sebastian-nagel opened a new pull request, #816: URL: https://github.com/apache/nutch/pull/816
and NUTCH-1942 Remove TopLevelDomain - use methods from crawler-commons' EffectiveTldFinder in URLUtil replacing classed and methods from the "org.apache.nutch.util.domain" package - adapt and extend unit tests - add tests for URLUtil.getTopLevelDomainName(url) - reflect changes to the public suffix list since 2014 ("xyz" is now a public suffix / ICANN suffix) - adapt to minor API changes - URLUtil.getDomainName(url) returns the host name in case no valid public suffix is found - for Unicode suffixes and TLDs the methods URLUtil.getDomainSuffix(url) resp. URLUtil.getTopLevelDomainName(url) now return the ASCII representation - add unit tests for host names with trailing dot ("www.apache.org.") - add add unit test for URLs without host/domain (cf. NUTCH-2450)unit test for URLs without host/domain (cf. NUTCH-2450) - update and complete Javadoc - update DomainStatistics, TLDIndexingFilter and domain URL filters to use the updated methods in URLUtil - remove the class TLDScoringFilter. The configuration is bound to the domain-suffixes.xml which wasn't maintained anymore and is now removed - remove package org.apache.nutch.util.domain - move DomainStatistics to org.apache.nutch.util - remove configuration files of domain utils -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org