[ https://issues.apache.org/jira/browse/NUTCH-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841995#comment-17841995 ]
ASF GitHub Bot commented on NUTCH-1806: --------------------------------------- sebastian-nagel opened a new pull request, #816: URL: https://github.com/apache/nutch/pull/816 and NUTCH-1942 Remove TopLevelDomain - use methods from crawler-commons' EffectiveTldFinder in URLUtil replacing classed and methods from the "org.apache.nutch.util.domain" package - adapt and extend unit tests - add tests for URLUtil.getTopLevelDomainName(url) - reflect changes to the public suffix list since 2014 ("xyz" is now a public suffix / ICANN suffix) - adapt to minor API changes - URLUtil.getDomainName(url) returns the host name in case no valid public suffix is found - for Unicode suffixes and TLDs the methods URLUtil.getDomainSuffix(url) resp. URLUtil.getTopLevelDomainName(url) now return the ASCII representation - add unit tests for host names with trailing dot ("www.apache.org.") - add add unit test for URLs without host/domain (cf. NUTCH-2450)unit test for URLs without host/domain (cf. NUTCH-2450) - update and complete Javadoc - update DomainStatistics, TLDIndexingFilter and domain URL filters to use the updated methods in URLUtil - remove the class TLDScoringFilter. The configuration is bound to the domain-suffixes.xml which wasn't maintained anymore and is now removed - remove package org.apache.nutch.util.domain - move DomainStatistics to org.apache.nutch.util - remove configuration files of domain utils > Delegate processing of URL domains to crawler commons > ----------------------------------------------------- > > Key: NUTCH-1806 > URL: https://issues.apache.org/jira/browse/NUTCH-1806 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.8 > Reporter: Julien Nioche > Priority: Major > Labels: crawler-commons > Fix For: 1.21 > > > We have code in src/java/org/apache/nutch/util/domain and a resource file > conf/domain-suffixes.xml to handle URL domains. This is used mostly from > URLUtil.getDomainName. > The resource file is not necessarily up to date and since crawler commons has > a similar functionality we should use it instead of having to maintain our > own resources. -- This message was sent by Atlassian Jira (v8.20.10#820010)