sebastian-nagel opened a new pull request, #816:
URL: https://github.com/apache/nutch/pull/816

   and NUTCH-1942 Remove TopLevelDomain
   
   - use methods from crawler-commons' EffectiveTldFinder in URLUtil  replacing 
classed and methods from the "org.apache.nutch.util.domain" package
   
   - adapt and extend unit tests
     - add tests for URLUtil.getTopLevelDomainName(url)
     - reflect changes to the public suffix list since 2014 ("xyz" is now a 
public suffix / ICANN suffix)
     - adapt to minor API changes
        - URLUtil.getDomainName(url) returns the host name in case no valid 
public suffix is found
        - for Unicode suffixes and TLDs the methods 
URLUtil.getDomainSuffix(url) resp.      URLUtil.getTopLevelDomainName(url) now 
return the ASCII representation
      - add unit tests for host names with trailing dot ("www.apache.org.")
      - add add unit test for URLs without host/domain (cf. NUTCH-2450)unit 
test for URLs without host/domain (cf. NUTCH-2450)
   
   - update and complete Javadoc
   
   - update DomainStatistics, TLDIndexingFilter and domain URL filters to use 
the updated methods in URLUtil
   - remove the class TLDScoringFilter. The configuration is bound to the 
domain-suffixes.xml which wasn't maintained anymore and is now removed
   - remove package org.apache.nutch.util.domain
   - move DomainStatistics to org.apache.nutch.util
   - remove configuration files of domain utils


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to