Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "DomainStatistics" page has been changed by AlexMc.
http://wiki.apache.org/nutch/DomainStatistics

--------------------------------------------------

New page:
There is a tool which may tell you some more about which domains have been 
fetched. You can try it through something like this



    $ bin/nutch org.apache.nutch.util.domain.DomainStatistics
    hdfs://nn:9000/user/otis/crawl/crawldb/current
    hdfs://nn:9000/user/otis/ds-host host 8


You can then -cat ds-host file from DFS and pipe it to sort -nrk1 for sorting 
by count, higher count first.

(I first noticed this in https://issues.apache.org/jira/browse/NUTCH-628. Yes 
the class is in the current release but I haven't tried it)

Reply via email to