Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "DomainStatistics" page has been changed by AlexMc. http://wiki.apache.org/nutch/DomainStatistics?action=diff&rev1=1&rev2=2 -------------------------------------------------- There is a tool which may tell you some more about which domains have been fetched. You can try it through something like this + == usage == - $ bin/nutch org.apache.nutch.util.domain.DomainStatistics - hdfs://nn:9000/user/otis/crawl/crawldb/current - hdfs://nn:9000/user/otis/ds-host host 8 + {{{ + $ bin/nutch org.apache.nutch.util.domain.DomainStatistics inputDirs outDir host|domain|suffix [numOfReducer] + }}} + == example == + + {{{ + $ bin/nutch org.apache.nutch.util.domain.DomainStatistics hdfs://nn:9000/user/otis/crawl/crawldb/current hdfs://nn:9000/user/otis/ds-host host 8 + }}} You can then -cat ds-host file from DFS and pipe it to sort -nrk1 for sorting by count, higher count first.

