Hi Julien, Can you explain the following? I've got here some output from a readdb -stats job and the output of the most recent crawldb update job. They differ a lot!
update: db_redir_temp 0 1,036,840 1,036,840 db_redir_perm 0 1,195,539 1,195,539 unknown 0 2,315 2,315 db_unfetched 0 16,909,397 16,909,397 db_notmodified 0 1,264,001 1,264,001 db_gone 0 955,701 955,701 db_fetched 0 19,545,591 19,545,591 stats: TOTAL urls: 40909384 status 1 (db_unfetched): 26788643 status 2 (db_fetched): 12345476 status 3 (db_gone): 763463 status 4 (db_redir_temp): 461511 status 5 (db_redir_perm): 431595 status 6 (db_notmodified): 118696 Thanks > Crawldb update to total counts per status > ----------------------------------------- > > Key: NUTCH-1071 > URL: https://issues.apache.org/jira/browse/NUTCH-1071 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.4 > Reporter: Julien Nioche > Assignee: Julien Nioche > Priority: Trivial > Fix For: 1.4 > > > The reduce phase of the crawldb update outputs all the entries that will be > found in the updated crawldb. We can use the counters to summarise the > number of URLs per status, which is a bit like the readdb -stats > functionality except that it does not require an additional step. This is > a useful way of monitoring the progress of a crawl using the Hadoop > JobTracker UI. > > -- > This message is automatically generated by JIRA. > For more information on JIRA, see: http://www.atlassian.com/software/jira