Hi Markus, Can't really think of a reason why they could differ. You called 'readdb -stats' right after the crawldb? Could it be a problem with readdb -stats? And why are we seeing an 'unknown' status in the crawldb update?
That's definitely interesting Julien On 29 July 2011 10:23, Markus Jelsma <markus.jel...@openindex.io> wrote: > Hi Julien, > > Can you explain the following? I've got here some output from a readdb > -stats > job and the output of the most recent crawldb update job. They differ a > lot! > > update: > db_redir_temp 0 1,036,840 1,036,840 > db_redir_perm 0 1,195,539 1,195,539 > unknown 0 2,315 2,315 > db_unfetched 0 16,909,397 16,909,397 > db_notmodified 0 1,264,001 1,264,001 > db_gone 0 955,701 955,701 > db_fetched 0 19,545,591 19,545,591 > > stats: > TOTAL urls: 40909384 > status 1 (db_unfetched): 26788643 > status 2 (db_fetched): 12345476 > status 3 (db_gone): 763463 > status 4 (db_redir_temp): 461511 > status 5 (db_redir_perm): 431595 > status 6 (db_notmodified): 118696 > > Thanks > > > Crawldb update to total counts per status > > ----------------------------------------- > > > > Key: NUTCH-1071 > > URL: https://issues.apache.org/jira/browse/NUTCH-1071 > > Project: Nutch > > Issue Type: Improvement > > Affects Versions: 1.4 > > Reporter: Julien Nioche > > Assignee: Julien Nioche > > Priority: Trivial > > Fix For: 1.4 > > > > > > The reduce phase of the crawldb update outputs all the entries that will > be > > found in the updated crawldb. We can use the counters to summarise the > > number of URLs per status, which is a bit like the readdb -stats > > functionality except that it does not require an additional step. This is > > a useful way of monitoring the progress of a crawl using the Hadoop > > JobTracker UI. > > > > -- > > This message is automatically generated by JIRA. > > For more information on JIRA, see: > http://www.atlassian.com/software/jira > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com