Hi, There ran a lot of indexers between readdb and updatedb and one domainstats job. Nothing that could interfere with the numbers afaik. I'm running a complete cycle again and will do a readdb again after the update and recheck.
I've no idea what the problem could. I don't know whether to trust readdb or updatedb stats and there's no way to check which output is true or is there. Thanks > Hi Markus, > > Can't really think of a reason why they could differ. You called 'readdb > -stats' right after the crawldb? > Could it be a problem with readdb -stats? And why are we seeing an > 'unknown' status in the crawldb update? > > That's definitely interesting > > Julien > > On 29 July 2011 10:23, Markus Jelsma <markus.jel...@openindex.io> wrote: > > Hi Julien, > > > > Can you explain the following? I've got here some output from a readdb > > -stats > > job and the output of the most recent crawldb update job. They differ a > > lot! > > > > update: > > db_redir_temp 0 1,036,840 1,036,840 > > db_redir_perm 0 1,195,539 1,195,539 > > unknown 0 2,315 2,315 > > db_unfetched 0 16,909,397 16,909,397 > > db_notmodified 0 1,264,001 1,264,001 > > db_gone 0 955,701 955,701 > > db_fetched 0 19,545,591 19,545,591 > > > > stats: > > TOTAL urls: 40909384 > > status 1 (db_unfetched): 26788643 > > status 2 (db_fetched): 12345476 > > status 3 (db_gone): 763463 > > status 4 (db_redir_temp): 461511 > > status 5 (db_redir_perm): 431595 > > status 6 (db_notmodified): 118696 > > > > Thanks > > > > > Crawldb update to total counts per status > > > ----------------------------------------- > > > > > > Key: NUTCH-1071 > > > URL: https://issues.apache.org/jira/browse/NUTCH-1071 > > > > > > Project: Nutch > > > > > > Issue Type: Improvement > > > > > > Affects Versions: 1.4 > > > > > > Reporter: Julien Nioche > > > Assignee: Julien Nioche > > > Priority: Trivial > > > > > > Fix For: 1.4 > > > > > > The reduce phase of the crawldb update outputs all the entries that > > > will > > > > be > > > > > found in the updated crawldb. We can use the counters to summarise the > > > number of URLs per status, which is a bit like the readdb -stats > > > functionality except that it does not require an additional step. This > > > is a useful way of monitoring the progress of a crawl using the Hadoop > > > JobTracker UI. > > > > > > -- > > > This message is automatically generated by JIRA. > > > > > For more information on JIRA, see: > > http://www.atlassian.com/software/jira