Re: (NUTCH-1071) Crawldb update to total counts per status
On Friday 29 July 2011 17:06:15 Julien Nioche wrote: > Markus, > > Have just committed a change to CrawlDBReducer (rev 1152254) > > see line 155 > -> reporter.getCounter("CrawlDB status", CrawlDatum.getStatusName(* > old*.getStatus())).increment(1); > > was using the wrong object :-( Don't be sad, it's finally weekend! Oh, and by the way, it's fixed as well! Both readdb and updatedb show identical numbers. Thanks! > > Would you mind giving it a try? > > Thanks > > Julien -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: (NUTCH-1071) Crawldb update to total counts per status
Markus, Have just committed a change to CrawlDBReducer (rev 1152254) see line 155 -> reporter.getCounter("CrawlDB status", CrawlDatum.getStatusName(* old*.getStatus())).increment(1); was using the wrong object :-( Would you mind giving it a try? Thanks Julien
Re: (NUTCH-1071) Crawldb update to total counts per status
Another cycle is complete, the results are incremental but similar. The TOTAL_URLS of stats is reduce_output_records + 1, seems alright. Can't find any order in the numbers. On Friday 29 July 2011 11:43:19 Julien Nioche wrote: > Hi Markus, > > Can't really think of a reason why they could differ. You called 'readdb > -stats' right after the crawldb? > Could it be a problem with readdb -stats? And why are we seeing an > 'unknown' status in the crawldb update? > > That's definitely interesting > > Julien > > On 29 July 2011 10:23, Markus Jelsma wrote: > > Hi Julien, > > > > Can you explain the following? I've got here some output from a readdb > > -stats > > job and the output of the most recent crawldb update job. They differ a > > lot! > > > > update: > > db_redir_temp 0 1,036,840 1,036,840 > > db_redir_perm 0 1,195,539 1,195,539 > > unknown 0 2,315 2,315 > > db_unfetched0 16,909,397 16,909,397 > > db_notmodified 0 1,264,001 1,264,001 > > db_gone 0 955,701 955,701 > > db_fetched 0 19,545,591 19,545,591 > > > > stats: > > TOTAL urls: 40909384 > > status 1 (db_unfetched):26788643 > > status 2 (db_fetched): 12345476 > > status 3 (db_gone): 763463 > > status 4 (db_redir_temp): 461511 > > status 5 (db_redir_perm): 431595 > > status 6 (db_notmodified): 118696 > > > > Thanks > > > > > Crawldb update to total counts per status > > > - > > > > > > Key: NUTCH-1071 > > > URL: https://issues.apache.org/jira/browse/NUTCH-1071 > > > > > > Project: Nutch > > > > > > Issue Type: Improvement > > > > > > Affects Versions: 1.4 > > > > > > Reporter: Julien Nioche > > > Assignee: Julien Nioche > > > Priority: Trivial > > > > > > Fix For: 1.4 > > > > > > The reduce phase of the crawldb update outputs all the entries that > > > will > > > > be > > > > > found in the updated crawldb. We can use the counters to summarise the > > > number of URLs per status, which is a bit like the readdb -stats > > > functionality except that it does not require an additional step. This > > > is a useful way of monitoring the progress of a crawl using the Hadoop > > > JobTracker UI. > > > > > > -- > > > This message is automatically generated by JIRA. > > > > > For more information on JIRA, see: > > http://www.atlassian.com/software/jira -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: (NUTCH-1071) Crawldb update to total counts per status
Hi, There ran a lot of indexers between readdb and updatedb and one domainstats job. Nothing that could interfere with the numbers afaik. I'm running a complete cycle again and will do a readdb again after the update and recheck. I've no idea what the problem could. I don't know whether to trust readdb or updatedb stats and there's no way to check which output is true or is there. Thanks > Hi Markus, > > Can't really think of a reason why they could differ. You called 'readdb > -stats' right after the crawldb? > Could it be a problem with readdb -stats? And why are we seeing an > 'unknown' status in the crawldb update? > > That's definitely interesting > > Julien > > On 29 July 2011 10:23, Markus Jelsma wrote: > > Hi Julien, > > > > Can you explain the following? I've got here some output from a readdb > > -stats > > job and the output of the most recent crawldb update job. They differ a > > lot! > > > > update: > > db_redir_temp 0 1,036,840 1,036,840 > > db_redir_perm 0 1,195,539 1,195,539 > > unknown 0 2,315 2,315 > > db_unfetched0 16,909,397 16,909,397 > > db_notmodified 0 1,264,001 1,264,001 > > db_gone 0 955,701 955,701 > > db_fetched 0 19,545,591 19,545,591 > > > > stats: > > TOTAL urls: 40909384 > > status 1 (db_unfetched):26788643 > > status 2 (db_fetched): 12345476 > > status 3 (db_gone): 763463 > > status 4 (db_redir_temp): 461511 > > status 5 (db_redir_perm): 431595 > > status 6 (db_notmodified): 118696 > > > > Thanks > > > > > Crawldb update to total counts per status > > > - > > > > > > Key: NUTCH-1071 > > > URL: https://issues.apache.org/jira/browse/NUTCH-1071 > > > > > > Project: Nutch > > > > > > Issue Type: Improvement > > > > > > Affects Versions: 1.4 > > > > > > Reporter: Julien Nioche > > > Assignee: Julien Nioche > > > Priority: Trivial > > > > > > Fix For: 1.4 > > > > > > The reduce phase of the crawldb update outputs all the entries that > > > will > > > > be > > > > > found in the updated crawldb. We can use the counters to summarise the > > > number of URLs per status, which is a bit like the readdb -stats > > > functionality except that it does not require an additional step. This > > > is a useful way of monitoring the progress of a crawl using the Hadoop > > > JobTracker UI. > > > > > > -- > > > This message is automatically generated by JIRA. > > > > > For more information on JIRA, see: > > http://www.atlassian.com/software/jira
Re: (NUTCH-1071) Crawldb update to total counts per status
Hi Markus, Can't really think of a reason why they could differ. You called 'readdb -stats' right after the crawldb? Could it be a problem with readdb -stats? And why are we seeing an 'unknown' status in the crawldb update? That's definitely interesting Julien On 29 July 2011 10:23, Markus Jelsma wrote: > Hi Julien, > > Can you explain the following? I've got here some output from a readdb > -stats > job and the output of the most recent crawldb update job. They differ a > lot! > > update: > db_redir_temp 0 1,036,840 1,036,840 > db_redir_perm 0 1,195,539 1,195,539 > unknown 0 2,315 2,315 > db_unfetched0 16,909,397 16,909,397 > db_notmodified 0 1,264,001 1,264,001 > db_gone 0 955,701 955,701 > db_fetched 0 19,545,591 19,545,591 > > stats: > TOTAL urls: 40909384 > status 1 (db_unfetched):26788643 > status 2 (db_fetched): 12345476 > status 3 (db_gone): 763463 > status 4 (db_redir_temp): 461511 > status 5 (db_redir_perm): 431595 > status 6 (db_notmodified): 118696 > > Thanks > > > Crawldb update to total counts per status > > - > > > > Key: NUTCH-1071 > > URL: https://issues.apache.org/jira/browse/NUTCH-1071 > > Project: Nutch > > Issue Type: Improvement > > Affects Versions: 1.4 > > Reporter: Julien Nioche > > Assignee: Julien Nioche > > Priority: Trivial > > Fix For: 1.4 > > > > > > The reduce phase of the crawldb update outputs all the entries that will > be > > found in the updated crawldb. We can use the counters to summarise the > > number of URLs per status, which is a bit like the readdb -stats > > functionality except that it does not require an additional step. This is > > a useful way of monitoring the progress of a crawl using the Hadoop > > JobTracker UI. > > > > -- > > This message is automatically generated by JIRA. > > For more information on JIRA, see: > http://www.atlassian.com/software/jira > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
Re: (NUTCH-1071) Crawldb update to total counts per status
Hi Julien, Can you explain the following? I've got here some output from a readdb -stats job and the output of the most recent crawldb update job. They differ a lot! update: db_redir_temp 0 1,036,840 1,036,840 db_redir_perm 0 1,195,539 1,195,539 unknown 0 2,315 2,315 db_unfetched0 16,909,397 16,909,397 db_notmodified 0 1,264,001 1,264,001 db_gone 0 955,701 955,701 db_fetched 0 19,545,591 19,545,591 stats: TOTAL urls: 40909384 status 1 (db_unfetched):26788643 status 2 (db_fetched): 12345476 status 3 (db_gone): 763463 status 4 (db_redir_temp): 461511 status 5 (db_redir_perm): 431595 status 6 (db_notmodified): 118696 Thanks > Crawldb update to total counts per status > - > > Key: NUTCH-1071 > URL: https://issues.apache.org/jira/browse/NUTCH-1071 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.4 > Reporter: Julien Nioche > Assignee: Julien Nioche > Priority: Trivial > Fix For: 1.4 > > > The reduce phase of the crawldb update outputs all the entries that will be > found in the updated crawldb. We can use the counters to summarise the > number of URLs per status, which is a bit like the readdb -stats > functionality except that it does not require an additional step. This is > a useful way of monitoring the progress of a crawl using the Hadoop > JobTracker UI. > > -- > This message is automatically generated by JIRA. > For more information on JIRA, see: http://www.atlassian.com/software/jira