Re: (NUTCH-1071) Crawldb update to total counts per status

Markus Jelsma Fri, 29 Jul 2011 02:25:30 -0700

Hi Julien,

Can you explain the following? I've got here some output from a readdb -stats 
job and the output of the most recent crawldb update job. They differ a lot!


update:
db_redir_temp   0       1,036,840       1,036,840
db_redir_perm   0       1,195,539       1,195,539
unknown         0       2,315   2,315
db_unfetched    0       16,909,397      16,909,397
db_notmodified  0       1,264,001       1,264,001
db_gone         0       955,701         955,701
db_fetched      0       19,545,591      19,545,591

stats:
TOTAL urls: 40909384
status 1 (db_unfetched):    26788643
status 2 (db_fetched):      12345476
status 3 (db_gone): 763463
status 4 (db_redir_temp):   461511
status 5 (db_redir_perm):   431595
status 6 (db_notmodified):  118696

Thanks

> Crawldb update to total counts per status
> -----------------------------------------
> 
>                  Key: NUTCH-1071
>                  URL: https://issues.apache.org/jira/browse/NUTCH-1071
>              Project: Nutch
>           Issue Type: Improvement
>     Affects Versions: 1.4
>             Reporter: Julien Nioche
>             Assignee: Julien Nioche
>             Priority: Trivial
>              Fix For: 1.4
> 
> 
> The reduce phase of the crawldb update outputs all the entries that will be
> found in the updated crawldb. We can use the counters to summarise the
> number of URLs per status, which is a bit like the readdb -stats
> functionality except that it does not require an additional step. This is
> a useful way of monitoring the progress of a crawl using the Hadoop
> JobTracker UI.
> 
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: (NUTCH-1071) Crawldb update to total counts per status

Reply via email to