Re: (NUTCH-1071) Crawldb update to total counts per status

Julien Nioche Fri, 29 Jul 2011 02:44:03 -0700

Hi Markus,

Can't really think of a reason why they could differ. You called 'readdb
-stats' right after the crawldb?
Could it be a problem with readdb -stats?  And why are we seeing an
'unknown' status in the crawldb update?


That's definitely interesting

Julien

On 29 July 2011 10:23, Markus Jelsma <[email protected]> wrote:

> Hi Julien,
>
> Can you explain the following? I've got here some output from a readdb
> -stats
> job and the output of the most recent crawldb update job. They differ a
> lot!
>
> update:
> db_redir_temp   0       1,036,840       1,036,840
> db_redir_perm   0       1,195,539       1,195,539
> unknown         0       2,315   2,315
> db_unfetched    0       16,909,397      16,909,397
> db_notmodified  0       1,264,001       1,264,001
> db_gone         0       955,701         955,701
> db_fetched      0       19,545,591      19,545,591
>
> stats:
> TOTAL urls: 40909384
> status 1 (db_unfetched):    26788643
> status 2 (db_fetched):      12345476
> status 3 (db_gone): 763463
> status 4 (db_redir_temp):   461511
> status 5 (db_redir_perm):   431595
> status 6 (db_notmodified):  118696
>
> Thanks
>
> > Crawldb update to total counts per status
> > -----------------------------------------
> >
> >                  Key: NUTCH-1071
> >                  URL: https://issues.apache.org/jira/browse/NUTCH-1071
> >              Project: Nutch
> >           Issue Type: Improvement
> >     Affects Versions: 1.4
> >             Reporter: Julien Nioche
> >             Assignee: Julien Nioche
> >             Priority: Trivial
> >              Fix For: 1.4
> >
> >
> > The reduce phase of the crawldb update outputs all the entries that will
> be
> > found in the updated crawldb. We can use the counters to summarise the
> > number of URLs per status, which is a bit like the readdb -stats
> > functionality except that it does not require an additional step. This is
> > a useful way of monitoring the progress of a crawl using the Hadoop
> > JobTracker UI.
> >
> > --
> > This message is automatically generated by JIRA.
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: (NUTCH-1071) Crawldb update to total counts per status

Reply via email to