Re: (NUTCH-1071) Crawldb update to total counts per status

Markus Jelsma Fri, 29 Jul 2011 02:57:04 -0700

Hi,

There ran a lot of indexers between readdb and updatedb and one domainstats 
job. Nothing that could interfere with the numbers afaik. I'm running a 
complete cycle again and will do a readdb again after the update and recheck.


I've no idea what the problem could. I don't know whether to trust readdb or 
updatedb stats and there's no way to check which output is true or is there.

Thanks

> Hi Markus,
> 
> Can't really think of a reason why they could differ. You called 'readdb
> -stats' right after the crawldb?
> Could it be a problem with readdb -stats?  And why are we seeing an
> 'unknown' status in the crawldb update?
> 
> That's definitely interesting
> 
> Julien
> 
> On 29 July 2011 10:23, Markus Jelsma <[email protected]> wrote:
> > Hi Julien,
> > 
> > Can you explain the following? I've got here some output from a readdb
> > -stats
> > job and the output of the most recent crawldb update job. They differ a
> > lot!
> > 
> > update:
> > db_redir_temp   0       1,036,840       1,036,840
> > db_redir_perm   0       1,195,539       1,195,539
> > unknown         0       2,315   2,315
> > db_unfetched    0       16,909,397      16,909,397
> > db_notmodified  0       1,264,001       1,264,001
> > db_gone         0       955,701         955,701
> > db_fetched      0       19,545,591      19,545,591
> > 
> > stats:
> > TOTAL urls: 40909384
> > status 1 (db_unfetched):    26788643
> > status 2 (db_fetched):      12345476
> > status 3 (db_gone): 763463
> > status 4 (db_redir_temp):   461511
> > status 5 (db_redir_perm):   431595
> > status 6 (db_notmodified):  118696
> > 
> > Thanks
> > 
> > > Crawldb update to total counts per status
> > > -----------------------------------------
> > > 
> > >                  Key: NUTCH-1071
> > >                  URL: https://issues.apache.org/jira/browse/NUTCH-1071
> > >              
> > >              Project: Nutch
> > >           
> > >           Issue Type: Improvement
> > >     
> > >     Affects Versions: 1.4
> > >     
> > >             Reporter: Julien Nioche
> > >             Assignee: Julien Nioche
> > >             Priority: Trivial
> > >             
> > >              Fix For: 1.4
> > > 
> > > The reduce phase of the crawldb update outputs all the entries that
> > > will
> > 
> > be
> > 
> > > found in the updated crawldb. We can use the counters to summarise the
> > > number of URLs per status, which is a bit like the readdb -stats
> > > functionality except that it does not require an additional step. This
> > > is a useful way of monitoring the progress of a crawl using the Hadoop
> > > JobTracker UI.
> > > 
> > > --
> > > This message is automatically generated by JIRA.
> > 
> > > For more information on JIRA, see:
> > http://www.atlassian.com/software/jira

Re: (NUTCH-1071) Crawldb update to total counts per status

Reply via email to