Re: (NUTCH-1071) Crawldb update to total counts per status

2011-07-29 Thread Markus Jelsma
On Friday 29 July 2011 17:06:15 Julien Nioche wrote:
> Markus,
> 
> Have just committed a change to CrawlDBReducer (rev 1152254)
> 
> see line 155
> -> reporter.getCounter("CrawlDB status", CrawlDatum.getStatusName(*
> old*.getStatus())).increment(1);
> 
> was using the wrong object :-(

Don't be sad, it's finally weekend! Oh, and by the way, it's fixed as well! 
Both readdb and updatedb show identical numbers.

Thanks! 

> 
> Would you mind giving it a try?
> 
> Thanks
> 
> Julien

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: (NUTCH-1071) Crawldb update to total counts per status

2011-07-29 Thread Julien Nioche
Markus,

Have just committed a change to CrawlDBReducer (rev 1152254)

see line 155
-> reporter.getCounter("CrawlDB status", CrawlDatum.getStatusName(*
old*.getStatus())).increment(1);

was using the wrong object :-(

Would you mind giving it a try?

Thanks

Julien


Re: (NUTCH-1071) Crawldb update to total counts per status

2011-07-29 Thread Markus Jelsma
Another cycle is complete, the results are incremental but similar.  The 
TOTAL_URLS of stats is reduce_output_records + 1, seems alright. 

Can't find any order in the numbers.

On Friday 29 July 2011 11:43:19 Julien Nioche wrote:
> Hi Markus,
> 
> Can't really think of a reason why they could differ. You called 'readdb
> -stats' right after the crawldb?
> Could it be a problem with readdb -stats?  And why are we seeing an
> 'unknown' status in the crawldb update?
> 
> That's definitely interesting
> 
> Julien
> 
> On 29 July 2011 10:23, Markus Jelsma  wrote:
> > Hi Julien,
> > 
> > Can you explain the following? I've got here some output from a readdb
> > -stats
> > job and the output of the most recent crawldb update job. They differ a
> > lot!
> > 
> > update:
> > db_redir_temp   0   1,036,840   1,036,840
> > db_redir_perm   0   1,195,539   1,195,539
> > unknown 0   2,315   2,315
> > db_unfetched0   16,909,397  16,909,397
> > db_notmodified  0   1,264,001   1,264,001
> > db_gone 0   955,701 955,701
> > db_fetched  0   19,545,591  19,545,591
> > 
> > stats:
> > TOTAL urls: 40909384
> > status 1 (db_unfetched):26788643
> > status 2 (db_fetched):  12345476
> > status 3 (db_gone): 763463
> > status 4 (db_redir_temp):   461511
> > status 5 (db_redir_perm):   431595
> > status 6 (db_notmodified):  118696
> > 
> > Thanks
> > 
> > > Crawldb update to total counts per status
> > > -
> > > 
> > >  Key: NUTCH-1071
> > >  URL: https://issues.apache.org/jira/browse/NUTCH-1071
> > >  
> > >  Project: Nutch
> > >   
> > >   Issue Type: Improvement
> > > 
> > > Affects Versions: 1.4
> > > 
> > > Reporter: Julien Nioche
> > > Assignee: Julien Nioche
> > > Priority: Trivial
> > > 
> > >  Fix For: 1.4
> > > 
> > > The reduce phase of the crawldb update outputs all the entries that
> > > will
> > 
> > be
> > 
> > > found in the updated crawldb. We can use the counters to summarise the
> > > number of URLs per status, which is a bit like the readdb -stats
> > > functionality except that it does not require an additional step. This
> > > is a useful way of monitoring the progress of a crawl using the Hadoop
> > > JobTracker UI.
> > > 
> > > --
> > > This message is automatically generated by JIRA.
> > 
> > > For more information on JIRA, see:
> > http://www.atlassian.com/software/jira

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: (NUTCH-1071) Crawldb update to total counts per status

2011-07-29 Thread Markus Jelsma
Hi,

There ran a lot of indexers between readdb and updatedb and one domainstats 
job. Nothing that could interfere with the numbers afaik. I'm running a 
complete cycle again and will do a readdb again after the update and recheck.

I've no idea what the problem could. I don't know whether to trust readdb or 
updatedb stats and there's no way to check which output is true or is there.

Thanks

> Hi Markus,
> 
> Can't really think of a reason why they could differ. You called 'readdb
> -stats' right after the crawldb?
> Could it be a problem with readdb -stats?  And why are we seeing an
> 'unknown' status in the crawldb update?
> 
> That's definitely interesting
> 
> Julien
> 
> On 29 July 2011 10:23, Markus Jelsma  wrote:
> > Hi Julien,
> > 
> > Can you explain the following? I've got here some output from a readdb
> > -stats
> > job and the output of the most recent crawldb update job. They differ a
> > lot!
> > 
> > update:
> > db_redir_temp   0   1,036,840   1,036,840
> > db_redir_perm   0   1,195,539   1,195,539
> > unknown 0   2,315   2,315
> > db_unfetched0   16,909,397  16,909,397
> > db_notmodified  0   1,264,001   1,264,001
> > db_gone 0   955,701 955,701
> > db_fetched  0   19,545,591  19,545,591
> > 
> > stats:
> > TOTAL urls: 40909384
> > status 1 (db_unfetched):26788643
> > status 2 (db_fetched):  12345476
> > status 3 (db_gone): 763463
> > status 4 (db_redir_temp):   461511
> > status 5 (db_redir_perm):   431595
> > status 6 (db_notmodified):  118696
> > 
> > Thanks
> > 
> > > Crawldb update to total counts per status
> > > -
> > > 
> > >  Key: NUTCH-1071
> > >  URL: https://issues.apache.org/jira/browse/NUTCH-1071
> > >  
> > >  Project: Nutch
> > >   
> > >   Issue Type: Improvement
> > > 
> > > Affects Versions: 1.4
> > > 
> > > Reporter: Julien Nioche
> > > Assignee: Julien Nioche
> > > Priority: Trivial
> > > 
> > >  Fix For: 1.4
> > > 
> > > The reduce phase of the crawldb update outputs all the entries that
> > > will
> > 
> > be
> > 
> > > found in the updated crawldb. We can use the counters to summarise the
> > > number of URLs per status, which is a bit like the readdb -stats
> > > functionality except that it does not require an additional step. This
> > > is a useful way of monitoring the progress of a crawl using the Hadoop
> > > JobTracker UI.
> > > 
> > > --
> > > This message is automatically generated by JIRA.
> > 
> > > For more information on JIRA, see:
> > http://www.atlassian.com/software/jira


Re: (NUTCH-1071) Crawldb update to total counts per status

2011-07-29 Thread Julien Nioche
Hi Markus,

Can't really think of a reason why they could differ. You called 'readdb
-stats' right after the crawldb?
Could it be a problem with readdb -stats?  And why are we seeing an
'unknown' status in the crawldb update?

That's definitely interesting

Julien

On 29 July 2011 10:23, Markus Jelsma  wrote:

> Hi Julien,
>
> Can you explain the following? I've got here some output from a readdb
> -stats
> job and the output of the most recent crawldb update job. They differ a
> lot!
>
> update:
> db_redir_temp   0   1,036,840   1,036,840
> db_redir_perm   0   1,195,539   1,195,539
> unknown 0   2,315   2,315
> db_unfetched0   16,909,397  16,909,397
> db_notmodified  0   1,264,001   1,264,001
> db_gone 0   955,701 955,701
> db_fetched  0   19,545,591  19,545,591
>
> stats:
> TOTAL urls: 40909384
> status 1 (db_unfetched):26788643
> status 2 (db_fetched):  12345476
> status 3 (db_gone): 763463
> status 4 (db_redir_temp):   461511
> status 5 (db_redir_perm):   431595
> status 6 (db_notmodified):  118696
>
> Thanks
>
> > Crawldb update to total counts per status
> > -
> >
> >  Key: NUTCH-1071
> >  URL: https://issues.apache.org/jira/browse/NUTCH-1071
> >  Project: Nutch
> >   Issue Type: Improvement
> > Affects Versions: 1.4
> > Reporter: Julien Nioche
> > Assignee: Julien Nioche
> > Priority: Trivial
> >  Fix For: 1.4
> >
> >
> > The reduce phase of the crawldb update outputs all the entries that will
> be
> > found in the updated crawldb. We can use the counters to summarise the
> > number of URLs per status, which is a bit like the readdb -stats
> > functionality except that it does not require an additional step. This is
> > a useful way of monitoring the progress of a crawl using the Hadoop
> > JobTracker UI.
> >
> > --
> > This message is automatically generated by JIRA.
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: (NUTCH-1071) Crawldb update to total counts per status

2011-07-29 Thread Markus Jelsma
Hi Julien,

Can you explain the following? I've got here some output from a readdb -stats 
job and the output of the most recent crawldb update job. They differ a lot!

update:
db_redir_temp   0   1,036,840   1,036,840
db_redir_perm   0   1,195,539   1,195,539
unknown 0   2,315   2,315
db_unfetched0   16,909,397  16,909,397
db_notmodified  0   1,264,001   1,264,001
db_gone 0   955,701 955,701
db_fetched  0   19,545,591  19,545,591

stats:
TOTAL urls: 40909384
status 1 (db_unfetched):26788643
status 2 (db_fetched):  12345476
status 3 (db_gone): 763463
status 4 (db_redir_temp):   461511
status 5 (db_redir_perm):   431595
status 6 (db_notmodified):  118696

Thanks

> Crawldb update to total counts per status
> -
> 
>  Key: NUTCH-1071
>  URL: https://issues.apache.org/jira/browse/NUTCH-1071
>  Project: Nutch
>   Issue Type: Improvement
> Affects Versions: 1.4
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Priority: Trivial
>  Fix For: 1.4
> 
> 
> The reduce phase of the crawldb update outputs all the entries that will be
> found in the updated crawldb. We can use the counters to summarise the
> number of URLs per status, which is a bit like the readdb -stats
> functionality except that it does not require an additional step. This is
> a useful way of monitoring the progress of a crawl using the Hadoop
> JobTracker UI.
> 
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira