Hi,

I'm running nutch trunk as of today.  I have 3 slaves and a master.  I'm
using mapred.map.tasks=20 and mapred.reduce.tasks=4.
There is something I'm really confused about.

When I inject 25000 urls and fetch them (depth = 1) and do a readdb
-stats, I get:
060110 171347 Statistics for CrawlDb: crawldb
060110 171347 TOTAL urls:       27939
060110 171347 avg score:        1.011
060110 171347 max score:        8.883
060110 171347 min score:        1.0
060110 171347 retry 0:  26429
060110 171347 retry 1:  1510
060110 171347 status 1 (DB_unfetched):  24248
060110 171347 status 2 (DB_fetched):    3390
060110 171347 status 3 (DB_gone):       301
060110 171347 CrawlDb statistics: done

There are several things that don't make sense to me and it would be
great if someone could clear this up:

1.
If I compute the number of occurences of "fetching" in all of my slaves'
tasktracker logs, I get: 6225
This number clearly doesn't match the DB_fetched of 3390 from the
readdb output.  Why is that ?
What happened to the 6225-3390=2835 missing urls?

2.
Why is the TOTAL urls: 27939 if I inject a file with 25000 entries?
Why is it not 25000 ?

3.
What is the meaning of DB_gone and DB_unfetched?
I was assuming if you inject a total of 25k urls where 5000 are
fetchable ones, you would get something like:
(DB_unfetched):  20000
(DB_fetched):    5000
It's not the case, so I'd like to understand what's exactly going on here.
Also, what is the meaning of DB_gone ?

4.
If I redo (starting from an empty crawldb of course) the exact same
inject + crawl with the same 25000 urls, but I use the following mapred
settings instead: mapred.map.tasks=200 and mapred.reduce.tasks=8, I
get the following readdb output:
060110 162140 TOTAL urls:       33173
060110 162140 avg score:        1.026
060110 162140 max score:        22.083
060110 162140 min score:        1.0
060110 162140 retry 0:  28381
060110 162140 retry 1:  4792
060110 162140 status 1 (DB_unfetched):  23136
060110 162140 status 2 (DB_fetched):    9234
060110 162140 status 3 (DB_gone):       803
060110 162140 CrawlDb statistics: done
How come the DB_fetched is about 3x more than earlier and the TOTAL urls goes
way beyond the 27939 from before?
It doesn't make any sense.  I'd expect to see similar results as before
with the other mapred settings.

Thank you,
Florent

Reply via email to