Hi, I'm running nutch trunk as of today. I have 3 slaves and a master. I'm using mapred.map.tasks=20 and mapred.reduce.tasks=4. There is something I'm really confused about.
When I inject 25000 urls and fetch them (depth = 1) and do a readdb -stats, I get: 060110 171347 Statistics for CrawlDb: crawldb 060110 171347 TOTAL urls: 27939 060110 171347 avg score: 1.011 060110 171347 max score: 8.883 060110 171347 min score: 1.0 060110 171347 retry 0: 26429 060110 171347 retry 1: 1510 060110 171347 status 1 (DB_unfetched): 24248 060110 171347 status 2 (DB_fetched): 3390 060110 171347 status 3 (DB_gone): 301 060110 171347 CrawlDb statistics: done There are several things that don't make sense to me and it would be great if someone could clear this up: 1. If I compute the number of occurences of "fetching" in all of my slaves' tasktracker logs, I get: 6225 This number clearly doesn't match the DB_fetched of 3390 from the readdb output. Why is that ? What happened to the 6225-3390=2835 missing urls? 2. Why is the TOTAL urls: 27939 if I inject a file with 25000 entries? Why is it not 25000 ? 3. What is the meaning of DB_gone and DB_unfetched? I was assuming if you inject a total of 25k urls where 5000 are fetchable ones, you would get something like: (DB_unfetched): 20000 (DB_fetched): 5000 It's not the case, so I'd like to understand what's exactly going on here. Also, what is the meaning of DB_gone ? 4. If I redo (starting from an empty crawldb of course) the exact same inject + crawl with the same 25000 urls, but I use the following mapred settings instead: mapred.map.tasks=200 and mapred.reduce.tasks=8, I get the following readdb output: 060110 162140 TOTAL urls: 33173 060110 162140 avg score: 1.026 060110 162140 max score: 22.083 060110 162140 min score: 1.0 060110 162140 retry 0: 28381 060110 162140 retry 1: 4792 060110 162140 status 1 (DB_unfetched): 23136 060110 162140 status 2 (DB_fetched): 9234 060110 162140 status 3 (DB_gone): 803 060110 162140 CrawlDb statistics: done How come the DB_fetched is about 3x more than earlier and the TOTAL urls goes way beyond the 27939 from before? It doesn't make any sense. I'd expect to see similar results as before with the other mapred settings. Thank you, Florent