When doing a one-pass crawl, I noticed that when I inject more than ~16000 urls, the fetcher only fetches a subset of the set initially injected. I use 1 master and 3 slaves with the following properties: mapred.map.tasks = 30 mapred.reduce.tasks = 6 generate.max.per.host = -1
I tried to inject different amount of urls to see around what threshold I start to see some missing ones. Here are the results of my tests so far: #urls 15000 and below: 100% fetched 16000: 15998 fetched (~100%) 25000: 21379 fetched (86%) 50000: 26565 fetched (53%) 100000: 22088 fetched (22%) After having seen bug NUTCH-136 "mapreduce segment generator generates 50 % less than excepted urls", I thought it may fix my problem. I only applied the 2nd change mentioned in the description (the change in Generator.java, line 48) since I didn't know how to set the partition to use a normal hashPartitioner. The fix didn't make any difference. Then I started debugging the generator to see if all the urls were generated. I confirmed they were all generated (did a check w/ 50k), so the problem lays further in the pipeline. I assume it's somewhere in the fetcher, but I'm not sure where yet. I'm gonna keep investigating. Has anyone encountered a similar issue ? I read messages of people crawling million of pages and I wonder why it seems I'm the only one to have this issue. I'm apparently unable to fetch more than ~30k pages even though I inject 1 million urls. Any help would be greatly appreciated. Thanks, --Flo