When doing a one-pass crawl, I noticed that when I inject more than
~16000 urls, the fetcher only fetches a subset of the set initially
injected.
I use 1 master and 3 slaves with the following properties:
mapred.map.tasks = 30
mapred.reduce.tasks = 6
generate.max.per.host = -1

I tried to inject different amount of urls to see around what threshold
I start to see some missing ones.  Here are the results of my tests so far:

#urls
15000 and below: 100% fetched
16000: 15998 fetched (~100%)
25000: 21379 fetched (86%)
50000: 26565 fetched (53%)
100000: 22088 fetched (22%)

After having seen bug NUTCH-136 "mapreduce segment generator generates
50 % less than excepted urls", I thought it may fix my problem.  I  only
applied the 2nd change mentioned in the description (the change in
Generator.java, line 48) since I didn't know how to set the partition to
use a normal hashPartitioner.  The fix didn't make any difference.

Then I started debugging the generator to see if all the urls were
generated.  I confirmed they were all generated (did a check w/ 50k), so
the problem lays further in the pipeline.  I assume it's somewhere in
the fetcher, but I'm not sure where yet.  I'm gonna keep investigating.

Has anyone encountered a similar issue ?
I read messages of people crawling million of pages and I wonder why it
seems I'm the only one to have this issue.  I'm apparently unable to
fetch more than ~30k pages even though I inject 1 million urls.

Any help would be greatly appreciated.

Thanks,
--Flo

Reply via email to