Hi,
I am trying to crawl URLs belonging to the same domain (around 140k) and
because of the fact that all the same domain URLs go to the same mapper,
only one mapper is used for fetching. All others are just a waste of
resources. These are the configurations I have tried till now but it's
still very slow.
Attempt 1 -
fetcher.threads.fetch : 10
fetcher.server.delay : 1
fetcher.threads.per.queue : 1,
fetcher.server.min.delay : 0.0
Attempt 2 -
fetcher.threads.fetch : 10
fetcher.server.delay : 1
fetcher.threads.per.queue : 3,
fetcher.server.min.delay : 0.5
Is there a way to distribute the same domain URLs across all the
fetcher.threads.fetch? I understand that in this case crawl delay cannot be
reinforced across different mappers but for my use case it's ok to crawl
aggressively. So any suggestions?
Regards
Prateek