Hi Prateek, mapred.map.tasks --> mapreduce.job.maps mapred.reduce.tasks --> mapreduce.job.reduces You should be able to override in these in nutch-site.xml then publish to your Hadoop cluster. lewismc
On 2021/05/07 15:18:38, prateek <[email protected]> wrote: > Hi, > > I am trying to crawl URLs belonging to the same domain (around 140k) and > because of the fact that all the same domain URLs go to the same mapper, > only one mapper is used for fetching. All others are just a waste of > resources. These are the configurations I have tried till now but it's > still very slow. > > Attempt 1 - > fetcher.threads.fetch : 10 > fetcher.server.delay : 1 > fetcher.threads.per.queue : 1, > fetcher.server.min.delay : 0.0 > > Attempt 2 - > fetcher.threads.fetch : 10 > fetcher.server.delay : 1 > fetcher.threads.per.queue : 3, > fetcher.server.min.delay : 0.5 > > Is there a way to distribute the same domain URLs across all the > fetcher.threads.fetch? I understand that in this case crawl delay cannot be > reinforced across different mappers but for my use case it's ok to crawl > aggressively. So any suggestions? > > Regards > Prateek >

