Hi Markus, Depending upon the core of the machine, I can only increase the number of threads upto a limit. After that performance degradation will come into the picture. So running a single mapper will still be a bottleneck in this case. I am looking for options to distribute the same domain URLs across various mappers. Not sure if that's even possible with Nutch or not.
Regards Prateek On Tue, May 11, 2021 at 11:58 AM Markus Jelsma <[email protected]> wrote: > Hello Prateet, > > If you want to fetch stuff from the same host/domain as fast as you want, > increase the number of threads, and the number of threads per queue. Then > decrease all the fetch delays. > > Regards, > Markus > > Op di 11 mei 2021 om 12:48 schreef prateek <[email protected]>: > > > Hi Lewis, > > > > As mentioned earlier, it does not matter how many mappers I assign to > fetch > > tasks. Since all the URLs are of the same domain, everything will be > > assigned to the same mapper and all other mappers will have no task to > > execute. So I am looking for ways I can crawl the same domain URLs > quickly. > > > > Regards > > Prateek > > > > On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney <[email protected] > > > > wrote: > > > > > Hi Prateek, > > > mapred.map.tasks --> mapreduce.job.maps > > > mapred.reduce.tasks --> mapreduce.job.reduces > > > You should be able to override in these in nutch-site.xml then publish > to > > > your Hadoop cluster. > > > lewismc > > > > > > On 2021/05/07 15:18:38, prateek <[email protected]> wrote: > > > > Hi, > > > > > > > > I am trying to crawl URLs belonging to the same domain (around 140k) > > and > > > > because of the fact that all the same domain URLs go to the same > > mapper, > > > > only one mapper is used for fetching. All others are just a waste of > > > > resources. These are the configurations I have tried till now but > it's > > > > still very slow. > > > > > > > > Attempt 1 - > > > > fetcher.threads.fetch : 10 > > > > fetcher.server.delay : 1 > > > > fetcher.threads.per.queue : 1, > > > > fetcher.server.min.delay : 0.0 > > > > > > > > Attempt 2 - > > > > fetcher.threads.fetch : 10 > > > > fetcher.server.delay : 1 > > > > fetcher.threads.per.queue : 3, > > > > fetcher.server.min.delay : 0.5 > > > > > > > > Is there a way to distribute the same domain URLs across all the > > > > fetcher.threads.fetch? I understand that in this case crawl delay > > cannot > > > be > > > > reinforced across different mappers but for my use case it's ok to > > crawl > > > > aggressively. So any suggestions? > > > > > > > > Regards > > > > Prateek > > > > > > > > > >

