Re: Crawling same domain URL's

Lewis John McGibbney Sun, 09 May 2021 17:02:21 -0700

Hi Prateek,
mapred.map.tasks     -->    mapreduce.job.maps
mapred.reduce.tasks  -->    mapreduce.job.reduces
You should be able to override in these in nutch-site.xml then publish to your 
Hadoop cluster.
lewismc


On 2021/05/07 15:18:38, prateek <[email protected]> wrote: 
> Hi,
> 
> I am trying to crawl URLs belonging to the same domain (around 140k) and
> because of the fact that all the same domain URLs go to the same mapper,
> only one mapper is used for fetching. All others are just a waste of
> resources. These are the configurations I have tried till now but it's
> still very slow.
> 
> Attempt 1 -
>         fetcher.threads.fetch : 10
>         fetcher.server.delay : 1
>         fetcher.threads.per.queue : 1,
>         fetcher.server.min.delay : 0.0
> 
> Attempt 2 -
>         fetcher.threads.fetch : 10
>         fetcher.server.delay : 1
>         fetcher.threads.per.queue : 3,
>         fetcher.server.min.delay : 0.5
> 
> Is there a way to distribute the same domain URLs across all the
> fetcher.threads.fetch? I understand that in this case crawl delay cannot be
> reinforced across different mappers but for my use case it's ok to crawl
> aggressively. So any suggestions?
> 
> Regards
> Prateek
>

Re: Crawling same domain URL's

Reply via email to