Hi Ye,
 
-----Original message-----
> From:Ye T Thet <yethura.t...@gmail.com>
> Sent: Thu 18-Oct-2012 15:46
> To: user@nutch.apache.org
> Subject: Fetcher Thread
> 
> Hi Folks,
> 
> I have two questions about the Fetcher Thread in Nutch. The value
> fetcher.threads.fetch in configuration file determines the number of
> threads the Nutch would use to fetch. Of course threads.per.host is also
> used for politeness.
> 
> I set 100 for fetcher.threads.fetch and 2 for threads.per.host value. So
> far on my development I have been using only one linux box to fetch thus it
> is clear that Nutch would fetch 100 urls at time provided that the
> threads.per.host criteria is met.
> 
> The questions are:
> 
> 1. What if I crawl on a hadoop cluster with with 5 linux box and set the
> fetcher.threads.fetch to 100? Would Nutch fetch 100 url at time or 500 (5 x
> 100) at time?

All nodes are isolated and don't know what the other is doing. So if you set 
the threads to 100 for each machine, each machine will run with 100 threads.

> 
> 2. Any advise on formulating optimum fetcher.threads.fetch and
> threads.per.host for a hadoop cluster with 5 linux box (Amazon EC2 medium
> instance, 3.7 GB memory). I would be crawling around 10,000 (10k) web
> sites.

I think threads per host must not exceed 1 for most websites out of politeness. 
You can set the number of threads as high as you can, it only takes more 
memory. If you parse in the fetcher as well, you can run much fewer threads.

> 
> Thanks,
> 
> Ye
> 

Reply via email to