Re: Fetcher Thread
Thanks Marcus. I will remember to set 1 for thread.per.host. Cheers, Ye On Thu, Oct 18, 2012 at 9:55 PM, Markus Jelsma wrote: > Hi Ye, > > -Original message- > > From:Ye T Thet > > Sent: Thu 18-Oct-2012 15:46 > > To: user@nutch.apache.org > > Subject: Fetcher Thread > > > > Hi Folks, > > > > I have two questions about the Fetcher Thread in Nutch. The value > > fetcher.threads.fetch in configuration file determines the number of > > threads the Nutch would use to fetch. Of course threads.per.host is also > > used for politeness. > > > > I set 100 for fetcher.threads.fetch and 2 for threads.per.host value. So > > far on my development I have been using only one linux box to fetch thus > it > > is clear that Nutch would fetch 100 urls at time provided that the > > threads.per.host criteria is met. > > > > The questions are: > > > > 1. What if I crawl on a hadoop cluster with with 5 linux box and set the > > fetcher.threads.fetch to 100? Would Nutch fetch 100 url at time or 500 > (5 x > > 100) at time? > > All nodes are isolated and don't know what the other is doing. So if you > set the threads to 100 for each machine, each machine will run with 100 > threads. > > > > > 2. Any advise on formulating optimum fetcher.threads.fetch and > > threads.per.host for a hadoop cluster with 5 linux box (Amazon EC2 medium > > instance, 3.7 GB memory). I would be crawling around 10,000 (10k) web > > sites. > > I think threads per host must not exceed 1 for most websites out of > politeness. You can set the number of threads as high as you can, it only > takes more memory. If you parse in the fetcher as well, you can run much > fewer threads. > > > > > Thanks, > > > > Ye > > >
RE: Fetcher Thread
Hi Ye, -Original message- > From:Ye T Thet > Sent: Thu 18-Oct-2012 15:46 > To: user@nutch.apache.org > Subject: Fetcher Thread > > Hi Folks, > > I have two questions about the Fetcher Thread in Nutch. The value > fetcher.threads.fetch in configuration file determines the number of > threads the Nutch would use to fetch. Of course threads.per.host is also > used for politeness. > > I set 100 for fetcher.threads.fetch and 2 for threads.per.host value. So > far on my development I have been using only one linux box to fetch thus it > is clear that Nutch would fetch 100 urls at time provided that the > threads.per.host criteria is met. > > The questions are: > > 1. What if I crawl on a hadoop cluster with with 5 linux box and set the > fetcher.threads.fetch to 100? Would Nutch fetch 100 url at time or 500 (5 x > 100) at time? All nodes are isolated and don't know what the other is doing. So if you set the threads to 100 for each machine, each machine will run with 100 threads. > > 2. Any advise on formulating optimum fetcher.threads.fetch and > threads.per.host for a hadoop cluster with 5 linux box (Amazon EC2 medium > instance, 3.7 GB memory). I would be crawling around 10,000 (10k) web > sites. I think threads per host must not exceed 1 for most websites out of politeness. You can set the number of threads as high as you can, it only takes more memory. If you parse in the fetcher as well, you can run much fewer threads. > > Thanks, > > Ye >
Re: Fetcher thread time out
I reduced the thread time out time to mapred.task.timeout / 8, meaning 150 seconds in my case. This actually helps for mappers that finish the queue but remaing hanging on some items and it helps to prematurely end instead of kill a task that's running on a server with too high load. My VM's suffer from having RAID-5 enabled and a short on RAM so i/o-wait is high. Fetcher threads that would normally be killed by the tracker are now being timed out. This means that whatever it's fetched is saved and no new single map is started, which would increase run time again. Comments? > Hi, > > With large map output the task tracker can time out (no progress update > during merge). Using io.sort.factor i can tune the merge phase to proceed > a bit faster. Yet it can still time out when the cluster is very busy etc. > I've increased the task time out but now it also takes longer to get rid > of handing threads. > > The fetcher thread time out is mapred.task.timeout / 2, it makes sense but > i guess it would make more sense to reduce the time out value even > further; why would i want to wait so long for it to get aborted anyway? > Now a single mapper can have a huge impact in avg. thoughput. > > Thought? > thanks