fetcher.threads.per.queue and fetcher.server.delay

kaveh minooie Tue, 14 Feb 2012 14:01:42 -0800

So I am trying to optimize the fetch performance, and I think that Imiserably failing since I am not able to max out any my resources (cpu,ram, and more importantly bandwidth). obviously I am not trying to maxout all of them at the same time. I just to find out the bottle neck,and I can't see any at this point. so here are the questions:

1- how do fetcher.threads.per.queue and fetcher.server.delay affecteach other? if the threads.per.queue is 8 and server.delay is 2 seconds,does that mean that I am going have about 8 fetch every 2 seconds oneach site(I don't care about the distribution but the max number of hitsthat the target site will receive from my crawler. would that be about 8each 2 seconds)?

2. this is sort of a follow up on the first one. but I have myfetcher.threads.fetch set to 16 and fetcher.threads.per.queue set to 8.but when I check the log files I see that most of the time (speciallytoward the end of the job when only sites with too many pages are left)8 of the threads are alway just spin waiting. wouldn't it be better tohave the same value for threads.fetch and threads.per.queue?

3. the nutch-default says something about fetch running one map task pernode (in fetcher.threads.fetch description) I have quad core cpus so Ihave setup the hadoop to run 4 task per node and i have 16 threads permap. Is there any down side in doing that? is there any optimal valuefor this? you should also know that when I double the size of thethreads (to 32) I get timeout error for roughly 80% of all the pagesWITHOUT seeing any real load on my bandwidth as if the node just can'thandle that many threads.

any comment would be really appreciated even if its just to say that Iam stupid. thanks,


--
Kaveh Minooie

www.plutoz.com

fetcher.threads.per.queue and fetcher.server.delay

Reply via email to