So I am trying to optimize the fetch performance, and I think that I miserably failing since I am not able to max out any my resources (cpu, ram, and more importantly bandwidth). obviously I am not trying to max out all of them at the same time. I just to find out the bottle neck, and I can't see any at this point. so here are the questions:

1- how do fetcher.threads.per.queue and fetcher.server.delay affect each other? if the threads.per.queue is 8 and server.delay is 2 seconds, does that mean that I am going have about 8 fetch every 2 seconds on each site(I don't care about the distribution but the max number of hits that the target site will receive from my crawler. would that be about 8 each 2 seconds)?

2. this is sort of a follow up on the first one. but I have my fetcher.threads.fetch set to 16 and fetcher.threads.per.queue set to 8. but when I check the log files I see that most of the time (specially toward the end of the job when only sites with too many pages are left) 8 of the threads are alway just spin waiting. wouldn't it be better to have the same value for threads.fetch and threads.per.queue?

3. the nutch-default says something about fetch running one map task per node (in fetcher.threads.fetch description) I have quad core cpus so I have setup the hadoop to run 4 task per node and i have 16 threads per map. Is there any down side in doing that? is there any optimal value for this? you should also know that when I double the size of the threads (to 32) I get timeout error for roughly 80% of all the pages WITHOUT seeing any real load on my bandwidth as if the node just can't handle that many threads.


any comment would be really appreciated even if its just to say that I am stupid. thanks,

--
Kaveh Minooie

www.plutoz.com

Reply via email to