Well, if we want to fetch pages from N different sites, ideally we should be able to have N threads running, without any of them having to wait. I guess ideally what the fetcher should probably do is instead of waiting, put the url it was trying to fetch back into the queue to be tried later on, and grab a different one.

I'm not so sure that accesses to each host are spread evenly throughout the list, because the fetch list I was doing had tens of thousands of different hosts and I was still getting a large amount of threads trying to access the same host at the same time, even with only 50 threads. Although maybe I'm wrong and that is how it would act if the hosts were spread evenly throughout, I'm not sure, it just seems like a lot.

Thanks for your help.

-Matt

Doug Cutting wrote:

Matt Zytaruk wrote:

Indeed, that does work, although that ends up slowing down the fetch a fair amount because a lot of threads end up idle, waiting, and I was hoping to avoid that slowdown if possible.


What should these threads be doing?

If you have a site with N pages to fetch, and you want to fetch them all politely, then it will take at least fetcher.server.delay*N to fetch them all. The fetch list is sorted by the hash of the url, so accesses to each host should be spread fairly evenly through the list.

Capping the number of pages per host (generate.max.per.host) will help, or, if you know the webmasters in question, you can consider increasing fetcher.threads.per.host.

Doug



Reply via email to