Re: Fetcher url sorting

Matt Zytaruk Tue, 22 Nov 2005 11:32:01 -0800

Well, if we want to fetch pages from N different sites, ideally weshould be able to have N threads running, without any of them having towait. I guess ideally what the fetcher should probably do is instead ofwaiting, put the url it was trying to fetch back into the queue to betried later on, and grab a different one.

I'm not so sure that accesses to each host are spread evenly throughoutthe list, because the fetch list I was doing had tens of thousands ofdifferent hosts and I was still getting a large amount of threads tryingto access the same host at the same time, even with only 50 threads.Although maybe I'm wrong and that is how it would act if the hosts werespread evenly throughout, I'm not sure, it just seems like a lot.


Thanks for your help.

-Matt

Doug Cutting wrote:

Matt Zytaruk wrote:
Indeed, that does work, although that ends up slowing down the fetcha fair amount because a lot of threads end up idle, waiting, and Iwas hoping to avoid that slowdown if possible.
What should these threads be doing?
If you have a site with N pages to fetch, and you want to fetch themall politely, then it will take at least fetcher.server.delay*N tofetch them all. The fetch list is sorted by the hash of the url, soaccesses to each host should be spread fairly evenly through the list.
Capping the number of pages per host (generate.max.per.host) willhelp, or, if you know the webmasters in question, you can considerincreasing fetcher.threads.per.host.
Doug

Re: Fetcher url sorting

Reply via email to