Well, if we want to fetch pages from N different sites, ideally we
should be able to have N threads running, without any of them having to
wait. I guess ideally what the fetcher should probably do is instead of
waiting, put the url it was trying to fetch back into the queue to be
tried later on, and grab a different one.
I'm not so sure that accesses to each host are spread evenly throughout
the list, because the fetch list I was doing had tens of thousands of
different hosts and I was still getting a large amount of threads trying
to access the same host at the same time, even with only 50 threads.
Although maybe I'm wrong and that is how it would act if the hosts were
spread evenly throughout, I'm not sure, it just seems like a lot.
Thanks for your help.
-Matt
Doug Cutting wrote:
Matt Zytaruk wrote:
Indeed, that does work, although that ends up slowing down the fetch
a fair amount because a lot of threads end up idle, waiting, and I
was hoping to avoid that slowdown if possible.
What should these threads be doing?
If you have a site with N pages to fetch, and you want to fetch them
all politely, then it will take at least fetcher.server.delay*N to
fetch them all. The fetch list is sorted by the hash of the url, so
accesses to each host should be spread fairly evenly through the list.
Capping the number of pages per host (generate.max.per.host) will
help, or, if you know the webmasters in question, you can consider
increasing fetcher.threads.per.host.
Doug