Alex Basa wrote:
Can someone please explain to me fetcher.threads.fetch
Number of fetcher threads per map task. So say two task per machine, 10 machines, 10 fetcher threads = 2 x 10 x 10 = 200 fetchers working concurrently in various map tasks.
and fetcher.threads.per.host
Number of threads to concurrently fetch from a single hostname. All urls from the same domain should be partitioned into a single fetcher map task. The threads.per.host will be requesting from a single host concurrently.
The crawl delay robots.txt variable, the fetcher.server.delay, and the fetcher.max.crawl.delay will also affect performance. The fetcher.max.crawl.delay is the max number of seconds under which a crawl delay variable will be accepted, anything over that and the page is ignored and not processed. Crawl delays which are accepted can affect performance greatly.
For example if there is a crawl delay from the robots.txt of 10 then each thread will wait 10 seconds before processing a page for that host. If there is no crawl delay then fetcher.server.delay determines the number of seconds each thread will wait before processing a page request. Both of these setting are for politeness.
Does it mean that if I am only crawling one site and have fetcher.threads.per.host = 5 fetcher.threads.fetch = 100 that we will only have 5 threads fetching from the one host and the other 95 threads will be waiting?
yes if you are only crawling one site, but single urls from the same domain are partitioned together if you are only fetching from a single site then you would really only have a single active map task.
Dennis
I couldn't find any documentation on this. Thanks in advance, Alex
