Re: fetcher questions

Dennis Kubes Fri, 17 Apr 2009 21:57:25 -0700


Alex Basa wrote:

Can someone please explain to me

fetcher.threads.fetch

Number of fetcher threads per map task. So say two task per machine, 10machines, 10 fetcher threads = 2 x 10 x 10 = 200 fetchers workingconcurrently in various map tasks.


and

fetcher.threads.per.host

Number of threads to concurrently fetch from a single hostname. Allurls from the same domain should be partitioned into a single fetchermap task. The threads.per.host will be requesting from a single hostconcurrently.

The crawl delay robots.txt variable, the fetcher.server.delay, and thefetcher.max.crawl.delay will also affect performance. Thefetcher.max.crawl.delay is the max number of seconds under which a crawldelay variable will be accepted, anything over that and the page isignored and not processed. Crawl delays which are accepted can affectperformance greatly.

For example if there is a crawl delay from the robots.txt of 10 theneach thread will wait 10 seconds before processing a page for that host.If there is no crawl delay then fetcher.server.delay determines thenumber of seconds each thread will wait before processing a pagerequest. Both of these setting are for politeness.


Does it mean that if I am only crawling one site and have

fetcher.threads.per.host = 5

fetcher.threads.fetch = 100

that we will only have 5 threads fetching from the one host and the other 95 
threads will be waiting?

yes if you are only crawling one site, but single urls from the samedomain are partitioned together if you are only fetching from a singlesite then you would really only have a single active map task.


Dennis


I couldn't find any documentation on this.  Thanks in advance,

Alex

Re: fetcher questions

Reply via email to