Re: Understanding Crawl-Delay

Sebastian Nagel Sun, 01 Jun 2014 13:16:27 -0700

Hi,

the page is not fetched and, of course, no links will be extracted from
this page (it's impossible to find links if there is no content available).


> Now if the crawl-delay is something like two days , does it mean that all
> the 100 out links from the one single crawled page not be crawled at all?
Yes. In case, these 100 links point to pages on the same host, the behavior
is definitely correct. Otherwise, the crawler would be stalled.
The blocking is implemented per queue (one queue per host).
If there's only one page from that host with that long crawl-delay,
one could argue the queue shouldn't be blocked any more. However, that's
a rather artificial example: usually you have to fetch many
pages from a host.

Sebastian


On 06/01/2014 05:34 PM, S.L wrote:
> Hello Folks,
> 
> I know that there is a fetcher.max.crawl.delay parameter which when set to
> certain value in seconds will skip a particular page to be fetched if the
> crawl-delay in robots.txt for that host is more than the value.
> 
> I have a confusion because the description of this parameter mentions that
> a particular page will not be fetched , where as the crawl-delay applies to
> the whole website , does it mean that all the pages will not be fetched by
> Nutch subsequently.
> 
> For example , if I have crawled page 1 , and page 1 has 100 outlinks.
> 
> Now if the crawl-delay is something like two days , does it mean that all
> the 100 out links from the one single crawled page not be crawled at all?
> 
> Thanks in advance!
>

Re: Understanding Crawl-Delay

Reply via email to