Thank you , that means that if we have any website with a crawl delay of more 
than the 30 seconds (which we are setting as the crawl delay i.e 
fetcher.max.crawl.delay value) then none of its pages would be ever fetched if 
we use its homepage as the seed URL and if we are ignoring the external 
outlinks.

That explains why some sites do not show up in the index.

I wonder though what a serious application in production would do to overcome 
this limitation?

Thanks again.





Sent from my HTC

----- Reply message -----
From: "Sebastian Nagel" <[email protected]>
To: <[email protected]>
Subject: Understanding Crawl-Delay
Date: Sun, Jun 1, 2014 4:15 PM

Hi,

the page is not fetched and, of course, no links will be extracted from
this page (it's impossible to find links if there is no content available).

> Now if the crawl-delay is something like two days , does it mean that all
> the 100 out links from the one single crawled page not be crawled at all?
Yes. In case, these 100 links point to pages on the same host, the behavior
is definitely correct. Otherwise, the crawler would be stalled.
The blocking is implemented per queue (one queue per host).
If there's only one page from that host with that long crawl-delay,
one could argue the queue shouldn't be blocked any more. However, that's
a rather artificial example: usually you have to fetch many
pages from a host.

Sebastian


On 06/01/2014 05:34 PM, S.L wrote:
> Hello Folks,
> 
> I know that there is a fetcher.max.crawl.delay parameter which when set to
> certain value in seconds will skip a particular page to be fetched if the
> crawl-delay in robots.txt for that host is more than the value.
> 
> I have a confusion because the description of this parameter mentions that
> a particular page will not be fetched , where as the crawl-delay applies to
> the whole website , does it mean that all the pages will not be fetched by
> Nutch subsequently.
> 
> For example , if I have crawled page 1 , and page 1 has 100 outlinks.
> 
> Now if the crawl-delay is something like two days , does it mean that all
> the 100 out links from the one single crawled page not be crawled at all?
> 
> Thanks in advance!
>

Reply via email to