Another question I should have asked is how long is the crawl delay in robots.txt?
If you read the fetcher.max.crawl.delay property description it explicitly notes that the fetcher will wait however long it is required by robots.tx until it fetches the page. Do you have this information? Thanks On Wed, Feb 15, 2012 at 9:08 AM, Danicela nutch <[email protected]>wrote: > I don't think I configured such things, how can I be sure ? > > ----- Message d'origine ----- > De : Lewis John Mcgibbney > Envoyés : 14.02.12 19:18 > À : [email protected] > Objet : Re: fetcher.max.crawl.delay = -1 doesn't work? > > Hi Danicela, Before I try this, have you configured any other overrides > for generating or fetching in nutch-site.xml? Thanks On Tue, Feb 14, 2012 > at 3:10 PM, Danicela nutch <[email protected]>wrote: > Hi, > > I > have in my nutch-site.xml the value fetcher.max.crawl.delay = -1. > > When > I try to fetch a site with a robots.txt with a Crawl Delay, it > doesn't > work. > > If I put fetcher.max.crawl.delay = 10000, it works. > > I use > Nutch 1.2, but according to the changelog, nothing has been changed > about > that since then. > > Is this a Nutch bug or I misused something ? > > > Another thing, in hadoop.log, the pages which couldn't be fetched are > > still marked as "fetching", is this normal ? Shouldn't they be marked as > > "dropped" or something ? > > Thanks. > -- *Lewis* > -- *Lewis*

