Hello, there. I believe I may have found a infinite loop in Nutch 0.9.
It happens when a site has a page that refers to itself through a redirection. The code in Fetcher.run(), around line 200 - sorry, my Fetcher has been a little modified, line numbers may vary a little - says, for that case: output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED); What that does is, inserts an extra (empty) crawl datum for the new url, with a re-fetch interval of 0.0. However, (see Generator.Selector.map(), particularly lines 144-145), the non-refetch condition used seems to be last-fetch+refetch-interval>now ... which is always false if refetch-interval==0.0! Now, if there is a new link to the new url in that page, that crawl datum is re-used, and the whole thing loops indefinitely. I've fixed that for myself by changing the quoted line (twice) by: output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null, CrawlDatum.STATUS_LINKED); and that works (btw the 30F should really be the value of "db.default.fetch.interval", but I haven't the time right now to work out the issues, but in reality the default constructor and the appropriate updater method should, if I am right in analysing the algorithm always enforce a positive refetch interval. Of course, another method could be used to remove this self-reference, but that couls be complicated, as that may happen through a loop (2 or more pages etc..., you know what I mean). Has that been fixed already, and by what method? Best regards George Herlin