Hi, I've experimented with Nutch for crawling Tor hidden services and I still find an annoying issue that requires a patched Nutch version. That is #NUTCH-693 [1]
This issue is a request for an option to control the behaviour of Nutch when getting a rel="nofollow" link. Currently, Nutch always ignores such links and there is no way of configuring this behaviour without patching it. The issue was closed with little discussion claiming that such option would be the same as an hypothetical "ignore.robotstxt" option. This is not the case. robots.txt is the way for webmasters to prevent crawlers to access certain URLs. This is *not* the job of nofollow. robots.txt is always controlled by the webmaster and, as such, it makes sense to strictly honouw it. On the other hand, nofollow is always controlled by third parties (otherwise, robots.txt should be used) and its well-established use is indicating non-endorsement to an URL. That is, in practice, preventing giving link-juice to potential spammers. nofollow is not meant to be an access control mechanism. nofollow is not meant to protect websites from crawler abuse either. That is robots.txt's job. So there is no point in treating them as the same. Now, there are very real use cases for following links with the rel="nofollow" attribute. In a loosely connected portion of the web, following these links might be the only sane way to crawl successfully. The Tor deepweb is a very clear case. There is a site which is very central in the Tor link-graph: The Hidden Wiki. It is a great seed for crawling Tor. But it's MediaWiki-based. And that means that every external link is tagged as rel="nofollow". Finding enough good seed URLs to crawl Tor without going through rel="nofollow" links is not trivial at all. The same might happen when crawling corporate intranets, I2P or other networks. So there is a clear use case for adding an option for following rel="nofollow" links. And, as far as I know, there is no point in not adding it. That is why I would like this to be discussed and, if deemed sensible, #NUTCH-693 reopened. [1] https://issues.apache.org/jira/browse/NUTCH-693 Best, -- Santiago M. Mola Jabber ID: cooldw...@gmail.com