I tried debugging your problem but it doesn't seem to exist. I fixed Nutch' RobotParser test [1] but i cannot confirm URL's being disallowed if there is NO value for Disallow: in the robots.txt file.
https://issues.apache.org/jira/browse/NUTCH-1408 Test with: $ bin/nutch plugin lib-http org.apache.nutch.protocol.http.api.RobotRulesParser robots.txt urlfile spider -----Original message----- > From:Markus Jelsma <markus.jel...@openindex.io> > Sent: Thu 21-Jun-2012 00:47 > To: nutch-u...@lucene.apache.org > Subject: RE: robots.txt, disallow: with empty string > > If you're sure Nutch treats an empty string the same as / then please file an > issue in Jira so we can track and fix it. > Thanks > > > -----Original message----- > > From:Magnús Skúlason <magg...@gmail.com> > > Sent: Wed 20-Jun-2012 18:36 > > To: nutch-u...@lucene.apache.org > > Subject: robots.txt, disallow: with empty string > > > > Hi, > > > > I have noticed that my nutch crawler skips many sites with robots.txt > > files that look something like this: > > User-agent: * > > Disallow: /administrator/ > > Disallow: /classes/ > > Disallow: /components/ > > Disallow: /editor/ > > Disallow: /images/ > > Disallow: /includes/ > > Disallow: /language/ > > Disallow: /mambots/ > > Disallow: /media/ > > Disallow: /modules/ > > Disallow: /templates/ > > Disallow: /uploadfiles/ > > Disallow: > > > > That is where the last line is "Disallow:", is nutch treating this as > > it should disallow all paths? I really don't think that is what the > > webmasters are intending and that this is probably a auto generation > > error in some web systems. If it where their intention to ban all > > crawlers it would be more straight forward to put a "Disallow: /" and > > skip all the prior rules. > > > > best regards, > > Magnus > > >