RE: robots.txt, disallow: with empty string

Markus Jelsma Fri, 22 Jun 2012 04:44:44 -0700

I tried debugging your problem but it doesn't seem to exist. I fixed Nutch' 
RobotParser test [1] but i cannot confirm URL's being disallowed if there is NO 
value for Disallow: in the robots.txt file.


https://issues.apache.org/jira/browse/NUTCH-1408

Test with:
$  bin/nutch plugin lib-http 
org.apache.nutch.protocol.http.api.RobotRulesParser robots.txt urlfile spider

 
-----Original message-----
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Thu 21-Jun-2012 00:47
> To: nutch-u...@lucene.apache.org
> Subject: RE: robots.txt, disallow: with empty string
> 
> If you're sure Nutch treats an empty string the same as / then please file an 
> issue in Jira so we can track and fix it.
> Thanks
>  
>  
> -----Original message-----
> > From:Magnús Skúlason <magg...@gmail.com>
> > Sent: Wed 20-Jun-2012 18:36
> > To: nutch-u...@lucene.apache.org
> > Subject: robots.txt, disallow: with empty string
> > 
> > Hi,
> > 
> > I have noticed that my nutch crawler skips many sites with robots.txt
> > files that look something like this:
> > User-agent: *
> > Disallow: /administrator/
> > Disallow: /classes/
> > Disallow: /components/
> > Disallow: /editor/
> > Disallow: /images/
> > Disallow: /includes/
> > Disallow: /language/
> > Disallow: /mambots/
> > Disallow: /media/
> > Disallow: /modules/
> > Disallow: /templates/
> > Disallow: /uploadfiles/
> > Disallow:
> > 
> > That is where the last line is "Disallow:", is nutch treating this as
> > it should disallow all paths? I really don't think that is what the
> > webmasters are intending and that this is probably a auto generation
> > error in some web systems. If it where their intention to ban all
> > crawlers it would be more straight forward to put a "Disallow: /" and
> > skip all the prior rules.
> > 
> > best regards,
> > Magnus
> > 
>

RE: robots.txt, disallow: with empty string

Reply via email to