Hi Renaud,

that is an additional error. I'm amazed that there is no good robots.txt parser 
for java. I'm currently modifying one that I found here, which is less 
complicated than the nutch parser: http://www.osjava.org/norbert/ 

I might publish this one day.


Regards,

Michael


> -----Original Message-----
> From: Renaud Richardet [mailto:[EMAIL PROTECTED]
> Sent: Thursday, August 02, 2007 6:19 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Bug: handling of robots.txt incorrect
> 
> 
> hi Michael,
> 
> is it similar to https://issues.apache.org/jira/browse/NUTCH-98 ? (I 
> just type "robot" in the search field of Nutch's JIRA at 
> https://issues.apache.org/jira/browse/NUTCH)
> 
> HTH,
> Renaud
> 
> 
> Michael Böckling wrote:
> > Hi!
> >
> > I experimented with the robots.txt parser component of 
> nutch, and it seems that it does not work as it should. The 
> call to RobotRulesParser.getRobotRulesSet() returns only the 
> entry with the highest precedence, which is depending on the 
> order of the values of the "http.robots.agents" configuration 
> directive.
> >
> > Here's an example:
> >
> >
> > robots.txt:
> > User-agent: * 
> > Disallow: /some/rule/
> > User-agent: nutch
> > Disallow: /some/other/rule/
> >
> > configuration:
> > http.robots.agents=nutch,*
> >
> > ==> the ruleset for "User-agent: *" is ignored
> >
> >
> > Expected behaviour: the "*" rules should be applied in every case.
> >
> > Reason: That is because the parser only returns 
> "bestRulesSoFar" (actual name of the variable).
> >
> >
> > Is this bug known, and if yes is there a workaround or fix? 
> >
> > Thanks for any help!
> >
> > Regards,
> >
> > Michael
> >
> >
> >   
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to