hi Michael,

is it similar to https://issues.apache.org/jira/browse/NUTCH-98 ? (I 
just type "robot" in the search field of Nutch's JIRA at 
https://issues.apache.org/jira/browse/NUTCH)

HTH,
Renaud


Michael Böckling wrote:
> Hi!
>
> I experimented with the robots.txt parser component of nutch, and it seems 
> that it does not work as it should. The call to 
> RobotRulesParser.getRobotRulesSet() returns only the entry with the highest 
> precedence, which is depending on the order of the values of the 
> "http.robots.agents" configuration directive.
>
> Here's an example:
>
>
> robots.txt:
> User-agent: * 
> Disallow: /some/rule/
> User-agent: nutch
> Disallow: /some/other/rule/
>
> configuration:
> http.robots.agents=nutch,*
>
> ==> the ruleset for "User-agent: *" is ignored
>
>
> Expected behaviour: the "*" rules should be applied in every case.
>
> Reason: That is because the parser only returns "bestRulesSoFar" (actual name 
> of the variable).
>
>
> Is this bug known, and if yes is there a workaround or fix? 
>
> Thanks for any help!
>
> Regards,
>
> Michael
>
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to