That's probably why you should put in the * in first position in the  
config file. (see comment there).

   <description>The agent strings we'll look for in robots.txt files,
   comma-separated, in decreasing order of precedence. You should
   put the value of http.agent.name as the first agent name, and keep the
   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
   </description>

Am 01.08.2007, 18:07 Uhr, schrieb Michael Böckling  
<[EMAIL PROTECTED]>:

> Hi!
>
> I experimented with the robots.txt parser component of nutch, and it  
> seems that it does not work as it should. The call to  
> RobotRulesParser.getRobotRulesSet() returns only the entry with the  
> highest precedence, which is depending on the order of the values of the  
> "http.robots.agents" configuration directive.
>
> Here's an example:
>
>
> robots.txt:
> User-agent: *
> Disallow: /some/rule/
> User-agent: nutch
> Disallow: /some/other/rule/
>
> configuration:
> http.robots.agents=nutch,*
>
> ==> the ruleset for "User-agent: *" is ignored
>
>
> Expected behaviour: the "*" rules should be applied in every case.
>
> Reason: That is because the parser only returns "bestRulesSoFar" (actual  
> name of the variable).
>
>
> Is this bug known, and if yes is there a workaround or fix?
>
> Thanks for any help!
>
> Regards,
>
> Michael
>
>



-- 
Erstellt mit Operas revolutionärem E-Mail-Modul: http://www.opera.com/mail/

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to