I tried that, it only reverses the error.

Michael



> -----Original Message-----
> From: Fritz Bein [mailto:[EMAIL PROTECTED]
> Sent: Thursday, August 02, 2007 2:15 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Bug: handling of robots.txt incorrect
> 
> 
> That's probably why you should put in the * in first position in the  
> config file. (see comment there).
> 
>    <description>The agent strings we'll look for in robots.txt files,
>    comma-separated, in decreasing order of precedence. You should
>    put the value of http.agent.name as the first agent name, 
> and keep the
>    default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>    </description>
> 
> Am 01.08.2007, 18:07 Uhr, schrieb Michael Böckling  
> <[EMAIL PROTECTED]>:
> 
> > Hi!
> >
> > I experimented with the robots.txt parser component of 
> nutch, and it  
> > seems that it does not work as it should. The call to  
> > RobotRulesParser.getRobotRulesSet() returns only the entry 
> with the  
> > highest precedence, which is depending on the order of the 
> values of the  
> > "http.robots.agents" configuration directive.
> >
> > Here's an example:
> >
> >
> > robots.txt:
> > User-agent: *
> > Disallow: /some/rule/
> > User-agent: nutch
> > Disallow: /some/other/rule/
> >
> > configuration:
> > http.robots.agents=nutch,*
> >
> > ==> the ruleset for "User-agent: *" is ignored
> >
> >
> > Expected behaviour: the "*" rules should be applied in every case.
> >
> > Reason: That is because the parser only returns 
> "bestRulesSoFar" (actual  
> > name of the variable).
> >
> >
> > Is this bug known, and if yes is there a workaround or fix?
> >
> > Thanks for any help!
> >
> > Regards,
> >
> > Michael
> >
> >
> 
> 
> 
> -- 
> Erstellt mit Operas revolutionärem E-Mail-Modul: 
http://www.opera.com/mail/

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to