Hi Renaud, that is an additional error. I'm amazed that there is no good robots.txt parser for java. I'm currently modifying one that I found here, which is less complicated than the nutch parser: http://www.osjava.org/norbert/
I might publish this one day. Regards, Michael > -----Original Message----- > From: Renaud Richardet [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 02, 2007 6:19 AM > To: [EMAIL PROTECTED] > Subject: Re: Bug: handling of robots.txt incorrect > > > hi Michael, > > is it similar to https://issues.apache.org/jira/browse/NUTCH-98 ? (I > just type "robot" in the search field of Nutch's JIRA at > https://issues.apache.org/jira/browse/NUTCH) > > HTH, > Renaud > > > Michael Böckling wrote: > > Hi! > > > > I experimented with the robots.txt parser component of > nutch, and it seems that it does not work as it should. The > call to RobotRulesParser.getRobotRulesSet() returns only the > entry with the highest precedence, which is depending on the > order of the values of the "http.robots.agents" configuration > directive. > > > > Here's an example: > > > > > > robots.txt: > > User-agent: * > > Disallow: /some/rule/ > > User-agent: nutch > > Disallow: /some/other/rule/ > > > > configuration: > > http.robots.agents=nutch,* > > > > ==> the ruleset for "User-agent: *" is ignored > > > > > > Expected behaviour: the "*" rules should be applied in every case. > > > > Reason: That is because the parser only returns > "bestRulesSoFar" (actual name of the variable). > > > > > > Is this bug known, and if yes is there a workaround or fix? > > > > Thanks for any help! > > > > Regards, > > > > Michael > > > > > > > ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general