I tried that, it only reverses the error. Michael
> -----Original Message----- > From: Fritz Bein [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 02, 2007 2:15 PM > To: [EMAIL PROTECTED] > Subject: Re: Bug: handling of robots.txt incorrect > > > That's probably why you should put in the * in first position in the > config file. (see comment there). > > <description>The agent strings we'll look for in robots.txt files, > comma-separated, in decreasing order of precedence. You should > put the value of http.agent.name as the first agent name, > and keep the > default * at the end of the list. E.g.: BlurflDev,Blurfl,* > </description> > > Am 01.08.2007, 18:07 Uhr, schrieb Michael Böckling > <[EMAIL PROTECTED]>: > > > Hi! > > > > I experimented with the robots.txt parser component of > nutch, and it > > seems that it does not work as it should. The call to > > RobotRulesParser.getRobotRulesSet() returns only the entry > with the > > highest precedence, which is depending on the order of the > values of the > > "http.robots.agents" configuration directive. > > > > Here's an example: > > > > > > robots.txt: > > User-agent: * > > Disallow: /some/rule/ > > User-agent: nutch > > Disallow: /some/other/rule/ > > > > configuration: > > http.robots.agents=nutch,* > > > > ==> the ruleset for "User-agent: *" is ignored > > > > > > Expected behaviour: the "*" rules should be applied in every case. > > > > Reason: That is because the parser only returns > "bestRulesSoFar" (actual > > name of the variable). > > > > > > Is this bug known, and if yes is there a workaround or fix? > > > > Thanks for any help! > > > > Regards, > > > > Michael > > > > > > > > -- > Erstellt mit Operas revolutionärem E-Mail-Modul: http://www.opera.com/mail/ ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
