That's probably why you should put in the * in first position in the config file. (see comment there).
<description>The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* </description> Am 01.08.2007, 18:07 Uhr, schrieb Michael Böckling <[EMAIL PROTECTED]>: > Hi! > > I experimented with the robots.txt parser component of nutch, and it > seems that it does not work as it should. The call to > RobotRulesParser.getRobotRulesSet() returns only the entry with the > highest precedence, which is depending on the order of the values of the > "http.robots.agents" configuration directive. > > Here's an example: > > > robots.txt: > User-agent: * > Disallow: /some/rule/ > User-agent: nutch > Disallow: /some/other/rule/ > > configuration: > http.robots.agents=nutch,* > > ==> the ruleset for "User-agent: *" is ignored > > > Expected behaviour: the "*" rules should be applied in every case. > > Reason: That is because the parser only returns "bestRulesSoFar" (actual > name of the variable). > > > Is this bug known, and if yes is there a workaround or fix? > > Thanks for any help! > > Regards, > > Michael > > -- Erstellt mit Operas revolutionärem E-Mail-Modul: http://www.opera.com/mail/ ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
