Hi! I experimented with the robots.txt parser component of nutch, and it seems that it does not work as it should. The call to RobotRulesParser.getRobotRulesSet() returns only the entry with the highest precedence, which is depending on the order of the values of the "http.robots.agents" configuration directive.
Here's an example: robots.txt: User-agent: * Disallow: /some/rule/ User-agent: nutch Disallow: /some/other/rule/ configuration: http.robots.agents=nutch,* ==> the ruleset for "User-agent: *" is ignored Expected behaviour: the "*" rules should be applied in every case. Reason: That is because the parser only returns "bestRulesSoFar" (actual name of the variable). Is this bug known, and if yes is there a workaround or fix? Thanks for any help! Regards, Michael -- Michael Böckling Java Engineer dmc digital media center GmbH Rommelstraße 11 70376 Stuttgart (Germany) Telefon: +49 711 601747-0 Telefax: +49 711 601747-141 E-Mail: [EMAIL PROTECTED] Internet: www.dmc.de Handelsregister: AG Stuttgart HRB 18974 Geschäftsführer: Andreas Magg, Daniel Rebhorn, Andreas Schwend --------------------------------------------- Besseres E-Business. dmc ist die kreative Vernetzung von Agentur, Systemhaus und Service. Seit über 10 Jahren entwickeln und realisieren wir zukunftweisende und erfolgreiche E-Business-Lösungen. Zu unseren langjährigen Kunden zählen neckermann.de, Kodak und Telekom Training. dmc auf Platz 8 im aktuellen New Media Service Ranking. Als inhabergeführte und netzwerkunabhängige Agentur gehören wir mit einem Umsatz von 13,50 Mio. Euro zu den Top 10 der erfolgreichsten New Media Dienstleister in Deutschland. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
