Hi!

I experimented with the robots.txt parser component of nutch, and it seems that 
it does not work as it should. The call to RobotRulesParser.getRobotRulesSet() 
returns only the entry with the highest precedence, which is depending on the 
order of the values of the "http.robots.agents" configuration directive.

Here's an example:


robots.txt:
User-agent: * 
Disallow: /some/rule/
User-agent: nutch
Disallow: /some/other/rule/

configuration:
http.robots.agents=nutch,*

==> the ruleset for "User-agent: *" is ignored


Expected behaviour: the "*" rules should be applied in every case.

Reason: That is because the parser only returns "bestRulesSoFar" (actual name 
of the variable).


Is this bug known, and if yes is there a workaround or fix? 

Thanks for any help!

Regards,

Michael


-- 
Michael Böckling
Java Engineer
dmc digital media center GmbH 
Rommelstraße 11 
70376 Stuttgart (Germany) 
Telefon: +49 711 601747-0
Telefax: +49 711 601747-141 
E-Mail: [EMAIL PROTECTED] 
Internet: www.dmc.de 

Handelsregister: AG Stuttgart HRB 18974
Geschäftsführer: Andreas Magg, Daniel Rebhorn, Andreas Schwend

---------------------------------------------
Besseres E-Business.
dmc ist die kreative Vernetzung von Agentur, Systemhaus und Service. Seit über 
10 Jahren entwickeln und realisieren wir zukunftweisende und erfolgreiche 
E-Business-Lösungen. Zu unseren langjährigen Kunden zählen neckermann.de, Kodak 
und Telekom Training.

dmc auf Platz 8 im aktuellen New Media Service Ranking.
Als inhabergeführte und netzwerkunabhängige Agentur gehören wir mit einem 
Umsatz von 13,50 Mio. Euro zu den Top 10 der erfolgreichsten New Media 
Dienstleister in Deutschland.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to