Folks, What is the reason for robots exclusion processing being handled at the protocol level rather than at the droid level? Presently HttpProtocol attempts to retrieve robot.txt for _each_ and _every_ URI it is processing and then discards robot.txt rules when finished. This does not sound right to me. Am I missing something?
I can't help thinking robots.txt processing belongs to the Droid level. CrawlingDroid should retrieve robot.txt once at the beginning of the run and then re-use it for all subsequent requests for the same URI space. It should maintain a notion of a session and cache robots.txt rules for all URIs outside the initial URI space for the same run. At the same time HttpProtocol should remain stateless (should not maintain any state information that could interfere with individual sessions) What do you think? Oleg
