Folks,

What is the reason for robots exclusion processing being handled at the
protocol level rather than at the droid level? Presently HttpProtocol
attempts to retrieve robot.txt for _each_ and _every_ URI it is
processing and then discards robot.txt rules when finished. This does
not sound right to me. Am I missing something? 

I can't help thinking robots.txt processing belongs to the Droid level.
CrawlingDroid should retrieve robot.txt once at the beginning of the run
and then re-use it for all subsequent requests for the same URI space.
It should maintain a notion of a session and cache robots.txt rules for
all URIs outside the initial URI space for the same run. At the same
time HttpProtocol should remain stateless (should not maintain any state
information that could interfere with individual sessions)

What do you think?

Oleg


Reply via email to