> I don't know if any of you have been following the bruhaha about
> aggregators downloading whole websites, but I ran across the following
> message from Mark Pilgrim today, and wondered if Plucker wanted to start
> being a good web-citizen and honouring robots.txt.
I've suggested this before, and suggested it be defaulted to ON,
with the user-selectable option of disabling it, of course. There is no
reason for a site which is publically accessible, to "block" people wishing
to see all of the content on the site.
What if I am a human (carbon-based) process clicking and reading
every link on a website? How do they ascertain the difference between a
human (carbon) and a computer (silicon) process "reading" their website? If
the issue is bandwidth, or throttling, robots.txt has no facilities to
introduce a "delay" in each page fetched. I think adding a grascious delay
(defaulted to ON, diabled by the user as needed) would help much more.
I'm all for trying to ease the burdon on the websites and bandwidth
(I should know, my own pipes here have already served up 66 _gigabytes_ of
Plucker Desktop since 2/9/2003), but what they are really doing is forcing
people to do is fake UserAgent strings in their tools, instead of adding the
ability to read/parse robots.txt exclusion files.
If Plucker (through the various tools, JPluck, Python distiller,
unreleased-as-yet perl parser, etc.) were to simply get robots.txt first
(assuming an http:// protocol was in use, and not file:// of course), and
then add the URI listed there to the stack of those NOT to fetch, that would
probably help in the short term, but I think people will just disable it and
fake their UserAgent, and we'll be right back where we started.
If it can be added, without breaking existing functionality in the
distiller, we should try. If it causes us more pain that good, I think we
should introduce a delay in fetching, as most of the other spiders and
aggregate fetching tools do (wget, LWP, pavuk, etc.)
d.
___
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev