Re: robots.txt

2003-02-25 Thread MJ Ray
Blake Winton <[EMAIL PROTECTED]> wrote:
> wondered if Plucker wanted to start being a good web-citizen
> and honouring robots.txt.

Plucker is not necessarily a robot, although it can operate as one.  It
should support robots.txt only when it is recursing into a site, not when
downloading a single page, IMO.  It should always support the robots meta. 
If that is acceptable and no-one beats me to it, I will try to code this
into the main plucker-build parser later this week and send a patch.

___
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev


Re: robots.txt

2003-02-25 Thread David A. Desrosiers

> I don't know if any of you have been following the bruhaha about
> aggregators downloading whole websites, but I ran across the following
> message from Mark Pilgrim today, and wondered if Plucker wanted to start
> being a good web-citizen and honouring robots.txt.

I've suggested this before, and suggested it be defaulted to ON,
with the user-selectable option of disabling it, of course. There is no
reason for a site which is publically accessible, to "block" people wishing
to see all of the content on the site.

What if I am a human (carbon-based)  process clicking and reading
every link on a website? How do they ascertain the difference between a
human (carbon) and a computer (silicon) process "reading" their website? If
the issue is bandwidth, or throttling, robots.txt has no facilities to
introduce a "delay" in each page fetched. I think adding a grascious delay
(defaulted to ON, diabled by the user as needed) would help much more.

I'm all for trying to ease the burdon on the websites and bandwidth
(I should know, my own pipes here have already served up 66 _gigabytes_ of
Plucker Desktop since 2/9/2003), but what they are really doing is forcing
people to do is fake UserAgent strings in their tools, instead of adding the
ability to read/parse robots.txt exclusion files.

If Plucker (through the various tools, JPluck, Python distiller,
unreleased-as-yet perl parser, etc.) were to simply get robots.txt first
(assuming an http:// protocol was in use, and not file:// of course), and
then add the URI listed there to the stack of those NOT to fetch, that would
probably help in the short term, but I think people will just disable it and
fake their UserAgent, and we'll be right back where we started.

If it can be added, without breaking existing functionality in the
distiller, we should try. If it causes us more pain that good, I think we
should introduce a delay in fetching, as most of the other spiders and
aggregate fetching tools do (wget, LWP, pavuk, etc.)


d.


___
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev


robots.txt

2003-02-25 Thread Blake Winton
I don't know if any of you have been following the bruhaha
about aggregators downloading whole websites, but I ran
across the following message from Mark Pilgrim today, and
wondered if Plucker wanted to start being a good web-citizen
and honouring robots.txt.

http://diveintomark.org/archives/2003/02/21/newsmonster_day_2.html#c0004
03

Later,
Blake.

___
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev