Ok, so this discussion has definitely been interesting... let's see if
we can turn it into something actionable.
1. Desktop aggregators and services like pubsub really do not fall into
the same category as robots/crawlers and therefore should not
necessarily be paying attention to robots.txt
2. However, desktop aggregators and services like pubsub do perform
automated pulls against a server and therefore can be abusive to a server.
3. Therefore, it would be helpful if there were a way for publishers to
define rules that aggregators and readers should follow.
So how about something like this:
Add a new link rel="readers" whose href points to a robots.txt-like file
that either allows or disallows the aggregator for specific URI's and
establishes polling rate preferences
User-agent: {aggregator-ua}
Origin: {ip-address}
Allow: {uri}
Disallow: {uri}
Frequency: {rate} [{penalty}]
Max-Requests: {num-requests} {period} [{penalty}]
The User-agent, Allow and Disallow fields have the same basic definition
as in robots.txt.
The Origin field specifies an IP address so that rules for specific IP's
can be established
The Frequency establishes the allowed polling rate for the IP or
User-agent. The optional {penalty} specifies the number of miliseconds
that will be added to the frequency for each violation
The Max-Requests establishes the maximum number of requests within a set
period of time. The optional {penalty} specifies the number of
miliseconds that will be added to the frequency for each violation
Example,
<feed xmlns="http://www.w3.org/2005/Atom">
...
<link rel="readers" href="http://www.example.com/readers.txt" />
</feed>
readers.txt,
User-agent: Some-Reader
Allow: /blog/my-atom-feed.atom
Disallow: /blog/someotherfeed.atom
Frequency: 3600000 1800000 # wait at least an hour between
requests, add 30 minutes for each violation
Max-Requests: 10 86400000 3600000 # maximum of ten requests within
a 24-hour period, add 1 hour to the period for each violation
Some-Reader is allowed to get my-atom.feed.atom but is not allowed to
pull someotherfeed.atom
If Some-Reader polls the feed more frequently than once in an hour, it
must wait 1 hr and 30 minutes before the next poll. If it polls within
that period, it goes up to 2 hrs. If it polls appropriately, it goes
back down to 1 hr.
If Some-Reader polls more than 10 times in a 24 hour period, the rate
goes up to no more than 10 times in a 25 hour period; then a 26 hour
period, etc. If the reader behaves, it reverts back to the 10 per 24
hour period.
The path's specified in the Allow and Disallow fields are relative to
the base URI of the readers.txt file... e.g., in the example above, they
are relative to www.example.com.
Thoughts?
- James
Walter Underwood wrote:
There are no wildcards in /robots.txt, only path prefixes and user-agent
names. There is one special user-agent, "*", which means "all".
I can't think of any good reason to always ignore the disallows for *.
I guess it is OK to implement the parts of a spec that you want.
Just don't answer "yes" when someone asks if you honor robots.txt.
A lot of spiders allow the admin to override /robots.txt for specific
sites, or better, for specific URLs.
wunder
--On August 25, 2005 11:47:18 PM -0500 "Roger B." <[EMAIL PROTECTED]> wrote:
Bob: It's one thing to ignore a wildcard rule in robots.txt. I don't
think its a good idea, but I can at least see a valid argument for it.
However, if I put something like:
User-agent: PubSub
Disallow: /
...in my robots.txt and you ignore it, then you very much belong on
the Bad List.
--
Roger Benningfield
--
Walter Underwood
Principal Software Architect, Verity