Ok, so this discussion has definitely been interesting... let's see if we can turn it into something actionable.

1. Desktop aggregators and services like pubsub really do not fall into the same category as robots/crawlers and therefore should not necessarily be paying attention to robots.txt

2. However, desktop aggregators and services like pubsub do perform automated pulls against a server and therefore can be abusive to a server.

3. Therefore, it would be helpful if there were a way for publishers to define rules that aggregators and readers should follow.

So how about something like this:

Add a new link rel="readers" whose href points to a robots.txt-like file that either allows or disallows the aggregator for specific URI's and establishes polling rate preferences

 User-agent: {aggregator-ua}
 Origin: {ip-address}
 Allow: {uri}
 Disallow: {uri}
 Frequency: {rate} [{penalty}]
 Max-Requests: {num-requests} {period} [{penalty}]

The User-agent, Allow and Disallow fields have the same basic definition as in robots.txt.

The Origin field specifies an IP address so that rules for specific IP's can be established The Frequency establishes the allowed polling rate for the IP or User-agent. The optional {penalty} specifies the number of miliseconds that will be added to the frequency for each violation The Max-Requests establishes the maximum number of requests within a set period of time. The optional {penalty} specifies the number of miliseconds that will be added to the frequency for each violation

Example,
 <feed xmlns="http://www.w3.org/2005/Atom";>
    ...
    <link rel="readers" href="http://www.example.com/readers.txt"; />
 </feed>

readers.txt,
 User-agent: Some-Reader
 Allow: /blog/my-atom-feed.atom
 Disallow: /blog/someotherfeed.atom
Frequency: 3600000 1800000 # wait at least an hour between requests, add 30 minutes for each violation Max-Requests: 10 86400000 3600000 # maximum of ten requests within a 24-hour period, add 1 hour to the period for each violation

Some-Reader is allowed to get my-atom.feed.atom but is not allowed to pull someotherfeed.atom If Some-Reader polls the feed more frequently than once in an hour, it must wait 1 hr and 30 minutes before the next poll. If it polls within that period, it goes up to 2 hrs. If it polls appropriately, it goes back down to 1 hr. If Some-Reader polls more than 10 times in a 24 hour period, the rate goes up to no more than 10 times in a 25 hour period; then a 26 hour period, etc. If the reader behaves, it reverts back to the 10 per 24 hour period.

The path's specified in the Allow and Disallow fields are relative to the base URI of the readers.txt file... e.g., in the example above, they are relative to www.example.com.

Thoughts?

- James



Walter Underwood wrote:

There are no wildcards in /robots.txt, only path prefixes and user-agent
names. There is one special user-agent, "*", which means "all".
I can't think of any good reason to always ignore the disallows for *.

I guess it is OK to implement the parts of a spec that you want.
Just don't answer "yes" when someone asks if you honor robots.txt.

A lot of spiders allow the admin to override /robots.txt for specific
sites, or better, for specific URLs.

wunder

--On August 25, 2005 11:47:18 PM -0500 "Roger B." <[EMAIL PROTECTED]> wrote:

Bob: It's one thing to ignore a wildcard rule in robots.txt. I don't
think its a good idea, but I can at least see a valid argument for it.
However, if I put something like:

User-agent: PubSub
Disallow: /

...in my robots.txt and you ignore it, then you very much belong on
the Bad List.

--
Roger Benningfield





--
Walter Underwood
Principal Software Architect, Verity



Reply via email to