Re: Don't Aggregrate Me

James M Snell Fri, 26 Aug 2005 09:56:22 -0700

Ok, so this discussion has definitely been interesting... let's see ifwe can turn it into something actionable.

1. Desktop aggregators and services like pubsub really do not fall intothe same category as robots/crawlers and therefore should notnecessarily be paying attention to robots.txt

2. However, desktop aggregators and services like pubsub do performautomated pulls against a server and therefore can be abusive to a server.

3. Therefore, it would be helpful if there were a way for publishers todefine rules that aggregators and readers should follow.


So how about something like this:

Add a new link rel="readers" whose href points to a robots.txt-like filethat either allows or disallows the aggregator for specific URI's andestablishes polling rate preferences


 User-agent: {aggregator-ua}
 Origin: {ip-address}
 Allow: {uri}
 Disallow: {uri}
 Frequency: {rate} [{penalty}]
 Max-Requests: {num-requests} {period} [{penalty}]

The User-agent, Allow and Disallow fields have the same basic definitionas in robots.txt.

The Origin field specifies an IP address so that rules for specific IP'scan be establishedThe Frequency establishes the allowed polling rate for the IP orUser-agent. The optional {penalty} specifies the number of milisecondsthat will be added to the frequency for each violationThe Max-Requests establishes the maximum number of requests within a setperiod of time. The optional {penalty} specifies the number ofmiliseconds that will be added to the frequency for each violation


Example,
 <feed xmlns="http://www.w3.org/2005/Atom";>
    ...
    <link rel="readers" href="http://www.example.com/readers.txt"; />
 </feed>

readers.txt,
 User-agent: Some-Reader
 Allow: /blog/my-atom-feed.atom
 Disallow: /blog/someotherfeed.atom

Frequency: 3600000 1800000 # wait at least an hour betweenrequests, add 30 minutes for each violationMax-Requests: 10 86400000 3600000 # maximum of ten requests withina 24-hour period, add 1 hour to the period for each violation

Some-Reader is allowed to get my-atom.feed.atom but is not allowed topull someotherfeed.atomIf Some-Reader polls the feed more frequently than once in an hour, itmust wait 1 hr and 30 minutes before the next poll. If it polls withinthat period, it goes up to 2 hrs. If it polls appropriately, it goesback down to 1 hr.If Some-Reader polls more than 10 times in a 24 hour period, the rategoes up to no more than 10 times in a 25 hour period; then a 26 hourperiod, etc. If the reader behaves, it reverts back to the 10 per 24hour period.

The path's specified in the Allow and Disallow fields are relative tothe base URI of the readers.txt file... e.g., in the example above, theyare relative to www.example.com.


Thoughts?

- James



Walter Underwood wrote:

There are no wildcards in /robots.txt, only path prefixes and user-agent
names. There is one special user-agent, "*", which means "all".
I can't think of any good reason to always ignore the disallows for *.

I guess it is OK to implement the parts of a spec that you want.
Just don't answer "yes" when someone asks if you honor robots.txt.

A lot of spiders allow the admin to override /robots.txt for specific
sites, or better, for specific URLs.

wunder

--On August 25, 2005 11:47:18 PM -0500 "Roger B." <[EMAIL PROTECTED]> wrote:

Bob: It's one thing to ignore a wildcard rule in robots.txt. I don't
think its a good idea, but I can at least see a valid argument for it.
However, if I put something like:

User-agent: PubSub
Disallow: /

...in my robots.txt and you ignore it, then you very much belong on
the Bad List.

--
Roger Benningfield




--
Walter Underwood
Principal Software Architect, Verity

Re: Don't Aggregrate Me

Reply via email to