Re: Don't Aggregrate Me

Walter Underwood Fri, 26 Aug 2005 12:29:32 -0700

--On August 26, 2005 9:51:10 AM -0700 James M Snell <[EMAIL PROTECTED]> wrote:


> Add a new link rel="readers" whose href points to a robots.txt-like file that
> either allows or disallows the aggregator for specific URI's and establishes
> polling rate preferences
> 
>   User-agent: {aggregator-ua}
>   Origin: {ip-address}
>   Allow: {uri}
>   Disallow: {uri}
>   Frequency: {rate} [{penalty}]
>   Max-Requests: {num-requests} {period} [{penalty}]

No, on several counts.

1. Big, scalable spiders don't work like that. They don't do aggregate
frequencies or rates. They may have independent crawlers visiting the
same host. Yes, they try to be good citizens, but you can't force
WWW search folk to redesign their spiders.

2. Frequencies and rates don't work well with either HTTP caching or
with publishing schedules. Things are much cleaner with a single 
model (max-age and/or expires).

3. This is trying to be a remote-control for spiders instead of describing
some characteristic of the content. We've rejected the remote control
approach in Atom.

4. What happens when there are conflicting specs in this file, in
robots.txt, and in a Google Sitemap?

5. Specifying all this detail is pointless if the spider ignores it.
You still need to have enforceable rate controls in your webserver
to handle busted or bad citizen robots.

6. Finally, this sort of thing has been proposed a few times and never
caught on. By itself, that is a weak argument, but I think the causes
are pretty strong (above).

There are some proprietary extensions to robots.txt:

Yahoo crawl-delay:
<http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html>

Google wildcard disallows:
<http://www.google.com/remove.html#images>

It looks like MSNbot does crawl-delay and an extension-only wildcard:
<http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm>

wunder
--
Walter Underwood
Principal Software Architect, Verity

Re: Don't Aggregrate Me

Reply via email to