On Sat, 17 Aug 2002 09:08:01 -0700 JC Dill <[EMAIL PROTECTED]> wrote: > On 06:28 PM 8/16/02, Nick Simicich wrote: >> At 10:04 AM 2002-08-16 -0700, JC Dill wrote:
> How hard is it to expect webmaster to take simple steps to announce > their policy? They need a *single* file indicating if spidering is > allowed or not, and they need a *single* header in the webpage to > indicate if it can be cached or not. > THIS WORKS > The reason it works is A) it is simple, and B) the burden is shared by > those with the content (who have to properly post their policy in the > robots.txt file and in their webpage headers) and those who seek to > search, mirror, or archive it (who must follow the content provider's > policy as it is conveyed in the file and header). robots.txt intercepts the data acquisition path that a spider normally and already uses in spidering. It just adds a new function point is grabbing that file too, in exactly the same way its already grabbing other web files. robot.txt doesn't add a new transport to the mix, it doesn't add a new protocol to the mix, it doesn't add a different third party service to the mix, and it doesn't require any changes to the web servers and their confiration and normal function. It merely very slightly tweaks what the robot was doing already: sucking nodes from URLs; and what the web server was doing stays exactly the same: serving files. Adding an equivalent of robots.txt to mailing lists under non-SMTP protocols violates the above models. Adding robots.txt as a pre-subscription check violates the above models for both "spider" and list server as well. Adding a flag header to the subscribe handshake requires no structural changes to the "spider" and damned close to no changes to the list server (which already sends those messages with custom content etc anyway). > Here is the meta tag format that tells all caches on the Internet not > to cache your webpage: > <META NAME="ROBOTS" CONTENT="NOARCHIVE"> > Here is the format if you want to just specify no archive by google > alone: > <META NAME="GOOGLEBOT" CONTENT="NOARCHIVE"> > This is not hard. This WORKS. Actually, it works rather poorly and is *very* non-standardised. Yes, it might work with one cache, but your bets across the wider range of caches out there are damned low probability. <<Can you tell I spent the last few months hacking Squid?>> -- J C Lawrence ---------(*) Satan, oscillate my metallic sonatas. [EMAIL PROTECTED] He lived as a devil, eh? http://www.kanga.nu/~claw/ Evil is a name of a foeman, as I live.
