[Robots] FW: Re: Correct URL, shlash at the end ?

2001-11-24

From: Sean 'Captain Napalm' Conner
Sent: Friday, November 23, 2001 10:58 PM
Subject: Re: [Robots] Re: Correct URL, shlash at the end ?

It was thus said that the Great [EMAIL PROTECTED] once stated:
> If one crazy idea leads to another ...then if the above did get in the
> robots.txt spec then the web services could then edit that slash part of
> robots.txt file.  When the webservice config files holding that default
> list detected the change event the web admin is then asked if they wish to
> also update the robots.txt.
> "Update your robot.txt file in the [doc root] to include the change in the
> Slash: list?"
> The crazy task is so simple the web servers programers would fight to do
> just to be the first.

  And if so, it would be the first feature I would disable in the webserver
since in several cases the web configuration is managed not by hand but by
other automated processes (I part time admin one where new sites are batched
up and a new configuration file is generated at set times) and I do not want
the web server to hang because it's waiting for a `Yes' or `No' answer from
a human.  Having to manually type in a pass phrase for a secure webserver
(until such time as we found out you *could* start it without it asking) was
bad enough (and having to always be around an Internet enabled computer in
case I was paged).

  And second, it's not quite as simple as you make it out to be.  For
instance, in Apache, the directive that controls this is ``DirectoryIndex''
and it can appear in several different contexts, including virtual hosts
(which means for one virtual host, I can have it default to ``Welcome.html''
because that might have been the default for some other webserver the client
may have used), directories or even under control of the user in an
.htaccess file (which isn't necessarily read until needed).  It also doesn't
have to be a simple file---it could be specified as:

DirectoryIndex  index.html /defaults/hey-dummy.html

  Which means that if, in a directory, ``index.html'' isn't found, use the
one located at ``/defaults/hey-dummy.html'' (hey, I didn't even know you
could do that until just now 8-)

  Then there's the matter of virtual hosts.  My own small colocated server
is serving up 25 sites---which means, updating 25 robots.txt files (that is,
if they exist) and not blow the existing one to smithereens.  Even if the
webserver didn't bother asking me, having it waste time to process 25 excess
files bothers me (since it's not a fast machine by any stretch of the
imagination).  Now do this for a machine that may have 2,000+ sites on it.
For something that doesn't exactly change at all (or very rarely).

  There's even a potential race condition.  I download and am editing my
robots.txt file.  The webserver admin makes a change to the configuration
and restarts the webserver and my robots.txt file is updated.  I then finish
my editing and upload the new file.  It is now out of sync with respect to
the configuration.  Far fetched, yes (given that not even what?  5% of all
sites even *have* a robots.txt file to begin with) but still a possibility.

  Not quite as simple as it is made out to be.

  -spc (Taken on my share of ``trivial'' changes ... )

