[Robots] FW: Re: Correct URL, shlash at the end ?

2001-11-24 Thread Nick Arnett




-Original Message-
From: Sean 'Captain Napalm' Conner [mailto:[EMAIL PROTECTED]]
Sent: Friday, November 23, 2001 10:58 PM
To: [EMAIL PROTECTED]
Subject: Re: [Robots] Re: Correct URL, shlash at the end ?


It was thus said that the Great [EMAIL PROTECTED] once stated:
>
> If one crazy idea leads to another ...then if the above did get in the
> robots.txt spec then the web services could then edit that slash part of
> robots.txt file.  When the webservice config files holding that default
file
> list detected the change event the web admin is then asked if they wish to
> also update the robots.txt.
>
> "Update your robot.txt file in the [doc root] to include the change in the
> Slash: list?"
>
> The crazy task is so simple the web servers programers would fight to do
it
> just to be the first.

  And if so, it would be the first feature I would disable in the webserver
since in several cases the web configuration is managed not by hand but by
other automated processes (I part time admin one where new sites are batched
up and a new configuration file is generated at set times) and I do not want
the web server to hang because it's waiting for a `Yes' or `No' answer from
a human.  Having to manually type in a pass phrase for a secure webserver
(until such time as we found out you *could* start it without it asking) was
bad enough (and having to always be around an Internet enabled computer in
case I was paged).

  And second, it's not quite as simple as you make it out to be.  For
instance, in Apache, the directive that controls this is ``DirectoryIndex''
and it can appear in several different contexts, including virtual hosts
(which means for one virtual host, I can have it default to ``Welcome.html''
because that might have been the default for some other webserver the client
may have used), directories or even under control of the user in an
.htaccess file (which isn't necessarily read until needed).  It also doesn't
have to be a simple file---it could be specified as:

DirectoryIndex  index.html /defaults/hey-dummy.html

  Which means that if, in a directory, ``index.html'' isn't found, use the
one located at ``/defaults/hey-dummy.html'' (hey, I didn't even know you
could do that until just now 8-)

  Then there's the matter of virtual hosts.  My own small colocated server
is serving up 25 sites---which means, updating 25 robots.txt files (that is,
if they exist) and not blow the existing one to smithereens.  Even if the
webserver didn't bother asking me, having it waste time to process 25 excess
files bothers me (since it's not a fast machine by any stretch of the
imagination).  Now do this for a machine that may have 2,000+ sites on it.
For something that doesn't exactly change at all (or very rarely).

  There's even a potential race condition.  I download and am editing my
robots.txt file.  The webserver admin makes a change to the configuration
and restarts the webserver and my robots.txt file is updated.  I then finish
my editing and upload the new file.  It is now out of sync with respect to
the configuration.  Far fetched, yes (given that not even what?  5% of all
sites even *have* a robots.txt file to begin with) but still a possibility.

  Not quite as simple as it is made out to be.

  -spc (Taken on my share of ``trivial'' changes ... )





--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".




[Robots] FW: Re: Correct URL, shlash at the end ?

2001-11-24 Thread Nick Arnett




-Original Message-
From: Sean 'Captain Napalm' Conner [mailto:[EMAIL PROTECTED]]
Sent: Friday, November 23, 2001 11:26 PM
To: [EMAIL PROTECTED]
Subject: Re: [Robots] Re: Correct URL, shlash at the end ?


It was thus said that the Great George Phillips once stated:
>
> Don't be mislead by relative URLs.  Yes, they use "." and "..".  Yes,
> "/" is very important.  Yes, they operate almost identically to
> UNIX relative paths (but different enough to keep us on our toes).
> Yes, they are extremely useful.  But they're just rules that take
> the stuff you used to get the current page and some relative stuff to
> construct new stuff -- all done by the browser.  The web server only
> understands pure, unadulterated, unrelative stuff.

  There are rules for parsing relative URLs in RFC-1808 and no, web servers
do understand relative URLs---only they must start (if giving a GET (or
other) command) with a leading `/'.  I just fed ``/people/../index/html'' to
my colocated server (telnet to port 80, feed in the GET request directly)
and I got the main index page at http://www.conman.org/ .  So the webserver
can do the processing as well (at least Apache).

> My suggestion is that the robot construct URLs with care -- always do what
> a browser would do and respect the fact that the HTTP server may need
> exactly the same stuff back as it put into the HTML.  And always, always
> store exactly the URL used to retrieve a block of content.  But implement
> some generic mechanism to generalize URL equality beyond strcmp().
Regular
> expression search and replace looks as promising as anything.  Imagine
something
> like this (with perlish regexp):
>
> URL-same: s'/(index|default).html?$'/'
>
> In other words, if the URL ends in "/index.html", "/default.html",
"/index.htm" or
> "/default.htm" then drop all but the slash and we'll assume the URL will
boil
> down to the same content.

  Is this for the robot configuration (on the robot end of things) or for
something like robots.txt?

> URL-same: s'[^/]+/..(/|$)''   # condense ".."

  Make sure you follow RFC-1808 though.

> URL-same: tr'A-Z'a-z' # case fold the whole thing 'cause why not?

  Because not every webserver is case insensitive.  The host portion is (has
to be, DNS is case insensitive) but the relative portion (at least in the
standards portions) is not.  Okay, some sites (like AOL) treats them as case
insensitive, but not all sites.

> And something for the "pathological sites"
>
> URL-same: s'^(http://boston.conman.org/.*/)0+)'$1'g
> URL-same: s'^(http://boston.conman.org/.*\.[0-9]*)0+(/|$)'$1$2'g

  What, exactly does that map?  Because I assure you that

http://boston.conman.org/2001/11/17.2

  is not the same as:

 http://boston.conman.org/2001/11/17

  even though the latter contains the content of the former (plus other
entries from that day).  But ...

  http://boston.conman.org/2000/8/10.2-15.5

  and

 http://boston.conman.org/2000/8/10.2-8/15.5

  do return the same content (in other words, those are equivalent), where
as:

  http://boston.conman.org/2000/8/10.2-15.5

  and

http://boston.conman.org/2000/8/10-15

  Are not (but again, the latter contains the content of the former).

  (Yet one more odd case.  This:

http://boston.conman.org/1999

  and this:

  http://boston.conman.org/1999/12

  and this:

http://boston.conman.org/1999/12/4-15

  Are the same, but only because I started keeping entries in December of
1999.  You can repeat for a couple of other variations).

> It would be so cool if a robot could discover these patterns for itself.
Seems
> like it would be a small scale version of covering boston.conman.org's
other
> "problem" of multiple overlapping data views.

  I'm not as sure of that 8-)

  -spc (I calculated that http://bible.conman.org/kj/ has over 15 million
different URL views into the King James Bible ... )



--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".




[Robots] FW: Re: Correct URL, shlash at the end ?

2001-11-24 Thread Nick Arnett




-Original Message-
From: Sean 'Captain Napalm' Conner [mailto:[EMAIL PROTECTED]]
Sent: Friday, November 23, 2001 11:26 PM
To: [EMAIL PROTECTED]
Subject: Re: [Robots] Re: Correct URL, shlash at the end ?


It was thus said that the Great George Phillips once stated:
>
> Don't be mislead by relative URLs.  Yes, they use "." and "..".  Yes,
> "/" is very important.  Yes, they operate almost identically to
> UNIX relative paths (but different enough to keep us on our toes).
> Yes, they are extremely useful.  But they're just rules that take
> the stuff you used to get the current page and some relative stuff to
> construct new stuff -- all done by the browser.  The web server only
> understands pure, unadulterated, unrelative stuff.

  There are rules for parsing relative URLs in RFC-1808 and no, web servers
do understand relative URLs---only they must start (if giving a GET (or
other) command) with a leading `/'.  I just fed ``/people/../index/html'' to
my colocated server (telnet to port 80, feed in the GET request directly)
and I got the main index page at http://www.conman.org/ .  So the webserver
can do the processing as well (at least Apache).

> My suggestion is that the robot construct URLs with care -- always do what
> a browser would do and respect the fact that the HTTP server may need
> exactly the same stuff back as it put into the HTML.  And always, always
> store exactly the URL used to retrieve a block of content.  But implement
> some generic mechanism to generalize URL equality beyond strcmp().
Regular
> expression search and replace looks as promising as anything.  Imagine
something
> like this (with perlish regexp):
>
> URL-same: s'/(index|default).html?$'/'
>
> In other words, if the URL ends in "/index.html", "/default.html",
"/index.htm" or
> "/default.htm" then drop all but the slash and we'll assume the URL will
boil
> down to the same content.

  Is this for the robot configuration (on the robot end of things) or for
something like robots.txt?

> URL-same: s'[^/]+/..(/|$)''   # condense ".."

  Make sure you follow RFC-1808 though.

> URL-same: tr'A-Z'a-z' # case fold the whole thing 'cause why not?

  Because not every webserver is case insensitive.  The host portion is (has
to be, DNS is case insensitive) but the relative portion (at least in the
standards portions) is not.  Okay, some sites (like AOL) treats them as case
insensitive, but not all sites.

> And something for the "pathological sites"
>
> URL-same: s'^(http://boston.conman.org/.*/)0+)'$1'g
> URL-same: s'^(http://boston.conman.org/.*\.[0-9]*)0+(/|$)'$1$2'g

  What, exactly does that map?  Because I assure you that

http://boston.conman.org/2001/11/17.2

  is not the same as:

 http://boston.conman.org/2001/11/17

  even though the latter contains the content of the former (plus other
entries from that day).  But ...

  http://boston.conman.org/2000/8/10.2-15.5

  and

 http://boston.conman.org/2000/8/10.2-8/15.5

  do return the same content (in other words, those are equivalent), where
as:

  http://boston.conman.org/2000/8/10.2-15.5

  and

http://boston.conman.org/2000/8/10-15

  Are not (but again, the latter contains the content of the former).

  (Yet one more odd case.  This:

http://boston.conman.org/1999

  and this:

  http://boston.conman.org/1999/12

  and this:

http://boston.conman.org/1999/12/4-15

  Are the same, but only because I started keeping entries in December of
1999.  You can repeat for a couple of other variations).

> It would be so cool if a robot could discover these patterns for itself.
Seems
> like it would be a small scale version of covering boston.conman.org's
other
> "problem" of multiple overlapping data views.

  I'm not as sure of that 8-)

  -spc (I calculated that http://bible.conman.org/kj/ has over 15 million
different URL views into the King James Bible ... )



--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".