[Robots] FW: Re: Correct URL, shlash at the end ?
-Original Message- From: Sean 'Captain Napalm' Conner [mailto:[EMAIL PROTECTED]] Sent: Friday, November 23, 2001 10:58 PM To: [EMAIL PROTECTED] Subject: Re: [Robots] Re: Correct URL, shlash at the end ? It was thus said that the Great [EMAIL PROTECTED] once stated: > > If one crazy idea leads to another ...then if the above did get in the > robots.txt spec then the web services could then edit that slash part of > robots.txt file. When the webservice config files holding that default file > list detected the change event the web admin is then asked if they wish to > also update the robots.txt. > > "Update your robot.txt file in the [doc root] to include the change in the > Slash: list?" > > The crazy task is so simple the web servers programers would fight to do it > just to be the first. And if so, it would be the first feature I would disable in the webserver since in several cases the web configuration is managed not by hand but by other automated processes (I part time admin one where new sites are batched up and a new configuration file is generated at set times) and I do not want the web server to hang because it's waiting for a `Yes' or `No' answer from a human. Having to manually type in a pass phrase for a secure webserver (until such time as we found out you *could* start it without it asking) was bad enough (and having to always be around an Internet enabled computer in case I was paged). And second, it's not quite as simple as you make it out to be. For instance, in Apache, the directive that controls this is ``DirectoryIndex'' and it can appear in several different contexts, including virtual hosts (which means for one virtual host, I can have it default to ``Welcome.html'' because that might have been the default for some other webserver the client may have used), directories or even under control of the user in an .htaccess file (which isn't necessarily read until needed). It also doesn't have to be a simple file---it could be specified as: DirectoryIndex index.html /defaults/hey-dummy.html Which means that if, in a directory, ``index.html'' isn't found, use the one located at ``/defaults/hey-dummy.html'' (hey, I didn't even know you could do that until just now 8-) Then there's the matter of virtual hosts. My own small colocated server is serving up 25 sites---which means, updating 25 robots.txt files (that is, if they exist) and not blow the existing one to smithereens. Even if the webserver didn't bother asking me, having it waste time to process 25 excess files bothers me (since it's not a fast machine by any stretch of the imagination). Now do this for a machine that may have 2,000+ sites on it. For something that doesn't exactly change at all (or very rarely). There's even a potential race condition. I download and am editing my robots.txt file. The webserver admin makes a change to the configuration and restarts the webserver and my robots.txt file is updated. I then finish my editing and upload the new file. It is now out of sync with respect to the configuration. Far fetched, yes (given that not even what? 5% of all sites even *have* a robots.txt file to begin with) but still a possibility. Not quite as simple as it is made out to be. -spc (Taken on my share of ``trivial'' changes ... ) -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] FW: Re: Correct URL, shlash at the end ?
-Original Message- From: Sean 'Captain Napalm' Conner [mailto:[EMAIL PROTECTED]] Sent: Friday, November 23, 2001 11:26 PM To: [EMAIL PROTECTED] Subject: Re: [Robots] Re: Correct URL, shlash at the end ? It was thus said that the Great George Phillips once stated: > > Don't be mislead by relative URLs. Yes, they use "." and "..". Yes, > "/" is very important. Yes, they operate almost identically to > UNIX relative paths (but different enough to keep us on our toes). > Yes, they are extremely useful. But they're just rules that take > the stuff you used to get the current page and some relative stuff to > construct new stuff -- all done by the browser. The web server only > understands pure, unadulterated, unrelative stuff. There are rules for parsing relative URLs in RFC-1808 and no, web servers do understand relative URLs---only they must start (if giving a GET (or other) command) with a leading `/'. I just fed ``/people/../index/html'' to my colocated server (telnet to port 80, feed in the GET request directly) and I got the main index page at http://www.conman.org/ . So the webserver can do the processing as well (at least Apache). > My suggestion is that the robot construct URLs with care -- always do what > a browser would do and respect the fact that the HTTP server may need > exactly the same stuff back as it put into the HTML. And always, always > store exactly the URL used to retrieve a block of content. But implement > some generic mechanism to generalize URL equality beyond strcmp(). Regular > expression search and replace looks as promising as anything. Imagine something > like this (with perlish regexp): > > URL-same: s'/(index|default).html?$'/' > > In other words, if the URL ends in "/index.html", "/default.html", "/index.htm" or > "/default.htm" then drop all but the slash and we'll assume the URL will boil > down to the same content. Is this for the robot configuration (on the robot end of things) or for something like robots.txt? > URL-same: s'[^/]+/..(/|$)'' # condense ".." Make sure you follow RFC-1808 though. > URL-same: tr'A-Z'a-z' # case fold the whole thing 'cause why not? Because not every webserver is case insensitive. The host portion is (has to be, DNS is case insensitive) but the relative portion (at least in the standards portions) is not. Okay, some sites (like AOL) treats them as case insensitive, but not all sites. > And something for the "pathological sites" > > URL-same: s'^(http://boston.conman.org/.*/)0+)'$1'g > URL-same: s'^(http://boston.conman.org/.*\.[0-9]*)0+(/|$)'$1$2'g What, exactly does that map? Because I assure you that http://boston.conman.org/2001/11/17.2 is not the same as: http://boston.conman.org/2001/11/17 even though the latter contains the content of the former (plus other entries from that day). But ... http://boston.conman.org/2000/8/10.2-15.5 and http://boston.conman.org/2000/8/10.2-8/15.5 do return the same content (in other words, those are equivalent), where as: http://boston.conman.org/2000/8/10.2-15.5 and http://boston.conman.org/2000/8/10-15 Are not (but again, the latter contains the content of the former). (Yet one more odd case. This: http://boston.conman.org/1999 and this: http://boston.conman.org/1999/12 and this: http://boston.conman.org/1999/12/4-15 Are the same, but only because I started keeping entries in December of 1999. You can repeat for a couple of other variations). > It would be so cool if a robot could discover these patterns for itself. Seems > like it would be a small scale version of covering boston.conman.org's other > "problem" of multiple overlapping data views. I'm not as sure of that 8-) -spc (I calculated that http://bible.conman.org/kj/ has over 15 million different URL views into the King James Bible ... ) -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] FW: Re: Correct URL, shlash at the end ?
-Original Message- From: Sean 'Captain Napalm' Conner [mailto:[EMAIL PROTECTED]] Sent: Friday, November 23, 2001 11:26 PM To: [EMAIL PROTECTED] Subject: Re: [Robots] Re: Correct URL, shlash at the end ? It was thus said that the Great George Phillips once stated: > > Don't be mislead by relative URLs. Yes, they use "." and "..". Yes, > "/" is very important. Yes, they operate almost identically to > UNIX relative paths (but different enough to keep us on our toes). > Yes, they are extremely useful. But they're just rules that take > the stuff you used to get the current page and some relative stuff to > construct new stuff -- all done by the browser. The web server only > understands pure, unadulterated, unrelative stuff. There are rules for parsing relative URLs in RFC-1808 and no, web servers do understand relative URLs---only they must start (if giving a GET (or other) command) with a leading `/'. I just fed ``/people/../index/html'' to my colocated server (telnet to port 80, feed in the GET request directly) and I got the main index page at http://www.conman.org/ . So the webserver can do the processing as well (at least Apache). > My suggestion is that the robot construct URLs with care -- always do what > a browser would do and respect the fact that the HTTP server may need > exactly the same stuff back as it put into the HTML. And always, always > store exactly the URL used to retrieve a block of content. But implement > some generic mechanism to generalize URL equality beyond strcmp(). Regular > expression search and replace looks as promising as anything. Imagine something > like this (with perlish regexp): > > URL-same: s'/(index|default).html?$'/' > > In other words, if the URL ends in "/index.html", "/default.html", "/index.htm" or > "/default.htm" then drop all but the slash and we'll assume the URL will boil > down to the same content. Is this for the robot configuration (on the robot end of things) or for something like robots.txt? > URL-same: s'[^/]+/..(/|$)'' # condense ".." Make sure you follow RFC-1808 though. > URL-same: tr'A-Z'a-z' # case fold the whole thing 'cause why not? Because not every webserver is case insensitive. The host portion is (has to be, DNS is case insensitive) but the relative portion (at least in the standards portions) is not. Okay, some sites (like AOL) treats them as case insensitive, but not all sites. > And something for the "pathological sites" > > URL-same: s'^(http://boston.conman.org/.*/)0+)'$1'g > URL-same: s'^(http://boston.conman.org/.*\.[0-9]*)0+(/|$)'$1$2'g What, exactly does that map? Because I assure you that http://boston.conman.org/2001/11/17.2 is not the same as: http://boston.conman.org/2001/11/17 even though the latter contains the content of the former (plus other entries from that day). But ... http://boston.conman.org/2000/8/10.2-15.5 and http://boston.conman.org/2000/8/10.2-8/15.5 do return the same content (in other words, those are equivalent), where as: http://boston.conman.org/2000/8/10.2-15.5 and http://boston.conman.org/2000/8/10-15 Are not (but again, the latter contains the content of the former). (Yet one more odd case. This: http://boston.conman.org/1999 and this: http://boston.conman.org/1999/12 and this: http://boston.conman.org/1999/12/4-15 Are the same, but only because I started keeping entries in December of 1999. You can repeat for a couple of other variations). > It would be so cool if a robot could discover these patterns for itself. Seems > like it would be a small scale version of covering boston.conman.org's other > "problem" of multiple overlapping data views. I'm not as sure of that 8-) -spc (I calculated that http://bible.conman.org/kj/ has over 15 million different URL views into the King James Bible ... ) -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".