-----Original Message-----
From: Sean 'Captain Napalm' Conner [mailto:[EMAIL PROTECTED]]
Sent: Friday, November 23, 2001 11:26 PM
To: [EMAIL PROTECTED]
Subject: Re: [Robots] Re: Correct URL, shlash at the end ?


It was thus said that the Great George Phillips once stated:
>
> Don't be mislead by relative URLs.  Yes, they use "." and "..".  Yes,
> "/" is very important.  Yes, they operate almost identically to
> UNIX relative paths (but different enough to keep us on our toes).
> Yes, they are extremely useful.  But they're just rules that take
> the stuff you used to get the current page and some relative stuff to
> construct new stuff -- all done by the browser.  The web server only
> understands pure, unadulterated, unrelative stuff.

  There are rules for parsing relative URLs in RFC-1808 and no, web servers
do understand relative URLs---only they must start (if giving a GET (or
other) command) with a leading `/'.  I just fed ``/people/../index/html'' to
my colocated server (telnet to port 80, feed in the GET request directly)
and I got the main index page at http://www.conman.org/ .  So the webserver
can do the processing as well (at least Apache).

> My suggestion is that the robot construct URLs with care -- always do what
> a browser would do and respect the fact that the HTTP server may need
> exactly the same stuff back as it put into the HTML.  And always, always
> store exactly the URL used to retrieve a block of content.  But implement
> some generic mechanism to generalize URL equality beyond strcmp().
Regular
> expression search and replace looks as promising as anything.  Imagine
something
> like this (with perlish regexp):
>
> URL-same:     s'/(index|default).html?$'/'
>
> In other words, if the URL ends in "/index.html", "/default.html",
"/index.htm" or
> "/default.htm" then drop all but the slash and we'll assume the URL will
boil
> down to the same content.

  Is this for the robot configuration (on the robot end of things) or for
something like robots.txt?

> URL-same:     s'[^/]+/..(/|$)''       # condense ".."

  Make sure you follow RFC-1808 though.

> URL-same:     tr'A-Z'a-z'             # case fold the whole thing 'cause why not?

  Because not every webserver is case insensitive.  The host portion is (has
to be, DNS is case insensitive) but the relative portion (at least in the
standards portions) is not.  Okay, some sites (like AOL) treats them as case
insensitive, but not all sites.

> And something for the "pathological sites"
>
> URL-same:     s'^(http://boston.conman.org/.*/)0+)'$1'g
> URL-same:     s'^(http://boston.conman.org/.*\.[0-9]*)0+(/|$)'$1$2'g

  What, exactly does that map?  Because I assure you that

                    http://boston.conman.org/2001/11/17.2

  is not the same as:

                     http://boston.conman.org/2001/11/17

  even though the latter contains the content of the former (plus other
entries from that day).  But ...

                  http://boston.conman.org/2000/8/10.2-15.5

  and

                 http://boston.conman.org/2000/8/10.2-8/15.5

  do return the same content (in other words, those are equivalent), where
as:

                  http://boston.conman.org/2000/8/10.2-15.5

  and

                    http://boston.conman.org/2000/8/10-15

  Are not (but again, the latter contains the content of the former).

  (Yet one more odd case.  This:

                        http://boston.conman.org/1999

  and this:

                      http://boston.conman.org/1999/12

  and this:

                    http://boston.conman.org/1999/12/4-15

  Are the same, but only because I started keeping entries in December of
1999.  You can repeat for a couple of other variations).

> It would be so cool if a robot could discover these patterns for itself.
Seems
> like it would be a small scale version of covering boston.conman.org's
other
> "problem" of multiple overlapping data views.

  I'm not as sure of that 8-)

  -spc (I calculated that http://bible.conman.org/kj/ has over 15 million
        different URL views into the King James Bible ... )



--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to