Re: [htdig3-dev] Re: htdig-3.1.4 prerelease

Joe R. Jah Tue, 7 Dec 1999 17:00:54 -0800
On Tue, 7 Dec 1999, Gilles Detillieux wrote:

> Date: Tue, 7 Dec 1999 16:49:58 -0600 (CST)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED], [EMAIL PROTECTED]
> Subject: Re: [htdig3-dev] Re: htdig-3.1.4 prerelease
> 
> No, you don't want url.lowercase(); in Need2Get() anymore!  It can
> break things on case sensitive servers, where upper and lower case
> equivalent names can actually be used for separate files.  In 3.1.4,
> URLs are converted to lowercase when they're parsed (in URL.cc), but
> only when case_sensitive is set to false.

Thanks; I'll remove it.

> > What other changes need to be made?
> 
> None that I can think of.  Please try the patch I just posted, on an
> unpatched 3.1.4 prerelease Retriever.cc.

I will.

> As you probably know, .shtml files are not fetched from the local
> file system, so this is unrelated to the patch we've been discussing.
> The problem of extra path information on SSI documents was actually
> discussed to great length last week - the problem is there's not a lot
> we can do about it in htdig, other than adding .shtml/ to exclude_urls,
> which you may want to do.

I just did it, and randig.  It took care of those duplicates;)

> As a quick recap, the SSI problem occurs when you have an href to an
> SSI document that has an extra slash (/) at the end of it.  This makes
> the URL look to htdig (or ANY web client) as a directory URL, so any
> relative hrefs in the document are interpreted as being under this
> document, rather than under the directory which contains the document.
> Try it yourself in a web browser to see what happens.
> 
> The solution, of course, is to hunt down these defective hrefs and strip
> off the trailing slash.  It's also a good idea to use only absolute hrefs
> within SSI documents, to avoid this problem when there is a faulty link.
> The exclude_urls hack above is a good precaution for htdig, but it won't
> solve the problem for other spiders or web clients.  It's not a generally
> known fact that SSI document are much more like CGI programs than they
> are like static HTML pages, and great care must be taken to avoid them
> presenting infinite hierarchies to web clients.

Thanks very much for the clarification.

> I can't understand why you didn't run into this with htdig 3.1.3 - the
> problem definitely was there then and in previous releases.  Did you
> add .shtml/ to exclude_urls in the config for 3.1.3, but not 3.1.4?

I can't understand it either.  No I never had .shtml/ in my exclude_urls. 
This brings up an interesting point: Under 3.1.3 a search of the word
Majordomo in my site would report 28 results; under 3.1.4 without .shtml
in exclude_urls would report 54, including the SSI mangled URL's. 
Under 3.1.4 with .shtml/ in exclude_urls it reports 47 results;-/

That means 3.1.4 finds 19 more unique files than 3.1.3 in this particular
search, one of which is the mangled .shtml file;)  I jumped into
conclusion and reported duplicates in the results;(

> It's not included in the releases, because it's considered too much of a
> hack, I assume.  I think at one point, it was added to the 3.2 source
> tree, but taken out again.  I've ported the patch to 3.1.4, and I
> suppose I can do likewise to 3.2.0b1 when it comes out, although you
> should be able to do this yourself pretty easily.  Just apply the code
> to Retriever.cc, which you'll likely have to do manually for 3.2 as
> Need2Get() has changed, then "diff -up Retriever.cc.orig Retriever.cc". 

Understood.  I have placed it in 3.1.4 folder in the patch site.

> Are there any other old patches that you think were overlooked, which
> should be considered for 3.2?

No, that was it.  Thank you very much Gilles.

Best regards,

Joe
-- 
     _/   _/_/_/       _/              ____________    __o
     _/   _/   _/      _/         ______________     _-\<,_
 _/  _/   _/_/_/   _/  _/                     ......(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ah        [EMAIL PROTECTED]


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.
Re: [htdig3-dev] Re: htdig-3.1.4 prerelease

Reply via email to