On Tue, 7 Dec 1999, Gilles Detillieux wrote:
> Date: Tue, 7 Dec 1999 16:49:58 -0600 (CST)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED], [EMAIL PROTECTED]
> Subject: Re: [htdig3-dev] Re: htdig-3.1.4 prerelease
>
> No, you don't want url.lowercase(); in Need2Get() anymore! It can
> break things on case sensitive servers, where upper and lower case
> equivalent names can actually be used for separate files. In 3.1.4,
> URLs are converted to lowercase when they're parsed (in URL.cc), but
> only when case_sensitive is set to false.
Thanks; I'll remove it.
> > What other changes need to be made?
>
> None that I can think of. Please try the patch I just posted, on an
> unpatched 3.1.4 prerelease Retriever.cc.
I will.
> As you probably know, .shtml files are not fetched from the local
> file system, so this is unrelated to the patch we've been discussing.
> The problem of extra path information on SSI documents was actually
> discussed to great length last week - the problem is there's not a lot
> we can do about it in htdig, other than adding .shtml/ to exclude_urls,
> which you may want to do.
I just did it, and randig. It took care of those duplicates;)
> As a quick recap, the SSI problem occurs when you have an href to an
> SSI document that has an extra slash (/) at the end of it. This makes
> the URL look to htdig (or ANY web client) as a directory URL, so any
> relative hrefs in the document are interpreted as being under this
> document, rather than under the directory which contains the document.
> Try it yourself in a web browser to see what happens.
>
> The solution, of course, is to hunt down these defective hrefs and strip
> off the trailing slash. It's also a good idea to use only absolute hrefs
> within SSI documents, to avoid this problem when there is a faulty link.
> The exclude_urls hack above is a good precaution for htdig, but it won't
> solve the problem for other spiders or web clients. It's not a generally
> known fact that SSI document are much more like CGI programs than they
> are like static HTML pages, and great care must be taken to avoid them
> presenting infinite hierarchies to web clients.
Thanks very much for the clarification.
> I can't understand why you didn't run into this with htdig 3.1.3 - the
> problem definitely was there then and in previous releases. Did you
> add .shtml/ to exclude_urls in the config for 3.1.3, but not 3.1.4?
I can't understand it either. No I never had .shtml/ in my exclude_urls.
This brings up an interesting point: Under 3.1.3 a search of the word
Majordomo in my site would report 28 results; under 3.1.4 without .shtml
in exclude_urls would report 54, including the SSI mangled URL's.
Under 3.1.4 with .shtml/ in exclude_urls it reports 47 results;-/
That means 3.1.4 finds 19 more unique files than 3.1.3 in this particular
search, one of which is the mangled .shtml file;) I jumped into
conclusion and reported duplicates in the results;(
> It's not included in the releases, because it's considered too much of a
> hack, I assume. I think at one point, it was added to the 3.2 source
> tree, but taken out again. I've ported the patch to 3.1.4, and I
> suppose I can do likewise to 3.2.0b1 when it comes out, although you
> should be able to do this yourself pretty easily. Just apply the code
> to Retriever.cc, which you'll likely have to do manually for 3.2 as
> Need2Get() has changed, then "diff -up Retriever.cc.orig Retriever.cc".
Understood. I have placed it in 3.1.4 folder in the patch site.
> Are there any other old patches that you think were overlooked, which
> should be considered for 3.2?
No, that was it. Thank you very much Gilles.
Best regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah [EMAIL PROTECTED]
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.