According to Patrick:
> I've been battling with the case_sensitive issue for a while now.  It
> seems that by declaring "case_sensitive: false" will automatically
> lowercase the URLs (performed in ../htlib/URL.cc).  This seems like
> a great idea, however, I think a more logical procedure would be to
> not automatically lowercase the URL from the get-go and only lower
> case the URL temporarily when performing comparisons to previously
> crawled/queued URLs.
> 
> Basically, what is happening is that the university's web server uses
> Apache's mod_mispel.  Upon a URL case sensitivity mis-match (ex:
> http://www.foo.com/DOCUMENT is the request, but http://www.foo.com/document
> is the true document name), the module will send an automatic

I assume you meant this the other way around.  If the true document name
is lowercase, there would be no problem.  Right?

> 301 Moved Permanently message -- a message that htdig does NOT follow,
> regardless of the case_sensitive argument.

The htdig code does indeed recognize any 30x return code, other than 304,
as a redirect.  The problem you're having is probably this: if a lowercase
URL is redirected to an uppercase or mixed case URL, the code as it is
now will likely set it back to lowercase if case_sensitive is false,
thus nullifying the effect of the redirect.

However, this suggests that your server is definitely case sensitive,
so you shouldn't set this attribute to false when indexing this server.
Instead, set it to true, and let the server's redirects do their work,
correcting the URLs to their proper form, which htdig will use in the
database.

> Long story short: where/how can the code be modified so that the actual
> URL is NOT lowercased automatically, but rather, is only lowercased
> temporarily when doing a comparison to other queued/crawled URLs
> (which will also be temporarily lowercased during the comparison
> process)?

I imagine you could take the case_sensitive processing out of URL.cc,
and put some processing around the handling of the "visited" dictionary,
in htdig/Retriever.cc.  I don't know whether that will have any other
repercussions, though, especially for a truly case insensitive server.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 


Reply via email to