>What C library are you using?  Your message above implies libc5
>on Linux.  I'm using Red Hat Linux 4.2 on my web server, which comes
>with libc 5.3.12.  I've tried all sorts of things, and I've come to the
>conclusion that locale support in this C library is hopelessly broken.
>I could not get it to work despite all my attempts.

The web server where I run ht://dig is a SlackWare (2.0.35) with glibc1
(libc5).

>On the other hand, with Red Hat Linux 5.2, which uses glibc, locales
>seem to work without any difficulties at all.

I am now willing to try ht://dig on another Linux Machine with RedHat 6.0
where locale seems to work on it. If it works, I can try to move on a
recent platform, but problem with broken locales is still alive ...

>I've thought of how ht://Dig could be fixed to work with broken locales.
>The extra_word_characters attribute is a good first step.  If you add
>all the accented characters to this, they'll get indexed.

Thanks a lot for this hint ... At least, now I get them indexed.

>The problem
>is ht://Dig won't know how to convert them from uppercase to lowercase,
>or vice versa.  I've thought of adding extra_word_casemap as a means
>of specifying these mappings.  In this way, the HtWordType functions
>would supplement all the ctype stuff, in a way that's user configurable.
>It's a shame that we'd need to resort to this, because this is exactly
>what the locale stuff is supposed to do for us, but with so many broken
>locales out there, I think there's a need for this.

I am not very hooked on fuzzy algorithms (obviously, it goes w/out saying
... ;-) ), but is it a problem to link single chars to string of 2 chars? I
try to explain better ...

if I want to search for a '�' ending word, I also have to search for "e'",
which is 2 chars long. And so:

'�' <-> "e'"
'�' <-> "E'"

And viceversa ... Instead, I don't think we don't need this conversion in
the middle of the word (or better, in italian we use to do this way).

>As for mapping accented to unaccented letters, as Geoff said, this has
>been discussed to some length about a week or so ago.  My suggestion
>was to implement it something like soundex, where it will go through the
>word database after htdig/htmerge, and create another database keyed on
>the canonical (unindexed) form of all of these words.  This algorithm
>could be configured either through a file, or perhaps better still,
>a config attribute (which could be taken from a file if desired) such
>as accent_map.  This map would allow you to specify precisely how to map
>various accented letters or digraphs to certain canonical representations.

I wish I could contribute to this, but I think that now I am too busy and,
moreover, as soon as I can re-start contributing, I have to set up the
HtHTTP and Transport classes and the Retrieving code ... Shame on me !!! I
am also waiting for 2 big C++ books and with the new year comin', I want to
dedicate more time to study C++ and OO programming (I have been still for a
long time ...).

And, if it's not enough, I also want to stay with my girlfriend: HEY, she's
american, 21 and really BEATIFUL !!! Will you ever forgive me if, for now,
I can't dedicate much spare time to programming? ;-@ - Just kidding ...

But if you have some directives I will be very glad to help anybody who
wants to do that ...

Ciao
-Gabriele


-------------------------------------------------

Gabriele Bartolini
Computer Programmer (are U sure?)
U.O. Rete Civica - Comune di Prato
Prato - Italia - Europa

e-mail: [EMAIL PROTECTED]
http://www.po-net.prato.it

"Life teaches you never stop learning ..."

-------------------------------------------------

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to