According to Gabriele Bartolini:
> >What C library are you using? Your message above implies libc5
> >on Linux. I'm using Red Hat Linux 4.2 on my web server, which comes
> >with libc 5.3.12. I've tried all sorts of things, and I've come to the
> >conclusion that locale support in this C library is hopelessly broken.
> >I could not get it to work despite all my attempts.
>
> The web server where I run ht://dig is a SlackWare (2.0.35) with glibc1
> (libc5).
Sounds like the same broken code I have.
> >On the other hand, with Red Hat Linux 5.2, which uses glibc, locales
> >seem to work without any difficulties at all.
>
> I am now willing to try ht://dig on another Linux Machine with RedHat 6.0
> where locale seems to work on it. If it works, I can try to move on a
> recent platform, but problem with broken locales is still alive ...
My tests seem to confirm that locales work fine with glibc 2.0 & 2.1,
so Red Hat 6.0 should be OK. You'll want to make sure you install all
of Red Hat's updates to 6.0, though, because there were a number of other
problems with 6.0, which the updates seem to fix very well.
> >I've thought of how ht://Dig could be fixed to work with broken locales.
> >The extra_word_characters attribute is a good first step. If you add
> >all the accented characters to this, they'll get indexed.
>
> Thanks a lot for this hint ... At least, now I get them indexed.
Yeah, but for searches of accented letters, you'll not only have to use
the right accent, but also the right case (upper or lower) for the search
to find a match.
> >The problem
> >is ht://Dig won't know how to convert them from uppercase to lowercase,
> >or vice versa. I've thought of adding extra_word_casemap as a means
> >of specifying these mappings. In this way, the HtWordType functions
> >would supplement all the ctype stuff, in a way that's user configurable.
> >It's a shame that we'd need to resort to this, because this is exactly
> >what the locale stuff is supposed to do for us, but with so many broken
> >locales out there, I think there's a need for this.
>
> I am not very hooked on fuzzy algorithms (obviously, it goes w/out saying
> ... ;-) ), but is it a problem to link single chars to string of 2 chars? I
> try to explain better ...
>
> if I want to search for a '�' ending word, I also have to search for "e'",
> which is 2 chars long. And so:
>
> '�' <-> "e'"
> '�' <-> "E'"
>
> And viceversa ... Instead, I don't think we don't need this conversion in
> the middle of the word (or better, in italian we use to do this way).
I think the fuzzy algorithms can be greatly improved with user input and
some contributions of code from willing developers, so you may change
your mind about them. In particular, I think the regex and accent fuzzy
algorithms will be very useful, and currently, there are many sites who
would not want to do without the endings algorithm.
As for the accents algorithm we've discussed, I think that it is very
feasible to handle one to many as well as many to one character mappings.
The way I envision implementing the accent_map attribute I proposed would
be much like the way url_part_aliases is implemented, using Hans-Peter's
WordCodec stuff. That should fit the bill, and I think it would allow
handling all the variations discussed so far.
> >As for mapping accented to unaccented letters, as Geoff said, this has
> >been discussed to some length about a week or so ago. My suggestion
> >was to implement it something like soundex, where it will go through the
> >word database after htdig/htmerge, and create another database keyed on
> >the canonical (unindexed) form of all of these words. This algorithm
> >could be configured either through a file, or perhaps better still,
> >a config attribute (which could be taken from a file if desired) such
> >as accent_map. This map would allow you to specify precisely how to map
> >various accented letters or digraphs to certain canonical representations.
>
> I wish I could contribute to this, but I think that now I am too busy and,
> moreover, as soon as I can re-start contributing, I have to set up the
> HtHTTP and Transport classes and the Retrieving code ... Shame on me !!! I
> am also waiting for 2 big C++ books and with the new year comin', I want to
> dedicate more time to study C++ and OO programming (I have been still for a
> long time ...).
>
> And, if it's not enough, I also want to stay with my girlfriend: HEY, she's
> american, 21 and really BEATIFUL !!! Will you ever forgive me if, for now,
> I can't dedicate much spare time to programming? ;-@ - Just kidding ...
>
> But if you have some directives I will be very glad to help anybody who
> wants to do that ...
We all have to choose our priorities, I guess, and I can't argue with
that. :-)
I don't have any solid directives about how to implement this, but I
think the general guidelines I've suggested in the past few discussions
on this are a good starting point. The main suggestions I have are
to base it on the soundex (or metaphone) algorithm, but of course with
very different rules for coming up with the canonical form of a word.
The canonicalization could be done using the WodCodec code, and specified
by an attribute such as accent_map, handled very much like the way the
url_part_aliases attribute is implemented.
I'd be interested in spending more time on this, because it seems like a
fun challenge, but right now I really can't justify the time. I'm way
behind on other job responsibilities right now. Surely there's some
developer in Western Europe, or elsewhere, who needs something like this,
and has the time and skills to make it happen.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.