Re: Using libunistring for string comparisons et al

Ludovic Courtès Wed, 16 Mar 2011 04:26:49 -0700

Hello Mark,

Mark H Weaver <m...@netris.org> writes:


> Mike Gran <spk...@yahoo.com> writes:
>>> The reason I am still arguing this point is because I have looked
>>> seriously at what I would need to do to (A) fix our i18n problems and
>>> (B) make the code efficient.  I very much want to fix these things,
>>> but the pain of trying to do this with our current scheme is too much
>>> for me to bear.  I shouldn't have to rewrite libunistring, and I
>>> shouldn't have to write 3 or 4 different variants of each procedure
>>> that takes two string parameters.
>>
>> What procedures are giving incorrect results?
>
> I know of two categories of bugs.  One has to do with case conversions
> and case-insensitive comparisons, which must be done on entire strings
> but are currently done for each character.  Here are some examples:
>
>   (string-upcase "Straße")         => "STRAßE"  (should be "STRASSE")
>   (string-downcase "ΧΑΟΣΣ")        => "χαοσσ"   (should be "χαoσς")
>   (string-downcase "ΧΑΟΣ Σ")       => "χαοσ σ"  (should be "χαoς σ")
>   (string-ci=? "Straße" "Strasse") => #f        (should be #t)
>   (string-ci=? "ΧΑΟΣ" "χαoσ")      => #f        (should be #t)

(Mike pointed out that SRFI-13 does not consider these bugs, but that’s
linguistically wrong so I’d consider it a bug.  Note that all these
functions are ‘linguistically buggy’ anyway since they don’t have a
locale argument, which breaks with Turkish ‘İ’.)

Can we first check what would need to be done to fix this in 2.0.x?

At first glance:

  - “Straße” is normally stored as a Latin1 string, so it would need to
    be converted to UTF-* before it can be passed to one of the
    unicase.h functions.  *Or*, we could check with bug-libunistring
    what it would take to add Latin1 string case mapping functions.

    Interestingly, ‘ß’ is the only Latin1 character that doesn’t have a
    one-to-one case mapping.  All other Latin1 strings can be handled by
    iterating over characters, as is currently done.

    With this in mind, we could hack our way so that strings that
    contain an ‘ß’ are stored as UTF-32 (yes, that’s a hack.)

  - For ‘string-downcase’, the Greek strings above are wide strings, so
    they can be passed directly to u32_toupper & co.  For these, the fix
    is almost two lines.

  - Case insensitive comparison is more difficult, as you already
    pointed out.  To do it right we’d probably need to convert Latin1
    strings to UTF-32 and then pass it to u32_casecmp.  We don’t have to
    do the conversion every time, though: we could just change Latin1
    strings in-place so they now point to a wide stringbuf upon the
    first ‘string-ci=’.

Thoughts?

Thanks,
Ludo’.

Re: Using libunistring for string comparisons et al

Reply via email to