Hello Mark, Mark H Weaver <m...@netris.org> writes:
> Mike Gran <spk...@yahoo.com> writes: >>> The reason I am still arguing this point is because I have looked >>> seriously at what I would need to do to (A) fix our i18n problems and >>> (B) make the code efficient. I very much want to fix these things, >>> but the pain of trying to do this with our current scheme is too much >>> for me to bear. I shouldn't have to rewrite libunistring, and I >>> shouldn't have to write 3 or 4 different variants of each procedure >>> that takes two string parameters. >> >> What procedures are giving incorrect results? > > I know of two categories of bugs. One has to do with case conversions > and case-insensitive comparisons, which must be done on entire strings > but are currently done for each character. Here are some examples: > > (string-upcase "Straße") => "STRAßE" (should be "STRASSE") > (string-downcase "ΧΑΟΣΣ") => "χαοσσ" (should be "χαoσς") > (string-downcase "ΧΑΟΣ Σ") => "χαοσ σ" (should be "χαoς σ") > (string-ci=? "Straße" "Strasse") => #f (should be #t) > (string-ci=? "ΧΑΟΣ" "χαoσ") => #f (should be #t) (Mike pointed out that SRFI-13 does not consider these bugs, but that’s linguistically wrong so I’d consider it a bug. Note that all these functions are ‘linguistically buggy’ anyway since they don’t have a locale argument, which breaks with Turkish ‘İ’.) Can we first check what would need to be done to fix this in 2.0.x? At first glance: - “Straße” is normally stored as a Latin1 string, so it would need to be converted to UTF-* before it can be passed to one of the unicase.h functions. *Or*, we could check with bug-libunistring what it would take to add Latin1 string case mapping functions. Interestingly, ‘ß’ is the only Latin1 character that doesn’t have a one-to-one case mapping. All other Latin1 strings can be handled by iterating over characters, as is currently done. With this in mind, we could hack our way so that strings that contain an ‘ß’ are stored as UTF-32 (yes, that’s a hack.) - For ‘string-downcase’, the Greek strings above are wide strings, so they can be passed directly to u32_toupper & co. For these, the fix is almost two lines. - Case insensitive comparison is more difficult, as you already pointed out. To do it right we’d probably need to convert Latin1 strings to UTF-32 and then pass it to u32_casecmp. We don’t have to do the conversion every time, though: we could just change Latin1 strings in-place so they now point to a wide stringbuf upon the first ‘string-ci=’. Thoughts? Thanks, Ludo’.