Mike Gran <spk...@yahoo.com> writes: > [...] But doing the upper->lower operation picks > up a few more of the corner cases, like U+03C2 GREEK > SMALL LETTER FINAL SIGMA and U+03C3 GREEK SMALL LETTER SIGMA > which are the same letter with different representations, > or U+00B5 MICRO SIGN and U+039C GREEK SMALL LETTER MU > which are supposed to have the same sort ordering.
Ah, okay. Makes sense. > Now that we've pulled in all of libunistring, it might > be a good idea to see if it has a complete implementation > of unicode case folding, because upper->lower is also not > completely correct. I looked into this. Indeed, the libunistring documentation mentions that in some languages (e.g. German), the to_upper and to_lower conversions cannot be done properly on a per-character basis, because the number of character can change. These operations much be done on an entire string. For example: <http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-2.html> (string-upcase "Straße") => "STRASSE" (string-foldcase "Straße") => "strasse" libunistring contains all the necessary functions, including case-insensitive string comparisons. However, the only string representations supported by these operations are: UTF-8, UTF-16, UTF-32, or locale-encoded strings, and for comparisons both strings must be the same encoding. I'm aware that this proposal will be very controversial, but starting in Guile 2.2, I think we ought to consider storing strings internally in UTF-8, as is done in Gauche. This would of course make string-ref and string-set! into O(n) operations. However, I claim that any code that depends on string-ref and string-set! could be better written