Using libunistring for string comparisons et al

Mark H Weaver Fri, 11 Mar 2011 14:34:13 -0800

Mike Gran <spk...@yahoo.com> writes:
> [...] But doing the upper->lower operation picks
> up a few more of the corner cases, like U+03C2 GREEK
> SMALL LETTER FINAL SIGMA and U+03C3 GREEK SMALL LETTER SIGMA
> which are the same letter with different representations,
> or U+00B5 MICRO SIGN and U+039C GREEK SMALL LETTER MU
> which are supposed to have the same sort ordering.


Ah, okay.  Makes sense.

> Now that we've pulled in all of libunistring, it might
> be a good idea to see if it has a complete implementation
> of unicode case folding, because upper->lower is also not
> completely correct.

I looked into this.  Indeed, the libunistring documentation mentions
that in some languages (e.g. German), the to_upper and to_lower
conversions cannot be done properly on a per-character basis, because
the number of character can change.  These operations much be done on an
entire string.  For example:

<http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-2.html>

  (string-upcase "Straße") => "STRASSE"
  (string-foldcase "Straße") => "strasse"

libunistring contains all the necessary functions, including
case-insensitive string comparisons.  However, the only string
representations supported by these operations are: UTF-8, UTF-16,
UTF-32, or locale-encoded strings, and for comparisons both strings must
be the same encoding.

I'm aware that this proposal will be very controversial, but starting in
Guile 2.2, I think we ought to consider storing strings internally in
UTF-8, as is done in Gauche.  This would of course make string-ref and
string-set! into O(n) operations.  However, I claim that any code that
depends on string-ref and string-set! could be better written

Using libunistring for string comparisons et al

Reply via email to