On Tue, 2009-09-22 at 23:57 -0400, Aubrey Jaffer wrote:
> | From: Thomas Lord <[email protected]>
> | Date: Tue, 22 Sep 2009 20:38:09 -0700
> |
> | On Tue, 2009-09-22 at 20:57 -0400, Aubrey Jaffer wrote:
> | > Unicode doesn't play well with a character datatype. Downcasing
> | > or foldcasing a single scalar-value can result in a length 2
> | > string.
> |
> | That is not a problem with Unicode. That is a problem with
> | the assumption that there is a bijection between upcase
> | and downcase characters - an assumption violated by one
> | character in one language.
>
> There are other ligatures which have this property. A Latin (English)
> example is (lowercase) "fi" (񏐡). Upcasing it gives "FI";
> downcasing leaves it unchanged, foldcasing yields "fi".
The "fi" character, however, has a canonical decomposition, and may
never appear in a normalized string; it is replaced by "fi". If
you're talking about normalized strings, it is indeed true that
there is only one character in one language that upcases to a
different number of characters.
There are several that upcase or foldcase to a different number of
codepoints, but that's a different problem and should be below the
level of abstraction provided by strings.
Bear
_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss