On Tue, 2009-09-22 at 23:57 -0400, Aubrey Jaffer wrote:
> | From: Thomas Lord <[email protected]>
>  | Date: Tue, 22 Sep 2009 20:38:09 -0700
>  | 
>  | On Tue, 2009-09-22 at 20:57 -0400, Aubrey Jaffer wrote:
>  | > Unicode doesn't play well with a character datatype.  Downcasing
>  | > or foldcasing a single scalar-value can result in a length 2
>  | > string.
>  | 
>  | That is not a problem with Unicode.  That is a problem with 
>  | the assumption that there is a bijection between upcase
>  | and downcase characters - an assumption violated by one
>  | character in one language.  
> 
> There are other ligatures which have this property.  A Latin (English)
> example is (lowercase) "fi" (&#324641;).  Upcasing it gives "FI";
> downcasing leaves it unchanged, foldcasing yields "fi".

The "fi" character, however, has a canonical decomposition, and may 
never appear in a normalized string; it is replaced by "fi".  If 
you're talking about normalized strings, it is indeed true that 
there is only one character in one language that upcases to a 
different number of characters. 

There are several that upcase or foldcase to a different number of 
codepoints, but that's a different problem and should be below the 
level of abstraction provided by strings. 

                                Bear



_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Reply via email to