On Wed, 2009-09-23 at 01:59 -0400, John Cowan wrote:
> Thomas Lord scripsit:
> 
> > That is not a problem with Unicode.  That is a problem with 
> > the assumption that there is a bijection between upcase
> > and downcase characters - an assumption violated by one
> > character in one language.  
> 
> A lot more than one.  In addition to ess-zet, there are:
> 
>         13 Latin and Armenian ligatures that uppercase to two characters

... which can never appear in normalized strings because they have 
canonical decompositions.... 

>         61 Latin and Greek lowercase letters with diacritics that
>         uppercase to the uppercase base character followed by the
>         combining diacritic(s)

... which are single characters represented with one codepoint that 
uppercase into single characters represented with two codepoints - 
a non-problem if you keep in mind that characters and codepoints are
different ideas ....

>         I with dot, which lowercases (in non-Turkic contexts) to
>         i followed by combining dot in order to maintain canonical
>         equivalence rules (only one dot is displayed)

... Which is also a single character represented with one codepoint 
converting to a single character represented with two codepoints ...

>         27 Greek titlecase combinations of an uppercase vowel with
>         diacritic(s) followed by a lowercase iota which uppercase to
>         the same vowel followed by an uppercase iota.

... Which have canonical decompositions and can never appear in a 
normalized string, and whose normalized forms also have the same 
number of *characters* after a case operation even though the 
number of *codepoints* is different...

> That makes 103 characters altogether that don't work in char-upcase
> or char-downcase.

40 of which don't count because they're not part of the repertoire of 
normalized characters, and 88 of which are single characters that 
change under casing operations to single characters, confusing only
those who have already confused character lengths with codepoint
lengths. 

And exactly one of which *does* count because it's actually a 
different number of characters after the casing operation.

                                Bear



_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Reply via email to