----- Original Message ----- From: "Ketil Malde" <[EMAIL PROTECTED]> To: "Dylan Thurston" <[EMAIL PROTECTED]> Cc: "Andrew J Bromage" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Monday, October 08, 2001 9:02 AM Subject: Re: UniCode
(The spelling is 'Unicode' (and none other).) > Dylan Thurston <[EMAIL PROTECTED]> writes: > > > Right. In Unicode, the concept of a "character" is not really so > > useful; > > After reading a bit about it, I'm certainly confused. > Unicode/ISO-10646 contains a lot of things that aren'r really one > character, e.g. ligatures. The ligatures that are included are there for compatiblity with older character encodings. Normally, for modern technology..., ligatures are (to be) formed automatically through the font. OpenType (OT, MS and Adobe) and AAT (Apple) have support for this. There are often requests to add more ligatures to 10646/Unicode, but they are rejected since 10646/Unicode encode characters, not glyphs. (With two well-known exceptions: for compatibility, and certain dingbats.) > > most functions that traditionally operate on characters (e.g., > > uppercase or display-width) fundamentally need to operate on strings. > > (This is due to properties of particular languages, not any design > > flaw of Unicode.) > > I think an argument could be put forward that Unicode is trying to be > more than just a character set. At least at first glance, it seems to Yes, but: > try to be both a character set and a glyph map, and incorporate things not that. See above. > like transliteration between character sets (or subsets, now that > Unicode contains them all), directionality of script, and so on. Unicode (but not 10646) does handle bidirectionality (seeUAX 9: http://www.unicode.org/unicode/reports/tr9/), but not transliteration. (Tranliteration is handled in IBMs ICU, though: http://www-124.ibm.com/developerworks/oss/icu4j/index.html) > > > toUpper, toLower - Not OK. There are cases where upper casing a > > character yields two characters. > > I though title case was supposed to handle this. I'm probably > confused, though. The titlecase characters in Unicode are (essentially) only there for compatibility reasons (originally for transliterating between certain subsets of Cyrillic and Latin scripts in a 1-1 way). You're not supposed to really use them... The cases where toUpper of a single character give two characters is for some (classical) Greek, where a builtin subscript iota turn into a capital iota, and other cases where there is no corresponding uppercase letter. It is also the case that case mapping is context sensitive. E.g. mapping capital sigma to small sigma (mostly) or ς (small final sigma) (at end of word), or the capital i to ı (small dotless i), if Turkish, or insert/ delete combining dot above for i and j in Lithuanian. See UTR 21 and http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt. > > > etc. Any program using this library is bound to get confused on > > Unicode strings. Even before Unicode, there is much functionality > > missing; for instance, I don't see any way to compare strings using > > a localized order. > > And you can't really use list functions like "length" on strings, > since one item can be two characters (Lj, ij, fi) and several items > can compose one character (combining characters). Depends on what you mean by "lenght" and "character"... You seem to be after what is sometimes referred to as "grapheme", and counting those. There is a proposal for a definition of "language independent grapheme" (with lexical syntax), but I don't think it is stable yet. > And "map (==)" can't compare two Strings since, e.g. in the presence > of combining characters. How are other systems handling this? I guess it is not very systematic. Java and XML make the comparisons directly by equality of the 'raw' characters *when* comparing identifiers/similar, though for XML there is a proposal for "early normalisation" essentially to NFC (normal form C). I would have preferred comparing the normal forms of the identifiers instead. For searches, the recommendation (though I doubt in practice yet) is to use a collation key based comparison. (Note that collation keys are usually language dependent. More about collation in UTS 10, http://www.unicode.org/unicode/reports/tr10/, and ISO/IEC 14651.) What does NOT make sense is to expose (to a user) the raw ordering (<) of Unicode strings, though it may be useful internally. Orders exposed to people (or other systems, for that matter) that are't concerned with the inner workings of a program should always be collation based. (But that holds for any character encoding, it's just more apparent for Unicode.) > It may be that Unicode isn't flawed, but it's certainly extremely > complex. I guess I'll have to delve a bit deeper into it. It's complex, but it is because the scripts of world are complex (and add to that politics, as well as compatbility and implementation issues). Kind regards /kent k _______________________________________________ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe