On Thursday 2004.11.18 01:44:07 +0000, Christopher Fynn wrote: > Edward H. Trager wrote: > > >Mlterm (http://mlterm.sourceforge.net/) is a multilingual-capable terminal > >emulator which handles combining characters. Mlterm with a console-based > >mail reader like mutt works pretty well. However, one is still at the > >mercy of the fonts. Even an OpenType font which handles diacritic stacking > >may still not place diacritics properly for Vietnamese unless that font > >was really designed with vietnamese in mind. And, supposing you do find a > >font with very nice typographic placement of diacritics for Vietnamese, > >that > >same font might not work so well for Greek, for example. So, the current > >situation is that in practice you get more readable results when your > >unicode > >text actually uses the code points for the precomposed glyphs. > > This seems to be correct for HTML & XML at least since > W3C's (draft) "Character Model for the World Wide Web 1.0: > Normalization" specifies NFC for HTML & XML. > <http://www.w3.org/TR/charmod-norm/> - don't know whether or > not any particular form is specified for other protocols. >
Hmmm, I'll have to read that document again and think about this one. One of the problems with Unicode is that it is, in many ways, such a mess. Based on first principles, people wanted Unicode to use a "character" model, not a "glyph" model. But it seems that what has really happened is that we've basically ended up with a "glyph" model for all of those scripts that already had legacy computer encodings at the time that Unicode came into existance: This includes Latin, Cyrillic, Greek, and Arabic among others. Only scripts that had never (or barely) had the fortune --or misfortune, depending on how you look at it-- to be encoded for use on computers have ended up in Unicode using a "character" rather than "glyph" based model. These would include scripts like Thaana, Devanagari, and Burmese. For those scripts, there are no "precomposed" forms -- and thus no difference between NFC versus NFD "normalizations". So, although it is more of a burden to display Burmese correctly, it might be easier to collate Burmese than it is to collate some European language texts where the text could be in NFC, NFD, or even some combination thereof ... Of course, Unicode is such a mess because, if I can be permitted to paraphrase a statement of Michael Everson, human writing systems are such a mess. And since technologies evolve over time, I suppose we just have to live with the complexities of having multiple normalization forms and lengthy documents like http://www.w3.org/TR/charmod-norm/ ... -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/