Re: questions with combining characters [was: Unicode: endpoint of evolution of encodings?]

Edward H. Trager Thu, 18 Nov 2004 08:17:26 -0800

On Thursday 2004.11.18 01:44:07 +0000, Christopher Fynn wrote:
> Edward H. Trager wrote:
> 
> >Mlterm (http://mlterm.sourceforge.net/) is a multilingual-capable terminal
> >emulator which handles combining characters.  Mlterm with a console-based
> >mail reader like mutt works pretty well.  However, one is still at the
> >mercy of the fonts.  Even an OpenType font which handles diacritic stacking
> >may still not place diacritics properly for Vietnamese unless that font
> >was really designed with vietnamese in mind.  And, supposing you do find a
> >font with very nice typographic placement of diacritics for Vietnamese, 
> >that
> >same font might not work so well for Greek, for example.  So, the current
> >situation is that in practice you get more readable results when your 
> >unicode
> >text actually uses the code points for the precomposed glyphs.
> 
> This seems to be correct for HTML & XML at least since
> W3C's (draft) "Character Model for the World Wide Web 1.0: 
> Normalization"  specifies NFC for HTML & XML. 
> <http://www.w3.org/TR/charmod-norm/>  - don't know whether or
> not any particular form is specified for other protocols.
>


Hmmm, I'll have to read that document again and think about this one.
One of the problems with Unicode is that it is, in many ways, such a mess.
Based on first principles, people wanted Unicode to use a "character"
model, not a "glyph" model.  But it seems that what has really happened
is that we've basically ended up with a "glyph" model for all of those scripts
that already had legacy computer encodings at the time that Unicode came into 
existance:
This includes Latin, Cyrillic, Greek, and Arabic among others.
Only scripts that had never (or barely) had the fortune --or misfortune, 
depending on how
you look at it-- to be encoded for use on computers have ended up in Unicode
using a "character" rather than "glyph" based model.  These would include
scripts like Thaana, Devanagari, and Burmese.  For those scripts, there are
no "precomposed" forms -- and thus no difference between NFC versus NFD 
"normalizations".
So, although it is more of a burden to display Burmese correctly, it might be
easier to collate Burmese than it is to collate some European language texts 
where
the text could be in NFC, NFD, or even some combination thereof ...

Of course, Unicode is such a mess because, if I can be permitted to paraphrase a
statement of Michael Everson, human writing systems are such a mess.  And since 
technologies evolve over time, I suppose we just have to live with the 
complexities
of having multiple normalization forms and lengthy documents like 
http://www.w3.org/TR/charmod-norm/ ...


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: questions with combining characters [was: Unicode: endpoint of evolution of encodings?]

Reply via email to