Re: questions with combining characters [was: Unicode: endpoint of evolution of encodings?]

Keld Jørn Simonsen Thu, 18 Nov 2004 08:49:30 -0800

On Thu, Nov 18, 2004 at 11:44:09AM -0500, Edward H. Trager wrote:
> On Thursday 2004.11.18 01:44:07 +0000, Christopher Fynn wrote:
> 
> Hmmm, I'll have to read that document again and think about this one.
> One of the problems with Unicode is that it is, in many ways, such a mess.
> Based on first principles, people wanted Unicode to use a "character"
> model, not a "glyph" model.  But it seems that what has really happened
> is that we've basically ended up with a "glyph" model for all of those scripts
> that already had legacy computer encodings at the time that Unicode came into 
> existance:
> This includes Latin, Cyrillic, Greek, and Arabic among others.
> Only scripts that had never (or barely) had the fortune --or misfortune, 
> depending on how
> you look at it-- to be encoded for use on computers have ended up in Unicode
> using a "character" rather than "glyph" based model.  These would include
> scripts like Thaana, Devanagari, and Burmese.  For those scripts, there are
> no "precomposed" forms -- and thus no difference between NFC versus NFD 
> "normalizations".
> So, although it is more of a burden to display Burmese correctly, it might be
> easier to collate Burmese than it is to collate some European language texts 
> where
> the text could be in NFC, NFD, or even some combination thereof ...


Hmm, I see it differently. All the "fully composed" characters are
indeed full characters in their own right, and Unicode is now adopting a
policy of not having the full characters encoded anymore , you need to
construct many latin letters out of a number of characters. So Unicode
has left the principle of encoding characters - symbols with distinct
meaning - and is now a kind of glyph registry.

This makes sorting harder to do, although it it not unfeasible to sort
eg latin letters in their full encoding together with decomposed
approximations in a convenient way, as demonstrated by ISO 14651.

Best regards
Keld

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: questions with combining characters [was: Unicode: endpoint of evolution of encodings?]

Reply via email to