Re: Unicode: endpoint of evolution of encodings? (was Re: gcc and utf-8 source)

srintuar Tue, 16 Nov 2004 16:38:49 -0800

Danilo Segan wrote:

New policies such as "no more precomposed glyphs" also indicate that
we're talking about glyph repository, not about character repository
(i.e. "no more precomposed glyphs, since you can get those glyphs by
combining existing glyphs", even though they may have entirely
different meaning and be separate characters in their own right).

This may be more of a practical issue: for some scripts such as Korean,
representing every possible character and partial character could
require a very large amount of codespace. We only have the precomposed
characters now for compatibility with platforms that simply dont support
composition whatsoever (all too common still, sadly).

For example: do these both work under your mailreader?
NFC: Tiáng Viát
NFD: TieÌÌng VieÌÌt

(for me under mozilla mail the second one looks slightly different,
which means its not working perfectly)

Defining a "Serbian Unified" code range would probably be futile:
Even after such a range was defined, little could stop someone from
representing serbian in cyrillic or latin script.


If Serbian was only encodeable with "Serbian range", it'd be much
better.  Keyword "representing" is to be treated as "displayed" in
Serbian language.  It's much better if script is external to the
encoding (eg. script is defined through metadata, not through data),
than if language is external to the encoding (that applies to other
languages too, see my example for English and Spanish below).  After
all, try looking at how many pages Google mistags as Russian or
Macedonian which are actually in Serbian (Google ignores metadata
itself, so that might be the culprit).

When you have multilingual documents you can more easily see why
that is impractical. There is no easy way for a piece of software to
know that some words are Spanish and some are English. If the two
languages had no overlaping codepoints whatsoever you could very
easily end up with English text encodedÂ with Spanish codepoints
and vice versa.

Even characters which look different in different scripts but are
logically identical get unified, so unicode right now is diametrically
opposed to the position you are describing, and for good reason.

Unicode, as it is, is closer to common glyph repository (AFII
anyone?) than character repository (ok, backwards compatibility is
also responsible for this, because of things like English ligatures,
etc).  FWIW, I'd assert that "j" in Spanish is not the same thing as
"j" in English (and that one is easily proved), apart from them being
represented with the same *glyph*.

Certainly the character is used differently. However, I would assert
that it is indeed the same character. Both English and Spanish
use latin script.

Of course, with Unicode, it's current practice to add language marks
in a text stream instead (eg. with XML using 'xml:lang' tagging, or
global metadata such as MIME header fields), but that beats one big
advantage of Unicode: you can read it with a stateless machine, from
any point in the stream (big win for network applications).  This
holds even for transformation format such as UTF-8 (Unicode itself is
no big deal without UTF-8, IMHO).


Ideal case, IMO, would be a character encoding standard where one can
deduce all properties of a character from itself (i.e. when my parser
runs across "Ð" it will know that it is a "lowercase b in Serbian" or
"lowercase b in Russian").

This I think will never happen: codepoints that carry language information
are no longer codepoints. Remeber: characters are not only used for
language, they can also be map symbols, mathematical operators,
fancy shapes, etc.

Also, imagine the chaos for OCR programs: you'd have to tell them
ahead of time which language they are supposed to read in. Also,
instead of latin â cyrillic converters you have a proliferation of
English â French, English â British English, Spanish â Italian.
converters instead. (overall a much worse place to be)

I don't get what this has to do with Unicode/UTF-8 being The Ultimate
among encodings.  It's only the best we've got so far.

I do agree that it is merely a first attempt at an Ãber-encoding, however
I have yet to hear of anyway that it could be fundamentally improved upon.

Perhaps eliminating all precomposed glyphs would be one such improvement,
but unicode already supports NFD, so it is already possible to use it as such.

-- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/

Re: Unicode: endpoint of evolution of encodings? (was Re: gcc and utf-8 source)

Reply via email to