Unicode: endpoint of evolution of encodings? (was Re: gcc and utf-8 source)

Danilo Segan Tue, 16 Nov 2004 07:54:59 -0800

Hi,

Today at 13:44, srintuar wrote:


> As for serbian, I dont think that really has much to do with unicode
> itself. You could apply a special folding algorithm when doing
> searches in a serbian context, but I dont think you would want to make
> the script ambiguous. 

I'd rather make script ambiguous, than make language ambiguous :)
FWIW, I can already produce perfectly fine fonts which will display
Serbian encoded using U0400-U045f in either of Latin or Cyrillic
scripts: choice of script is commonly personal preference in Serbia,
not a property of text.  Of course, this would be a misuse of Unicode :)

Sheer number of available Cyrillic-to-Latin and Latin-to-Cyrillic
converters available for Serbian proves that this is so.

Backwards compatibility is one of big Unicode problems when it comes
to consistency.  Unicode is utterly inconsistent, and thus very hard
to implement fully (for both processing and display purposes).  From
practical standpoint, this makes sense (nobody would start using
Unicode in the first place if it didn't have such properties), but it
doesn't mean it's a "good" property (whatever global "good" means),
only that it works to some extent.

New policies such as "no more precomposed glyphs" also indicate that
we're talking about glyph repository, not about character repository
(i.e. "no more precomposed glyphs, since you can get those glyphs by
combining existing glyphs", even though they may have entirely
different meaning and be separate characters in their own right).

> Defining a "Serbian Unified" code range would probably be futile:
> Even after such a range was defined, little could stop someone from
> representing serbian in cyrillic or latin script. 

If Serbian was only encodeable with "Serbian range", it'd be much
better.  Keyword "representing" is to be treated as "displayed" in
Serbian language.  It's much better if script is external to the
encoding (eg. script is defined through metadata, not through data),
than if language is external to the encoding (that applies to other
languages too, see my example for English and Spanish below).  After
all, try looking at how many pages Google mistags as Russian or
Macedonian which are actually in Serbian (Google ignores metadata
itself, so that might be the culprit).

> (then you would have three strings to search for intstead of one or
> two) If you have two official scripts, you have two official scripts.

Wrong: official has nothing to do with it (btw, official is in regards
to country, not language, and only one script is official in Serbia:
Cyrillic script).  In Serbian, "two different scripts" are a display
property. Think of glyph vs. character debate in Unicode sense
(Unicode encodes characters, i.e. their meanings, not their looks).
So, when text is in Serbian, "Ð" is exactly the same as "b".  It's not
"b in Serbian Cyrillic" and "b in Serbian Latin", it's simply "b".  We
can talk about display property of such a text as being in Serbian
Latin or Serbian Cyrillic just like we can talk about a display
property of it being in Garamond or Arial, Bold or Italic.  Unicode
was about meaning, right?

> As for CJK un-unification, I find that highly unlikely. Even if for
> some reason certain codepoints were unambiguated,  that would still
> fit within a backwards compatible modification of unicode, and not a
> complete scrapping of it. 

Indeed, and that's what I aimed for: Unicode will not be expanded such
that it solves all the problems of character encoding, and we'd need
something else for that.  Or, if I'm not yet clear, Unicode is not
an end-point in evolution of encodings.

Unicode, as it is, is closer to common glyph repository (AFII
anyone?) than character repository (ok, backwards compatibility is
also responsible for this, because of things like English ligatures,
etc).  FWIW, I'd assert that "j" in Spanish is not the same thing as
"j" in English (and that one is easily proved), apart from them being
represented with the same *glyph*.

Of course, with Unicode, it's current practice to add language marks
in a text stream instead (eg. with XML using 'xml:lang' tagging, or
global metadata such as MIME header fields), but that beats one big
advantage of Unicode: you can read it with a stateless machine, from
any point in the stream (big win for network applications).  This
holds even for transformation format such as UTF-8 (Unicode itself is
no big deal without UTF-8, IMHO).

Ideal case, IMO, would be a character encoding standard where one can
deduce all properties of a character from itself (i.e. when my parser
runs across "Ð" it will know that it is a "lowercase b in Serbian" or
"lowercase b in Russian"). 

Technically, this character encoding could be mapped to something
like Unicode or AFII for display purposes.  Of course, this brings in
some problems in input area.  Yet, many more advantages would come
with such encoding, such as better/faster collation (one can choose to
ignore language bits of a character to get current behaviour, which is
highly unpredictable when one changes a locale) etc.

> Speaking of political issues: if you advocate
> un-unification, I'd be interested to hear which language's script you
> would place closest to zero.

I don't really mind.  There're many objective criteria that could be
established (i.e. by number of representable characters [slowest
first/largest first], approximate number of worldwide users at the
time of introduction, first-come-first-served, etc).  I'd have more
problems with Unicode if I actually cared about such things like what
comes "closer to zero" :)

Of course, I'm aware that changes of this kind are very far-fetched,
and would require substantial changes in our systems.  That's not
going to happen anytime soon.  This requires a change in thinking too
("what do you mean Spanish j is different from English j? but they're
the same!").  There're many problems like expandability (i.e. you
wouldn't be able to use new language even though it might use the same
glyphs some other language uses, but this is more of a "bootstrapping"
problem), so I'm not saying this would be the end-point in evolution
of encodings either.  This doesn't apply only to multi-script
languages, but it might be that we see the problem more clearly :-)

I'm simply saying that this kind of encoding has some properties that
are better than what Unicode has to offer, and that Unicode is far
from being the evolution end-point either.  Perhaps my idea of how it
should look like is not very good, but I outline problems which could
also be solved with another encoding and character set.  So, lets see
what future holds in store for us :)

Also, DOS wasn't deprecated in short time span either.  It took at
least 10 years for it to be considered deprecated (ok, some considered
it deprecated right when it was released :).

> Changes to unicode dont require any change to UTF-8 itself.
> Sacrificing native script programming comments and literals isnt worth
> it for something that may never change, imo.

I don't get what this has to do with Unicode/UTF-8 being The Ultimate
among encodings.  It's only the best we've got so far.

Sorry for this long rant about Unicode: I don't feel bad about it, and
UTF-8 is about the only encoding I use for all its magnificient
properties, but I still see problems with it :)

Cheers,
Danilo


PS. Note that I misused "encoding" to sometimes mean "character set" as
well.  Hope this won't be too confusing :)

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Unicode: endpoint of evolution of encodings? (was Re: gcc and utf-8 source)

Reply via email to