Re: Unicode: endpoint of evolution of encodings?

Pablo Saratxaga Thu, 18 Nov 2004 14:15:37 -0800

Kaixo!

On Thu, Nov 18, 2004 at 01:50:28PM +0100, Danilo Segan wrote:


> The issue is whether you would care about being able to differentiate
> in your text processor between languages or not.

It is indeed a good feature to do so;
but the *smallest* unit for which language information is usefull
are *words*, not characters/letters.

> So, "jota" would
> still make sense in Spanish, whatever it was pronounced as, but not
> much sense in English (since it's not a word there).  I think this is
> a good property to know.

No, it is useless. The letter "j", alone, is the same letter on all
languages using the latin script. There is absolutely no gain in
creating differences based on language (plus, I know of no language
where there is a word consisting of the single letter "j").

"disambiguating" letters depending on the language is a very bad idea,
beacause it destroys the interexchangeability of documents.
You have problems to do google searchs in Serbian because a text
can be in two different scripts; now with your idea of disambiguating
letters it means that the same problem will exist for almost all
languages (minus the very few ones using a unique script), it would
even be worst, as a same English text, for example, could be encoded in
dozens of different (eg: in English-letters, Spanish-letters,
Portuguese-letters, French-letters, German-letters, Italian-letters,
Indonesian-letters, Polish-letters, Irish-letters, Welsh-letters,
Danish-letters,...).

> We must agree that these differences Unicode went after are
> glyph-based, rather than character-based.

They are character based.
With a character defined as an atomic element of a script (there are of
course a lot of exceptions due to historical reasons, but that is the
basic idea).
So, unicode is a collection of *scripts*, each script is separate and
independent of the others, and each script is a collection of characters
belonging to that script (there are some special characters, like
generic puntuation and ascii digits, that can be used in conjunction
with most scripts, but outside the shared puntuation characters, the
different characters are exclusive to a given script, even if there are
similarities in some cases with other characters of another script).

The basic concept to encode writing is the script, that is so when
electronically encoding text simply because that is so when writting
text by hand or press.

> I say that "a" and "Ð" are same characters in Serbian,

They are not.
They may be the same *letter* in Serbian.
But a letter is not a character (in Spanish, "ch" is a letter (yes, I'm
a traditionalist), as well in Serbian "lj" and "nj" are letters;
however the involved characters are "c", "h", "l", "n", "j".
Note also how in cyrillic script "Ñ" and "Ñ" are single caracters,
note also that "ÐÑ" and "ÐÑ" are not single characters.

Script changes are a bit like orthographic changes (an extreme
orthographic changes); in French the word "acute" used to be written
"aiguÃ", now it is "aigÃe"; it is impossible to encode it in a single
way, you can attach as many language proprieties to characters, you
cannot handle orthographic changes that way.

You wrongly see latin and cyrillic variants of Serbian as simple
differences in shape of the same characters; that is not so,
you should instead look at it as two orthographic variants.


-- 
Ki Ãa vos vÃye bÃn,
Pablo Saratxaga

http://chanae.walon.org/pablo/          PGP Key available, key ID: 0xD9B85466
[you can write me in Walloon, Spanish, French, English, Catalan or Esperanto]
[min povas skribi en valona, esperanta, angla aux latinidaj lingvoj]

pgpCNwfPN2cza.pgp
Description: PGP signature

Re: Unicode: endpoint of evolution of encodings?

Reply via email to