Re: Unicode: endpoint of evolution of encodings?

Danilo Segan Thu, 18 Nov 2004 04:02:27 -0800

Hi Antoine,

Yesterday at 13:37, Antoine Leca wrote:

> srintuar wrote:
>> FWIW, I'd assert that "j" in Spanish is not the same thing as
>> "j" in English (and that one is easily proved), apart from them being
>> represented with the same *glyph*.
>
> You picked (certainly involuntarily) a very instructive example.
> I am living in Spain, so I feel qualified to issue an advice upon this one.
> While my uses (note the plural) of "j" in "Spanish" is different from my use
> (note the singular) in English, there is much more difference between my use
> of "jota" in Castilian (a form of "Spanish" where "j" is pronounced as a
> laryngal, similar to "Ñ/h" for Danilo; sorry I do not know Vietnamese) and
> my use of "jota" (the letter does not change its name) in Catalan (another
> form of "Spanish" where "j" is pronounced more or less like in French,
> similar to "Ð/Å" for Danilo); and in Valencian (a variant of Catalan, so
> another form of "Spanish", spoken where I am living) it is pronounced as
> affricate, that is... as in English.
>
> Now, the very interesting thing is that people here, when they ignore the
> context language, use... their local prononciation; so the *same* jota is
> pronounced differently by different Spanish persons. As a practical example,
> the name of the letter itself, jota, is pronounced /xota/ (/ÑÐÑÐ/) in
> Castilla, /ÊÉtÉ/ (/ÐÐÑÐ/) in Barcelona and /dÍÊÉta/ (/ÐÐÐÑÐ/) 
> here in
> Valencia.
>
> And of course, NOBODY is willingful to have three different Js on her
> keyboard (plus another to write German, as a bonus.)

Well, this is certainly the case where one should have one single J on
her keyboard, since it's the same *character* of the language, whether
or not it is pronounced the same.  Basically, my assumed definition
of a character fails here, since there's another layer: a reader (as a
person).

I argued solely based on the phonetic properties, since those I'm
most familiar with.  Anyone more clued into it (just like you are in
case of Spanish) would be able to come up with a better criteria.
FWIW, whatever I said may be completely bogus in that sense :)

The issue is whether you would care about being able to differentiate
in your text processor between languages or not.  So, "jota" would
still make sense in Spanish, whatever it was pronounced as, but not
much sense in English (since it's not a word there).  I think this is
a good property to know.

> The Unicode analysis here is that there are the same, since there is a
> continuum of uses that embrace both languages (in other words, you will not
> encounter systemic differences inside a given language, even if you can
> encounter systemic differences *between* languages). On the other hand, they
> decided that there are systemic differences between A and Ð
> (Latin/Cyrillic).

Yes, and I'm just pointing out that it is not really the case, or
rather, that it depends on the language.  Also, when one sees
"ÐÐ" in Serbian, he'd pronounce it like Ð/d and Ð/zhe, while in
Russian, one would pronounce it like Serbians pronounce "Ñ".

We must agree that these differences Unicode went after are
glyph-based, rather than character-based.  I mean, a counter example
is enough to prove invalidity of an assertion (they say they encode
characters, but not glyphs; I say that "a" and "Ð" are same characters
in Serbian, thus their claim is false :).

> Also, in the case of "j", fact is that one can trace it down in the
> evolution of the script(s), and all forms of "j" do have a common ancester
> (no earlier than XVIth century).

Latin and Cyrillic scripts both have a common ancestor as well: Greek
alphabet.  "Reformed Cyrillic" (current, 19th century Cyrillic) also
shares a lot of glyphs with the Latin script.  So, this doesn't seem
to be the argument Unicode actually used.

>> Also, imagine the chaos for OCR programs: you'd have to tell them
>> ahead of time which language they are supposed to read in.
>
> This is an aside, but already you have to tell them: the software will use
> that information to select a dictionnary over another, and this enhances the
> result by a very important margin. For example, until you are telling the
> OCR software you are reading Vietnamese, it will discard any traces it
> "sees" below the vowels as being meaningless.

Indeed.  As I argued already, this would mostly cause problems for
pre-existing content, where we have no language tags.  With any new
content, there's commonly some indicator of content language
(i.e. current keyboard layout, system language, explicite setting of
language like with OCR, etc).  It will be wrong in a small number of
cases (such as my using Serbian layout to type Russian).

Btw Antoine, thanks a lot for enlightening me more about Spanish (I'm
just starting to learn it, so I really care about it :).

Cheers,
Danilo

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode: endpoint of evolution of encodings?

Reply via email to