Re: Unicode: endpoint of evolution of encodings? (was Re: gcc and utf-8 source)

Antoine Leca Wed, 17 Nov 2004 04:37:55 -0800

srintuar wrote:
> FWIW, I'd assert that "j" in Spanish is not the same thing as
> "j" in English (and that one is easily proved), apart from them being
> represented with the same *glyph*.


You picked (certainly involuntarily) a very instructive example.
I am living in Spain, so I feel qualified to issue an advice upon this one.
While my uses (note the plural) of "j" in "Spanish" is different from my use
(note the singular) in English, there is much more difference between my use
of "jota" in Castilian (a form of "Spanish" where "j" is pronounced as a
laryngal, similar to "Ñ/h" for Danilo; sorry I do not know Vietnamese) and
my use of "jota" (the letter does not change its name) in Catalan (another
form of "Spanish" where "j" is pronounced more or less like in French,
similar to "Ð/Å" for Danilo); and in Valencian (a variant of Catalan, so
another form of "Spanish", spoken where I am living) it is pronounced as
affricate, that is... as in English.

Now, the very interesting thing is that people here, when they ignore the
context language, use... their local prononciation; so the *same* jota is
pronounced differently by different Spanish persons. As a practical example,
the name of the letter itself, jota, is pronounced /xota/ (/ÑÐÑÐ/) in
Castilla, /ÊÉtÉ/ (/ÐÐÑÐ/) in Barcelona and /dÍÊÉta/ (/ÐÐÐÑÐ/) 
here in
Valencia.

And of course, NOBODY is willingful to have three different Js on her
keyboard (plus another to write German, as a bonus.)

> Certainly the character is used differently. However, I would assert
> that it is indeed the same character. Both English and Spanish
> use latin script.

The Unicode analysis here is that there are the same, since there is a
continuum of uses that embrace both languages (in other words, you will not
encounter systemic differences inside a given language, even if you can
encounter systemic differences *between* languages). On the other hand, they
decided that there are systemic differences between A and Ð
(Latin/Cyrillic).

Also, in the case of "j", fact is that one can trace it down in the
evolution of the script(s), and all forms of "j" do have a common ancester
(no earlier than XVIth century).


> Also, imagine the chaos for OCR programs: you'd have to tell them
> ahead of time which language they are supposed to read in.

This is an aside, but already you have to tell them: the software will use
that information to select a dictionnary over another, and this enhances the
result by a very important margin. For example, until you are telling the
OCR software you are reading Vietnamese, it will discard any traces it
"sees" below the vowels as being meaningless.

Antoine


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode: endpoint of evolution of encodings? (was Re: gcc and utf-8 source)

Reply via email to