Re: Unicode, character ambiguities

Tomohiro KUBOTA Tue, 08 Jan 2002 23:45:07 -0800

Hi,

I am a native Japanese speaker and I think I am somewhat Unicode
lover compared with Japanese average.

At Tue, 8 Jan 2002 23:03:35 -0500,
Glenn Maynard wrote:

> What, exactly, needs to be done by an application (or rather, its data
> formats) to accomodate CJK in Unicode (and other languages with similar
> ambiguities)?

The most well-known criticism against Unicode is that it unified
"Han Ideograms" (Kanji) from Chinese, Japanese, and Korean "Han
Ideograms" (Kanji) with similar shape and origin, though they
are different "characters".  Even native CJK speakers and CJK
scholars can have different opinions on a question that "this Kanji
and that Kanji are different characters, or same characters with
different shapes?"  Since Unicode takes an opinion which is
different from most of common Japanese people, Japanese people
came to generally hate Unicode.  It is natural that scholars have
variety of opinions than common people and Unicode Consortium
did find a native Japanese scholar who support Unicode's opinion.
But the opinion is different from common Japanese people's....
Thus, Japanese people think Unicode cannot distinguish different
"characters" from China, Japan, and Korea.  Unicode's view is that
"these characters are the same characters with different shale (glyph),
so it should share one codepoint, because Unicode is a _character_
code, not a _glyph_ code."  This is Han Unification.  Now nobody
can stand against the political and commercial power of Unicode
and Japanese people feel helpless....  

Note that I heard that Chinese and Korean people have different
opinion on Kanji from Japanese.  They think Kanji from China,
Japan, and Korea are "same character with different shape"
and they accept Unicode.

If your software support only one language in one time, you can
use Unicode and the problem is only to choose proper font.
Here, "Japanese font" means a font which has Japanese glyph
(in Unicode's view) for Han Unification codepoints.  Now, the
problem is to use Japanese font for Japanese, Chinese font
for Chinese, and Korean font for Korean.

However, if your software supports multilingual text, the problem
can be difficult.  Japanese people want to distinguish unified
Kanji.  However, many (even Japanese) people are satisfied if
Japanese text is written in Japanese font.  Thus, an easy
compromise is to use Japanese font for all Han Unification
characters.  (Chinese and Korean people will accept it).

I think the Han Unification problem can be ignored for daily
usage, by using the compromise I wrote above.

> Is knowing the language enough?  (For example, is it enough in HTML to
> write UTF-8 and use the LANG tag?)
> 
> Is it generally important or useful to be able to change language mid-
> sentence?  (It's much simpler to store a single language for a whole data
> element, and it's much easier to render.)

Of course if your software can have language information it is
great.  mid-sentence language support is excellent!  Usage of
Japanese font anywhere (I wrote above) is a _compromise_ , so
it is always welcome to avoid the compromise.

However, I prefer more and more percentage of softwares in the
world come to be able to handle CJK characters as soon as possible,
than waiting for "perfect" CJK support.

There are a few ways to store language information.  Language tags
above U+E0000, mark-up languages like XML, and so on.  I wonder
whether "Variation Selectors" in Unicode 3.2 Beta 
http://www.unicode.org/versions/beta.html
can be used for this purpose or not....  Does anyone have information?

Saying about round-trip compatibility, yes, round-trip compatibility
for EUC-JP, EUC-KR, Big5, GB2312, GBK are guaranteed, i.e., Unicode
is a superset of these encodings (character sets).  However,
(1) there are no authorative mapping tables between these encodings
    and Unicode and there are various private mapping tables.  This
    can cause portability problem around round-trap compatibility.
(2) Unicode is _not_ a superset of the combination of these encodings,
    i.e., Unicode is _not_ a superset of ISO-2022-JP-2 and so on.
For (1), I am now trying to let Unicode Consortium to take some
solution or to write an attention or techinical report about this
problem.  I hear that Unicode Technical Committee is now discussing
about this problem.
For (2), no solution can exist, because Unicode and ISO-2022 has
different opinion of what is identity of character.  However, usage
of language-tags or variation-selectors(?) can partly solve this
problem.  However, an authorative way to express distinction between
CJK Kanji must be determined, and everyone must follow the way, to
keep portability.  Now I hear nobody is wrestling with this problem...
"authorative" is rather a political problem than technical....

Note that the internal encoding may be Unicode, but stream I/O
encoding has to be specified by LC_CTYPE locale.  This is mandatory
for internationalized softwares.

---
Tomohiro KUBOTA <[EMAIL PROTECTED]>
http://www.debian.or.jp/~kubota/
"Introduction to I18N"  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

Reply via email to