Re: Reading Chinese Characters from a browser

Kenneth Whistler Wed, 09 Jul 2003 19:45:07 -0700

Philippe Verdy responded to a question by SRIDHARAN Aravind:

> > How can I differentiate whether a given character in chinese is
> > simplified or traditional? 
> 
> Normally you can't with Unicode/ISO10646: 
> They are unified now by the UniHan working group, to be used 
> for Traditional or Simplied Chinese, or Japanese, or traditional 
> Korean and Vietnamese, and other minority languages written with 
> this ideographic script.


Correcting some misstatements here...

Actually, in most instances in Unicode you *can* differentiate
whether a given Chinese character is simplified or traditional,
precisely because the two related forms are *NOT* unified in
Unicode. Thus, to pick an example which hasn't already been
rendered hackneyed by discussion:

U+9BE8 jing1 'whale' (traditional character form)
U+9CB8 jing1 'whale' (simplified character form)

So in Unicode you can differentiate the two *by code point*.

Of course, coming up with the exact list of code points is
non-trivial, but as Philippe pointed out, you can get a
lot of information here by examining Unihan.txt. In particular,
the kTraditional and kSimplified fields give mappings back
and form between such pairs. (The problem is, however, messy
around the edges because of "traditional simplified" forms,
1-to-n mappings, distinct national simplifications, and
similar problems.)

I think what Philippe was trying to convey is that if text
is identified as being encoded using Unicode, you cannot
use that fact alone to determine whether the text is
"traditional" or "simplified" in orthography, since Unicode
includes both forms and encompasses text in either
orthography (or even mix-and-match text that would use
both orthographies together, e.g. to contrast the two usages).

This differs from the situation for some traditional East
Asian character sets. For example, identification of
charset = cp936 would indicate that text is "simplified",
since that character encoding does not include many
traditional forms, whereas charset = cp950 would indicate
that text is "traditional", since that character
encoding does not include many simplified forms.

Incidentally, the "UniHan working group" is a misnomer. The
correct term is Ideographic Rapporteur Group (IRG), the
group which does unifications of candidate CJK ideographs
on behalf of WG2 (for ISO/IEC 10646).

--Ken

Re: Reading Chinese Characters from a browser

Reply via email to