Philippe Verdy responded to a question by SRIDHARAN Aravind: > > How can I differentiate whether a given character in chinese is > > simplified or traditional? > > Normally you can't with Unicode/ISO10646: > They are unified now by the UniHan working group, to be used > for Traditional or Simplied Chinese, or Japanese, or traditional > Korean and Vietnamese, and other minority languages written with > this ideographic script.
Correcting some misstatements here... Actually, in most instances in Unicode you *can* differentiate whether a given Chinese character is simplified or traditional, precisely because the two related forms are *NOT* unified in Unicode. Thus, to pick an example which hasn't already been rendered hackneyed by discussion: U+9BE8 jing1 'whale' (traditional character form) U+9CB8 jing1 'whale' (simplified character form) So in Unicode you can differentiate the two *by code point*. Of course, coming up with the exact list of code points is non-trivial, but as Philippe pointed out, you can get a lot of information here by examining Unihan.txt. In particular, the kTraditional and kSimplified fields give mappings back and form between such pairs. (The problem is, however, messy around the edges because of "traditional simplified" forms, 1-to-n mappings, distinct national simplifications, and similar problems.) I think what Philippe was trying to convey is that if text is identified as being encoded using Unicode, you cannot use that fact alone to determine whether the text is "traditional" or "simplified" in orthography, since Unicode includes both forms and encompasses text in either orthography (or even mix-and-match text that would use both orthographies together, e.g. to contrast the two usages). This differs from the situation for some traditional East Asian character sets. For example, identification of charset = cp936 would indicate that text is "simplified", since that character encoding does not include many traditional forms, whereas charset = cp950 would indicate that text is "traditional", since that character encoding does not include many simplified forms. Incidentally, the "UniHan working group" is a misnomer. The correct term is Ideographic Rapporteur Group (IRG), the group which does unifications of candidate CJK ideographs on behalf of WG2 (for ISO/IEC 10646). --Ken