Hi, I am a native Japanese speaker and I think I am somewhat Unicode lover compared with Japanese average.
At Tue, 8 Jan 2002 23:03:35 -0500, Glenn Maynard wrote: > What, exactly, needs to be done by an application (or rather, its data > formats) to accomodate CJK in Unicode (and other languages with similar > ambiguities)? The most well-known criticism against Unicode is that it unified "Han Ideograms" (Kanji) from Chinese, Japanese, and Korean "Han Ideograms" (Kanji) with similar shape and origin, though they are different "characters". Even native CJK speakers and CJK scholars can have different opinions on a question that "this Kanji and that Kanji are different characters, or same characters with different shapes?" Since Unicode takes an opinion which is different from most of common Japanese people, Japanese people came to generally hate Unicode. It is natural that scholars have variety of opinions than common people and Unicode Consortium did find a native Japanese scholar who support Unicode's opinion. But the opinion is different from common Japanese people's.... Thus, Japanese people think Unicode cannot distinguish different "characters" from China, Japan, and Korea. Unicode's view is that "these characters are the same characters with different shale (glyph), so it should share one codepoint, because Unicode is a _character_ code, not a _glyph_ code." This is Han Unification. Now nobody can stand against the political and commercial power of Unicode and Japanese people feel helpless.... Note that I heard that Chinese and Korean people have different opinion on Kanji from Japanese. They think Kanji from China, Japan, and Korea are "same character with different shape" and they accept Unicode. If your software support only one language in one time, you can use Unicode and the problem is only to choose proper font. Here, "Japanese font" means a font which has Japanese glyph (in Unicode's view) for Han Unification codepoints. Now, the problem is to use Japanese font for Japanese, Chinese font for Chinese, and Korean font for Korean. However, if your software supports multilingual text, the problem can be difficult. Japanese people want to distinguish unified Kanji. However, many (even Japanese) people are satisfied if Japanese text is written in Japanese font. Thus, an easy compromise is to use Japanese font for all Han Unification characters. (Chinese and Korean people will accept it). I think the Han Unification problem can be ignored for daily usage, by using the compromise I wrote above. > Is knowing the language enough? (For example, is it enough in HTML to > write UTF-8 and use the LANG tag?) > > Is it generally important or useful to be able to change language mid- > sentence? (It's much simpler to store a single language for a whole data > element, and it's much easier to render.) Of course if your software can have language information it is great. mid-sentence language support is excellent! Usage of Japanese font anywhere (I wrote above) is a _compromise_ , so it is always welcome to avoid the compromise. However, I prefer more and more percentage of softwares in the world come to be able to handle CJK characters as soon as possible, than waiting for "perfect" CJK support. There are a few ways to store language information. Language tags above U+E0000, mark-up languages like XML, and so on. I wonder whether "Variation Selectors" in Unicode 3.2 Beta http://www.unicode.org/versions/beta.html can be used for this purpose or not.... Does anyone have information? Saying about round-trip compatibility, yes, round-trip compatibility for EUC-JP, EUC-KR, Big5, GB2312, GBK are guaranteed, i.e., Unicode is a superset of these encodings (character sets). However, (1) there are no authorative mapping tables between these encodings and Unicode and there are various private mapping tables. This can cause portability problem around round-trap compatibility. (2) Unicode is _not_ a superset of the combination of these encodings, i.e., Unicode is _not_ a superset of ISO-2022-JP-2 and so on. For (1), I am now trying to let Unicode Consortium to take some solution or to write an attention or techinical report about this problem. I hear that Unicode Technical Committee is now discussing about this problem. For (2), no solution can exist, because Unicode and ISO-2022 has different opinion of what is identity of character. However, usage of language-tags or variation-selectors(?) can partly solve this problem. However, an authorative way to express distinction between CJK Kanji must be determined, and everyone must follow the way, to keep portability. Now I hear nobody is wrestling with this problem... "authorative" is rather a political problem than technical.... Note that the internal encoding may be Unicode, but stream I/O encoding has to be specified by LC_CTYPE locale. This is mandatory for internationalized softwares. --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://www.debian.or.jp/~kubota/ "Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/