On Tue, Jun 05, 2001 at 01:31:38PM -0400, Dan Sugalski wrote: > The other issue it actively brought up was the complaint about having to > share glyphs amongst several languages, which didn't strike me as all that > big a deal either, except perhaps as a matter of national pride and/or easy > identification of the language of origin for a glyph. Not being literate in > any of the languages in question, though, I didn't feel particularly > qualified to make a judgement as to the validity of the complaints.
There are a number of related problems here; the Han unification effort has pissed off some Asians on several counts. There easiest part to explain is display; this isn't something that Perl particularly needs to care about, but the same glyph may need to look different if it's in Chinese rather than in Japanese. For the rest, I refer the assembly to my undergraduate dissertation :) : -------- Unicode itself is, like the JIS standard, simply an enumeration of characters with their orderings; it says nothing about how the data is represented to the computer, and must be supplemented by one of several Unicode Transformation Formats which describe the encoding. However, despite the huge benefits to programmers worldwide, two critical problems are hindering the adoption of Unicode amongst the Japanese computer-using community. The first objection is technical, and the second is more sociological. The technical objection stems from the fact that the Unicode Consortium initially assigned a finite space for all Japanese, Chinese and Korean characters, allowing only just under 28,000 characters. This space has nearly been filled, with 20,902 basic characters already accepted, and 6,585 new characters under review; the situation is not going to get any better as Chinese characters are invented for use in proper names and so on. It is evident that 28,000 characters is not going to be anywhere near enough, and programmers have felt betrayed that the promise of a `fully Universal character set' will satisfy all other languages but theirs. Thankfully, the Unicode Consortium has recently assigned another extension plane for CJK characters and adopted a further 42,711 characters, meaning that all the characters in the Chinese Han Yu Da Zidian and the Japanese Morohashi Dai Kanwa Jiten are now adopted into Unicode. However, many programmers are unaware of the extension plane and still feel that the Unicode Consortium is ignoring their plight. More serious, however, is the decision to unify equivalent characters in the Chinese, Japanese and Korean character sets into a single table known as `Unihan'[10]. This has proved controversial primarily through lack of understanding of the nature of `equivalent characters': the Unihan table does not constitute a dumbing down of the character set, as simplified and traditional forms of characters have been maintained. However, Chinese and Japanese variants of the same single character have been unified. The Unicode standard seeks to encode characters rather than glyphs[11] , and hence the variant characters which come about due to variations in writing style have been unified. On the other hand, characters undergoing structural variance have not been unified. The principles on which Han Unification took place, are, according to [Graham, 2000], not dissimilar to those used to unify characters in the legacy JIS and other character sets. Three rules were used to determine whether or not two kanji should be considered equivalent: Source Separation Rule If two kanji were distinct in a primary source character set (JIS in the case of Japanese, GB2312-80 and other GB standards for Chinese, KSC5601-1987 for Korean, and so on) then they should not be unified. This would allow round-trip-conversion between Unicode and the original source. For instance, the following variants of the character for tsurugi, sword, were not unified: [Picture omitted] Non-Cognate Rule Kanji which are not cognate are not variants; this prohibits, for instance, the unifiation of the following characters: [Picture omitted] Component Structure If a unification is acceptable under the above rules, unification is only carried out if the characters share the same radicals and component features, taking into consideration their arrangement. Using these rules, the CJK Joint Research Group of the ISO technical committee on Unicode reduced a candidate 121,000 Han characters into 20,902 unique characters [12]. On the other hand, there are some valid objections from Japanese, on three specific counts [13]: Firstly, the JIS standard defines, along with the ordering and enumeration of its characters, their glyph shape. Unicode, on the other hand does not. This means that as far as Unicode is concerned, there is literally no distinction between two distinct shapes and hence no way to specify which should be used. This becomes particularly emotive when one is, for instance, attempting to represent a person's name - if they have a particular preferred variant character with which they write their name, there is no way to communicate that to the computer, and information is lost. The second objection is again related to character versus glyph issues: since Chinese, Japanese and Korean forms of glyphs are unified into a single character, display of a CJK text becomes difficult. As there is no indication of the language of the input, software displaying Unicode text has no hints about the style in which characters should be displayed. Chinese and Japanese fonts have distinct styles, and it is impossible to devise a font in which Japanese and Chinese texts could both be displayed concurrently without appearing `alien'. For instance, a Chinese user could conceivably see recognisably Japanese variants of characters appearing in his Chinese text, and vice versa. In defence of Unicode, it provides (but discourages) `non-printing' characters which tag the following text as being in a particular language [14] . For instance, the three Unicode characters U-000E0001 U-000E006A U-000E0061 signify that the following text is in Japanese, allowing an application to select the correct font style. Finally, there is a historiographical issue; when computers are used to digitise and store historical literature containing archaic characters, specifying the exact variant character becomes an important consideration. Once again, this can be made more emotive by considering the digitisation of Japanese or Korean family records. In such a case, one would want to make a faithful representation of the original source document, something which Unihan unification does not permit. 10 The characters are known in Unicode as Han ideographs; the name `Han' is a reference to the Chinese origin of the characters, which are known as hanzi in Chinese, hanja in Korean, chu Han in Vietnamese and, of course, kanji in Japanese 11 See Appendix B 12 [Lunde, 1999, p.124-125] presents a graphical view of the uni气ation process. 13 This section adopted from [Cheong, 1999] 14 [Whistler and Adams, 2001] -- "He was a modest, good-humored boy. It was Oxford that made him insufferable."