Ben said: > However, what I am certain > is that you have illustrated the fact that > applications/clients/users/servers/etc can be made to take advantage > of this explicit labeling of what "script" an IDN is in and over time > (with people writing appropriate applications) can be developed into a > very powerful and useful system. (Unlike TLD such as ".gov", ".ca" > which serves next to no purpose from an IDN's perspective.)
I have to disagree. I am certain that labelling what script an IDN is in will just cause problems. At the very least, this will introduce an entire new class of error conditions, where the label says one thing, but the character content of the IDN does not in fact match the label. Furthermore, the example we have been talking about here, traditional versus simplified Chinese, is not even a script difference in the first place. "Traditional" versus "Simplified" in a character set context, and as typically implemented, refers to distinctions between Code Page 950 (Big 5) and Code Page 936 (GBK, etc.), together with the fonts, input methods, message resource files, and such, as needed to support them. And either of those character sets is actually mixed script, since they both support Latin characters from ASCII, as well as the basic Greek alphabet and Bopomofo. "Simplified Chinese" also supports the basic Cyrillic alphabet and Hiragana and Katakana for Japanese. Even if you are just talking about Traditional versus Simplified Chinese characters (ideographs) within the Han script subparts of Code Page 950 or Code Page 936, the distinction is not as clean as you might think it would be. The PRC simplified set, even in its earlier forms in GB 2312, contain *some* traditional forms for characters. But the current extensions, first for GBK (~ Microsoft Code Page 936), and now for GB 18030, incorporate *all* of the Han characters from the Unicode 3.0 repertoire, which means that a "Simplified" code page for China now contains *all* of the traditional characters from Code Page 950, as well as all the simplified characters from Unicode 3.0. And of course, Unicode data itself encompasses both simplified and traditional forms of Chinese ideographs. So what would the IDN distinction between simplified and traditional mean if data was encoded in Unicode? Even the identification of scripts is non-trivial. Many characters are *shared* between scripts, or are borrowed from one script to the next. Cyrillic and Latin have a long history of cross-borrowing forms from one script into the other, for example, for special uses. And Japanese got all its Chinese characters (kanji) in the first place by borrowing them from Chinese. See the Unicode Technical Report #24 Script Names, for more discussion of this: http://www.unicode.org/unicode/reports/tr24/ Note, in particular, that "Traditional (Chinese)" and "Simplified (Chinese)" are nowhere mentioned in that report -- those are simply not script distinctions. --Ken
