> People have posted cases where the number of TC characters that make > up a word is different from the number of SC characters that make up > the same word. People have also posted cases where the number of > characters remains the same but the mapping depends on context.
You are talking about combinations of glyphs, we call them words or Labels in IDN context. A label may contain only one character sometimes, and it is easier for us to discuss without have to show the glyphs on our screen here on this list. This may be the reason for the above confusion. Let me try to elaborate a little on Chinese character processing, I mean one glyph at a time as in Unicode table. There are 20,000+ often used symbols worldwide as they are collected in Plane 0 of UCS. Mainland China use about 7,000 reguarly, Taiwan uses 13,000 reguarly. We call these are frequently used characters. Among these frequently used characters, there are always semantic differences among any two given characters, due to history and locality of the usage, as you can image. This posts the necessity of organization works in characters and thus dictionary editing, or standard works through out Chinese writen language histroy. Especially when computer comes along. The first classification is certainly those semanticly distinct and writen form distinct characters. Many semanticly distinct but form similar characters are in the carefully explained category in education sector as well as in written language critics, which can be analogies to Latin spelling checking as an educational activity, but not as an equivalent symbol set concern which is discussed here on this list. This is certainly a spelling checking feature as input concerns. The second classification are those semanticly overlap but not the same. We translated the category for these character as synonymies or "same meaning characters". But they are not thesaurus, thesaurus are introduced into China in recent years only. These are not the subject for any unification or mix. The correct usage in a text has to be differenciated within context, thus word dictionaries are used to help, this is an AI feature in an Editor software. The third classificantion of characters are semanticly "identical" but has many different forms, especially through the long history of keeping these characters. The majority of characters in UCS beyong the above 20,000 frequently used characters are belong to this category. And the majority of TC/SC also belong to this category. As I have said, if you want to find the details of something different you will always be successful among these characters. This third classification is what we are concerning in preference of display and possible inclusion of more character forms beyong the frenquently used character set or exclusion of them from such an equivalent set, which has addressed by Japanese users. Notice, that I said the majority of TC/SC belongs to this category. This is what you have heard some user do not agree with this "identical" classification. This is the matter of life that the Han user community has to be precise about which symbol in which set. So that they can have a standard to work from. Excluding this equivalent set from the basic [nameprep] profile is definitly marking the failure of IDN and causes more "trademark" conflicts on the way. > People have stated that conversion between TC and SC requires a > dictionary of words, rather than a table of characters. All these > show that TC/SC is analogous to a spelling difference. Correct. This is to deal with the small number (on the scale of 10 vs. 2000) of TC/SC in input and display level, which should not over take the face that TC/SC has to be equivalent identifiers in [nameprep]. As the matter of using characters as identifiers in IDN, the job we have to be concerned is to reduce these semantic "identical" characters from whatever number down to a "no trademark confilct" level of clearance to be viable symbol set which we can permit in IDN for identifier matching. In this sense, it is like case-insensitive treatment of Latin symbols. Yes, we do want uppercase too, but on identifier level, they are the same! > I don't clain that TC/SC conversion or equivalence is not a problem. I hope that the above explaination has shown a feasible solution to this problem to your satisfaction. > Neither do I claim that the potential confusion between <GREEK > CAPITAL LETTER ALPHA> and <LATIN CAPITAL LETTER A> is > not a problem. This is a problem of IDN. This problem is opposite with TC/SC equivalence. Because of these symbols are picked up/ pasted/ typed from a mixture of applications and user interfaces, anyone of them can be the bad guy hidden from someone's eye, and the machines only know about bits. If you add more forms of encoding, such as UTF8/16 or input keystroke sequences, then the problem can escalate quickly. The solutions that I can think of at this moment would be two: 1. Unification of symbols like CJK unification with equivalent symbol set defined; 2. Transparent language tag to enforce each label to be consistent with its tag through out the system include DNS. If we work out CJK in IDN problem, then this will be a piece of cake at the end of our IDN banquet :-) > Neither > do I claim that the potential confusion between English "theatre" > and American "theater" or English "lift" and American "elevator" > are not problems. But I believe that all these problems are > outside the scope of IDN. Correct, this problem is outside the scope of IDN. Regards, Liana Ye
