Many have responded: > Meanwhile, it is true that there are simplified characters which > correspond to more than one traditional form. ... > This is the kind of mess that has discouraged anybody from doing a > systematic survey of simplifications for the Unihan database. ... > Before converting TC to SC, one should resolve all TC variants to > the most "common" or "standard" TC form (good luck deciding what that > means). ... > I think that any mapping will fail.
Thanks to everyone for your input concerning the TC/SC mapping issue. You have confirmed what I already knew, but needed concrete evidence of; namely, that mapping between Traditional Chinese and Simplified Chinese is not a simple 1-to-1 table lookup problem, but involves lexical analysis and even knowledge of the author's intent. Currently on the IDN mailing list there is a big debate over this topic. It is well known that ASCII-based domain names are matched in the DNS in a case-insensitive manner. Many people recognize that Chinese readers who are familiar with both TC and SC consider text written in the two sub-scripts to be interchangeable, in roughly the same way that uppercase and lowercase Latin are interchangeable. They would like Chinese domain names written in TC to match the "equivalent" name written in SC, just as "UNICODE.ORG" matches "unicode.org". The problem is getting people to understand the scope of the problem. As you have illustrated so well, TC/SC mapping is NOT, in the general case, as simple as Latin case mapping. It requires content analysis, and possibly some form of tagging. Almost all of the list members whose e-mail addresses end in .cn, .tw or .hk seem to believe that there is a willful disregard on the part of the working group for the needs of Chinese users in this respect. We have tried to convince them that (a) the solution is not as simple as Latin case mapping, as many have portrayed it; (b) the problem is not with Unicode Han unification, since TC and SC are not unified; (c) content analysis is not feasible for domain names; and (d) the entire problem is out of scope of the IDN WG. We have proposed that organizations register both <TC><TC><TC>.cn and <SC><SC><SC>.cn if they want both hits to be successful. So far, not much convincing has taken place. In the above case, they claim that all eight (2^3) possible combinations (e.g. "<TC><SC><TC>.cn") would need to be registered, which is overkill. One list member has even proposed the prohibition of all CJK code points from internationalized domain names "until the problem can be solved," and he has the support of several others. It is obvious that this is an attempt to hijack the entire IDN model by claiming "it does not support Chinese at all," which would certainly be true if Han characters were prohibited, and imposing a locally-constructed, Chinese-specific (i.e. not universal) model later on. Unfortunately, as an American who does not speak or read Chinese, I have been in a poor position to argue with these people about their own written language. So I relied on the combined expertise of the Unicode list, including native speakers and people with doctorates in Chinese, for background information. Thanks again for your help. -Doug Ewell Fullerton, California