Many have responded:

> Meanwhile, it is true that there are simplified characters which 
> correspond to more than one traditional form.
...
> This is the kind of mess that has discouraged anybody from doing a 
> systematic survey of simplifications for the Unihan database.
...
> Before converting TC to SC, one should resolve all TC variants to
> the most "common" or "standard" TC form (good luck deciding what that
> means).
...
> I think that any mapping will fail.

Thanks to everyone for your input concerning the TC/SC mapping issue.  You 
have confirmed what I already knew, but needed concrete evidence of; namely, 
that mapping between Traditional Chinese and Simplified Chinese is not a 
simple 1-to-1 table lookup problem, but involves lexical analysis and even 
knowledge of the author's intent.

Currently on the IDN mailing list there is a big debate over this topic.  It 
is well known that ASCII-based domain names are matched in the DNS in a 
case-insensitive manner.  Many people recognize that Chinese readers who are 
familiar with both TC and SC consider text written in the two sub-scripts to 
be interchangeable, in roughly the same way that uppercase and lowercase 
Latin are interchangeable.  They would like Chinese domain names written in 
TC to match the "equivalent" name written in SC, just as "UNICODE.ORG" 
matches "unicode.org".

The problem is getting people to understand the scope of the problem.  As you 
have illustrated so well, TC/SC mapping is NOT, in the general case, as 
simple as Latin case mapping.  It requires content analysis, and possibly 
some form of tagging.

Almost all of the list members whose e-mail addresses end in .cn, .tw or .hk 
seem to believe that there is a willful disregard on the part of the working 
group for the needs of Chinese users in this respect.  We have tried to 
convince them that (a) the solution is not as simple as Latin case mapping, 
as many have portrayed it; (b) the problem is not with Unicode Han 
unification, since TC and SC are not unified; (c) content analysis is not 
feasible for domain names; and (d) the entire problem is out of scope of the 
IDN WG.  We have proposed that organizations register both <TC><TC><TC>.cn 
and <SC><SC><SC>.cn if they want both hits to be successful.  So far, not 
much convincing has taken place.  In the above case, they claim that all 
eight (2^3) possible combinations (e.g. "<TC><SC><TC>.cn") would need to be 
registered, which is overkill.

One list member has even proposed the prohibition of all CJK code points from 
internationalized domain names "until the problem can be solved," and he has 
the support of several others.  It is obvious that this is an attempt to 
hijack the entire IDN model by claiming "it does not support Chinese at all," 
which would certainly be true if Han characters were prohibited, and imposing 
a locally-constructed, Chinese-specific (i.e. not universal) model later on.

Unfortunately, as an American who does not speak or read Chinese, I have been 
in a poor position to argue with these people about their own written 
language.  So I relied on the combined expertise of the Unicode list, 
including native speakers and people with doctorates in Chinese, for 
background information.  Thanks again for your help.

-Doug Ewell
 Fullerton, California

Reply via email to