On Thursday, January 24, 2002, at 09:39 AM, [EMAIL PROTECTED] wrote:

>
> Currently on the IDN mailing list there is a big debate over this topic.  
> It
> is well known that ASCII-based domain names are matched in the DNS in a
> case-insensitive manner.  Many people recognize that Chinese readers who 
> are
> familiar with both TC and SC consider text written in the two sub-scripts 
> to
> be interchangeable, in roughly the same way that uppercase and lowercase
> Latin are interchangeable.  They would like Chinese domain names written 
> in
> TC to match the "equivalent" name written in SC, just as "UNICODE.ORG"
> matches "unicode.org".
>

Actually, this is more like asking "honor" and "honour" to match.

> Almost all of the list members whose e-mail addresses end in .cn, .tw or 
> .hk
> seem to believe that there is a willful disregard on the part of the 
> working
> group for the needs of Chinese users in this respect.  We have tried to
> convince them that (a) the solution is not as simple as Latin case 
> mapping,
> as many have portrayed it; (b) the problem is not with Unicode Han
> unification, since TC and SC are not unified; (c) content analysis is not
> feasible for domain names; and (d) the entire problem is out of scope of 
> the
> IDN WG.  We have proposed that organizations register both <TC><TC><TC>.cn
> and <SC><SC><SC>.cn if they want both hits to be successful.  So far, not
> much convincing has taken place.  In the above case, they claim that all
> eight (2^3) possible combinations (e.g. "<TC><SC><TC>.cn") would need to 
> be
> registered, which is overkill.
>

The bulk of Han ideographs don't occur in TC/SC pairs, so this is specious.
   I.e., to register the equivalent of "unicode.org", you only need two 
registrations, "<U+540C><U+4E00><78BC>.org" (TC) and 
"<U+540C><U+4E00><U+7801>.org" (SC).  You don't need eight registrations.

Meanwhile, I'd like to offer a suggestion:

*If* they can live with one caveat, and *if* they can give us time to 
clean up our SC/TC mapping data, we could do the following:

1) SC/TC matching on Unicode data is only to be done on the SC/TC mapping 
data supplied by UTC.

2) Wherever a since SC character matches multiple TC characters, all the 
characters are to be treated the same.

This means, for example, that U+53F0 (台) will be treated the same as 
U+6AAF (檯), U+81FA (臺), and U+98B1 (颱).  This also means, of course, that 
U+6AAF, U+81FA, and U+98B1 will end up being indistinguishable even in 
purely TC names.

3) This includes Unicode compatibility mappings.  (Thereby reducing a lot 
of turtles, if nothing else.)

The caveat is that this must be understood to be a first-order, 
computer-appropriate equivalence and is not in any way to be held to be a 
generalized solution to the lexically appropriate conversion between SC 
and TC.  It also has to be understood that some things are going to slip 
through because it is not a generalized solution to Han normalization.  
Lexically inappropriate matches will take place!

(Maybe we should refer to *zhengguihua* instead of "Han normalization"…)

It also means that some desired matches won't happen, and some things can 
be "spoofed" by these nasty variant issues such as came up yesterday.  
U+9EBC and U+9EBD aren't likely to both match U+4E48.

However, this is already a problem in Unicode.  "shuowen.org" will have to 
register both "<U+8AAA><U+6587>.org" and "<U+8AAC><U+6587>.org"; Jingwa, 
Inc., will need both "<U+4E3C><U+86D9>" and "<U+4E95><U+86D9>".

OK, so this is more than one caveat.  It will also mean that we will no 
longer be able to accept both the TC and SC form for a character as a 
candidate for separate encoding in the future, and future compatibility 
ideographs will be excluded from use in IDN.  (Actually, you could save 
yourself some grief right off by excluding Han radicals and all 
compatibility ideographs.)

==========
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/


Reply via email to