Re: TC/SC mapping
On Wed, 23 Jan 2002, John H. Jenkins wrote: > Thomas, do you have a reference for U+9EBC (麼) and U+9EBD (麽) being > different? The only dictionary I have which contains both is the > (traditional) CiHai, it and it claims they're variants of each other. Belated, but a little more on these two. "Annex T: Procedure for the Unification and Arrangement of CJK Ideographs"[1] (AMD8 of ISO/IEC 10646-1:1993) at the very end of section T.3, gives this pair as an example of unification blocked by source separation (a T source is the culprit). [1] ftp://ftp.cse.cuhk.edu.hk/pub/irg/AnnexT.rtf Thomas Chan [EMAIL PROTECTED]
Re: TC/SC mapping
John H. Jenkins wrote: > Well, first of all, the UTC is already on record as refusing to encode > new SC separately. > > Secondly, we would break IDN equivalence. If we add a new SC which is > equivalent to two TC, By your previous graf, that can't happen; so it must be adding a new TC (off a newly dug-up bone, perhaps) which simplifies to two different SCs. Fair enough. -- Not to perambulate || John Cowan <[EMAIL PROTECTED]> the corridors || http://www.reutershealth.com during the hours of repose || http://www.ccil.org/~cowan in the boots of ascension. \\ Sign in Austrian ski-resort hotel
Re: TC/SC mapping
On Thursday, January 24, 2002, at 12:29 PM, John Cowan wrote: > John H. Jenkins wrote: > > {TC1, SC1, SC2, TC2, TC3, SC3} constitute a "Han simplification > class" (HSC), and are all the same when appearing in IDNs. > > Correct? > Oui. > >> The caveat is that this must be understood to be a first-order, >> computer-appropriate equivalence and is not in any way to be held to be >> a generalized solution to the lexically appropriate conversion between >> SC and TC. > > > Is there any danger that these classes will turn out to be a > "small world", in the sense that we wind up with a few huge classes > which include almost all the characters? > Nope. >> (Maybe we should refer to *zhengguihua* instead of "Han normalization" ) > > > Can you explain the joke? > It's just to make Ken happy. He doesn't like me talking about "Han normalization," since "normalization" is Unicodespeak for something else. "Zhengguihua" is Mandarin for "normalization." >> It will also mean that we will no longer be able to accept both the TC >> and SC form for a character as a candidate for separate encoding in the >> future, > > > I don't understand this part. Since this is neither compatibility nor > canonical equivalence, it will not effect any of the known normalization > forms. Nor are we defining a new normalization form here, since in > HSCs like the above there is no particular reason to pick any of the > six characters as *the* normalized form, although by convention we can > pick one -- say, the one with the smallest Unicode scalar > value, or the one which appears in the largest number of legacy > sets -- to aid in description and implementation. > > It's just another of those sets of equivalence classes provided for > special purposes, like the Arabic/Syriac shaping classes or the > canonical combining classes. > Well, first of all, the UTC is already on record as refusing to encode new SC separately. Secondly, we would break IDN equivalence. If we add a new SC which is equivalent to two TC, then suddenly domains which could be distinguished on the basis of the old TC pair can't any more. > Or are you saying that this new information should be represented > as a Unicode compatibility equivalence? If so, that would > wreak havoc with existing NCF and NKCF code. > No, >> (Actually, you could save yourself some grief right off by excluding Han >> radicals and all compatibility ideographs.) > > This would be a Bad Thing in Korean, though, because the whole point > of Korean compatibility ideographs is to preserve differences in > reading. Or are ideographs not used in (modern) Korean names? > These compatibility ideographs are *not* to provide phonetic-specific distinctions between various Korean hanja. They're for compatibility with an older standard only, which did make that distinction. IMHO it would be more confusing to Chinese, Japanese, *and* Korean readers to have some domain names distinguished when the the only thing different about them is the Korean pronunciation of the hanja used to write them. == John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jenkins/
Re: TC/SC mapping
On Thursday, January 24, 2002, at 11:44 AM, Thomas Chan wrote: > On Thu, 24 Jan 2002, John H. Jenkins wrote: > >> However, this is already a problem in Unicode. "shuowen.org" will have >> to >> register both ".org" and ".org"; Jingwa, >> Inc., will need both "" and "". > > U+8AAA and U+8AAC are given on p. 265 of TUS3.0 as an example of what > would have been unified had it not been for source separation. Is it > possible to acquire data on other z-variants? Er, no. > The kZVariant fields do not > seem to contain exactly that data. Nope. As with the SC/TC problem from yesterday, this is just too messy for anyone to have found the time to do it properly. The bulk of the kZVariant data we have right now is largely derived from the CCCII mapping data. This is something we're going to ask WG2 to tell the IRG to do. == John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jenkins/
RE: TC/SC mapping
Doug Ewell wrote: > Currently on the IDN mailing list there is a big debate over > this topic. It is well known that ASCII-based domain names > are matched in the DNS in a case-insensitive manner. Many > people recognize that Chinese readers who are familiar with > both TC and SC consider text written in the two sub-scripts > to be interchangeable, in roughly the same way that > uppercase and lowercase Latin are interchangeable. Converting TC to SC is difficult, and the opposite is nearly impossible. But a simple "loose match" like the one you describe does not seem so difficult. On the other hand, out of English-only realm, also converting uppercase to lowercase is difficult, and the opposite is nearly impossible. But simple case folding is not so difficult. Here it is simply a matter of putting together all the groups of ideographs that may be considered variants of each other (not only SC and TC, but also Japanese simplifications, semantic variants, "specialized semantic variants", compatibility equivalents, radicals, etc.), and to map them *internally* to a single key (e.g., the lowest code point in the group). You don't even bother whether the result is TC, SC, or a horrible mix of the two: anyway, nobody is supposed to see it. Of course there are security concerns. It the conversion must be well-defined and not be changed in the course of time. And, of course, DNS's should be registered in their "folded" version. _ Marco
Re: TC/SC mapping
On Thu, 24 Jan 2002, John H. Jenkins wrote: > However, this is already a problem in Unicode. "shuowen.org" will have to > register both ".org" and ".org"; Jingwa, > Inc., will need both "" and "". U+8AAA and U+8AAC are given on p. 265 of TUS3.0 as an example of what would have been unified had it not been for source separation. Is it possible to acquire data on other z-variants? The kZVariant fields do not seem to contain exactly that data. Had that example not been pointed out, I wouldn't have been known that both were encoded. Thomas Chan [EMAIL PROTECTED]
Re: TC/SC mapping
On Thursday, January 24, 2002, at 09:39 AM, [EMAIL PROTECTED] wrote: > > Currently on the IDN mailing list there is a big debate over this topic. > It > is well known that ASCII-based domain names are matched in the DNS in a > case-insensitive manner. Many people recognize that Chinese readers who > are > familiar with both TC and SC consider text written in the two sub-scripts > to > be interchangeable, in roughly the same way that uppercase and lowercase > Latin are interchangeable. They would like Chinese domain names written > in > TC to match the "equivalent" name written in SC, just as "UNICODE.ORG" > matches "unicode.org". > Actually, this is more like asking "honor" and "honour" to match. > Almost all of the list members whose e-mail addresses end in .cn, .tw or > .hk > seem to believe that there is a willful disregard on the part of the > working > group for the needs of Chinese users in this respect. We have tried to > convince them that (a) the solution is not as simple as Latin case > mapping, > as many have portrayed it; (b) the problem is not with Unicode Han > unification, since TC and SC are not unified; (c) content analysis is not > feasible for domain names; and (d) the entire problem is out of scope of > the > IDN WG. We have proposed that organizations register both .cn > and .cn if they want both hits to be successful. So far, not > much convincing has taken place. In the above case, they claim that all > eight (2^3) possible combinations (e.g. ".cn") would need to > be > registered, which is overkill. > The bulk of Han ideographs don't occur in TC/SC pairs, so this is specious. I.e., to register the equivalent of "unicode.org", you only need two registrations, "<78BC>.org" (TC) and ".org" (SC). You don't need eight registrations. Meanwhile, I'd like to offer a suggestion: *If* they can live with one caveat, and *if* they can give us time to clean up our SC/TC mapping data, we could do the following: 1) SC/TC matching on Unicode data is only to be done on the SC/TC mapping data supplied by UTC. 2) Wherever a since SC character matches multiple TC characters, all the characters are to be treated the same. This means, for example, that U+53F0 (台) will be treated the same as U+6AAF (檯), U+81FA (臺), and U+98B1 (颱). This also means, of course, that U+6AAF, U+81FA, and U+98B1 will end up being indistinguishable even in purely TC names. 3) This includes Unicode compatibility mappings. (Thereby reducing a lot of turtles, if nothing else.) The caveat is that this must be understood to be a first-order, computer-appropriate equivalence and is not in any way to be held to be a generalized solution to the lexically appropriate conversion between SC and TC. It also has to be understood that some things are going to slip through because it is not a generalized solution to Han normalization. Lexically inappropriate matches will take place! (Maybe we should refer to *zhengguihua* instead of "Han normalization"…) It also means that some desired matches won't happen, and some things can be "spoofed" by these nasty variant issues such as came up yesterday. U+9EBC and U+9EBD aren't likely to both match U+4E48. However, this is already a problem in Unicode. "shuowen.org" will have to register both ".org" and ".org"; Jingwa, Inc., will need both "" and "". OK, so this is more than one caveat. It will also mean that we will no longer be able to accept both the TC and SC form for a character as a candidate for separate encoding in the future, and future compatibility ideographs will be excluded from use in IDN. (Actually, you could save yourself some grief right off by excluding Han radicals and all compatibility ideographs.) == John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jenkins/
Re: TC/SC mapping
Many have responded: > Meanwhile, it is true that there are simplified characters which > correspond to more than one traditional form. ... > This is the kind of mess that has discouraged anybody from doing a > systematic survey of simplifications for the Unihan database. ... > Before converting TC to SC, one should resolve all TC variants to > the most "common" or "standard" TC form (good luck deciding what that > means). ... > I think that any mapping will fail. Thanks to everyone for your input concerning the TC/SC mapping issue. You have confirmed what I already knew, but needed concrete evidence of; namely, that mapping between Traditional Chinese and Simplified Chinese is not a simple 1-to-1 table lookup problem, but involves lexical analysis and even knowledge of the author's intent. Currently on the IDN mailing list there is a big debate over this topic. It is well known that ASCII-based domain names are matched in the DNS in a case-insensitive manner. Many people recognize that Chinese readers who are familiar with both TC and SC consider text written in the two sub-scripts to be interchangeable, in roughly the same way that uppercase and lowercase Latin are interchangeable. They would like Chinese domain names written in TC to match the "equivalent" name written in SC, just as "UNICODE.ORG" matches "unicode.org". The problem is getting people to understand the scope of the problem. As you have illustrated so well, TC/SC mapping is NOT, in the general case, as simple as Latin case mapping. It requires content analysis, and possibly some form of tagging. Almost all of the list members whose e-mail addresses end in .cn, .tw or .hk seem to believe that there is a willful disregard on the part of the working group for the needs of Chinese users in this respect. We have tried to convince them that (a) the solution is not as simple as Latin case mapping, as many have portrayed it; (b) the problem is not with Unicode Han unification, since TC and SC are not unified; (c) content analysis is not feasible for domain names; and (d) the entire problem is out of scope of the IDN WG. We have proposed that organizations register both .cn and .cn if they want both hits to be successful. So far, not much convincing has taken place. In the above case, they claim that all eight (2^3) possible combinations (e.g. ".cn") would need to be registered, which is overkill. One list member has even proposed the prohibition of all CJK code points from internationalized domain names "until the problem can be solved," and he has the support of several others. It is obvious that this is an attempt to hijack the entire IDN model by claiming "it does not support Chinese at all," which would certainly be true if Han characters were prohibited, and imposing a locally-constructed, Chinese-specific (i.e. not universal) model later on. Unfortunately, as an American who does not speak or read Chinese, I have been in a poor position to argue with these people about their own written language. So I relied on the combined expertise of the Unicode list, including native speakers and people with doctorates in Chinese, for background information. Thanks again for your help. -Doug Ewell Fullerton, California
Re: TC/SC mapping
> > This is the kind of mess that has discouraged anybody from doing a > > systematic survey of simplifications for the Unihan database. > > Part of this is because there is the orthogonal complexity of > variant TC forms. Before converting TC to SC, one should resolve > all TC variants to the most "common" or "standard" TC form (good > luck deciding what that means). e.g., in the above case, resolve to > U+9EBD. I think that any mapping will fail. As so many things with CJK characters, the usage depends on constraints beyond a character encoding: time, location, purpose, etc. This is the very reason why CCCII hasn't succeeded. As a consequence, the available fields are not enough to really represent the interdependencies correctly. Either increase the number of available keywords (e.g. kZVariant1, kZVariant2) to be able to fine-tune the dependencies (something like `character a in the meaning of b is a variant of character c', or add a remark to the description of keywords that the fields can't be exhaustive due to such and such reasons. Werner
Re: TC/SC mapping
On Wed, 23 Jan 2002, John H. Jenkins wrote: > On Wednesday, January 23, 2002, at 09:05 AM, Thomas Chan wrote: > > In other words, > > yao1 'small'TC U+4E48 or U+5E7A -> SC U+4E48 > > me (as in shen2me 'what') TC U+9EBC or U+9EBD -> SC U+4E48 > > mo2 (as in yao1mo2 'insignificant') TC U+9EBC or U+9EBD -> SC U+9EBD > > Thomas, do you have a reference for U+9EBC (麼) and U+9EBD (麽) being > different? The only dictionary I have which contains both is the > (traditional) CiHai, it and it claims they're variants of each other. Well, first, the "Jianhuazi Zongbiao" that defines the PRC simplifications juxtaposes U+9EBD and U+4E48 for the "me" pronunciation of the former (non-"me" usage of the former are not simplified); U+9EBC is not mentioned. In the PRC's _Ci Hai_ from 1979 (the third dictionary to bear that name), U+9EBC is a pointer to U+9EBD for all usages of U+9EBD. In the _Hanyu Da Zidian_ (PRC, 1986), U+9EBD has the following usages: 1) mo2 'small' 2) ma2 of gan4ma2 'what for'. (It says that nowadays this particular ma2 is written U+55CE.) 3.1) ma, a particle, which can sometimes be written U+55CE. 3.2) ma, a particle, which can sometimes be written U+561B. 4) me of zhe4me 'so; like this'; also used as padding in songs. However, for U+9EBC, it says it is the same as U+9EBD, but the only examples given have the 'small' meaning, including one from the _Shuowen Jiezi_ (China, AD 100) that says that U+9EBD is a vulgar (su2) form of U+9EBC. Apparently, U+9EBC is the more orthodox version as far as mo2 'small' is concerned, but U+9EBD has become more common, including becoming used to write various modern/colloquial words. I would revise the mapping as follows: me (as in shen2me 'what')TC U+9EBD -> SC U+4E48 mo2 (as in yao1mo2 'insignificant') TC U+9EBC -> TC U+9EBD -> SC U+9EBD I think the choice whether to regard U+9EBC and U+9EBD as different or not depends on the application. I would lean towards treating them as the same. > Meanwhile, both Sanseido and KangXi say that U+5C1B (尛) is a member of > the family. (KangXi says that anciently U+9EBC (麼) was written U+5C1B (尛) > . Mathews and Sanseido also remind us that U+5E85 (庅) is another variant, > and Sanseido *also* lists U+5692 (嚒). In the _Hanyu Da Zidian_, U+5C1B points to U+9EBC. (I see on the same page that U+21B6F also points to U+9EBC, and the _Hanyu Da Zidian_ is citing this pointer from the same source.) It doesn't say, but I would presume these refer only to the original mo2 'usage', given the age of the cited source, _Longkan Shoujian_ (China, AD 997), and the composition of U+5C1B (three 'smalls') and U+21B6F ('three' + 'small'). U+5E85 is understandable as an abbreviated form of U+9EBD, and I'll add that it's also documented in Samuel Wells Williams' 1874 dictionary (pushes back the usage given in Mathews by at least half a century). U+5692 seems understandable--it is just U+9EBC with a mouth radical tacked on--I presume this is only for the modern/colloquial "me" usages, and not mo2 'small'. (I wouldn't be surprised if somewhere there is attested a U+9EBD with a mouth radical tacked on.) I would further revise the (partial) mapping as follows: me (as in shen2me 'what'): TC U+9EBC -> TC U+9EBD -> TC U+5E85 -> SC U+4E48 TC U+9EBC -> TC U+5692 mo2 (as in yao1mo2 'insignificant'): TC U+9EBC -> TC U+9EBD -> SC U+9EBD And this is not finished, yet! The _Hanyu Da Zidian_ also lists some other variant forms of U+9EBD--I suspect they are probably all/mostly for the mo2 'small' usage. I should point out that the _Hanyu Da Zidian_ is in no way the final word despite its comprehensiveness, e.g., U+5E85 and U+5692 are not included in it. > So, Doug, you see that U+4E48 (么) could conceivably be a traditional > character in its own right *or* the simplified form for no fewer than six > (!) other ideographs. > > This is the kind of mess that has discouraged anybody from doing a > systematic survey of simplifications for the Unihan database. Part of this is because there is the orthogonal complexity of variant TC forms. Before converting TC to SC, one should resolve all TC variants to the most "common" or "standard" TC form (good luck deciding what that means). e.g., in the above case, resolve to U+9EBD. I think we are also complicating things by treating the entire process of variants and simplifications as operating solely on the orthography (cf., upper and lower case); in some cases, it is simpler to conceptualize it as the "spelling" of words being changed. > > The other example (U+8721 kTraditionalVariant U+8721 U+881F) is a > > mistake--the TraditionalVariant should only be U+881F. > > Actually, no. Both KangXi and the Cihai list U+8721 (蜡) as a traditional > character in its own right, although I assume it's rare as I can't find it > in my other dictionaries. You're right. The presence of U+8721
Re: TC/SC mapping
John, > This is the kind of mess that has discouraged anybody from doing a > systematic survey of simplifications for the Unihan database. > > > The other example (U+8721 kTraditionalVariant U+8721 U+881F) is a > > mistake--the TraditionalVariant should only be U+881F. > > > > Actually, no. Both KangXi and the Cihai list U+8721 (蜡) as a traditional > character in its own right, although I assume it's rare as I can't find it > in my other dictionaries. Oh what tangled webs. First of all, we have "wax", Mandarin la4 Traditional form: U+881F Simplified form: U+8721 Then we have "maggot", Mandarin qu1, of which Cihai claims that Shuowen says: Vulgar form: U+86C6 Correct form: U+43E3 Archaic form: U+8721 My Taiwan and PRC dictionaries both claim U+86C6 for "maggot", so that would now be considered both the Traditional and Simplified form, with U+8721 being an obsolete, archaic variant for it. Then we have "yearend ceremony (of Zhou dynasty)", Mandarin zha4 Traditional form: U+8721 Not listed in my contemporary PRC dictionary. And yes, this is the kind of mess that has discouraged anybody from doing a systematic survey of simplifications for the Unihan database. --Ken
Re: TC/SC mapping
On Wednesday, January 23, 2002, at 09:05 AM, Thomas Chan wrote: > In other words, > yao1 'small'TC U+4E48 or U+5E7A -> SC U+4E48 > me (as in shen2me 'what') TC U+9EBC or U+9EBD -> SC U+4E48 > mo2 (as in yao1mo2 'insignificant') TC U+9EBC or U+9EBD -> SC U+9EBD > Thomas, do you have a reference for U+9EBC (麼) and U+9EBD (麽) being different? The only dictionary I have which contains both is the (traditional) CiHai, it and it claims they're variants of each other. Meanwhile, both Sanseido and KangXi say that U+5C1B (尛) is a member of the family. (KangXi says that anciently U+9EBC (麼) was written U+5C1B (尛) . Mathews and Sanseido also remind us that U+5E85 (庅) is another variant, and Sanseido *also* lists U+5692 (嚒). So, Doug, you see that U+4E48 (么) could conceivably be a traditional character in its own right *or* the simplified form for no fewer than six (!) other ideographs. This is the kind of mess that has discouraged anybody from doing a systematic survey of simplifications for the Unihan database. > The other example (U+8721 kTraditionalVariant U+8721 U+881F) is a > mistake--the TraditionalVariant should only be U+881F. > Actually, no. Both KangXi and the Cihai list U+8721 (蜡) as a traditional character in its own right, although I assume it's rare as I can't find it in my other dictionaries. == John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jenkins/