Re: TC/SC mapping

2002-02-04 Thread Thomas Chan

On Wed, 23 Jan 2002, John H. Jenkins wrote:

> Thomas, do you have a reference for U+9EBC (麼) and U+9EBD (麽) being 
> different?  The only dictionary I have which contains both is the 
> (traditional) CiHai, it and it claims they're variants of each other.

Belated, but a little more on these two.  "Annex T: Procedure for the
Unification and Arrangement of CJK Ideographs"[1] (AMD8 of ISO/IEC
10646-1:1993) at the very end of section T.3, gives this pair as an
example of unification blocked by source separation (a T source is the
culprit).

[1] ftp://ftp.cse.cuhk.edu.hk/pub/irg/AnnexT.rtf


Thomas Chan
[EMAIL PROTECTED]





Re: TC/SC mapping

2002-01-24 Thread John Cowan

John H. Jenkins wrote:


> Well, first of all, the UTC is already on record as refusing to encode 
> new SC separately.
> 
> Secondly, we would break IDN equivalence.  If we add a new SC which is 
> equivalent to two TC,


By your previous graf, that can't happen; so it must be adding a new TC
(off a newly dug-up bone, perhaps) which simplifies to two different
SCs.  Fair enough.

-- 
Not to perambulate || John Cowan <[EMAIL PROTECTED]>
the corridors   || http://www.reutershealth.com
during the hours of repose || http://www.ccil.org/~cowan
in the boots of ascension.  \\ Sign in Austrian ski-resort hotel





Re: TC/SC mapping

2002-01-24 Thread John H. Jenkins


On Thursday, January 24, 2002, at 12:29 PM, John Cowan wrote:

> John H. Jenkins wrote:
>
> {TC1, SC1, SC2, TC2, TC3, SC3} constitute a "Han simplification
> class" (HSC), and are all the same when appearing in IDNs.
>
> Correct?
>

Oui.

>
>> The caveat is that this must be understood to be a first-order, 
>> computer-appropriate equivalence and is not in any way to be held to be 
>> a generalized solution to the lexically appropriate conversion between 
>> SC and TC.
>
>
> Is there any danger that these classes will turn out to be a
> "small world", in the sense that we wind up with a few huge classes
> which include almost all the characters?
>

Nope.

>> (Maybe we should refer to *zhengguihua* instead of "Han normalization"…)
>
>
> Can you explain the joke?
>

It's just to make Ken happy.  He doesn't like me talking about "Han 
normalization," since "normalization" is Unicodespeak for something else.  
"Zhengguihua" is Mandarin for "normalization."

>> It will also mean that we will no longer be able to accept both the TC 
>> and SC form for a character as a candidate for separate encoding in the 
>> future,
>
>
> I don't understand this part.  Since this is neither compatibility nor
> canonical equivalence, it will not effect any of the known normalization
> forms.  Nor are we defining a new normalization form here, since in
> HSCs like the above there is no particular reason to pick any of the
> six characters as *the* normalized form, although by convention we can
> pick one -- say, the one with the smallest Unicode scalar
> value, or the one which appears in the largest number of legacy
> sets -- to aid in description and implementation.
>
> It's just another of those sets of equivalence classes provided for
> special purposes, like the Arabic/Syriac shaping classes or the
> canonical combining classes.
>

Well, first of all, the UTC is already on record as refusing to encode new 
SC separately.

Secondly, we would break IDN equivalence.  If we add a new SC which is 
equivalent to two TC, then suddenly domains which could be distinguished 
on the basis of the old TC pair can't any more.

> Or are you saying that this new information should be represented
> as a Unicode compatibility equivalence?  If so, that would
> wreak havoc with existing NCF and NKCF code.
>

No,

>> (Actually, you could save yourself some grief right off by excluding Han 
>> radicals and all compatibility ideographs.)
>
> This would be a Bad Thing in Korean, though, because the whole point
> of Korean compatibility ideographs is to preserve differences in
> reading.  Or are ideographs not used in (modern) Korean names?
>

These compatibility ideographs are *not* to provide phonetic-specific 
distinctions between various Korean hanja.  They're for compatibility with 
an older standard only, which did make that distinction.  IMHO it would be 
more confusing to Chinese, Japanese, *and* Korean readers to have some 
domain names distinguished when the the only thing different about them is 
the Korean pronunciation of the hanja used to write them.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/





Re: TC/SC mapping

2002-01-24 Thread John H. Jenkins


On Thursday, January 24, 2002, at 11:44 AM, Thomas Chan wrote:

> On Thu, 24 Jan 2002, John H. Jenkins wrote:
>
>> However, this is already a problem in Unicode.  "shuowen.org" will have 
>> to
>> register both ".org" and ".org"; Jingwa,
>> Inc., will need both "" and "".
>
> U+8AAA and U+8AAC are given on p. 265 of TUS3.0 as an example of what
> would have been unified had it not been for source separation.  Is it
> possible to acquire data on other z-variants?

Er, no.

> The kZVariant fields do not
> seem to contain exactly that data.

Nope.  As with the SC/TC problem from yesterday, this is just too messy 
for anyone to have found the time to do it properly.  The bulk of the 
kZVariant data we have right now is largely derived from the CCCII mapping 
data.

This is something we're going to ask WG2 to tell the IRG to do.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/





RE: TC/SC mapping

2002-01-24 Thread Marco Cimarosti

Doug Ewell wrote:
> Currently on the IDN mailing list there is a big debate over 
> this topic.  It is well known that ASCII-based domain names
> are matched in the DNS in a case-insensitive manner.  Many
> people recognize that Chinese readers who are familiar with
> both TC and SC consider text written in the two sub-scripts
> to be interchangeable, in roughly the same way that
> uppercase and lowercase Latin are interchangeable.

Converting TC to SC is difficult, and the opposite is nearly impossible. But
a simple "loose match" like the one you describe does not seem so difficult.

On the other hand, out of English-only realm, also converting uppercase to
lowercase is difficult, and the opposite is nearly impossible. But simple
case folding is not so difficult.

Here it is simply a matter of putting together all the groups of ideographs
that may be considered variants of each other (not only SC and TC, but also
Japanese simplifications, semantic variants, "specialized semantic
variants", compatibility equivalents, radicals, etc.), and to map them
*internally* to a single key (e.g., the lowest code point in the group).

You don't even bother whether the result is TC, SC, or a horrible mix of the
two: anyway, nobody is supposed to see it.

Of course there are security concerns. It the conversion must be
well-defined and not be changed in the course of time. And, of course, DNS's
should be registered in their "folded" version.

_ Marco




Re: TC/SC mapping

2002-01-24 Thread Thomas Chan

On Thu, 24 Jan 2002, John H. Jenkins wrote:

> However, this is already a problem in Unicode.  "shuowen.org" will have to 
> register both ".org" and ".org"; Jingwa, 
> Inc., will need both "" and "".

U+8AAA and U+8AAC are given on p. 265 of TUS3.0 as an example of what
would have been unified had it not been for source separation.  Is it
possible to acquire data on other z-variants?  The kZVariant fields do not
seem to contain exactly that data.  Had that example not been pointed out,
I wouldn't have been known that both were encoded.


Thomas Chan
[EMAIL PROTECTED]






Re: TC/SC mapping

2002-01-24 Thread John H. Jenkins


On Thursday, January 24, 2002, at 09:39 AM, [EMAIL PROTECTED] wrote:

>
> Currently on the IDN mailing list there is a big debate over this topic.  
> It
> is well known that ASCII-based domain names are matched in the DNS in a
> case-insensitive manner.  Many people recognize that Chinese readers who 
> are
> familiar with both TC and SC consider text written in the two sub-scripts 
> to
> be interchangeable, in roughly the same way that uppercase and lowercase
> Latin are interchangeable.  They would like Chinese domain names written 
> in
> TC to match the "equivalent" name written in SC, just as "UNICODE.ORG"
> matches "unicode.org".
>

Actually, this is more like asking "honor" and "honour" to match.

> Almost all of the list members whose e-mail addresses end in .cn, .tw or 
> .hk
> seem to believe that there is a willful disregard on the part of the 
> working
> group for the needs of Chinese users in this respect.  We have tried to
> convince them that (a) the solution is not as simple as Latin case 
> mapping,
> as many have portrayed it; (b) the problem is not with Unicode Han
> unification, since TC and SC are not unified; (c) content analysis is not
> feasible for domain names; and (d) the entire problem is out of scope of 
> the
> IDN WG.  We have proposed that organizations register both .cn
> and .cn if they want both hits to be successful.  So far, not
> much convincing has taken place.  In the above case, they claim that all
> eight (2^3) possible combinations (e.g. ".cn") would need to 
> be
> registered, which is overkill.
>

The bulk of Han ideographs don't occur in TC/SC pairs, so this is specious.
   I.e., to register the equivalent of "unicode.org", you only need two 
registrations, "<78BC>.org" (TC) and 
".org" (SC).  You don't need eight registrations.

Meanwhile, I'd like to offer a suggestion:

*If* they can live with one caveat, and *if* they can give us time to 
clean up our SC/TC mapping data, we could do the following:

1) SC/TC matching on Unicode data is only to be done on the SC/TC mapping 
data supplied by UTC.

2) Wherever a since SC character matches multiple TC characters, all the 
characters are to be treated the same.

This means, for example, that U+53F0 (台) will be treated the same as 
U+6AAF (檯), U+81FA (臺), and U+98B1 (颱).  This also means, of course, that 
U+6AAF, U+81FA, and U+98B1 will end up being indistinguishable even in 
purely TC names.

3) This includes Unicode compatibility mappings.  (Thereby reducing a lot 
of turtles, if nothing else.)

The caveat is that this must be understood to be a first-order, 
computer-appropriate equivalence and is not in any way to be held to be a 
generalized solution to the lexically appropriate conversion between SC 
and TC.  It also has to be understood that some things are going to slip 
through because it is not a generalized solution to Han normalization.  
Lexically inappropriate matches will take place!

(Maybe we should refer to *zhengguihua* instead of "Han normalization"…)

It also means that some desired matches won't happen, and some things can 
be "spoofed" by these nasty variant issues such as came up yesterday.  
U+9EBC and U+9EBD aren't likely to both match U+4E48.

However, this is already a problem in Unicode.  "shuowen.org" will have to 
register both ".org" and ".org"; Jingwa, 
Inc., will need both "" and "".

OK, so this is more than one caveat.  It will also mean that we will no 
longer be able to accept both the TC and SC form for a character as a 
candidate for separate encoding in the future, and future compatibility 
ideographs will be excluded from use in IDN.  (Actually, you could save 
yourself some grief right off by excluding Han radicals and all 
compatibility ideographs.)

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/





Re: TC/SC mapping

2002-01-24 Thread DougEwell2

Many have responded:

> Meanwhile, it is true that there are simplified characters which 
> correspond to more than one traditional form.
...
> This is the kind of mess that has discouraged anybody from doing a 
> systematic survey of simplifications for the Unihan database.
...
> Before converting TC to SC, one should resolve all TC variants to
> the most "common" or "standard" TC form (good luck deciding what that
> means).
...
> I think that any mapping will fail.

Thanks to everyone for your input concerning the TC/SC mapping issue.  You 
have confirmed what I already knew, but needed concrete evidence of; namely, 
that mapping between Traditional Chinese and Simplified Chinese is not a 
simple 1-to-1 table lookup problem, but involves lexical analysis and even 
knowledge of the author's intent.

Currently on the IDN mailing list there is a big debate over this topic.  It 
is well known that ASCII-based domain names are matched in the DNS in a 
case-insensitive manner.  Many people recognize that Chinese readers who are 
familiar with both TC and SC consider text written in the two sub-scripts to 
be interchangeable, in roughly the same way that uppercase and lowercase 
Latin are interchangeable.  They would like Chinese domain names written in 
TC to match the "equivalent" name written in SC, just as "UNICODE.ORG" 
matches "unicode.org".

The problem is getting people to understand the scope of the problem.  As you 
have illustrated so well, TC/SC mapping is NOT, in the general case, as 
simple as Latin case mapping.  It requires content analysis, and possibly 
some form of tagging.

Almost all of the list members whose e-mail addresses end in .cn, .tw or .hk 
seem to believe that there is a willful disregard on the part of the working 
group for the needs of Chinese users in this respect.  We have tried to 
convince them that (a) the solution is not as simple as Latin case mapping, 
as many have portrayed it; (b) the problem is not with Unicode Han 
unification, since TC and SC are not unified; (c) content analysis is not 
feasible for domain names; and (d) the entire problem is out of scope of the 
IDN WG.  We have proposed that organizations register both .cn 
and .cn if they want both hits to be successful.  So far, not 
much convincing has taken place.  In the above case, they claim that all 
eight (2^3) possible combinations (e.g. ".cn") would need to be 
registered, which is overkill.

One list member has even proposed the prohibition of all CJK code points from 
internationalized domain names "until the problem can be solved," and he has 
the support of several others.  It is obvious that this is an attempt to 
hijack the entire IDN model by claiming "it does not support Chinese at all," 
which would certainly be true if Han characters were prohibited, and imposing 
a locally-constructed, Chinese-specific (i.e. not universal) model later on.

Unfortunately, as an American who does not speak or read Chinese, I have been 
in a poor position to argue with these people about their own written 
language.  So I relied on the combined expertise of the Unicode list, 
including native speakers and people with doctorates in Chinese, for 
background information.  Thanks again for your help.

-Doug Ewell
 Fullerton, California




Re: TC/SC mapping

2002-01-24 Thread Werner LEMBERG

> > This is the kind of mess that has discouraged anybody from doing a
> > systematic survey of simplifications for the Unihan database.
> 
> Part of this is because there is the orthogonal complexity of
> variant TC forms.  Before converting TC to SC, one should resolve
> all TC variants to the most "common" or "standard" TC form (good
> luck deciding what that means).  e.g., in the above case, resolve to
> U+9EBD.

I think that any mapping will fail.  As so many things with CJK
characters, the usage depends on constraints beyond a character
encoding: time, location, purpose, etc.  This is the very reason why
CCCII hasn't succeeded.  As a consequence, the available fields are
not enough to really represent the interdependencies correctly.

Either increase the number of available keywords (e.g. kZVariant1,
kZVariant2) to be able to fine-tune the dependencies (something like
`character a in the meaning of b is a variant of character c', or add
a remark to the description of keywords that the fields can't be
exhaustive due to such and such reasons.


Werner




Re: TC/SC mapping

2002-01-23 Thread Thomas Chan

On Wed, 23 Jan 2002, John H. Jenkins wrote:

> On Wednesday, January 23, 2002, at 09:05 AM, Thomas Chan wrote:
> > In other words,
> >   yao1 'small'TC U+4E48 or U+5E7A -> SC U+4E48
> >   me (as in shen2me 'what')   TC U+9EBC or U+9EBD -> SC U+4E48
> >   mo2 (as in yao1mo2 'insignificant') TC U+9EBC or U+9EBD -> SC U+9EBD
> 
> Thomas, do you have a reference for U+9EBC (麼) and U+9EBD (麽) being 
> different?  The only dictionary I have which contains both is the 
> (traditional) CiHai, it and it claims they're variants of each other.

Well, first, the "Jianhuazi Zongbiao" that defines the PRC
simplifications juxtaposes U+9EBD and U+4E48 for the "me" pronunciation
of the former (non-"me" usage of the former are not simplified);
U+9EBC is not mentioned.

In the PRC's _Ci Hai_ from 1979 (the third dictionary to bear that
name), U+9EBC is a pointer to U+9EBD for all usages of U+9EBD.

In the _Hanyu Da Zidian_ (PRC, 1986), U+9EBD has the following
usages:
  1)   mo2 'small'
  2)   ma2 of gan4ma2 'what for'.  (It says that nowadays this
   particular ma2 is written U+55CE.)
  3.1) ma, a particle, which can sometimes be written U+55CE.
  3.2) ma, a particle, which can sometimes be written U+561B.
  4)   me of zhe4me 'so; like this'; also used as padding in songs.

However, for U+9EBC, it says it is the same as U+9EBD, but the
only examples given have the 'small' meaning, including one from
the _Shuowen Jiezi_ (China, AD 100) that says that U+9EBD is a
vulgar (su2) form of U+9EBC.

Apparently, U+9EBC is the more orthodox version as far as mo2
'small' is concerned, but U+9EBD has become more common,
including becoming used to write various modern/colloquial words.

I would revise the mapping as follows:
  me (as in shen2me 'what')TC U+9EBD -> SC U+4E48
  mo2 (as in yao1mo2 'insignificant') TC U+9EBC -> TC U+9EBD -> SC U+9EBD

I think the choice whether to regard U+9EBC and U+9EBD as different or not
depends on the application.  I would lean towards treating them as the
same.

 
> Meanwhile, both Sanseido and KangXi say that U+5C1B (尛) is a member of 
> the family.  (KangXi says that anciently U+9EBC (麼) was written U+5C1B (尛)
> .  Mathews and Sanseido also remind us that U+5E85 (庅) is another variant,
>   and Sanseido *also* lists U+5692 (嚒).

In the _Hanyu Da Zidian_, U+5C1B points to U+9EBC.  (I see on the same
page that U+21B6F also points to U+9EBC, and the _Hanyu Da Zidian_ is
citing this pointer from the same source.)  It doesn't say, but I would
presume these refer only to the original mo2 'usage', given the age
of the cited source, _Longkan Shoujian_ (China, AD 997), and the
composition of U+5C1B (three 'smalls') and U+21B6F ('three' + 'small').

U+5E85 is understandable as an abbreviated form of U+9EBD, and I'll
add that it's also documented in Samuel Wells Williams' 1874
dictionary (pushes back the usage given in Mathews by at least half a
century).

U+5692 seems understandable--it is just U+9EBC with a mouth radical
tacked on--I presume this is only for the modern/colloquial "me" usages,
and not mo2 'small'.  (I wouldn't be surprised if somewhere there is
attested a U+9EBD with a mouth radical tacked on.)

I would further revise the (partial) mapping as follows:

  me (as in shen2me 'what'):
TC U+9EBC -> TC U+9EBD -> TC U+5E85 -> SC U+4E48
TC U+9EBC -> TC U+5692

  mo2 (as in yao1mo2 'insignificant'):
TC U+9EBC -> TC U+9EBD -> SC U+9EBD

And this is not finished, yet!  The _Hanyu Da Zidian_ also lists
some other variant forms of U+9EBD--I suspect they are probably
all/mostly for the mo2 'small' usage.  I should point out that the _Hanyu
Da Zidian_ is in no way the final word despite its comprehensiveness,
e.g., U+5E85 and U+5692 are not included in it.

 
> So, Doug, you see that U+4E48 (么) could conceivably be a traditional 
> character in its own right *or* the simplified form for no fewer than six 
> (!) other ideographs.
> 
> This is the kind of mess that has discouraged anybody from doing a 
> systematic survey of simplifications for the Unihan database.

Part of this is because there is the orthogonal complexity of variant TC
forms.  Before converting TC to SC, one should resolve all TC variants to
the most "common" or "standard" TC form (good luck deciding what that
means).  e.g., in the above case, resolve to U+9EBD.

I think we are also complicating things by treating the entire process of
variants and simplifications as operating solely on the orthography (cf., 
upper and lower case); in some cases, it is simpler to conceptualize it as
the "spelling" of words being changed.

 
> > The other example (U+8721 kTraditionalVariant U+8721 U+881F) is a
> > mistake--the TraditionalVariant should only be U+881F.
> 
> Actually, no.  Both KangXi and the Cihai list U+8721 (蜡) as a traditional 
> character in its own right, although I assume it's rare as I can't find it 
> in my other dictionaries.

You're right.  The presence of U+8721

Re: TC/SC mapping

2002-01-23 Thread Kenneth Whistler

John,

> This is the kind of mess that has discouraged anybody from doing a 
> systematic survey of simplifications for the Unihan database.
> 
> > The other example (U+8721 kTraditionalVariant U+8721 U+881F) is a
> > mistake--the TraditionalVariant should only be U+881F.
> >
> 
> Actually, no.  Both KangXi and the Cihai list U+8721 (蜡) as a traditional 
> character in its own right, although I assume it's rare as I can't find it 
> in my other dictionaries.

Oh what tangled webs.

First of all, we have "wax", Mandarin la4

  Traditional form: U+881F
  Simplified form:  U+8721

Then we have "maggot", Mandarin qu1, of which Cihai claims that Shuowen says:

  Vulgar form:  U+86C6
  Correct form: U+43E3
  Archaic form: U+8721

  My Taiwan and PRC dictionaries both claim U+86C6 for "maggot", so
  that would now be considered both the Traditional and Simplified form,
  with U+8721 being an obsolete, archaic variant for it.

Then we have "yearend ceremony (of Zhou dynasty)", Mandarin zha4

  Traditional form: U+8721

  Not listed in my contemporary PRC dictionary.

And yes, this is the kind of mess that has discouraged anybody from
doing a systematic survey of simplifications for the Unihan database.

--Ken






Re: TC/SC mapping

2002-01-23 Thread John H. Jenkins


On Wednesday, January 23, 2002, at 09:05 AM, Thomas Chan wrote:
> In other words,
>   yao1 'small'TC U+4E48 or U+5E7A -> SC U+4E48
>   me (as in shen2me 'what')   TC U+9EBC or U+9EBD -> SC U+4E48
>   mo2 (as in yao1mo2 'insignificant') TC U+9EBC or U+9EBD -> SC U+9EBD
>

Thomas, do you have a reference for U+9EBC (麼) and U+9EBD (麽) being 
different?  The only dictionary I have which contains both is the 
(traditional) CiHai, it and it claims they're variants of each other.

Meanwhile, both Sanseido and KangXi say that U+5C1B (尛) is a member of 
the family.  (KangXi says that anciently U+9EBC (麼) was written U+5C1B (尛)
.  Mathews and Sanseido also remind us that U+5E85 (庅) is another variant,
  and Sanseido *also* lists U+5692 (嚒).

So, Doug, you see that U+4E48 (么) could conceivably be a traditional 
character in its own right *or* the simplified form for no fewer than six 
(!) other ideographs.

This is the kind of mess that has discouraged anybody from doing a 
systematic survey of simplifications for the Unihan database.

> The other example (U+8721 kTraditionalVariant U+8721 U+881F) is a
> mistake--the TraditionalVariant should only be U+881F.
>

Actually, no.  Both KangXi and the Cihai list U+8721 (蜡) as a traditional 
character in its own right, although I assume it's rare as I can't find it 
in my other dictionaries.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/