Re: Status of Unihan Mandarin readings?

2002-12-20 Thread Raymond Mercier

On the errors in kMandarin:

Apart from the kMandarin errors of the kind that Andrew West has noted, 
there is another corruption, namely the loss of ü, and this happened 
between 3.0b1 and 3.0b2, when the ü became the two bytes C393.

As to Han/Yi, U+6C49 YI4 HAN4 is found not only in 3.0b1 and 3.0b2, 
but also in 2.0. The HAN4 was dropped only in 3.2.
While I admire the effort to explain the intrusion of YI4, I feel it is a 
bit misplaced, and that some more mechanical/clerical explanation is in 
order. After all, look at the number of times same as U+  is written 
as sama as U+...  in 3.2: 6 to be precise.


Raymond Mercier







Raymond Mercier




Re: Status of Unihan Mandarin readings?

2002-12-20 Thread Raymond Mercier


At 08:44 AM 12/20/2002 -0700, you wrote:

That's because the file was converted to UTF-8.  Previously it had not 
been in any single encoding, which was creating problems

Well, OK, but should you have created by now some sort of program that 
checks the file whenever you make a change - a sort of spellcheck ? Should 
not be too hard to write something that displays the effects of any changes.

Raymond Mercier





Re: Status of Unihan Mandarin readings?

2002-12-19 Thread Andrew C. West
On Thu, 19 Dec 2002 04:58:08 -0800 (PST), Marco Cimarosti wrote:

 
 I have tried to follow the discussion about the errors in field kMandarin
 of file Unihan.txt but, after a while, I lost my way with all those
 dictionary references...
 
 Could someone kindly make a short summary of the situation? Here are my
 biggest ???'s:

Here's my take on the situation :

 
 - Are the errors really there?

Yes.

 - Any estimate as to how many entries are affected?

I estimate about 10% of basic CJK, in other words 2,000+

 - Is it only kMandarin affected or also any other fields?

I don't think any other fields are affected.

 - Any estimates for when it will be possible publish a fixed version?

I'll let Mr. Jenkins answer that one.

 - Any suggestion for interim work-arounds (e.g., an older version of the
 file, an alternative source)?

Use the Unihan database for Unicode 3.0 at
http://www.unicode.org/Public/3.0-Update/Unihan-3.txt

This is the latest uncorrupted version.

Hope this clarifies the situation.

Andrew