RE: Unicode 4.0.1 Released

2004-04-12 Thread Kenneth Whistler
Jony asked:

> I thought that the alphabetic presentation forms are deprecated, however
> they are not indicated as such in proplist.txt.

A formal deprecation of a Unicode character takes an explicit
decision by the UTC, and no such decision is on record for
the alphabetic presentation forms.

--Ken




RE: Unicode 4.0.1 Released

2004-04-11 Thread Ernest Cline



> [Original Message]
> From: Jony Rosenne <[EMAIL PROTECTED]>
>
> I thought that the alphabetic presentation forms are deprecated,
> however they are not indicated as such in proplist.txt.

Their use is discouraged for new Unicode text, but they are not
deprecated, just as ANGSTROM SIGN is not deprecated.
They were encoded so that for conversion from some legacy
codesets could be done with a single Unicode character
per legacy character.  Except in the unlikely event that another
codeset with additional presentation forms gets added to the
list of those considered important to support in that manner, it
is extremely doubtful that more of those forms will be encoded
in Unicode.

Deprecated characters are the way Unicode says,
"We made a mistake in encoding this character, and if we
were starting from scratch we wouldn't have this character,
but it's there, so we have to keep it.  But please don't use it,
and if you have data that does, please change it, ASAP!"





RE: Unicode 4.0.1 Released

2004-04-11 Thread Jony Rosenne
I thought that the alphabetic presentation forms are deprecated, however
they are not indicated as such in proplist.txt.

Jony




Re: Unicode 4.0.1 Released

2004-04-08 Thread jameskass

John Jenkins wrote concerning UNIHAN.TXT,


> BTW, in case anybody's wondering, I've been working with Andrew 
> privately to get these issues resolved.

That's great!  Maintaining such a huge database must be a huge chore.

The work that goes into UNIHAN.TXT is most appreciated.  Although
Andrew pointed out some problems with the update, the database
does contain a lot of useful information.  It's pretty cool to see 
a few definition fields appearing now in Plane Two, too.

Best regards,

James Kass




Re: Unicode 4.0.1 Released

2004-04-08 Thread Richard S Cook
On Apr 8, 2004, at 04:29 PM, John Jenkins wrote:
On Apr 2, 2004, at 4:38 AM, Andrew C. West wrote:

For me 4.0.1 was a big disappointment. The much vaunted update of the 
Unihan
database did not even clear up all the editorial errors in the 
database, let
alone deal with the real problems of content, such as incorrect or 
dubious
Mandarin, Cantonese, Korean and Japanese readings.
[ and so on...]

BTW, in case anybody's wondering, I've been working with Andrew 
privately to get these issues resolved.
John,

In a database of 1,047,571 records (when last I checked) errors are 
bound to turn up now and again. As more and more informative data gets 
added, it's going to need fixing sometimes.

Far from being disappointed, I commend you for your tirelessness in 
this thankless task. Unihan will never be more than what people make of 
it. You have clearly made a lot of it.

-Richard




Re: Unicode 4.0.1 Released

2004-04-08 Thread John Jenkins
On Apr 2, 2004, at 4:38 AM, Andrew C. West wrote:

For me 4.0.1 was a big disappointment. The much vaunted update of the 
Unihan
database did not even clear up all the editorial errors in the 
database, let
alone deal with the real problems of content, such as incorrect or 
dubious
Mandarin, Cantonese, Korean and Japanese readings.
[ and so on...]

BTW, in case anybody's wondering, I've been working with Andrew 
privately to get these issues resolved.


John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/



Re: Unicode 4.0.1 Released

2004-04-02 Thread Andrew C. West
On Tue, 30 Mar 2004 15:49:53 -0800, Rick McGowan wrote:
> 
> Unicode 4.0.1 has been released!
> 
> The main new features in Unicode 4.0.1 are the following:
> 
> 1. The first significant update of the Unihan Database (Unihan.txt)
>   since Unicode 3.2.0, including a large number of fixes and
>   additional data items.
>

For me 4.0.1 was a big disappointment. The much vaunted update of the Unihan
database did not even clear up all the editorial errors in the database, let
alone deal with the real problems of content, such as incorrect or dubious
Mandarin, Cantonese, Korean and Japanese readings.

As to the 164 incorrect Vietnamese readings for basic CJKV ideographs, I notice
that although the correct readings for these characters have now been added to
the kVietnamese field, the original erroneous readings (agreed as such by the
relevant Vietnamese experts) have been retained as well; so that now each of the
164 characters in the CJK Unified Ideographs block with a kVietnamese key has a
spurious reading followed by the correct reading. Hardly much of an improvement.

For the record, the following is a list of easily fixed editorial errors
relating to the fields of interest to me that I submitted as part of the review
process, and which remain unfixed in the latest version of the Unihan database.
This means that I have to manually preprocess Unihan.txt to correct these errors
before I can put it through my parsing program -- which is a pain.

1. kRSUnicode Field

A. Missing Simplified Radical Marker
Simplified radicals are indicated by an apostrophe after the radical number in
the basic CJK block, but not in CJK-A or CJK-B.
U+4336..4341Radical 120 should be 120'
U+4723..4729Radical 149 should be 149'
U+478C..4790Radical 154 should be 154'
U+4880..4882Radical 159 should be 159'
U+497A..4986Radical 167 should be 167'
U+49B6..49B8Radical 169 should be 169'
U+4B6A  Radical 184 should be 184'
U+4BC3..4BC5Radical 187 should be 187'
U+4C9D..4CA4Radical 195 should be 195'
U+4D13..4D19Radical 196 should be 196'
U+4DAD..4DAERadical 212 should be 212'
U+8D5C  154.11 should be 154'.11
U+8D5D  154.12 should be 154'.12
U+8F89  159.8 should be 159'.8
U+987C  181.4 should be 181'.4
U+9E6B  196.12 should be 196'.12
U+9EA6  199.0 should be 199'.0
U+9EFE  205.0 should be 205'.0
U+9F7F  211.0 should be 211'.0
U+26208..26221  Radical 120 should be 120'
U+27BAA Radical 149 should be 149'
U+27E51..27E57  Radical 154 should be 154'
U+28405..2840A  Radical 159 should be 159'
U+28C3E..28C56  Radical 167 should be 167'
U+28DFF..28E0E  Radical 169 should be 169'
U+293FC..29400  Radical 178 should be 178'
U+29595..29597  Radical 181 should be 181'
U+29665..29670  Radical 182 should be 182'
U+297FE..2980F  Radical 184 should be 184'
U+299E6..29A10  Radical 187 should be 187'
U+29F79..29F8E  Radical 195 should be 195'
U+2A241..2A255  Radical 196 should be 196'
U+2A388..2A390  Radical 199 should be 199'
U+2A68F..2A690  Radical 211 should be 211'
U+2FA18 Radical 205 should be 205'

2. kMandarin Field

A. Invalid Application of U-UMLAUT
LÜN and LÜAN are invalid Mandarin pinyin spellings.
U+6523  kMandarin   LÜAN2 LUAN2 should be LUAN2
U+7674  kMandarin   LÜAN2 should be LUAN2
U+7D6F  kMandarin   LÜN4 GAI1 should be LUN4 GAI1

B. Invalid Non-Application of U-UMLAUT
LUE and NUE should always be written with an umlaut.
U+3A3C  kMandarin   LUE4 should be LÜE4
U+4588  kMandarin   NUE4 should be NÜE4
U+458B  kMandarin   NUE4 should be NÜE4
U+63A0  kMandarin   LÜE4 LUE3 should be LÜE4 LÜE3
U+7878  kMandarin   NUE4 should be NÜE4

C. Other Invalid Pinyin Spellings
IE, LIONG, YIAN, IONG and YIAO are invalid Mandarin pinyin spellings.
U+34C8  kMandarin   BEI4 BING4 FEI4 IE4 should be BEI4 BING4 FEI4 YE4 ?
U+3D88  kMandarin   LIONG3 YING2 should be YONG3 YING2 ?
U+66D5  kMandarin   YIAN4   should be YAN4
U+7867  kMandarin   IONG3 should be YONG3
U+9D01  kMandarin   YIAO1   should be YAO1

D. Duplicate Readings
Whilst there may be a historical reason why some characters in the Unihan
database originally had multiple duplicate Mandarin readings, there is no reason
why they should still be there in the latest version.
U+3561  kMandarin   HE2 HE2 HE4 HE4 HUO4
U+3563  kMandarin   YAN3 YAN4 YAN4
U+356A  kMandarin   DAN3 DAN3
U+3613  kMandarin   LAN2 LAN2
U+363A  kMandarin   FA2 FA2
U+369C  kMandarin   XU4 XU4 YU4
U+38BD  kMandarin   ER3 ER3
U+39A9  kMandarin   YIN3 YIN3
U+39E5  kMandarin   XIAN3 XIAN3
U+3C34  kMandarin   PO2 POU3 POU3
U+3C76  kMandarin   BENG4 JIAO4 PENG2 PENG2 QIAO3 RU4
U+3C78  kMandarin   BI4 BI4 BIE2
U+3E2E  kMandarin   FEN2 FEN2
U+3F18  kMandarin   WA3 WA3
U+3FC5  kMandarin   XIAN3 XUAN3
U+400E  kMandarin  

RE: Unicode 4.0.1 Released

2004-03-31 Thread Kenneth Whistler

> > * Changed: bidi class of several characters

> Won't these fixes break applications out there? I.e., won't they turn
> previously conformant applications into non conformant ones?

And the other thing to understand about this particular change
is that it is the outcome of a years-long debate and a
painstakingly negotiated settlement that reminded me of
other difficult negotiations involving Middle Eastern issues.

The upshot will be that Microsoft-based applications will
come *into* compliance with the Bidirectional Algorithm,
with the crucial few character property changes as stated.
And IBM and other vendors who had implemented based on the
prior property values agreed that it was worth the tweak
in order to bring the entire industry into an agreed,
interoperable state for bidirectional behavior (in the
absence of higher-level protocols) involving the crucial
ASCII characters, '+', '-', and '/', that interact with URL's,
dates, times, and numbers in a bidirectional context.

--Ken




Re: Unicode 4.0.1 Released

2004-03-31 Thread jcowan
Marco Cimarosti scripsit:

> So far, my understanding was that the normative properties of existing code
> points where "carved in stone".

Not all normative properties are immutable.  A normative property is
simply one which you have to get right if you claim conformance to
that part of Unicode:  you cannot make PLUS SIGN a letter.  Immutable
properties are those which Unicode guarantees will never change; they
are a subset of the normative properties.

> Won't these fixes break applications out there? I.e., won't they turn
> previously conformant applications into non conformant ones?

They will conform to previous versions but not to newer versions.

-- 
BALIN FUNDINUL  UZBAD KHAZADDUMU[EMAIL PROTECTED]
BALIN SON OF FUNDIN LORD OF KHAZAD-DUM  http://www.ccil.org/~cowan



RE: Unicode 4.0.1 Released

2004-03-31 Thread Marco Cimarosti
Rick McGowan wrote:
> Unicode 4.0.1 has been released! [...]
> The main new features in Unicode 4.0.1 are the following:
> [...]
> 3. Unicode Character Database:
> [...]
>   * Changed: general category of U+200B ZERO WIDTH SPACE
>   * Changed: bidi class of several characters

(If I am asking a FAQ, I apologize in advance...)

So far, my understanding was that the normative properties of existing code
points where "carved in stone".

Won't these fixes break applications out there? I.e., won't they turn
previously conformant applications into non conformant ones?

_ Marco