RE: Unicode 4.0.1 Released
Jony asked: > I thought that the alphabetic presentation forms are deprecated, however > they are not indicated as such in proplist.txt. A formal deprecation of a Unicode character takes an explicit decision by the UTC, and no such decision is on record for the alphabetic presentation forms. --Ken
RE: Unicode 4.0.1 Released
> [Original Message] > From: Jony Rosenne <[EMAIL PROTECTED]> > > I thought that the alphabetic presentation forms are deprecated, > however they are not indicated as such in proplist.txt. Their use is discouraged for new Unicode text, but they are not deprecated, just as ANGSTROM SIGN is not deprecated. They were encoded so that for conversion from some legacy codesets could be done with a single Unicode character per legacy character. Except in the unlikely event that another codeset with additional presentation forms gets added to the list of those considered important to support in that manner, it is extremely doubtful that more of those forms will be encoded in Unicode. Deprecated characters are the way Unicode says, "We made a mistake in encoding this character, and if we were starting from scratch we wouldn't have this character, but it's there, so we have to keep it. But please don't use it, and if you have data that does, please change it, ASAP!"
RE: Unicode 4.0.1 Released
I thought that the alphabetic presentation forms are deprecated, however they are not indicated as such in proplist.txt. Jony
Re: Unicode 4.0.1 Released
John Jenkins wrote concerning UNIHAN.TXT, > BTW, in case anybody's wondering, I've been working with Andrew > privately to get these issues resolved. That's great! Maintaining such a huge database must be a huge chore. The work that goes into UNIHAN.TXT is most appreciated. Although Andrew pointed out some problems with the update, the database does contain a lot of useful information. It's pretty cool to see a few definition fields appearing now in Plane Two, too. Best regards, James Kass
Re: Unicode 4.0.1 Released
On Apr 8, 2004, at 04:29 PM, John Jenkins wrote: On Apr 2, 2004, at 4:38 AM, Andrew C. West wrote: For me 4.0.1 was a big disappointment. The much vaunted update of the Unihan database did not even clear up all the editorial errors in the database, let alone deal with the real problems of content, such as incorrect or dubious Mandarin, Cantonese, Korean and Japanese readings. [ and so on...] BTW, in case anybody's wondering, I've been working with Andrew privately to get these issues resolved. John, In a database of 1,047,571 records (when last I checked) errors are bound to turn up now and again. As more and more informative data gets added, it's going to need fixing sometimes. Far from being disappointed, I commend you for your tirelessness in this thankless task. Unihan will never be more than what people make of it. You have clearly made a lot of it. -Richard
Re: Unicode 4.0.1 Released
On Apr 2, 2004, at 4:38 AM, Andrew C. West wrote: For me 4.0.1 was a big disappointment. The much vaunted update of the Unihan database did not even clear up all the editorial errors in the database, let alone deal with the real problems of content, such as incorrect or dubious Mandarin, Cantonese, Korean and Japanese readings. [ and so on...] BTW, in case anybody's wondering, I've been working with Andrew privately to get these issues resolved. John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jhjenkins/
Re: Unicode 4.0.1 Released
On Tue, 30 Mar 2004 15:49:53 -0800, Rick McGowan wrote: > > Unicode 4.0.1 has been released! > > The main new features in Unicode 4.0.1 are the following: > > 1. The first significant update of the Unihan Database (Unihan.txt) > since Unicode 3.2.0, including a large number of fixes and > additional data items. > For me 4.0.1 was a big disappointment. The much vaunted update of the Unihan database did not even clear up all the editorial errors in the database, let alone deal with the real problems of content, such as incorrect or dubious Mandarin, Cantonese, Korean and Japanese readings. As to the 164 incorrect Vietnamese readings for basic CJKV ideographs, I notice that although the correct readings for these characters have now been added to the kVietnamese field, the original erroneous readings (agreed as such by the relevant Vietnamese experts) have been retained as well; so that now each of the 164 characters in the CJK Unified Ideographs block with a kVietnamese key has a spurious reading followed by the correct reading. Hardly much of an improvement. For the record, the following is a list of easily fixed editorial errors relating to the fields of interest to me that I submitted as part of the review process, and which remain unfixed in the latest version of the Unihan database. This means that I have to manually preprocess Unihan.txt to correct these errors before I can put it through my parsing program -- which is a pain. 1. kRSUnicode Field A. Missing Simplified Radical Marker Simplified radicals are indicated by an apostrophe after the radical number in the basic CJK block, but not in CJK-A or CJK-B. U+4336..4341Radical 120 should be 120' U+4723..4729Radical 149 should be 149' U+478C..4790Radical 154 should be 154' U+4880..4882Radical 159 should be 159' U+497A..4986Radical 167 should be 167' U+49B6..49B8Radical 169 should be 169' U+4B6A Radical 184 should be 184' U+4BC3..4BC5Radical 187 should be 187' U+4C9D..4CA4Radical 195 should be 195' U+4D13..4D19Radical 196 should be 196' U+4DAD..4DAERadical 212 should be 212' U+8D5C 154.11 should be 154'.11 U+8D5D 154.12 should be 154'.12 U+8F89 159.8 should be 159'.8 U+987C 181.4 should be 181'.4 U+9E6B 196.12 should be 196'.12 U+9EA6 199.0 should be 199'.0 U+9EFE 205.0 should be 205'.0 U+9F7F 211.0 should be 211'.0 U+26208..26221 Radical 120 should be 120' U+27BAA Radical 149 should be 149' U+27E51..27E57 Radical 154 should be 154' U+28405..2840A Radical 159 should be 159' U+28C3E..28C56 Radical 167 should be 167' U+28DFF..28E0E Radical 169 should be 169' U+293FC..29400 Radical 178 should be 178' U+29595..29597 Radical 181 should be 181' U+29665..29670 Radical 182 should be 182' U+297FE..2980F Radical 184 should be 184' U+299E6..29A10 Radical 187 should be 187' U+29F79..29F8E Radical 195 should be 195' U+2A241..2A255 Radical 196 should be 196' U+2A388..2A390 Radical 199 should be 199' U+2A68F..2A690 Radical 211 should be 211' U+2FA18 Radical 205 should be 205' 2. kMandarin Field A. Invalid Application of U-UMLAUT LÜN and LÜAN are invalid Mandarin pinyin spellings. U+6523 kMandarin LÜAN2 LUAN2 should be LUAN2 U+7674 kMandarin LÜAN2 should be LUAN2 U+7D6F kMandarin LÜN4 GAI1 should be LUN4 GAI1 B. Invalid Non-Application of U-UMLAUT LUE and NUE should always be written with an umlaut. U+3A3C kMandarin LUE4 should be LÜE4 U+4588 kMandarin NUE4 should be NÜE4 U+458B kMandarin NUE4 should be NÜE4 U+63A0 kMandarin LÜE4 LUE3 should be LÜE4 LÜE3 U+7878 kMandarin NUE4 should be NÜE4 C. Other Invalid Pinyin Spellings IE, LIONG, YIAN, IONG and YIAO are invalid Mandarin pinyin spellings. U+34C8 kMandarin BEI4 BING4 FEI4 IE4 should be BEI4 BING4 FEI4 YE4 ? U+3D88 kMandarin LIONG3 YING2 should be YONG3 YING2 ? U+66D5 kMandarin YIAN4 should be YAN4 U+7867 kMandarin IONG3 should be YONG3 U+9D01 kMandarin YIAO1 should be YAO1 D. Duplicate Readings Whilst there may be a historical reason why some characters in the Unihan database originally had multiple duplicate Mandarin readings, there is no reason why they should still be there in the latest version. U+3561 kMandarin HE2 HE2 HE4 HE4 HUO4 U+3563 kMandarin YAN3 YAN4 YAN4 U+356A kMandarin DAN3 DAN3 U+3613 kMandarin LAN2 LAN2 U+363A kMandarin FA2 FA2 U+369C kMandarin XU4 XU4 YU4 U+38BD kMandarin ER3 ER3 U+39A9 kMandarin YIN3 YIN3 U+39E5 kMandarin XIAN3 XIAN3 U+3C34 kMandarin PO2 POU3 POU3 U+3C76 kMandarin BENG4 JIAO4 PENG2 PENG2 QIAO3 RU4 U+3C78 kMandarin BI4 BI4 BIE2 U+3E2E kMandarin FEN2 FEN2 U+3F18 kMandarin WA3 WA3 U+3FC5 kMandarin XIAN3 XUAN3 U+400E kMandarin
RE: Unicode 4.0.1 Released
> > * Changed: bidi class of several characters > Won't these fixes break applications out there? I.e., won't they turn > previously conformant applications into non conformant ones? And the other thing to understand about this particular change is that it is the outcome of a years-long debate and a painstakingly negotiated settlement that reminded me of other difficult negotiations involving Middle Eastern issues. The upshot will be that Microsoft-based applications will come *into* compliance with the Bidirectional Algorithm, with the crucial few character property changes as stated. And IBM and other vendors who had implemented based on the prior property values agreed that it was worth the tweak in order to bring the entire industry into an agreed, interoperable state for bidirectional behavior (in the absence of higher-level protocols) involving the crucial ASCII characters, '+', '-', and '/', that interact with URL's, dates, times, and numbers in a bidirectional context. --Ken
Re: Unicode 4.0.1 Released
Marco Cimarosti scripsit: > So far, my understanding was that the normative properties of existing code > points where "carved in stone". Not all normative properties are immutable. A normative property is simply one which you have to get right if you claim conformance to that part of Unicode: you cannot make PLUS SIGN a letter. Immutable properties are those which Unicode guarantees will never change; they are a subset of the normative properties. > Won't these fixes break applications out there? I.e., won't they turn > previously conformant applications into non conformant ones? They will conform to previous versions but not to newer versions. -- BALIN FUNDINUL UZBAD KHAZADDUMU[EMAIL PROTECTED] BALIN SON OF FUNDIN LORD OF KHAZAD-DUM http://www.ccil.org/~cowan
RE: Unicode 4.0.1 Released
Rick McGowan wrote: > Unicode 4.0.1 has been released! [...] > The main new features in Unicode 4.0.1 are the following: > [...] > 3. Unicode Character Database: > [...] > * Changed: general category of U+200B ZERO WIDTH SPACE > * Changed: bidi class of several characters (If I am asking a FAQ, I apologize in advance...) So far, my understanding was that the normative properties of existing code points where "carved in stone". Won't these fixes break applications out there? I.e., won't they turn previously conformant applications into non conformant ones? _ Marco