Re: Application that displays CJK text in Normalization Form D
Kent Karlsson wrote: Crap. Yes, Ken and BabelPad are right. Some ideographs do have singleton mappings and can thus be different between NFD and NFC. No, both NFD and NFC will map U+FA47 to U+6F22; singleton canonical mappings are not "reversed" in the composition phase of transforming to NFC. Some ideographs have singleton mappings and can thus be different when mapped to NFD and/or to NFC? -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
Re: CJK Compatibility Gotchas (was: Re: Application that displays CJK text in Normalization Form D
On 11/15/2010 5:43 PM, Kenneth Whistler wrote: Perhaps someone would like to make a detailed proposal to the UTC for how to fix the text and charts?;-) Ken, having shown yourself the master of detail in your reply, I think you've appointed yourself. A round of applause for Ken! See how easy that was? :) Cheers, A./ PS: I had something pithy in mind that would work for the charts - I'll send that off to the guy who maintains the nameslist.
CJK Compatibility Gotchas (was: Re: Application that displays CJK text in Normalization Form D
Asmus replied: > On 11/15/2010 2:24 PM, Kenneth Whistler wrote: > >> FA47 is a "compatibility character", and would have a > >> compatibility mapping. > > Faulty syllogism. > > Formally correct answer but only because of something of a design flaw > in Unicode. When the type of mapping was decided on, people didn't fully > expect that NFC might become widely used/enforced, making these > distinctions appear wherever text is normalized in a distributed > architecture. O.k., I'm gonna have to intervene again. *hehe* Yes, there is a design flaw here, but Asmus' explanation is also somewhat faulty, because it flattens out the history in a way that is liable to be misunderstood. There is a *reason* why "when the type of mapping was decided on" that "people didn't fully expect that NFC might become widely used/enforced" -- but it wasn't that they were goofing up in understanding the implications of normalization. Rather, at that point in Unicode history NFC didn't *exist* yet, nor had the normalization algorithm been designed. Here, for the benefit of the standards geeks out there, are the relevant higlights of the historical timeline involved. June, 1992. The canonical mappings for the CJK Compatibility characters were *printed* (with off-by-one errors for some of them!) in Unicode 1.0, volume 2 (= Unicode 1.0.1). Actually, at the time, we didn't know they were "canonical" mappings, because that concept hadn't formally been invented yet, but the intention was clear. They were the mappings from the "CJK compatibility ideographs" to the "real" unified Han ideographs in the standard. The CJK compatibility characters were all considered to be duplicates in the source standards that didn't follow the unification rules. July, 1996. The formal definitions of "canonical decomposition" and "compatibility decomposition" were first published in Unicode 2.0. There wasn't a data file for the CJK Compatibility Ideographs block, but the canonical mappings were *printed* (correctly, this time) on pp. 7-470 to 7-472 of the standard. August 4, 1998. The first published version of UnicodeData.txt that contained the canonical mappings for the CJK Compatibility Ideographs was UnicodeData-2.1.5.txt for Unicode 2.1.5. (Actually, they got into UnicodeData-2.1.4.txt on July 9, 1998, but that wasn't a published version of the data file.) July 23, 1999. This was the publication data of the first approved version of UAX #15 (Revision 15), and so is the first published definition of NFC. (Of course UAX #15 had been in draft for some time earlier than that, so the term "NFC" can be tracked back in the drafts to mid-1998.) September, 1999. Release of Unicode 3.0 -- the first release of Unicode formally tied to the Unicode Normalization Algorithm. (The revision of UAX #15 for the release was actually Revision 18, dated November 11, 1999.) March 23, 2001. UAX #15, Version 3.1.0. This was the version of the Unicode Normalization Algorithm that specified the composition version to be Version 3.1.0 and locked down normalization forever more. So essentially, there was a 9 year period between when the first mappings were defined for the CJK Compatibility Ideographs and the date beyond which it became impossible to reinterpret or change a canonical mapping because of the lockdown of normalization. The problems resulting from the normalization for CJK Compatibility Ideographs only started to become visible to people *after* the lockdown, and when Unicode normalization started to become a regular feature of actual processing. And it wasn't because "people didn't fully expect that NFC might become widely used/enforced" -- or at least not the people in the UTC. The UAX #15 text published with Unicode 3.0 already stated: "The W3C Character Model for the World Wide Web requires the use of Normalization Form C for XML and related standards..." And it wasn't because of some oversight about the canonical mappings involving the CJK Compatibility Ideographs per se. That same UAX #15 for Unicode 3.0 also stated: "With *all* normalization forms singleton characters (those with singleton canonical mappings) are replaced." So the ground facts for the FA10 --> (NFC/NFD/NFKC/NFKD) 585C normalization pattern were well-established and explicitly stated in 1999. > > FA47 is a CJK Compatibility character, which means it was encoded > > for compatibility purposes -- in this case to cover the round-trip > > mapping needed for JIS X 0213. > > > > However, it has a *canonical* decomposition mapping to U+6F22. > > And that, of course, destroys the desired "round-trip" behavior if it is > inadvertently applied while the data are encoded in Unicode. Hence the > need to recreate a solution to the issue of variant forms with a > different mechanism, the ideographic variation sequence (and > corresponding database). Yes, that is basically correct. But, this architect
Re: Application that displays CJK text in Normalization Form D
Den 2010-11-15 23:53, skrev "Doug Ewell" : >> When I type the ideograph 漢 (U+FA47) into BabelPad, highlight it, and >> then click the button labeled "Normalize to NFC", the character >> becomes 漢 (U+6F22). Does BabelPad not conform to the Unicode Standard >> in this case? Is this not truly Unicode normalization? > > Crap. Yes, Ken and BabelPad are right. Some ideographs do have > singleton mappings and can thus be different between NFD and NFC. No, both NFD and NFC will map U+FA47 to U+6F22; singleton canonical mappings are not "reversed" in the composition phase of transforming to NFC. /Kent K
Re: Application that displays CJK text in Normalization Form D
On 11/15/2010 2:24 PM, Kenneth Whistler wrote: FA47 is a "compatibility character", and would have a compatibility mapping. Faulty syllogism. Formally correct answer but only because of something of a design flaw in Unicode. When the type of mapping was decided on, people didn't fully expect that NFC might become widely used/enforced, making these distinctions appear wherever text is normalized in a distributed architecture. FA47 is a CJK Compatibility character, which means it was encoded for compatibility purposes -- in this case to cover the round-trip mapping needed for JIS X 0213. However, it has a *canonical* decomposition mapping to U+6F22. And that, of course, destroys the desired "round-trip" behavior if it is inadvertently applied while the data are encoded in Unicode. Hence the need to recreate a solution to the issue of variant forms with a different mechanism, the ideographic variation sequence (and corresponding database). The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47. Easily verified, for example, by checking the FA47 entry in NormalizationTest.txt in the UCD. While correct, it's something that remains a bit of a gotcha. Especially now that Unicode has charts that go to great length showing the different glyphs for these characters, I would suggest adding a note to the charts that make clear that these distinctions are *removed* anytime the text is normalized, which, in a distributed architecture may happen anytime. A./ --Ken When I type ... (U+FA47) into BabelPad, highlight it, and then click the button labeled "Normalize to NFC", the character becomes ... (U+6F22). Does BabelPad not conform to the Unicode Standard in this case? ...
RE: Application that displays CJK text in Normalization Form D
Jim Monty wrote: > How cool is it to post an inquiry to the Unicode mailing list and have > Unicode luminaries like Mark Davis, Asmus Freytag, Markus Scherer, > Martin Dürst and Doug Ewell ALL reply? Don't count me among the luminaries. I'm just a student too, studying Unicode for 19 years now, and to prove that I'm still learning... > When I type the ideograph 漢 (U+FA47) into BabelPad, highlight it, and > then click the button labeled "Normalize to NFC", the character > becomes 漢 (U+6F22). Does BabelPad not conform to the Unicode Standard > in this case? Is this not truly Unicode normalization? Crap. Yes, Ken and BabelPad are right. Some ideographs do have singleton mappings and can thus be different between NFD and NFC. It isn't quite the same as combining U+30C8 and U+3099 to make U+30C9, or combining jamos into precomposed syllables, but it's enough to disprove my earlier statement. How about this: For *any* text example which can be encoded differently in NFC and NFD, there are some combinations of OS + app + rendering engine + font that can display that example properly in both forms, and some that cannot. -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
RE: Application that displays CJK text in Normalization Form D
> FA47 is a "compatibility character", and would have a compatibility mapping. Faulty syllogism. FA47 is a CJK Compatibility character, which means it was encoded for compatibility purposes -- in this case to cover the round-trip mapping needed for JIS X 0213. However, it has a *canonical* decomposition mapping to U+6F22. The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47. Easily verified, for example, by checking the FA47 entry in NormalizationTest.txt in the UCD. --Ken > > When I type ... (U+FA47) into BabelPad, highlight it, and then > > click the button labeled "Normalize to NFC", the character > > becomes ... (U+6F22). Does BabelPad not conform to the Unicode Standard > > in this case? ...
RE: Application that displays CJK text in Normalization Form D
FA47 is a "compatibility character", and would have a compatibility mapping. -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Jim Monty Sent: Monday, November 15, 2010 1:02 PM To: unicode@unicode.org Subject: Re: Application that displays CJK text in Normalization Form D > When I type the ideograph 漢 (U+FA47) into BabelPad, highlight it, and then > click the button labeled "Normalize to NFC", the character becomes 漢 > (U+6F22). Does BabelPad not conform to the Unicode Standard in this case? Is > this not truly Unicode normalization? Jim Monty
RE: Application that displays katakana and Hangul text in Normalization Form D [Was Re: Application that displays CJK text in Normalization Form D] :-)
Jim, behaviour will depend on fonts being used. It could also depend on the version of software you are using. Windows 7 has pretty good support (fonts and Uniscribe) for all of this. Peter -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Jim Monty Sent: Sunday, November 14, 2010 3:35 PM To: unicode@unicode.org Subject: Application that displays katakana and Hangul text in Normalization Form D [Was Re: Application that displays CJK text in Normalization Form D] :-) Andrew Cunningham wrote: > Jim Monty wrote: > > In my original post, I used "CJK text" in opposition to non-CJK text > > because non-CJK text (in particular, Latin text) in Normalization > > Form D displays properly in the same software I described where CJK > > text (in particular, katakana and Hangul) in Normalization Form D > > does not display properly. > > Actually the Latin text can suffer from the same problems, Latin text > in NFD has similar dependencies as Korean text in NFD, and sometimes > with worse results. Yes, I realize this, too. I was referring to the specific case of East Asian-script characters in NFD, not the general case of characters in any script in NFD. In Notepad, I see an o with a macron on top of it for the Unicode characters U+006F U+0304. On the next line of the same text file, there are the two Unicode characters U+30C8 U+309, but I do not see a katakana letter do. Instead, I see a katakana letter to and, to the right of it, a katakan-hiragana voiced sound mark. I observe essentially the same thing in other applications, including BabelPad and SC UniPad. So this is this specific circumstance that led me to ask the Unicode community about a specific case: Asian-script characters in Unicode Normalization Form D. The answer for my specific case (thanks to Doug Ewell) is that the version of Uniscribe installed on my computer is not properly rendering katakana and Hangul characters in Normalization Form D. It seems I need a better Uniscribe. The other valuable thing I learned is that there are plenty of systems (complex systems of computer and similar digital device hardware, video display devices, computer operating systems, software applications, font-rendering and text-layout service applications, fonts, etc.) that support Unicode in Normalization Form D better than the systems I'm using at the moment. I didn't know this. Thank you for the additional information about Latin-script NFD. Jim Monty
RE: Application that displays CJK text in Normalization Form D
Another point: > Aren't the two versions of the same Unicode text supposed to be > rendered the same? They're not, at least not in any of the > applications in which I've viewed them: Microsoft Internet Explorer, > Microsoft Notepad, Vim, BabelPad and SC Unipad. SC UniPad uses its own built-in font and rendering engine, and does not claim to do much "smart" rendering beyond Arabic contextual forms and bidirectionality. It does have options to "Combine Characters" and "Combine Hangul Jamo," which will convert the Japanese and Korean examples (respectively) from NFD to NFC, but I realize that's not the question you are asking. -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
Re: Application that displays CJK text in Normalization Form D
Doug Ewell wrote: > And no, I did not intend to make this big a deal out of it, and I > apologize for doing so. Nor did I. I'm a genuine student of Unicode, here to learn. It seems many of the regular contributors to the Unicode and Unicore mailing lists are the Unicode experts themselves, many of whom are developers of the Unicode Standard. As such, these mailing lists are fantastic! There are very few technology mailing lists like them anymore. How cool is it to post an inquiry to the Unicode mailing list and have Unicode luminaries like Mark Davis, Asmus Freytag, Markus Scherer, Martin Dürst and Doug Ewell ALL reply? (The answer: Pretty darn cool!) When I asked for clarification about my use of the term "CJK text" instead of "kana and Hangul text", I was earnest. If there was something wrong with my understanding of the standard terminology, I genuinely wanted to know what it was. You're the experts, I'm the initiate. > The answer to Jim's question, then, is that for those examples > of "CJK text" which are encoded differently in NFC and NFD (a group > that excludes ideographs, thus immediately putting that side issue > to rest), there are indeed some combinations of OS + app + rendering > engine + font that can display those examples properly. And this was the valuable lesson I learned. Until this exchange on the Unicode mailing, I'd had a biased and wrong impression of the state of the art with respect to Unicode normalization and modern software based on my own personal experience. I'm glad I asked the question, and I'm grateful for all the excellent and thorough answers. When I type the ideograph 漢 (U+FA47) into BabelPad, highlight it, and then click the button labeled "Normalize to NFC", the character becomes 漢 (U+6F22). Does BabelPad not conform to the Unicode Standard in this case? Is this not truly Unicode normalization? Jim Monty
RE: Application that displays CJK text in Normalization Form D
On Windows, strings will display correctly in either NFC or NFD provided an appropriate font is used--that choice being different for Japanese and for Korean. Windows 7 and earlier do not ship with fonts that support Old Hangul, but Old Hangul fonts are available from other sources; e.g. there's an MS Office add-on sold in Korea that includes Old Hangul fonts. One limitation wrt Japanese marks: when drawing in GDI in vertical orientation, marks may not position correctly if there is no precomposed character for the combination. That's not an issue for the strings you provided here, however. Peter -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Jim Monty Sent: Saturday, November 13, 2010 4:47 PM To: unicode@unicode.org Subject: Application that displays CJK text in Normalization Form D Is there even a single software application that properly displays CJK text in Normalization Form D? NFC: ドライドマンゴス NFD: ドライドマンゴス NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 Aren't the two versions of the same Unicode text supposed to be rendered the same? They're not, at least not in any of the applications in which I've viewed them: Microsoft Internet Explorer, Microsoft Notepad, Vim, BabelPad and SC Unipad. Jim Monty
Re: Application that displays CJK text in Normalization Form D
Asmus Freytag wrote: The term "CJK" is often used to refer to those characters which are common to Chinese and Japanese and Korean, viz. the ideographic characters. Doug, you might want to talk to the author of UTN#14 then, because he seems to be using the term "CJK text" in a sense that I find indistinguishable from the way Jim did. Any relation of yours? Nice catch. In UTN #14, I wrote: In the case of Chinese, Japanese, and Korean (“CJK”) text, where a typical document might contain thousands of different ideographic Han characters, there never was any expectation that 8 bits per character would suffice. The legacy double-byte character sets designed for CJK text used a single byte for some characters (ASCII and halfwidth katakana) and two for others. DBCS encodings are trickier to handle than fixed-length encodings—programmers must keep track of lead and trail bytes—but at least these character sets represented CJK text in no more than 16 bits, as compactly as could be expected. By "CJK text" I definitely did mean to emphasize the unique situation of having to find room for thousands of ideographic characters. I note that legacy character sets (primarily EBCDIC-based) have been devised to handle only Latin plus katakana, or only Latin plus jamos, such that 8 bits per character did in fact suffice. In my second sentence above, I did acknowledge that "double-byte character sets designed for CJK text" include halfwidth katakana. For that matter, many of them also include Greek and Cyrillic, so I'm not sure if the comparison to Jim's usage is quite on the mark, but I'll accept it if Asmus sees it that way. The answer to Jim's question, then, is that for those examples of "CJK text" which are encoded differently in NFC and NFD (a group that excludes ideographs, thus immediately putting that side issue to rest), there are indeed some combinations of OS + app + rendering engine + font that can display those examples properly. And no, I did not intend to make this big a deal out of it, and I apologize for doing so. -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
Application that displays katakana and Hangul text in Normalization Form D [Was Re: Application that displays CJK text in Normalization Form D] :-)
[I apologize for the repost. The original one was formatted badly.] Andrew Cunningham wrote: > Jim Monty wrote: > > In my original post, I used "CJK text" in opposition > > to non-CJK text because non-CJK text (in particular, Latin text) in > > Normalization Form D displays properly in the same software I described > > where CJK text (in particular, katakana and Hangul) in Normalization > > Form D does not display properly. > > Actually the Latin text can suffer from the same problems, Latin text > in NFD has similar dependencies as Korean text in NFD, and sometimes > with worse results. Yes, I realize this, too. I was referring to the specific case of East Asian-script characters in NFD, not the general case of characters in any script in NFD. In Notepad, I see an o with a macron on top of it for the Unicode characters U+006F U+0304. On the next line of the same text file, there are the two Unicode characters U+30C8 U+309, but I do not see a katakana letter do. Instead, I see a katakana letter to and, to the right of it, a katakan-hiragana voiced sound mark. I observe essentially the same thing in other applications, including BabelPad and SC UniPad. So this is this specific circumstance that led me to ask the Unicode community about a specific case: Asian-script characters in Unicode Normalization Form D. The answer for my specific case (thanks to Doug Ewell) is that the version of Uniscribe installed on my computer is not properly rendering katakana and Hangul characters in Normalization Form D. It seems I need a better Uniscribe. The other valuable thing I learned is that there are plenty of systems (complex systems of computer and similar digital device hardware, video display devices, computer operating systems, software applications, font-rendering and text-layout service applications, fonts, etc.) that support Unicode in Normalization Form D better than the systems I'm using at the moment. I didn't know this. Thank you for the additional information about Latin-script NFD. Jim Monty
Application that displays katakana and Hangul text in Normalization Form D [Was Re: Application that displays CJK text in Normalization Form D] :-)
Andrew Cunningham wrote: > Jim Monty wrote: > > In my original post, I used "CJK text" in opposition > > to non-CJK text because non-CJK text (in particular, Latin text) in > > Normalization Form D displays properly in the same software I described > > where CJK text (in particular, katakana and Hangul) in Normalization > > Form D does not display properly. > > Actually the Latin text can suffer from the same problems, Latin text > in NFD has similar dependencies as Korean text in NFD, and sometimes > with worse results. Yes, I realize this, too. I was referring to the specific case of East Asian-script characters in NFD, not the general case of characters in any script in NFD. In Notepad, I see an o with a macron on top of it for the Unicode characters U+006F U+0304. On the next line of the same text file, there are the two Unicode characters U+30C8 U+309, but I do not see a katakana letter do. Instead, I see a katakana letter to and, to the right of it, a katakan-hiragana voiced sound mark. I observe essentially the same thing in other applications, including BabelPad and SC UniPad. So this is this specific circumstance that led me to ask the Unicode community about a specific case: Asian-script characters in Unicode Normalization Form D. The answer for my specific case (thanks to Doug Ewell) is that the version of Uniscribe installed on my computer is not properly rendering katakana and Hangul characters in Normalization Form D. It seems I need a better Uniscribe. The other valuable thing I learned is that there are plenty of systems (complex systems of computer and similar digital device hardware, video display devices, computer operating systems, software applications, font-rendering and text-layout service applications, fonts, etc.) that support Unicode in Normalization Form D better than the systems I'm using at the moment. I didn't know this. Thank you for the additional information about Latin-script NFD. Jim Monty
Re: Application that displays CJK text in Normalization Form D
On 11/14/2010 12:57 PM, Doug Ewell wrote: Jim Monty wrote: Japanese kana (the "J" in "CJK") and Korean syllables (the "K" in "CJK") both have different normalization forms. What do ideographs have to do with anything? I didn't mention ideographs; you did. The term "CJK" is often used to refer to those characters which are common to Chinese and Japanese and Korean, viz. the ideographic characters. Doug, you might want to talk to the author of UTN#14 then, because he seems to be using the term "CJK text" in a sense that I find indistinguishable from the way Jim did. Any relation of yours? :) A./ PS: I too think that replacing the "CJK text" with "Katakana and Hangul" as a more specific choice, would have been an improvement- as written it makes the problem sound more open-ended than it is. But you guys are arguing about an E-mail subject line, of all things
Re: Application that displays CJK text in Normalization Form D
Doug Ewell wrote: > One might as well ask if there are any systems which can properly display > "Unicode text" in NFD. That seems like a perfectly reasonable question to ask. Its answer might be complex, but it's nonetheless a valid question. In fact, to me, it reads like a Unicode FAQ. I get the subtle distinction you're making; I just don't understand why you're making it in this context. In my original post, I used "CJK text" in opposition to non-CJK text because non-CJK text (in particular, Latin text) in Normalization Form D displays properly in the same software I described where CJK text (in particular, katakana and Hangul) in Normalization Form D does not display properly. I don't understand what's wrong with using CJK as an umbrella term, which is exactly what it is. I don't think it refers specifically just to Chinese characters, or Han ideographs. There are terms specifically for those: Chinese characters and Han ideographs. Jim Monty
Re: Application that displays CJK text in Normalization Form D
> "JB" == Jim Breen writes: JB> Firefox (3.6,12 - Ubuntu) placed the dakuten over the following katakana JB> and mangled the hangul. GNOME Terminal (2.28.1) did the same. That is a general PanGo (παν誤) issue. I don't know whether the new harfbuzz will do any better, yet. PangGo does get the hangul right if you choose any of the Un family of fonts, but it still fails to look as good. Interestingly, rxvt-unicode does get the katakana identically. (I have it configured to use Droid Sans Fallback as its first fallback font for CJK.) It also succeeds in making syllables of the choseong and jungseong chars, but like PanGo they are not as legible as the precomposed syllables. Even selection selects a syllable at a time. -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6
Re: Application that displays CJK text in Normalization Form D
Jim Monty wrote: Japanese kana (the "J" in "CJK") and Korean syllables (the "K" in "CJK") both have different normalization forms. What do ideographs have to do with anything? I didn't mention ideographs; you did. The term "CJK" is often used to refer to those characters which are common to Chinese and Japanese and Korean, viz. the ideographic characters. This is Korean text in NFC... 유리를 HANGUL SYLLABLE YU HANGUL SYLLABLE RI HANGUL SYLLABLE REUL ...and this is the same Korean text in NFD... 유리를 HANGUL CHOSEONG IEUNG HANGUL JUNGSEONG YU HANGUL CHOSEONG RIEUL HANGUL JUNGSEONG I HANGUL CHOSEONG RIEUL HANGUL JUNGSEONG EU HANGUL JONGSEONG RIEUL Right, I got that. How is this text different than anything else in Unicode with respect to normalization forms NFC and NFD? What's wrong, exactly, with my question and the way I phrased it? I simply asked a question about CJK text (which includes, by definition, Japanese kana and Korean syllables and jamo) and software that displays such CJK text when it is in Normalization Form D. For the sake of clarity, I included specific examples. There's nothing wrong with asking what systems display hangul the same in NFC and NFD, or similarly for katakana. Lumping them together under one "CJK" umbrella didn't seem right. There's nothing about a system's ability to display one correctly that implies an ability or inability to display the other correctly. One might as well ask if there are any systems which can properly display "Unicode text" in NFD. -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
Re: Application that displays CJK text in Normalization Form D
Doug Ewell wrote: > Jim Monty wrote: >> >> Is there even a single software application that properly displays CJK text >> in Normalization Form D? >> >> NFC: ドライドマンゴス >> NFD: ドライドマンゴス >> >> NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 >> NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 > > BabelPad running under Uniscribe v1.0626.6000.16386 displays the Katakana > examples identically (using Meiryo) and the Hangul examples identically > (using Batang). > > As usual, there is more to "does it display properly?" than calling out an > individual application or operating system. This is good to know. Thank you. > Furthermore, I don't think "CJK text" is an appropriate way to lump these > two issues together. In particular, Korean syllable-block formation isn't > like anything else in Unicode. When I read the Subject line, my first > thought was, how silly, ideographs aren't subject to normalization. Japanese kana (the "J" in "CJK") and Korean syllables (the "K" in "CJK") both have different normalization forms. What do ideographs have to do with anything? I didn't mention ideographs; you did. This is Korean text in NFC... 유리를 HANGUL SYLLABLE YU HANGUL SYLLABLE RI HANGUL SYLLABLE REUL ...and this is the same Korean text in NFD... 유리를 HANGUL CHOSEONG IEUNG HANGUL JUNGSEONG YU HANGUL CHOSEONG RIEUL HANGUL JUNGSEONG I HANGUL CHOSEONG RIEUL HANGUL JUNGSEONG EU HANGUL JONGSEONG RIEUL How is this text different than anything else in Unicode with respect to normalization forms NFC and NFD? What's wrong, exactly, with my question and the way I phrased it? I simply asked a question about CJK text (which includes, by definition, Japanese kana and Korean syllables and jamo) and software that displays such CJK text when it is in Normalization Form D. For the sake of clarity, I included specific examples. Jim Monty
Re: Application that displays CJK text in Normalization Form D
Jim Monty wrote: Is there even a single software application that properly displays CJK text in Normalization Form D? NFC: ドライドマンゴス NFD: ドライドマンゴス NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 BabelPad running under Uniscribe v1.0626.6000.16386 displays the Katakana examples identically (using Meiryo) and the Hangul examples identically (using Batang). As usual, there is more to "does it display properly?" than calling out an individual application or operating system. Furthermore, I don't think "CJK text" is an appropriate way to lump these two issues together. In particular, Korean syllable-block formation isn't like anything else in Unicode. When I read the Subject line, my first thought was, how silly, ideographs aren't subject to normalization. -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
Re: Application that displays CJK text in Normalization Form D
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Am 14.11.2010 12:03, schrieb Michel Bottin: > I don't see any difference in Firefox 3.6.12 and Thunderbird 3.1.6 on > MacOS X 10.5 > > Michel Bottin > > Le 14/11/10 03:59, Jim Breen a écrit : >> On Sat, 13 Nov 2010 Jim Monty wrote: >>> Is there even a single software application that properly displays CJK text >>> in >>> Normalization Form D? >>> >>> NFC: ドライドマンゴス >>> NFD: ドライドマンゴス >>> >>> NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 >>> NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 For me - using Thunderbitd 3.1.5 on Windows 7 - there is also no visible difference. Best regards, - -- Dominikus Dittes Scherkl -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJM3+D4AAoJELBWOtEemFJVF8EH/Rf4Dr+LixQiHTkdzkJWyGqf xOdXyAJA4ArBqw4Fh2yVgVc8fEVaEk/TGUgtW5nCtzAEPI7NpgqTsx8QPDqEAhNB qF7thDFNwcWYXrsNFUhUDbVc4GDgGd5KDWZorrZlWx39QOwWrKDr1Wh8Q0Y+/eBj dk/eEJEjUeXZS3qYWbgwv96pjeCN81m8U7dQPgmUrOLI+NLMEnR+xX7mLS+Oym7A nXmEHwhJUU1AbSoTiS/pXE6cIHdg3KWHzBIhSWwALEejeSidblI3vVWrRfam+dsG SJFMKVV9E/6TtC1WxG9lk/bGyyhsLrrmG0mtPndC1ZSmQtB3cpk3FPKAbFoFgdI= =5m4A -END PGP SIGNATURE-
Re: Application that displays CJK text in Normalization Form D
I don't see any difference in Firefox 3.6.12 and Thunderbird 3.1.6 on MacOS X 10.5 Michel Bottin Le 14/11/10 03:59, Jim Breen a écrit : On Sat, 13 Nov 2010 Jim Monty wrote: Is there even a single software application that properly displays CJK text in Normalization Form D? NFC: ドライドマンゴス NFD: ドライドマンゴス NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 Google's Chromium browser (6.0.409.0 (47612) Ubuntu) displayed both correctly. Yudit (Unicode editor - http://www.yudit.org/) also displayed both correctly. Firefox (3.6,12 - Ubuntu) placed the dakuten over the following katakana and mangled the hangul. GNOME Terminal (2.28.1) did the same. Opera (10.63 - Linux) displayed the dakuten and most of the hangul as rectangles. NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 Jim -- Jim Breen Adjunct Snr Research Fellow, Clayton School of IT, Monash University Vice-president: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre Graduate student: Language Technology Group, University of Melbourne -- In girum imus nocte et consumimur igni
Re: Application that displays CJK text in Normalization Form D
On Sat, 13 Nov 2010 Jim Monty wrote: > > Is there even a single software application that properly displays CJK text in > Normalization Form D? > > NFC: ドライドマンゴス > NFD: ドライドマンゴス > > NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 > NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 Google's Chromium browser (6.0.409.0 (47612) Ubuntu) displayed both correctly. Yudit (Unicode editor - http://www.yudit.org/) also displayed both correctly. Firefox (3.6,12 - Ubuntu) placed the dakuten over the following katakana and mangled the hangul. GNOME Terminal (2.28.1) did the same. Opera (10.63 - Linux) displayed the dakuten and most of the hangul as rectangles. > NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 > NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 Jim -- Jim Breen Adjunct Snr Research Fellow, Clayton School of IT, Monash University Vice-president: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre Graduate student: Language Technology Group, University of Melbourne
Re: Application that displays CJK text in Normalization Form D
Note however that when editing a reply to your message within Gmail, the text that appears in the webform containing your text in NFD will cause Gmail to reject storing the text or sending it. If you try to save the temporary message or send it, Gmail says "error, the action has failed. Please retry", and you can retry any number of times, it will fail. I think this is a severe bug of Gmail : you need to delete the NFD text or normalize it in an external application. Philippe. 2010/11/14 Jim Monty > > Is there even a single software application that properly displays CJK text > in > Normalization Form D? > > NFC: ドライドマンゴス > > NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요 > > Aren't the two versions of the same Unicode text supposed to be rendered > the > same? They're not, at least not in any of the applications in which I've > viewed > them: Microsoft Internet Explorer, Microsoft Notepad, Vim, BabelPad and SC > Unipad. > > Jim Monty > > > > > > >
Re: Application that displays CJK text in Normalization Form D
They are the same for me when viewed in Gmail (in any one of the modern browsers in their most current versions on Windows, I did not test on MacOS X or Linux). I suppose that Gmail renormalizes the texts to NFC before displaying them... I can't even detect a difference in the HTML source of the displayed message, all seems to be in NFC (could that originate from the web browser performing such normalization immediately on HTML text elements before entering them in the DOM and making them accessible from Javascript ?) I've stopped using local mail clients (like Outlook, Outlook Express, Windows Mail, and others since long now, because webmails are definitely more practical for me, from any PC or smart phone, and offer comfortable storage space for storing many years or emails, as long as you cleanup the undetected spams, as most spams fall in a specific box whose cleanup is automated), so I can't confirm that they will normalize the texts. This may not be the case however for attachments (if their MIME type is not "text/*", or if they are digitally signed). Plain text editors are not supposed to perform such normalizations, so all will depend on how they manage their own internal data storage. But yes, these editors should display them exactly the same (if not, this is an issue of how they use their text renderers), even if they are left in their initial normalization form (or in unnormalized forms). Philippe.
Re: Application that displays CJK text in Normalization Form D
All Cocoa/Cocoa Touch apps display them correctly. Aki Inoue On 2010/11/13, at 17:07, Bill Poser wrote: > > > On Sat, Nov 13, 2010 at 4:46 PM, Jim Monty wrote: > Is there even a single software application that properly displays CJK text in > Normalization Form D? > > > I just tried your examples in Yudit (http://www.yudit.org) and they seem to > work: the NFD text looks the same as the NFC text. >
Re: Application that displays CJK text in Normalization Form D
On Sat, Nov 13, 2010 at 4:46 PM, Jim Monty wrote: > Is there even a single software application that properly displays CJK text > in > Normalization Form D? > > I just tried your examples in Yudit (http://www.yudit.org) and they seem to work: the NFD text looks the same as the NFC text.