RE: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Peter Constable
On Windows, strings will display correctly in either NFC or NFD provided an 
appropriate font is used--that choice being different for Japanese and for 
Korean. Windows 7 and earlier do not ship with fonts that support Old Hangul, 
but Old Hangul fonts are available from other sources; e.g. there's an MS 
Office add-on sold in Korea that includes Old Hangul fonts.

One limitation wrt Japanese marks: when drawing in GDI in vertical orientation, 
marks may not position correctly if there is no precomposed character for the 
combination. That's not an issue for the strings you provided here, however.


Peter

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Jim Monty
Sent: Saturday, November 13, 2010 4:47 PM
To: unicode@unicode.org
Subject: Application that displays CJK text in Normalization Form D

Is there even a single software application that properly displays CJK text in 
Normalization Form D?

NFC: ドライドマンゴス
NFD: ドライドマンゴス

NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요
NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요

Aren't the two versions of the same Unicode text supposed to be rendered the 
same? They're not, at least not in any of the applications in which I've viewed
them: Microsoft Internet Explorer, Microsoft Notepad, Vim, BabelPad and SC 
Unipad.

Jim Monty








Re: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Jim Monty
Doug Ewell wrote:
 And no, I did not intend to make this big a deal out of it, and I
 apologize for doing so.

Nor did I.

I'm a genuine student of Unicode, here to learn. It seems many of the regular 
contributors to the Unicode and Unicore mailing lists are the Unicode experts 
themselves, many of whom are developers of the Unicode Standard. As such, these 
mailing lists are fantastic! There are very few technology mailing lists 
like them anymore. How cool is it to post an inquiry to the Unicode mailing 
list and have Unicode luminaries like Mark Davis, Asmus Freytag, Markus 
Scherer, 
Martin Dürst and Doug Ewell ALL reply? (The answer: Pretty darn cool!)

When I asked for clarification about my use of the term CJK text instead of 
kana and Hangul text, I was earnest. If there was something wrong with my 
understanding of the standard terminology, I genuinely wanted to know what it 
was. You're the experts, I'm the initiate.

 The answer to Jim's question, then, is that for those examples
 of CJK text which are encoded differently in NFC and NFD (a group
 that excludes ideographs, thus immediately putting that side issue
 to rest), there are indeed some combinations of OS + app + rendering
 engine + font that can display those examples properly.

And this was the valuable lesson I learned. Until this exchange on the Unicode 
mailing, I'd had a biased and wrong impression of the state of the art with 
respect to Unicode normalization and modern software based on my own personal 
experience. I'm glad I asked the question, and I'm grateful for all the 
excellent and thorough answers.

When I type the ideograph 漢 (U+FA47) into BabelPad, highlight it, and then 
click 
the button labeled Normalize to NFC, the character becomes 漢 (U+6F22). Does 
BabelPad not conform to the Unicode Standard in this case? Is this not truly 
Unicode normalization?

Jim Monty




RE: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Doug Ewell
Another point:

 Aren't the two versions of the same Unicode text supposed to be
 rendered the same? They're not, at least not in any of the
 applications in which I've viewed them: Microsoft Internet Explorer,
 Microsoft Notepad, Vim, BabelPad and SC Unipad.

SC UniPad uses its own built-in font and rendering engine, and does not
claim to do much smart rendering beyond Arabic contextual forms and
bidirectionality.  It does have options to Combine Characters and
Combine Hangul Jamo, which will convert the Japanese and Korean
examples (respectively) from NFD to NFC, but I realize that's not the
question you are asking.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­






RE: Application that displays katakana and Hangul text in Normalization Form D [Was Re: Application that displays CJK text in Normalization Form D] :-)

2010-11-15 Thread Peter Constable
Jim, behaviour will depend on fonts being used. It could also depend on the 
version of software you are using. Windows 7 has pretty good support (fonts and 
Uniscribe) for all of this.


Peter


-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Jim Monty
Sent: Sunday, November 14, 2010 3:35 PM
To: unicode@unicode.org
Subject: Application that displays katakana and Hangul text in Normalization 
Form D [Was Re: Application that displays CJK text in Normalization Form D] :-)

Andrew Cunningham wrote:
 Jim Monty wrote:
  In my original post, I used CJK text in opposition to non-CJK text 
  because non-CJK text (in particular, Latin text) in Normalization 
  Form D displays properly in the same software I described where CJK 
  text (in particular, katakana and Hangul) in Normalization Form D 
  does not display properly.

 Actually the Latin text can suffer from the same problems, Latin text 
 in NFD has similar dependencies as Korean text in NFD, and sometimes 
 with worse results.

Yes, I realize this, too. I was referring to the specific case of East 
Asian-script characters in NFD, not the general case of characters in any 
script 


in NFD.

In Notepad, I see an o with a macron on top of it for the Unicode characters 
U+006F U+0304. On the next line of the same text file, there are the two 
Unicode 


characters U+30C8 U+309, but I do not see a katakana letter do. Instead, I see 
a 


katakana letter to and, to the right of it, a katakan-hiragana voiced sound 
mark. I observe essentially the same thing in other applications, including 
BabelPad and SC UniPad. So this is this specific circumstance that led me to 
ask 


the Unicode community about a specific case: Asian-script characters in Unicode 
Normalization Form D.

The answer for my specific case (thanks to Doug Ewell) is that the version of 
Uniscribe installed on my computer is not properly rendering katakana and 
Hangul 


characters in Normalization Form D. It seems I need a better Uniscribe.

The other valuable thing I learned is that there are plenty of systems (complex 
systems of computer and similar digital device hardware, video display devices, 
computer operating systems, software applications, font-rendering and 
text-layout service applications, fonts, etc.) that support Unicode in 
Normalization Form D better than the systems I'm using at the moment. I didn't 
know this.

Thank you for the additional information about Latin-script NFD.

Jim Monty







RE: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Shawn Steele
FA47 is a compatibility character, and would have a compatibility mapping.

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Jim Monty
Sent: Monday, November 15, 2010 1:02 PM
To: unicode@unicode.org
Subject: Re: Application that displays CJK text in Normalization Form D

 When I type the ideograph 漢 (U+FA47) into BabelPad, highlight it, and then 
 click the button labeled Normalize to NFC, the character becomes 漢 
 (U+6F22). Does BabelPad not conform to the Unicode Standard in this case? Is 
 this not truly Unicode normalization?

Jim Monty







RE: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Kenneth Whistler

 FA47 is a compatibility character, and would have a compatibility mapping.

Faulty syllogism.

FA47 is a CJK Compatibility character, which means it was encoded
for compatibility purposes -- in this case to cover the round-trip
mapping needed for JIS X 0213.

However, it has a *canonical* decomposition mapping to U+6F22.

The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47.

Easily verified, for example, by checking the FA47 entry in
NormalizationTest.txt in the UCD.

--Ken

  When I type ... (U+FA47) into BabelPad, highlight it, and then 
  click the button labeled Normalize to NFC, the character 
  becomes ... (U+6F22). Does BabelPad not conform to the Unicode Standard 
  in this case? ...




RE: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Doug Ewell
Jim Monty jim dot monty at yahoo dot com wrote:

 How cool is it to post an inquiry to the Unicode mailing list and have
 Unicode luminaries like Mark Davis, Asmus Freytag, Markus Scherer,
 Martin Dürst and Doug Ewell ALL reply?

Don't count me among the luminaries.  I'm just a student too, studying
Unicode for 19 years now, and to prove that I'm still learning...

 When I type the ideograph 漢 (U+FA47) into BabelPad, highlight it, and
 then click the button labeled Normalize to NFC, the character
 becomes 漢 (U+6F22). Does BabelPad not conform to the Unicode Standard
 in this case? Is this not truly Unicode normalization?

Crap.  Yes, Ken and BabelPad are right.  Some ideographs do have
singleton mappings and can thus be different between NFD and NFC.  It
isn't quite the same as combining U+30C8 and U+3099 to make U+30C9, or
combining jamos into precomposed syllables, but it's enough to disprove
my earlier statement.

How about this:

For *any* text example which can be encoded differently in NFC and NFD,
there are some combinations of OS + app + rendering engine + font that
can display that example properly in both forms, and some that cannot.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­






Re: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Asmus Freytag

On 11/15/2010 2:24 PM, Kenneth Whistler wrote:

FA47 is a compatibility character, and would have a compatibility mapping.

Faulty syllogism.


Formally correct answer but only because of something of a design flaw 
in Unicode. When the type of mapping was decided on, people didn't fully 
expect that NFC might become widely used/enforced, making these 
distinctions appear wherever text is normalized in a distributed 
architecture.

FA47 is a CJK Compatibility character, which means it was encoded
for compatibility purposes -- in this case to cover the round-trip
mapping needed for JIS X 0213.

However, it has a *canonical* decomposition mapping to U+6F22.


And that, of course, destroys the desired round-trip behavior if it is 
inadvertently applied while the data are encoded in Unicode. Hence the 
need to recreate a solution to the issue of variant forms with a 
different mechanism, the ideographic variation sequence (and 
corresponding database).




The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47.

Easily verified, for example, by checking the FA47 entry in
NormalizationTest.txt in the UCD.


While correct, it's something that remains a bit of a gotcha. Especially 
now that Unicode has charts that go to great length showing the 
different glyphs for these characters, I would suggest adding a note to 
the charts that make clear that these distinctions are *removed* anytime 
the text is normalized, which, in a distributed architecture may happen 
anytime.


A./

--Ken


When I type ... (U+FA47) into BabelPad, highlight it, and then
click the button labeled Normalize to NFC, the character
becomes ... (U+6F22). Does BabelPad not conform to the Unicode Standard
in this case? ...








Re: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Kent Karlsson

Den 2010-11-15 23:53, skrev Doug Ewell d...@ewellic.org:

 When I type the ideograph 漢 (U+FA47) into BabelPad, highlight it, and
 then click the button labeled Normalize to NFC, the character
 becomes 漢 (U+6F22). Does BabelPad not conform to the Unicode Standard
 in this case? Is this not truly Unicode normalization?
 
 Crap.  Yes, Ken and BabelPad are right.  Some ideographs do have
 singleton mappings and can thus be different between NFD and NFC.

No, both NFD and NFC will map U+FA47 to U+6F22; singleton canonical
mappings are not reversed in the composition phase of transforming to NFC.

/Kent K






CJK Compatibility Gotchas (was: Re: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Kenneth Whistler
Asmus replied:

 On 11/15/2010 2:24 PM, Kenneth Whistler wrote:
  FA47 is a compatibility character, and would have a 
  compatibility mapping.
  Faulty syllogism.
 
 Formally correct answer but only because of something of a design flaw 
 in Unicode. When the type of mapping was decided on, people didn't fully 
 expect that NFC might become widely used/enforced, making these 
 distinctions appear wherever text is normalized in a distributed 
 architecture.

O.k., I'm gonna have to intervene again. *hehe* Yes, there is
a design flaw here, but Asmus' explanation is also somewhat
faulty, because it flattens out the history in a way that is
liable to be misunderstood.

There is a *reason* why when the type of mapping was decided on
that people didn't fully expect that NFC might become
widely used/enforced -- but it wasn't that they were goofing
up in understanding the implications of normalization. Rather,
at that point in Unicode history NFC didn't *exist* yet, nor
had the normalization algorithm been designed.

Here, for the benefit of the standards geeks out there, are the
relevant higlights of the historical timeline involved.

June, 1992.

  The canonical mappings for the CJK Compatibility characters
  were *printed* (with off-by-one errors for some of them!) in
  Unicode 1.0, volume 2 (= Unicode 1.0.1).
  
  Actually, at the time, we didn't know they were canonical
  mappings, because that concept hadn't formally been invented
  yet, but the intention was clear. They were the mappings
  from the CJK compatibility ideographs to the real unified
  Han ideographs in the standard. The CJK compatibility characters
  were all considered to be duplicates in the source standards
  that didn't follow the unification rules.
  
July, 1996.

  The formal definitions of canonical decomposition and
  compatibility decomposition were first published in
  Unicode 2.0. There wasn't a data file for the CJK Compatibility
  Ideographs block, but the canonical mappings were *printed*
  (correctly, this time) on pp. 7-470 to 7-472 of the standard.
  
August 4, 1998.

  The first published version of UnicodeData.txt that contained
  the canonical mappings for the CJK Compatibility Ideographs
  was UnicodeData-2.1.5.txt for Unicode 2.1.5. (Actually,
  they got into UnicodeData-2.1.4.txt on July 9, 1998, but that
  wasn't a published version of the data file.)
  
July 23, 1999.

  This was the publication data of the first approved version
  of UAX #15 (Revision 15), and so is the first published definition
  of NFC. (Of course UAX #15 had been in draft for some time earlier
  than that, so the term NFC can be tracked back in the drafts
  to mid-1998.)
  
September, 1999.

  Release of Unicode 3.0 -- the first release of Unicode formally
  tied to the Unicode Normalization Algorithm. (The revision
  of UAX #15 for the release was actually Revision 18, dated
  November 11, 1999.)
  
March 23, 2001.

  UAX #15, Version 3.1.0. This was the version of the Unicode
  Normalization Algorithm that specified the composition version
  to be Version 3.1.0 and locked down normalization
  forever more.
  
So essentially, there was a 9 year period between when the
first mappings were defined for the CJK Compatibility Ideographs
and the date beyond which it became impossible to reinterpret
or change a canonical mapping because of the lockdown of
normalization.

The problems resulting from the normalization for CJK Compatibility
Ideographs only started to become visible to people *after*
the lockdown, and when Unicode normalization started to become
a regular feature of actual processing.

And it wasn't because people didn't fully expect that NFC might 
become widely used/enforced -- or at least not the people in
the UTC. The UAX #15 text published with Unicode 3.0 already
stated: The W3C Character Model for the World Wide Web requires
the use of Normalization Form C for XML and related standards...

And it wasn't because of some oversight about the canonical
mappings involving the CJK Compatibility Ideographs per se.
That same UAX #15 for Unicode 3.0 also stated: With *all*
normalization forms singleton characters (those with singleton
canonical mappings) are replaced. So the ground facts for
the FA10 -- (NFC/NFD/NFKC/NFKD) 585C normalization pattern
were well-established and explicitly stated in 1999.

  FA47 is a CJK Compatibility character, which means it was encoded
  for compatibility purposes -- in this case to cover the round-trip
  mapping needed for JIS X 0213.
 
  However, it has a *canonical* decomposition mapping to U+6F22.
 
 And that, of course, destroys the desired round-trip behavior if it is 
 inadvertently applied while the data are encoded in Unicode. Hence the 
 need to recreate a solution to the issue of variant forms with a 
 different mechanism, the ideographic variation sequence (and 
 corresponding database).

Yes, that is basically correct. But, this architectural design flaw
actually results from two additional 

Re: CJK Compatibility Gotchas (was: Re: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Asmus Freytag

On 11/15/2010 5:43 PM, Kenneth Whistler wrote:

Perhaps someone would like to make a detailed proposal to
the UTC for how to fix the text and charts?;-)


Ken,

having shown yourself the master of detail in your reply, I think you've 
appointed yourself.


A round of applause for Ken!

See how easy that was? :)

Cheers,

A./

PS: I had something pithy in mind that would work for the charts - I'll 
send that off to the guy who maintains the nameslist.


Re: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Doug Ewell

Kent Karlsson kent dot karlsson14 at telia dot com wrote:

Crap.  Yes, Ken and BabelPad are right.  Some ideographs do have 
singleton mappings and can thus be different between NFD and NFC.


No, both NFD and NFC will map U+FA47 to U+6F22; singleton canonical 
mappings are not reversed in the composition phase of transforming 
to NFC.


Some ideographs have singleton mappings and can thus be different when 
mapped to NFD and/or to NFC?


--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­





Re: Application that displays CJK text in Normalization Form D

2010-11-14 Thread Michel Bottin
I don't see any difference in Firefox 3.6.12 and Thunderbird 3.1.6 on 
MacOS X 10.5


Michel Bottin

Le 14/11/10 03:59, Jim Breen a écrit :

On Sat, 13 Nov 2010  Jim Montyjim.mo...@yahoo.com  wrote:

Is there even a single software application that properly displays CJK text in
Normalization Form D?

NFC: ドライドマンゴス
NFD: ドライドマンゴス

NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요
NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요

Google's Chromium browser (6.0.409.0 (47612) Ubuntu) displayed both
correctly. Yudit (Unicode editor - http://www.yudit.org/) also displayed both
correctly.

Firefox (3.6,12 - Ubuntu) placed the dakuten over the following katakana
and mangled the hangul. GNOME Terminal (2.28.1) did the same.

Opera (10.63 - Linux) displayed the dakuten and most of the hangul as
rectangles.


NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요
NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Vice-president: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

--
In girum imus nocte et consumimur igni



Re: Application that displays CJK text in Normalization Form D

2010-11-14 Thread Dominikus Dittes Scherkl
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Am 14.11.2010 12:03, schrieb Michel Bottin:
 I don't see any difference in Firefox 3.6.12 and Thunderbird 3.1.6 on
 MacOS X 10.5
 
 Michel Bottin
 
 Le 14/11/10 03:59, Jim Breen a écrit :
 On Sat, 13 Nov 2010  Jim Monty jim.mo...@yahoo.com wrote:
 Is there even a single software application that properly displays CJK text 
 in
 Normalization Form D?

 NFC: ドライドマンゴス
 NFD: ドライドマンゴス

 NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요
 NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요
For me - using Thunderbitd 3.1.5 on Windows 7 - there is also no visible
difference.

Best regards,

- -- 

Dominikus Dittes Scherkl
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJM3+D4AAoJELBWOtEemFJVF8EH/Rf4Dr+LixQiHTkdzkJWyGqf
xOdXyAJA4ArBqw4Fh2yVgVc8fEVaEk/TGUgtW5nCtzAEPI7NpgqTsx8QPDqEAhNB
qF7thDFNwcWYXrsNFUhUDbVc4GDgGd5KDWZorrZlWx39QOwWrKDr1Wh8Q0Y+/eBj
dk/eEJEjUeXZS3qYWbgwv96pjeCN81m8U7dQPgmUrOLI+NLMEnR+xX7mLS+Oym7A
nXmEHwhJUU1AbSoTiS/pXE6cIHdg3KWHzBIhSWwALEejeSidblI3vVWrRfam+dsG
SJFMKVV9E/6TtC1WxG9lk/bGyyhsLrrmG0mtPndC1ZSmQtB3cpk3FPKAbFoFgdI=
=5m4A
-END PGP SIGNATURE-



Re: Application that displays CJK text in Normalization Form D

2010-11-14 Thread Doug Ewell

Jim Monty jim dot monty at yahoo dot com wrote:

Is there even a single software application that properly displays CJK 
text in

Normalization Form D?

NFC: ドライドマンゴス
NFD: ドライドマンゴス

NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요
NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요


BabelPad running under Uniscribe v1.0626.6000.16386 displays the 
Katakana examples identically (using Meiryo) and the Hangul examples 
identically (using Batang).


As usual, there is more to does it display properly? than calling out 
an individual application or operating system.


Furthermore, I don't think CJK text is an appropriate way to lump 
these two issues together.  In particular, Korean syllable-block 
formation isn't like anything else in Unicode.  When I read the Subject 
line, my first thought was, how silly, ideographs aren't subject to 
normalization.


--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­




Re: Application that displays CJK text in Normalization Form D

2010-11-14 Thread Jim Monty
Doug Ewell wrote:
 Jim Monty wrote:

 Is there even a single software application that properly displays CJK text
 in Normalization Form D?
 
 NFC: ドライドマンゴス
 NFD: ドライドマンゴス
 
 NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요
 NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요

 BabelPad running under Uniscribe v1.0626.6000.16386 displays the Katakana
 examples identically (using Meiryo) and the Hangul examples identically
 (using Batang).

 As usual, there is more to does it display properly? than calling out an
 individual application or operating system.

This is good to know. Thank you.

 Furthermore, I don't think CJK text is an appropriate way to lump these
 two issues together. In particular, Korean syllable-block formation isn't
 like anything else in Unicode. When I read the Subject line, my first
 thought was, how silly, ideographs aren't subject to normalization.

Japanese kana (the J in CJK) and Korean syllables (the K in CJK) both 
have different normalization forms. What do ideographs have to do with 
anything? 
I didn't mention ideographs; you did.

This is Korean text in NFC...

    유리를
    HANGUL SYLLABLE YU
    HANGUL SYLLABLE RI
    HANGUL SYLLABLE REUL

...and this is the same Korean text in NFD...

    유리를
    HANGUL CHOSEONG IEUNG
    HANGUL JUNGSEONG YU
    HANGUL CHOSEONG RIEUL
    HANGUL JUNGSEONG I
    HANGUL CHOSEONG RIEUL
    HANGUL JUNGSEONG EU
    HANGUL JONGSEONG RIEUL

How is this text different than anything else in Unicode with respect to 
normalization forms NFC and NFD? What's wrong, exactly, with my question and 
the 
way I phrased it? I simply asked a question about CJK text (which includes, by 
definition, Japanese kana and Korean syllables and jamo) and software that 
displays such CJK text when it is in Normalization Form D. For the sake 
of clarity, I included specific examples.

Jim Monty





Re: Application that displays CJK text in Normalization Form D

2010-11-14 Thread Doug Ewell

Jim Monty jim dot monty at yahoo dot com wrote:

Japanese kana (the J in CJK) and Korean syllables (the K in 
CJK) both have different normalization forms. What do ideographs 
have to do with anything? I didn't mention ideographs; you did.


The term CJK is often used to refer to those characters which are 
common to Chinese and Japanese and Korean, viz. the ideographic 
characters.



This is Korean text in NFC...

유리를
HANGUL SYLLABLE YU
HANGUL SYLLABLE RI
HANGUL SYLLABLE REUL

...and this is the same Korean text in NFD...

유리를
HANGUL CHOSEONG IEUNG
HANGUL JUNGSEONG YU
HANGUL CHOSEONG RIEUL
HANGUL JUNGSEONG I
HANGUL CHOSEONG RIEUL
HANGUL JUNGSEONG EU
HANGUL JONGSEONG RIEUL


Right, I got that.

How is this text different than anything else in Unicode with respect 
to normalization forms NFC and NFD? What's wrong, exactly, with my 
question and the way I phrased it? I simply asked a question about CJK 
text (which includes, by definition, Japanese kana and Korean 
syllables and jamo) and software that displays such CJK text when it 
is in Normalization Form D. For the sake of clarity, I included 
specific examples.


There's nothing wrong with asking what systems display hangul the same 
in NFC and NFD, or similarly for katakana.  Lumping them together under 
one CJK umbrella didn't seem right.  There's nothing about a system's 
ability to display one correctly that implies an ability or inability to 
display the other correctly.  One might as well ask if there are any 
systems which can properly display Unicode text in NFD.


--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­




Re: Application that displays CJK text in Normalization Form D

2010-11-14 Thread James Cloos
 JB == Jim Breen jimbr...@gmail.com writes:

JB Firefox (3.6,12 - Ubuntu) placed the dakuten over the following katakana
JB and mangled the hangul. GNOME Terminal (2.28.1) did the same.

That is a general PanGo (παν誤) issue.  I don't know whether the new harfbuzz
will do any better, yet.

PangGo does get the hangul right if you choose any of the Un family of fonts,
but it still fails to look as good.

Interestingly, rxvt-unicode does get the katakana identically.  (I have it
configured to use Droid Sans Fallback as its first fallback font for CJK.)
It also succeeds in making syllables of the choseong and jungseong chars,
but like PanGo they are not as legible as the precomposed syllables.
Even selection selects a syllable at a time.

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6



Re: Application that displays CJK text in Normalization Form D

2010-11-14 Thread Jim Monty
Doug Ewell wrote:
 One might as well ask if there are any systems which can properly display
 Unicode text in NFD.

That seems like a perfectly reasonable question to ask. Its answer might be 
complex, but it's nonetheless a valid question. In fact, to me, it reads like a 
Unicode FAQ.

I get the subtle distinction you're making; I just don't understand why you're 
making it in this context. In my original post, I used CJK text in opposition 
to non-CJK text because non-CJK text (in particular, Latin text) in 
Normalization Form D displays properly in the same software I described where 
CJK text (in particular, katakana and Hangul) in Normalization Form D does not 
display properly.

I don't understand what's wrong with using CJK as an umbrella term, which is 
exactly what it is. I don't think it refers specifically just to Chinese 
characters, or Han ideographs. There are terms specifically for those: Chinese 
characters and Han ideographs.

Jim Monty





Re: Application that displays CJK text in Normalization Form D

2010-11-14 Thread Asmus Freytag

On 11/14/2010 12:57 PM, Doug Ewell wrote:

Jim Monty jim dot monty at yahoo dot com wrote:

Japanese kana (the J in CJK) and Korean syllables (the K in 
CJK) both have different normalization forms. What do ideographs 
have to do with anything? I didn't mention ideographs; you did.


The term CJK is often used to refer to those characters which are 
common to Chinese and Japanese and Korean, viz. the ideographic 
characters.


Doug,

you might want to talk to the author of UTN#14 then, because he seems to 
be using the term CJK text in a sense that I find indistinguishable 
from the way Jim did.


Any relation of yours?

:)

A./

PS: I too think that replacing the CJK text with Katakana and Hangul 
as a more specific choice, would have been an improvement- as written it 
makes the problem sound more open-ended than it is. But you guys are 
arguing about an E-mail subject line, of all things




Application that displays katakana and Hangul text in Normalization Form D [Was Re: Application that displays CJK text in Normalization Form D] :-)

2010-11-14 Thread Jim Monty
Andrew Cunningham wrote:
 Jim Monty wrote:
  In my original post, I used CJK text in opposition
  to non-CJK text because non-CJK text (in particular, Latin text) in
  Normalization Form D displays properly in the same software I described
  where CJK text (in particular, katakana and Hangul) in Normalization
  Form D does not display properly.

 Actually the Latin text can suffer from the same problems, Latin text
 in NFD has similar dependencies as Korean text in NFD, and sometimes
 with worse results.

Yes, I realize this, too. I was referring to the specific case of East 
Asian-script characters in NFD, not the general case of characters in any 
script 


in NFD.

In Notepad, I see an o with a macron on top of it for the Unicode characters 
U+006F U+0304. On the next line of the same text file, there are the two 
Unicode 


characters U+30C8 U+309, but I do not see a katakana letter do. Instead, I see 
a 


katakana letter to and, to the right of it, a katakan-hiragana voiced sound 
mark. I observe essentially the same thing in other applications, including 
BabelPad and SC UniPad. So this is this specific circumstance that led me to 
ask 


the Unicode community about a specific case: Asian-script characters in Unicode 
Normalization Form D.

The answer for my specific case (thanks to Doug Ewell) is that the version of 
Uniscribe installed on my computer is not properly rendering katakana and 
Hangul 


characters in Normalization Form D. It seems I need a better Uniscribe.

The other valuable thing I learned is that there are plenty of systems (complex 
systems of computer and similar digital device hardware, video display devices, 
computer operating systems, software applications, font-rendering and 
text-layout service applications, fonts, etc.) that support Unicode in 
Normalization Form D better than the systems I'm using at the moment. I didn't 
know this.

Thank you for the additional information about Latin-script NFD.

Jim Monty




Application that displays katakana and Hangul text in Normalization Form D [Was Re: Application that displays CJK text in Normalization Form D] :-)

2010-11-14 Thread Jim Monty
[I apologize for the repost. The original one was formatted badly.]

Andrew Cunningham wrote:
 Jim Monty wrote:
  In my original post, I used CJK text in opposition
  to non-CJK text because non-CJK text (in particular, Latin text) in
  Normalization Form D displays properly in the same software I described
  where CJK text (in particular, katakana and Hangul) in Normalization
  Form D does not display properly.

 Actually the Latin text can suffer from the same problems, Latin text
 in NFD has similar dependencies as Korean text in NFD, and sometimes
 with worse results.

Yes, I realize this, too. I was referring to the specific case of East 
Asian-script characters in NFD, not the general case of characters in any 
script 

in NFD.

In Notepad, I see an o with a macron on top of it for the Unicode characters 
U+006F U+0304. On the next line of the same text file, there are the two 
Unicode 

characters U+30C8 U+309, but I do not see a katakana letter do. Instead, I see 
a 

katakana letter to and, to the right of it, a katakan-hiragana voiced sound 
mark. I observe essentially the same thing in other applications, including 
BabelPad and SC UniPad. So this is this specific circumstance that led me to 
ask 

the Unicode community about a specific case: Asian-script characters in Unicode 
Normalization Form D.

The answer for my specific case (thanks to Doug Ewell) is that the version of 
Uniscribe installed on my computer is not properly rendering katakana and 
Hangul 

characters in Normalization Form D. It seems I need a better Uniscribe.

The other valuable thing I learned is that there are plenty of systems (complex 
systems of computer and similar digital device hardware, video display devices, 
computer operating systems, software applications, font-rendering and 
text-layout service applications, fonts, etc.) that support Unicode in 
Normalization Form D better than the systems I'm using at the moment. I didn't 
know this.

Thank you for the additional information about Latin-script NFD.

Jim Monty





Re: Application that displays CJK text in Normalization Form D

2010-11-14 Thread Doug Ewell

Asmus Freytag asmusf at ix dot netcom dot com wrote:

The term CJK is often used to refer to those characters which are 
common to Chinese and Japanese and Korean, viz. the ideographic 
characters.


Doug,

you might want to talk to the author of UTN#14 then, because he seems 
to be using the term CJK text in a sense that I find 
indistinguishable from the way Jim did.


Any relation of yours?


Nice catch.  In UTN #14, I wrote:

In the case of Chinese, Japanese, and Korean (“CJK”) text, where a 
typical document might contain thousands of different ideographic Han 
characters, there never was any expectation that 8 bits per character 
would suffice. The legacy double-byte character sets designed for CJK 
text used a single byte for some characters (ASCII and halfwidth 
katakana) and two for others. DBCS encodings are trickier to handle 
than fixed-length encodings—programmers must keep track of lead and 
trail bytes—but at least these character sets represented CJK text in 
no more than 16 bits, as compactly as could be expected.


By CJK text I definitely did mean to emphasize the unique situation of 
having to find room for thousands of ideographic characters.  I note 
that legacy character sets (primarily EBCDIC-based) have been devised to 
handle only Latin plus katakana, or only Latin plus jamos, such that 8 
bits per character did in fact suffice.


In my second sentence above, I did acknowledge that double-byte 
character sets designed for CJK text include halfwidth katakana.  For 
that matter, many of them also include Greek and Cyrillic, so I'm not 
sure if the comparison to Jim's usage is quite on the mark, but I'll 
accept it if Asmus sees it that way.


The answer to Jim's question, then, is that for those examples of CJK 
text which are encoded differently in NFC and NFD (a group that 
excludes ideographs, thus immediately putting that side issue to rest), 
there are indeed some combinations of OS + app + rendering engine + font 
that can display those examples properly.


And no, I did not intend to make this big a deal out of it, and I 
apologize for doing so.


--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­




Application that displays CJK text in Normalization Form D

2010-11-13 Thread Jim Monty
Is there even a single software application that properly displays CJK text in 
Normalization Form D?

NFC: ドライドマンゴス
NFD: ドライドマンゴス

NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요
NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요

Aren't the two versions of the same Unicode text supposed to be rendered the 
same? They're not, at least not in any of the applications in which I've viewed 
them: Microsoft Internet Explorer, Microsoft Notepad, Vim, BabelPad and SC 
Unipad.

Jim Monty





Re: Application that displays CJK text in Normalization Form D

2010-11-13 Thread Bill Poser
On Sat, Nov 13, 2010 at 4:46 PM, Jim Monty jim.mo...@yahoo.com wrote:

 Is there even a single software application that properly displays CJK text
 in
 Normalization Form D?


I just tried your examples in Yudit (http://www.yudit.org) and they seem to
work: the NFD text looks the same as the NFC text.


Re: Application that displays CJK text in Normalization Form D

2010-11-13 Thread Aki Inoue
All Cocoa/Cocoa Touch apps display them correctly. 

Aki Inoue


On 2010/11/13, at 17:07, Bill Poser billpos...@gmail.com wrote:

 
 
 On Sat, Nov 13, 2010 at 4:46 PM, Jim Monty jim.mo...@yahoo.com wrote:
 Is there even a single software application that properly displays CJK text in
 Normalization Form D?
 
 
 I just tried your examples in Yudit (http://www.yudit.org) and they seem to 
 work: the NFD text looks the same as the NFC text. 
 


Re: Application that displays CJK text in Normalization Form D

2010-11-13 Thread Philippe Verdy
They are the same for me when viewed in Gmail (in any one of the modern
browsers in their most current versions on Windows, I did not test on MacOS
X or Linux).

I suppose that Gmail renormalizes the texts to NFC before displaying them...

I can't even detect a difference in the HTML source of the displayed
message, all seems to be in NFC (could that originate from the web browser
performing such normalization immediately on HTML text elements before
entering them in the DOM and making them accessible from Javascript ?)

I've stopped using local mail clients (like Outlook, Outlook Express,
Windows Mail, and others since long now, because webmails are definitely
more practical for me, from any PC or smart phone, and offer comfortable
storage space for storing many years or emails, as long as you cleanup the
undetected spams, as most spams fall in a specific box whose cleanup is
automated), so I can't confirm that they will normalize the texts. This may
not be the case however for attachments (if their MIME type is not text/*,
or if they are digitally signed).

Plain text editors are not supposed to perform such normalizations, so all
will depend on how they manage their own internal data storage. But yes,
these editors should display them exactly the same (if not, this is an issue
of how they use their text renderers), even if they are left in their
initial normalization form (or in unnormalized forms).

Philippe.


Re: Application that displays CJK text in Normalization Form D

2010-11-13 Thread Philippe Verdy
Note however that when editing a reply to your message within Gmail, the
text that appears in the webform containing your text in NFD will cause
Gmail to reject storing the text or sending it.

If you try to save the temporary message or send it, Gmail says error, the
action has failed. Please retry, and you can retry any number of times, it
will fail. I think this is a severe bug of Gmail :  you need to delete the
NFD text or normalize it in an external application.

Philippe.

2010/11/14 Jim Monty jim.mo...@yahoo.com


 Is there even a single software application that properly displays CJK text
 in
 Normalization Form D?

 NFC: ドライドマンゴス

 NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요

 Aren't the two versions of the same Unicode text supposed to be rendered
 the
 same? They're not, at least not in any of the applications in which I've
 viewed
 them: Microsoft Internet Explorer, Microsoft Notepad, Vim, BabelPad and SC
 Unipad.

 Jim Monty









Re: Application that displays CJK text in Normalization Form D

2010-11-13 Thread Jim Breen
On Sat, 13 Nov 2010  Jim Monty jim.mo...@yahoo.com wrote:

 Is there even a single software application that properly displays CJK text in
 Normalization Form D?

 NFC: ドライドマンゴス
 NFD: ドライドマンゴス

 NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요
 NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요

Google's Chromium browser (6.0.409.0 (47612) Ubuntu) displayed both
correctly. Yudit (Unicode editor - http://www.yudit.org/) also displayed both
correctly.

Firefox (3.6,12 - Ubuntu) placed the dakuten over the following katakana
and mangled the hangul. GNOME Terminal (2.28.1) did the same.

Opera (10.63 - Linux) displayed the dakuten and most of the hangul as
rectangles.

 NFC: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요
 NFD: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Vice-president: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne