Re: Application that displays CJK text in Normalization Form D

Asmus Freytag Mon, 15 Nov 2010 15:13:28 -0800

On 11/15/2010 2:24 PM, Kenneth Whistler wrote:

FA47 is a "compatibility character", and would have a compatibility mapping.

Faulty syllogism.

Formally correct answer but only because of something of a design flawin Unicode. When the type of mapping was decided on, people didn't fullyexpect that NFC might become widely used/enforced, making thesedistinctions appear wherever text is normalized in a distributedarchitecture.

FA47 is a CJK Compatibility character, which means it was encoded
for compatibility purposes -- in this case to cover the round-trip
mapping needed for JIS X 0213.

However, it has a *canonical* decomposition mapping to U+6F22.

And that, of course, destroys the desired "round-trip" behavior if it isinadvertently applied while the data are encoded in Unicode. Hence theneed to recreate a solution to the issue of variant forms with adifferent mechanism, the ideographic variation sequence (andcorresponding database).

The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47.

Easily verified, for example, by checking the FA47 entry in
NormalizationTest.txt in the UCD.

While correct, it's something that remains a bit of a gotcha. Especiallynow that Unicode has charts that go to great length showing thedifferent glyphs for these characters, I would suggest adding a note tothe charts that make clear that these distinctions are *removed* anytimethe text is normalized, which, in a distributed architecture may happenanytime.

A./

--Ken

When I type ... (U+FA47) into BabelPad, highlight it, and then
click the button labeled "Normalize to NFC", the character
becomes ... (U+6F22). Does BabelPad not conform to the Unicode Standard
in this case? ...

Re: Application that displays CJK text in Normalization Form D

Reply via email to