On 11/15/2010 2:24 PM, Kenneth Whistler wrote:
FA47 is a "compatibility character", and would have a compatibility mapping.
Faulty syllogism.

Formally correct answer but only because of something of a design flaw in Unicode. When the type of mapping was decided on, people didn't fully expect that NFC might become widely used/enforced, making these distinctions appear wherever text is normalized in a distributed architecture.
FA47 is a CJK Compatibility character, which means it was encoded
for compatibility purposes -- in this case to cover the round-trip
mapping needed for JIS X 0213.

However, it has a *canonical* decomposition mapping to U+6F22.

And that, of course, destroys the desired "round-trip" behavior if it is inadvertently applied while the data are encoded in Unicode. Hence the need to recreate a solution to the issue of variant forms with a different mechanism, the ideographic variation sequence (and corresponding database).


The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47.

Easily verified, for example, by checking the FA47 entry in
NormalizationTest.txt in the UCD.

While correct, it's something that remains a bit of a gotcha. Especially now that Unicode has charts that go to great length showing the different glyphs for these characters, I would suggest adding a note to the charts that make clear that these distinctions are *removed* anytime the text is normalized, which, in a distributed architecture may happen anytime.

A./
--Ken

When I type ... (U+FA47) into BabelPad, highlight it, and then
click the button labeled "Normalize to NFC", the character
becomes ... (U+6F22). Does BabelPad not conform to the Unicode Standard
in this case? ...




Reply via email to