On 11/15/2010 2:24 PM, Kenneth Whistler wrote:
FA47 is a "compatibility character", and would have a compatibility mapping.
Faulty syllogism.
Formally correct answer but only because of something of a design flaw
in Unicode. When the type of mapping was decided on, people didn't fully
expect that NFC might become widely used/enforced, making these
distinctions appear wherever text is normalized in a distributed
architecture.
FA47 is a CJK Compatibility character, which means it was encoded
for compatibility purposes -- in this case to cover the round-trip
mapping needed for JIS X 0213.
However, it has a *canonical* decomposition mapping to U+6F22.
And that, of course, destroys the desired "round-trip" behavior if it is
inadvertently applied while the data are encoded in Unicode. Hence the
need to recreate a solution to the issue of variant forms with a
different mechanism, the ideographic variation sequence (and
corresponding database).
The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47.
Easily verified, for example, by checking the FA47 entry in
NormalizationTest.txt in the UCD.
While correct, it's something that remains a bit of a gotcha. Especially
now that Unicode has charts that go to great length showing the
different glyphs for these characters, I would suggest adding a note to
the charts that make clear that these distinctions are *removed* anytime
the text is normalized, which, in a distributed architecture may happen
anytime.
A./
--Ken
When I type ... (U+FA47) into BabelPad, highlight it, and then
click the button labeled "Normalize to NFC", the character
becomes ... (U+6F22). Does BabelPad not conform to the Unicode Standard
in this case? ...