FYI, by far the largest source of text in NFD (decomposed) form in Mac OS X is the file system. File names are stored this way (for historical reasons), so anything copied from a file name is in (a slightly altered form of) NFD.

Also, a few keyboard layouts generate text that is partly decomposed, for ease of typing (e.g., Vietnamese).

Deborah Goldsmith
Internationalization, Unicode liaison
Apple Computer, Inc.
[EMAIL PROTECTED]

On Aug 23, 2004, at 11:51 AM, Doug Ewell wrote:

Problem with accented charactersWilliam Tay wrote:

Can anyone explain why an accented character is sometimes represented
as a base character plus its accent?  For example, the utf-8
representation for é is 65 CC 81, which is the utf-8 representation
for e and the accent, instead of C3 A9?  I find that this is how MacOS
X represents accented characters.

The two characters U+0065 and U+0301 (é) are canonically equivalent to the single character U+00E9 (é). That is, the two-character combining sequence is supposed to be considered equivalent to the single precomposed character. Apparently MacOS X, or at least one application running under it, does use the combining sequence.

How can a C application that receives such utf-8 encoded characters
handle them correctly?  Appreciate your comments.

It must understand normalization. See TUS 4.0, section 5.6 for more information.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/








Reply via email to