Problem with accented charactersWilliam Tay wrote: > Can anyone explain why an accented character is sometimes represented > as a base character plus its accent? For example, the utf-8 > representation for à is 65 CC 81, which is the utf-8 representation > for e and the accent, instead of C3 A9? I find that this is how MacOS > X represents accented characters.
The two characters U+0065 and U+0301 (eÌ) are canonically equivalent to the single character U+00E9 (Ã). That is, the two-character combining sequence is supposed to be considered equivalent to the single precomposed character. Apparently MacOS X, or at least one application running under it, does use the combining sequence. > How can a C application that receives such utf-8 encoded characters > handle them correctly? Appreciate your comments. It must understand normalization. See TUS 4.0, section 5.6 for more information. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

