>I've noticed that the non-ASCII characters are getting split into their >base code points. For example, U+00E9, Latin small letter E with >acute, becomes U+0065 U+0301 (unicode.org/charts/PDF/U0080.pdf). Is >there a way to easily recombine the code points to get the original >value? It's strange to me that Encode::decode_utf8 doesn't do this. I >thought diacritical marks were always combined with their preceding >letter, if possible. > >Andrew
You've run into the particular format of HFS+ filenames. It's not just any utf-8 encoding, most all of the Unicode characters that are decomposable are decomposed, and must be so! In Apple's header files (CoreFoundation/CFStringEncodingExt.h), it's referred to as kUnicodeCanonicalDecompVariant. In NSString.h there are functions for decomposedStringWithCanonicalMapping (and precomposed- and -CompatabilityMapping). How you get to them from Perl, tho.... maybe CamelBones? A description of this text encoding (and the reason for it) are found at http://developer.apple.com/technotes/tn/tn1150.html see especially http://developer.apple.com/technotes/tn/tn1150.html#HFSPlusNames and http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties Hope that helps a little, -Randy