>I've noticed that the non-ASCII characters are getting split into their 
>base code points.  For example, U+00E9, Latin small letter E with 
>acute, becomes U+0065 U+0301 (unicode.org/charts/PDF/U0080.pdf).  Is 
>there a way to easily recombine the code points to get the original 
>value?  It's strange to me that Encode::decode_utf8 doesn't do this.  I 
>thought diacritical marks were always combined with their preceding 
>letter, if possible.
>
>Andrew

You've run into the particular format of HFS+ filenames.  It's not just 
any utf-8 encoding, most all of the Unicode characters that are 
decomposable are decomposed, and must be so!

In Apple's header files (CoreFoundation/CFStringEncodingExt.h), it's 
referred to as kUnicodeCanonicalDecompVariant.
In NSString.h there are functions for 
decomposedStringWithCanonicalMapping (and precomposed- and 
-CompatabilityMapping).  How you get to them from Perl, tho.... maybe 
CamelBones?

A description of this text encoding (and the reason for it) are found at
  http://developer.apple.com/technotes/tn/tn1150.html

see especially
  http://developer.apple.com/technotes/tn/tn1150.html#HFSPlusNames
and
  http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties


Hope that helps a little,

 -Randy

Reply via email to