Re: character encoding on file upload name

Randy Boring Fri, 08 Apr 2005 07:33:29 -0700

>I've noticed that the non-ASCII characters are getting split into their 
>base code points.  For example, U+00E9, Latin small letter E with 
>acute, becomes U+0065 U+0301 (unicode.org/charts/PDF/U0080.pdf).  Is 
>there a way to easily recombine the code points to get the original 
>value?  It's strange to me that Encode::decode_utf8 doesn't do this.  I 
>thought diacritical marks were always combined with their preceding 
>letter, if possible.
>
>Andrew


You've run into the particular format of HFS+ filenames.  It's not just 
any utf-8 encoding, most all of the Unicode characters that are 
decomposable are decomposed, and must be so!

In Apple's header files (CoreFoundation/CFStringEncodingExt.h), it's 
referred to as kUnicodeCanonicalDecompVariant.
In NSString.h there are functions for 
decomposedStringWithCanonicalMapping (and precomposed- and 
-CompatabilityMapping).  How you get to them from Perl, tho.... maybe 
CamelBones?

A description of this text encoding (and the reason for it) are found at
  http://developer.apple.com/technotes/tn/tn1150.html

see especially
  http://developer.apple.com/technotes/tn/tn1150.html#HFSPlusNames
and
  http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties


Hope that helps a little,

 -Randy

Re: character encoding on file upload name

Reply via email to