use Unicode::Normalize qw(compose); use Encode qw(decode_utf8); ... my $f = decode_utf8(param('file')); ... write out the file itself with name in decomposed utf-8 $f = compose($f); ... now do something with filename in composed utf-8 etc.
Thanks to everyone who helped out. I'm not sure what to do with my day now.
Andrew
On Apr 7, 2005, at 1:57 PM, Randy Boring wrote:
I've noticed that the non-ASCII characters are getting split into their
base code points. For example, U+00E9, Latin small letter E with
acute, becomes U+0065 U+0301 (unicode.org/charts/PDF/U0080.pdf). Is
there a way to easily recombine the code points to get the original
value? It's strange to me that Encode::decode_utf8 doesn't do this. I
thought diacritical marks were always combined with their preceding
letter, if possible.
Andrew
You've run into the particular format of HFS+ filenames. It's not just any utf-8 encoding, most all of the Unicode characters that are decomposable are decomposed, and must be so!
In Apple's header files (CoreFoundation/CFStringEncodingExt.h), it's referred to as kUnicodeCanonicalDecompVariant. In NSString.h there are functions for decomposedStringWithCanonicalMapping (and precomposed- and -CompatabilityMapping). How you get to them from Perl, tho.... maybe CamelBones?
A description of this text encoding (and the reason for it) are found at
http://developer.apple.com/technotes/tn/tn1150.html
see especially http://developer.apple.com/technotes/tn/tn1150.html#HFSPlusNames and http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
Hope that helps a little,
-Randy