With Randy's tip and my discovery of the Unicode::Normalize module, I've gotten things worked out.

use Unicode::Normalize qw(compose);
use Encode qw(decode_utf8);
...
my $f = decode_utf8(param('file'));
... write out the file itself with name in decomposed utf-8
$f = compose($f);
... now do something with filename in composed utf-8
etc.

Thanks to everyone who helped out. I'm not sure what to do with my day now.

Andrew



On Apr 7, 2005, at 1:57 PM, Randy Boring wrote:

I've noticed that the non-ASCII characters are getting split into their
base code points. For example, U+00E9, Latin small letter E with
acute, becomes U+0065 U+0301 (unicode.org/charts/PDF/U0080.pdf). Is
there a way to easily recombine the code points to get the original
value? It's strange to me that Encode::decode_utf8 doesn't do this. I
thought diacritical marks were always combined with their preceding
letter, if possible.


Andrew

You've run into the particular format of HFS+ filenames. It's not just any utf-8 encoding, most all of the Unicode characters that are decomposable are decomposed, and must be so!

In Apple's header files (CoreFoundation/CFStringEncodingExt.h), it's
referred to as kUnicodeCanonicalDecompVariant.
In NSString.h there are functions for
decomposedStringWithCanonicalMapping (and precomposed- and
-CompatabilityMapping).  How you get to them from Perl, tho.... maybe
CamelBones?

A description of this text encoding (and the reason for it) are found at
http://developer.apple.com/technotes/tn/tn1150.html


see especially
  http://developer.apple.com/technotes/tn/tn1150.html#HFSPlusNames
and
  http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties


Hope that helps a little,

 -Randy




Reply via email to