Re: character encoding on file upload name

Andrew Mace Thu, 07 Apr 2005 11:40:09 -0700

With Randy's tip and my discovery of the Unicode::Normalize module, I've gotten things worked out.

use Unicode::Normalize qw(compose);
use Encode qw(decode_utf8);
...
my $f = decode_utf8(param('file'));
... write out the file itself with name in decomposed utf-8
$f = compose($f);
... now do something with filename in composed utf-8
etc.

Thanks to everyone who helped out. I'm not sure what to do with my day now.

Andrew

On Apr 7, 2005, at 1:57 PM, Randy Boring wrote:

I've noticed that the non-ASCII characters are getting split into their base code points. For example, U+00E9, Latin small letter E with acute, becomes U+0065 U+0301 (unicode.org/charts/PDF/U0080.pdf). Is there a way to easily recombine the code points to get the original value? It's strange to me that Encode::decode_utf8 doesn't do this. I thought diacritical marks were always combined with their preceding letter, if possible.

Andrew
You've run into the particular format of HFS+ filenames.  It's not just
any utf-8 encoding, most all of the Unicode characters that are
decomposable are decomposed, and must be so!
In Apple's header files (CoreFoundation/CFStringEncodingExt.h), it's
referred to as kUnicodeCanonicalDecompVariant.
In NSString.h there are functions for
decomposedStringWithCanonicalMapping (and precomposed- and
-CompatabilityMapping).  How you get to them from Perl, tho.... maybe
CamelBones?
A description of this text encoding (and the reason for it) are found at http://developer.apple.com/technotes/tn/tn1150.html
see especially
  http://developer.apple.com/technotes/tn/tn1150.html#HFSPlusNames
and
  http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
Hope that helps a little,
 -Randy

Re: character encoding on file upload name

Reply via email to