Re: character encoding on file upload name

Andrew Mace Thu, 07 Apr 2005 10:34:55 -0700

On Apr 6, 2005, at 6:00 PM, Joel Rees wrote:

Just a random thought, have you tried hexdump on the file, the upload/download using the system and the browser(s) and whatever else you use, and the output of the script? hexdump should tell you whether things are changing or not.

(There may be a bug in hexdump relative to the bom. RH Linux's has such a bug and I haven't checked other systems for it yet. The bug shifts the displayed bytes relative to the interpreted bytes in some formats, if it's there, I guess I should check, but it may not be immediately. Anyway, hexdump should give some idea what's going on with all the different systems trying to help out.)

I've noticed that the non-ASCII characters are getting split into their base code points. For example, U+00E9, Latin small letter E with acute, becomes U+0065 U+0301 (unicode.org/charts/PDF/U0080.pdf). Is there a way to easily recombine the code points to get the original value? It's strange to me that Encode::decode_utf8 doesn't do this. I thought diacritical marks were always combined with their preceding letter, if possible.

Andrew

Re: character encoding on file upload name

Reply via email to