On Apr 6, 2005, at 6:00 PM, Joel Rees wrote:

Just a random thought, have you tried hexdump on the file, the upload/download using the system and the browser(s) and whatever else you use, and the output of the script? hexdump should tell you whether things are changing or not.

(There may be a bug in hexdump relative to the bom. RH Linux's has such a bug and I haven't checked other systems for it yet. The bug shifts the displayed bytes relative to the interpreted bytes in some formats, if it's there, I guess I should check, but it may not be immediately. Anyway, hexdump should give some idea what's going on with all the different systems trying to help out.)

I've noticed that the non-ASCII characters are getting split into their base code points. For example, U+00E9, Latin small letter E with acute, becomes U+0065 U+0301 (unicode.org/charts/PDF/U0080.pdf). Is there a way to easily recombine the code points to get the original value? It's strange to me that Encode::decode_utf8 doesn't do this. I thought diacritical marks were always combined with their preceding letter, if possible.


Andrew



Reply via email to