Thanks to all those who replied to my query about finding out the encoding of a text that *should* have been in 7bit ASCII.

In the end, I dealt with the problem in the following way: I captured the STDERR output of the script that gave the Malformed UTF-8 character warnings and I processed it with a regexp to extract the lines where the warnings were sent. Then, a colleague and I fixed all those lines with a mix of hexdumping, text editing, deciding the right symbol on the basis of the context, and simple scripts.

Regards,

Marco


--- Marco Baroni University of Bologna http://sslmit.unibo.it/~baroni



Reply via email to