In the end, I dealt with the problem in the following way: I captured the STDERR output of the script that gave the Malformed UTF-8 character warnings and I processed it with a regexp to extract the lines where the warnings were sent. Then, a colleague and I fixed all those lines with a mix of hexdumping, text editing, deciding the right symbol on the basis of the context, and simple scripts.
Regards,
Marco
--- Marco Baroni University of Bologna http://sslmit.unibo.it/~baroni