Looks like your theory about the input data being "in ascii (with entity
references...)" is contradicted by the evidence.
Indeed.
I do not see a c with cedilla, I see a rhombus with a question mark inside (which is the way my shell displays non-ASCII characters). I guess it is a c with cedilla from the context.
So now you need to determine what character encoding is being used for the non-ascii codes, which are obviously present in the data. When you look at the file and you see a c with cedilla, can you tell whether is this actually the appropriate character, based on its context? Is this true of all such characters?
So, I would like to ask you or anybody else: is there some kind of tool (e.g., a text editor) that I could use to discover which encoding is being used? (I tried with emacs but failed).
Thanks again.
Marco
--- Marco Baroni University of Bologna http://sslmit.unibo.it/~baroni
