Note: This was sent on Sunday at 19.45 but seems to have disappeared on its way to pharo users. Re-sent just to complete the story. _____________________________________________________ Paul
Good to have found the charset discrepancy - that may have something to do with it. But I don't think it has to do with the C’è in the body of the page. I have just parsed another page published today, with the same error, and again it fails in parsing the <head> node, so it has not even reached the body. The <head> contains a meta which describes the article - a sort of paraphrase of the article headline - and it fails in the middle of decoding that. The character at which it fails is again $«, so that is definitely the cause. Maybe the wrong charset is the explanation of why it messes up that - but I don't know enough about the different charsets to know. Does ISO-8859-1 even contain $«? Peter Addendum: I have looked a bit further, and the charset problem lies behind it all. The ISO-8859-1 charset *does* include $«, at decimal 171 or hex AB. At the point where it fails, the parser is reading the string '«B', which is hex AB 42 in ISO-8859-1, and the debugger shows that the parser is trying to decode hex AB 42 as a multibyte UTF8 character. So there are two questions remaining: (a) why does the parser try to decode it as UTF8? (b) why does reading the string in before calling the parser get round the problem? -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html