Note: This was sent on Sunday at 19.45 but seems to have disappeared on its
way to pharo users. Re-sent just to complete the story.
_____________________________________________________
Paul

Good to have found the charset discrepancy - that may have something to do
with it. But I don't think it has to do with the C’è in the body of the
page. I have just parsed another page published today, with the same error,
and again it fails in parsing the <head> node, so it has not even reached
the body. The <head> contains a meta which describes the article - a sort of
paraphrase of the article headline - and it fails in the middle of decoding
that. The character at which it fails is again $«, so that is definitely the
cause. Maybe the wrong charset is the explanation of why it messes up that -
but I don't know enough about the different charsets to know. Does
ISO-8859-1 even contain $«?

Peter

Addendum: I have looked a bit further, and the charset problem lies behind
it all. The ISO-8859-1 charset *does* include $«, at decimal 171 or hex AB.
At the point where it fails, the parser is reading the string '«B', which is
hex AB 42 in ISO-8859-1, and the debugger shows that the parser is trying to
decode hex AB 42 as a multibyte UTF8 character. So there are two questions
remaining: (a) why does the parser try to decode it as UTF8? (b) why does
reading the string in before calling the parser get round the problem?




--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply via email to