Hello, I'm using htmlParseDoc to build a tree for an ISO-8859-7 (Greek) html file. In most case I didn't have a problem but in other cases I discovered that when I try to later save the tree to file or memory it's not being dumped fully or in the right encoding. I tried to track down the problem and why it only happens sometimes.
I discovered that the problem probably happens in htmlParser.c in the htmlCurrentChar function and only when the html content has some encoded characters BEFORE the "Content-Type" meta tag (such as in the "title" tag) What happens next is that since the parser doesn't know the right encoding yet, it assumes that it's isolatin1 and tries to convert the rest of the encoded characters. Is there any simple workaround when I don't know the correct encoding before parsing the document? Something like trying to find the "Content-Type" meta tag before parsing the rest of the document or something similar to resolve this issue? As an example I supply two links from the same site which demonstrate the problem, the site was selected only for demonstration purposes 1) http://www.m-art.gr/ - the title has iso-8859-7 encoded characters and the document doesn't get parsed properly. 2) http://www.m-art.gr/gr/bazart/index.asp - also iso-8859-7 document but no encoded characters before the "content-type" declaration, this gets parsed properly Thank you very much Liron
_______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org http://mail.gnome.org/mailman/listinfo/xml