[xml] Adding parsed 8859-1 content to a UTF-8 document ...

denverrox denver Sat, 19 Jul 2008 15:30:19 -0700

Greetings list!

I'm using libxml2 (specifically the HTMLparser/tree modules, and the xpath 
library) to perform transformation operations on HTML input files, and have run 
into a character encoding issue:


Specifically, I have two HTML documents, one in 8859-1 encoding, and the other 
in UTF-8.

First I parse both documents into DOM trees.

Then, I'm performing an XPath on the 8859-1 document, cloning the resultset 
nodes using "xmlCopyNodeList," then using "xmlAddNextSibling" to add the 8859-1 
document content into a document that was originally UTF-8 encoded.

This results in the 8859-1 content not being correctly serialized if I output 
the UTF-8 document.  Special characters are garbled, etc.

Based on the libxml2 encodings webpage ( 
http://xmlsoft.org/encoding.htmlhttp://xmlsoft.org/encoding.html ), it seems 
that libxml2 converts all character encodings to UTF-8 internally. Therefore 
unless I'm misunderstanding something, the 8859-1 document should be in UTF-8 
after parsing.  

Is there any reason why this serialization problem should occur, if both the 
8859-1 document and UTF-8 document are converted to native UTF-8 by libxml2?  
Shouldn't it "just work"? My impression is that you can freely copy cloned 
nodesets between documents, as they're all internally in UTF-8.  Careful review 
of the libXML2 encodings page seems to agree with this assertion, so I'm quite 
stumped.

Any help on this is appreciated, thank you!

D. Platt

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

[xml] Adding parsed 8859-1 content to a UTF-8 document ...

Reply via email to