Dale Strickland-Clark wrote: > from xml.dom.minidom import parseString > output = parseString(strHTML).toxml() > > The output is: > ><?xml version="1.0" encoding="iso-8859-1"?> ><html> ><head> ><title/> ><meta content="text/html; charset=iso-8859-1" >http-equiv="Content-Type"/> </head> ><body> > ⬠></body> ></html> > > So it encodes the entity reference to ⬠(Euro sign).  I need it to > remain as € so that the resulting HTML can render properly in a > browser.  Is there a way to make the parser not convert the entity > references?  Or is there a convenient post processing function that > will do the conversion?
First up, when I repeat what you did I don't get the same output. toxml() without an encoding argument produces a unicode string, and no encoding attribute in the <?xml ...?> toxml() only takes a single encoding argument, so unfortunately there isn't any way to tell it what to do for unicode characters which are not supported in the encoding you are using. However, if you then encode the unicode output to ascii with entity escapes, I think you should be alright (unless I've missed something): >>> from xml.dom.minidom import parseString >>> strHTML = '''<?xml version="1.0" encoding="ISO-8859-1"?> <html> <head> <title></title> <meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" /> </head> <body> € </body> </html>''' >>> print parseString(strHTML).toxml().encode('ascii', 'xmlcharrefreplace') <?xml version="1.0" ?> <html> <head> <title/> <meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/> </head> <body> € </body> </html> >>> You lose the encoding at the top of the output, but since the output is entirely ascii I don't think that matters. -- http://mail.python.org/mailman/listinfo/python-list