New submission from Dave Hughes <[EMAIL PROTECTED]>: In the ElementTree and cElementTree implementations in Python 2.5 (and possibly Python 2.6 as I also found this issue when testing an SVN checkout of ElementTree 1.3), the conversion of a ProcessingInstruction to a string converts XML reserved characters (<, >, &) to character entities:
>>> from xml.etree.ElementTree import * >>> tostring(ProcessingInstruction('test', '<testing&>')) '<?test <testing&>?>' >>> from xml.etree.cElementTree import * >>> tostring(ProcessingInstruction('test', '<testing&>')) '<?test <testing&>?>' The XML 1.0 spec is rather vague on whether character entities are permitted in PIs (it explicitly states parameter entities are not parsed in PIs, but says nothing about parsing character entities). However, it does have this to say in section 2.4 "Character Data and Markup": "The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section." So, XML reserved chars don't need converting in PIs (the only string not permitted in a PI's content according to the spec, section 2.6, is '?>'), which sort of implies that they shouldn't be. As for practical reasons why they shouldn't be: Breaks generated PHP: >>> from xml.etree.cElementTree import * >>> doc = Element('html') >>> SubElement(doc, 'head') <Element 'head' at 0x2af4e3b8a9f0> >>> SubElement(doc, 'body') <Element 'body' at 0x2af4e3b922a0> >>> doc[1].append(ProcessingInstruction('php', 'if (2 < 1) print "<p>Something has gone horribly wrong!</p>";')) >>> tostring(doc) '<html><head /><body><?php if (2 < 1) print "<p>Something has gone horribly wrong!</p>";?></body></html>' Different from xml.dom: >>> from xml.dom.minidom import * >>> i = getDOMImplementation() >>> doc = i.createDocument(None, 'html', None) >>> doc.documentElement.appendChild(doc.createElement('head')) <DOM Element: head at 0x8c6170> >>> doc.documentElement.appendChild(doc.createElement('body')) <DOM Element: body at 0x8c6290> >>> doc.documentElement.lastChild.appendChild(doc.createProcessingInstruction('test', '<testing&>')) <xml.dom.minidom.ProcessingInstruction instance at 0x8c63b0> >>> doc.toxml() '<?xml version="1.0" ?>\n<html><head/><body><?test <testing&>?></body></ html>' Different from lxml: >>> from lxml.etree import * >>> tostring(ProcessingInstruction('test', '<testing&>')) '<?test <testing&>?>' I suspect the only change necessary to fix this is to replace the _escape_cdata() call for ProcessingInstruction (and possibly Comment too given the spec quote above) in ElementTree._write() with an _encode() call, as shown in this patch (which includes the Comment change as well): Index: elementtree/ElementTree.py =================================================================== --- elementtree/ElementTree.py (revision 511) +++ elementtree/ElementTree.py (working copy) @@ -663,9 +663,9 @@ # write XML to file tag = node.tag if tag is Comment: - file.write("<!-- %s -->" % _escape_cdata(node.text, encoding)) + file.write("<!-- %s -->" % _encode(node.text, encoding)) elif tag is ProcessingInstruction: - file.write("<?%s?>" % _escape_cdata(node.text, encoding)) + file.write("<?%s?>" % _encode(node.text, encoding)) else: items = node.items() xmlns_items = [] # new namespaces in this scope Sorry I haven't got a similar patch for cElementTree. I've had a quick look through the source, but haven't yet figured out where the change should be made (unless it's not required - does cElementTree reuse that bit of ElementTree?). ---------- components: XML messages: 66154 nosy: waveform severity: normal status: open title: ElementTree ProcessingInstruction uses character entities in content type: behavior versions: Python 2.5 __________________________________ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue2746> __________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com