Robin Becker wrote at 2022-1-12 10:22 +0000: >I have a puzzle over how lxml & entities should be 'preserved' code below >illustrates. To preserve I change & --> & >in the source and add resolve_entities=False to the parser definition. The >escaping means we only have one kind of >entity & which means lxml will preserve it. For whatever reason lxml won't >preserve character entities eg !. > >The simple parse from string and conversion tostring shows that the parsing at >least took notice of it. > >However, I want to create a tuple tree so have to use tree.text, >tree.getchildren() and tree.tail for access. > >When I use those I expected to have to undo the escaping to get back the >original entities, but it seems they are >already done. > >Good for me, but if the tree knows how it was created (tostring shows that) >why is it ignored with attribute access? > >if __name__=='__main__': > from lxml import etree as ET > #initial xml > xml = b'<a attr="&mysym; < & > !">aaaaa &mysym; < & > > ! AAAAA</a>' > #escaped xml > xxml = xml.replace(b'&',b'&') > > myparser = ET.XMLParser(resolve_entities=False) > tree = ET.fromstring(xxml,parser=myparser) > > #use tostring > print(f'using tostring\n{xxml=!r}\n{ET.tostring(tree)=!r}\n') > > #now access the items using text & children & text > print(f'using > attributes\n{tree.text=!r}\n{tree.getchildren()=!r}\n{tree.tail=!r}') > >when run I see this > >$ python tmp/tlp.py >using tostring >xxml=b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa >&mysym; &lt; &amp; &gt; >&#33; AAAAA</a>' >ET.tostring(tree)=b'<a attr="&mysym; &lt; &amp; &gt; >&#33;">aaaaa &mysym; &lt; &amp; >&gt; &#33; AAAAA</a>' > >using attributes >tree.text='aaaaa &mysym; < & > ! AAAAA' >tree.getchildren()=[] >tree.tail=None
Apparently, the `resolve_entities=False` was not effective: otherwise, your tree content should have more structure (especially some entity reference children). `&#<value>` is not an entity reference but a character reference. It may rightfully be treated differently from entity references. -- https://mail.python.org/mailman/listinfo/python-list