Robin Becker wrote at 2022-1-12 10:22 +0000:
>I have a puzzle over how lxml & entities should be 'preserved' code below 
>illustrates. To preserve I change & --> &
>in the source and add resolve_entities=False to the parser definition. The 
>escaping means we only have one kind of
>entity & which means lxml will preserve it. For whatever reason lxml won't 
>preserve character entities eg !.
>
>The simple parse from string and conversion tostring shows that the parsing at 
>least took notice of it.
>
>However, I want to create a tuple tree so have to use tree.text, 
>tree.getchildren() and tree.tail for access.
>
>When I use those I expected to have to undo the escaping to get back the 
>original entities, but it seems they are
>already done.
>
>Good for me, but if the tree knows how it was created (tostring shows that) 
>why is it ignored with attribute access?
>
>if __name__=='__main__':
>     from lxml import etree as ET
>     #initial xml
>     xml = b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym; &lt; &amp; 
> &gt; &#33; AAAAA</a>'
>     #escaped xml
>     xxml = xml.replace(b'&',b'&amp;')
>
>     myparser = ET.XMLParser(resolve_entities=False)
>     tree = ET.fromstring(xxml,parser=myparser)
>
>     #use tostring
>     print(f'using tostring\n{xxml=!r}\n{ET.tostring(tree)=!r}\n')
>
>     #now access the items using text & children & text
>     print(f'using 
> attributes\n{tree.text=!r}\n{tree.getchildren()=!r}\n{tree.tail=!r}')
>
>when run I see this
>
>$ python tmp/tlp.py
>using tostring
>xxml=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa 
>&amp;mysym; &amp;lt; &amp;amp; &amp;gt;
>&amp;#33; AAAAA</a>'
>ET.tostring(tree)=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; 
>&amp;#33;">aaaaa &amp;mysym; &amp;lt; &amp;amp;
>&amp;gt; &amp;#33; AAAAA</a>'
>
>using attributes
>tree.text='aaaaa &mysym; &lt; &amp; &gt; &#33; AAAAA'
>tree.getchildren()=[]
>tree.tail=None

Apparently, the `resolve_entities=False` was not effective: otherwise,
your tree content should have more structure (especially some
entity reference children).

`&#<value>` is not an entity reference but a character reference.
It may rightfully be treated differently from entity references.
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to