[lxml] Re: what you can and cannot do with lxml etree.strip_tags

jholg--- via lxml - The Python XML Toolkit Mon, 13 Apr 2026 04:37:49 -0700

Hi Martin,

On Montag, 13. April 2026 00:15:21 CEST Martin Mueller wrote:
> Lxml is an excellent program, but its documentation is very terse, and often
> assumes a reader who already "sort of"  knows what the actual reader does
> not.    Steven Pinker in his Sense of Style  has a chapter on "The Curse of
> Knowledge" with the wonderful topic sentence: "The main cause of
> incomprehensible prose is the difficulty of imaging what it's like for
> someone else not to know something that you know".


I consider lxml's documentation very good, but documentation is really hard.
And it is written by the experts most of the time who naturally have the 
mentioned difficulty, quasi by design.

That said this works both ways i.e. also for questions and bug reports. ;-)

> I had trouble with etree.strip_tags  but discovered that you can't add
> secondary constraints to a strip_tag command.  

I don't quite understand what you mean with "secondary constraints" here.
Could you give a very small, reproducible example with input data that shows 
what you'd like to do but can't?

E.g. like

$ python3
Python 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> data = "<root><div><hi>Hello</hi>, world!</div></root>"
>>> root = etree.fromstring(data)
>>> etree.tostring(root)
b'<root><div><hi>Hello</hi>, world!</div></root>'
>>> root.xpath("./div")    # or root.getchildren()[0] in this case
[<Element div at 0x7abca645ee40>]
>>> div_elem = root.xpath("./div")[0]  
>>> etree.strip_tags(div_elem, 'hi')
>>> etree.tostring(root)  # Look Ma, no <hi> any more!
b'<root><div>Hello, world!</div></root>'

Very simplified - the main characteristics of your use case should be 
represented in an actual example. Like e.g. different kinds of <hi> elements 
you are dealing with.

Concrete examples often work better than describing your problem in prose 
here. Moreover a short reproducible example gives helpful souls an easy way to 
try it out themselves.

Sometimes it can be good to then describe the intent of what you'd like to 
achieve much like a code comment, e.g.

>>> # ...
>>> # Strip <hi attr='whatever'> elements from the tree but not other <hi>
>>> # elements. Descendants and text/tail content of stripped elements
>>> # shall be kept.

Or maybe invent a "hypothetical" function or API that describes your intent:

>>> def check_for_removal(elem):
...     # Check if this element should be removed e.g. by XPath
...     # If elem does not have a foo='bar' attribute xpath result will be
...     # an empty list:
...     foo_bar_attr = elem.xpath('@foo[.="bar"]'))
...     return bool(foo_bar_attr)
>>> # Strip <hi> elements based on additional element conditions
>>> # etree.strip_tags(div_elem, "hi", _condition=check_for_removal)

> Thus you can strip  all "hi"
> children of a div, but you cannot do this for <hi> tags with less than
> child elements. Is that the case? )

What does "tags with less than child elements" mean here?

(Guessing) do you want to only strip <hi> tags with a certain number of child 
elements?

(Guessing) do you want to do something else to the <hi> tags or their 
children/attributes/text content *and* simultaneously strip them?

> If you want to operate on <hi> elements with further specification, you
> select them and then use  the "addnext" method, looping through the <hi>
> element in reverse order.

(Guessing) do you want to only operate on or strip certain <hi> elements (e.g. 
with special attributes or child elements) but not on other <hi> elements?

Assuming you want to select the <hi> elements to modify/remove not by tag name 
only but more elaborate conditions, then yes you can't directly use 
etree.strip_tags().

Maybe this approach suggested previously could work for you:

https://mail.python.org/archives/list/[email protected]/message/
2F7ZICTGEBIQEHRUOSGUE7DZQ7OAA4AZ/

Basically, select the elements to operate on by XPath, which can contain 
complex conditions, and rename those you'd like to remove.
Afterwards, use etree.strip_tags() on the renamed elements.
I take it strip_tags() may be nice here since it doesn't also remove element 
descendants and text/tail contents(?).

Probably not too relevant in this case but just to mention it:
It can also be beneficial to include version information for your concrete 
environment in such examples, like Python version, lxml version, libxml2/
libxslt (the C libraries that "power" lxml) versions:

>>> etree.__version__
'5.2.1'
>>> etree.LXML_VERSION
(5, 2, 1, 0)
>>> etree.LIBXML_VERSION
(2, 9, 14)

Best regards
Holger



_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/lxml.python.org
Member address: [email protected]

[lxml] Re: what you can and cannot do with lxml etree.strip_tags

Reply via email to