Hi Martin,
On Montag, 13. April 2026 00:15:21 CEST Martin Mueller wrote:
> Lxml is an excellent program, but its documentation is very terse, and often
> assumes a reader who already "sort of" knows what the actual reader does
> not. Steven Pinker in his Sense of Style has a chapter on "The Curse of
> Knowledge" with the wonderful topic sentence: "The main cause of
> incomprehensible prose is the difficulty of imaging what it's like for
> someone else not to know something that you know".
I consider lxml's documentation very good, but documentation is really hard.
And it is written by the experts most of the time who naturally have the
mentioned difficulty, quasi by design.
That said this works both ways i.e. also for questions and bug reports. ;-)
> I had trouble with etree.strip_tags but discovered that you can't add
> secondary constraints to a strip_tag command.
I don't quite understand what you mean with "secondary constraints" here.
Could you give a very small, reproducible example with input data that shows
what you'd like to do but can't?
E.g. like
$ python3
Python 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> data = "<root><div><hi>Hello</hi>, world!</div></root>"
>>> root = etree.fromstring(data)
>>> etree.tostring(root)
b'<root><div><hi>Hello</hi>, world!</div></root>'
>>> root.xpath("./div") # or root.getchildren()[0] in this case
[<Element div at 0x7abca645ee40>]
>>> div_elem = root.xpath("./div")[0]
>>> etree.strip_tags(div_elem, 'hi')
>>> etree.tostring(root) # Look Ma, no <hi> any more!
b'<root><div>Hello, world!</div></root>'
Very simplified - the main characteristics of your use case should be
represented in an actual example. Like e.g. different kinds of <hi> elements
you are dealing with.
Concrete examples often work better than describing your problem in prose
here. Moreover a short reproducible example gives helpful souls an easy way to
try it out themselves.
Sometimes it can be good to then describe the intent of what you'd like to
achieve much like a code comment, e.g.
>>> # ...
>>> # Strip <hi attr='whatever'> elements from the tree but not other <hi>
>>> # elements. Descendants and text/tail content of stripped elements
>>> # shall be kept.
Or maybe invent a "hypothetical" function or API that describes your intent:
>>> def check_for_removal(elem):
... # Check if this element should be removed e.g. by XPath
... # If elem does not have a foo='bar' attribute xpath result will be
... # an empty list:
... foo_bar_attr = elem.xpath('@foo[.="bar"]'))
... return bool(foo_bar_attr)
>>> # Strip <hi> elements based on additional element conditions
>>> # etree.strip_tags(div_elem, "hi", _condition=check_for_removal)
> Thus you can strip all "hi"
> children of a div, but you cannot do this for <hi> tags with less than
> child elements. Is that the case? )
What does "tags with less than child elements" mean here?
(Guessing) do you want to only strip <hi> tags with a certain number of child
elements?
(Guessing) do you want to do something else to the <hi> tags or their
children/attributes/text content *and* simultaneously strip them?
> If you want to operate on <hi> elements with further specification, you
> select them and then use the "addnext" method, looping through the <hi>
> element in reverse order.
(Guessing) do you want to only operate on or strip certain <hi> elements (e.g.
with special attributes or child elements) but not on other <hi> elements?
Assuming you want to select the <hi> elements to modify/remove not by tag name
only but more elaborate conditions, then yes you can't directly use
etree.strip_tags().
Maybe this approach suggested previously could work for you:
https://mail.python.org/archives/list/[email protected]/message/
2F7ZICTGEBIQEHRUOSGUE7DZQ7OAA4AZ/
Basically, select the elements to operate on by XPath, which can contain
complex conditions, and rename those you'd like to remove.
Afterwards, use etree.strip_tags() on the renamed elements.
I take it strip_tags() may be nice here since it doesn't also remove element
descendants and text/tail contents(?).
Probably not too relevant in this case but just to mention it:
It can also be beneficial to include version information for your concrete
environment in such examples, like Python version, lxml version, libxml2/
libxslt (the C libraries that "power" lxml) versions:
>>> etree.__version__
'5.2.1'
>>> etree.LXML_VERSION
(5, 2, 1, 0)
>>> etree.LIBXML_VERSION
(2, 9, 14)
Best regards
Holger
_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/lxml.python.org
Member address: [email protected]