This is about a problem I thought I had solved.
I use lxml to update linguistically annotated TEI texts, where each word is
wrapped in a <w> element. In a typical workflow, corrections exist as a
dictionary whose key is an xmlid. The script loops through a text. If an xmlid
is a dictionary key, a simple function replaces the attribute values in the
<w> element.
At the end of each run, the tree of the text is written to a new document.
There is a final function “sort and indent” appended below, which applies some
formatting to the text so that attributes appear in alphabetical and text is
consistently indented.
This function worked for a while, and then it started acting up. It would run
through some texts but after a short interval—some times 30 seconds, sometime
five minutes—it would exit with an error message like
File
"/users/martinmueller/Dropbox/earlyprint/eebochron/1473-1623/159/a/159-ane-A21328.xml",
line 27
lxml.etree.XMLSyntaxError: Memory allocation failed, line 27, column 19
If you started the run again from the file on which it had exited, it would
process that file properly but stumble again some files later. If you remove
the function the error disappears. Which supports two conclusions:
1. The error has something to do with that function
2. It has nothing to do with the way it handles any individual text
When I reported that problem at an earlier time, I think that Stefan advised me
to introduce some step that would clear memory after each single text. I
remember that this didn’t work and I gave up on that function, thinking that
perhaps there was some problem with the way in which Python, Anaconda and lxml
interacted.
Recently I got rid of Anaconda and updated to the latest versions of Python and
lxml. The problem disappeared for a little while, but then it reappeared. So I
tried again two ways of clearing memory. I add either “tree=’None’” or “del
tree” as the last command for any given file.
This made no difference. The most plausible explanation for this behaviour is
that there is some cumulative effect which aborts the program when it causes
some threshold. The way I reset doesn’t work.
Oddly enough, while I have been writing this email, the script has run through
343 texts in 645 seconds and is still chugging away. One of the texts is very
long, from which I gather that the cumulative length of texts processed between
failures is unlikely to be the cause of failure. It finally failed after 400
texts and 721 second. The next run failed after 44 texts and 41 seconds.
The error message refers to an lxml.etree.XMLSyntaxError. I looked up some
failure points in particular text, but couldn’t see any pattern. Besides, the
point of failure is never a point of failure the second time round.
The problem must have something to do with the little function below. If you
drop, lxml will process thousands or tens of thousands of text without any
problem of building one tree, doing stuff with it, and dropping it for another
tree. But what is it about this code that will work perfectly on any individual
text but has cumulative effect that typically leads to failure a after two or
three dozen texts.
def sort_and_indent(elem, level: int = 0):
attrib = elem.attrib
if len(attrib) > 1:
attributes = sorted(attrib.items())
attrib.clear()
attrib.update(attributes)
i = "\n" + " " * level
if len(elem):
if not elem.text or not elem.text.strip():
elem.text = i + " "
if not elem.tail or not elem.tail.strip():
elem.tail = i
for elem in elem:
sort_and_indent(elem, level + 1)
if not elem.tail or not elem.tail.strip():
elem.tail = i
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = i
_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]