This is about a problem I thought I had solved.

I use  lxml to update linguistically annotated  TEI texts, where each word is 
wrapped in a <w> element. In a typical workflow, corrections exist as a 
dictionary whose key is an xmlid. The script loops through a text. If an xmlid 
is  a dictionary key, a simple function replaces the attribute values in the 
<w> element.

At the end of each run, the tree of the text is written to a new document. 
There is a final function “sort and indent” appended below, which applies some 
formatting to the text  so that attributes appear in alphabetical  and text is 
consistently indented.

This function worked for a while, and then it started acting up.  It would run 
through some texts but after a short interval—some times 30 seconds, sometime 
five minutes—it would exit with an error message like

File 
"/users/martinmueller/Dropbox/earlyprint/eebochron/1473-1623/159/a/159-ane-A21328.xml",
 line 27
lxml.etree.XMLSyntaxError: Memory allocation failed, line 27, column 19

If you started the run again from the file on which it had exited, it would 
process that file properly but stumble again some files later. If you remove 
the function the error disappears.  Which supports two conclusions:


  1.  The error has something to do with that function
  2.  It has nothing to do with the way it handles any individual text

When I reported that problem at an earlier time, I think that Stefan advised me 
to introduce some step that would clear memory after each single text.  I 
remember that this didn’t work and I gave up on  that function, thinking that 
perhaps there was some problem with the way in which Python, Anaconda and lxml 
interacted.

Recently I got rid of Anaconda and updated to the latest versions of Python and 
lxml.  The problem disappeared for a little while, but then it reappeared. So I 
tried again two ways of clearing memory.  I add either  “tree=’None’” or “del 
tree” as the last command for any given file.

This made no difference. The most plausible explanation for this behaviour is 
that  there is some cumulative effect which aborts the program when it causes 
some threshold. The way I reset doesn’t work.

Oddly enough, while I have been writing this email, the script has run through 
343 texts in 645 seconds and is still chugging away. One of the texts is very 
long, from which I gather that the cumulative length of texts processed between 
failures is unlikely to be the cause of failure. It finally failed after 400 
texts and 721 second. The next run failed after 44 texts and 41 seconds.

The error message refers to an lxml.etree.XMLSyntaxError. I looked up some 
failure points in particular text, but couldn’t see any pattern. Besides, the 
point of failure is never a point of failure the second time round.

The problem must have something to do with the little function below. If you 
drop, lxml will process thousands or tens of thousands of text without any 
problem of building one tree, doing stuff with it, and dropping it for another 
tree. But what is it about this code that will work perfectly on any individual 
text but has cumulative effect that typically leads to failure a after two or 
three dozen texts.




def sort_and_indent(elem, level: int = 0):
    attrib = elem.attrib
    if len(attrib) > 1:
        attributes = sorted(attrib.items())
        attrib.clear()
        attrib.update(attributes)

    i = "\n" + " " * level

    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + " "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            sort_and_indent(elem, level + 1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i

_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]

Reply via email to