Re: Trying to parse a HUGE(1gb) xml file

Stefan Behnel Mon, 27 Dec 2010 23:25:59 -0800

Alan Meyer, 28.12.2010 01:29:

On 12/27/2010 4:55 PM, Stefan Behnel wrote:

From my experience, SAX is only practical for very simple cases where
little state is involved when extracting information from the parse
events. A typical example is gathering statistics based on single tags -
not a very common use case. Anything that involves knowing where in the
XML tree you are to figure out what to do with the event is already too
complicated. The main drawback of SAX is that the callbacks run into
separate method calls, so you have to do all the state keeping manually
through fields of the SAX handler instance.


My serious advices is: don't waste your time learning SAX. It's simply
too frustrating to debug SAX extraction code into existence. Given how
simple and fast it is to extract data with ElementTree's iterparse() in
a memory efficient way, there is really no reason to write complicated
SAX code instead.


I confess that I hadn't been thinking about iterparse(). I presume that
clear() is required with iterparse() if we're going to process files of
arbitrary length.

I should think that this approach provides an intermediate solution. It's
more work than building the full tree in memory because the programmer has
to do some additional housekeeping to call clear() at the right time and
place. But it's less housekeeping than SAX.

The iterparse() implementation in lxml.etree allows you to intercept on aspecific tag name, which is especially useful for large XML documents thatare basically an endless sequence of (however deeply structured) top-levelelements - arguably the most common format for gigabyte sized XML files. Sowhat I usually do here is to intercept on the top level tag name, clear()that tag after use and leave it dangling around, like this:


    for _, element in ET.iterparse(source, tag='toptagname'):
        # ... work on the element and its subtree
        element.clear()

That allows you to write simple in-memory tree handling code (iteration,XPath, XSLT, whatever), while pushing the performance up (compared to ET'siterparse that returns all elements) and keeping the total amount of memoryusage reasonably low. Even a series of several hundred thousand empty toplevel tags don't add up to anything that would truly hurt a decent machine.

In many cases where I know that the XML file easily fits into memoryanyway, I don't even do any housekeeping at all. And the true advantage is:if you ever find that it's needed because the file sizes grow beyond yourinitial expectations, you don't have to touch your tested and readilydebugged data extraction code, just add a suitable bit of cleanup code, oreven switch from the initial all-in-memory parse() solution to anevent-driven iterparse()+cleanup solution.

I guess I've done enough SAX, in enough different languages, that I don't
find it that onerous to use. When I need an element stack to keep track of
things I can usually re-use code I've written for other applications. But
for a programmer that doesn't do a lot of this stuff, I agree, the learning
curve with lxml will be shorter and the programming and debugging can be
faster.

I'm aware that SAX has the advantage of being available for more languages.But if you are in the lucky position to use Python for XML processing, whynot just use the tools that it makes available?


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Trying to parse a HUGE(1gb) xml file

Reply via email to