Re: Trying to parse a HUGE(1gb) xml file

Alan Meyer Mon, 27 Dec 2010 16:33:18 -0800

On 12/27/2010 4:55 PM, Stefan Behnel wrote:
...

 From my experience, SAX is only practical for very simple cases where
little state is involved when extracting information from the parse
events. A typical example is gathering statistics based on single tags -
not a very common use case. Anything that involves knowing where in the
XML tree you are to figure out what to do with the event is already too
complicated. The main drawback of SAX is that the callbacks run into
separate method calls, so you have to do all the state keeping manually
through fields of the SAX handler instance.


My serious advices is: don't waste your time learning SAX. It's simply
too frustrating to debug SAX extraction code into existence. Given how
simple and fast it is to extract data with ElementTree's iterparse() in
a memory efficient way, there is really no reason to write complicated
SAX code instead.

Stefan

I confess that I hadn't been thinking about iterparse(). I presume thatclear() is required with iterparse() if we're going to process files ofarbitrary length.

I should think that this approach provides an intermediate solution.It's more work than building the full tree in memory because theprogrammer has to do some additional housekeeping to call clear() at theright time and place. But it's less housekeeping than SAX.

I guess I've done enough SAX, in enough different languages, that Idon't find it that onerous to use. When I need an element stack to keeptrack of things I can usually re-use code I've written for otherapplications. But for a programmer that doesn't do a lot of this stuff,I agree, the learning curve with lxml will be shorter and theprogramming and debugging can be faster.


    Alan
--
http://mail.python.org/mailman/listinfo/python-list

Re: Trying to parse a HUGE(1gb) xml file

Reply via email to