On 12/27/2010 4:55 PM, Stefan Behnel wrote:
...
 From my experience, SAX is only practical for very simple cases where
little state is involved when extracting information from the parse
events. A typical example is gathering statistics based on single tags -
not a very common use case. Anything that involves knowing where in the
XML tree you are to figure out what to do with the event is already too
complicated. The main drawback of SAX is that the callbacks run into
separate method calls, so you have to do all the state keeping manually
through fields of the SAX handler instance.

My serious advices is: don't waste your time learning SAX. It's simply
too frustrating to debug SAX extraction code into existence. Given how
simple and fast it is to extract data with ElementTree's iterparse() in
a memory efficient way, there is really no reason to write complicated
SAX code instead.

Stefan


I confess that I hadn't been thinking about iterparse(). I presume that clear() is required with iterparse() if we're going to process files of arbitrary length.

I should think that this approach provides an intermediate solution. It's more work than building the full tree in memory because the programmer has to do some additional housekeeping to call clear() at the right time and place. But it's less housekeeping than SAX.

I guess I've done enough SAX, in enough different languages, that I don't find it that onerous to use. When I need an element stack to keep track of things I can usually re-use code I've written for other applications. But for a programmer that doesn't do a lot of this stuff, I agree, the learning curve with lxml will be shorter and the programming and debugging can be faster.

    Alan
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to