Adam Tauno Williams, 20.12.2010 20:49:
On Mon, 2010-12-20 at 11:34 -0800, spaceman-spiff wrote:
This is a rather long post, but i wanted to include all the details&
everything i have tried so far myself, so please bear with me& read
the entire boringly long post.
I am trying to parse a ginormous ( ~ 1gb) xml file.
Do that hundreds of times a day.
0. I am a python& xml n00b, s& have been relying on the excellent
beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if
u are readng this, you are AWESOME& so is your witty& humorous
writing style)
1. Almost all exmaples pf parsing xml in python, i have seen, start off with
these 4 lines of code.
import xml.etree.ElementTree as etree
Try
import xml.etree.cElementTree as etree
instead. Note the leading "c", which hints at the C implementations of
ElementTree. It's much faster and much more memory friendly than the Python
implementation.
tree = etree.parse('*path_to_ginormous_xml*')
root = tree.getroot() #my huge xml has 1 root at the top level
print root
Yes, this is a terrible technique; most examples are crap.
2. In the 2nd line of code above, as Mark explains in DIP, the parse
function builds& returns a tree object, in-memory(RAM), which
represents the entire document.
I tried this code, which works fine for a small ( ~ 1MB), but when i
run this simple 4 line py code in a terminal for my HUGE target file
(1GB), nothing happens.
In a separate terminal, i run the top command,& i can see a python
process, with memory (the VIRT column) increasing from 100MB , all the
way upto 2100MB.
Yes, this is using DOM. DOM is evil and the enemy, full-stop.
Actually, ElementTree is not "DOM", it's modelled after the XML Infoset.
While I agree that DOM is, well, maybe not "the enemy", but not exactly
beautiful either, ElementTree is really a good thing, likely also in this case.
I am guessing, as this happens (over the course of 20-30 mins), the
tree representing is being slowly built in memory, but even after
30-40 mins, nothing happens.
I dont get an error, seg fault or out_of_memory exception.
You need to process the document as a stream of elements; aka SAX.
IMHO, this is the worst advice you can give.
Stefan
--
http://mail.python.org/mailman/listinfo/python-list