Adam Tauno Williams, 20.12.2010 20:49:
On Mon, 2010-12-20 at 11:34 -0800, spaceman-spiff wrote:
This is a rather long post, but i wanted to include all the details&
everything i have tried so far myself, so please bear with me&  read
the entire boringly long post.
I am trying to parse a ginormous ( ~ 1gb) xml file.

Do that hundreds of times a day.

0. I am a python&  xml n00b, s&  have been relying on the excellent
beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if
u are readng this, you are AWESOME&  so is your witty&  humorous
writing style)
1. Almost all exmaples pf parsing xml in python, i have seen, start off with 
these 4 lines of code.
import xml.etree.ElementTree as etree

Try

    import xml.etree.cElementTree as etree

instead. Note the leading "c", which hints at the C implementations of ElementTree. It's much faster and much more memory friendly than the Python implementation.


tree = etree.parse('*path_to_ginormous_xml*')
root = tree.getroot()  #my huge xml has 1 root at the top level
print root

Yes, this is a terrible technique;  most examples are crap.

2. In the 2nd line of code above, as Mark explains in DIP, the parse
function builds&  returns a tree object, in-memory(RAM), which
represents the entire document.
I tried this code, which works fine for a small ( ~ 1MB), but when i
run this simple 4 line py code in a terminal for my HUGE target file
(1GB), nothing happens.
In a separate terminal, i run the top command,&  i can see a python
process, with memory (the VIRT column) increasing from 100MB , all the
way upto 2100MB.

Yes, this is using DOM.  DOM is evil and the enemy, full-stop.

Actually, ElementTree is not "DOM", it's modelled after the XML Infoset. While I agree that DOM is, well, maybe not "the enemy", but not exactly beautiful either, ElementTree is really a good thing, likely also in this case.


I am guessing, as this happens (over the course of 20-30 mins), the
tree representing is being slowly built in memory, but even after
30-40 mins, nothing happens.
I dont get an error, seg fault or out_of_memory exception.

You need to process the document as a stream of elements; aka SAX.

IMHO, this is the worst advice you can give.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to