On 29 Aug 2005 08:17:04 -0700 "jog" <[EMAIL PROTECTED]> wrote: > I want to get text out of some nodes of a huge xml file (1,5 GB). The > architecture of the xml file is something like this > [structure snipped] > I want to combine the text out of page:title and page:revision:text > for every single page element. One by one I want to index these > combined texts (so for each page one index) > What is the most efficient API for that?: SAX ( I donĀ“t thonk so) DOM > or pulldom?
Definitely SAX IMHO, or xml.parsers.expat. For what you're doing, an event-driven interface is ideal. DOM parses the *entire* XML tree into memory at once, before you can do anything - highly inefficient for a large data set like this. I've never used pulldom, it might have potential, but from my (limited and flawed) understanding of it, I think it may also wind up loading most of the file into memory by the time you're done. SAX will not build any memory structures other than the ones you explicitly create (SAX is commonly used to build DOM trees). With SAX, you can just watch for any tags of interest (and perhaps some surrounding tags to provide context), extract the desired data, and all that very efficiently. It took me a bit to get the hang of SAX, but once I did, I haven't looked back. Event-driven parsing is a brilliant solution to this problem domain. > Or should I just use Xpath somehow. XPath usually requires a DOM tree on which it can operate. The Python XPath implementation (in PyXML) requires DOM objects. I see this as being a highly inefficient solution. Another potential solution, if the data file has extraneous information: run the source file through an XSLT transform that strips it down to only the data you need, and then apply SAX to parse it. - Michael -- http://mail.python.org/mailman/listinfo/python-list