Re: XML Parsing for large XML documents

Ed Summers Wed, 25 Feb 2004 12:55:12 -0800

Hi Rob:

On Wed, Feb 25, 2004 at 03:31:07PM -0500, Robert Fox wrote:
> 1. Am I using the best XML processing module that I can for this sort of 
> task?


XPath expressions require building a document object model (DOM) of your XML 
file. Building a DOM for a huge file is extremely expensive since it converts
your XML file into an in memory tree structure, where each element is a node.
You system is probably digging into virtual memory (to disk) to keep the 
monster in memory...which means slow. And you need to slurp the whole thing
in before any work can actually start.

When processing large XML files you'll want to use a stream based parser like 
XML::SAX. 

> 2. Has anyone else processed documents of this size, and what have they 
> used?

Yep, I've used XML::SAX recently and XML::Parser back in the day. XML::Parser
use is depracated now, but once upon a time it was cutting edge :)

> 3. What is the most efficient way to process through such a large document 
> no matter what XML processor one uses?

Use a stream based parser instead of one that is DOM based. This applies in 
any language (Java, Python, etc...). There is a series of good articles on 
SAX parsing from Perl on xml.com [1]. The nice thing about SAX is that it is 
not Perl specific, so what you learn about SAX can be applied in lots of other
languages. SAX filters [2] are also incredibly useful. 

Good luck!

//Ed

[1] http://www.xml.com/pub/a/2001/02/14/perlsax.html
[2] http://www.xml.com/pub/a/2001/10/10/sax-filters.html

Re: XML Parsing for large XML documents

Reply via email to