First of all, thanks to all of you who supplied comments and suggestions for my issue relating to parsing very large XML documents with complex structures.

Given those suggestions, I was able to find a solution. Believe it or not, the solution was to use a Perl API that relied upon a C library for parsing the XML, as opposed to a pure Perl solution. In this case, I used XML::LibXML (which is an API to the Gnome libxml2 C library). It is an understatement to say that the processing speed, after running several tests, was improved by many orders of magnitude. I'm now able to process a 54MB file of XML/RDF records in 1/24th the time it was taking me previously using the Perl based XML::XPath/XML::Parse modules. It now takes minutes instead of hours. And, as a bonus, re-coding the program didn't take that long since the API is very similar, using the DOM technique. I think my perfomance is also predicated on the fact that I have enough RAM to manipulate the document in memory as opposed to horrific disk space swapping.

My script is running on the same host, against the same data set, and the improvement was phenomenal. This is the only performance tweak that I made to my program and the pay off was well worth the relatively minimal effort. I really couldn't believe it when I saw the performance increase, but I must say that I'm relieved because I was worried that the issue may have been my algorithm and the underlying Perl code library we had written as a basis for this application.

I would be interested to know if others have had a similar experience switching to an API which relies on a compiled set of C library routines (such as XML::Sablotron). Hats off to Matt Sergeant and Christian Glahn for their work on the XML::LibXML modules.

I hope my experience helps some of you out there working on XML projects involving large data sets.

Rob

Robert Fox
Sr. Programmer/Analyst
University Libraries of Notre Dame
(574)631-3353
[EMAIL PROTECTED]



Reply via email to