XML Parsing for large XML documents

Robert Fox Wed, 25 Feb 2004 12:31:43 -0800

I'm cross posting this question to perl4lib and xml4lib, hoping that someone will have a suggestion.

I've created a very large (~54MB) XML document in RDF format for the purpose of importing related records into a database. Not only does the RDF document contain many thousands of individual records for electronic resources (web resources), but it also contains all of the "relationships" between those resources encoded in such a way that the document itself represents a rather large database of these resources. The relationships are multi-tiered. I've also written a Perl script which can parse this large document and process through all of the XML data in order to import the data, along with all of the various relationships, into the database. The Perl script uses XML::XPath, and XML::XPath::XMLParser. I use these modules to find the appropriate document nodes as needed while the processing is going on and the database is being populated. The database is not a flat file: several data tables and linking tables are involved.

I've run into a problem, though: my Perl script runs very slowly. I've done just about everything I can to optimize my script so that it isn't memory intensive and efficient, and nothing seems to have significantly helped. Therefore, I have a couple of questions for the list(s):

1. Am I using the best XML processing module that I can for this sort of task? 2. Has anyone else processed documents of this size, and what have they used? 3. What is the most efficient way to process through such a large document no matter what XML processor one uses?

The processing on this is so amazingly slow that it is likely to take many hours if not days(!) to process through the bulk of records in this XML document. There must be a better way.

Any suggestions or help would be much appreciated,

Rob Fox

Robert Fox
Sr. Programmer/Analyst
University Libraries of Notre Dame
(574)631-3353
[EMAIL PROTECTED]

XML Parsing for large XML documents

Reply via email to