RE: XML Parsing for large XML documents

2004-02-26 Thread Robert Fox
Peter and Ed-

Thanks for the replies.

Your suggestions are very good. Here is my problem, though: I don't think 
that I can process this document in a serial fashion, which seems to be 
more akin to SAX. I need to do a lot of node hopping in order to create 
somewhat complex data structures for import into the database, and that 
requires a lot of jumping around from one part of the node tree to another. 
Thus, it seems as though I need to use a DOM parser to accomplish this. 
Scanning an entire document of this size in order to perform very specific 
event handling for each operation (using SAX) seems like it would be just 
as time consuming as having the entire node tree represented in memory. 
Please correct me if I'm wrong here.

On the plus side, I am running this process on a machine that seems to have 
enough RAM to represent the entire document and my code structures (arrays, 
etc.) without the need for virtual memory and heavy disk I/O. However, the 
process is VERY CPU intensive because of all of the sorting and lookups 
that occur for many of the operations. I'm going to see today if I can make 
those more efficient as well.

Someone else has suggested to me that perhaps it would be a good idea to 
break up the larger document into smaller parts during processing and only 
work on those parts in a serial mode. It was also suggested that 
XML::LibXML was an efficient tool because of the C library core (libxml2). 
And, I've also now heard of hybrid parsers that allow the ease of use and 
flexibility of DOM with the efficiency of SAX (RelaxNGCC).

For those of you that haven't heard of these tools before, you might want 
to check out:

XML::Sablotron (similar to XML::LibXML)
XMLPull (http://www.xmlpull.org)
Piccolo Parser (http://piccolo.sourceforge.net)
RelaxNGCC (http://relaxngcc.sourceforge.net/en/index.htm)
I get the impression that if I tried to use SAX parsing for a relatively 
complex RDF document, the programming load would be rather significant. 
But, if it speeds up processing by several orders of magnitude, then it 
would be worth it. I'm concerned, though, that I won't have the ability to 
crawl the document nodes using conditionals and revert to previous portions 
of the document that need further processing. What is your experience in 
this regard?

Thanks again for the responses. This is great.

Rob



At 11:07 AM 2/26/2004 +, Peter Corrigan wrote:
On 25 February 2004 20:31 wrote...
1. Am I using the best XML processing module that I can for this sort
of
 task?
If it must be faster, then it might be worth porting what you have to
work with LibXML which has all round impressive benchmarks especially
for DOM work.
Useful comparisons may be found at:
http://xmlbench.sourceforge.net/results/benchmark/index.html
Remember that the size of the final internal representation used to
manipulate the XML data for DOM could be up to 5 times the original size
i.e. 270mb in your case. Simply adding RAM/porting your exising code to
another machine might be enough to give you the speed-up you require.
3. What is the most efficient way to process through such a large
document
 no matter what XML processor one uses?
SAX type processing will be faster and use less memory. If you need
random access to any point of the tree after the document has been read
you will need DOM, hence you will need lots of memory.
If this is a one off load, I guess you have to balance the cost of your
time recoding with the cost of waiting for the data to load using what
you have already. Machines usually work cheaper :-)
Best of luck

Peter Corrigan
Head of Library Systems
James Hardiman Library
NUI Galway
IRELAND
Tel: +353-91-524411 Ext 2497
Mobile: +353-87-2798505
-Original Message-
From: Robert Fox [mailto:[EMAIL PROTECTED]
Sent: 25 February 2004 20:31
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: XML Parsing for large XML documents
I'm cross posting this question to perl4lib and xml4lib, hoping that
someone will have a suggestion.
I've created a very large (~54MB) XML document in RDF format for the
purpose of importing related records into a database. Not only does the
RDF
document contain many thousands of individual records for electronic
resources (web resources), but it also contains all of the
relationships
between those resources encoded in such a way that the document itself
represents a rather large database of these resources. The relationships
are multi-tiered. I've also written a Perl script which can parse this
large document and process through all of the XML data in order to
import
the data, along with all of the various relationships, into the
database.
The Perl script uses XML::XPath, and XML::XPath::XMLParser. I use these
modules to find the appropriate document nodes as needed while the
processing is going on and the database is being populated. The database
is
not a flat file: several data tables and linking tables are involved.
I've run into a problem, though: my Perl script runs 

Re: Problems testing MARC::Charset-0.5

2004-02-26 Thread Ed Summers
On Tue, Feb 17, 2004 at 10:55:35AM -0300, Oberdan Luiz May wrote:
   I'm running perl 5.8.3 on Solaris 2.6, with the last version of all 
 modules needed, the latest Berkeley DB, all compiled with GCC 3.3.2 . Any 
 hints?

There was a bug in MARC::Charset v0.5 which was causing the EastAsian Berkeley
DB mapping to fail. The failure wasn't evident when I released v0.5 since
MARC::Charset::EastAsian was using the installed BerkeleyDB for lookups
rather than the one that is generated as part of the perl Makfile.PL process.

Both problems have been fixed and were just uploaded to CPAN as v0.6. If 
you really want the latest package you can get it from SourceForge here:

http://sourceforge.net/projects/marcpm/

Thanks for writing to the list about this Oberdan!

//Ed


Re: Problems testing MARC::Charset-0.5

2004-02-26 Thread Oberdan Luiz May
At 11:22 26/2/2004 -0600, you wrote:
On Tue, Feb 17, 2004 at 10:55:35AM -0300, Oberdan Luiz May wrote:
   I'm running perl 5.8.3 on Solaris 2.6, with the last version of all
 modules needed, the latest Berkeley DB, all compiled with GCC 3.3.2 . Any
 hints?
There was a bug in MARC::Charset v0.5 which was causing the EastAsian Berkeley
DB mapping to fail. The failure wasn't evident when I released v0.5 since
MARC::Charset::EastAsian was using the installed BerkeleyDB for lookups
rather than the one that is generated as part of the perl Makfile.PL process.
Both problems have been fixed and were just uploaded to CPAN as v0.6. If
you really want the latest package you can get it from SourceForge here:
http://sourceforge.net/projects/marcpm/

Thanks for writing to the list about this Oberdan!

//Ed


Hello Ed,

I already downloaded MARC::Charset 0.6. This time all worked 
pretty fine. Actually I'm having other problems when converting from ansel 
fo UTF8, but I'll post it as another question. Thanks!

[]'s

Oberdan