G'Day,

Any easy XML (Python or otherwise) tools for splitting a 750M 
XML file down into smaller portions?  

Because the file is so large
and exceeds memory size, I think the tool needs to be a 'streaming'
tool.  On IBM DeveloperWorks site, I found an article detailing 
using XSLT, but in other places it states XSLT tools usually
aren't streaming, so I'm guessing none of the XSLT processors
(xalan, saxon) will succeed.  (Not to mention its been more than
10 years since I last worked with XSLT.)

Original file looks like:
<?xml version="1.0"?>
<!DOCTYPE BigFile SYSTEM "BigFile.dtd">
<BigFile> 
<TrivialHeader> blah </TrivialHeader>
<Datum> A couple hundred thousand Datum elements.</Datum>
<Datum> 'Datum' are non-trivial, containing extensive subtrees.</Datum>
<Datum> ...etc... </Datum> 
<TrivialFooter> blah </TrivialFooter>
</BigFile>


I'd like a tool to split that into maybe
10 different, valid XML files, all of which have the <BigFile>,
<TrivialHeader> and <TrivialFooter> tags, 
but 1/10th as many <Datum>s per file.  


The problem is that on my 4Gig laptop, I run out of memory
for any tool which tries to read in the whole tree at
one time.  In my case, Python's ElementTree fails, ala:

> fin  = open("BigFile.xml", "r")
> tree = xml.etree.ElementTree.parse(fin)  --> Out of Memory


Solution doesn't have to be Python, but it would be nicest 
if it were, as rest of the processing is all done in
a Python script.


Cheers,
Tom





--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to