G'Day, Any easy XML (Python or otherwise) tools for splitting a 750M XML file down into smaller portions?
Because the file is so large and exceeds memory size, I think the tool needs to be a 'streaming' tool. On IBM DeveloperWorks site, I found an article detailing using XSLT, but in other places it states XSLT tools usually aren't streaming, so I'm guessing none of the XSLT processors (xalan, saxon) will succeed. (Not to mention its been more than 10 years since I last worked with XSLT.) Original file looks like: <?xml version="1.0"?> <!DOCTYPE BigFile SYSTEM "BigFile.dtd"> <BigFile> <TrivialHeader> blah </TrivialHeader> <Datum> A couple hundred thousand Datum elements.</Datum> <Datum> 'Datum' are non-trivial, containing extensive subtrees.</Datum> <Datum> ...etc... </Datum> <TrivialFooter> blah </TrivialFooter> </BigFile> I'd like a tool to split that into maybe 10 different, valid XML files, all of which have the <BigFile>, <TrivialHeader> and <TrivialFooter> tags, but 1/10th as many <Datum>s per file. The problem is that on my 4Gig laptop, I run out of memory for any tool which tries to read in the whole tree at one time. In my case, Python's ElementTree fails, ala: > fin = open("BigFile.xml", "r") > tree = xml.etree.ElementTree.parse(fin) --> Out of Memory Solution doesn't have to be Python, but it would be nicest if it were, as rest of the processing is all done in a Python script. Cheers, Tom -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html