Re: [SLUG] Python, XML, and Splitting a 750M XML File?
I was a bit bored, and this works for me... http://pastebin.com/srPxwvSm Chris- On Thu, Jan 6, 2011 at 4:12 PM, Peter Miller wrote: > On Thu, 2011-01-06 at 15:50 +1100, Peter Miller wrote: >> > 'Datum' are non-trivial, containing extensive subtrees. >> > ...etc... >> > blah >> > >> >> XML is plain text, use a text tool. >> If the line breaks are as indicated, use split(1) >> and then hand edit the headers and footers. > > Or, use awk(1) and split on lines containing /<.Datum>/ > using awk's ability to write to more than one file. > I suppose much the same could be done in Perl, too, but I'm older than > such new-fangled things as Perl. > > -- > Regards > Peter Miller > /\/\* http://miller.emu.id.au/pmiller/ > > PGP public key ID: 1024D/D0EDB64D > fingerprint = AD0A C5DF C426 4F03 5D53 2BDB 18D8 A4E2 D0ED B64D > See http://www.keyserver.net or any PGP keyserver for public key. > > "A data structure is just a stupid programming language." -- R. Wm. Gosper > -- > SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ > Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html > -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
Re: [SLUG] Python, XML, and Splitting a 750M XML File?
On Thu, 2011-01-06 at 15:50 +1100, Peter Miller wrote: > > 'Datum' are non-trivial, containing extensive subtrees. > > ...etc... > > blah > > > > XML is plain text, use a text tool. > If the line breaks are as indicated, use split(1) > and then hand edit the headers and footers. Or, use awk(1) and split on lines containing /<.Datum>/ using awk's ability to write to more than one file. I suppose much the same could be done in Perl, too, but I'm older than such new-fangled things as Perl. -- Regards Peter Miller /\/\*http://miller.emu.id.au/pmiller/ PGP public key ID: 1024D/D0EDB64D fingerprint = AD0A C5DF C426 4F03 5D53 2BDB 18D8 A4E2 D0ED B64D See http://www.keyserver.net or any PGP keyserver for public key. "A data structure is just a stupid programming language." -- R. Wm. Gosper -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
Re: [SLUG] Python, XML, and Splitting a 750M XML File?
On Thu, 2011-01-06 at 13:51 +1100, Tom Deckert wrote: > Original file looks like: > > > > blah > A couple hundred thousand Datum elements. > 'Datum' are non-trivial, containing extensive subtrees. > ...etc... > blah > XML is plain text, use a text tool. If the line breaks are as indicated, use split(1) and then hand edit the headers and footers. -- Regards Peter Miller /\/\*http://miller.emu.id.au/pmiller/ PGP public key ID: 1024D/D0EDB64D fingerprint = AD0A C5DF C426 4F03 5D53 2BDB 18D8 A4E2 D0ED B64D See http://www.keyserver.net or any PGP keyserver for public key. "As we said in the preface to the first edition, C 'wears well as one's experience with it grows.' With a decade more experience, we still feel that way." -- Brian Kernighan and Dennis Ritchie -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
Re: [SLUG] Python, XML, and Splitting a 750M XML File?
Sorry, I misread your email. Have you tried sax parsing? -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
Re: [SLUG] Python, XML, and Splitting a 750M XML File?
On 6 January 2011 13:51, Tom Deckert wrote: > > G'Day, > > Any easy XML (Python or otherwise) tools for splitting a 750M > XML file down into smaller portions? > > Because the file is so large > and exceeds memory size, I think the tool needs to be a 'streaming' > tool. On IBM DeveloperWorks site, I found an article detailing > using XSLT, but in other places it states XSLT tools usually > aren't streaming, so I'm guessing none of the XSLT processors > (xalan, saxon) will succeed. (Not to mention its been more than > 10 years since I last worked with XSLT.) > > Original file looks like: > > > > blah > A couple hundred thousand Datum elements. > 'Datum' are non-trivial, containing extensive subtrees. > ...etc... > blah > > > > I'd like a tool to split that into maybe > 10 different, valid XML files, all of which have the , > and tags, > but 1/10th as many s per file. > > > The problem is that on my 4Gig laptop, I run out of memory > for any tool which tries to read in the whole tree at > one time. In my case, Python's ElementTree fails, ala: > >> fin = open("BigFile.xml", "r") >> tree = xml.etree.ElementTree.parse(fin) --> Out of Memory > > > Solution doesn't have to be Python, but it would be nicest > if it were, as rest of the processing is all done in > a Python script. Out of interest is it just one large xml file or multiple xml files within one file ? Also, have you tried lxml? [0] [0] - http://codespeak.net/lxml/ -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
[SLUG] Python, XML, and Splitting a 750M XML File?
G'Day, Any easy XML (Python or otherwise) tools for splitting a 750M XML file down into smaller portions? Because the file is so large and exceeds memory size, I think the tool needs to be a 'streaming' tool. On IBM DeveloperWorks site, I found an article detailing using XSLT, but in other places it states XSLT tools usually aren't streaming, so I'm guessing none of the XSLT processors (xalan, saxon) will succeed. (Not to mention its been more than 10 years since I last worked with XSLT.) Original file looks like: blah A couple hundred thousand Datum elements. 'Datum' are non-trivial, containing extensive subtrees. ...etc... blah I'd like a tool to split that into maybe 10 different, valid XML files, all of which have the , and tags, but 1/10th as many s per file. The problem is that on my 4Gig laptop, I run out of memory for any tool which tries to read in the whole tree at one time. In my case, Python's ElementTree fails, ala: > fin = open("BigFile.xml", "r") > tree = xml.etree.ElementTree.parse(fin) --> Out of Memory Solution doesn't have to be Python, but it would be nicest if it were, as rest of the processing is all done in a Python script. Cheers, Tom -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html