Re: [SLUG] Python, XML, and Splitting a 750M XML File?

2011-01-05 Thread Chris Donovan
I was a bit bored, and this works for me...

http://pastebin.com/srPxwvSm

Chris-

On Thu, Jan 6, 2011 at 4:12 PM, Peter Miller  wrote:
> On Thu, 2011-01-06 at 15:50 +1100, Peter Miller wrote:
>> >  'Datum' are non-trivial, containing extensive subtrees.
>> >  ...etc... 
>> >  blah 
>> > 
>>
>> XML is plain text, use a text tool.
>> If the line breaks are as indicated, use split(1)
>> and then hand edit the headers and footers.
>
> Or, use awk(1) and split on lines containing /<.Datum>/
> using awk's ability to write to more than one file.
> I suppose much the same could be done in Perl, too, but I'm older than
> such new-fangled things as Perl.
>
> --
> Regards
> Peter Miller 
> /\/\*        http://miller.emu.id.au/pmiller/
>
> PGP public key ID: 1024D/D0EDB64D
> fingerprint = AD0A C5DF C426 4F03 5D53  2BDB 18D8 A4E2 D0ED B64D
> See http://www.keyserver.net or any PGP keyserver for public key.
>
> "A data structure is just a stupid programming language." -- R. Wm. Gosper
> --
> SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
> Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
>
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Python, XML, and Splitting a 750M XML File?

2011-01-05 Thread Peter Miller
On Thu, 2011-01-06 at 15:50 +1100, Peter Miller wrote:
> >  'Datum' are non-trivial, containing extensive subtrees.
> >  ...etc...  
> >  blah 
> > 
> 
> XML is plain text, use a text tool.
> If the line breaks are as indicated, use split(1)
> and then hand edit the headers and footers.

Or, use awk(1) and split on lines containing /<.Datum>/
using awk's ability to write to more than one file.
I suppose much the same could be done in Perl, too, but I'm older than
such new-fangled things as Perl.

-- 
Regards
Peter Miller 
/\/\*http://miller.emu.id.au/pmiller/

PGP public key ID: 1024D/D0EDB64D
fingerprint = AD0A C5DF C426 4F03 5D53  2BDB 18D8 A4E2 D0ED B64D
See http://www.keyserver.net or any PGP keyserver for public key.

"A data structure is just a stupid programming language." -- R. Wm. Gosper
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Python, XML, and Splitting a 750M XML File?

2011-01-05 Thread Peter Miller
On Thu, 2011-01-06 at 13:51 +1100, Tom Deckert wrote:
> Original file looks like:
> 
> 
>  
>  blah 
>  A couple hundred thousand Datum elements.
>  'Datum' are non-trivial, containing extensive subtrees.
>  ...etc...  
>  blah 
> 

XML is plain text, use a text tool.
If the line breaks are as indicated, use split(1)
and then hand edit the headers and footers.

-- 
Regards
Peter Miller 
/\/\*http://miller.emu.id.au/pmiller/

PGP public key ID: 1024D/D0EDB64D
fingerprint = AD0A C5DF C426 4F03 5D53  2BDB 18D8 A4E2 D0ED B64D
See http://www.keyserver.net or any PGP keyserver for public key.

"As we said in the preface to the first edition, C 'wears well as one's
experience with it grows.'  With a decade more experience, we still feel
that way." -- Brian Kernighan and Dennis Ritchie
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Python, XML, and Splitting a 750M XML File?

2011-01-05 Thread dave b
Sorry, I misread your email.

Have you tried sax parsing?
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Python, XML, and Splitting a 750M XML File?

2011-01-05 Thread dave b
On 6 January 2011 13:51, Tom Deckert  wrote:
>
> G'Day,
>
> Any easy XML (Python or otherwise) tools for splitting a 750M
> XML file down into smaller portions?
>
> Because the file is so large
> and exceeds memory size, I think the tool needs to be a 'streaming'
> tool.  On IBM DeveloperWorks site, I found an article detailing
> using XSLT, but in other places it states XSLT tools usually
> aren't streaming, so I'm guessing none of the XSLT processors
> (xalan, saxon) will succeed.  (Not to mention its been more than
> 10 years since I last worked with XSLT.)
>
> Original file looks like:
> 
> 
> 
>  blah 
>  A couple hundred thousand Datum elements.
>  'Datum' are non-trivial, containing extensive subtrees.
>  ...etc... 
>  blah 
> 
>
>
> I'd like a tool to split that into maybe
> 10 different, valid XML files, all of which have the ,
>  and  tags,
> but 1/10th as many s per file.
>
>
> The problem is that on my 4Gig laptop, I run out of memory
> for any tool which tries to read in the whole tree at
> one time.  In my case, Python's ElementTree fails, ala:
>
>> fin  = open("BigFile.xml", "r")
>> tree = xml.etree.ElementTree.parse(fin)  --> Out of Memory
>
>
> Solution doesn't have to be Python, but it would be nicest
> if it were, as rest of the processing is all done in
> a Python script.

Out of interest is it just one large xml file or multiple xml files
within one file ?

Also, have you tried lxml? [0]

[0] - http://codespeak.net/lxml/
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


[SLUG] Python, XML, and Splitting a 750M XML File?

2011-01-05 Thread Tom Deckert

G'Day,

Any easy XML (Python or otherwise) tools for splitting a 750M 
XML file down into smaller portions?  

Because the file is so large
and exceeds memory size, I think the tool needs to be a 'streaming'
tool.  On IBM DeveloperWorks site, I found an article detailing 
using XSLT, but in other places it states XSLT tools usually
aren't streaming, so I'm guessing none of the XSLT processors
(xalan, saxon) will succeed.  (Not to mention its been more than
10 years since I last worked with XSLT.)

Original file looks like:


 
 blah 
 A couple hundred thousand Datum elements.
 'Datum' are non-trivial, containing extensive subtrees.
 ...etc...  
 blah 



I'd like a tool to split that into maybe
10 different, valid XML files, all of which have the ,
 and  tags, 
but 1/10th as many s per file.  


The problem is that on my 4Gig laptop, I run out of memory
for any tool which tries to read in the whole tree at
one time.  In my case, Python's ElementTree fails, ala:

> fin  = open("BigFile.xml", "r")
> tree = xml.etree.ElementTree.parse(fin)  --> Out of Memory


Solution doesn't have to be Python, but it would be nicest 
if it were, as rest of the processing is all done in
a Python script.


Cheers,
Tom





--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html