Re: [SLUG] Python, XML, and Splitting a 750M XML File?

2011-01-05 Thread Chris Donovan
I was a bit bored, and this works for me...

http://pastebin.com/srPxwvSm

Chris-

On Thu, Jan 6, 2011 at 4:12 PM, Peter Miller  wrote:
> On Thu, 2011-01-06 at 15:50 +1100, Peter Miller wrote:
>> >  'Datum' are non-trivial, containing extensive subtrees.
>> >  ...etc... 
>> >  blah 
>> > 
>>
>> XML is plain text, use a text tool.
>> If the line breaks are as indicated, use split(1)
>> and then hand edit the headers and footers.
>
> Or, use awk(1) and split on lines containing /<.Datum>/
> using awk's ability to write to more than one file.
> I suppose much the same could be done in Perl, too, but I'm older than
> such new-fangled things as Perl.
>
> --
> Regards
> Peter Miller 
> /\/\*        http://miller.emu.id.au/pmiller/
>
> PGP public key ID: 1024D/D0EDB64D
> fingerprint = AD0A C5DF C426 4F03 5D53  2BDB 18D8 A4E2 D0ED B64D
> See http://www.keyserver.net or any PGP keyserver for public key.
>
> "A data structure is just a stupid programming language." -- R. Wm. Gosper
> --
> SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
> Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
>
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Python, XML, and Splitting a 750M XML File?

2011-01-05 Thread Peter Miller
On Thu, 2011-01-06 at 15:50 +1100, Peter Miller wrote:
> >  'Datum' are non-trivial, containing extensive subtrees.
> >  ...etc...  
> >  blah 
> > 
> 
> XML is plain text, use a text tool.
> If the line breaks are as indicated, use split(1)
> and then hand edit the headers and footers.

Or, use awk(1) and split on lines containing /<.Datum>/
using awk's ability to write to more than one file.
I suppose much the same could be done in Perl, too, but I'm older than
such new-fangled things as Perl.

-- 
Regards
Peter Miller 
/\/\*http://miller.emu.id.au/pmiller/

PGP public key ID: 1024D/D0EDB64D
fingerprint = AD0A C5DF C426 4F03 5D53  2BDB 18D8 A4E2 D0ED B64D
See http://www.keyserver.net or any PGP keyserver for public key.

"A data structure is just a stupid programming language." -- R. Wm. Gosper
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Python, XML, and Splitting a 750M XML File?

2011-01-05 Thread Peter Miller
On Thu, 2011-01-06 at 13:51 +1100, Tom Deckert wrote:
> Original file looks like:
> 
> 
>  
>  blah 
>  A couple hundred thousand Datum elements.
>  'Datum' are non-trivial, containing extensive subtrees.
>  ...etc...  
>  blah 
> 

XML is plain text, use a text tool.
If the line breaks are as indicated, use split(1)
and then hand edit the headers and footers.

-- 
Regards
Peter Miller 
/\/\*http://miller.emu.id.au/pmiller/

PGP public key ID: 1024D/D0EDB64D
fingerprint = AD0A C5DF C426 4F03 5D53  2BDB 18D8 A4E2 D0ED B64D
See http://www.keyserver.net or any PGP keyserver for public key.

"As we said in the preface to the first edition, C 'wears well as one's
experience with it grows.'  With a decade more experience, we still feel
that way." -- Brian Kernighan and Dennis Ritchie
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Python, XML, and Splitting a 750M XML File?

2011-01-05 Thread dave b
Sorry, I misread your email.

Have you tried sax parsing?
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Python, XML, and Splitting a 750M XML File?

2011-01-05 Thread dave b
On 6 January 2011 13:51, Tom Deckert  wrote:
>
> G'Day,
>
> Any easy XML (Python or otherwise) tools for splitting a 750M
> XML file down into smaller portions?
>
> Because the file is so large
> and exceeds memory size, I think the tool needs to be a 'streaming'
> tool.  On IBM DeveloperWorks site, I found an article detailing
> using XSLT, but in other places it states XSLT tools usually
> aren't streaming, so I'm guessing none of the XSLT processors
> (xalan, saxon) will succeed.  (Not to mention its been more than
> 10 years since I last worked with XSLT.)
>
> Original file looks like:
> 
> 
> 
>  blah 
>  A couple hundred thousand Datum elements.
>  'Datum' are non-trivial, containing extensive subtrees.
>  ...etc... 
>  blah 
> 
>
>
> I'd like a tool to split that into maybe
> 10 different, valid XML files, all of which have the ,
>  and  tags,
> but 1/10th as many s per file.
>
>
> The problem is that on my 4Gig laptop, I run out of memory
> for any tool which tries to read in the whole tree at
> one time.  In my case, Python's ElementTree fails, ala:
>
>> fin  = open("BigFile.xml", "r")
>> tree = xml.etree.ElementTree.parse(fin)  --> Out of Memory
>
>
> Solution doesn't have to be Python, but it would be nicest
> if it were, as rest of the processing is all done in
> a Python script.

Out of interest is it just one large xml file or multiple xml files
within one file ?

Also, have you tried lxml? [0]

[0] - http://codespeak.net/lxml/
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


[SLUG] Python, XML, and Splitting a 750M XML File?

2011-01-05 Thread Tom Deckert

G'Day,

Any easy XML (Python or otherwise) tools for splitting a 750M 
XML file down into smaller portions?  

Because the file is so large
and exceeds memory size, I think the tool needs to be a 'streaming'
tool.  On IBM DeveloperWorks site, I found an article detailing 
using XSLT, but in other places it states XSLT tools usually
aren't streaming, so I'm guessing none of the XSLT processors
(xalan, saxon) will succeed.  (Not to mention its been more than
10 years since I last worked with XSLT.)

Original file looks like:


 
 blah 
 A couple hundred thousand Datum elements.
 'Datum' are non-trivial, containing extensive subtrees.
 ...etc...  
 blah 



I'd like a tool to split that into maybe
10 different, valid XML files, all of which have the ,
 and  tags, 
but 1/10th as many s per file.  


The problem is that on my 4Gig laptop, I run out of memory
for any tool which tries to read in the whole tree at
one time.  In my case, Python's ElementTree fails, ala:

> fin  = open("BigFile.xml", "r")
> tree = xml.etree.ElementTree.parse(fin)  --> Out of Memory


Solution doesn't have to be Python, but it would be nicest 
if it were, as rest of the processing is all done in
a Python script.


Cheers,
Tom





--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Value of Red Hat certification ?

2011-01-05 Thread darrin hodges
I did my RHCE last year, I did RH253 (Networking and Admin) and RH302 (RHCE
exam) for which I paid for out of my own pocket (about $4,000 all up).  If
you have good experience with Linux (whichever distro), its only a matter of
learning how do things the RH specific way and you'll get through the exam
fairly easily.  It did help to land me a job (in a Ubuntu shop!) and
certification will allow perspective employers to see that you have a
particular level of knowledge of Linux. My experience with dealing RedHat
was very positive, they were very helpful throughout the process. I believe
for me it was worth doing as it gave me 'the edge'.

cheers
Darrin.




On Thu, Jan 6, 2011 at 1:33 AM, Rod Butcher wrote:

> I had consider that - my plan is to actually train myself to be
> vendor-neutral i.e. familiarise myself with the major distros RHEL, Suze,
> Ubuntu so that I can administer them all, but to add the RHEL specialisation
> on top of that, mainly because RHEL is apparently viewed as Number 1 - but I
> think somebody who can only make a single distro work is pretty useless.
> I think Red Hat certification will inevitably include a degree of
> advertising/brainwashing to try to get people to do things there way purely
> to differentiate their brand, but I'm old enough to see through Fudd.
> How do employers view this - do they assume that serious admins make sure
> they are familiar with multiple distroes, and see RHEL certification as a
> bonus (i..e. the person knows More), or do they assume that Red Hat cert
> means a person knows Less ?
> thanks
> Rod
>
>
> On 05/01/11 12:22, onlyjob wrote:
>
>> Why Get a Vendor/Distribution *Neutral* Linux Certification?
>> http://www.youtube.com/watch?v=ZaGjgdYB1vI
>>
> --
> SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
> Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
>
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Value of Red Hat certification ?

2011-01-05 Thread Rod Butcher
I had consider that - my plan is to actually train myself to be 
vendor-neutral i.e. familiarise myself with the major distros RHEL, 
Suze, Ubuntu so that I can administer them all, but to add the RHEL 
specialisation on top of that, mainly because RHEL is apparently viewed 
as Number 1 - but I think somebody who can only make a single distro 
work is pretty useless.
I think Red Hat certification will inevitably include a degree of 
advertising/brainwashing to try to get people to do things there way 
purely to differentiate their brand, but I'm old enough to see through Fudd.
How do employers view this - do they assume that serious admins make 
sure they are familiar with multiple distroes, and see RHEL 
certification as a bonus (i..e. the person knows More), or do they 
assume that Red Hat cert means a person knows Less ?

thanks
Rod

On 05/01/11 12:22, onlyjob wrote:

Why Get a Vendor/Distribution *Neutral* Linux Certification?
http://www.youtube.com/watch?v=ZaGjgdYB1vI

--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html