Re: [Tutor] Trying to parse a HUGE(1gb) xml file in python

Stefan Behnel Tue, 21 Dec 2010 02:58:12 -0800

David Hutto, 21.12.2010 11:29:

On Tue, Dec 21, 2010 at 5:19 AM, Stefan Behnel wrote:

Alan Gauld, 21.12.2010 10:58:

22 Jan 2009 ... Stripping Illegal Characters from XML in Python>>


... I'd be asking Python to process 6.4 gigabytes of CSV into
6.5 gigabytes of XML 1. ..... In fact, what happened was that
the parsing didn't work and the whole db was ...

And I thought a 1G file was extreme... Do these people stop to think that
with XML as much as 80% of their "data" is just description (ie the tags).


As I already said, it compresses well. In run-length compressed XML files,
the tags can easily take up a negligible amount of space compared to the
more widely varying data content (although that also commonly tends to
compress rather well). And depending on how fast your underlying storage is,
decompressing and parsing the file may still be faster than parsing a huge
uncompressed file directly. So, again, the shear uncompressed file size is
*not* a very interesting argument.


However, could they (as mentioned elsewhere, and by other in another
form)mitigate the damage by using smaller tags exclusively?

Why should that have a (noticeable) impact on the compressed file? It's theinherent nature of compression to reduce redundancy, which in XML filesusually includes the redundancy of repeated tag names (even if thecompression is not specifically XML aware).

It's a very bad idea to use short and obfuscated tag names to reduce thestorage size. That's like coding in assembler to reduce the size of thesource code. Just use compression for storage, or buy a larger hard diskfor your NAS.

And also compressed is formatted, even for the tags, correct?


The (lossless) compression doesn't change the content.

Stefan

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Trying to parse a HUGE(1gb) xml file in python

Reply via email to