Re: [Tutor] Trying to parse a HUGE(1gb) xml file in python

Stefan Behnel Tue, 21 Dec 2010 07:05:17 -0800

Alan Gauld, 21.12.2010 15:11:

"Stefan Behnel" wrote

And I thought a 1G file was extreme... Do these people stop to think that
with XML as much as 80% of their "data" is just description (ie the tags).


As I already said, it compresses well. In run-length compressed XML
files, the tags can easily take up a negligible amount of space compared
to the more widely varying data content


I understand how compression helps with the data transmission aspect.

compress rather well). And depending on how fast your underlying storage
is, decompressing and parsing the file may still be faster than parsing a
huge uncompressed file directly.


But I don't understand how uncompressing a file before parsing it can
be faster than parsing the original uncompressed file?

I didn't say "uncompressing a file *before* parsing it". I meantuncompressing the data *while* parsing it. Just like you have to decode itfor parsing, it's just an additional step to decompress it before decoding.Depending on the performance relation between I/O speed and decompressionspeed, it can be faster to load the compressed data and decompress it intothe parser on the fly. lxml.etree (or rather libxml2) internally does thatfor you, for example, if it detects compressed input when parsing from a file.

Note that these performance differences are tricky to prove in benchmarks,as repeating the benchmark usually means that the file is already cached inmemory after the first run, so the decompression overhead will dominate inthe second run. That's not what you will see in a clean run or for hugefiles, though.


Stefan

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Trying to parse a HUGE(1gb) xml file in python

Reply via email to