Re: Trying to parse a HUGE(1gb) xml file

Stefan Behnel Sun, 26 Dec 2010 00:35:33 -0800

Tim Harig, 26.12.2010 02:05:

On 2010-12-25, Nobody<nob...@nowhere.com>  wrote:

On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:

Of course, one advantage of XML is that with so much redundant text, it
compresses well.  We typically see gzip compression ratios of 20:1.
But, that just means you can archive them efficiently; you can't do
anything useful until you unzip them.


XML is typically processed sequentially, so you don't need to create a
decompressed copy of the file before you start processing it.


Sometimes XML is processed sequentially.  When the markup footprint is
large enough it must be.  Quite often, as in the case of the OP, you only
want to extract a small piece out of the total data.  In those cases, being
forced to read all of the data sequentially is both inconvenient and and a
performance penalty unless there is some way to address the data you want
directly.

So what? If you only have to do that once, it doesn't matter if you have toread the whole file or just a part of it. Should make a difference of acouple of minutes.

If you do it a lot, you will have to find a way to make the accessefficient for your specific use case. So the file format doesn't mattereither, because the data will most likely end up in a fast data base afterreading it in sequentially *once*, just as in the case above.

I really don't think there are many important use cases where you need fastrandom access to large data sets and cannot afford to adapt the storagelayout before hand.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Trying to parse a HUGE(1gb) xml file

Reply via email to