Re: Trying to parse a HUGE(1gb) xml file

Stefan Sonnenberg-Carstens Sat, 25 Dec 2010 12:54:36 -0800

Am 25.12.2010 20:41, schrieb Roy Smith:

In article<[email protected]>,
  Adam Tauno Williams<[email protected]>  wrote:

XML works extremely well for large datasets.

Barf.  I'll agree that there are some nice points to XML.  It is
portable.  It is (to a certain extent) human readable, and in a pinch
you can use standard text tools to do ad-hoc queries (i.e. grep for a
particular entry).  And, yes, there are plenty of toolsets for dealing
with XML files.

On the other hand, the verbosity is unbelievable.  I'm currently working
with a data feed we get from a supplier in XML.  Every day we get
incremental updates of about 10-50 MB each.  The total data set at this
point is 61 GB.  It's got stuff like this in it:

         <Parental-Advisory>FALSE</Parental-Advisory>

That's 54 bytes to store a single bit of information.  I'm all for
human-readable formats, but bloating the data by a factor of 432 is
rather excessive.  Of course, that's an extreme example.  A more
efficient example would be:

         <Id>1173722</Id>

which is 26 bytes to store an integer.  That's only a bloat factor of
6-1/2.

Of course, one advantage of XML is that with so much redundant text, it
compresses well.  We typically see gzip compression ratios of 20:1.
But, that just means you can archive them efficiently; you can't do
anything useful until you unzip them.

Sending complete SQLite databases is absolute perfect.
For example Fedora uses (used?) this for their yum catalog updates.
Download to the right place, point your tool to it, ready.

--
http://mail.python.org/mailman/listinfo/python-list

Re: Trying to parse a HUGE(1gb) xml file

Reply via email to