On Fri, 2009-05-15 at 19:47 +0200, Francesc Alted wrote:
> A Friday 15 May 2009 17:40:15 Francesc Alted escrigué:
> > A Friday 15 May 2009 15:40:16 David Fokkema escrigué:
> > > Hi list,
> > >
> > > I don't get this (using pytables 2.1.1):
> > >
> > > In [1]: import tables
> > >
> > > In [2]: data = tables.openFile('data_new.h5', 'w')
> > >
> > > In [3]: data.createVLArray(data.root, 'nosee',
> > > tables.Int32Atom())Out[3]:
> > > /nosee (VLArray(0,)) ''
> > > atom = Int32Atom(shape=(), dflt=0)
> > > byteorder = 'little'
> > > nrows = 0
> > > flavor = 'numpy'
> > >
> > > In [4]: data.createVLArray(data.root, 'see', tables.Int32Atom(),
> > > filters=tables.Filters(complevel=1))
> > > Out[4]:
> > > /see (VLArray(0,), shuffle, zlib(1)) ''
> > > atom = Int32Atom(shape=(), dflt=0)
> > > byteorder = 'little'
> > > nrows = 0
> > > flavor = 'numpy'
> > >
> > > In [5]: a = 1000000 * [200]
> > >
> > > In [6]: for i in range(50):
> > > ...: data.root.see.append(a)
> > > ...:
> > > ...:
> > >
> > > In [7]: data.flush()
> > >
> > > And looking at the file:
> > >
> > > 191M 2009-05-15 15:37 data_new.h5
> > >
> > > Also writing to the uncompressed table, adds another 191 Mb to the file.
> > > So, I really see no compression at all. I also tried zlib(9). Why are my
> > > arrays not compressed? The repetitive values seem like a perfect
> > > candidate for compression.
> >
> > Yes, I can reproduce this. Well, at least it seems that PyTables is
> > setting the filters correctly. For the 'see' dataset h5ls -v is reporting:
> >
> > Chunks: {2048} 32768 bytes
> > Storage: 800 logical bytes, 391 allocated bytes, 204.60% utilization
> > Filter-0: shuffle-2 OPT {16}
> > Filter-1: deflate-1 OPT {1}
> > Type: variable length of
> > native int
> >
> > which clearly demonstrate that the filters are correctly installed in the
> > HDF5 pipeline :-\
> >
> > This definitely seems an HDF5 issue. To say the truth I've never seen good
> > compression rates in VLArrays (although I'd never thought that compression
> > was completely inexistent!).
> >
> > I'll try to report this to the hdf-forum list and get back to you.
Wow, thanks!
>
> Done. So, George Lewandowski answered this:
>
> """
> For VL data, the dataset itself contains a struct which points to the actual
> data, which is stored elsewhere. When you apply compression, the "pointer"
> structs are compressed, but the data itself is not affected.
> """
>
> So, we should expect gains only for compressing the pointer structure of the
> variable length dataset, and not the data itself. This effect can be seen
> when keeping smaller rows (in the order of tens, not millions, as in your
> example). For instance, this:
>
> In [57]: data = tb.openFile('/tmp/vldata2.h5', 'w')
>
> In [58]: data.createVLArray(data.root, 'see', tb.Int32Atom(),
> filters=tb.Filters(complevel=1))
>
> Out[58]:
> /see (VLArray(0,), shuffle, zlib(1)) ''
> atom = Int32Atom(shape=(), dflt=0)
> byteorder = 'little'
> nrows = 0
> flavor = 'numpy'
>
> In [59]: d = tb.numpy.arange(10)
>
> In [60]: for i in range(5000):
> data.root.see.append(d)
> ....:
>
> In [63]: data.close()
>
> creates a file of 301,104 bytes, while without using compression the size
> grows to 397,360 bytes. Here, the 'real' data only takes 200,000 bytes, and
> it has been the pointer structure that has been reduced from around 190,000
> bytes to just 100,000 bytes, which a 2x compression rate (more or less).
>
> Apparently there is no provision in HDF5 for compressing actual data in
> variable length arrays. However, if this is a must for you you can always
> compress the data manually before writing it to disk, and decompress it after
> the reading process.
Hmmm... that's a shame. Is there really no provision for it or is it
just hard to set up? I'll have to think this over, then. I do need
compression, because I'm basically storing parts of a terabyte dataset
on my Eee pc, with which I'm very happy because of its weight and
easy-to-travel-with design, but is a bit underpowered for real world
data analysis. That may rule out compression because of CPU cycles, now
that I think about it, :-/ Well, I'll try compression, serializing and
storing as a string.
Thanks,
David
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables
unlimited royalty-free distribution of the report engine
for externally facing server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users