Re: [Pytables-users] appending to EArray

Vincent Schut Fri, 27 Apr 2007 00:22:06 -0700

Hi Francesc,

thanks for your elaborate answer. In fact, this is exactly (and then I 
mean really exactly) what I have been doing now, after by experiment I 
found out that the EArray would not work for me. This soluting indeed 
works extremely well.
It was just the fact that I am still limited to a defined array size 
that made me wonder if there would be an EArray solution. I am very new 
to pytables, so I could have missed something :) On the other hand (as a 
short note on my background), I am not new to large data array storing 
and processing (remote sensing), but till now I used to store my data as 
(geo)tiff files. Chunked (tiled in tiff jargon) and compressed as well, 
so I was glad to see the chunk and compression support in hdf5/pytables.
The nice thing of pytables, above tiff, for me, is that I can use my 
data as one large memmapped numpy array and let pytables do the rest, 
while still saving lots of diskspace compared to normal memmapped arrays.



Francesc Altet wrote:
> El dj 26 de 04 del 2007 a les 08:55, en/na Vincent Schut va
> escriure:
>> The problem I face is this:
>> I would like to store lots of time-dependend data in an EArray, where 
>> the extensible dimension represents the date. However, the size of
>> the 
>> other dimensions are such that even an array slice of one date does
>> not 
>> fit into memory. E.g. my EArray is defined with a shape of (40000, 
>> 20000, 4, 0), UInt16. When I want to extend the EArray with a new
>> date 
>> slice, I have to append the new slice at once, which does not work 
>> because a numpy array of (40000, 20000, 4, 1) is too large to fit
>> into 
>> memory.
>> Is there a way to append an uninitialized array slice in a memory 
>> efficient way, and fill it part by part later?
> 
> Mmm, currently not. However, you may want to use a CArray for doing
> this.  With a CArray, you normally define a (potentially huge) array on
> disk, and then you can start populating it by slices (also called
> hyperslices in HDF5 jargon). Look at this example (slightly modified
> from examples/carray1.py):
> 
> N = XXX  # the number of slices you want to keep in the CArray
> fileName = 'carray1.h5'
> shape = (40000, 20000, 4, N)
> atom = tables.TimeAtom()
> filters = tables.Filters(complevel=5, complib='zlib')
> 
> # Create the file and fill it with one hyperslab
> h5f = tables.openFile(fileName, 'w')
> ca = h5f.createCArray(h5f.root, 'carray', atom, shape, filters=filters)
> # Fill a hyperslab in ``ca``.
> ca[:100, 0:50,:,0] = numpy.ones((100, 50, 4, 1))
> ca.attrs.prows = 1  # Keep track of the number of populated rows
> h5f.close()
> 
> # Re-open the file and save another hyperslab
> h5f = tables.openFile(fileName, 'a')
> # Fill another hyperslab in ``ca``.
> ca[100:200, 50:100,:,0] = numpy.zeros((100, 50, 4, 1))
> ca.attrs.prows += 1  # Keep track of the number of populated rows
> h5f.close()
> 
> With this, you can continue populating the dataset (probably using a
> loop) until you are done.
> 
> As the CArray also supports compression (it is active in the example
> above), you don't have to be afraid on initial size of the CArray,
> because initially it will be filled with 0's (you can change this value
> specifying the dflt argument of the Atom constructor) so the final
> dataset will have a very low entropy (zero, in fact), and the compressor
> will do an excellent job at reducing the space needed to keep the
> initial CArray.
> 
> Indeed, when you start populating the CArray, you will notice that the
> dataset on disk starts to grow (you are adding more entropy to it).  So,
> in the end, what you have is a kind of EArray (because it is
> 'extendible' in practical terms), but with the possibility to populate
> it with small 'hyperslabs' until you are done.
> 
> Finally, if speed is critical to you, you might want to specify the
> chunk size for the CArray (see the 'chunkshape' argument of createCArray
> in PyTables 2.0) to be the same of the hyperslab that you are using to
> populate it.  Be careful, because this is only meant to be used by
> experts (for example, specifying a too large chunkshape, would require
> too much effort on the compressor side and performace would degrade).
> Generally, the algorithm in PyTables to compute the optimal chunksize
> should be enough for most of the situations.
> 
> Hope that helps,
> 


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] appending to EArray

Reply via email to