Re: [Pytables-users] appending to EArray

Francesc Altet Thu, 26 Apr 2007 10:49:14 -0700

El dj 26 de 04 del 2007 a les 08:55, en/na Vincent Schut va
escriure:
> The problem I face is this:
> I would like to store lots of time-dependend data in an EArray, where 
> the extensible dimension represents the date. However, the size of
> the 
> other dimensions are such that even an array slice of one date does
> not 
> fit into memory. E.g. my EArray is defined with a shape of (40000, 
> 20000, 4, 0), UInt16. When I want to extend the EArray with a new
> date 
> slice, I have to append the new slice at once, which does not work 
> because a numpy array of (40000, 20000, 4, 1) is too large to fit
> into 
> memory.
> Is there a way to append an uninitialized array slice in a memory 
> efficient way, and fill it part by part later?


Mmm, currently not. However, you may want to use a CArray for doing
this.  With a CArray, you normally define a (potentially huge) array on
disk, and then you can start populating it by slices (also called
hyperslices in HDF5 jargon). Look at this example (slightly modified
from examples/carray1.py):

N = XXX  # the number of slices you want to keep in the CArray
fileName = 'carray1.h5'
shape = (40000, 20000, 4, N)
atom = tables.TimeAtom()
filters = tables.Filters(complevel=5, complib='zlib')

# Create the file and fill it with one hyperslab
h5f = tables.openFile(fileName, 'w')
ca = h5f.createCArray(h5f.root, 'carray', atom, shape, filters=filters)
# Fill a hyperslab in ``ca``.
ca[:100, 0:50,:,0] = numpy.ones((100, 50, 4, 1))
ca.attrs.prows = 1  # Keep track of the number of populated rows
h5f.close()

# Re-open the file and save another hyperslab
h5f = tables.openFile(fileName, 'a')
# Fill another hyperslab in ``ca``.
ca[100:200, 50:100,:,0] = numpy.zeros((100, 50, 4, 1))
ca.attrs.prows += 1  # Keep track of the number of populated rows
h5f.close()

With this, you can continue populating the dataset (probably using a
loop) until you are done.

As the CArray also supports compression (it is active in the example
above), you don't have to be afraid on initial size of the CArray,
because initially it will be filled with 0's (you can change this value
specifying the dflt argument of the Atom constructor) so the final
dataset will have a very low entropy (zero, in fact), and the compressor
will do an excellent job at reducing the space needed to keep the
initial CArray.

Indeed, when you start populating the CArray, you will notice that the
dataset on disk starts to grow (you are adding more entropy to it).  So,
in the end, what you have is a kind of EArray (because it is
'extendible' in practical terms), but with the possibility to populate
it with small 'hyperslabs' until you are done.

Finally, if speed is critical to you, you might want to specify the
chunk size for the CArray (see the 'chunkshape' argument of createCArray
in PyTables 2.0) to be the same of the hyperslab that you are using to
populate it.  Be careful, because this is only meant to be used by
experts (for example, specifying a too large chunkshape, would require
too much effort on the compressor side and performace would degrade).
Generally, the algorithm in PyTables to compute the optimal chunksize
should be enough for most of the situations.

Hope that helps,

-- 
Francesc Altet    |  Be careful about using the following code --
Carabos Coop. V.  |  I've only proven that it works, 
www.carabos.com   |  I haven't tested it. -- Donald Knuth


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] appending to EArray

Reply via email to