El dj 26 de 04 del 2007 a les 08:55, en/na Vincent Schut va escriure: > The problem I face is this: > I would like to store lots of time-dependend data in an EArray, where > the extensible dimension represents the date. However, the size of > the > other dimensions are such that even an array slice of one date does > not > fit into memory. E.g. my EArray is defined with a shape of (40000, > 20000, 4, 0), UInt16. When I want to extend the EArray with a new > date > slice, I have to append the new slice at once, which does not work > because a numpy array of (40000, 20000, 4, 1) is too large to fit > into > memory. > Is there a way to append an uninitialized array slice in a memory > efficient way, and fill it part by part later?
Mmm, currently not. However, you may want to use a CArray for doing this. With a CArray, you normally define a (potentially huge) array on disk, and then you can start populating it by slices (also called hyperslices in HDF5 jargon). Look at this example (slightly modified from examples/carray1.py): N = XXX # the number of slices you want to keep in the CArray fileName = 'carray1.h5' shape = (40000, 20000, 4, N) atom = tables.TimeAtom() filters = tables.Filters(complevel=5, complib='zlib') # Create the file and fill it with one hyperslab h5f = tables.openFile(fileName, 'w') ca = h5f.createCArray(h5f.root, 'carray', atom, shape, filters=filters) # Fill a hyperslab in ``ca``. ca[:100, 0:50,:,0] = numpy.ones((100, 50, 4, 1)) ca.attrs.prows = 1 # Keep track of the number of populated rows h5f.close() # Re-open the file and save another hyperslab h5f = tables.openFile(fileName, 'a') # Fill another hyperslab in ``ca``. ca[100:200, 50:100,:,0] = numpy.zeros((100, 50, 4, 1)) ca.attrs.prows += 1 # Keep track of the number of populated rows h5f.close() With this, you can continue populating the dataset (probably using a loop) until you are done. As the CArray also supports compression (it is active in the example above), you don't have to be afraid on initial size of the CArray, because initially it will be filled with 0's (you can change this value specifying the dflt argument of the Atom constructor) so the final dataset will have a very low entropy (zero, in fact), and the compressor will do an excellent job at reducing the space needed to keep the initial CArray. Indeed, when you start populating the CArray, you will notice that the dataset on disk starts to grow (you are adding more entropy to it). So, in the end, what you have is a kind of EArray (because it is 'extendible' in practical terms), but with the possibility to populate it with small 'hyperslabs' until you are done. Finally, if speed is critical to you, you might want to specify the chunk size for the CArray (see the 'chunkshape' argument of createCArray in PyTables 2.0) to be the same of the hyperslab that you are using to populate it. Be careful, because this is only meant to be used by experts (for example, specifying a too large chunkshape, would require too much effort on the compressor side and performace would degrade). Generally, the algorithm in PyTables to compute the optimal chunksize should be enough for most of the situations. Hope that helps, -- Francesc Altet | Be careful about using the following code -- Carabos Coop. V. | I've only proven that it works, www.carabos.com | I haven't tested it. -- Donald Knuth ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Pytables-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pytables-users
