Hi Francesc, thanks for your elaborate answer. In fact, this is exactly (and then I mean really exactly) what I have been doing now, after by experiment I found out that the EArray would not work for me. This soluting indeed works extremely well. It was just the fact that I am still limited to a defined array size that made me wonder if there would be an EArray solution. I am very new to pytables, so I could have missed something :) On the other hand (as a short note on my background), I am not new to large data array storing and processing (remote sensing), but till now I used to store my data as (geo)tiff files. Chunked (tiled in tiff jargon) and compressed as well, so I was glad to see the chunk and compression support in hdf5/pytables. The nice thing of pytables, above tiff, for me, is that I can use my data as one large memmapped numpy array and let pytables do the rest, while still saving lots of diskspace compared to normal memmapped arrays.
Francesc Altet wrote: > El dj 26 de 04 del 2007 a les 08:55, en/na Vincent Schut va > escriure: >> The problem I face is this: >> I would like to store lots of time-dependend data in an EArray, where >> the extensible dimension represents the date. However, the size of >> the >> other dimensions are such that even an array slice of one date does >> not >> fit into memory. E.g. my EArray is defined with a shape of (40000, >> 20000, 4, 0), UInt16. When I want to extend the EArray with a new >> date >> slice, I have to append the new slice at once, which does not work >> because a numpy array of (40000, 20000, 4, 1) is too large to fit >> into >> memory. >> Is there a way to append an uninitialized array slice in a memory >> efficient way, and fill it part by part later? > > Mmm, currently not. However, you may want to use a CArray for doing > this. With a CArray, you normally define a (potentially huge) array on > disk, and then you can start populating it by slices (also called > hyperslices in HDF5 jargon). Look at this example (slightly modified > from examples/carray1.py): > > N = XXX # the number of slices you want to keep in the CArray > fileName = 'carray1.h5' > shape = (40000, 20000, 4, N) > atom = tables.TimeAtom() > filters = tables.Filters(complevel=5, complib='zlib') > > # Create the file and fill it with one hyperslab > h5f = tables.openFile(fileName, 'w') > ca = h5f.createCArray(h5f.root, 'carray', atom, shape, filters=filters) > # Fill a hyperslab in ``ca``. > ca[:100, 0:50,:,0] = numpy.ones((100, 50, 4, 1)) > ca.attrs.prows = 1 # Keep track of the number of populated rows > h5f.close() > > # Re-open the file and save another hyperslab > h5f = tables.openFile(fileName, 'a') > # Fill another hyperslab in ``ca``. > ca[100:200, 50:100,:,0] = numpy.zeros((100, 50, 4, 1)) > ca.attrs.prows += 1 # Keep track of the number of populated rows > h5f.close() > > With this, you can continue populating the dataset (probably using a > loop) until you are done. > > As the CArray also supports compression (it is active in the example > above), you don't have to be afraid on initial size of the CArray, > because initially it will be filled with 0's (you can change this value > specifying the dflt argument of the Atom constructor) so the final > dataset will have a very low entropy (zero, in fact), and the compressor > will do an excellent job at reducing the space needed to keep the > initial CArray. > > Indeed, when you start populating the CArray, you will notice that the > dataset on disk starts to grow (you are adding more entropy to it). So, > in the end, what you have is a kind of EArray (because it is > 'extendible' in practical terms), but with the possibility to populate > it with small 'hyperslabs' until you are done. > > Finally, if speed is critical to you, you might want to specify the > chunk size for the CArray (see the 'chunkshape' argument of createCArray > in PyTables 2.0) to be the same of the hyperslab that you are using to > populate it. Be careful, because this is only meant to be used by > experts (for example, specifying a too large chunkshape, would require > too much effort on the compressor side and performace would degrade). > Generally, the algorithm in PyTables to compute the optimal chunksize > should be enough for most of the situations. > > Hope that helps, > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Pytables-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pytables-users
