Dag Sverre Seljebotn: > Hi, > > Is anyone working on alternative storage options for numpy arrays, and > specifically recarrays? My main application involves processing series > of large recarrays (say 1000 recarrays, each with 5M rows having 50 > fields). Existing options meet some but not all of my requirements. > > Requirements > -------------- > The basic requirements are: > > Mandatory > - fast > - suitable for very large arrays (larger than can fit in memory) > - compressed (to reduce disk space, read data more quickly) > - seekable (can read subset of data without decompressing everything) > - can append new data to an existing file > - able to extract individual fields from a recarray (for when indexing > or processing needs just a few fields) > Nice to have > - files can be split without decompressing and recompressing (e.g. > distribute processing over a grid) > - encryption, ideally field-level, with encryption occurring after > compression > - can store multiple arrays in one physical file (convenience) > - portable/stardard/well documented > > Existing options > ----------------- > Over the last few years I've tried most of numpy's options for saving > arrays to disk, including pickles, .npy, .npz, memmap-ed files and HDF > (Pytables). > > None of these is perfect, although Pytables comes close: > - .npy - not compressed, need to read whole array into memory > - .npz - compressed but ZLIB compression is too slow > - memmap - not compressed > - Pytables (HDF using chunked storage for recarrays with LZO > compression and shuffle filter) > - can't extract individual field from a recarray
I'm just learning PyTables so I'm curious about this... if I use a normal Table, it will be presented as a NumPy record array when I access it, and I can access individual fields. What are the disadvantages to that? > - multiple dependencies (HDF, PyTables+LZO, Pyh5+LZF) (I think this is a pro, not a con: It means that there's a lot of already bugfixed code being used. Any codebase is only as strong as the number of eyes on it.) Dag Sverre _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
