Re: [Numpy-discussion] Designing a new storage format for numpy recarrays
2009/10/30 Stephen Simmons : > I should clarify what I meant.. > > Suppose I have a recarray with 50 fields and want to read just one of > those fields. PyTables/HDF will read in the compressed data for chunks > of complete rows, decompress the full 50 fields, and then give me back > the data for just one field. > > I'm after a solution where asking for a single field reads in the bytes > for just that field from disk and decompresses it. > > This is similar to the difference between databases storing their data > as rows or columns. See for example Mike Stonebraker's C-store > column-oriented database (http://db.lcs.mit.edu/projects/cstore/vldb.pdf). Is there any reason not to simply store the data as a collection of separate arrays, one per column? It shouldn't be too hard to write a wrapper to give this nicer syntax, while implementing it under the hood with HDF5... Anne > Stephen > > > > Francesc Alted wrote: >> A Friday 30 October 2009 14:18:05 Stephen Simmons escrigué: >> >>> - Pytables (HDF using chunked storage for recarrays with LZO >>> compression and shuffle filter) >>> - can't extract individual field from a recarray >>> >> >> Er... Have you tried the ``cols`` accessor? >> >> http://www.pytables.org/docs/manual/ch04.html#ColsClassDescr >> >> Cheers, >> >> > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Designing a new storage format for numpy recarrays
I should clarify what I meant.. Suppose I have a recarray with 50 fields and want to read just one of those fields. PyTables/HDF will read in the compressed data for chunks of complete rows, decompress the full 50 fields, and then give me back the data for just one field. I'm after a solution where asking for a single field reads in the bytes for just that field from disk and decompresses it. This is similar to the difference between databases storing their data as rows or columns. See for example Mike Stonebraker's C-store column-oriented database (http://db.lcs.mit.edu/projects/cstore/vldb.pdf). Stephen Francesc Alted wrote: > A Friday 30 October 2009 14:18:05 Stephen Simmons escrigué: > >> - Pytables (HDF using chunked storage for recarrays with LZO >> compression and shuffle filter) >> - can't extract individual field from a recarray >> > > Er... Have you tried the ``cols`` accessor? > > http://www.pytables.org/docs/manual/ch04.html#ColsClassDescr > > Cheers, > > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Designing a new storage format for numpy recarrays
On Fri, Oct 30, 2009 at 08:18, Stephen Simmons wrote: > Thoughts about a new format > > It seems that numpy could benefit from a new storage format. While you may indeed need a new format, I'm not sure that numpy does. Lord knows I've gotten enough flak for inventing yet another binary format with .npy. :-) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Designing a new storage format for numpy recarrays
A Friday 30 October 2009 14:18:05 Stephen Simmons escrigué: > - Pytables (HDF using chunked storage for recarrays with LZO > compression and shuffle filter) > - can't extract individual field from a recarray Er... Have you tried the ``cols`` accessor? http://www.pytables.org/docs/manual/ch04.html#ColsClassDescr Cheers, -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Designing a new storage format for numpy recarrays
Unless I read your request or the documentation wrong, h5py already supports pulling specific fields out of "compound data types": http://h5py.alfven.org/docs-1.1/guide/hl.html#id3 > For compound data, you can specify multiple field names alongside > the numeric slices: > >>> dset["FieldA"] > >>> dset[0,:,4:5, "FieldA", "FieldB"] > >>> dset[0, ..., "FieldC"] Is this latter style of access what you were asking for? (Or is the problem that it's not fast enough in hdf5, even with the shuffle filter, etc?) So then the issue is that there's a dependency on hdf5 and h5py? (or if you want to access LZF-compressed files without h5py, a dependency on hdf5 and the C LZF compressor?). This is pretty lightweight, especially if you're proposing writing new code which itself would be a dependency. So your new code couldn't depend on *anything* else if you wanted it to be a fewer-dependencies option than hdf5+h5py, right? Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Designing a new storage format for numpy recarrays
Stephen Simmons wrote: > P.S. Maybe this will be too much work, and I'd be better off sticking > with Pytables. I can't judge that, but I want to share some thoughts (rant?): - Are you ready to not only write the code, but maintain it over years to come, and work through nasty bugs, and think things through when people ask for parallellism or obscure filesystem locking functionality or whatnot? - Are you ready to finish even the last, boring "10%". Since there are existing options in the same area you can't expect a growing userbase to help you with the last "10%" (unlike projects in unexplored areas). - When you are done, are you sure that what you finally have will really be leaner and easier to work with than the existing options (like PyTables?). If not, odds are the result will in the end only be used by yourself. Simply writing the prototype is the easy part of the job! Perhaps needless to say, my hunch would be to try to work with PyTables to add what you miss there. There's a harder learning curve than writing something from scratch, but not harder than what others will have with something you write from scratch. The advantage of hdf5 is that there's lot of existing tools for inspecting, processing and sharing the data independent of NumPy (well, up to propriotary compression; but that's hardly worse than the entire format being propriotary). Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Designing a new storage format for numpy recarrays
Dag Sverre Seljebotn: > Hi, > > Is anyone working on alternative storage options for numpy arrays, and > specifically recarrays? My main application involves processing series > of large recarrays (say 1000 recarrays, each with 5M rows having 50 > fields). Existing options meet some but not all of my requirements. > > Requirements > -- > The basic requirements are: > > Mandatory > - fast > - suitable for very large arrays (larger than can fit in memory) > - compressed (to reduce disk space, read data more quickly) > - seekable (can read subset of data without decompressing everything) > - can append new data to an existing file > - able to extract individual fields from a recarray (for when indexing > or processing needs just a few fields) > Nice to have > - files can be split without decompressing and recompressing (e.g. > distribute processing over a grid) > - encryption, ideally field-level, with encryption occurring after > compression > - can store multiple arrays in one physical file (convenience) > - portable/stardard/well documented > > Existing options > - > Over the last few years I've tried most of numpy's options for saving > arrays to disk, including pickles, .npy, .npz, memmap-ed files and HDF > (Pytables). > > None of these is perfect, although Pytables comes close: > - .npy - not compressed, need to read whole array into memory > - .npz - compressed but ZLIB compression is too slow > - memmap - not compressed > - Pytables (HDF using chunked storage for recarrays with LZO > compression and shuffle filter) > - can't extract individual field from a recarray I'm just learning PyTables so I'm curious about this... if I use a normal Table, it will be presented as a NumPy record array when I access it, and I can access individual fields. What are the disadvantages to that? > - multiple dependencies (HDF, PyTables+LZO, Pyh5+LZF) (I think this is a pro, not a con: It means that there's a lot of already bugfixed code being used. Any codebase is only as strong as the number of eyes on it.) Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion