Re: [Numpy-discussion] Designing a new storage format for numpy recarrays

2009-10-30 Thread Anne Archibald
2009/10/30 Stephen Simmons :
> I should clarify what I meant..
>
> Suppose I have a recarray with 50 fields and want to read just one of
> those fields. PyTables/HDF will read in the compressed data for chunks
> of complete rows, decompress the full 50 fields, and then give me back
> the data for just one field.
>
> I'm after a solution where asking for a single field reads in the bytes
> for just that field from disk and decompresses it.
>
> This is similar to the difference between databases storing their data
> as rows or columns. See for example Mike Stonebraker's C-store
> column-oriented database (http://db.lcs.mit.edu/projects/cstore/vldb.pdf).

Is there any reason not to simply store the data as a collection of
separate arrays, one per column? It shouldn't be too hard to write a
wrapper to give this nicer syntax, while implementing it under the
hood with HDF5...

Anne

> Stephen
>
>
>
> Francesc Alted wrote:
>> A Friday 30 October 2009 14:18:05 Stephen Simmons escrigué:
>>
>>>  - Pytables (HDF using chunked storage for recarrays with LZO
>>> compression and shuffle filter)
>>>     - can't extract individual field from a recarray
>>>
>>
>> Er... Have you tried the ``cols`` accessor?
>>
>> http://www.pytables.org/docs/manual/ch04.html#ColsClassDescr
>>
>> Cheers,
>>
>>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Designing a new storage format for numpy recarrays

2009-10-30 Thread Stephen Simmons
I should clarify what I meant..

Suppose I have a recarray with 50 fields and want to read just one of 
those fields. PyTables/HDF will read in the compressed data for chunks 
of complete rows, decompress the full 50 fields, and then give me back 
the data for just one field.

I'm after a solution where asking for a single field reads in the bytes 
for just that field from disk and decompresses it.

This is similar to the difference between databases storing their data 
as rows or columns. See for example Mike Stonebraker's C-store 
column-oriented database (http://db.lcs.mit.edu/projects/cstore/vldb.pdf).

Stephen



Francesc Alted wrote:
> A Friday 30 October 2009 14:18:05 Stephen Simmons escrigué:
>   
>>  - Pytables (HDF using chunked storage for recarrays with LZO
>> compression and shuffle filter)
>> - can't extract individual field from a recarray
>> 
>
> Er... Have you tried the ``cols`` accessor?
>
> http://www.pytables.org/docs/manual/ch04.html#ColsClassDescr
>
> Cheers,
>
>   

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Designing a new storage format for numpy recarrays

2009-10-30 Thread Robert Kern
On Fri, Oct 30, 2009 at 08:18, Stephen Simmons  wrote:

> Thoughts about a new format
> 
> It seems that numpy could benefit from a new storage format.

While you may indeed need a new format, I'm not sure that numpy does.
Lord knows I've gotten enough flak for inventing yet another binary
format with .npy. :-)

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Designing a new storage format for numpy recarrays

2009-10-30 Thread Francesc Alted
A Friday 30 October 2009 14:18:05 Stephen Simmons escrigué:
>  - Pytables (HDF using chunked storage for recarrays with LZO
> compression and shuffle filter)
> - can't extract individual field from a recarray

Er... Have you tried the ``cols`` accessor?

http://www.pytables.org/docs/manual/ch04.html#ColsClassDescr

Cheers,

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Designing a new storage format for numpy recarrays

2009-10-30 Thread Zachary Pincus
Unless I read your request or the documentation wrong, h5py already  
supports pulling specific fields out of "compound data types":

http://h5py.alfven.org/docs-1.1/guide/hl.html#id3

> For compound data, you can specify multiple field names alongside  
> the numeric slices:
> >>> dset["FieldA"]
> >>> dset[0,:,4:5, "FieldA", "FieldB"]
> >>> dset[0, ..., "FieldC"]

Is this latter style of access what you were asking for? (Or is the  
problem that it's not fast enough in hdf5, even with the shuffle  
filter, etc?)

So then the issue is that there's a dependency on hdf5 and h5py? (or  
if you want to access LZF-compressed files without h5py, a dependency  
on hdf5 and the C LZF compressor?). This is pretty lightweight,  
especially if you're proposing writing new code which itself would be  
a dependency. So your new code couldn't depend on *anything* else if  
you wanted it to be a fewer-dependencies option than hdf5+h5py, right?

Zach
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Designing a new storage format for numpy recarrays

2009-10-30 Thread Dag Sverre Seljebotn
Stephen Simmons wrote:
> P.S. Maybe this will be too much work, and I'd be better off sticking
> with Pytables.

I can't judge that, but I want to share some thoughts (rant?):

 - Are you ready to not only write the code, but maintain it over years to
come, and work through nasty bugs, and think things through when people
ask for parallellism or obscure filesystem locking functionality or
whatnot?

 - Are you ready to finish even the last, boring "10%". Since there are
existing options in the same area you can't expect a growing userbase to
help you with the last "10%" (unlike projects in unexplored areas).

 - When you are done, are you sure that what you finally have will really
be leaner and easier to work with than the existing options (like
PyTables?).

If not, odds are the result will in the end only be used by yourself.
Simply writing the prototype is the easy part of the job!

Perhaps needless to say, my hunch would be to try to work with PyTables to
add what you miss there. There's a harder learning curve than writing
something from scratch, but not harder than what others will have with
something you write from scratch.

The advantage of hdf5 is that there's lot of existing tools for
inspecting, processing and sharing the data independent of NumPy (well, up
to propriotary compression; but that's hardly worse than the entire format
being propriotary).

Dag Sverre

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Designing a new storage format for numpy recarrays

2009-10-30 Thread Dag Sverre Seljebotn
Dag Sverre Seljebotn:
> Hi,
>
> Is anyone working on alternative storage options for numpy arrays, and
> specifically recarrays? My main application involves processing series
> of large recarrays (say 1000 recarrays, each with 5M rows having 50
> fields). Existing options meet some but not all of my  requirements.
>
> Requirements
> --
> The basic requirements are:
>
> Mandatory
>  - fast
>  - suitable for very large arrays (larger than can fit in memory)
>  - compressed (to reduce disk space, read data more quickly)
>  - seekable (can read subset of data without decompressing everything)
>  - can append new data to an existing file
>  - able to extract individual fields from a recarray (for when indexing
> or processing needs just a few fields)
> Nice to have
>  - files can be split without decompressing and recompressing (e.g.
> distribute processing over a grid)
>  - encryption, ideally field-level, with encryption occurring after
> compression
>  - can store multiple arrays in one physical file (convenience)
>  - portable/stardard/well documented
>
> Existing options
> -
> Over the last few years I've tried most of numpy's options for saving
> arrays to disk, including pickles, .npy, .npz, memmap-ed files and HDF
> (Pytables).
>
> None of these is perfect, although Pytables comes close:
>  - .npy - not compressed, need to read whole array into memory
>  - .npz - compressed but ZLIB compression is too slow
>  - memmap - not compressed
>  - Pytables (HDF using chunked storage for recarrays with LZO
> compression and shuffle filter)
> - can't extract individual field from a recarray

I'm just learning PyTables so I'm curious about this... if I use a normal
Table, it will be presented as a NumPy record array when I access it, and
I can access individual fields. What are the disadvantages to that?

> - multiple dependencies (HDF, PyTables+LZO, Pyh5+LZF)

(I think this is a pro, not a con: It means that there's a lot of already
bugfixed code being used. Any codebase is only as strong as the number of
eyes on it.)

Dag Sverre

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion