Re: [Pytables-users] Extracting from a large compressed binary datafile into a PyTables array

Anand Patil Mon, 05 Nov 2007 12:49:18 -0800

> Anand Patil (el 2007-10-31 a les 17:53:17 -0700) va dir::
>
> > I have a file full of 32-bit floats, in binary format, compressed with
> zip.
> > I'd like to get it into a PyTables array, but this:
> >
> >     Z = ZipFile('data_file.zip')
> >     binary_data = Z.read('data_file')
> >     numpy_array = numpy.fromstring(data, dtype=float32)
> >     h5file.createArray('/', 'data', numpy_array)
> >
> > won't work because I don't have enough memory for the intermediate
> stages.
> > Is there an easy way to do this piece-by-piece or in a 'streaming'
> fashion?
>
> First of all I'd avoid using an ``Array`` object for storing such a big
> array.  ``CArray`` or ``EArray`` objects are more suited for that, since
> they are chunked so they are a lot more memory-efficient.  Both allow
> you to store your data little by little, since disk space is only
> allocated for a chunk when really needed.  The first ones have a fixed
> shape, while the second ones are enlargeable.
>
> I guess the big obstacle would be to extract data from the zip file
> incrementally.  Since the ``ZipFile`` interface doesn't allow this, you
> may unzip ``data_file`` to disk, then open it and read chunks of data
> from it.  Something like this:
>
>     nptype = numpy.float32
>     atom = tables.Atom.from_sctype(nptype)
>
>     extract data_file from data_file.zip (e.g. with subprocess)
>     total_rows = size of data_file / atom.itemsize (e.g. with stat)
>
>     array = h5file.createCArray( '/', 'data', atom,
>                                  shape=(total_rows,) )
>     # or
>     array = h5file.createEArray( '/', 'data', atom,
>                                  shape=(0,), expectedrows=total_rows )
>     # We will be reading blocks as big as a chunk.
>     rows_to_read = array.chunkshape[0]
>     bytes_to_read = rows_to_read * atom.itemsize
>
>     dfile = open('data_file', 'b')
>     data = dfile.read(bytes_to_read)
>     base = 0  # only for CArray
>     while data:
>         arr = numpy.fromstring(data, dtype=nptype)
>         # CArray case
>         array[base:base+len(arr)] = arr
>         base += len(arr)
>         # EArray case
>         array.append(arr)
>         data = dfile.read(bytes_to_read)
>     array.flush()
>     dfile.close()
>
> This is untested, but I hope you get the idea.
>
> Cheers,
>
> ::
>
>         Ivan Vilata i Balaguer   >qo<   http://www.carabos.com/
>                C?rabos Coop. V.  V  V   Enjoy Data



Got it, thanks!

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Extracting from a large compressed binary datafile into a PyTables array

Reply via email to