from:"Stephen Simmons"

Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Stephen Simmons


On 1/06/2010 10:51 PM, Wes McKinney wrote:
  snip
  This is a pretty good example of the group-by problem that will
  hopefully work its way into a future edition of NumPy.

Wes (or anyone else), please can you elaborate on any plans for groupby?

I've made my own modification to numpy.bincount for doing groupby-type 
operations but not contributed anything back to the numpy community.

Is there any interest from European-based people to work on groupby etc 
at the Euro SciPy sprints in July? If others are interested, maybe we 
could work out requirements beforehand and then do some coding in Paris.

Cheers
Stephen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Proposal for new ufunc functionality

2010-04-14 Thread Stephen Simmons

I would really like to see this become a core part of numpy...

For groupby-like summing over arrays, I use a modified version of
numpy.bincount() which has optional arguments that greatly enhance its 
flexibility:
   bincount(bin, weights=, max_bins=. out=)
where:
   *  bins- numpy array of bin numbers (uint8, int16 or int32).
  [1]  *Negative bins numbers indicate weights to be ignored
   *  weights - (opt) numpy array of weights (float or double)
  [2]  *  max_bin - (opt) bin numbers greater than this are ignored when 
counting
   *  out - (opt) numpy output array (int32 or double)

[1]  This is how I support Robert Kern's comment below If there are some
areas you want to ignore, that's difficult to do with reduceat().

[2]  Specifying the number of bins up front has two benefits: (i) saves
scanning the bins array to see how big the output needs to be;
and (ii) allows you to control the size of the output array, as you may
want it bigger than the number of bins would suggest.


I look forward to the draft NEP!

Best regards
Stephen Simmons



On 13/04/2010 10:34 PM, Robert Kern wrote:
 On Sat, Apr 10, 2010 at 17:59, Robert Kernrobert.k...@gmail.com  wrote:

 On Sat, Apr 10, 2010 at 12:45, Pauli Virtanenp...@iki.fi  wrote:
  
 la, 2010-04-10 kello 12:23 -0500, Travis Oliphant kirjoitti:
 [clip]

 Here are my suggested additions to NumPy:
 ufunc methods:
  
 [clip]

* reducein (array, indices, axis=0)
 similar to reduce-at, but the indices provide both the
 start and end points (rather than being fence-posts like reduceat).
  
 Is the `reducein` important to have, as compared to `reduceat`?

 Yes, I think so. If there are some areas you want to ignore, that's
 difficult to do with reduceat().
  
 And conversely overlapping areas are highly useful but completely
 impossible to do with reduceat.



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Designing a new storage format for numpy recarrays

2009-10-30 Thread Stephen Simmons

Hi,

Is anyone working on alternative storage options for numpy arrays, and 
specifically recarrays? My main application involves processing series 
of large recarrays (say 1000 recarrays, each with 5M rows having 50 
fields). Existing options meet some but not all of my  requirements.

Requirements
--
The basic requirements are:

Mandatory
 - fast
 - suitable for very large arrays (larger than can fit in memory)
 - compressed (to reduce disk space, read data more quickly)
 - seekable (can read subset of data without decompressing everything)
 - can append new data to an existing file
 - able to extract individual fields from a recarray (for when indexing 
or processing needs just a few fields)
Nice to have
 - files can be split without decompressing and recompressing (e.g. 
distribute processing over a grid)
 - encryption, ideally field-level, with encryption occurring after 
compression
 - can store multiple arrays in one physical file (convenience)
 - portable/stardard/well documented

Existing options
-
Over the last few years I've tried most of numpy's options for saving 
arrays to disk, including pickles, .npy, .npz, memmap-ed files and HDF 
(Pytables).

None of these is perfect, although Pytables comes close:
 - .npy - not compressed, need to read whole array into memory
 - .npz - compressed but ZLIB compression is too slow
 - memmap - not compressed
 - Pytables (HDF using chunked storage for recarrays with LZO 
compression and shuffle filter)
- can't extract individual field from a recarray
- multiple dependencies (HDF, PyTables+LZO, Pyh5+LZF)
- HDF is standard but LZO implementation is specific to Pytables 
(similarly LZF is specific to Pyh5)

Are there any other options?


Thoughts about a new format

It seems that numpy could benefit from a new storage format. My first 
thoughts involve:

 - Use chunked format - split big arrays into pages of consecutive rows, 
compressed separately
 - Get good compression ratios by shuffling data before compressing 
(byte 1 of all rows, then byte 2 of all rows, ...)
 - Get efficient access to individual fields in recarrays by compressing 
each recarray field's data separately (shuffling has nice side-effect of 
separating recarray fields' data)
 - Make it fast to compress and decompress by using LZO
 - Store pages of rows (and compressd field data within a page) using a 
numpy variation of IFF chunked format (e.g. used by the DjVu scanned 
document format version 3). For example, FORM chunk for whole file, DTYP 
chunk for dtype info, DIRM chunk for directory to pages holding rows, 
NPAG chunk for a page
 - The IFF structure of named chunk types allows format to be extended 
(other compressors than LZO, encryption, links to remote data chunks, etc)

I'd appreciate any comments or suggestions before I start coding.

References
---
DjVu format - http://djvu.org/resources/
DjVu v3 format - http://djvu.org/docs/DjVu3Spec.djvu


Stephen

P.S. Maybe this will be too much work, and I'd be better off sticking 
with Pytables.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Home for pyhdf5io?

2009-05-24 Thread Stephen Simmons

David Warde-Farley wrote:
 On 23-May-09, at 4:25 PM, Albert Thuswaldner wrote:
 Actually my vision with pyhdf5io is to have hdf5 to replace numpy's
 own binary file format (.npy, npz). Pyhdf5io (or an incarnation of it)
 should be the standard (binary) way to store data in scipy/numpy. A
 bold statement, I know, but I think that it would be an improvement,
 especially for those users how are replacing Matlab with sicpy/numpy.
 
 In that it introduces a dependency on pytables (and the hdf5 C  
 library) I doubt it would be something the numpy core developers would  
 be eager to adopt.

 The npy and npz formats (as best I can gather) exist so that there is  
 _some_ way of persisting data to disk that ships with numpy. It's not  
 meant necessarily as the best way, or as an interchange format, just  
 as something that works out of the box, the code for which is  
 completely contained within numpy.

 It might be worth mentioning the limitations of numpy's built-in  
 save(), savez() and load() in the docstrings and recommending more  
 portable alternatives, though.

 David
   

I tend to agree with David that PyTables is too big a dependency for 
inclusion in core Numpy. It does a lot more than simply loading and 
saving arrays.

While I haven't tried Andrew Collette's h5py 
(http://code.google.com/p/h5py), it looks like a very 'thin' wrapper 
around the HDF5 C libraries. Maybe numpy's save(), savez(), load(), 
memmap() could be enhanced so that saving/loading files with HDF5-like 
file extensions used the HDF5 format, with code based on h5py and 
pyhdf5io. This could, I imagine, be a relatively small/simple addition 
to numpy, with the only external dependency being the HDF5 libraries 
themselves.

Stephen
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Example code for Numpy C preprocessor 'repeat' directive?

2009-03-04 Thread Stephen Simmons

Hi,

Please can someone suggest resources for learning how to use the 
'repeat' macros in numpy C code to avoid repeating sections of 
type-specific code for each data type? Ideally there would be two types 
of resources: (i) a description of how the repeat macros are meant to be 
used/compiled; and (ii) suggestion for a numpy source file that best 
illustrates this.

Thanks in advance!
Stephen

P.S.  Motivation is this is I'm trying to write an optimised numpy 
implementation of SQL-style aggregation operators for an OLAP data 
analysis project (using PyTables to store large numpy data sets). 
bincount()  is being used to implement SELECT SUM(x) FROM TBL WHERE y 
GROUP BY fn(z). My modified bincount code can handle a wider variety of 
index, weight and output array data types. It also supports passing in 
the output array as a parameter, allowing multipass aggregation routines.

I got the code working for a small number of data type combinations, but 
now I'm drowning in an exponential explosion of manually maintained data 
type combinations
---snip
} else if ((weight_type==NPY_FLOAT)(out_type==PyArray_DOUBLE)) {
...
} else if (bin_type==PyArray_INTP) {
for (i=0; ibin_len; i++) {
bin = ((npy_intp *) bin_data)[i];
if (bin=0  bin=max_bin)
((double*)out_data)[bin] += *((float *)(weights_data + 
i*wt_stride));
}
} else if (bin_type==PyArray_UINT8) {
for (i=0; ibin_len; i++) {
bin = ((npy_uint8 *) bin_data)[i];
if (bin=0  bin=max_bin)
((double*)out_data)[bin] += *((float *)(weights_data + 
i*wt_stride));
}
---snip

'repeat' directives in C comments are obviously the proper way to avoid 
manual generating all this boilerplate code. Unfortunately I haven't yet 
understood how to make the autogenerated type-specific code link back 
into a comment function entry point. Either there is some 
compiler/distutils magic going on, or it's explained in a different 
numpy source file from where I'm looking right now...
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Easy way to vectorize a loop?

2009-03-01 Thread Stephen Simmons

Hi,

Can anyone help me out with a simple way to vectorize this loop?

# idx and vals are arrays with indexes and values used to update array data
# data = numpy.ndarray(shape=(100,100,100,100), dtype='f4')
flattened = data.ravel()
for i in range(len(vals)):
flattened[idx[i]]+=vals[i]


Many thanks!
Stephen
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ANN: HDF5 for Python 1.1

2009-02-09 Thread Stephen Simmons

Hi Andrew,

Do you have any plans to support LZO compression in h5py?
 
I have lots of LZO-compressed datasets created with PyTables.
There's a real barrier to using both h5py and PyTables if the fast
decompressor options are just LZF on h5py and LZO on PyTables.

Many thanks
Stephen


Andrew Collette wrote:
 =
 Announcing HDF5 for Python (h5py) 1.1
 =

 What is h5py?
 -

 HDF5 for Python (h5py) is a general-purpose Python interface to the
 Hierarchical Data Format library, version 5.  HDF5 is a versatile,
 mature scientific software library designed for the fast, flexible
 storage of enormous amounts of data.

 From a Python programmer's perspective, HDF5 provides a robust way to
 store data, organized by name in a tree-like fashion.  You can create
 datasets (arrays on disk) hundreds of gigabytes in size, and perform
 random-access I/O on desired sections.  Datasets are organized in a
 filesystem-like hierarchy using containers called groups, and
 accesed using the tradional POSIX /path/to/resource syntax.

 In addition to providing interoperability with existing HDF5 datasets
 and platforms, h5py is a convienient way to store and retrieve
 arbitrary NumPy data and metadata.


 New features in 1.1
 ---

   - A new compression filter based on the LZF library, which provides
 transparent compression many times faster than the standard HDF5
 GZIP filter.

   - Efficient broadcasting using HDF5 hyperslab selections; for example,
 you can write to a (2000 x 100 x 50) selection from a (100 x 50)
 source array.

   - Now supports the NumPy boolean type

   - Auto-completion for IPython 0.9.X (contributed by Darren Dale)

   - Installable via easy_install


 Standard features
 -

   - Supports storage of NumPy data of the following types:

 * Integer/Unsigned Integer
 * Float/Double
 * Complex/Double Complex
 * Compound (recarray)
 * Strings
 * Boolean
 * Array (as members of a compound type only)
 * Void

   - Random access to datasets using the standard NumPy slicing syntax,
 including fancy indexing and point-based selection

   - Transparent compression of datasets using GZIP, LZF or SZIP,
 and error-detection using Fletcher32

   - Pythonic interface supporting dictionary and NumPy-array metaphors
 for the high-level HDF5 abstrations like groups and datasets

   - A comprehensive, object-oriented wrapping of the HDF5 low-level C API
 via Cython, in addition to the NumPy-like high-level interface.

   - Supports many new features of HDF5 1.8, including recursive iteration
 over entire files and in-library copy operations on the file tree

   - Thread-safe


 Where to get it
 ---

 * Main website, documentation:  http://h5py.alfven.org

 * Downloads, bug tracker:   http://h5py.googlecode.com


 Requires
 

 * Linux, Mac OS-X or Windows

 * Python 2.5 (Windows), Python 2.5 or 2.6 (Linux/Mac OS-X)

 * NumPy 1.0.3 or later

 * HDF5 1.6.5 or later (including 1.8); HDF5 is included with
   the Windows version.


 Thanks
 --

 Thanks to D. Dale, E. Lawrence and other for their continued support
 and comments.  Also thanks to the Francesc Alted and the PyTables project,
 for inspiration and generously providing their code to the community. Thanks
 to everyone at the HDF Group for creating such a useful piece of software.
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion


   

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] 3-10x speedup in bincount()

2007-03-13 Thread Stephen Simmons

Well, there were no responses to my earlier email proposing changes to 
numpy.bincount() to make it faster and more flexible. Does this mean 
noone is using bincount? :-)

Anyway I've now made the proposed modifications and got substantial 
speedups of 3-10. Details are in this extract from my C code. For the 
time being I will test this as a standalone extension module. When 
stable and tests are written, I'll submit as a patch to the numpy's 
_compiled_base.c.

Cheers,

Stephen


/
 * Faster versions of bincount and casting strings to integers
 *
 * Author:  Stephen Simmons, [EMAIL PROTECTED]
 * Date:11 March 2007
 *
 * This module contains C code for functions I am using to accelerate
 * SQL-like aggregate functions for a column-oriented database based on 
numpy.
 *
 * subtotal's bincount is typically 3-10 times faster than numpy's 
bincount,
 * and more flexible
 *  - takes bins as they are, without promoting their type to int32
 *  - takes weights as they are, without promoting their type to double
 *  - ignores negative bins, rather than reporting an error
 *  - allows optional int32/double output array to be passed in
 *  - specify maximum bin number to use. Larger bins numbers are ignored
 *  - only scans for max bin number if neither max_bin nor out array 
specified
 *
 * atoi is 30-60 times faster than casting a string array to an integer 
type,
 * and may optionally use a translation table. The translation table is
 * a convenient way to map sparse integer strings to adjacent bins before
 * using bincount.
 *
 * Typical usage
 * -

# Set up arrays 5,000,000 elts long. s1 has strings filled with '1' and '2'.
# w are weights
  s1 = numpy.ndarray(shape=500, dtype='S1'); s1[:]='2'; s1[1::2]='1'
  w = numpy.arange(500,dtype='f4')

# Using standard numpy functions, string conversion is slow
  i = s1.astype('i1')- 5.95 sec (!)
  numpy.bincount(i)  - 113 ms
  numpy.bincount(i, w)   - 272 ms
  numpy.bincount(s1.astype('i1'), w) - 6.12 sec

# Using the faster functions here:
  i = subtotal.atoi(s1)  - 90 ms (60x faster)
  subtotal.bincount(i, max_bin=2)- 31 ms (3.6x faster)
  subtotal.bincount(i, w, max_bin=2) - 51 ms (5.3x faster)

# In both bases, output is
array([2, 1, 2, ..., 1, 2, 1], dtype=int8)
array([  0, 250, 250])
array([  0.e+00,   6.2500e+12,   6.24999750e+12])

# As an example of using a translate table, run bincount from atoi applying
# a translate table for counting odd vs even strings. This does the 
whole lot
# in 158ms, a speedup of 38x from 6.12 s above.
  t = numpy.arange(256, dtype='i1') % 2
  subtotal.bincount(subtotal.atoi(s1, t), w, max_bin=t.max()) - 
158 ms
array([  6.24999750e+12,   6.2500e+12])

 *
 * These timings are based on numpy-1.0.1 Windows binary distribution
 * for Python 2.5, compared with building this code using standard distutils
 * without any particular compiler optimisations:
C:\mnt\dev\subtotal python setup.py build -cmingw32
 * Processor is a 1.83GHz Pentium-M
 /

 rest of C code omitted 


Stephen Simmons wrote:
 Hi,

 I'd like to propose some minor modifications to the function 
 bincount(arr, weights=None), so would like some feedback from other 
 uses of bincount() before I write this up as a proper patch, .

 Background:
 bincount() has two forms:
 - bincount(x) returns an integer array ians of length max(x)+1  where 
 ians[n] is the number of times n appears in x.
 - bincount(x, weights) returns a double array dans of length max(x)+1 
 where dans[n] is the sum of elements in the weight vector weights[i] 
 at the positions where x[i]==n
 In both cases, all elements of x must be non-negative.

 Proposed changes:
 (1) Remove the restriction that elements of x must be non-negative.
 Currently bincount() starts by finding max(x) and min(x). If the min 
 value is negative, an exception is raised.  This change proposes 
 dropping the initial search for min(x), and instead testing for 
 non-negativity while summing values in the return arrays ians or dans. 
 Any indexes where where x is negative will be silently ignored. This 
 will allow selective bincounts where values to ignore are flagged with 
 a negative bin number.

 (2) Allow an optional argument for maximum bin number.
 Currently bincount(x) returns an array whose length is dependent on 
 max(x). It is sometimes preferable to specify the exact size of the 
 returned array, so this change would add a new optional argument, 
 max_bin, which is one less than the size of the returned array. Under 
 this change, bincount() starts by finding max(x) only if max_bin is 
 not specified. Then the returned array ians or dans is created with 
 length max_bin+1, and any indexes that would overflow the output array 
 are silently ignored.

 (3) Allow an optional output array, y.
 Currently bincount

[Numpy-discussion] array.sum() slower than expected along some array axes?

2007-02-03 Thread Stephen Simmons





Hi,

Does anyone know why there is an order of magnitude difference 
in the speed of numpy's array.sum() function depending on the axis
of the matrix summed?

To see this, import numpy and create a big array with two rows:
  import numpy
  a = numpy.ones([2,100], 'f4')

Then using ipython's timeit function:
Time (ms)
 sum(a)20
 a.sum()   9   
 a.sum(axis=1) 9
 a.sum(axis=0) 159
 numpy.dot(numpy.ones(a.shape[0], a.dtype),
a) 15

This last one using a dot product is functionally equivalent 
to a.sum(axis=0), suggesting that the slowdown is due to how 
indexing is implemented in array.sum().

Cheers

Stephen


P.S. I am using the following versions of Python and numpy:
  numpy.__version__
 '1.0.1'
  sys.version
 '2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)]'




___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] array.sum() slower than expected along some array axes?

2007-02-03 Thread Stephen Simmons

Charles R Harris wrote:


 On 2/3/07, *Stephen Simmons* [EMAIL PROTECTED] 
 mailto:[EMAIL PROTECTED] wrote:

 Hi,

 Does anyone know why there is an order of magnitude difference
 in the speed of numpy's array.sum() function depending on the axis
 of the matrix summed?

 To see this, import numpy and create a big array with two rows:
 import numpy
 a = numpy.ones([2,100], 'f4')

 Then using ipython's timeit function:
   Time (ms)
sum(a)   20
a.sum()   9 
a.sum(axis=1) 9
a.sum(axis=0)   159
numpy.dot(numpy.ones(a.shape[0], a.dtype), a)15

 This last one using a dot product is functionally equivalent
 to a.sum(axis=0), suggesting that the slowdown is due to how
 indexing is implemented in array.sum().


 In this case it is expected. There are inner and outer loops, in the 
 slow case the inner loop with its extra code is called 100 times, 
 in the fast case, twice. On the other hand, note this:

 In [10]: timeit a[0,:] + a[1,:]
 100 loops, best of 3: 19.7 ms per loop


 Which has only one loop. Caching could also be a problem, but in this 
 case it is dominated by loop overhead.

 Chuck
I agree that summing along the longer axis is most probably slower 
because it makes more passes through the inner loop.

The question though is whether all of the inner loop's overhead is 
necessary.
My counterexample using numpy.dot() suggests there's considerable scope 
for improvement, at least for certain common cases.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Advice please on efficient subtotal function

2006-12-29 Thread Stephen Simmons

Thanks Francesc, but I am already planning to read the data in block-wise as
you suggest. My question is rather how best to update the subtotals for each
block in a parallel way using numpy efficiently, rather than a simplistic and
slow element-by-element loop. 

I can't use a simple sum(), as in your benchmark example or Greg's reply,
because I need to do:
for n in xrange(len(data)):
totals[ i[n], j[n] ] += data[n]
and not 
for n in xrange(len(data)):
totals[n] += data[n]
 
My best solution so far is roughly like this:
 - read in next block of 100k or so rows (taking into account the PyTables
table's _v_maxTuples and _v_chunksize)
 - calculate the subtotal index arrays i and j
 - do a lexsort() on [i, j, n]
 - partition the sorted [i, j, n] into subsets where the i and j arrays change
values. The k_th such subset is thus 
s_k = [ i_k, j_k, [n_k0, ..., n_kN] ] 
 - update the subtotals for each subset in the block
totals[i_k, j_k]+= sum(data[n_k0, ..., n_kN])

This should be reasonably efficient, but it's messy, and I'm not familar
enough with numpy's indexing tricks to get this right first time.

Maybe instead I'll have a go at writing a Pyrex function that implements the
simple loop at C speed:
   subtotal2d(data_array, idx_array, out=None, dtype=None)
where data_array is Nx1, idx_array is NxM and out is M-dimensional.


Incidentally, there's one other function I'd find useful here in forming the
index arrays i[] and j[], a fast translate-from-dict function:
  arr2 = fromiter( d[a] for a in arr1 )
My initial impression is that a C version would be substantially faster; maybe
I should do some benchmarking to see whether a pure Python/numpy approach is
actually faster than I expect.

Cheers, and thanks for any further suggestions,

Stephen


Francesc Altet [EMAIL PROTECTED] wrote:

 A Divendres 29 Desembre 2006 10:05, Stephen Simmons escrigué:
  Hi,
 
  I'm looking for efficient ways to subtotal a 1-d array onto a 2-D grid.
  This is more easily explained in code that words, thus:
 
  for n in xrange(len(data)):
  totals[ i[n], j[n] ] += data[n]
 
  data comes from a series of PyTables files with ~200m rows. Each row has
  ~20 cols, and I use the first three columns (which are 1-3 char strings)
to
  form the indexing functions i[] and j[], then want to calc averages of the
  remaining 17 numerical cols.
 
  I have tried various indirect ways of doing this with searchsorted and
  bincount, but intuitively they feel overly complex solutions to what is
  essentially a very simple problem.
 
  My work involved comparing the subtotals for various different
segmentation
  strategies (the i[] and j[] indexing functions). Efficient solutions are
  important because I need to make many passes through the 200m rows of
data.
  Memory usage is the easiest thing for me to adjust by changing how many
  rows of data to read in for each pass and then reusing the same array data
  buffers.
 
 Well, from your words I guess you should already have tested this, but just
in
 
 case. As PyTables saves data in tables row-wise, it is always faster using 
 the complete row for computations in each iteration than using just a single

 column. This is shown in the small benchmark that I'm attaching at the end
of 
 the message. Here is its output for a table with 1m rows:
 
 time for creating the file-- 12.044
 time for using column reads -- 46.407
 time for using the row wise iterator-- 73.036
 time for using block reads (row wise)-- 5.156
 
 So, using block reads (in case you can use them) is your best bet.
 
 HTH,
 

--
 import tables
 import numpy
 from time import time
 
 nrows = 1000*1000
 
 # Create a table definition with 17 double cols and 3 string cols
 coltypes = numpy.dtype(f8,*17 + S3,*3)
 
 t1 = time()
 # Create a file with an empty table. Use compression to minimize file size.
 f = tables.openFile(/tmp/prova.h5, 'w')
 table = f.createTable(f.root, 'table', numpy.empty(0, coltypes),
   filters=tables.Filters(complevel=1, complib='lzo'))
 # Fill the table with default values (empty strings and zeros)
 row = table.row
 for nrow in xrange(nrows):
 row.append()
 f.close()
 print time for creating the file--, round(time()-t1, 3)
 
 # *** Start benchmarks **
 f = tables.openFile(/tmp/prova.h5, 'r')
 table = f.root.table
 colnames = table.colnames[:-3]  # exclude the string cols
 
 # Loop over the table using column reads
 t1 = time(); cum = numpy.zeros(17)
 for ncol, colname in enumerate(colnames):
 col = table.read(0, nrows, field=colname)
 cum[ncol] += col.sum()
 print time for using column reads --, round(time()-t1, 3)
 
 # Loop over the table using its row iterator
 t1 = time(); cum = numpy.zeros(17)
 for row in table:
 for ncol, colname in enumerate(colnames):
 cum[ncol] += row[colname]
 print time for using the row iterator

Re: [Numpy-discussion] 2D binning

Re: [Numpy-discussion] Proposal for new ufunc functionality

[Numpy-discussion] Designing a new storage format for numpy recarrays

Re: [Numpy-discussion] Home for pyhdf5io?

[Numpy-discussion] Example code for Numpy C preprocessor 'repeat' directive?

[Numpy-discussion] Easy way to vectorize a loop?

Re: [Numpy-discussion] ANN: HDF5 for Python 1.1

[Numpy-discussion] 3-10x speedup in bincount()

[Numpy-discussion] array.sum() slower than expected along some array axes?

Re: [Numpy-discussion] array.sum() slower than expected along some array axes?

Re: [Numpy-discussion] Advice please on efficient subtotal function

11 matches

Site Navigation

Mail list logo

Footer information