Re: [Numpy-discussion] 2D binning
On 1/06/2010 10:51 PM, Wes McKinney wrote: snip This is a pretty good example of the group-by problem that will hopefully work its way into a future edition of NumPy. Wes (or anyone else), please can you elaborate on any plans for groupby? I've made my own modification to numpy.bincount for doing groupby-type operations but not contributed anything back to the numpy community. Is there any interest from European-based people to work on groupby etc at the Euro SciPy sprints in July? If others are interested, maybe we could work out requirements beforehand and then do some coding in Paris. Cheers Stephen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal for new ufunc functionality
I would really like to see this become a core part of numpy... For groupby-like summing over arrays, I use a modified version of numpy.bincount() which has optional arguments that greatly enhance its flexibility: bincount(bin, weights=, max_bins=. out=) where: * bins- numpy array of bin numbers (uint8, int16 or int32). [1] *Negative bins numbers indicate weights to be ignored * weights - (opt) numpy array of weights (float or double) [2] * max_bin - (opt) bin numbers greater than this are ignored when counting * out - (opt) numpy output array (int32 or double) [1] This is how I support Robert Kern's comment below If there are some areas you want to ignore, that's difficult to do with reduceat(). [2] Specifying the number of bins up front has two benefits: (i) saves scanning the bins array to see how big the output needs to be; and (ii) allows you to control the size of the output array, as you may want it bigger than the number of bins would suggest. I look forward to the draft NEP! Best regards Stephen Simmons On 13/04/2010 10:34 PM, Robert Kern wrote: On Sat, Apr 10, 2010 at 17:59, Robert Kernrobert.k...@gmail.com wrote: On Sat, Apr 10, 2010 at 12:45, Pauli Virtanenp...@iki.fi wrote: la, 2010-04-10 kello 12:23 -0500, Travis Oliphant kirjoitti: [clip] Here are my suggested additions to NumPy: ufunc methods: [clip] * reducein (array, indices, axis=0) similar to reduce-at, but the indices provide both the start and end points (rather than being fence-posts like reduceat). Is the `reducein` important to have, as compared to `reduceat`? Yes, I think so. If there are some areas you want to ignore, that's difficult to do with reduceat(). And conversely overlapping areas are highly useful but completely impossible to do with reduceat. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Designing a new storage format for numpy recarrays
Hi, Is anyone working on alternative storage options for numpy arrays, and specifically recarrays? My main application involves processing series of large recarrays (say 1000 recarrays, each with 5M rows having 50 fields). Existing options meet some but not all of my requirements. Requirements -- The basic requirements are: Mandatory - fast - suitable for very large arrays (larger than can fit in memory) - compressed (to reduce disk space, read data more quickly) - seekable (can read subset of data without decompressing everything) - can append new data to an existing file - able to extract individual fields from a recarray (for when indexing or processing needs just a few fields) Nice to have - files can be split without decompressing and recompressing (e.g. distribute processing over a grid) - encryption, ideally field-level, with encryption occurring after compression - can store multiple arrays in one physical file (convenience) - portable/stardard/well documented Existing options - Over the last few years I've tried most of numpy's options for saving arrays to disk, including pickles, .npy, .npz, memmap-ed files and HDF (Pytables). None of these is perfect, although Pytables comes close: - .npy - not compressed, need to read whole array into memory - .npz - compressed but ZLIB compression is too slow - memmap - not compressed - Pytables (HDF using chunked storage for recarrays with LZO compression and shuffle filter) - can't extract individual field from a recarray - multiple dependencies (HDF, PyTables+LZO, Pyh5+LZF) - HDF is standard but LZO implementation is specific to Pytables (similarly LZF is specific to Pyh5) Are there any other options? Thoughts about a new format It seems that numpy could benefit from a new storage format. My first thoughts involve: - Use chunked format - split big arrays into pages of consecutive rows, compressed separately - Get good compression ratios by shuffling data before compressing (byte 1 of all rows, then byte 2 of all rows, ...) - Get efficient access to individual fields in recarrays by compressing each recarray field's data separately (shuffling has nice side-effect of separating recarray fields' data) - Make it fast to compress and decompress by using LZO - Store pages of rows (and compressd field data within a page) using a numpy variation of IFF chunked format (e.g. used by the DjVu scanned document format version 3). For example, FORM chunk for whole file, DTYP chunk for dtype info, DIRM chunk for directory to pages holding rows, NPAG chunk for a page - The IFF structure of named chunk types allows format to be extended (other compressors than LZO, encryption, links to remote data chunks, etc) I'd appreciate any comments or suggestions before I start coding. References --- DjVu format - http://djvu.org/resources/ DjVu v3 format - http://djvu.org/docs/DjVu3Spec.djvu Stephen P.S. Maybe this will be too much work, and I'd be better off sticking with Pytables. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Home for pyhdf5io?
David Warde-Farley wrote: On 23-May-09, at 4:25 PM, Albert Thuswaldner wrote: Actually my vision with pyhdf5io is to have hdf5 to replace numpy's own binary file format (.npy, npz). Pyhdf5io (or an incarnation of it) should be the standard (binary) way to store data in scipy/numpy. A bold statement, I know, but I think that it would be an improvement, especially for those users how are replacing Matlab with sicpy/numpy. In that it introduces a dependency on pytables (and the hdf5 C library) I doubt it would be something the numpy core developers would be eager to adopt. The npy and npz formats (as best I can gather) exist so that there is _some_ way of persisting data to disk that ships with numpy. It's not meant necessarily as the best way, or as an interchange format, just as something that works out of the box, the code for which is completely contained within numpy. It might be worth mentioning the limitations of numpy's built-in save(), savez() and load() in the docstrings and recommending more portable alternatives, though. David I tend to agree with David that PyTables is too big a dependency for inclusion in core Numpy. It does a lot more than simply loading and saving arrays. While I haven't tried Andrew Collette's h5py (http://code.google.com/p/h5py), it looks like a very 'thin' wrapper around the HDF5 C libraries. Maybe numpy's save(), savez(), load(), memmap() could be enhanced so that saving/loading files with HDF5-like file extensions used the HDF5 format, with code based on h5py and pyhdf5io. This could, I imagine, be a relatively small/simple addition to numpy, with the only external dependency being the HDF5 libraries themselves. Stephen ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Example code for Numpy C preprocessor 'repeat' directive?
Hi, Please can someone suggest resources for learning how to use the 'repeat' macros in numpy C code to avoid repeating sections of type-specific code for each data type? Ideally there would be two types of resources: (i) a description of how the repeat macros are meant to be used/compiled; and (ii) suggestion for a numpy source file that best illustrates this. Thanks in advance! Stephen P.S. Motivation is this is I'm trying to write an optimised numpy implementation of SQL-style aggregation operators for an OLAP data analysis project (using PyTables to store large numpy data sets). bincount() is being used to implement SELECT SUM(x) FROM TBL WHERE y GROUP BY fn(z). My modified bincount code can handle a wider variety of index, weight and output array data types. It also supports passing in the output array as a parameter, allowing multipass aggregation routines. I got the code working for a small number of data type combinations, but now I'm drowning in an exponential explosion of manually maintained data type combinations ---snip } else if ((weight_type==NPY_FLOAT)(out_type==PyArray_DOUBLE)) { ... } else if (bin_type==PyArray_INTP) { for (i=0; ibin_len; i++) { bin = ((npy_intp *) bin_data)[i]; if (bin=0 bin=max_bin) ((double*)out_data)[bin] += *((float *)(weights_data + i*wt_stride)); } } else if (bin_type==PyArray_UINT8) { for (i=0; ibin_len; i++) { bin = ((npy_uint8 *) bin_data)[i]; if (bin=0 bin=max_bin) ((double*)out_data)[bin] += *((float *)(weights_data + i*wt_stride)); } ---snip 'repeat' directives in C comments are obviously the proper way to avoid manual generating all this boilerplate code. Unfortunately I haven't yet understood how to make the autogenerated type-specific code link back into a comment function entry point. Either there is some compiler/distutils magic going on, or it's explained in a different numpy source file from where I'm looking right now... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Easy way to vectorize a loop?
Hi, Can anyone help me out with a simple way to vectorize this loop? # idx and vals are arrays with indexes and values used to update array data # data = numpy.ndarray(shape=(100,100,100,100), dtype='f4') flattened = data.ravel() for i in range(len(vals)): flattened[idx[i]]+=vals[i] Many thanks! Stephen ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: HDF5 for Python 1.1
Hi Andrew, Do you have any plans to support LZO compression in h5py? I have lots of LZO-compressed datasets created with PyTables. There's a real barrier to using both h5py and PyTables if the fast decompressor options are just LZF on h5py and LZO on PyTables. Many thanks Stephen Andrew Collette wrote: = Announcing HDF5 for Python (h5py) 1.1 = What is h5py? - HDF5 for Python (h5py) is a general-purpose Python interface to the Hierarchical Data Format library, version 5. HDF5 is a versatile, mature scientific software library designed for the fast, flexible storage of enormous amounts of data. From a Python programmer's perspective, HDF5 provides a robust way to store data, organized by name in a tree-like fashion. You can create datasets (arrays on disk) hundreds of gigabytes in size, and perform random-access I/O on desired sections. Datasets are organized in a filesystem-like hierarchy using containers called groups, and accesed using the tradional POSIX /path/to/resource syntax. In addition to providing interoperability with existing HDF5 datasets and platforms, h5py is a convienient way to store and retrieve arbitrary NumPy data and metadata. New features in 1.1 --- - A new compression filter based on the LZF library, which provides transparent compression many times faster than the standard HDF5 GZIP filter. - Efficient broadcasting using HDF5 hyperslab selections; for example, you can write to a (2000 x 100 x 50) selection from a (100 x 50) source array. - Now supports the NumPy boolean type - Auto-completion for IPython 0.9.X (contributed by Darren Dale) - Installable via easy_install Standard features - - Supports storage of NumPy data of the following types: * Integer/Unsigned Integer * Float/Double * Complex/Double Complex * Compound (recarray) * Strings * Boolean * Array (as members of a compound type only) * Void - Random access to datasets using the standard NumPy slicing syntax, including fancy indexing and point-based selection - Transparent compression of datasets using GZIP, LZF or SZIP, and error-detection using Fletcher32 - Pythonic interface supporting dictionary and NumPy-array metaphors for the high-level HDF5 abstrations like groups and datasets - A comprehensive, object-oriented wrapping of the HDF5 low-level C API via Cython, in addition to the NumPy-like high-level interface. - Supports many new features of HDF5 1.8, including recursive iteration over entire files and in-library copy operations on the file tree - Thread-safe Where to get it --- * Main website, documentation: http://h5py.alfven.org * Downloads, bug tracker: http://h5py.googlecode.com Requires * Linux, Mac OS-X or Windows * Python 2.5 (Windows), Python 2.5 or 2.6 (Linux/Mac OS-X) * NumPy 1.0.3 or later * HDF5 1.6.5 or later (including 1.8); HDF5 is included with the Windows version. Thanks -- Thanks to D. Dale, E. Lawrence and other for their continued support and comments. Also thanks to the Francesc Alted and the PyTables project, for inspiration and generously providing their code to the community. Thanks to everyone at the HDF Group for creating such a useful piece of software. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] 3-10x speedup in bincount()
Well, there were no responses to my earlier email proposing changes to numpy.bincount() to make it faster and more flexible. Does this mean noone is using bincount? :-) Anyway I've now made the proposed modifications and got substantial speedups of 3-10. Details are in this extract from my C code. For the time being I will test this as a standalone extension module. When stable and tests are written, I'll submit as a patch to the numpy's _compiled_base.c. Cheers, Stephen / * Faster versions of bincount and casting strings to integers * * Author: Stephen Simmons, [EMAIL PROTECTED] * Date:11 March 2007 * * This module contains C code for functions I am using to accelerate * SQL-like aggregate functions for a column-oriented database based on numpy. * * subtotal's bincount is typically 3-10 times faster than numpy's bincount, * and more flexible * - takes bins as they are, without promoting their type to int32 * - takes weights as they are, without promoting their type to double * - ignores negative bins, rather than reporting an error * - allows optional int32/double output array to be passed in * - specify maximum bin number to use. Larger bins numbers are ignored * - only scans for max bin number if neither max_bin nor out array specified * * atoi is 30-60 times faster than casting a string array to an integer type, * and may optionally use a translation table. The translation table is * a convenient way to map sparse integer strings to adjacent bins before * using bincount. * * Typical usage * - # Set up arrays 5,000,000 elts long. s1 has strings filled with '1' and '2'. # w are weights s1 = numpy.ndarray(shape=500, dtype='S1'); s1[:]='2'; s1[1::2]='1' w = numpy.arange(500,dtype='f4') # Using standard numpy functions, string conversion is slow i = s1.astype('i1')- 5.95 sec (!) numpy.bincount(i) - 113 ms numpy.bincount(i, w) - 272 ms numpy.bincount(s1.astype('i1'), w) - 6.12 sec # Using the faster functions here: i = subtotal.atoi(s1) - 90 ms (60x faster) subtotal.bincount(i, max_bin=2)- 31 ms (3.6x faster) subtotal.bincount(i, w, max_bin=2) - 51 ms (5.3x faster) # In both bases, output is array([2, 1, 2, ..., 1, 2, 1], dtype=int8) array([ 0, 250, 250]) array([ 0.e+00, 6.2500e+12, 6.24999750e+12]) # As an example of using a translate table, run bincount from atoi applying # a translate table for counting odd vs even strings. This does the whole lot # in 158ms, a speedup of 38x from 6.12 s above. t = numpy.arange(256, dtype='i1') % 2 subtotal.bincount(subtotal.atoi(s1, t), w, max_bin=t.max()) - 158 ms array([ 6.24999750e+12, 6.2500e+12]) * * These timings are based on numpy-1.0.1 Windows binary distribution * for Python 2.5, compared with building this code using standard distutils * without any particular compiler optimisations: C:\mnt\dev\subtotal python setup.py build -cmingw32 * Processor is a 1.83GHz Pentium-M / rest of C code omitted Stephen Simmons wrote: Hi, I'd like to propose some minor modifications to the function bincount(arr, weights=None), so would like some feedback from other uses of bincount() before I write this up as a proper patch, . Background: bincount() has two forms: - bincount(x) returns an integer array ians of length max(x)+1 where ians[n] is the number of times n appears in x. - bincount(x, weights) returns a double array dans of length max(x)+1 where dans[n] is the sum of elements in the weight vector weights[i] at the positions where x[i]==n In both cases, all elements of x must be non-negative. Proposed changes: (1) Remove the restriction that elements of x must be non-negative. Currently bincount() starts by finding max(x) and min(x). If the min value is negative, an exception is raised. This change proposes dropping the initial search for min(x), and instead testing for non-negativity while summing values in the return arrays ians or dans. Any indexes where where x is negative will be silently ignored. This will allow selective bincounts where values to ignore are flagged with a negative bin number. (2) Allow an optional argument for maximum bin number. Currently bincount(x) returns an array whose length is dependent on max(x). It is sometimes preferable to specify the exact size of the returned array, so this change would add a new optional argument, max_bin, which is one less than the size of the returned array. Under this change, bincount() starts by finding max(x) only if max_bin is not specified. Then the returned array ians or dans is created with length max_bin+1, and any indexes that would overflow the output array are silently ignored. (3) Allow an optional output array, y. Currently bincount
[Numpy-discussion] array.sum() slower than expected along some array axes?
Hi, Does anyone know why there is an order of magnitude difference in the speed of numpy's array.sum() function depending on the axis of the matrix summed? To see this, import numpy and create a big array with two rows: import numpy a = numpy.ones([2,100], 'f4') Then using ipython's timeit function: Time (ms) sum(a)20 a.sum() 9 a.sum(axis=1) 9 a.sum(axis=0) 159 numpy.dot(numpy.ones(a.shape[0], a.dtype), a) 15 This last one using a dot product is functionally equivalent to a.sum(axis=0), suggesting that the slowdown is due to how indexing is implemented in array.sum(). Cheers Stephen P.S. I am using the following versions of Python and numpy: numpy.__version__ '1.0.1' sys.version '2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)]' ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] array.sum() slower than expected along some array axes?
Charles R Harris wrote: On 2/3/07, *Stephen Simmons* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Hi, Does anyone know why there is an order of magnitude difference in the speed of numpy's array.sum() function depending on the axis of the matrix summed? To see this, import numpy and create a big array with two rows: import numpy a = numpy.ones([2,100], 'f4') Then using ipython's timeit function: Time (ms) sum(a) 20 a.sum() 9 a.sum(axis=1) 9 a.sum(axis=0) 159 numpy.dot(numpy.ones(a.shape[0], a.dtype), a)15 This last one using a dot product is functionally equivalent to a.sum(axis=0), suggesting that the slowdown is due to how indexing is implemented in array.sum(). In this case it is expected. There are inner and outer loops, in the slow case the inner loop with its extra code is called 100 times, in the fast case, twice. On the other hand, note this: In [10]: timeit a[0,:] + a[1,:] 100 loops, best of 3: 19.7 ms per loop Which has only one loop. Caching could also be a problem, but in this case it is dominated by loop overhead. Chuck I agree that summing along the longer axis is most probably slower because it makes more passes through the inner loop. The question though is whether all of the inner loop's overhead is necessary. My counterexample using numpy.dot() suggests there's considerable scope for improvement, at least for certain common cases. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Advice please on efficient subtotal function
Thanks Francesc, but I am already planning to read the data in block-wise as you suggest. My question is rather how best to update the subtotals for each block in a parallel way using numpy efficiently, rather than a simplistic and slow element-by-element loop. I can't use a simple sum(), as in your benchmark example or Greg's reply, because I need to do: for n in xrange(len(data)): totals[ i[n], j[n] ] += data[n] and not for n in xrange(len(data)): totals[n] += data[n] My best solution so far is roughly like this: - read in next block of 100k or so rows (taking into account the PyTables table's _v_maxTuples and _v_chunksize) - calculate the subtotal index arrays i and j - do a lexsort() on [i, j, n] - partition the sorted [i, j, n] into subsets where the i and j arrays change values. The k_th such subset is thus s_k = [ i_k, j_k, [n_k0, ..., n_kN] ] - update the subtotals for each subset in the block totals[i_k, j_k]+= sum(data[n_k0, ..., n_kN]) This should be reasonably efficient, but it's messy, and I'm not familar enough with numpy's indexing tricks to get this right first time. Maybe instead I'll have a go at writing a Pyrex function that implements the simple loop at C speed: subtotal2d(data_array, idx_array, out=None, dtype=None) where data_array is Nx1, idx_array is NxM and out is M-dimensional. Incidentally, there's one other function I'd find useful here in forming the index arrays i[] and j[], a fast translate-from-dict function: arr2 = fromiter( d[a] for a in arr1 ) My initial impression is that a C version would be substantially faster; maybe I should do some benchmarking to see whether a pure Python/numpy approach is actually faster than I expect. Cheers, and thanks for any further suggestions, Stephen Francesc Altet [EMAIL PROTECTED] wrote: A Divendres 29 Desembre 2006 10:05, Stephen Simmons escrigué: Hi, I'm looking for efficient ways to subtotal a 1-d array onto a 2-D grid. This is more easily explained in code that words, thus: for n in xrange(len(data)): totals[ i[n], j[n] ] += data[n] data comes from a series of PyTables files with ~200m rows. Each row has ~20 cols, and I use the first three columns (which are 1-3 char strings) to form the indexing functions i[] and j[], then want to calc averages of the remaining 17 numerical cols. I have tried various indirect ways of doing this with searchsorted and bincount, but intuitively they feel overly complex solutions to what is essentially a very simple problem. My work involved comparing the subtotals for various different segmentation strategies (the i[] and j[] indexing functions). Efficient solutions are important because I need to make many passes through the 200m rows of data. Memory usage is the easiest thing for me to adjust by changing how many rows of data to read in for each pass and then reusing the same array data buffers. Well, from your words I guess you should already have tested this, but just in case. As PyTables saves data in tables row-wise, it is always faster using the complete row for computations in each iteration than using just a single column. This is shown in the small benchmark that I'm attaching at the end of the message. Here is its output for a table with 1m rows: time for creating the file-- 12.044 time for using column reads -- 46.407 time for using the row wise iterator-- 73.036 time for using block reads (row wise)-- 5.156 So, using block reads (in case you can use them) is your best bet. HTH, -- import tables import numpy from time import time nrows = 1000*1000 # Create a table definition with 17 double cols and 3 string cols coltypes = numpy.dtype(f8,*17 + S3,*3) t1 = time() # Create a file with an empty table. Use compression to minimize file size. f = tables.openFile(/tmp/prova.h5, 'w') table = f.createTable(f.root, 'table', numpy.empty(0, coltypes), filters=tables.Filters(complevel=1, complib='lzo')) # Fill the table with default values (empty strings and zeros) row = table.row for nrow in xrange(nrows): row.append() f.close() print time for creating the file--, round(time()-t1, 3) # *** Start benchmarks ** f = tables.openFile(/tmp/prova.h5, 'r') table = f.root.table colnames = table.colnames[:-3] # exclude the string cols # Loop over the table using column reads t1 = time(); cum = numpy.zeros(17) for ncol, colname in enumerate(colnames): col = table.read(0, nrows, field=colname) cum[ncol] += col.sum() print time for using column reads --, round(time()-t1, 3) # Loop over the table using its row iterator t1 = time(); cum = numpy.zeros(17) for row in table: for ncol, colname in enumerate(colnames): cum[ncol] += row[colname] print time for using the row iterator