On Tue, Apr 13, 2010 at 10:03 AM, Travis Oliphant <oliph...@enthought.com> wrote: > > On Apr 12, 2010, at 5:31 PM, Robert Kern wrote: > > We should collect all of these proposals into a NEP. To clarify what I > > mean by "group-by" behavior. > > Suppose I have an array of floats and an array of integers. Each element > > in the array of integers represents a region in the float array of a certain > > "kind". The reduction should take place over like-kind values: > > Example: > > add.reduceby(array=[1,2,3,4,5,6,7,8,9], by=[0,1,0,1,2,0,0,2,2]) > > results in the calculations: > > 1 + 3 + 6 + 7 > > 2 + 4 > > 5 + 8 + 9 > > and therefore the output (notice the two arrays --- perhaps a structured > > array should be returned instead...) > > [0,1,2], > > [17, 6, 22] > > The real value is when you have tabular data and you want to do reductions > > in one field based on values in another field. This happens all the time > > in relational algebra and would be a relatively straightforward thing to > > support in ufuncs. > > I might suggest a simplification where the by array must be an array > of non-negative ints such that they are indices into the output. For > example (note that I replace 2 with 3 and have no 2s in the by array): > > add.reduceby(array=[1,2,3,4,5,6,7,8,9], by=[0,1,0,1,3,0,0,3,3]) == > [17, 6, 0, 22] > > This basically generalizes bincount() to other binary ufuncs. > > > Interesting proposal. I do like the having only one output. > I'm particularly interested in reductions with "by" arrays of strings. i.e. > something like: > add.reduceby([10,11,12,13,14,15,16], > by=['red','green','red','green','red','blue', 'blue']). > resulting in: > 10+12+14 > 11+13 > 15+16 > In practice, these would have to be essentially mapped to the kind of > integer array I used in the original example, and so I suppose if we couple > your proposal with the segment function from the rest of my original > proposal, then the same resulting functionality is available (with perhaps > the extra intermediate integer array that may not be strictly necessary). > But, having simple building blocks is usually better in the long run (and > typically leads to better optimizations by human programmers).
Currently I'm using unique return_inverse to do the recoding into integers >>> np.unique(['red','green','red','green','red','blue', >>> 'blue'],return_inverse=True) (array(['blue', 'green', 'red'], dtype='|S5'), array([2, 1, 2, 1, 2, 0, 0])) and then feed into bincount. Your plans are a good generalization and speedup. Josef > Thanks, > -Travis > > -- > Travis Oliphant > Enthought Inc. > 1-512-536-1057 > http://www.enthought.com > oliph...@enthought.com > > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion