On Tue, Aug 12, 2014 at 12:17 PM, Eelco Hoogendoorn < hoogendoorn.ee...@gmail.com> wrote:
> Thanks. Prompted by that stackoverflow question, and similar problems I > had to deal with myself, I started working on a much more general extension > to numpy's functionality in this space. Like you noted, things get a little > panda-y, but I think there is a lot of panda's functionality that could or > should be part of the numpy core, a robust set of grouping operations in > particular. > FYI I wrote some table grouping operations (join, hstack, vstack) for numpy some time ago, available here: https://github.com/astropy/astropy/blob/v0.4.x/astropy/table/np_utils.py These are part of the astropy project but this module has no actual astropy dependencies apart from a local backport of OrderedDict for Python < 2.7. Cheers, Tom > see pastebin here: > http://pastebin.com/c5WLWPbp > > Ive posted about it on this list before, but without apparent interest; > and I havnt gotten around to getting this up to professional standards yet > either. But there is a lot more that could be done in this direction. > > Note that the count functionality in the stackoverflow answer is > relatively indirect and inefficient, using the inverse_index and such. A > much more efficient method is obtained by the code used here. > > > On Tue, Aug 12, 2014 at 5:57 PM, Warren Weckesser < > warren.weckes...@gmail.com> wrote: > >> >> >> >> On Tue, Aug 12, 2014 at 11:35 AM, Warren Weckesser < >> warren.weckes...@gmail.com> wrote: >> >>> I created a pull request (https://github.com/numpy/numpy/pull/4958) >>> that defines the function `count_unique`. `count_unique` generates a >>> contingency table from a collection of sequences. For example, >>> >>> In [7]: x = [1, 1, 1, 1, 2, 2, 2, 2, 2] >>> >>> In [8]: y = [3, 4, 3, 3, 3, 4, 5, 5, 5] >>> >>> In [9]: (xvals, yvals), counts = count_unique(x, y) >>> >>> In [10]: xvals >>> Out[10]: array([1, 2]) >>> >>> In [11]: yvals >>> Out[11]: array([3, 4, 5]) >>> >>> In [12]: counts >>> Out[12]: >>> array([[3, 1, 0], >>> [1, 1, 3]]) >>> >>> >>> It can be interpreted as a multi-argument generalization of >>> `np.unique(x, return_counts=True)`. >>> >>> It overlaps with Pandas' `crosstab`, but I think this is a pretty >>> fundamental counting operation that fits in numpy. >>> >>> Matlab's `crosstab` (http://www.mathworks.com/help/stats/crosstab.html) >>> and R's `table` perform the same calculation (with a few more bells and >>> whistles). >>> >>> >>> For comparison, here's Pandas' `crosstab` (same `x` and `y` as above): >>> >>> In [28]: import pandas as pd >>> >>> In [29]: xs = pd.Series(x) >>> >>> In [30]: ys = pd.Series(y) >>> >>> In [31]: pd.crosstab(xs, ys) >>> Out[31]: >>> col_0 3 4 5 >>> row_0 >>> 1 3 1 0 >>> 2 1 1 3 >>> >>> >>> And here is R's `table`: >>> >>> > x <- c(1,1,1,1,2,2,2,2,2) >>> > y <- c(3,4,3,3,3,4,5,5,5) >>> > table(x, y) >>> y >>> x 3 4 5 >>> 1 3 1 0 >>> 2 1 1 3 >>> >>> >>> Is there any interest in adding this (or some variation of it) to numpy? >>> >>> >>> Warren >>> >>> >> >> While searching StackOverflow in the numpy tag for "count unique", I just >> discovered that I basically reinvented Eelco Hoogendoorn's code in his >> answer to >> http://stackoverflow.com/questions/10741346/numpy-frequency-counts-for-unique-values-in-an-array. >> Nice one, Eelco! >> >> Warren >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion