Re: [Numpy-discussion] New function `count_unique` to generate contingency tables.

Aldcroft, Thomas Sun, 25 Jan 2015 11:33:07 -0800

On Tue, Aug 12, 2014 at 12:17 PM, Eelco Hoogendoorn <
hoogendoorn.ee...@gmail.com> wrote:


> Thanks. Prompted by that stackoverflow question, and similar problems I
> had to deal with myself, I started working on a much more general extension
> to numpy's functionality in this space. Like you noted, things get a little
> panda-y, but I think there is a lot of panda's functionality that could or
> should be part of the numpy core, a robust set of grouping operations in
> particular.
>

FYI I wrote some table grouping operations (join, hstack, vstack) for numpy
some time ago, available here:

  https://github.com/astropy/astropy/blob/v0.4.x/astropy/table/np_utils.py

These are part of the astropy project but this module has no actual astropy
dependencies apart from a local backport of OrderedDict for Python < 2.7.

Cheers,
Tom




> see pastebin here:
> http://pastebin.com/c5WLWPbp
>
> Ive posted about it on this list before, but without apparent interest;
> and I havnt gotten around to getting this up to professional standards yet
> either. But there is a lot more that could be done in this direction.
>
> Note that the count functionality in the stackoverflow answer is
> relatively indirect and inefficient, using the inverse_index and such. A
> much more efficient method is obtained by the code used here.
>
>
> On Tue, Aug 12, 2014 at 5:57 PM, Warren Weckesser <
> warren.weckes...@gmail.com> wrote:
>
>>
>>
>>
>> On Tue, Aug 12, 2014 at 11:35 AM, Warren Weckesser <
>> warren.weckes...@gmail.com> wrote:
>>
>>> I created a pull request (https://github.com/numpy/numpy/pull/4958)
>>> that defines the function `count_unique`.  `count_unique` generates a
>>> contingency table from a collection of sequences.  For example,
>>>
>>> In [7]: x = [1, 1, 1, 1, 2, 2, 2, 2, 2]
>>>
>>> In [8]: y = [3, 4, 3, 3, 3, 4, 5, 5, 5]
>>>
>>> In [9]: (xvals, yvals), counts = count_unique(x, y)
>>>
>>> In [10]: xvals
>>> Out[10]: array([1, 2])
>>>
>>> In [11]: yvals
>>> Out[11]: array([3, 4, 5])
>>>
>>> In [12]: counts
>>> Out[12]:
>>> array([[3, 1, 0],
>>>        [1, 1, 3]])
>>>
>>>
>>> It can be interpreted as a multi-argument generalization of
>>> `np.unique(x, return_counts=True)`.
>>>
>>> It overlaps with Pandas' `crosstab`, but I think this is a pretty
>>> fundamental counting operation that fits in numpy.
>>>
>>> Matlab's `crosstab` (http://www.mathworks.com/help/stats/crosstab.html)
>>> and R's `table` perform the same calculation (with a few more bells and
>>> whistles).
>>>
>>>
>>> For comparison, here's Pandas' `crosstab` (same `x` and `y` as above):
>>>
>>> In [28]: import pandas as pd
>>>
>>> In [29]: xs = pd.Series(x)
>>>
>>> In [30]: ys = pd.Series(y)
>>>
>>> In [31]: pd.crosstab(xs, ys)
>>> Out[31]:
>>> col_0  3  4  5
>>> row_0
>>> 1      3  1  0
>>> 2      1  1  3
>>>
>>>
>>> And here is R's `table`:
>>>
>>> > x <- c(1,1,1,1,2,2,2,2,2)
>>> > y <- c(3,4,3,3,3,4,5,5,5)
>>> > table(x, y)
>>>    y
>>> x   3 4 5
>>>   1 3 1 0
>>>   2 1 1 3
>>>
>>>
>>> Is there any interest in adding this (or some variation of it) to numpy?
>>>
>>>
>>> Warren
>>>
>>>
>>
>> While searching StackOverflow in the numpy tag for "count unique", I just
>> discovered that I basically reinvented Eelco Hoogendoorn's code in his
>> answer to
>> http://stackoverflow.com/questions/10741346/numpy-frequency-counts-for-unique-values-in-an-array.
>> Nice one, Eelco!
>>
>> Warren
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] New function `count_unique` to generate contingency tables.

Reply via email to