I have just submitted a PR (https://github.com/numpy/numpy/pull/4330)
adding an axis argument to bincount. It lets you do things that would have
been hard before, but the UI when broadcasting arrays together and having
an axis argument can get tricky, and there is no obvious example already in
place to follow, so I'd like to get some feedback on my choices.

*With no weights*

When not using the 'weights' parameter, the counting is done over the axes
passed in 'axis'. This defaults to all axes, i.e. the flattened array. The
output will have the shape of the original array, with those axes removed,
and an extra dimension of size 'n' added at the end, where 'n' is the
larger of 'minlength' and the maximum value in the array plus one. This is
pretty straightforward. I think the only design choices that warrant some
discussion are:

1. Should the default value for 'axis' be all axes, just the last, just the
first?
2. Having the extra dimension added at the end. It may seem more natural to
have the new dimension replace a dimension that has been removed. But
because 'axis' can hold multiple axes, this would require some guessing
(the first? the last? first or last based in position in the array, or in
position in the 'axis' tuple?), which is avoided by having a fixed
position. The other option would be at the beginning, not the end of the
shape. For counting I think the last dimensions is the right choice, but...

As an example of how it works:

>>> a = np.random.randint(5, size=(3, 400, 500))
>>> np.bincount(a, axis=(-1, -2))
array([[39763, 40086, 39832, 39970, 40349],
       [40006, 39892, 40226, 39938, 39938],
       [39990, 40082, 40184, 39818, 39926]])

So there were 40184 occurrences of 2 in a[2, :, :].

*With weights*

This can get complicated, but the rules are simple: the two arrays are
broadcasted together, and the axes removed refer to the axes in the input
array before broadcasting.

This is probably best illustrated with an example:

>>> w = np.random.rand(100, 3, 2)
>>> a = np.random.randint(4, size=(100,))
>>> np.bincount(a, w.T).T
array([[[  8.29654919,   9.65794721],
        [ 12.01620609,  10.06676672],
        [ 11.73217521,  10.42220345]],

       [[ 10.67034693,  11.7945728 ],
        [ 13.47044072,  11.45176676],
        [ 10.83104283,  12.00869285]],

       [[ 14.30506753,   8.18840995],
        [ 13.44466573,  13.18924624],
        [ 11.95200531,  12.92169698]],

       [[ 16.78580192,  16.96104034],
        [ 12.80863984,  15.04778831],
        [ 16.35114845,  14.63648771]]])

Here 'w' has shape '(100, 3, 2)', interpreted as a list of 100 arrays of
shape '(3, 2)'. We want to add together the arrays into several groups, as
indicated by another array 'a' of shape '(100,)', which is what is achieved
above. Other options to consider are:

>>> np.bincount(a[:, None, None], w).shape
(4,) <-- WRONG: the axis of dimension 1 have not been added by
broadcasting, so they get removed
>>> np.bincount(a[:, None, None], w, axis=0).shape
(3, 2, 4) <-- RIGHT, but this doesn't seem the ordering of the dimensions
one would want.

It seems to me that what anyone trying to do this would like to get back is
an array of shape '(4, 3, 2)', so I think the construct bincount(x, w.T).T
will be used often enough that it warrants some less convoluted way of
getting that back. But unless someone can figure out a smart way of
handling this, I'd rather wait to see how it gets used, and modify it
later, rather than making up an uninformed UI which turns out to be useless.

The obvious question for bincount with weights are:

1. Should axis refer to the axes **after** broadcasting? I don't think it
makes sense to add over a dimension of size 1 in the input array, you can
get the same result by summing over that dimension in `weights` before
calling bincount, but I am open to other opinons.
2. Any ideas on how to better handle multidimensional weights?

Thanks,

Jaime

-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.I
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to