I have just submitted a PR (https://github.com/numpy/numpy/pull/4330) adding an axis argument to bincount. It lets you do things that would have been hard before, but the UI when broadcasting arrays together and having an axis argument can get tricky, and there is no obvious example already in place to follow, so I'd like to get some feedback on my choices.
*With no weights* When not using the 'weights' parameter, the counting is done over the axes passed in 'axis'. This defaults to all axes, i.e. the flattened array. The output will have the shape of the original array, with those axes removed, and an extra dimension of size 'n' added at the end, where 'n' is the larger of 'minlength' and the maximum value in the array plus one. This is pretty straightforward. I think the only design choices that warrant some discussion are: 1. Should the default value for 'axis' be all axes, just the last, just the first? 2. Having the extra dimension added at the end. It may seem more natural to have the new dimension replace a dimension that has been removed. But because 'axis' can hold multiple axes, this would require some guessing (the first? the last? first or last based in position in the array, or in position in the 'axis' tuple?), which is avoided by having a fixed position. The other option would be at the beginning, not the end of the shape. For counting I think the last dimensions is the right choice, but... As an example of how it works: >>> a = np.random.randint(5, size=(3, 400, 500)) >>> np.bincount(a, axis=(-1, -2)) array([[39763, 40086, 39832, 39970, 40349], [40006, 39892, 40226, 39938, 39938], [39990, 40082, 40184, 39818, 39926]]) So there were 40184 occurrences of 2 in a[2, :, :]. *With weights* This can get complicated, but the rules are simple: the two arrays are broadcasted together, and the axes removed refer to the axes in the input array before broadcasting. This is probably best illustrated with an example: >>> w = np.random.rand(100, 3, 2) >>> a = np.random.randint(4, size=(100,)) >>> np.bincount(a, w.T).T array([[[ 8.29654919, 9.65794721], [ 12.01620609, 10.06676672], [ 11.73217521, 10.42220345]], [[ 10.67034693, 11.7945728 ], [ 13.47044072, 11.45176676], [ 10.83104283, 12.00869285]], [[ 14.30506753, 8.18840995], [ 13.44466573, 13.18924624], [ 11.95200531, 12.92169698]], [[ 16.78580192, 16.96104034], [ 12.80863984, 15.04778831], [ 16.35114845, 14.63648771]]]) Here 'w' has shape '(100, 3, 2)', interpreted as a list of 100 arrays of shape '(3, 2)'. We want to add together the arrays into several groups, as indicated by another array 'a' of shape '(100,)', which is what is achieved above. Other options to consider are: >>> np.bincount(a[:, None, None], w).shape (4,) <-- WRONG: the axis of dimension 1 have not been added by broadcasting, so they get removed >>> np.bincount(a[:, None, None], w, axis=0).shape (3, 2, 4) <-- RIGHT, but this doesn't seem the ordering of the dimensions one would want. It seems to me that what anyone trying to do this would like to get back is an array of shape '(4, 3, 2)', so I think the construct bincount(x, w.T).T will be used often enough that it warrants some less convoluted way of getting that back. But unless someone can figure out a smart way of handling this, I'd rather wait to see how it gets used, and modify it later, rather than making up an uninformed UI which turns out to be useless. The obvious question for bincount with weights are: 1. Should axis refer to the axes **after** broadcasting? I don't think it makes sense to add over a dimension of size 1 in the input array, you can get the same result by summing over that dimension in `weights` before calling bincount, but I am open to other opinons. 2. Any ideas on how to better handle multidimensional weights? Thanks, Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.I
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion