Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Ilhan Polat
after a coffee, I don't see the point of calling it still "k" so "max_n" is
my vote for what its worth.

On Sun, May 30, 2021 at 8:38 AM Ilhan Polat  wrote:

> Since this going into the top namespace, I'd also vote against the
> matlab-y "topk" name. And even matlab didn't do what I would expect and
> went with maxk
>
> https://nl.mathworks.com/help/matlab/ref/maxk.html
>
> I think "max_k" is a good generalization of the regular "max". Even when
> auto-completing, this showing up under max makes sense to me instead of
> searching them inside "t"s. Besides, "argmax_k" also follows suite, that of
> the previous convention. To my eyes this is an acceptable disturbance to an
> already very crowded namespace.
>
>
>
> a few moments later
>
> But then again an ugly idea rears its head proposing this going into the
> existing max function. But I'll shut up now :)
>
>
>
>
>
>
>
> On Sun, May 30, 2021 at 12:50 AM Robert Kern 
> wrote:
>
>> On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi 
>> wrote:
>>
>>> What does k stand for here? As someone that never encountered this
>>> function before I find both names equally confusing. If I understand
>>> what the function is supposed to be doing, I think largest() would be
>>> much more descriptive.
>>>
>>
>> `k` is the number of elements to return. `largest()` can connote that
>> it's only returning the one largest value. It's fairly typical to include a
>> dummy variable (`k` or `n`) in the name to indicate that the function lets
>> you specify how many you want. See, for example, the stdlib `heapq`
>> module's `nlargest()` function.
>>
>> https://docs.python.org/3/library/heapq.html#heapq.nlargest
>>
>> "top-k" comes from the ML community where this function is used to
>> evaluate classification models (`k` instead of `n` being largely an
>> accident of history, I imagine). In many classification problems, the
>> number of classes is very large, and they are very related to each other.
>> For example, ImageNet has a lot of different dog breeds broken out as
>> separate classes. In order to get a more balanced view of the relative
>> performance of the classification models, you often want to check whether
>> the correct class is in the top 5 classes (or whatever `k` is appropriate)
>> that the model predicted for the example, not just the one class that the
>> model says is the most likely. "5 largest" doesn't really work in the
>> sentences that one usually writes when talking about ML classifiers; they
>> are talking about the 5 classes that are associated with the 5 largest
>> values from the predictor, not the values themselves. So "top k" is what
>> gets used in ML discussions, and that transfers over to the name of the
>> function in ML libraries.
>>
>> It is a top-down reflection of the higher level thing that people want to
>> compute (in that context) rather than a bottom-up description of how the
>> function is manipulating the input, if that makes sense. Either one is a
>> valid way to name things. There is a lot to be said for numpy's
>> domain-agnostic nature that we should prefer the bottom-up description
>> style of naming. However, we are also in the midst of a diversifying
>> ecosystem of array libraries, largely driven by the ML domain, and adopting
>> some of that terminology when we try to enhance our interoperability with
>> those libraries is also a factor to be considered.
>>
>> --
>> Robert Kern
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Matti Picus


On 29/5/21 5:28 pm, Ralf Gommers wrote:



On Fri, May 28, 2021 at 4:58 PM > wrote:


Hi all,

Finding topk elements is widely used in several fields, but missed
in NumPy.
I implement this functionality named as  numpy.topk using core numpy
functions and open a PR:

https://github.com/numpy/numpy/pull/19117


Any discussion are welcome.


Thanks for the proposal Kang. I think this functionality is indeed a 
fairly obvious gap in what Numpy offers, and would make sense to add. 
A detailed comparison with other libraries would be very helpful here. 
TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and 
MXNet call it `topk`.


Two things to look at in more detail here are:
1. complete signatures of the function in each of those libraries, and 
what the commonality is there.
2. the argument Eric made on your PR about consistency with 
sort/argsort, and if we want topk/argtopk? Also, do other libraries 
have `argtopk`?


Cheers,
Ralf


Best wishes,

Kang Kai



Did this function come up at all in the array-API consortium dicussions?

Matti

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Daniele Nicolodi
On 30/05/2021 00:48, Robert Kern wrote:
> On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi  > wrote:
> 
> What does k stand for here? As someone that never encountered this
> function before I find both names equally confusing. If I understand
> what the function is supposed to be doing, I think largest() would be
> much more descriptive.
> 
> 
> `k` is the number of elements to return. `largest()` can connote that
> it's only returning the one largest value. It's fairly typical to
> include a dummy variable (`k` or `n`) in the name to indicate that the
> function lets you specify how many you want. See, for example, the
> stdlib `heapq` module's `nlargest()` function.

I thought that a `largest()` function with an integer second argument
could be enough self explanatory. `nlargest()` would be much more
obvious to the wider audience, I think.

> https://docs.python.org/3/library/heapq.html#heapq.nlargest
> 
> 
> "top-k" comes from the ML community where this function is used to
> evaluate classification models (`k` instead of `n` being largely an
> accident of history, I imagine). In many classification problems, the
> number of classes is very large, and they are very related to each
> other. For example, ImageNet has a lot of different dog breeds broken
> out as separate classes. In order to get a more balanced view of the
> relative performance of the classification models, you often want to
> check whether the correct class is in the top 5 classes (or whatever `k`
> is appropriate) that the model predicted for the example, not just the
> one class that the model says is the most likely. "5 largest" doesn't
> really work in the sentences that one usually writes when talking about
> ML classifiers; they are talking about the 5 classes that are associated
> with the 5 largest values from the predictor, not the values themselves.
> So "top k" is what gets used in ML discussions, and that transfers over
> to the name of the function in ML libraries.
> 
> It is a top-down reflection of the higher level thing that people want
> to compute (in that context) rather than a bottom-up description of how
> the function is manipulating the input, if that makes sense. Either one
> is a valid way to name things. There is a lot to be said for numpy's
> domain-agnostic nature that we should prefer the bottom-up description
> style of naming. However, we are also in the midst of a diversifying
> ecosystem of array libraries, largely driven by the ML domain, and
> adopting some of that terminology when we try to enhance our
> interoperability with those libraries is also a factor to be considered.

I think that such a simple function should be named in the most obvious
way possible, or it will become one function that will be used in the
domains where the unusual name makes sense, but will end being
re-implemented in all other contexts. I am sure that if I would have
been looking for a function that returns the N largest items in an array
(being that intended accordingly to a given key function or otherwise) I
would never have looked at a function named `topk()` or `top_k()` and I
am pretty sure I would have discarded anything that has `k` or `top` in
its name.

On the other hand, I understand that ML is where all the hipe (and a
large fraction of the money) is this days, thus I understand if numpy
wants to appease the crowd.

Cheers,
Dan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Alan G. Isaac

Is there any thought of allowing for other comparisons?
In which case `last_k` might be preferable.
Alan Isaac

On 5/30/2021 2:38 AM, Ilhan Polat wrote:


I think "max_k" is a good generalization of the regular "max".

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Alan G. Isaac

Mathematica and Julia both seem relevant here.
Mma has TakeLargest (and Wolfram tends to think hard about names).
https://reference.wolfram.com/language/ref/TakeLargest.html
Julia's closest comparable is perhaps partialsortperm:
https://docs.julialang.org/en/v1/base/sort/#Base.Sort.partialsortperm
Alan Isaac



On 5/30/2021 4:40 AM, kang...@mail.ustc.edu.cn wrote:

Hi, Thanks for reply, I present some details below:

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Neal Becker
Topk is a bad choice imo.  I initially parsed it as to_pk, and had no idea
what that was, although sounded a lot like a scipy signal function.
Nlargest would be very obvious.

On Sun, May 30, 2021, 7:50 AM Alan G. Isaac  wrote:

> Mathematica and Julia both seem relevant here.
> Mma has TakeLargest (and Wolfram tends to think hard about names).
> https://reference.wolfram.com/language/ref/TakeLargest.html
> Julia's closest comparable is perhaps partialsortperm:
> https://docs.julialang.org/en/v1/base/sort/#Base.Sort.partialsortperm
> Alan Isaac
>
>
>
> On 5/30/2021 4:40 AM, kang...@mail.ustc.edu.cn wrote:
> > Hi, Thanks for reply, I present some details below:
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread kangkai
>
>
> On Fri, May 28, 2021 at 4:58 PM  > wrote:
>
> Hi all,
>
> Finding topk elements is widely used in several fields, but missed
> in NumPy.
> I implement this functionality named as  numpy.topk using core numpy
> functions and open a PR:
>
> https://github.com/numpy/numpy/pull/19117
> 
>
> Any discussion are welcome.
>
>
> Thanks for the proposal Kang. I think this functionality is indeed a 
> fairly obvious gap in what Numpy offers, and would make sense to add. 
> A detailed comparison with other libraries would be very helpful here. 
> TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and 
> MXNet call it `topk`.
>
> Two things to look at in more detail here are:
> 1. complete signatures of the function in each of those libraries, and 
> what the commonality is there.
> 2. the argument Eric made on your PR about consistency with 
> sort/argsort, and if we want topk/argtopk? Also, do other libraries 
> have `argtopk`?
>
> Cheers,
> Ralf
>
>
> Best wishes,
>
> Kang Kai
>


Hi, Thanks for reply, I present some details below: 


## 1. complete signatures of the function in each of those libraries, and what 
the commonality is there.


| Library | Name   | arg1  | arg2 | arg3 | arg4  | arg5   |
|-||---|--|--|---||
| NumPy [1]   | numpy.topk | a | k| axis | largest   | sorted |
| PyTorch [2] | torch.topk | input | k| dim  | largest   | sorted |
| R [3]   | topK   | x | K| /| / | /  |
| MXNet [4]   | mxnet.npx.topk | data  | k| axis | is_ascend | /  |
| CNTK [5]| cntk.ops.top_k | x | k| axis | / | /  |
| TF [6]  | tf.math.top_k  | input | k| /| / | sorted |
| Dask [7]| dask.array.topk| a | k| axis | -k| /  |
| Dask [8]| dask.array.argtopk | a | k| axis | -k| /  |
| MATLAB [9]  | mink   | A | k| dim  | / | /  |
| MATLAB [10] | maxk   | A | k| dim  | / | /  |



| Library | Name   | Returns | 
|-||-| 
| NumPy [1]   | numpy.topk | values, indices | 
| PyTorch [2] | torch.topk | values, indices | 
| R [3]   | topK   | indices | 
| MXNet [4]   | mxnet.npx.topk | controls by ret_typ | 
| CNTK [5]| cntk.ops.top_k | values, indices | 
| TF [6]  | tf.math.top_k  | values, indices | 
| Dask [7]| dask.array.topk| values  | 
| Dask [8]| dask.array.argtopk | indices | 
| MATLAB [9]  | mink   | values, indices |
| MATLAB [10] | maxk   | values, indices |


- arg1: Input array.
- arg2: Number of top elements to look for along the given axis.
- arg3: Axis along which to find topk.
- R only supports vector, TensorFlow only supports axis=-1.
- arg4: Controls whether to return k largest or smallest elements.
- R, CNTK and TensorFlow only return k largest elements.
- In Dask, k can be negative, which means to return k smallest elements.
- In MATLAB, use two distinct functions.
- arg5: If true the resulting k elements will be sorted by the values.
- R, MXNet, CNTK, Dask and MATLAB only return sorted elements.

**Summary**:
- Function Name: could be `topk`, `top_k`, `mink`/`maxk`.
- arg1 (a), arg2 (k), arg3 (axis): should be required.
- arg4 (largest), arg4 (sorted): might be discussed.
- Returns: discussed below.


## 2. the argument Eric made on your PR about consistency with sort/argsort, if 
we want topk/argtopk? Also, do other libraries have `argtopk`


In most libraries, `topk` or `top_k` returns both values and indices, and 
`argtopk` is not included except for Dask. In addition, there is another 
inconsistency: `sort` returns ascending values, but `topk` returns 
descending values.


## Suggestions
Finally, IMHO, new function signature might be designed as one of:
I) use `topk` / `argtopk` or `top_k` / `argtop_k`
```python
def topk(a, k, axis=-1, sorted=True) -> topk_values
def argtopk(a, k, axis=-1, sorted=True) -> topk_indices
```
or
```python
def top_k(a, k, axis=-1, sorted=True) -> topk_values
def argtop_k(a, k, axis=-1, sorted=True) -> topk_indices
```
where `k` can be negative which means to return k smallest elements.


II) use `maxk` / `argmaxk` or `max_k` / `argmax_k` (`mink` / `argmink` or 
`min_k` / `argmin_k`)
```python
def maxk(a, k, axis=-1, sorted=True) -> values
def argmaxk(a, k, axis=-1, sorted=True) -> indices


def mink(a, k, axis=-1, sorted=True) -> values
def argmink(a, k, axis=-1, sorted=True) -> indices
```
or
```python
def max_k(a, k, axis=-1, sorted=True) -> values
def argmax_k(a, k, axis=-1, so

Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Benjamin Root
to be honest, I read "topk" as "topeka", but I am weird. While numpy
doesn't use underscores all that much, I think this is one case where it
makes sense.

I'd also watch out for the use of the term "sorted", as it may mean
different things to different people, particularly with regards to what its
default value should be. I also find myself initially confused by the names
"largest" and "sorted", especially what should they mean with the "min-k"
behavior. I think Dask's use of negative k is very pythonic and would help
keep the namespace clean by avoiding the extra "min_k".

As for the indices, I am of two minds. On the one hand, I don't like
polluting the namespace with extra functions. On the other hand, having a
function that behaves differently based on a parameter is just fugly,
although we do have a function that does this - np.unique().

Ben Root

On Sun, May 30, 2021 at 8:22 AM Neal Becker  wrote:

> Topk is a bad choice imo.  I initially parsed it as to_pk, and had no idea
> what that was, although sounded a lot like a scipy signal function.
> Nlargest would be very obvious.
>
> On Sun, May 30, 2021, 7:50 AM Alan G. Isaac  wrote:
>
>> Mathematica and Julia both seem relevant here.
>> Mma has TakeLargest (and Wolfram tends to think hard about names).
>> https://reference.wolfram.com/language/ref/TakeLargest.html
>> Julia's closest comparable is perhaps partialsortperm:
>> https://docs.julialang.org/en/v1/base/sort/#Base.Sort.partialsortperm
>> Alan Isaac
>>
>>
>>
>> On 5/30/2021 4:40 AM, kang...@mail.ustc.edu.cn wrote:
>> > Hi, Thanks for reply, I present some details below:
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion