[Numpy-discussion] Re: Revert the return of a single NaN for `np.unique` with floating point numbers?

Ralf Gommers Sun, 14 Nov 2021 13:58:29 -0800

An update on this: after discussion in the community meeting the preferred
direction is to add an `equal_nans` keyword and support both behaviors.
Given that we won't get to this, I proposed on
https://github.com/numpy/numpy/issues/20326 to make the default False, i.e.
the same as for 1.21.x


Cheers,
Ralf


On Mon, Nov 8, 2021 at 7:06 PM Sebastian Berg <sebast...@sipsolutions.net>
wrote:

> On Fri, 2021-11-05 at 18:44 -0500, Juan Nunez-Iglesias wrote:
> > I agree with the argument to revert. Consistency among libraries should
> > be near the top of the list of priorities, as should performance.
> > Whether the new behaviour "makes more sense", meanwhile, is debatable.
> >
>
> I don't have much of an opinion either way.  But while correctness is
> debatable, is there any reason someone would prefer getting multiple
> (potentially very man) NaNs?  I admit it feels more like a trap to me.
>
>
> About Performance
> -----------------
>
> Note that at least on CPUs there is no reason why there even is an
> overhead [1]!  (We simply miss an `np.islessgreater` function.)
>
> The overhead is a bit annoying admittedly, although pretty typical for
> NumPy, I guess.  This one seems to be reducible to around 10% if you
> use  `a != a` instead of `isnan` (the ufunc incurs a lot of overhead).
> Which is on par with 1.19.x, since NumPy got faster ;).
>
>
> About Reverting
> ---------------
>
> It feels like this should have a compat release note, which seems like
> a very strong flag for "not just a bug fix".  But maybe there is a good
> reason to break with that?
>
>
> Cheers,
>
> Sebastian
>
>
>
> [1] C99 has `islessgreater` and both SSE2 and AVX have intrinsics which
> allow reformulating the whole thing to have exactly the same speed as
> before, these intrinsics effectively calculate things like:
>
>     !(a < b)   or   !(a != b)
>
> Admittedly, the SSE2 version probably sets FPEs for NaNs (but that is a
> problem that we already have and is even now not really avoidable).
>
> So at least for all relevant CPUs, the implementation detail that we
> currently don't have `islessgreater`.  (I have not found information of
> whether the C99 `islessgreater` has decent performance on CUDA)
>
> (Not surprisingly, MSVC is arguably buggy and does to not compile the
> C99 `islessgreate` to fast byte-code – although plausibly link-time
> optimizer may safe that.  But since we have hand coded SSE2/AVX
> versions of comparisons, even that would probably not matter)
>
>
>
> > On Fri, 5 Nov 2021, at 4:08 PM, Ralf Gommers wrote:
> > >
> > >
> > > On Mon, Aug 2, 2021 at 7:49 PM Ralf Gommers <ralf.gomm...@gmail.com>
> > > wrote:
> > > >
> > > >
> > > > On Mon, Aug 2, 2021 at 7:04 PM Sebastian Berg
> > > > <sebast...@sipsolutions.net> wrote:
> > > > > Hi all,
> > > > >
> > > > > In NumPy 1.21, the output of `np.unique` changed in the presence
> > > > > of
> > > > > multiple NaNs.  Previously, all NaNs were returned when we now
> > > > > only
> > > > > return one (all NaNs were considered unique):
> > > > >
> > > > >     a = np.array([1, 1, np.nan, np.nan, np.nan])
> > > > >
> > > > > Before 1.21:
> > > > >
> > > > >     >>> np.unique(a)
> > > > >     array([ 1., nan, nan, nan])
> > > > >
> > > > > After 1.21:
> > > > >
> > > > >     array([ 1., nan])
> > > > >
> > > > >
> > > > > This change was requested in an old issue:
> > > > >
> > > > >      https://github.com/numpy/numpy/issues/2111
> > > > >
> > > > > And happened here:
> > > > >
> > > > >      https://github.com/numpy/numpy/pull/18070
> > > > >
> > > > > While, it has a release note.  I am not sure the change got the
> > > > > attention it deserved.  This would be especially worrying if it
> > > > > is a
> > > > > regression for anyone?
> > > >
> > > > I think it's now the expected answer, not a regression. `unique` is
> > > > not an elementwise function that needs to adhere to IEEE-754 where
> > > > nan != nan. I can't remember reviewing this change, but it makes
> > > > perfect sense to me.
> > >
> > > It turns out there's a lot more to this story. The short summary of
> > > it is that:
> > > - use case wise it's still true that considering NaN's equal makes
> > > more sense.
> > > - scikit-learn has a custom implementation which removes duplicate
> > > NaN's:
> > >
> https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_encode.py#L7
> > > - all other array/tensor libraries match NumPy's old behavior (return
> > > multiple NaN's)
> > > - there's a performance and implementation cost for CuPy et al. to
> > > match NumPy's changed behavior (multiple NaN's is the natural result,
> > > no extra checks needed)
> > > - There is also a significant performance cost for NumPy, ~25% for
> > > small/medium-size arrays with no/few NaN's (say the benchmarks in
> > > https://github.com/numpy/numpy/pull/18070 - which is a common case,
> > > and that's not negligible like the PR description claims.
> > > - the "single NaN" behavior is easy to get as a utility function on
> > > top of the "multiple NaN' (like scikit-learn does), the opposite is
> > > much harder
> > > - for the array API standard, the final decision was to go with the
> > > "multiple NaN' behavior, so we'll need that in `numpy.array_api`.
> > > - more discussion in
> > > https://github.com/data-apis/array-api/issues/249
> > >
> > > It would be good to make up our minds before 1.22.0. Two choices:
> > > 1. We can leave it as it is now and work around the issue in
> > > `numpy.array_api`. It would also require CuPy and others to make
> > > changes, which probably cost performance.
> > > 2. We can revert it in 1.22.0, and possibly in 1.21.5 if such a
> > > release will be made.
> > >
> > > There is something to say for either option. I'd personally lean
> > > slightly towards reverting because of both consistency with other
> > > implementations and performance. But it seems like there's no optimal
> > > solution here, it's a fairly hairy issue.
> > >
> > > Thoughts?
> > >
> > > Ralf
> > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list -- numpy-discussion@python.org
> > > To unsubscribe send an email to numpy-discussion-le...@python.org
> > > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> > > Member address: j...@fastmail.com
> > >
> > _______________________________________________
> > NumPy-Discussion mailing list -- numpy-discussion@python.org
> > To unsubscribe send an email to numpy-discussion-le...@python.org
> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> > Member address: sebast...@sipsolutions.net
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: ralf.gomm...@googlemail.com
>

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Revert the return of a single NaN for `np.unique` with floating point numbers?

Reply via email to