An update on this: after discussion in the community meeting the preferred direction is to add an `equal_nans` keyword and support both behaviors. Given that we won't get to this, I proposed on https://github.com/numpy/numpy/issues/20326 to make the default False, i.e. the same as for 1.21.x
Cheers, Ralf On Mon, Nov 8, 2021 at 7:06 PM Sebastian Berg <sebast...@sipsolutions.net> wrote: > On Fri, 2021-11-05 at 18:44 -0500, Juan Nunez-Iglesias wrote: > > I agree with the argument to revert. Consistency among libraries should > > be near the top of the list of priorities, as should performance. > > Whether the new behaviour "makes more sense", meanwhile, is debatable. > > > > I don't have much of an opinion either way. But while correctness is > debatable, is there any reason someone would prefer getting multiple > (potentially very man) NaNs? I admit it feels more like a trap to me. > > > About Performance > ----------------- > > Note that at least on CPUs there is no reason why there even is an > overhead [1]! (We simply miss an `np.islessgreater` function.) > > The overhead is a bit annoying admittedly, although pretty typical for > NumPy, I guess. This one seems to be reducible to around 10% if you > use `a != a` instead of `isnan` (the ufunc incurs a lot of overhead). > Which is on par with 1.19.x, since NumPy got faster ;). > > > About Reverting > --------------- > > It feels like this should have a compat release note, which seems like > a very strong flag for "not just a bug fix". But maybe there is a good > reason to break with that? > > > Cheers, > > Sebastian > > > > [1] C99 has `islessgreater` and both SSE2 and AVX have intrinsics which > allow reformulating the whole thing to have exactly the same speed as > before, these intrinsics effectively calculate things like: > > !(a < b) or !(a != b) > > Admittedly, the SSE2 version probably sets FPEs for NaNs (but that is a > problem that we already have and is even now not really avoidable). > > So at least for all relevant CPUs, the implementation detail that we > currently don't have `islessgreater`. (I have not found information of > whether the C99 `islessgreater` has decent performance on CUDA) > > (Not surprisingly, MSVC is arguably buggy and does to not compile the > C99 `islessgreate` to fast byte-code – although plausibly link-time > optimizer may safe that. But since we have hand coded SSE2/AVX > versions of comparisons, even that would probably not matter) > > > > > On Fri, 5 Nov 2021, at 4:08 PM, Ralf Gommers wrote: > > > > > > > > > On Mon, Aug 2, 2021 at 7:49 PM Ralf Gommers <ralf.gomm...@gmail.com> > > > wrote: > > > > > > > > > > > > On Mon, Aug 2, 2021 at 7:04 PM Sebastian Berg > > > > <sebast...@sipsolutions.net> wrote: > > > > > Hi all, > > > > > > > > > > In NumPy 1.21, the output of `np.unique` changed in the presence > > > > > of > > > > > multiple NaNs. Previously, all NaNs were returned when we now > > > > > only > > > > > return one (all NaNs were considered unique): > > > > > > > > > > a = np.array([1, 1, np.nan, np.nan, np.nan]) > > > > > > > > > > Before 1.21: > > > > > > > > > > >>> np.unique(a) > > > > > array([ 1., nan, nan, nan]) > > > > > > > > > > After 1.21: > > > > > > > > > > array([ 1., nan]) > > > > > > > > > > > > > > > This change was requested in an old issue: > > > > > > > > > > https://github.com/numpy/numpy/issues/2111 > > > > > > > > > > And happened here: > > > > > > > > > > https://github.com/numpy/numpy/pull/18070 > > > > > > > > > > While, it has a release note. I am not sure the change got the > > > > > attention it deserved. This would be especially worrying if it > > > > > is a > > > > > regression for anyone? > > > > > > > > I think it's now the expected answer, not a regression. `unique` is > > > > not an elementwise function that needs to adhere to IEEE-754 where > > > > nan != nan. I can't remember reviewing this change, but it makes > > > > perfect sense to me. > > > > > > It turns out there's a lot more to this story. The short summary of > > > it is that: > > > - use case wise it's still true that considering NaN's equal makes > > > more sense. > > > - scikit-learn has a custom implementation which removes duplicate > > > NaN's: > > > > https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_encode.py#L7 > > > - all other array/tensor libraries match NumPy's old behavior (return > > > multiple NaN's) > > > - there's a performance and implementation cost for CuPy et al. to > > > match NumPy's changed behavior (multiple NaN's is the natural result, > > > no extra checks needed) > > > - There is also a significant performance cost for NumPy, ~25% for > > > small/medium-size arrays with no/few NaN's (say the benchmarks in > > > https://github.com/numpy/numpy/pull/18070 - which is a common case, > > > and that's not negligible like the PR description claims. > > > - the "single NaN" behavior is easy to get as a utility function on > > > top of the "multiple NaN' (like scikit-learn does), the opposite is > > > much harder > > > - for the array API standard, the final decision was to go with the > > > "multiple NaN' behavior, so we'll need that in `numpy.array_api`. > > > - more discussion in > > > https://github.com/data-apis/array-api/issues/249 > > > > > > It would be good to make up our minds before 1.22.0. Two choices: > > > 1. We can leave it as it is now and work around the issue in > > > `numpy.array_api`. It would also require CuPy and others to make > > > changes, which probably cost performance. > > > 2. We can revert it in 1.22.0, and possibly in 1.21.5 if such a > > > release will be made. > > > > > > There is something to say for either option. I'd personally lean > > > slightly towards reverting because of both consistency with other > > > implementations and performance. But it seems like there's no optimal > > > solution here, it's a fairly hairy issue. > > > > > > Thoughts? > > > > > > Ralf > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list -- numpy-discussion@python.org > > > To unsubscribe send an email to numpy-discussion-le...@python.org > > > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > > > Member address: j...@fastmail.com > > > > > _______________________________________________ > > NumPy-Discussion mailing list -- numpy-discussion@python.org > > To unsubscribe send an email to numpy-discussion-le...@python.org > > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > > Member address: sebast...@sipsolutions.net > > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: ralf.gomm...@googlemail.com >
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com