[issue43475] Worst-case behaviour of hash collision with float NaN

Mark Dickinson Thu, 11 Mar 2021 09:36:06 -0800


Mark Dickinson <dicki...@gmail.com> added the comment:


Sigh. When I'm undisputed ruler of the multiverse, I'm going to make "NaN == 
NaN" return True, IEEE 754 be damned. NaN != NaN is fine(ish) at the level of 
numerics; the problems start when the consequences of that choice leak into the 
other parts of the language that care about equality. NaNs just shouldn't be 
considered "special enough to break the rules" (where the rules here are that 
the equivalence relation being used as the basis of equality for a general 
container should actually *be* an equivalence relation - reflexive, symmetric, 
and transitive).

Anyway, more constructively ...

I agree with the analysis, and the proposed solution seems sound: if we're 
going to change this, then using the object hash seems like a workable 
solution. The question is whether we actually do need to change this.

I'm not too worried about backwards compatibility here: if we change this, 
we're bound to break *someone's* code somewhere, but it's hard to imagine that 
there's *that* much code out there that makes useful use of the property that 
hash(nan) == 0. The most likely source of breakage I can think of is in test 
code that checks that 3rd-party float-like things (NumPy's float64, or gmpy2 
floats, or ...) behave like Python's floats.

@Cong Ma: do you have any examples of cases where this has proved, or is likely 
to prove, a problem in real-world code, or is this purely theoretical at this 
point?

I'm finding it hard to imagine cases where a developer *intentionally* has a 
dictionary with several NaN values as keys. (Even a single NaN as a key seems a 
stretch; general floats as dictionary keys are rare in real-world code that 
I've encountered.) But it's somewhat easier to imagine situations where one 
could run into this accidentally. Here's one such:

>>> import collections, numpy as np
>>> x = np.full(100000, np.nan)
>>> c = collections.Counter(x)

That took around 4 minutes of non-ctrl-C-interruptible time on my laptop. (I 
was actually expecting it not to "work" as a bad example: I was expecting that 
NaNs realised from a NumPy array would all be the same NaN object, but that's 
not what happens.) For comparison, this appears instant:

>>> x = np.random.rand(100000)
>>> c = collections.Counter(x)

And it's not a stretch to imagine having a NumPy array representing a masked 
binary 1024x1024-pixel image (float values of NaN, 0.0 and 1.0) and using 
Counter to find out how many 0s and 1s there were.

On the other hand, we've lived with the current behaviour for some decades now 
without it apparently ever being a real issue.

On balance, I'm +1 for fixing this. It seems like a failure mode that would be 
rarely encountered, but it's quite unpleasant when it *is* encountered.

----------
nosy: +rhettinger, tim.peters

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43475>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

Reply via email to