[issue43475] Worst-case behaviour of hash collision with float NaN

Cong Ma Thu, 11 Mar 2021 06:58:54 -0800


New submission from Cong Ma <m.c...@protonmail.ch>:


Summary: CPython hash all NaN values to 0. This guarantees worst-case behaviour 
for dict if numerous existing keys are NaN. I think by hashing NaN using the 
generic object (or "pointer") hash instead, the worst-case situation can be 
alleviated without changing the semantics of either dict or float. However, 
this also requires changes to how complex and Decimal objects hash, and 
moreover incompatible change to sys.hash_info. I would like to hear how Python 
developers think about this matter.

--------

Currently CPython uses the hard-coded macro constant 0 (_PyHASH_NAN, defined in 
Include/pyhash.h) for the hash value of all floating point NaNs. The value is 
part of the sys.hashinfo API and is re-used by complex and Decimal in computing 
its hash in accordance with Python builtin-type documentation. [0]

(The doc [0] specifically says that "[a]ll hashable nans have the same hash 
value.")

This is normally not a great concern, except for the worst case performance. 
The problem is that, since they hash to the same value and they're guaranteed 
to compare unequal to any compatible numeric value -- not even to themselves, 
this means they're guaranteed to collide.

For this reason I'd like to question whether it is a good idea to have all 
hashable NaNs with the same hash value.

There has been some discussions about this over the Web for some time (see 
[1]). In [1] the demo Python script times the insertion of k distinct NaN keys 
(as different objects) into a freshly created dict. Since the keys are distinct 
and are guaranteed to collide with each other (if any), the run time of a 
single lookup/insertion is roughly linear to the existing number of NaN keys. 
I've recreated the same script using with a more modern Python (attached).

I'd suggest a fix for this worst-case behaviour: instead of returning the hash 
value 0 for all NaNs, use the generic object (pointer) hash for these objects. 
As a PoC (also in the attached script), it roughly means

```
class myfloat(float):
    def __hash__(self):
        if self != self:  # i.e., math.isnan(self)
            return object.__hash__(self)
        return super().__hash__(self)
```

This will
- keep the current behaviour of dict intact;
- keep the invariant `a == b implies hash(a) == hash(b)` intact, where 
applicable;
- uphold all the other rules for Python numeric objects listed in [0];
- make hash collisions no more likely than with object() instances (dict lookup 
time is amortized constant w.r.t. existing number of NaN keys).

However, it will
- not keep the current rule "All hashable nans have the same hash value";
- not be compatible with the current sys.hash_info API (requires the removal of 
the "nan" attribute from there and documenting the change);
- require congruent modifications in complex and Decimal too.

Additionally, I don't think this will affect module-level NaN "constants" such 
as math.nan and how they behave. The "NaN constant" has never been a problem to 
begin with. It's only the *distinct* NaN objects that may cause the worst-case 
behaviour.

--------

Just for the record I'd also like to clear some outdated info or misconception 
about NaN keys in Python dicts. It's not true that NaN keys, once inserted, 
cannot be retrieved (e.g., as claimed in [1][2]). In Python, they can be, if 
you keep the original key *object* around by keeping a reference to it (or 
obtaining a new one from the dict by iterating over it). This, I think, is 
because Python dict compares for object identity before rich-comparing for 
equality in `lookdict()` in Objects/dictobject.c, so this works for `d = 
dict()`:

```
f = float("nan")
d[f] = "value"
v = d[f]
```

but this fails with `KeyError`, as it should:

```
d[float("nan")] = "value"
v = d[float("nan")]
```

In this regard the NaN float object behaves exactly like the object() instance 
as keys -- except for the collisions. That's why I think at least for floats 
the "object" hash is likely to work. The solution using PRNG [1] (implemented 
with the Go language) is not necessary for CPython because the live objects are 
already distinct.

--------

Links:

[0] https://docs.python.org/3/library/stdtypes.html#hashing-of-numeric-types
[1] https://research.swtch.com/randhash
[2] 
https://readafterwrite.wordpress.com/2017/03/23/how-to-hash-floating-point-numbers/

----------
components: Interpreter Core, Library (Lib)
files: nan_key.py
messages: 388508
nosy: congma
priority: normal
severity: normal
status: open
title: Worst-case behaviour of hash collision with float NaN
type: performance
Added file: https://bugs.python.org/file49869/nan_key.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43475>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

Reply via email to