[issue43475] Worst-case behaviour of hash collision with float NaN

Cong Ma Thu, 11 Mar 2021 10:58:23 -0800


Cong Ma <m.c...@protonmail.ch> added the comment:


Thank you @mark.dickinson for the detailed analysis.

In addition to your hypothetical usage examples, I am also trying to understand 
the implications for user code.

If judging by the issues people open on GitHub like this: 
https://github.com/pandas-dev/pandas/issues/28363 yes apparently people do run 
into the "NaN as key" problem every now and then, I guess. (I'm not familiar 
enough with the poster's real problem in the Pandas setting to comment about 
that one, but it seems to suggest people do run into "real world" problems that 
shares some common features with this one). There's also StackOverflow threads 
like this (https://stackoverflow.com/a/17062985) where people discuss hashing a 
data table that explicitly use NaN for missing values. The reason seems to be 
that "[e]mpirical evidence suggests that hashing is an effective strategy for 
dimensionality reduction and practical nonparametric estimation." 
(https://arxiv.org/pdf/0902.2206.pdf).

I cannot imagine whether the proposed change would make life easier for people 
who really want NaN keys to begin with. However I think it might reduce the 
exposure to worst-case performances in dict (and maybe set/frozenset), while 
not hurting Python's own self-consistency.

Maybe there's one thing to consider about future compatibility though... 
because the proposed fix depends on the current behaviour that floats (and by 
extension, built-in float-like objects such as Decimal and complex) are not 
cached, unlike small ints and interned Unicode objects. So when you explicitly 
construct a NaN by calling float(), you always get a new Python object. Is this 
guaranteed now on different platforms with different compilers? I'm trying to 
parse the magic in Include/pymath.h about the definition of macro `Py_NAN`, 
where it seems to me that for certain configs it could be a `static const 
union` translation-unit-level constant. Does this affect the code that actually 
builds a Python object from an underlying NaN? (I apologize if this is a bad 
question). But if it doesn't and we're guaranteed to really have Python 
NaN-objects that don't alias if explicitly constructed, is this something 
unlikely to change in the future?

I also wonder if there's security implication for servers that process 
user-submitted input, maybe running a float() constructor on some string list, 
checking exceptions but silently succeeding with "nan": arguably this is not 
going to be as common an concern as the string-key collision DoS now foiled by 
hash randomization, but I'm not entirely sure.

On "theoretical interest".. yes theoretical interests can also be "real world" 
if one teaches CS theory in real world using Python, see 
https://bugs.python.org/issue43198#msg386849

So yes, I admit we're in an obscure corner of Python here. It's a tricky, 
maybe-theoretical, but seemingly annoying but easy-to-fix issue..

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43475>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43475] Worst-case behaviour of hash collision with float NaN

Reply via email to