Jeroen Demeyer <[email protected]> added the comment:
This weekend I realized something important which I didn't realize before: some
hash functions which I assumed to be good (i.e. small chance of collisions
between any given two tuples) turned out to often fail the tests. This is
because you don't just want to minimize collisions, you also want to minimize
*correlations* between collisions.
More precisely: for a given hash function (considering the multiplier as
parameter), it can happen that there are 4 tuples t, u, v, w such that whether
or not hash(t) == hash(u) is correlated to whether or not hash(v) == hash(w).
Such correlations increase the standard deviation of the number of collisions
in the tests a lot (even if the average is unaffected), which leads to
significant chances of failing the tests.
So with this in mind I stopped testing pairs of tuples but I ran the actual
testsuites. The metric I'm using is now the probability that the testsuite
passes for randomly chosen multipliers (3 mod 8). For example, the original
tuple hash has a probability of around 97% of passing the original testsuite.
None of the hash functions that I tried (DJB or FNV with input mangling like t
^= t << 7) achieved such a high probability of passing the original test. The
*only* variation that I found which passes the original test and my new test
(and a third "random" test which I haven't mentioned before) with a high enough
probability was FNV with input mangling with a second multiplier:
h = 1
for y in INPUT:
t = hash(y)
t ^= t * SOME_LARGE_EVEN_NUMBER # instead of t ^= t << SHIFT
h = (h ^ t) * MULTIPLIER
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com