Jeroen Demeyer <j.deme...@ugent.be> added the comment:

This weekend I realized something important which I didn't realize before: some 
hash functions which I assumed to be good (i.e. small chance of collisions 
between any given two tuples) turned out to often fail the tests. This is 
because you don't just want to minimize collisions, you also want to minimize 
*correlations* between collisions.

More precisely: for a given hash function (considering the multiplier as 
parameter), it can happen that there are 4 tuples t, u, v, w such that whether 
or not hash(t) == hash(u) is correlated to whether or not hash(v) == hash(w). 
Such correlations increase the standard deviation of the number of collisions 
in the tests a lot (even if the average is unaffected), which leads to 
significant chances of failing the tests.

So with this in mind I stopped testing pairs of tuples but I ran the actual 
testsuites. The metric I'm using is now the probability that the testsuite 
passes for randomly chosen multipliers (3 mod 8). For example, the original 
tuple hash has a probability of around 97% of passing the original testsuite.

None of the hash functions that I tried (DJB or FNV with input mangling like t 
^= t << 7) achieved such a high probability of passing the original test. The 
*only* variation that I found which passes the original test and my new test 
(and a third "random" test which I haven't mentioned before) with a high enough 
probability was FNV with input mangling with a second multiplier:

h = 1
for y in INPUT:
    t = hash(y)
    t ^= t * SOME_LARGE_EVEN_NUMBER   # instead of t ^= t << SHIFT
    h = (h ^ t) * MULTIPLIER

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to