[issue34751] Hash collisions for tuples

Tim Peters Mon, 01 Oct 2018 11:47:53 -0700


Tim Peters <t...@python.org> added the comment:


If SeaHash is interesting to us (it is to me), I suggest dropping our DJB/FNV 
combining part entirely and using a purer form, like so:

    Py_uhash_t t = (Py_uhash_t)y;
    x ^= t ^ (t << 1);
    x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;
    x ^= (x >> 32) >> (x >> 60);
    x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;

The only collisions with that were 14 in the new tuple test.  Rotate t right by 
3 first, and that drops to 13.  Much better than utter catastrophe ;-)

100% pure SeaHash does

    x ^= t;
    
at the start first, instead of `t ^ (t << 1)` on the RHS.  That fails the 
second tuple test, with 98 collisions.  Not catastrophic, just "too many".  But 
there's only so much we can expect from a cheap non-crypto-strength hash.  `t ^ 
(t << 1)` is a pragmatic tweek to get rid of masses of uselessly repeated sign 
bits, which do occur in realistic tuples (mixing positive and negative ints).  
Adding that tweak shouldn't harm any of SeaHash's provable global properties, 
since `t ^ (t << 1)` is just a permutation of the input space.

Different constants would be needed for a 32-bit version (best guesses, 
untried:  a 31-bit random prime; s/32/16/; s/60/29/).

I don't yet know about potential SeaHash licensing issues.  It's part of Rust:

https://docs.rs/seahash/3.0.5/seahash/

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue34751] Hash collisions for tuples

Reply via email to