[issue34751] Hash collisions for tuples

Tim Peters Wed, 26 Sep 2018 13:43:59 -0700


Tim Peters <t...@python.org> added the comment:


High-order bit:  please restore the original tuple hash test.  You have the 
worst case of "but I didn't write it" I've ever encountered ;-)  Your new test 
is valuable, but I've seen several cases now where it fails to detect any 
problems where the original test fails catastrophically.  Tests are written to 
ensure known problems don't recur, so there's _never_ "a reason" to throw one 
away unless the test itself is in error.  The test suite only grows over time, 
and that's how it should be.

Example:  when I replace the computational guts of the tuple hash with the 
single line (64-bit build, and this is the only code remaining in the loop 
apart from setting y to the next tuple component's hash):

    x = (x * mult) + y;

both tests fail spectacularly.

But if I add one more line:

    x = (x * mult) + y;
    x += x >> 16;
    
your new test falls to 0 collisions, while the original test still suffers 
83447 collisions (an improvement - down from 120050 - but still a spectacular 
failure).

We're flying blind enough here without deliberately plucking out one of our 
eyes ;-)  Indeed, the original tuple test is the only one in my collection that 
showed even a single collision after the two-line replacement just above.  "My 
collection" means your test, the original tuple hash test, and every example 
given in all the messages in this bug report.

Another "surprise":  replace addition with xor in the second line above:

    x = (x * mult) + y;
    x ^= x >> 16;

and then the original tuple test drops from 83447 to 0(!) collisions, while 
your new one jumps from 0 to a measly 12.


> The collision hash((3,3)) == hash((-3,-3)) is due a specific
> structure in the input to the FNV algorithm.

This is all so hypothetical it's impossible to know.  Perhaps the tuple hash 
function simply subtracts the hash of the first component from the hash of the 
second.  Then it's impossible to make any deterministic change to the component 
hashes that can stop either of the tuples from hashing to 0.  The weakness is 
entirely in the hash combining function then.

> The operation t ^= t << 7 that you suggested (and which I approve, except
> for the shift amount) is meant precisely to destroy that structure.

Except I have no reason to believe that destroying the input structure is of 
unique value.  As shown above, it's easy to change the tuple hash in a simple 
way that passes all known tests without touching `y` (aka `t`) at all.

>> It's x you care about directly, not t.

> That would not be a good solution, because that destroys the structure
> of the hash algorithm.   For Bernstein for example, each loop iteration
> should do something like
>
>    x = (x * m) + t
>
> for *some* value t depending on the input. If you mess with x, then it
> really becomes a different hash algorithm.

The two-liner above with the xor in the second line is exactly Bernstein 33A, 
followed by a permutation of 33A's _output_ space.  It works fine in tests.  
Why should I care about merely that it's "different"?  33A on its own fails 
spectacularly - it damned well _better_ be "different" ;-)

There is no "Bernstein theory" to appeal to here - just raw assertions.  I'm 
not attached to any of these approaches.  The strongest I can say from trying 
many things is that "chaining permutations works best, but no detail appears to 
matter much, except that multiplication is the most valuable of all the 
permutations I've tried".

x*M is a permutation of the word-size space for any odd M, and any "big enough" 
odd M I tried appears to work about as well as any other.  Adding t is another. 
 The shift-xor in the second line above is a third (but the "+=" version 
instead was not a permutation, and it failed).

> That is far more dangerous than simply applying a permutation on t.

Why?

> Which "desirable properties" of t does the operation t ^= (t << 1) damage?

On second thought, none!  Across any contiguous range of integers, that 
preserves that the last k bits (for all k >= 1) remain as evenly distributed as 
in the inputs.  That wasn't obvious to me at first.  It's not true if the left 
shift is replaced by a right shift, though.

However, with a shift count of 1 I've generally found that the _original_ tuple 
hash test fails (not spectacularly, but fails all the same).  "Generally" means 
across a substantial number of varying the permutations used in other places.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue34751] Hash collisions for tuples

Reply via email to