Tim Peters <[email protected]> added the comment:
An "aha!" moment for me: I noted before that replacing the tuple hash loop with
Py_uhash_t t = (Py_uhash_t)y;
t ^= t << 1;
x = (x * mult) + t;
does OK (64-bit build): most tests had no collisions, but the original tuple
test had 24 (a modest failure) and the new one 6.
Horrible results with a different scheme prompted me to add one line before the
shift-xor:
t = (t >> 3) | (t << 61);
That is, merely rotate the input right by 3 bits first. Disaster! Almost all
tests suffered major collisions, and the old test 235424 and the new one 344919.
What's going on? With hindsight, it's obvious: multiplication is a horrible
"mixer" _in this context_. Because nearly all the tests are slinging little
integers, almost all the input variation is in the last handful of bits.
Multiplication is great for spreading low-order bits around. But rotate them
to the high end, and multiplication is next to useless: it only propagates
them "to the left", and they're already at the left end then. This part has
virtually nothing to do with whether + or ^ is used, or with whether
multiplication is done before or after. It's actually the multiplication
that's the weakness, and has nothing to do with which multiplier is used.
This isn't a problem with FNV or DJB when used _as intended_, because their
output is much wider than their inputs. The high-order bit of an input for
them is still a lower-order bit to their much wider multiplication, and
eventually propagates. It isn't a problem for "multiplicative hashing"
algorithms either, because those use a double-width multiply and only retain
the _high_ half of a double-width product. We're only retaining the _lower_
half.
I also noted before that replacing the shift-xor with the frozenset hash's
input scrambler:
t = ((t ^ 89869747UL) ^ (t << 16)) * 3644798167UL;
worked great. But it's equally a disaster if the inputs are rotated right by 3
first. Same reason: it too only propagates "to the left".
So almost all tuple hash schemes on the table, both current and various
experiments here, are systematic disasters if input hashes vary mostly in the
high-order bits. We haven't noticed because the current string hashes vary
about the same in all bit positions, while contiguous ranges of not-huge ints
have hashes that vary in the low-order bits.
The only schemes in all these messages that are "obviously immune" are those
that used a (any) flavor of FNV or DJB as intended, treating each input hash as
a sequence of unsigned bytes.
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com