Tim Peters <t...@python.org> added the comment:

>> people already wrote substantial test suites dedicated
>> to that sole purpose, and we should aim to be "mere
>> consumers" of functions that pass _those_ tests.

> There are hash functions that pass those tests which
> are still bad in practice when used as tuple hash function.

Really?  Which one(s)?  If you're talking about that some fail badly when you 
_replace_ the constants they picked, I doubt they'd accept that as proof of 
anything.  Things like FNV and DJB score poorly on SMHasher to begin with.


> That's really unfortunate, but it's a fact that we need
> to live with.

I'm surprised they do as well as they do using less than a handful of 
invertible transformations per input, using state of the same bit width as the 
inputs.

I don't expect them to be immune to all easily-provoked "worse than random" 
cases, though.  Over time, these hashes change while keeping the same name.  As 
noted before, this is at least the second version of SeaHash.  The best of the 
original algorithms of this kind - murmurhash - is on its 3rd version now.

The motivation is the same:  someone bumps into real-life cases where they do 
_way way way_ worse than "random", and so they try to repair that as cheaply as 
possible.

Most of these are designed for data centers to do cheap-as-possible reasonably 
robust fingerprinting of giant data blobs.  They could already have used, e.g., 
SHA-2 for highly robust fingerprinting, but they don't want to pay the very 
much higher runtime cost.

If you can provoke any deviation from randomness in that kind of hash, it's 
considered "broken" and is never used again.  But in these far cheaper hashes, 
it's considered part of the tradeoffs.  If you read the Reddit thread about 
SeaHash I cited before, the person who suggested the current transformations 
noted that there were still weaknesses in some areas of the input space he was 
able to find just by thinking about how it worked, but far milder than the 
weaknesses he found by thinking about how the original version worked.

That doesn't imply there aren't far worse weaknesses in areas he _didn't_ think 
about.  So it goes.

> It means that we may need to do some adjustments to
> the hash functions.

I'm fine with the faithful (barring errors on my part) xxHash I posted here 
before.  And I don't care whether it works "better" or "worse" on Python's tiny 
test suite if its constants are replaced, or its algorithm is "tweaked".  When 
xxHash version 2 is released, I want it to be as mindless as possible to 
replace the guts of Python's tuple-hash loop with the guts of xxHash version 2.

It's easy to believe that SMHasher has nothing testing the bit patterns 
produced by mixing two's-complement integers of similar magnitude but opposite 
signs.  That's not the kind of data giant data centers are likely to have much 
of ;-)  If Appleby can be talked into _adding_ that kind of keyset data to 
SMHasher, then the next generations of these hashes will deal with it for us.  
But an objectively small number of collisions more than expected in some such 
cases now doesn't annoy me enough to want to bother.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to