[issue34751] Hash collisions for tuples

Tim Peters Fri, 21 Sep 2018 20:48:02 -0700


Tim Peters <[email protected]> added the comment:


>> Why do you claim the original was "too small"?  Too small for
>> what purpose?

> If the multiplier is too small, then the resulting hash values are
> small too. This causes collisions to appear for smaller numbers:

All right!  An actual reason ;-)  So there are two things so far I clearly 
agree with, but they're independent of "the problem" this report was opened 
about:

1. On a 64-bit box life would be better if we picked a bigger multiplier.  This 
should be done via preprocessor #if selection, though, not via the inscrutable 
rules C uses to cut back an integer literal out of range for the type of the 
variable it's assigned to.

2. A raw integer hash of -1 should be changed to something other than -2.  It 
was always a Poor Idea to pick an integer of small magnitude for the 
substitute.  Some newer types didn't repeat this mistake.  For example, 
frozensets replace a raw hash of -1 with 
590923713.  Alas, this is hard to change now, because hashable objects of any 
type that compare equal to -1 also need to return the same substitute value 
(e.g., hash(-1.0) == hash(decimal.Decimal("-100e-2")) == hash(-1+0j) == -2).  
So I'm afraid it's been hard-coded into user-defined numeric classes too :-(  
There would be no obvious problem in changing the _tuple_ hash to use a 
different substitute value, although it's not clear that would buy anything 
either.  hash(-1) == -2 was the original sin.


[about multipliers]
> Because it's really just random chance.
> ...
> Ideally, it should be just a random number.

How do you know this?  I'm not aware of any research papers about picking 
multipliers in this context, but would love to see one.

I am aware of much research about multipliers in the somewhat related contexts 
of linear congruential pseudo-random number generators, and of "multiplicative 
hashing", and "any random multiplier is fine" couldn't be further from the 
truth _in_ those contexts.

Guido picked a prime to begin with (actually, at first it was 3(!)), and 
comments in the current code sure say that whoever added this part was keen on 
primes too:

"""
/* The addend 82520, was selected from the range(0, 1000000) for
   generating the greatest number of prime multipliers for tuples
   up to length eight:
"""

I don't know that primes are important here, but neither do I know that they're 
_not_ important here.  Widely used production code isn't the place to conduct 
experiments, so the status quo is favored in the absence of truly compelling 
reasons.  Just repeating the raw assertion "random is fine" isn't compelling.  
If it were true, we could, e.g., multiply by a large power of 2 via shifting 
instead, and save a few cycles.

For my best guess at what "the problem" here actually is, in the

>>> cands = list(range(-10, -1)) + list(range(9))
>>> len(set(hash(t) for t in product(cands, repeat=4)))

example, which currently (on my 64-bit box) generates 35,380 distinct hash 
codes from the 104,976 inputs, ALL the collisions go away merely by changing 
the character "^" to "+" in the current code.

Which was your original suggestion.  Which you appear to be opposed to now?  
I'm unclear about why, if so.

That's a change I could probably live with, although I have no strong reason to 
believe it couldn't cause problems for applications that currently work fine.  
I don't have more time to think about that now, though.

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue34751] Hash collisions for tuples

Reply via email to