Tim Peters <[email protected]> added the comment:
>> Why do you claim the original was "too small"? Too small for
>> what purpose?
> If the multiplier is too small, then the resulting hash values are
> small too. This causes collisions to appear for smaller numbers:
All right! An actual reason ;-) So there are two things so far I clearly
agree with, but they're independent of "the problem" this report was opened
about:
1. On a 64-bit box life would be better if we picked a bigger multiplier. This
should be done via preprocessor #if selection, though, not via the inscrutable
rules C uses to cut back an integer literal out of range for the type of the
variable it's assigned to.
2. A raw integer hash of -1 should be changed to something other than -2. It
was always a Poor Idea to pick an integer of small magnitude for the
substitute. Some newer types didn't repeat this mistake. For example,
frozensets replace a raw hash of -1 with
590923713. Alas, this is hard to change now, because hashable objects of any
type that compare equal to -1 also need to return the same substitute value
(e.g., hash(-1.0) == hash(decimal.Decimal("-100e-2")) == hash(-1+0j) == -2).
So I'm afraid it's been hard-coded into user-defined numeric classes too :-(
There would be no obvious problem in changing the _tuple_ hash to use a
different substitute value, although it's not clear that would buy anything
either. hash(-1) == -2 was the original sin.
[about multipliers]
> Because it's really just random chance.
> ...
> Ideally, it should be just a random number.
How do you know this? I'm not aware of any research papers about picking
multipliers in this context, but would love to see one.
I am aware of much research about multipliers in the somewhat related contexts
of linear congruential pseudo-random number generators, and of "multiplicative
hashing", and "any random multiplier is fine" couldn't be further from the
truth _in_ those contexts.
Guido picked a prime to begin with (actually, at first it was 3(!)), and
comments in the current code sure say that whoever added this part was keen on
primes too:
"""
/* The addend 82520, was selected from the range(0, 1000000) for
generating the greatest number of prime multipliers for tuples
up to length eight:
"""
I don't know that primes are important here, but neither do I know that they're
_not_ important here. Widely used production code isn't the place to conduct
experiments, so the status quo is favored in the absence of truly compelling
reasons. Just repeating the raw assertion "random is fine" isn't compelling.
If it were true, we could, e.g., multiply by a large power of 2 via shifting
instead, and save a few cycles.
For my best guess at what "the problem" here actually is, in the
>>> cands = list(range(-10, -1)) + list(range(9))
>>> len(set(hash(t) for t in product(cands, repeat=4)))
example, which currently (on my 64-bit box) generates 35,380 distinct hash
codes from the 104,976 inputs, ALL the collisions go away merely by changing
the character "^" to "+" in the current code.
Which was your original suggestion. Which you appear to be opposed to now?
I'm unclear about why, if so.
That's a change I could probably live with, although I have no strong reason to
believe it couldn't cause problems for applications that currently work fine.
I don't have more time to think about that now, though.
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com