[issue34751] Hash collisions for tuples

2018-11-05 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

Many thanks!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-27 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-27 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset aeb1be5868623c9cd9cf6d7de3015a43fb005815 by Raymond Hettinger 
(jdemeyer) in branch 'master':
bpo-34751: improved hash function for tuples (GH-9471)
https://github.com/python/cpython/commit/aeb1be5868623c9cd9cf6d7de3015a43fb005815


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-10 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> it will typically change only the last two bits of the final result

Which is great if all that you care about is avoiding collisions.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-09 Thread Tim Peters


Tim Peters  added the comment:

>> Changes initialization to add in the length:

> What's the rationale for that change? You always asked
> me to stay as close as possible to the "official" hash function
> which adds in the length at the end. Is there an actual benefit
> from doing it in the beginning?

The heart of xxHash is the state-updating code, neither initialization nor 
finalization.  Likewise for SeaHash or any of the others.

Without the post-loop avalanche code, adding the length at the end has very 
little effect - it will typically change only the last two bits of the final 
result.  _With_ the avalanche code, the length can affect every bit in the 
result.  But adding it in at the start also achieves that - same as changing 
any bit in the initial accumulator value.

Adding it in at the start instead also takes the addition off the critical 
path.  Which may or may not save a cycle or two (depending on processor and 
compiler), but can't hurt speed.

I noted before that adding the length at the end _can_ break out of a zero 
fixed-point (accumulator becomes 0 and all remaining values hash to 0).  Adding 
it at the start loses that.

So there is a theoretical danger there ... OK, trying it both ways I don't see 
any significant differences in my test results or a timing difference outside 
the noise range.  So I'm happy either way.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-09 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

I pushed an update at PR 9471. I think I took into account all your comments, 
except for moving the length addition from the end to the begin of the function.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-09 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> Changes initialization to add in the length:

What's the rationale for that change? You always asked me to stay as close as 
possible to the "official" hash function which adds in the length at the end. 
Is there an actual benefit from doing it in the beginning?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-08 Thread Tim Peters


Tim Peters  added the comment:

We need to worry about timing too :-(  I'm using this as a test.  It's very 
heavy on using 3-tuples of little ints as dict keys.  Getting hash codes for 
ints is relatively cheap, and there's not much else going on, so this is 
intended to be very sensitive to changes in the speed of tuple hashing:

def doit():
from time import perf_counter as now
from itertools import product

s = now()
d = dict.fromkeys(product(range(150), repeat=3), 0)
for k in d:
 d[k] += 1
for k in d:
 d[k] *= 2
f = now()
return f - s

I run that five times in a loop on a mostly quiet box, and pick the smallest 
time as "the result".

Compared to current master, a 64-bit release build under Visual Studio takes 
20.7% longer.  Ouch!  That's a real hit.

Fiddling the code a bit (see the PR) to convince Visual Studio to emit a rotate 
instruction instead of some shifts and an add reduces that to 19.3% longer.  A 
little better.

The variant I discussed last time (add in the length at the start, and get rid 
of all the post-loop avalanche code) reduces it to 8.88% longer.  The avalanche 
code is fixed overhead independent of tuple length, so losing it is more 
valuable (for relative speed) the shorter the tuple.

I can't speed it more.  These high-quality hashes have unforgiving long 
critical paths, and every operation appears crucial to their good results in 
_some_ test we already have.  But "long" is relative to our current tuple hash, 
which does relatively little to scramble bits, and never "propagates right" at 
all.  In its loop, it does one multiply nothing waits on, and increments the 
multplier for the next iteration while the multiply is going on.

Note:  "the real" xxHash runs 4 independent streams, but it only has to read 8 
bytes to get the next value to fold in.  That can go on - in a single thread - 
while other streams are doing arithmetic (ILP).  We pretty much have to "stop 
the world" to call PyObject_Hash() instead.

We could call that 4 times in a row and _then_ start arithmetic.  But most 
tuples that get hashed are probably less than 4 long, and the code to mix 
stream results together at the end is another bottleneck.  My first stab at 
trying to split it into 2 streams ran substantially slower on realistic-length 
(i.e., small) tuples - so that was also my last stab ;-)

I can live with the variant.  When I realized we never "propagate right" now, I 
agreed with Jeroen that the tuple hash is fundamentally "broken", despite that 
people hadn't reported it as such yet, and despite that that flaw had 
approximately nothing to do with the issue this bug report was opened about.  
Switching to "a real" hash will avoid a universe of bad cases, and xxHash 
appears to be as cheap as a hash without glaringly obvious weaknesses gets:  
two multiplies, an add, and a rotate.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-07 Thread Tim Peters


Tim Peters  added the comment:

Here's htest.py output from

git pull https://github.com/jdemeyer/cpython.git bpo34751

and a variation.  The number of collisions in the variation appear in square 
brackets at the end of each line.

32-bit build:

range(100) by 3; 32-bit hash codes; mean 116.42 got 0 [0]
-10 .. 8 by 4; 32-bit hash codes; mean 1.28 got 0 [0]
-50 .. 50 less -1 by 3; 32-bit hash codes; mean 116.42 got 0 [0]
0..99 << 60 by 3; 32-bit hash codes; mean 116.42 got 0 [0]
[-3, 3] by 20; 32-bit hash codes; mean 128.00 got 140 [108]
[0.5, 0.25] by 20; 32-bit hash codes; mean 128.00 got 99 [154]
old tuple test; 32-bit hash codes; mean 7.43 got 4 [2]
new tuple test; 32-bit hash codes; mean 13.87 got 19 [21]

64-bit build:

range(100) by 3; 64-bit hash codes; mean 0.00 got 0 [0]
range(100) by 3; 32-bit lower hash codes; mean 116.42 got 130 [0]
range(100) by 3; 32-bit upper hash codes; mean 116.42 got 94 [6]
-10 .. 8 by 4; 64-bit hash codes; mean 0.00 got 0 [0]
-10 .. 8 by 4; 32-bit lower hash codes; mean 1.28 got 1 [0]
-10 .. 8 by 4; 32-bit upper hash codes; mean 1.28 got 1 [0]
-50 .. 50 less -1 by 3; 64-bit hash codes; mean 0.00 got 0 [0]
-50 .. 50 less -1 by 3; 32-bit lower hash codes; mean 116.42 got 128 [0]
-50 .. 50 less -1 by 3; 32-bit upper hash codes; mean 116.42 got 116 [8]
0..99 << 60 by 3; 64-bit hash codes; mean 0.00 got 0 [0]
0..99 << 60 by 3; 32-bit lower hash codes; mean 116.42 got 123 [258]
0..99 << 60 by 3; 32-bit upper hash codes; mean 116.42 got 121 [0]
[-3, 3] by 20; 64-bit hash codes; mean 0.00 got 0 [0]
[-3, 3] by 20; 32-bit lower hash codes; mean 128.00 got 129 [117]
[-3, 3] by 20; 32-bit upper hash codes; mean 128.00 got 137 [115]
[0.5, 0.25] by 20; 64-bit hash codes; mean 0.00 got 0 [0]
[0.5, 0.25] by 20; 32-bit lower hash codes; mean 128.00 got 126 [131]
[0.5, 0.25] by 20; 32-bit upper hash codes; mean 128.00 got 137 [130]
old tuple test; 64-bit hash codes; mean 0.00 got 0 [0]
old tuple test; 32-bit lower hash codes; mean 7.43 got 12 [5]
old tuple test; 32-bit upper hash codes; mean 7.43 got 54 [52]
new tuple test; 64-bit hash codes; mean 0.00 got 0 [0]
new tuple test; 32-bit lower hash codes; mean 13.87 got 10 [6]
new tuple test; 32-bit upper hash codes; mean 13.87 got 20 [30]

They're all fine by me.  This is what the variation does:

1. Removes all "final mix" code after the loop ends.
2. Changes initialization to add in the length:

Py_uhash_t acc = _PyHASH_XXPRIME_5 + (Py_uhash_t)len;

The vast bulk of the "final mix" code applies a chain of tuple-agnostic 
permutations.  The only exception is adding in the length, which #2 moves to 
the start instead.

Applying permutations after the loop ends changes nothing about the number of 
full-width hash collisions, which is Python's _primary_ concern.  Two 
full-width hash codes are the same after the final mix if and only if they're 
the same before the final mix starts.

In xxHash they're concerned about getting close to "avalanche" perfection (each 
input bit affects all final output bits with probability about 0.5 for each).  
There aren't enough permutations _inside_ the loop to achieve that for the last 
input or two, so they pile up more permutations after the loop.

While we do care about "propagate left" and "propagate right" to mix up very 
regular and/or sparse hash codes, "avalanche perfection" is of no particular 
value to us.  To the contrary, in _typical_ cases like `product(range(100), 
repeat=3)` it's better for us _not_ to destroy all the regularity:

range(100) by 3; 32-bit lower hash codes; mean 116.42 got 130 [0]
range(100) by 3; 32-bit upper hash codes; mean 116.42 got 94 [6]

"Avalanche perfection" is what drives the number of collisions up with the 
final mix code intact.  Without it, we get "waaay better than random" 
numbers of collisions.

The other reason it's attractive to drop it:  the final mix code is one long 
critical path (no instruction-level parallelism), adding 8 serialized machine 
instructions including two multiplies.  That's a real expense for 
typically-short tuples used as dict keys or set elements.

Nothing is anywhere near disaster territory either way, although "worse than 
random" remains quite possible, especially when cutting a 64-bit hash in half 
(do note that, either way, there are no collisions in any test when retaining 
the full 64-bit hash code).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-07 Thread Tim Peters


Tim Peters  added the comment:

Attaching htest.py so we have a common way to compare what various things do on 
the tests that have been suggested.  unittest sucks for that.  doctest too.  
Here's current code output from a 32-bit build; "ideally" we want "got" values 
not much larger than "mean" (these are counting collisions):

range(100) by 3; 32-bit hash codes; mean 116.42 got 0
-10 .. 8 by 4; 32-bit hash codes; mean 1.28 got 69,596
-50 .. 50 less -1 by 3; 32-bit hash codes; mean 116.42 got 708,066
0..99 << 60 by 3; 32-bit hash codes; mean 116.42 got 875,000
[-3, 3] by 20; 32-bit hash codes; mean 128.00 got 1,047,552
[0.5, 0.25] by 20; 32-bit hash codes; mean 128.00 got 1,048,568
old tuple test; 32-bit hash codes; mean 7.43 got 6
new tuple test; 32-bit hash codes; mean 13.87 got 102,922

And under a 64-bit build, where the full hash code is considered, and also its 
lowest and highest 32 bits.  Note, e.g., that the old tuple test is an utter 
disaster if we only look at the high 32 bits.  Which is actually fine by me - 
the point of this is to show what happens.  Judging is a different (albeit 
related) issue ;-)

range(100) by 3; 64-bit hash codes; mean 0.00 got 0
range(100) by 3; 32-bit lower hash codes; mean 116.42 got 0
range(100) by 3; 32-bit upper hash codes; mean 116.42 got 989,670
-10 .. 8 by 4; 64-bit hash codes; mean 0.00 got 69,596
-10 .. 8 by 4; 32-bit lower hash codes; mean 1.28 got 69,596
-10 .. 8 by 4; 32-bit upper hash codes; mean 1.28 got 101,438
-50 .. 50 less -1 by 3; 64-bit hash codes; mean 0.00 got 708,066
-50 .. 50 less -1 by 3; 32-bit lower hash codes; mean 116.42 got 708,066
-50 .. 50 less -1 by 3; 32-bit upper hash codes; mean 116.42 got 994,287
0..99 << 60 by 3; 64-bit hash codes; mean 0.00 got 500,000
0..99 << 60 by 3; 32-bit lower hash codes; mean 116.42 got 875,000
0..99 << 60 by 3; 32-bit upper hash codes; mean 116.42 got 989,824
[-3, 3] by 20; 64-bit hash codes; mean 0.00 got 1,047,552
[-3, 3] by 20; 32-bit lower hash codes; mean 128.00 got 1,047,552
[-3, 3] by 20; 32-bit upper hash codes; mean 128.00 got 1,047,552
[0.5, 0.25] by 20; 64-bit hash codes; mean 0.00 got 1,048,544
[0.5, 0.25] by 20; 32-bit lower hash codes; mean 128.00 got 1,048,575
[0.5, 0.25] by 20; 32-bit upper hash codes; mean 128.00 got 1,048,544
old tuple test; 64-bit hash codes; mean 0.00 got 0
old tuple test; 32-bit lower hash codes; mean 7.43 got 6
old tuple test; 32-bit upper hash codes; mean 7.43 got 128,494
new tuple test; 64-bit hash codes; mean 0.00 got 102,920
new tuple test; 32-bit lower hash codes; mean 13.87 got 102,922
new tuple test; 32-bit upper hash codes; mean 13.87 got 178,211

--
Added file: https://bugs.python.org/file47857/htest.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-07 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

I updated PR 9471 with a tuple hash function based on xxHash. The only change 
w.r.t. the official xxHash specification is that I'm not using parallellism and 
just using 1 accumulator. Please have a look.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-06 Thread Tim Peters


Tim Peters  added the comment:

Thinking about the way xxHash works prompted me to try this initialization 
change:

x = 0x345678UL + (Py_uhash_t)len; // adding in the length is new

That was a pure win in my tests, slashing the number of collisions in the new 
tuple test, 32-bit build, from 51 to 17.  In the 64-bit build, cut from 11 to 
10, but when looking at the high 32 hash bits instead from 27 to 15.  There 
were also small improvements in other 32-bit build collision stats.

Full-blown xxHash (& SeaHash, and murmurhash, and ...) also fold in the length, 
but not inside their core loops.  It's useful info!

But they fold it in _after_ their core loops, not before.  They're apparently 
worried about this:  if their internal state ever reaches 0, that's a fixed 
point so long as all remaining incoming data is zeroes (in Python, e.g., 
picture adding one or more trailing integer or float zeroes or ..., which hash 
to 0).  Folding in the length at the end snaps it slightly out of zeroland.

I don't really care about that.  Folding in the length at the start quickly 
leads to large differences across iterations for tuples of different lengths 
(thanks to repeated mulitiplication and "propagate right" steps), which is 
especially helpful in things like the nested tuple tests where there's almost 
as much variation in tuple lengths as there is in the few little integers 
bottom-level tuples contain.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-06 Thread Tim Peters


Tim Peters  added the comment:

As a sanity check, I tried the 32-bit version of MurmurHash2 (version 3 doesn't 
have a 32-bit version).  All of my Python tests had collisions, and while the 
new tuple test dropped to 15, product([0.5, 0.25], repeat=20) skyrocketed from 
141 (32-bit xxHash) to 3848.

I could live with that too, but xxHash does better overall on our tiny tests, 
and requires less code and fewer cycles.

Here's what I used:

static Py_hash_t
tuplehash(PyTupleObject *v)
{
const Py_uhash_t m = 0x5bd1e995;
const int r = 24;
Py_uhash_t x, t;  /* Unsigned for defined overflow behavior. */
Py_hash_t y;
Py_ssize_t len = Py_SIZE(v);
PyObject **p;
x = 0x345678UL ^ (Py_uhash_t)len;
p = v->ob_item;
while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
Py_uhash_t t = (Py_uhash_t)y;
t *= m;
t ^= t >> r;
t *= m;
x *= m;
x ^= t;
}
x ^= x >> 13;
x *= m;
x ^= x >> 15;
if (x == (Py_uhash_t)-1)
x = -2;
return x;
}

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-04 Thread Tim Peters


Tim Peters  added the comment:

>> people already wrote substantial test suites dedicated
>> to that sole purpose, and we should aim to be "mere
>> consumers" of functions that pass _those_ tests.

> There are hash functions that pass those tests which
> are still bad in practice when used as tuple hash function.

Really?  Which one(s)?  If you're talking about that some fail badly when you 
_replace_ the constants they picked, I doubt they'd accept that as proof of 
anything.  Things like FNV and DJB score poorly on SMHasher to begin with.


> That's really unfortunate, but it's a fact that we need
> to live with.

I'm surprised they do as well as they do using less than a handful of 
invertible transformations per input, using state of the same bit width as the 
inputs.

I don't expect them to be immune to all easily-provoked "worse than random" 
cases, though.  Over time, these hashes change while keeping the same name.  As 
noted before, this is at least the second version of SeaHash.  The best of the 
original algorithms of this kind - murmurhash - is on its 3rd version now.

The motivation is the same:  someone bumps into real-life cases where they do 
_way way way_ worse than "random", and so they try to repair that as cheaply as 
possible.

Most of these are designed for data centers to do cheap-as-possible reasonably 
robust fingerprinting of giant data blobs.  They could already have used, e.g., 
SHA-2 for highly robust fingerprinting, but they don't want to pay the very 
much higher runtime cost.

If you can provoke any deviation from randomness in that kind of hash, it's 
considered "broken" and is never used again.  But in these far cheaper hashes, 
it's considered part of the tradeoffs.  If you read the Reddit thread about 
SeaHash I cited before, the person who suggested the current transformations 
noted that there were still weaknesses in some areas of the input space he was 
able to find just by thinking about how it worked, but far milder than the 
weaknesses he found by thinking about how the original version worked.

That doesn't imply there aren't far worse weaknesses in areas he _didn't_ think 
about.  So it goes.

> It means that we may need to do some adjustments to
> the hash functions.

I'm fine with the faithful (barring errors on my part) xxHash I posted here 
before.  And I don't care whether it works "better" or "worse" on Python's tiny 
test suite if its constants are replaced, or its algorithm is "tweaked".  When 
xxHash version 2 is released, I want it to be as mindless as possible to 
replace the guts of Python's tuple-hash loop with the guts of xxHash version 2.

It's easy to believe that SMHasher has nothing testing the bit patterns 
produced by mixing two's-complement integers of similar magnitude but opposite 
signs.  That's not the kind of data giant data centers are likely to have much 
of ;-)  If Appleby can be talked into _adding_ that kind of keyset data to 
SMHasher, then the next generations of these hashes will deal with it for us.  
But an objectively small number of collisions more than expected in some such 
cases now doesn't annoy me enough to want to bother.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-04 Thread Tim Peters


Change by Tim Peters :


--
components: +Interpreter Core -XML

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-04 Thread Tim Peters


Tim Peters  added the comment:

>> In the 64-bit build there are no collisions across my
>> tests except for 11 in the new tuple test.

> That's pretty bad actually. With 64 bits, you statistically
> expect something in the order of 10**-8 collisions. So
> what you're seeing is 9 orders of magnitude too many collisions.

You wrote the test, so you should remember that you throw away the top 32 bits 
of the 64-bit hash code :-)  Tossing 345130 balls into 2**32 bins has an 
expected mean of 13.9 collisions and sdev 3.7.  11 collisions is fine.

If I change the test to retain all the hash bits in the 64-bit build, there are 
no collisions in the new tuple test.

If I change it again to retain only the high 32 hash bits, 27 collisions.  
Which is statistically suspect, but still universes away from "disaster" 
territory.

--
components: +XML -Interpreter Core

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-04 Thread Tim Peters


Tim Peters  added the comment:

> Note that I'm always considering parametrized
> versions of the hash functions that I'm testing.
> I'm replacing the fixed multiplier (all algorithms
> mentioned here have such a thing) by a random
> multiplier which is 3 mod 8.

I've explained before in some detail that this makes NO SENSE for SeaHash:  its 
multiplier was specially constructed to guarantee certain dispersion 
properties, to maximize the effectiveness of its "propagate right" step.

You already have my explanation about that, and a link to the original Reddit 
thread in which it was proposed (and eagerly accepted by SeaHash's primary 
author).  Also already explained that its multiplier appears to fail its design 
criteria at two specific bit positions.  Which I can repair, but I would do so 
not by changing Python's version, but by bringing it to SeaHash's author's 
attention.

OTOH, I haven't been able to find anything explaining why the xxHash authors 
picked what they picked.  But they certainly didn't have "random" in mind - 
nobody does.  You discovered for yourself by trying things that various 
properties make for bad multipliers, and people who work on these things "for 
real" knew that a long time ago.  They almost certainly know a great deal more 
that we haven't thought of at all.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-04 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> Taking an algorithm in wide use that's already known to get a top score on 
> SMHasher and fiddling it to make a "slight" improvement in one tiny Python 
> test doesn't make sense to me.

OK, I won't do that. The difference is not that much anyway (it increases the 
test success rate from about 85% to 90%)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-04 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> In the 64-bit build there are no collisions across my tests except for 11 in 
> the new tuple test.

That's pretty bad actually. With 64 bits, you statistically expect something in 
the order of 10**-8 collisions. So what you're seeing is 9 orders of magnitude 
too many collisions.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-04 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> Taking an algorithm in wide use that's already known to get a top score on 
> SMHasher and fiddling it to make a "slight" improvement in one tiny Python 
> test doesn't make sense to me.

What I'm doing is the most innocent change: just applying a fixed permutation 
on the *input* of the hash function. I'm not changing the actual hashing 
algorithm.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-04 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> people already wrote substantial test suites dedicated to that sole purpose, 
> and we should aim to be "mere consumers" of functions that pass _those_ tests.

There are hash functions that pass those tests which are still bad in practice 
when used as tuple hash function. That's really unfortunate, but it's a fact 
that we need to live with. It means that we may need to do some adjustments to 
the hash functions.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-04 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> Note:  I'm assuming that by "PRIME32_2" you mean 2246822519U

Yes indeed.

> and that "MULTIPLIER" means 2654435761U.

No, I mean a randomly chosen multiplier which is 3 mod 8.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-03 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> I've posted several SeaHash cores that suffer no collisions at all in any of 
> our tests (including across every "bad example" in these 100+ messages), 
> except for "the new" tuple test.  Which it also passed, most recently with 7 
> collisions.  That was under 64-bit builds, though, and from what follows I 
> figure you're only looking at 32-bit builds for now.

Note that I'm always considering parametrized versions of the hash functions 
that I'm testing. I'm replacing the fixed multiplier (all algorithms mentioned 
here have such a thing) by a random multiplier which is 3 mod 8. I keep the 
other constants. This allows me to look at the *probability* of passing the 
testsuite. That says a lot more than a simple yes/no answer for a single test. 
You don't want to pass the testsuite by random chance, you want to pass the 
testsuite because you have a high probability of passing it.

And I'm testing 32-bit hashes indeed since we need to support that anyway and 
the probability of collisions is high enough to get interesting statistical 
data.

For SeaHash, the probability of passing my new tuple test was only around 55%. 
For xxHash, this was about 85%. Adding some input mangling improved both 
scores, but the xxHash variant was still better than SeaHash.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-03 Thread Tim Peters


Tim Peters  added the comment:

Here's a complete xxHash-based implementation via throwing away C++-isms from

https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp

In the 64-bit build there are no collisions across my tests except for 11 in 
the new tuple test.

The 32-bit build fares worse:

- 3 collisions in the old tuple test.
- 51 collisions in the new tuple test.
- 173 collisions across product([-3, 3], repeat=20)
- 141 collisions across product([0.5, 0.25], repeat=20)
- no other collisions

The expected number of collisions from tossing 2**20 balls into 2**32 buckets 
is about 128, with standard deviation 11.3.  So 141 is fine, but 173 is highly 
suspect.  51 for the new tuple test is way out of bounds.

But I don't much care.  None of these are anywhere near disasters, and - as 
I've already said - I know of no non-crypto strength hash that doesn't suffer 
"worse than random" behaviors in some easily-provoked cases.  All you have to 
do is keep trying.

Much as with SeaHash, adding

t ^= t << 1;

at the top helps a whole lot in the "bad cases", cutting the new test's 
collisions to 7 in the 32-bit build.  It also cuts the product([-3, 3], 
repeat=20) collisions to 109.  But boosts the old tuple test's collisions to 
12.  I wouldn't do it:  it adds cycles for no realistic gain.  We don't really 
care whether the hash "is random" - we do care about avoiding catastrophic 
pile-ups, and there are none with an unmodified xxHash.

BTW, making that change also _boosts_ the number of "new test" collisions to 23 
in the 64-bit build (a little over its passing limit).

#define Py_NHASHBITS (SIZEOF_PY_HASH_T * CHAR_BIT)
#if Py_NHASHBITS >= 64
#define Py_HASH_MULT1 (Py_uhash_t)11400714785074694791ULL
#define Py_HASH_MULT2 (Py_uhash_t)14029467366897019727ULL
#define Py_HASH_LSHIFT 31
#elif Py_NHASHBITS >= 32
#define Py_HASH_MULT1 (Py_uhash_t)2654435761ULL
#define Py_HASH_MULT2 (Py_uhash_t)2246822519ULL
#define Py_HASH_LSHIFT 13
#else
#error "can't make sense of Py_uhash_t size"
#endif
#define Py_HASH_RSHIFT (Py_NHASHBITS - Py_HASH_LSHIFT)

static Py_hash_t
tuplehash(PyTupleObject *v)
{
Py_uhash_t x, t;  /* Unsigned for defined overflow behavior. */
Py_hash_t y;
Py_ssize_t len = Py_SIZE(v);
PyObject **p;
x = 0x345678UL;
p = v->ob_item;
while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
t = (Py_uhash_t)y;
x += t * Py_HASH_MULT2;
x = (x << Py_HASH_LSHIFT) | (x >> Py_HASH_RSHIFT);
x *= Py_HASH_MULT1;
}
x += 97531UL;
if (x == (Py_uhash_t)-1)
x = -2;
return x;
}
#undef Py_NHASHBITS
#undef Py_HASH_MULT1
#undef Py_HASH_MULT2
#undef Py_HASH_LSHIFT
#undef Py_HASH_RSHIFT

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-03 Thread Tim Peters


Tim Peters  added the comment:

Thanks for looking at xxHash!  An advantage is that it already comes in 32- and 
64-bit versions.

> A (simplified and slightly modified version of) xxHash
> seems to work very well, much better than SeaHash.

?  I've posted several SeaHash cores that suffer no collisions at all in any of 
our tests (including across every "bad example" in these 100+ messages), except 
for "the new" tuple test.  Which it also passed, most recently with 7 
collisions.  That was under 64-bit builds, though, and from what follows I 
figure you're only looking at 32-bit builds for now.

>  Just like SeaHash, xxHash also works in parallel. But I'm not
> doing that and just using this for the loop:
>
>for y in t:
>y ^= y * (PRIME32_2 - 1)
>acc += y
>acc = ((acc << 13) + (acc >> 19))  # rotate left by 13 bits
>acc *= MULTIPLIER
>
> Plain xxHash does "y *= PRIME32_2" or equivalently
> "y += y * (PRIME32_2 - 1)". Replacing that += by ^= helps
> slightly with my new tuple test.

Taking an algorithm in wide use that's already known to get a top score on 
SMHasher and fiddling it to make a "slight" improvement in one tiny Python test 
doesn't make sense to me.  Does the variant also score well under SMHasher?  "I 
don't see why it wouldn't" is not an answer to that ;-)

Note that the change also slows the loop a bit - "acc += y" can't begin until 
the multiply finishes, and the following "acc *=" can't begin until the 
addition finishes.  Lengthening the critical path needs to buy something real 
to justify the extra expense.  I don't see it here.

For posterity:  the 64-bit version of xxHash uses different primes, and rotates 
left by 31 instead of by 13.

Note:  I'm assuming that by "PRIME32_2" you mean 2246822519U, and that 
"MULTIPLIER" means 2654435761U.  Which appear to be the actual multipliers used 
in, e.g.,

https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp

As always, I'm not a fan of making gratuitous changes.  In any case, without 
knowing the actual values you're using, nobody can replicate what you're doing.

That said, I have no reason to favor SeaHash over xxHash, and xxHash also has a 
real advantage in that it apparently didn't _need_ the "t ^= t << 1" 
permutation to clear the new tuple test's masses of replicated sign bits.

But the more we make changes to either, the more work _we_ have to do to ensure 
we haven't introduced subtle weaknesses.  Which the Python test suite will 
never be substantial enough to verify - we're testing almost nothing about hash 
behavior.  Nor should we:  people already wrote substantial test suites 
dedicated to that sole purpose, and we should aim to be "mere consumers" of 
functions that pass _those_ tests.  Python's test suite is more to ensure that 
known problems _in Python_ don't recur over the decades.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-03 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

A (simplified and slightly modified version of) xxHash seems to work very well, 
much better than SeaHash. Just like SeaHash, xxHash also works in parallel. But 
I'm not doing that and just using this for the loop:

for y in t:
y ^= y * (PRIME32_2 - 1)
acc += y
acc = ((acc << 13) + (acc >> 19))  # rotate left by 13 bits
acc *= MULTIPLIER

Plain xxHash does "y *= PRIME32_2" or equivalently "y += y * (PRIME32_2 - 1)". 
Replacing that += by ^= helps slightly with my new tuple test.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-03 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

I'm having a look at xxHash, the second-fastest hash mentioned on 
https://docs.rs/seahash/3.0.5/seahash/

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-03 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> I know of no such hash functions short of crypto-strength ones.

Being crypto-strength and having few collisions statistically are different 
properties.

For non-crypto hash functions it's typically very easy to generate collisions 
once you know the parameters (which are called the "key" for crypto hash 
functions). But I think that you shouldn't be able to produce a large number of 
collisions if you don't know the parameters in advance.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters


Tim Peters  added the comment:

> SeaHash seems to be designed for 64 bits.

Absolutely.

> I'm guessing that replacing the shifts by
>
> x ^= ((x >> 16) >> (x >> 29))
>
> would be what you'd do for a 32-bit hash.

My guess too.  But "the prime" is a puzzle.  As noted before, the 64-bit prime 
is specially constructed to satisfy specific dispersion properties.  While I 
haven't triple-checked this, I believe it fails at two specific bit positions:

- Bit 2**16 doesn't propagate into the leading 4 bits at all.

- Bit 2**17 does propagate into bit 2**60, but _only_ into that bit among the 
highest 4 bits.  One of the criteria is that the prime shouldn't propagate a 
lower-order bit into _only_ the least-significant of the 4 leading bits.

Repairing that would probably require adding a bit or two.  But the prime 
already has 35 bits set.  The greater the imbalance between 0 and 1 bits, the 
more multiply acts like a simpler sequence of shift-and-adds or 
shift-and-subtracts (for example, at an extreme, consider the multiplier (1 << 
63) - 1).

Anyway, it seems the density of 1 bits would have to be even higher for a 
32-bit multiplier to ensure all low-order bits directly propagated to at least 
one of the topmost 3 bits, but not to the third-most significant alone.

Or I could search for a 31-bit prime - or just an odd integer - that got as 
close as possible with 16 or 17 bits set.  Or ...

> Alternatively, we could always compute the hash with
> 64 bits (using uint64_t) and then truncate at the end
> if needed.

Provided we continue to like it at all ;-)  In which case I'd probably end it 
with

32_bit_result = (32_bit_result_type)(x ^ (x >> 32));

instead.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters


Tim Peters  added the comment:

FYI, this appears to be a good overview of what SMHasher is looking for:

https://github.com/aappleby/smhasher/wiki/SMHasher

Someg of everything: performance, correctness, statistical measures, and 
specific kinds of keysets that have proved problematic for other non-crypto 
hash functions.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters


Tim Peters  added the comment:

> So it seems that this SMHasher test suite doesn't
> catch the problem that we're seeing with negative integers.

Seems to be so, but I've never run SMHasher myself.  I believe it's focused on 
statistical properties, like avalanche and bit independence.


> However, we ideally want a hash that works well for
> all kinds of inputs. If the hash function is good,
> it shouldn't be possible to write a hash collision
> test function which has a significantly higher chance
> of failing than random chance.

I know of no such hash functions short of crypto-strength ones.  Nobody uses 
those for hash tables, though, because they're much slower and usually much 
more involved.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters


Tim Peters  added the comment:

> >>> from itertools import product
> >>> len(set(map(hash, product([0.5, 0.25], repeat=20
> 32

> Good catch! Would you like me to add this to the testsuite?

It's in mine already ;-)  I've added all the "bad examples" in all the messages 
here.  Sooner or later they'll get folded into Python's test suite.

BTW, there were no collisions in that under whatever 64-bit Python I last 
compiled.  That was a SeaHash variant.  I'm not certain, but I believe it had 
"t ^= t << 1" at the start and with the first multiply commented out.

Having learning _something_ about why SeaHash does what it does, I'm not 
convinced the first multiply is of much value.  As a standalone bit-scrambler 
for a single 64-bit input, it does matter.  But in the tuple hash context, 
we're running it in a loop.  Strictly alternating "propagate left" and 
"propagate right" seems to me almost as good - although that's just intuition.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters


Tim Peters  added the comment:

>> For that reason, I've only been looking at those that
>> scored 10 (best possible) on Appleby's SMHasher[1] test suite

> Do you have a list of such hash functions?

A world of searching.  The ones rated "Excellent" here (previously cited) for a 
start:

https://docs.rs/seahash/3.0.5/seahash/

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters


Tim Peters  added the comment:

>> I've noted before, e.g., that sticking to a prime
>> eliminates a world of regular bit patterns in the
>> multiplier.

> Why do you think this? 0x1fff is prime :-)

Point taken ;-)  But "a world of" is not the same as "the universe".  For 
example, sticking to a prime you'll never get 8 bytes the same.  Etc - "a world 
of" extremely regular patterns are eliminated.


> Having regular bit patterns and being prime are independent
> properties.
>
> To be clear: I don't have anything against picking a prime
> but we just shouldn't pretend that primes are important
> when they are not. That's all...

I don't like arguments from ignorance.  As I've said, I don't know why SeaHash 
uses a prime.  Neither do you.  Our collective ignorance doesn't imply the 
designer didn't have a good reason.

I can't think of a "good reason" for it either, but that's where we differ:  I 
think "so don't make gratuitous changes" while you think "therefore it can't 
possibly matter" ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> For that reason, I've only been looking at those that scored 10 (best 
> possible) on Appleby's SMHasher[1] test suite

Do you have a list of such hash functions?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

>>> from itertools import product
>>> len(set(map(hash, product([0.5, 0.25], repeat=20
32

Good catch! Would you like me to add this to the testsuite?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> For that reason, I've only been looking at those that scored 10 (best 
> possible) on Appleby's SMHasher[1] test suite, which is used by everyone who 
> does recognized work in this field.

So it seems that this SMHasher test suite doesn't catch the problem that we're 
seeing with negative integers.

> I'm concerned that I've been putting way too much weight on "the new" tuple 
> test. [...] that's a minuscule region of the problem space.

I'll admit that it's a miniscule region of the problem space. However, we 
ideally want a hash that works well for all kinds of inputs. If the hash 
function is good, it shouldn't be possible to write a hash collision test 
function which has a significantly higher chance of failing than random chance.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> I've noted before, e.g., that sticking to a prime eliminates a world of 
> regular bit patterns in the multiplier.

Why do you think this? 0x1fff is prime :-)

Having regular bit patterns and being prime are independent properties.

To be clear: I don't have anything against picking a prime but we just 
shouldn't pretend that primes are important when they are not. That's all...

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters


Tim Peters  added the comment:

> If we, e.g., tested tuples of little floats instead ...

Speaking of which:

>>> from itertools import product
>>> len(set(map(hash, product([0.5, 0.25], repeat=20
32

32 hash codes out of 1048576 distinct two-tuples isn't exactly 
confidence-inspiring either ;-)  No scheme that only "propagates to the left" 
is going to help that much, because the variation in those hashes is all in the 
high-order bits.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters


Tim Peters  added the comment:

> So my suggestion remains
>
> for y in INPUT:
>t = hash(y)
>t ^= t * SOME_LARGE_EVEN_NUMBER
>h ^= t
>h *= MULTIPLIER

On the face of it, I'd be amazed if that passed SMHasher, because it only 
propagates bits "to the left".  All hashes that score well on that test work to 
propagate "to the right" too.  SeaHash does that with its variable-count 
right-shift-xor.  Another popular choice in other hashes is a rotate.

As noted several times before, our tuple tests only use small integers, where 
there's virtually no variation in the high-order hash bits beyond "all 0" or 
"all 1".  Since all the variation is in a handful of low-order bits, "propagate 
right" _appears_ to be a waste of time.

If we, e.g., tested tuples of little floats instead, we may have "concluded" 
that it's "propagate left" that's a waste of time:

>>> for i in range(5):
... x = 1 / 2**i
... print(x, hex(hash(x)
...
1.0   0x1
0.50x1000
0.250x800
0.125   0x400
0.0625  0x200

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters


Tim Peters  added the comment:

>> the author wants this transformation to be easily
>> invertible, so a prime is necessary

> A multiplication by any odd number modulo 2**64 is
> invertible.

Right!  My mistake.


> As I argued before, the concept of primes is
> meaningless (except for the prime 2) when computing
> modulo 2**64.

I don't know why they're using a prime.  But that doesn't mean they don't have 
"a reason".  You seem quick to dismiss things if they don't make instant sense 
to you at first glance.  I've noted before, e.g., that sticking to a prime 
eliminates a world of regular bit patterns in the multiplier.

The original SeaHash used a different prime, and a fixed right shift of 32 (but 
twice in different places).

Switching to the current prime, and the variable shift, can be traced back to 
this long comment on Reddit:

https://www.reddit.com/r/rust/comments/5fdf8z/seahash_a_blazingly_fast_portable_hash_function/dakdii1/

The prime isn't "random":  it was constructed so that flipping a low-order bit 
to 1 would directly affect at least one of the topmost 4 bits, which in turn 
are used to select the shift count.  While that doesn't seem to matter for our 
tiny test suite, as the message shows it made huge improvements in other 
regions of the input space.

I haven't reported this, but doing "x ^= x >> 32" instead worked fine for our 
test suite too.  Well, big deal - we're testing almost nothing beyond two kinds 
of "oops!" cases (nested tuples, and mixed-sign tiny ints).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters


Tim Peters  added the comment:

A meta-comment:  a great deal of work has been done on non-crypto hashes in 
recent years.  I'm not interested in rolling our own if one of the "winners" 
from that process can be adapted.  For that reason, I've only been looking at 
those that scored 10 (best possible) on Appleby's SMHasher[1] test suite, which 
is used by everyone who does recognized work in this field.  SeaHash appears to 
be the fastest of those, with short & simple code.

I'm concerned that I've been putting way too much weight on "the new" tuple 
test.  That uses a grand total of 10 tiny integers in -5 .. 5 (omits -1).  
_All_ base-component variation is in the last 3 bits, while the top 61 bits are 
all 0 or all 1.  Great for testing nested tuples with tiny mixed-sign integers, 
but that's a minuscule region of the problem space.

[1] https://github.com/aappleby/smhasher

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> 100% pure SeaHash does x ^= t at the start first, instead of `t ^ (t << 1)` 
> on the RHS.

Indeed. Some initial testing shows that this kind of "input mangling" (applying 
such a permutation on the inputs) actually plays a much more important role to 
avoid collisions than the SeaHash operation x ^= ((x >> 16) >> (x >> 29)).

So my suggestion remains

for y in INPUT:
t = hash(y)
t ^= t * SOME_LARGE_EVEN_NUMBER
h ^= t
h *= MULTIPLIER

Adding in the additional SeaHash operations

x ^= ((x >> 16) >> (x >> 29))
x *= MULTIPLIER

does not increase the probability of the tests passing.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

Correction: the FNV variant of SeaHash only fails the new testsuite, not the 
old one. The DJB variant of SeaHash fails both.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

SeaHash seems to be designed for 64 bits. I'm guessing that replacing the 
shifts by

x ^= ((x >> 16) >> (x >> 29))

would be what you'd do for a 32-bit hash. Alternatively, we could always 
compute the hash with 64 bits (using uint64_t) and then truncate at the end if 
needed.

However, when testing the hash function

for t in INPUT:
x ^= hash(t)
x *= MULTIPLIER
x ^= ((x >> 16) >> (x >> 29))
x *= MULTIPLIER

It fails horribly on the original and my new testsuite. I'm guessing that the 
problem is that the line x ^= ((x >> 16) >> (x >> 29)) ignores low-order bits 
of x, so it's too close to pure FNV which is known to have problems. When 
replacing the first line of the loop above by x += hash(t) (DJB-style), it 
becomes too close to pure DJB and it also fails horribly because of nested 
tuples.

So it doesn't seem that the line x ^= ((x >> 16) >> (x >> 29)) (which is what 
makes SeaHash special) really helps much to solve the known problems with DJB 
or FNV.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

This weekend I realized something important which I didn't realize before: some 
hash functions which I assumed to be good (i.e. small chance of collisions 
between any given two tuples) turned out to often fail the tests. This is 
because you don't just want to minimize collisions, you also want to minimize 
*correlations* between collisions.

More precisely: for a given hash function (considering the multiplier as 
parameter), it can happen that there are 4 tuples t, u, v, w such that whether 
or not hash(t) == hash(u) is correlated to whether or not hash(v) == hash(w). 
Such correlations increase the standard deviation of the number of collisions 
in the tests a lot (even if the average is unaffected), which leads to 
significant chances of failing the tests.

So with this in mind I stopped testing pairs of tuples but I ran the actual 
testsuites. The metric I'm using is now the probability that the testsuite 
passes for randomly chosen multipliers (3 mod 8). For example, the original 
tuple hash has a probability of around 97% of passing the original testsuite.

None of the hash functions that I tried (DJB or FNV with input mangling like t 
^= t << 7) achieved such a high probability of passing the original test. The 
*only* variation that I found which passes the original test and my new test 
(and a third "random" test which I haven't mentioned before) with a high enough 
probability was FNV with input mangling with a second multiplier:

h = 1
for y in INPUT:
t = hash(y)
t ^= t * SOME_LARGE_EVEN_NUMBER   # instead of t ^= t << SHIFT
h = (h ^ t) * MULTIPLIER

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> the author wants this transformation to be easily invertible, so a prime is 
> necessary

A multiplication by any odd number modulo 2**64 is invertible. As I argued 
before, the concept of primes is meaningless (except for the prime 2) when 
computing modulo 2**64.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-01 Thread Tim Peters


Tim Peters  added the comment:

>Py_uhash_t t = (Py_uhash_t)y;
>x ^= t ^ (t << 1);
>x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;
>x ^= (x >> 32) >> (x >> 60);
>x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;

Comment out either one of the multiplies, and it still passes all the tests.  
Commenting out the first one is "more valuable", because on all but the last 
loop iteration the latency of the last-line multiply will overlap with the next 
call to hash a tuple component.  It even reduces the number of collisions in 
the new tuple test (down to 7 - and still no collisions at all in other tests).

I'm not clear on why there are two multiplies.  With both, or either one alone, 
low and high bits still get their chances to affect the other end.

Presumably it boils down to some detail in "the proofs" - but, as for FNV, 
while various things are _claimed_, I can't find a writeup of "the proofs" 
anywhere :-(

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-10-01 Thread Tim Peters


Tim Peters  added the comment:

If SeaHash is interesting to us (it is to me), I suggest dropping our DJB/FNV 
combining part entirely and using a purer form, like so:

Py_uhash_t t = (Py_uhash_t)y;
x ^= t ^ (t << 1);
x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;
x ^= (x >> 32) >> (x >> 60);
x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;

The only collisions with that were 14 in the new tuple test.  Rotate t right by 
3 first, and that drops to 13.  Much better than utter catastrophe ;-)

100% pure SeaHash does

x ^= t;

at the start first, instead of `t ^ (t << 1)` on the RHS.  That fails the 
second tuple test, with 98 collisions.  Not catastrophic, just "too many".  But 
there's only so much we can expect from a cheap non-crypto-strength hash.  `t ^ 
(t << 1)` is a pragmatic tweek to get rid of masses of uselessly repeated sign 
bits, which do occur in realistic tuples (mixing positive and negative ints).  
Adding that tweak shouldn't harm any of SeaHash's provable global properties, 
since `t ^ (t << 1)` is just a permutation of the input space.

Different constants would be needed for a 32-bit version (best guesses, 
untried:  a 31-bit random prime; s/32/16/; s/60/29/).

I don't yet know about potential SeaHash licensing issues.  It's part of Rust:

https://docs.rs/seahash/3.0.5/seahash/

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-30 Thread Tim Peters


Tim Peters  added the comment:

Noting that the failure of high-order bits to propagate is one form of 
"avalanche" failure.  A recent 64-bit hash, SeaHash, takes this approach which 
is said to have provably "perfect" avalanche behavior:

Py_uhash_t t = (Py_uhash_t)y;
t *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;
t ^= (t >> 32) >> (t >> 60);
t *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;

The giant constant is just a 63-bit "patternless" prime (for whatever reason, 
the author wants this transformation to be easily invertible, so a prime is 
necessary - this is NOT a "crypto" hash).  The first multiply propagates 
low-order bits left.  Then the next line uses high-order bits to change 
low-order bits.  Extracting a variable shift count from the data itself is a 
clever idea taken from the PCG family of PRNGs - you have to work to contrive 
data where this doesn't "act kinda random".  The last line then propagates the 
- now altered by the high-order bits - lower-order bits left again.

Followed by

x = (x * mult) + t;

this yields a tuple hash that passes all the tests I have.  The only collisions 
are in the new tuple test, which suffers 14.

Better, add the "catastrophic" right-rotate

t = (t >> 3) | (t << 61);

at the start and it's still only the new tuple test that has a collision - it 
rises to 19 then, close to (but still under) its failure cutoff.

What I haven't tried:  in context it would be nice to eliminate the second 
multiply by the giant constant.  We're doing a multiply anyway to fold `t` into 
`x`, which will propagate the low-order bits left on the next iteration's `x * 
mult`.  That would ruin SeaHash's provable guarantees, but I'm more concerned 
about saving some cycles than purity ;-)  If that added enough collisions to 
push the second tuple test "not much" over the limit, I'd raise its limit.

Gonzo:  "the real" SeaHash duplicates the code above and works on 4 inputs 
simultaneously, designed to keep a modern processor's instruction pipelines as 
busy as possible.  I'd rather not bother (most tuples are short).

So this is principled and, if the SeaHash theory is right, immune to any simple 
kind of catastrophic failure.  Is it worth it?  I'd sleep better, yes ;-)  Note 
that the first multiply by the giant constant can be active at the same time as 
the `x * mult` at the end, so I'm guessing the speed hit would be bearable.

There's no truly _cheap_ way to get good propagation from all bit positions.  
SeaHash has the fastest way to do that I've encountered.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-30 Thread Tim Peters


Tim Peters  added the comment:

An "aha!" moment for me:  I noted before that replacing the tuple hash loop with

Py_uhash_t t = (Py_uhash_t)y;
t ^= t << 1;
x = (x * mult) + t;

does OK (64-bit build):  most tests had no collisions, but the original tuple 
test had 24 (a modest failure) and the new one 6.

Horrible results with a different scheme prompted me to add one line before the 
shift-xor:

t = (t >> 3) | (t << 61);

That is, merely rotate the input right by 3 bits first.  Disaster!  Almost all 
tests suffered major collisions, and the old test 235424 and the new one 344919.

What's going on?  With hindsight, it's obvious:  multiplication is a horrible 
"mixer" _in this context_.  Because nearly all the tests are slinging little 
integers, almost all the input variation is in the last handful of bits.  
Multiplication is great for spreading low-order bits around.  But rotate them 
to the high end, and multiplication is next to useless:  it only propagates 
them "to the left", and they're already at the left end then.  This part has 
virtually nothing to do with whether + or ^ is used, or with whether 
multiplication is done before or after.  It's actually the multiplication 
that's the weakness, and has nothing to do with which multiplier is used.

This isn't a problem with FNV or DJB when used _as intended_, because their 
output is much wider than their inputs.  The high-order bit of an input for 
them is still a lower-order bit to their much wider multiplication, and 
eventually propagates.  It isn't a problem for "multiplicative hashing" 
algorithms either, because those use a double-width multiply and only retain 
the _high_ half of a double-width product.  We're only retaining the _lower_ 
half.

I also noted before that replacing the shift-xor with the frozenset hash's 
input scrambler:

t = ((t ^ 89869747UL) ^ (t << 16)) * 3644798167UL;

worked great.  But it's equally a disaster if the inputs are rotated right by 3 
first.  Same reason:  it too only propagates "to the left".

So almost all tuple hash schemes on the table, both current and various 
experiments here, are systematic disasters if input hashes vary mostly in the 
high-order bits.  We haven't noticed because the current string hashes vary 
about the same in all bit positions, while contiguous ranges of not-huge ints 
have hashes that vary in the low-order bits.

The only schemes in all these messages that are "obviously immune" are those 
that used a (any) flavor of FNV or DJB as intended, treating each input hash as 
a sequence of unsigned bytes.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-28 Thread Tim Peters


Tim Peters  added the comment:

Jeroen, thanks for helping us fly slightly less blind! ;-) It's a lot of work.

I'd say you may as well pick a prime.  It's folklore, but "a reason" is that 
you've discovered that regular bit patterns in multipliers can hurt, and 
sticking to primes eliminates worlds of simple & subtle bit patterns.

Another bit of folklore:  pick a multiplier such that for each byte, neither it 
nor its complement are close in value to any other byte of the multiplier.  
Which may seem more valuable for byte-oriented hashes, yet the byte-oriented 
FNV violates it spectacularly:  all FNV multipliers have _no_ bits set higher 
than 2**8 except for their leading bit.  So the longer the FNV multiplier, the 
more all-0 bytes it contains.

Which appears to be a deliberate choice to limit how quickly each new input 
byte propagates to other lower-order state bits across iterations.  The DJB33 
algorithms accomplish that by using a tiny multiplier (relative to the output 
hash width) instead.

As I explained in other msgs here, treating tuple component hashes as sequences 
of bytes seems to deliver excellent results with the unaltered byte-oriented 
versions of FNV-1[a] and DJBX33[AX] - too good for anything in my test 
collection to detect anything even suspicious, let alone "a problem" (although 
it would be easy to contrive problem cases for 33[AX] - 33 is way "too small" 
to avoid collisions even across tuples with a single component).

So part of our challenge appears to me to be that we're trying instead to 
produce a hash from inputs of the _same_ bit width.  DJB/FNV were designed to 
produce hashes of width at least 4 times their inputs' bit widths.  Each input 
has a relatively tiny effect on the internal state for them.  For us, each 
input can change the entire internal state.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-28 Thread Tim Peters


Tim Peters  added the comment:

[Tim]
> Perhaps worth noting that FNV-1a works great if used as
> _intended_:  on a stream of unsigned bytes.
> ...
>
>Py_uhash_t t = (Py_uhash_t)y;
>for (int i = 0; i < sizeof(t); ++i) {
>x = (x ^ (t & 0xff)) * (Py_uhash_t)1099511628211ULL;
>t >>= 8;
>}

And that's all - no hacks for nested tuples, no hacks for mixed-sign ints, just 
100% pure FNV-1a.

So, just for fun, how about 100% pure Bernstein 33A or 33X?  Those replace the 
3rd line above with, respectively,

x = (x * 33) + (t & 0xff); // 33A

or

x = (x * 33) ^ (t & 0xff); // 33X


And those _also_ work great:  no collisions at all across my collection, except 
for the new tuple test, where they suffer 10 and 8 collisions respectively 
(note that the new test retains only the 32 least-significant hash bits).

Those are remarkable partly because multiplying by 33 is a weak permutation on 
its own:  it's just a left shift (by 5) and an add.  And while you can find odd 
multipliers with lower order (mod 2**N) than 33, you have to work to find one - 
it's exceptionally poor in this respect.  For example, pow(33, 8, 256) == 1, 
but half of the odd multipliers in range(256) have order 64 (mod 256).  Only 16 
of 'em have order <= 8.

Then again, we're applying that weak permutation 8 times per 64-bit input here, 
and folding in a fresh byte each time.  It seems that, overall, that's 
"stronger" than applying a stronger - but still easily spelled - permutation 
just one or two times to a 64-bit input.

The only thing I've tried using 64-bit ops that was comparably robust across 
these tests used the frozenset hash's permutation on the inputs:

Py_uhash_t t = (Py_uhash_t)y;
t = ((t ^ 89869747UL) ^ (t << 16)) * 3644798167UL;
x = (x * mult) + t;

That was quite insensitive to the choice of `mult`.  For example, in this 
instead:

Py_uhash_t t = (Py_uhash_t)y;
t ^= t << 1;
x = (x * mult) + t;

the original tuple test had 24 collision and the new test 6 using the current 
multiplier; changing to the FNV-1a 32-bit multiplier, those zoomed to 857 and 
44; then to the FNV-1a 64-bit multiplier, zoomed again to 477 and 2027; then to 
the "random" 1484268081, way down to 0 and 25; and finally to the "random" 
5517301966916670289, down to 0 and 4.  Which didn't inspire confidence ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-28 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

I spent about 2 days doing an extensive study of the FNV and DJB algorithms. I 
share my conclusions below.

To be very clear what I mean, I am talking about the following algorithms (t is 
a tuple and m is the multiplier which is always assumed to be odd).

def FNV(t, m):
h = 1
for x in t:
h = ((h * m) ^ x) % 2**32
return h

def DJB(t, m):
h = 1
for x in t:
h = ((h * m) + x) % 2**32
return h

I am restricting to 32 bits because we need to support that anyway and because 
it's computationally easier than 64 bits.

I took the point of view of fixing two different tuples t and u and then seeing 
for which multipliers m a collision hash(t, m) == hash(u, m) occurs. Since the 
lower k bits of the hash depend only on the lower k bits of the multiplier, it 
is actually feasible to list all m's for which there is a collision; I wrote a 
C program to do that. Note that there are 2**31 possible odd multipliers and 
2**32 different possible hash values: so you expect about 0.5 collisions on 
average.

For the entries of the tuples, I took random integers in the range from 0 to 63 
(so the issue with negative numbers in FNV does not occur). I mostly considered 
tuples of length 4, 5 and 6.

It turns out that both algorithms have pairs (t, u) for which a massive number 
of collisions occur:

- For FNV, the tuples (12, 50, 52, 24, 3), (28, 18, 52, 56, 19) have 2**26 
collisions (all multipliers with lower byte 0x01, 0x7f, 0x81 or 0xff give 
collisions)
- For DJB, the tuples (22, 10, 12, 22, 29), (23, 14, 18, 26, 30) have 2**24 
collisions (all multipliers with lower byte 0xff give collisions)

However, when you then look at the multipliers for these bad cases, they all 
have a special structure in the lower bits of which there are 2 cases:

- A special bit pattern like 0x001, 0x003, 0x555, 0xccd, ...
- A zero of a low-degree polynomial like 0x9fa5 which is a zero of x^2 + x + 2 
modulo 2**16 (this may only be relevant for DJB)

In order to eliminate these 2 cases, I decided to fix the lower byte of the 
multiplier. I wanted it to be 3 or 5 mod 8 to have maximal multiplicative order 
and then I checked which byte had the worst algebraic relations and no obvious 
bit pattern. I decided to take 0xc5 as lower byte.

When considering only multipliers with 0xc5 as lower byte, I found no very bad 
cases. I checked lots of pairs of tuples of length <= 6 and entries <= 64 and 
the worst I could find was 8192 collisions for FNV and 1024 collisions for DJB. 
This maximum number was found for tuples of length 5. Also, the maximum number 
of collisions seems bounded as the bitsize increases. For a 25-bit hash, I got 
the same maximum numbers of collisions as for a 32-bit hash. So, the 
probability of getting a collision decreases by a factor of 2 for every bit 
added, which is what you want.

When testing various bounds on the entries and tuple lengths, DJB is generally 
a little bit better than FNV (by 2 metrics: the maximum number of collisions 
and the root-mean-square of the number of collisions).

So my conclusion would be to use DJB as core algorithm, taking care to pick a 
good multiplier (or at least not an obviously bad multiplier).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-28 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> Replacing DJBX33A's multiplier of 33 is also a different algorithm.  So is 
> working with inputs other than unsigned bytes.

I would argue that this is just extending the parameters of the algorithm. If 
the algorithm is general enough, then that shouldn't be a problem.

> So is mucking with the inputs before adding them in.

Not really. If you consider it an algorithm for hashing a sequence, that is 
just changing the input sequence.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-27 Thread Tim Peters


Tim Peters  added the comment:

Also worth noting:  other projects need to combine hashes too.  Here's a 64-bit 
version of the highly regarded C++ Boost[1] library's hash_combine() function 
(I replaced its 32-bit int literal with "a random" 64-bit one):

x ^= (Py_uhash_t)y + 0x94ae1875aa5647f1UL + (x << 6) + (x >> 2);

It's very cheap.  It also sucks horribly if used as the guts Python's tuple 
hash loop ;-)

This isn't a paradox.  Best I can tell, Python may be unique in trying _not_ to 
make all hashes "look random".  If you have only randomish hashes to combine, 
the Boost approach works fine.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-27 Thread Tim Peters


Tim Peters  added the comment:

Perhaps worth noting that FNV-1a works great if used as _intended_:  on a 
stream of unsigned bytes.  All tests except the new tuple hash test suffer no 
collisions; the new test suffers 14.  Nothing is needed to try to worm around 
nested tuple catastrophes, or to worm around mixing integers of similar 
magnitude but different signs.  The obvious downside:  on a 64-bit box, it's 8 
times as many multiplies :-(  Here's a 64-bit version:

Py_uhash_t t = (Py_uhash_t)y;
for (int i = 0; i < sizeof(t); ++i) {
x = (x ^ (t & 0xff)) * (Py_uhash_t)1099511628211ULL;
t >>= 8;
}

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-27 Thread Tim Peters


Tim Peters  added the comment:

I should have spelled this out before:  these are all permutations, so in 
general permuting the result space of `x * mult + y` (or any other permutation 
involving x and y) is exactly the same as not permuting it but applying a 
different permutation to y instead.

Specifically, the sequence:

x = x * mult + y  (mod 2**N)
x = P(x)  # P is any permutation of N-bit integers

is the same as (and this isn't "deep" - it's trivial):

x = x * mult + Q(y, x)  (mod 2**N)

where Q(y, x) = P(x * mult + y) - x * mult  (mod 2**N)

Q is a "richer" class of permutation though, because it's a different 
permutation for each fixed value of `x`, and uses multiplication to help 
disperse bits.

While it would take a lot of work to quantify this, in my experiments the tuple 
hash is significantly less touchy when permuting x after than when permuting y 
before, presumably because Q is richer.

The two tuple hash tests have many inputs whose tuple component hashes have 
very similar (even identical) bit patterns.  There's only so much dirt-simple 
permutations (like "y ^= y << 1") can do to break the patterns.  So, e.g., 
change a shift count, change the multiplier, ..., and at least one of those two 
tests fails depressingly often.  Not spectacularly, but over the limit they 
allow.  Q(y, x) does a far better job of magnifying small differences.

Something I haven't tried:  use a richer permutation of y that _doesn't_ depend 
on x.  For example, the frozenset hash has much the same underlying challenge, 
and uses this permutation to "blow up" small input differences:

static Py_uhash_t
_shuffle_bits(Py_uhash_t h)
{
return ((h ^ 89869747UL) ^ (h << 16)) * 3644798167UL;
}

But that's a lot more expensive than a single shift-xor, and the speed of tuple 
hashing is more important than of frozenset hashing.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-26 Thread Tim Peters


Tim Peters  added the comment:

>> The two-liner above with the xor in the second line is
>> exactly Bernstein 33A, followed by a permutation
>> of 33A's _output_ space.

> Not output space, but internal state

?  33A's output _is_ its internal state at the end.  This is a distinction that 
makes no difference.  I do distinguish here between 33A and the output of 
Python's `tuplehash()`, but the output space of both on a 64-bit box is a 
64-bit int, so that's another pointless distinction to me.

> (I assume that you do that operation inside the loop).

Yes, as I said at the start, "this is the only code remaining in the loop apart 
from setting y to the next tuple component's hash".

> It's replacing DJBX33A by a different algorithm which is not
> DJBX33A.

Replacing DJBX33A's multiplier of 33 is also a different algorithm.  So is 
working with inputs other than unsigned bytes.  So is mucking with the inputs 
before adding them in.

> It may or may not work, that's not my point. It's just that
> I would avoid changing the structure of the algorithm if
> there is no reason to.

Which is also fine by me, except I see no actual reason to care.  All 
variations of "chain permutations" I've tried appear to work fine, except for 
those (which I haven't mentioned at all) that tried replacing multiplication 
with "weaker" permutations.

A minor exception is that, as already mentioned, applying the "leftshift-xor" 
permutation to the inputs with a shift count of 1 didn't pass the original 
tuple hash test in several cases.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-26 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> The two-liner above with the xor in the second line is exactly Bernstein 33A, 
> followed by a permutation of 33A's _output_ space.

Not output space, but internal state (I assume that you do that operation 
inside the loop). It's replacing DJBX33A by a different algorithm which is not 
DJBX33A. It may or may not work, that's not my point. It's just that I would 
avoid changing the structure of the algorithm if there is no reason to.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-26 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> please restore the original tuple hash test.

Sure. I wasn't sure what to do and was I afraid that having 2 tests for tuple 
hashes would be too much. If that's OK for you, then surely I will restore the 
test.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-26 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> Do you believe any other multiplier would work better toward that end?

Absolutely. Ideally, the multiplier should just be a random 64-bit number.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-26 Thread Tim Peters


Tim Peters  added the comment:

High-order bit:  please restore the original tuple hash test.  You have the 
worst case of "but I didn't write it" I've ever encountered ;-)  Your new test 
is valuable, but I've seen several cases now where it fails to detect any 
problems where the original test fails catastrophically.  Tests are written to 
ensure known problems don't recur, so there's _never_ "a reason" to throw one 
away unless the test itself is in error.  The test suite only grows over time, 
and that's how it should be.

Example:  when I replace the computational guts of the tuple hash with the 
single line (64-bit build, and this is the only code remaining in the loop 
apart from setting y to the next tuple component's hash):

x = (x * mult) + y;

both tests fail spectacularly.

But if I add one more line:

x = (x * mult) + y;
x += x >> 16;

your new test falls to 0 collisions, while the original test still suffers 
83447 collisions (an improvement - down from 120050 - but still a spectacular 
failure).

We're flying blind enough here without deliberately plucking out one of our 
eyes ;-)  Indeed, the original tuple test is the only one in my collection that 
showed even a single collision after the two-line replacement just above.  "My 
collection" means your test, the original tuple hash test, and every example 
given in all the messages in this bug report.

Another "surprise":  replace addition with xor in the second line above:

x = (x * mult) + y;
x ^= x >> 16;

and then the original tuple test drops from 83447 to 0(!) collisions, while 
your new one jumps from 0 to a measly 12.


> The collision hash((3,3)) == hash((-3,-3)) is due a specific
> structure in the input to the FNV algorithm.

This is all so hypothetical it's impossible to know.  Perhaps the tuple hash 
function simply subtracts the hash of the first component from the hash of the 
second.  Then it's impossible to make any deterministic change to the component 
hashes that can stop either of the tuples from hashing to 0.  The weakness is 
entirely in the hash combining function then.

> The operation t ^= t << 7 that you suggested (and which I approve, except
> for the shift amount) is meant precisely to destroy that structure.

Except I have no reason to believe that destroying the input structure is of 
unique value.  As shown above, it's easy to change the tuple hash in a simple 
way that passes all known tests without touching `y` (aka `t`) at all.

>> It's x you care about directly, not t.

> That would not be a good solution, because that destroys the structure
> of the hash algorithm.   For Bernstein for example, each loop iteration
> should do something like
>
>x = (x * m) + t
>
> for *some* value t depending on the input. If you mess with x, then it
> really becomes a different hash algorithm.

The two-liner above with the xor in the second line is exactly Bernstein 33A, 
followed by a permutation of 33A's _output_ space.  It works fine in tests.  
Why should I care about merely that it's "different"?  33A on its own fails 
spectacularly - it damned well _better_ be "different" ;-)

There is no "Bernstein theory" to appeal to here - just raw assertions.  I'm 
not attached to any of these approaches.  The strongest I can say from trying 
many things is that "chaining permutations works best, but no detail appears to 
matter much, except that multiplication is the most valuable of all the 
permutations I've tried".

x*M is a permutation of the word-size space for any odd M, and any "big enough" 
odd M I tried appears to work about as well as any other.  Adding t is another. 
 The shift-xor in the second line above is a third (but the "+=" version 
instead was not a permutation, and it failed).

> That is far more dangerous than simply applying a permutation on t.

Why?

> Which "desirable properties" of t does the operation t ^= (t << 1) damage?

On second thought, none!  Across any contiguous range of integers, that 
preserves that the last k bits (for all k >= 1) remain as evenly distributed as 
in the inputs.  That wasn't obvious to me at first.  It's not true if the left 
shift is replaced by a right shift, though.

However, with a shift count of 1 I've generally found that the _original_ tuple 
hash test fails (not spectacularly, but fails all the same).  "Generally" means 
across a substantial number of varying the permutations used in other places.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-26 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> There are _two_ hash functions at play in that collision:  the tuple hash 
> function, and the integer hash function.  Regardless of whether the _tuple_ 
> hash function does [anything involving just `t`], that only directly affects 
> the result of the _int_ hash function

...which is what you want here. The collision hash((3,3)) == hash((-3,-3)) is 
due a specific structure in the input to the FNV algorithm. The operation t ^= 
t << 7 that you suggested (and which I approve, except for the shift amount) is 
meant precisely to destroy that structure.

> If you feel you just have to mess with low-order bits, do it instead on the 
> _tuple_ hash intermediate results, not on its inputs. It's x you care about 
> directly, not t.

That would not be a good solution, because that destroys the structure of the 
has algorithm. For Bernstein for example, each loop iteration should do 
something like

x = (x * m) + t

for *some* value t depending on the input. If you mess with x, then it really 
becomes a different hash algorithm. That is far more dangerous than simply 
applying a permutation on t.

> Mucking with t to avoid the nested-tuple catastrophes is worth it, but I 
> prefer to do that with as little damage to t's desirable properties as 
> reasonably possible.

Which "desirable properties" of t does the operation t ^= (t << 1) damage?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-25 Thread Tim Peters


Tim Peters  added the comment:

>> j is even implies (j ^ -3) == -(j ^ 3)

> This follows from what I posted before: if j is even, then
> j ^ 3 is odd, so we can apply the rule x ^ -2 = -x to x = j ^ 3
> ...

Thanks!  That helps a lot.  I had a blind spot there.  This kind of thing would 
have been best spelled out at the start:  it's not just that -2 has some bad 
effects, but that there's a whole universe of potentially damaging identities:

j odd  implies   j  = -(j ^ -2)
j even implies  (j ^ 1) = -(j ^ -1)
j odd  implies  (j ^ 2) = -(j ^ -4)
j even implies  (j ^ 3) = -(j ^ -3)
j odd  implies  (j ^ 4) = -(j ^ -6)
j even implies  (j ^ 5) = -(j ^ -5)
j odd  implies  (j ^ 6) = -(j ^ -8)
j even implies  (j ^ 7) = -(j ^ -7)
j odd  implies  (j ^ 8) = -(j ^ -10)
j even implies  (j ^ 9) = -(j ^ -9)
j odd  implies  (j ^ 10) = -(j ^ -12)
j even implies  (j ^ 11) = -(j ^ -11)
j odd  implies  (j ^ 12) = -(j ^ -14)
j even implies  (j ^ 13) = -(j ^ -13)
j odd  implies  (j ^ 14) = -(j ^ -16)
j even implies  (j ^ 15) = -(j ^ -15)
...

Every integer pairs up with another of the opposite sign in one of these - but 
with only one.  That goes a long way toward explaining why collisions are 
frequent when mixing signs on ints of similar magnitude, but also why they 
don't "pile up" catastrophically.

But it also suggests a way to construct cases that _are_ catastrophic.  For 
example:

>>> len(set(map(hash, product([-3, 3], repeat=20
1024

Ouch!

>>> c = Counter(map(hash, product([-3, 3], repeat=20)))
>>> len(c)
1024
>>> max(c.values())
1024
>>> min(c.values())
1024

So each of the 1024 hash codes showed up 1024 times.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-25 Thread Tim Peters


Tim Peters  added the comment:

> Suppose that there is a hash collision, say hash((3, 3)) ==
> hash((-3, -3)) and you change the hashing algorithm to fix
> this collision.

There are _two_ hash functions at play in that collision:  the tuple hash 
function, and the integer hash function.  Regardless of whether the _tuple_ 
hash function does

t ^= t << shift;
or
t += t >> shift;

or anything else involving just `t`, that only directly affects the result of 
the _int_ hash function.  Whose low order bits cannot be improved in general - 
they're already optimally spread out in the most important cases.

> If that change does NOT affect the lower N bits,
> then you still have a collision hash((3, 3)) % 2**N ==
> hash((-3, -3)) % 2**N. This is relevant because the dict
> implementation starts by taking the lower bits first. So
> ideally we want to avoid hash collisions in the lower
> bits too.

Which the int hash on its own already does a superb job of doing in the most 
important cases.  If you feel you just have to mess with low-order bits, do it 
instead on the _tuple_ hash intermediate results, not on its inputs.  Like

x += x >> 16;

after t is folded in to x.  It's x you care about directly, not t.  Damaging 
desirable properties in t to _indirectly_ gain something in x is too clever by 
half.

Mucking with t to avoid the nested-tuple catastrophes is worth it, but I prefer 
to do that with as little damage to t's desirable properties as reasonably 
possible.

> This may also be a reason to avoid the standard FNV
> multipliers: the 64-bit FNV multiplier was chosen with
> the full 64-bit hash range in mind. It was never meant
> to work well when truncated to some lower N bits.

Do you believe any other multiplier would work better toward that end?  For any 
odd multiplier M, the last k bits of i*M are determined solely by the last k 
bits of i, and

[(last k bits of i*M) for i in range(anything, anything + 2**k)]

is a permutation of range(2**k).  The high order bits of i have nothing to do 
with it, and any odd multiplier permutes the lower k bits over any stretch of 
2**k consecutive multiplicands.

Extracting low-order bits is intended to be done by "xor folding" in FNV, and I 
_expect_ the same would be prudent for any other multiplier:

https://tools.ietf.org/html/draft-eastlake-fnv-15#page-7

The dict and set lookup functions already do something stronger than xor 
folding to bring high bits into play, but do so incrementally as the probe 
sequence unfolds.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-25 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

Regarding t ^= t << 7: I tested PR 9471 with all shift values from 1 to 20. The 
new testsuite passed for all shifts from 1 to 13, except for 6. It failed for 6 
and for 14 to 20. This indicates that smaller shift values are better (even 
when looking at 32-bit outputs).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-25 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> I want to leave low-order hash bits alone.  That's deliberate.

Since I didn't understand the rationale for this and since shifting << 1 also 
seems to work well, I edited PR 9471 to use DJBX33A with t ^= t << 1.

Since you insisted on adding 97531 at the end, I put that back.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-25 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> And one more:

x = (x * mult) ^ t;

also appears to work equally well.

The order of operations does not really matter: you can write the loop as

x *= mult   # Appears only in FNV-1
x ^= t[0]
x *= mult
x ^= t[1]
x *= mult
x ^= t[2]
x *= mult
x ^= t[3]
x *= mult  # Appears only in FNV-1a

The initial multiplication is equivalent to changing the initial value and the 
final multiplication is just a permutation of the outputs. None of those should 
really affect the quality of the hash.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-25 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> j is even implies (j ^ -3) == -(j ^ 3)

This follows from what I posted before: if j is even, then j ^ 3 is odd, so we 
can apply the rule x ^ -2 = -x to x = j ^ 3:

(j ^ 3) ^ -2 = -(j ^ 3)

which implies

j ^ (3 ^ -2) = -(j ^ 3)

or equivalently

j ^ -3 = -(j ^ 3)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-25 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> When testing what, specifically?  And the standard 32-bit FNV multiplier, or 
> the standard 64-bit FNV multiplier?

FNV-1a with the t ^= 2 * t mangling running my new testsuite on either PR 9471 
or PR 9534 using the 64-bit FNV multiplier to produce 64-bit hashes. In other 
words, the code from PR 9534 but just changing the multiplier.

On the full 64-bit range, I got 2 collisions (where statistically 0 would be 
expected). When truncated to 32-bits, I got about 1700 collisions (where 
statistically about 15 would be expected).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-25 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> The low bits are already un-improvable in the most important cases.

Maybe you misunderstood me. Suppose that there is a hash collision, say 
hash((3, 3)) == hash((-3, -3)) and you change the hashing algorithm to fix this 
collision. If that change does NOT affect the lower N bits, then you still have 
a collision hash((3, 3)) % 2**N == hash((-3, -3)) % 2**N. This is relevant 
because the dict implementation starts by taking the lower bits first. So 
ideally we want to avoid hash collisions in the lower bits too.

This may also be a reason to avoid the standard FNV multipliers: the 64-bit FNV 
multiplier was chosen with the full 64-bit hash range in mind. It was never 
meant to work well when truncated to some lower N bits.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-24 Thread Tim Peters


Tim Peters  added the comment:

And one more:

x = (x * mult) ^ t;

also appears to work equally well.  So, way back when, it appears we _could_ 
have wormed around the disaster du jour just by applying a shift-xor 
permutation to the raw hash results.

Note the implication:  if we believed our tuple hash worked OK before then (& 
we did), adding a shift-xor step would have changed nothing about it _except_ 
to permute the input space.  That alone was enough to repair the nested tuple 
problems.

Note that permutations are a bog standard way to improve algorithms with bad 
behaviors in some cases.  For example, at the end of the Mersenne Twister they 
do 4 permutations to improve otherwise-marginal results from "the real" 
pseudo-random generator:

y ^= (y >> 11);
y ^= (y << 7) & 0x9d2c5680U;
y ^= (y << 15) & 0xefc6U;
y ^= (y >> 18);
return y;

At this point I see no "good reason" to prefer any of

x = (x * mult) ^ t; // like Bernstein 33X & FNV-1
x = (x * mult) + t; // like Bernstein 33A
x = (x ^ t) * mult; // like FNV-1a

to any other.  In their _original_ contexts, they were applied to longish 
sequences of unsigned bytes.  But tuples used as keys and set elements are 
typically much shorter than strings, so results about how they worked in their 
original contexts are largely irrelevant for that reason too.

While lots of non-endcase testing is also needed, at this point I don't have 
any real fear about any of them.

As a sanity check,

x = (x * mult) | t;

is disastrous for all the endcase tests.  So I believe I really am compiling 
and running the code ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-24 Thread Tim Peters


Tim Peters  added the comment:

Just noting that this Bernstein-like variant appears to work as well as the 
FNV-1a version in all the goofy ;-) endcase tests I've accumulated:

while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
Py_uhash_t t = (Py_uhash_t)y;
t ^= t << 7;
x = x * mult + t;
}

They're identical except for the last line.  FNV-1a uses

 x = (x ^ t) * mult;

instead.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-24 Thread Tim Peters


Tim Peters  added the comment:

> advantage of my approach is that high-order bits become more
> important:

I don't much care about high-order bits, beyond that we don't systematically 
_lose_ them.  The dict and set lookup routines have their own strategies for 
incorporating high-order bits into the probe sequence, and those routines are 
the primary reason hash() exists in Python.

So there's scant reason to try to let high bits influence low bits (but good 
reason _not_ to, because - as explained before - the low-order bits are already 
in great shape for the important tuple component types).

Something else appears to be going on in this specific example, which has a 
very touchy set of hash codes:

> BEFORE:
> >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for 
> >>> c in L]; len(set(hash(x) for x in T))
> 50

Which is what I get too on (at least) a 64-bit build.

> AFTER:
> >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for 
> >>> c in L]; len(set(hash(x) for x in T))
> 100

Ditto.  However, I get exactly the same from the code I posted before, which 
has no right shifts at all.

More, if I take the code exactly as it exists today, and merely comment out the

mult += (Py_hash_t)(82520UL + len + len);

line, I also get a million distinct hash codes.

Half of the hash codes have bit 2**60 set, and apart from those all the hash 
codes have no bits set higher than 2**6.  Any number of accidents could cause 
the 2**60 bits to wipe each other out, and merely changing the sequence of 
multipliers was demonstrably enough to stop that.

Indeed, merely changing 82520 to 82524 in the current code is enough to get a 
million unique hash codes in this case.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-24 Thread Tim Peters


Tim Peters  added the comment:

Jeroen, I understood the part about -2 from your initial report ;-)  That's why 
the last code I posted didn't use -2 at all (neither -1, which hashes to -2).  
None of the very many colliding tuples contained -2 in any form.  For example, 
these 8 tuples all have the same hash now:

(-4,  -4,  -4,  40)(-4,  -4,  4, -48)
( 4,   4,  -4,  40)( 4,   4,  4, -48)
(-4,  28, -28, -48)(-4,  28, 28,  40)
( 4, -28, -28, -48)( 4, -28, 28,  40)

Your last example (with (3, -2) and (-3, 0)) also implicitly relies on that:

j is even implies (j ^ -3) == -(j ^ 3)

There are apparently piles of similar identities :-(

I appreciate that

a*M + C = b*M + C (mod N)

implies

a = b (mod N)

when M is coprime to N, and also that the theory of linear combinations modulo 
2**K is far better known than the well-hidden theory FNV developed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-24 Thread Tim Peters


Tim Peters  added the comment:

> when you do t ^= t << 7, then you are not changing
> the lower 7 bits at all.

I want to leave low-order hash bits alone.  That's deliberate.

The most important tuple component types, for tuples that are hashable, are 
strings and contiguous ranges of "not huge" ints.  The current string hash 
works hard to "act randomly" in every bit position - there's no point in trying 
to "improve" anything about the hashes it already produces.

In contrast, the hashes of any contiguous range of N not-huge integers 
(excluding -1) are already, by construction, guaranteed to have no collisions 
at all in their low-order (roughly) log2(N) bits.  They can't be improved in 
this respect, because they're already optimal in this respect.

So, if anything, I'd look at increasing the left shift rather than reducing it. 
 The low bits are already un-improvable in the most important cases.


> So applications using hash(x) % 128 will still see
> all the problems that we are trying to fix.

?  The only "problem" I've seen identified was in mixing positive and negative 
integers.  Here on a 32-build with the FNV-1a 32-bit multiplier:

>> from itertools import product
>> from collections import Counter
>> cands = list(range(-50, 51))
>> cands.remove(-1)
>> c = Counter()
>> for t in product(cands, repeat=4):
.. c[hash(t) & 0x7f] += 1
>>> len(c)
128

So all 128 lower 7-bit patterns showed up.  And they're quite evenly 
distributed:

>>> c2 = Counter(c.values())
>>> for k in sorted(c2):
... print(k, c2[k])
...
781202 1
781207 1
781209 2
781212 1
781213 2
781214 1
781215 4
781221 3
781222 1
781225 3
781226 1
781227 3
781228 2
781229 2
781230 1
781231 1
781232 2
781233 3
781234 1
781235 4
781236 2
781237 1
781238 2
781240 5
781241 6
781242 1
781243 1
781244 1
781245 1
781247 1
781248 2
781251 2
781252 4
781253 3
781254 5
781255 2
781256 2
781257 3
781259 2
781260 1
781261 1
781262 1
781263 2
781264 4
781265 2
781266 1
781267 1
781268 4
781269 1
781270 1
781271 2
781272 1
781274 2
781275 1
781276 1
781278 1
781280 1
781281 2
781282 2
781285 1
781286 2
781288 1
781295 1
781297 2
781301 1
781302 1
781304 1
781307 1

> With the standard FNV multiplier on 64 bits, I did
> get collisions while testing.

When testing what, specifically?  And the standard 32-bit FNV multiplier, or 
the standard 64-bit FNV multiplier?

> Instead, I chose 3**41 as multiplier. But of course,
> there are still plenty of bikeshedding opportunities
> for the multiplier...

Not for me.  If the universally used 64-bit FNV multiplier can't be used in the 
context of Python's tuple hash fiddled to use an almost-pure form of FNV-1a, 
then that approach is dead to me.  Needing to pick multipliers out of thin air 
instead implies the theory underlying FNV-1a doesn't transfer to this context.

About which I have no current opinion.  It may or may not.  Since we're flying 
blind, I'm just willing to _assume_ it does until proven otherwise by testing.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

I created a new PR based on Tim's t ^= t << 7 idea, except that I'm using << 1 
instead, to have more mixing in the lower bits.

With the standard FNV multiplier on 64 bits, I did get collisions while 
testing. I haven't figured out exactly why these occurred, but it's probably 
due to the high number of 0 bits. Instead, I chose 3**41 as multiplier. But of 
course, there are still plenty of bikeshedding opportunities for the 
multiplier...

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer


Change by Jeroen Demeyer :


--
pull_requests: +8937

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> In the absence of a real analysis, the intuition is simply that "t ^= t << 7" 
> will clear masses of leading sign bits when hashing "small" negative integers.

That's a clever solution. If you want to go that route, I would rather suggest 
t ^= t << 1 which is essentially the same fix but which has probably less 
collisions on the low-order bits.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> For example, FNV-1a has far better "avalanche" behavior than Bernstein

We don't care about that though. We want to have no collisions, not a random 
output.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> stuff like "t += t >> 16" is a many-to-one function, not a permutation

Yes, I am aware of that. However, the number of collisions here is really quite 
small. It's very unlikely to hit one by accident.

I also chose >> over << for two reasons:

1. It brings the high-order in play: https://bugs.python.org/msg326117

2. It avoids collisions on the low-order bits: when you do t ^= t << 7, then 
you are not changing the lower 7 bits at all. So applications using hash(x) % 
128 will still see all the problems that we are trying to fix.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

While writing up the analysis above, it occurred to me that collisions already 
happen for 2-tuples:

>>> hash((3, -2)) == hash((-3, 0))
True

These kind of 2-tuples of small integers don't look contrived at all. I can 
easily see them appearing, in mathematical applications for example.

As for real-world usage: the only thing that I can say is that I discovered 
these hash collisions a while ago, while working on SageMath. I was testing the 
hash for a custom class and I found collisions, which I traced back to 
collisions for tuples.

In any case, it is hard to find real-world problems where a bad hash really 
matters, since Python works fine with a broken hash too.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> Has anyone figured out the real source of the degeneration when mixing in 
> negative integers?

The underlying reason for the collisions is the following mathematical relation:

x ^ -x = -2 << i

where i is the index of the smallest set bit of x. In particular, x ^ -x = -2 
for odd x.

Now consider two consecutive hash iterations:

y = (x ^ a) * m1
z = (y ^ b) * m2

and suppose that x ^ a is odd. Now if we replace a by a ^ -2, then x ^ a will 
be replaced by -(x ^ a) and y will be replaced by -y. If we also replace b by b 
^ -2, then y ^ b will be replaced by y ^ b. In other words, we have a collision.

This kind of bad interaction between ^ and * only happens with the FNV-style 
hashing. The Bernstein hash using + instead of ^ does not suffer from this 
problem. That makes it a better choice in my opinion.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-23 Thread Tim Peters


Tim Peters  added the comment:

BTW, those tests were all done under a 64-bit build.  Some differences in a 
32-bit build:

1. The test_tuple hash test started with 6 collisions.  With the change, it 
went down to 4.  Also changing to the FNV-1a 32-bit multiplier boosted it to 8. 
 The test passes in all those cases.

2. For 4-tuples taken from range(-50, 51) omitting -1 and -2:  massive number 
of collisions under the current code.  Died with MemoryError with the change - 
ran out of memory to build the Counter recording all the distinct hash codes.

Changing it to 3-tuples instead:  somewhere around 140 collisions with the 
change, and also changing to the FNV-1a 32-bit multiplier eliminated all 
collisions.

All of this was done under Win 10, using Visual Studio 2017.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-23 Thread Tim Peters


Tim Peters  added the comment:

FYI, using this for the guts of the tuple hash works well on everything we've 
discussed.  In particular, no collisions in the current test_tuple hash test, 
and none either in the cases mixing negative and positive little ints.  This 
all remains so using the current multiplier, or "the standard" FNV-1a 
multipliers 16777619UL (32 bit) and 1099511628211UL (64 bit).  I've done no 
other tests:

while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
Py_uhash_t t = (Py_uhash_t)y;
t ^= t << 7;
x = (x ^ t) * mult;
}

Note that the multiplier doesn't change anymore.  The twist is adding

t ^= t << 7;

to _permute_ the native hash space (stuff like "t += t >> 16" is a many-to-one 
function, not a permutation).  Other than that, it's exactly FNV-1a.  I don't 
know that 7 is especially magical - it's just the first thing I tried.

In the absence of a real analysis, the intuition is simply that "t ^= t << 7" 
will clear masses of leading sign bits when hashing "small" negative integers.  
For whatever reason(s), they appear to be the cause of the troubles.

However, since that line of code permutes the space, exactly the same 
statistics can be provoked by some other inputs.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-23 Thread Tim Peters


Tim Peters  added the comment:

[Raymond, on boosting the multiplier on 64-bit boxes]
> Yes, that would be perfectly reasonable (though to some
> extent the objects in the tuple also share some of the
> responsibility for getting all bits into play).

It's of value independent of that.  Tuples of ints used as keys and set 
elements are very important, and I doubt we'll ever give up on that `hash(i) == 
i` for the typically "not huge" ints used in such contexts.  Jeroen gave a 
reasonable example of how boosting the multiplier can help in a case of that 
above:

https://bugs.python.org/msg326032

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-23 Thread Tim Peters


Tim Peters  added the comment:

Has anyone figured out the real source of the degeneration when mixing in 
negative integers?  I have not.  XOR always permutes the hash range - it's 
one-to-one.  No possible outputs are lost, and XOR with a negative int isn't 
"obviously degenerate" in general:

>>> for i in range(-16, 17):
...print(i, i ^ -5)
-16 11
-15 10
-14 9
-13 8
-12 15
-11 14
-10 13
-9 12
-8 3
-7 2
-6 1
-5 0
-4 7
-3 6
-2 5
-1 4
0 -5
1 -6
2 -7
3 -8
4 -1
5 -2
6 -3
7 -4
8 -13
9 -14
10 -15
11 -16
12 -9
13 -10
14 -11
15 -12
16 -21

Clear enough? "Distinct integers in, distinct integers out", but with the twist 
that the sign bit flips.  That's _seemingly_ minor to me.  Yet it has massive 
effects on tuple hash distribution.  Here's a function to show that:

def doit(cands, repeat=4):
from collections import Counter
from itertools import product

print(len(cands), "cands **", repeat, "=", len(cands)**repeat)
c1 = Counter(map(hash, product(cands, repeat=repeat)))
print(len(c1), "distinct hash codes")
c2 = Counter(c1.values())
for dups in sorted(c2):
print(dups, c2[dups])

Then an example we're proud of:

>>> doit(range(100))
100 cands ** 4 = 1
1 distinct hash codes
1 1

Much the same, mixing in negative ints, but _excluding_ the problematic -1 and 
-2 inputs:

>>> cands = list(range(-50, 51))
>>> cands.remove(-1) # hashes to -2
>>> cands.remove(-2) # for j odd, j ^ -2 == -j
>>> doit(cands)
99 cands ** 4 = 96059601
15736247 distinct hash codes
1 70827
2 1005882
3 228578
4 5000424
5 19728
6 1047762
8 8363046

What accounts for such a massive difference?  It's not merely that we're using 
negative ints:

>>> doit(range(-101, -1)) # leave -2 in, for a change
100 cands ** 4 = 1
1 distinct hash codes
1 1

So, on their own, negative ints are as spectacularly well-behaved in this range 
as positive ints, and including -2 creates no problems.

I suspect it's related to that x ^ -(x + 1) == -1, but not in a way obvious 
enough to be ... well, obvious ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-23 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> To my eyes, the _strongest_ case in all these messages is 
> for boosting the multiplier size on 64-bit boxes.  That's
> principled and well-motivated.

Yes, that would be perfectly reasonable (though to some extent the objects in 
the tuple also share some of the responsibility for getting all bits into play).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-23 Thread Tim Peters


Tim Peters  added the comment:

Oh, I don't agree that it's "broken" either.  There's still no real-world test 
case here demonstrating catastrophic behavior, neither even a contrived test 
case demonstrating that, nor a coherent characterization of what "the problem" 
is.

I'm nevertheless open to making improvements, but keeping in mind foremost that 
_all_ changes are potentially damaging to patterns of data we know nothing 
about precisely because they've been working fine for many years.  So, e.g., 
there's no chance that gratuitous changes will be accepted.  For example, don't 
_like_ adding 97531UL at the end?  Tough luck - it's staying, and it's not 
worth one word of argument.

To my eyes, the _strongest_ case in all these messages is for boosting the 
multiplier size on 64-bit boxes.  That's principled and well-motivated.  Python 
itself changed the multiplier from 3 to 103 long ago for the same reasons.  
But that apparently has nothing to do with "the problem" this report was opened 
about ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-23 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> We shouldn't feel shoved into altering something that we don't agree is broken

It *is* broken. You are just denying the reality.

That's also the reason that I'm insisting so much: I don't want to push my 
personal fix (despite that it may seem so), I want to fix a bug.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-22 Thread Tim Peters


Tim Peters  added the comment:

Raymond, I share your concerns.  There's no reason at all to make gratuitous 
changes (like dropping the "post-addition of a constant and incorporating 
length signature"), apart from that there's no apparent reason for them 
existing to begin with ;-)

Unintended consequences can abound.

Still, the last repair was pretty hacky, and a very well-known and highly 
respected algorithm (FNV-1a) _is_ hiding under it.  I would like to pursue 
seeing whether a more direct form of FNV-1a can be rehabilitated to worm around 
the known problematic cases.  That would also, if successful, give us a 
principled way to pick a better multiplier for 64-bit boxes.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-22 Thread Tim Peters


Tim Peters  added the comment:

I strive not to believe anything in the absence of evidence ;-)

FNV-1a supplanted Bernstein's scheme in many projects because it works better.  
Indeed, Python itself used FNV for string hashing before the security wonks got 
exercised over collision attacks.  It worked great.  But "better" depends on 
what you value.  For example, FNV-1a has far better "avalanche" behavior than 
Bernstein[1].

Please note that I don't _object_ to your code!  It may work fine.  Or it may 
not in some cases.  The problem is that we have no way to tell.  There's no 
theory here, just misleading appeals to experience with the algorithms in 
contexts they were _intended_ to be used in.  Nobody expected some patterns of 
nested tuples to fail catastrophically before, and nobody expected mixed-sign 
integers to lead to poor (but not catastrophic) behavior after the former was 
"repaired".  Nobody now expects the next 10 catastrophes either.

We can only rely on contrived tests and bug reports.  Python's current scheme 
is unbeatable on that count, because it's the only scheme for which over a 
decade of experience _in context_ exists.

Which experience says there are no known catastophically bad real cases.  Which 
is worth a whole lot in a project that's so widely used now.

That said, the "repair" before was at least as much pure hackery as your 
proposed hackery, and - I agree - completely undercut the _theory_ FNV was 
based on (although, to be fair, I doubt anyone else _knew_ they were 
re-inventing a damaged FNV-1a at the time).

So ... since FNV-1a is in fact better in its intended context than the 
Bernstein one-liners in the same context, how about adding your

t += (t >> (4 * sizeof(t)));

hack to the _current_ code (& delete the code changing the multiplier)?  Then 
we could switch to the "official" FNV 32-bit and 64-bit multipliers too, and 
know at least that "almost everything" about the resulting algorithm is known 
to work exceedingly well in its original context.

[1] https://sites.google.com/site/murmurhash/avalanche

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-22 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

I don't think any change should be made unless we agree that there is a real 
problem to be solved rather than because the OP is brazenly insistent on 
forcing through some change.  We shouldn't feel shoved into altering something 
that we don't agree is broken and into replacing it with something that we're 
less sure about.  Also, I'm concerned that that patch drops features like the 
post-addition of a constant and incorporating length signature -- it is as if 
we're starting from scratch rather than keeping the incremental improvements 
made over decades.  AFAICT, the only motivation for change is that the OP is 
relentless; otherwise, none of us would be giving this further thought.  I echo 
Eric's concern that it is easy to inadvertently make Python worse.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-22 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> the made-up hacks Python used to worm around a class of gross flaws its prior 
> DJBX33X approach suffered, taking DJBX33X out of its original context and 
> applying it in an area it wasn't designed for.

But we know why the DJBX33A hash didn't work (nested tuples), so we can think 
of the best way to solve that. Python messes with the multiplier, which makes 
it quite a different hash. Surely, if you believe that the precise choice of 
multiplier matters a lot, then you should also agree that arbitrarily changing 
the multiplier in the loop is a bad idea.

My proposal instead is to keep the structure of the DJBX33A hash but change the 
hash of the individual items to be hashed. That's a much less invasive change 
to the known algorithm.

Finally, something that I haven't mentioned here: an additional advantage of my 
approach is that high-order bits become more important:

BEFORE:
>>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c 
>>> in L]; len(set(hash(x) for x in T))
50

AFTER:
>>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c 
>>> in L]; len(set(hash(x) for x in T))
100

Again, I'm not claiming that this is a major issue. Just additional evidence 
that maybe my new hash might actually be slightly better than the existing hash.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-22 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

Thanks for the reference, I never heard of the FNV hash. Unfortunately, I 
haven't seen a reference for the rationale of how they pick their multiplier.

It's not clear what you are suggesting though. Keep using the FNV-ish hash 
somehow? Or using a DJBX33A hash with the FNV multiplier?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-22 Thread Tim Peters


Tim Peters  added the comment:

So you don't know of any directly relevant research either.  "Offhand I can't 
see anything wrong" is better than nothing, but very far from "and we know it 
will be OK because [see references 1 and 2]".

That Bernstein's DJBX33A has been widely used inspires no confidence 
whatsoever.  As already explained, Python _did_ use DJBX33X (with a different 
multiplier), and it systematically screwed up catastrophically in some nested 
tuple cases already spelled out.  Bernstein himself moved to DJBX33X (using xor 
instead of addition), and that would have screwed up too on a mix of smallish 
integers of both signs.

What Python does now, except for changing the multiplier, is essentially 
version 1a of the _also_ very widely used - but more recent - Fowler/Noll/Vo 
hash family[1]:

# Version 1a, recommended over version 1 (which does * before ^).
hash = offset_basis
for each octet_of_data to be hashed
hash = hash xor octet_of_data
hash = hash * FNV_prime
return hash

They did extensive studies, and very deliberately use a prime multiplier 
subject to a number of other constraints.  Not based on "offhand I can't see 
why not", but based on analysis and running extensive tests.

But, same as with Bernstein's variants (which predate FNV's work on picking 
"good" multipliers), they were developed in the context of hashing sequences of 
unsigned 8-bit integers.  There's no a priori reason I can see to expect them 
to work well in contexts other than that.  Indeed, FNV puts special constraints 
on the last 8 bits of the primes they pick, presumably because they're only 
concerned with hashing sequences of 8-bit values.

FYI, for 32-bit hashes they use multiplier 16777619, and for 64-bit 
1099511628211.

In the absence of coherent analysis directly relevant to what Python actually 
needs here, we're all flying blind.  What we do have is over a decade of 
positive real-world experience with the made-up hacks Python used to worm 
around a class of gross flaws its prior DJBX33X approach suffered, taking 
DJBX33X out of its original context and applying it in an area it wasn't 
designed for.  Which made it close to FNV-1a, but also in a context _that_ 
wasn't designed for.

However, it _is_ structurally close to FNV-1a, and using prime multipliers was 
a big deal to them.  "But offhand I don't see why" isn't a good enough reason 
to abandon that.  To the contrary, in the absence of _proof_ that it doesn't 
actually matter, I'd be happiest using the same multipliers (given above) and 
initial values "the standard" FNV-1a algorithms use.

[1] http://www.isthe.com/chongo/tech/comp/fnv/index.html#FNV-1a

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-22 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> I'm not aware of any research papers about picking multipliers in this 
> context, but would love to see one.

The only real condition that I can think of is that the order should be large: 
we do not want MULTIPLIER**n = 1 (mod 2**N) for a small number n.

Other than that, we could pick the multiplier to guarantee no hash collisions 
on some chosen subset of inputs. A bit like your product(range(100), repeat=4) 
example but then for more inputs.

If you agree with this approach, I could try to find a good multiplier this way.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-22 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> Which was your original suggestion.  Which you appear to be opposed to now?  
> I'm unclear about why, if so.

I'm not strictly opposed to that. It's just that I have less confidence in the 
current ad-hoc hash compared to something following the DJBX33A scheme.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34751] Hash collisions for tuples

2018-09-22 Thread Jeroen Demeyer


Jeroen Demeyer  added the comment:

> I don't know that primes are important here, but neither do I know that 
> they're _not_ important here.

Hashes are effectively computed modulo 2**N. "Primes" are meaningless in that 
setting (except for the prime 2 which does have a meaning). For example, 
103 is prime but 103 + 2**64 is not. But these represent the same 
number modulo 2**64. Also, multiplication by any odd number is a permutation 
modulo 2**N, so every odd number is invertible.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



  1   2   >