Eric Appelt added the comment:

I spent some time looking at the Objects/setobject.c code where the hashing 
scheme for frozensets, which essentially works by XOR-ing together the hashes 
from all its entries. Significant care is taken to shuffle bits at various 
stages, and the algorithm seems to be well thought out. My own attempts to 
change constants or add in reshufflings either didn't help the collision 
statistics or just made things worse. I think that there is something of a 
fundamental limitation here due to information loss in XOR, and other fast, 
commutative bitwise operations (i.e. AND, OR) are well known to be much worse.

I did stumble on a 2006 paper by Boneh and Boyen [1] which looked at a 
potentially related problem of trying to combine two different hashing 
functions to improve collision resistance and found that there was no way to do 
this with fewer bits than just concatenating the hashes. The present ticket is 
more a problem of combining hashes from the same function in an 
order-independent way and may be completely unrelated. However, I imagine that 
concatenation followed by rehashing the result would remove the loss due to 
XORing unlucky choices of hashes.

Concatenating everything seemed obviously too slow and/or unacceptable in terms 
of memory use, but I thought of a compromise where I would construct an array 
of 4 hash values, initialized to zero, and for each entry I would select an 
array index based on a reshuffling of the bits, and XOR that particular entry 
into the chosen index. This results in a hash that is 4x wider than the 
standard size that reduces the information loss incurred from XOR. This wide 
hash can then be hashed down to a normal sized hash.

I implemented this algorithm and tested it using the same procedure as before. 
Specifically, all frozensets for every possible combination (128) of the 
letters abcdefg as single character strings are hashed, and the last 7 bits of 
each of their hashes are compared to see how many unique 7-bit patterns are 
produced. This is done for PYTHONHASHSEED values from 1 to 10000 to build a 
distribution. The distribution is compared to a similar distribution 
constructed from pseudorandom numbers for comparison.

Unlike the current hashing algorithm for frozensets, and much like the result 
from hashes of strings, the result of this new "wide4" algorithm appears to be 
nearly ideal. The results are plotted in "frozenset_string_n7_10k_wide4.png" as 
attached.

I will follow this up with a patch for the algorithm as soon as I run the 
complete test suite and clean up.

Another option if the current algorithm is considered good enough is to alter 
the current test to retry on failure if the power set of letters 'abcdefg...' 
fails by shifting one letter and trying again, perhaps ensuring that 4/5 tests 
pass. This ought to greatly reduce the sporadic build failures.

[1] http://ai.stanford.edu/~xb/crypto06b/blackboxhash.pdf

----------
Added file: http://bugs.python.org/file45281/frozenset_string_n7_10k_wide4.png

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26163>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to