Jeroen Demeyer <[email protected]> added the comment:
I spent about 2 days doing an extensive study of the FNV and DJB algorithms. I
share my conclusions below.
To be very clear what I mean, I am talking about the following algorithms (t is
a tuple and m is the multiplier which is always assumed to be odd).
def FNV(t, m):
h = 1
for x in t:
h = ((h * m) ^ x) % 2**32
return h
def DJB(t, m):
h = 1
for x in t:
h = ((h * m) + x) % 2**32
return h
I am restricting to 32 bits because we need to support that anyway and because
it's computationally easier than 64 bits.
I took the point of view of fixing two different tuples t and u and then seeing
for which multipliers m a collision hash(t, m) == hash(u, m) occurs. Since the
lower k bits of the hash depend only on the lower k bits of the multiplier, it
is actually feasible to list all m's for which there is a collision; I wrote a
C program to do that. Note that there are 2**31 possible odd multipliers and
2**32 different possible hash values: so you expect about 0.5 collisions on
average.
For the entries of the tuples, I took random integers in the range from 0 to 63
(so the issue with negative numbers in FNV does not occur). I mostly considered
tuples of length 4, 5 and 6.
It turns out that both algorithms have pairs (t, u) for which a massive number
of collisions occur:
- For FNV, the tuples (12, 50, 52, 24, 3), (28, 18, 52, 56, 19) have 2**26
collisions (all multipliers with lower byte 0x01, 0x7f, 0x81 or 0xff give
collisions)
- For DJB, the tuples (22, 10, 12, 22, 29), (23, 14, 18, 26, 30) have 2**24
collisions (all multipliers with lower byte 0xff give collisions)
However, when you then look at the multipliers for these bad cases, they all
have a special structure in the lower bits of which there are 2 cases:
- A special bit pattern like 0x001, 0x003, 0x555, 0xccd, ...
- A zero of a low-degree polynomial like 0x9fa5 which is a zero of x^2 + x + 2
modulo 2**16 (this may only be relevant for DJB)
In order to eliminate these 2 cases, I decided to fix the lower byte of the
multiplier. I wanted it to be 3 or 5 mod 8 to have maximal multiplicative order
and then I checked which byte had the worst algebraic relations and no obvious
bit pattern. I decided to take 0xc5 as lower byte.
When considering only multipliers with 0xc5 as lower byte, I found no very bad
cases. I checked lots of pairs of tuples of length <= 6 and entries <= 64 and
the worst I could find was 8192 collisions for FNV and 1024 collisions for DJB.
This maximum number was found for tuples of length 5. Also, the maximum number
of collisions seems bounded as the bitsize increases. For a 25-bit hash, I got
the same maximum numbers of collisions as for a 32-bit hash. So, the
probability of getting a collision decreases by a factor of 2 for every bit
added, which is what you want.
When testing various bounds on the entries and tuple lengths, DJB is generally
a little bit better than FNV (by 2 metrics: the maximum number of collisions
and the root-mean-square of the number of collisions).
So my conclusion would be to use DJB as core algorithm, taking care to pick a
good multiplier (or at least not an obviously bad multiplier).
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com