[Tim] >... I haven't yet thought of a cheap way to compute an > `inc` for double-hashing that isn't vulnerable to bad behavior for > _some_ easily constructed set of int keys. If you forget "cheap", > it's easy; e.g., > > random.seed(h) > inc = random.choice(range(1, mask + 1, 2))
Heh. I always tell people not to believe what they think: run code to make sure! So I did ;-) def duni(h, nbits): from random import seed, choice mask = (1 << nbits) - 1 i = h & mask yield i seed(h) inc = choice(range(1, mask + 1, 2)) while True: i = (i + inc) & mask yield i On the terrible-for-good-reasons-for-failing-`double`-1023*i-searches case, it turned out `duni` did just as bad as `dfib`: """ bits 20 nslots 1,048,576 dlen 699,050 alpha 0.67 # built 1 theoretical avg probes for uniform hashing when found 1.65 not found 3.00 ... crisp ... skipping (slow!) prober current found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 34 mean 3.04 prober double found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 699049 mean 1867.51 prober dfib found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 427625 mean 8.09 prober duni found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 433846 mean 9.84 prober uniform found min 1:66.65% max 24 mean 1.65 fail min 1:33.35% max 35 mean 3.00 Yikes! That really surprised me. So - are integers of the form 1023*i so magical they also manage to provoke random.seed() into amazingly regular behavior? Or is there something about the whole idea of "double hashing" that leaves it vulnerable no matter how "random" an odd increment it picks? It's not the former. More code showed that a great number of distinct `inc` values are in fact used. And double hashing is widely & successfully used in other contexts. so it's not a problem with the idea _in general_. What does appear to be the case: it doesn't always play well with taking the last `nbits` bits as the initial table index. In other double-hashing contexts, they strive to pick a pseudo-random initial table index too. Why this matters: for any odd integer N, N*i = N*j (mod 2**k) if and only if i = j (mod 2**k) So, when integers are their own hash codes, for any `i` all the values in range(i*N, i*N + N*2**k, N) hash to unique slots in a table with 2**k slots (actually any table with 2**m slots for any m >= k). In particular, mod 2**k they map to the slots i*N + 0*N i*N + 1*N i*N + 2*N i*N + 3*N ... So, on a failing search for an integer of the form j*N, whenever double-hashing picks an increment that happens to be a multiple of N, it can bump into a huge chain of collisions, because double-hashing also uses a fixed increment between probes. And this is so for any odd N. For example, using a "yield i * 11" generator: """ bits 20 nslots 1,048,576 dlen 699,050 alpha 0.67 # built 1 theoretical avg probes for uniform hashing when found 1.65 not found 3.00 ... prober current found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 34 mean 3.00 prober double found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 667274 mean 34.10 prober dfib found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 539865 mean 9.30 prober duni found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 418562 mean 8.65 prober uniform found min 1:66.63% max 25 mean 1.65 fail min 1:33.31% max 32 mean 3.00 """ All three double-hashing variants have horrible max collision chains, for the reasons just explained. In "normal" double-hashing contexts, the initial table indices are scrambled; it's my fault they follow a (mod 2**k) arithmetic progression in Python. Which fault I gladly accept ;-) It's valuable in practice. But, so long as that stays, it kills the double-hashing idea for me: while it's great for random-ish hash codes, the worst cases are horribly bad and very easily provoked (even by accident). _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/