Some explanations and cautions. An advantage of sticking with pure Python instead of C is that spare moments are devoted to investigating the problem instead of thrashing with micro-optimizations ;-)
Why does the current scheme suffer for small tables? With hindsight it's pretty obvious: it can visit table slots more than once (the other schemes cannot), and while the odds of that happening are tiny in large tables, they're high for small tables. For example, in a 3-bit table (8 slots), suppose the initial index is 1, and there's a collision. The 5*i+1 recurrence _would_ visit slot 6 next, but we also shift in 5 new bits and add them. If those bits are "random", there's a 1-in-8 chance that we'll visit index 1 _again_. If we don't, and have another collision, shifting in the next 5 bits has a 2-in-8 chance of repeating one of the two slots we already visited. And so on. Here using the `str(i)` generator (which yields random-ish hash codes): """ bits 3 nslots 8 dlen 5 alpha 0.62 # built 20,000 ... prober current found min 1:74.80% max 15 mean 1.42 fail min 1:37.62% max 18 mean 2.66 """ So despite that only 5 slots are filled, in at least one case it took 15 probes to find an existing key, and 18 probes to realize a key was missing. In the other schemes, it takes at most 5 probes for success and 6 probes for failure. This is worse for 64-bit hash codes than for 32-bit ones, because we can loop around twice as often before `perturb` permanently becomes 0 (at which point the pure 5*i+1 recurrence visits each slot at most once). The larger the table, the less likely a repeat due to `perturb` becomes. For example, suppose we have an 8-bit table and again visit index 1 first. We may visit any index in range(6, 6+32) next (depending on the 5 fresh bits shifted in), but repeating 1 is _not_ a possibility. Why do the double hashing methods do better for the `i << 12` generator? In any case where the hash codes have "a lot" of trailing bits in common, they all map to the same table index at first, and the probe sequence remains the same for all until the loop shifts `perturb` far enough right that the rightmost differing bits finally show up in the addition. In the double hashing methods, _all_ the bits of the hash code affect the value of `inc` computed before the loop starts, so they have a good chance of differing already on the second probe. This is again potentially worse for the current scheme with 64-bit hash codes, since the sheer number of common trailing bits _can_ be up to 63. However, the double-hashing schemes have pathological cases too, that the current scheme avoids. The first I tried was a `yield i * 1023` generator. These are spectacularly _good_ values for all schemes except `uniform` for successful searches, because i*1023 = j*1023 (mod 2**k) implies i=j (mod 2**k) That is, there are no first-probe collisions at all across a contiguous range of `i` with no more than 2**k values. But the `double` scheme for a 10-bit table degenerates to linear probing in this case, because inc = (h % mask) | 1 # force it odd is always 1 when h is divisible by 1023 (== mask for a 10-bit table). This is terrible for a failing search; e.g., for a 20-bit table (note that 2**20-1 is divisible by 1023): """ bits 20 nslots 1,048,576 dlen 699,050 alpha 0.67 # built 1 theoretical avg probes for uniform hashing when found 1.65 not found 3.00 ... prober current found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 34 mean 3.04 prober double found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 699049 mean 1867.51 prober dfib found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 427625 mean 8.09 prober uniform found min 1:66.65% max 24 mean 1.65 fail min 1:33.35% max 35 mean 3.00 """ While that's a failing-search disaster for `double`, it's also bad for `dfib` (& I don't have a slick explanation for that). So where does that leave us? I don't know. All schemes have good & bad points ;-) I haven't yet thought of a cheap way to compute an `inc` for double-hashing that isn't vulnerable to bad behavior for _some_ easily constructed set of int keys. If you forget "cheap", it's easy; e.g., random.seed(h) inc = random.choice(range(1, mask + 1, 2)) Regardless, I'll attach the current version of the code.
MIN_ELTS = 100_000 M64 = (1 << 64) - 1 def phash(obj, M=M64): # hash(obj) as uint64 return hash(obj) & M # Probers: generate sequence of table indices to look at, # in table of size 2**nbits, for object with uint64 hash code h. def linear(h, nbits): mask = (1 << nbits) - 1 i = h & mask while True: yield i i = (i + 1) & mask # offsets of 0, 1, 3, 6, 10, 15, ... # this permutes the index range when the table size is a power of 2 def quadratic(h, nbits): mask = (1 << nbits) - 1 i = h & mask inc = 1 while True: yield i i = (i + inc) & mask inc += 1 def pre28201(h, nbits): mask = (1 << nbits) - 1 i = h & mask while True: yield i i = (5*i + h + 1) & mask h >>= 5 def current(h, nbits): mask = (1 << nbits) - 1 i = h & mask while True: yield i h >>= 5 i = (5*i + h + 1) & mask # One version of "double hashing". The increment between probes is # fixed, but varies across objects. This does very well! Note that the # increment needs to be relatively prime to the table size so that all # possible indices are generated. Because our tables have power-of-2 # sizes, we merely need to ensure the increment is odd. # Using `h % mask` is akin to "casting out 9's" in decimal: it's as if # we broke the hash code into nbits-wide chunks from the right, then # added them, then repeated that procedure until only one "digit" # remains. All bits in the hash code affect the result. # While mod is expensive, a successful search usual gets out on the # first try, & then the lookup can return before the mod completes. def double(h, nbits): mask = (1 << nbits) - 1 i = h & mask yield i inc = (h % mask) | 1 # force it odd while True: i = (i + inc) & mask yield i # Double hashing using Fibonacci multiplication for the increment. This # does about as well as `double`, but doesn't require division. # # The multiplicative constant depends on the word size W, and is the # nearest odd integer to 2**W/((1 + sqrt(5))/2). So for a 64-bit box: # # >>> 2**64 / ((1 + decimal.getcontext().sqrt(5))/2) # Decimal('11400714819323198485.95161059') # # For a 32-bit box, it's 2654435769. # # The result is the high-order `nbits` bits of the low-order W bits of # the product. In C, the "& M" part isn't needed (unsigned * in C # returns only the low-order bits to begin with). # # Annoyance: I don't think Python dicts store `nbits`, just 2**nbits. def dfib(h, nbits, M=M64): mask = (1 << nbits) - 1 i = h & mask yield i inc = (((h * 11400714819323198485) & M) >> (64 - nbits)) | 1 while True: i = (i + inc) & mask yield i # The theoretical "gold standard": generate a random permutation of the # table indices for each object. We can't actually do that, but # Python's PRNG gets close enough that there's no practical difference. def uniform(h, nbits): from random import seed, randrange seed(h) n = 1 << nbits seen = set() while True: assert len(seen) < n while True: i = randrange(n) if i not in seen: break seen.add(i) yield i def spray(nbits, objs, cs, prober, *, used=None, shift=5): building = used is None nslots = 1 << nbits if building: used = [0] * nslots assert len(used) == nslots for o in objs: n = 1 for i in prober(phash(o), nbits): if used[i]: n += 1 else: break if building: used[i] = 1 cs[n] += 1 return used def objgen(i=1): while True: yield str(i) #yield i << 12 # i*1023 gives a unique first probe, but is deadly # on failing searches for `double` (especially) and # `dfib`. #yield i * 1023 i += 1 # Average probes for a failing search; e.g., # 100 slots; 3 occupied # 1: 97/100 # 2: 3/100 * 97/99 # 3: 3/100 * 2/99 * 97/98 # 4: 3/100 * 2/99 * 1/98 * 97/97 # # `total` slots, `filled` occupied # probability `p` probes will be needed, 1 <= p <= filled+1 # p-1 collisions followed by success: # ff(filled, p-1) / ff(total, p-1) * (total - filled) / (total - (p-1)) # where `ff` is the falling factorial. def avgf(total, filled): assert 0 <= filled < total ffn = float(filled) ffd = float(total) tmf = ffd - ffn result = 0.0 ffpartial = 1.0 ppartial = 0.0 for p in range(1, filled + 2): thisp = ffpartial * tmf / (total - (p-1)) ppartial += thisp result += thisp * p ffpartial *= ffn / ffd ffn -= 1.0 ffd -= 1.0 assert abs(ppartial - 1.0) < 1e-14, ppartial return result # Average probes for a successful search. Alas, this takes time # quadratic in `filled`. def avgs(total, filled): assert 0 < filled < total return sum(avgf(total, f) for f in range(filled)) / filled def pstats(ns): total = sum(ns.values()) small = min(ns) print(f"min {small}:{ns[small]/total:.2%} " f"max {max(ns)} " f"mean {sum(i * j for i, j in ns.items())/total:.2f} ") def drive(nbits): from collections import defaultdict from itertools import islice import math import sys nslots = 1 << nbits dlen = nslots * 2 // 3 assert (sys.getsizeof({i: i for i in range(dlen)}) < sys.getsizeof({i: i for i in range(dlen + 1)})) alpha = dlen / nslots # actual load factor of max dict ntodo = (MIN_ELTS + dlen - 1) // dlen print() print("bits", nbits, f"nslots {nslots:,} dlen {dlen:,} alpha {alpha:.2f} " f"# built {ntodo:,}") print(f"theoretical avg probes for uniform hashing " f"when found {math.log(1/(1-alpha))/alpha:.2f} " f"not found {1/(1-alpha):.2f}") print(" crisp ", end="") if nbits > 12: print("... skipping (slow!)") else: print(f"when found {avgs(nslots, dlen):.2f} " f"not found {avgf(nslots, dlen):.2f}") for prober in ( linear, quadratic, pre28201, current, double, dfib, uniform, ): print(" prober", prober.__name__) objs = objgen() good = defaultdict(int) bad = defaultdict(int) for _ in range(ntodo): used = spray(nbits, islice(objs, dlen), good, prober) assert sum(used) == dlen spray(nbits, islice(objs, nslots), bad, prober, used=used) print(" " * 8 + "found ", end="") pstats(good) print(" " * 8 + "fail ", end="") pstats(bad) for bits in range(3, 23): drive(bits)
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/