Super cool! Yeah, I find the intuition on idea hash count to be tricky as well. Increasing the number of hash functions increases saturation (not desirable), but also decreases false positives (desirable) since `fpp = s**k` (where s is saturation and k is the number of hash functions). Since the ideal saturation is 0.5, fpp drops exponentially as k increases (assuming s == 0.5).
But, k negatively impacts saturation (`s = 1 / e**(-kn/m)`). It turns out the ideal k value is `k = (m/n) * ln(2)`, which provides a saturation of 0.5. All this means that you can first pick an ideal size (m) based on the expected number of unique values (n) by substituting `(m/n) * ln(2)` for k, then compute k based on the chosen value for m. This is exactly what the FuzzySet implementation does (as of GH#11900). But on the intuition bit, this helped me: If you crank k way up, you drive saturation way up. No number of hash functions will really help you if saturation is super high. In other words, since `fpp = s**k`, it's going to take a really large k to get that fpp down (e.g., if s == 0.9, you'd need a k ~= 20 to get fpp ~= 0.1), but trying to crank k up just further increases saturation. I found this calculator tool ( https://hur.st/bloomfilter/?n=500000000&p=.1&m=&k=) useful in visualizing the interaction between these different variables (in particular, fpp vs. k). All this also makes me wonder why we make "target saturation" in FuzzySet extensible ( https://github.com/apache/lucene/blob/a55be5cb368cba698114f63bc8e8be2a2a55b089/lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java#L298). Since 0.5 is ideal, it's unclear to me why anyone would want to set a different target for downsizing. Maybe this sort of carries over from cases where a single hash function is used (we still have methods that setup FuzzySet this way)? I think there might be an opportunity to clean up the API a bit and get rid of the ability to create FuzzySet with a hardcoded k == 1 (shameless plug for a PR I opened back in May to do this: GH#14615). Cheers, -g On Sat, Dec 20, 2025 at 1:50 AM Michael Wechner <[email protected]> wrote: > very cool, thanks for sharing! > Am 19.12.25 um 18:13 schrieb Michael McCandless: > > Hi Team, I thought this might be interesting to some of us :) > > If anyone out there also struggles to understand / intuit how a bloom > filter's multiple hash functions (which Lucene's bloom filter implements) > improve the false positive error rate, here is an interactive 3D > surface/chart to see how hash function count (k), bits per element (m/n), > and false positive error rate relate: > > https://githubsearch.mikemccandless.com/gemini_bloom10.html > > The mouseover also shows the saturation (percent bits set) but it's not > rendered in the graph. > > I created this by iterating with Gemini Pro 3 ... see the full > prompting/iterations here: https://gemini.google.com/share/46a219e0abaf > > Net/net doing multiple hash functions into the same backing bit set is > surprisingly worthwhile even as the saturation grows, up to 50%. > > It's entirely possible Gemini hallucinated something in the graph!! But > it seems sorta correct to me. And of course that false positive error rate > formula involves e! > > Mike McCandless > > http://blog.mikemccandless.com > >
