On 5/29/26 5:55 PM, Tomas Vondra wrote:
old patches
-----------
Those old patches tried to do a fairly small thing during a hash join,
and that's building a Bloom filter on the inner relation (the one that
gets hashed), and then use that filter before probing the hash table.
The benefits come from Bloom filters being (fairly) cheap, and a
negative answer (hash is not in the filter) may allows us to skip a much
more expensive operation.
The old threads patches focused especially at two hash join cases:
(a) A very selective join, i.e. a significant fraction of outer tuples
does not have a match in the hash table.
(b) A selective hash join forced to do batching because the hash table
is too large, and thus forced to spill outer tuples to temporary files.
For (a), the benefit comes from Bloom filters being much cheaper to
probe than a hash table. The exact cost depends on the implementation,
sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
the filter discards 90% of tuples, it can be a big win.
For (b), the filter (for all the batches at once) allows us to discard
some of the outer tuples without writing them to temporary files. Which
is way more expensive than probing a hash table.
As it happens, I've been exploring the use of a bitmap filter for the
same two cases you mention. This has some relevance to the issues you
mention in your post about sizing, false-positive rate, etc.
Instead of a Bloom filter, I chose to use a bitmap filter, with one bit
per bucket on the build side. As the inner table is built, I set a bit
in the bitmap filter for every occupied bucket. If a bucket is empty,
there are no matching hashes and those hash values can be skipped where
appropriate. The advantages of this bitmap over a Bloom filter are:
- sizing is pre-determined by nbuckets
- small bitmaps (4k for 32k buckets)
- cheaper - nominal cost to set/check bits
A well-chosen Bloom filter will be more discriminating, but the bitmap
has the same no-false-negatives guarantee and costs much less space and
time to build.
I implemented both of your cases:
Drop-before-spill: (Case b)
Build per-batch bitmaps during inner partition pass and drop tuples that
don't have a bit set. Saves I/O on tuples that will never match. This
only works for inner and semi joins.
Single-Batch probe: (Case a)
Only pays off in high-miss-rate joins and a bucket array larger than
L2/L3 cache. This case has a higher penalty for hash table lookup than
the in-cache bitmap check. This case works in multi-batch, but the I/O
cost dominates and there is no gain.
I put runtime guards on both of these; I sampled the drop rate over a
window and disable the filter for the rest of the pass if the rate falls
below a threshold. (~5% for case b; ~25% for case a)
The benchmarks are encouraging:
For case a, I was able to see a best-case improvement of ~15% for
carefully chosen data (dependent on L2/L3 cache size).
For case b, I tested 3 cases with sparse, average and dense probe hits:
sparse probe (~95% miss): +18% to +36%
avg probe (~37% miss): +9% to +13%
dense probe (FK-like, ~0% miss): flat, within noise
(This was on a 8-core x86-64, L1 32KB/core, L2 4MB/core, L3 32MB, 31 GB
RAM. PostgreSQL 19devel, serial hash join,
max_parallel_workers_per_gather = 0, across work_mem = 1-8MB)
Happy to share the patch and full benchmark data if useful.
-Ben Mejia