On 6/24/26 00:36, Ben Mejia wrote:
>
>
> On 5/29/26 5:55 PM, Tomas Vondra wrote:
>
>> old patches
>> -----------
>>
>> Those old patches tried to do a fairly small thing during a hash join,
>> and that's building a Bloom filter on the inner relation (the one that
>> gets hashed), and then use that filter before probing the hash table.
>>
>> The benefits come from Bloom filters being (fairly) cheap, and a
>> negative answer (hash is not in the filter) may allows us to skip a much
>> more expensive operation.
>>
>> The old threads patches focused especially at two hash join cases:
>>
>> (a) A very selective join, i.e. a significant fraction of outer tuples
>> does not have a match in the hash table.
>>
>> (b) A selective hash join forced to do batching because the hash table
>> is too large, and thus forced to spill outer tuples to temporary files.
>>
>> For (a), the benefit comes from Bloom filters being much cheaper to
>> probe than a hash table. The exact cost depends on the implementation,
>> sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
>> the filter discards 90% of tuples, it can be a big win.
>>
>> For (b), the filter (for all the batches at once) allows us to discard
>> some of the outer tuples without writing them to temporary files. Which
>> is way more expensive than probing a hash table.
> As it happens, I've been exploring the use of a bitmap filter for the
> same two cases you mention. This has some relevance to the issues you
> mention in your post about sizing, false-positive rate, etc.
>
> Instead of a Bloom filter, I chose to use a bitmap filter, with one bit
> per bucket on the build side. As the inner table is built, I set a bit
> in the bitmap filter for every occupied bucket. If a bucket is empty,
> there are no matching hashes and those hash values can be skipped where
> appropriate. The advantages of this bitmap over a Bloom filter are:
>
> - sizing is pre-determined by nbuckets
> - small bitmaps (4k for 32k buckets)
> - cheaper - nominal cost to set/check bits
>
> A well-chosen Bloom filter will be more discriminating, but the bitmap
> has the same no-false-negatives guarantee and costs much less space and
> time to build.
>
Isn't that pretty much the same thing as a Bloom filter with a single
hash function? So it has the same false positive properties, i.e. it may
be sacrificing some of the accuracy for not having to calculate any more
hashes. Could be a win.
> I implemented both of your cases:
>
> Drop-before-spill: (Case b)
> Build per-batch bitmaps during inner partition pass and drop tuples that
> don't have a bit set. Saves I/O on tuples that will never match. This
> only works for inner and semi joins.
>
OK, makes sense.
>
> Single-Batch probe: (Case a)
> Only pays off in high-miss-rate joins and a bucket array larger than L2/
> L3 cache. This case has a higher penalty for hash table lookup than the
> in-cache bitmap check. This case works in multi-batch, but the I/O cost
> dominates and there is no gain.
>
Right, it's hard to beat the hash table in this case.
>
> I put runtime guards on both of these; I sampled the drop rate over a
> window and disable the filter for the rest of the pass if the rate falls
> below a threshold. (~5% for case b; ~25% for case a)
>
This seems similar to the adaptive behavior I implemented in v2. I
haven't thought of the idea of using different thresholds for the two
cases - I like that.
> The benchmarks are encouraging:
>
> For case a, I was able to see a best-case improvement of ~15% for
> carefully chosen data (dependent on L2/L3 cache size).
>
> For case b, I tested 3 cases with sparse, average and dense probe hits:
>
> sparse probe (~95% miss): +18% to +36%
> avg probe (~37% miss): +9% to +13%
> dense probe (FK-like, ~0% miss): flat, within noise
>
> (This was on a 8-core x86-64, L1 32KB/core, L2 4MB/core, L3 32MB, 31 GB
> RAM. PostgreSQL 19devel, serial hash join,
> max_parallel_workers_per_gather = 0, across work_mem = 1-8MB)
>
> Happy to share the patch and full benchmark data if useful.
>
I'd be happy to collaborate of some of this, if you're interested. Feel
free to post your patch here, or in a separate thread. Up to you.
Separate threads might be easier for cfbot to track.
I've spent some time hacking on v1, i.e. the pushdown - making the
planning work properly, etc. That's mostly orthogonal to this, with some
overlap. Ideally we'd want to do both I think.
regards
--
Tomas Vondra