On 6/3/26 11:20, Oleg Bartunov wrote: > ... > > Bloom filters have two rather different roles here. > > For a local Hash Join optimization, Bloom does not require any > particular physical ordering of the heap. It can be useful simply when > the join is selective enough, or when batching/spilling makes failed > probes expensive: the Bloom filter rejects many outer tuples before a > full hash-table probe or before writing them to temporary batches. >
Right. Adding a filter within a hash join is certainly less ambitious, and the possible benefits are smaller. > But once we talk about pushing a runtime filter down to the scan/storage > layer, the physical preconditions become crucial. To get more than a > cheap per-row check, the scan must have something coarse-grained to > skip: partitions, row groups, chunks, block ranges, dictionaries, min/ > max metadata, BRIN-like summaries, etc. Without that, the filter is > still correct, but the benefit is mostly CPU/probe reduction rather than > avoiding data production. > Maybe, but there's also ongoing work on adding batches to the executor, in which case we'd eliminate "row groups" even when using a filter in the scope of a hashjoin operator. Of course, the tuples will flow all the way up to that operator. > So for me the most interesting part of this thread is not Bloom itself, > but the architectural idea: pushing runtime knowledge down to the scan > node, against the normal direction of data flow. The build side of a > join produces compact knowledge about admissible keys, and lower layers > may use it before rows are materialized and sent upward. > > I saw this in my own experiments with zone/chunk-oriented storage for > Postgres: static predicates could prune zones nicely, but joins were the > hard case because the useful filtering knowledge was produced above the > scan. A runtime semi-join filter pushed from the Hash Join build side > into the scan could turn join-derived knowledge into scan-level pruning. > > For example: > > SELECT sum(e.cost) > FROM events e > JOIN accounts a ON e.account_id = a.id <http://a.id> > WHERE a.region = 'NP'; -- Nepal > > The events scan does not know which account_id values are EU accounts. > That knowledge is produced above it, on the build side of the join. A > runtime semi-join filter pushed from the Hash Join build side down into > the events scan could let the scan reject impossible account_id values > before producing tuples. > Yes. This is known as "predicate transfer" in academic papers. > For a plain heap scan this may mostly save hash probes. But with zone/ > chunk-oriented storage, where chunks have dictionaries, min/max > metadata, Bloom summaries, or tenant ranges, the same runtime filter can > skip whole chunks. That is the part I find most interesting: turning > join-derived knowledge into scan-level pruning, against the normal > direction of data flow. > > Bloom is just one carrier for that knowledge. The real feature is a > pluggable runtime-filter mechanism that heap, CustomScan, FDW, columnar/ > table AMs, partitioned storage, or chunk/cold storage can consume at the > level they understand. > > This may be a topic for a separate thread, because it quickly becomes > less about Hash Join Bloom filters and more about runtime knowledge > pushdown into storage. > Right, there's a general concept of a "filter", and Bloom filters are just one example of that. And maybe we could build other types of filters more suitable for the scan. But I think it'll still be tied to a hash join, because what other nodes / joins can build the filter? regards -- Tomas Vondra
