Re: hashjoins vs. Bloom filters (yet again)

Andrew Dunstan Sun, 31 May 2026 08:04:09 -0700


On 2026-05-30 Sa 2:14 PM, Tomas Vondra wrote:



Hi, Tomas

This is terrific and very timely from my POV.

I've been experimenting with a table AM (implemented as a
CustomScan scan provider), and bloom-filter pushdown from a hashjoin is one
of the bigger wins available to it: a fact-table scan joined to a filtered
dimension can use the filter to skip whole row groups and avoid
decompressing columns entirely, rather than just rejecting a tuple after
it's been produced. I'd hacked up a private version of this via a new
table-AM callback (the hashjoin walks the outer subtree, builds a filter
from the build side, and hands it to the AM's scan descriptor). Having now
read your PoC, I think your framework is the better foundation, and I'd
rather build on it than carry a parallel mechanism. But two things stand in
the way of a storage-level consumer using it, and I think both are
relatively
small.

OK, good to hear. I was actually thinking about that use case too, i.e.
making it possible for the scan to do something smart with the filter
(like even pushing it even further down, to "storage"). Or maybe the
ForeignScan could push it to the remote side, so that it's actually
filtered there.

I didn't mention that my message, and there are some difficulties:

1) We only build the hash (and bloom) with a delay, after the scan
already produces some tuples. That complicates the pushdown, whiich may
need to happen when starting the scan. Presumably, we'd need to allow
disabling this optimization, optionally.

2) We'd need some sort of "portable" Bloom filter, with serialization
and deserialization, etc.

Both of these seem rather solvable.

1) A CustomScan can't currently be a recipient.

find_bloom_filter_recipient() only recognizes the stock scan tags, and the
probe itself lives in ExecScanExtended(), which a CustomScan never calls
(it dispatches to the provider's ExecCustomScan). The second part is
actually a feature, not a bug: if a CustomScan provider does its own
probing, it can choose the granularity -- per dictionary entry, per row
group, or per row -- instead of being locked into the per-row,
post-materialization probe that the stock nodes get. So all that's needed
on your side is to let the planner attach a filter to a base-relation
CustomScan; the provider takes care of consuming it.

Concretely, that's adding T_CustomScan to the scan-leaf case in
find_bloom_filter_recipient() (CustomScan embeds Scan first, so the
scanrelid test is identical; non-leaf custom nodes have scanrelid == 0 and
fall through to NULL), plus the matching fix_scan_bloom_filters() call in
set_customscan_references(). The provider then calls ExecInitBloomFilters()
in BeginCustomScan and ExecBloomFilters() (or a coarser-grained variant)
inside its scan loop. Everything else -- producer registration, the
es_bloom_producers lookup, the adaptive sampling, EXPLAIN -- is reused
unchanged.

Yes, that should work and it's a mostly mechanical change.

Maybe we'd want some sort of opt-in, so that the CustomScan can indicate
it can handle Bloom filters. Like, setting
CUSTOMPATH_SUPPORT_BLOOM_FILTERS to flags.

2) The combined-hash filter can't be tested against a single column.

You build one filter keyed on hash32() of all the join keys combined. For a
single-key join that's ideal, and a column store can use it directly: hash
each distinct dictionary value once per row group and skip groups whose
values are all absent. For a multi-column join, though, the combined hash
mixes the keys, so it can only ever be tested per-row (with all key columns
in hand) -- it can't be checked against any one column's dictionary. The
per-row probe is still useful, but the row-group/dictionary skipping, which
is where most of the storage win comes from, isn't available.

The obvious thought is to key a filter per column instead. But I don't
think that should *replace* the combined filter, because per-column filters
are strictly less selective on multi-column joins: they only test whether
each column's value appears *somewhere* in the build side, not whether the
combination does. With build pairs {(1,10),(2,20)}, an outer (1,20) passes
both per-column filters even though it matches no build row, whereas the
combined filter rejects it. So for the row-level probe -- and especially
for plain heap -- the combined filter is the better one, and replacing it
would be a regression.

What I think would actually help is to let the framework *optionally* emit
per-column filters in addition to the combined one, when a recipient
signals it can use them. The combined filter stays the default and does the
precise per-row rejection (unchanged for heap, and usable per-row by a
column store too); the per-column filters are extra, built only on demand,
and let a storage consumer cheaply eliminate whole row groups before the
combined filter does the exact work. The cost is the build CPU and memory
for the extra filters -- but only for consumers that ask, so your design is
untouched when nobody does. For a single-key join the two filters
coincide, so
there'd be no reason to build both.

I think I speculated about this (having per-key filters) in some of the
comments in the patch, although the use case was different. I haven't
thought about TAM, but about different joins where the join keys come
from both sides. Consider a join like

         HJ
       /    \
      A     HJ
          /    \
         B      C

where A-(BC) is on (A.x = B.x AND A.y = C.y), so the complete filter
can't be pushed to either side. But we could:

(1) Push the filter on top of the BC join (which in this example is not
really a push-down).

(2) Build filters on (x) and (y) separately, and push-down these.

Or we could do both, really.

I suppose a variation of (2) would work for your use case too, except
we'd push all three filters (x,y), (x) and (y) to the same scan.

I guess this could also be opt-in, enabled by some CUSTOMPATH_ flag.

The question is how efficient can the smaller filters be. The complete
filter can be very selective, while the per-key filters are terrible.

I'd be happy to work on patches for these.

Great. It's and interesting experiment / area to explore.



Here are 3 patches (developed using Claude) that sit on top of your POC.

Patch 1 enables the pushdown filters for custom scans. As you say it'sfairly mechanical and is enabled by a CUSTOMPATH_SUPPORT_BLOOM_FILTERSpath flag.

Patch 2 provides for building per-key filters in addition to themulti-key filter if that flag is set. There may be other cases thatwould want it, but this would suit my immediate use case.

Patch 3 provides for eager creation of the filter(s) in such cases,disabling the optimization you mentioned in point 1 above.


FWIW I think the main difficulty for this PoC is going to be the
planning/costing stuff, and the impact on EXPLAIN.

I haven't dealt with that or other issues you raise, but I think this isenough for me to begin testing. I have adapted my TAM to it and verifiedthat it acts as expected. I will start doing some benchmarks.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

From ff734511d22bcb93f5c1256fd745a9d21818f7f1 Mon Sep 17 00:00:00 2001
From: Andrew Dunstan <[email protected]>
Date: Sun, 31 May 2026 07:13:48 -0400
Subject: [PATCH addon 1/3] Allow a CustomScan to receive a pushed-down
 hashjoin bloom filter

Extend the hashjoin bloom-filter pushdown so that a base-relation
CustomScan can be a recipient, gated on a new opt-in path flag
CUSTOMPATH_SUPPORT_BLOOM_FILTERS.  This lets a table AM implemented as a
CustomScan scan provider consume the filter and apply it inside its own
scan loop -- for a column store, at row-group or dictionary granularity,
before decompression -- rather than only rejecting an already-produced
tuple.

find_bloom_filter_recipient() now treats a base-rel CustomScan
(scanrelid > 0) that advertised the flag the same as a SeqScan.  The
probe is not wired into ExecScanExtended() (a CustomScan dispatches to
the provider's ExecCustomScan), so the provider calls ExecBloomFilters()
itself; ExecInitCustomScan() compiles the probe state up front via
ExecInitBloomFilters() so the provider need not touch bloom internals.
set_customscan_references() fixes the pushed key expressions for a
base-relation scan just like the scan qual.

Providers that do not set the flag, and heap, are unaffected.
---
 src/backend/executor/nodeCustom.c       | 10 ++++++++++
 src/backend/optimizer/plan/createplan.c | 19 +++++++++++++++++++
 src/backend/optimizer/plan/setrefs.c    | 10 ++++++++++
 src/include/nodes/extensible.h          |  2 ++
 4 files changed, 41 insertions(+)

diff --git a/src/backend/executor/nodeCustom.c 
b/src/backend/executor/nodeCustom.c
index b7cc890cd20..dfd87e49737 100644
--- a/src/backend/executor/nodeCustom.c
+++ b/src/backend/executor/nodeCustom.c
@@ -101,6 +101,16 @@ ExecInitCustomScan(CustomScan *cscan, EState *estate, int 
eflags)
        css->ss.ps.qual =
                ExecInitQual(cscan->scan.plan.qual, (PlanState *) css);
 
+       /*
+        * Set up any bloom filters a hash join pushed down to this scan (see
+        * nodeHashjoin.c).  This compiles the probe expressions against the 
scan
+        * tuple slot; the provider is responsible for actually probing them 
with
+        * ExecBloomFilters() from its ExecCustomScan callback, at whatever
+        * granularity it supports.  A no-op unless the provider advertised
+        * CUSTOMPATH_SUPPORT_BLOOM_FILTERS and the planner found a filter to 
push.
+        */
+       ExecInitBloomFilters((PlanState *) css, css->ss.ss_ScanTupleSlot);
+
        /*
         * The callback of custom-scan provider applies the final initialization
         * of the custom-scan-state node according to its logic.
diff --git a/src/backend/optimizer/plan/createplan.c 
b/src/backend/optimizer/plan/createplan.c
index 7ecb551aae6..304ce0e3c0d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4799,6 +4799,25 @@ find_bloom_filter_recipient(Plan *plan, Index 
target_relid)
                                        return plan;
                                return NULL;
                        }
+               case T_CustomScan:
+                       {
+                               /*
+                                * A CustomScan on a base relation can act as a 
recipient, but
+                                * only if the provider advertised that it 
knows how to consume
+                                * a pushed-down bloom filter.  Unlike the 
stock scans, the
+                                * probe is not performed by ExecScanExtended() 
(a CustomScan
+                                * dispatches to the provider's own 
ExecCustomScan); the
+                                * provider is responsible for calling 
ExecBloomFilters() at
+                                * whatever granularity it likes.  Non-leaf 
custom nodes have
+                                * scanrelid == 0 and so are rejected by the 
relid test.
+                                */
+                               CustomScan *cscan = (CustomScan *) plan;
+
+                               if ((cscan->flags & 
CUSTOMPATH_SUPPORT_BLOOM_FILTERS) &&
+                                       cscan->scan.scanrelid == target_relid)
+                                       return plan;
+                               return NULL;
+                       }
                case T_Sort:
                case T_IncrementalSort:
                case T_Material:
diff --git a/src/backend/optimizer/plan/setrefs.c 
b/src/backend/optimizer/plan/setrefs.c
index 0059acfccbe..74c7a5bf3a5 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -1826,6 +1826,16 @@ set_customscan_references(PlannerInfo *root,
                cscan->custom_exprs =
                        fix_scan_list(root, cscan->custom_exprs,
                                                  rtoffset, NUM_EXEC_QUAL((Plan 
*) cscan));
+
+               /*
+                * Bloom filters pushed down to a base-relation CustomScan: the 
key
+                * expressions are plain Vars of the scanned relation, so they 
are
+                * fixed up the same way as the scan qual.  (A CustomScan 
emitting a
+                * custom_scan_tlist takes the branch above and would instead 
need
+                * fix_upper_expr against the tlist index, like IndexOnlyScan; 
no
+                * in-tree provider needs that yet.)
+                */
+               fix_scan_bloom_filters(root, (Plan *) cscan, rtoffset);
        }
 
        /* Adjust child plan-nodes recursively, if needed */
diff --git a/src/include/nodes/extensible.h b/src/include/nodes/extensible.h
index 517db95c4a3..ea2cef4fe3b 100644
--- a/src/include/nodes/extensible.h
+++ b/src/include/nodes/extensible.h
@@ -84,6 +84,8 @@ extern const ExtensibleNodeMethods 
*GetExtensibleNodeMethods(const char *extnode
 #define CUSTOMPATH_SUPPORT_BACKWARD_SCAN       0x0001
 #define CUSTOMPATH_SUPPORT_MARK_RESTORE                0x0002
 #define CUSTOMPATH_SUPPORT_PROJECTION          0x0004
+/* provider can accept a hashjoin bloom filter pushed down to its scan */
+#define CUSTOMPATH_SUPPORT_BLOOM_FILTERS       0x0008
 
 /*
  * Custom path methods.  Mostly, we just need to know how to convert a
-- 
2.43.0

From 2bac3bb8a4917f77deb19998752493f75b4f1c70 Mon Sep 17 00:00:00 2001
From: Andrew Dunstan <[email protected]>
Date: Sun, 31 May 2026 07:13:58 -0400
Subject: [PATCH addon 2/3] Optionally build per-key hashjoin bloom filters for
 opted-in recipients

Add an opt-in path that builds one bloom filter per join key, in
addition to the existing combined-hash filter, when the pushdown
recipient is a CustomScan that advertised CUSTOMPATH_SUPPORT_BLOOM_FILTERS
and the join has more than one key.

The combined filter, keyed on the hash of all keys together, stays the
default and remains the more selective one for a per-row probe: per-key
filters only test whether each column's value appears somewhere in the
build side, so on a multi-column join they are strictly weaker (they
cannot reject a row whose columns each match but not as a tuple).  What
they enable is testing a single key column on its own -- a column store
can check one column against its per-column dictionary or zone map and
skip whole row groups before decompression, which the combined filter
cannot support.

The build reuses the per-key inner hash functions (the combined hash
value cannot be decomposed, so the Hash node builds one single-key hash
ExprState per key); the extra CPU and memory are paid only by a consumer
that opted in.  A recipient correlates HashState.perkey_filters[i] with
BloomFilter.filter_exprs[i] by position.  Heap and single-key joins are
unaffected.
---
 src/backend/executor/nodeHash.c         | 35 +++++++++++++++++++++++++
 src/backend/executor/nodeHashjoin.c     | 25 ++++++++++++++++++
 src/backend/optimizer/plan/createplan.c | 12 +++++++++
 src/include/nodes/execnodes.h           | 14 ++++++++++
 src/include/nodes/plannodes.h           | 11 ++++++++
 5 files changed, 97 insertions(+)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 37224324bce..2b045eae186 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -197,6 +197,25 @@ MultiExecPrivateHash(HashState *node)
                                                                  (unsigned 
char *) &hashvalue,
                                                                  
sizeof(hashvalue));
 
+                       /*
+                        * Likewise for the optional per-key filters, using the 
per-key
+                        * (single-key) hash ExprStates.  Same econtext as the 
combined
+                        * hash above (ecxt_outertuple is the just-fetched 
inner tuple).
+                        */
+                       for (int k = 0; k < node->perkey_nfilters; k++)
+                       {
+                               bool            keyisnull;
+                               uint32          keyhash;
+
+                               keyhash = 
DatumGetUInt32(ExecEvalExprSwitchContext(node->perkey_hash[k],
+                                                                               
                                                   econtext,
+                                                                               
                                                   &keyisnull));
+                               if (!keyisnull)
+                                       
bloom_add_element(node->perkey_filters[k],
+                                                                         
(unsigned char *) &keyhash,
+                                                                         
sizeof(keyhash));
+                       }
+
                        bucketNumber = ExecHashGetSkewBucket(hashtable, 
hashvalue);
                        if (bucketNumber != INVALID_SKEW_BUCKET_NO)
                        {
@@ -722,6 +741,22 @@ ExecHashTableCreate(HashState *state)
                oldctx = MemoryContextSwitchTo(hashtable->hashCxt);
                state->bloom_filter = bloom_create((int64) Max(rows, 1.0),
                                                                                
   bloom_work_mem, 0);
+
+               /*
+                * If a recipient opted in, also build one filter per join key 
(in
+                * addition to the combined one above).  These let a recipient 
test an
+                * individual key column on its own; they are less selective 
than the
+                * combined filter, so they are built only on demand.
+                */
+               if (state->want_perkey_bloom)
+               {
+                       state->perkey_filters = palloc_array(struct 
bloom_filter *,
+                                                                               
                 state->perkey_nfilters);
+                       for (int i = 0; i < state->perkey_nfilters; i++)
+                               state->perkey_filters[i] = bloom_create((int64) 
Max(rows, 1.0),
+                                                                               
                                bloom_work_mem, 0);
+               }
+
                MemoryContextSwitchTo(oldctx);
        }
 
diff --git a/src/backend/executor/nodeHashjoin.c 
b/src/backend/executor/nodeHashjoin.c
index 8fa7af4cfef..1eaf81285f8 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -908,6 +908,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
        hashState = castNode(HashState, innerPlanState(hjstate));
        hashState->want_bloom_filter = (node->bloom_consumer_count > 0);
        hashState->bloom_filter_id = node->bloom_filter_id;
+       hashState->want_perkey_bloom = node->bloom_perkey;
 
        /*
         * Initialize result slot, type and projection.
@@ -1031,6 +1032,28 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int 
eflags)
                                                                &hashstate->ps,
                                                                0);
 
+               /*
+                * If a recipient opted in to per-key bloom filters, build one 
inner
+                * (single-key) hash ExprState per join key, used by the Hash 
node to
+                * populate the per-key filters.  The combined hash above 
cannot be
+                * decomposed, so this is the extra cost a per-key consumer 
pays.
+                */
+               if (hashstate->want_perkey_bloom)
+               {
+                       hashstate->perkey_nfilters = nkeys;
+                       hashstate->perkey_hash = palloc_array(ExprState *, 
nkeys);
+                       for (int i = 0; i < nkeys; i++)
+                               hashstate->perkey_hash[i] =
+                                       
ExecBuildHash32Expr(hashstate->ps.ps_ResultTupleDesc,
+                                                                               
hashstate->ps.resultops,
+                                                                               
&inner_hashfuncid[i],
+                                                                               
list_make1_oid(list_nth_oid(node->hashcollations, i)),
+                                                                               
list_make1(list_nth(hash->hashkeys, i)),
+                                                                               
&hash_strict[i],
+                                                                               
&hashstate->ps,
+                                                                               
0);
+               }
+
                /* Remember whether we need to save tuples with null join keys 
*/
                hjstate->hj_KeepNullTuples = HJ_FILL_OUTER(hjstate);
                hashstate->keep_null_tuples = HJ_FILL_INNER(hjstate);
@@ -1118,6 +1141,7 @@ ExecEndHashJoin(HashJoinState *node)
                ExecHashTableDestroy(node->hj_HashTable);
                node->hj_HashTable = NULL;
                hashNode->bloom_filter = NULL;
+               hashNode->perkey_filters = NULL;
        }
 
        /*
@@ -1775,6 +1799,7 @@ ExecReScanHashJoin(HashJoinState *node)
                         * freed by the ExecHashTableDestroy call.
                         */
                        hashNode->bloom_filter = NULL;
+                       hashNode->perkey_filters = NULL;
 
                        /*
                         * if chgParam of subnode is not null then plan will be 
re-scanned
diff --git a/src/backend/optimizer/plan/createplan.c 
b/src/backend/optimizer/plan/createplan.c
index 304ce0e3c0d..5b01b3e45cc 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4992,6 +4992,18 @@ try_push_bloom_filter(PlannerInfo *root, HashJoin *hj, 
Plan *outer_plan)
 
        recipient->bloom_filters = lappend(recipient->bloom_filters, bf);
 
+       /*
+        * If the recipient is a CustomScan that opted in, also build a separate
+        * filter per join key.  Only such a recipient can make use of them (to
+        * test a single column against a dictionary or zone map); the combined
+        * filter is always built and is the more selective one for the per-row
+        * probe.  There is nothing to gain for a single-key join, where the two
+        * coincide.
+        */
+       if (list_length(hashkeys) > 1 && IsA(recipient, CustomScan) &&
+               (((CustomScan *) recipient)->flags & 
CUSTOMPATH_SUPPORT_BLOOM_FILTERS))
+               hj->bloom_perkey = true;
+
        /*
         * XXX We've manged to push the filter to the scan node, but maybe
         * we should wait with updating bloom_consumer_count when it actually
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 04333f1a4d0..ee98bcb3adf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2783,6 +2783,20 @@ typedef struct HashState
         */
        struct bloom_filter *bloom_filter;
 
+       /*
+        * Optional per-key bloom filters, built in addition to the combined
+        * bloom_filter above when a recipient opted in (HashJoin.bloom_perkey).
+        * perkey_filters has perkey_nfilters entries, one per join key, in 
hashkey
+        * order; a recipient correlates them with BloomFilter.filter_exprs by
+        * position.  perkey_hash holds the matching per-key (single-key) hash
+        * ExprStates used to populate them during the build.  All live in 
hashCxt
+        * and follow the same lifecycle as bloom_filter.
+        */
+       bool            want_perkey_bloom;
+       int                     perkey_nfilters;
+       struct bloom_filter **perkey_filters;
+       ExprState **perkey_hash;
+
        /*
         * Counters with total per-filter instrumentation. Separate from the
         * per-recipient counters in BloomFilterState. Redundant, but will be
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4e35d77cc49..21ec7ffae1a 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1124,6 +1124,17 @@ typedef struct HashJoin
         * Zero when this HashJoin has no consumers.
         */
        int                     bloom_filter_id;
+
+       /*
+        * Whether to also build one bloom filter per join key (in addition to 
the
+        * combined-hash filter), so that a recipient can test an individual key
+        * column on its own -- e.g. a column store probing a per-column 
dictionary
+        * or zone map.  Set at plan time only when the recipient is a 
CustomScan
+        * that advertised CUSTOMPATH_SUPPORT_BLOOM_FILTERS.  The combined 
filter is
+        * always built and remains the more selective one; per-key filters are 
an
+        * opt-in extra that nobody else pays for.
+        */
+       bool            bloom_perkey;
 } HashJoin;
 
 /* ----------------
-- 
2.43.0

From 3a4be73ebded2c7cb683f2f0803dcf3badf0686a Mon Sep 17 00:00:00 2001
From: Andrew Dunstan <[email protected]>
Date: Sun, 31 May 2026 07:48:23 -0400
Subject: [PATCH addon 3/3] Build the hashjoin bloom filter eagerly for a
 CustomScan recipient

When the outer relation's startup cost is below the hash-table build
cost, ExecHashJoinImpl fetches the first outer tuple before building the
hash table, to take the empty-outer shortcut.  For a CustomScan that
consumes a pushed-down bloom filter in its own scan loop that is too
late: its first tuple request -- which for a column store may decompress
a whole row group -- happens before the filter exists, so the first
batch is scanned unfiltered.

Add a HashJoin.bloom_eager flag, set at plan time when the filter is
pushed to a CustomScan recipient (which advertised
CUSTOMPATH_SUPPORT_BLOOM_FILTERS), telling the executor to skip the
empty-outer prefetch and build the hash table -- and the filter --
before the outer scan starts.  This is driven by the same opt-in path as
the recipient itself rather than a GUC, and only such a recipient pays
the cost (a possibly-needless hash build when the outer turns out empty);
stock-scan recipients, which probe per-row after producing a tuple
anyway, are unaffected.
---
 src/backend/executor/nodeHashjoin.c     | 11 +++++++++
 src/backend/optimizer/plan/createplan.c | 30 ++++++++++++++++++-------
 src/include/nodes/plannodes.h           | 10 +++++++++
 3 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/src/backend/executor/nodeHashjoin.c 
b/src/backend/executor/nodeHashjoin.c
index 1eaf81285f8..9154310c09a 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -317,6 +317,17 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                                         */
                                        node->hj_FirstOuterTupleSlot = NULL;
                                }
+                               else if (((HashJoin *) 
node->js.ps.plan)->bloom_eager)
+                               {
+                                       /*
+                                        * We pushed a bloom filter to a 
CustomScan on the outer
+                                        * side that wants it at scan start 
(e.g. to skip row groups
+                                        * before decompression).  Skip the 
empty-outer prefetch and
+                                        * build the hash table -- and the 
filter -- first, so it is
+                                        * ready before the outer scan produces 
its first tuple.
+                                        */
+                                       node->hj_FirstOuterTupleSlot = NULL;
+                               }
                                else if (HJ_FILL_OUTER(node) ||
                                                 (outerNode->plan->startup_cost 
< hashNode->ps.plan->total_cost &&
                                                  !node->hj_OuterNotEmpty))
diff --git a/src/backend/optimizer/plan/createplan.c 
b/src/backend/optimizer/plan/createplan.c
index 5b01b3e45cc..a70f1104800 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4993,16 +4993,30 @@ try_push_bloom_filter(PlannerInfo *root, HashJoin *hj, 
Plan *outer_plan)
        recipient->bloom_filters = lappend(recipient->bloom_filters, bf);
 
        /*
-        * If the recipient is a CustomScan that opted in, also build a separate
-        * filter per join key.  Only such a recipient can make use of them (to
-        * test a single column against a dictionary or zone map); the combined
-        * filter is always built and is the more selective one for the per-row
-        * probe.  There is nothing to gain for a single-key join, where the two
-        * coincide.
+        * A CustomScan recipient that opted in consumes the filter in its own
+        * scan loop, possibly at the storage level, so it wants two things a
+        * stock scan does not.
         */
-       if (list_length(hashkeys) > 1 && IsA(recipient, CustomScan) &&
+       if (IsA(recipient, CustomScan) &&
                (((CustomScan *) recipient)->flags & 
CUSTOMPATH_SUPPORT_BLOOM_FILTERS))
-               hj->bloom_perkey = true;
+       {
+               /*
+                * Build the hash table (and filter) before the outer scan 
starts, so
+                * the filter is available on the first tuple request rather 
than after
+                * a batch has already been scanned unfiltered.
+                */
+               hj->bloom_eager = true;
+
+               /*
+                * Also build a separate filter per join key, so the recipient 
can test
+                * a single column on its own (e.g. against a per-column 
dictionary or
+                * zone map).  The combined filter is always built and is the 
more
+                * selective one for a per-row probe; there is nothing to gain 
for a
+                * single-key join, where the two coincide.
+                */
+               if (list_length(hashkeys) > 1)
+                       hj->bloom_perkey = true;
+       }
 
        /*
         * XXX We've manged to push the filter to the scan node, but maybe
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 21ec7ffae1a..0e011f3d4e2 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1135,6 +1135,16 @@ typedef struct HashJoin
         * opt-in extra that nobody else pays for.
         */
        bool            bloom_perkey;
+
+       /*
+        * Whether to build the hash table (and bloom filter) before fetching 
the
+        * first outer tuple, skipping the empty-outer prefetch optimization.  
Set
+        * at plan time when the filter is pushed to a CustomScan recipient, 
which
+        * may want to apply the filter the moment its scan starts (e.g. a 
column
+        * store skipping row groups before decompression) rather than after 
having
+        * already produced a batch unfiltered.  See ExecHashJoinImpl.
+        */
+       bool            bloom_eager;
 } HashJoin;
 
 /* ----------------
-- 
2.43.0

Re: hashjoins vs. Bloom filters (yet again)

Reply via email to