Hi,

This is the bucket-occupancy bitmap filter I described on the "hashjoins vs. Bloom filters (yet again)" thread [1]. Tomas suggested posting the patch the patch I described in [2] in a separate thread for cfbot, so here it is as a two-patch series.

Similar to Tomas's patch to use a Bloom filter in hashjoins to avoid probes, I have these two patches that use a bitmap filter for the same two cases. As Tomas points out, a bitmap filter can be viewed as a Bloom filter with a single hash function.

The idea: while building the inner side, set one bit per occupied hash bucket (one bit per bucket, sized directly by nbuckets -- ~4kB for 32k buckets). An empty bucket means no possible match, with no false negatives. Compared to a Bloom filter it is less discriminating but far cheaper to build and probe, and needs no sizing/false-positive tuning.

  It exploits this in the two cases from the older Bloom-filter work:

0001 - pre-spill outer drop (multi-batch). During outer partitioning, an outer tuple whose target batch's bucket is empty cannot match, so it is dropped before being written to the batch file -- saving the spill write and subsequent read. Inner and semi joins only. Also, this currently disables itself when nbatch grows at runtime (batch doubling), so it targets stable multi-batch plans rather than the runaway-batch case. (See [3] for a discussion on dealing with runaway batches.)

0002 - single-batch probe skip. While probing a single-batch hash table, an empty bucket short-circuits the chain walk. Safe for any join type (it only avoids work on a provably-empty bucket). Pays off mainly with a high miss rate and a bucket array larger than L2/L3.

Both carry an adaptive runtime guard: they sample the drop/skip rate over a window and stop consulting the bitmap if it falls below break-even, so dense (e.g. FK) joins do not pay the per-tuple check for nothing.

  Rough numbers (serial join, 8-core x86-64, work_mem 1-8MB):

    0001, by probe selectivity:   ~95% miss: +18% to +36%
                                  ~37% miss: +9% to +13%
                                  FK-like (~0% miss): flat
    0002, single-batch high-miss: up to ~15% on cache-friendly data

Both are gated by GUCs (enable_hashjoin_prefilter, enable_hashjoin_probe_filter), default off. They're experimental toggles for now -- no config.sgml docs yet, pending design feedback. Serial hash join only at this stage.

Some known issues: The naming needs a little work... prefilter and probe filter are pretty generic and could use better names-- I just can't think of new ones right now. Also, the memory used for the bitmaps is not included or reported in spaceUsed for now.

[1] https://www.postgresql.org/message-id/[email protected] [2] https://www.postgresql.org/message-id/[email protected] [3] https://www.postgresql.org/message-id/[email protected]

Regards,
  Ben Mejia
From 62040683b360efa790b42d0c3c4970bbdc439c52 Mon Sep 17 00:00:00 2001
From: Ben Mejia <[email protected]>
Date: Mon, 22 Jun 2026 13:56:45 -0700
Subject: [PATCH v1 1/2] Add enable_hashjoin_prefilter: drop non-matching outer
 tuples before spill

Maintain a per-batch inner-occupancy bitmap (the prefilter), populated as
inner tuples are written to their batch file during the build. During outer
partitioning, consult the target batch's bitmap and drop the tuple if its bucket
is empty. Restricted to serial multi-batch joins with no batch doubling.

An adaptive guard samples the eviction rate over a window of routed probe tuples
and stops consulting the bitmap once the rate falls below break-even, so
dense (FK-like) joins do not keep paying the per-probe check for no benefit.

Controlled by the enable_hashjoin_prefilter GUC (default off) and reported in
EXPLAIN ANALYZE as "Outer Prefilter: N tuples dropped before spill".
---
 src/backend/commands/explain.c                |  15 ++
 src/backend/executor/nodeHash.c               |  35 ++++
 src/backend/executor/nodeHashjoin.c           |  53 ++++-
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/guc_tables.c           |   1 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/executor/hashjoin.h               |  17 ++
 src/include/executor/instrument_node.h        |   1 +
 src/include/miscadmin.h                       |   2 +
 src/test/regress/expected/join_hash.out       | 195 ++++++++++++++++++
 src/test/regress/sql/join_hash.sql            | 154 ++++++++++++++
 11 files changed, 479 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a40d03d35f3..cb4044c310d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3473,6 +3473,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
                                                         hinstrument.nbuckets, 
hinstrument.nbatch,
                                                         spacePeakKb);
                }
+
+               /* Show outer pre-storage filter stats if it dropped anything */
+               if (hinstrument.outer_prefiltered > 0)
+               {
+                       if (es->format != EXPLAIN_FORMAT_TEXT)
+                               ExplainPropertyInteger("Outer Tuples 
Prefiltered", NULL,
+                                                                          
hinstrument.outer_prefiltered, es);
+                       else
+                       {
+                               ExplainIndentText(es);
+                               appendStringInfo(es->str,
+                                                                "Outer 
Prefilter: " UINT64_FORMAT " tuples dropped before spill\n",
+                                                                
hinstrument.outer_prefiltered);
+                       }
+               }
        }
 }
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8825bb6fa23..389c9b2367a 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -537,6 +537,11 @@ ExecHashTableCreate(HashState *state)
        hashtable->skewTuples = 0;
        hashtable->innerBatchFile = NULL;
        hashtable->outerBatchFile = NULL;
+       hashtable->batch_bitmap = NULL;
+       hashtable->prefilter_active = false;
+       hashtable->prefilter_win_checks = 0;
+       hashtable->prefilter_win_drops = 0;
+       hashtable->outer_prefiltered = 0;
        hashtable->spaceUsed = 0;
        hashtable->spacePeak = 0;
        hashtable->spaceAllowed = space_allowed;
@@ -587,6 +592,21 @@ ExecHashTableCreate(HashState *state)
                hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch);
                hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
 
+               /*
+                * Allocate per-batch pre-filter bitmaps.
+                * Index 0 is left NULL, as batch 0 is never spilled.
+                */
+               if (enable_hashjoin_prefilter)
+               {
+                       size_t          bitmap_bytes = (hashtable->nbuckets + 
7) / 8;
+                       int                     b;
+
+                       hashtable->batch_bitmap = palloc0_array(uint8 *, 
nbatch);
+                       for (b = 1; b < nbatch; b++)
+                               hashtable->batch_bitmap[b] = 
palloc0(bitmap_bytes);
+                       hashtable->prefilter_active = true;
+               }
+
                MemoryContextSwitchTo(oldctx);
 
                /* The files will not be opened until needed... */
@@ -1103,6 +1123,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 
        hashtable->nbatch = nbatch;
 
+       /* Batch doubling invalidates the outer pre-storage filter. */
+       if (hashtable->batch_bitmap != NULL)
+       {
+               for (int b = 1; b < oldnbatch; b++)
+                       if (hashtable->batch_bitmap[b])
+                               pfree(hashtable->batch_bitmap[b]);
+               pfree(hashtable->batch_bitmap);
+               hashtable->batch_bitmap = NULL;
+       }
+
        /*
         * Scan through the existing hash table entries and dump out any that 
are
         * no longer of the current batch.
@@ -1851,6 +1881,10 @@ ExecHashTableInsert(HashJoinTable hashtable,
                                                          hashvalue,
                                                          
&hashtable->innerBatchFile[batchno],
                                                          hashtable);
+
+               /* Set bit in pre-filter to record this bucket as occupied */
+               if (hashtable->batch_bitmap != NULL)
+                       hashtable->batch_bitmap[batchno][bucketno >> 3] |= (1 
<< (bucketno & 7));
        }
 
        if (shouldFree)
@@ -2945,6 +2979,7 @@ ExecHashAccumInstrumentation(HashInstrumentation 
*instrument,
                                                                          
hashtable->nbatch_original);
        instrument->space_peak = Max(instrument->space_peak,
                                                                 
hashtable->spacePeak);
+       instrument->outer_prefiltered += hashtable->outer_prefiltered;
 }
 
 /*
diff --git a/src/backend/executor/nodeHashjoin.c 
b/src/backend/executor/nodeHashjoin.c
index 0b365d5b475..44ce1857947 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -193,6 +193,15 @@
 /* Returns true if doing null-fill on inner relation */
 #define HJ_FILL_INNER(hjstate) ((hjstate)->hj_NullOuterTupleSlot != NULL)
 
+/*
+ * Adaptive guard for the pre-filter.  Sample the eviction rate
+ * over a window of routed probe tuples; if it falls below the threshold, stop
+ * consulting the bitmap for the rest of the outer-partitioning pass so dense
+ * (FK-like) joins do not pay the per-probe check for no benefit.
+ */
+#define PREFILTER_WINDOW               4096
+#define PREFILTER_MIN_DROP_PCT 5
+
 static TupleTableSlot *ExecHashJoinOuterGetTuple(PlanState *outerNode,
                                                                                
                 HashJoinState *hjstate,
                                                                                
                 uint32 *hashvalue);
@@ -508,8 +517,48 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                                        node->hj_CurSkewBucketNo == 
INVALID_SKEW_BUCKET_NO)
                                {
                                        bool            shouldFree;
-                                       MinimalTuple mintuple = 
ExecFetchSlotMinimalTuple(outerTupleSlot,
-                                                                               
                                                          &shouldFree);
+                                       MinimalTuple mintuple;
+
+                                       /*
+                                        * Pre-filter: if inner table has 
nothing in this bucket, the
+                                        * outer tuple cannot match, so drop 
it. Not valid for outer/anti
+                                        * joins, which must still emit 
unmatched outer rows.
+                                        */
+                                       if (hashtable->batch_bitmap != NULL && 
hashtable->prefilter_active &&
+                                               (node->js.jointype == 
JOIN_INNER ||
+                                                node->js.jointype == 
JOIN_SEMI))
+                                       {
+                                               int                     
bucketno = node->hj_CurBucketNo;
+                                               uint8      *bitmap = 
hashtable->batch_bitmap[batchno];
+                                               bool            empty;
+
+                                               empty = ((bitmap[bucketno >> 3] 
& (1 << (bucketno & 7))) == 0);
+
+                                               /* Window accounting: count 
this probe, and the drop if empty. */
+                                               
hashtable->prefilter_win_checks++;
+                                               if (empty)
+                                               {
+                                                       
hashtable->prefilter_win_drops++;
+                                                       
hashtable->outer_prefiltered++;
+                                               }
+
+                                               /* End of window: stop 
filtering if it is no longer paying off. */
+                                               if 
(hashtable->prefilter_win_checks >= PREFILTER_WINDOW)
+                                               {
+                                                       if 
(hashtable->prefilter_win_drops * 100 <
+                                                               
hashtable->prefilter_win_checks * PREFILTER_MIN_DROP_PCT)
+                                                               
hashtable->prefilter_active = false;
+                                                       
hashtable->prefilter_win_checks = 0;
+                                                       
hashtable->prefilter_win_drops = 0;
+                                               }
+
+                                               /* empty bucket: no inner match 
possible, drop without spilling */
+                                               if (empty)
+                                                       continue;
+                                       }
+
+                                       mintuple = 
ExecFetchSlotMinimalTuple(outerTupleSlot,
+                                                                               
                                 &shouldFree);
 
                                        /*
                                         * Need to postpone this outer tuple to 
a later batch.
diff --git a/src/backend/utils/misc/guc_parameters.dat 
b/src/backend/utils/misc/guc_parameters.dat
index 7b1eb6e61bc..00c9c103e16 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -919,6 +919,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_hashjoin_prefilter', type => 'bool', context => 
'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables pre-filter for outer tuples before spilling them in 
multi-batch hash joins.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_hashjoin_prefilter',
+  boot_val => 'false',
+},
+
 { name => 'enable_incremental_sort', type => 'bool', context => 'PGC_USERSET', 
group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of incremental sort steps.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index 290ccbc543e..400cc687533 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -532,6 +532,7 @@ extern const struct config_enum_entry 
dynamic_shared_memory_options[];
  * GUC option variables that are exported from this module
  */
 bool           AllowAlterSystem = true;
+bool           enable_hashjoin_prefilter = false;
 bool           log_duration = false;
 bool           Debug_print_plan = false;
 bool           Debug_print_parse = false;
diff --git a/src/backend/utils/misc/postgresql.conf.sample 
b/src/backend/utils/misc/postgresql.conf.sample
index ac38cddaaf9..92630504fdf 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -428,6 +428,7 @@
 #enable_gathermerge = on
 #enable_hashagg = on
 #enable_hashjoin = on
+#enable_hashjoin_prefilter = off
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 4d342174b9a..ca5f69f9830 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -359,6 +359,23 @@ typedef struct HashJoinTableData
        BufFile   **innerBatchFile; /* buffered virtual temp file per batch */
        BufFile   **outerBatchFile; /* buffered virtual temp file per batch */
 
+       /*
+        * Per-batch pre-filter bitmaps, one bit per bucket, populated as inner
+        * tuples are written to their batch file during the inner build.
+        * During outer partitioning, an outer tuple whose bucket is empty
+        * cannot match. So for inner and semi joins it can be dropped before
+        * being spilled, saving the outerBatchFile write and the later read.
+        *
+        * prefilter_active is an adaptive runtime guard: it starts true and is
+        * cleared if the sampled eviction rate over a window of routed probe 
tuples
+        * is below break-even.
+        */
+       uint8     **batch_bitmap;       /* per-batch occupancy bitmaps, or NULL 
*/
+       bool            prefilter_active;       /* adaptive: still consulting 
the bitmap? */
+       uint64          prefilter_win_checks;   /* routed probes seen in 
current window */
+       uint64          prefilter_win_drops;    /* drops in current window */
+       uint64          outer_prefiltered;      /* outer tuples dropped before 
spilling */
+
        Size            spaceUsed;              /* memory space currently used 
by tuples */
        Size            spaceAllowed;   /* upper limit for space used */
        Size            spacePeak;              /* peak space used */
diff --git a/src/include/executor/instrument_node.h 
b/src/include/executor/instrument_node.h
index 4076990408e..37776baa7ab 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -227,6 +227,7 @@ typedef struct HashInstrumentation
        int                     nbatch;                 /* number of batches at 
end of execution */
        int                     nbatch_original;        /* planned number of 
batches */
        Size            space_peak;             /* peak memory usage in bytes */
+       uint64          outer_prefiltered;      /* outer tuples dropped before 
spilling */
 } HashInstrumentation;
 
 /*
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 7170a4bff98..9ca5a15ea5b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -273,6 +273,8 @@ extern PGDLLIMPORT double hash_mem_multiplier;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
+extern PGDLLIMPORT bool enable_hashjoin_prefilter;
+
 /*
  * Upper and lower hard limits for the buffer access strategy ring size
  * specified by the VacuumBufferUsageLimit GUC and BUFFER_USAGE_LIMIT option
diff --git a/src/test/regress/expected/join_hash.out 
b/src/test/regress/expected/join_hash.out
index 75009e29720..0557c544272 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -1166,3 +1166,198 @@ lateral (select t1.fivethous, i4.f1 from tenk1 t1 join 
int4_tbl i4
 (4 rows)
 
 rollback;
+--
+-- enable_hashjoin_prefilter: drop non-matching outer tuples before they are
+-- spilled to a batch file.  Valid only for inner and semi joins; outer and
+-- anti joins must still emit unmatched outer rows, so the pre-filter must not
+-- run for them.
+--
+begin;
+set local enable_hashjoin = on;
+set local enable_mergejoin = off;
+set local enable_nestloop = off;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+-- Locate the Hash node in an EXPLAIN (FORMAT json) plan.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+  x json;
+  child json;
+begin
+  if node->>'Node Type' = 'Hash' then
+    return node;
+  else
+    for child in select json_array_elements(node->'Plans')
+    loop
+      x := find_hash(child);
+      if x is not null then
+        return x;
+      end if;
+    end loop;
+    return null;
+  end if;
+end;
+$$;
+-- original/final batch counts, as in the tests above.
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+  whole_plan json;
+  hash_node json;
+begin
+  for whole_plan in
+    execute 'explain (analyze, format ''json'') ' || query
+  loop
+    hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+    original := hash_node->>'Original Hash Batches';
+    final := hash_node->>'Hash Batches';
+    return next;
+  end loop;
+end;
+$$;
+-- Number of outer tuples removed by the pre-filter, or NULL when it did not
+-- run (e.g. for outer/anti joins, or a single-batch join).
+create or replace function hash_join_prefiltered(query text)
+returns bigint language plpgsql
+as
+$$
+declare
+  whole_plan json;
+  hash_node json;
+  result bigint;
+begin
+  for whole_plan in
+    execute 'explain (analyze, format ''json'') ' || query
+  loop
+    hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+    result := (hash_node->>'Outer Tuples Prefiltered')::bigint;
+  end loop;
+  return result;
+end;
+$$;
+-- Build side: 20000 distinct keys, large enough to need multiple batches at
+-- the work_mem above and correctly estimated so nbatch stays put (the regime
+-- where the pre-filter bitmaps survive).
+create table pf_build as
+  select g as id, repeat('x', 30) as t from generate_series(1, 20000) g;
+analyze pf_build;
+-- Probe side: 100000 rows, ~10% matching (keys 1..10000) and ~90% with no
+-- match (keys far above the build range), interleaved.  The interleaving
+-- matters: the pre-filter samples its drop rate over a window of routed probe
+-- tuples and switches itself off if that window is unrepresentative, so the
+-- matching and non-matching rows must be mixed rather than segregated.  Larger
+-- than the build side so the planner hashes pf_build and probes this.
+create table pf_probe as
+  select case when g % 10 = 0 then g / 10
+              else 1000000 + g end as id
+  from generate_series(1, 100000) g;
+analyze pf_probe;
+-- Premise: the join is multi-batch and planned that way (no runtime growth).
+select original > 1 as initially_multibatch, final > original as 
increased_batches
+  from hash_join_batches(
+$$
+  select count(*) from pf_probe p join pf_build b using (id)
+$$);
+ initially_multibatch | increased_batches 
+----------------------+-------------------
+ t                    | f
+(1 row)
+
+-- (1) INNER join: identical results with the pre-filter off and on.
+set local enable_hashjoin_prefilter = off;
+select count(*), coalesce(sum(p.id), 0) as checksum
+  from pf_probe p join pf_build b using (id);
+ count | checksum 
+-------+----------
+ 10000 | 50005000
+(1 row)
+
+set local enable_hashjoin_prefilter = on;
+select count(*), coalesce(sum(p.id), 0) as checksum
+  from pf_probe p join pf_build b using (id);
+ count | checksum 
+-------+----------
+ 10000 | 50005000
+(1 row)
+
+-- (2) the pre-filter actually dropped non-matching outer tuples.
+select hash_join_prefiltered(
+$$
+  select count(*) from pf_probe p join pf_build b using (id)
+$$) > 0 as inner_prefilter_fired;
+ inner_prefilter_fired 
+-----------------------
+ t
+(1 row)
+
+-- (3) SEMI join: identical results off vs on, and the pre-filter may fire.
+set local enable_hashjoin_prefilter = off;
+select count(*) from pf_probe p
+  where exists (select 1 from pf_build b where b.id = p.id);
+ count 
+-------
+ 10000
+(1 row)
+
+set local enable_hashjoin_prefilter = on;
+select count(*) from pf_probe p
+  where exists (select 1 from pf_build b where b.id = p.id);
+ count 
+-------
+ 10000
+(1 row)
+
+select hash_join_prefiltered(
+$$
+  select count(*) from pf_probe p
+    where exists (select 1 from pf_build b where b.id = p.id)
+$$) > 0 as semi_prefilter_fired;
+ semi_prefilter_fired 
+----------------------
+ t
+(1 row)
+
+-- (4) LEFT join: unmatched outer rows must survive, so the pre-filter must
+-- not run.  (A wrong pre-filter would make this count drop below 100000.)
+set local enable_hashjoin_prefilter = on;
+select count(*) from pf_probe p left join pf_build b using (id);
+ count  
+--------
+ 100000
+(1 row)
+
+select hash_join_prefiltered(
+$$
+  select count(*) from pf_probe p left join pf_build b using (id)
+$$) is null as left_prefilter_disabled;
+ left_prefilter_disabled 
+-------------------------
+ t
+(1 row)
+
+-- (5) ANTI join: the non-matching outer rows are exactly the result, so the
+-- pre-filter must not run.
+select count(*) from pf_probe p
+  where not exists (select 1 from pf_build b where b.id = p.id);
+ count 
+-------
+ 90000
+(1 row)
+
+select hash_join_prefiltered(
+$$
+  select count(*) from pf_probe p
+    where not exists (select 1 from pf_build b where b.id = p.id)
+$$) is null as anti_prefilter_disabled;
+ anti_prefilter_disabled 
+-------------------------
+ t
+(1 row)
+
+rollback;
diff --git a/src/test/regress/sql/join_hash.sql 
b/src/test/regress/sql/join_hash.sql
index 989390e6864..ab01b277532 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -625,3 +625,157 @@ lateral (select t1.fivethous, i4.f1 from tenk1 t1 join 
int4_tbl i4
          on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
 
 rollback;
+
+--
+-- enable_hashjoin_prefilter: drop non-matching outer tuples before they are
+-- spilled to a batch file.  Valid only for inner and semi joins; outer and
+-- anti joins must still emit unmatched outer rows, so the pre-filter must not
+-- run for them.
+--
+begin;
+
+set local enable_hashjoin = on;
+set local enable_mergejoin = off;
+set local enable_nestloop = off;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+
+-- Locate the Hash node in an EXPLAIN (FORMAT json) plan.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+  x json;
+  child json;
+begin
+  if node->>'Node Type' = 'Hash' then
+    return node;
+  else
+    for child in select json_array_elements(node->'Plans')
+    loop
+      x := find_hash(child);
+      if x is not null then
+        return x;
+      end if;
+    end loop;
+    return null;
+  end if;
+end;
+$$;
+
+-- original/final batch counts, as in the tests above.
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+  whole_plan json;
+  hash_node json;
+begin
+  for whole_plan in
+    execute 'explain (analyze, format ''json'') ' || query
+  loop
+    hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+    original := hash_node->>'Original Hash Batches';
+    final := hash_node->>'Hash Batches';
+    return next;
+  end loop;
+end;
+$$;
+
+-- Number of outer tuples removed by the pre-filter, or NULL when it did not
+-- run (e.g. for outer/anti joins, or a single-batch join).
+create or replace function hash_join_prefiltered(query text)
+returns bigint language plpgsql
+as
+$$
+declare
+  whole_plan json;
+  hash_node json;
+  result bigint;
+begin
+  for whole_plan in
+    execute 'explain (analyze, format ''json'') ' || query
+  loop
+    hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+    result := (hash_node->>'Outer Tuples Prefiltered')::bigint;
+  end loop;
+  return result;
+end;
+$$;
+
+-- Build side: 20000 distinct keys, large enough to need multiple batches at
+-- the work_mem above and correctly estimated so nbatch stays put (the regime
+-- where the pre-filter bitmaps survive).
+create table pf_build as
+  select g as id, repeat('x', 30) as t from generate_series(1, 20000) g;
+analyze pf_build;
+
+-- Probe side: 100000 rows, ~10% matching (keys 1..10000) and ~90% with no
+-- match (keys far above the build range), interleaved.  The interleaving
+-- matters: the pre-filter samples its drop rate over a window of routed probe
+-- tuples and switches itself off if that window is unrepresentative, so the
+-- matching and non-matching rows must be mixed rather than segregated.  Larger
+-- than the build side so the planner hashes pf_build and probes this.
+create table pf_probe as
+  select case when g % 10 = 0 then g / 10
+              else 1000000 + g end as id
+  from generate_series(1, 100000) g;
+analyze pf_probe;
+
+-- Premise: the join is multi-batch and planned that way (no runtime growth).
+select original > 1 as initially_multibatch, final > original as 
increased_batches
+  from hash_join_batches(
+$$
+  select count(*) from pf_probe p join pf_build b using (id)
+$$);
+
+-- (1) INNER join: identical results with the pre-filter off and on.
+set local enable_hashjoin_prefilter = off;
+select count(*), coalesce(sum(p.id), 0) as checksum
+  from pf_probe p join pf_build b using (id);
+set local enable_hashjoin_prefilter = on;
+select count(*), coalesce(sum(p.id), 0) as checksum
+  from pf_probe p join pf_build b using (id);
+
+-- (2) the pre-filter actually dropped non-matching outer tuples.
+select hash_join_prefiltered(
+$$
+  select count(*) from pf_probe p join pf_build b using (id)
+$$) > 0 as inner_prefilter_fired;
+
+-- (3) SEMI join: identical results off vs on, and the pre-filter may fire.
+set local enable_hashjoin_prefilter = off;
+select count(*) from pf_probe p
+  where exists (select 1 from pf_build b where b.id = p.id);
+set local enable_hashjoin_prefilter = on;
+select count(*) from pf_probe p
+  where exists (select 1 from pf_build b where b.id = p.id);
+select hash_join_prefiltered(
+$$
+  select count(*) from pf_probe p
+    where exists (select 1 from pf_build b where b.id = p.id)
+$$) > 0 as semi_prefilter_fired;
+
+-- (4) LEFT join: unmatched outer rows must survive, so the pre-filter must
+-- not run.  (A wrong pre-filter would make this count drop below 100000.)
+set local enable_hashjoin_prefilter = on;
+select count(*) from pf_probe p left join pf_build b using (id);
+select hash_join_prefiltered(
+$$
+  select count(*) from pf_probe p left join pf_build b using (id)
+$$) is null as left_prefilter_disabled;
+
+-- (5) ANTI join: the non-matching outer rows are exactly the result, so the
+-- pre-filter must not run.
+select count(*) from pf_probe p
+  where not exists (select 1 from pf_build b where b.id = p.id);
+select hash_join_prefiltered(
+$$
+  select count(*) from pf_probe p
+    where not exists (select 1 from pf_build b where b.id = p.id)
+$$) is null as anti_prefilter_disabled;
+
+rollback;

base-commit: 4df5fe3833a87f6629eb888ca5a385bcb0b179d9
-- 
2.50.1 (Apple Git-155)

From 9866e34b9d7719bed781acdbf1961a5b933b455d Mon Sep 17 00:00:00 2001
From: Ben Mejia <[email protected]>
Date: Mon, 22 Jun 2026 15:35:39 -0700
Subject: [PATCH v1 2/2] Add a bitmap filter for single-batch prefiltering.

Add a GUC to control probe filter: enable_hashjoin_probe_filter

The probe site has its own per-batch adaptive guard with a higher threshold
(~27%).

Reported in EXPLAIN ANALYZE as "Probe Filter: N empty buckets skipped".
---
 src/backend/commands/explain.c                | 15 ++++
 src/backend/executor/nodeHash.c               | 77 ++++++++++++++++++-
 src/backend/utils/misc/guc_parameters.dat     |  7 ++
 src/backend/utils/misc/guc_tables.c           |  1 +
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/executor/hashjoin.h               | 11 +++
 src/include/executor/instrument_node.h        |  1 +
 src/include/miscadmin.h                       |  1 +
 src/test/regress/expected/join_hash.out       | 70 +++++++++++++++++
 src/test/regress/sql/join_hash.sql            | 53 +++++++++++++
 10 files changed, 234 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index cb4044c310d..db6a842b550 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3488,6 +3488,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
                                                                 
hinstrument.outer_prefiltered);
                        }
                }
+
+               /* Show probe-time empty-bucket skips if any */
+               if (hinstrument.probe_filter_skips > 0)
+               {
+                       if (es->format != EXPLAIN_FORMAT_TEXT)
+                               ExplainPropertyInteger("Probe Filter Skips", 
NULL,
+                                                                          
hinstrument.probe_filter_skips, es);
+                       else
+                       {
+                               ExplainIndentText(es);
+                               appendStringInfo(es->str,
+                                                                "Probe Filter: 
" UINT64_FORMAT " empty buckets skipped\n",
+                                                                
hinstrument.probe_filter_skips);
+                       }
+               }
        }
 }
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 389c9b2367a..26765b1f9f1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -43,6 +43,14 @@
 #include "utils/tuplestore.h"
 #include "utils/wait_event.h"
 
+/*
+ * Adaptive guard for the probe-time empty-bucket filter. Sample the skip
+ * rate over a window; if it falls below break-even, stop consulting the
+ * bitmap.
+ */
+#define PROBE_FILTER_WINDOW                    1024
+#define PROBE_FILTER_MIN_SKIP_PCT      27
+
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
 static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable);
@@ -213,6 +221,26 @@ MultiExecPrivateHash(HashState *node)
        if (hashtable->nbuckets != hashtable->nbuckets_optimal)
                ExecHashIncreaseNumBuckets(hashtable);
 
+       /*
+        * Single-batch probe filter: build the bitmap filter now that nbuckets
+        * is final.
+        */
+       if (hashtable->nbatch == 1 && enable_hashjoin_probe_filter &&
+               hashtable->batch_bitmap == NULL)
+       {
+               MemoryContext oldctx = 
MemoryContextSwitchTo(hashtable->hashCxt);
+               size_t          bitmap_bytes = (hashtable->nbuckets + 7) / 8;
+               int                     b;
+
+               hashtable->batch_bitmap = palloc0_array(uint8 *, 1);
+               hashtable->batch_bitmap[0] = palloc0(bitmap_bytes);
+               for (b = 0; b < hashtable->nbuckets; b++)
+                       if (hashtable->buckets.unshared[b] != NULL)
+                               hashtable->batch_bitmap[0][b >> 3] |= (1 << (b 
& 7));
+               hashtable->probe_filter_active = true;
+               MemoryContextSwitchTo(oldctx);
+       }
+
        /* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
        hashtable->spaceUsed += hashtable->nbuckets * sizeof(HashJoinTuple);
        if (hashtable->spaceUsed > hashtable->spacePeak)
@@ -542,6 +570,10 @@ ExecHashTableCreate(HashState *state)
        hashtable->prefilter_win_checks = 0;
        hashtable->prefilter_win_drops = 0;
        hashtable->outer_prefiltered = 0;
+       hashtable->probe_filter_active = false;
+       hashtable->probe_filter_win_checks = 0;
+       hashtable->probe_filter_win_skips = 0;
+       hashtable->probe_filter_skips = 0;
        hashtable->spaceUsed = 0;
        hashtable->spacePeak = 0;
        hashtable->spaceAllowed = space_allowed;
@@ -593,8 +625,9 @@ ExecHashTableCreate(HashState *state)
                hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
 
                /*
-                * Allocate per-batch pre-filter bitmaps.
-                * Index 0 is left NULL, as batch 0 is never spilled.
+                * Allocate the pre-spill drop filter's per-batch occupancy 
bitmaps.
+                * Index 0 is left NULL, as batch 0 is never spilled.  (The 
probe-time
+                * skip is single-batch only and builds its own bitmap in 
MultiExecHash.)
                 */
                if (enable_hashjoin_prefilter)
                {
@@ -1126,7 +1159,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
        /* Batch doubling invalidates the outer pre-storage filter. */
        if (hashtable->batch_bitmap != NULL)
        {
-               for (int b = 1; b < oldnbatch; b++)
+               for (int b = 0; b < oldnbatch; b++)
                        if (hashtable->batch_bitmap[b])
                                pfree(hashtable->batch_bitmap[b]);
                pfree(hashtable->batch_bitmap);
@@ -2057,6 +2090,43 @@ ExecScanHashBucket(HashJoinState *hjstate,
        HashJoinTuple hashTuple = hjstate->hj_CurTuple;
        uint32          hashvalue = hjstate->hj_CurHashValue;
 
+       /*
+        * Probe-time empty-bucket skip: on the first probe of a standard 
bucket,
+        * check the bitmap filter and skip this bucket if empty. Also check the
+        * adaptive guard and disable the filter if the rate is below 
break-even.
+        */
+       if (hashTuple == NULL &&
+               enable_hashjoin_probe_filter &&
+               hashtable->probe_filter_active &&
+               hashtable->nbatch == 1 &&
+               hashtable->batch_bitmap != NULL &&
+               hjstate->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
+       {
+               int                     bucketno = hjstate->hj_CurBucketNo;
+               uint8      *bitmap = 
hashtable->batch_bitmap[hashtable->curbatch];
+               bool            empty = ((bitmap[bucketno >> 3] & (1 << 
(bucketno & 7))) == 0);
+
+               hashtable->probe_filter_win_checks++;
+               if (empty)
+                       hashtable->probe_filter_win_skips++;
+
+               /* End of probe filter window: skip only if it is paying off. */
+               if (hashtable->probe_filter_win_checks >= PROBE_FILTER_WINDOW)
+               {
+                       if (hashtable->probe_filter_win_skips * 100 <
+                               hashtable->probe_filter_win_checks * 
PROBE_FILTER_MIN_SKIP_PCT)
+                               hashtable->probe_filter_active = false;
+                       hashtable->probe_filter_win_checks = 0;
+                       hashtable->probe_filter_win_skips = 0;
+               }
+
+               if (empty)
+               {
+                       hashtable->probe_filter_skips++;
+                       return false;
+               }
+       }
+
        /*
         * hj_CurTuple is the address of the tuple last returned from the 
current
         * bucket, or NULL if it's time to start scanning a new bucket.
@@ -2980,6 +3050,7 @@ ExecHashAccumInstrumentation(HashInstrumentation 
*instrument,
        instrument->space_peak = Max(instrument->space_peak,
                                                                 
hashtable->spacePeak);
        instrument->outer_prefiltered += hashtable->outer_prefiltered;
+       instrument->probe_filter_skips += hashtable->probe_filter_skips;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_parameters.dat 
b/src/backend/utils/misc/guc_parameters.dat
index 00c9c103e16..a972e0d2364 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -926,6 +926,13 @@
   boot_val => 'false',
 },
 
+{ name => 'enable_hashjoin_probe_filter', type => 'bool', context => 
'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables skipping empty hash buckets during the probe phase of 
single-batch hash joins.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_hashjoin_probe_filter',
+  boot_val => 'false',
+},
+
 { name => 'enable_incremental_sort', type => 'bool', context => 'PGC_USERSET', 
group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of incremental sort steps.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index 400cc687533..b5affcc52b1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -533,6 +533,7 @@ extern const struct config_enum_entry 
dynamic_shared_memory_options[];
  */
 bool           AllowAlterSystem = true;
 bool           enable_hashjoin_prefilter = false;
+bool           enable_hashjoin_probe_filter = false;
 bool           log_duration = false;
 bool           Debug_print_plan = false;
 bool           Debug_print_parse = false;
diff --git a/src/backend/utils/misc/postgresql.conf.sample 
b/src/backend/utils/misc/postgresql.conf.sample
index 92630504fdf..544b9d2e14e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -429,6 +429,7 @@
 #enable_hashagg = on
 #enable_hashjoin = on
 #enable_hashjoin_prefilter = off
+#enable_hashjoin_probe_filter = off
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index ca5f69f9830..487e00fc909 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -376,6 +376,17 @@ typedef struct HashJoinTableData
        uint64          prefilter_win_drops;    /* drops in current window */
        uint64          outer_prefiltered;      /* outer tuples dropped before 
spilling */
 
+       /*
+        * Probe-time empty-bucket skip (single-batch joins only).  After the 
build,
+        * MultiExecHash records every occupied bucket in batch_bitmap[0]; while
+        * probing, an empty bucket cannot match, so the chain walk is skipped. 
 A
+        * one-shot adaptive guard disables it if the skip rate is below 
break-even.
+        */
+       bool            probe_filter_active;    /* still consulting at probe 
time? */
+       uint64          probe_filter_win_checks;        /* probes seen in 
current window */
+       uint64          probe_filter_win_skips; /* skips in current window */
+       uint64          probe_filter_skips; /* total empty buckets skipped at 
probe */
+
        Size            spaceUsed;              /* memory space currently used 
by tuples */
        Size            spaceAllowed;   /* upper limit for space used */
        Size            spacePeak;              /* peak space used */
diff --git a/src/include/executor/instrument_node.h 
b/src/include/executor/instrument_node.h
index 37776baa7ab..5df6f07034c 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -228,6 +228,7 @@ typedef struct HashInstrumentation
        int                     nbatch_original;        /* planned number of 
batches */
        Size            space_peak;             /* peak memory usage in bytes */
        uint64          outer_prefiltered;      /* outer tuples dropped before 
spilling */
+       uint64          probe_filter_skips; /* empty buckets skipped at probe 
time */
 } HashInstrumentation;
 
 /*
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 9ca5a15ea5b..250c416c8ec 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -274,6 +274,7 @@ extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
 extern PGDLLIMPORT bool enable_hashjoin_prefilter;
+extern PGDLLIMPORT bool enable_hashjoin_probe_filter;
 
 /*
  * Upper and lower hard limits for the buffer access strategy ring size
diff --git a/src/test/regress/expected/join_hash.out 
b/src/test/regress/expected/join_hash.out
index 0557c544272..d1bbe1a177c 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -1241,6 +1241,26 @@ begin
   return result;
 end;
 $$;
+-- Number of empty buckets the single-batch probe filter skipped, or NULL when
+-- it did not run.
+create or replace function hash_join_probe_skips(query text)
+returns bigint language plpgsql
+as
+$$
+declare
+  whole_plan json;
+  hash_node json;
+  result bigint;
+begin
+  for whole_plan in
+    execute 'explain (analyze, format ''json'') ' || query
+  loop
+    hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+    result := (hash_node->>'Probe Filter Skips')::bigint;
+  end loop;
+  return result;
+end;
+$$;
 -- Build side: 20000 distinct keys, large enough to need multiple batches at
 -- the work_mem above and correctly estimated so nbatch stays put (the regime
 -- where the pre-filter bitmaps survive).
@@ -1360,4 +1380,54 @@ $$) is null as anti_prefilter_disabled;
  t
 (1 row)
 
+-- (6) Probe filter must stay inert in a multi-batch join.  With the drop
+-- filter on, batch_bitmap[1..n] is allocated but index 0 is left NULL; the
+-- probe filter is single-batch only and must not consult it.  A correct count
+-- (and no crash) guards against the nbatch==1 gate being removed.
+set local enable_hashjoin_prefilter = on;
+set local enable_hashjoin_probe_filter = on;
+select count(*) from pf_probe p join pf_build b using (id);
+ count 
+-------
+ 10000
+(1 row)
+
+-- (7) Single-batch probe filter: with the inner in one batch and a sparsely
+-- occupied bucket array, non-matching probe tuples land in empty buckets and
+-- are skipped at probe time.  Matching and non-matching keys are interleaved
+-- so the adaptive sampling window sees a representative skip rate.
+set local work_mem = '4MB';
+set local enable_hashjoin_prefilter = off;
+set local enable_hashjoin_probe_filter = off;
+create table pf_sb_build as
+  select g as id from generate_series(1, 5000) g;
+analyze pf_sb_build;
+create table pf_sb_probe as
+  select case when g % 50 = 0 then g / 50 else 1000000 + g end as id
+  from generate_series(1, 50000) g;
+analyze pf_sb_probe;
+-- identical result with the probe filter off and on
+select count(*) from pf_sb_probe p join pf_sb_build b using (id);
+ count 
+-------
+  1000
+(1 row)
+
+set local enable_hashjoin_probe_filter = on;
+select count(*) from pf_sb_probe p join pf_sb_build b using (id);
+ count 
+-------
+  1000
+(1 row)
+
+-- the probe filter actually skipped empty buckets
+select hash_join_probe_skips(
+$$
+  select count(*) from pf_sb_probe p join pf_sb_build b using (id)
+$$) > 0 as single_batch_probe_fired;
+ single_batch_probe_fired 
+--------------------------
+ t
+(1 row)
+
 rollback;
diff --git a/src/test/regress/sql/join_hash.sql 
b/src/test/regress/sql/join_hash.sql
index ab01b277532..56086361c83 100644
--- a/src/test/regress/sql/join_hash.sql
+++ b/src/test/regress/sql/join_hash.sql
@@ -706,6 +706,27 @@ begin
 end;
 $$;
 
+-- Number of empty buckets the single-batch probe filter skipped, or NULL when
+-- it did not run.
+create or replace function hash_join_probe_skips(query text)
+returns bigint language plpgsql
+as
+$$
+declare
+  whole_plan json;
+  hash_node json;
+  result bigint;
+begin
+  for whole_plan in
+    execute 'explain (analyze, format ''json'') ' || query
+  loop
+    hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+    result := (hash_node->>'Probe Filter Skips')::bigint;
+  end loop;
+  return result;
+end;
+$$;
+
 -- Build side: 20000 distinct keys, large enough to need multiple batches at
 -- the work_mem above and correctly estimated so nbatch stays put (the regime
 -- where the pre-filter bitmaps survive).
@@ -778,4 +799,36 @@ $$
     where not exists (select 1 from pf_build b where b.id = p.id)
 $$) is null as anti_prefilter_disabled;
 
+-- (6) Probe filter must stay inert in a multi-batch join.  With the drop
+-- filter on, batch_bitmap[1..n] is allocated but index 0 is left NULL; the
+-- probe filter is single-batch only and must not consult it.  A correct count
+-- (and no crash) guards against the nbatch==1 gate being removed.
+set local enable_hashjoin_prefilter = on;
+set local enable_hashjoin_probe_filter = on;
+select count(*) from pf_probe p join pf_build b using (id);
+
+-- (7) Single-batch probe filter: with the inner in one batch and a sparsely
+-- occupied bucket array, non-matching probe tuples land in empty buckets and
+-- are skipped at probe time.  Matching and non-matching keys are interleaved
+-- so the adaptive sampling window sees a representative skip rate.
+set local work_mem = '4MB';
+set local enable_hashjoin_prefilter = off;
+set local enable_hashjoin_probe_filter = off;
+create table pf_sb_build as
+  select g as id from generate_series(1, 5000) g;
+analyze pf_sb_build;
+create table pf_sb_probe as
+  select case when g % 50 = 0 then g / 50 else 1000000 + g end as id
+  from generate_series(1, 50000) g;
+analyze pf_sb_probe;
+-- identical result with the probe filter off and on
+select count(*) from pf_sb_probe p join pf_sb_build b using (id);
+set local enable_hashjoin_probe_filter = on;
+select count(*) from pf_sb_probe p join pf_sb_build b using (id);
+-- the probe filter actually skipped empty buckets
+select hash_join_probe_skips(
+$$
+  select count(*) from pf_sb_probe p join pf_sb_build b using (id)
+$$) > 0 as single_batch_probe_fired;
+
 rollback;
-- 
2.50.1 (Apple Git-155)

Reply via email to