On Tue, Feb 24, 2026, at 19:21, Joel Jacobson wrote:
> This bug seems to sometimes cause the wrong table, the larger table, to
> be hashed in a Hash Join, and the smaller table to be used for probing.
...
> The fix causes quite a lot of plans in
> src/test/regress/expected/partition_join.out to change, which makes me a
> bit worried I might have misunderstood something here. I haven't
> verified if all the new plans are improvements, I just copied the result
> file to the expected dir.

I've now investigated all the plan changes in

   src/test/regress/expected/partition_join.out

due to this fix, and now feel confident this is a bug,
and that the bug fix is correct.

To benchmark, I beefed up the populated data in partition_join.sql
by x10000, e.g:

-CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
-CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (500) TO (600);
-CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (250) TO (500);
-INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM 
generate_series(0, 599) i WHERE i % 2 = 0;
+CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (2500000);
+CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (5000000) TO (6000000);
+CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (2500000) TO (5000000);
+INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM 
generate_series(0, 5999999) i WHERE i % 2 = 0;

I then measured all queries that produced a different plan,
using EXPLAIN ANALYZE, here are the results:

joel=# SELECT COUNT(*), COUNT(*) FILTER (WHERE execution_time_head > 
execution_time_patch) AS faster, COUNT(*) FILTER (WHERE execution_time_head < 
execution_time_patch) AS slower FROM partition_join_bench;
 count | faster | slower
-------+--------+--------
    27 |     16 |     11
(1 row)

joel=# SELECT SUM(execution_time_head) AS total_execution_time_master, 
SUM(execution_time_patch) AS total_execution_time_patch, 
1-SUM(execution_time_patch)/SUM(execution_time_head) AS improvement FROM 
partition_join_bench;
 total_execution_time_master | total_execution_time_patch |      improvement
-----------------------------+----------------------------+------------------------
                    3577.826 |                   2892.280 | 
0.19160965345995026030
(1 row)

joel=# SELECT execution_time_head, execution_time_patch, 
execution_time_head-execution_time_patch AS diff FROM partition_join_bench 
ORDER BY 3;
 execution_time_head | execution_time_patch |  diff
---------------------+----------------------+---------
             128.481 |              170.469 | -41.988
              59.927 |               84.131 | -24.204
              63.928 |               87.188 | -23.260
              57.315 |               78.443 | -21.128
              65.779 |               84.669 | -18.890
              65.456 |               81.128 | -15.672
              57.349 |               72.832 | -15.483
              63.383 |               77.267 | -13.884
              60.248 |               73.359 | -13.111
              61.173 |               67.388 |  -6.215
              67.052 |               69.475 |  -2.423
              79.368 |               78.874 |   0.494
              61.533 |               56.617 |   4.916
             108.781 |               92.301 |  16.480
             124.661 |               98.540 |  26.121
             146.671 |              117.109 |  29.562
             112.973 |               79.949 |  33.024
             119.745 |               82.465 |  37.280
             145.449 |               99.523 |  45.926
             239.796 |              166.813 |  72.983
             228.056 |              154.956 |  73.100
             225.025 |              145.068 |  79.957
             261.493 |              173.595 |  87.898
             245.301 |              157.054 |  88.247
             239.626 |              149.158 |  90.468
             243.589 |              147.587 |  96.002
             245.668 |              146.322 |  99.346
(27 rows)

In total the improvement is about 20%.

The faster queries are due to swapping the build/probe side,
so that the planner hash the smaller filtered side instead
of the larger unfiltered side.

The slower queries are due to fixing the hash join cost estimate,
which makes the makes hash join look cheaper than nested loop.
But at this data scale, nested loop is still a win for such cases.

I benchmarked just in case I'd missed something.
These results makes me confident we have a bug,
and that this fix is correct.

/Joel


Reply via email to