On 09/02/2026 23:34, Andrei Lepikhov wrote:
> On 9/2/26 21:16, Tom Lane wrote:
>> What I'm wondering about is that join_collapse_limit and
>> from_collapse_limit were invented more than two decades ago, but
>> we've not touched their default values since then.  Machines are a
>> lot faster since 2004, and we've probably achieved some net speedups
>> in the planner logic as well.  Could we alleviate this concern by
>> raising those defaults, and if so, what are reasonable values in 2026?
> 
> As I see, people never use the default settings now. The case that triggered
> this topic could work well with a join collapse limit around 40 joins (GEQO
> started at 14). But a specific setting always depends on how much time people
> want to spend on planning. So, I don't think a change of default settings is
> needed.
After looking into more cases, I realized the main issue is actually something 
else.

If a SubLink in the WHERE clause is not turned into a JOIN, it acts as a filter
at the lowest possible level. When we do transform it, we move it to the top of
the join tree. If the collapse limit is lower than the number of joins, we end
up moving its filtering effect outside the ‘join problem’, and the optimiser
cannot take advantage of this often helpful way to execute the query. You can
see a dumb demo in the attachment.

To solve the problem, we should identify the relids that the SubLink refers to
or depends on, ensure they are not on the nullable side of a join, and add them
to the jointree as a JoinExpr with as few relids as possible.

Looking at pull_up_sublinks_qual_recurse, I see that we have information to skip
extra work if the join_collapse_limit is high enough.

-- 
regards, Andrei Lepikhov,
pgEdge
DROP TABLE IF EXISTS t,tn;

SET max_parallel_workers_per_gather = 0;
SET join_collapse_limit = 1;

CREATE TEMP TABLE t (x integer);
INSERT INTO t SELECT 1 FROM generate_series(1,1E4) AS x;
CREATE TEMP TABLE tn (x integer);
INSERT INTO tn SELECT 2 FROM generate_series(1,1E4) AS x;
CREATE INDEX ON tn (x);

EXPLAIN (COSTS OFF)
SELECT * FROM t t0 LEFT JOIN t t1 LEFT JOIN t t2 ON (t1.x=t2.x) ON (t0.x=t1.x)
WHERE EXISTS (SELECT 1 FROM tn WHERE tn.x = t0.x);

EXPLAIN (COSTS OFF)
SELECT * FROM t t0 LEFT JOIN t t1  LEFT JOIN t t2 ON (t1.x=t2.x) ON (t0.x=t1.x)
WHERE EXISTS (SELECT 1 FROM tn WHERE tn.x = t0.x OFFSET 0);

/*
 Nested Loop Left Join
   Join Filter: (t0.x = t1.x)
   ->  Hash Join
         Hash Cond: (t0.x = tn.x)
         ->  Seq Scan on t t0
         ->  Hash
               ->  HashAggregate
                     Group Key: tn.x
                     ->  Seq Scan on tn
   ->  Hash Left Join
         Hash Cond: (t1.x = t2.x)
         ->  Seq Scan on t t1
         ->  Hash
               ->  Seq Scan on t t2

 Hash Right Join
   Hash Cond: (t1.x = t0.x)
   ->  Hash Left Join
         Hash Cond: (t1.x = t2.x)
         ->  Seq Scan on t t1
         ->  Hash
               ->  Seq Scan on t t2
   ->  Hash
         ->  Seq Scan on t t0
               Filter: EXISTS(SubPlan 1)
               SubPlan 1
                 ->  Seq Scan on tn
                       Filter: (x = t0.x)
*/

Reply via email to