Re: A new strategy for pull-up correlated ANY_SUBLINK

Andy Fan Wed, 05 Apr 2023 00:16:11 -0700

Hi Tom:

 Sorry for the delayed response!  I think my knowledge has been refreshed
 for this discussion.



> One thing I'm not at all clear about is whether we need to restrict
> the optimization so that it doesn't occur if the subquery contains
> outer references falling outside available_rels.  I think that that
> case is covered by is_simple_subquery() deciding later to not pull up
> the subquery based on LATERAL restrictions, but maybe that misses
> something.
>

I think we need the restriction and that should be enough for this feature
. Given the query Richard provided before:

explain
select * from tenk1 A where exists
(select 1 from tenk2 B
where A.hundred in (select C.hundred FROM tenk2 C
WHERE c.odd = b.odd));

It first can be converted to the below format without any issue.

SELECT * FROM tenk1 A SEMI JOIN tenk2 B
on A.hundred in (select C.hundred FROM tenk2 C
WHERE c.odd = b.odd);

Then without the restriction, since we only pull the varnos from
sublink->testexpr, then it is {A}, so it convert to

SELECT * FROM
(tenk1 A SEMI JOIN LATERAL (SELECT c.hundred FROM tenk2 C)
ON c.odd = b.odd AND a.hundred = v.hundred)
SEMI JOIN on tenk2 B ON TRUE;

then the above query is NOT A VALID QUERY since:
1. The above query is *not* same as

SELECT * FROM (tenk1 A SEMI JOIN tenk2 B) on true
SEMI JOIN LATERAL (SELECT c.hundred FROM tenk2 C) v
ON v.odd = b.odd;

2. The above query requires b.odd when B is not available. So it is
right that an optimizer can't generate a plan for it. The fix would
be to do the restriction before applying this optimization.

I'm not sure pull-up-subquery can play any role here, IIUC, the bad thing
happens before pull-up-subquery.

I also write & analyze more test and found no issue by me

1. SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> should not be pull-up to rarg of the left join since A.hundred is not
available.

2.  SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = a.odd);
==> should not be pull-up to rarg of the left join since A.odd is not
available.

3. SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> should be pull-up to rarg of left join.

4. SELECT * FROM tenk1 A INNER JOIN tenk2 B
ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> pull-up as expected.

5. SELECT * FROM tenk1 A RIGHT JOIN tenk2 B
ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> should not be pull-up into larg of left join since b.odd is not
available.


About the existing test case changes because of this patch, they do
requires on the sublink is planned to a subPlan, so I introduces the below
changes to keep the original intention.

Changes
A in (SELECT A FROM ..)
To
(A, random() > 0) in (SELECT a, random() > 0 FROM ..);


I'm also wondering whether the similar restriction in
> convert_EXISTS_sublink_to_join could be removed similarly.
> In this light it was a mistake for convert_EXISTS_sublink_to_join
> to manage the pullup itself rather than doing it in the two-step
> fashion that convert_ANY_sublink_to_join does it.
>
>
Yes,  it is true!  I prefer to believe this deserves a separate patch.

Any feedback is welcome!

-- 
Best Regards
Andy Fan

From 3e2c8b46928a7f246ee7066b9ad160cf18a2e952 Mon Sep 17 00:00:00 2001
From: Andy Fan <zhihui.fan1213@gmail.com>
Date: Wed, 5 Apr 2023 14:45:11 +0800
Subject: [PATCH v3] Pull up direct-correlated ANY_SUBLINK using lateral join.

---
 .../postgres_fdw/expected/postgres_fdw.out    |   6 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql     |   4 +-
 src/backend/optimizer/plan/subselect.c        |  19 +++-
 src/test/regress/expected/join.out            |  14 +--
 src/test/regress/expected/subselect.out       | 102 ++++++++++++++++++
 src/test/regress/sql/join.sql                 |   8 +-
 src/test/regress/sql/subselect.sql            |  32 ++++++
 7 files changed, 164 insertions(+), 21 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 04a3ef450cf..d43f74e49bf 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -11563,7 +11563,7 @@ CREATE FOREIGN TABLE foreign_tbl (b int)
 CREATE FOREIGN TABLE foreign_tbl2 () INHERITS (foreign_tbl)
   SERVER loopback OPTIONS (table_name 'base_tbl');
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT a FROM base_tbl WHERE a IN (SELECT a FROM foreign_tbl);
+SELECT a FROM base_tbl WHERE (a, random() > 0) IN (SELECT a, random() > 0 FROM foreign_tbl);
                                  QUERY PLAN                                  
 -----------------------------------------------------------------------------
  Seq Scan on public.base_tbl
@@ -11571,7 +11571,7 @@ SELECT a FROM base_tbl WHERE a IN (SELECT a FROM foreign_tbl);
    Filter: (SubPlan 1)
    SubPlan 1
      ->  Result
-           Output: base_tbl.a
+           Output: base_tbl.a, (random() > '0'::double precision)
            ->  Append
                  ->  Async Foreign Scan on public.foreign_tbl foreign_tbl_1
                        Remote SQL: SELECT NULL FROM public.base_tbl
@@ -11579,7 +11579,7 @@ SELECT a FROM base_tbl WHERE a IN (SELECT a FROM foreign_tbl);
                        Remote SQL: SELECT NULL FROM public.base_tbl
 (11 rows)
 
-SELECT a FROM base_tbl WHERE a IN (SELECT a FROM foreign_tbl);
+SELECT a FROM base_tbl WHERE (a, random() > 0) IN (SELECT a, random() > 0 FROM foreign_tbl);
  a 
 ---
  1
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4f3088c03ea..8d775fdcf57 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3816,8 +3816,8 @@ CREATE FOREIGN TABLE foreign_tbl2 () INHERITS (foreign_tbl)
   SERVER loopback OPTIONS (table_name 'base_tbl');
 
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT a FROM base_tbl WHERE a IN (SELECT a FROM foreign_tbl);
-SELECT a FROM base_tbl WHERE a IN (SELECT a FROM foreign_tbl);
+SELECT a FROM base_tbl WHERE (a, random() > 0) IN (SELECT a, random() > 0 FROM foreign_tbl);
+SELECT a FROM base_tbl WHERE (a, random() > 0) IN (SELECT a, random() > 0 FROM foreign_tbl);
 
 -- Clean up
 DROP FOREIGN TABLE foreign_tbl CASCADE;
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 052263aea6d..4708a5363ac 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -1279,15 +1279,24 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	List	   *subquery_vars;
 	Node	   *quals;
 	ParseState *pstate;
+	Relids sub_ref_outer_relids = NULL;
+	bool use_lateral = false;
 
 	Assert(sublink->subLinkType == ANY_SUBLINK);
 
 	/*
-	 * The sub-select must not refer to any Vars of the parent query. (Vars of
-	 * higher levels should be okay, though.)
+	 * If the sub-select refers to any Vars of the parent query, we have to
+	 * treat it as LATERAL.  (Vars of higher levels don't matter here.)
 	 */
-	if (contain_vars_of_level((Node *) subselect, 1))
-		return NULL;
+	sub_ref_outer_relids = pull_varnos_of_level(NULL, (Node *) subselect, 1);
+
+	if (!bms_is_empty(sub_ref_outer_relids))
+	{
+		if (bms_is_subset(sub_ref_outer_relids, available_rels))
+			use_lateral = true;
+		else
+			return NULL;
+	}
 
 	/*
 	 * The test expression must contain some Vars of the parent query, else
@@ -1324,7 +1333,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	nsitem = addRangeTableEntryForSubquery(pstate,
 										   subselect,
 										   makeAlias("ANY_subquery", NIL),
-										   false,
+										   use_lateral,
 										   false);
 	rte = nsitem->p_rte;
 	parse->rtable = lappend(parse->rtable, rte);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 5d59ed7890f..317d9240dfd 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4898,7 +4898,7 @@ reset enable_nestloop;
 explain (costs off)
 select a.unique1, b.unique2
   from onek a left join onek b on a.unique1 = b.unique2
-  where b.unique2 = any (select q1 from int8_tbl c where c.q1 < b.unique1);
+  where (b.unique2, random() > 0) = any (select q1, random() > 0 from int8_tbl c where c.q1 < b.unique1);
                         QUERY PLAN                        
 ----------------------------------------------------------
  Hash Join
@@ -4914,7 +4914,7 @@ select a.unique1, b.unique2
 
 select a.unique1, b.unique2
   from onek a left join onek b on a.unique1 = b.unique2
-  where b.unique2 = any (select q1 from int8_tbl c where c.q1 < b.unique1);
+  where (b.unique2, random() > 0) = any (select q1, random() > 0 from int8_tbl c where c.q1 < b.unique1);
  unique1 | unique2 
 ---------+---------
      123 |     123
@@ -6686,12 +6686,12 @@ select * from (values (0), (1)) v(id),
 lateral (select * from int8_tbl t1,
          lateral (select * from
                     (select * from int8_tbl t2
-                     where q1 = any (select q2 from int8_tbl t3
+                     where (q1, random() > 0) = any (select q2, random() > 0 from int8_tbl t3
                                      where q2 = (select greatest(t1.q1,t2.q2))
                                        and (select v.id=0)) offset 0) ss2) ss
          where t1.q1 = ss.q2) ss0;
-                              QUERY PLAN                              
-----------------------------------------------------------------------
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
  Nested Loop
    Output: "*VALUES*".column1, t1.q1, t1.q2, ss2.q1, ss2.q2
    ->  Seq Scan on public.int8_tbl t1
@@ -6708,7 +6708,7 @@ lateral (select * from int8_tbl t1,
                      Filter: (SubPlan 3)
                      SubPlan 3
                        ->  Result
-                             Output: t3.q2
+                             Output: t3.q2, (random() > '0'::double precision)
                              One-Time Filter: $4
                              InitPlan 1 (returns $2)
                                ->  Result
@@ -6725,7 +6725,7 @@ select * from (values (0), (1)) v(id),
 lateral (select * from int8_tbl t1,
          lateral (select * from
                     (select * from int8_tbl t2
-                     where q1 = any (select q2 from int8_tbl t3
+                     where (q1, random() > 0) = any (select q2, random() > 0 from int8_tbl t3
                                      where q2 = (select greatest(t1.q1,t2.q2))
                                        and (select v.id=0)) offset 0) ss2) ss
          where t1.q1 = ss.q2) ss0;
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 22af8fafa17..49325058c5d 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1926,3 +1926,105 @@ select * from x for update;
    Output: subselect_tbl.f1, subselect_tbl.f2, subselect_tbl.f3
 (2 rows)
 
+-- Pull-up the direct-correlated ANY_SUBLINK
+explain (costs off)
+select * from tenk1 A where hundred in (select hundred from tenk2 B where B.odd = A.odd);
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Hash Join
+   Hash Cond: ((a.odd = b.odd) AND (a.hundred = b.hundred))
+   ->  Seq Scan on tenk1 a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: b.odd, b.hundred
+               ->  Seq Scan on tenk2 b
+(7 rows)
+
+explain (costs off)
+select * from tenk1 A where exists
+(select 1 from tenk2 B
+where A.hundred in (select C.hundred FROM tenk2 C
+WHERE c.odd = b.odd));
+           QUERY PLAN            
+---------------------------------
+ Nested Loop Semi Join
+   Join Filter: (SubPlan 1)
+   ->  Seq Scan on tenk1 a
+   ->  Materialize
+         ->  Seq Scan on tenk2 b
+   SubPlan 1
+     ->  Seq Scan on tenk2 c
+           Filter: (odd = b.odd)
+(8 rows)
+
+-- we should only try to pull up the sublink into RHS of a left join
+-- but a.hundred is not avaiable.
+explain (costs off)
+SELECT * FROM tenk1 A LEFT JOIN tenk2 B
+ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
+           QUERY PLAN            
+---------------------------------
+ Nested Loop Left Join
+   Join Filter: (SubPlan 1)
+   ->  Seq Scan on tenk1 a
+   ->  Materialize
+         ->  Seq Scan on tenk2 b
+   SubPlan 1
+     ->  Seq Scan on tenk2 c
+           Filter: (odd = b.odd)
+(8 rows)
+
+-- we should only try to pull up the sublink into RHS of a left join
+-- but a.odd is not avaiable for this.
+explain (costs off)
+SELECT * FROM tenk1 A LEFT JOIN tenk2 B
+ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = a.odd);
+           QUERY PLAN            
+---------------------------------
+ Nested Loop Left Join
+   Join Filter: (SubPlan 1)
+   ->  Seq Scan on tenk1 a
+   ->  Materialize
+         ->  Seq Scan on tenk2 b
+   SubPlan 1
+     ->  Seq Scan on tenk2 c
+           Filter: (odd = a.odd)
+(8 rows)
+
+-- should be able to pull up since all the references is available
+explain (costs off)
+SELECT * FROM tenk1 A LEFT JOIN tenk2 B
+ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Nested Loop Left Join
+   ->  Seq Scan on tenk1 a
+   ->  Materialize
+         ->  Hash Join
+               Hash Cond: ((b.odd = c.odd) AND (b.hundred = c.hundred))
+               ->  Seq Scan on tenk2 b
+               ->  Hash
+                     ->  HashAggregate
+                           Group Key: c.odd, c.hundred
+                           ->  Seq Scan on tenk2 c
+(10 rows)
+
+-- we can pull up the sublink into the inner JoinExpr.
+explain (costs off)
+SELECT * FROM tenk1 A INNER JOIN tenk2 B
+ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
+                   QUERY PLAN                    
+-------------------------------------------------
+ Hash Join
+   Hash Cond: (c.odd = b.odd)
+   ->  Hash Join
+         Hash Cond: (a.hundred = c.hundred)
+         ->  Seq Scan on tenk1 a
+         ->  Hash
+               ->  HashAggregate
+                     Group Key: c.odd, c.hundred
+                     ->  Seq Scan on tenk2 c
+   ->  Hash
+         ->  Seq Scan on tenk2 b
+(11 rows)
+
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index a630f58b571..e91d25c0915 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1721,11 +1721,11 @@ reset enable_nestloop;
 explain (costs off)
 select a.unique1, b.unique2
   from onek a left join onek b on a.unique1 = b.unique2
-  where b.unique2 = any (select q1 from int8_tbl c where c.q1 < b.unique1);
+  where (b.unique2, random() > 0) = any (select q1, random() > 0 from int8_tbl c where c.q1 < b.unique1);
 
 select a.unique1, b.unique2
   from onek a left join onek b on a.unique1 = b.unique2
-  where b.unique2 = any (select q1 from int8_tbl c where c.q1 < b.unique1);
+  where (b.unique2, random() > 0) = any (select q1, random() > 0 from int8_tbl c where c.q1 < b.unique1);
 
 --
 -- test full-join strength reduction
@@ -2373,7 +2373,7 @@ select * from (values (0), (1)) v(id),
 lateral (select * from int8_tbl t1,
          lateral (select * from
                     (select * from int8_tbl t2
-                     where q1 = any (select q2 from int8_tbl t3
+                     where (q1, random() > 0) = any (select q2, random() > 0 from int8_tbl t3
                                      where q2 = (select greatest(t1.q1,t2.q2))
                                        and (select v.id=0)) offset 0) ss2) ss
          where t1.q1 = ss.q2) ss0;
@@ -2382,7 +2382,7 @@ select * from (values (0), (1)) v(id),
 lateral (select * from int8_tbl t1,
          lateral (select * from
                     (select * from int8_tbl t2
-                     where q1 = any (select q2 from int8_tbl t3
+                     where (q1, random() > 0) = any (select q2, random() > 0 from int8_tbl t3
                                      where q2 = (select greatest(t1.q1,t2.q2))
                                        and (select v.id=0)) offset 0) ss2) ss
          where t1.q1 = ss.q2) ss0;
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index 40276708c99..b2dd094983d 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -968,3 +968,35 @@ select * from (with x as (select 2 as y) select * from x) ss;
 explain (verbose, costs off)
 with x as (select * from subselect_tbl)
 select * from x for update;
+
+-- Pull-up the direct-correlated ANY_SUBLINK
+explain (costs off)
+select * from tenk1 A where hundred in (select hundred from tenk2 B where B.odd = A.odd);
+
+explain (costs off)
+select * from tenk1 A where exists
+(select 1 from tenk2 B
+where A.hundred in (select C.hundred FROM tenk2 C
+WHERE c.odd = b.odd));
+
+-- we should only try to pull up the sublink into RHS of a left join
+-- but a.hundred is not avaiable.
+explain (costs off)
+SELECT * FROM tenk1 A LEFT JOIN tenk2 B
+ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
+
+-- we should only try to pull up the sublink into RHS of a left join
+-- but a.odd is not avaiable for this.
+explain (costs off)
+SELECT * FROM tenk1 A LEFT JOIN tenk2 B
+ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = a.odd);
+
+-- should be able to pull up since all the references is available
+explain (costs off)
+SELECT * FROM tenk1 A LEFT JOIN tenk2 B
+ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
+
+-- we can pull up the sublink into the inner JoinExpr.
+explain (costs off)
+SELECT * FROM tenk1 A INNER JOIN tenk2 B
+ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
-- 
2.21.0

Re: A new strategy for pull-up correlated ANY_SUBLINK

Reply via email to