Re: Removing unneeded self joins

Andrei Lepikhov Mon, 16 Oct 2023 01:28:39 -0700

On 12/10/2023 18:32, Alexander Korotkov wrote:

On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov
<a.lepik...@postgrespro.ru> wrote:

On 4/10/2023 14:34, Alexander Korotkov wrote:

  > Relid replacement machinery is the most contradictory code here. We used
  > a utilitarian approach and implemented a simplistic variant.


  > > 2) It would be nice to skip the insertion of IS NOT NULL checks when
  > > they are not necessary.  [1] points that infrastructure from [2] might
  > > be useful.  The patchset from [2] seems committed mow.  However, I
  > > can't see it is directly helpful in this matter.  Could we just skip
  > > adding IS NOT NULL clause for the columns, that have
  > > pg_attribute.attnotnull set?
  > Thanks for the links, I will look into that case.

To be more precise, in the attachment, you can find a diff to the main
patch, which shows the volume of changes to achieve the desired behaviour.
Some explains in regression tests shifted. So, I've made additional tests:

DROP TABLE test CASCADE;
CREATE TABLE test (a int, b int not null);
CREATE UNIQUE INDEX abc ON test(b);
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b;
CREATE UNIQUE INDEX abc1 ON test(a,b);
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b;
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b AND (t1.a=t2.a OR t2.a=t1.a);
DROP INDEX abc1;
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b AND (t1.b=t2.b OR t2.b=t1.b);

We have almost the results we wanted to have. But in the last explain
you can see that nothing happened with the OR clause. We should use the
expression mutator instead of walker to handle such clauses. But It
doesn't process the RestrictInfo node ... I'm inclined to put a solution
of this issue off for a while.


OK.  I think it doesn't worth to eliminate IS NULL quals with this
complexity (at least at this stage of work).

I made improvements over the code.  Mostly new comments, grammar
corrections of existing comments and small refactoring.

Also, I found that the  suggestion from David Rowley [1] to qsort
array of relations to faster find duplicates is still unaddressed.
I've implemented it.  That helps to evade quadratic complexity with
large number of relations.

Also I've incorporated improvements from Alena Rybakina except one for
skipping SJ removal when no SJ quals is found.  It's not yet clear for
me if this check fix some cases. But at least optimization got skipped
in some useful cases (as you can see in regression tests).

I would like to propose one more minor improvement (see in attachment).The idea here is that after removing a self-join and changing clauses weshould re-probe the set of relids with the same Oid, because we can findmore removable self-joins (see the demo test in join.sql).


--
regards,
Andrey Lepikhov
Postgres Professional

diff --git a/src/backend/optimizer/plan/analyzejoins.c 
b/src/backend/optimizer/plan/analyzejoins.c
index 7b8dc7a2b7..f7ccda5231 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -2298,6 +2298,7 @@ remove_self_joins_recurse(PlannerInfo *root, List 
*joinlist, Relids toRemove)
                        {
                                /* Create group of relation indexes with the 
same oid */
                                Relids          group = NULL;
+                               Relids          removed;
 
                                while (i < j)
                                {
@@ -2306,8 +2307,21 @@ remove_self_joins_recurse(PlannerInfo *root, List 
*joinlist, Relids toRemove)
                                }
 
                                relids = bms_del_members(relids, group);
-                               toRemove = bms_add_members(toRemove,
-                                                                               
remove_self_joins_one_group(root, group));
+
+                               /*
+                                * Try to remove self-joins from a group of 
identical entries.
+                                * Make next attempt iteratively - if something 
is deleted from
+                                * a group, changes in clauses and equivalence 
classes can give
+                                * us a chance to find more candidates.
+                                */
+                               do {
+                                       Assert(!bms_overlap(group, toRemove));
+                                       removed = 
remove_self_joins_one_group(root, group);
+                                       toRemove = bms_add_members(toRemove, 
removed);
+                                       group = bms_del_members(group, removed);
+                               } while (!bms_is_empty(removed) &&
+                                                bms_membership(group) == 
BMS_MULTIPLE);
+                               bms_free(removed);
                                bms_free(group);
                        }
                        else
diff --git a/src/test/regress/expected/join.out 
b/src/test/regress/expected/join.out
index 12a90bd42e..cb2429645c 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -6786,6 +6786,36 @@ SELECT * FROM emp1 e1, emp1 e2 WHERE e1.id = e2.id AND 
e2.code <> e1.code;
    Filter: ((e2.id IS NOT NULL) AND (e2.code <> e2.code))
 (3 rows)
 
+-- Shuffle self-joined relations. Only in the case of iterative deletion
+-- attempts explains of these queries will be identical.
+CREATE UNIQUE INDEX ON emp1((id*id));
+explain SELECT count(*) FROM emp1 c1, emp1 c2, emp1 c3
+WHERE c1.id=c2.id AND c1.id*c2.id=c3.id*c3.id;
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Aggregate  (cost=43.84..43.85 rows=1 width=8)
+   ->  Seq Scan on emp1 c3  (cost=0.00..38.25 rows=2237 width=0)
+         Filter: ((id IS NOT NULL) AND ((id * id) IS NOT NULL))
+(3 rows)
+
+explain SELECT count(*) FROM emp1 c1, emp1 c2, emp1 c3
+WHERE c1.id=c3.id AND c1.id*c3.id=c2.id*c2.id;
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Aggregate  (cost=43.84..43.85 rows=1 width=8)
+   ->  Seq Scan on emp1 c3  (cost=0.00..38.25 rows=2237 width=0)
+         Filter: ((id IS NOT NULL) AND ((id * id) IS NOT NULL))
+(3 rows)
+
+explain SELECT count(*) FROM emp1 c1, emp1 c2, emp1 c3
+WHERE c3.id=c2.id AND c3.id*c2.id=c1.id*c1.id;
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Aggregate  (cost=43.84..43.85 rows=1 width=8)
+   ->  Seq Scan on emp1 c3  (cost=0.00..38.25 rows=2237 width=0)
+         Filter: ((id IS NOT NULL) AND ((id * id) IS NOT NULL))
+(3 rows)
+
 -- We can remove the join even if we find the join can't duplicate rows and
 -- the base quals of each side are different.  In the following case we end up
 -- moving quals over to s1 to make it so it can't match any rows.
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 4d49c0767a..55147263ca 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -2576,6 +2576,16 @@ CREATE TABLE emp1 ( id SERIAL PRIMARY KEY NOT NULL, code 
int);
 explain (verbose, costs off)
 SELECT * FROM emp1 e1, emp1 e2 WHERE e1.id = e2.id AND e2.code <> e1.code;
 
+-- Shuffle self-joined relations. Only in the case of iterative deletion
+-- attempts explains of these queries will be identical.
+CREATE UNIQUE INDEX ON emp1((id*id));
+explain SELECT count(*) FROM emp1 c1, emp1 c2, emp1 c3
+WHERE c1.id=c2.id AND c1.id*c2.id=c3.id*c3.id;
+explain SELECT count(*) FROM emp1 c1, emp1 c2, emp1 c3
+WHERE c1.id=c3.id AND c1.id*c3.id=c2.id*c2.id;
+explain SELECT count(*) FROM emp1 c1, emp1 c2, emp1 c3
+WHERE c3.id=c2.id AND c3.id*c2.id=c1.id*c1.id;
+
 -- We can remove the join even if we find the join can't duplicate rows and
 -- the base quals of each side are different.  In the following case we end up
 -- moving quals over to s1 to make it so it can't match any rows.

Re: Removing unneeded self joins

Reply via email to