Re: POC: GROUP BY optimization

Andrei Lepikhov Mon, 15 Apr 2024 23:00:14 -0700

On 4/12/24 06:44, Tom Lane wrote:

* The very first hunk causes get_eclass_for_sort_expr to have
side-effects on existing EquivalenceClass data structures.
What is the argument that that's not going to cause problems?
At the very least it introduces asymmetry, in that the first
caller wins the chance to change the EC's ec_sortref and later
callers won't be able to.

Let me resolve issues bit by bit.

Addressing your first note, 'asymmetry,' I discovered this part of thecode. Before the 8d83a5d, it could cause some artefacts, but since then,a kind of asymmetry has been introduced into the GROUP-BY clause list.As you can see in the attached patch, GROUP-BY reordering works evenwhen the asymmetry of the 8d83a5d chooses different keys.

At the same time, I agree that implicitly setting anything in anexporting function, which should just look up for EC, is a questionablecoding style. To avoid possible usage issues (in extensions, forexample), I propose setting it up afterwards, explicitly forcing thisaction by input parameter - see my attempt in the attachment.


--
regards,
Andrei Lepikhov
Postgres Professional

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index a619ff9177..bc08dfadaf 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -652,18 +652,7 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 
 			if (opcintype == cur_em->em_datatype &&
 				equal(expr, cur_em->em_expr))
-			{
-				/*
-				 * Match!
-				 *
-				 * Copy the sortref if it wasn't set yet.  That may happen if
-				 * the ec was constructed from a WHERE clause, i.e. it doesn't
-				 * have a target reference at all.
-				 */
-				if (cur_ec->ec_sortref == 0 && sortref > 0)
-					cur_ec->ec_sortref = sortref;
-				return cur_ec;
-			}
+				return cur_ec; /* Match! */
 		}
 	}
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 1d61881a6b..5d9597adcd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -1355,7 +1355,8 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 													&sortclauses,
 													tlist,
 													false,
-													&sortable);
+													&sortable,
+													false);
 	/* It's caller error if not all clauses were sortable */
 	Assert(sortable);
 	return result;
@@ -1379,13 +1380,16 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
  * to remove any clauses that can be proven redundant via the eclass logic.
  * Even though we'll have to hash in that case, we might as well not hash
  * redundant columns.)
+ * *set_ec_sortref is true if caller wants to set up ec_sortref field in
+ * the pathkey Equivalence Class, if it have not initialized beforehand.
  */
 List *
 make_pathkeys_for_sortclauses_extended(PlannerInfo *root,
 									   List **sortclauses,
 									   List *tlist,
 									   bool remove_redundant,
-									   bool *sortable)
+									   bool *sortable,
+									   bool set_ec_sortref)
 {
 	List	   *pathkeys = NIL;
 	ListCell   *l;
@@ -1409,6 +1413,13 @@ make_pathkeys_for_sortclauses_extended(PlannerInfo *root,
 										   sortcl->nulls_first,
 										   sortcl->tleSortGroupRef,
 										   true);
+		if (pathkey->pk_eclass->ec_sortref == 0 && set_ec_sortref)
+			/*
+			 * Copy the sortref if it wasn't set yet.  That may happen if
+			 * the ec was constructed from a WHERE clause, i.e. it doesn't
+			 * have a target reference at all.
+			 */
+			pathkey->pk_eclass->ec_sortref = sortcl->tleSortGroupRef;
 
 		/* Canonical form eliminates redundant ordering keys */
 		if (!pathkey_is_redundant(pathkey, pathkeys))
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5320da51a0..dee98c9a90 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3391,7 +3391,7 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 		/*
 		 * With a plain GROUP BY list, we can remove any grouping items that
 		 * are proven redundant by EquivalenceClass processing.  For example,
-		 * we can remove y given "WHERE x = y GROUP BY x, y".  These aren't
+		 * we can remove y given "WHERE x get_eclass_for_sort_expr= y GROUP BY x, y".  These aren't
 		 * especially common cases, but they're nearly free to detect.  Note
 		 * that we remove redundant items from processed_groupClause but not
 		 * the original parse->groupClause.
@@ -3403,7 +3403,8 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 												   &root->processed_groupClause,
 												   tlist,
 												   true,
-												   &sortable);
+												   &sortable,
+												   true);
 		if (!sortable)
 		{
 			/* Can't sort; no point in considering aggregate ordering either */
@@ -3453,7 +3454,8 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 												   &root->processed_distinctClause,
 												   tlist,
 												   true,
-												   &sortable);
+												   &sortable,
+												   false);
 		if (!sortable)
 			root->distinct_pathkeys = NIL;
 	}
@@ -3479,7 +3481,8 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 												   &groupClauses,
 												   tlist,
 												   false,
-												   &sortable);
+												   &sortable,
+												   false);
 		if (!sortable)
 			root->setop_pathkeys = NIL;
 	}
@@ -6027,7 +6030,8 @@ make_pathkeys_for_window(PlannerInfo *root, WindowClause *wc,
 																 &wc->partitionClause,
 																 tlist,
 																 true,
-																 &sortable);
+																 &sortable,
+																 false);
 
 		Assert(sortable);
 	}
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 39ba461548..fff4b69dfd 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -242,7 +242,8 @@ extern List *make_pathkeys_for_sortclauses_extended(PlannerInfo *root,
 													List **sortclauses,
 													List *tlist,
 													bool remove_redundant,
-													bool *sortable);
+													bool *sortable,
+													bool set_ec_sortref);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 50695d83a2..1b06f9ca97 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2886,6 +2886,53 @@ GROUP BY c1.w, c1.z;
  5.0000000000000000
 (2 rows)
 
+-- Pathkeys, built in subtree, can be used to optimize GROUP-BY clause ordering.
+-- Also, here we check that it doesn't depend on the initial clause order in the
+-- GROUP-BY list.
+EXPLAIN (COSTS OFF)
+SELECT c1.y,c1.x FROM group_agg_pk c1
+  JOIN group_agg_pk c2
+  ON c1.x = c2.x
+GROUP BY c1.y,c1.x,c2.x;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Group
+   Group Key: c1.x, c1.y
+   ->  Incremental Sort
+         Sort Key: c1.x, c1.y
+         Presorted Key: c1.x
+         ->  Merge Join
+               Merge Cond: (c1.x = c2.x)
+               ->  Sort
+                     Sort Key: c1.x
+                     ->  Seq Scan on group_agg_pk c1
+               ->  Sort
+                     Sort Key: c2.x
+                     ->  Seq Scan on group_agg_pk c2
+(13 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT c1.y,c1.x FROM group_agg_pk c1
+  JOIN group_agg_pk c2
+  ON c1.x = c2.x
+GROUP BY c1.y,c2.x,c1.x;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Group
+   Group Key: c2.x, c1.y
+   ->  Incremental Sort
+         Sort Key: c2.x, c1.y
+         Presorted Key: c2.x
+         ->  Merge Join
+               Merge Cond: (c1.x = c2.x)
+               ->  Sort
+                     Sort Key: c1.x
+                     ->  Seq Scan on group_agg_pk c1
+               ->  Sort
+                     Sort Key: c2.x
+                     ->  Seq Scan on group_agg_pk c2
+(13 rows)
+
 RESET enable_nestloop;
 RESET enable_hashjoin;
 DROP TABLE group_agg_pk;
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 2905848ead..0be9c48483 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1253,6 +1253,20 @@ SELECT avg(c1.f ORDER BY c1.x, c1.y)
 FROM group_agg_pk c1 JOIN group_agg_pk c2 ON c1.x = c2.x
 GROUP BY c1.w, c1.z;
 
+-- Pathkeys, built in subtree, can be used to optimize GROUP-BY clause ordering.
+-- Also, here we check that it doesn't depend on the initial clause order in the
+-- GROUP-BY list.
+EXPLAIN (COSTS OFF)
+SELECT c1.y,c1.x FROM group_agg_pk c1
+  JOIN group_agg_pk c2
+  ON c1.x = c2.x
+GROUP BY c1.y,c1.x,c2.x;
+EXPLAIN (COSTS OFF)
+SELECT c1.y,c1.x FROM group_agg_pk c1
+  JOIN group_agg_pk c2
+  ON c1.x = c2.x
+GROUP BY c1.y,c2.x,c1.x;
+
 RESET enable_nestloop;
 RESET enable_hashjoin;
 DROP TABLE group_agg_pk;

Re: POC: GROUP BY optimization

Reply via email to