Re: [PR] [SPARK-53399][SQL] Restore CollapseProject to merge unrelated expressions [spark]

via GitHub Wed, 03 Sep 2025 20:34:21 -0700


cloud-fan commented on code in PR #52149:
URL: https://github.com/apache/spark/pull/52149#discussion_r2320734123



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala:
##########
@@ -298,4 +298,20 @@ class CollapseProjectSuite extends PlanTest {
       comparePlans(optimized, expected)
     }
   }
+
+  test("SPARK-53399: Merge expressions") {
+    val query = testRelation
+      .select($"a" + 1 as "a_plus_1", $"b" + 1 as "b_plus_1")
+      .select($"a_plus_1" + $"a_plus_1", $"b_plus_1")
+      .analyze
+
+    val optimized = Optimize.execute(query)
+
+    val expected = testRelation
+      .select($"a" + 1 as "a_plus_1", $"b")
+      .select($"a_plus_1" + $"a_plus_1", $"b" + 1 as "b_plus_1")

Review Comment:
   Thinking about it again, if `b_plus_1` references 1000 columns, then this 
optimization actually makes the plan worse,  as the lower `Project` becomes 
super wide.
   
   The benefits of this new optimization:
   - if we happen to put two Python UDF into a single `Project`, it's a hude 
win.
   - combining two expressions allows the optimizer to make better decisions.
   
   Given that we will handle Python UDF specially later, let's be more 
conservative here:
   ```
   ...
   val newRefs = AttributeSet(substitute.flatMap(_.references))
   if (newRefs.length <= substitute) ... else (consumers, producers)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-53399][SQL] Restore CollapseProject to merge unrelated expressions [spark]

Reply via email to