cloud-fan commented on code in PR #52149: URL: https://github.com/apache/spark/pull/52149#discussion_r2320734123
########## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala: ########## @@ -298,4 +298,20 @@ class CollapseProjectSuite extends PlanTest { comparePlans(optimized, expected) } } + + test("SPARK-53399: Merge expressions") { + val query = testRelation + .select($"a" + 1 as "a_plus_1", $"b" + 1 as "b_plus_1") + .select($"a_plus_1" + $"a_plus_1", $"b_plus_1") + .analyze + + val optimized = Optimize.execute(query) + + val expected = testRelation + .select($"a" + 1 as "a_plus_1", $"b") + .select($"a_plus_1" + $"a_plus_1", $"b" + 1 as "b_plus_1") Review Comment: Thinking about it again, if `b_plus_1` references 1000 columns, then this optimization actually makes the plan worse, as the lower `Project` becomes super wide. The benefits of this new optimization: - if we happen to put two Python UDF into a single `Project`, it's a hude win. - combining two expressions allows the optimizer to make better decisions. Given that we will handle Python UDF specially later, let's be more conservative here: ``` ... val newRefs = AttributeSet(substitute.flatMap(_.references)) if (newRefs.length <= substitute) ... else (consumers, producers) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org