[spark] branch master updated: [SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance

dongjoon Sun, 03 Apr 2022 00:49:19 -0700

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 61074d4  [SPARK-28090][SQL] Improve `replaceAliasButKeepName` 
performance
61074d4 is described below

commit 61074d4741556b7cdc66a2b0af012e5b4b2bc075
Author: Peter Toth <peter.t...@gmail.com>
AuthorDate: Sun Apr 3 00:47:40 2022 -0700

    [SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance
    
    ### What changes were proposed in this pull request?
    
    SPARK-28090 ticket description contains an example query with multiple 
nested struct creation and field extraction. The following is is an example of 
the query when the sample code range is set to only 3:
    ```
    Project [struct(num1, numerics#23.num1, num10, numerics#23.num10, num11, 
numerics#23.num11, num12, numerics#23.num12, num13, numerics#23.num13, num14, 
numerics#23.num14, num15, numerics#23.num15, num2, numerics#23.num2, num3, 
numerics#23.num3, num4, numerics#23.num4, num5, numerics#23.num5, num6, 
numerics#23.num6, num7, numerics#23.num7, num8, numerics#23.num8, num9, 
numerics#23.num9, out_num1, numerics#23.out_num1, out_num2, -numerics#23.num2) 
AS numerics#42]
    +- Project [struct(num1, numerics#5.num1, num10, numerics#5.num10, num11, 
numerics#5.num11, num12, numerics#5.num12, num13, numerics#5.num13, num14, 
numerics#5.num14, num15, numerics#5.num15, num2, numerics#5.num2, num3, 
numerics#5.num3, num4, numerics#5.num4, num5, numerics#5.num5, num6, 
numerics#5.num6, num7, numerics#5.num7, num8, numerics#5.num8, num9, 
numerics#5.num9, out_num1, -numerics#5.num1) AS numerics#23]
       +- LogicalRDD [numerics#5], false
    ```
    If the level of nesting reaches 7 the query plan generation becomes 
extremely slow on Spark 2.4. That is because both
    - `CollapseProject` rule is slow and
    - some of the expression optimization rules running on the huge, not yet 
simplified expression tree of the single, collapsed `Project` node are slow.
    
    On Spark 3.3, after SPARK-36718, `CollapseProject` doesn't collapse such 
plans so the above issues don't occur,
    but `PhysicalOperation` extractor has an issue that it also builds up that 
huge expression tree and then traverses and modifies it in 
`AliasHelper.replaceAliasButKeepName()`. With a small change in that function 
we can avoid such costly operations.
    
    ### Why are the changes needed?
    The suggested change reduced the plan generation time of the example query 
from minutes (range = 7) or hours (range = 8+) to seconds.
    
    ### Does this PR introduce _any_ user-facing change?
    The example query can be executed.
    
    ### How was this patch tested?
    Existing UTs + manual test of the example query in the ticket description.
    
    Closes #35382 from peter-toth/SPARK-28090-improve-replacealiasbutkeepname.
    
    Authored-by: Peter Toth <peter.t...@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongj...@apache.org>
---
 .../spark/sql/catalyst/expressions/AliasHelper.scala       | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
index dea7ea0..888a986 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
@@ -70,11 +70,17 @@ trait AliasHelper {
   protected def replaceAliasButKeepName(
      expr: NamedExpression,
      aliasMap: AttributeMap[Alias]): NamedExpression = {
-    // Use transformUp to prevent infinite recursion when the replacement 
expression
-    // redefines the same ExprId,
-    trimNonTopLevelAliases(expr.transformUp {
+    expr match {
+      // We need to keep the `Alias` if we replace a top-level Attribute, so 
that it's still a
+      // `NamedExpression`. We also need to keep the name of the original 
Attribute.
       case a: Attribute => aliasMap.get(a).map(_.withName(a.name)).getOrElse(a)
-    }).asInstanceOf[NamedExpression]
+      case o =>
+        // Use transformUp to prevent infinite recursion when the replacement 
expression
+        // redefines the same ExprId.
+        o.mapChildren(_.transformUp {
+          case a: Attribute => aliasMap.get(a).map(_.child).getOrElse(a)
+        }).asInstanceOf[NamedExpression]
+    }
   }
 
   protected def trimAliases(e: Expression): Expression = {

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance

Reply via email to