This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
     new afd99d19a2b8 [SPARK-47897][SQL][3.5] Fix ExpressionSet performance 
regression in scala 2.12
afd99d19a2b8 is described below

commit afd99d19a2b85dda2245d3557506d1090187c5f4
Author: Zhen Wang <643348...@qq.com>
AuthorDate: Fri Apr 19 10:53:16 2024 +0800

    [SPARK-47897][SQL][3.5] Fix ExpressionSet performance regression in scala 
2.12
    
    ### What changes were proposed in this pull request?
    
    Fix `ExpressionSet` performance regression in scala 2.12.
    
    ### Why are the changes needed?
    
    The implementation of the `SetLike.++` method in scala 2.12 is to 
iteratively execute the `+` method. The `ExpressionSet.+` method first clones a 
new object and then adds element, which is very expensive.
    
    
https://github.com/scala/scala/blob/ceaf7e68ac93e9bbe8642d06164714b2de709c27/src/library/scala/collection/SetLike.scala#L186
    
    After https://github.com/apache/spark/pull/36121, the `++` and `--` methods 
in ExpressionSet of scala 2.12 were removed, causing performance regression.
    
    ### Does this PR introduce _any_ user-facing change?
    
    ### How was this patch tested?
    
    Benchmark code:
    
    ```
    object TestBenchmark {
      def main(args: Array[String]): Unit = {
        val count = 300
        val benchmark = new Benchmark("Test ExpressionSetV2 ++ ", count)
        val aUpper = AttributeReference("A", IntegerType)(exprId = ExprId(1))
    
        var initialSet = ExpressionSet((0 until 300).map(i => aUpper + i))
        val setToAddWithSameDeterministicExpression = ExpressionSet((0 until 
300).map(i => aUpper + i))
    
        benchmark.addCase("Test ++", 10) { _: Int =>
          for (_ <- 0L until count) {
            initialSet ++= setToAddWithSameDeterministicExpression
          }
        }
        benchmark.run()
      }
    }
    ```
    
    before this change:
    
    ```
    OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64
    Intel Core Processor (Skylake, IBRS)
    Test ExpressionSetV2 ++ :                 Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------------------------------
    Test ++                                            1577           1691      
    61          0.0     5255516.0       1.0X
    ```
    
    after this change:
    
    ```
    OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64
    Intel Core Processor (Skylake, IBRS)
    Test ExpressionSetV2 ++ :                 Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------------------------------
    Test ++                                              14             14      
     0          0.0       45395.2       1.0X
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #46114 from wForget/SPARK-47897.
    
    Authored-by: Zhen Wang <643348...@qq.com>
    Signed-off-by: Kent Yao <y...@apache.org>
---
 .../sql/catalyst/expressions/ExpressionSet.scala    | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala-2.12/org/apache/spark/sql/catalyst/expressions/ExpressionSet.scala
 
b/sql/catalyst/src/main/scala-2.12/org/apache/spark/sql/catalyst/expressions/ExpressionSet.scala
index 3e545f745bae..c18679330f3a 100644
--- 
a/sql/catalyst/src/main/scala-2.12/org/apache/spark/sql/catalyst/expressions/ExpressionSet.scala
+++ 
b/sql/catalyst/src/main/scala-2.12/org/apache/spark/sql/catalyst/expressions/ExpressionSet.scala
@@ -17,7 +17,7 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
-import scala.collection.mutable
+import scala.collection.{mutable, GenTraversableOnce}
 import scala.collection.mutable.ArrayBuffer
 
 object ExpressionSet {
@@ -108,12 +108,31 @@ class ExpressionSet protected(
     newSet
   }
 
+  /**
+   * SPARK-47897: In Scala 2.12, the `SetLike.++` method iteratively calls `+` 
method.
+   * `ExpressionSet.+` is expensive, so we override `++`.
+   */
+  override def ++(elems: GenTraversableOnce[Expression]): ExpressionSet = {
+    val newSet = clone()
+    elems.foreach(newSet.add)
+    newSet
+  }
+
   override def -(elem: Expression): ExpressionSet = {
     val newSet = clone()
     newSet.remove(elem)
     newSet
   }
 
+  /**
+   * SPARK-47897: We need to override `--` like `++`.
+   */
+  override def --(elems: GenTraversableOnce[Expression]): ExpressionSet = {
+    val newSet = clone()
+    elems.foreach(newSet.remove)
+    newSet
+  }
+
   def map(f: Expression => Expression): ExpressionSet = {
     val newSet = new ExpressionSet()
     this.iterator.foreach(elem => newSet.add(f(elem)))


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to