[spark] branch master updated: [SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

yumwang Tue, 05 Sep 2023 20:39:01 -0700

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 16e813cecd5 [SPARK-45071][SQL] Optimize the processing speed of 
`BinaryArithmetic#dataType` when processing multi-column data
16e813cecd5 is described below

commit 16e813cecd55490a71ef6c05fca2209fbdae078f
Author: zzzzming95 <505306...@qq.com>
AuthorDate: Wed Sep 6 11:38:30 2023 +0800

    [SPARK-45071][SQL] Optimize the processing speed of 
`BinaryArithmetic#dataType` when processing multi-column data
    
    ### What changes were proposed in this pull request?
    
    Since `BinaryArithmetic#dataType` will recursively process the datatype of 
each node, the driver will be very slow when multiple columns are processed.
    
    For example, the following code:
    ```scala
    import spark.implicits._
    import scala.util.Random
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.functions.sum
    import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
    
    val N = 30
    val M = 100
    
    val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
    val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))
    
    val schema = StructType(columns.map(StructField(_, IntegerType)))
    val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
    val df = spark.createDataFrame(rdd, schema)
    val colExprs = columns.map(sum(_))
    
    // gen a new column , and add the other 30 column
    df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
    ```
    
    This code will take a few minutes for the driver to execute in the spark3.4 
version, but only takes a few seconds to execute in the spark3.2 version. 
Related issue: [SPARK-39316](https://github.com/apache/spark/pull/36698)
    
    ### Why are the changes needed?
    
    Optimize the processing speed of `BinaryArithmetic#dataType` when 
processing multi-column data
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    manual testing
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes #42804 from zzzzming95/SPARK-45071.
    
    Authored-by: zzzzming95 <505306...@qq.com>
    Signed-off-by: Yuming Wang <yumw...@ebay.com>
---
 .../apache/spark/sql/catalyst/expressions/arithmetic.scala   | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
index 2d9bccc0854..a556ac9f129 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
@@ -219,6 +219,12 @@ abstract class BinaryArithmetic extends BinaryOperator
 
   protected val evalMode: EvalMode.Value
 
+  private lazy val internalDataType: DataType = (left.dataType, 
right.dataType) match {
+    case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) =>
+      resultDecimalType(p1, s1, p2, s2)
+    case _ => left.dataType
+  }
+
   protected def failOnError: Boolean = evalMode match {
     // The TRY mode executes as if it would fail on errors, except that it 
would capture the errors
     // and return null results.
@@ -234,11 +240,7 @@ abstract class BinaryArithmetic extends BinaryOperator
     case _ => super.checkInputDataTypes()
   }
 
-  override def dataType: DataType = (left.dataType, right.dataType) match {
-    case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) =>
-      resultDecimalType(p1, s1, p2, s2)
-    case _ => left.dataType
-  }
+  override def dataType: DataType = internalDataType
 
   // When `spark.sql.decimalOperations.allowPrecisionLoss` is set to true, if 
the precision / scale
   // needed are out of the range of available values, the scale is reduced up 
to 6, in order to


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

Reply via email to