This is an automated email from the ASF dual-hosted git repository. yumwang pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.5 by this push: new 699e4e59515 [SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data 699e4e59515 is described below commit 699e4e59515864862bd86734b2eb51cc674e38be Author: zzzzming95 <505306...@qq.com> AuthorDate: Wed Sep 6 11:38:30 2023 +0800 [SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data ### What changes were proposed in this pull request? Since `BinaryArithmetic#dataType` will recursively process the datatype of each node, the driver will be very slow when multiple columns are processed. For example, the following code: ```scala import spark.implicits._ import scala.util.Random import org.apache.spark.sql.Row import org.apache.spark.sql.functions.sum import org.apache.spark.sql.types.{StructType, StructField, IntegerType} val N = 30 val M = 100 val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) val schema = StructType(columns.map(StructField(_, IntegerType))) val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) val df = spark.createDataFrame(rdd, schema) val colExprs = columns.map(sum(_)) // gen a new column , and add the other 30 column df.withColumn("new_col_sum", expr(columns.mkString(" + "))) ``` This code will take a few minutes for the driver to execute in the spark3.4 version, but only takes a few seconds to execute in the spark3.2 version. Related issue: [SPARK-39316](https://github.com/apache/spark/pull/36698) ### Why are the changes needed? Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual testing ### Was this patch authored or co-authored using generative AI tooling? no Closes #42804 from zzzzming95/SPARK-45071. Authored-by: zzzzming95 <505306...@qq.com> Signed-off-by: Yuming Wang <yumw...@ebay.com> (cherry picked from commit 16e813cecd55490a71ef6c05fca2209fbdae078f) Signed-off-by: Yuming Wang <yumw...@ebay.com> --- .../apache/spark/sql/catalyst/expressions/arithmetic.scala | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala index 31d4d71cd40..06742ef917b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala @@ -219,6 +219,12 @@ abstract class BinaryArithmetic extends BinaryOperator protected val evalMode: EvalMode.Value + private lazy val internalDataType: DataType = (left.dataType, right.dataType) match { + case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) => + resultDecimalType(p1, s1, p2, s2) + case _ => left.dataType + } + protected def failOnError: Boolean = evalMode match { // The TRY mode executes as if it would fail on errors, except that it would capture the errors // and return null results. @@ -234,11 +240,7 @@ abstract class BinaryArithmetic extends BinaryOperator case _ => super.checkInputDataTypes() } - override def dataType: DataType = (left.dataType, right.dataType) match { - case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) => - resultDecimalType(p1, s1, p2, s2) - case _ => left.dataType - } + override def dataType: DataType = internalDataType // When `spark.sql.decimalOperations.allowPrecisionLoss` is set to true, if the precision / scale // needed are out of the range of available values, the scale is reduced up to 6, in order to --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org