[
https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun closed SPARK-45071.
---------------------------------
> Optimize the processing speed of `BinaryArithmetic#dataType` when processing
> multi-column data
> ----------------------------------------------------------------------------------------------
>
> Key: SPARK-45071
> URL: https://issues.apache.org/jira/browse/SPARK-45071
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.4.0, 3.5.0
> Reporter: zzzzming95
> Assignee: zzzzming95
> Priority: Major
> Fix For: 3.4.2, 3.5.1, 4.0.0
>
>
> Since `BinaryArithmetic#dataType` will recursively process the datatype of
> each node, the driver will be very slow when multiple columns are processed.
> For example, the following code:
> {code:java}
> ```
> import spark.implicits._
> import scala.util.Random
> import org.apache.spark.sql.functions.sum
> import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
> val N = 30
> val M = 100
> val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
> val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))
> val schema = StructType(columns.map(StructField(_, IntegerType)))
> val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
> val df = spark.createDataFrame(rdd, schema)
> val colExprs = columns.map(sum(_))
> // gen a new column , and add the other 30 column
> df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
> ```
> {code}
>
> This code will take a few minutes for the driver to execute in the spark3.4
> version, but only takes a few seconds to execute in the spark3.2 version.
> Related issue: SPARK-39316
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]