zzzzming95 created SPARK-45071: ---------------------------------- Summary: Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data Key: SPARK-45071 URL: https://issues.apache.org/jira/browse/SPARK-45071 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0, 3.5.0 Reporter: zzzzming95
Since `BinaryArithmetic#dataType` will recursively process the datatype of each node, the driver will be very slow when multiple columns are processed. For example, the following code: ``` import spark.implicits._ import scala.util.Random import org.apache.spark.sql.functions.sum import org.apache.spark.sql.types.\{StructType, StructField, IntegerType} val N = 30 val M = 100 val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) val schema = StructType(columns.map(StructField(_, IntegerType))) val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) val df = spark.createDataFrame(rdd, schema) val colExprs = columns.map(sum(_)) // gen a new column , and add the other 30 column df.withColumn("new_col_sum", expr(columns.mkString(" + "))) ``` This code will take a few minutes for the driver to execute in the spark3.4 version, but only takes a few seconds to execute in the spark3.2 version. Related issue: SPARK-39316 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org