Performance Issue with Column Addition in Spark 3.4.x: Time Doubling with Increased Columns

KO Dukhyun Tue, 04 Jul 2023 06:13:04 -0700

Dear spark users,

I'm experiencing an unusual issue with Spark 3.4.x.
When creating a new column as the sum of several existing columns, the time 
taken almost doubles as the number of columns increases. This operation doesn't 
require much resources, so I suspect there might be a problem with the parse 
engine.


This phenomenon did not occur in versions prior to 3.3.x. I've attached a 
simple example below.


//example code
val schema = StructType((1 to 100).map(x => StructField(s"c$x", IntegerType)))
val data = Row.fromSeq(Seq.fill(100)(1))
val df = spark.createDataFrame(spark.sparkContext.parallelize(Seq(data)), 
schema=schema)
val sumCols = (1 to 31).map(x => s"c$x").toList
spark.time{df.withColumn("sumofcols", sumCols.map(col).reduce(_+_)).count}
//=========
Time taken: 288213 ms
res13: Long = 1L

With spark 3.3.2, last line takes about 150ms.

Is there any known problem like this?

regards,

Dukhyun Ko

Performance Issue with Column Addition in Spark 3.4.x: Time Doubling with Increased Columns

Reply via email to