Dear spark users,
I'm experiencing an unusual issue with Spark 3.4.x.
When creating a new column as the sum of several existing columns, the time
taken almost doubles as the number of columns increases. This operation doesn't
require much resources, so I suspect there might be a problem with the parse
engine.
This phenomenon did not occur in versions prior to 3.3.x. I've attached a
simple example below.
//example code
val schema = StructType((1 to 100).map(x => StructField(s"c$x", IntegerType)))
val data = Row.fromSeq(Seq.fill(100)(1))
val df = spark.createDataFrame(spark.sparkContext.parallelize(Seq(data)),
schema=schema)
val sumCols = (1 to 31).map(x => s"c$x").toList
spark.time{df.withColumn("sumofcols", sumCols.map(col).reduce(_+_)).count}
//=========
Time taken: 288213 ms
res13: Long = 1L
With spark 3.3.2, last line takes about 150ms.
Is there any known problem like this?
regards,
Dukhyun Ko