Dear spark users, I'm experiencing an unusual issue with Spark 3.4.x. When creating a new column as the sum of several existing columns, the time taken almost doubles as the number of columns increases. This operation doesn't require much resources, so I suspect there might be a problem with the parse engine.
This phenomenon did not occur in versions prior to 3.3.x. I've attached a simple example below. //example code val schema = StructType((1 to 100).map(x => StructField(s"c$x", IntegerType))) val data = Row.fromSeq(Seq.fill(100)(1)) val df = spark.createDataFrame(spark.sparkContext.parallelize(Seq(data)), schema=schema) val sumCols = (1 to 31).map(x => s"c$x").toList spark.time{df.withColumn("sumofcols", sumCols.map(col).reduce(_+_)).count} //========= Time taken: 288213 ms res13: Long = 1L With spark 3.3.2, last line takes about 150ms. Is there any known problem like this? regards, Dukhyun Ko