Hi all,
I'm experimenting serious performance problem when using withColumn and dataset with large number of columns. It is very slow: on a dataset with 100 columns it takes a few seconds. The code snippet demonstrates the problem. val custs = Seq( Row(1, "Bob", 21, 80.5), Row(2, "Bobby", 21, 80.5), Row(3, "Jean", 21, 80.5), Row(4, "Fatime", 21, 80.5) ) var fields = List( StructField("id", IntegerType, true), StructField("a", IntegerType, true), StructField("b", StringType, true), StructField("target", DoubleType, false)) val schema = StructType(fields) var rdd = sc.parallelize(custs) var df = sqlContext.createDataFrame(rdd, schema) for (i <- 1 to 200) { val now = System.currentTimeMillis df = df.withColumn("a_new_col_" + i, df("a") + i) println(s"$i -> " + (System.currentTimeMillis - now)) } df.show()