Hi all,

I'm experimenting serious performance problem when using withColumn and
dataset with large number of columns. It is very slow: on a dataset with
100 columns it takes a few seconds.


The code snippet demonstrates the problem.


val custs = Seq(
Row(1, "Bob", 21, 80.5),
Row(2, "Bobby", 21, 80.5),
Row(3, "Jean", 21, 80.5),
Row(4, "Fatime", 21, 80.5)
)

var fields = List(
StructField("id", IntegerType, true),
StructField("a", IntegerType, true),
StructField("b", StringType, true),
StructField("target", DoubleType, false))
val schema = StructType(fields)

var rdd = sc.parallelize(custs)
var df = sqlContext.createDataFrame(rdd, schema)

for (i <- 1 to 200)
{ val now = System.currentTimeMillis df = df.withColumn("a_new_col_" + i,
df("a") + i) println(s"$i -> " + (System.currentTimeMillis - now)) }

df.show()

Reply via email to