We improved this in 1.4. Adding 100 columns took 4s on my laptop. https://issues.apache.org/jira/browse/SPARK-7276
Still not the fastest, but much faster. scala> Seq((1, 2)).toDF("a", "b") res6: org.apache.spark.sql.DataFrame = [a: int, b: int] scala> scala> val start = System.nanoTime start: Long = 1433274299441224000 scala> for (i <- 1 to 100) { | df = df.withColumn("n" + i, org.apache.spark.sql.functions.lit(0)) | } scala> val end = System.nanoTime end: Long = 1433274303250091000 scala> scala> println((end - start) / 1000 / 1000 / 1000) 3 On Tue, Jun 2, 2015 at 12:34 PM, zsampson <zsamp...@palantir.com> wrote: > Hey, > > I'm seeing extreme slowness in withColumn when it's used in a loop. I'm > running this code: > > for (int i = 0; i < NUM_ITERATIONS ++i) { > df = df.withColumn("col"+i, new Column(new Literal(i, > DataTypes.IntegerType))); > } > > where df is initially a trivial dataframe. Here are the results of running > with different values of NUM_ITERATIONS: > > iterations time > 25 3s > 50 11s > 75 31s > 100 76s > 125 159s > 150 283s > > When I update the DataFrame by manually copying/appending to the column > array and using DataFrame.select, it runs in about half the time, but this > is still untenable at any significant number of iterations. > > Any insight? > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-withColumn-very-slow-when-used-iteratively-tp12562.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >