Would it be valuable to create a .withColumns([colName], [ColumnObject]) method that adds in bulk rather than iteratively?
Alternatively effort might be better spent in making .withColumn() singular faster. On Tue, Jun 2, 2015 at 3:46 PM, Reynold Xin <r...@databricks.com> wrote: > We improved this in 1.4. Adding 100 columns took 4s on my laptop. > https://issues.apache.org/jira/browse/SPARK-7276 > > Still not the fastest, but much faster. > > scala> Seq((1, 2)).toDF("a", "b") > res6: org.apache.spark.sql.DataFrame = [a: int, b: int] > > scala> > > scala> val start = System.nanoTime > start: Long = 1433274299441224000 > > scala> for (i <- 1 to 100) { > | df = df.withColumn("n" + i, org.apache.spark.sql.functions.lit(0)) > | } > > scala> val end = System.nanoTime > end: Long = 1433274303250091000 > > scala> > > scala> println((end - start) / 1000 / 1000 / 1000) > 3 > > > On Tue, Jun 2, 2015 at 12:34 PM, zsampson <zsamp...@palantir.com> wrote: > >> Hey, >> >> I'm seeing extreme slowness in withColumn when it's used in a loop. I'm >> running this code: >> >> for (int i = 0; i < NUM_ITERATIONS ++i) { >> df = df.withColumn("col"+i, new Column(new Literal(i, >> DataTypes.IntegerType))); >> } >> >> where df is initially a trivial dataframe. Here are the results of running >> with different values of NUM_ITERATIONS: >> >> iterations time >> 25 3s >> 50 11s >> 75 31s >> 100 76s >> 125 159s >> 150 283s >> >> When I update the DataFrame by manually copying/appending to the column >> array and using DataFrame.select, it runs in about half the time, but this >> is still untenable at any significant number of iterations. >> >> Any insight? >> >> >> >> -- >> View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-withColumn-very-slow-when-used-iteratively-tp12562.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >