Hi, I have been playing around with doing union between a large number of dataframes and saw that the performance of the actual union (not the action) is worse than O(N^2). Since a union basically defines a lineage (i.e. current + union with of other as a child) this should be almost instantaneous, however in practice this can be very costly.
I was wondering why this is and if there is a way to fix this. A sample test: def testUnion(n: Int): Long = { val dataframes = for { x <- 0 until n } yield spark.range(1000) val t0 = System.currentTimeMillis() val allDF = dataframes.reduceLeft(_.union(_)) val t1 = System.currentTimeMillis() val totalTime = t1 - t0 println(s"$totalTime miliseconds") totalTime } scala> testUnion(100) 193 miliseconds res5: Long = 193 scala> testUnion(200) 759 miliseconds res1: Long = 759 scala> testUnion(500) 4438 miliseconds res2: Long = 4438 scala> testUnion(1000) 18441 miliseconds res6: Long = 18441 scala> testUnion(2000) 88498 miliseconds res7: Long = 88498 scala> testUnion(5000) 822305 miliseconds res8: Long = 822305 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.