Hi,

I have been playing around with doing union between a large number of 
dataframes and saw that the performance of the actual union (not the action) is 
worse than O(N^2). Since a union basically defines a lineage (i.e. current + 
union with of other as a child) this should be almost instantaneous, however in 
practice this can be very costly.

I was wondering why this is and if there is a way to fix this.

A sample test:
def testUnion(n: Int): Long = {
  val dataframes = for {
    x <- 0 until n
  } yield spark.range(1000)

  val t0 = System.currentTimeMillis()
  val allDF = dataframes.reduceLeft(_.union(_))
  val t1 = System.currentTimeMillis()
  val totalTime = t1 - t0
  println(s"$totalTime miliseconds")
  totalTime
}

scala> testUnion(100)
193 miliseconds
res5: Long = 193

scala> testUnion(200)
759 miliseconds
res1: Long = 759

scala> testUnion(500)
4438 miliseconds
res2: Long = 4438

scala> testUnion(1000)
18441 miliseconds
res6: Long = 18441

scala> testUnion(2000)
88498 miliseconds
res7: Long = 88498

scala> testUnion(5000)
822305 miliseconds
res8: Long = 822305






--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Reply via email to