Iterative union like this creates a deeply nested recursive structure in
a similar manner to described here http://stackoverflow.com/q/34461804

You can try something like this http://stackoverflow.com/a/37612978 but
there is of course on overhead of conversion between Dataset and RDD.


On 12/29/2016 06:21 PM, assaf.mendelson wrote:
>
> Hi,
>
>  
>
> I have been playing around with doing union between a large number of
> dataframes and saw that the performance of the actual union (not the
> action) is worse than O(N^2). Since a union basically defines a
> lineage (i.e. current + union with of other as a child) this should be
> almost instantaneous, however in practice this can be very costly.
>
>  
>
> I was wondering why this is and if there is a way to fix this.
>
>  
>
> A sample test:
>
> *def *testUnion(n: Int): Long = {
>   *val *dataframes = *for *{
>     x <- 0 until n
>   } *yield */spark/.range(1000)
>
>   *val *t0 = System./currentTimeMillis/()
>   *val *allDF = dataframes.reduceLeft(_.union(_))
>   *val *t1 = System./currentTimeMillis/()
>   *val *totalTime = t1 - t0
>   /println/(*s"**$*totalTime*miliseconds"*)
>   totalTime
> }
>
>  
>
> scala> testUnion(100)
>
> 193 miliseconds
>
> res5: Long = 193
>
>  
>
> scala> testUnion(200)
>
> 759 miliseconds
>
> res1: Long = 759
>
>  
>
> scala> testUnion(500)
>
> 4438 miliseconds
>
> res2: Long = 4438
>
>  
>
> scala> testUnion(1000)
>
> 18441 miliseconds
>
> res6: Long = 18441
>
>  
>
> scala> testUnion(2000)
>
> 88498 miliseconds
>
> res7: Long = 88498
>
>  
>
> scala> testUnion(5000)
>
> 822305 miliseconds
>
> res8: Long = 822305
>
>  
>
>  
>
>
> ------------------------------------------------------------------------
> View this message in context: repeated unioning of dataframes take
> worse than O(N^2) time
> <http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.

-- 
Maciej Szymkiewicz

Reply via email to