Hi, I understand that doing a union creates a nested structures, however why isn’t it O(N)? If I look at the code it seems this should be a tree merge of two plans, that should occur at O(N) not O(N^2). Even when running the plan that should be O(N*LOG(N)) instead of O(N^2) or worse. Assaf.
From: Maciej Szymkiewicz [via Apache Spark Developers List] [mailto:ml-node+s1001551n20395...@n3.nabble.com] Sent: Thursday, December 29, 2016 7:39 PM To: Mendelson, Assaf Subject: Re: repeated unioning of dataframes take worse than O(N^2) time Iterative union like this creates a deeply nested recursive structure in a similar manner to described here http://stackoverflow.com/q/34461804 You can try something like this http://stackoverflow.com/a/37612978 but there is of course on overhead of conversion between Dataset and RDD. On 12/29/2016 06:21 PM, assaf.mendelson wrote: Hi, I have been playing around with doing union between a large number of dataframes and saw that the performance of the actual union (not the action) is worse than O(N^2). Since a union basically defines a lineage (i.e. current + union with of other as a child) this should be almost instantaneous, however in practice this can be very costly. I was wondering why this is and if there is a way to fix this. A sample test: def testUnion(n: Int): Long = { val dataframes = for { x <- 0 until n } yield spark.range(1000) val t0 = System.currentTimeMillis() val allDF = dataframes.reduceLeft(_.union(_)) val t1 = System.currentTimeMillis() val totalTime = t1 - t0 println(s"$totalTime miliseconds") totalTime } scala> testUnion(100) 193 miliseconds res5: Long = 193 scala> testUnion(200) 759 miliseconds res1: Long = 759 scala> testUnion(500) 4438 miliseconds res2: Long = 4438 scala> testUnion(1000) 18441 miliseconds res6: Long = 18441 scala> testUnion(2000) 88498 miliseconds res7: Long = 88498 scala> testUnion(5000) 822305 miliseconds res8: Long = 822305 ________________________________ View this message in context: repeated unioning of dataframes take worse than O(N^2) time<http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394.html> Sent from the Apache Spark Developers List mailing list archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at Nabble.com. -- Maciej Szymkiewicz ________________________________ If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394p20395.html To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1...@n3.nabble.com<mailto:ml-node+s1001551n1...@n3.nabble.com> To unsubscribe from Apache Spark Developers List, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>. NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394p20403.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.