RE: repeated unioning of dataframes take worse than O(N^2) time

assaf.mendelson Thu, 29 Dec 2016 11:20:03 -0800

Hi,

I understand that doing a union creates a nested structures, however why isn’t 
it O(N)? If I look at the code it seems this should be a tree merge of two 
plans, that should occur at O(N) not O(N^2).
Even when running the plan that should be O(N*LOG(N)) instead of O(N^2) or 
worse.
Assaf.

From: Maciej Szymkiewicz [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n20395...@n3.nabble.com]
Sent: Thursday, December 29, 2016 7:39 PM
To: Mendelson, Assaf
Subject: Re: repeated unioning of dataframes take worse than O(N^2) time

Iterative union like this creates a deeply nested recursive structure in a 
similar manner to described here http://stackoverflow.com/q/34461804

You can try something like this http://stackoverflow.com/a/37612978 but there 
is of course on overhead of conversion between Dataset and RDD.

On 12/29/2016 06:21 PM, assaf.mendelson wrote:
Hi,

I have been playing around with doing union between a large number of 
dataframes and saw that the performance of the actual union (not the action) is 
worse than O(N^2). Since a union basically defines a lineage (i.e. current + 
union with of other as a child) this should be almost instantaneous, however in 
practice this can be very costly.

I was wondering why this is and if there is a way to fix this.

A sample test:
def testUnion(n: Int): Long = {
  val dataframes = for {
    x <- 0 until n
  } yield spark.range(1000)

  val t0 = System.currentTimeMillis()
  val allDF = dataframes.reduceLeft(_.union(_))
  val t1 = System.currentTimeMillis()
  val totalTime = t1 - t0
  println(s"$totalTime miliseconds")
  totalTime
}

scala> testUnion(100)
193 miliseconds
res5: Long = 193

scala> testUnion(200)
759 miliseconds
res1: Long = 759

scala> testUnion(500)
4438 miliseconds
res2: Long = 4438

scala> testUnion(1000)
18441 miliseconds
res6: Long = 18441

scala> testUnion(2000)
88498 miliseconds
res7: Long = 88498

scala> testUnion(5000)
822305 miliseconds
res8: Long = 822305

________________________________
View this message in context: repeated unioning of dataframes take worse than 
O(N^2) 
time<http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394.html>
Sent from the Apache Spark Developers List mailing list 
archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at 
Nabble.com.

--

Maciej Szymkiewicz

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394p20395.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com<mailto:ml-node+s1001551n1...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click 
here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394p20403.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

RE: repeated unioning of dataframes take worse than O(N^2) time

Reply via email to