Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-30 Thread Liang-Chi Hsieh

Actually, as you use Dataset's union API, unlike RDD's union API, it will
break the nested structure. So that should not be the issue.

The additional time introduced when the number of dataframes grows, is spent
on analysis stage. I can think that as the Union has a long children list,
the analyzer needs more time to traverse the tree.

When the dataset of Union(Range1, Range2) is created, the Analyzer needs to
go through 2 Range(s). When the next union happens, i.e., Union(Range1,
Range2, Range3), the Analyzer needs to go through 3 Range(s), except for the
first 2 Range(s). The two Range plans are overlapped. But the Analyzer still
goes through them.

If there is an Union with 5 Range logical plans, the Analyzer goes through:

2 + 3 + 4 + 5 = 14 Range(s) under the Union

When you increase the Range plans to 10. It becomes:

2 + 3 + 4 + 5 + ... + 10 = 54 Range(s)

So if an Union of 100 Range plans, there are 5049 Range(s) needed to go
through. For 200 Range plans, it becomes 20099.

You can see it is not linear relation.





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394p20408.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



RE: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread assaf.mendelson
Hi,

I understand that doing a union creates a nested structures, however why isn’t 
it O(N)? If I look at the code it seems this should be a tree merge of two 
plans, that should occur at O(N) not O(N^2).
Even when running the plan that should be O(N*LOG(N)) instead of O(N^2) or 
worse.
Assaf.

From: Maciej Szymkiewicz [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n20395...@n3.nabble.com]
Sent: Thursday, December 29, 2016 7:39 PM
To: Mendelson, Assaf
Subject: Re: repeated unioning of dataframes take worse than O(N^2) time


Iterative union like this creates a deeply nested recursive structure in a 
similar manner to described here http://stackoverflow.com/q/34461804

You can try something like this http://stackoverflow.com/a/37612978 but there 
is of course on overhead of conversion between Dataset and RDD.


On 12/29/2016 06:21 PM, assaf.mendelson wrote:
Hi,

I have been playing around with doing union between a large number of 
dataframes and saw that the performance of the actual union (not the action) is 
worse than O(N^2). Since a union basically defines a lineage (i.e. current + 
union with of other as a child) this should be almost instantaneous, however in 
practice this can be very costly.

I was wondering why this is and if there is a way to fix this.

A sample test:
def testUnion(n: Int): Long = {
  val dataframes = for {
x <- 0 until n
  } yield spark.range(1000)

  val t0 = System.currentTimeMillis()
  val allDF = dataframes.reduceLeft(_.union(_))
  val t1 = System.currentTimeMillis()
  val totalTime = t1 - t0
  println(s"$totalTime miliseconds")
  totalTime
}

scala> testUnion(100)
193 miliseconds
res5: Long = 193

scala> testUnion(200)
759 miliseconds
res1: Long = 759

scala> testUnion(500)
4438 miliseconds
res2: Long = 4438

scala> testUnion(1000)
18441 miliseconds
res6: Long = 18441

scala> testUnion(2000)
88498 miliseconds
res7: Long = 88498

scala> testUnion(5000)
822305 miliseconds
res8: Long = 822305




View this message in context: repeated unioning of dataframes take worse than 
O(N^2) 
time<http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394.html>
Sent from the Apache Spark Developers List mailing list 
archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at 
Nabble.com.



--

Maciej Szymkiewicz


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394p20395.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com<mailto:ml-node+s1001551n1...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click 
here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=1=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394p20403.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread Sean Owen
Don't do that. Union them all at once with SparkContext.union

On Thu, Dec 29, 2016, 17:21 assaf.mendelson  wrote:

> Hi,
>
>
>
> I have been playing around with doing union between a large number of
> dataframes and saw that the performance of the actual union (not the
> action) is worse than O(N^2). Since a union basically defines a lineage
> (i.e. current + union with of other as a child) this should be almost
> instantaneous, however in practice this can be very costly.
>
>
>
> I was wondering why this is and if there is a way to fix this.
>
>
>
> A sample test:
>
> *def *testUnion(n: Int): Long = {
>   *val *dataframes = *for *{
> x <- 0 until n
>   } *yield **spark*.range(1000)
>
>   *val *t0 = System.*currentTimeMillis*()
>   *val *allDF = dataframes.reduceLeft(_.union(_))
>   *val *t1 = System.*currentTimeMillis*()
>   *val *totalTime = t1 - t0
>   *println*(*s"**$*totalTime* miliseconds"*)
>   totalTime
> }
>
>
>
> scala> testUnion(100)
>
> 193 miliseconds
>
> res5: Long = 193
>
>
>
> scala> testUnion(200)
>
> 759 miliseconds
>
> res1: Long = 759
>
>
>
> scala> testUnion(500)
>
> 4438 miliseconds
>
> res2: Long = 4438
>
>
>
> scala> testUnion(1000)
>
> 18441 miliseconds
>
> res6: Long = 18441
>
>
>
> scala> testUnion(2000)
>
> 88498 miliseconds
>
> res7: Long = 88498
>
>
>
> scala> testUnion(5000)
>
> 822305 miliseconds
>
> res8: Long = 822305
>
>
>
>
>
> --
> View this message in context: repeated unioning of dataframes take worse
> than O(N^2) time
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread Maciej Szymkiewicz
Iterative union like this creates a deeply nested recursive structure in
a similar manner to described here http://stackoverflow.com/q/34461804

You can try something like this http://stackoverflow.com/a/37612978 but
there is of course on overhead of conversion between Dataset and RDD.


On 12/29/2016 06:21 PM, assaf.mendelson wrote:
>
> Hi,
>
>  
>
> I have been playing around with doing union between a large number of
> dataframes and saw that the performance of the actual union (not the
> action) is worse than O(N^2). Since a union basically defines a
> lineage (i.e. current + union with of other as a child) this should be
> almost instantaneous, however in practice this can be very costly.
>
>  
>
> I was wondering why this is and if there is a way to fix this.
>
>  
>
> A sample test:
>
> *def *testUnion(n: Int): Long = {
>   *val *dataframes = *for *{
> x <- 0 until n
>   } *yield */spark/.range(1000)
>
>   *val *t0 = System./currentTimeMillis/()
>   *val *allDF = dataframes.reduceLeft(_.union(_))
>   *val *t1 = System./currentTimeMillis/()
>   *val *totalTime = t1 - t0
>   /println/(*s"**$*totalTime*miliseconds"*)
>   totalTime
> }
>
>  
>
> scala> testUnion(100)
>
> 193 miliseconds
>
> res5: Long = 193
>
>  
>
> scala> testUnion(200)
>
> 759 miliseconds
>
> res1: Long = 759
>
>  
>
> scala> testUnion(500)
>
> 4438 miliseconds
>
> res2: Long = 4438
>
>  
>
> scala> testUnion(1000)
>
> 18441 miliseconds
>
> res6: Long = 18441
>
>  
>
> scala> testUnion(2000)
>
> 88498 miliseconds
>
> res7: Long = 88498
>
>  
>
> scala> testUnion(5000)
>
> 822305 miliseconds
>
> res8: Long = 822305
>
>  
>
>  
>
>
> 
> View this message in context: repeated unioning of dataframes take
> worse than O(N^2) time
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.

-- 
Maciej Szymkiewicz