Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-16 Thread Eugen Cepoi
Yes, the second example does that. It transforms all the points of a partition into a single element the skyline, thus reduce will run on the skyline of two partitions and not on single points. Le 16 avr. 2014 06:47, Yanzhe Chen yanzhe...@gmail.com a écrit : Eugen, Thanks for your tip and I do

Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Yanzhe Chen
Hi all, As a previous thread, I am asking how to implement a divide-and-conquer algorithm (skyline) in Spark. Here is my current solution: val data = sc.textFile(…).map(line = line.split(“,”).map(_.toDouble)) val result = data.mapPartitions(points =

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Cheng Lian
Your Spark solution first reduces partial results into a single partition, computes the final result, and then collects to the driver side. This involves a shuffle and two waves of network traffic. Instead, you can directly collect partial results to the driver and then computes the final results

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Eugen Cepoi
It depends on your algorithm but I guess that you probably should use reduce (the code probably doesn't compile but it shows you the idea). val result = data.reduce { case (left, right) = skyline(left ++ right) } Or in the case you want to merge the result of a partition with another one you

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Yanzhe Chen
Eugen, Thanks for your tip and I do want to merge the result of a partition with another one but I am still not quite clear how to do it. Say the original data rdd has 32 partitions and since mapPartitions won’t change the number of partitions, it will remain 32 partitions which each contains