Hi all, As a previous thread, I am asking how to implement a divide-and-conquer algorithm (skyline) in Spark. Here is my current solution:
val data = sc.textFile(…).map(line => line.split(“,”).map(_.toDouble)) val result = data.mapPartitions(points => skyline(points.toArray).iterator).coalesce(1, true) .mapPartitions(points => skyline(points.toArray).iterator).collect() where skyline is a local algorithm to compute the results: def skyline(points: Array[Point]) : Array[Point] Basically, I find this implement runs slower than the corresponding Hadoop version (the identity map phase plus local skyline for both combine and reduce phases). Below are my questions: 1. Why this implementation is much slower than the Hadoop one? I can find two possible reasons: one is the shuffle overhead in coalesce, another is calling the toArray and iterator repeatedly when invoking local skyline algorithm. But I am not sure which one is true. 2. One observation is that while Hadoop version almost used up all the CPU resources during execution, the CPU seems not that hot on Spark. Is that a clue to prove that the shuffling might be the real bottleneck? 3. Is there any difference between coalesce(1, true) and reparation? It seems that both opeartions need shuffling data. What’s the proper situations using the coalesce method? 4. More generally, I am trying to implementing some core geometry computation operators on Spark (like skyline, convex hull etc). In my understanding, since Spark is more capable of handling iterative computations on dataset, the above solution apparently doesn’t exploit what Spark is good at. Any comments on how to do geometry computations on Spark (if it is possible) ? Thanks for any insight. Yanzhe