Implementing percentile through top Vs take

2014-07-30 Thread Bharath Ravi Kumar
I'm looking to select the top n records (by rank) from a data set of a few hundred GB's. My understanding is that JavaRDD.top(n, comparator) is entirely a driver-side operation in that all records are sorted in the driver's memory. I prefer an approach where the records are sorted on the cluster

Re: Implementing percentile through top Vs take

2014-07-30 Thread Sean Owen
No, it's definitely not done on the driver. It works as you say. Look at the source code for RDD.takeOrdered, which is what top calls. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130 On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar