I'm looking to select the top n records (by rank) from a data set of a few
hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
entirely a driver-side operation in that all records are sorted in the
driver's memory. I prefer an approach where the records are sorted on the
cluster
No, it's definitely not done on the driver. It works as you say. Look
at the source code for RDD.takeOrdered, which is what top calls.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130
On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar