I'm looking to select the top n records (by rank) from a data set of a few hundred GB's. My understanding is that JavaRDD.top(n, comparator) is entirely a driver-side operation in that all records are sorted in the driver's memory. I prefer an approach where the records are sorted on the cluster and only the top ones sent to the driver. I'm hence leaning towards creating a JavaPairRDD on a key, then sorting the rdd by key and invoking take(N). I'd like to know if rdd.top achieves the same result (while being executed on the cluster) as take or if my assumption that it's a driver side operation is correct.
Thanks, Bharath