I'm looking to select the top n records (by rank) from a data set of a few
hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
entirely a driver-side operation in that all records are sorted in the
driver's memory. I prefer an approach where the records are sorted on the
cluster and only the top ones sent to the driver. I'm hence leaning towards
creating a  JavaPairRDD on a key, then sorting the rdd by key and invoking
take(N). I'd like to know if rdd.top achieves the same result (while being
executed on the cluster) as take or if my assumption that it's a driver
side operation is correct.

Thanks,
Bharath

Reply via email to