Hi all, I have a distribution represented as an RDD of tuples, in rows of (segment, score) For each segment, I want to discard tuples with top X percent scores. This seems hard to do in Spark RDD.
A naive algorithm would be - 1) Sort RDD by segment & score (descending) 2) Within each segment, number the rows from top to bottom. 3) For each segment, calculate the cut off index. i.e. 90 for 10% cut off out of a segment with 100 rows. 4) For the entire RDD, filter rows with row num <= cut off index This does not look like a good algorithm. I would really appreciate if someone can suggest a better way to implement this in Spark. Regards, Aung