Hi Debasish, Thanks for your suggestions. In-memory version is quite useful. I do not quite understand how you can use aggregateBy to get 10% top K elements. Can you please give an example?
Thanks, Aung On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das <debasish.da...@gmail.com> wrote: > You can do it in-memory as well....get 10% topK elements from each > partition and use merge from any sort algorithm like timsort....basically > aggregateBy > > Your version uses shuffle but this version is 0 shuffle..assuming your > data set is cached you will be using in-memory allReduce through > treeAggregate... > > But this is only good for top 10% or bottom 10%...if you need to do it for > top 30% then may be the shuffle version will work better... > > On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet <aung....@gmail.com> wrote: > >> Hi all, >> >> I have a distribution represented as an RDD of tuples, in rows of >> (segment, score) >> For each segment, I want to discard tuples with top X percent scores. >> This seems hard to do in Spark RDD. >> >> A naive algorithm would be - >> >> 1) Sort RDD by segment & score (descending) >> 2) Within each segment, number the rows from top to bottom. >> 3) For each segment, calculate the cut off index. i.e. 90 for 10% cut >> off out of a segment with 100 rows. >> 4) For the entire RDD, filter rows with row num <= cut off index >> >> This does not look like a good algorithm. I would really appreciate if >> someone can suggest a better way to implement this in Spark. >> >> Regards, >> Aung >> > >