Hi Debasish,

Thanks for your suggestions. In-memory version is quite useful. I do not
quite understand how you can use aggregateBy to get 10% top K elements. Can
you please give an example?

Thanks,
Aung

On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das <debasish.da...@gmail.com>
wrote:

> You can do it in-memory as well....get 10% topK elements from each
> partition and use merge from any sort algorithm like timsort....basically
> aggregateBy
>
> Your version uses shuffle but this version is 0 shuffle..assuming your
> data set is cached you will be using in-memory allReduce through
> treeAggregate...
>
> But this is only good for top 10% or bottom 10%...if you need to do it for
> top 30% then may be the shuffle version will work better...
>
> On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet <aung....@gmail.com> wrote:
>
>> Hi all,
>>
>> I have a distribution represented as an RDD of tuples, in rows of
>> (segment, score)
>> For each segment, I want to discard tuples with top X percent scores.
>> This seems hard to do in Spark RDD.
>>
>> A naive algorithm would be -
>>
>> 1) Sort RDD by segment & score (descending)
>> 2) Within each segment, number the rows from top to bottom.
>> 3) For each  segment, calculate the cut off index. i.e. 90 for 10% cut
>> off out of a segment with 100 rows.
>> 4) For the entire RDD, filter rows with row num <= cut off index
>>
>> This does not look like a good algorithm. I would really appreciate if
>> someone can suggest a better way to implement this in Spark.
>>
>> Regards,
>> Aung
>>
>
>

Reply via email to