?You could also consider using a count-min data structure such as in 
https://github.com/laserson/dsq?

to get approximate quantiles, then use whatever values you want to filter the 
original sequence.

________________________________
From: Debasish Das <debasish.da...@gmail.com>
Sent: Thursday, March 26, 2015 9:45 PM
To: Aung Htet
Cc: user
Subject: Re: How to get a top X percent of a distribution represented as RDD

Idea is to use a heap and get topK elements from every partition...then use 
aggregateBy and for combOp do a merge routine from mergeSort...basically get 
100 items from partition 1, 100 items from partition 2, merge them so that you 
get sorted 200 items and take 100...for merge you can use heap as well...Matei 
had a BPQ inside Spark which we use all the time...Passing arrays over wire is 
better than passing full heap objects and merge routine on array should run 
faster but needs experiment...

On Thu, Mar 26, 2015 at 9:26 PM, Aung Htet 
<aung....@gmail.com<mailto:aung....@gmail.com>> wrote:
Hi Debasish,

Thanks for your suggestions. In-memory version is quite useful. I do not quite 
understand how you can use aggregateBy to get 10% top K elements. Can you 
please give an example?

Thanks,
Aung

On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das 
<debasish.da...@gmail.com<mailto:debasish.da...@gmail.com>> wrote:
You can do it in-memory as well....get 10% topK elements from each partition 
and use merge from any sort algorithm like timsort....basically aggregateBy

Your version uses shuffle but this version is 0 shuffle..assuming your data set 
is cached you will be using in-memory allReduce through treeAggregate...

But this is only good for top 10% or bottom 10%...if you need to do it for top 
30% then may be the shuffle version will work better...

On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet 
<aung....@gmail.com<mailto:aung....@gmail.com>> wrote:
Hi all,

I have a distribution represented as an RDD of tuples, in rows of (segment, 
score)
For each segment, I want to discard tuples with top X percent scores. This 
seems hard to do in Spark RDD.

A naive algorithm would be -

1) Sort RDD by segment & score (descending)
2) Within each segment, number the rows from top to bottom.
3) For each  segment, calculate the cut off index. i.e. 90 for 10% cut off out 
of a segment with 100 rows.
4) For the entire RDD, filter rows with row num <= cut off index

This does not look like a good algorithm. I would really appreciate if someone 
can suggest a better way to implement this in Spark.

Regards,
Aung



Reply via email to