Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Debasish Das
e whatever values you want to >>> filter the original sequence. >>> ---------- >>> *From:* Debasish Das >>> *Sent:* Thursday, March 26, 2015 9:45 PM >>> *To:* Aung Htet >>> *Cc:* user >>> *Subject:* Re: How to get a top

Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Aung Htet
45 PM >> *To:* Aung Htet >> *Cc:* user >> *Subject:* Re: How to get a top X percent of a distribution represented >> as RDD >> >> Idea is to use a heap and get topK elements from every partition...then >> use aggregateBy and for combOp do a merge routine fr

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
> *To:* Aung Htet > *Cc:* user > *Subject:* Re: How to get a top X percent of a distribution represented > as RDD > > Idea is to use a heap and get topK elements from every partition...then > use aggregateBy and for combOp do a merge routine from > mergeSort...basicall

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Charles Hayden
: Aung Htet Cc: user Subject: Re: How to get a top X percent of a distribution represented as RDD Idea is to use a heap and get topK elements from every partition...then use aggregateBy and for combOp do a merge routine from mergeSort...basically get 100 items from partition 1, 100 items from

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
Idea is to use a heap and get topK elements from every partition...then use aggregateBy and for combOp do a merge routine from mergeSort...basically get 100 items from partition 1, 100 items from partition 2, merge them so that you get sorted 200 items and take 100...for merge you can use heap as w

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Aung Htet
Hi Debasish, Thanks for your suggestions. In-memory version is quite useful. I do not quite understand how you can use aggregateBy to get 10% top K elements. Can you please give an example? Thanks, Aung On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das wrote: > You can do it in-memory as wellg

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
You can do it in-memory as wellget 10% topK elements from each partition and use merge from any sort algorithm like timsortbasically aggregateBy Your version uses shuffle but this version is 0 shuffle..assuming your data set is cached you will be using in-memory allReduce through treeAggre

How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Aung Htet
Hi all, I have a distribution represented as an RDD of tuples, in rows of (segment, score) For each segment, I want to discard tuples with top X percent scores. This seems hard to do in Spark RDD. A naive algorithm would be - 1) Sort RDD by segment & score (descending) 2) Within each segment, nu