Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Aung Htet
such as in https://github.com/laserson/dsq​ to get approximate quantiles, then use whatever values you want to filter the original sequence. -- *From:* Debasish Das debasish.da...@gmail.com *Sent:* Thursday, March 26, 2015 9:45 PM *To:* Aung Htet *Cc:* user *Subject:* Re

How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Aung Htet
Hi all, I have a distribution represented as an RDD of tuples, in rows of (segment, score) For each segment, I want to discard tuples with top X percent scores. This seems hard to do in Spark RDD. A naive algorithm would be - 1) Sort RDD by segment score (descending) 2) Within each segment,

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Aung Htet
... But this is only good for top 10% or bottom 10%...if you need to do it for top 30% then may be the shuffle version will work better... On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet aung@gmail.com wrote: Hi all, I have a distribution represented as an RDD of tuples, in rows of (segment