such as in
https://github.com/laserson/dsq
to get approximate quantiles, then use whatever values you want to filter
the original sequence.
--
*From:* Debasish Das debasish.da...@gmail.com
*Sent:* Thursday, March 26, 2015 9:45 PM
*To:* Aung Htet
*Cc:* user
*Subject:* Re
Hi all,
I have a distribution represented as an RDD of tuples, in rows of (segment,
score)
For each segment, I want to discard tuples with top X percent scores. This
seems hard to do in Spark RDD.
A naive algorithm would be -
1) Sort RDD by segment score (descending)
2) Within each segment,
...
But this is only good for top 10% or bottom 10%...if you need to do it for
top 30% then may be the shuffle version will work better...
On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet aung@gmail.com wrote:
Hi all,
I have a distribution represented as an RDD of tuples, in rows of
(segment