In that case you can directly use count-min-sketch from algebird....they
work fine with Spark aggregateBy but I have found the java BPQ from Spark
much faster than say algebird Heap datastructure...

On Thu, Mar 26, 2015 at 10:04 PM, Charles Hayden <charles.hay...@atigeo.com>
wrote:

>  ​You could also consider using a count-min data structure such as in
> https://github.com/laserson/dsq​
>
> to get approximate quantiles, then use whatever values you want to filter
> the original sequence.
>  ------------------------------
> *From:* Debasish Das <debasish.da...@gmail.com>
> *Sent:* Thursday, March 26, 2015 9:45 PM
> *To:* Aung Htet
> *Cc:* user
> *Subject:* Re: How to get a top X percent of a distribution represented
> as RDD
>
>  Idea is to use a heap and get topK elements from every partition...then
> use aggregateBy and for combOp do a merge routine from
> mergeSort...basically get 100 items from partition 1, 100 items from
> partition 2, merge them so that you get sorted 200 items and take 100...for
> merge you can use heap as well...Matei had a BPQ inside Spark which we use
> all the time...Passing arrays over wire is better than passing full heap
> objects and merge routine on array should run faster but needs experiment...
>
> On Thu, Mar 26, 2015 at 9:26 PM, Aung Htet <aung....@gmail.com> wrote:
>
>> Hi Debasish,
>>
>> Thanks for your suggestions. In-memory version is quite useful. I do not
>> quite understand how you can use aggregateBy to get 10% top K elements. Can
>> you please give an example?
>>
>> Thanks,
>> Aung
>>
>> On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> You can do it in-memory as well....get 10% topK elements from each
>>> partition and use merge from any sort algorithm like timsort....basically
>>> aggregateBy
>>>
>>>  Your version uses shuffle but this version is 0 shuffle..assuming your
>>> data set is cached you will be using in-memory allReduce through
>>> treeAggregate...
>>>
>>>  But this is only good for top 10% or bottom 10%...if you need to do it
>>> for top 30% then may be the shuffle version will work better...
>>>
>>> On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet <aung....@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>>  I have a distribution represented as an RDD of tuples, in rows of
>>>> (segment, score)
>>>> For each segment, I want to discard tuples with top X percent scores.
>>>> This seems hard to do in Spark RDD.
>>>>
>>>>  A naive algorithm would be -
>>>>
>>>>  1) Sort RDD by segment & score (descending)
>>>> 2) Within each segment, number the rows from top to bottom.
>>>> 3) For each  segment, calculate the cut off index. i.e. 90 for 10% cut
>>>> off out of a segment with 100 rows.
>>>> 4) For the entire RDD, filter rows with row num <= cut off index
>>>>
>>>> This does not look like a good algorithm. I would really appreciate if
>>>> someone can suggest a better way to implement this in Spark.
>>>>
>>>>  Regards,
>>>> Aung
>>>>
>>>
>>>
>>
>

Reply via email to