Hi Debasish, Charles, I solved the problem by using a BPQ like method, based on your suggestions. So thanks very much for that!
My approach was 1) Count the population of each segment in the RDD by map/reduce so that I get the bound number N equivalent to 10% of each segment. This becomes the size of the BPQ. 2) Associate the bounds N to the corresponding records in the first RDD. 3) Reduce the RDD from step 2 by merging the values in every two rows, basically creating a sorted list (Indexed Seq) 4) If the size of the sorted list is greater than N (the bound) then, create a new sorted list by using a priority queue and dequeuing top N values. In the end, I get a record for each segment with N max values for each segment. Regards, Aung On Fri, Mar 27, 2015 at 4:27 PM, Debasish Das <debasish.da...@gmail.com> wrote: > In that case you can directly use count-min-sketch from algebird....they > work fine with Spark aggregateBy but I have found the java BPQ from Spark > much faster than say algebird Heap datastructure... > > On Thu, Mar 26, 2015 at 10:04 PM, Charles Hayden < > charles.hay...@atigeo.com> wrote: > >> You could also consider using a count-min data structure such as in >> https://github.com/laserson/dsq >> >> to get approximate quantiles, then use whatever values you want to filter >> the original sequence. >> ------------------------------ >> *From:* Debasish Das <debasish.da...@gmail.com> >> *Sent:* Thursday, March 26, 2015 9:45 PM >> *To:* Aung Htet >> *Cc:* user >> *Subject:* Re: How to get a top X percent of a distribution represented >> as RDD >> >> Idea is to use a heap and get topK elements from every partition...then >> use aggregateBy and for combOp do a merge routine from >> mergeSort...basically get 100 items from partition 1, 100 items from >> partition 2, merge them so that you get sorted 200 items and take 100...for >> merge you can use heap as well...Matei had a BPQ inside Spark which we use >> all the time...Passing arrays over wire is better than passing full heap >> objects and merge routine on array should run faster but needs experiment... >> >> On Thu, Mar 26, 2015 at 9:26 PM, Aung Htet <aung....@gmail.com> wrote: >> >>> Hi Debasish, >>> >>> Thanks for your suggestions. In-memory version is quite useful. I do not >>> quite understand how you can use aggregateBy to get 10% top K elements. Can >>> you please give an example? >>> >>> Thanks, >>> Aung >>> >>> On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das <debasish.da...@gmail.com> >>> wrote: >>> >>>> You can do it in-memory as well....get 10% topK elements from each >>>> partition and use merge from any sort algorithm like timsort....basically >>>> aggregateBy >>>> >>>> Your version uses shuffle but this version is 0 shuffle..assuming >>>> your data set is cached you will be using in-memory allReduce through >>>> treeAggregate... >>>> >>>> But this is only good for top 10% or bottom 10%...if you need to do >>>> it for top 30% then may be the shuffle version will work better... >>>> >>>> On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet <aung....@gmail.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I have a distribution represented as an RDD of tuples, in rows of >>>>> (segment, score) >>>>> For each segment, I want to discard tuples with top X percent scores. >>>>> This seems hard to do in Spark RDD. >>>>> >>>>> A naive algorithm would be - >>>>> >>>>> 1) Sort RDD by segment & score (descending) >>>>> 2) Within each segment, number the rows from top to bottom. >>>>> 3) For each segment, calculate the cut off index. i.e. 90 for 10% cut >>>>> off out of a segment with 100 rows. >>>>> 4) For the entire RDD, filter rows with row num <= cut off index >>>>> >>>>> This does not look like a good algorithm. I would really appreciate if >>>>> someone can suggest a better way to implement this in Spark. >>>>> >>>>> Regards, >>>>> Aung >>>>> >>>> >>>> >>> >> >