Re: Selecting first ten values in a RDD/partition

2014-06-29 Thread Chris Fregly
as brian g alluded to earlier, you can use DStream.mapPartitions() to return the partition-local top 10 for each partition. once you collect the results from all the partitions, you can do a global top 10 merge sort across all partitions. this leads to a much much-smaller dataset to be shuffled

Re: Selecting first ten values in a RDD/partition

2014-05-30 Thread nilmish
My primary goal : To get top 10 hashtag for every 5 mins interval. I want to do this efficiently. I have already done this by using reducebykeyandwindow() and then sorting all hashtag in 5 mins interval taking only top 10 elements. But this is very slow. So I now I am thinking of retaining only

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Anwar Rizal
Can you clarify what you're trying to achieve here ? If you want to take only top 10 of each RDD, why don't sort followed by take(10) of every RDD ? Or, you want to take top 10 of five minutes ? Cheers, On Thu, May 29, 2014 at 2:04 PM, nilmish nilmish@gmail.com wrote: I have a DSTREAM

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Brian Gawalt
Try looking at the .mapPartitions( ) method implemented for RDD[T] objects. It will give you direct access to an iterator containing the member objects of each partition for doing the kind of within-partition hashtag counts you're describing. -- View this message in context: