Re: Selecting first ten values in a RDD/partition

2014-06-29 Thread Chris Fregly
as brian g alluded to earlier, you can use DStream.mapPartitions() to return the partition-local top 10 for each partition. once you collect the results from all the partitions, you can do a global top 10 merge sort across all partitions. this leads to a much much-smaller dataset to be shuffled

Re: Selecting first ten values in a RDD/partition

2014-05-30 Thread nilmish
My primary goal : To get top 10 hashtag for every 5 mins interval. I want to do this efficiently. I have already done this by using reducebykeyandwindow() and then sorting all hashtag in 5 mins interval taking only top 10 elements. But this is very slow. So I now I am thinking of retaining only

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Gerard Maas
DStream has a help method to print the first 10 elements of each RDD. You could take some inspiration from it, as the usecase is practically the same and the code will be probably very similar: rdd.take(10)... https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/s

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Brian Gawalt
Try looking at the .mapPartitions( ) method implemented for RDD[T] objects. It will give you direct access to an iterator containing the member objects of each partition for doing the kind of within-partition hashtag counts you're describing. -- View this message in context: http://apache-spark

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Anwar Rizal
Can you clarify what you're trying to achieve here ? If you want to take only top 10 of each RDD, why don't sort followed by take(10) of every RDD ? Or, you want to take top 10 of five minutes ? Cheers, On Thu, May 29, 2014 at 2:04 PM, nilmish wrote: > I have a DSTREAM which consists of RDD