subject:"Re\: Selecting first ten values in a RDD\/partition"

Re: Selecting first ten values in a RDD/partition

2014-06-29 Thread Chris Fregly

as brian g alluded to earlier, you can use DStream.mapPartitions() to return the partition-local top 10 for each partition. once you collect the results from all the partitions, you can do a global top 10 merge sort across all partitions. this leads to a much much-smaller dataset to be shuffled

Re: Selecting first ten values in a RDD/partition

2014-05-30 Thread nilmish

My primary goal : To get top 10 hashtag for every 5 mins interval. I want to do this efficiently. I have already done this by using reducebykeyandwindow() and then sorting all hashtag in 5 mins interval taking only top 10 elements. But this is very slow. So I now I am thinking of retaining only

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Gerard Maas

DStream has a help method to print the first 10 elements of each RDD. You could take some inspiration from it, as the usecase is practically the same and the code will be probably very similar: rdd.take(10)... https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/s

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Brian Gawalt

Try looking at the .mapPartitions( ) method implemented for RDD[T] objects. It will give you direct access to an iterator containing the member objects of each partition for doing the kind of within-partition hashtag counts you're describing. -- View this message in context: http://apache-spark

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Anwar Rizal

Can you clarify what you're trying to achieve here ? If you want to take only top 10 of each RDD, why don't sort followed by take(10) of every RDD ? Or, you want to take top 10 of five minutes ? Cheers, On Thu, May 29, 2014 at 2:04 PM, nilmish wrote: > I have a DSTREAM which consists of RDD

Re: Selecting first ten values in a RDD/partition

Re: Selecting first ten values in a RDD/partition

Re: Selecting first ten values in a RDD/partition

Re: Selecting first ten values in a RDD/partition

Re: Selecting first ten values in a RDD/partition

5 matches

Site Navigation

Mail list logo

Footer information