as brian g alluded to earlier, you can use DStream.mapPartitions() to
return the partition-local top 10 for each partition. once you collect
the results from all the partitions, you can do a global top 10 merge sort
across all partitions.
this leads to a much much-smaller dataset to be shuffled
My primary goal : To get top 10 hashtag for every 5 mins interval.
I want to do this efficiently. I have already done this by using
reducebykeyandwindow() and then sorting all hashtag in 5 mins interval
taking only top 10 elements. But this is very slow.
So I now I am thinking of retaining only
DStream has a help method to print the first 10 elements of each RDD. You
could take some inspiration from it, as the usecase is practically the same
and the code will be probably very similar: rdd.take(10)...
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/s
Try looking at the .mapPartitions( ) method implemented for RDD[T] objects.
It will give you direct access to an iterator containing the member objects
of each partition for doing the kind of within-partition hashtag counts
you're describing.
--
View this message in context:
http://apache-spark
Can you clarify what you're trying to achieve here ?
If you want to take only top 10 of each RDD, why don't sort followed by
take(10) of every RDD ?
Or, you want to take top 10 of five minutes ?
Cheers,
On Thu, May 29, 2014 at 2:04 PM, nilmish wrote:
> I have a DSTREAM which consists of RDD