as brian g alluded to earlier, you can use DStream.mapPartitions() to
return the partition-local top 10 for each partition. once you collect
the results from all the partitions, you can do a global top 10 merge sort
across all partitions.
this leads to a much much-smaller dataset to be shuffled
My primary goal : To get top 10 hashtag for every 5 mins interval.
I want to do this efficiently. I have already done this by using
reducebykeyandwindow() and then sorting all hashtag in 5 mins interval
taking only top 10 elements. But this is very slow.
So I now I am thinking of retaining only
Can you clarify what you're trying to achieve here ?
If you want to take only top 10 of each RDD, why don't sort followed by
take(10) of every RDD ?
Or, you want to take top 10 of five minutes ?
Cheers,
On Thu, May 29, 2014 at 2:04 PM, nilmish nilmish@gmail.com wrote:
I have a DSTREAM
Try looking at the .mapPartitions( ) method implemented for RDD[T] objects.
It will give you direct access to an iterator containing the member objects
of each partition for doing the kind of within-partition hashtag counts
you're describing.
--
View this message in context: