as brian g alluded to earlier, you can use DStream.mapPartitions() to return the partition-local top 10 for each partition. once you collect the results from all the partitions, you can do a global top 10 merge sort across all partitions.
this leads to a much much-smaller dataset to be shuffled back to the driver to calculate the global top 10. On Fri, May 30, 2014 at 5:05 AM, nilmish <nilmish....@gmail.com> wrote: > My primary goal : To get top 10 hashtag for every 5 mins interval. > > I want to do this efficiently. I have already done this by using > reducebykeyandwindow() and then sorting all hashtag in 5 mins interval > taking only top 10 elements. But this is very slow. > > So I now I am thinking of retaining only top 10 hashtags in each RDD > because > these only could come in the final answer. > > I am stuck at : how to retain only top 10 hashtag in each RDD of my DSTREAM > ? Basically I need to transform my DTREAM in which each RDD contains only > top 10 hashtags so that number of hashtags in 5 mins interval is low. > > If there is some more efficient way of doing this then please let me know > that also. > > Thanx, > Nilesh > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6577.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >