Hi nilmish, One option for you is to consider moving to a different algorithm. The SpaceSaver/StreamSummary method will get you approximate results in exchange for smaller data structure size. It has an implementation in Twitter's Algebird library, if you're using Scala:
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/SpaceSaver.scala and has a more general write up here: http://boundary.com/blog/2013/05/14/approximate-heavy-hitters-the-spacesaving-algorithm/ I believe it will let you avoid an expensive sort of all the hundreds of thousands of hashtags you can see in a day. Best, --Brian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-implementation-of-getting-top-10-hashtags-in-last-5-mins-window-tp5741p5845.html Sent from the Apache Spark User List mailing list archive at Nabble.com.