This might help https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala
Thanks Best Regards On Tue, Nov 4, 2014 at 6:03 AM, Harold Nguyen <har...@nexgate.com> wrote: > Hi all, > > I was just reading this nice documentation here: > > http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html > > And got to the end of it, which says: > > "Note that there are more efficient ways to get the top 10 hashtags. For > example, instead of sorting the entire of 5-minute-counts (thereby, > incurring the cost of a data shuffle), one can get the top 10 hashtags in > each partition, collect them together at the driver and then find the top > 10 hashtags among them. We leave this as an exercise for the reader to try." > > I was just wondering if anyone had managed to do this, and was willing to > share as an example :) This seems to be the exact use case that will help > me! > > Thanks! > > Harold >