Hi all, I was just reading this nice documentation here: http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html
And got to the end of it, which says: "Note that there are more efficient ways to get the top 10 hashtags. For example, instead of sorting the entire of 5-minute-counts (thereby, incurring the cost of a data shuffle), one can get the top 10 hashtags in each partition, collect them together at the driver and then find the top 10 hashtags among them. We leave this as an exercise for the reader to try." I was just wondering if anyone had managed to do this, and was willing to share as an example :) This seems to be the exact use case that will help me! Thanks! Harold