Hi all,

I was just reading this nice documentation here:
http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html

And got to the end of it, which says:

"Note that there are more efficient ways to get the top 10 hashtags. For
example, instead of sorting the entire of 5-minute-counts (thereby,
incurring the cost of a data shuffle), one can get the top 10 hashtags in
each partition, collect them together at the driver and then find the top
10 hashtags among them. We leave this as an exercise for the reader to try."

I was just wondering if anyone had managed to do this, and was willing to
share as an example :) This seems to be the exact use case that will help
me!

Thanks!

Harold

Reply via email to