good suggestion, td.

and i believe the optimization that jon.burns is referring to - from the
big data mini course - is a step earlier:  the sorting mechanism that
produces sortedCounts.

you can use mapPartitions() to get a top k locally on each partition, then
shuffle only (k * # of partitions) elements to the driver for sorting -
versus shuffling the whole dataset from all partitions.  network IO saving
technique.


On Tue, Jul 15, 2014 at 9:41 AM, jon.burns <jon.bu...@uleth.ca> wrote:

> It works perfect, thanks!. I feel like I should have figured that out, I'll
> chalk it up to inexperience with Scala. Thanks again.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-collect-take-functionality-tp9670p9772.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to