good suggestion, td. and i believe the optimization that jon.burns is referring to - from the big data mini course - is a step earlier: the sorting mechanism that produces sortedCounts.
you can use mapPartitions() to get a top k locally on each partition, then shuffle only (k * # of partitions) elements to the driver for sorting - versus shuffling the whole dataset from all partitions. network IO saving technique. On Tue, Jul 15, 2014 at 9:41 AM, jon.burns <jon.bu...@uleth.ca> wrote: > It works perfect, thanks!. I feel like I should have figured that out, I'll > chalk it up to inexperience with Scala. Thanks again. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-collect-take-functionality-tp9670p9772.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >