Why are you repartitioning 1? That would obviously be slow, you are converting a distributed operation to a single node operation. Also consider using RDD.top(). If you define the ordering right (based on the count), then you will get top K across then without doing a shuffle for sortByKey. Much cheaper.
TD On Thu, Mar 12, 2015 at 11:06 AM, Laeeq Ahmed <laeeqsp...@yahoo.com.invalid> wrote: > Hi, > > I have a streaming application where am doing top 10 count in each window > which seems slow. Is there efficient way to do this. > > val counts = keyAndValues.map(x => > math.round(x._3.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4)) > val topCounts = counts.repartition(1).map(_.swap).transform(rdd => > rdd.sortByKey(false)).map(_.swap).mapPartitions(rdd => rdd.take(10)) > > Regards, > Laeeq > >