I believe it will be most efficient to let top(n) do the work, rather than sort the whole RDD and then take the first n. The reason is that top and takeOrdered know they need at most n elements from each partition, and then just need to merge those. It's never required to sort the whole thing.
I also believe it will be marginally faster to provide an Ordering rather than swap pairs just to use the natural Ordering, but, I don't know if it's significant. Note that I think you can write "Ordering.by(_._2)" to be more concise (not 100% sure about the syntax off the top of my head). On Tue, Oct 20, 2015 at 3:56 PM, Carol McDonald <cmcdon...@maprtech.com> wrote: > To find the top 10 counts , which is better using top(10) with Ordering on > the value, > or swapping the key value and ordering on the key ? For example which is > better below ? > Or does it matter > > val top10 = logs.filter(log => log.responseCode != 200).map(log => > (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering[Long].on(x=>x._2)) > > > val top10 = logs.filter(log => log.responseCode != 200).map(log => > (log.endpoint, > 1)).reduceByKey((x,y)=>x+y).map(x=>(x._2,x._1)).sortByKey(false).take(10) > > > val top10 = logs.filter(log => log.responseCode != 200).map(log => > (log.endpoint, 1)).reduceByKey((x,y)=>x+y).map(pair => pair.swap).top(10) > >