this works val top10 = logs.filter(log => log.responseCode != 200).map(log => (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2))
or val top10 = logs.filter(log => log.responseCode != 200).map(log => (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2)) On Tue, Oct 20, 2015 at 11:07 AM, Sean Owen <so...@cloudera.com> wrote: > I believe it will be most efficient to let top(n) do the work, rather than > sort the whole RDD and then take the first n. The reason is that top and > takeOrdered know they need at most n elements from each partition, and then > just need to merge those. It's never required to sort the whole thing. > > I also believe it will be marginally faster to provide an Ordering rather > than swap pairs just to use the natural Ordering, but, I don't know if it's > significant. > > Note that I think you can write "Ordering.by(_._2)" to be more concise > (not 100% sure about the syntax off the top of my head). > > > > On Tue, Oct 20, 2015 at 3:56 PM, Carol McDonald <cmcdon...@maprtech.com> > wrote: > >> To find the top 10 counts , which is better using top(10) with Ordering >> on the value, >> or swapping the key value and ordering on the key ? For example which is >> better below ? >> Or does it matter >> >> val top10 = logs.filter(log => log.responseCode != 200).map(log => >> (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering[Long].on(x=>x._2)) >> >> >> val top10 = logs.filter(log => log.responseCode != 200).map(log => >> (log.endpoint, >> 1)).reduceByKey((x,y)=>x+y).map(x=>(x._2,x._1)).sortByKey(false).take(10) >> >> >> val top10 = logs.filter(log => log.responseCode != 200).map(log => >> (log.endpoint, 1)).reduceByKey((x,y)=>x+y).map(pair => pair.swap).top(10) >> >> >