Re: Top 10 count
// sort by 2nd element Sorting.quickSort(pairs)(Ordering.by[(String, Int, Int), Int](_._2)) // sort by the 3rd element, then 1st Sorting.quickSort(pairs)(Ordering[(Int, String)].on(x => (x._3, x._1))) On Tue, Oct 20, 2015 at 11:33 AM, Carol McDonaldwrote: > this works > > val top10 = logs.filter(log => log.responseCode != 200).map(log => > (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2)) > > or > > val top10 = logs.filter(log => log.responseCode != 200).map(log => > (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2)) > > On Tue, Oct 20, 2015 at 11:07 AM, Sean Owen wrote: > >> I believe it will be most efficient to let top(n) do the work, rather >> than sort the whole RDD and then take the first n. The reason is that top >> and takeOrdered know they need at most n elements from each partition, and >> then just need to merge those. It's never required to sort the whole thing. >> >> I also believe it will be marginally faster to provide an Ordering rather >> than swap pairs just to use the natural Ordering, but, I don't know if it's >> significant. >> >> Note that I think you can write "Ordering.by(_._2)" to be more concise >> (not 100% sure about the syntax off the top of my head). >> >> >> >> On Tue, Oct 20, 2015 at 3:56 PM, Carol McDonald >> wrote: >> >>> To find the top 10 counts , which is better using top(10) with Ordering >>> on the value, >>> or swapping the key value and ordering on the key ? For example which >>> is better below ? >>> Or does it matter >>> >>> val top10 = logs.filter(log => log.responseCode != 200).map(log => >>> (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering[Long].on(x=>x._2)) >>> >>> >>> val top10 = logs.filter(log => log.responseCode != 200).map(log => >>> (log.endpoint, >>> 1)).reduceByKey((x,y)=>x+y).map(x=>(x._2,x._1)).sortByKey(false).take(10) >>> >>> >>> val top10 = logs.filter(log => log.responseCode != 200).map(log => >>> (log.endpoint, 1)).reduceByKey((x,y)=>x+y).map(pair => pair.swap).top(10) >>> >>> >> >
Re: Top 10 count
this works val top10 = logs.filter(log => log.responseCode != 200).map(log => (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2)) or val top10 = logs.filter(log => log.responseCode != 200).map(log => (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2)) On Tue, Oct 20, 2015 at 11:07 AM, Sean Owenwrote: > I believe it will be most efficient to let top(n) do the work, rather than > sort the whole RDD and then take the first n. The reason is that top and > takeOrdered know they need at most n elements from each partition, and then > just need to merge those. It's never required to sort the whole thing. > > I also believe it will be marginally faster to provide an Ordering rather > than swap pairs just to use the natural Ordering, but, I don't know if it's > significant. > > Note that I think you can write "Ordering.by(_._2)" to be more concise > (not 100% sure about the syntax off the top of my head). > > > > On Tue, Oct 20, 2015 at 3:56 PM, Carol McDonald > wrote: > >> To find the top 10 counts , which is better using top(10) with Ordering >> on the value, >> or swapping the key value and ordering on the key ? For example which is >> better below ? >> Or does it matter >> >> val top10 = logs.filter(log => log.responseCode != 200).map(log => >> (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering[Long].on(x=>x._2)) >> >> >> val top10 = logs.filter(log => log.responseCode != 200).map(log => >> (log.endpoint, >> 1)).reduceByKey((x,y)=>x+y).map(x=>(x._2,x._1)).sortByKey(false).take(10) >> >> >> val top10 = logs.filter(log => log.responseCode != 200).map(log => >> (log.endpoint, 1)).reduceByKey((x,y)=>x+y).map(pair => pair.swap).top(10) >> >> >
Re: Top 10 count
I believe it will be most efficient to let top(n) do the work, rather than sort the whole RDD and then take the first n. The reason is that top and takeOrdered know they need at most n elements from each partition, and then just need to merge those. It's never required to sort the whole thing. I also believe it will be marginally faster to provide an Ordering rather than swap pairs just to use the natural Ordering, but, I don't know if it's significant. Note that I think you can write "Ordering.by(_._2)" to be more concise (not 100% sure about the syntax off the top of my head). On Tue, Oct 20, 2015 at 3:56 PM, Carol McDonaldwrote: > To find the top 10 counts , which is better using top(10) with Ordering on > the value, > or swapping the key value and ordering on the key ? For example which is > better below ? > Or does it matter > > val top10 = logs.filter(log => log.responseCode != 200).map(log => > (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering[Long].on(x=>x._2)) > > > val top10 = logs.filter(log => log.responseCode != 200).map(log => > (log.endpoint, > 1)).reduceByKey((x,y)=>x+y).map(x=>(x._2,x._1)).sortByKey(false).take(10) > > > val top10 = logs.filter(log => log.responseCode != 200).map(log => > (log.endpoint, 1)).reduceByKey((x,y)=>x+y).map(pair => pair.swap).top(10) > >