Re: Top 10 count

Sean Owen Tue, 20 Oct 2015 08:15:48 -0700

I believe it will be most efficient to let top(n) do the work, rather than
sort the whole RDD and then take the first n. The reason is that top and
takeOrdered know they need at most n elements from each partition, and then
just need to merge those. It's never required to sort the whole thing.

I also believe it will be marginally faster to provide an Ordering rather
than swap pairs just to use the natural Ordering, but, I don't know if it's
significant.

Note that I think you can write "Ordering.by(_._2)" to be more concise (not
100% sure about the syntax off the top of my head).

On Tue, Oct 20, 2015 at 3:56 PM, Carol McDonald <[email protected]>
wrote:

> To find the top 10 counts , which is better using top(10) with Ordering on
> the value,
> or swapping the key value and ordering on the key ?  For example which is
> better below ?
> Or does it matter
>
>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
> (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering[Long].on(x=>x._2))
>
>
>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
> (log.endpoint,
> 1)).reduceByKey((x,y)=>x+y).map(x=>(x._2,x._1)).sortByKey(false).take(10)
>
>
>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
> (log.endpoint, 1)).reduceByKey((x,y)=>x+y).map(pair => pair.swap).top(10)
>
>

Re: Top 10 count

Reply via email to