Re: Top 10 count

2015-10-20 Thread Carol McDonald
// sort by 2nd element
Sorting.quickSort(pairs)(Ordering.by[(String, Int, Int), Int](_._2))
// sort by the 3rd element, then 1st
Sorting.quickSort(pairs)(Ordering[(Int, String)].on(x => (x._3, x._1)))



On Tue, Oct 20, 2015 at 11:33 AM, Carol McDonald 
wrote:

> this works
>
> val top10 = logs.filter(log => log.responseCode != 200).map(log =>
> (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2))
>
> or
>
> val top10 = logs.filter(log => log.responseCode != 200).map(log =>
> (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2))
>
> On Tue, Oct 20, 2015 at 11:07 AM, Sean Owen  wrote:
>
>> I believe it will be most efficient to let top(n) do the work, rather
>> than sort the whole RDD and then take the first n. The reason is that top
>> and takeOrdered know they need at most n elements from each partition, and
>> then just need to merge those. It's never required to sort the whole thing.
>>
>> I also believe it will be marginally faster to provide an Ordering rather
>> than swap pairs just to use the natural Ordering, but, I don't know if it's
>> significant.
>>
>> Note that I think you can write "Ordering.by(_._2)" to be more concise
>> (not 100% sure about the syntax off the top of my head).
>>
>>
>>
>> On Tue, Oct 20, 2015 at 3:56 PM, Carol McDonald 
>> wrote:
>>
>>> To find the top 10 counts , which is better using top(10) with Ordering
>>> on the value,
>>> or swapping the key value and ordering on the key ?  For example which
>>> is better below ?
>>> Or does it matter
>>>
>>>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
>>> (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering[Long].on(x=>x._2))
>>>
>>>
>>>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
>>> (log.endpoint,
>>> 1)).reduceByKey((x,y)=>x+y).map(x=>(x._2,x._1)).sortByKey(false).take(10)
>>>
>>>
>>>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
>>> (log.endpoint, 1)).reduceByKey((x,y)=>x+y).map(pair => pair.swap).top(10)
>>>
>>>
>>
>


Re: Top 10 count

2015-10-20 Thread Carol McDonald
this works

val top10 = logs.filter(log => log.responseCode != 200).map(log =>
(log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2))

or

val top10 = logs.filter(log => log.responseCode != 200).map(log =>
(log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2))

On Tue, Oct 20, 2015 at 11:07 AM, Sean Owen  wrote:

> I believe it will be most efficient to let top(n) do the work, rather than
> sort the whole RDD and then take the first n. The reason is that top and
> takeOrdered know they need at most n elements from each partition, and then
> just need to merge those. It's never required to sort the whole thing.
>
> I also believe it will be marginally faster to provide an Ordering rather
> than swap pairs just to use the natural Ordering, but, I don't know if it's
> significant.
>
> Note that I think you can write "Ordering.by(_._2)" to be more concise
> (not 100% sure about the syntax off the top of my head).
>
>
>
> On Tue, Oct 20, 2015 at 3:56 PM, Carol McDonald 
> wrote:
>
>> To find the top 10 counts , which is better using top(10) with Ordering
>> on the value,
>> or swapping the key value and ordering on the key ?  For example which is
>> better below ?
>> Or does it matter
>>
>>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
>> (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering[Long].on(x=>x._2))
>>
>>
>>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
>> (log.endpoint,
>> 1)).reduceByKey((x,y)=>x+y).map(x=>(x._2,x._1)).sortByKey(false).take(10)
>>
>>
>>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
>> (log.endpoint, 1)).reduceByKey((x,y)=>x+y).map(pair => pair.swap).top(10)
>>
>>
>


Re: Top 10 count

2015-10-20 Thread Sean Owen
I believe it will be most efficient to let top(n) do the work, rather than
sort the whole RDD and then take the first n. The reason is that top and
takeOrdered know they need at most n elements from each partition, and then
just need to merge those. It's never required to sort the whole thing.

I also believe it will be marginally faster to provide an Ordering rather
than swap pairs just to use the natural Ordering, but, I don't know if it's
significant.

Note that I think you can write "Ordering.by(_._2)" to be more concise (not
100% sure about the syntax off the top of my head).



On Tue, Oct 20, 2015 at 3:56 PM, Carol McDonald 
wrote:

> To find the top 10 counts , which is better using top(10) with Ordering on
> the value,
> or swapping the key value and ordering on the key ?  For example which is
> better below ?
> Or does it matter
>
>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
> (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering[Long].on(x=>x._2))
>
>
>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
> (log.endpoint,
> 1)).reduceByKey((x,y)=>x+y).map(x=>(x._2,x._1)).sortByKey(false).take(10)
>
>
>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
> (log.endpoint, 1)).reduceByKey((x,y)=>x+y).map(pair => pair.swap).top(10)
>
>