Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

Michael Armbrust Thu, 03 Nov 2016 22:01:17 -0700

Thinking out loud is good :)

You are right in that anytime you ask for a global ordering from Spark you
will pay the cost of figuring out the range boundaries for partitions.  If
you say orderBy, though, we aren't sure that you aren't expecting a global
order.


If you only want to make sure that items are colocated, it is cheaper to do
a groupByKey followed by a flatMapGroups
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1828840559545742/2840265927289860/latest.html>
.



On Thu, Nov 3, 2016 at 7:31 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i guess i could sort by (hashcode(key), key, secondarySortColumn) and then
> do mapPartitions?
>
> sorry thinking out loud a bit here. ok i think that could work. thanks
>
> On Thu, Nov 3, 2016 at 10:25 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> thats an interesting thought about orderBy and mapPartitions. i guess i
>> could emulate a groupBy with secondary sort using those two. however isn't
>> using an orderBy expensive since it is a total sort? i mean a groupBy with
>> secondary sort is also a total sort under the hood, but its on
>> (hashCode(key), secondarySortColumn) which is easier to distribute and
>> therefore can be implemented more efficiently.
>>
>>
>>
>>
>>
>> On Thu, Nov 3, 2016 at 8:59 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> It is still unclear to me why we should remember all these tricks (or
>>>> add lots of extra little functions) when this elegantly can be expressed in
>>>> a reduce operation with a simple one line lamba function.
>>>>
>>> I think you can do that too.  KeyValueGroupedDataset has a reduceGroups
>>> function.  This probably won't be as fast though because you end up
>>> creating objects where as the version I gave will get codgened to operate
>>> on binary data the whole way though.
>>>
>>>> The same applies to these Window functions. I had to read it 3 times to
>>>> understand what it all means. Maybe it makes sense for someone who has been
>>>> forced to use such limited tools in sql for many years but that's not
>>>> necessary what we should aim for. Why can I not just have the sortBy and
>>>> then an Iterator[X] => Iterator[Y] to express what I want to do?
>>>>
>>> We also have orderBy and mapPartitions.
>>>
>>>> All these functions (rank etc.) can be trivially expressed in this,
>>>> plus I can add other operations if needed, instead of being locked in like
>>>> this Window framework.
>>>>
>>>  I agree that window functions would probably not be my first choice for
>>> many problems, but for people coming from SQL it was a very popular
>>> feature.  My real goal is to give as many paradigms as possible in a single
>>> unified framework.  Let people pick the right mode of expression for any
>>> given job :)
>>>
>>
>>
>

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

Reply via email to