Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

Robin East Thu, 03 Nov 2016 08:12:02 -0700

I agree with Koert. Relying on something because it appears to work when you 
test it can be dangerous if there is nothing in the api guarantee.


Going back quite a few years it used to be the case that Oracle would always 
return a group by with the rows in the order of the grouping key. This was the 
result of the implementation specifics of GROUP BY. Then at some point Oracle 
introduce a new hashing GROUP BY mechanism that could be chosen for the cost 
based optimizer and all of a sudden lots of people’s applications ‘broke’ 
because they had been relying on functionality that had always worked in the 
past but wasn’t actually guaranteed.

TLDR - don’t rely on functionality that isn’t specified


> On 3 Nov 2016, at 14:37, Koert Kuipers <ko...@tresata.com> wrote:
> 
> i did not check the claim in that blog post that the data is ordered, but i 
> wouldnt rely on that behavior since it is not something the api guarantees 
> and could change in future versions
> 
> On Thu, Nov 3, 2016 at 9:59 AM, Rabin Banerjee <dev.rabin.baner...@gmail.com 
> <mailto:dev.rabin.baner...@gmail.com>> wrote:
> Hi Koert & Robin ,
> 
>   Thanks ! But if you go through the blog 
> https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> <https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/>
>  and check the comments under the blog it's actually working, although I am 
> not sure how . And yes I agree a custom aggregate UDAF is a good option . 
> 
> Can anyone share the best way to implement this in Spark .?
> 
> Regards,
> Rabin Banerjee 
> 
> On Thu, Nov 3, 2016 at 6:59 PM, Koert Kuipers <ko...@tresata.com 
> <mailto:ko...@tresata.com>> wrote:
> Just realized you only want to keep first element. You can do this without 
> sorting by doing something similar to min or max operation using a custom 
> aggregator/udaf or reduceGroups on Dataset. This is also more efficient.
> 
> 
> On Nov 3, 2016 7:53 AM, "Rabin Banerjee" <dev.rabin.baner...@gmail.com 
> <mailto:dev.rabin.baner...@gmail.com>> wrote:
> Hi All ,
> 
>   I want to do a dataframe operation to find the rows having the latest 
> timestamp in each group using the below operation 
> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
> .select("customername","service_type","mobileno","cust_addr")
> 
> Spark Version :: 1.6.x
> My Question is "Will Spark guarantee the Order while doing the groupBy , if 
> DF is ordered using OrderBy previously in Spark 1.6.x"??
> 
> I referred a blog here :: 
> https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> <https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/>
> Which claims it will work except in Spark 1.5.1 and 1.5.2 .
> 
> I need a bit elaboration of how internally spark handles it ? also is it more 
> efficient than using a Window function ?
> 
> Thanks in Advance ,
> Rabin Banerjee
> 
> 
> 
>

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

Reply via email to