You are looking to perform an *argmax*, which you can do with a single aggregation. Here is an example <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3170497669323442/2840265927289860/latest.html> .
On Thu, Nov 3, 2016 at 4:53 AM, Rabin Banerjee <dev.rabin.baner...@gmail.com > wrote: > Hi All , > > I want to do a dataframe operation to find the rows having the latest > timestamp in each group using the below operation > > df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr")) > .select("customername","service_type","mobileno","cust_addr") > > > *Spark Version :: 1.6.x* > > My Question is *"Will Spark guarantee the Order while doing the groupBy , if > DF is ordered using OrderBy previously in Spark 1.6.x"??* > > > *I referred a blog here :: > **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/ > > <https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/>* > > *Which claims it will work except in Spark 1.5.1 and 1.5.2 .* > > > *I need a bit elaboration of how internally spark handles it ? also is it > more efficient than using a Window function ?* > > > *Thanks in Advance ,* > > *Rabin Banerjee* > > > >