You need to use window functions to get this kind of behavior. Or use max and a struct ( http://stackoverflow.com/questions/13523049/hive-sql-find-the-latest-record)
On Thu, Dec 17, 2015 at 11:55 PM, Timothée Carayol < timothee.cara...@gmail.com> wrote: > Hi all, > > I tried to do something like the following in Spark > > df.orderBy('col1, 'col2).groupBy('col1).agg(first('col3)) > > I was hoping to get, within each col1 value, the value for col3 that > corresponds to the highest value for col2 within that col1 group. This only > works if the order on col2 is preserved after the groupBy step. > > > https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/ > suggests that it is (unlike RDD.groupBy, DataFrame.groupBy is described as > preserving the order). > > Yet in my experiments, I find that in some cases the order is not > preserved. Running the same code multiple times gives me different results. > > If this is a bug, I'll happily work on a reproducible example and post to > JIRA but I thought I'd check with the mailing list first in case that is, > in fact, the expected behaviour? > > Thanks > Timothée >