Re: Is DataFrame.groupBy supposed to preserve order within groups?

Michael Armbrust Fri, 18 Dec 2015 14:18:40 -0800

You need to use window functions to get this kind of behavior.  Or use max
and a struct (
http://stackoverflow.com/questions/13523049/hive-sql-find-the-latest-record)


On Thu, Dec 17, 2015 at 11:55 PM, Timothée Carayol <
timothee.cara...@gmail.com> wrote:

> Hi all,
>
> I tried to do something like the following in Spark
>
> df.orderBy('col1, 'col2).groupBy('col1).agg(first('col3))
>
> I was hoping to get, within each col1 value, the value for col3 that
> corresponds to the highest value for col2 within that col1 group. This only
> works if the order on col2 is preserved after the groupBy step.
>
>
> https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
> suggests that it is (unlike RDD.groupBy, DataFrame.groupBy is described as
> preserving the order).
>
> Yet in my experiments, I find that in some cases the order is not
> preserved. Running the same code multiple times gives me different results.
>
> If this is a bug, I'll happily work on a reproducible example and post to
> JIRA but I thought I'd check with the mailing list first in case that is,
> in fact, the expected behaviour?
>
> Thanks
> Timothée
>

Re: Is DataFrame.groupBy supposed to preserve order within groups?

Reply via email to