Thanks Michael.

If I understand correctly, this is the expected behaviour then, and there
is no order guarantee within grouped DataFrame. I'll comment on that blog
post to report that its message is inaccurate.

Your first suggestion is to use window functions. As I understand, window
functions return as many rows per group as in the initial dataframe,
whereas I want only one row per group in my final result. Am I right to
understand that you suggest using a window function to get the within-group
maximum (repeated n times per group), and then aggregate by group to get to
the final result (with another groupBy step)?

Best wishes
Timothée

2015-12-18 23:17 GMT+01:00 Michael Armbrust <mich...@databricks.com>:

> You need to use window functions to get this kind of behavior.  Or use max
> and a struct (
> http://stackoverflow.com/questions/13523049/hive-sql-find-the-latest-record
> )
>
> On Thu, Dec 17, 2015 at 11:55 PM, Timothée Carayol <
> timothee.cara...@gmail.com> wrote:
>
>> Hi all,
>>
>> I tried to do something like the following in Spark
>>
>> df.orderBy('col1, 'col2).groupBy('col1).agg(first('col3))
>>
>> I was hoping to get, within each col1 value, the value for col3 that
>> corresponds to the highest value for col2 within that col1 group. This only
>> works if the order on col2 is preserved after the groupBy step.
>>
>>
>> https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>> suggests that it is (unlike RDD.groupBy, DataFrame.groupBy is described as
>> preserving the order).
>>
>> Yet in my experiments, I find that in some cases the order is not
>> preserved. Running the same code multiple times gives me different results.
>>
>> If this is a bug, I'll happily work on a reproducible example and post to
>> JIRA but I thought I'd check with the mailing list first in case that is,
>> in fact, the expected behaviour?
>>
>> Thanks
>> Timothée
>>
>
>

Reply via email to