[pyspark] Use output of one aggregated function for another aggregated function within the same groupby

Rishi Shah Wed, 24 Apr 2019 20:08:11 -0700

Hi All,

[PySpark 2.3, python 2.7]


I would like to achieve something like this, could you please suggest best
way to implement (perhaps highlight pros & cons of the approach in terms of
performance)?

df = df.groupby('grp_col').agg(max(date).alias('max_date'), count(when
col('file_date') == col('max_date')))

Please note 'max_date' is a result of aggregate function max inside the
group by agg. I can definitely use multiple groupbys to achieve this but is
there a better way? better performance wise may be?

Appreciate your help!

-- 
Regards,

Rishi Shah

[pyspark] Use output of one aggregated function for another aggregated function within the same groupby

Reply via email to