Hi All, [PySpark 2.3, python 2.7]
I would like to achieve something like this, could you please suggest best way to implement (perhaps highlight pros & cons of the approach in terms of performance)? df = df.groupby('grp_col').agg(max(date).alias('max_date'), count(when col('file_date') == col('max_date'))) Please note 'max_date' is a result of aggregate function max inside the group by agg. I can definitely use multiple groupbys to achieve this but is there a better way? better performance wise may be? Appreciate your help! -- Regards, Rishi Shah