I would just map to pair using the id. Then do a reduceByKey where you compare
the scores and keep the highest. Then do .values and that should do it.
Sent from my iPhone
> On Jan 11, 2020, at 11:14 AM, Rishi Shah wrote:
>
>
> Thanks everyone for your contribution on this topic, I wanted to
Thanks everyone for your contribution on this topic, I wanted to check-in
to see if anyone has discovered a different or have an opinion on better
approach to deduplicating data using pyspark. Would really appreciate any
further insight on this.
Thanks,
-Rishi
On Wed, Jun 12, 2019 at 4:21 PM