Re: High level explanation of dropDuplicates

2020-01-11 Thread Miguel Morales
I would just map to pair using the id. Then do a reduceByKey where you compare the scores and keep the highest. Then do .values and that should do it. Sent from my iPhone > On Jan 11, 2020, at 11:14 AM, Rishi Shah wrote: > >  > Thanks everyone for your contribution on this topic, I wanted to

Re: High level explanation of dropDuplicates

2020-01-11 Thread Rishi Shah
Thanks everyone for your contribution on this topic, I wanted to check-in to see if anyone has discovered a different or have an opinion on better approach to deduplicating data using pyspark. Would really appreciate any further insight on this. Thanks, -Rishi On Wed, Jun 12, 2019 at 4:21 PM