Re: High level explanation of dropDuplicates

2020-01-11 Thread Miguel Morales
I would just map to pair using the id. Then do a reduceByKey where you compare the scores and keep the highest. Then do .values and that should do it. Sent from my iPhone > On Jan 11, 2020, at 11:14 AM, Rishi Shah wrote: > >  > Thanks everyone for your contribution on this topic, I wanted to

Re: High level explanation of dropDuplicates

2020-01-11 Thread Rishi Shah
Thanks everyone for your contribution on this topic, I wanted to check-in to see if anyone has discovered a different or have an opinion on better approach to deduplicating data using pyspark. Would really appreciate any further insight on this. Thanks, -Rishi On Wed, Jun 12, 2019 at 4:21 PM

Re: High level explanation of dropDuplicates

2019-06-12 Thread Yeikel
Nicholas , thank you for your explanation. I am also interested in the example that Rishi is asking for. I am sure mapPartitions may work , but as Vladimir suggests it may not be the best option in terms of performance. @Vladimir Prus , are you aware of any example about writing a "custom

Re: High level explanation of dropDuplicates

2019-06-12 Thread Vladimir Prus
Hi, If your data frame is partitioned by column A, and you want deduplication by columns A, B and C, then a faster way might be to sort each partition by A, B and C and then do a linear scan - it is often faster than group by all columns - which require a shuffle. Sadly, there's no standard way

Re: High level explanation of dropDuplicates

2019-06-09 Thread Rishi Shah
Hi All, Just wanted to check back regarding best way to perform deduplication. Is using drop duplicates the optimal way to get rid of duplicates? Would it be better if we run operations on red directly? Also what about if we want to keep the last value of the group while performing deduplication

Re: High level explanation of dropDuplicates

2019-05-20 Thread Nicholas Hakobian
>From doing some searching around in the spark codebase, I found the following: https://github.com/apache/spark/blob/163a6e298213f216f74f4764e241ee6298ea30b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1452-L1474 So it appears there is no direct operation

High level explanation of dropDuplicates

2019-05-20 Thread Yeikel
Hi , I am looking for a high level explanation(overview) on how dropDuplicates[1] works. [1] https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326 Could someone please explain? Thank you -- Sent from: