Re: [pyspark 2.3+] Dedupe records

2020-05-30 Thread Anwar AliKhan
What meaning Dataframes are RDDs under the cover ? What meaning deduplication ? Please send your bio data history and past commercial projects. The Wali Ahad agreed to release 300 million USD for new machine learning research Project to centralize government facilities to find better way to

Re: [pyspark 2.3+] Dedupe records

2020-05-30 Thread Molotch
The performant way would be to partition your dataset into reasonably small chunks and use a bloom filter to see if the entity might be in your set before you make a lookup. Check the bloom filter, if the entity might be in the set, rely on partition pruning to read and backfill the relevant

Re: [pyspark 2.3+] Dedupe records

2020-05-29 Thread Sonal Goyal
Hi Rishi, 1. Dataframes are RDDs under the cover. If you have unstructured data or if you know something about the data through which you can optimize the computation. you can go with RDDs. Else the Dataframes which are optimized by Spark SQL should be fine. 2. For incremental deduplication, I

[pyspark 2.3+] Dedupe records

2020-05-29 Thread Rishi Shah
Hi All, I have around 100B records where I get new , update & delete records. Update/delete records are not that frequent. I would like to get some advice on below: 1) should I use rdd + reducibly or DataFrame window operation for data of this size? Which one would outperform the other? Which is