Re: [pyspark 2.3+] Dedupe records

Anwar AliKhan Sat, 30 May 2020 05:31:57 -0700

What meaning Dataframes are RDDs under the cover ?

What meaning deduplication ?



Please send your  bio data history and past commercial projects.

The Wali Ahad agreed to release 300 million USD for new machine learning
research
Project to centralize government facilities to find better way to offer
Citizen Service with artificial Intelligence Technologies.

I am to find talented Artificial Intelligence Experts.


Shukran



On Sat, 30 May 2020, 05:26 Sonal Goyal, <sonalgoy...@gmail.com> wrote:

> Hi Rishi,
>
> 1. Dataframes are RDDs under the cover. If you have unstructured data or
> if you know something about the data through which you can optimize the
> computation. you can go with RDDs. Else the Dataframes which are optimized
> by Spark SQL should be fine.
> 2. For incremental deduplication, I guess you can hash your data based on
> some particular values and then only compare the new records against the
> ones which have the same hash. That should reduce the order of comparisons
> drastically provided you can come up with a good indexing/hashing scheme as
> per your dataset.
>
> Thanks,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
> On Sat, May 30, 2020 at 8:17 AM Rishi Shah <rishishah.s...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I have around 100B records where I get new , update & delete records.
>> Update/delete records are not that frequent. I would like to get some
>> advice on below:
>>
>> 1) should I use rdd + reducibly or DataFrame window operation for data of
>> this size? Which one would outperform the other? Which is more reliable and
>> low maintenance?
>> 2) Also how would you suggest we do incremental deduplication? Currently
>> we do full processing once a week and no dedupe during week days to avoid
>> heavy processing. However I would like to explore incremental dedupe option
>> and weight pros/cons.
>>
>> Any input is highly appreciated!
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>

Re: [pyspark 2.3+] Dedupe records

Reply via email to