Re: [pyspark 2.3+] Dedupe records

Sonal Goyal Fri, 29 May 2020 21:27:14 -0700

Hi Rishi,

1. Dataframes are RDDs under the cover. If you have unstructured data or if
you know something about the data through which you can optimize the
computation. you can go with RDDs. Else the Dataframes which are optimized
by Spark SQL should be fine.
2. For incremental deduplication, I guess you can hash your data based on
some particular values and then only compare the new records against the
ones which have the same hash. That should reduce the order of comparisons
drastically provided you can come up with a good indexing/hashing scheme as
per your dataset.


Thanks,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>




On Sat, May 30, 2020 at 8:17 AM Rishi Shah <rishishah.s...@gmail.com> wrote:

> Hi All,
>
> I have around 100B records where I get new , update & delete records.
> Update/delete records are not that frequent. I would like to get some
> advice on below:
>
> 1) should I use rdd + reducibly or DataFrame window operation for data of
> this size? Which one would outperform the other? Which is more reliable and
> low maintenance?
> 2) Also how would you suggest we do incremental deduplication? Currently
> we do full processing once a week and no dedupe during week days to avoid
> heavy processing. However I would like to explore incremental dedupe option
> and weight pros/cons.
>
> Any input is highly appreciated!
>
> --
> Regards,
>
> Rishi Shah
>

Re: [pyspark 2.3+] Dedupe records

Reply via email to