Re: [pyspark 2.3+] Dedupe records

2020-05-30 Thread Anwar AliKhan
What meaning Dataframes are RDDs under the cover ?

What meaning deduplication ?


Please send your  bio data history and past commercial projects.

The Wali Ahad agreed to release 300 million USD for new machine learning
research
Project to centralize government facilities to find better way to offer
Citizen Service with artificial Intelligence Technologies.

I am to find talented Artificial Intelligence Experts.


Shukran



On Sat, 30 May 2020, 05:26 Sonal Goyal,  wrote:

> Hi Rishi,
>
> 1. Dataframes are RDDs under the cover. If you have unstructured data or
> if you know something about the data through which you can optimize the
> computation. you can go with RDDs. Else the Dataframes which are optimized
> by Spark SQL should be fine.
> 2. For incremental deduplication, I guess you can hash your data based on
> some particular values and then only compare the new records against the
> ones which have the same hash. That should reduce the order of comparisons
> drastically provided you can come up with a good indexing/hashing scheme as
> per your dataset.
>
> Thanks,
> Sonal
> Nube Technologies 
>
> 
>
>
>
>
> On Sat, May 30, 2020 at 8:17 AM Rishi Shah 
> wrote:
>
>> Hi All,
>>
>> I have around 100B records where I get new , update & delete records.
>> Update/delete records are not that frequent. I would like to get some
>> advice on below:
>>
>> 1) should I use rdd + reducibly or DataFrame window operation for data of
>> this size? Which one would outperform the other? Which is more reliable and
>> low maintenance?
>> 2) Also how would you suggest we do incremental deduplication? Currently
>> we do full processing once a week and no dedupe during week days to avoid
>> heavy processing. However I would like to explore incremental dedupe option
>> and weight pros/cons.
>>
>> Any input is highly appreciated!
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>


Re: [pyspark 2.3+] Dedupe records

2020-05-30 Thread Molotch
The performant way would be to partition your dataset into reasonably small
chunks and use a bloom filter to see if the entity might be in your set
before you make a lookup.

Check the bloom filter, if the entity might be in the set, rely on partition
pruning to read and backfill the relevant partition. If the entity isn't in
the set, just save as new data.

Sooner or later you probably would want to compact the appended partitions
to reduce the amount of small files.

Delta Lake has update and compation semantics unless you want to do it
manually.

Since 2.4.0 Spark is also able to prune buckets. But as far as I know
there's no way to backfill a single bucket. If it was the combination of
partition and bucket pruning could dramatically limit the amount data you
needed to read/write from/to disk.

RDD vs Dataframe, I'm not sure exactly how and when Tungsten is able to be
used when using RDD:s, if at all. Because of that I always try to use
Dataframes and the built in fucntions as long as possible just to get the
sweet offheap allocation and the "expressions to byte code" thingy along the
Catalyst optimizations. That will probably make more for your performance
than anything else. The memory overhead of JVM objects and GC runs might be
brutal on your performance and memory usage depending on your dataset and
use case.


br,

molotch



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [pyspark 2.3+] Dedupe records

2020-05-29 Thread Sonal Goyal
Hi Rishi,

1. Dataframes are RDDs under the cover. If you have unstructured data or if
you know something about the data through which you can optimize the
computation. you can go with RDDs. Else the Dataframes which are optimized
by Spark SQL should be fine.
2. For incremental deduplication, I guess you can hash your data based on
some particular values and then only compare the new records against the
ones which have the same hash. That should reduce the order of comparisons
drastically provided you can come up with a good indexing/hashing scheme as
per your dataset.

Thanks,
Sonal
Nube Technologies 






On Sat, May 30, 2020 at 8:17 AM Rishi Shah  wrote:

> Hi All,
>
> I have around 100B records where I get new , update & delete records.
> Update/delete records are not that frequent. I would like to get some
> advice on below:
>
> 1) should I use rdd + reducibly or DataFrame window operation for data of
> this size? Which one would outperform the other? Which is more reliable and
> low maintenance?
> 2) Also how would you suggest we do incremental deduplication? Currently
> we do full processing once a week and no dedupe during week days to avoid
> heavy processing. However I would like to explore incremental dedupe option
> and weight pros/cons.
>
> Any input is highly appreciated!
>
> --
> Regards,
>
> Rishi Shah
>


[pyspark 2.3+] Dedupe records

2020-05-29 Thread Rishi Shah
Hi All,

I have around 100B records where I get new , update & delete records.
Update/delete records are not that frequent. I would like to get some
advice on below:

1) should I use rdd + reducibly or DataFrame window operation for data of
this size? Which one would outperform the other? Which is more reliable and
low maintenance?
2) Also how would you suggest we do incremental deduplication? Currently we
do full processing once a week and no dedupe during week days to avoid
heavy processing. However I would like to explore incremental dedupe option
and weight pros/cons.

Any input is highly appreciated!

-- 
Regards,

Rishi Shah