Re: Migration to Spark 3.2

2022-01-27 Thread Stephen Coy
Hi Aurélien, Your Jackson versions look fine. What happens if you change the scope of your Jackson dependencies to “provided”? This should result in your application using the versions provided by Spark and avoid this potential collision. Cheers, Steve C On 27 Jan 2022, at 9:48 pm, Aurélien

Re: How to delete the record

2022-01-27 Thread ayan guha
Btw, 2 options Mitch explained are not mutually exclusive. Option 2 can and should be implemented over a delta lake table anyway. Especially if you need to do hard deletes eventually (eg for regulatory needs) On Fri, 28 Jan 2022 at 6:50 am, Sid Kal wrote: > Thanks Mich and Sean for your time

Re: How to delete the record

2022-01-27 Thread Sid Kal
Thanks Mich and Sean for your time On Fri, 28 Jan 2022, 00:53 Mich Talebzadeh, wrote: > Yes I believe so. > > Check this article of mine dated early 2019 but will have some relevance > to what I am implying. > > >

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
Yes I believe so. Check this article of mine dated early 2019 but will have some relevance to what I am implying. https://www.linkedin.com/pulse/real-time-data-streaming-big-typical-use-cases-talebzadeh-ph-d-/ HTH view my Linkedin profile

Re: DataStreamReader cleanSource option

2022-01-27 Thread Mich Talebzadeh
Hi Gabriela, I don't know about data lake but this is about Spark Structured Streaming. Have both readStream and writeStream working OK, for example can you do df.printSchema() after read? It is advisable to wrap the logic inside try: This is an example of wrapping it data_path =

Re: How to delete the record

2022-01-27 Thread Sid Kal
Okay sounds good. So, below two options would help me to capture CDC changes: 1) Delta lake 2) Maintaining snapshot of records with some indicators and timestamp. Correct me if I'm wrong Thanks, Sid On Thu, 27 Jan 2022, 23:59 Mich Talebzadeh, wrote: > There are two ways of doing it. > > >

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
There are two ways of doing it. 1. Through snapshot offered meaning an immutable snapshot of the state of the table at a given version. For example, the state of a Delta table

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
There are two ways of doing it. 1. Through snapshot offered meaning an immutable snapshot of the state of the table at a given version. For example, the state of a Delta table

Re: How to delete the record

2022-01-27 Thread Sean Owen
Delta, for example, manages merge/append/delete and also keeps previous states of the table's data, so you can query what it looked like before. See delta.io On Thu, Jan 27, 2022, 11:54 AM Sid Kal wrote: > Hi Sean, > > So you mean if I use those file formats it will do the work of CDC >

Re: How to delete the record

2022-01-27 Thread Sid Kal
Hi Sean, So you mean if I use those file formats it will do the work of CDC automatically or I would have to handle it via code ? Hi Mich, Not sure if I understood you. Let me try to explain my scenario. Suppose there is a Id "1" which is inserted today, so I transformed and ingested it. Now

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
Sid, How do you cater for updates? Do you add it as an update with a new record without touching the original record? This approach allows you to see the history of the records i.e. inserted once, deleted once and updated *n* times throughout the Entity Life History of record. So your mileage

Re: How to delete the record

2022-01-27 Thread Sean Owen
This is what storage engines like Delta, Hudi, Iceberg are for. No need to manage it manually or use a DBMS. These formats allow deletes, upserts, etc of data, using Spark, on cloud storage. On Thu, Jan 27, 2022 at 10:56 AM Mich Talebzadeh wrote: > Where ETL data is stored? > > > > *But now the

Re: How to delete the record

2022-01-27 Thread Sid Kal
Hi Mich, Thanks for your time. Data is stored in S3 via DMS which is read in the Spark jobs. How can I mark as a soft delete ? Any small snippet / link / example. Anything would help. Thanks, Sid On Thu, 27 Jan 2022, 22:26 Mich Talebzadeh, wrote: > Where ETL data is stored? > > > > *But

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
Where ETL data is stored? *But now the main problem is when the record at the source is deleted, it should be deleted in my final transformed record too.* If your final sync (storage) is data warehouse, it should be soft flagged with op_type (Insert/Update/Delete) and op_time (timestamp).

How to delete the record

2022-01-27 Thread Sid Kal
I am using Spark incremental approach for bringing the latest data everyday. Everything works fine. But now the main problem is when the record at the source is deleted, it should be deleted in my final transformed record too. How do I capture such changes and change my table too ? Best

Re: Small optimization questions

2022-01-27 Thread Aki Riisiö
Ah, sorry for spamming, I found the answer from documentation. Thank you for the clarification! Best regards, Aki Riisiö On Thu, 27 Jan 2022 at 10:39, Aki Riisiö wrote: > Hello. > > Thank you for the reply again. I just checked how many tasks are spawned > when we read the data from S3 and in

Re: Small optimization questions

2022-01-27 Thread Aki Riisiö
Hello. Thank you for the reply again. I just checked how many tasks are spawned when we read the data from S3 and in the latest run, this was a little over 23000. What determines the amount of tasks during the read? Is it directly corresponding to the number of files to be read? Thank you. On

DataStreamReader cleanSource option

2022-01-27 Thread Gabriela Dvořáková
Hi, I am writing to ask for advice regarding the cleanSource option of the DataStreamReader. I am using pyspark with Spark 3.1. via Azure Synapse. To my knowledge, cleanSource option was introduced in Spark version 3. I'd spent a significant amount of time trying to configure this option with

RE: newbie question for reduce

2022-01-27 Thread Christopher Robson
Hi, The reduce lambda accepts as its first argument the return value of the previous execution. The first time, it is invoked with: x = ("a", 1), y = ("b", 2) And returns 1+2=3 Second time, it is invoked with x = 3, y = ("c", 3) so you can see why it raises the error that you are seeing. There

Re: Migration to Spark 3.2

2022-01-27 Thread Aurélien Mazoyer
Hi Stephen, Thank you for your answer! Here it is, it seems that jackson dependencies are correct, no? : Thanks, [INFO] com.krrier:spark-lib-full:jar:0.0.1-SNAPSHOT [INFO] +- com.krrier:backend:jar:0.0.1-SNAPSHOT:compile [INFO] | \- com.krrier:data:jar:0.0.1-SNAPSHOT:compile [INFO] +-