Re: How to delete the record

2022-01-30 Thread Gourav Sengupta
Hi, I think it will be useful to understand the problem before solving the problem. Can I please ask what this table is? Is it a fact (event store) kind of a table, or a dimension (master data) kind of table? And what are the downstream consumptions of this table? Besides that what is the

Re: How to delete the record

2022-01-27 Thread ayan guha
Btw, 2 options Mitch explained are not mutually exclusive. Option 2 can and should be implemented over a delta lake table anyway. Especially if you need to do hard deletes eventually (eg for regulatory needs) On Fri, 28 Jan 2022 at 6:50 am, Sid Kal wrote: > Thanks Mich and Sean for your time

Re: How to delete the record

2022-01-27 Thread Sid Kal
Thanks Mich and Sean for your time On Fri, 28 Jan 2022, 00:53 Mich Talebzadeh, wrote: > Yes I believe so. > > Check this article of mine dated early 2019 but will have some relevance > to what I am implying. > > >

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
Yes I believe so. Check this article of mine dated early 2019 but will have some relevance to what I am implying. https://www.linkedin.com/pulse/real-time-data-streaming-big-typical-use-cases-talebzadeh-ph-d-/ HTH view my Linkedin profile

Re: How to delete the record

2022-01-27 Thread Sid Kal
Okay sounds good. So, below two options would help me to capture CDC changes: 1) Delta lake 2) Maintaining snapshot of records with some indicators and timestamp. Correct me if I'm wrong Thanks, Sid On Thu, 27 Jan 2022, 23:59 Mich Talebzadeh, wrote: > There are two ways of doing it. > > >

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
There are two ways of doing it. 1. Through snapshot offered meaning an immutable snapshot of the state of the table at a given version. For example, the state of a Delta table

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
There are two ways of doing it. 1. Through snapshot offered meaning an immutable snapshot of the state of the table at a given version. For example, the state of a Delta table

Re: How to delete the record

2022-01-27 Thread Sean Owen
Delta, for example, manages merge/append/delete and also keeps previous states of the table's data, so you can query what it looked like before. See delta.io On Thu, Jan 27, 2022, 11:54 AM Sid Kal wrote: > Hi Sean, > > So you mean if I use those file formats it will do the work of CDC >

Re: How to delete the record

2022-01-27 Thread Sid Kal
Hi Sean, So you mean if I use those file formats it will do the work of CDC automatically or I would have to handle it via code ? Hi Mich, Not sure if I understood you. Let me try to explain my scenario. Suppose there is a Id "1" which is inserted today, so I transformed and ingested it. Now

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
Sid, How do you cater for updates? Do you add it as an update with a new record without touching the original record? This approach allows you to see the history of the records i.e. inserted once, deleted once and updated *n* times throughout the Entity Life History of record. So your mileage

Re: How to delete the record

2022-01-27 Thread Sean Owen
This is what storage engines like Delta, Hudi, Iceberg are for. No need to manage it manually or use a DBMS. These formats allow deletes, upserts, etc of data, using Spark, on cloud storage. On Thu, Jan 27, 2022 at 10:56 AM Mich Talebzadeh wrote: > Where ETL data is stored? > > > > *But now the

Re: How to delete the record

2022-01-27 Thread Sid Kal
Hi Mich, Thanks for your time. Data is stored in S3 via DMS which is read in the Spark jobs. How can I mark as a soft delete ? Any small snippet / link / example. Anything would help. Thanks, Sid On Thu, 27 Jan 2022, 22:26 Mich Talebzadeh, wrote: > Where ETL data is stored? > > > > *But

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
Where ETL data is stored? *But now the main problem is when the record at the source is deleted, it should be deleted in my final transformed record too.* If your final sync (storage) is data warehouse, it should be soft flagged with op_type (Insert/Update/Delete) and op_time (timestamp).

How to delete the record

2022-01-27 Thread Sid Kal
I am using Spark incremental approach for bringing the latest data everyday. Everything works fine. But now the main problem is when the record at the source is deleted, it should be deleted in my final transformed record too. How do I capture such changes and change my table too ? Best

Re: How to delete a record from parquet files using dataframes

2016-02-24 Thread Jakob Odersky
form of parquet files using dataframes in >> hdfs. How to delete the records using dataframes? >> >> Thanks! >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-delete-a-record-f

Re: How to delete a record from parquet files using dataframes

2016-02-24 Thread Cheng Lian
using dataframes? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-delete-a-record-from-parquet-files-using-dataframes-tp26242.html Sent from the Apache Spark User List mailing list archive at Nabble.com

How to delete a record from parquet files using dataframes

2016-02-16 Thread SRK
Hi, I am saving my records in the form of parquet files using dataframes in hdfs. How to delete the records using dataframes? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-delete-a-record-from-parquet-files-using-dataframes-tp26242.html