No worries. Is the custom build something you can work with the AWS team to get installed to be able to test ?
-Nishith On Wed, Apr 14, 2021 at 12:57 PM Kizhakkel Jose, Felix <[email protected]> wrote: > Hi Nishith, Vinoth, > > Thank you so much for the quick response and offering the help. > > Regards, > Felix K Jose > From: Kizhakkel Jose, Felix <[email protected]> > Date: Wednesday, April 14, 2021 at 3:55 PM > To: [email protected] <[email protected]> > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table > Caution: This e-mail originated from outside of Philips, be careful for > phishing. > > > Hi Nishith, > > As I mentioned we are on AWS EMR, but I don’t think we have this 0.8.0 > version available as part of existing version. So we need a custom build > for working it on latest EMR 6.1.0 > > Regards, > Felix K Jose > From: nishith agarwal <[email protected]> > Date: Wednesday, April 14, 2021 at 3:49 PM > To: dev <[email protected]> > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table > Caution: This e-mail originated from outside of Philips, be careful for > phishing. > > > Felix, > > Happy to help you through trying and rolling out multi-writer on Hudi > tables. Do you have a test environment where you can try out the feature by > following the doc that Vinoth pointed above ? > > Thanks, > Nishith > > On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <[email protected]> wrote: > > > Hi Felix, > > > > Most people I think are publishing this data into Kafka,and apply the > > deletes as a part of the streaming job itself. The reason why this works > is > > because typically, only a small fraction of users leave the service (say > << > > 0.1% weekly is what I have heard). So, the cost of storage on Kafka is > not > > much. Is that not the case for you? Are you looking for one time > scrubbing > > of data for e.g? The benefit of this approach is that you eliminate any > > concurrency issues that arise from streaming job producing data for a > user, > > while the deletes are also issued for that user. > > > > On concurrency control, Hudi now supports multiple writers, if you want > to > > write a background job that will perform these deletes for you. it's in > > 0.8.0, see > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.html&data=04%7C01%7C%7Cd9f8ea00fdba484d5cee08d8ff7f3df6%7C1a407a2d76754d178692b3ac285306e4%7C0%7C1%7C637540269251879953%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=5PqHnY%2FI7i3u7z31irU8Xu5VRW1niA0ljfWUWm0vjDY%3D&reserved=0. > One of > > us > > can help you out with trying this and rolling out. (Nishith is the > feature > > author). Here, if the delete job touches same files, that the streaming > job > > is writing to, then only one of them will succeed. > > > > We are working on a design for true lock free concurrency control, which > > provides the benefits of both models. But, won't be there for another > month > > or two. > > > > Thanks > > Vinoth > > > > > > On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix > > <[email protected]> wrote: > > > > > Hi All, > > > > > > I have 100s of HUDI tables (AWS S3) where each of those are populated > via > > > Spark structured streaming from kafka streams. Now I have to delete > > records > > > for a given user (userId) from all the tables which has data for that > > user. > > > Meaning all tables where we have reference to that specific userId. I > > > cannot republish all the events/records for that user to kafka to > perform > > > delete, since its around 10-15 year’s worth of data for each user and > is > > > going to be so costly and time consuming. So I am wondering how > everybody > > > is performing GDPR on the their HUDI tables? > > > > > > > > > How I get delete request? > > > On a delete kafka topic we get a delete event [which just contains the > > > userId of the user to delete], so we have to use that as filter > > condition > > > and read all the records from HUDI tables and write it back with data > > > source operation as ‘delete’. But while performing/running this delete > > > spark job on the table if the streaming job continues to ingest new > > > arriving data- what will be the side effect? Will it work, since seems > > like > > > multi writers are not currently supported. > > > > > > Could you help me with a solution? > > > > > > Regards, > > > Felix K Jose > > > > > > ________________________________ > > > The information contained in this message may be confidential and > legally > > > protected under applicable law. The message is intended solely for the > > > addressee(s). If you are not the intended recipient, you are hereby > > > notified that any use, forwarding, dissemination, or reproduction of > this > > > message is strictly prohibited and may be unlawful. If you are not the > > > intended recipient, please contact the sender by return e-mail and > > destroy > > > all copies of the original message. > > > > > > > ________________________________ > The information contained in this message may be confidential and legally > protected under applicable law. The message is intended solely for the > addressee(s). If you are not the intended recipient, you are hereby > notified that any use, forwarding, dissemination, or reproduction of this > message is strictly prohibited and may be unlawful. If you are not the > intended recipient, please contact the sender by return e-mail and destroy > all copies of the original message. >
