If you want to quickly try something, you can also build jar off master and run independently (works for client mode/spark-shell experiments) https://dev.to/bytearray/using-your-own-apache-spark-hudi-versions-with-aws-emr-40a0
On Thu, Apr 15, 2021 at 6:09 AM Kizhakkel Jose, Felix <felix.j...@philips.com.invalid> wrote: > Hi Nishith, > > I will check with Udit M, since he had helped me in the past with a custom > jar for EMR. > > Regards, > Felix K Jose > From: nishith agarwal <n3.nas...@gmail.com> > Date: Wednesday, April 14, 2021 at 3:59 PM > To: dev <dev@hudi.apache.org> > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table > Caution: This e-mail originated from outside of Philips, be careful for > phishing. > > > No worries. Is the custom build something you can work with the AWS team to > get installed to be able to test ? > > -Nishith > > On Wed, Apr 14, 2021 at 12:57 PM Kizhakkel Jose, Felix > <felix.j...@philips.com.invalid> wrote: > > > Hi Nishith, Vinoth, > > > > Thank you so much for the quick response and offering the help. > > > > Regards, > > Felix K Jose > > From: Kizhakkel Jose, Felix <felix.j...@philips.com.INVALID> > > Date: Wednesday, April 14, 2021 at 3:55 PM > > To: dev@hudi.apache.org <dev@hudi.apache.org> > > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table > > Caution: This e-mail originated from outside of Philips, be careful for > > phishing. > > > > > > Hi Nishith, > > > > As I mentioned we are on AWS EMR, but I don’t think we have this 0.8.0 > > version available as part of existing version. So we need a custom build > > for working it on latest EMR 6.1.0 > > > > Regards, > > Felix K Jose > > From: nishith agarwal <n3.nas...@gmail.com> > > Date: Wednesday, April 14, 2021 at 3:49 PM > > To: dev <dev@hudi.apache.org> > > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table > > Caution: This e-mail originated from outside of Philips, be careful for > > phishing. > > > > > > Felix, > > > > Happy to help you through trying and rolling out multi-writer on Hudi > > tables. Do you have a test environment where you can try out the feature > by > > following the doc that Vinoth pointed above ? > > > > Thanks, > > Nishith > > > > On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <vin...@apache.org> > wrote: > > > > > Hi Felix, > > > > > > Most people I think are publishing this data into Kafka,and apply the > > > deletes as a part of the streaming job itself. The reason why this > works > > is > > > because typically, only a small fraction of users leave the service > (say > > << > > > 0.1% weekly is what I have heard). So, the cost of storage on Kafka is > > not > > > much. Is that not the case for you? Are you looking for one time > > scrubbing > > > of data for e.g? The benefit of this approach is that you eliminate any > > > concurrency issues that arise from streaming job producing data for a > > user, > > > while the deletes are also issued for that user. > > > > > > On concurrency control, Hudi now supports multiple writers, if you want > > to > > > write a background job that will perform these deletes for you. it's in > > > 0.8.0, see > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.html&data=04%7C01%7C%7Cde1da0fb3fb2458b31a208d8ff7fcf24%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637540271689560701%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aaJEMtZBWGIT2SuYO9qRyPihTvHDkHMTHFFMlwVyJXc%3D&reserved=0 > . > > One of > > > us > > > can help you out with trying this and rolling out. (Nishith is the > > feature > > > author). Here, if the delete job touches same files, that the streaming > > job > > > is writing to, then only one of them will succeed. > > > > > > We are working on a design for true lock free concurrency control, > which > > > provides the benefits of both models. But, won't be there for another > > month > > > or two. > > > > > > Thanks > > > Vinoth > > > > > > > > > On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix > > > <felix.j...@philips.com.invalid> wrote: > > > > > > > Hi All, > > > > > > > > I have 100s of HUDI tables (AWS S3) where each of those are populated > > via > > > > Spark structured streaming from kafka streams. Now I have to delete > > > records > > > > for a given user (userId) from all the tables which has data for that > > > user. > > > > Meaning all tables where we have reference to that specific userId. I > > > > cannot republish all the events/records for that user to kafka to > > perform > > > > delete, since its around 10-15 year’s worth of data for each user and > > is > > > > going to be so costly and time consuming. So I am wondering how > > everybody > > > > is performing GDPR on the their HUDI tables? > > > > > > > > > > > > How I get delete request? > > > > On a delete kafka topic we get a delete event [which just contains > the > > > > userId of the user to delete], so we have to use that as filter > > > condition > > > > and read all the records from HUDI tables and write it back with data > > > > source operation as ‘delete’. But while performing/running this > delete > > > > spark job on the table if the streaming job continues to ingest new > > > > arriving data- what will be the side effect? Will it work, since > seems > > > like > > > > multi writers are not currently supported. > > > > > > > > Could you help me with a solution? > > > > > > > > Regards, > > > > Felix K Jose > > > > > > > > ________________________________ > > > > The information contained in this message may be confidential and > > legally > > > > protected under applicable law. The message is intended solely for > the > > > > addressee(s). If you are not the intended recipient, you are hereby > > > > notified that any use, forwarding, dissemination, or reproduction of > > this > > > > message is strictly prohibited and may be unlawful. If you are not > the > > > > intended recipient, please contact the sender by return e-mail and > > > destroy > > > > all copies of the original message. > > > > > > > > > > > ________________________________ > > The information contained in this message may be confidential and legally > > protected under applicable law. The message is intended solely for the > > addressee(s). If you are not the intended recipient, you are hereby > > notified that any use, forwarding, dissemination, or reproduction of this > > message is strictly prohibited and may be unlawful. If you are not the > > intended recipient, please contact the sender by return e-mail and > destroy > > all copies of the original message. > > > > ________________________________ > The information contained in this message may be confidential and legally > protected under applicable law. The message is intended solely for the > addressee(s). If you are not the intended recipient, you are hereby > notified that any use, forwarding, dissemination, or reproduction of this > message is strictly prohibited and may be unlawful. If you are not the > intended recipient, please contact the sender by return e-mail and destroy > all copies of the original message. >