Re: GDPR deletes and Consenting deletes of data from hudi table

nishith agarwal Wed, 14 Apr 2021 12:59:30 -0700

No worries. Is the custom build something you can work with the AWS team to
get installed to be able to test ?


-Nishith

On Wed, Apr 14, 2021 at 12:57 PM Kizhakkel Jose, Felix
<[email protected]> wrote:

> Hi Nishith, Vinoth,
>
> Thank you so much for the quick response and offering the help.
>
> Regards,
> Felix K Jose
> From: Kizhakkel Jose, Felix <[email protected]>
> Date: Wednesday, April 14, 2021 at 3:55 PM
> To: [email protected] <[email protected]>
> Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> Caution: This e-mail originated from outside of Philips, be careful for
> phishing.
>
>
> Hi Nishith,
>
> As I mentioned we are on AWS EMR, but I don’t think we have this 0.8.0
> version available as part of existing version. So we need a custom build
> for working it on latest EMR 6.1.0
>
> Regards,
> Felix K Jose
> From: nishith agarwal <[email protected]>
> Date: Wednesday, April 14, 2021 at 3:49 PM
> To: dev <[email protected]>
> Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> Caution: This e-mail originated from outside of Philips, be careful for
> phishing.
>
>
> Felix,
>
> Happy to help you through trying and rolling out multi-writer on Hudi
> tables. Do you have a test environment where you can try out the feature by
> following the doc that Vinoth pointed above ?
>
> Thanks,
> Nishith
>
> On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <[email protected]> wrote:
>
> > Hi Felix,
> >
> > Most people I think are publishing this data into Kafka,and apply the
> > deletes as a part of the streaming job itself. The reason why this works
> is
> > because typically, only a small fraction of users leave the service (say
> <<
> > 0.1% weekly is what I have heard). So, the cost of storage on Kafka is
> not
> > much. Is that not the case for you? Are you looking for one time
> scrubbing
> > of data for e.g? The benefit of this approach is that you eliminate any
> > concurrency issues that arise from streaming job producing data for a
> user,
> > while the deletes are also issued for that user.
> >
> > On concurrency control, Hudi now supports multiple writers, if you want
> to
> > write a background job that will perform these deletes for you. it's in
> > 0.8.0, see
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.html&amp;data=04%7C01%7C%7Cd9f8ea00fdba484d5cee08d8ff7f3df6%7C1a407a2d76754d178692b3ac285306e4%7C0%7C1%7C637540269251879953%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=5PqHnY%2FI7i3u7z31irU8Xu5VRW1niA0ljfWUWm0vjDY%3D&amp;reserved=0.
> One of
> > us
> > can help you out with trying this and rolling out. (Nishith is the
> feature
> > author). Here, if the delete job touches same files, that the streaming
> job
> > is writing to, then only one of them will succeed.
> >
> > We are working on a design for true lock free concurrency control, which
> > provides the benefits of both models. But, won't be there for another
> month
> > or two.
> >
> > Thanks
> > Vinoth
> >
> >
> > On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
> > <[email protected]> wrote:
> >
> > > Hi All,
> > >
> > > I have 100s of HUDI tables (AWS S3) where each of those are populated
> via
> > > Spark structured streaming from kafka streams. Now I have to delete
> > records
> > > for a given user (userId) from all the tables which has data for that
> > user.
> > > Meaning all tables where we have reference to that specific userId. I
> > > cannot republish all the events/records for that user to kafka to
> perform
> > > delete, since its around 10-15 year’s worth of data for each user and
> is
> > > going to be so costly and time consuming. So I am wondering how
> everybody
> > > is performing GDPR on the their HUDI tables?
> > >
> > >
> > > How I get delete request?
> > > On a delete kafka topic we get a delete event [which just contains the
> > > userId of the user  to delete], so we have to use that as filter
> > condition
> > > and read all the records from HUDI tables and write it back with data
> > > source operation as ‘delete’. But while performing/running this delete
> > > spark job on the table if the streaming job continues to ingest new
> > > arriving data- what will be the side effect? Will it work, since seems
> > like
> > > multi writers are not currently supported.
> > >
> > > Could you help me with a solution?
> > >
> > > Regards,
> > > Felix K Jose
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> legally
> > > protected under applicable law. The message is intended solely for the
> > > addressee(s). If you are not the intended recipient, you are hereby
> > > notified that any use, forwarding, dissemination, or reproduction of
> this
> > > message is strictly prohibited and may be unlawful. If you are not the
> > > intended recipient, please contact the sender by return e-mail and
> > destroy
> > > all copies of the original message.
> > >
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

Re: GDPR deletes and Consenting deletes of data from hudi table

Reply via email to