Hi Nishith,

As I mentioned we are on AWS EMR, but I don’t think we have this 0.8.0 version 
available as part of existing version. So we need a custom build for working it 
on latest EMR 6.1.0

Regards,
Felix K Jose
From: nishith agarwal <n3.nas...@gmail.com>
Date: Wednesday, April 14, 2021 at 3:49 PM
To: dev <dev@hudi.apache.org>
Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
Caution: This e-mail originated from outside of Philips, be careful for 
phishing.


Felix,

Happy to help you through trying and rolling out multi-writer on Hudi
tables. Do you have a test environment where you can try out the feature by
following the doc that Vinoth pointed above ?

Thanks,
Nishith

On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <vin...@apache.org> wrote:

> Hi Felix,
>
> Most people I think are publishing this data into Kafka,and apply the
> deletes as a part of the streaming job itself. The reason why this works is
> because typically, only a small fraction of users leave the service (say <<
> 0.1% weekly is what I have heard). So, the cost of storage on Kafka is not
> much. Is that not the case for you? Are you looking for one time scrubbing
> of data for e.g? The benefit of this approach is that you eliminate any
> concurrency issues that arise from streaming job producing data for a user,
> while the deletes are also issued for that user.
>
> On concurrency control, Hudi now supports multiple writers, if you want to
> write a background job that will perform these deletes for you. it's in
> 0.8.0, see 
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.html&amp;data=04%7C01%7C%7C7e2423066e794f2164d908d8ff7e6e1a%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637540265765782629%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tWqjRyZoXQ7rTr0nUmd83mI7xNJOGJYFEBTHMcvkeDM%3D&amp;reserved=0.
>  One of
> us
> can help you out with trying this and rolling out. (Nishith is the feature
> author). Here, if the delete job touches same files, that the streaming job
> is writing to, then only one of them will succeed.
>
> We are working on a design for true lock free concurrency control, which
> provides the benefits of both models. But, won't be there for another month
> or two.
>
> Thanks
> Vinoth
>
>
> On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
> <felix.j...@philips.com.invalid> wrote:
>
> > Hi All,
> >
> > I have 100s of HUDI tables (AWS S3) where each of those are populated via
> > Spark structured streaming from kafka streams. Now I have to delete
> records
> > for a given user (userId) from all the tables which has data for that
> user.
> > Meaning all tables where we have reference to that specific userId. I
> > cannot republish all the events/records for that user to kafka to perform
> > delete, since its around 10-15 year’s worth of data for each user and is
> > going to be so costly and time consuming. So I am wondering how everybody
> > is performing GDPR on the their HUDI tables?
> >
> >
> > How I get delete request?
> > On a delete kafka topic we get a delete event [which just contains the
> > userId of the user  to delete], so we have to use that as filter
> condition
> > and read all the records from HUDI tables and write it back with data
> > source operation as ‘delete’. But while performing/running this delete
> > spark job on the table if the streaming job continues to ingest new
> > arriving data- what will be the side effect? Will it work, since seems
> like
> > multi writers are not currently supported.
> >
> > Could you help me with a solution?
> >
> > Regards,
> > Felix K Jose
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>

________________________________
The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

Reply via email to