If you want to quickly try something, you can also build jar off master and
run independently (works for client mode/spark-shell experiments)
https://dev.to/bytearray/using-your-own-apache-spark-hudi-versions-with-aws-emr-40a0



On Thu, Apr 15, 2021 at 6:09 AM Kizhakkel Jose, Felix
<felix.j...@philips.com.invalid> wrote:

> Hi Nishith,
>
> I will check with Udit M, since he had helped me in the past with a custom
> jar for EMR.
>
> Regards,
> Felix K Jose
> From: nishith agarwal <n3.nas...@gmail.com>
> Date: Wednesday, April 14, 2021 at 3:59 PM
> To: dev <dev@hudi.apache.org>
> Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> Caution: This e-mail originated from outside of Philips, be careful for
> phishing.
>
>
> No worries. Is the custom build something you can work with the AWS team to
> get installed to be able to test ?
>
> -Nishith
>
> On Wed, Apr 14, 2021 at 12:57 PM Kizhakkel Jose, Felix
> <felix.j...@philips.com.invalid> wrote:
>
> > Hi Nishith, Vinoth,
> >
> > Thank you so much for the quick response and offering the help.
> >
> > Regards,
> > Felix K Jose
> > From: Kizhakkel Jose, Felix <felix.j...@philips.com.INVALID>
> > Date: Wednesday, April 14, 2021 at 3:55 PM
> > To: dev@hudi.apache.org <dev@hudi.apache.org>
> > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> > Caution: This e-mail originated from outside of Philips, be careful for
> > phishing.
> >
> >
> > Hi Nishith,
> >
> > As I mentioned we are on AWS EMR, but I don’t think we have this 0.8.0
> > version available as part of existing version. So we need a custom build
> > for working it on latest EMR 6.1.0
> >
> > Regards,
> > Felix K Jose
> > From: nishith agarwal <n3.nas...@gmail.com>
> > Date: Wednesday, April 14, 2021 at 3:49 PM
> > To: dev <dev@hudi.apache.org>
> > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> > Caution: This e-mail originated from outside of Philips, be careful for
> > phishing.
> >
> >
> > Felix,
> >
> > Happy to help you through trying and rolling out multi-writer on Hudi
> > tables. Do you have a test environment where you can try out the feature
> by
> > following the doc that Vinoth pointed above ?
> >
> > Thanks,
> > Nishith
> >
> > On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <vin...@apache.org>
> wrote:
> >
> > > Hi Felix,
> > >
> > > Most people I think are publishing this data into Kafka,and apply the
> > > deletes as a part of the streaming job itself. The reason why this
> works
> > is
> > > because typically, only a small fraction of users leave the service
> (say
> > <<
> > > 0.1% weekly is what I have heard). So, the cost of storage on Kafka is
> > not
> > > much. Is that not the case for you? Are you looking for one time
> > scrubbing
> > > of data for e.g? The benefit of this approach is that you eliminate any
> > > concurrency issues that arise from streaming job producing data for a
> > user,
> > > while the deletes are also issued for that user.
> > >
> > > On concurrency control, Hudi now supports multiple writers, if you want
> > to
> > > write a background job that will perform these deletes for you. it's in
> > > 0.8.0, see
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.html&amp;data=04%7C01%7C%7Cde1da0fb3fb2458b31a208d8ff7fcf24%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637540271689560701%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=aaJEMtZBWGIT2SuYO9qRyPihTvHDkHMTHFFMlwVyJXc%3D&amp;reserved=0
> .
> > One of
> > > us
> > > can help you out with trying this and rolling out. (Nishith is the
> > feature
> > > author). Here, if the delete job touches same files, that the streaming
> > job
> > > is writing to, then only one of them will succeed.
> > >
> > > We are working on a design for true lock free concurrency control,
> which
> > > provides the benefits of both models. But, won't be there for another
> > month
> > > or two.
> > >
> > > Thanks
> > > Vinoth
> > >
> > >
> > > On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
> > > <felix.j...@philips.com.invalid> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I have 100s of HUDI tables (AWS S3) where each of those are populated
> > via
> > > > Spark structured streaming from kafka streams. Now I have to delete
> > > records
> > > > for a given user (userId) from all the tables which has data for that
> > > user.
> > > > Meaning all tables where we have reference to that specific userId. I
> > > > cannot republish all the events/records for that user to kafka to
> > perform
> > > > delete, since its around 10-15 year’s worth of data for each user and
> > is
> > > > going to be so costly and time consuming. So I am wondering how
> > everybody
> > > > is performing GDPR on the their HUDI tables?
> > > >
> > > >
> > > > How I get delete request?
> > > > On a delete kafka topic we get a delete event [which just contains
> the
> > > > userId of the user  to delete], so we have to use that as filter
> > > condition
> > > > and read all the records from HUDI tables and write it back with data
> > > > source operation as ‘delete’. But while performing/running this
> delete
> > > > spark job on the table if the streaming job continues to ingest new
> > > > arriving data- what will be the side effect? Will it work, since
> seems
> > > like
> > > > multi writers are not currently supported.
> > > >
> > > > Could you help me with a solution?
> > > >
> > > > Regards,
> > > > Felix K Jose
> > > >
> > > > ________________________________
> > > > The information contained in this message may be confidential and
> > legally
> > > > protected under applicable law. The message is intended solely for
> the
> > > > addressee(s). If you are not the intended recipient, you are hereby
> > > > notified that any use, forwarding, dissemination, or reproduction of
> > this
> > > > message is strictly prohibited and may be unlawful. If you are not
> the
> > > > intended recipient, please contact the sender by return e-mail and
> > > destroy
> > > > all copies of the original message.
> > > >
> > >
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

Reply via email to