Re: Hudi Record Key Best Practices

Vinoth Chandar Mon, 07 Dec 2020 14:09:17 -0800

Sounds good to me. We are always looking to add more contributors.

https://github.com/apache/hudi/pull/2263
 is the pr under review for clustering


RFC 18/19 have the details as well

On Wed, Nov 25, 2020 at 6:20 AM Kizhakkel Jose, Felix <
felix.j...@philips.com> wrote:

> Hi Vinoth, Siva,
>
> I know you guys are so busy. But I always get quick response from one of
> hoodiers. Thank you so much for the detailed information.
>
> Yes, as suggested for UPSERTs I will go with *Approach 2*.
>
> For deletes clustering can help me. Also happy to see that we don’t need
> to duplicate that field as part of Record Key to get it clustered. Where
> can I find PR/RFC for clustering implementation to read about it and get a
> better understanding? And I believe this is something similar to bucketing
> in Hive?
>
> Also RFC-21 is going to help on the storage footprint a lot.
>
>
> All interesting stuffs. Once I complete my major Data Lake Implementation
> project I definetly would like to start contributing to HUDI.
>
>
>
> Thank you @Vinoth Chandar <vin...@apache.org> @Siva once again for all of
> your help.  And @Raymond, thank you for answering and clarifying things
> throughout this.
>
>
>
> Regards,
>
> Felix K Jose
>
> *From: *Vinoth Chandar <vin...@apache.org>
> *Date: *Tuesday, November 24, 2020 at 5:52 PM
> *To: *Sivabalan <n.siv...@gmail.com>
> *Cc: *Kizhakkel Jose, Felix <felix.j...@philips.com>, Raymond Xu <
> xu.shiyan.raym...@gmail.com>, dev@hudi.apache.org <dev@hudi.apache.org>
> *Subject: *Re: Hudi Record Key Best Practices
>
> Agree with Siva's suggestions.
>
>
>
> For clustering, it's not necessary for it to be part of the key. (Satish
> can correct if I missed something)
>
>
>
> On Tue, Nov 24, 2020 at 2:01 PM Sivabalan <n.siv...@gmail.com> wrote:
>
> here are the discussions points we had in slack.
>
>
>
> Suggestion is to go with approach 2 based on these points.
>
> - Prefixing F1 (including timestamp), will help pruning some file slices
> even within a day (within a partition) if records are properly ordered
> based on timestamp.
>
> - Deletes are occasional compared to upserts. So, optimizing for upserts
> makes sense and hence approach 2 is fine. Also, anyways to delete records,
> its two part execution. First a query to hudi like "select HoodieKey from
> hudi_tbl where user_id = 'X'), and the a DELETE operation to hudi for these
> HoodieKeys. For first query, I assume embedding user_id in record keys does
> not matter, bcoz, this query does filtering for a specific column in the
> dataset.
>
> So, initially thought not much of value embedding user id in record key.
> But as vinoth suggested, clustering could come in handy and so lets have
> userId too as part of record keys.
>
> - In approach3, the record keys could be too large and so may not want to
> go this route.
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Nov 24, 2020 at 11:58 AM Vinoth Chandar <vin...@apache.org> wrote:
>
> Hi Felix,
>
>
>
> I will try to be faster going forward. Apologies for the late reply.
> Thanks Raymond for all the great clarifications.
>
>
>
> On RFC-21, I think it's safe to assume it will be available by Jan or so.
> 0.8.0 (Uber folks, correct me if I am wrong)
>
>
>
> >>For approach 2 – the reason for prepending datetime is to have an
> incrementing id, otherwise your uuid is a purely random id and wont support
> range pruning, while writing, correct?
>
> You are right. In general, we only have the following levers to control
> performance. I take it that "origination datetime" is not monotonically
> increasing? Otherwise Approach 1 is good, right?
>
>
>
> If you want to optimize for upsert performance,
>
> - prepending a timestamp field would help. if you simply prepend the date,
> which is already also the partition path, then all keys in that partition
> will have the same prefix and no additional pruning opportunities exist.
>
> - Advise using dynamic bloom filters
> (config hoodie.bloom.index.filter.type=DYNAMIC_V0), to ensure the bloom
> filters filter our enough files after range pruning.
>
>
>
> For good delete performance, we can cluster records by user_id for older
> partitions, such that all records a user is packed into the smallest number
> of files. This way,  when only a small number of users leave,
>
> your delete won't rewrite the entire partition's files. Clustering support
> is landing by the end of year in 0.7.0. (There is a PR out already, if you
> want to test/play).
>
>
>
> All of this is also highly workload specific. So we can get into those
> details, if that helps. MOR is a much better alternative for dealing with
> deletes IMO.
>
> It was specifically designed, used for those, since it can absorb the
> deletes into log files and apply them later amortizing costs.
>
>
>
> Future is good, since we are investing in record level indexes that could
> also natively index secondary fields like user_id. Again expect that to be
> there in 0.9.0 or something, around Mar.
>
> For now, we have to play with how we lay out the data to squeeze
> performance.
>
>
>
> Hope that helps.
>
>
>
> thanks
>
> vinoth
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Nov 24, 2020 at 5:54 AM Kizhakkel Jose, Felix <
> felix.j...@philips.com> wrote:
>
> Hi Raymond,
>
> Thanks a lot for the reply.
>
> For approach 2 – the reason for prepending datetime is to have a
> incrementing id, otherwise your uuid is a purely random id and wont support
> range pruning, while writing, correct? In a given date partition I am
> expected to get 10s of billions records, and by having an incrementing id
> helps BLOOM filtering? This is the only intend of having the prefix of
> datetime (int64 representation)
>
> Yes, I also see Approach 3 really too big and causing lot in storage
> footprint.
>
> My initial approach was Approach 1 (generated uuid from all the 4 fields),
> then heard that the range pruning can make write faster – so thought of
> datetime as prefix. Do you see any benefit or the UUID can itself be
> sufficient -since it’s been generated from the 4 input fields?
>
>
>
> Regards,
>
> Felix K Jose
>
> *From: *Raymond Xu <xu.shiyan.raym...@gmail.com>
> *Date: *Tuesday, November 24, 2020 at 2:20 AM
> *To: *Kizhakkel Jose, Felix <felix.j...@philips.com>
> *Cc: *dev@hudi.apache.org <dev@hudi.apache.org>, vin...@apache.org <
> vin...@apache.org>, n.siv...@gmail.com <n.siv...@gmail.com>
> *Subject: *Re: Hudi Record Key Best Practices
>
> Hi Felix,
>
> I'd prefer approach 1. The logic is simple: to ensure uniqueness in your
> dataset.
>
> For 2, not very sure about the intention of prepending the datetime, looks
> like duplicate info knowing that you already partitioned it by that field.
>
> For 3, it seems too long for a primary id.
>
> Hope this helps.
>
>
>
> On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix <
> felix.j...@philips.com> wrote:
>
> @Vinoth Chandar <vin...@apache.org>,
>
> Could you please take a look at and let me know what is the best approach
> or could you see whom can help me on this?
>
>
>
> Regards,
>
> Felix K Jose
>
> *From: *Kizhakkel Jose, Felix <felix.j...@philips.com.INVALID>
> *Date: *Thursday, November 19, 2020 at 12:04 PM
> *To: *dev@hudi.apache.org <dev@hudi.apache.org>, Vinoth Chandar <
> vin...@apache.org>, xu.shiyan.raym...@gmail.com <
> xu.shiyan.raym...@gmail.com>
> *Cc: *vin...@apache.org <vin...@apache.org>, n.siv...@gmail.com <
> n.siv...@gmail.com>
> *Subject: *Re: Hudi Record Key Best Practices
>
> Sure. I will see about partition key.
>
> Since RFC 21 is not yet implemented and available to consume, can anyone
> please suggest what is the best approach I should be following to construct
> the record key I asked in the  original question:
>
> “
> My Write Use Cases:
> 1. Writes to partitioned HUDI table every 15 minutes
>
>   1.  where 95% inserts and 5% updates,
>   2.  Also 95% write goes to same partition (current date) 5% write can
> span across multiple partitions
> 2. GDPR request to delete records from the table using User Identifier
> field (F3)
>
>
> Record Key Construction:
> Approach 1:
> Generate a UUID  from the concatenated String of all these 4 fields [eg:
> str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> newly generated field as Record Key
>
> Approach 2:
> Generate a UUID  from the concatenated String of 3 fields except datetime
> field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> datetime field to the generated UUID and use that newly generated field as
> Record Key •F1_<uuid>
>
> Approach 3:
> Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> “
>
> Regards,
> Felix K Jose
> From: Raymond Xu <xu.shiyan.raym...@gmail.com>
> Date: Wednesday, November 18, 2020 at 5:30 PM
> To: dev@hudi.apache.org <dev@hudi.apache.org>
> Cc: vin...@apache.org <vin...@apache.org>, n.siv...@gmail.com <
> n.siv...@gmail.com>
> Subject: Re: Hudi Record Key Best Practices
> Hi Felix, I wasn't suggesting partition by user id, that'll be too many;
> just saying maybe making the writes more evenly spreaded could be
> better. Effectively, with 95% writes, it's like writing to a single
> partition dataset. Hourly partition could mitigate the situation, since you
> also have date-range queries. Just some rough ideas, the strategy really
> depends on your data pattern and requirements.
>
> For the development timeline on RFC 21, probably Vinoth or Balaji
> could give more info.
>
> On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix
> <felix.j...@philips.com.invalid> wrote:
>
> > Hi Raymond,
> > Thank you for the response.
> >
> > Yes, the virtual key definitely going to help reducing the storage
> > footprint. When do you think it is going to be available and will it be
> > compatible with all downstream processing engines (Presto, Redshift
> > Spectrum etc.)? We have started our development activities and expecting
> to
> > get into PROD by March-April timeframe.
> >
> > Regarding the partition key,  we get data every day from 10-20 million
> > users and currently the data we are planning to partition is by Date
> > (YYYY-MM-DD) and thereby we will have consistent partitions for
> downstream
> > systems(every partition has same amount of data [20 million user data in
> > each partition, rather than skewed partitions]). And most of our queries
> > are date range queries for a given user-Id
> >
> > If I partition by user-Id, then I will have millions of partitions, and I
> > have read that having large number of partition has major read impact
> (meta
> > data management etc.), what do you think? Is my understanding correct?
> >
> > Yes, for current day most of the data will be for that day – so do you
> > think it’s going to be a problem while writing (wont the BLOOM index
> help)?
> > And that’s what I am trying to understand to land in a better performant
> > solution.
> >
> > Meanwhile I would like to see my record Key construct as well, to see how
> > it can help on write performance and downstream requirement to support
> > GDPR.  To avoid any reprocessing/migration down the line.
> >
> > Regards,
> > Felix K Jose
> >
> > From: Raymond Xu <xu.shiyan.raym...@gmail.com>
> > Date: Tuesday, November 17, 2020 at 6:18 PM
> > To: dev@hudi.apache.org <dev@hudi.apache.org>
> > Cc: vin...@apache.org <vin...@apache.org>, n.siv...@gmail.com <
> > n.siv...@gmail.com>, v.bal...@ymail.com.invalid
> > <v.bal...@ymail.com.invalid>
> > Subject: Re: Hudi Record Key Best Practices
> > Hi Felix, looks like the use case will benefit from virtual key feature
> in
> > this RFC
> >
> >
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7C5523000dd6444b36130408d88cad3629%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637414022852270093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SWg3X%2BUEoy5OgdevWX1x487ZERSejrI2cZ%2F5Tlue2yg%3D&amp;reserved=0
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&data=04%7C01%7C%7C6c4ae6d635fd405a2ee708d890cb9f48%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637418551529459732%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=az5pemZfveNQK5kf8h5m0iDHdixCnfx455PuIK2vrVo%3D&reserved=0>
> >
> > Once this is implemented, you don't have to create a separate key.
> >
> > A rough thought: you mentioned 95% writes go to the same partition.
> Rather
> > than the record key, maybe consider improving on the partition field? to
> > have more even writes across partitions for eg?
> >
> > On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
> > <felix.j...@philips.com.invalid> wrote:
> >
> > > Hello All,
> > >
> > > I have asked generic questions regarding record key in slack channel,
> but
> > > I just want to consolidate everything regarding Record Key and the
> > > suggested best practices of Record Key construction to get better write
> > > performance.
> > >
> > > Table Type: COW
> > > Partition Path: Date
> > >
> > > My record uniqueness is derived from a combination of 4 fields:
> > >
> > >   1.  F1: Datetime (record’s origination datetime)
> > >   2.  F2: String       (11 char  long serial number)
> > >   3.  F3: UUID        (User Identifier)
> > >   4.  F4: String.       (12 CHAR statistic name)
> > >
> > > Note: My record is a nested document and some of the above fields are
> > > nested fields
> > >
> > > My Write Use Cases:
> > > 1. Writes to partitioned HUDI table every 15 minutes
> > >
> > >   1.  where 95% inserts and 5% updates,
> > >   2.  Also 95% write goes to same partition (current date) 5% write can
> > > span across multiple partitions
> > > 2. GDPR request to delete records from the table using User Identifier
> > > field (F3)
> > >
> > >
> > > Record Key Construction:
> > > Approach 1:
> > > Generate a UUID  from the concatenated String of all these 4 fields
> [eg:
> > > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> > > newly generated field as Record Key
> > >
> > > Approach 2:
> > > Generate a UUID  from the concatenated String of 3 fields except
> datetime
> > > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> > > datetime field to the generated UUID and use that newly generated field
> > as
> > > Record Key •F1_<uuid>
> > >
> > > Approach 3:
> > > Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> > >
> > > Which is the approach you will suggest? Could you please help me?
> > >
> > > Regards,
> > > Felix K Jose
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> legally
> > > protected under applicable law. The message is intended solely for the
> > > addressee(s). If you are not the intended recipient, you are hereby
> > > notified that any use, forwarding, dissemination, or reproduction of
> this
> > > message is strictly prohibited and may be unlawful. If you are not the
> > > intended recipient, please cont_act the sender by return e-mail and
> > destroy
> > > all copies of the original message.
> > >
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>
>
>
>
> --
>
> Regards,
> -Sivabalan
>
>

Re: Hudi Record Key Best Practices

Reply via email to