Re: Hudi Record Key Best Practices

Raymond Xu Mon, 23 Nov 2020 23:20:50 -0800

Hi Felix,
I'd prefer approach 1. The logic is simple: to ensure uniqueness in your
dataset.
For 2, not very sure about the intention of prepending the datetime, looks
like duplicate info knowing that you already partitioned it by that field.
For 3, it seems too long for a primary id.
Hope this helps.


On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix <
[email protected]> wrote:

> @Vinoth Chandar <[email protected]>,
>
> Could you please take a look at and let me know what is the best approach
> or could you see whom can help me on this?
>
>
>
> Regards,
>
> Felix K Jose
>
> *From: *Kizhakkel Jose, Felix <[email protected]>
> *Date: *Thursday, November 19, 2020 at 12:04 PM
> *To: *[email protected] <[email protected]>, Vinoth Chandar <
> [email protected]>, [email protected] <
> [email protected]>
> *Cc: *[email protected] <[email protected]>, [email protected] <
> [email protected]>
> *Subject: *Re: Hudi Record Key Best Practices
>
> Sure. I will see about partition key.
>
> Since RFC 21 is not yet implemented and available to consume, can anyone
> please suggest what is the best approach I should be following to construct
> the record key I asked in the  original question:
>
> “
> My Write Use Cases:
> 1. Writes to partitioned HUDI table every 15 minutes
>
>   1.  where 95% inserts and 5% updates,
>   2.  Also 95% write goes to same partition (current date) 5% write can
> span across multiple partitions
> 2. GDPR request to delete records from the table using User Identifier
> field (F3)
>
>
> Record Key Construction:
> Approach 1:
> Generate a UUID  from the concatenated String of all these 4 fields [eg:
> str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> newly generated field as Record Key
>
> Approach 2:
> Generate a UUID  from the concatenated String of 3 fields except datetime
> field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> datetime field to the generated UUID and use that newly generated field as
> Record Key •F1_<uuid>
>
> Approach 3:
> Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> “
>
> Regards,
> Felix K Jose
> From: Raymond Xu <[email protected]>
> Date: Wednesday, November 18, 2020 at 5:30 PM
> To: [email protected] <[email protected]>
> Cc: [email protected] <[email protected]>, [email protected] <
> [email protected]>
> Subject: Re: Hudi Record Key Best Practices
> Hi Felix, I wasn't suggesting partition by user id, that'll be too many;
> just saying maybe making the writes more evenly spreaded could be
> better. Effectively, with 95% writes, it's like writing to a single
> partition dataset. Hourly partition could mitigate the situation, since you
> also have date-range queries. Just some rough ideas, the strategy really
> depends on your data pattern and requirements.
>
> For the development timeline on RFC 21, probably Vinoth or Balaji
> could give more info.
>
> On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix
> <[email protected]> wrote:
>
> > Hi Raymond,
> > Thank you for the response.
> >
> > Yes, the virtual key definitely going to help reducing the storage
> > footprint. When do you think it is going to be available and will it be
> > compatible with all downstream processing engines (Presto, Redshift
> > Spectrum etc.)? We have started our development activities and expecting
> to
> > get into PROD by March-April timeframe.
> >
> > Regarding the partition key,  we get data every day from 10-20 million
> > users and currently the data we are planning to partition is by Date
> > (YYYY-MM-DD) and thereby we will have consistent partitions for
> downstream
> > systems(every partition has same amount of data [20 million user data in
> > each partition, rather than skewed partitions]). And most of our queries
> > are date range queries for a given user-Id
> >
> > If I partition by user-Id, then I will have millions of partitions, and I
> > have read that having large number of partition has major read impact
> (meta
> > data management etc.), what do you think? Is my understanding correct?
> >
> > Yes, for current day most of the data will be for that day – so do you
> > think it’s going to be a problem while writing (wont the BLOOM index
> help)?
> > And that’s what I am trying to understand to land in a better performant
> > solution.
> >
> > Meanwhile I would like to see my record Key construct as well, to see how
> > it can help on write performance and downstream requirement to support
> > GDPR.  To avoid any reprocessing/migration down the line.
> >
> > Regards,
> > Felix K Jose
> >
> > From: Raymond Xu <[email protected]>
> > Date: Tuesday, November 17, 2020 at 6:18 PM
> > To: [email protected] <[email protected]>
> > Cc: [email protected] <[email protected]>, [email protected] <
> > [email protected]>, [email protected]
> > <[email protected]>
> > Subject: Re: Hudi Record Key Best Practices
> > Hi Felix, looks like the use case will benefit from virtual key feature
> in
> > this RFC
> >
> >
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7C5523000dd6444b36130408d88cad3629%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637414022852270093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SWg3X%2BUEoy5OgdevWX1x487ZERSejrI2cZ%2F5Tlue2yg%3D&amp;reserved=0
> >
> > Once this is implemented, you don't have to create a separate key.
> >
> > A rough thought: you mentioned 95% writes go to the same partition.
> Rather
> > than the record key, maybe consider improving on the partition field? to
> > have more even writes across partitions for eg?
> >
> > On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
> > <[email protected]> wrote:
> >
> > > Hello All,
> > >
> > > I have asked generic questions regarding record key in slack channel,
> but
> > > I just want to consolidate everything regarding Record Key and the
> > > suggested best practices of Record Key construction to get better write
> > > performance.
> > >
> > > Table Type: COW
> > > Partition Path: Date
> > >
> > > My record uniqueness is derived from a combination of 4 fields:
> > >
> > >   1.  F1: Datetime (record’s origination datetime)
> > >   2.  F2: String       (11 char  long serial number)
> > >   3.  F3: UUID        (User Identifier)
> > >   4.  F4: String.       (12 CHAR statistic name)
> > >
> > > Note: My record is a nested document and some of the above fields are
> > > nested fields
> > >
> > > My Write Use Cases:
> > > 1. Writes to partitioned HUDI table every 15 minutes
> > >
> > >   1.  where 95% inserts and 5% updates,
> > >   2.  Also 95% write goes to same partition (current date) 5% write can
> > > span across multiple partitions
> > > 2. GDPR request to delete records from the table using User Identifier
> > > field (F3)
> > >
> > >
> > > Record Key Construction:
> > > Approach 1:
> > > Generate a UUID  from the concatenated String of all these 4 fields
> [eg:
> > > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
> > > newly generated field as Record Key
> > >
> > > Approach 2:
> > > Generate a UUID  from the concatenated String of 3 fields except
> datetime
> > > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
> > > datetime field to the generated UUID and use that newly generated field
> > as
> > > Record Key •F1_<uuid>
> > >
> > > Approach 3:
> > > Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
> > >
> > > Which is the approach you will suggest? Could you please help me?
> > >
> > > Regards,
> > > Felix K Jose
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> legally
> > > protected under applicable law. The message is intended solely for the
> > > addressee(s). If you are not the intended recipient, you are hereby
> > > notified that any use, forwarding, dissemination, or reproduction of
> this
> > > message is strictly prohibited and may be unlawful. If you are not the
> > > intended recipient, please contact the sender by return e-mail and
> > destroy
> > > all copies of the original message.
> > >
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

Re: Hudi Record Key Best Practices

Reply via email to