Agree with Siva's suggestions. For clustering, it's not necessary for it to be part of the key. (Satish can correct if I missed something)
On Tue, Nov 24, 2020 at 2:01 PM Sivabalan <[email protected]> wrote: > here are the discussions points we had in slack. > > Suggestion is to go with approach 2 based on these points. > - Prefixing F1 (including timestamp), will help pruning some file slices > even within a day (within a partition) if records are properly ordered > based on timestamp. > - Deletes are occasional compared to upserts. So, optimizing for upserts > makes sense and hence approach 2 is fine. Also, anyways to delete records, > its two part execution. First a query to hudi like "select HoodieKey from > hudi_tbl where user_id = 'X'), and the a DELETE operation to hudi for these > HoodieKeys. For first query, I assume embedding user_id in record keys does > not matter, bcoz, this query does filtering for a specific column in the > dataset. > So, initially thought not much of value embedding user id in record key. > But as vinoth suggested, clustering could come in handy and so lets have > userId too as part of record keys. > - In approach3, the record keys could be too large and so may not want to > go this route. > > > > > > On Tue, Nov 24, 2020 at 11:58 AM Vinoth Chandar <[email protected]> wrote: > >> Hi Felix, >> >> I will try to be faster going forward. Apologies for the late reply. >> Thanks Raymond for all the great clarifications. >> >> On RFC-21, I think it's safe to assume it will be available by Jan or so. >> 0.8.0 (Uber folks, correct me if I am wrong) >> >> >>For approach 2 – the reason for prepending datetime is to have an >> incrementing id, otherwise your uuid is a purely random id and wont support >> range pruning, while writing, correct? >> You are right. In general, we only have the following levers to control >> performance. I take it that "origination datetime" is not monotonically >> increasing? Otherwise Approach 1 is good, right? >> >> If you want to optimize for upsert performance, >> - prepending a timestamp field would help. if you simply prepend the >> date, which is already also the partition path, then all keys in that >> partition will have the same prefix and no additional pruning opportunities >> exist. >> - Advise using dynamic bloom filters >> (config hoodie.bloom.index.filter.type=DYNAMIC_V0), to ensure the bloom >> filters filter our enough files after range pruning. >> >> For good delete performance, we can cluster records by user_id for older >> partitions, such that all records a user is packed into the smallest number >> of files. This way, when only a small number of users leave, >> your delete won't rewrite the entire partition's files. Clustering >> support is landing by the end of year in 0.7.0. (There is a PR out already, >> if you want to test/play). >> >> All of this is also highly workload specific. So we can get into those >> details, if that helps. MOR is a much better alternative for dealing with >> deletes IMO. >> It was specifically designed, used for those, since it can absorb the >> deletes into log files and apply them later amortizing costs. >> >> Future is good, since we are investing in record level indexes that could >> also natively index secondary fields like user_id. Again expect that to be >> there in 0.9.0 or something, around Mar. >> For now, we have to play with how we lay out the data to squeeze >> performance. >> >> Hope that helps. >> >> thanks >> vinoth >> >> >> >> >> >> On Tue, Nov 24, 2020 at 5:54 AM Kizhakkel Jose, Felix < >> [email protected]> wrote: >> >>> Hi Raymond, >>> >>> Thanks a lot for the reply. >>> >>> For approach 2 – the reason for prepending datetime is to have a >>> incrementing id, otherwise your uuid is a purely random id and wont support >>> range pruning, while writing, correct? In a given date partition I am >>> expected to get 10s of billions records, and by having an incrementing id >>> helps BLOOM filtering? This is the only intend of having the prefix of >>> datetime (int64 representation) >>> >>> Yes, I also see Approach 3 really too big and causing lot in storage >>> footprint. >>> >>> My initial approach was Approach 1 (generated uuid from all the 4 >>> fields), then heard that the range pruning can make write faster – so >>> thought of datetime as prefix. Do you see any benefit or the UUID can >>> itself be sufficient -since it’s been generated from the 4 input fields? >>> >>> >>> >>> Regards, >>> >>> Felix K Jose >>> >>> *From: *Raymond Xu <[email protected]> >>> *Date: *Tuesday, November 24, 2020 at 2:20 AM >>> *To: *Kizhakkel Jose, Felix <[email protected]> >>> *Cc: *[email protected] <[email protected]>, [email protected] < >>> [email protected]>, [email protected] <[email protected]> >>> *Subject: *Re: Hudi Record Key Best Practices >>> >>> Hi Felix, >>> >>> I'd prefer approach 1. The logic is simple: to ensure uniqueness in your >>> dataset. >>> >>> For 2, not very sure about the intention of prepending the datetime, >>> looks like duplicate info knowing that you already partitioned it by that >>> field. >>> >>> For 3, it seems too long for a primary id. >>> >>> Hope this helps. >>> >>> >>> >>> On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix < >>> [email protected]> wrote: >>> >>> @Vinoth Chandar <[email protected]>, >>> >>> Could you please take a look at and let me know what is the best >>> approach or could you see whom can help me on this? >>> >>> >>> >>> Regards, >>> >>> Felix K Jose >>> >>> *From: *Kizhakkel Jose, Felix <[email protected]> >>> *Date: *Thursday, November 19, 2020 at 12:04 PM >>> *To: *[email protected] <[email protected]>, Vinoth Chandar < >>> [email protected]>, [email protected] < >>> [email protected]> >>> *Cc: *[email protected] <[email protected]>, [email protected] < >>> [email protected]> >>> *Subject: *Re: Hudi Record Key Best Practices >>> >>> Sure. I will see about partition key. >>> >>> Since RFC 21 is not yet implemented and available to consume, can anyone >>> please suggest what is the best approach I should be following to construct >>> the record key I asked in the original question: >>> >>> “ >>> My Write Use Cases: >>> 1. Writes to partitioned HUDI table every 15 minutes >>> >>> 1. where 95% inserts and 5% updates, >>> 2. Also 95% write goes to same partition (current date) 5% write can >>> span across multiple partitions >>> 2. GDPR request to delete records from the table using User Identifier >>> field (F3) >>> >>> >>> Record Key Construction: >>> Approach 1: >>> Generate a UUID from the concatenated String of all these 4 fields [eg: >>> str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that >>> newly generated field as Record Key >>> >>> Approach 2: >>> Generate a UUID from the concatenated String of 3 fields except >>> datetime field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and >>> prepend datetime field to the generated UUID and use that newly generated >>> field as Record Key •F1_<uuid> >>> >>> Approach 3: >>> Record Key as a composite key of all 4 fields (F1, F2, F3, F4) >>> “ >>> >>> Regards, >>> Felix K Jose >>> From: Raymond Xu <[email protected]> >>> Date: Wednesday, November 18, 2020 at 5:30 PM >>> To: [email protected] <[email protected]> >>> Cc: [email protected] <[email protected]>, [email protected] < >>> [email protected]> >>> Subject: Re: Hudi Record Key Best Practices >>> Hi Felix, I wasn't suggesting partition by user id, that'll be too many; >>> just saying maybe making the writes more evenly spreaded could be >>> better. Effectively, with 95% writes, it's like writing to a single >>> partition dataset. Hourly partition could mitigate the situation, since >>> you >>> also have date-range queries. Just some rough ideas, the strategy really >>> depends on your data pattern and requirements. >>> >>> For the development timeline on RFC 21, probably Vinoth or Balaji >>> could give more info. >>> >>> On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix >>> <[email protected]> wrote: >>> >>> > Hi Raymond, >>> > Thank you for the response. >>> > >>> > Yes, the virtual key definitely going to help reducing the storage >>> > footprint. When do you think it is going to be available and will it be >>> > compatible with all downstream processing engines (Presto, Redshift >>> > Spectrum etc.)? We have started our development activities and >>> expecting to >>> > get into PROD by March-April timeframe. >>> > >>> > Regarding the partition key, we get data every day from 10-20 million >>> > users and currently the data we are planning to partition is by Date >>> > (YYYY-MM-DD) and thereby we will have consistent partitions for >>> downstream >>> > systems(every partition has same amount of data [20 million user data >>> in >>> > each partition, rather than skewed partitions]). And most of our >>> queries >>> > are date range queries for a given user-Id >>> > >>> > If I partition by user-Id, then I will have millions of partitions, >>> and I >>> > have read that having large number of partition has major read impact >>> (meta >>> > data management etc.), what do you think? Is my understanding correct? >>> > >>> > Yes, for current day most of the data will be for that day – so do you >>> > think it’s going to be a problem while writing (wont the BLOOM index >>> help)? >>> > And that’s what I am trying to understand to land in a better >>> performant >>> > solution. >>> > >>> > Meanwhile I would like to see my record Key construct as well, to see >>> how >>> > it can help on write performance and downstream requirement to support >>> > GDPR. To avoid any reprocessing/migration down the line. >>> > >>> > Regards, >>> > Felix K Jose >>> > >>> > From: Raymond Xu <[email protected]> >>> > Date: Tuesday, November 17, 2020 at 6:18 PM >>> > To: [email protected] <[email protected]> >>> > Cc: [email protected] <[email protected]>, [email protected] < >>> > [email protected]>, [email protected] >>> > <[email protected]> >>> > Subject: Re: Hudi Record Key Best Practices >>> > Hi Felix, looks like the use case will benefit from virtual key >>> feature in >>> > this RFC >>> > >>> > >>> > >>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&data=04%7C01%7C%7C5523000dd6444b36130408d88cad3629%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637414022852270093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SWg3X%2BUEoy5OgdevWX1x487ZERSejrI2cZ%2F5Tlue2yg%3D&reserved=0 >>> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&data=04%7C01%7C%7C9af2e2156ca741dc30b708d890497321%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637417992446807324%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=JFMrvaH7mq2o1eisazMXFvvmn4MjescTBp4bMygJ5Oo%3D&reserved=0> >>> > >>> > Once this is implemented, you don't have to create a separate key. >>> > >>> > A rough thought: you mentioned 95% writes go to the same partition. >>> Rather >>> > than the record key, maybe consider improving on the partition field? >>> to >>> > have more even writes across partitions for eg? >>> > >>> > On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix >>> > <[email protected]> wrote: >>> > >>> > > Hello All, >>> > > >>> > > I have asked generic questions regarding record key in slack >>> channel, but >>> > > I just want to consolidate everything regarding Record Key and the >>> > > suggested best practices of Record Key construction to get better >>> write >>> > > performance. >>> > > >>> > > Table Type: COW >>> > > Partition Path: Date >>> > > >>> > > My record uniqueness is derived from a combination of 4 fields: >>> > > >>> > > 1. F1: Datetime (record’s origination datetime) >>> > > 2. F2: String (11 char long serial number) >>> > > 3. F3: UUID (User Identifier) >>> > > 4. F4: String. (12 CHAR statistic name) >>> > > >>> > > Note: My record is a nested document and some of the above fields are >>> > > nested fields >>> > > >>> > > My Write Use Cases: >>> > > 1. Writes to partitioned HUDI table every 15 minutes >>> > > >>> > > 1. where 95% inserts and 5% updates, >>> > > 2. Also 95% write goes to same partition (current date) 5% write >>> can >>> > > span across multiple partitions >>> > > 2. GDPR request to delete records from the table using User >>> Identifier >>> > > field (F3) >>> > > >>> > > >>> > > Record Key Construction: >>> > > Approach 1: >>> > > Generate a UUID from the concatenated String of all these 4 fields >>> [eg: >>> > > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use >>> that >>> > > newly generated field as Record Key >>> > > >>> > > Approach 2: >>> > > Generate a UUID from the concatenated String of 3 fields except >>> datetime >>> > > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend >>> > > datetime field to the generated UUID and use that newly generated >>> field >>> > as >>> > > Record Key •F1_<uuid> >>> > > >>> > > Approach 3: >>> > > Record Key as a composite key of all 4 fields (F1, F2, F3, F4) >>> > > >>> > > Which is the approach you will suggest? Could you please help me? >>> > > >>> > > Regards, >>> > > Felix K Jose >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > ________________________________ >>> > > The information contained in this message may be confidential and >>> legally >>> > > protected under applicable law. The message is intended solely for >>> the >>> > > addressee(s). If you are not the intended recipient, you are hereby >>> > > notified that any use, forwarding, dissemination, or reproduction of >>> this >>> > > message is strictly prohibited and may be unlawful. If you are not >>> the >>> > > intended recipient, please contact the sender by return e-mail and >>> > destroy >>> > > all copies of the original message. >>> > > >>> > >>> > ________________________________ >>> > The information contained in this message may be confidential and >>> legally >>> > protected under applicable law. The message is intended solely for the >>> > addressee(s). If you are not the intended recipient, you are hereby >>> > notified that any use, forwarding, dissemination, or reproduction of >>> this >>> > message is strictly prohibited and may be unlawful. If you are not the >>> > intended recipient, please contact the sender by return e-mail and >>> destroy >>> > all copies of the original message. >>> > >>> >>> ________________________________ >>> The information contained in this message may be confidential and >>> legally protected under applicable law. The message is intended solely for >>> the addressee(s). If you are not the intended recipient, you are hereby >>> notified that any use, forwarding, dissemination, or reproduction of this >>> message is strictly prohibited and may be unlawful. If you are not the >>> intended recipient, please contact the sender by return e-mail and destroy >>> all copies of the original message. >>> >>> > > -- > Regards, > -Sivabalan >
