Sounds good to me. We are always looking to add more contributors. https://github.com/apache/hudi/pull/2263 is the pr under review for clustering
RFC 18/19 have the details as well On Wed, Nov 25, 2020 at 6:20 AM Kizhakkel Jose, Felix < felix.j...@philips.com> wrote: > Hi Vinoth, Siva, > > I know you guys are so busy. But I always get quick response from one of > hoodiers. Thank you so much for the detailed information. > > Yes, as suggested for UPSERTs I will go with *Approach 2*. > > For deletes clustering can help me. Also happy to see that we don’t need > to duplicate that field as part of Record Key to get it clustered. Where > can I find PR/RFC for clustering implementation to read about it and get a > better understanding? And I believe this is something similar to bucketing > in Hive? > > Also RFC-21 is going to help on the storage footprint a lot. > > > All interesting stuffs. Once I complete my major Data Lake Implementation > project I definetly would like to start contributing to HUDI. > > > > Thank you @Vinoth Chandar <vin...@apache.org> @Siva once again for all of > your help. And @Raymond, thank you for answering and clarifying things > throughout this. > > > > Regards, > > Felix K Jose > > *From: *Vinoth Chandar <vin...@apache.org> > *Date: *Tuesday, November 24, 2020 at 5:52 PM > *To: *Sivabalan <n.siv...@gmail.com> > *Cc: *Kizhakkel Jose, Felix <felix.j...@philips.com>, Raymond Xu < > xu.shiyan.raym...@gmail.com>, dev@hudi.apache.org <dev@hudi.apache.org> > *Subject: *Re: Hudi Record Key Best Practices > > Agree with Siva's suggestions. > > > > For clustering, it's not necessary for it to be part of the key. (Satish > can correct if I missed something) > > > > On Tue, Nov 24, 2020 at 2:01 PM Sivabalan <n.siv...@gmail.com> wrote: > > here are the discussions points we had in slack. > > > > Suggestion is to go with approach 2 based on these points. > > - Prefixing F1 (including timestamp), will help pruning some file slices > even within a day (within a partition) if records are properly ordered > based on timestamp. > > - Deletes are occasional compared to upserts. So, optimizing for upserts > makes sense and hence approach 2 is fine. Also, anyways to delete records, > its two part execution. First a query to hudi like "select HoodieKey from > hudi_tbl where user_id = 'X'), and the a DELETE operation to hudi for these > HoodieKeys. For first query, I assume embedding user_id in record keys does > not matter, bcoz, this query does filtering for a specific column in the > dataset. > > So, initially thought not much of value embedding user id in record key. > But as vinoth suggested, clustering could come in handy and so lets have > userId too as part of record keys. > > - In approach3, the record keys could be too large and so may not want to > go this route. > > > > > > > > > > > > On Tue, Nov 24, 2020 at 11:58 AM Vinoth Chandar <vin...@apache.org> wrote: > > Hi Felix, > > > > I will try to be faster going forward. Apologies for the late reply. > Thanks Raymond for all the great clarifications. > > > > On RFC-21, I think it's safe to assume it will be available by Jan or so. > 0.8.0 (Uber folks, correct me if I am wrong) > > > > >>For approach 2 – the reason for prepending datetime is to have an > incrementing id, otherwise your uuid is a purely random id and wont support > range pruning, while writing, correct? > > You are right. In general, we only have the following levers to control > performance. I take it that "origination datetime" is not monotonically > increasing? Otherwise Approach 1 is good, right? > > > > If you want to optimize for upsert performance, > > - prepending a timestamp field would help. if you simply prepend the date, > which is already also the partition path, then all keys in that partition > will have the same prefix and no additional pruning opportunities exist. > > - Advise using dynamic bloom filters > (config hoodie.bloom.index.filter.type=DYNAMIC_V0), to ensure the bloom > filters filter our enough files after range pruning. > > > > For good delete performance, we can cluster records by user_id for older > partitions, such that all records a user is packed into the smallest number > of files. This way, when only a small number of users leave, > > your delete won't rewrite the entire partition's files. Clustering support > is landing by the end of year in 0.7.0. (There is a PR out already, if you > want to test/play). > > > > All of this is also highly workload specific. So we can get into those > details, if that helps. MOR is a much better alternative for dealing with > deletes IMO. > > It was specifically designed, used for those, since it can absorb the > deletes into log files and apply them later amortizing costs. > > > > Future is good, since we are investing in record level indexes that could > also natively index secondary fields like user_id. Again expect that to be > there in 0.9.0 or something, around Mar. > > For now, we have to play with how we lay out the data to squeeze > performance. > > > > Hope that helps. > > > > thanks > > vinoth > > > > > > > > > > > > On Tue, Nov 24, 2020 at 5:54 AM Kizhakkel Jose, Felix < > felix.j...@philips.com> wrote: > > Hi Raymond, > > Thanks a lot for the reply. > > For approach 2 – the reason for prepending datetime is to have a > incrementing id, otherwise your uuid is a purely random id and wont support > range pruning, while writing, correct? In a given date partition I am > expected to get 10s of billions records, and by having an incrementing id > helps BLOOM filtering? This is the only intend of having the prefix of > datetime (int64 representation) > > Yes, I also see Approach 3 really too big and causing lot in storage > footprint. > > My initial approach was Approach 1 (generated uuid from all the 4 fields), > then heard that the range pruning can make write faster – so thought of > datetime as prefix. Do you see any benefit or the UUID can itself be > sufficient -since it’s been generated from the 4 input fields? > > > > Regards, > > Felix K Jose > > *From: *Raymond Xu <xu.shiyan.raym...@gmail.com> > *Date: *Tuesday, November 24, 2020 at 2:20 AM > *To: *Kizhakkel Jose, Felix <felix.j...@philips.com> > *Cc: *dev@hudi.apache.org <dev@hudi.apache.org>, vin...@apache.org < > vin...@apache.org>, n.siv...@gmail.com <n.siv...@gmail.com> > *Subject: *Re: Hudi Record Key Best Practices > > Hi Felix, > > I'd prefer approach 1. The logic is simple: to ensure uniqueness in your > dataset. > > For 2, not very sure about the intention of prepending the datetime, looks > like duplicate info knowing that you already partitioned it by that field. > > For 3, it seems too long for a primary id. > > Hope this helps. > > > > On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix < > felix.j...@philips.com> wrote: > > @Vinoth Chandar <vin...@apache.org>, > > Could you please take a look at and let me know what is the best approach > or could you see whom can help me on this? > > > > Regards, > > Felix K Jose > > *From: *Kizhakkel Jose, Felix <felix.j...@philips.com.INVALID> > *Date: *Thursday, November 19, 2020 at 12:04 PM > *To: *dev@hudi.apache.org <dev@hudi.apache.org>, Vinoth Chandar < > vin...@apache.org>, xu.shiyan.raym...@gmail.com < > xu.shiyan.raym...@gmail.com> > *Cc: *vin...@apache.org <vin...@apache.org>, n.siv...@gmail.com < > n.siv...@gmail.com> > *Subject: *Re: Hudi Record Key Best Practices > > Sure. I will see about partition key. > > Since RFC 21 is not yet implemented and available to consume, can anyone > please suggest what is the best approach I should be following to construct > the record key I asked in the original question: > > “ > My Write Use Cases: > 1. Writes to partitioned HUDI table every 15 minutes > > 1. where 95% inserts and 5% updates, > 2. Also 95% write goes to same partition (current date) 5% write can > span across multiple partitions > 2. GDPR request to delete records from the table using User Identifier > field (F3) > > > Record Key Construction: > Approach 1: > Generate a UUID from the concatenated String of all these 4 fields [eg: > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that > newly generated field as Record Key > > Approach 2: > Generate a UUID from the concatenated String of 3 fields except datetime > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend > datetime field to the generated UUID and use that newly generated field as > Record Key •F1_<uuid> > > Approach 3: > Record Key as a composite key of all 4 fields (F1, F2, F3, F4) > “ > > Regards, > Felix K Jose > From: Raymond Xu <xu.shiyan.raym...@gmail.com> > Date: Wednesday, November 18, 2020 at 5:30 PM > To: dev@hudi.apache.org <dev@hudi.apache.org> > Cc: vin...@apache.org <vin...@apache.org>, n.siv...@gmail.com < > n.siv...@gmail.com> > Subject: Re: Hudi Record Key Best Practices > Hi Felix, I wasn't suggesting partition by user id, that'll be too many; > just saying maybe making the writes more evenly spreaded could be > better. Effectively, with 95% writes, it's like writing to a single > partition dataset. Hourly partition could mitigate the situation, since you > also have date-range queries. Just some rough ideas, the strategy really > depends on your data pattern and requirements. > > For the development timeline on RFC 21, probably Vinoth or Balaji > could give more info. > > On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix > <felix.j...@philips.com.invalid> wrote: > > > Hi Raymond, > > Thank you for the response. > > > > Yes, the virtual key definitely going to help reducing the storage > > footprint. When do you think it is going to be available and will it be > > compatible with all downstream processing engines (Presto, Redshift > > Spectrum etc.)? We have started our development activities and expecting > to > > get into PROD by March-April timeframe. > > > > Regarding the partition key, we get data every day from 10-20 million > > users and currently the data we are planning to partition is by Date > > (YYYY-MM-DD) and thereby we will have consistent partitions for > downstream > > systems(every partition has same amount of data [20 million user data in > > each partition, rather than skewed partitions]). And most of our queries > > are date range queries for a given user-Id > > > > If I partition by user-Id, then I will have millions of partitions, and I > > have read that having large number of partition has major read impact > (meta > > data management etc.), what do you think? Is my understanding correct? > > > > Yes, for current day most of the data will be for that day – so do you > > think it’s going to be a problem while writing (wont the BLOOM index > help)? > > And that’s what I am trying to understand to land in a better performant > > solution. > > > > Meanwhile I would like to see my record Key construct as well, to see how > > it can help on write performance and downstream requirement to support > > GDPR. To avoid any reprocessing/migration down the line. > > > > Regards, > > Felix K Jose > > > > From: Raymond Xu <xu.shiyan.raym...@gmail.com> > > Date: Tuesday, November 17, 2020 at 6:18 PM > > To: dev@hudi.apache.org <dev@hudi.apache.org> > > Cc: vin...@apache.org <vin...@apache.org>, n.siv...@gmail.com < > > n.siv...@gmail.com>, v.bal...@ymail.com.invalid > > <v.bal...@ymail.com.invalid> > > Subject: Re: Hudi Record Key Best Practices > > Hi Felix, looks like the use case will benefit from virtual key feature > in > > this RFC > > > > > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&data=04%7C01%7C%7C5523000dd6444b36130408d88cad3629%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637414022852270093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SWg3X%2BUEoy5OgdevWX1x487ZERSejrI2cZ%2F5Tlue2yg%3D&reserved=0 > <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&data=04%7C01%7C%7C6c4ae6d635fd405a2ee708d890cb9f48%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637418551529459732%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=az5pemZfveNQK5kf8h5m0iDHdixCnfx455PuIK2vrVo%3D&reserved=0> > > > > Once this is implemented, you don't have to create a separate key. > > > > A rough thought: you mentioned 95% writes go to the same partition. > Rather > > than the record key, maybe consider improving on the partition field? to > > have more even writes across partitions for eg? > > > > On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix > > <felix.j...@philips.com.invalid> wrote: > > > > > Hello All, > > > > > > I have asked generic questions regarding record key in slack channel, > but > > > I just want to consolidate everything regarding Record Key and the > > > suggested best practices of Record Key construction to get better write > > > performance. > > > > > > Table Type: COW > > > Partition Path: Date > > > > > > My record uniqueness is derived from a combination of 4 fields: > > > > > > 1. F1: Datetime (record’s origination datetime) > > > 2. F2: String (11 char long serial number) > > > 3. F3: UUID (User Identifier) > > > 4. F4: String. (12 CHAR statistic name) > > > > > > Note: My record is a nested document and some of the above fields are > > > nested fields > > > > > > My Write Use Cases: > > > 1. Writes to partitioned HUDI table every 15 minutes > > > > > > 1. where 95% inserts and 5% updates, > > > 2. Also 95% write goes to same partition (current date) 5% write can > > > span across multiple partitions > > > 2. GDPR request to delete records from the table using User Identifier > > > field (F3) > > > > > > > > > Record Key Construction: > > > Approach 1: > > > Generate a UUID from the concatenated String of all these 4 fields > [eg: > > > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that > > > newly generated field as Record Key > > > > > > Approach 2: > > > Generate a UUID from the concatenated String of 3 fields except > datetime > > > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend > > > datetime field to the generated UUID and use that newly generated field > > as > > > Record Key •F1_<uuid> > > > > > > Approach 3: > > > Record Key as a composite key of all 4 fields (F1, F2, F3, F4) > > > > > > Which is the approach you will suggest? Could you please help me? > > > > > > Regards, > > > Felix K Jose > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ________________________________ > > > The information contained in this message may be confidential and > legally > > > protected under applicable law. The message is intended solely for the > > > addressee(s). If you are not the intended recipient, you are hereby > > > notified that any use, forwarding, dissemination, or reproduction of > this > > > message is strictly prohibited and may be unlawful. If you are not the > > > intended recipient, please cont_act the sender by return e-mail and > > destroy > > > all copies of the original message. > > > > > > > ________________________________ > > The information contained in this message may be confidential and legally > > protected under applicable law. The message is intended solely for the > > addressee(s). If you are not the intended recipient, you are hereby > > notified that any use, forwarding, dissemination, or reproduction of this > > message is strictly prohibited and may be unlawful. If you are not the > > intended recipient, please contact the sender by return e-mail and > destroy > > all copies of the original message. > > > > ________________________________ > The information contained in this message may be confidential and legally > protected under applicable law. The message is intended solely for the > addressee(s). If you are not the intended recipient, you are hereby > notified that any use, forwarding, dissemination, or reproduction of this > message is strictly prohibited and may be unlawful. If you are not the > intended recipient, please contact the sender by return e-mail and destroy > all copies of the original message. > > > > > -- > > Regards, > -Sivabalan > >