Agree with Siva's suggestions.

For clustering, it's not necessary for it to be part of the key. (Satish
can correct if I missed something)

On Tue, Nov 24, 2020 at 2:01 PM Sivabalan <[email protected]> wrote:

> here are the discussions points we had in slack.
>
> Suggestion is to go with approach 2 based on these points.
> - Prefixing F1 (including timestamp), will help pruning some file slices
> even within a day (within a partition) if records are properly ordered
> based on timestamp.
> - Deletes are occasional compared to upserts. So, optimizing for upserts
> makes sense and hence approach 2 is fine. Also, anyways to delete records,
> its two part execution. First a query to hudi like "select HoodieKey from
> hudi_tbl where user_id = 'X'), and the a DELETE operation to hudi for these
> HoodieKeys. For first query, I assume embedding user_id in record keys does
> not matter, bcoz, this query does filtering for a specific column in the
> dataset.
> So, initially thought not much of value embedding user id in record key.
> But as vinoth suggested, clustering could come in handy and so lets have
> userId too as part of record keys.
> - In approach3, the record keys could be too large and so may not want to
> go this route.
>
>
>
>
>
> On Tue, Nov 24, 2020 at 11:58 AM Vinoth Chandar <[email protected]> wrote:
>
>> Hi Felix,
>>
>> I will try to be faster going forward. Apologies for the late reply.
>> Thanks Raymond for all the great clarifications.
>>
>> On RFC-21, I think it's safe to assume it will be available by Jan or so.
>> 0.8.0 (Uber folks, correct me if I am wrong)
>>
>> >>For approach 2 – the reason for prepending datetime is to have an
>> incrementing id, otherwise your uuid is a purely random id and wont support
>> range pruning, while writing, correct?
>> You are right. In general, we only have the following levers to control
>> performance. I take it that "origination datetime" is not monotonically
>> increasing? Otherwise Approach 1 is good, right?
>>
>> If you want to optimize for upsert performance,
>> - prepending a timestamp field would help. if you simply prepend the
>> date, which is already also the partition path, then all keys in that
>> partition will have the same prefix and no additional pruning opportunities
>> exist.
>> - Advise using dynamic bloom filters
>> (config hoodie.bloom.index.filter.type=DYNAMIC_V0), to ensure the bloom
>> filters filter our enough files after range pruning.
>>
>> For good delete performance, we can cluster records by user_id for older
>> partitions, such that all records a user is packed into the smallest number
>> of files. This way,  when only a small number of users leave,
>> your delete won't rewrite the entire partition's files. Clustering
>> support is landing by the end of year in 0.7.0. (There is a PR out already,
>> if you want to test/play).
>>
>> All of this is also highly workload specific. So we can get into those
>> details, if that helps. MOR is a much better alternative for dealing with
>> deletes IMO.
>> It was specifically designed, used for those, since it can absorb the
>> deletes into log files and apply them later amortizing costs.
>>
>> Future is good, since we are investing in record level indexes that could
>> also natively index secondary fields like user_id. Again expect that to be
>> there in 0.9.0 or something, around Mar.
>> For now, we have to play with how we lay out the data to squeeze
>> performance.
>>
>> Hope that helps.
>>
>> thanks
>> vinoth
>>
>>
>>
>>
>>
>> On Tue, Nov 24, 2020 at 5:54 AM Kizhakkel Jose, Felix <
>> [email protected]> wrote:
>>
>>> Hi Raymond,
>>>
>>> Thanks a lot for the reply.
>>>
>>> For approach 2 – the reason for prepending datetime is to have a
>>> incrementing id, otherwise your uuid is a purely random id and wont support
>>> range pruning, while writing, correct? In a given date partition I am
>>> expected to get 10s of billions records, and by having an incrementing id
>>> helps BLOOM filtering? This is the only intend of having the prefix of
>>> datetime (int64 representation)
>>>
>>> Yes, I also see Approach 3 really too big and causing lot in storage
>>> footprint.
>>>
>>> My initial approach was Approach 1 (generated uuid from all the 4
>>> fields), then heard that the range pruning can make write faster – so
>>> thought of datetime as prefix. Do you see any benefit or the UUID can
>>> itself be sufficient -since it’s been generated from the 4 input fields?
>>>
>>>
>>>
>>> Regards,
>>>
>>> Felix K Jose
>>>
>>> *From: *Raymond Xu <[email protected]>
>>> *Date: *Tuesday, November 24, 2020 at 2:20 AM
>>> *To: *Kizhakkel Jose, Felix <[email protected]>
>>> *Cc: *[email protected] <[email protected]>, [email protected] <
>>> [email protected]>, [email protected] <[email protected]>
>>> *Subject: *Re: Hudi Record Key Best Practices
>>>
>>> Hi Felix,
>>>
>>> I'd prefer approach 1. The logic is simple: to ensure uniqueness in your
>>> dataset.
>>>
>>> For 2, not very sure about the intention of prepending the datetime,
>>> looks like duplicate info knowing that you already partitioned it by that
>>> field.
>>>
>>> For 3, it seems too long for a primary id.
>>>
>>> Hope this helps.
>>>
>>>
>>>
>>> On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix <
>>> [email protected]> wrote:
>>>
>>> @Vinoth Chandar <[email protected]>,
>>>
>>> Could you please take a look at and let me know what is the best
>>> approach or could you see whom can help me on this?
>>>
>>>
>>>
>>> Regards,
>>>
>>> Felix K Jose
>>>
>>> *From: *Kizhakkel Jose, Felix <[email protected]>
>>> *Date: *Thursday, November 19, 2020 at 12:04 PM
>>> *To: *[email protected] <[email protected]>, Vinoth Chandar <
>>> [email protected]>, [email protected] <
>>> [email protected]>
>>> *Cc: *[email protected] <[email protected]>, [email protected] <
>>> [email protected]>
>>> *Subject: *Re: Hudi Record Key Best Practices
>>>
>>> Sure. I will see about partition key.
>>>
>>> Since RFC 21 is not yet implemented and available to consume, can anyone
>>> please suggest what is the best approach I should be following to construct
>>> the record key I asked in the  original question:
>>>
>>> “
>>> My Write Use Cases:
>>> 1. Writes to partitioned HUDI table every 15 minutes
>>>
>>>   1.  where 95% inserts and 5% updates,
>>>   2.  Also 95% write goes to same partition (current date) 5% write can
>>> span across multiple partitions
>>> 2. GDPR request to delete records from the table using User Identifier
>>> field (F3)
>>>
>>>
>>> Record Key Construction:
>>> Approach 1:
>>> Generate a UUID  from the concatenated String of all these 4 fields [eg:
>>> str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that
>>> newly generated field as Record Key
>>>
>>> Approach 2:
>>> Generate a UUID  from the concatenated String of 3 fields except
>>> datetime field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and
>>> prepend datetime field to the generated UUID and use that newly generated
>>> field as Record Key •F1_<uuid>
>>>
>>> Approach 3:
>>> Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
>>> “
>>>
>>> Regards,
>>> Felix K Jose
>>> From: Raymond Xu <[email protected]>
>>> Date: Wednesday, November 18, 2020 at 5:30 PM
>>> To: [email protected] <[email protected]>
>>> Cc: [email protected] <[email protected]>, [email protected] <
>>> [email protected]>
>>> Subject: Re: Hudi Record Key Best Practices
>>> Hi Felix, I wasn't suggesting partition by user id, that'll be too many;
>>> just saying maybe making the writes more evenly spreaded could be
>>> better. Effectively, with 95% writes, it's like writing to a single
>>> partition dataset. Hourly partition could mitigate the situation, since
>>> you
>>> also have date-range queries. Just some rough ideas, the strategy really
>>> depends on your data pattern and requirements.
>>>
>>> For the development timeline on RFC 21, probably Vinoth or Balaji
>>> could give more info.
>>>
>>> On Wed, Nov 18, 2020 at 7:38 AM Kizhakkel Jose, Felix
>>> <[email protected]> wrote:
>>>
>>> > Hi Raymond,
>>> > Thank you for the response.
>>> >
>>> > Yes, the virtual key definitely going to help reducing the storage
>>> > footprint. When do you think it is going to be available and will it be
>>> > compatible with all downstream processing engines (Presto, Redshift
>>> > Spectrum etc.)? We have started our development activities and
>>> expecting to
>>> > get into PROD by March-April timeframe.
>>> >
>>> > Regarding the partition key,  we get data every day from 10-20 million
>>> > users and currently the data we are planning to partition is by Date
>>> > (YYYY-MM-DD) and thereby we will have consistent partitions for
>>> downstream
>>> > systems(every partition has same amount of data [20 million user data
>>> in
>>> > each partition, rather than skewed partitions]). And most of our
>>> queries
>>> > are date range queries for a given user-Id
>>> >
>>> > If I partition by user-Id, then I will have millions of partitions,
>>> and I
>>> > have read that having large number of partition has major read impact
>>> (meta
>>> > data management etc.), what do you think? Is my understanding correct?
>>> >
>>> > Yes, for current day most of the data will be for that day – so do you
>>> > think it’s going to be a problem while writing (wont the BLOOM index
>>> help)?
>>> > And that’s what I am trying to understand to land in a better
>>> performant
>>> > solution.
>>> >
>>> > Meanwhile I would like to see my record Key construct as well, to see
>>> how
>>> > it can help on write performance and downstream requirement to support
>>> > GDPR.  To avoid any reprocessing/migration down the line.
>>> >
>>> > Regards,
>>> > Felix K Jose
>>> >
>>> > From: Raymond Xu <[email protected]>
>>> > Date: Tuesday, November 17, 2020 at 6:18 PM
>>> > To: [email protected] <[email protected]>
>>> > Cc: [email protected] <[email protected]>, [email protected] <
>>> > [email protected]>, [email protected]
>>> > <[email protected]>
>>> > Subject: Re: Hudi Record Key Best Practices
>>> > Hi Felix, looks like the use case will benefit from virtual key
>>> feature in
>>> > this RFC
>>> >
>>> >
>>> >
>>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&amp;data=04%7C01%7C%7C5523000dd6444b36130408d88cad3629%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637414022852270093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SWg3X%2BUEoy5OgdevWX1x487ZERSejrI2cZ%2F5Tlue2yg%3D&amp;reserved=0
>>> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHUDI%2FRFC%2B-%2B21%2B%253A%2BAllow%2BHoodieRecordKey%2Bto%2Bbe%2BVirtual&data=04%7C01%7C%7C9af2e2156ca741dc30b708d890497321%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637417992446807324%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=JFMrvaH7mq2o1eisazMXFvvmn4MjescTBp4bMygJ5Oo%3D&reserved=0>
>>> >
>>> > Once this is implemented, you don't have to create a separate key.
>>> >
>>> > A rough thought: you mentioned 95% writes go to the same partition.
>>> Rather
>>> > than the record key, maybe consider improving on the partition field?
>>> to
>>> > have more even writes across partitions for eg?
>>> >
>>> > On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
>>> > <[email protected]> wrote:
>>> >
>>> > > Hello All,
>>> > >
>>> > > I have asked generic questions regarding record key in slack
>>> channel, but
>>> > > I just want to consolidate everything regarding Record Key and the
>>> > > suggested best practices of Record Key construction to get better
>>> write
>>> > > performance.
>>> > >
>>> > > Table Type: COW
>>> > > Partition Path: Date
>>> > >
>>> > > My record uniqueness is derived from a combination of 4 fields:
>>> > >
>>> > >   1.  F1: Datetime (record’s origination datetime)
>>> > >   2.  F2: String       (11 char  long serial number)
>>> > >   3.  F3: UUID        (User Identifier)
>>> > >   4.  F4: String.       (12 CHAR statistic name)
>>> > >
>>> > > Note: My record is a nested document and some of the above fields are
>>> > > nested fields
>>> > >
>>> > > My Write Use Cases:
>>> > > 1. Writes to partitioned HUDI table every 15 minutes
>>> > >
>>> > >   1.  where 95% inserts and 5% updates,
>>> > >   2.  Also 95% write goes to same partition (current date) 5% write
>>> can
>>> > > span across multiple partitions
>>> > > 2. GDPR request to delete records from the table using User
>>> Identifier
>>> > > field (F3)
>>> > >
>>> > >
>>> > > Record Key Construction:
>>> > > Approach 1:
>>> > > Generate a UUID  from the concatenated String of all these 4 fields
>>> [eg:
>>> > > str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use
>>> that
>>> > > newly generated field as Record Key
>>> > >
>>> > > Approach 2:
>>> > > Generate a UUID  from the concatenated String of 3 fields except
>>> datetime
>>> > > field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend
>>> > > datetime field to the generated UUID and use that newly generated
>>> field
>>> > as
>>> > > Record Key •F1_<uuid>
>>> > >
>>> > > Approach 3:
>>> > > Record Key as a composite key of all 4 fields (F1, F2, F3, F4)
>>> > >
>>> > > Which is the approach you will suggest? Could you please help me?
>>> > >
>>> > > Regards,
>>> > > Felix K Jose
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > ________________________________
>>> > > The information contained in this message may be confidential and
>>> legally
>>> > > protected under applicable law. The message is intended solely for
>>> the
>>> > > addressee(s). If you are not the intended recipient, you are hereby
>>> > > notified that any use, forwarding, dissemination, or reproduction of
>>> this
>>> > > message is strictly prohibited and may be unlawful. If you are not
>>> the
>>> > > intended recipient, please contact the sender by return e-mail and
>>> > destroy
>>> > > all copies of the original message.
>>> > >
>>> >
>>> > ________________________________
>>> > The information contained in this message may be confidential and
>>> legally
>>> > protected under applicable law. The message is intended solely for the
>>> > addressee(s). If you are not the intended recipient, you are hereby
>>> > notified that any use, forwarding, dissemination, or reproduction of
>>> this
>>> > message is strictly prohibited and may be unlawful. If you are not the
>>> > intended recipient, please contact the sender by return e-mail and
>>> destroy
>>> > all copies of the original message.
>>> >
>>>
>>> ________________________________
>>> The information contained in this message may be confidential and
>>> legally protected under applicable law. The message is intended solely for
>>> the addressee(s). If you are not the intended recipient, you are hereby
>>> notified that any use, forwarding, dissemination, or reproduction of this
>>> message is strictly prohibited and may be unlawful. If you are not the
>>> intended recipient, please contact the sender by return e-mail and destroy
>>> all copies of the original message.
>>>
>>>
>
> --
> Regards,
> -Sivabalan
>

Reply via email to