Re: HUDI Table Primary Key - UUID or Custom For Better Performance

Tanuj Fri, 16 Oct 2020 20:42:57 -0700

Thanks Prashant. To answer your questions -
1) Yes size of keys are something around 5-8 alphanumeric but since its 
composite key of 3 domain keys I believe it will be almost equal to UUID
4) Thats the business need. We need to keep a track/audit for every insertion 
of new record. We had 2 options - Update Existing Record , make an Audit Table 
to store old records or keep pushing in the same table with timestamp so that 
it always works with Append mode. We choose Option 2
5) Thats what I want to understand how Bloom Filters will be useful here. And 
in general also is bloom filter used in HUDI for read. I understand the write 
process where its being used but does it use in read as well as I believe after 
picking up the correct parquet file Hudi delegates the read to Spark . Please 
correct me if I am wrong here
6) We will only query on domain object keys excluding create_date.


On 2020/10/16 18:53:21, Prashant Wason <[email protected]> wrote: 
> Hi Tanu,
> 
> Some points to consider:
> 1. UUID is fixed size compared to domain_object_keys (dont know the size).
> Smaller keys will reduce the storage requirements.
> 2. UUIDs don't compress. Your domain object keys may compress better.
> 3. From the bloom filter perspective, I dont think there is any difference
> unless the size difference of keys is very large.
> 4. If the domain object keys are already unique, what is the use of
> suffixing the create_date?
> 5. If you query by "primary key minus timestamp", the entire record key
> column will have to be read to match it. So bloom filters won't be useful
> here.
> 6. What do the domain object keys look like? Are they going to be included
> in any other field in the record? Would you ever want to query on domain
> object keys?
> 
> Thanks
> Prashant
> 
> 
> On Thu, Oct 15, 2020 at 8:21 PM tanu dua <[email protected]> wrote:
> 
> > read query pattern will be (partition key + primary key minus timestamp)
> > where my primary key is domain keys + timestamp.
> >
> > Read Write queries are as per dataset but mostly all the tables are read
> > and write frequently and equally
> >
> > Read will be mostly done by providing the partitions and not by blanket
> > query.
> >
> > If we have to choose between read and write I will choose write but I want
> > to stick only with COW table.
> >
> > Please let me know if you need more information.
> >
> >
> > On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan <[email protected]> wrote:
> >
> > > Can you give us a sense of how your read workload looks like? Depending
> > on
> > > that read perf could vary.
> > >
> > > On Thu, Oct 15, 2020 at 4:06 AM Tanuj <[email protected]> wrote:
> > >
> > > > Hi all,
> > > > We don't have an "UPDATE" use case and all ingested rows will be
> > "INSERT"
> > > > so what is the best way to define PRIMARY key. As of now we have
> > designed
> > > > primary key as per domain object with create_date which is -
> > > > <domain_object_key_1>,<domain_object_key_2>,<create_date>
> > > >
> > > > Since its always an INSERT for us , I can potentially use UUID as well
> > .
> > > >
> > > > We use keys for Bloom Index in HUDI so just wanted to know if I get a
> > > > better performance in writing if I will have the UUID vs composite
> > domain
> > > > keys.
> > > >
> > > > I believe read is not impacted as per the Primary Key as its not being
> > > > considered ?
> > > >
> > > > Please suggest
> > > >
> > > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>

Re: HUDI Table Primary Key - UUID or Custom For Better Performance

Reply via email to