Re: HUDI Table Primary Key - UUID or Custom For Better Performance

Vinoth Chandar Fri, 23 Oct 2020 19:09:07 -0700

Got it; please feel free to raise a jira for future

On Wed, Oct 21, 2020 at 9:47 PM tanu dua <[email protected]> wrote:


> Thanks got it. Unfortunately it’s not very straightforward for me to
> provide ordered keys. So far I am getting a decent write performance so
> will revisit if required.
>
> On Wed, 21 Oct 2020 at 7:45 AM, Vinoth Chandar <
> [email protected]> wrote:
>
> > For now, bloom filters are not actually leveraged in the read/query path
> > but only by the writer performing the index lookup for upserting. Hudi is
> > write optimized like an OLTP store and read optimized like OLAP, if
> > that makes sense.
> >
> > As for bloom index performance, our tuning guide and FAQ talk about this.
> > If you eventually want to support de-duplication say, it might be good to
> > pick a key that is ordered. Something like _hoodie_seq_no that keeps
> > increasing with new commits, then the bloom indexing mechanism will be
> also
> > able to do range pruning effectively improving performance significantly.
> > Pure uuid keys are not very conducive for range pruning ie files written
> > during each commit will over lap in key range with almost every other
> file.
> >
> > Thanks
> > Vinoth
> >
> > On Fri, Oct 16, 2020 at 8:42 PM Tanuj <[email protected]> wrote:
> >
> > > Thanks Prashant. To answer your questions -
> > > 1) Yes size of keys are something around 5-8 alphanumeric but since its
> > > composite key of 3 domain keys I believe it will be almost equal to
> UUID
> > > 4) Thats the business need. We need to keep a track/audit for every
> > > insertion of new record. We had 2 options - Update Existing Record ,
> make
> > > an Audit Table to store old records or keep pushing in the same table
> > with
> > > timestamp so that it always works with Append mode. We choose Option 2
> > > 5) Thats what I want to understand how Bloom Filters will be useful
> here.
> > > And in general also is bloom filter used in HUDI for read. I understand
> > the
> > > write process where its being used but does it use in read as well as I
> > > believe after picking up the correct parquet file Hudi delegates the
> read
> > > to Spark . Please correct me if I am wrong here
> > > 6) We will only query on domain object keys excluding create_date.
> > >
> > > On 2020/10/16 18:53:21, Prashant Wason <[email protected]>
> wrote:
> > > > Hi Tanu,
> > > >
> > > > Some points to consider:
> > > > 1. UUID is fixed size compared to domain_object_keys (dont know the
> > > size).
> > > > Smaller keys will reduce the storage requirements.
> > > > 2. UUIDs don't compress. Your domain object keys may compress better.
> > > > 3. From the bloom filter perspective, I dont think there is any
> > > difference
> > > > unless the size difference of keys is very large.
> > > > 4. If the domain object keys are already unique, what is the use of
> > > > suffixing the create_date?
> > > > 5. If you query by "primary key minus timestamp", the entire record
> key
> > > > column will have to be read to match it. So bloom filters won't be
> > useful
> > > > here.
> > > > 6. What do the domain object keys look like? Are they going to be
> > > included
> > > > in any other field in the record? Would you ever want to query on
> > domain
> > > > object keys?
> > > >
> > > > Thanks
> > > > Prashant
> > > >
> > > >
> > > > On Thu, Oct 15, 2020 at 8:21 PM tanu dua <[email protected]>
> > wrote:
> > > >
> > > > > read query pattern will be (partition key + primary key minus
> > > timestamp)
> > > > > where my primary key is domain keys + timestamp.
> > > > >
> > > > > Read Write queries are as per dataset but mostly all the tables are
> > > read
> > > > > and write frequently and equally
> > > > >
> > > > > Read will be mostly done by providing the partitions and not by
> > blanket
> > > > > query.
> > > > >
> > > > > If we have to choose between read and write I will choose write
> but I
> > > want
> > > > > to stick only with COW table.
> > > > >
> > > > > Please let me know if you need more information.
> > > > >
> > > > >
> > > > > On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan <[email protected]>
> > wrote:
> > > > >
> > > > > > Can you give us a sense of how your read workload looks like?
> > > Depending
> > > > > on
> > > > > > that read perf could vary.
> > > > > >
> > > > > > On Thu, Oct 15, 2020 at 4:06 AM Tanuj <[email protected]>
> > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > > We don't have an "UPDATE" use case and all ingested rows will
> be
> > > > > "INSERT"
> > > > > > > so what is the best way to define PRIMARY key. As of now we
> have
> > > > > designed
> > > > > > > primary key as per domain object with create_date which is -
> > > > > > > <domain_object_key_1>,<domain_object_key_2>,<create_date>
> > > > > > >
> > > > > > > Since its always an INSERT for us , I can potentially use UUID
> as
> > > well
> > > > > .
> > > > > > >
> > > > > > > We use keys for Bloom Index in HUDI so just wanted to know if I
> > > get a
> > > > > > > better performance in writing if I will have the UUID vs
> > composite
> > > > > domain
> > > > > > > keys.
> > > > > > >
> > > > > > > I believe read is not impacted as per the Primary Key as its
> not
> > > being
> > > > > > > considered ?
> > > > > > >
> > > > > > > Please suggest
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > -Sivabalan
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: HUDI Table Primary Key - UUID or Custom For Better Performance

Reply via email to