Got it; please feel free to raise a jira for future On Wed, Oct 21, 2020 at 9:47 PM tanu dua <[email protected]> wrote:
> Thanks got it. Unfortunately it’s not very straightforward for me to > provide ordered keys. So far I am getting a decent write performance so > will revisit if required. > > On Wed, 21 Oct 2020 at 7:45 AM, Vinoth Chandar < > [email protected]> wrote: > > > For now, bloom filters are not actually leveraged in the read/query path > > but only by the writer performing the index lookup for upserting. Hudi is > > write optimized like an OLTP store and read optimized like OLAP, if > > that makes sense. > > > > As for bloom index performance, our tuning guide and FAQ talk about this. > > If you eventually want to support de-duplication say, it might be good to > > pick a key that is ordered. Something like _hoodie_seq_no that keeps > > increasing with new commits, then the bloom indexing mechanism will be > also > > able to do range pruning effectively improving performance significantly. > > Pure uuid keys are not very conducive for range pruning ie files written > > during each commit will over lap in key range with almost every other > file. > > > > Thanks > > Vinoth > > > > On Fri, Oct 16, 2020 at 8:42 PM Tanuj <[email protected]> wrote: > > > > > Thanks Prashant. To answer your questions - > > > 1) Yes size of keys are something around 5-8 alphanumeric but since its > > > composite key of 3 domain keys I believe it will be almost equal to > UUID > > > 4) Thats the business need. We need to keep a track/audit for every > > > insertion of new record. We had 2 options - Update Existing Record , > make > > > an Audit Table to store old records or keep pushing in the same table > > with > > > timestamp so that it always works with Append mode. We choose Option 2 > > > 5) Thats what I want to understand how Bloom Filters will be useful > here. > > > And in general also is bloom filter used in HUDI for read. I understand > > the > > > write process where its being used but does it use in read as well as I > > > believe after picking up the correct parquet file Hudi delegates the > read > > > to Spark . Please correct me if I am wrong here > > > 6) We will only query on domain object keys excluding create_date. > > > > > > On 2020/10/16 18:53:21, Prashant Wason <[email protected]> > wrote: > > > > Hi Tanu, > > > > > > > > Some points to consider: > > > > 1. UUID is fixed size compared to domain_object_keys (dont know the > > > size). > > > > Smaller keys will reduce the storage requirements. > > > > 2. UUIDs don't compress. Your domain object keys may compress better. > > > > 3. From the bloom filter perspective, I dont think there is any > > > difference > > > > unless the size difference of keys is very large. > > > > 4. If the domain object keys are already unique, what is the use of > > > > suffixing the create_date? > > > > 5. If you query by "primary key minus timestamp", the entire record > key > > > > column will have to be read to match it. So bloom filters won't be > > useful > > > > here. > > > > 6. What do the domain object keys look like? Are they going to be > > > included > > > > in any other field in the record? Would you ever want to query on > > domain > > > > object keys? > > > > > > > > Thanks > > > > Prashant > > > > > > > > > > > > On Thu, Oct 15, 2020 at 8:21 PM tanu dua <[email protected]> > > wrote: > > > > > > > > > read query pattern will be (partition key + primary key minus > > > timestamp) > > > > > where my primary key is domain keys + timestamp. > > > > > > > > > > Read Write queries are as per dataset but mostly all the tables are > > > read > > > > > and write frequently and equally > > > > > > > > > > Read will be mostly done by providing the partitions and not by > > blanket > > > > > query. > > > > > > > > > > If we have to choose between read and write I will choose write > but I > > > want > > > > > to stick only with COW table. > > > > > > > > > > Please let me know if you need more information. > > > > > > > > > > > > > > > On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan <[email protected]> > > wrote: > > > > > > > > > > > Can you give us a sense of how your read workload looks like? > > > Depending > > > > > on > > > > > > that read perf could vary. > > > > > > > > > > > > On Thu, Oct 15, 2020 at 4:06 AM Tanuj <[email protected]> > > wrote: > > > > > > > > > > > > > Hi all, > > > > > > > We don't have an "UPDATE" use case and all ingested rows will > be > > > > > "INSERT" > > > > > > > so what is the best way to define PRIMARY key. As of now we > have > > > > > designed > > > > > > > primary key as per domain object with create_date which is - > > > > > > > <domain_object_key_1>,<domain_object_key_2>,<create_date> > > > > > > > > > > > > > > Since its always an INSERT for us , I can potentially use UUID > as > > > well > > > > > . > > > > > > > > > > > > > > We use keys for Bloom Index in HUDI so just wanted to know if I > > > get a > > > > > > > better performance in writing if I will have the UUID vs > > composite > > > > > domain > > > > > > > keys. > > > > > > > > > > > > > > I believe read is not impacted as per the Primary Key as its > not > > > being > > > > > > > considered ? > > > > > > > > > > > > > > Please suggest > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Regards, > > > > > > -Sivabalan > > > > > > > > > > > > > > > > > > > > >
