For now, bloom filters are not actually leveraged in the read/query path but only by the writer performing the index lookup for upserting. Hudi is write optimized like an OLTP store and read optimized like OLAP, if that makes sense.
As for bloom index performance, our tuning guide and FAQ talk about this. If you eventually want to support de-duplication say, it might be good to pick a key that is ordered. Something like _hoodie_seq_no that keeps increasing with new commits, then the bloom indexing mechanism will be also able to do range pruning effectively improving performance significantly. Pure uuid keys are not very conducive for range pruning ie files written during each commit will over lap in key range with almost every other file. Thanks Vinoth On Fri, Oct 16, 2020 at 8:42 PM Tanuj <[email protected]> wrote: > Thanks Prashant. To answer your questions - > 1) Yes size of keys are something around 5-8 alphanumeric but since its > composite key of 3 domain keys I believe it will be almost equal to UUID > 4) Thats the business need. We need to keep a track/audit for every > insertion of new record. We had 2 options - Update Existing Record , make > an Audit Table to store old records or keep pushing in the same table with > timestamp so that it always works with Append mode. We choose Option 2 > 5) Thats what I want to understand how Bloom Filters will be useful here. > And in general also is bloom filter used in HUDI for read. I understand the > write process where its being used but does it use in read as well as I > believe after picking up the correct parquet file Hudi delegates the read > to Spark . Please correct me if I am wrong here > 6) We will only query on domain object keys excluding create_date. > > On 2020/10/16 18:53:21, Prashant Wason <[email protected]> wrote: > > Hi Tanu, > > > > Some points to consider: > > 1. UUID is fixed size compared to domain_object_keys (dont know the > size). > > Smaller keys will reduce the storage requirements. > > 2. UUIDs don't compress. Your domain object keys may compress better. > > 3. From the bloom filter perspective, I dont think there is any > difference > > unless the size difference of keys is very large. > > 4. If the domain object keys are already unique, what is the use of > > suffixing the create_date? > > 5. If you query by "primary key minus timestamp", the entire record key > > column will have to be read to match it. So bloom filters won't be useful > > here. > > 6. What do the domain object keys look like? Are they going to be > included > > in any other field in the record? Would you ever want to query on domain > > object keys? > > > > Thanks > > Prashant > > > > > > On Thu, Oct 15, 2020 at 8:21 PM tanu dua <[email protected]> wrote: > > > > > read query pattern will be (partition key + primary key minus > timestamp) > > > where my primary key is domain keys + timestamp. > > > > > > Read Write queries are as per dataset but mostly all the tables are > read > > > and write frequently and equally > > > > > > Read will be mostly done by providing the partitions and not by blanket > > > query. > > > > > > If we have to choose between read and write I will choose write but I > want > > > to stick only with COW table. > > > > > > Please let me know if you need more information. > > > > > > > > > On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan <[email protected]> wrote: > > > > > > > Can you give us a sense of how your read workload looks like? > Depending > > > on > > > > that read perf could vary. > > > > > > > > On Thu, Oct 15, 2020 at 4:06 AM Tanuj <[email protected]> wrote: > > > > > > > > > Hi all, > > > > > We don't have an "UPDATE" use case and all ingested rows will be > > > "INSERT" > > > > > so what is the best way to define PRIMARY key. As of now we have > > > designed > > > > > primary key as per domain object with create_date which is - > > > > > <domain_object_key_1>,<domain_object_key_2>,<create_date> > > > > > > > > > > Since its always an INSERT for us , I can potentially use UUID as > well > > > . > > > > > > > > > > We use keys for Bloom Index in HUDI so just wanted to know if I > get a > > > > > better performance in writing if I will have the UUID vs composite > > > domain > > > > > keys. > > > > > > > > > > I believe read is not impacted as per the Primary Key as its not > being > > > > > considered ? > > > > > > > > > > Please suggest > > > > > > > > > > > > > > > > > > -- > > > > Regards, > > > > -Sivabalan > > > > > > > > > >
