Thanks Prashant. To answer your questions - 1) Yes size of keys are something around 5-8 alphanumeric but since its composite key of 3 domain keys I believe it will be almost equal to UUID 4) Thats the business need. We need to keep a track/audit for every insertion of new record. We had 2 options - Update Existing Record , make an Audit Table to store old records or keep pushing in the same table with timestamp so that it always works with Append mode. We choose Option 2 5) Thats what I want to understand how Bloom Filters will be useful here. And in general also is bloom filter used in HUDI for read. I understand the write process where its being used but does it use in read as well as I believe after picking up the correct parquet file Hudi delegates the read to Spark . Please correct me if I am wrong here 6) We will only query on domain object keys excluding create_date.
On 2020/10/16 18:53:21, Prashant Wason <[email protected]> wrote: > Hi Tanu, > > Some points to consider: > 1. UUID is fixed size compared to domain_object_keys (dont know the size). > Smaller keys will reduce the storage requirements. > 2. UUIDs don't compress. Your domain object keys may compress better. > 3. From the bloom filter perspective, I dont think there is any difference > unless the size difference of keys is very large. > 4. If the domain object keys are already unique, what is the use of > suffixing the create_date? > 5. If you query by "primary key minus timestamp", the entire record key > column will have to be read to match it. So bloom filters won't be useful > here. > 6. What do the domain object keys look like? Are they going to be included > in any other field in the record? Would you ever want to query on domain > object keys? > > Thanks > Prashant > > > On Thu, Oct 15, 2020 at 8:21 PM tanu dua <[email protected]> wrote: > > > read query pattern will be (partition key + primary key minus timestamp) > > where my primary key is domain keys + timestamp. > > > > Read Write queries are as per dataset but mostly all the tables are read > > and write frequently and equally > > > > Read will be mostly done by providing the partitions and not by blanket > > query. > > > > If we have to choose between read and write I will choose write but I want > > to stick only with COW table. > > > > Please let me know if you need more information. > > > > > > On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan <[email protected]> wrote: > > > > > Can you give us a sense of how your read workload looks like? Depending > > on > > > that read perf could vary. > > > > > > On Thu, Oct 15, 2020 at 4:06 AM Tanuj <[email protected]> wrote: > > > > > > > Hi all, > > > > We don't have an "UPDATE" use case and all ingested rows will be > > "INSERT" > > > > so what is the best way to define PRIMARY key. As of now we have > > designed > > > > primary key as per domain object with create_date which is - > > > > <domain_object_key_1>,<domain_object_key_2>,<create_date> > > > > > > > > Since its always an INSERT for us , I can potentially use UUID as well > > . > > > > > > > > We use keys for Bloom Index in HUDI so just wanted to know if I get a > > > > better performance in writing if I will have the UUID vs composite > > domain > > > > keys. > > > > > > > > I believe read is not impacted as per the Primary Key as its not being > > > > considered ? > > > > > > > > Please suggest > > > > > > > > > > > > > > -- > > > Regards, > > > -Sivabalan > > > > > >
