Hello All, I have asked generic questions regarding record key in slack channel, but I just want to consolidate everything regarding Record Key and the suggested best practices of Record Key construction to get better write performance.
Table Type: COW Partition Path: Date My record uniqueness is derived from a combination of 4 fields: 1. F1: Datetime (record’s origination datetime) 2. F2: String (11 char long serial number) 3. F3: UUID (User Identifier) 4. F4: String. (12 CHAR statistic name) Note: My record is a nested document and some of the above fields are nested fields My Write Use Cases: 1. Writes to partitioned HUDI table every 15 minutes 1. where 95% inserts and 5% updates, 2. Also 95% write goes to same partition (current date) 5% write can span across multiple partitions 2. GDPR request to delete records from the table using User Identifier field (F3) Record Key Construction: Approach 1: Generate a UUID from the concatenated String of all these 4 fields [eg: str(F1) + “_” + str(F2) + “_” + str(F3) + “_” + str(F4) ] and use that newly generated field as Record Key Approach 2: Generate a UUID from the concatenated String of 3 fields except datetime field(F1) [eg: str(F2) + “_” + str(F3) + “_” + str(F4)] and prepend datetime field to the generated UUID and use that newly generated field as Record Key •F1_<uuid> Approach 3: Record Key as a composite key of all 4 fields (F1, F2, F3, F4) Which is the approach you will suggest? Could you please help me? Regards, Felix K Jose ________________________________ The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
