Hi all, I would like to bring up the topic of how precombine field is used and what's the purpose of it. I would also like to know what are the plans for it in the future.
At first glance precombine filed looks like it's only used to deduplicate records in incoming batch. But when digging deeper it looks like it can/is also be used to: 1. combine records not before but on write to decide if update existing record (eg with DefaultHoodieRecordPayload) 2. combine records on read for MoR table to combine log and base files correctly. 3. precombine field is required for spark SQL UPDATE, even if user can't provide duplicates anyways with this sql statement. Regarding [3] there's inconsistency as precombine field is not required in MERGE INTO UPDATE. Underneath UPSERT is switched to INSERT in upsert mode to update existing records. I know that Hudi does a lot of work to ensure PK uniqueness across/within partitions and there is a need to deduplicate records before write or to deduplicate existing data if duplicates were introduced eg when using non-strict insert mode. What should then happen in a situation where user does not want or can not provide a pre-combine field? Then it's on user not to introduce duplicates, but makes Hudi more generic and easier to use for "SQL" people. No precombine is possible for CoW, already, but UPSERT and SQL UPDATE is not supported (but users can update records using Insert in non-strict mode or MERGE INTO UPDATE). There's also a difference between CoW and MoR where for MoR precombine field is a hard requirement, but is optional for CoW. (UPDATES with no precombine are also possible in Flink for both CoW and MoR but not in Spark.) Would it make sense to take inspiration from some DBMS systems then (eg Synapse) to allow updates and upserts when no precombine field is specified? Scenario: Say that duplicates were introduced with Insert in non-strict mode, no precombine field is specified, then we have two options: option 1) on UPDATE/UPSERT Hudi should deduplicate the existing records, as there's no precombine field it's expected we don't know which records will be removed and which will be effectively updated and preserved in the table. (This can be also achieved by always providing the same value in precombine field for all records.) option 2) on UPDATE/UPSERT Hudi should deduplicate the existing records, as there's no precombine field, record with the latest _hoodie_commit_time is preserved and updated, other records with the same PK are removed. In both cases, deduplication on UPDATE/UPSERT becomes a hard rule whether we use precombine field or not. Then regarding MoR and merging records on read (found this in Hudi format spec), can it be done by only using _hoodie_commit_time in absence of precombine field? If so for both MoR and CoW precombine field can become completely optional? I'm of course looking at it more from the user perspective, it would be nice to know what is and what is not possible from the design and developer perspective. Best Regards, Daniel Kaźmirski