What precombine field really is used for and its future?

Daniel Kaźmirski Fri, 31 Mar 2023 13:21:37 -0700

Hi all,

I would like to bring up the topic of how precombine field is used and
what's the purpose of it. I would also like to know what are the plans for
it in the future.


At first glance precombine filed looks like it's only used to deduplicate
records in incoming batch.
But when digging deeper it looks like it can/is also be used to:
1. combine records not before but on write to decide if update existing
record (eg with DefaultHoodieRecordPayload)
2. combine records on read for MoR table to combine log and base files
correctly.
3. precombine field is required for spark SQL UPDATE, even if user can't
provide duplicates anyways with this sql statement.

Regarding [3] there's inconsistency as precombine field is not required in
MERGE INTO UPDATE. Underneath UPSERT is switched to INSERT in upsert mode
to update existing records.

I know that Hudi does a lot of work to ensure PK uniqueness across/within
partitions and there is a need to deduplicate records before write or to
deduplicate existing data if duplicates were introduced eg when using
non-strict insert mode.

What should then happen in a situation where user does not want or can not
provide a pre-combine field? Then it's on user not to introduce duplicates,
but makes Hudi more generic and easier to use for "SQL" people.

No precombine is possible for CoW, already, but UPSERT and SQL UPDATE is
not supported (but users can update records using Insert in non-strict mode
or MERGE INTO UPDATE).
There's also a difference between CoW and MoR where for MoR
precombine field is a hard requirement, but is optional for CoW.
(UPDATES with no precombine are also possible in Flink for both CoW and MoR
but not in Spark.)

Would it make sense to take inspiration from some DBMS systems then (eg
Synapse) to allow updates and upserts when no precombine field is specified?
Scenario:
Say that duplicates were introduced with Insert in non-strict mode, no
precombine field is specified, then we have two options:
option 1) on UPDATE/UPSERT Hudi should deduplicate the existing records, as
there's no precombine field it's expected we don't know which records will
be removed and which will be effectively updated and preserved in the
table. (This can be also achieved by always providing the same value in
precombine field for all records.)
option 2) on UPDATE/UPSERT Hudi should deduplicate the existing records, as
there's no precombine field, record with the latest _hoodie_commit_time is
preserved and updated, other records with the same PK are removed.

In both cases, deduplication on UPDATE/UPSERT becomes a hard rule
whether we use precombine field or not.

Then regarding MoR and merging records on read (found this in Hudi format
spec), can it be done by only using _hoodie_commit_time in absence of
precombine field?
If so for both MoR and CoW precombine field can become completely optional?

I'm of course looking at it more from the user perspective, it would be
nice to know what is and what is not possible from the design and developer
perspective.

Best Regards,
Daniel Kaźmirski

What precombine field really is used for and its future?

Reply via email to