sivabalan narayanan created HUDI-6478: -----------------------------------------
Summary: Simplify INSERT_INTO configs Key: HUDI-6478 URL: https://issues.apache.org/jira/browse/HUDI-6478 Project: Apache Hudi Issue Type: Improvement Components: spark-sql Reporter: sivabalan narayanan We have 2 to 3 diff configs in the mix for INSERT_INTO command. lets try to simplify them. hoodie.sql.insert.mode, drop dups, hoodie.sql.bulk.insert.enable and datasource.operation.type. Rough notes: hoodie.sql.bulk.insert.enable: true | false. hoodie.sql.insert.mode: STRICT| NON_STRICT | UPSERT STRICT: we can't re-ingest same record again. will throw if found duplicates to be ingested again. NON_STRICT: no such constraints. but has to be set along w/ bulk_insert(if its enabled). if not, exception will be thrown. UPSERT: default insert.mode(until a week back where in we switch to make bulk_insert the default for INSERT_INTO). will take care of de-dup. will use OverwriteWithLatestAvroPayload(which means that we can update an existing record across batches). datasource.operation.type: insert, bulk_insert, upsert drop.dups: Drop new incoming records if it already exists. Proposal: * We will introduce a new config named "hoodie.sql.write.operation" which will have 3 values ("insert", "bulk_insert" and "upsert"). Default value will be "insert" for INSERT_INTO. ** Deprecate hoodie.sql.insert.mode and "hoodie.sql.bulk.insert.enable". * Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if operation type is "Insert" for both spark-sql and spark-ds. This will maintain duplicates but still help w/ small file management with "insert"s. * Introduce a new config named "hoodie.datasource.insert.dedupe.policy" whose valid values are "ignore, fail and drop". Make "ignore" as default. "fail" will mimic "STRICT" mode we support as of now. Even spark-ds users can use the fail/STRICT behavior if need be. ** Deprecate hoodie.datasource.insert.drop.dups. -- This message was sent by Atlassian Jira (v8.20.10#820010)