sivabalan narayanan created HUDI-6478:
-----------------------------------------

             Summary: Simplify INSERT_INTO configs
                 Key: HUDI-6478
                 URL: https://issues.apache.org/jira/browse/HUDI-6478
             Project: Apache Hudi
          Issue Type: Improvement
          Components: spark-sql
            Reporter: sivabalan narayanan


We have 2 to 3 diff configs in the mix for INSERT_INTO command. lets try to 
simplify them.
 
hoodie.sql.insert.mode, drop dups, hoodie.sql.bulk.insert.enable and 
datasource.operation.type.
 
Rough notes:
 
hoodie.sql.bulk.insert.enable: true | false.
 
hoodie.sql.insert.mode: STRICT| NON_STRICT | UPSERT
STRICT: we can't re-ingest same record again. will throw if found duplicates to 
be ingested again.
NON_STRICT: no such constraints. but has to be set along w/ bulk_insert(if its 
enabled). if not, exception will be thrown.
UPSERT: default insert.mode(until a week back where in we switch to make 
bulk_insert the default for INSERT_INTO). will take care of de-dup. will use 
OverwriteWithLatestAvroPayload(which means that we can update an existing 
record across batches).
 
datasource.operation.type: insert, bulk_insert, upsert
 
drop.dups: Drop new incoming records if it already exists.
 
Proposal:
 
 * We will introduce a new config named "hoodie.sql.write.operation" which will 
have 3 values ("insert", "bulk_insert" and "upsert"). Default value will be 
"insert" for INSERT_INTO.
 ** Deprecate hoodie.sql.insert.mode and "hoodie.sql.bulk.insert.enable".
 * Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if operation 
type is "Insert" for both spark-sql and spark-ds. This will maintain duplicates 
but still help w/ small file management with "insert"s.
 * Introduce a new config named "hoodie.datasource.insert.dedupe.policy" whose 
valid values are "ignore, fail and drop". Make "ignore" as default. "fail" will 
mimic "STRICT" mode we support as of now. Even spark-ds users can use the 
fail/STRICT behavior if need be.
 ** Deprecate hoodie.datasource.insert.drop.dups.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to