nsivabalan commented on code in PR #14052:
URL: https://github.com/apache/hudi/pull/14052#discussion_r2403477973
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##########
@@ -612,11 +612,11 @@ object DataSourceWriteOptions {
.markAdvanced()
.sinceVersion("0.14.0")
.withDocumentation("Sql write operation to use with INSERT_INTO spark sql
command. This comes with 3 possible values, bulk_insert, " +
- "insert and upsert. bulk_insert is generally meant for initial loads and
is known to be performant compared to insert. But bulk_insert may not " +
- "do small file management. If you prefer hudi to automatically manage
small files, then you can go with \"insert\". There is no precombine " +
+ "insert and upsert. The default behavior is insert, which means that
duplicates will be preserved when writing data. bulk_insert is generally meant
for initial loads " +
Review Comment:
how about
```
Sql write operation to use with INSERT_INTO spark sql command. This comes
with 3 possible values, bulk_insert, " +
"insert and upsert. \"bulk_insert\" is generally meant for initial
loads and is known to be performant compared to insert. \"insert\" is the
default value for this config and does small file handling in addition to
bulk_insert, but will ensure to retain duplicates if ingested. If you may use
INSERT_INTO for mutable dataset, then you may have to set this config value to
\"upsert\". With upsert, Hudi will merge multiple versions of the record
identified by record key configuration into one final record.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]