nsivabalan commented on code in PR #14052:
URL: https://github.com/apache/hudi/pull/14052#discussion_r2403477973


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##########
@@ -612,11 +612,11 @@ object DataSourceWriteOptions {
     .markAdvanced()
     .sinceVersion("0.14.0")
     .withDocumentation("Sql write operation to use with INSERT_INTO spark sql 
command. This comes with 3 possible values, bulk_insert, " +
-      "insert and upsert. bulk_insert is generally meant for initial loads and 
is known to be performant compared to insert. But bulk_insert may not " +
-      "do small file management. If you prefer hudi to automatically manage 
small files, then you can go with \"insert\". There is no precombine " +
+      "insert and upsert. The default behavior is insert, which means that 
duplicates will be preserved when writing data. bulk_insert is generally meant 
for initial loads " +

Review Comment:
   how about 
   ```
   Sql write operation to use with INSERT_INTO spark sql command. This comes 
with 3 possible values, bulk_insert, " +
         "insert and upsert. \"bulk_insert\" is generally meant for initial 
loads and is known to be performant compared to insert. \"insert\" is the 
default value for this config and does small file handling in addition to 
bulk_insert, but will ensure to retain duplicates if ingested. If you may use 
INSERT_INTO for mutable dataset, then you may have to set this config value to 
\"upsert\". With upsert, Hudi will merge multiple versions of the record 
identified by record key configuration into one final record.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to