[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-12 Thread via GitHub


danny0405 commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1261912747


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter {
 if (mergedParams.contains(PRECOMBINE_FIELD.key())) {
   mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, 
mergedParams(PRECOMBINE_FIELD.key()))
 }
+if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && 
mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
+  && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get 
!= FAIL_INSERT_DUP_POLICY) {
+  // enable merge allow duplicates when operation type is insert
+  
mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(),
 "true")

Review Comment:
   > we do enable hoodie.merge.allow.duplicate.on.inserts
   
   Do you mean disable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-11 Thread via GitHub


danny0405 commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1259267932


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter {
 if (mergedParams.contains(PRECOMBINE_FIELD.key())) {
   mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, 
mergedParams(PRECOMBINE_FIELD.key()))
 }
+if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && 
mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
+  && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get 
!= FAIL_INSERT_DUP_POLICY) {
+  // enable merge allow duplicates when operation type is insert
+  
mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(),
 "true")

Review Comment:
   yeah, it's better we can keep the strategy in line.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-09 Thread via GitHub


danny0405 commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1257660429


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter {
 if (mergedParams.contains(PRECOMBINE_FIELD.key())) {
   mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, 
mergedParams(PRECOMBINE_FIELD.key()))
 }
+if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && 
mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
+  && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get 
!= FAIL_INSERT_DUP_POLICY) {
+  // enable merge allow duplicates when operation type is insert
+  
mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(),
 "true")

Review Comment:
   Yeah, kind of obscure from the first sight, why not just put the default 
value `hoodie.merge.allow.duplicate.on.inserts` as true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-06 Thread via GitHub


danny0405 commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1255177050


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter {
 if (mergedParams.contains(PRECOMBINE_FIELD.key())) {
   mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, 
mergedParams(PRECOMBINE_FIELD.key()))
 }
+if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && 
mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
+  && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get 
!= FAIL_INSERT_DUP_POLICY) {
+  // enable merge allow duplicates when operation type is insert
+  
mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(),
 "true")

Review Comment:
   There is no much meaningness to keep backwards compatible if the behavior 
itself is not correct from user's intuition,
   because most of the users that use `INSERT` operation does not need 
deduplication and they do not want to specify a record key either.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-04 Thread via GitHub


danny0405 commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1252497920


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -112,6 +113,36 @@ trait ProvidesHoodieConfig extends Logging {
 }
   }
 
+  private def deducePayloadClassNameLegacy(operation: String, tableType: 
String, insertMode: InsertMode): String = {
+if (operation == UPSERT_OPERATION_OPT_VAL &&
+  tableType == COW_TABLE_TYPE_OPT_VAL && insertMode == InsertMode.STRICT) {
+  // Validate duplicate key for COW, for MOR it will do the merge with the 
DefaultHoodieRecordPayload
+  // on reading.
+  // TODO use HoodieSparkValidateDuplicateKeyRecordMerger when 
SparkRecordMerger is default
+  classOf[ValidateDuplicateKeyPayload].getCanonicalName
+} else if (operation == INSERT_OPERATION_OPT_VAL && tableType == 
COW_TABLE_TYPE_OPT_VAL &&
+  insertMode == InsertMode.STRICT){
+  // Validate duplicate key for inserts to COW table when using strict 
insert mode.
+  classOf[ValidateDuplicateKeyPayload].getCanonicalName
+} else {
+  classOf[OverwriteWithLatestAvroPayload].getCanonicalName
+}

Review Comment:
   By default, should we use `DefaultHoodieRecordPayload` instead ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-04 Thread via GitHub


danny0405 commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1252496143


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter {
 if (mergedParams.contains(PRECOMBINE_FIELD.key())) {
   mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, 
mergedParams(PRECOMBINE_FIELD.key()))
 }
+if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && 
mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
+  && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get 
!= FAIL_INSERT_DUP_POLICY) {
+  // enable merge allow duplicates when operation type is insert
+  
mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(),
 "true")

Review Comment:
   I feel by default, we should never dedup for INSERT operation. That keeps 
the behavior in line with regular RDBMS.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org