[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
danny0405 commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1261912747 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter { if (mergedParams.contains(PRECOMBINE_FIELD.key())) { mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, mergedParams(PRECOMBINE_FIELD.key())) } +if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key()) + && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) { + // enable merge allow duplicates when operation type is insert + mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true") Review Comment: > we do enable hoodie.merge.allow.duplicate.on.inserts Do you mean disable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
danny0405 commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1259267932 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter { if (mergedParams.contains(PRECOMBINE_FIELD.key())) { mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, mergedParams(PRECOMBINE_FIELD.key())) } +if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key()) + && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) { + // enable merge allow duplicates when operation type is insert + mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true") Review Comment: yeah, it's better we can keep the strategy in line. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
danny0405 commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1257660429 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter { if (mergedParams.contains(PRECOMBINE_FIELD.key())) { mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, mergedParams(PRECOMBINE_FIELD.key())) } +if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key()) + && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) { + // enable merge allow duplicates when operation type is insert + mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true") Review Comment: Yeah, kind of obscure from the first sight, why not just put the default value `hoodie.merge.allow.duplicate.on.inserts` as true. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
danny0405 commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1255177050 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter { if (mergedParams.contains(PRECOMBINE_FIELD.key())) { mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, mergedParams(PRECOMBINE_FIELD.key())) } +if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key()) + && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) { + // enable merge allow duplicates when operation type is insert + mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true") Review Comment: There is no much meaningness to keep backwards compatible if the behavior itself is not correct from user's intuition, because most of the users that use `INSERT` operation does not need deduplication and they do not want to specify a record key either. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
danny0405 commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1252497920 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala: ## @@ -112,6 +113,36 @@ trait ProvidesHoodieConfig extends Logging { } } + private def deducePayloadClassNameLegacy(operation: String, tableType: String, insertMode: InsertMode): String = { +if (operation == UPSERT_OPERATION_OPT_VAL && + tableType == COW_TABLE_TYPE_OPT_VAL && insertMode == InsertMode.STRICT) { + // Validate duplicate key for COW, for MOR it will do the merge with the DefaultHoodieRecordPayload + // on reading. + // TODO use HoodieSparkValidateDuplicateKeyRecordMerger when SparkRecordMerger is default + classOf[ValidateDuplicateKeyPayload].getCanonicalName +} else if (operation == INSERT_OPERATION_OPT_VAL && tableType == COW_TABLE_TYPE_OPT_VAL && + insertMode == InsertMode.STRICT){ + // Validate duplicate key for inserts to COW table when using strict insert mode. + classOf[ValidateDuplicateKeyPayload].getCanonicalName +} else { + classOf[OverwriteWithLatestAvroPayload].getCanonicalName +} Review Comment: By default, should we use `DefaultHoodieRecordPayload` instead ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
danny0405 commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1252496143 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter { if (mergedParams.contains(PRECOMBINE_FIELD.key())) { mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, mergedParams(PRECOMBINE_FIELD.key())) } +if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key()) + && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) { + // enable merge allow duplicates when operation type is insert + mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true") Review Comment: I feel by default, we should never dedup for INSERT operation. That keeps the behavior in line with regular RDBMS. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org