xushiyan commented on code in PR #8697: URL: https://github.com/apache/hudi/pull/8697#discussion_r1281457455
########## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ########## @@ -429,6 +416,40 @@ object HoodieSparkSqlWriter { } } + def deduceOperation(hoodieConfig: HoodieConfig, paramsWithoutDefaults : Map[String, String]): WriteOperationType = { + var operation = WriteOperationType.fromValue(hoodieConfig.getString(OPERATION)) + // TODO clean up + // It does not make sense to allow upsert() operation if INSERT_DROP_DUPS is true + // Auto-correct the operation to "insert" if OPERATION is set to "upsert" wrongly + // or not set (in which case it will be set as "upsert" by parametersWithWriteDefaults()) . + if (hoodieConfig.getBoolean(INSERT_DROP_DUPS) && + operation == WriteOperationType.UPSERT) { + + log.warn(s"$UPSERT_OPERATION_OPT_VAL is not applicable " + + s"when $INSERT_DROP_DUPS is set to be true, " + + s"overriding the $OPERATION to be $INSERT_OPERATION_OPT_VAL") + + operation = WriteOperationType.INSERT + operation + } else { + // if no record key, no preCombine, we should treat it as append only workload + // and make bulk_insert as operation type. + if (!paramsWithoutDefaults.containsKey(DataSourceWriteOptions.RECORDKEY_FIELD.key()) + && !paramsWithoutDefaults.containsKey(DataSourceWriteOptions.PRECOMBINE_FIELD.key()) + && !paramsWithoutDefaults.containsKey(OPERATION.key())) { + log.warn(s"Choosing BULK_INSERT as the operation type since auto record key generation is applicable") + operation = WriteOperationType.BULK_INSERT + } + // if no record key is set, will switch the default operation to INSERT (auto record key gen) + else if (!hoodieConfig.contains(DataSourceWriteOptions.RECORDKEY_FIELD.key()) + && !paramsWithoutDefaults.containsKey(OPERATION.key())) { + log.warn(s"Choosing INSERT as the operation type since auto record key generation is applicable") + operation = WriteOperationType.INSERT Review Comment: i meant enforced was conditioned on "when auto key gen is applicable". My question was more on the choice of "INSERT" vs "BULK_INSERT": both have the same semantics, but here it chooses INSERT when users set precombine field and the other when not set, which I don't see the rationale behind. Shouldn't the choice btw those 2 based on some file sizing flag, rather than having precombine field or not? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org