[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #3328: [HUDI-2208] Support Bulk Insert For Spark Sql

GitBox Thu, 22 Jul 2021 07:49:55 -0700


pengzhiwei2018 commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r674875373




##########
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala
##########
@@ -172,6 +178,15 @@ object HoodieOptionConfig {
     params.get(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY.key)
   }
 
+  /**
+   * Whether enable the bulk insert for sql insert statement when there is no 
primaryKey in the table.
+   */
+  def enableBulkInsert(options: Map[String, String]): Boolean = {

Review comment:
       I saw that currently ENABLE_ROW_WRITER_OPT_KEY is only used for bulk 
insert,  so i reused this config.

##########
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -159,7 +159,10 @@ object HoodieSparkSqlWriter {
 
           // Convert to RDD[HoodieRecord]
           val genericRecords: RDD[GenericRecord] = 
HoodieSparkUtils.createRdd(df, schema, structName, nameSpace)
-          val shouldCombine = 
parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean || 
operation.equals(WriteOperationType.UPSERT);
+          val shouldCombine = 
parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean ||
+            operation.equals(WriteOperationType.UPSERT) ||
+            
parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT_PROP.key(),

Review comment:
       Yes,  if we have enable the COMBINE_BEFORE_INSERT_PROP for insert, it 
has not compute the pre combine field value which will result incorrect result 
for insert with duplicate records.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #3328: [HUDI-2208] Support Bulk Insert For Spark Sql

Reply via email to