[GitHub] [hudi] kazdy commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false

via GitHub Thu, 23 Feb 2023 05:09:47 -0800


kazdy commented on PR #7998:
URL: https://github.com/apache/hudi/pull/7998#issuecomment-1441751764


   There seem to be a bug with non-strict insert mode
   when using spark datasource it can insert duplicates only in overwrite mode 
or append mode when data is inserted to the table for the first time, but if I 
want to insert in append mode for the second time it deduplicates the dataset 
as if it was working in upsert mode.
   
   ```
   
   opt_insert = {
       'hoodie.table.name': 'huditbl',
       'hoodie.datasource.write.recordkey.field': 'keyid',
       'hoodie.datasource.write.table.name': 'huditbl',
       'hoodie.datasource.write.operation': 'insert',
       'hoodie.sql.insert.mode': 'non-strict',
       'hoodie.upsert.shuffle.parallelism': 2,
       'hoodie.insert.shuffle.parallelism': 2,
       'hoodie.combine.before.upsert': 'false',
       'hoodie.combine.before.insert': 'false',
       'hoodie.datasource.write.insert.drop.duplicates': 'false'
   }
   
   df = spark.range(0, 10).toDF("keyid") \
     .withColumn("age", expr("keyid + 1000"))
   
   df.write.format("hudi"). \
   options(**opt_insert). \
   mode("overwrite"). \
   save(path)
   
   spark.read.format("hudi").load(path).count() # returns 10
   
   df = df.union(df) # creates duplicates
   df.write.format("hudi"). \
   options(**opt_insert). \
   mode("append"). \
   save(path)
   
   spark.read.format("hudi").load(path).count() # returns 10 but should return 
20
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] kazdy commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false

Reply via email to