kazdy commented on PR #7998:
URL: https://github.com/apache/hudi/pull/7998#issuecomment-1441751764

   There seem to be a bug with non-strict insert mode
   when using spark datasource it can insert duplicates only in overwrite mode 
or append mode when data is inserted to the table for the first time, but if I 
want to insert in append mode for the second time it deduplicates the dataset 
as if it was working in upsert mode.
   
   ```
   
   opt_insert = {
       'hoodie.table.name': 'huditbl',
       'hoodie.datasource.write.recordkey.field': 'keyid',
       'hoodie.datasource.write.table.name': 'huditbl',
       'hoodie.datasource.write.operation': 'insert',
       'hoodie.sql.insert.mode': 'non-strict',
       'hoodie.upsert.shuffle.parallelism': 2,
       'hoodie.insert.shuffle.parallelism': 2,
       'hoodie.combine.before.upsert': 'false',
       'hoodie.combine.before.insert': 'false',
       'hoodie.datasource.write.insert.drop.duplicates': 'false'
   }
   
   df = spark.range(0, 10).toDF("keyid") \
     .withColumn("age", expr("keyid + 1000"))
   
   df.write.format("hudi"). \
   options(**opt_insert). \
   mode("overwrite"). \
   save(path)
   
   spark.read.format("hudi").load(path).count() # returns 10
   
   df = df.union(df) # creates duplicates
   df.write.format("hudi"). \
   options(**opt_insert). \
   mode("append"). \
   save(path)
   
   spark.read.format("hudi").load(path).count() # returns 10 but should return 
20
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to