kazdy commented on PR #7998: URL: https://github.com/apache/hudi/pull/7998#issuecomment-1441751764
There seem to be a bug with non-strict insert mode when using spark datasource it can insert duplicates only in overwrite mode or append mode when data is inserted to the table for the first time, but if I want to insert in append mode for the second time it deduplicates the dataset as if it was working in upsert mode. ``` opt_insert = { 'hoodie.table.name': 'huditbl', 'hoodie.datasource.write.recordkey.field': 'keyid', 'hoodie.datasource.write.table.name': 'huditbl', 'hoodie.datasource.write.operation': 'insert', 'hoodie.sql.insert.mode': 'non-strict', 'hoodie.upsert.shuffle.parallelism': 2, 'hoodie.insert.shuffle.parallelism': 2, 'hoodie.combine.before.upsert': 'false', 'hoodie.combine.before.insert': 'false', 'hoodie.datasource.write.insert.drop.duplicates': 'false' } df = spark.range(0, 10).toDF("keyid") \ .withColumn("age", expr("keyid + 1000")) df.write.format("hudi"). \ options(**opt_insert). \ mode("overwrite"). \ save(path) spark.read.format("hudi").load(path).count() # returns 10 df = df.union(df) # creates duplicates df.write.format("hudi"). \ options(**opt_insert). \ mode("append"). \ save(path) spark.read.format("hudi").load(path).count() # returns 10 but should return 20 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org