[ https://issues.apache.org/jira/browse/HUDI-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Vexler closed HUDI-5257. --------------------------------- Resolution: Not A Problem I think this was due to caching dataframes incorrectly > Spark-Sql duplicates and re-uses record keys under certain configs and use > cases > -------------------------------------------------------------------------------- > > Key: HUDI-5257 > URL: https://issues.apache.org/jira/browse/HUDI-5257 > Project: Apache Hudi > Issue Type: Bug > Components: bootstrap, spark-sql > Reporter: Jonathan Vexler > Assignee: Jonathan Vexler > Priority: Major > Attachments: bad_data.txt > > > On a new table with primary key _row_key and partitioned by partition_path, > if you do a bulk insert by: > {code:java} > insertDf.createOrReplaceTempView("insert_temp_table") > spark.sql(s"set hoodie.datasource.write.operation=bulk_insert") > spark.sql("set hoodie.sql.bulk.insert.enable=true") > spark.sql("set hoodie.sql.insert.mode=non-strict") > spark.sql(s"insert into $tableName select * from insert_temp_table") {code} > you will get data with: [^bad_data.txt] where multiple records have the same > key even though they have different primary key values, and that there are > multiple files even though there are only 10 records > changing hoodie.datasource.write.operation=bulk_insert to > hoodie.datasource.write.operation=insert causes the data to be inserted > correctly. I do not know if it is using bulk insert with this change. > > However, if you use bulk insert with raw data like > {code:java} > spark.sql(s""" > | insert into $tableName values > | $values > |""".stripMargin > ){code} > where $values is something like > {code:java} > (1, 'a1', 10, 1000, "2021-01-05"), {code} > then hoodie.datasource.write.operation=bulk_insert works as expected -- This message was sent by Atlassian Jira (v8.20.10#820010)