[jira] [Closed] (HUDI-5257) Spark-Sql duplicates and re-uses record keys under certain configs and use cases

Jonathan Vexler (Jira) Wed, 31 May 2023 17:48:08 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jonathan Vexler closed HUDI-5257.
---------------------------------
    Resolution: Not A Problem

I think this was due to caching dataframes incorrectly

> Spark-Sql duplicates and re-uses record keys under certain configs and use 
> cases
> --------------------------------------------------------------------------------
>
>                 Key: HUDI-5257
>                 URL: https://issues.apache.org/jira/browse/HUDI-5257
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: bootstrap, spark-sql
>            Reporter: Jonathan Vexler
>            Assignee: Jonathan Vexler
>            Priority: Major
>         Attachments: bad_data.txt
>
>
> On a new table with primary key  _row_key and partitioned by partition_path, 
> if you do a bulk insert by:
> {code:java}
> insertDf.createOrReplaceTempView("insert_temp_table")
> spark.sql(s"set hoodie.datasource.write.operation=bulk_insert")
> spark.sql("set hoodie.sql.bulk.insert.enable=true")
> spark.sql("set hoodie.sql.insert.mode=non-strict")
> spark.sql(s"insert into $tableName select * from insert_temp_table") {code}
> you will get data with: [^bad_data.txt] where multiple records have the same 
> key even though they have different primary key values, and that there are 
> multiple files even though there are only 10 records
> changing hoodie.datasource.write.operation=bulk_insert to 
> hoodie.datasource.write.operation=insert causes the data to be inserted 
> correctly. I do not know if it is using bulk insert with this change. 
>  
> However, if you use bulk insert with raw data like 
> {code:java}
> spark.sql(s"""         
> | insert into $tableName values         
> | $values 
> |""".stripMargin
> ){code}
> where $values is something like
> {code:java}
> (1, 'a1', 10, 1000, "2021-01-05"), {code}
> then hoodie.datasource.write.operation=bulk_insert works as expected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-5257) Spark-Sql duplicates and re-uses record keys under certain configs and use cases

Reply via email to