Yao Zhang created HUDI-4765: ------------------------------- Summary: Compared inserting data via spark-sql with spark-shell,_hoodie_record_key generation logic is different, which might affects data upsert Key: HUDI-4765 URL: https://issues.apache.org/jira/browse/HUDI-4765 Project: Apache Hudi Issue Type: Bug Components: spark, spark-sql Affects Versions: 0.11.1 Environment: Spark 3.1.1 Hudi 0.11.1 Reporter: Yao Zhang
Create table using spark-sql: {code:java} create table hudi_mor_tbl ( id int, name string, price double, ts bigint ) using hudi tblproperties ( type = 'mor', primaryKey = 'id', preCombineField = 'ts' ) location 'hdfs:///hudi/hudi_mor_tbl'; {code} And then insert data via spark-shell and spark-sql respectively: {code:java} import org.apache.spark.sql._ import org.apache.spark.sql.types._ val fields = Array( StructField("id", IntegerType, true), StructField("name", StringType, true), StructField("price", DoubleType, true), StructField("ts", LongType, true) ) val simpleSchema = StructType(fields) val data = Seq(Row(2, "a2", 200.0, 100L)) val df = spark.createDataFrame(data, simpleSchema) df.write.format("hudi"). option(PRECOMBINE_FIELD_OPT_KEY, "ts"). option(RECORDKEY_FIELD_OPT_KEY, "id"). option(TABLE_NAME, "hudi_mor_tbl"). option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ"). mode(Append). save("hdfs:///hudi/hudi_mor_tbl") {code} {code:java} insert into hudi_mor_tbl select 1, 'a1', 20, 1000; {code} After that we query the table, we can see those two rows are as below: {code:java} +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|price| ts| +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+ | 20220902012710792|20220902012710792...| 2| |c3eff8c8-fa47-48c...| 2| a2|200.0| 100| | 20220902012813658|20220902012813658...| id:1| |c3eff8c8-fa47-48c...| 1| a1| 20.0|1000| +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+ {code} '_hoodie_record_key' field for spark_sql inserted data is 'id:1' while that for spark-shell is 2. It seems that spark_sql uses '[primaryKey_field_name]:[primaryKey_field_value]' to construct the '_hoodie_record_key' field, which is different from spark-shell. As a result, if we inserted one row via spark-sql and then upserted it via spark-shell, we would get two duplicated rows. That is not what we expected. Did I miss some configurations that might lead to this issue? If not, personally I think we should make the default record key generation logic consistent. -- This message was sent by Atlassian Jira (v8.20.10#820010)