peanut-chenzhong opened a new issue #3735: URL: https://github.com/apache/hudi/issues/3735
For my understanding, if we using OverwriteNonDefaultsWithLatestAvroPayload, Hudi will update column by comlun. If the upsert data has some column which is null, Hudi will ignore these columns and only update the other columns. But the behavior now seems not correct. Steps to reproduce the behavior: 1.use spark-sql to init test data create table test_payload (par1 int,par2 int,key int,col0 string,col1 double,col2 date,col3 timestamp); insert into test_payload select 1,20,100,'bb',220.22,'2011-02-10','2011-01-10 01:11:20'; insert into test_payload select 1,10,100,'cc',null,null,'2011-01-10 01:11:00'; 2.insert the first line data to Hudi using OverwriteNonDefaultsWithLatestAvroPayload val base_data = sql("select * from test_payload where col0='aa' or col0='bb' ;") base_data.write.format("hudi"). option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL). option("hoodie.datasource.write.precombine.field", "col3"). option("hoodie.datasource.write.recordkey.field", "key"). option("hoodie.datasource.write.partitionpath.field", ""). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator"). option("hoodie.datasource.write.operation", "upsert"). option("hoodie.datasource.write.payload.class", "org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload"). option("hoodie.upsert.shuffle.parallelism", 4). option("hoodie.datasource.write.hive_style_partitioning", "true"). option("hoodie.table.name", "tb_test_payload").mode(Overwrite).save(s"/tmp/huditest/tb_test_payload") 3.upsert the second line data to Hudi using OverwriteNonDefaultsWithLatestAvroPayload upsert_data.write.format("hudi"). option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL). option("hoodie.datasource.write.precombine.field", "col3"). option("hoodie.datasource.write.recordkey.field", "key"). option("hoodie.datasource.write.partitionpath.field", ""). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator"). option("hoodie.datasource.write.payload.class", "org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload"). option("hoodie.datasource.write.operation", "upsert"). option("hoodie.upsert.shuffle.parallelism", 4). option("hoodie.datasource.write.hive_style_partitioning", "true"). option("hoodie.table.name", "tb_test_payload").mode(Append).save(s"/tmp/huditest/tb_test_payload") 4.query table spark.read.format("org.apache.hudi").load("/tmp/huditest/tb_test_payload/*").createOrReplaceTempView("hudi_ro_table") spark.sql("select * from hudi_ro_table").show(30,false) +-------------------+--------------------+------------------+----------------------+-----------------------------------------------------------------------+----+----+---+----+----+----+-------------------+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |par1|par2|key|col0|col1|col2|col3 | +-------------------+--------------------+------------------+----------------------+-----------------------------------------------------------------------+----+----+---+----+----+----+-------------------+ |20210930083222 |20210930083222_0_6 |100 | |191bf655-bc6c-4944-b7bb-1f00304c033e-0_0-190-316_20210930083222.parquet|1 |10 |100|cc |null|null|2011-01-10 01:11:00| +-------------------+--------------------+------------------+----------------------+-----------------------------------------------------------------------+----+----+---+----+----+----+-------------------+ You can see the hole row has been update even col1 and col2 is null. **Expected behavior** expected behavior is the col1 and col2 shouldn`t been updated. A clear and concise description of what you expected to happen. **Environment Description** * Hudi version :0.9 * Spark version :3.1.1 * Hive version :3.1 * Hadoop version :3.1.1 * Storage (HDFS/S3/GCS..) :hdfs * Running on Docker? (yes/no) :no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org