peanut-chenzhong opened a new issue #3735:
URL: https://github.com/apache/hudi/issues/3735


   For my understanding, if we using OverwriteNonDefaultsWithLatestAvroPayload, 
Hudi will update column by comlun. If the upsert data has some column which is 
null, Hudi will ignore these columns and only update the other columns. But the 
behavior now seems not correct.
   
   Steps to reproduce the behavior:
   
   1.use spark-sql to init test data
   create table test_payload (par1 int,par2 int,key int,col0 string,col1 
double,col2 date,col3 timestamp);
   insert into test_payload select 
1,20,100,'bb',220.22,'2011-02-10','2011-01-10 01:11:20';
   insert into test_payload select 1,10,100,'cc',null,null,'2011-01-10 
01:11:00';
   
   2.insert the first line data to Hudi using 
OverwriteNonDefaultsWithLatestAvroPayload
   val base_data = sql("select * from test_payload where col0='aa' or col0='bb' 
;")
   base_data.write.format("hudi").
   option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL).
   option("hoodie.datasource.write.precombine.field", "col3").
   option("hoodie.datasource.write.recordkey.field", "key").
   option("hoodie.datasource.write.partitionpath.field", "").
   option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator").
   option("hoodie.datasource.write.operation", "upsert").
   option("hoodie.datasource.write.payload.class", 
"org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload").
   option("hoodie.upsert.shuffle.parallelism", 4).
   option("hoodie.datasource.write.hive_style_partitioning", "true").
   option("hoodie.table.name", 
"tb_test_payload").mode(Overwrite).save(s"/tmp/huditest/tb_test_payload")
   
   3.upsert the second line data to Hudi using 
OverwriteNonDefaultsWithLatestAvroPayload
   upsert_data.write.format("hudi").
   option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL).
   option("hoodie.datasource.write.precombine.field", "col3").
   option("hoodie.datasource.write.recordkey.field", "key").
   option("hoodie.datasource.write.partitionpath.field", "").
   option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator").
   option("hoodie.datasource.write.payload.class", 
"org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload").
   option("hoodie.datasource.write.operation", "upsert").
   option("hoodie.upsert.shuffle.parallelism", 4).
   option("hoodie.datasource.write.hive_style_partitioning", "true").
   option("hoodie.table.name", 
"tb_test_payload").mode(Append).save(s"/tmp/huditest/tb_test_payload")
   
   4.query table
   
spark.read.format("org.apache.hudi").load("/tmp/huditest/tb_test_payload/*").createOrReplaceTempView("hudi_ro_table")
   spark.sql("select * from hudi_ro_table").show(30,false)
   
+-------------------+--------------------+------------------+----------------------+-----------------------------------------------------------------------+----+----+---+----+----+----+-------------------+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
                                                      
|par1|par2|key|col0|col1|col2|col3               |
   
+-------------------+--------------------+------------------+----------------------+-----------------------------------------------------------------------+----+----+---+----+----+----+-------------------+
   |20210930083222     |20210930083222_0_6  |100               |                
      
|191bf655-bc6c-4944-b7bb-1f00304c033e-0_0-190-316_20210930083222.parquet|1   
|10  |100|cc  |null|null|2011-01-10 01:11:00|
   
+-------------------+--------------------+------------------+----------------------+-----------------------------------------------------------------------+----+----+---+----+----+----+-------------------+
   
   You can see the hole row has been update even col1 and col2 is null.
   
   **Expected behavior**
   
   expected behavior is the col1 and col2 shouldn`t been updated.
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.9
   
   * Spark version :3.1.1
   
   * Hive version :3.1
   
   * Hadoop version :3.1.1
   
   * Storage (HDFS/S3/GCS..) :hdfs
   
   * Running on Docker? (yes/no) :no
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to