Reading Merge_on_read table| Unable to read updated records after multiple updates

satish . sidnakoppa . it Fri, 26 Apr 2019 03:03:45 -0700

Writing hudi set as below

ds.withColumn("emp_name",lit("upd1 
Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
.option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "part_by")
.option("hoodie.upsert.shuffle.parallelism",4)
.mode(SaveMode.Append)
.save("/apps/hive/warehouse/emp_mor_26")



1st run - write record 1,"hudi_045",current_timestamp as ts
read result -- 1, hudi_045
2nd run - write record 1,"hudi_046",current_timestamp as ts
read result -- 1,hudi_046
3rd run -- write record 1, "hoodie_123",current_timestamp as ts
read result --- 1,hudi_046
4th run -- write record 1, "hdie_1232324",current_timestamp as ts
read result --- 1,hudi_046

after multiple updates to same record ,
the generated  log.1 has multiple instances of the same record.
At this point the updated record is not fetched.

14:45 
/apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1 
- has record that was updated in run 1
15:00 
/apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1 
- has record that was updated in run 2 and run 3
14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
14:41 
/apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet


So is there any compaction to be enabled before reading or while writing .

Reading Merge_on_read table| Unable to read updated records after multiple updates

Reply via email to