Writing hudi set as below
ds.withColumn("emp_name",lit("upd1
Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
.option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "part_by")
.option("hoodie.upsert.shuffle.parallelism",4)
.mode(SaveMode.Append)
.save("/apps/hive/warehouse/emp_mor_26")
1st run - write record 1,"hudi_045",current_timestamp as ts
read result -- 1, hudi_045
2nd run - write record 1,"hudi_046",current_timestamp as ts
read result -- 1,hudi_046
3rd run -- write record 1, "hoodie_123",current_timestamp as ts
read result --- 1,hudi_046
4th run -- write record 1, "hdie_1232324",current_timestamp as ts
read result --- 1,hudi_046
after multiple updates to same record ,
the generated log.1 has multiple instances of the same record.
At this point the updated record is not fetched.
14:45
/apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
- has record that was updated in run 1
15:00
/apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
- has record that was updated in run 2 and run 3
14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
14:41
/apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
So is there any compaction to be enabled before reading or while writing .