Writing hudi set as below ds.withColumn("emp_name",lit("upd1 Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie") .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26") .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id") .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "part_by") .option("hoodie.upsert.shuffle.parallelism",4) .mode(SaveMode.Append) .save("/apps/hive/warehouse/emp_mor_26")
1st run - write record 1,"hudi_045",current_timestamp as ts read result -- 1, hudi_045 2nd run - write record 1,"hudi_046",current_timestamp as ts read result -- 1,hudi_046 3rd run -- write record 1, "hoodie_123",current_timestamp as ts read result --- 1,hudi_046 4th run -- write record 1, "hdie_1232324",current_timestamp as ts read result --- 1,hudi_046 after multiple updates to same record , the generated log.1 has multiple instances of the same record. At this point the updated record is not fetched. 14:45 /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1 - has record that was updated in run 1 15:00 /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1 - has record that was updated in run 2 and run 3 14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata 14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet So is there any compaction to be enabled before reading or while writing .