Looks like you are querying the RO table? If so, the query only hits parquet file; which was probably generated during the first upsert and all others went to the log. Unless compaction runs, it wont show up on ro table
If you want the latest merged view you need to query the RT table. Does that sound applicable? On Fri, Apr 26, 2019 at 3:02 AM [email protected] < [email protected]> wrote: > Writing hudi set as below > > ds.withColumn("emp_name",lit("upd1 > Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie") > .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26") > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id") > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "part_by") > .option("hoodie.upsert.shuffle.parallelism",4) > .mode(SaveMode.Append) > .save("/apps/hive/warehouse/emp_mor_26") > > > 1st run - write record 1,"hudi_045",current_timestamp as ts > read result -- 1, hudi_045 > 2nd run - write record 1,"hudi_046",current_timestamp as ts > read result -- 1,hudi_046 > 3rd run -- write record 1, "hoodie_123",current_timestamp as ts > read result --- 1,hudi_046 > 4th run -- write record 1, "hdie_1232324",current_timestamp as ts > read result --- 1,hudi_046 > > after multiple updates to same record , > the generated log.1 has multiple instances of the same record. > At this point the updated record is not fetched. > > 14:45 > /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1 > - has record that was updated in run 1 > 15:00 > /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1 > - has record that was updated in run 2 and run 3 > 14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata > 14:41 > /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet > > > So is there any compaction to be enabled before reading or while writing . > >
