Looks like you are querying the RO table? If so, the query only hits
parquet file; which was probably generated during the first upsert and all
others went to the log. Unless compaction runs, it wont show up on ro table

If you want the latest merged view you need to query the RT table.

Does that sound applicable?



On Fri, Apr 26, 2019 at 3:02 AM [email protected] <
[email protected]> wrote:

> Writing hudi set as below
>
> ds.withColumn("emp_name",lit("upd1
> Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
> .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
> .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
> .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "part_by")
> .option("hoodie.upsert.shuffle.parallelism",4)
> .mode(SaveMode.Append)
> .save("/apps/hive/warehouse/emp_mor_26")
>
>
> 1st run - write record 1,"hudi_045",current_timestamp as ts
> read result -- 1, hudi_045
> 2nd run - write record 1,"hudi_046",current_timestamp as ts
> read result -- 1,hudi_046
> 3rd run -- write record 1, "hoodie_123",current_timestamp as ts
> read result --- 1,hudi_046
> 4th run -- write record 1, "hdie_1232324",current_timestamp as ts
> read result --- 1,hudi_046
>
> after multiple updates to same record ,
> the generated  log.1 has multiple instances of the same record.
> At this point the updated record is not fetched.
>
> 14:45
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> - has record that was updated in run 1
> 15:00
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> - has record that was updated in run 2 and run 3
> 14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> 14:41
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
>
>
> So is there any compaction to be enabled before reading or while writing .
>
>

Reply via email to