I am querying real time view of the table.
This table (emp_mor_26_rt) created after runsync tool.
So the first updated record are fetched from log1 file.

Only after third update both the updates are placed in log files.




On Fri 26 Apr, 2019, 6:30 PM Vinoth Chandar <[email protected] wrote:

> Looks like you are querying the RO table? If so, the query only hits
> parquet file; which was probably generated during the first upsert and all
> others went to the log. Unless compaction runs, it wont show up on ro table
>
> If you want the latest merged view you need to query the RT table.
>
> Does that sound applicable?
>
>
>
> On Fri, Apr 26, 2019 at 3:02 AM [email protected] <
> [email protected]> wrote:
>
> > Writing hudi set as below
> >
> > ds.withColumn("emp_name",lit("upd1
> >
> Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
> > .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
> > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
> > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
> > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "part_by")
> > .option("hoodie.upsert.shuffle.parallelism",4)
> > .mode(SaveMode.Append)
> > .save("/apps/hive/warehouse/emp_mor_26")
> >
> >
> > 1st run - write record 1,"hudi_045",current_timestamp as ts
> > read result -- 1, hudi_045
> > 2nd run - write record 1,"hudi_046",current_timestamp as ts
> > read result -- 1,hudi_046
> > 3rd run -- write record 1, "hoodie_123",current_timestamp as ts
> > read result --- 1,hudi_046
> > 4th run -- write record 1, "hdie_1232324",current_timestamp as ts
> > read result --- 1,hudi_046
> >
> > after multiple updates to same record ,
> > the generated  log.1 has multiple instances of the same record.
> > At this point the updated record is not fetched.
> >
> > 14:45
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> > - has record that was updated in run 1
> > 15:00
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> > - has record that was updated in run 2 and run 3
> > 14:41
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> > 14:41
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
> >
> >
> > So is there any compaction to be enabled before reading or while writing
> .
> >
> >
>

Reply via email to