Re: Reading Merge_on_read table| Unable to read updated records after multiple updates

SATISH SIDNAKOPPA Mon, 29 Apr 2019 07:23:13 -0700

Yes,
As this needed discussion ,the thread was created in google groups for
inputs.
I am unable to read from rt table after multiple updates.


14:45 
/apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
-* has record that was updated in run 1*
15:00 
/apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
- *has record that was updated in run 2 and run 3*
14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
14:41 
/apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet




On Sat, Apr 27, 2019 at 7:24 PM SATISH SIDNAKOPPA <
[email protected]> wrote:

> No ,the issue is faced with rt table created by sync tool .
>
> On Fri 26 Apr, 2019, 11:53 PM Vinoth Chandar <[email protected] wrote:
>
>> once you registered the rt table, is this working now for you?
>>
>> On Fri, Apr 26, 2019 at 9:36 AM SATISH SIDNAKOPPA <
>> [email protected]> wrote:
>>
>> > I am querying real time view of the table.
>> > This table (emp_mor_26_rt) created after runsync tool.
>> > So the first updated record are fetched from log1 file.
>> >
>> > Only after third update both the updates are placed in log files.
>> >
>> >
>> >
>> >
>> > On Fri 26 Apr, 2019, 6:30 PM Vinoth Chandar <[email protected] wrote:
>> >
>> > > Looks like you are querying the RO table? If so, the query only hits
>> > > parquet file; which was probably generated during the first upsert and
>> > all
>> > > others went to the log. Unless compaction runs, it wont show up on ro
>> > table
>> > >
>> > > If you want the latest merged view you need to query the RT table.
>> > >
>> > > Does that sound applicable?
>> > >
>> > >
>> > >
>> > > On Fri, Apr 26, 2019 at 3:02 AM [email protected] <
>> > > [email protected]> wrote:
>> > >
>> > > > Writing hudi set as below
>> > > >
>> > > > ds.withColumn("emp_name",lit("upd1
>> > > >
>> > >
>> >
>> Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
>> > > > .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
>> > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
>> > > > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
>> > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,
>> "part_by")
>> > > > .option("hoodie.upsert.shuffle.parallelism",4)
>> > > > .mode(SaveMode.Append)
>> > > > .save("/apps/hive/warehouse/emp_mor_26")
>> > > >
>> > > >
>> > > > 1st run - write record 1,"hudi_045",current_timestamp as ts
>> > > > read result -- 1, hudi_045
>> > > > 2nd run - write record 1,"hudi_046",current_timestamp as ts
>> > > > read result -- 1,hudi_046
>> > > > 3rd run -- write record 1, "hoodie_123",current_timestamp as ts
>> > > > read result --- 1,hudi_046
>> > > > 4th run -- write record 1, "hdie_1232324",current_timestamp as ts
>> > > > read result --- 1,hudi_046
>> > > >
>> > > > after multiple updates to same record ,
>> > > > the generated  log.1 has multiple instances of the same record.
>> > > > At this point the updated record is not fetched.
>> > > >
>> > > > 14:45
>> > > >
>> > >
>> >
>> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
>> > > > - has record that was updated in run 1
>> > > > 15:00
>> > > >
>> > >
>> >
>> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
>> > > > - has record that was updated in run 2 and run 3
>> > > > 14:41
>> > > /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
>> > > > 14:41
>> > > >
>> > >
>> >
>> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
>> > > >
>> > > >
>> > > > So is there any compaction to be enabled before reading or while
>> > writing
>> > > .
>> > > >
>> > > >
>> > >
>> >
>>
>

Re: Reading Merge_on_read table| Unable to read updated records after multiple updates

Reply via email to