Re: Reading Merge_on_read table| Unable to read updated records after multiple updates

SATISH SIDNAKOPPA Mon, 29 Apr 2019 21:34:08 -0700

Hi Vinoth,

Missed while copying.
PFB the list of files


14:45 
/apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
- has record that was updated in run 1
15:00 
/apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
- has record that was updated in run 2 and run 3
14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
14:41 
/apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet


On Mon, Apr 29, 2019 at 8:26 PM Vinoth Chandar <[email protected]> wrote:

> Hi Satish,
>
> There are no parquet files? Can you share the full listing of files in the
> partition?
>
> Thanks
> Vinoth
>
> On Mon, Apr 29, 2019 at 7:22 AM SATISH SIDNAKOPPA <
> [email protected]> wrote:
>
> > Yes,
> > As this needed discussion ,the thread was created in google groups for
> > inputs.
> > I am unable to read from rt table after multiple updates.
> >
> > 14:45
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> > -* has record that was updated in run 1*
> > 15:00
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> > - *has record that was updated in run 2 and run 3*
> > 14:41
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> > 14:41
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
> >
> >
> >
> >
> > On Sat, Apr 27, 2019 at 7:24 PM SATISH SIDNAKOPPA <
> > [email protected]> wrote:
> >
> > > No ,the issue is faced with rt table created by sync tool .
> > >
> > > On Fri 26 Apr, 2019, 11:53 PM Vinoth Chandar <[email protected] wrote:
> > >
> > >> once you registered the rt table, is this working now for you?
> > >>
> > >> On Fri, Apr 26, 2019 at 9:36 AM SATISH SIDNAKOPPA <
> > >> [email protected]> wrote:
> > >>
> > >> > I am querying real time view of the table.
> > >> > This table (emp_mor_26_rt) created after runsync tool.
> > >> > So the first updated record are fetched from log1 file.
> > >> >
> > >> > Only after third update both the updates are placed in log files.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Fri 26 Apr, 2019, 6:30 PM Vinoth Chandar <[email protected]
> wrote:
> > >> >
> > >> > > Looks like you are querying the RO table? If so, the query only
> hits
> > >> > > parquet file; which was probably generated during the first upsert
> > and
> > >> > all
> > >> > > others went to the log. Unless compaction runs, it wont show up on
> > ro
> > >> > table
> > >> > >
> > >> > > If you want the latest merged view you need to query the RT table.
> > >> > >
> > >> > > Does that sound applicable?
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Fri, Apr 26, 2019 at 3:02 AM [email protected] <
> > >> > > [email protected]> wrote:
> > >> > >
> > >> > > > Writing hudi set as below
> > >> > > >
> > >> > > > ds.withColumn("emp_name",lit("upd1
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
> > >> > > > .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
> > >> > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
> > >> > > >
> > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
> > >> > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,
> > >> "part_by")
> > >> > > > .option("hoodie.upsert.shuffle.parallelism",4)
> > >> > > > .mode(SaveMode.Append)
> > >> > > > .save("/apps/hive/warehouse/emp_mor_26")
> > >> > > >
> > >> > > >
> > >> > > > 1st run - write record 1,"hudi_045",current_timestamp as ts
> > >> > > > read result -- 1, hudi_045
> > >> > > > 2nd run - write record 1,"hudi_046",current_timestamp as ts
> > >> > > > read result -- 1,hudi_046
> > >> > > > 3rd run -- write record 1, "hoodie_123",current_timestamp as ts
> > >> > > > read result --- 1,hudi_046
> > >> > > > 4th run -- write record 1, "hdie_1232324",current_timestamp as
> ts
> > >> > > > read result --- 1,hudi_046
> > >> > > >
> > >> > > > after multiple updates to same record ,
> > >> > > > the generated  log.1 has multiple instances of the same record.
> > >> > > > At this point the updated record is not fetched.
> > >> > > >
> > >> > > > 14:45
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> > >> > > > - has record that was updated in run 1
> > >> > > > 15:00
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> > >> > > > - has record that was updated in run 2 and run 3
> > >> > > > 14:41
> > >> > >
> > /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> > >> > > > 14:41
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
> > >> > > >
> > >> > > >
> > >> > > > So is there any compaction to be enabled before reading or while
> > >> > writing
> > >> > > .
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: Reading Merge_on_read table| Unable to read updated records after multiple updates

Reply via email to