+1 on the proposal. Thanks Sanjay for describing this in detail.
This feature can also help in eliminating file listing completely from HDFS
for hudi metadata information for use-cases that are very sensitive to file
listing.

Thanks,
Nishith

On Wed, Sep 9, 2020 at 4:59 PM Vinoth Chandar <vin...@apache.org> wrote:

> Hi Sanjay,
>
> Overall the two proposals sound reasonable to me. Thanks for describing
> them so well.
> General comment, it seems like you are implementing multi AZ replication by
> matching commit times across AZs?
>
> I do want to name these properties to be consistent with other Hudi
> terminology. but we can work those out on the PR itself.
>
> > If the tool can't update all the HMS it will rollback the updated ones
> the tool can also fail midway after updating one of the HMS and we need to
> handle recovery etc?
>
> Thanks
> Vinoth
>
>
> On Tue, Sep 8, 2020 at 10:45 AM Satish Kotha <satishko...@uber.com.invalid
> >
> wrote:
>
> > Hi folks,
> >
> > Any thoughts on this? At a high level, we want to change high
> > watermark commit through a property to perform pre-commit and post-commit
> > hooks. Is this useful for anyone else?
> >
> > On Thu, Sep 3, 2020 at 11:12 AM Sanjay Sundaresan <ssan...@uber.com>
> > wrote:
> >
> > > Hello folks,
> > >
> > > We have a use case to make sure data in the same hudi datasets stored
> in
> > > different DC ( for high availability / disaster recovery ) are strongly
> > > consistent as well as pass all quality checks before they can be
> consumed
> > > by users who we try to query them. Currently, we have an offline
> service
> > > that runs quality checks as well as asynchronously syncs the hudi
> > datasets
> > > between different DC/AZ but till the sync happens queries running in
> > these
> > > different DC see inconsistent results. For some of our most critical
> > > datasets this inconsistency is causing so many problems.
> > >
> > > We want to support the need for following use cases 1) data consistency
> > 2)
> > > Adding data quality checks post commit.
> > >
> > > Our flow looks like this
> > > 1) write new batch of data at t1
> > > 2) user queries will not see data at t1
> > > 3) data quality checks are done by setting a session property to
> include
> > t1
> > > 4) optionally replicate t1 to other AZs and promote t1 so regular user
> > > queries will see data at t1
> > >
> > > We want to make the following changes to achieve this.
> > >
> > > 1. Change the HoodieParquetInputFormat to look for
> > > 'last_replication_timestamp' property in the JobConf and use this to
> > create
> > > a new ActiveTimeline that limits the commits seen to be lesser than or
> > > equal to this timestamp. This can be overridden by a session property
> > that
> > > will allow us to make such data visible for quality checks.
> > >
> > > 2. We are storing this particular timestamp as a table property in
> > > HiveMetaStore. To make it easier to update we want to extend the
> > > HiveSyncTool to also update this table property when syncing hudi
> dataset
> > > to the hms. The extended tool will take in a list of HMS's to be
> updated
> > > and will try to update each of them one by one. ( In case of global HMS
> > > across all DC this is just one, but if there is region local HMS per DC
> > the
> > > update of all HMS is not truly transaction so there is a small window
> of
> > > time where the queries can return inconsistent results ). If the tool
> > can't
> > > update all the HMS it will rollback the updated ones ( again not
> > applicable
> > > for global HMS ).
> > >
> > > We have made the above changes to our internal branch and we are
> > > successfully running it in production.
> > >
> > > Please let us know of feedback about this change.
> > >
> > > Sanjay
> > >
> >
>

Reply via email to