Re: [DISCUSS] enable cross AZ consistency and quality checks of hudi datasets

2020-09-10 Thread nishith agarwal
+1 on the proposal. Thanks Sanjay for describing this in detail.
This feature can also help in eliminating file listing completely from HDFS
for hudi metadata information for use-cases that are very sensitive to file
listing.

Thanks,
Nishith

On Wed, Sep 9, 2020 at 4:59 PM Vinoth Chandar  wrote:

> Hi Sanjay,
>
> Overall the two proposals sound reasonable to me. Thanks for describing
> them so well.
> General comment, it seems like you are implementing multi AZ replication by
> matching commit times across AZs?
>
> I do want to name these properties to be consistent with other Hudi
> terminology. but we can work those out on the PR itself.
>
> > If the tool can't update all the HMS it will rollback the updated ones
> the tool can also fail midway after updating one of the HMS and we need to
> handle recovery etc?
>
> Thanks
> Vinoth
>
>
> On Tue, Sep 8, 2020 at 10:45 AM Satish Kotha  >
> wrote:
>
> > Hi folks,
> >
> > Any thoughts on this? At a high level, we want to change high
> > watermark commit through a property to perform pre-commit and post-commit
> > hooks. Is this useful for anyone else?
> >
> > On Thu, Sep 3, 2020 at 11:12 AM Sanjay Sundaresan 
> > wrote:
> >
> > > Hello folks,
> > >
> > > We have a use case to make sure data in the same hudi datasets stored
> in
> > > different DC ( for high availability / disaster recovery ) are strongly
> > > consistent as well as pass all quality checks before they can be
> consumed
> > > by users who we try to query them. Currently, we have an offline
> service
> > > that runs quality checks as well as asynchronously syncs the hudi
> > datasets
> > > between different DC/AZ but till the sync happens queries running in
> > these
> > > different DC see inconsistent results. For some of our most critical
> > > datasets this inconsistency is causing so many problems.
> > >
> > > We want to support the need for following use cases 1) data consistency
> > 2)
> > > Adding data quality checks post commit.
> > >
> > > Our flow looks like this
> > > 1) write new batch of data at t1
> > > 2) user queries will not see data at t1
> > > 3) data quality checks are done by setting a session property to
> include
> > t1
> > > 4) optionally replicate t1 to other AZs and promote t1 so regular user
> > > queries will see data at t1
> > >
> > > We want to make the following changes to achieve this.
> > >
> > > 1. Change the HoodieParquetInputFormat to look for
> > > 'last_replication_timestamp' property in the JobConf and use this to
> > create
> > > a new ActiveTimeline that limits the commits seen to be lesser than or
> > > equal to this timestamp. This can be overridden by a session property
> > that
> > > will allow us to make such data visible for quality checks.
> > >
> > > 2. We are storing this particular timestamp as a table property in
> > > HiveMetaStore. To make it easier to update we want to extend the
> > > HiveSyncTool to also update this table property when syncing hudi
> dataset
> > > to the hms. The extended tool will take in a list of HMS's to be
> updated
> > > and will try to update each of them one by one. ( In case of global HMS
> > > across all DC this is just one, but if there is region local HMS per DC
> > the
> > > update of all HMS is not truly transaction so there is a small window
> of
> > > time where the queries can return inconsistent results ). If the tool
> > can't
> > > update all the HMS it will rollback the updated ones ( again not
> > applicable
> > > for global HMS ).
> > >
> > > We have made the above changes to our internal branch and we are
> > > successfully running it in production.
> > >
> > > Please let us know of feedback about this change.
> > >
> > > Sanjay
> > >
> >
>


Re: [DISCUSS] enable cross AZ consistency and quality checks of hudi datasets

2020-09-09 Thread Vinoth Chandar
Hi Sanjay,

Overall the two proposals sound reasonable to me. Thanks for describing
them so well.
General comment, it seems like you are implementing multi AZ replication by
matching commit times across AZs?

I do want to name these properties to be consistent with other Hudi
terminology. but we can work those out on the PR itself.

> If the tool can't update all the HMS it will rollback the updated ones
the tool can also fail midway after updating one of the HMS and we need to
handle recovery etc?

Thanks
Vinoth


On Tue, Sep 8, 2020 at 10:45 AM Satish Kotha 
wrote:

> Hi folks,
>
> Any thoughts on this? At a high level, we want to change high
> watermark commit through a property to perform pre-commit and post-commit
> hooks. Is this useful for anyone else?
>
> On Thu, Sep 3, 2020 at 11:12 AM Sanjay Sundaresan 
> wrote:
>
> > Hello folks,
> >
> > We have a use case to make sure data in the same hudi datasets stored in
> > different DC ( for high availability / disaster recovery ) are strongly
> > consistent as well as pass all quality checks before they can be consumed
> > by users who we try to query them. Currently, we have an offline service
> > that runs quality checks as well as asynchronously syncs the hudi
> datasets
> > between different DC/AZ but till the sync happens queries running in
> these
> > different DC see inconsistent results. For some of our most critical
> > datasets this inconsistency is causing so many problems.
> >
> > We want to support the need for following use cases 1) data consistency
> 2)
> > Adding data quality checks post commit.
> >
> > Our flow looks like this
> > 1) write new batch of data at t1
> > 2) user queries will not see data at t1
> > 3) data quality checks are done by setting a session property to include
> t1
> > 4) optionally replicate t1 to other AZs and promote t1 so regular user
> > queries will see data at t1
> >
> > We want to make the following changes to achieve this.
> >
> > 1. Change the HoodieParquetInputFormat to look for
> > 'last_replication_timestamp' property in the JobConf and use this to
> create
> > a new ActiveTimeline that limits the commits seen to be lesser than or
> > equal to this timestamp. This can be overridden by a session property
> that
> > will allow us to make such data visible for quality checks.
> >
> > 2. We are storing this particular timestamp as a table property in
> > HiveMetaStore. To make it easier to update we want to extend the
> > HiveSyncTool to also update this table property when syncing hudi dataset
> > to the hms. The extended tool will take in a list of HMS's to be updated
> > and will try to update each of them one by one. ( In case of global HMS
> > across all DC this is just one, but if there is region local HMS per DC
> the
> > update of all HMS is not truly transaction so there is a small window of
> > time where the queries can return inconsistent results ). If the tool
> can't
> > update all the HMS it will rollback the updated ones ( again not
> applicable
> > for global HMS ).
> >
> > We have made the above changes to our internal branch and we are
> > successfully running it in production.
> >
> > Please let us know of feedback about this change.
> >
> > Sanjay
> >
>


Re: [DISCUSS] enable cross AZ consistency and quality checks of hudi datasets

2020-09-08 Thread Satish Kotha
Hi folks,

Any thoughts on this? At a high level, we want to change high
watermark commit through a property to perform pre-commit and post-commit
hooks. Is this useful for anyone else?

On Thu, Sep 3, 2020 at 11:12 AM Sanjay Sundaresan  wrote:

> Hello folks,
>
> We have a use case to make sure data in the same hudi datasets stored in
> different DC ( for high availability / disaster recovery ) are strongly
> consistent as well as pass all quality checks before they can be consumed
> by users who we try to query them. Currently, we have an offline service
> that runs quality checks as well as asynchronously syncs the hudi datasets
> between different DC/AZ but till the sync happens queries running in these
> different DC see inconsistent results. For some of our most critical
> datasets this inconsistency is causing so many problems.
>
> We want to support the need for following use cases 1) data consistency 2)
> Adding data quality checks post commit.
>
> Our flow looks like this
> 1) write new batch of data at t1
> 2) user queries will not see data at t1
> 3) data quality checks are done by setting a session property to include t1
> 4) optionally replicate t1 to other AZs and promote t1 so regular user
> queries will see data at t1
>
> We want to make the following changes to achieve this.
>
> 1. Change the HoodieParquetInputFormat to look for
> 'last_replication_timestamp' property in the JobConf and use this to create
> a new ActiveTimeline that limits the commits seen to be lesser than or
> equal to this timestamp. This can be overridden by a session property that
> will allow us to make such data visible for quality checks.
>
> 2. We are storing this particular timestamp as a table property in
> HiveMetaStore. To make it easier to update we want to extend the
> HiveSyncTool to also update this table property when syncing hudi dataset
> to the hms. The extended tool will take in a list of HMS's to be updated
> and will try to update each of them one by one. ( In case of global HMS
> across all DC this is just one, but if there is region local HMS per DC the
> update of all HMS is not truly transaction so there is a small window of
> time where the queries can return inconsistent results ). If the tool can't
> update all the HMS it will rollback the updated ones ( again not applicable
> for global HMS ).
>
> We have made the above changes to our internal branch and we are
> successfully running it in production.
>
> Please let us know of feedback about this change.
>
> Sanjay
>


[DISCUSS] enable cross AZ consistency and quality checks of hudi datasets

2020-09-03 Thread Sanjay Sundaresan
Hello folks,

We have a use case to make sure data in the same hudi datasets stored in
different DC ( for high availability / disaster recovery ) are strongly
consistent as well as pass all quality checks before they can be consumed
by users who we try to query them. Currently, we have an offline service
that runs quality checks as well as asynchronously syncs the hudi datasets
between different DC/AZ but till the sync happens queries running in these
different DC see inconsistent results. For some of our most critical
datasets this inconsistency is causing so many problems.

We want to support the need for following use cases 1) data consistency 2)
Adding data quality checks post commit.

Our flow looks like this
1) write new batch of data at t1
2) user queries will not see data at t1
3) data quality checks are done by setting a session property to include t1
4) optionally replicate t1 to other AZs and promote t1 so regular user
queries will see data at t1

We want to make the following changes to achieve this.

1. Change the HoodieParquetInputFormat to look for
'last_replication_timestamp' property in the JobConf and use this to create
a new ActiveTimeline that limits the commits seen to be lesser than or
equal to this timestamp. This can be overridden by a session property that
will allow us to make such data visible for quality checks.

2. We are storing this particular timestamp as a table property in
HiveMetaStore. To make it easier to update we want to extend the
HiveSyncTool to also update this table property when syncing hudi dataset
to the hms. The extended tool will take in a list of HMS's to be updated
and will try to update each of them one by one. ( In case of global HMS
across all DC this is just one, but if there is region local HMS per DC the
update of all HMS is not truly transaction so there is a small window of
time where the queries can return inconsistent results ). If the tool can't
update all the HMS it will rollback the updated ones ( again not applicable
for global HMS ).

We have made the above changes to our internal branch and we are
successfully running it in production.

Please let us know of feedback about this change.

Sanjay