Hi folks,

Any thoughts on this? At a high level, we want to change high
watermark commit through a property to perform pre-commit and post-commit
hooks. Is this useful for anyone else?

On Thu, Sep 3, 2020 at 11:12 AM Sanjay Sundaresan <[email protected]> wrote:

> Hello folks,
>
> We have a use case to make sure data in the same hudi datasets stored in
> different DC ( for high availability / disaster recovery ) are strongly
> consistent as well as pass all quality checks before they can be consumed
> by users who we try to query them. Currently, we have an offline service
> that runs quality checks as well as asynchronously syncs the hudi datasets
> between different DC/AZ but till the sync happens queries running in these
> different DC see inconsistent results. For some of our most critical
> datasets this inconsistency is causing so many problems.
>
> We want to support the need for following use cases 1) data consistency 2)
> Adding data quality checks post commit.
>
> Our flow looks like this
> 1) write new batch of data at t1
> 2) user queries will not see data at t1
> 3) data quality checks are done by setting a session property to include t1
> 4) optionally replicate t1 to other AZs and promote t1 so regular user
> queries will see data at t1
>
> We want to make the following changes to achieve this.
>
> 1. Change the HoodieParquetInputFormat to look for
> 'last_replication_timestamp' property in the JobConf and use this to create
> a new ActiveTimeline that limits the commits seen to be lesser than or
> equal to this timestamp. This can be overridden by a session property that
> will allow us to make such data visible for quality checks.
>
> 2. We are storing this particular timestamp as a table property in
> HiveMetaStore. To make it easier to update we want to extend the
> HiveSyncTool to also update this table property when syncing hudi dataset
> to the hms. The extended tool will take in a list of HMS's to be updated
> and will try to update each of them one by one. ( In case of global HMS
> across all DC this is just one, but if there is region local HMS per DC the
> update of all HMS is not truly transaction so there is a small window of
> time where the queries can return inconsistent results ). If the tool can't
> update all the HMS it will rollback the updated ones ( again not applicable
> for global HMS ).
>
> We have made the above changes to our internal branch and we are
> successfully running it in production.
>
> Please let us know of feedback about this change.
>
> Sanjay
>

Reply via email to