Hello folks,

We have a use case to make sure data in the same hudi datasets stored in
different DC ( for high availability / disaster recovery ) are strongly
consistent as well as pass all quality checks before they can be consumed
by users who we try to query them. Currently, we have an offline service
that runs quality checks as well as asynchronously syncs the hudi datasets
between different DC/AZ but till the sync happens queries running in these
different DC see inconsistent results. For some of our most critical
datasets this inconsistency is causing so many problems.

We want to support the need for following use cases 1) data consistency 2)
Adding data quality checks post commit.

Our flow looks like this
1) write new batch of data at t1
2) user queries will not see data at t1
3) data quality checks are done by setting a session property to include t1
4) optionally replicate t1 to other AZs and promote t1 so regular user
queries will see data at t1

We want to make the following changes to achieve this.

1. Change the HoodieParquetInputFormat to look for
'last_replication_timestamp' property in the JobConf and use this to create
a new ActiveTimeline that limits the commits seen to be lesser than or
equal to this timestamp. This can be overridden by a session property that
will allow us to make such data visible for quality checks.

2. We are storing this particular timestamp as a table property in
HiveMetaStore. To make it easier to update we want to extend the
HiveSyncTool to also update this table property when syncing hudi dataset
to the hms. The extended tool will take in a list of HMS's to be updated
and will try to update each of them one by one. ( In case of global HMS
across all DC this is just one, but if there is region local HMS per DC the
update of all HMS is not truly transaction so there is a small window of
time where the queries can return inconsistent results ). If the tool can't
update all the HMS it will rollback the updated ones ( again not applicable
for global HMS ).

We have made the above changes to our internal branch and we are
successfully running it in production.

Please let us know of feedback about this change.

Sanjay

Reply via email to