Hello folks, We have a use case to make sure data in the same hudi datasets stored in different DC ( for high availability / disaster recovery ) are strongly consistent as well as pass all quality checks before they can be consumed by users who we try to query them. Currently, we have an offline service that runs quality checks as well as asynchronously syncs the hudi datasets between different DC/AZ but till the sync happens queries running in these different DC see inconsistent results. For some of our most critical datasets this inconsistency is causing so many problems.
We want to support the need for following use cases 1) data consistency 2) Adding data quality checks post commit. Our flow looks like this 1) write new batch of data at t1 2) user queries will not see data at t1 3) data quality checks are done by setting a session property to include t1 4) optionally replicate t1 to other AZs and promote t1 so regular user queries will see data at t1 We want to make the following changes to achieve this. 1. Change the HoodieParquetInputFormat to look for 'last_replication_timestamp' property in the JobConf and use this to create a new ActiveTimeline that limits the commits seen to be lesser than or equal to this timestamp. This can be overridden by a session property that will allow us to make such data visible for quality checks. 2. We are storing this particular timestamp as a table property in HiveMetaStore. To make it easier to update we want to extend the HiveSyncTool to also update this table property when syncing hudi dataset to the hms. The extended tool will take in a list of HMS's to be updated and will try to update each of them one by one. ( In case of global HMS across all DC this is just one, but if there is region local HMS per DC the update of all HMS is not truly transaction so there is a small window of time where the queries can return inconsistent results ). If the tool can't update all the HMS it will rollback the updated ones ( again not applicable for global HMS ). We have made the above changes to our internal branch and we are successfully running it in production. Please let us know of feedback about this change. Sanjay
