Hi folks, Any thoughts on this? At a high level, we want to change high watermark commit through a property to perform pre-commit and post-commit hooks. Is this useful for anyone else?
On Thu, Sep 3, 2020 at 11:12 AM Sanjay Sundaresan <[email protected]> wrote: > Hello folks, > > We have a use case to make sure data in the same hudi datasets stored in > different DC ( for high availability / disaster recovery ) are strongly > consistent as well as pass all quality checks before they can be consumed > by users who we try to query them. Currently, we have an offline service > that runs quality checks as well as asynchronously syncs the hudi datasets > between different DC/AZ but till the sync happens queries running in these > different DC see inconsistent results. For some of our most critical > datasets this inconsistency is causing so many problems. > > We want to support the need for following use cases 1) data consistency 2) > Adding data quality checks post commit. > > Our flow looks like this > 1) write new batch of data at t1 > 2) user queries will not see data at t1 > 3) data quality checks are done by setting a session property to include t1 > 4) optionally replicate t1 to other AZs and promote t1 so regular user > queries will see data at t1 > > We want to make the following changes to achieve this. > > 1. Change the HoodieParquetInputFormat to look for > 'last_replication_timestamp' property in the JobConf and use this to create > a new ActiveTimeline that limits the commits seen to be lesser than or > equal to this timestamp. This can be overridden by a session property that > will allow us to make such data visible for quality checks. > > 2. We are storing this particular timestamp as a table property in > HiveMetaStore. To make it easier to update we want to extend the > HiveSyncTool to also update this table property when syncing hudi dataset > to the hms. The extended tool will take in a list of HMS's to be updated > and will try to update each of them one by one. ( In case of global HMS > across all DC this is just one, but if there is region local HMS per DC the > update of all HMS is not truly transaction so there is a small window of > time where the queries can return inconsistent results ). If the tool can't > update all the HMS it will rollback the updated ones ( again not applicable > for global HMS ). > > We have made the above changes to our internal branch and we are > successfully running it in production. > > Please let us know of feedback about this change. > > Sanjay >
