Re: [DISCUSS] enable cross AZ consistency and quality checks of hudi datasets
+1 on the proposal. Thanks Sanjay for describing this in detail. This feature can also help in eliminating file listing completely from HDFS for hudi metadata information for use-cases that are very sensitive to file listing. Thanks, Nishith On Wed, Sep 9, 2020 at 4:59 PM Vinoth Chandar wrote: > Hi Sanjay, > > Overall the two proposals sound reasonable to me. Thanks for describing > them so well. > General comment, it seems like you are implementing multi AZ replication by > matching commit times across AZs? > > I do want to name these properties to be consistent with other Hudi > terminology. but we can work those out on the PR itself. > > > If the tool can't update all the HMS it will rollback the updated ones > the tool can also fail midway after updating one of the HMS and we need to > handle recovery etc? > > Thanks > Vinoth > > > On Tue, Sep 8, 2020 at 10:45 AM Satish Kotha > > wrote: > > > Hi folks, > > > > Any thoughts on this? At a high level, we want to change high > > watermark commit through a property to perform pre-commit and post-commit > > hooks. Is this useful for anyone else? > > > > On Thu, Sep 3, 2020 at 11:12 AM Sanjay Sundaresan > > wrote: > > > > > Hello folks, > > > > > > We have a use case to make sure data in the same hudi datasets stored > in > > > different DC ( for high availability / disaster recovery ) are strongly > > > consistent as well as pass all quality checks before they can be > consumed > > > by users who we try to query them. Currently, we have an offline > service > > > that runs quality checks as well as asynchronously syncs the hudi > > datasets > > > between different DC/AZ but till the sync happens queries running in > > these > > > different DC see inconsistent results. For some of our most critical > > > datasets this inconsistency is causing so many problems. > > > > > > We want to support the need for following use cases 1) data consistency > > 2) > > > Adding data quality checks post commit. > > > > > > Our flow looks like this > > > 1) write new batch of data at t1 > > > 2) user queries will not see data at t1 > > > 3) data quality checks are done by setting a session property to > include > > t1 > > > 4) optionally replicate t1 to other AZs and promote t1 so regular user > > > queries will see data at t1 > > > > > > We want to make the following changes to achieve this. > > > > > > 1. Change the HoodieParquetInputFormat to look for > > > 'last_replication_timestamp' property in the JobConf and use this to > > create > > > a new ActiveTimeline that limits the commits seen to be lesser than or > > > equal to this timestamp. This can be overridden by a session property > > that > > > will allow us to make such data visible for quality checks. > > > > > > 2. We are storing this particular timestamp as a table property in > > > HiveMetaStore. To make it easier to update we want to extend the > > > HiveSyncTool to also update this table property when syncing hudi > dataset > > > to the hms. The extended tool will take in a list of HMS's to be > updated > > > and will try to update each of them one by one. ( In case of global HMS > > > across all DC this is just one, but if there is region local HMS per DC > > the > > > update of all HMS is not truly transaction so there is a small window > of > > > time where the queries can return inconsistent results ). If the tool > > can't > > > update all the HMS it will rollback the updated ones ( again not > > applicable > > > for global HMS ). > > > > > > We have made the above changes to our internal branch and we are > > > successfully running it in production. > > > > > > Please let us know of feedback about this change. > > > > > > Sanjay > > > > > >
Re: [DISCUSS] enable cross AZ consistency and quality checks of hudi datasets
Hi Sanjay, Overall the two proposals sound reasonable to me. Thanks for describing them so well. General comment, it seems like you are implementing multi AZ replication by matching commit times across AZs? I do want to name these properties to be consistent with other Hudi terminology. but we can work those out on the PR itself. > If the tool can't update all the HMS it will rollback the updated ones the tool can also fail midway after updating one of the HMS and we need to handle recovery etc? Thanks Vinoth On Tue, Sep 8, 2020 at 10:45 AM Satish Kotha wrote: > Hi folks, > > Any thoughts on this? At a high level, we want to change high > watermark commit through a property to perform pre-commit and post-commit > hooks. Is this useful for anyone else? > > On Thu, Sep 3, 2020 at 11:12 AM Sanjay Sundaresan > wrote: > > > Hello folks, > > > > We have a use case to make sure data in the same hudi datasets stored in > > different DC ( for high availability / disaster recovery ) are strongly > > consistent as well as pass all quality checks before they can be consumed > > by users who we try to query them. Currently, we have an offline service > > that runs quality checks as well as asynchronously syncs the hudi > datasets > > between different DC/AZ but till the sync happens queries running in > these > > different DC see inconsistent results. For some of our most critical > > datasets this inconsistency is causing so many problems. > > > > We want to support the need for following use cases 1) data consistency > 2) > > Adding data quality checks post commit. > > > > Our flow looks like this > > 1) write new batch of data at t1 > > 2) user queries will not see data at t1 > > 3) data quality checks are done by setting a session property to include > t1 > > 4) optionally replicate t1 to other AZs and promote t1 so regular user > > queries will see data at t1 > > > > We want to make the following changes to achieve this. > > > > 1. Change the HoodieParquetInputFormat to look for > > 'last_replication_timestamp' property in the JobConf and use this to > create > > a new ActiveTimeline that limits the commits seen to be lesser than or > > equal to this timestamp. This can be overridden by a session property > that > > will allow us to make such data visible for quality checks. > > > > 2. We are storing this particular timestamp as a table property in > > HiveMetaStore. To make it easier to update we want to extend the > > HiveSyncTool to also update this table property when syncing hudi dataset > > to the hms. The extended tool will take in a list of HMS's to be updated > > and will try to update each of them one by one. ( In case of global HMS > > across all DC this is just one, but if there is region local HMS per DC > the > > update of all HMS is not truly transaction so there is a small window of > > time where the queries can return inconsistent results ). If the tool > can't > > update all the HMS it will rollback the updated ones ( again not > applicable > > for global HMS ). > > > > We have made the above changes to our internal branch and we are > > successfully running it in production. > > > > Please let us know of feedback about this change. > > > > Sanjay > > >
Re: [DISCUSS] enable cross AZ consistency and quality checks of hudi datasets
Hi folks, Any thoughts on this? At a high level, we want to change high watermark commit through a property to perform pre-commit and post-commit hooks. Is this useful for anyone else? On Thu, Sep 3, 2020 at 11:12 AM Sanjay Sundaresan wrote: > Hello folks, > > We have a use case to make sure data in the same hudi datasets stored in > different DC ( for high availability / disaster recovery ) are strongly > consistent as well as pass all quality checks before they can be consumed > by users who we try to query them. Currently, we have an offline service > that runs quality checks as well as asynchronously syncs the hudi datasets > between different DC/AZ but till the sync happens queries running in these > different DC see inconsistent results. For some of our most critical > datasets this inconsistency is causing so many problems. > > We want to support the need for following use cases 1) data consistency 2) > Adding data quality checks post commit. > > Our flow looks like this > 1) write new batch of data at t1 > 2) user queries will not see data at t1 > 3) data quality checks are done by setting a session property to include t1 > 4) optionally replicate t1 to other AZs and promote t1 so regular user > queries will see data at t1 > > We want to make the following changes to achieve this. > > 1. Change the HoodieParquetInputFormat to look for > 'last_replication_timestamp' property in the JobConf and use this to create > a new ActiveTimeline that limits the commits seen to be lesser than or > equal to this timestamp. This can be overridden by a session property that > will allow us to make such data visible for quality checks. > > 2. We are storing this particular timestamp as a table property in > HiveMetaStore. To make it easier to update we want to extend the > HiveSyncTool to also update this table property when syncing hudi dataset > to the hms. The extended tool will take in a list of HMS's to be updated > and will try to update each of them one by one. ( In case of global HMS > across all DC this is just one, but if there is region local HMS per DC the > update of all HMS is not truly transaction so there is a small window of > time where the queries can return inconsistent results ). If the tool can't > update all the HMS it will rollback the updated ones ( again not applicable > for global HMS ). > > We have made the above changes to our internal branch and we are > successfully running it in production. > > Please let us know of feedback about this change. > > Sanjay >
[DISCUSS] enable cross AZ consistency and quality checks of hudi datasets
Hello folks, We have a use case to make sure data in the same hudi datasets stored in different DC ( for high availability / disaster recovery ) are strongly consistent as well as pass all quality checks before they can be consumed by users who we try to query them. Currently, we have an offline service that runs quality checks as well as asynchronously syncs the hudi datasets between different DC/AZ but till the sync happens queries running in these different DC see inconsistent results. For some of our most critical datasets this inconsistency is causing so many problems. We want to support the need for following use cases 1) data consistency 2) Adding data quality checks post commit. Our flow looks like this 1) write new batch of data at t1 2) user queries will not see data at t1 3) data quality checks are done by setting a session property to include t1 4) optionally replicate t1 to other AZs and promote t1 so regular user queries will see data at t1 We want to make the following changes to achieve this. 1. Change the HoodieParquetInputFormat to look for 'last_replication_timestamp' property in the JobConf and use this to create a new ActiveTimeline that limits the commits seen to be lesser than or equal to this timestamp. This can be overridden by a session property that will allow us to make such data visible for quality checks. 2. We are storing this particular timestamp as a table property in HiveMetaStore. To make it easier to update we want to extend the HiveSyncTool to also update this table property when syncing hudi dataset to the hms. The extended tool will take in a list of HMS's to be updated and will try to update each of them one by one. ( In case of global HMS across all DC this is just one, but if there is region local HMS per DC the update of all HMS is not truly transaction so there is a small window of time where the queries can return inconsistent results ). If the tool can't update all the HMS it will rollback the updated ones ( again not applicable for global HMS ). We have made the above changes to our internal branch and we are successfully running it in production. Please let us know of feedback about this change. Sanjay