[ https://issues.apache.org/jira/browse/HUDI-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raymond Xu updated HUDI-1212: ----------------------------- Component/s: table-service (was: incremental-query) (was: writer-core) > GDPR: Support deletions of records on all versions of Hudi dataset > ------------------------------------------------------------------- > > Key: HUDI-1212 > URL: https://issues.apache.org/jira/browse/HUDI-1212 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service > Affects Versions: 0.9.0 > Reporter: Balaji Varadarajan > Priority: Major > Fix For: 0.11.0 > > > Incremental Pull should also stop returning the record on historical datset > when we delete them from latest snapshot. > > Context from Mailing list email : > > Hello, > I am Siva's colleague and I am working on the problem below as well. > I would like to describe what we are trying to achieve with Hudi as well as > our current way of working and our GDPR and "Right To Be Forgotten " > compliance policies. > Our requirements : > - We wish to apply a strict interpretation of the RTBF. In other words, when > we remove a person's data, it should be throughout the historical data and > not just the latest snapshot. > - We wish to use Hudi to reduce our storage requirements using upserts and > don't want to have duplicates between commits. > - We wish to retain history for persons who have not requested to be > forgotten and therefore we do not want to delete commit files from the > history as some have proposed. > We have tried a couple of solutions, but so far without success : > - replay the data omitting the data of the persons who have requested to be > forgotten. We wanted to manipulate the commit times to rebuild the history. > We found that we couldn't manipulate the commit times and retain the history. > - replay the data omitting the data of the persons who have requested to be > forgotten, but writing to a date-based partition folder using the > "partitionpath" parameter. > We found that commits using upserts between the partitionpath folders, do not > ignore data that is unchanged between 2 commit dates as when using the > default commit file system, so we will not save on our storage or speed up > our processing using this technique. > So basically we would like to find a way to apply a strict RTBF, GDPR, > maintain history and time-travel (large history) and save storage space using > Hudi. > Can anyone see a way to achieve this? > Kind Regards, > David Rosalia > > -- This message was sent by Atlassian Jira (v8.20.1#820001)