[ https://issues.apache.org/jira/browse/HUDI-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-1357: --------------------------------- Fix Version/s: 0.7.0 > Add a check to ensure there is no data loss when writing to HUDI dataset > ------------------------------------------------------------------------ > > Key: HUDI-1357 > URL: https://issues.apache.org/jira/browse/HUDI-1357 > Project: Apache Hudi > Issue Type: Improvement > Reporter: Prashant Wason > Assignee: Prashant Wason > Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > > When updating a HUDI dataset with updates + deletes, records from existing > base files are read and merged with updates+deletes and finally written to > newer base files. > It should hold that: > count(records_in_older_base file) + num_deletes = count(records_in_new_base > file) > In our internal production deployment, we had an issue wherein due to parquet > bug in handling the schema, reading existing records returned null data. This > lead to many records not being written out from older parquet into newer > parquet file. > This check will ensure that such issues do not lead to data loss by > triggering an exception when the expected record counts do not match. This > check is off by default and controlled through a HoodieWriteConfig parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005)