[jira] [Updated] (HUDI-1357) Add a check to ensure there is no data loss when writing to HUDI dataset

Vinoth Chandar (Jira) Wed, 20 Jan 2021 21:59:09 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vinoth Chandar updated HUDI-1357:
---------------------------------
    Fix Version/s: 0.7.0

> Add a check to ensure there is no data loss when writing to HUDI dataset
> ------------------------------------------------------------------------
>
>                 Key: HUDI-1357
>                 URL: https://issues.apache.org/jira/browse/HUDI-1357
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.7.0
>
>
> When updating a HUDI dataset with updates + deletes, records from existing 
> base files are read and merged with updates+deletes and finally written to 
> newer base files.
> It should hold that:
> count(records_in_older_base file) + num_deletes = count(records_in_new_base 
> file)
> In our internal production deployment, we had an issue wherein due to parquet 
> bug in handling the schema, reading existing records returned null data. This 
> lead to many records not being written out from older parquet into newer 
> parquet file.
> This check will ensure that such issues do not lead to data loss by 
> triggering an exception when the expected record counts do not match. This 
> check is off by default and controlled through a HoodieWriteConfig parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1357) Add a check to ensure there is no data loss when writing to HUDI dataset

Reply via email to