[ 
https://issues.apache.org/jira/browse/HUDI-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4287:
---------------------------------
    Fix Version/s: 0.14.1
                       (was: 0.14.0)

> Optimize Flink checkpoint meta mechanism to fix mistaken pending instants
> -------------------------------------------------------------------------
>
>                 Key: HUDI-4287
>                 URL: https://issues.apache.org/jira/browse/HUDI-4287
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: flink
>            Reporter: Shizhi Chen
>            Assignee: Shizhi Chen
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.14.1
>
>         Attachments: image-2022-06-27-19-42-14-676.png, 
> image-2022-06-27-19-55-20-210.png, image-2022-06-27-20-07-55-984.png, 
> image-2022-06-27-20-11-47-939.png, image-2022-06-27-20-29-49-897.png
>
>
> *Problem reveiw*
> CkpMetadata is introduced into flink module to reduce timeline burden, but 
> currently its 
> mechanism lacks corresponding status for rollback instants, which may result 
> in commit/delta commit instants deletion, and thus 
> StreamWriteOperatorCoordinator(meta end) and Write function(data end) will 
> not be coordinatited correctly.
> Finally, data files will be deleted by mistake.
> This situation will be easy to reproduced especially when 
> StreamWriteOperatorCoordinator schedules table services for a long time 
> between commit and init instants after the restoration from a checkpoint.
>  
> *Stable Reproduction Proccedure*
>  * a. Before starting a job, let's modify the 
> StreamWriteOperatorCoordinator#notifyCheckpointComplete like:
> !image-2022-06-27-19-42-14-676.png|width=479,height=293! 
> It does nothing but to mock the possible long time table services for fast 
> reproduction.
>  * b. Start a simple flink hudi job such as append, and don't hesitate to 
> kill it when the 2nd checkpoint is in INFLIGHT.
>  * c. Let's restart it from the checkpoint restoration, it'll be sure to hit 
> the case after another 2 checkpoints, which may be accompanied by the 
> FileNotFoundException:
> !image-2022-06-27-20-29-49-897.png|width=503,height=386! 
> More important, we could observe the incoordination:
> !image-2022-06-27-20-07-55-984.png|width=517,height=109! 
> The screenshot above shows that the instant should be 20220531163135119 in 
> 2022-05-31 16:36 which is committed by StreamWriteOperatorCoordinator as meta 
> end.
> !image-2022-06-27-20-11-47-939.png|width=517,height=155! 
> At the same time, the data files are written with the wrong base commit 
> instant: 20220531161923191, which is deleted during rollbacks in procedure c. 
> for its uncompletement and also should have been evicted from ckp_meta.
>  
> *Solution*
> The solution is to optimize the mechanism with CANCELLED CkpMessage state in 
> the highest priority corresponding with DELETE instant during rollback action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to