[ https://issues.apache.org/jira/browse/HUDI-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prashant Wason updated HUDI-4287: --------------------------------- Fix Version/s: 0.14.1 (was: 0.14.0) > Optimize Flink checkpoint meta mechanism to fix mistaken pending instants > ------------------------------------------------------------------------- > > Key: HUDI-4287 > URL: https://issues.apache.org/jira/browse/HUDI-4287 > Project: Apache Hudi > Issue Type: Bug > Components: flink > Reporter: Shizhi Chen > Assignee: Shizhi Chen > Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.1 > > Attachments: image-2022-06-27-19-42-14-676.png, > image-2022-06-27-19-55-20-210.png, image-2022-06-27-20-07-55-984.png, > image-2022-06-27-20-11-47-939.png, image-2022-06-27-20-29-49-897.png > > > *Problem reveiw* > CkpMetadata is introduced into flink module to reduce timeline burden, but > currently its > mechanism lacks corresponding status for rollback instants, which may result > in commit/delta commit instants deletion, and thus > StreamWriteOperatorCoordinator(meta end) and Write function(data end) will > not be coordinatited correctly. > Finally, data files will be deleted by mistake. > This situation will be easy to reproduced especially when > StreamWriteOperatorCoordinator schedules table services for a long time > between commit and init instants after the restoration from a checkpoint. > > *Stable Reproduction Proccedure* > * a. Before starting a job, let's modify the > StreamWriteOperatorCoordinator#notifyCheckpointComplete like: > !image-2022-06-27-19-42-14-676.png|width=479,height=293! > It does nothing but to mock the possible long time table services for fast > reproduction. > * b. Start a simple flink hudi job such as append, and don't hesitate to > kill it when the 2nd checkpoint is in INFLIGHT. > * c. Let's restart it from the checkpoint restoration, it'll be sure to hit > the case after another 2 checkpoints, which may be accompanied by the > FileNotFoundException: > !image-2022-06-27-20-29-49-897.png|width=503,height=386! > More important, we could observe the incoordination: > !image-2022-06-27-20-07-55-984.png|width=517,height=109! > The screenshot above shows that the instant should be 20220531163135119 in > 2022-05-31 16:36 which is committed by StreamWriteOperatorCoordinator as meta > end. > !image-2022-06-27-20-11-47-939.png|width=517,height=155! > At the same time, the data files are written with the wrong base commit > instant: 20220531161923191, which is deleted during rollbacks in procedure c. > for its uncompletement and also should have been evicted from ckp_meta. > > *Solution* > The solution is to optimize the mechanism with CANCELLED CkpMessage state in > the highest priority corresponding with DELETE instant during rollback action. -- This message was sent by Atlassian Jira (v8.20.10#820010)