Joyce.Li created FLINK-25401: -------------------------------- Summary: DefaultCompletedCheckpointStore may not return the latest CompletedCheckpoint after JM failover. Key: FLINK-25401 URL: https://issues.apache.org/jira/browse/FLINK-25401 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Reporter: Joyce.Li
At present, when we recover {{{}DefaultCompletedCheckpointStore{}}}, we use the character order to sort the {{{}CompletedCheckpoint{}}}. {code:java} // Get all there is first. final List<Tuple2<RetrievableStateHandle<CompletedCheckpoint>, String>> initialCheckpoints = checkpointStateHandleStore.getAllAndLock(); // Sort checkpoints by name. initialCheckpoints.sort(Comparator.comparing(o -> o.f1));{code} But considering this situation, for example, we reserve 3 {{{}CompletedCheckpoint{}}}, their IDÂ are 99, 100, 101, after JM failover, DefaultCompletedCheckpointStore will restore these three {{{}CompletedCheckpoint{}}}, but the order will become 100, 101, 99 . When we restore the state of the job, we will use the {{CompletedCheckpoint}} with ID 99 to restore, which will cause an error. I think we should use {{CheckpointStoreUtil#nameToCheckpointID}} to convert the {{String}} to {{long}} before sorting. -- This message was sent by Atlassian Jira (v8.20.1#820001)