Joyce.Li created FLINK-25401:
--------------------------------

             Summary: DefaultCompletedCheckpointStore may not return the latest 
CompletedCheckpoint after JM failover.
                 Key: FLINK-25401
                 URL: https://issues.apache.org/jira/browse/FLINK-25401
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
            Reporter: Joyce.Li


At present, when we recover {{{}DefaultCompletedCheckpointStore{}}}, we use the 
character order to sort the {{{}CompletedCheckpoint{}}}.
{code:java}
// Get all there is first.
final List<Tuple2<RetrievableStateHandle<CompletedCheckpoint>, String>> 
initialCheckpoints =
        checkpointStateHandleStore.getAllAndLock();

// Sort checkpoints by name.
initialCheckpoints.sort(Comparator.comparing(o -> o.f1));{code}
But considering this situation, for example, we reserve 3 
{{{}CompletedCheckpoint{}}}, their ID  are 99, 100, 101, after JM failover, 
DefaultCompletedCheckpointStore will restore these three 
{{{}CompletedCheckpoint{}}}, but the order will become 100, 101, 99 . When we 
restore the state of the job, we will use the {{CompletedCheckpoint}} with ID 
99 to restore, which will cause an error.

I think we should use {{CheckpointStoreUtil#nameToCheckpointID}} to convert the 
{{String}} to {{long}} before sorting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to