Liu created FLINK-25486:
---------------------------

             Summary: Perjob can not recover from checkpoint when zookeeper 
leader changes
                 Key: FLINK-25486
                 URL: https://issues.apache.org/jira/browse/FLINK-25486
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing
            Reporter: Liu


When the config 
high-availability.zookeeper.client.tolerate-suspended-connections is default 
false, the appMaster will failover once zk leader changes. In this case, the 
old appMaster will clean up all the zk info and the new appMaster will not 
recover from the latest checkpoint.

The process is as following:
 # Start a perJob application.
 # kill zk's leade node which cause the perJob to suspend.
 # In MiniDispatcher's function jobReachedTerminalState, shutDownFuture is set 
to UNKNOWN .
 # The future is transferred to ClusterEntrypoint, the method is called with 
cleanupHaData true.
 # Clean up zk data and exit.
 # The new appMaster will not find any checkpoints to start and the state is 
lost.

Since the job can recover automatically when the zk leader changes, it is 
reasonable to keep zk info for the coming recovery.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to