ming li created FLINK-18451:
-------------------------------

             Summary: Flink HA on yarn may appear TaskManager double running 
when HA is restored
                 Key: FLINK-18451
                 URL: https://issues.apache.org/jira/browse/FLINK-18451
             Project: Flink
          Issue Type: Bug
          Components: Deployment / YARN
    Affects Versions: 1.9.0
            Reporter: ming li


We found that when NodeManager is lost, the new JobManager will be restored by 
Yarn's ResourceManager, and the Leader node will be registered on Zookeeper. 
The original TaskManager will find the new JobManager through Zookeeper and 
close the old JobManager connection. At this time, all tasks of the TaskManager 
will fail. The new JobManager will directly perform job recovery and recover 
from the latest checkpoint.

However, during the recovery process, when a TaskManager is abnormally 
connected to Zookeeper, it is not registered with the new JobManager in time. 
Before the following timeout:
1. Connect with Zookeeper
2. Heartbeat with JobManager/ResourceManager
Task will continue to run (assuming that Task can run independently in 
TaskManager). Assuming that HA recovers fast enough, some Task double runs will 
occur at this time.

Do we need to make a persistent record of the cluster resources we allocated 
during the runtime, and use it to judge all Task stops when HA is restored?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to