Yelei Feng created FLINK-6147:
---------------------------------
Summary: flink client can't detect cluster is down
Key: FLINK-6147
URL: https://issues.apache.org/jira/browse/FLINK-6147
Project: Flink
Issue Type: Bug
Components: Client
Affects Versions: 1.2.0, 1.3.0
Reporter: Yelei Feng
I tested in yarn mode, reproduce step:
1. flink run xx.jar
2. kill yarn application
CLI hangs there only showing "New JobManager elected. Connecting to null "
instead of cleanup and close itself.
After some digging, I found the main logic is in {{JobClientActor}}. It would
terminate itself once receiving message {{ConnectionTimeout}}. It receive
jobmanager status changes from two sources: zookeeper and akka deathwatch.
Client sets current {{leaderSessionId} and unwatch previous jobmanager from
zk, receives {{Teminated}} of previous jobmanager from akka deathwatch and send
{{ConnectionTimeout}} to itself after 60s. In a great chance, they would
interfere with each other.
Situation1:
1. client get notified from zk, set {{leaderSessionId}} to null
2. client unwatch previous jobmanager
3. msg {{Teminated}} of previous jobmanager never got received
Situation 2:
1. msg {{Teminated}} of current jobmanager is received
2. schedule msg {{ConnectionTimeout}} after 60s
3. client get notified from zk, set {{leaderSessionId}} to null in less than 60s
4. {{ConnectionTimeout}} will be filtered out due to different
{{leaderSessionId}}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)