Hi,

I have two questions about flink client in interactive mode.


One is for yarn-session.sh,  once the session CLI can't get cluster stauts 
(jobmanager is down), it will try to shutdown the cluster and cleanup related 
files even if new jobmanager will be created soon. As result,  yarn will fail 
to start a new jobmanager due to missing files on HDFS. As a workround, I can 
config `akka.lookup.timeout` to wait a bit longer,  say 60 seconds. But I'm 
wondering if it will affect other components.


Second is about flink cli. If cluster is down after submiting job using 'flink 
run xx.jar',  cli hangs there only showing "New JobManager elected. Connecting 
to null " instead of cleanup and close itself.


After some digging, I found the main logic is in JobClientActor. It receives 
jobmanager status changes from two sources: zookeeper and akka deathwatch. It 
would terminate itself once receiving message `ConnectionTimeout`.
Client sets current leaderSessionId and unwatch previous jobmanager from ZK; it 
receives `Teminated` of previous jobmanager from akka deathwatch and send 
`ConnectionTimeout` to itself after 60s. In a great chance, they would 
interfere with each other.

Situation1:

  1.  client get notified from zk, set leaderSessionId to null
  2.  client unwatch previous jobmanager
  3.  msg `Teminated` of previous jobmanager never got received

Situation 2:

  1.  msg `Teminated` of current jobmanager is received
  2.  schedule msg ConnectionTimeout after 60s
  3.  client get notified from zk, set `leaderSessionId` to null in less than 
60s
  4.  `ConnectionTimeout` will be filtered out due to different 
`leaderSessionId`


Both of the two problems only happen in interactive mode,  not in detached 
mode.  I wonder if it's issues for interactive mode, or we should only use 
detached mode in production environment?


Any insight is appreciated.


Thanks,

Yelei

Reply via email to