Hi Yelei,

thanks for investigating the problem and pointing out the problematic
parts. In fact, I recently stumbled across the very same problem in the
JobClientActor and wrote a fix for it. It is already merged into the
master. I hope that this fix solved the problem you've described.

Cheers,
Till

On Wed, Mar 22, 2017 at 3:42 PM, Yelei Feng <feng...@outlook.com> wrote:

> Hi,
>
>
> I have two questions about flink client in interactive mode.
>
>
> One is for yarn-session.sh,  once the session CLI can't get cluster stauts
> (jobmanager is down), it will try to shutdown the cluster and cleanup
> related files even if new jobmanager will be created soon. As result,  yarn
> will fail to start a new jobmanager due to missing files on HDFS. As a
> workround, I can config `akka.lookup.timeout` to wait a bit longer,  say 60
> seconds. But I'm wondering if it will affect other components.
>
>
> Second is about flink cli. If cluster is down after submiting job using
> 'flink run xx.jar',  cli hangs there only showing "New JobManager elected.
> Connecting to null " instead of cleanup and close itself.
>
>
> After some digging, I found the main logic is in JobClientActor. It
> receives jobmanager status changes from two sources: zookeeper and akka
> deathwatch. It would terminate itself once receiving message
> `ConnectionTimeout`.
> Client sets current leaderSessionId and unwatch previous jobmanager from
> ZK; it receives `Teminated` of previous jobmanager from akka deathwatch and
> send `ConnectionTimeout` to itself after 60s. In a great chance, they would
> interfere with each other.
>
> Situation1:
>
>   1.  client get notified from zk, set leaderSessionId to null
>   2.  client unwatch previous jobmanager
>   3.  msg `Teminated` of previous jobmanager never got received
>
> Situation 2:
>
>   1.  msg `Teminated` of current jobmanager is received
>   2.  schedule msg ConnectionTimeout after 60s
>   3.  client get notified from zk, set `leaderSessionId` to null in less
> than 60s
>   4.  `ConnectionTimeout` will be filtered out due to different
> `leaderSessionId`
>
>
> Both of the two problems only happen in interactive mode,  not in detached
> mode.  I wonder if it's issues for interactive mode, or we should only use
> detached mode in production environment?
>
>
> Any insight is appreciated.
>
>
> Thanks,
>
> Yelei
>
>

Reply via email to