Hi Deepak, could you check the logs whether the JobManager has been quarantined and thus, cannot be connected to anymore? The logs should at least contain a hint why the TaskManager lost the connection initially.
Cheers, Till On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <dkjhan...@gmail.com> wrote: > Hi, > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are on > EC2 m4.large instance with checkpoint enabled on S3 ). My topology works > fine, but after few hours I do see that Taskmanagers gets detached with > Jobmanager. I tried to reach Jobmanager using telnet at the same time and > it worked but Taskmanager does not succeed in connecting again. It attaches > only after I restart it. I tried following settings but still the problem > persists. > > akka.ask.timeout: 20 s > akka.lookup.timeout: 20 s > akka.watch.heartbeat.interval: 20 s > > Please find attached snapshot on one of the Taskmanager. Is there any > setting that I need to do ? > > -- > Thanks, > Deepak Jha > >