[ https://issues.apache.org/jira/browse/FLINK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502783#comment-14502783 ]
ASF GitHub Bot commented on FLINK-1908: --------------------------------------- Github user DarkKnightCZ commented on the pull request: https://github.com/apache/flink/pull/609#issuecomment-94451193 @tillrohrmann The problem that occurred was that JM bound the IP:PORT with some delay, so TMs failed to start, since they couldn't connect. When i tried in 5-node environment, sometimes 2 or 3 TMs failed because JM wasn't ready there. There was no subsequential checking done, TMs just stopped. I agree that TM should indeed try to check several times, if the JM is available, so i will try to look at it also. > JobManager startup delay isn't considered when using start-cluster.sh script > ---------------------------------------------------------------------------- > > Key: FLINK-1908 > URL: https://issues.apache.org/jira/browse/FLINK-1908 > Project: Flink > Issue Type: Bug > Components: Distributed Runtime > Affects Versions: 0.9, 0.8.1 > Environment: Linux > Reporter: Lukas Raska > Priority: Minor > Original Estimate: 5m > Remaining Estimate: 5m > > When starting Flink cluster via start-cluster.sh script, JobManager startup > can be delayed (as it's started asynchronously), which can result in failed > startup of several task managers. > Solution is to wait certain amount of time and periodically check if RPC port > is accessible, then proceed with starting task managers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)