Scott Kidder created FLINK-7022: ----------------------------------- Summary: Flink Job Manager Scheduler & Web Frontend out of sync when Zookeeper is unavailable on startup Key: FLINK-7022 URL: https://issues.apache.org/jira/browse/FLINK-7022 Project: Flink Issue Type: Bug Components: JobManager Affects Versions: 1.2.1, 1.3.0, 1.2.0 Environment: Kubernetes cluster running: * Flink 1.3.0 Job Manager & Task Manager on Java 8u131 * Zookeeper 3.4.10 cluster with 3 nodes Reporter: Scott Kidder
h2. Problem Flink Job Manager web frontend is permanently unavailable if one or more Zookeeper nodes are unresolvable during startup. The job scheduler eventually recovers and assigns jobs to task managers, but the web frontend continues to respond with an HTTP 503 and the following message: {noformat}Service temporarily unavailable due to an ongoing leader election. Please refresh.{noformat} h2. Expected Behavior Once Flink is able to interact with Zookeeper successfully, all aspects of the Job Manager (job scheduling & the web frontend) should be available. h2. Environment Details We're running Flink and Zookeeper in Kubernetes on CoreOS. CoreOS can run in a configuration that automatically detects and applies operating system updates. We have a Zookeeper node running on the same CoreOS instance as Flink. It's possible that the Zookeeper node will not yet be started when the Flink components are started. This could cause hostname resolution of the Zookeeper nodes to fail. h3. Flink Task Manager Logs {noformat} 2017-06-27 15:38:47,161 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.statsd.host, localhost 2017-06-27 15:38:47,161 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.statsd.port, 8125 2017-06-27 15:38:47,162 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.statsd.interval, 10 SECONDS 2017-06-27 15:38:47,254 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend, filesystem 2017-06-27 15:38:47,254 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend.fs.checkpointdir, hdfs://hdfs:8020/flink/checkpoints 2017-06-27 15:38:47,255 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.savepoints.dir, hdfs://hdfs:8020/flink/savepoints 2017-06-27 15:38:47,255 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: recovery.mode, zookeeper 2017-06-27 15:38:47,256 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: recovery.zookeeper.quorum, zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 2017-06-27 15:38:47,256 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: recovery.zookeeper.storageDir, hdfs://hdfs:8020/flink/recovery 2017-06-27 15:38:47,256 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: recovery.jobmanager.port, 6123 2017-06-27 15:38:47,257 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 41479 2017-06-27 15:38:47,357 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'recovery.mode' instead of proper key 'high-availability' 2017-06-27 15:38:47,366 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager with high-availability 2017-06-27 15:38:47,366 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'recovery.jobmanager.port' instead of proper key 'high-availability.jobmanager.port' 2017-06-27 15:38:47,452 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager on flink:6123 with execution mode CLUSTER 2017-06-27 15:38:47,549 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, flink 2017-06-27 15:38:47,549 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2017-06-27 15:38:47,549 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 1024 2017-06-27 15:38:47,549 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 1024 2017-06-27 15:38:47,549 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1 2017-06-27 15:38:47,549 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false 2017-06-27 15:38:47,550 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1 2017-06-27 15:38:47,550 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081 2017-06-27 15:38:47,550 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporters, statsd 2017-06-27 15:38:47,550 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.statsd.class, org.apache.flink.metrics.statsd.StatsDReporter 2017-06-27 15:38:47,551 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.statsd.host, localhost 2017-06-27 15:38:47,551 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.statsd.port, 8125 2017-06-27 15:38:47,551 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.statsd.interval, 10 SECONDS 2017-06-27 15:38:47,551 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend, filesystem 2017-06-27 15:38:47,551 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend.fs.checkpointdir, hdfs://hdfs:8020/flink/checkpoints 2017-06-27 15:38:47,552 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.savepoints.dir, hdfs://hdfs:8020/flink/savepoints 2017-06-27 15:38:47,552 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: recovery.mode, zookeeper 2017-06-27 15:38:47,552 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: recovery.zookeeper.quorum, zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 2017-06-27 15:38:47,552 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: recovery.zookeeper.storageDir, hdfs://hdfs:8020/flink/recovery 2017-06-27 15:38:47,552 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: recovery.jobmanager.port, 6123 2017-06-27 15:38:47,552 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 41479 2017-06-27 15:38:48,055 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to root (auth:SIMPLE) 2017-06-27 15:38:48,664 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor system reachable at flink:6123 2017-06-27 15:38:50,955 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 2017-06-27 15:38:51,252 INFO Remoting - Starting remoting 2017-06-27 15:38:52,679 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink@flink:6123] 2017-06-27 15:38:52,758 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'recovery.mode' instead of proper key 'high-availability' 2017-06-27 15:38:52,761 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'recovery.mode' instead of proper key 'high-availability' 2017-06-27 15:38:52,764 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'recovery.zookeeper.storageDir' instead of proper key 'high-availability.storageDir' 2017-06-27 15:38:52,854 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, flink 2017-06-27 15:38:52,854 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2017-06-27 15:38:52,854 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 1024 2017-06-27 15:38:52,854 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 1024 2017-06-27 15:38:52,854 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1 2017-06-27 15:38:52,854 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false 2017-06-27 15:38:52,854 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1 2017-06-27 15:38:52,854 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081 2017-06-27 15:38:52,864 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporters, statsd 2017-06-27 15:38:52,865 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.statsd.class, org.apache.flink.metrics.statsd.StatsDReporter 2017-06-27 15:38:52,865 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.statsd.host, localhost 2017-06-27 15:38:52,865 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.statsd.port, 8125 2017-06-27 15:38:52,865 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: metrics.reporter.statsd.interval, 10 SECONDS at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) at org.apache.flink.shaded.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150) at org.apache.flink.shaded.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) at org.apache.flink.shaded.org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55) at org.apache.flink.shaded.org.apache.curator.ConnectionState.reset(ConnectionState.java:262) at org.apache.flink.shaded.org.apache.curator.ConnectionState.start(ConnectionState.java:109) at org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:191) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:259) at org.apache.flink.runtime.util.ZooKeeperUtils.startCuratorFramework(ZooKeeperUtils.java:128) at org.apache.flink.runtime.highavailability.HighAvailabilityServicesUtils.createHighAvailabilityServices(HighAvailabilityServicesUtils.java:96) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2047) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) at scala.util.Try$.apply(Try.scala:192) at org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117) at org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992) at org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990) at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990) at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) 2017-06-27 15:38:59,160 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager web frontend 2017-06-27 15:38:59,257 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of JobManager log file: /usr/local/flink-1.3.0/log/flink--jobmanager-0-flink-jobmanager-3380372638-1q7jb.log 2017-06-27 15:38:59,257 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of JobManager stdout file: /usr/local/flink-1.3.0/log/flink--jobmanager-0-flink-jobmanager-3380372638-1q7jb.out 2017-06-27 15:38:59,257 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using directory /tmp/flink-web-252afcf4-d41d-4095-a082-f6ce5176c2f5 for the web interface files 2017-06-27 15:38:59,257 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using directory /tmp/flink-web-2ca2cadf-a1b6-44af-9510-9c523a422022 for web frontend JAR file uploads 2017-06-27 15:39:01,060 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend listening at 0:0:0:0:0:0:0:0:8081 2017-06-27 15:39:01,060 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor 2017-06-27 15:39:01,253 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-1f49aadd-0a7d-45d1-8fdc-fc2167ca93d5 2017-06-27 15:39:01,257 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:41479 - max concurrent requests: 50 - max backlog: 1000 2017-06-27 15:39:01,851 INFO org.apache.flink.runtime.metrics.MetricRegistry - Configuring StatsDReporter with {interval=10 SECONDS, port=8125, host=localhost, class=org.apache.flink.metrics.statsd.StatsDReporter}. 2017-06-27 15:39:01,948 INFO org.apache.flink.metrics.statsd.StatsDReporter - Configured StatsDReporter with {host:localhost, port:8125} 2017-06-27 15:39:01,949 INFO org.apache.flink.runtime.metrics.MetricRegistry - Periodically reporting metrics in intervals of 10 SECONDS for reporter statsd of type org.apache.flink.metrics.statsd.StatsDReporter. 2017-06-27 15:39:02,050 INFO org.apache.flink.runtime.jobmanager.MemoryArchivist - Started memory archivist akka://flink/user/archive 2017-06-27 15:39:02,059 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'recovery.zookeeper.storageDir' instead of proper key 'high-availability.storageDir' 2017-06-27 15:39:17,252 ERROR org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection timed out for connection string (zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181) and timeout (15000) / elapsed (18395) org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) at org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) at org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) at org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90) at org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) at org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83) at org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.newNamespaceAwareEnsurePath(NamespaceImpl.java:109) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.newNamespaceAwareEnsurePath(CuratorFrameworkImpl.java:469) at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.<init>(ZooKeeperSubmittedJobGraphStore.java:116) at org.apache.flink.runtime.util.ZooKeeperUtils.createSubmittedJobGraphs(ZooKeeperUtils.java:263) at org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices.getSubmittedJobGraphStore(ZooKeeperHaServices.java:149) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2716) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2641) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2298) at org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990) at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) 2017-06-27 15:39:37,448 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@27cbfddf 2017-06-27 15:40:07,457 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 68603 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2017-06-27 15:40:07,457 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@27cbfddf 2017-06-27 15:40:07,555 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl - Ensure path threw exception java.net.UnknownHostException: zookeeper-1.zookeeper: Name or service not known at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) at java.net.InetAddress.getAllByName0(InetAddress.java:1276) at java.net.InetAddress.getAllByName(InetAddress.java:1192) at java.net.InetAddress.getAllByName(InetAddress.java:1126) at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) at org.apache.flink.shaded.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150) at org.apache.flink.shaded.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) at org.apache.flink.shaded.org.apache.curator.HandleHolder.internalClose(HandleHolder.java:128) at org.apache.flink.shaded.org.apache.curator.HandleHolder.closeAndReset(HandleHolder.java:77) at org.apache.flink.shaded.org.apache.curator.ConnectionState.reset(ConnectionState.java:261) at org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:221) at org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) at org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) at org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90) at org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) at org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83) at org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.newNamespaceAwareEnsurePath(NamespaceImpl.java:109) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.newNamespaceAwareEnsurePath(CuratorFrameworkImpl.java:469) at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.<init>(ZooKeeperSubmittedJobGraphStore.java:116) at org.apache.flink.runtime.util.ZooKeeperUtils.createSubmittedJobGraphs(ZooKeeperUtils.java:263) at org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices.getSubmittedJobGraphStore(ZooKeeperHaServices.java:149) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2716) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2641) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2298) at org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) at scala.util.Try$.apply(Try.scala:192) at org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117) at org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992) at org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990) at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990) at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) 2017-06-27 15:40:22,566 ERROR org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection timed out for connection string (zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181) and timeout (15000) / elapsed (15108) org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) at org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) at org.apache.flink.runtime.util.ZooKeeperUtils.createSubmittedJobGraphs(ZooKeeperUtils.java:263) at org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices.getSubmittedJobGraphStore(ZooKeeperHaServices.java:149) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2716) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2641) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2298) at org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) at scala.util.Try$.apply(Try.scala:192) at org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117) at org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992) at org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990) at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990) at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) 2017-06-27 15:40:42,575 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@27cbfddf 2017-06-27 15:41:02,684 ERROR org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection timed out for connection string (zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181) and timeout (15000) / elapsed (55226) org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) at org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) at org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) at org.apache.flink.shaded.org.apache.curator.utils.EnsurePath$InitialHelper$1.call(EnsurePath.java:156) at org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) at org.apache.flink.shaded.org.apache.curator.utils.EnsurePath$InitialHelper.ensure(EnsurePath.java:149) at org.apache.flink.shaded.org.apache.curator.utils.EnsurePath.ensure(EnsurePath.java:102) at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.<init>(ZooKeeperSubmittedJobGraphStore.java:117) at org.apache.flink.runtime.util.ZooKeeperUtils.createSubmittedJobGraphs(ZooKeeperUtils.java:263) at org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices.getSubmittedJobGraphStore(ZooKeeperHaServices.java:149) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2716) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2641) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2298) at org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) at scala.util.Try$.apply(Try.scala:192) at org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117) at org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992) at org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990) at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990) at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) 2017-06-27 15:41:02,684 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@27cbfddf 2017-06-27 15:41:02,803 WARN org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-1381454376626202001.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 2017-06-27 15:41:02,804 ERROR org.apache.flink.shaded.org.apache.curator.ConnectionState - Authentication failed 2017-06-27 15:41:02,806 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server ip-10-2-8-5.ec2.internal/10.2.8.5:2181 ... 2017-06-27 16:00:51,490 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Try to restart or fail the job (022d8149808dd3297a8a7275a1fd3d6b) if no longer possible. 2017-06-27 16:00:51,490 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job (022d8149808dd3297a8a7275a1fd3d6b) switched from state FAILING to RESTARTING. 2017-06-27 16:00:51,490 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Restarting the job (022d8149808dd3297a8a7275a1fd3d6b). 2017-06-27 16:00:51,490 INFO org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestarter - Delaying retry of job execution for 10000 ms ... 2017-06-27 16:00:58,252 INFO org.apache.flink.runtime.jobmanager.JobManager - Task Manager Registration but not connected to ResourceManager 2017-06-27 16:00:58,254 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at flink-taskmanager-3116622558-zmmwq (akka.tcp://flink@10.2.8.11:6122/user/taskmanager) as 2a058f00bd1e25f44c1cb8f3e5dd726f. Current number of registered hosts is 1. Current number of alive task slots is 2. 2017-06-27 16:00:58,453 INFO org.apache.flink.runtime.jobmanager.JobManager - Task Manager Registration but not connected to ResourceManager 2017-06-27 16:01:01,491 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job (022d8149808dd3297a8a7275a1fd3d6b) switched from state RESTARTING to CREATED. 2017-06-27 16:01:01,491 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper. 2017-06-27 16:01:01,645 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 1 checkpoints in ZooKeeper. 2017-06-27 16:01:01,645 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to retrieve checkpoint 502. 2017-06-27 16:01:01,660 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring from latest valid checkpoint: Checkpoint 502 @ 1498577858587 for 022d8149808dd3297a8a7275a1fd3d6b. 2017-06-27 16:01:01,661 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - No master state to restore 2017-06-27 16:01:01,661 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job (022d8149808dd3297a8a7275a1fd3d6b) switched from state CREATED to RUNNING. {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)