Scott Kidder created FLINK-7022:
-----------------------------------

             Summary: Flink Job Manager Scheduler & Web Frontend out of sync 
when Zookeeper is unavailable on startup
                 Key: FLINK-7022
                 URL: https://issues.apache.org/jira/browse/FLINK-7022
             Project: Flink
          Issue Type: Bug
          Components: JobManager
    Affects Versions: 1.2.1, 1.3.0, 1.2.0
         Environment: Kubernetes cluster running:
* Flink 1.3.0 Job Manager & Task Manager on Java 8u131
* Zookeeper 3.4.10 cluster with 3 nodes
            Reporter: Scott Kidder


h2. Problem
Flink Job Manager web frontend is permanently unavailable if one or more 
Zookeeper nodes are unresolvable during startup. The job scheduler eventually 
recovers and assigns jobs to task managers, but the web frontend continues to 
respond with an HTTP 503 and the following message:
{noformat}Service temporarily unavailable due to an ongoing leader election. 
Please refresh.{noformat}

h2. Expected Behavior
Once Flink is able to interact with Zookeeper successfully, all aspects of the 
Job Manager (job scheduling & the web frontend) should be available.

h2. Environment Details
We're running Flink and Zookeeper in Kubernetes on CoreOS. CoreOS can run in a 
configuration that automatically detects and applies operating system updates. 
We have a Zookeeper node running on the same CoreOS instance as Flink. It's 
possible that the Zookeeper node will not yet be started when the Flink 
components are started. This could cause hostname resolution of the Zookeeper 
nodes to fail.

h3. Flink Task Manager Logs
{noformat}
2017-06-27 15:38:47,161 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: metrics.reporter.statsd.host, localhost
2017-06-27 15:38:47,161 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: metrics.reporter.statsd.port, 8125
2017-06-27 15:38:47,162 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: metrics.reporter.statsd.interval, 10 SECONDS
2017-06-27 15:38:47,254 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: state.backend, filesystem
2017-06-27 15:38:47,254 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: state.backend.fs.checkpointdir, 
hdfs://hdfs:8020/flink/checkpoints
2017-06-27 15:38:47,255 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: state.savepoints.dir, hdfs://hdfs:8020/flink/savepoints
2017-06-27 15:38:47,255 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: recovery.mode, zookeeper
2017-06-27 15:38:47,256 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: recovery.zookeeper.quorum, 
zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181
2017-06-27 15:38:47,256 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: recovery.zookeeper.storageDir, 
hdfs://hdfs:8020/flink/recovery
2017-06-27 15:38:47,256 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: recovery.jobmanager.port, 6123
2017-06-27 15:38:47,257 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: blob.server.port, 41479
2017-06-27 15:38:47,357 WARN  org.apache.flink.configuration.Configuration      
            - Config uses deprecated configuration key 'recovery.mode' instead 
of proper key 'high-availability'
2017-06-27 15:38:47,366 INFO  org.apache.flink.runtime.jobmanager.JobManager    
            - Starting JobManager with high-availability
2017-06-27 15:38:47,366 WARN  org.apache.flink.configuration.Configuration      
            - Config uses deprecated configuration key 
'recovery.jobmanager.port' instead of proper key 
'high-availability.jobmanager.port'
2017-06-27 15:38:47,452 INFO  org.apache.flink.runtime.jobmanager.JobManager    
            - Starting JobManager on flink:6123 with execution mode CLUSTER
2017-06-27 15:38:47,549 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: jobmanager.rpc.address, flink
2017-06-27 15:38:47,549 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: jobmanager.rpc.port, 6123
2017-06-27 15:38:47,549 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: jobmanager.heap.mb, 1024
2017-06-27 15:38:47,549 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: taskmanager.heap.mb, 1024
2017-06-27 15:38:47,549 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: taskmanager.numberOfTaskSlots, 1
2017-06-27 15:38:47,549 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: taskmanager.memory.preallocate, false
2017-06-27 15:38:47,550 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: parallelism.default, 1
2017-06-27 15:38:47,550 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: jobmanager.web.port, 8081
2017-06-27 15:38:47,550 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: metrics.reporters, statsd
2017-06-27 15:38:47,550 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: metrics.reporter.statsd.class, 
org.apache.flink.metrics.statsd.StatsDReporter
2017-06-27 15:38:47,551 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: metrics.reporter.statsd.host, localhost
2017-06-27 15:38:47,551 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: metrics.reporter.statsd.port, 8125
2017-06-27 15:38:47,551 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: metrics.reporter.statsd.interval, 10 SECONDS
2017-06-27 15:38:47,551 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: state.backend, filesystem
2017-06-27 15:38:47,551 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: state.backend.fs.checkpointdir, 
hdfs://hdfs:8020/flink/checkpoints
2017-06-27 15:38:47,552 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: state.savepoints.dir, hdfs://hdfs:8020/flink/savepoints
2017-06-27 15:38:47,552 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: recovery.mode, zookeeper
2017-06-27 15:38:47,552 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: recovery.zookeeper.quorum, 
zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181
2017-06-27 15:38:47,552 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: recovery.zookeeper.storageDir, 
hdfs://hdfs:8020/flink/recovery
2017-06-27 15:38:47,552 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: recovery.jobmanager.port, 6123
2017-06-27 15:38:47,552 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: blob.server.port, 41479
2017-06-27 15:38:48,055 INFO  
org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set 
to root (auth:SIMPLE)
2017-06-27 15:38:48,664 INFO  org.apache.flink.runtime.jobmanager.JobManager    
            - Starting JobManager actor system reachable at flink:6123
2017-06-27 15:38:50,955 INFO  akka.event.slf4j.Slf4jLogger                      
            - Slf4jLogger started
2017-06-27 15:38:51,252 INFO  Remoting                                          
            - Starting remoting
2017-06-27 15:38:52,679 INFO  Remoting                                          
            - Remoting started; listening on addresses 
:[akka.tcp://flink@flink:6123]
2017-06-27 15:38:52,758 WARN  org.apache.flink.configuration.Configuration      
            - Config uses deprecated configuration key 'recovery.mode' instead 
of proper key 'high-availability'
2017-06-27 15:38:52,761 WARN  org.apache.flink.configuration.Configuration      
            - Config uses deprecated configuration key 'recovery.mode' instead 
of proper key 'high-availability'
2017-06-27 15:38:52,764 WARN  org.apache.flink.configuration.Configuration      
            - Config uses deprecated configuration key 
'recovery.zookeeper.storageDir' instead of proper key 
'high-availability.storageDir'
2017-06-27 15:38:52,854 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: jobmanager.rpc.address, flink
2017-06-27 15:38:52,854 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: jobmanager.rpc.port, 6123
2017-06-27 15:38:52,854 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: jobmanager.heap.mb, 1024
2017-06-27 15:38:52,854 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: taskmanager.heap.mb, 1024
2017-06-27 15:38:52,854 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: taskmanager.numberOfTaskSlots, 1
2017-06-27 15:38:52,854 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: taskmanager.memory.preallocate, false
2017-06-27 15:38:52,854 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: parallelism.default, 1
2017-06-27 15:38:52,854 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: jobmanager.web.port, 8081
2017-06-27 15:38:52,864 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: metrics.reporters, statsd
2017-06-27 15:38:52,865 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: metrics.reporter.statsd.class, 
org.apache.flink.metrics.statsd.StatsDReporter
2017-06-27 15:38:52,865 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: metrics.reporter.statsd.host, localhost
2017-06-27 15:38:52,865 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: metrics.reporter.statsd.port, 8125
2017-06-27 15:38:52,865 INFO  
org.apache.flink.configuration.GlobalConfiguration            - Loading 
configuration property: metrics.reporter.statsd.interval, 10 SECONDS
        at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
        at 
org.apache.flink.shaded.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150)
        at 
org.apache.flink.shaded.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
        at 
org.apache.flink.shaded.org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.reset(ConnectionState.java:262)
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.start(ConnectionState.java:109)
        at 
org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:191)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:259)
        at 
org.apache.flink.runtime.util.ZooKeeperUtils.startCuratorFramework(ZooKeeperUtils.java:128)
        at 
org.apache.flink.runtime.highavailability.HighAvailabilityServicesUtils.createHighAvailabilityServices(HighAvailabilityServicesUtils.java:96)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2047)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117)
        at scala.util.Try$.apply(Try.scala:192)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990)
        at 
org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990)
        at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala)
2017-06-27 15:38:59,160 INFO  org.apache.flink.runtime.jobmanager.JobManager    
            - Starting JobManager web frontend
2017-06-27 15:38:59,257 INFO  
org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined 
location of JobManager log file: 
/usr/local/flink-1.3.0/log/flink--jobmanager-0-flink-jobmanager-3380372638-1q7jb.log
2017-06-27 15:38:59,257 INFO  
org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined 
location of JobManager stdout file: 
/usr/local/flink-1.3.0/log/flink--jobmanager-0-flink-jobmanager-3380372638-1q7jb.out
2017-06-27 15:38:59,257 INFO  
org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory 
/tmp/flink-web-252afcf4-d41d-4095-a082-f6ce5176c2f5 for the web interface files
2017-06-27 15:38:59,257 INFO  
org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory 
/tmp/flink-web-2ca2cadf-a1b6-44af-9510-9c523a422022 for web frontend JAR file 
uploads
2017-06-27 15:39:01,060 INFO  
org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend 
listening at 0:0:0:0:0:0:0:0:8081
2017-06-27 15:39:01,060 INFO  org.apache.flink.runtime.jobmanager.JobManager    
            - Starting JobManager actor
2017-06-27 15:39:01,253 INFO  org.apache.flink.runtime.blob.BlobServer          
            - Created BLOB server storage directory 
/tmp/blobStore-1f49aadd-0a7d-45d1-8fdc-fc2167ca93d5
2017-06-27 15:39:01,257 INFO  org.apache.flink.runtime.blob.BlobServer          
            - Started BLOB server at 0.0.0.0:41479 - max concurrent requests: 
50 - max backlog: 1000
2017-06-27 15:39:01,851 INFO  org.apache.flink.runtime.metrics.MetricRegistry   
            - Configuring StatsDReporter with {interval=10 SECONDS, port=8125, 
host=localhost, class=org.apache.flink.metrics.statsd.StatsDReporter}.
2017-06-27 15:39:01,948 INFO  org.apache.flink.metrics.statsd.StatsDReporter    
            - Configured StatsDReporter with {host:localhost, port:8125}
2017-06-27 15:39:01,949 INFO  org.apache.flink.runtime.metrics.MetricRegistry   
            - Periodically reporting metrics in intervals of 10 SECONDS for 
reporter statsd of type org.apache.flink.metrics.statsd.StatsDReporter.
2017-06-27 15:39:02,050 INFO  
org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory 
archivist akka://flink/user/archive
2017-06-27 15:39:02,059 WARN  org.apache.flink.configuration.Configuration      
            - Config uses deprecated configuration key 
'recovery.zookeeper.storageDir' instead of proper key 
'high-availability.storageDir'
2017-06-27 15:39:17,252 ERROR 
org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection 
timed out for connection string 
(zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181)
 and timeout (15000) / elapsed (18395)
org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: 
KeeperErrorCode = ConnectionLoss
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
        at 
org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90)
        at 
org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.newNamespaceAwareEnsurePath(NamespaceImpl.java:109)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.newNamespaceAwareEnsurePath(CuratorFrameworkImpl.java:469)
        at 
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.<init>(ZooKeeperSubmittedJobGraphStore.java:116)
        at 
org.apache.flink.runtime.util.ZooKeeperUtils.createSubmittedJobGraphs(ZooKeeperUtils.java:263)
        at 
org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices.getSubmittedJobGraphStore(ZooKeeperHaServices.java:149)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2716)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2641)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2298)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139)
        at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990)
        at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala)
2017-06-27 15:39:37,448 INFO  org.apache.zookeeper.ZooKeeper                    
            - Initiating client connection, 
connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181
 sessionTimeout=60000 
watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@27cbfddf
2017-06-27 15:40:07,457 WARN  
org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection 
attempt unsuccessful after 68603 (greater than max timeout of 60000). Resetting 
connection and trying again with a new connection.
2017-06-27 15:40:07,457 INFO  org.apache.zookeeper.ZooKeeper                    
            - Initiating client connection, 
connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181
 sessionTimeout=60000 
watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@27cbfddf
2017-06-27 15:40:07,555 ERROR 
org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl  
- Ensure path threw exception
java.net.UnknownHostException: zookeeper-1.zookeeper: Name or service not known
        at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
        at 
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
        at java.net.InetAddress.getAllByName(InetAddress.java:1192)
        at java.net.InetAddress.getAllByName(InetAddress.java:1126)
        at 
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
        at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
        at 
org.apache.flink.shaded.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150)
        at 
org.apache.flink.shaded.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
        at 
org.apache.flink.shaded.org.apache.curator.HandleHolder.internalClose(HandleHolder.java:128)
        at 
org.apache.flink.shaded.org.apache.curator.HandleHolder.closeAndReset(HandleHolder.java:77)
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.reset(ConnectionState.java:261)
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:221)
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
        at 
org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90)
        at 
org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.newNamespaceAwareEnsurePath(NamespaceImpl.java:109)
        at 
org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.newNamespaceAwareEnsurePath(CuratorFrameworkImpl.java:469)
        at 
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.<init>(ZooKeeperSubmittedJobGraphStore.java:116)
        at 
org.apache.flink.runtime.util.ZooKeeperUtils.createSubmittedJobGraphs(ZooKeeperUtils.java:263)
        at 
org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices.getSubmittedJobGraphStore(ZooKeeperHaServices.java:149)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2716)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2641)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2298)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117)
        at scala.util.Try$.apply(Try.scala:192)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990)
        at 
org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990)
        at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala)
2017-06-27 15:40:22,566 ERROR 
org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection 
timed out for connection string 
(zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181)
 and timeout (15000) / elapsed (15108)
org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: 
KeeperErrorCode = ConnectionLoss
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
        at 
org.apache.flink.runtime.util.ZooKeeperUtils.createSubmittedJobGraphs(ZooKeeperUtils.java:263)
        at 
org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices.getSubmittedJobGraphStore(ZooKeeperHaServices.java:149)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2716)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2641)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2298)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117)
        at scala.util.Try$.apply(Try.scala:192)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990)
        at 
org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990)
        at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala)
2017-06-27 15:40:42,575 INFO  org.apache.zookeeper.ZooKeeper                    
            - Initiating client connection, 
connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181
 sessionTimeout=60000 
watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@27cbfddf
2017-06-27 15:41:02,684 ERROR 
org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection 
timed out for connection string 
(zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181)
 and timeout (15000) / elapsed (55226)
org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: 
KeeperErrorCode = ConnectionLoss
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
        at 
org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
        at 
org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
        at 
org.apache.flink.shaded.org.apache.curator.utils.EnsurePath$InitialHelper$1.call(EnsurePath.java:156)
        at 
org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
        at 
org.apache.flink.shaded.org.apache.curator.utils.EnsurePath$InitialHelper.ensure(EnsurePath.java:149)
        at 
org.apache.flink.shaded.org.apache.curator.utils.EnsurePath.ensure(EnsurePath.java:102)
        at 
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.<init>(ZooKeeperSubmittedJobGraphStore.java:117)
        at 
org.apache.flink.runtime.util.ZooKeeperUtils.createSubmittedJobGraphs(ZooKeeperUtils.java:263)
        at 
org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices.getSubmittedJobGraphStore(ZooKeeperHaServices.java:149)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2716)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2641)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2298)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117)
        at scala.util.Try$.apply(Try.scala:192)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990)
        at 
org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
        at 
org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990)
        at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala)
2017-06-27 15:41:02,684 INFO  org.apache.zookeeper.ZooKeeper                    
            - Initiating client connection, 
connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181
 sessionTimeout=60000 
watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@27cbfddf
2017-06-27 15:41:02,803 WARN  org.apache.zookeeper.ClientCnxn                   
            - SASL configuration failed: 
javax.security.auth.login.LoginException: No JAAS configuration section named 
'Client' was found in specified JAAS configuration file: 
'/tmp/jaas-1381454376626202001.conf'. Will continue connection to Zookeeper 
server without SASL authentication, if Zookeeper server allows it.
2017-06-27 15:41:02,804 ERROR 
org.apache.flink.shaded.org.apache.curator.ConnectionState    - Authentication 
failed
2017-06-27 15:41:02,806 INFO  org.apache.zookeeper.ClientCnxn                   
            - Opening socket connection to server 
ip-10-2-8-5.ec2.internal/10.2.8.5:2181

...

2017-06-27 16:00:51,490 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to restart 
or fail the job  (022d8149808dd3297a8a7275a1fd3d6b) if no longer possible.
2017-06-27 16:00:51,490 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job  
(022d8149808dd3297a8a7275a1fd3d6b) switched from state FAILING to RESTARTING.
2017-06-27 16:00:51,490 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the 
job  (022d8149808dd3297a8a7275a1fd3d6b).
2017-06-27 16:00:51,490 INFO  
org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestarter  - 
Delaying retry of job execution for 10000 ms ...
2017-06-27 16:00:58,252 INFO  org.apache.flink.runtime.jobmanager.JobManager    
            - Task Manager Registration but not connected to ResourceManager
2017-06-27 16:00:58,254 INFO  org.apache.flink.runtime.instance.InstanceManager 
            - Registered TaskManager at flink-taskmanager-3116622558-zmmwq 
(akka.tcp://flink@10.2.8.11:6122/user/taskmanager) as 
2a058f00bd1e25f44c1cb8f3e5dd726f. Current number of registered hosts is 1. 
Current number of alive task slots is 2.
2017-06-27 16:00:58,453 INFO  org.apache.flink.runtime.jobmanager.JobManager    
            - Task Manager Registration but not connected to ResourceManager
2017-06-27 16:01:01,491 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job  
(022d8149808dd3297a8a7275a1fd3d6b) switched from state RESTARTING to CREATED.
2017-06-27 16:01:01,491 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - 
Recovering checkpoints from ZooKeeper.
2017-06-27 16:01:01,645 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Found 
1 checkpoints in ZooKeeper.
2017-06-27 16:01:01,645 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Trying 
to retrieve checkpoint 502.
2017-06-27 16:01:01,660 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Restoring from 
latest valid checkpoint: Checkpoint 502 @ 1498577858587 for 
022d8149808dd3297a8a7275a1fd3d6b.
2017-06-27 16:01:01,661 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - No master state 
to restore
2017-06-27 16:01:01,661 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job  
(022d8149808dd3297a8a7275a1fd3d6b) switched from state CREATED to RUNNING.
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to