Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

Gerard Garcia Wed, 29 Aug 2018 06:17:52 -0700

Hi Till,

Sorry for the late reply, I was waiting to update to Flink 1.6.0 to see if
the problem got fixed but I still experience the first issue (jobgraph not
deleted from zookeeper when task is canceled). The second issue
(taskmanagers unable to register to the new elected jobmanager) was
actually a configuration error, all jobmanagers had the
"jobmanager.rpc.address" option set to point to the same jobmanager so I
guess the new one was not registering its url correctly in Zookeeper.


And for the first issue, it seems to be a known bug:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Old-job-resurrected-during-HA-failover-td22000.html
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JobGraphs-not-cleaned-up-in-HA-mode-td22600.html
https://issues.apache.org/jira/browse/FLINK-10011

so I guess I'll have to wait until there is a fix as I have not seens any
workaround (other than removing the jobgraph from Zookeeper after
cancelling the task)

Thanks,

Gerard


On Tue, Jul 24, 2018 at 1:46 PM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Gerard,
>
> the first log snippet from the client does not show anything suspicious.
> The warning just says that you cannot use the Yarn CLI because it lacks the
> Hadoop dependencies in the classpath.
>
> The second snippet is indeed more interesting. If the TaskExecutors are
> not notified about the changed leader, then this might indicate a problem
> with the ZooKeeper connection or the ZooKeeper cluster itself. This might
> also explain why the job deletion from ZooKeeper does not succeed.
>
> One thing you could check is whether the leader ZNode under
> `/flink/default/leader/dispatcher_lock` (if you are using the defaults)
> actually contains the address of the newly elected leader. The leader path
> should also be logged in the cluster entrypoint logs. You can use the
> ZooKeeper cli for accessing the ZNodes.
>
> Cheers,
> Till
>
> On Mon, Jul 23, 2018 at 4:07 PM Gerard Garcia <ger...@talaia.io> wrote:
>
>> We have just started experiencing a different problem that could be
>> related, maybe it helps to diagnose the issue.
>>
>> In the last 24h the jobmanager lost connection to Zookeeper a couple of
>> times. Each time, a new jobmanager (in a different node) was elected leader
>> correctly but the taskamangers kept trying to connect to the old
>> jobmanager. These are the ending log messages until the taskamanger shut
>> down itself.
>>
>> 12:06:41.747 [flink-akka.actor.default-dispatcher-5] WARN
>> akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote
>> connection to [null] failed with java.net.ConnectException: Connection
>> refused: (...)1/192.168.1.9:35605
>> 12:06:41.748 [flink-akka.actor.default-dispatcher-2] INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor  - Could not resolve
>> ResourceManager address akka.tcp://flink@(...)1:35605/user/resourcemanager,
>> retrying in 10000 ms: Could not connect to rpc endpoint under address
>> akka.tcp://flink@(...)1:35605/user/resourcemanager..
>> 12:06:41.748 [flink-akka.actor.default-dispatcher-5] WARN
>> akka.remote.ReliableDeliverySupervisor
>> flink-akka.remote.default-remote-dispatcher-15 - Association with remote
>> system [akka.tcp://flink@(...)1:35605] has failed, address is now gated
>> for [50] ms. Reason: [Association failed with 
>> [akka.tcp://flink@(...)1:35605]]
>> Caused by: [Connection refused: (...)1/192.168.1.9:35605]
>> 12:06:51.766 [flink-akka.actor.default-dispatcher-5] WARN
>> akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote
>> connection to [null] failed with java.net.ConnectException: Connection
>> refused: (...)1/192.168.1.9:35605
>> 12:06:51.767 [flink-akka.actor.default-dispatcher-2] INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor  - Could not resolve
>> ResourceManager address akka.tcp://flink@(...)1:35605/user/resourcemanager,
>> retrying in 10000 ms: Could not connect to rpc endpoint under address
>> akka.tcp://flink@(...)1:35605/user/resourcemanager..
>> 12:06:51.767 [flink-akka.actor.default-dispatcher-5] WARN
>> akka.remote.ReliableDeliverySupervisor
>> flink-akka.remote.default-remote-dispatcher-7 - Association with remote
>> system [akka.tcp://flink@(...)1:35605] has failed, address is now gated
>> for [50] ms. Reason: [Association failed with 
>> [akka.tcp://flink@(...)1:35605]]
>> Caused by: [Connection refused: (...)1/192.168.1.9:35605]
>> 12:07:01.123 [flink-akka.actor.default-dispatcher-5] ERROR
>> org.apache.flink.runtime.taskexecutor.TaskExecutor  - Fatal error occurred
>> in TaskExecutor akka.tcp://flink@(...)2:33455/user/taskmanager_0.
>> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
>> Could not register at the ResourceManager within the specified maximum
>> registration duration 300000 ms. This indicates a problem with this
>> instance. Terminating now.
>>         at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1018)
>>         at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1004)
>>         at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>>         at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>>         at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>>         at
>> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>>         at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>         at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>         at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>         at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>         at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>         at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>         at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> 12:07:01.125 [flink-akka.actor.default-dispatcher-5] ERROR
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Fatal error
>> occurred while executing the TaskManager. Shutting it down...
>> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
>> Could not register at the ResourceManager within the specified maximum
>> registration duration 300000 ms. This indicates a problem with this
>> instance. Terminating now.
>>         at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1018)
>>         at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1004)
>>         at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>>         at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>>         at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>>         at
>> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>>         at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>         at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>         at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>         at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>         at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>         at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>         at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> 12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor  - Stopping TaskExecutor
>> akka.tcp://flink@(...)2:33455/user/taskmanager_0.
>> 12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO
>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping
>> ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>> 12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO
>> o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager  - Shutting
>> down TaskExecutorLocalStateStoresManager.
>> 12:07:01.130 [flink-akka.actor.default-dispatcher-5] INFO
>> org.apache.flink.runtime.io.disk.iomanager.IOManager  - I/O manager removed
>> spill file directory /home/tmp/flink-io-cf83b38e-53f1-4802-a097-e6db95b46084
>> 12:07:01.130 [flink-akka.actor.default-dispatcher-5] INFO
>> org.apache.flink.runtime.io.network.NetworkEnvironment  - Shutting down the
>> network environment and its components.
>> 12:07:01.131 [flink-akka.actor.default-dispatcher-5] INFO
>> org.apache.flink.runtime.io.network.netty.NettyClient  - Successful
>> shutdown (took 1 ms).
>> 12:07:01.132 [flink-akka.actor.default-dispatcher-5] INFO
>> org.apache.flink.runtime.io.network.netty.NettyServer  - Successful
>> shutdown (took 1 ms).
>> 12:07:01.141 [flink-akka.actor.default-dispatcher-5] INFO
>> org.apache.flink.runtime.taskexecutor.JobLeaderService  - Stop job leader
>> service.
>> 12:07:01.142 [flink-akka.actor.default-dispatcher-5] INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor  - Stopped TaskExecutor
>> akka.tcp://flink@(...)2:33455/user/taskmanager_0.
>> 12:07:01.143 [flink-akka.actor.default-dispatcher-5] INFO
>> org.apache.flink.runtime.blob.PermanentBlobCache  - Shutting down BLOB cache
>> 12:07:01.143 [flink-akka.actor.default-dispatcher-5] INFO
>> org.apache.flink.runtime.blob.TransientBlobCache  - Shutting down BLOB cache
>> 12:07:01.148 [Curator-Framework-0] INFO
>> o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  -
>> backgroundOperationsLoop exiting
>> 12:07:01.173 [flink-akka.actor.default-dispatcher-5] INFO
>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session:
>> 0x3011955028f00f4 closed
>> 12:07:01.173 [main-EventThread] INFO
>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread
>> shut down for session: 0x3011955028f00f4
>> 12:07:01.173 [flink-akka.actor.default-dispatcher-5] INFO
>> org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Stopping Akka RPC
>> service.
>> 12:07:01.178 [flink-akka.actor.default-dispatcher-3] INFO
>> akka.remote.RemoteActorRefProvider$RemotingTerminator
>> flink-akka.remote.default-remote-dispatcher-16 - Shutting down remote
>> daemon.
>> 12:07:01.178 [flink-akka.actor.default-dispatcher-3] INFO
>> akka.remote.RemoteActorRefProvider$RemotingTerminator
>> flink-akka.remote.default-remote-dispatcher-16 - Remote daemon shut down;
>> proceeding with flushing remote transports.
>> 12:07:01.190 [flink-akka.actor.default-dispatcher-5] INFO
>> akka.remote.RemoteActorRefProvider$RemotingTerminator
>> flink-akka.remote.default-remote-dispatcher-7 - Remoting shut down.
>>
>> It seems like there are problems updating Zookeeper. I've have also
>> noticed these messages in Zookeeper log:
>>
>>  WARN  [SyncThread:2:FileTxnLog@378] - fsync-ing the write ahead log in
>> SyncThread:2 took 1259ms which will adversely effect operation latency. See
>> the ZooKeeper troubleshooting guide
>>
>> Maybe Flink hits some timeout?
>>
>> Gerard
>>
>> On Mon, Jul 23, 2018 at 11:57 AM Gerard Garcia <ger...@talaia.io> wrote:
>>
>>> Hi Till,
>>>
>>> I can't post the full log (as there is internal info in them) but I've
>>> found this. Is that what you are looking for?
>>>
>>> 11:29:17.351 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>> --------------------------------------------------------------------------------
>>> 11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>> Starting Command Line Client (Version: 1.5-SNAPSHOT, Rev:a4fc4c6,
>>> Date:05.06.2018 @ 10:22:30 CEST)
>>> 11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  OS
>>> current user: (...)
>>> 11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>> Current Hadoop/Kerberos user: <no hadoop dependency found>
>>> 11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>> JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.131-b11
>>> 11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>> Maximum heap size: 14254 MiBytes
>>> 11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>> JAVA_HOME: (not set)
>>> 11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  No
>>> Hadoop Dependency available
>>> 11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>> JVM Options:
>>> 11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>>  -Dlog.file=/opt/flink/flink-1.5.0/log/flink-root-client-(...).log
>>> 11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>>  -Dlog4j.configuration=file:/opt/flink/flink-1.5.0/conf/log4j-cli.properties
>>> 11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>>  -Dlogback.configurationFile=file:/opt/flink/flink-1.5.0/conf/logback.xml
>>> 11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>> Program Arguments:
>>> 11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>>  cancel
>>> 11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>>  e403893e5208ca47ace886a77e405291
>>> 11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>> Classpath:
>>> /opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
>>> 11:29:17.375 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>> --------------------------------------------------------------------------------
>>> 11:29:17.380 [main] WARN  org.apache.flink.client.cli.CliFrontend  -
>>> Could not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.
>>> java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
>>>         at java.lang.Class.forName0(Native Method)
>>>         at java.lang.Class.forName(Class.java:264)
>>>         at
>>> org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1204)
>>>         at
>>> org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1160)
>>>         at
>>> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1086)
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.hadoop.conf.Configuration
>>>         at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>         at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>>>         at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>         ... 5 common frames omitted
>>> 11:29:17.385 [main] INFO  org.apache.flink.core.fs.FileSystem  - Hadoop
>>> is not in the classpath/dependencies. The extended set of supported File
>>> Systems via Hadoop is not available.
>>> 11:29:17.479 [main] INFO
>>> o.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot
>>> create Hadoop Security Module because Hadoop cannot be found in the
>>> Classpath.
>>> 11:29:17.489 [main] INFO
>>> org.apache.flink.runtime.security.SecurityUtils  - Cannot install
>>> HadoopSecurityContext because Hadoop cannot be found in the Classpath.
>>> 11:29:17.518 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>> Running 'cancel' command.
>>> 11:29:17.523 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>> Cancelling job e403893e5208ca47ace886a77e405291.
>>> 11:29:17.537 [main] INFO
>>> org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly
>>> available BLOB storage directory at file:///home/nas/flink/ha//default/blob
>>> 11:29:17.538 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils
>>> - Enforcing default ACL for ZK connections
>>> 11:29:17.539 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils
>>> - Using '/flink/default' as Zookeeper namespace.
>>> 11:29:17.574 [main] INFO
>>> o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
>>> 11:29:17.577 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f,
>>> built on 03/23/2017 10:13 GMT
>>> 11:29:17.577 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:host.name=(...)
>>> 11:29:17.578 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:java.version=1.8.0_131
>>> 11:29:17.578 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:java.vendor=Oracle Corporation
>>> 11:29:17.579 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:java.home=/opt/jdk/jdk1.8.0_131/jre
>>> 11:29:17.580 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
>>> 11:29:17.580 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
>>> 11:29:17.580 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:java.io.tmpdir=/tmp
>>> 11:29:17.580 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:java.compiler=<NA>
>>> 11:29:17.580 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:os.name=Linux
>>> 11:29:17.580 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:os.arch=amd64
>>> 11:29:17.580 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:os.version=4.9.87-xxxx-std-ipv6-64
>>> 11:29:17.581 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:user.name=(...)
>>> 11:29:17.581 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:user.home=(...)
>>> 11:29:17.581 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>> environment:user.dir=/opt/flink/flink-1.5.0/bin
>>> 11:29:17.581 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating
>>> client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,
>>> 10.1.1.7:2181 sessionTimeout=60000
>>> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
>>> 11:29:17.589 [main-SendThread(10.1.1.5:2181)] WARN
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
>>> configuration failed: javax.security.auth.login.LoginException: No JAAS
>>> configuration section named 'Client' was found in specified JAAS
>>> configuration file: '/tmp/jaas-3807415919448894740.conf'. Will continue
>>> connection to Zookeeper server without SASL authentication, if Zookeeper
>>> server allows it.
>>> 11:29:17.590 [main-SendThread(10.1.1.5:2181)] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening
>>> socket connection to server 10.1.1.5/10.1.1.5:2181
>>> 11:29:17.590 [main-EventThread] ERROR
>>> o.a.flink.shaded.curator.org.apache.curator.ConnectionState  -
>>> Authentication failed
>>> 11:29:17.603 [main-SendThread(10.1.1.5:2181)] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
>>> connection established to 10.1.1.5/10.1.1.5:2181, initiating session
>>> 11:29:17.625 [main-SendThread(10.1.1.5:2181)] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session
>>> establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid =
>>> 0x100571bda1903c3, negotiated timeout = 40000
>>> 11:29:17.626 [main-EventThread] INFO
>>> o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State
>>> change: CONNECTED
>>> 11:29:17.764 [main] INFO  org.apache.flink.runtime.rest.RestClient  -
>>> Rest client endpoint started.
>>> 11:29:17.766 [main] INFO
>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting
>>> ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
>>> 11:29:17.812 [main] INFO
>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting
>>> ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
>>> 11:29:18.007 [main] INFO  org.apache.flink.runtime.rest.RestClient  -
>>> Shutting down rest endpoint.
>>> 11:29:18.008 [main] INFO  org.apache.flink.runtime.rest.RestClient  -
>>> Rest endpoint shutdown complete.
>>> 11:29:18.008 [main] INFO
>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping
>>> ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
>>> 11:29:18.009 [main] INFO
>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping
>>> ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
>>> 11:29:18.010 [Curator-Framework-0] INFO
>>> o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  -
>>> backgroundOperationsLoop exiting
>>> 11:29:18.030 [main] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session:
>>> 0x100571bda1903c3 closed
>>> 11:29:18.030 [main-EventThread] INFO
>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread
>>> shut down for session: 0x100571bda1903c3
>>> 11:29:18.030 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>> Cancelled job e403893e5208ca47ace886a77e405291.
>>>
>>> Gerard
>>>
>>> On Fri, Jul 20, 2018 at 5:14 AM vino yang <yanghua1...@gmail.com> wrote:
>>>
>>>> Hi Till,
>>>>
>>>> You are right, we also saw the problem you said. Curator removes the
>>>> specific job graph path asynchronously. But it's the only gist when
>>>> recovering, right? Is there any plan to enhance this point?
>>>>
>>>> Thanks, vino.
>>>>
>>>> 2018-07-19 21:58 GMT+08:00 Till Rohrmann <trohrm...@apache.org>:
>>>>
>>>>> Hi Gerard,
>>>>>
>>>>> the logging statement `Removed job graph ... from ZooKeeper` is
>>>>> actually not 100% accurate. The actual deletion is executed as an
>>>>> asynchronous background task and the log statement is not printed in the
>>>>> callback (which it should). Therefore, the deletion could still have
>>>>> failed. In order to see this, the full jobmanager/cluster entry point logs
>>>>> would be tremendously helpful.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Thu, Jul 19, 2018 at 1:33 PM Gerard Garcia <ger...@talaia.io>
>>>>> wrote:
>>>>>
>>>>>> Thanks Andrey,
>>>>>>
>>>>>> That is the log from the jobmanager just after it has finished
>>>>>> cancelling the task:
>>>>>>
>>>>>> 11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO
>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping
>>>>>> checkpoint coordinator for job e403893e5208ca47ace886a77e405291.
>>>>>> 11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO
>>>>>> o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting 
>>>>>> down
>>>>>> 11:29:18.738 [flink-akka.actor.default-dispatcher-15695] INFO
>>>>>> o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Removing
>>>>>> /flink-eur/default/checkpoints/e403893e5208ca47ace886a77e405291 from
>>>>>> ZooKeeper
>>>>>> 11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO
>>>>>> o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting 
>>>>>> down.
>>>>>> 11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO
>>>>>> o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing
>>>>>> /checkpoint-counter/e403893e5208ca47ace886a77e405291 from ZooKeeper
>>>>>> 11:29:18.827 [flink-akka.actor.default-dispatcher-15695] INFO
>>>>>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job
>>>>>> e403893e5208ca47ace886a77e405291 reached globally terminal state 
>>>>>> CANCELED.
>>>>>> 11:29:18.846 [flink-akka.actor.default-dispatcher-15675] INFO
>>>>>> org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the JobMaster 
>>>>>> for
>>>>>> job (...)(e403893e5208ca47ace886a77e405291).
>>>>>> 11:29:18.848 [flink-akka.actor.default-dispatcher-15675] INFO
>>>>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping
>>>>>> ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>>>>>> 11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO
>>>>>> org.apache.flink.runtime.jobmaster.JobMaster  - Close ResourceManager
>>>>>> connection d5fbc30a895066054e29fb2fd60fb0f1: JobManager is shutting 
>>>>>> down..
>>>>>> 11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO
>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Suspending 
>>>>>> SlotPool.
>>>>>> 11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO
>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Stopping 
>>>>>> SlotPool.
>>>>>> 11:29:18.864 [flink-akka.actor.default-dispatcher-15688] INFO
>>>>>> o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect
>>>>>> job manager 
>>>>>> 9cf221e2340597629fb932c03aa14...@akka.tcp://flink@(...):33827/user/jobmanager_9
>>>>>> for job e403893e5208ca47ace886a77e405291 from the resource manager.
>>>>>> 11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO
>>>>>> o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping
>>>>>> ZooKeeperLeaderElectionService
>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/e403893e5208ca47ace886a77e405291/job_manager_lock'}.
>>>>>> 11:29:18.980 [flink-akka.actor.default-dispatcher-15695] INFO
>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed
>>>>>> checkpoint 31154 for job 5d8c376b10d358b9c9470b3e70113626 (132520 bytes 
>>>>>> in
>>>>>> 411 ms).
>>>>>> 11:29:19.025 [flink-akka.actor.default-dispatcher-15683] INFO
>>>>>> o.a.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed 
>>>>>> job
>>>>>> graph e403893e5208ca47ace886a77e405291 from ZooKeeper.
>>>>>>
>>>>>>
>>>>>> At the end it says removed job graph e403893e5208ca47ace886a77e405291
>>>>>> from ZooKeeper but I still can see it at /flink/default/jobgraphs:
>>>>>>
>>>>>> [zk: localhost:2181(CONNECTED) 14] ls
>>>>>> /flink/default/jobgraphs/e403893e5208ca47ace886a77e405291
>>>>>> [3fe9c3c8-5bec-404e-a720-75f9b188124f,
>>>>>> 36208299-0f6d-462c-bae4-2e3d53f50e8c]
>>>>>>
>>>>>> Gerard
>>>>>>
>>>>>> On Wed, Jul 18, 2018 at 4:24 PM Andrey Zagrebin <
>>>>>> and...@data-artisans.com> wrote:
>>>>>>
>>>>>>> Hi Gerard,
>>>>>>>
>>>>>>> There is an issue recently fixed for 1.5.2, 1.6.0:
>>>>>>> https://issues.apache.org/jira/browse/FLINK-9575
>>>>>>> It might have caused your problem.
>>>>>>>
>>>>>>> Can you please provide log from JobManager/Entry point for further
>>>>>>> investigation?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Andrey
>>>>>>>
>>>>>>> On 18 Jul 2018, at 10:16, Gerard Garcia <ger...@talaia.io> wrote:
>>>>>>>
>>>>>>> Hi vino,
>>>>>>>
>>>>>>> Seems that jobs id stay in /jobgraphs when we cancel them manually.
>>>>>>> For example, after cancelling the job with id
>>>>>>> 75e16686cb4fe0d33ead8e29af131d09 the entry is still in zookeeper's path
>>>>>>> /flink/default/jobgraphs, but the job disappeared from
>>>>>>> /home/nas/flink/ha/default/blob/.
>>>>>>>
>>>>>>> That is the client log:
>>>>>>>
>>>>>>> 09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>>>>>> Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
>>>>>>> 09:20:58.503 [main] INFO
>>>>>>> org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly
>>>>>>> available BLOB storage directory at
>>>>>>> file:///home/nas/flink/ha//default/blob
>>>>>>> 09:20:58.505 [main] INFO
>>>>>>> org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL 
>>>>>>> for
>>>>>>> ZK connections
>>>>>>> 09:20:58.505 [main] INFO
>>>>>>> org.apache.flink.runtime.util.ZooKeeperUtils  - Using 
>>>>>>> '/flink-eur/default'
>>>>>>> as Zookeeper namespace.
>>>>>>> 09:20:58.539 [main] INFO
>>>>>>> o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
>>>>>>> 09:20:58.543 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:zookeeper.version=
>>>>>>> 3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017
>>>>>>> 10:13 GMT
>>>>>>> 09:20:58.543 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:host.name=flink-eur-production1
>>>>>>> 09:20:58.543 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:java.version=1.8.0_131
>>>>>>> 09:20:58.544 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:java.vendor=Oracle Corporation
>>>>>>> 09:20:58.546 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:java.home=/opt/jdk/jdk1.8.0_131/jre
>>>>>>> 09:20:58.546 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
>>>>>>> 09:20:58.546 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
>>>>>>> 09:20:58.546 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:java.io.tmpdir=/tmp
>>>>>>> 09:20:58.546 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:java.compiler=<NA>
>>>>>>> 09:20:58.547 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:os.name=Linux
>>>>>>> 09:20:58.547 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:os.arch=amd64
>>>>>>> 09:20:58.547 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:os.version=4.9.87-xxxx-std-ipv6-64
>>>>>>> 09:20:58.547 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:user.name=root
>>>>>>> 09:20:58.547 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:user.home=/root
>>>>>>> 09:20:58.547 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client
>>>>>>> environment:user.dir=/opt/flink/flink-1.5.0/bin
>>>>>>> 09:20:58.548 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating
>>>>>>> client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,
>>>>>>> 10.1.1.7:2181 sessionTimeout=60000
>>>>>>> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
>>>>>>> 09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
>>>>>>> configuration failed: javax.security.auth.login.LoginException: No JAAS
>>>>>>> configuration section named 'Client' was found in specified JAAS
>>>>>>> configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue
>>>>>>> connection to Zookeeper server without SASL authentication, if Zookeeper
>>>>>>> server allows it.
>>>>>>> 09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening
>>>>>>> socket connection to server 10.1.1.5/10.1.1.5:2181
>>>>>>> 09:20:58.556 [main-EventThread] ERROR 
>>>>>>> o.a.flink.shaded.curator.org.apache.curator.ConnectionState
>>>>>>> - Authentication failed
>>>>>>> 09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
>>>>>>> connection established to 10.1.1.5/10.1.1.5:2181, initiating session
>>>>>>> 09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session
>>>>>>> establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid
>>>>>>> = 0x100571bda1903b7, negotiated timeout = 40000
>>>>>>> 09:20:58.593 [main-EventThread] INFO
>>>>>>> o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State
>>>>>>> change: CONNECTED
>>>>>>> 09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient
>>>>>>> - Rest client endpoint started.
>>>>>>> 09:20:58.713 [main] INFO
>>>>>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting
>>>>>>> ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
>>>>>>> 09:20:58.755 [main] INFO
>>>>>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting
>>>>>>> ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
>>>>>>> 09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient
>>>>>>> - Shutting down rest endpoint.
>>>>>>> 09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient
>>>>>>> - Rest endpoint shutdown complete.
>>>>>>> 09:20:58.947 [main] INFO
>>>>>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping
>>>>>>> ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
>>>>>>> 09:20:58.948 [main] INFO
>>>>>>> o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping
>>>>>>> ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
>>>>>>> 09:20:58.949 [Curator-Framework-0] INFO
>>>>>>> o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  -
>>>>>>> backgroundOperationsLoop exiting
>>>>>>> 09:20:58.968 [main] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session:
>>>>>>> 0x100571bda1903b7 closed
>>>>>>> 09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  -
>>>>>>> Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
>>>>>>> 09:20:58.969 [main-EventThread] INFO
>>>>>>> o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - 
>>>>>>> EventThread
>>>>>>> shut down for session: 0x100571bda1903b7
>>>>>>>
>>>>>>> I'm assuming that in /jobgraphs there should only be the job ids
>>>>>>> that are currently running (at least it seemed that when the jobmanager
>>>>>>> restarted it tried to restart the jobs ids stored there). Is that 
>>>>>>> correct?
>>>>>>>
>>>>>>> Gerard
>>>>>>>
>>>>>>> On Wed, Jul 18, 2018 at 9:17 AM vino yang <yanghua1...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Gerard,
>>>>>>>>
>>>>>>>> From you provide information, you mean the path in Zookeeper
>>>>>>>> "/jobgraphs" exists more jobs than you submitted?
>>>>>>>> And can not be restarted because blob files can not be find?
>>>>>>>>
>>>>>>>> Can you provide more details, about the stack trace, log and which
>>>>>>>> version of Flink? Normally, the jobgraph can not be added to Zookeeper
>>>>>>>> except submit job manually.
>>>>>>>>
>>>>>>>> Thanks, vino.
>>>>>>>>
>>>>>>>> 2018-07-16 21:19 GMT+08:00 gerardg <ger...@talaia.io>:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Our deployment consists of a standalone HA cluster of 8 machines
>>>>>>>>> with an
>>>>>>>>> external Zookeeper cluster. We have observed several times that
>>>>>>>>> when a
>>>>>>>>> jobmanager fails and a new one is elected, the new one tries to
>>>>>>>>> restart
>>>>>>>>> more jobs than the ones that were running and since it can't find
>>>>>>>>> some
>>>>>>>>> files, it fails and gets stuck in a restart loop. That is the
>>>>>>>>> error that we
>>>>>>>>> see in the logs:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> These are the contents of /home/nas/flink/ha/default/blob/:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We've checked zookeeper and there are actually a lot of jobgraphs
>>>>>>>>> in
>>>>>>>>> /flink/default/jobgraphs
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There were only three jobs running so neither zookeeper nor the
>>>>>>>>> flink 'ha'
>>>>>>>>> folder seems to have the correct number of jobgraphs stored.
>>>>>>>>>
>>>>>>>>> The only way we have to solve this is to remove everything at path
>>>>>>>>> /flink in
>>>>>>>>> zookeeper and the 'ha' flink folder and restart the jobs manually.
>>>>>>>>>
>>>>>>>>> I'll try to monitor if some action (e.g. we have been canceling and
>>>>>>>>> restoring jobs from savepoints quite often lately) leaves an entry
>>>>>>>>> in
>>>>>>>>> zookeepers path /flink/default/jobgraphs of a job that is not
>>>>>>>>> running but
>>>>>>>>> maybe someone can't point us to some configuration problem that
>>>>>>>>> could cause
>>>>>>>>> this behavior.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Gerard
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sent from:
>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

Reply via email to