Hi, After this error/exception, it seems that taskmanager never connects to jobmanager anymore. Job stuck in failed state because there is not enough slots to recover the job.
let's assume there was a temp glitch btw jobmanager and zk. would it cause such a permanent failure in Flink? I checked the zookeeper record. * leader zknode seems to have the correct info for "job_manager_lock" * I am not sure how to read the leaderlatch zknode A little more about the job * standalone cluster mode * 1 jobmanager * 1 taskmanager Thanks, Steven *2018-04-11 01:11:48,007 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc).2018-04-11 01:11:48,007 INFO org.apache.flink.runtime.taskmanager.Task - Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc) switched from RUNNING to FAILED.java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@1.2.3.4:42787/user/jobmanager <http://flink@1.2.3.4:42787/user/jobmanager>: Old JobManager lost its leadership. at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1073) at org.apache.flink.runtime.taskmanager.TaskManager.org <http://org.apache.flink.runtime.taskmanager.TaskManager.org>$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1467) at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:277) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:121) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)2018-04-11 01:11:48,011 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc).2018-04-11 01:11:48,013 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Shutting down BLOB cache*