Hi,

After this error/exception, it seems that taskmanager never connects to
jobmanager anymore.  Job stuck in failed state because there is not enough
slots to recover the job.

let's assume there was a temp glitch btw jobmanager and zk. would it cause
such a permanent failure in Flink?

I checked the zookeeper record.
* leader zknode seems to have the correct info for "job_manager_lock"
* I am not sure how to read the leaderlatch zknode


A little more about the job
* standalone cluster mode
* 1 jobmanager
* 1 taskmanager

Thanks,
Steven

*2018-04-11 01:11:48,007 INFO  org.apache.flink.runtime.taskmanager.Task
                    - Attempting to fail task externally Source:
kafkasource -> Sink: s3sink (1/1)
(5a7dba2e186b9fdaebb62bdd703dc7dc).2018-04-11 01:11:48,007 INFO
 org.apache.flink.runtime.taskmanager.Task                     - Source:
kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc)
switched from RUNNING to FAILED.java.lang.Exception: TaskManager
akka://flink/user/taskmanager disconnects from JobManager
akka.tcp://flink@1.2.3.4:42787/user/jobmanager
<http://flink@1.2.3.4:42787/user/jobmanager>: Old JobManager lost its
leadership.        at
org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1073)
       at org.apache.flink.runtime.taskmanager.TaskManager.org
<http://org.apache.flink.runtime.taskmanager.TaskManager.org>$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1467)
       at
org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:277)
       at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
       at
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
       at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
       at
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
       at
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
       at
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
       at
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)        at
org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:121)
       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
       at akka.actor.ActorCell.invoke(ActorCell.scala:495)        at
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)        at
akka.dispatch.Mailbox.run(Mailbox.scala:224)        at
akka.dispatch.Mailbox.exec(Mailbox.scala:234)        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
       at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
       at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
       at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)2018-04-11
01:11:48,011 INFO  org.apache.flink.runtime.taskmanager.Task
                    - Triggering cancellation of task code Source:
kafkasource -> Sink: s3sink (1/1)
(5a7dba2e186b9fdaebb62bdd703dc7dc).2018-04-11 01:11:48,013 INFO
 org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting
down BLOB cache*

Reply via email to