from Flink UI on jobmanager, sometimes I saw taskmanager connected and heartbeat time got updated.
but then sometimes the taskmanager page become blank. maybe disconnected. On Wed, Apr 11, 2018 at 1:31 PM, Steven Wu <stevenz...@gmail.com> wrote: > Hi, > > After this error/exception, it seems that taskmanager never connects to > jobmanager anymore. Job stuck in failed state because there is not enough > slots to recover the job. > > let's assume there was a temp glitch btw jobmanager and zk. would it cause > such a permanent failure in Flink? > > I checked the zookeeper record. > * leader zknode seems to have the correct info for "job_manager_lock" > * I am not sure how to read the leaderlatch zknode > > > A little more about the job > * standalone cluster mode > * 1 jobmanager > * 1 taskmanager > > Thanks, > Steven > > *2018-04-11 01:11:48,007 INFO org.apache.flink.runtime.taskmanager.Task > - Attempting to fail task externally Source: > kafkasource -> Sink: s3sink (1/1) > (5a7dba2e186b9fdaebb62bdd703dc7dc).2018-04-11 01:11:48,007 INFO > org.apache.flink.runtime.taskmanager.Task - Source: > kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc) > switched from RUNNING to FAILED.java.lang.Exception: TaskManager > akka://flink/user/taskmanager disconnects from JobManager > akka.tcp://flink@1.2.3.4:42787/user/jobmanager > <http://flink@1.2.3.4:42787/user/jobmanager>: Old JobManager lost its > leadership. at > org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1073) > at org.apache.flink.runtime.taskmanager.TaskManager.org > <http://org.apache.flink.runtime.taskmanager.TaskManager.org>$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1467) > at > org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:277) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at > org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:121) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) at > akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at > akka.dispatch.Mailbox.run(Mailbox.scala:224) at > akka.dispatch.Mailbox.exec(Mailbox.scala:234) at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)2018-04-11 > 01:11:48,011 INFO org.apache.flink.runtime.taskmanager.Task > - Triggering cancellation of task code Source: > kafkasource -> Sink: s3sink (1/1) > (5a7dba2e186b9fdaebb62bdd703dc7dc).2018-04-11 01:11:48,013 INFO > org.apache.flink.runtime.blob.PermanentBlobCache - Shutting > down BLOB cache* > >