from Flink UI on jobmanager, sometimes I saw taskmanager connected and
heartbeat time got updated.


but then sometimes the taskmanager page become blank. maybe disconnected.



On Wed, Apr 11, 2018 at 1:31 PM, Steven Wu <stevenz...@gmail.com> wrote:

> Hi,
>
> After this error/exception, it seems that taskmanager never connects to
> jobmanager anymore.  Job stuck in failed state because there is not enough
> slots to recover the job.
>
> let's assume there was a temp glitch btw jobmanager and zk. would it cause
> such a permanent failure in Flink?
>
> I checked the zookeeper record.
> * leader zknode seems to have the correct info for "job_manager_lock"
> * I am not sure how to read the leaderlatch zknode
>
>
> A little more about the job
> * standalone cluster mode
> * 1 jobmanager
> * 1 taskmanager
>
> Thanks,
> Steven
>
> *2018-04-11 01:11:48,007 INFO  org.apache.flink.runtime.taskmanager.Task
>                     - Attempting to fail task externally Source:
> kafkasource -> Sink: s3sink (1/1)
> (5a7dba2e186b9fdaebb62bdd703dc7dc).2018-04-11 01:11:48,007 INFO
>  org.apache.flink.runtime.taskmanager.Task                     - Source:
> kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc)
> switched from RUNNING to FAILED.java.lang.Exception: TaskManager
> akka://flink/user/taskmanager disconnects from JobManager
> akka.tcp://flink@1.2.3.4:42787/user/jobmanager
> <http://flink@1.2.3.4:42787/user/jobmanager>: Old JobManager lost its
> leadership.        at
> org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1073)
>        at org.apache.flink.runtime.taskmanager.TaskManager.org
> <http://org.apache.flink.runtime.taskmanager.TaskManager.org>$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1467)
>        at
> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:277)
>        at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>        at
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>        at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>        at
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>        at
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>        at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>        at
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)        at
> org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:121)
>        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>        at akka.actor.ActorCell.invoke(ActorCell.scala:495)        at
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)        at
> akka.dispatch.Mailbox.run(Mailbox.scala:224)        at
> akka.dispatch.Mailbox.exec(Mailbox.scala:234)        at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>        at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>        at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>        at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)2018-04-11
> 01:11:48,011 INFO  org.apache.flink.runtime.taskmanager.Task
>                     - Triggering cancellation of task code Source:
> kafkasource -> Sink: s3sink (1/1)
> (5a7dba2e186b9fdaebb62bdd703dc7dc).2018-04-11 01:11:48,013 INFO
>  org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting
> down BLOB cache*
>
>

Reply via email to