Hi Lu,
Xintong has a professional analysis about TM heartbeat timeout in a
historical mail[1], please check if it could help.

[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Heartbeat-of-TaskManager-timed-out-td36228.html

Best regards,
JING ZHANG

Lu Niu <qqib...@gmail.com> 于2021年6月11日周五 上午1:33写道:

> Hi, Flink User
>
> Several of our applications get heartbeat timeout occasionally. there is no
> GC, no OOM:
> ```
> - realtime conversion event filter (49/120)
> (16e3d3e7608eed0a30e0508eb52065fd) switched from RUNNING to FAILED on
> container_e05_1599158866703_129001_01_000111 @
> xenon-pii-prod-001-20191210-data-slave-prod-0a01bbdd.ec2.pin220.com
> (dataPort=39013). java.util.concurrent.TimeoutException: Heartbeat of
> TaskManager with id container_e05_1599158866703_129001_01_000111 timed out.
> at
>
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1193)
> at
>
> org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> at
>
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at
> akka.actor.Actor$class.aroundReceive(Actor.scala:517) at
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at
> akka.actor.ActorCell.invoke(ActorCell.scala:561) at
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at
> akka.dispatch.Mailbox.run(Mailbox.scala:225) at
> akka.dispatch.Mailbox.exec(Mailbox.scala:235) at
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at
>
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at
>
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> ```
> Shall we try increase `heartbeat.timeout` ? Any side effects? E.g, would
> that lead to slower detection when TM container is killed by YARN?
>
> we use flink 1.11
>
> Best
> Lu
> Pinterest, Inc.
>

Reply via email to