Hi 上面TM心跳出现unreachable,一般是TM退出了,可以看下退出原因 下面Checkpoint超时,可以看下是否出现反压等问题,也可以看checkpoint执行时间,考虑增加checkpoint超时时间
Best, Shammon On Thu, Feb 16, 2023 at 10:34 AM lxk <lxk7...@163.com> wrote: > 你好,可以dump下内存分析 > > > > > > > > > > > > > > > > > > 在 2023-02-16 10:05:19,"Fei Han" <hanfeizi0...@aliyun.com.INVALID> 写道: > >@all > >大家好!我的Flink 版本是1.14.5。CDC版本是2.2.1。在on yarn 运行一段时间后会出现如下报错: > >org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with > id container_e506_1673750933366_49579_01_000002( > hdp-server-010.yigongpin.com:8041) is no longer reachable. at > org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1359) > ~[flink-dist_2.12-1.14.5.jar:1.14.5] at > org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.reportHeartbeatRpcFailure(HeartbeatMonitorImpl.java:123) > ~[flink-dist_2.12-1.14.5.jar:1.14.5] at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275) > ~[flink-dist_2.12-1.14.5.jar:1.14.5] at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267) > ~[flink-dist_2.12-1.14.5.jar:1.14.5] at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262) > ~[flink-dist_2.12-1.14.5.jar:1.14.5] at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248) > ~[flink-dist_2.12-1.14.5.jar:1.14.5] at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > ~[?:1.8.0_181] at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > ~[?:1.8.0_181] at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) > ~[?:1.8.0_181] at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:455) > ~[flink-rpc-akka_dec09d13-99a1-420c-b835-8157413a3db0.jar:1.14.5] at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > ~[flink-rpc-akka_dec09d13-99a1-420c-b835-8157413a3db0.jar:1.14.5] at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:455) > ~[flink-rpc-akka_dec09d13-99a1-420c-b835-8157413a3db0.jar:1.14.5] at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213) > ~[flink-rpc-akka_dec09d13-99a1-420c-b835-8157413a3db0.jar:1.14.5] at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78) > ~[flink-rpc-akka_dec09d13-99a1-420c-b835-8157413a3db0.jar:1.14.5] at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163) > ~[flink-rpc-akka_dec09d13-99a1-420c-b835-8157413a3db0.jar:1.14.5] at > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) > [flink-rpc-akka_dec09d13-99a1-420c-b835-8157413a3db0.jar:1.14.5] at > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) > [flink-rpc-akka_dec09d13-99a1-420c-b835-8157413a3db0.jar:1.14.5] at > scala.PartialFunction.applyOrElse(PartialFunction.scala:123) > [flink-rpc-akka_dec09d13-99a1-420c-b835-8157413a3db0.jar:1.14.5] at > scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) > [flink-rpc-akka_dec09d13-99a1-420c-b835-8157413a3db0.jar:1.14.5] > >在以上报错后,还会出现如下checkpoint报错:org.apache.flink.runtime.checkpoint.CheckpointException: > Checkpoint expired before completing. at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2000) > [flink-dist_2.12-1.14.5.jar:1.14.5] at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [?:1.8.0_181] at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [?:1.8.0_181] at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > [?:1.8.0_181] at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > [?:1.8.0_181] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_181] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_181] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]。 > >请教下大佬们!这2个地方还怎么优化呢?有什么好的方法没有。 >