Hi 集群负载比较大的时候,下游一直收不到request的partition,就会导致PartitionNotFoundException,建议增大 taskmanager.network.request-backoff.max [1][2] 以增大重试次数
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-request-backoff-max [2] https://juejin.cn/post/6844904185347964942#heading-8 祝好 唐云 ________________________________ From: 赵一旦 <hinobl...@gmail.com> Sent: Monday, November 23, 2020 13:08 To: user-zh@flink.apache.org <user-zh@flink.apache.org> Subject: Re: Flink任务启动偶尔报错PartitionNotFoundException,会自动恢复。 这个报错和kafka没有关系的哈,我大概理解是提交任务的瞬间,jobManager/taskManager机器压力较大,存在机器之间心跳超时什么的? 这个partition应该是指flink运行图中的数据partition,我感觉。没有具体细看,每次提交的瞬间可能遇到这个问题,然后会自动重试成功。 zhisheng <zhisheng2...@gmail.com> 于2020年11月18日周三 下午10:51写道: > 是不是有 kafka 机器挂了? > > Best > zhisheng > > hailongwang <18868816...@163.com> 于2020年11月18日周三 下午5:56写道: > > > 感觉还有其它 root cause,可以看下还有其它日志不? > > > > > > Best, > > Hailong > > > > At 2020-11-18 15:52:57, "赵一旦" <hinobl...@gmail.com> wrote: > > >2020-11-18 16:51:37 > > >org.apache.flink.runtime.io > .network.partition.PartitionNotFoundException: > > >Partition > > b225fa9143dfa179d3a3bd223165d5c5#3@3fee4d51f5a43001ef743f3f15e4cfb2 > > >not found. > > > at org.apache.flink.runtime.io.network.partition.consumer. > > >RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:267) > > > at org.apache.flink.runtime.io.network.partition.consumer. > > > > > >RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:166) > > > at org.apache.flink.runtime.io.network.partition.consumer. > > >SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:521) > > > at org.apache.flink.runtime.io.network.partition.consumer. > > > > > >SingleInputGate.lambda$triggerPartitionStateCheck$1(SingleInputGate.java:765 > > >) > > > at > java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture > > >.java:670) > > > at java.util.concurrent.CompletableFuture$UniAccept.tryFire( > > >CompletableFuture.java:646) > > > at java.util.concurrent.CompletableFuture$Completion.run( > > >CompletableFuture.java:456) > > > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > > > at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec( > > >ForkJoinExecutorConfigurator.scala:44) > > > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool > > >.java:1339) > > > at > > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > > at > > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread > > >.java:107) > > > > > > > > >请问这是什么问题呢? > > >