Hi,

如果要增加request
partition的重试时间,可以调整配置项`taskmanager.network.request-backoff.max`,默认是10秒,具体配置可以参阅[1]

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#full-taskmanageroptions

Best,
Shammon FY

On Tue, Jul 4, 2023 at 11:38 AM zhan...@eastcom-sw.com <
zhan...@eastcom-sw.com> wrote:

> 从前面日志看是重启后从hdfs加载checkpoint数据处理(100M左右)这过程好像有点久,还有连kafka消费
> 下游的超时重试  可以设置次数或者时长吗?
>
> 发件人: Shammon FY
> 发送时间: 2023-07-04 10:12
> 收件人: user-zh
> 主题: Re: PartitionNotFoundException循环重启
> Hi,
>
> PartitionNotFoundException异常原因通常是下游task向上游task发送partition
>
> request请求,但是上游task还没有部署成功。一般情况下,下游task会重试,超时后会报出异常。你可以查看下有没有其他的异常日志,查一下上游task为什么没有部署成功。
>
> Best,
> Shammon FY
>
> On Tue, Jul 4, 2023 at 9:30 AM zhan...@eastcom-sw.com <
> zhan...@eastcom-sw.com> wrote:
>
> >
> > 异常日志内容
> >
> > 2023-07-03 20:30:15,164 INFO
> > org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Sink:
> > Sink 3 (2/45)
> > (79a20a2489a31465de9524eaf6b5ebf7_8fb6014c2df1d028b4c9ec6b86c8738f_
> > 1_3093) switched from RUNNING to FAILED on 10.252.210.63:2359-420157 @
> > nbiot-core-mpp-dcos-b-2.novalocal (dataPort=32769).
> > org.apache.flink.runtime.io
> .network.partition.PartitionNotFoundException:
> > Partition
> >
> 65e701af2579c0381a2c3e53bd66fed0#24@79a20a2489a31465de9524eaf6b5ebf7_d952d2a6aebfb900c453884c57f96b82_24_
> > 3093 not found.
> >         at org.apache.flink.runtime.io
> .network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:70)
> > ~[flink-dist-1.17.1.jar:1.17.1]
> >         at org.apache.flink.runtime.io
> .network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:136)
> > ~[flink-dist-1.17.1.jar:1.17.1]
> >         at org.apache.flink.runtime.io
> .network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:186)
> > ~[flink-dist-1.17.1.jar:1.17.1]
> >         at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_77]
> >         at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_77]
> >
> >
> >
> > 发件人: zhan...@eastcom-sw.com
> > 发送时间: 2023-07-04 09:25
> > 收件人: user-zh
> > 主题: PartitionNotFoundException循环重启
> >     hi,我这有两个流量比较大的job(一天3亿/6亿),在启动正常运行了5、6天左右就会出现
> > PartitionNotFoundException 的异常,然后不断的循环重启
> >
> >     在flink-conf.yaml中添加以下参数后,也是同样在6天后会 循环报 PartitionNotFoundException
> > 的异常后,不断的重启....
> >     taskmanager.network.tcp-connection.enable-reuse-across-jobs: false
> >     taskmanager.network.max-num-tcp-connections: 16
> >
> >     当前版本 1.17.1,同样的job跟数据在1.14.4中一直没问题,请问这个有什么办法解决么?
> >
> >
>

回复