不建议这样做,因为这样会掩盖问题。

但如果一定要配置"重试次数"或"超时时长" 这些参数,会涉及到很多参数,比如 akka.tcp.timeout,
taskmanager.network.netty.client.connectTimeoutSec,
taskmanager.network.retries等等,具体可以参考[1]。

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/config/

Best,
Yuxin


crazy <2463829...@qq.com.invalid> 于2023年3月6日周一 14:41写道:

> 机器问题从监控上暂时没发现啥问题,能否通过增加"重试次数"或"超时时长"来缓解这个问题呢?不太清楚具体参数需要设置哪些?
>
>
>
>
> crazy
> 2463829...@qq.com
>
>
>
> &nbsp;
>
>
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:
>                                                   "user-zh"
>                                                                     <
> tanyuxinw...@gmail.com&gt;;
> 发送时间:&nbsp;2023年3月6日(星期一) 下午2:33
> 收件人:&nbsp;"user-zh"<user-zh@flink.apache.org&gt;;
>
> 主题:&nbsp;Re: Flink作业tm Connection timed out异常问题
>
>
>
> "如果进程没被调度到这台机器上,任务正常",从给出的描述来看,确实很可能是 A 这台机器有问题。
>
> 可以检查机器 A 的网络、内存、CPU
> 指标或者监控是否正常,与其他机器是否存在不同。比如网络参数的配置、机器内存是否存在损坏、机器是否存在异常进程或负载等等。
>
> 如果硬件问题,系统日志有可能有一些报错。也可以使用一些机器检查工具, dmesg/vmstat等。
>
> Best,
> Yuxin
>
>
> crazy <2463829...@qq.com.invalid&gt; 于2023年3月6日周一 14:23写道:
>
> &gt; 各位大佬好,有个线上作业频繁failover,异常日志如下:
> &gt;
> &gt; 2023-03-05 11:41:07,847 INFO&nbsp;
> org.apache.flink.runtime.executiongraph.ExecutionGraph&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> [] - Process (287/300) (b3ef27fec49fe3777f830802ef3501e9) switched from
> RUNNING to FAILED on container_e26_1646120234560_82135_01_000097 @
> xx.xx.xx.xx (dataPort=26882).
> &gt; 
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
> readAddress(..) failed: Connection timed out (connection to 'xxx/
> 10.70.89.25:43923')
> &gt;    at 
> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131]
> &gt; Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
> readAddress(..) failed: Connection timed out
> &gt;
> &gt;
> &gt; 每次失败都是A这台机器上的进程报 switched from RUNNING to FAILED on
> &gt; container_e26_1646120234560_82135_01_000097 @A (dataPort=26882) ,
> &gt; 查看机器A负载正常,如果进程没被调度到这台机器上,任务正常,目前怀疑是这台机器导致的问题,请教下该如何排查这个问题呢?多谢
> &gt;
> &gt;
> &gt; ------------------------------
> &gt; crazy
> &gt; 2463829...@qq.com
> &gt;
> &gt; <
> https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&amp;nocheck=true&amp;name=crazy&amp;icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DKlvibnHhZZWe933WckLKt7Q%26s%3D100%26t%3D1557169080%3Frand%3D1638962060&amp;mail=2463829830%40qq.com&amp;code=&gt
> ;
> &gt;
> &gt;

回复