不建议这样做,因为这样会掩盖问题。 但如果一定要配置"重试次数"或"超时时长" 这些参数,会涉及到很多参数,比如 akka.tcp.timeout, taskmanager.network.netty.client.connectTimeoutSec, taskmanager.network.retries等等,具体可以参考[1]。
[1] https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/config/ Best, Yuxin crazy <2463829...@qq.com.invalid> 于2023年3月6日周一 14:41写道: > 机器问题从监控上暂时没发现啥问题,能否通过增加"重试次数"或"超时时长"来缓解这个问题呢?不太清楚具体参数需要设置哪些? > > > > > crazy > 2463829...@qq.com > > > > > > > > > ------------------ 原始邮件 ------------------ > 发件人: > "user-zh" > < > tanyuxinw...@gmail.com>; > 发送时间: 2023年3月6日(星期一) 下午2:33 > 收件人: "user-zh"<user-zh@flink.apache.org>; > > 主题: Re: Flink作业tm Connection timed out异常问题 > > > > "如果进程没被调度到这台机器上,任务正常",从给出的描述来看,确实很可能是 A 这台机器有问题。 > > 可以检查机器 A 的网络、内存、CPU > 指标或者监控是否正常,与其他机器是否存在不同。比如网络参数的配置、机器内存是否存在损坏、机器是否存在异常进程或负载等等。 > > 如果硬件问题,系统日志有可能有一些报错。也可以使用一些机器检查工具, dmesg/vmstat等。 > > Best, > Yuxin > > > crazy <2463829...@qq.com.invalid> 于2023年3月6日周一 14:23写道: > > > 各位大佬好,有个线上作业频繁failover,异常日志如下: > > > > 2023-03-05 11:41:07,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [] - Process (287/300) (b3ef27fec49fe3777f830802ef3501e9) switched from > RUNNING to FAILED on container_e26_1646120234560_82135_01_000097 @ > xx.xx.xx.xx (dataPort=26882). > > > org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: > readAddress(..) failed: Connection timed out (connection to 'xxx/ > 10.70.89.25:43923') > > at > org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at > org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at > org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at > org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at > org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > ~[flink-dist_2.11-1.13.5.jar:1.13.5] > > at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131] > > Caused by: > org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection timed out > > > > > > 每次失败都是A这台机器上的进程报 switched from RUNNING to FAILED on > > container_e26_1646120234560_82135_01_000097 @A (dataPort=26882) , > > 查看机器A负载正常,如果进程没被调度到这台机器上,任务正常,目前怀疑是这台机器导致的问题,请教下该如何排查这个问题呢?多谢 > > > > > > ------------------------------ > > crazy > > 2463829...@qq.com > > > > < > https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=crazy&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DKlvibnHhZZWe933WckLKt7Q%26s%3D100%26t%3D1557169080%3Frand%3D1638962060&mail=2463829830%40qq.com&code=> > ; > > > >