"如果进程没被调度到这台机器上,任务正常",从给出的描述来看,确实很可能是 A 这台机器有问题。

可以检查机器 A 的网络、内存、CPU
指标或者监控是否正常,与其他机器是否存在不同。比如网络参数的配置、机器内存是否存在损坏、机器是否存在异常进程或负载等等。

如果硬件问题,系统日志有可能有一些报错。也可以使用一些机器检查工具, dmesg/vmstat等。

Best,
Yuxin


crazy <2463829...@qq.com.invalid> 于2023年3月6日周一 14:23写道:

> 各位大佬好,有个线上作业频繁failover,异常日志如下:
>
> 2023-03-05 11:41:07,847 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Process 
> (287/300) (b3ef27fec49fe3777f830802ef3501e9) switched from RUNNING to FAILED 
> on container_e26_1646120234560_82135_01_000097 @ xx.xx.xx.xx (dataPort=26882).
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> readAddress(..) failed: Connection timed out (connection to 
> 'xxx/10.70.89.25:43923')
>       at 
> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>  ~[flink-dist_2.11-1.13.5.jar:1.13.5]
>       at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131]
> Caused by: 
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection timed out
>
>
> 每次失败都是A这台机器上的进程报 switched from RUNNING to FAILED on
> container_e26_1646120234560_82135_01_000097 @A (dataPort=26882) ,
> 查看机器A负载正常,如果进程没被调度到这台机器上,任务正常,目前怀疑是这台机器导致的问题,请教下该如何排查这个问题呢?多谢
>
>
> ------------------------------
> crazy
> 2463829...@qq.com
>
> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=crazy&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DKlvibnHhZZWe933WckLKt7Q%26s%3D100%26t%3D1557169080%3Frand%3D1638962060&mail=2463829830%40qq.com&code=>
>
>

Reply via email to