如题,我举个例子。 我有个任务,经常失败,现在以其中一个例子。首先看exception history,发现: Time Exception Name Location 2022-03-17 15:09:44 org.apache.flink.runtime.io.network.netty.exception.LocalTransportException bal_ft_baiduid_subid_window_reduce(300s, nulls, 0s) (1/30) - execution #7 m1-sys-rpm109-7c88b.m1.baidu.com:2086
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection timed out (connection to ' 10.35.115.170/10.35.115.170:2045') at org.apache.flink.runtime.io.network.netty. CreditBasedPartitionRequestClientHandler.exceptionCaught( CreditBasedPartitionRequestClientHandler.java:201) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeExceptionCaught( AbstractChannelHandlerContext.java:302) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeExceptionCaught( AbstractChannelHandlerContext.java:281) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.fireExceptionCaught( AbstractChannelHandlerContext.java:273) at org.apache.flink.shaded.netty4.io.netty.channel. DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline .java:1377) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeExceptionCaught( AbstractChannelHandlerContext.java:302) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeExceptionCaught( AbstractChannelHandlerContext.java:281) at org.apache.flink.shaded.netty4.io.netty.channel. DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907) at org.apache.flink.shaded.netty4.io.netty.channel.epoll. AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException( AbstractEpollStreamChannel.java:728) at org.apache.flink.shaded.netty4.io.netty.channel.epoll. AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady( AbstractEpollStreamChannel.java:818) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop .processReady(EpollEventLoop.java:475) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop .run(EpollEventLoop.java:378) at org.apache.flink.shaded.netty4.io.netty.util.concurrent. SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at org.apache.flink.shaded.netty4.io.netty.util.internal. ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix. Errors$NativeIoException: readAddress(..) failed: Connection timed out 根据 location 我找到机器,登陆确认的确有该报错。 为了找到原因,我根据报错,继续查找 Connection timed out (connection to '10.35.115.170/10.35.115.170:2045') 中ip对应的机器,错误时间,找到相关日志如下: INFO 2022-03-17 15:09:44,526 org.apache.flink.runtime.taskmanager.Task - bal_ft_baiduid_Dr12Join -> xxx_TimestampWatermarkAssigner(-90s) (17/30)#7 (f4944df682b1642e0911b2aa39a1279b) switched from RUNNING to CANCELING. 其他日志类似,都是cancel吧啦的。 所以,不论是ui上的错误,还是跟踪一层后的日志(直接是cancel)。那为什么引发了cancel呢? 当然肯定是某中错误导致cancel然后自动恢复啥的。 我需要知道的是,我一个任务120个容器,我怎么去跟踪发现是哪一台容器第一个发生了某中错误,进而导致这一系列问题呢??? 任意一机器A假设网络故障,都可能导致另一个机器连接A出现timeout,但是我需要知道最开始或者真正出问题的是哪一台机器。这个大家都怎么解决一般。