Re: the remote task manager was lost
可以看一下 remote task 对应的 tm 日志,看看有没有啥异常 Best, Congxian 赵一旦 于2020年12月2日周三 下午6:17写道: > 我都是80G、100G这么分配资源的。。。 > > guanxianchun 于2020年10月28日周三 下午5:02写道: > > > flink版本: flink-1.11 > > taskmanager memory: 8G > > jobmanager memory: 2G > > akka.ask.timeout:20s > > akka.retry-gate-closed-for: 5000 > > client.timeout:600s > > > > 运行一段时间后报the remote task manager was lost ,错误信息如下: > > 2020-10-28 00:25:30,608 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - > Completed > > checkpoint 411 for job 031e5f122711786fcc11ee6eb47291fa (2703770 bytes in > > 336 ms). > > 2020-10-28 00:27:30,273 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - > > Triggering > > checkpoint 412 (type=CHECKPOINT) @ 1603816050239 for job > > 031e5f122711786fcc11ee6eb47291fa. > > 2020-10-28 00:27:30,776 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - > Completed > > checkpoint 412 for job 031e5f122711786fcc11ee6eb47291fa (3466688 bytes in > > 509 ms). > > 2020-10-28 00:29:30,246 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - > > Triggering > > checkpoint 413 (type=CHECKPOINT) @ 1603816170239 for job > > 031e5f122711786fcc11ee6eb47291fa. > > 2020-10-28 00:29:30,597 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - > Completed > > checkpoint 413 for job 031e5f122711786fcc11ee6eb47291fa (2752681 bytes in > > 334 ms). > > 2020-10-28 00:29:47,353 WARN akka.remote.ReliableDeliverySupervisor > > > > [] - Association with remote system > > [akka.tcp://fl...@hadoop01.dev.test.cn:13912] has failed, address is now > > gated for [5000] ms. Reason: [Disassociated] > > 2020-10-28 00:29:47,353 WARN akka.remote.ReliableDeliverySupervisor > > > > [] - Association with remote system > > [akka.tcp://flink-metr...@hadoop01.dev.test.cn:31260] has failed, > address > > is > > now gated for [5000] ms. Reason: [Disassociated] > > 2020-10-28 00:29:47,377 INFO > > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > > KeyedProcess -> async wait operator -> Map (1/3) > > (f84731e57528b326ad15ddc17821d1b8) switched from RUNNING to FAILED on > > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@538198b8. > > org.apache.flink.runtime.io > > .network.netty.exception.RemoteTransportException: > > Connection unexpectedly closed by remote task manager > > 'hadoop01.dev.test.cn/192.168.1.21:7527'. This might indicate that the > > remote task manager was lost. > > at > > org.apache.flink.runtime.io > > > .network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:144) > > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > > at > > > > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257) > > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > > at > > > > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243) > > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > > at > > > > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236) > > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > > at > > > > > org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81) > > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > > at > > org.apache.flink.runtime.io > > > .network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:97) > > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > > at > > > > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257) > > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > > at > > > > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243) > > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > > at > > > > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236) > > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > > at > >
Re: the remote task manager was lost
我都是80G、100G这么分配资源的。。。 guanxianchun 于2020年10月28日周三 下午5:02写道: > flink版本: flink-1.11 > taskmanager memory: 8G > jobmanager memory: 2G > akka.ask.timeout:20s > akka.retry-gate-closed-for: 5000 > client.timeout:600s > > 运行一段时间后报the remote task manager was lost ,错误信息如下: > 2020-10-28 00:25:30,608 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed > checkpoint 411 for job 031e5f122711786fcc11ee6eb47291fa (2703770 bytes in > 336 ms). > 2020-10-28 00:27:30,273 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - > Triggering > checkpoint 412 (type=CHECKPOINT) @ 1603816050239 for job > 031e5f122711786fcc11ee6eb47291fa. > 2020-10-28 00:27:30,776 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed > checkpoint 412 for job 031e5f122711786fcc11ee6eb47291fa (3466688 bytes in > 509 ms). > 2020-10-28 00:29:30,246 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - > Triggering > checkpoint 413 (type=CHECKPOINT) @ 1603816170239 for job > 031e5f122711786fcc11ee6eb47291fa. > 2020-10-28 00:29:30,597 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed > checkpoint 413 for job 031e5f122711786fcc11ee6eb47291fa (2752681 bytes in > 334 ms). > 2020-10-28 00:29:47,353 WARN akka.remote.ReliableDeliverySupervisor > > [] - Association with remote system > [akka.tcp://fl...@hadoop01.dev.test.cn:13912] has failed, address is now > gated for [5000] ms. Reason: [Disassociated] > 2020-10-28 00:29:47,353 WARN akka.remote.ReliableDeliverySupervisor > > [] - Association with remote system > [akka.tcp://flink-metr...@hadoop01.dev.test.cn:31260] has failed, address > is > now gated for [5000] ms. Reason: [Disassociated] > 2020-10-28 00:29:47,377 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess -> async wait operator -> Map (1/3) > (f84731e57528b326ad15ddc17821d1b8) switched from RUNNING to FAILED on > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@538198b8. > org.apache.flink.runtime.io > .network.netty.exception.RemoteTransportException: > Connection unexpectedly closed by remote task manager > 'hadoop01.dev.test.cn/192.168.1.21:7527'. This might indicate that the > remote task manager was lost. > at > org.apache.flink.runtime.io > .network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:144) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.runtime.io > .network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:97) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink
the remote task manager was lost
flink版本: flink-1.11 taskmanager memory: 8G jobmanager memory: 2G akka.ask.timeout:20s akka.retry-gate-closed-for: 5000 client.timeout:600s 运行一段时间后报the remote task manager was lost ,错误信息如下: 2020-10-28 00:25:30,608 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed checkpoint 411 for job 031e5f122711786fcc11ee6eb47291fa (2703770 bytes in 336 ms). 2020-10-28 00:27:30,273 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering checkpoint 412 (type=CHECKPOINT) @ 1603816050239 for job 031e5f122711786fcc11ee6eb47291fa. 2020-10-28 00:27:30,776 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed checkpoint 412 for job 031e5f122711786fcc11ee6eb47291fa (3466688 bytes in 509 ms). 2020-10-28 00:29:30,246 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering checkpoint 413 (type=CHECKPOINT) @ 1603816170239 for job 031e5f122711786fcc11ee6eb47291fa. 2020-10-28 00:29:30,597 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed checkpoint 413 for job 031e5f122711786fcc11ee6eb47291fa (2752681 bytes in 334 ms). 2020-10-28 00:29:47,353 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://fl...@hadoop01.dev.test.cn:13912] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2020-10-28 00:29:47,353 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink-metr...@hadoop01.dev.test.cn:31260] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2020-10-28 00:29:47,377 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - KeyedProcess -> async wait operator -> Map (1/3) (f84731e57528b326ad15ddc17821d1b8) switched from RUNNING to FAILED on org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@538198b8. org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'hadoop01.dev.test.cn/192.168.1.21:7527'. This might indicate that the remote task manager was lost. at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:144) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:97) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.shaded.