Re: the remote task manager was lost

2020-12-02 文章 Congxian Qiu
可以看一下 remote task 对应的 tm 日志,看看有没有啥异常

Best,
Congxian


赵一旦  于2020年12月2日周三 下午6:17写道:

> 我都是80G、100G这么分配资源的。。。
>
> guanxianchun  于2020年10月28日周三 下午5:02写道:
>
> > flink版本: flink-1.11
> > taskmanager memory: 8G
> > jobmanager memory: 2G
> > akka.ask.timeout:20s
> > akka.retry-gate-closed-for: 5000
> > client.timeout:600s
> >
> > 运行一段时间后报the remote task manager was lost ,错误信息如下:
> > 2020-10-28 00:25:30,608 INFO
> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] -
> Completed
> > checkpoint 411 for job 031e5f122711786fcc11ee6eb47291fa (2703770 bytes in
> > 336 ms).
> > 2020-10-28 00:27:30,273 INFO
> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] -
> > Triggering
> > checkpoint 412 (type=CHECKPOINT) @ 1603816050239 for job
> > 031e5f122711786fcc11ee6eb47291fa.
> > 2020-10-28 00:27:30,776 INFO
> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] -
> Completed
> > checkpoint 412 for job 031e5f122711786fcc11ee6eb47291fa (3466688 bytes in
> > 509 ms).
> > 2020-10-28 00:29:30,246 INFO
> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] -
> > Triggering
> > checkpoint 413 (type=CHECKPOINT) @ 1603816170239 for job
> > 031e5f122711786fcc11ee6eb47291fa.
> > 2020-10-28 00:29:30,597 INFO
> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] -
> Completed
> > checkpoint 413 for job 031e5f122711786fcc11ee6eb47291fa (2752681 bytes in
> > 334 ms).
> > 2020-10-28 00:29:47,353 WARN  akka.remote.ReliableDeliverySupervisor
> >
> > [] - Association with remote system
> > [akka.tcp://fl...@hadoop01.dev.test.cn:13912] has failed, address is now
> > gated for [5000] ms. Reason: [Disassociated]
> > 2020-10-28 00:29:47,353 WARN  akka.remote.ReliableDeliverySupervisor
> >
> > [] - Association with remote system
> > [akka.tcp://flink-metr...@hadoop01.dev.test.cn:31260] has failed,
> address
> > is
> > now gated for [5000] ms. Reason: [Disassociated]
> > 2020-10-28 00:29:47,377 INFO
> > org.apache.flink.runtime.executiongraph.ExecutionGraph   [] -
> > KeyedProcess -> async wait operator -> Map (1/3)
> > (f84731e57528b326ad15ddc17821d1b8) switched from RUNNING to FAILED on
> > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@538198b8.
> > org.apache.flink.runtime.io
> > .network.netty.exception.RemoteTransportException:
> > Connection unexpectedly closed by remote task manager
> > 'hadoop01.dev.test.cn/192.168.1.21:7527'. This might indicate that the
> > remote task manager was lost.
> > at
> > org.apache.flink.runtime.io
> >
> .network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:144)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> > at
> >
> >
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> > at
> >
> >
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> > at
> >
> >
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> > at
> >
> >
> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> > at
> > org.apache.flink.runtime.io
> >
> .network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:97)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> > at
> >
> >
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> > at
> >
> >
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> > at
> >
> >
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> > at
> >

Re: the remote task manager was lost

2020-12-02 文章 赵一旦
我都是80G、100G这么分配资源的。。。

guanxianchun  于2020年10月28日周三 下午5:02写道:

> flink版本: flink-1.11
> taskmanager memory: 8G
> jobmanager memory: 2G
> akka.ask.timeout:20s
> akka.retry-gate-closed-for: 5000
> client.timeout:600s
>
> 运行一段时间后报the remote task manager was lost ,错误信息如下:
> 2020-10-28 00:25:30,608 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed
> checkpoint 411 for job 031e5f122711786fcc11ee6eb47291fa (2703770 bytes in
> 336 ms).
> 2020-10-28 00:27:30,273 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] -
> Triggering
> checkpoint 412 (type=CHECKPOINT) @ 1603816050239 for job
> 031e5f122711786fcc11ee6eb47291fa.
> 2020-10-28 00:27:30,776 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed
> checkpoint 412 for job 031e5f122711786fcc11ee6eb47291fa (3466688 bytes in
> 509 ms).
> 2020-10-28 00:29:30,246 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] -
> Triggering
> checkpoint 413 (type=CHECKPOINT) @ 1603816170239 for job
> 031e5f122711786fcc11ee6eb47291fa.
> 2020-10-28 00:29:30,597 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed
> checkpoint 413 for job 031e5f122711786fcc11ee6eb47291fa (2752681 bytes in
> 334 ms).
> 2020-10-28 00:29:47,353 WARN  akka.remote.ReliableDeliverySupervisor
>
> [] - Association with remote system
> [akka.tcp://fl...@hadoop01.dev.test.cn:13912] has failed, address is now
> gated for [5000] ms. Reason: [Disassociated]
> 2020-10-28 00:29:47,353 WARN  akka.remote.ReliableDeliverySupervisor
>
> [] - Association with remote system
> [akka.tcp://flink-metr...@hadoop01.dev.test.cn:31260] has failed, address
> is
> now gated for [5000] ms. Reason: [Disassociated]
> 2020-10-28 00:29:47,377 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] -
> KeyedProcess -> async wait operator -> Map (1/3)
> (f84731e57528b326ad15ddc17821d1b8) switched from RUNNING to FAILED on
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@538198b8.
> org.apache.flink.runtime.io
> .network.netty.exception.RemoteTransportException:
> Connection unexpectedly closed by remote task manager
> 'hadoop01.dev.test.cn/192.168.1.21:7527'. This might indicate that the
> remote task manager was lost.
> at
> org.apache.flink.runtime.io
> .network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:144)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> org.apache.flink.runtime.io
> .network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:97)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
> org.apache.flink

the remote task manager was lost

2020-10-28 文章 guanxianchun
flink版本: flink-1.11
taskmanager memory: 8G
jobmanager memory: 2G
akka.ask.timeout:20s
akka.retry-gate-closed-for: 5000
client.timeout:600s

运行一段时间后报the remote task manager was lost ,错误信息如下:
2020-10-28 00:25:30,608 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed
checkpoint 411 for job 031e5f122711786fcc11ee6eb47291fa (2703770 bytes in
336 ms).
2020-10-28 00:27:30,273 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering
checkpoint 412 (type=CHECKPOINT) @ 1603816050239 for job
031e5f122711786fcc11ee6eb47291fa.
2020-10-28 00:27:30,776 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed
checkpoint 412 for job 031e5f122711786fcc11ee6eb47291fa (3466688 bytes in
509 ms).
2020-10-28 00:29:30,246 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering
checkpoint 413 (type=CHECKPOINT) @ 1603816170239 for job
031e5f122711786fcc11ee6eb47291fa.
2020-10-28 00:29:30,597 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed
checkpoint 413 for job 031e5f122711786fcc11ee6eb47291fa (2752681 bytes in
334 ms).
2020-10-28 00:29:47,353 WARN  akka.remote.ReliableDeliverySupervisor
  
[] - Association with remote system
[akka.tcp://fl...@hadoop01.dev.test.cn:13912] has failed, address is now
gated for [5000] ms. Reason: [Disassociated] 
2020-10-28 00:29:47,353 WARN  akka.remote.ReliableDeliverySupervisor
  
[] - Association with remote system
[akka.tcp://flink-metr...@hadoop01.dev.test.cn:31260] has failed, address is
now gated for [5000] ms. Reason: [Disassociated] 
2020-10-28 00:29:47,377 INFO 
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] -
KeyedProcess -> async wait operator -> Map (1/3)
(f84731e57528b326ad15ddc17821d1b8) switched from RUNNING to FAILED on
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@538198b8.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connection unexpectedly closed by remote task manager
'hadoop01.dev.test.cn/192.168.1.21:7527'. This might indicate that the
remote task manager was lost.
at
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:144)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:97)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at
org.apache.flink.shaded.