TaskManagers Crushing

2023-08-19 Thread Kenan Kılıçtepe
Hi,

I have 4 task manager working on 4 servers.
They all crush at the same time without any useful error logs.
Only log I can see is some disconnection from Kafka for both consumer and
producers.
Any idea or any help is appreciated.

Some logs from all taskmanagers:

I think first server 4 is crushing and it causes crush for all taskmanagers.

JobManager:

2023-08-18 15:16:46,528 INFO  org.apache.kafka.clients.NetworkClient
[] - [AdminClient clientId=47539-enumerator-admin-client]
Node 2 disconnected.
2023-08-18 15:19:00,303 INFO  org.apache.kafka.clients.NetworkClient
[] - [AdminClient
clientId=tf_25464-enumerator-admin-client] Node 4 disconnected.
2023-08-18 15:19:16,668 INFO  org.apache.kafka.clients.NetworkClient
[] - [AdminClient
clientId=cpu_59942-enumerator-admin-client] Node 1 disconnected.
2023-08-18 15:19:16,764 INFO  org.apache.kafka.clients.NetworkClient
[] - [AdminClient
clientId=cpu_55128-enumerator-admin-client] Node 3 disconnected.
2023-08-18 15:19:27,913 WARN  akka.remote.transport.netty.NettyTransport
[] - Remote connection to [/10.11.0.51:42778] failed with
java.io.IOException: Connection reset by peer
2023-08-18 15:19:27,963 WARN  akka.remote.ReliableDeliverySupervisor
[] - Association with remote system
[akka.tcp://flink@tef-prod-flink-04:38835] has failed, address is now gated
for [50] ms. Reason: [Disassociated]
2023-08-18 15:19:27,967 WARN  akka.remote.ReliableDeliverySupervisor
[] - Association with remote system
[akka.tcp://flink-metrics@tef-prod-flink-04:46491] has failed, address is
now gated for [50] ms. Reason: [Disassociated]
2023-08-18 15:19:29,225 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph   [] -
RouterReplacementAlgorithm -> kafkaSink_sinkFaultyRouter_windowMode: Writer
-> kafkaSink_sinkFaultyRouter_windowMode: Committer (3/4)
(f6fd65e3fc049bd9021093d8f532bbaf_a47f4a3b960228021159de8de51dbb1f_2_0)
switched from RUNNING to FAILED on
injection-assia-3-pro-cloud-tef-gcp-europe-west1:39011-b24b1d @
injection-assia-3-pro-cloud-tef-gcp-europe-west1 (dataPort=35223).
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connection unexpectedly closed by remote task manager 'tef-prod-flink-04/
10.11.0.51:37505 [ tef-prod-flink-04:38835-e3ca4d ] '. This might indicate
that the remote task manager was lost.
at
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:134)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:94)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:831)
~[flink-dist-1.16.2.jar:1.16.2]
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.

Re: TaskManagers Crushing

2023-08-20 Thread liu ron
Hi,

Maybe you need to check what changed on the Kafka side at that time.

Best,
Ron

Kenan Kılıçtepe  于2023年8月20日周日 08:51写道:

> Hi,
>
> I have 4 task manager working on 4 servers.
> They all crush at the same time without any useful error logs.
> Only log I can see is some disconnection from Kafka for both consumer and
> producers.
> Any idea or any help is appreciated.
>
> Some logs from all taskmanagers:
>
> I think first server 4 is crushing and it causes crush for all
> taskmanagers.
>
> JobManager:
>
> 2023-08-18 15:16:46,528 INFO  org.apache.kafka.clients.NetworkClient
> [] - [AdminClient clientId=47539-enumerator-admin-client]
> Node 2 disconnected.
> 2023-08-18 15:19:00,303 INFO  org.apache.kafka.clients.NetworkClient
> [] - [AdminClient
> clientId=tf_25464-enumerator-admin-client] Node 4 disconnected.
> 2023-08-18 15:19:16,668 INFO  org.apache.kafka.clients.NetworkClient
> [] - [AdminClient
> clientId=cpu_59942-enumerator-admin-client] Node 1 disconnected.
> 2023-08-18 15:19:16,764 INFO  org.apache.kafka.clients.NetworkClient
> [] - [AdminClient
> clientId=cpu_55128-enumerator-admin-client] Node 3 disconnected.
> 2023-08-18 15:19:27,913 WARN  akka.remote.transport.netty.NettyTransport
> [] - Remote connection to [/10.11.0.51:42778] failed with
> java.io.IOException: Connection reset by peer
> 2023-08-18 15:19:27,963 WARN  akka.remote.ReliableDeliverySupervisor
> [] - Association with remote system
> [akka.tcp://flink@tef-prod-flink-04:38835] has failed, address is now
> gated for [50] ms. Reason: [Disassociated]
> 2023-08-18 15:19:27,967 WARN  akka.remote.ReliableDeliverySupervisor
> [] - Association with remote system
> [akka.tcp://flink-metrics@tef-prod-flink-04:46491] has failed, address is
> now gated for [50] ms. Reason: [Disassociated]
> 2023-08-18 15:19:29,225 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph   [] -
> RouterReplacementAlgorithm -> kafkaSink_sinkFaultyRouter_windowMode: Writer
> -> kafkaSink_sinkFaultyRouter_windowMode: Committer (3/4)
> (f6fd65e3fc049bd9021093d8f532bbaf_a47f4a3b960228021159de8de51dbb1f_2_0)
> switched from RUNNING to FAILED on
> injection-assia-3-pro-cloud-tef-gcp-europe-west1:39011-b24b1d @
> injection-assia-3-pro-cloud-tef-gcp-europe-west1 (dataPort=35223).
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connection unexpectedly closed by remote task manager 'tef-prod-flink-04/
> 10.11.0.51:37505 [ tef-prod-flink-04:38835-e3ca4d ] '. This might
> indicate that the remote task manager was lost.
> at
> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:134)
> ~[flink-dist-1.16.2.jar:1.16.2]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
> ~[flink-dist-1.16.2.jar:1.16.2]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
> ~[flink-dist-1.16.2.jar:1.16.2]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
> ~[flink-dist-1.16.2.jar:1.16.2]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
> ~[flink-dist-1.16.2.jar:1.16.2]
> at
> org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:94)
> ~[flink-dist-1.16.2.jar:1.16.2]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
> ~[flink-dist-1.16.2.jar:1.16.2]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
> ~[flink-dist-1.16.2.jar:1.16.2]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
> ~[flink-dist-1.16.2.jar:1.16.2]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
> ~[flink-dist-1.16.2.jar:1.16.2]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
> ~[flink-dist-1.16.2.jar:1.16.2]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
> ~[flink-dist-1.16.2.jar:1.16.2]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChan

Re: TaskManagers Crushing

2023-08-20 Thread Kenan Kılıçtepe
Hi,

Nothing interesting on Kafka side.Just sone partition delete/create logs.
Also I can't understand why all task managers stop at the same time without
any error log.

Thanks
Kenan



On Sun, Aug 20, 2023 at 10:49 AM liu ron  wrote:

> Hi,
>
> Maybe you need to check what changed on the Kafka side at that time.
>
> Best,
> Ron
>
> Kenan Kılıçtepe  于2023年8月20日周日 08:51写道:
>
>> Hi,
>>
>> I have 4 task manager working on 4 servers.
>> They all crush at the same time without any useful error logs.
>> Only log I can see is some disconnection from Kafka for both consumer and
>> producers.
>> Any idea or any help is appreciated.
>>
>> Some logs from all taskmanagers:
>>
>> I think first server 4 is crushing and it causes crush for all
>> taskmanagers.
>>
>> JobManager:
>>
>> 2023-08-18 15:16:46,528 INFO  org.apache.kafka.clients.NetworkClient
>>   [] - [AdminClient clientId=47539-enumerator-admin-client]
>> Node 2 disconnected.
>> 2023-08-18 15:19:00,303 INFO  org.apache.kafka.clients.NetworkClient
>>   [] - [AdminClient
>> clientId=tf_25464-enumerator-admin-client] Node 4 disconnected.
>> 2023-08-18 15:19:16,668 INFO  org.apache.kafka.clients.NetworkClient
>>   [] - [AdminClient
>> clientId=cpu_59942-enumerator-admin-client] Node 1 disconnected.
>> 2023-08-18 15:19:16,764 INFO  org.apache.kafka.clients.NetworkClient
>>   [] - [AdminClient
>> clientId=cpu_55128-enumerator-admin-client] Node 3 disconnected.
>> 2023-08-18 15:19:27,913 WARN  akka.remote.transport.netty.NettyTransport
>>   [] - Remote connection to [/10.11.0.51:42778] failed
>> with java.io.IOException: Connection reset by peer
>> 2023-08-18 15:19:27,963 WARN  akka.remote.ReliableDeliverySupervisor
>>   [] - Association with remote system
>> [akka.tcp://flink@tef-prod-flink-04:38835] has failed, address is now
>> gated for [50] ms. Reason: [Disassociated]
>> 2023-08-18 15:19:27,967 WARN  akka.remote.ReliableDeliverySupervisor
>>   [] - Association with remote system
>> [akka.tcp://flink-metrics@tef-prod-flink-04:46491] has failed, address
>> is now gated for [50] ms. Reason: [Disassociated]
>> 2023-08-18 15:19:29,225 INFO
>>  org.apache.flink.runtime.executiongraph.ExecutionGraph   [] -
>> RouterReplacementAlgorithm -> kafkaSink_sinkFaultyRouter_windowMode: Writer
>> -> kafkaSink_sinkFaultyRouter_windowMode: Committer (3/4)
>> (f6fd65e3fc049bd9021093d8f532bbaf_a47f4a3b960228021159de8de51dbb1f_2_0)
>> switched from RUNNING to FAILED on
>> injection-assia-3-pro-cloud-tef-gcp-europe-west1:39011-b24b1d @
>> injection-assia-3-pro-cloud-tef-gcp-europe-west1 (dataPort=35223).
>> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
>> Connection unexpectedly closed by remote task manager 'tef-prod-flink-04/
>> 10.11.0.51:37505 [ tef-prod-flink-04:38835-e3ca4d ] '. This might
>> indicate that the remote task manager was lost.
>> at
>> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:134)
>> ~[flink-dist-1.16.2.jar:1.16.2]
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>> ~[flink-dist-1.16.2.jar:1.16.2]
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>> ~[flink-dist-1.16.2.jar:1.16.2]
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>> ~[flink-dist-1.16.2.jar:1.16.2]
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>> ~[flink-dist-1.16.2.jar:1.16.2]
>> at
>> org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:94)
>> ~[flink-dist-1.16.2.jar:1.16.2]
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>> ~[flink-dist-1.16.2.jar:1.16.2]
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>> ~[flink-dist-1.16.2.jar:1.16.2]
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>> ~[flink-dist-1.16.2.jar:1.16.2]
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
>> ~[flink-dist-1.16.2.jar:1.16.2]
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractCh

Re: TaskManagers Crushing

2023-08-20 Thread Shammon FY
Hi,

I seems that the node `tef-prod-flink-04/10.11.0.51:37505 [
tef-prod-flink-04:38835-e3ca4d ]` exits unexpected, you can check whether
there are some errors in the log of TM or K8S

Best,
Shammon FY


On Sun, Aug 20, 2023 at 5:42 PM Kenan Kılıçtepe 
wrote:

> Hi,
>
> Nothing interesting on Kafka side.Just sone partition delete/create logs.
> Also I can't understand why all task managers stop at the same time
> without any error log.
>
> Thanks
> Kenan
>
>
>
> On Sun, Aug 20, 2023 at 10:49 AM liu ron  wrote:
>
>> Hi,
>>
>> Maybe you need to check what changed on the Kafka side at that time.
>>
>> Best,
>> Ron
>>
>> Kenan Kılıçtepe  于2023年8月20日周日 08:51写道:
>>
>>> Hi,
>>>
>>> I have 4 task manager working on 4 servers.
>>> They all crush at the same time without any useful error logs.
>>> Only log I can see is some disconnection from Kafka for both consumer
>>> and producers.
>>> Any idea or any help is appreciated.
>>>
>>> Some logs from all taskmanagers:
>>>
>>> I think first server 4 is crushing and it causes crush for all
>>> taskmanagers.
>>>
>>> JobManager:
>>>
>>> 2023-08-18 15:16:46,528 INFO  org.apache.kafka.clients.NetworkClient
>>>   [] - [AdminClient clientId=47539-enumerator-admin-client]
>>> Node 2 disconnected.
>>> 2023-08-18 15:19:00,303 INFO  org.apache.kafka.clients.NetworkClient
>>>   [] - [AdminClient
>>> clientId=tf_25464-enumerator-admin-client] Node 4 disconnected.
>>> 2023-08-18 15:19:16,668 INFO  org.apache.kafka.clients.NetworkClient
>>>   [] - [AdminClient
>>> clientId=cpu_59942-enumerator-admin-client] Node 1 disconnected.
>>> 2023-08-18 15:19:16,764 INFO  org.apache.kafka.clients.NetworkClient
>>>   [] - [AdminClient
>>> clientId=cpu_55128-enumerator-admin-client] Node 3 disconnected.
>>> 2023-08-18 15:19:27,913 WARN  akka.remote.transport.netty.NettyTransport
>>>   [] - Remote connection to [/10.11.0.51:42778] failed
>>> with java.io.IOException: Connection reset by peer
>>> 2023-08-18 15:19:27,963 WARN  akka.remote.ReliableDeliverySupervisor
>>>   [] - Association with remote system
>>> [akka.tcp://flink@tef-prod-flink-04:38835] has failed, address is now
>>> gated for [50] ms. Reason: [Disassociated]
>>> 2023-08-18 15:19:27,967 WARN  akka.remote.ReliableDeliverySupervisor
>>>   [] - Association with remote system
>>> [akka.tcp://flink-metrics@tef-prod-flink-04:46491] has failed, address
>>> is now gated for [50] ms. Reason: [Disassociated]
>>> 2023-08-18 15:19:29,225 INFO
>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph   [] -
>>> RouterReplacementAlgorithm -> kafkaSink_sinkFaultyRouter_windowMode: Writer
>>> -> kafkaSink_sinkFaultyRouter_windowMode: Committer (3/4)
>>> (f6fd65e3fc049bd9021093d8f532bbaf_a47f4a3b960228021159de8de51dbb1f_2_0)
>>> switched from RUNNING to FAILED on
>>> injection-assia-3-pro-cloud-tef-gcp-europe-west1:39011-b24b1d @
>>> injection-assia-3-pro-cloud-tef-gcp-europe-west1 (dataPort=35223).
>>> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
>>> Connection unexpectedly closed by remote task manager 'tef-prod-flink-04/
>>> 10.11.0.51:37505 [ tef-prod-flink-04:38835-e3ca4d ] '. This might
>>> indicate that the remote task manager was lost.
>>> at
>>> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:134)
>>> ~[flink-dist-1.16.2.jar:1.16.2]
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>>> ~[flink-dist-1.16.2.jar:1.16.2]
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>>> ~[flink-dist-1.16.2.jar:1.16.2]
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
>>> ~[flink-dist-1.16.2.jar:1.16.2]
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>>> ~[flink-dist-1.16.2.jar:1.16.2]
>>> at
>>> org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:94)
>>> ~[flink-dist-1.16.2.jar:1.16.2]
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
>>> ~[flink-dist-1.16.2.jar:1.16.2]
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
>>> ~[flink-dist-1.16.2.jar:1.16.2]
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandl

Re: [EXTERNAL] TaskManagers Crushing

2023-11-29 Thread Ivan Webber via user
Were you ever able to find a workaround for this? I also have transient 
failures due to
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException.

From: Kenan Kılıçtepe 
Sent: Saturday, August 19, 2023 5:50 PM
To: user@flink.apache.org 
Subject: [EXTERNAL] TaskManagers Crushing

You don't often get email from kkilict...@gmail.com. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
Hi,

I have 4 task manager working on 4 servers.
They all crush at the same time without any useful error logs.
Only log I can see is some disconnection from Kafka for both consumer and 
producers.
Any idea or any help is appreciated.

Some logs from all taskmanagers:

I think first server 4 is crushing and it causes crush for all taskmanagers.

JobManager:

2023-08-18 15:16:46,528 INFO  org.apache.kafka.clients.NetworkClient
   [] - [AdminClient clientId=47539-enumerator-admin-client] Node 2 
disconnected.
2023-08-18 15:19:00,303 INFO  org.apache.kafka.clients.NetworkClient
   [] - [AdminClient clientId=tf_25464-enumerator-admin-client] Node 4 
disconnected.
2023-08-18 15:19:16,668 INFO  org.apache.kafka.clients.NetworkClient
   [] - [AdminClient clientId=cpu_59942-enumerator-admin-client] Node 1 
disconnected.
2023-08-18 15:19:16,764 INFO  org.apache.kafka.clients.NetworkClient
   [] - [AdminClient clientId=cpu_55128-enumerator-admin-client] Node 3 
disconnected.
2023-08-18 15:19:27,913 WARN  akka.remote.transport.netty.NettyTransport
   [] - Remote connection to 
[/10.11.0.51:42778<http://10.11.0.51:42778/>] failed with java.io.IOException: 
Connection reset by peer
2023-08-18 15:19:27,963 WARN  akka.remote.ReliableDeliverySupervisor
   [] - Association with remote system 
[akka.tcp://flink@tef-prod-flink-04:38835] has failed, address is now gated for 
[50] ms. Reason: [Disassociated]
2023-08-18 15:19:27,967 WARN  akka.remote.ReliableDeliverySupervisor
   [] - Association with remote system 
[akka.tcp://flink-metrics@tef-prod-flink-04:46491] has failed, address is now 
gated for [50] ms. Reason: [Disassociated]
2023-08-18 15:19:29,225 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - 
RouterReplacementAlgorithm -> kafkaSink_sinkFaultyRouter_windowMode: Writer -> 
kafkaSink_sinkFaultyRouter_windowMode: Committer (3/4) 
(f6fd65e3fc049bd9021093d8f532bbaf_a47f4a3b960228021159de8de51dbb1f_2_0) 
switched from RUNNING to FAILED on 
injection-assia-3-pro-cloud-tef-gcp-europe-west1:39011-b24b1d @ 
injection-assia-3-pro-cloud-tef-gcp-europe-west1 (dataPort=35223).
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
Connection unexpectedly closed by remote task manager 
'tef-prod-flink-04/10.11.0.51:37505<http://10.11.0.51:37505/> [ 
tef-prod-flink-04:38835-e3ca4d ] '. This might indicate that the remote task 
manager was lost.
at 
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:134)
 ~[flink-dist-1.16.2.jar:1.16.2]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
 ~[flink-dist-1.16.2.jar:1.16.2]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
 ~[flink-dist-1.16.2.jar:1.16.2]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
 ~[flink-dist-1.16.2.jar:1.16.2]
at 
org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
 ~[flink-dist-1.16.2.jar:1.16.2]
at 
org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:94)
 ~[flink-dist-1.16.2.jar:1.16.2]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
 ~[flink-dist-1.16.2.jar:1.16.2]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
 ~[flink-dist-1.16.2.jar:1.16.2]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
 ~[flink-dist-1.16.2.jar:1.16.2]
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
 ~[flink-dist-1.16.2.jar:1.16.2]
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive