[ 
https://issues.apache.org/jira/browse/KAFKA-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Federico Valeri updated KAFKA-19905:
------------------------------------
    Description: 
During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying 
to connect to node 0 (the controller) every 50ms. The issue is time sensitive, 
but it can be reproduced easily shutting down all nodes at the same time (i.e. 
pkill -SIGTERM kafka.Kafka when running them on the same machine).

The problem is that even during shutdown, the NodeToControllerRequestThread 
continues to run. The RaftControllerNodeProvider still returns node 0 as the 
controller from cached Raft metadata, but node 0 has already terminated 
(NodeToControllerChannelManager:323).

Looking at logs, the controller shut down at 12:31:38 while brokers were still 
in controlled shutdown. The sequence shows:

1. Node 1 and 2 request controlled shutdown
2. Controller grants the shutdown
3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
5. They get stuck in this reconnection loop because the 
NodeToControllerRequestThread is still running and hasn't been shut down 
properly

{code}
[2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,616] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
{code}

The reconnection loop goes on for exactly 5 minutes, which is the shutdown 
timeout hard coded in KafkaBroker trait.

This is what I have from another test logs for one of the brokers:

- SIGTERM received: 14:39:46,282
- Actual shutdown completed: 14:44:46,385
- Time elapsed: 5 minutes and 0.103 seconds (approximately 5 minutes)

I acknowledge that this is unlikely to happen with brokers running on different 
machines, but not so unlikely when running tests locally on a single physical 
machine.





  was:
During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying 
to connect to node 0 (the controller) every 50ms. The issue is time sensitive, 
but it can be reproduced easily shutting down all nodes at the same time (i.e. 
pkill -SIGTERM kafka.Kafka when running them on the same machine).

The problem is that even during shutdown, the NodeToControllerRequestThread 
continues to run. The RaftControllerNodeProvider still returns node 0 as the 
controller from cached Raft metadata, but node 0 has already terminated 
(NodeToControllerChannelManager:323).

Looking at logs, the controller shut down at 12:31:38 while brokers were still 
in controlled shutdown. The sequence shows:

1. Node 1 and 2 request controlled shutdown
2. Controller grants the shutdown
3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
5. They get stuck in this reconnection loop because the 
NodeToControllerRequestThread is still running and hasn't been shut down 
properly

{code}
[2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,616] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
{code}





> Tight reconnection loop during shutdown
> ---------------------------------------
>
>                 Key: KAFKA-19905
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19905
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 4.1.1
>            Reporter: Federico Valeri
>            Assignee: Federico Valeri
>            Priority: Major
>         Attachments: Screenshot From 2025-11-25 14-40-25.png, test.zip
>
>
> During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying 
> to connect to node 0 (the controller) every 50ms. The issue is time 
> sensitive, but it can be reproduced easily shutting down all nodes at the 
> same time (i.e. pkill -SIGTERM kafka.Kafka when running them on the same 
> machine).
> The problem is that even during shutdown, the NodeToControllerRequestThread 
> continues to run. The RaftControllerNodeProvider still returns node 0 as the 
> controller from cached Raft metadata, but node 0 has already terminated 
> (NodeToControllerChannelManager:323).
> Looking at logs, the controller shut down at 12:31:38 while brokers were 
> still in controlled shutdown. The sequence shows:
> 1. Node 1 and 2 request controlled shutdown
> 2. Controller grants the shutdown
> 3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
> 4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
> 5. They get stuck in this reconnection loop because the 
> NodeToControllerRequestThread is still running and hasn't been shut down 
> properly
> {code}
> [2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2 
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2 
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
> established. Node may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2 
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2 
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
> established. Node may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,616] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> {code}
> The reconnection loop goes on for exactly 5 minutes, which is the shutdown 
> timeout hard coded in KafkaBroker trait.
> This is what I have from another test logs for one of the brokers:
> - SIGTERM received: 14:39:46,282
> - Actual shutdown completed: 14:44:46,385
> - Time elapsed: 5 minutes and 0.103 seconds (approximately 5 minutes)
> I acknowledge that this is unlikely to happen with brokers running on 
> different machines, but not so unlikely when running tests locally on a 
> single physical machine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to