[jira] [Updated] (KAFKA-19905) Tight reconnection loop during shutdown

Federico Valeri (Jira) Fri, 21 Nov 2025 07:31:06 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Federico Valeri updated KAFKA-19905:
------------------------------------
    Description: 
During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying 
to connect to node 0 (the controller) every 50ms. The issue is time sensitive, 
but it can be reproduced easily shutting down all nodes at the same time.

The problem is that even during shutdown, the NodeToControllerRequestThread 
continues to run. The RaftControllerNodeProvider still returns node 0 as the 
controller from cached Raft metadata, but node 0 has already terminated 
(NodeToControllerChannelManager:323).

Looking at logs, the controller shut down at 12:31:38 while brokers were still 
in controlled shutdown. The sequence shows:

1. Node 1 and 2 request controlled shutdown
2. Controller grants the shutdown
3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
5. They get stuck in this reconnection loop because the 
NodeToControllerRequestThread is still running and hasn't been shut down 
properly

{code}
[2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,616] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
{code}




  was:
During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying 
to connect to node 0 (the controller) every 50ms. The issue is time sensitive, 
but it can be reproduced easily shutting down all nodes at the same time.

The problem is that even during shutdown, the NodeToControllerRequestThread 
continues to run. The RaftControllerNodeProvider still returns node 0 as the 
controller from cached Raft metadata, but node 0 has already terminated 
(NodeToControllerChannelManager:323).

Looking at logs, the controller shut down at 12:31:38 while brokers were still 
in controlled shutdown. The sequence shows:

1. Node 1 and 2 request controlled shutdown
2. Controller grants the shutdown
3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
5. They get stuck in this reconnection loop because the 
NodeToControllerRequestThread is still running and hasn't been shut down 
properly

{code}
[2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,616] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
{code}

There are two solutions:

1. Make the NodeToControllerRequestThread interruptible 
(InterBrokerSendThread.isInterruptible=true)
2. Check isRunning() in the doWork() method before attempting controller 
operations

I went with solution 2 to avoid interrupting in-flight requests and could cause 
other issues.




> Tight reconnection loop during shutdown
> ---------------------------------------
>
>                 Key: KAFKA-19905
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19905
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 4.1.1
>            Reporter: Federico Valeri
>            Assignee: Federico Valeri
>            Priority: Major
>         Attachments: test.zip
>
>
> During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying 
> to connect to node 0 (the controller) every 50ms. The issue is time 
> sensitive, but it can be reproduced easily shutting down all nodes at the 
> same time.
> The problem is that even during shutdown, the NodeToControllerRequestThread 
> continues to run. The RaftControllerNodeProvider still returns node 0 as the 
> controller from cached Raft metadata, but node 0 has already terminated 
> (NodeToControllerChannelManager:323).
> Looking at logs, the controller shut down at 12:31:38 while brokers were 
> still in controlled shutdown. The sequence shows:
> 1. Node 1 and 2 request controlled shutdown
> 2. Controller grants the shutdown
> 3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
> 4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
> 5. They get stuck in this reconnection loop because the 
> NodeToControllerRequestThread is still running and hasn't been shut down 
> properly
> {code}
> [2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2 
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2 
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
> established. Node may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2 
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2 
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
> established. Node may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,616] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-19905) Tight reconnection loop during shutdown

Reply via email to