[
https://issues.apache.org/jira/browse/KAFKA-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Federico Valeri updated KAFKA-19905:
------------------------------------
Description:
During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying
to connect to node 0 (the controller) every 50ms. The issue is time sensitive,
but it can be reproduced easily shutting down all nodes at the same time.
The problem is that even during shutdown, the NodeToControllerRequestThread
continues to run. The RaftControllerNodeProvider still returns node 0 as the
controller from cached Raft metadata, but node 0 has already terminated
(NodeToControllerChannelManager:323).
Looking at logs, the controller shut down at 12:31:38 while brokers were still
in controlled shutdown. The sequence shows:
1. Node 1 and 2 request controlled shutdown
2. Controller grants the shutdown
3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
5. They get stuck in this reconnection loop because the
NodeToControllerRequestThread is still running and hasn't been shut down
properly
{code}
[2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] INFO
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
controller, from now on will use node localhost:9090 (id: 0 rack: null
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
controller, from now on will use node localhost:9090 (id: 0 rack: null
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] INFO
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
controller, from now on will use node localhost:9090 (id: 0 rack: null
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,616] INFO
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
controller, from now on will use node localhost:9090 (id: 0 rack: null
isFenced: false) (kafka.server.NodeToControllerRequestThread)
{code}
was:
During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying
to connect to node 0 (the controller) every 50ms. The issue is time sensitive,
but it can be reproduced easily shutting down all nodes at the same time.
The problem is that even during shutdown, the NodeToControllerRequestThread
continues to run. The RaftControllerNodeProvider still returns node 0 as the
controller from cached Raft metadata, but node 0 has already terminated
(NodeToControllerChannelManager:323).
Looking at logs, the controller shut down at 12:31:38 while brokers were still
in controlled shutdown. The sequence shows:
1. Node 1 and 2 request controlled shutdown
2. Controller grants the shutdown
3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
5. They get stuck in this reconnection loop because the
NodeToControllerRequestThread is still running and hasn't been shut down
properly
{code}
[2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] INFO
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
controller, from now on will use node localhost:9090 (id: 0 rack: null
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
controller, from now on will use node localhost:9090 (id: 0 rack: null
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] INFO
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
controller, from now on will use node localhost:9090 (id: 0 rack: null
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,616] INFO
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
controller, from now on will use node localhost:9090 (id: 0 rack: null
isFenced: false) (kafka.server.NodeToControllerRequestThread)
{code}
There are two solutions:
1. Make the NodeToControllerRequestThread interruptible
(InterBrokerSendThread.isInterruptible=true)
2. Check isRunning() in the doWork() method before attempting controller
operations
I went with solution 2 to avoid interrupting in-flight requests and could cause
other issues.
> Tight reconnection loop during shutdown
> ---------------------------------------
>
> Key: KAFKA-19905
> URL: https://issues.apache.org/jira/browse/KAFKA-19905
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 4.1.1
> Reporter: Federico Valeri
> Assignee: Federico Valeri
> Priority: Major
> Attachments: test.zip
>
>
> During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying
> to connect to node 0 (the controller) every 50ms. The issue is time
> sensitive, but it can be reproduced easily shutting down all nodes at the
> same time.
> The problem is that even during shutdown, the NodeToControllerRequestThread
> continues to run. The RaftControllerNodeProvider still returns node 0 as the
> controller from cached Raft metadata, but node 0 has already terminated
> (NodeToControllerChannelManager:323).
> Looking at logs, the controller shut down at 12:31:38 while brokers were
> still in controlled shutdown. The sequence shows:
> 1. Node 1 and 2 request controlled shutdown
> 2. Controller grants the shutdown
> 3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
> 4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
> 5. They get stuck in this reconnection loop because the
> NodeToControllerRequestThread is still running and hasn't been shut down
> properly
> {code}
> [2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be
> established. Node may not be available.
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] INFO
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
> controller, from now on will use node localhost:9090 (id: 0 rack: null
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
> controller, from now on will use node localhost:9090 (id: 0 rack: null
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be
> established. Node may not be available.
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] INFO
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
> controller, from now on will use node localhost:9090 (id: 0 rack: null
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,616] INFO
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
> controller, from now on will use node localhost:9090 (id: 0 rack: null
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)