[
https://issues.apache.org/jira/browse/KAFKA-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Federico Valeri resolved KAFKA-19905.
-------------------------------------
Resolution: Invalid
> Tight reconnection loop during shutdown
> ---------------------------------------
>
> Key: KAFKA-19905
> URL: https://issues.apache.org/jira/browse/KAFKA-19905
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 4.1.1
> Reporter: Federico Valeri
> Assignee: Federico Valeri
> Priority: Major
> Attachments: Screenshot From 2025-11-25 14-40-25.png, test.zip
>
>
> During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying
> to connect to node 0 (the controller) every 50ms. The issue is time
> sensitive, but it can be reproduced easily shutting down all nodes at the
> same time (i.e. pkill -SIGTERM kafka.Kafka when running them on the same
> machine).
> The problem is that even during shutdown, the NodeToControllerRequestThread
> continues to run. The RaftControllerNodeProvider still returns node 0 as the
> controller from cached Raft metadata, but node 0 has already terminated
> (NodeToControllerChannelManager:323).
> Looking at logs, the controller shut down at 12:31:38 while brokers were
> still in controlled shutdown. The sequence shows:
> 1. Node 1 and 2 request controlled shutdown
> 2. Controller grants the shutdown
> 3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
> 4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
> 5. They get stuck in this reconnection loop because the
> NodeToControllerRequestThread is still running and hasn't been shut down
> properly
> {code}
> [2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be
> established. Node may not be available.
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] INFO
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
> controller, from now on will use node localhost:9090 (id: 0 rack: null
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
> controller, from now on will use node localhost:9090 (id: 0 rack: null
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be
> established. Node may not be available.
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] INFO
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
> controller, from now on will use node localhost:9090 (id: 0 rack: null
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,616] INFO
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
> controller, from now on will use node localhost:9090 (id: 0 rack: null
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> {code}
> The reconnection loop goes on for exactly 5 minutes, which is the shutdown
> timeout hard coded in KafkaBroker trait.
> This is what I have from another test logs for one of the brokers:
> - SIGTERM received: 14:39:46,282
> - Actual shutdown completed: 14:44:46,385
> - Time elapsed: 5 minutes and 0.103 seconds (approximately 5 minutes)
> I acknowledge that this is unlikely to happen with brokers running on
> different machines, but not so unlikely when running tests locally on a
> single physical machine.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)