[
https://issues.apache.org/jira/browse/KAFKA-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040612#comment-18040612
]
Federico Valeri commented on KAFKA-19905:
-----------------------------------------
The reconnection loop goes on for exactly 5 minutes, which is the shutdown
timeout hard coded in KafkaBroker trait.
This is what I have from another test logs for one of the brokers:
- SIGTERM received: 14:39:46,282
- Actual shutdown completed: 14:44:46,385
- Time elapsed: 5 minutes and 0.103 seconds (approximately 5 minutes)
> Tight reconnection loop during shutdown
> ---------------------------------------
>
> Key: KAFKA-19905
> URL: https://issues.apache.org/jira/browse/KAFKA-19905
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 4.1.1
> Reporter: Federico Valeri
> Assignee: Federico Valeri
> Priority: Major
> Attachments: Screenshot From 2025-11-25 14-40-25.png, test.zip
>
>
> During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying
> to connect to node 0 (the controller) every 50ms. The issue is time
> sensitive, but it can be reproduced easily shutting down all nodes at the
> same time.
> The problem is that even during shutdown, the NodeToControllerRequestThread
> continues to run. The RaftControllerNodeProvider still returns node 0 as the
> controller from cached Raft metadata, but node 0 has already terminated
> (NodeToControllerChannelManager:323).
> Looking at logs, the controller shut down at 12:31:38 while brokers were
> still in controlled shutdown. The sequence shows:
> 1. Node 1 and 2 request controlled shutdown
> 2. Controller grants the shutdown
> 3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
> 4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
> 5. They get stuck in this reconnection loop because the
> NodeToControllerRequestThread is still running and hasn't been shut down
> properly
> {code}
> [2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be
> established. Node may not be available.
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] INFO
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
> controller, from now on will use node localhost:9090 (id: 0 rack: null
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
> controller, from now on will use node localhost:9090 (id: 0 rack: null
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be
> established. Node may not be available.
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] INFO
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
> controller, from now on will use node localhost:9090 (id: 0 rack: null
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,616] INFO
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
> controller, from now on will use node localhost:9090 (id: 0 rack: null
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)