Tal Asulin created KAFKA-19586:
----------------------------------
Summary: Potential
Key: KAFKA-19586
URL: https://issues.apache.org/jira/browse/KAFKA-19586
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 4.0.0, 3.9.1, 3.9.0
Environment: Running Kafka over Kubernetes
Kafka 3.9.0
Reporter: Tal Asulin
After upgrading our Kafka clusters to *Kafka 3.9.0* with *KRaft mode enabled*
in production, we started observing strange behavior during rolling restarts of
broker nodes — behavior we had never seen before.
When a broker is {*}gracefully shut down by the KRaft controller{*}, it
immediately restarts. Shortly afterward, while it is busy {*}replicating
missing data{*}, the broker suddenly {*}freezes for approximately 20–50
seconds{*}. During this time, it produces {*}no logs, no metrics{*}, and *no
heartbeat messages* to the controllers (see the timestamps below).
{code:java}
2025-07-30 09:21:27,224 INFO [Broker id=8] Skipped the become-follower state
change for my-topic-215 with topic id Some(yO4CQayIRbyESrHHVPdOrQ) and
partition state LeaderAndIsrPartitionState(topicName='my-topic',
partitionIndex=215, controllerEpoch=-1, leader=4, leaderEpoch=54, isr=[4, 8],
partitionEpoch=102, replicas=[4, 8], addingReplicas=[], removingReplicas=[],
isNew=false, leaderRecoveryState=0) since it is already a follower with leader
epoch 54. (state.change.logger) [kafka-8-metadata-loader-event-handler]
2025-07-30 09:21:51,887 INFO [Broker id=8] Transitioning 427 partition(s) to
local followers. (state.change.logger) [kafka-8-metadata-loader-event-handler]
{code}
{{}}
After this “hanging” period, the broker *resumes normal operation* without
emitting any error or warning messages — as if nothing happened. However,
during this gap, because the broker fails to send heartbeats to the KRaft
controllers, it gets {*}fenced out of the cluster{*}, which leads to
{*}partitions going offline and, ultimately, data loss{*}.
{code:java}
2025-07-30 09:21:40,325 INFO [QuorumController id=300] Fencing broker 8 because
its session has timed out.
(org.apache.kafka.controller.ReplicationControlManager)
[quorum-controller-300-event-handler]{code}
I haven’t been able to reproduce or simulate this behavior in a test
environment. However, I can confirm that it *consistently occurs in our
production clusters* every time we perform a rolling restart of the brokers.
Our primary suspicion is that the Kafka broker process might be *“choking”
under the load* — trying to replicate a large amount of data while also taking
over leadership for many partitions. This may cause the JVM to stall, leading
to the observed freeze.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)