[
https://issues.apache.org/jira/browse/KAFKA-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tal Asulin updated KAFKA-19586:
-------------------------------
Attachment: flame3.html
> Kafka broker freezes and gets fenced during rolling restart with KRaft mode
> ---------------------------------------------------------------------------
>
> Key: KAFKA-19586
> URL: https://issues.apache.org/jira/browse/KAFKA-19586
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 3.9.0, 3.9.1, 4.0.0
> Environment: Running Kafka over Kubernetes
> Kafka 3.9.0
> Reporter: Tal Asulin
> Priority: Blocker
> Attachments: flame3.html
>
>
> After upgrading our Kafka clusters to *Kafka 3.9.0* with *KRaft mode enabled*
> in production, we started observing strange behavior during rolling restarts
> of broker nodes — behavior we had never seen before.
> When a broker is {*}gracefully shut down by the KRaft controller{*}, it
> immediately restarts. Shortly afterward, while it is busy {*}replicating
> missing data{*}, the broker suddenly {*}freezes for approximately 20–50
> seconds{*}. During this time, it produces {*}no logs, no metrics{*}, and *no
> heartbeat messages* to the controllers (see the timestamps below).
>
> {code:java}
> 2025-07-30 09:21:27,224 INFO [Broker id=8] Skipped the become-follower state
> change for my-topic-215 with topic id Some(yO4CQayIRbyESrHHVPdOrQ) and
> partition state LeaderAndIsrPartitionState(topicName='my-topic',
> partitionIndex=215, controllerEpoch=-1, leader=4, leaderEpoch=54, isr=[4, 8],
> partitionEpoch=102, replicas=[4, 8], addingReplicas=[], removingReplicas=[],
> isNew=false, leaderRecoveryState=0) since it is already a follower with
> leader epoch 54. (state.change.logger) [kafka-8-metadata-loader-event-handler]
> 2025-07-30 09:21:51,887 INFO [Broker id=8] Transitioning 427 partition(s) to
> local followers. (state.change.logger)
> [kafka-8-metadata-loader-event-handler] {code}
>
> After this “hanging” period, the broker *resumes normal operation* without
> emitting any error or warning messages — as if nothing happened. However,
> during this gap, because the broker fails to send heartbeats to the KRaft
> controllers, it gets *fenced out of the cluster* ([9s
> timeout|https://kafka.apache.org/documentation/#brokerconfigs_broker.session.timeout.ms]),
> which leads to {*}partitions going offline and, ultimately, data loss{*}.
> {code:java}
> 2025-07-30 09:21:40,325 INFO [QuorumController id=300] Fencing broker 8
> because its session has timed out.
> (org.apache.kafka.controller.ReplicationControlManager)
> [quorum-controller-300-event-handler]{code}
>
> We were able to re-produce this behaviour when provisioning clusters with
> high disk utilization.
> Our primary suspicion is that the Kafka broker process might be *“choking”
> under the load* — trying to replicate a large amount of data while also
> taking over leadership for many partitions. This may cause the JVM to stall,
> leading to the observed freeze.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)