[
https://issues.apache.org/jira/browse/KAFKA-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Saravanan updated KAFKA-20075:
------------------------------
Summary: Side effects of 'preferred leader election' schedule change (was:
Side effects of 'preferred leader election' schedule change is causing side
effects)
> Side effects of 'preferred leader election' schedule change
> -----------------------------------------------------------
>
> Key: KAFKA-20075
> URL: https://issues.apache.org/jira/browse/KAFKA-20075
> Project: Kafka
> Issue Type: Task
> Components: config, controller, group-coordinator
> Affects Versions: 4.0.1, 4.1.0, 4.1.1
> Reporter: Saravanan
> Priority: Major
> Labels: kraft, leader-imbalance, preferred-leader-election
>
> After upgrading from Kafka 3.9.1 to 4.1.0 in KRaft mode, the behavior of
> preferred leader election with auto.leader.rebalance.enable=true appears to
> have changed. The effective semantics of
> leader.imbalance.check.interval.seconds are different: in 3.9.1, preferred
> leader election for imbalanced partitions consistently occurred ~300 seconds
> after a broker failure/recovery, whereas in 4.1.0 it can occur at any time
> between 0 and 300 seconds after a broker comes back. This earlier rebalance
> can overlap with partition unloading from the old leader, causing prolonged
> consumer impact.
> *In Kafka 3.9.1 KRaft:*
> When a broker goes down and later comes back up, preferred leader election
> for affected partitions (e.g., __consumer_offsets) consistently happens about
> 5 minutes (300 seconds) after the broker's failure/recovery sequence.
> From an operator's perspective, the effective behavior is:
> _"Preferred leader election runs ~300s after the broker event."_
> This aligns intuitively with leader.imbalance.check.interval.seconds=300, and
> the interval appears tied to the time when the broker failure/imbalance
> started.
> *In Kafka 4.1.0 KRaft:*
> With the same configuration (auto.leader.rebalance.enable=true,
> leader.imbalance.check.interval.seconds=300), preferred leader election is
> now driven by the new periodic task scheduler in QuorumController (e.g.,
> PeriodicTask("electPreferred", ...)), plus per‑broker imbalance logic.
> In practice, this means:
> Preferred leader election can occur at any time between 0 and 300 seconds
> after a broker comes back, depending on where the controller's periodic
> schedule currently is.
> The timing is no longer intuitively "300 seconds after the broker event" but
> "on the next periodic electPreferred tick," which is decoupled from broker
> failure/recovery.
> This semantic change is important because of the interaction with partition
> load/unload:
> When a broker that was a preferred leader comes back:
> The previous leader may still be unloading partitions (closing
> producers/consumers, flushing state, checkpoints, etc.).
> If preferred leader election fires early (close to the broker's return), the
> new preferred leader may start loading those same partitions while the old
> leader is still unloading them.
> This overlapping unload/load window causes:
> Extended recovery times for __consumer_offsets and other system topics.
> Noticeable consumer-side delays and lag spikes.
> Infrequent but high-impact incidents in production.
> Conceptually, the change in 4.x is an improvement (cleaner periodic task
> infrastructure, explicit electPreferred task, per-broker imbalance
> threshold), but it also effectively changes the semantics of
> leader.imbalance.check.interval.seconds as understood by operators:
> Previously (3.9.1), operators could treat it as "roughly how long after a
> broker event before preferred leader rebalance kicks in."
> Now (4.1.0+), it is "the frequency of a global periodic check," not aligned
> to broker status changes, which leads to leader rebalances occurring much
> earlier than expected relative to broker recovery.
> *Impact*
> Overlapping partition unloading/loading between old and new preferred leaders.
> Longer recovery and stabilization time for critical internal topics like
> __consumer_offsets.
> Noticeable and sometimes severe delays for consumers during these rare but
> critical windows.
> Operational confusion: existing tuning based on 3.9.1's behavior no longer
> match what's observed in 4.1.0.
> *Clarifications / Requests*
> Intended semantics of leader.imbalance.check.interval.seconds in 4.x
> In 3.9.1, preferred leader election effectively happened ~300s after broker
> failure/recovery.
> In 4.1.0, with the periodic electPreferred task, it can happen anytime
> between 0-300s after a broker comes back.
> Is this changed timing relative to broker events intentional?
> Interaction with new imbalance logic
> How do leader.imbalance.per.broker.percentage and the new KRaft controller
> logic influence when preferred leader election is triggered (beyond the
> periodic task)?
> Are there now event-driven triggers that can cause earlier rebalancing than
> the configured interval?
> Operational guidance to avoid overlap/unload issues
> What is the recommended way in 4.1.0+ to avoid preferred leader election
> overlapping with partition unloading on the old leader (and loading on the
> new one) after broker recovery?
> Should operators tune leader.imbalance.per.broker.percentage,
> leader.imbalance.check.interval.seconds, or use another mechanism to delay
> automatic preferred leader rebalance after a broker comes back?
> Documentation expectations for upgrades
> If the new behavior is expected, can the docs explicitly state that
> leader.imbalance.check.interval.seconds is a periodic scheduler interval, not
> a post-broker-event delay, and that actual rebalance relative to broker
> events may occur anywhere between 0 and the configured interval?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)