[ 
https://issues.apache.org/jira/browse/KAFKA-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saravanan updated KAFKA-20075:
------------------------------
    Summary: Side effects of 'preferred leader election' schedule change  (was: 
Side effects of 'preferred leader election' schedule change is causing side 
effects)

> Side effects of 'preferred leader election' schedule change
> -----------------------------------------------------------
>
>                 Key: KAFKA-20075
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20075
>             Project: Kafka
>          Issue Type: Task
>          Components: config, controller, group-coordinator
>    Affects Versions: 4.0.1, 4.1.0, 4.1.1
>            Reporter: Saravanan
>            Priority: Major
>              Labels: kraft, leader-imbalance, preferred-leader-election
>
> After upgrading from Kafka 3.9.1 to 4.1.0 in KRaft mode, the behavior of 
> preferred leader election with auto.leader.rebalance.enable=true appears to 
> have changed. The effective semantics of 
> leader.imbalance.check.interval.seconds are different: in 3.9.1, preferred 
> leader election for imbalanced partitions consistently occurred ~300 seconds 
> after a broker failure/recovery, whereas in 4.1.0 it can occur at any time 
> between 0 and 300 seconds after a broker comes back. This earlier rebalance 
> can overlap with partition unloading from the old leader, causing prolonged 
> consumer impact.
> *In Kafka 3.9.1 KRaft:*
> When a broker goes down and later comes back up, preferred leader election 
> for affected partitions (e.g., __consumer_offsets) consistently happens about 
> 5 minutes (300 seconds) after the broker's failure/recovery sequence.
> From an operator's perspective, the effective behavior is:
> _"Preferred leader election runs ~300s after the broker event."_
> This aligns intuitively with leader.imbalance.check.interval.seconds=300, and 
> the interval appears tied to the time when the broker failure/imbalance 
> started.
> *In Kafka 4.1.0 KRaft:*
> With the same configuration (auto.leader.rebalance.enable=true, 
> leader.imbalance.check.interval.seconds=300), preferred leader election is 
> now driven by the new periodic task scheduler in QuorumController (e.g., 
> PeriodicTask("electPreferred", ...)), plus per‑broker imbalance logic.
> In practice, this means:
> Preferred leader election can occur at any time between 0 and 300 seconds 
> after a broker comes back, depending on where the controller's periodic 
> schedule currently is.
> The timing is no longer intuitively "300 seconds after the broker event" but 
> "on the next periodic electPreferred tick," which is decoupled from broker 
> failure/recovery.
> This semantic change is important because of the interaction with partition 
> load/unload:
> When a broker that was a preferred leader comes back:
> The previous leader may still be unloading partitions (closing 
> producers/consumers, flushing state, checkpoints, etc.).
> If preferred leader election fires early (close to the broker's return), the 
> new preferred leader may start loading those same partitions while the old 
> leader is still unloading them.
> This overlapping unload/load window causes:
> Extended recovery times for __consumer_offsets and other system topics.
> Noticeable consumer-side delays and lag spikes.
> Infrequent but high-impact incidents in production.
> Conceptually, the change in 4.x is an improvement (cleaner periodic task 
> infrastructure, explicit electPreferred task, per-broker imbalance 
> threshold), but it also effectively changes the semantics of 
> leader.imbalance.check.interval.seconds as understood by operators:
> Previously (3.9.1), operators could treat it as "roughly how long after a 
> broker event before preferred leader rebalance kicks in."
> Now (4.1.0+), it is "the frequency of a global periodic check," not aligned 
> to broker status changes, which leads to leader rebalances occurring much 
> earlier than expected relative to broker recovery.
> *Impact*
> Overlapping partition unloading/loading between old and new preferred leaders.
> Longer recovery and stabilization time for critical internal topics like 
> __consumer_offsets.
> Noticeable and sometimes severe delays for consumers during these rare but 
> critical windows.
> Operational confusion: existing tuning based on 3.9.1's behavior no longer 
> match what's observed in 4.1.0.
> *Clarifications / Requests*
> Intended semantics of leader.imbalance.check.interval.seconds in 4.x
> In 3.9.1, preferred leader election effectively happened ~300s after broker 
> failure/recovery.
> In 4.1.0, with the periodic electPreferred task, it can happen anytime 
> between 0-300s after a broker comes back.
> Is this changed timing relative to broker events intentional?
> Interaction with new imbalance logic
> How do leader.imbalance.per.broker.percentage and the new KRaft controller 
> logic influence when preferred leader election is triggered (beyond the 
> periodic task)?
> Are there now event-driven triggers that can cause earlier rebalancing than 
> the configured interval?
> Operational guidance to avoid overlap/unload issues
> What is the recommended way in 4.1.0+ to avoid preferred leader election 
> overlapping with partition unloading on the old leader (and loading on the 
> new one) after broker recovery?
> Should operators tune leader.imbalance.per.broker.percentage, 
> leader.imbalance.check.interval.seconds, or use another mechanism to delay 
> automatic preferred leader rebalance after a broker comes back?
> Documentation expectations for upgrades
> If the new behavior is expected, can the docs explicitly state that 
> leader.imbalance.check.interval.seconds is a periodic scheduler interval, not 
> a post-broker-event delay, and that actual rebalance relative to broker 
> events may occur anywhere between 0 and the configured interval?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to