Thanks Tom for updating the KIP with topic level configs. I still have one question
With leader.election.prefer.early.local.log.start.offset=true, replicas bootstrapped via tiered storage are deprioritized for leadership until they accumulate local data. In a 3-broker, 3-rack (RF=3) cluster, if one rack is replaced and bootstraps this way, all its replicas will have a higher localLogStartOffset, so leadership would concentrate on the other two racks for some time. At larger scale (e.g., AZ-level recovery), this could mean one failure domain is effectively excluded from leadership, shifting most leader load to the remaining racks. Is this temporary rack/AZ-level leader skew an expected trade-off? If so, is there guidance to plan for N-1 capacity or to rebalance leadership during recovery? Regards, Manan On Wed, Apr 29, 2026 at 9:18 PM Thomas Thornton via dev < [email protected]> wrote: > Hi Manan, > > Thanks for the discussion. > > MG1: Good point. I've updated the KIP to add topic-level overrides for > both configs so operators can tune the threshold per topic if the > cluster-wide default doesn't fit a particular workload. > > MG2: This shouldn't cause skew. The localLogStartOffset sorting only > pushes down newly-bootstrapped replicas that have significantly less > local data. For existing replicas with comparable local data, they'll > be within the threshold and fall back to the original assignment > order, same behavior as today. We're not introducing a new dimension > that would systematically favor certain racks or brokers. > > MG3: When tiered storage is not enabled for a topic, we will not send > AlterPartition requests to report localLogStartOffset. There will be > no extra control-plane overhead for clusters or topics that don't use > tiered storage. When enabled, the additional AlterPartition calls only > fire when an ISR member's localLogStartOffset actually changes, which > reuses the existing protocol and should be infrequent. > > Thanks, > Tom > > On Wed, Apr 29, 2026 at 5:42 PM Thomas Thornton > <[email protected]> wrote: > > > > Hi Ivan, > > > > Thanks for the feedback. > > > > IY1: Yes, the sort is stable. Replicas within the threshold are > > considered equivalent and retain their original assignment order. > > We're only reordering replicas that have significantly less local > > data, the existing replicas keep the same relative ordering as before. > > Updated that part of the KIP to reflect this. > > > > IY2: Good idea. I've updated the KIP to add topic-level overrides for > > both configs. This follows the standard Kafka pattern (like > > `retention.ms`, `log.retention.ms`). The cluster-wide default applies > > unless overridden per topic. > > > > Thanks, > > Tom > > > > On Fri, Apr 24, 2026 at 3:18 PM Ivan Yurchenko <[email protected]> wrote: > > > > > > Hi Thomas, > > > > > > Thank you for the KIP. The motivation makes sense to me. I have a > couple of comments: > > > > > > IY1: > > > > When `leader.election.prefer.early.local.log.start.offset is > enabled`, the key change is to sort targetReplicas by > local-log-start-offset (ascending) before selecting a leader. This ensures > replicas with more local data (lower local-log-start-offset) are considered > first in both election paths. > > > > > > I assume here it meant to say "sort stably", to preserve the original > preference order as much as possible? > > > > > > IY2: > > > Can we find a reason for a particular topic to not follow the new > leader election algorithm, or it is strictly better and once enabled it's > not expected to be disabled? If the answer is yes, would you consider > adding the topic-level versions of the new configs > leader.election.prefer.early.local.log.start.offset and > leader.election.local.log.start.offset.threshold? > > > > > > Best, > > > Ivan > > > > > > > > > On Mon, Mar 30, 2026, at 20:43, Thomas Thornton via dev wrote: > > > > > > Hi all, > > > > > > We want to start a discussion thread for KIP-1303: Deprioritize Tiered > > > Storage Followers In Leader Election. > > > > > > The adopted KIP-1023 introduced an optimization allowing followers to > > > skip replicating data already in remote storage, dramatically reducing > > > ISR join time. However, as noted in KIP-1023, this creates a risk: if > > > such a follower becomes leader, it may need to serve consumer requests > > > from remote storage, impacting performance. > > > > > > This KIP proposes to mitigate this risk by preferring replicas with > > > more local data (lower localLogStartOffset) during leader election. > > > Key changes include: > > > 1) New config leader.election.prefer.early.local.log.start.offset to > > > enable the feature > > > 2) New config leader.election.local.log.start.offset.threshold to > > > avoid leader churn from minor retention timing differences > > > 3) Extending FetchRequest and AlterPartition to propagate > > > localLogStartOffset from followers → leader → controller > > > > > > The full KIP is available here: > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1303%3A+Deprioritize+Tiered+Storage+Followers+In+Leader+Election > > > > > > Thanks, > > > Tom > > > > > > >
