Re: [DISCUSS] KIP-1303: Deprioritize Tiered Storage Followers In Leader Election

Manan Gupta Thu, 30 Apr 2026 21:43:29 -0700

Thanks Tom for updating the KIP with topic level configs.
I still have one question


With leader.election.prefer.early.local.log.start.offset=true, replicas
bootstrapped via tiered storage are deprioritized for leadership until they
accumulate local data.

In a 3-broker, 3-rack (RF=3) cluster, if one rack is replaced and
bootstraps this way, all its replicas will have a higher localLogStartOffset,
so leadership would concentrate on the other two racks for some time.

At larger scale (e.g., AZ-level recovery), this could mean one failure
domain is effectively excluded from leadership, shifting most leader load
to the remaining racks.

Is this temporary rack/AZ-level leader skew an expected trade-off? If so,
is there guidance to plan for N-1 capacity or to rebalance leadership
during recovery?


Regards,
Manan

On Wed, Apr 29, 2026 at 9:18 PM Thomas Thornton via dev <
[email protected]> wrote:

> Hi Manan,
>
> Thanks for the discussion.
>
> MG1: Good point. I've updated the KIP to add topic-level overrides for
> both configs so operators can tune the threshold per topic if the
> cluster-wide default doesn't fit a particular workload.
>
> MG2: This shouldn't cause skew. The localLogStartOffset sorting only
> pushes down newly-bootstrapped replicas that have significantly less
> local data. For existing replicas with comparable local data, they'll
> be within the threshold and fall back to the original assignment
> order, same behavior as today. We're not introducing a new dimension
> that would systematically favor certain racks or brokers.
>
> MG3: When tiered storage is not enabled for a topic, we will not send
> AlterPartition requests to report localLogStartOffset. There will be
> no extra control-plane overhead for clusters or topics that don't use
> tiered storage. When enabled, the additional AlterPartition calls only
> fire when an ISR member's localLogStartOffset actually changes, which
> reuses the existing protocol and should be infrequent.
>
> Thanks,
> Tom
>
> On Wed, Apr 29, 2026 at 5:42 PM Thomas Thornton
> <[email protected]> wrote:
> >
> > Hi Ivan,
> >
> > Thanks for the feedback.
> >
> > IY1: Yes, the sort is stable. Replicas within the threshold are
> > considered equivalent and retain their original assignment order.
> > We're only reordering replicas that have significantly less local
> > data, the existing replicas keep the same relative ordering as before.
> > Updated that part of the KIP to reflect this.
> >
> > IY2: Good idea. I've updated the KIP to add topic-level overrides for
> > both configs. This follows the standard Kafka pattern (like
> > `retention.ms`, `log.retention.ms`). The cluster-wide default applies
> > unless overridden per topic.
> >
> > Thanks,
> > Tom
> >
> > On Fri, Apr 24, 2026 at 3:18 PM Ivan Yurchenko <[email protected]> wrote:
> > >
> > > Hi Thomas,
> > >
> > > Thank you for the KIP. The motivation makes sense to me. I have a
> couple of comments:
> > >
> > > IY1:
> > > > When `leader.election.prefer.early.local.log.start.offset is
> enabled`, the key change is to sort targetReplicas by
> local-log-start-offset (ascending) before selecting a leader. This ensures
> replicas with more local data (lower local-log-start-offset) are considered
> first in both election paths.
> > >
> > > I assume here it meant to say "sort stably", to preserve the original
> preference order as much as possible?
> > >
> > > IY2:
> > > Can we find a reason for a particular topic to not follow the new
> leader election algorithm, or it is strictly better and once enabled it's
> not expected to be disabled? If the answer is yes, would you consider
> adding the topic-level versions of the new configs
> leader.election.prefer.early.local.log.start.offset and
> leader.election.local.log.start.offset.threshold?
> > >
> > > Best,
> > > Ivan
> > >
> > >
> > > On Mon, Mar 30, 2026, at 20:43, Thomas Thornton via dev wrote:
> > >
> > > Hi all,
> > >
> > > We want to start a discussion thread for KIP-1303: Deprioritize Tiered
> > > Storage Followers In Leader Election.
> > >
> > > The adopted KIP-1023 introduced an optimization allowing followers to
> > > skip replicating data already in remote storage, dramatically reducing
> > > ISR join time. However, as noted in KIP-1023, this creates a risk: if
> > > such a follower becomes leader, it may need to serve consumer requests
> > > from remote storage, impacting performance.
> > >
> > > This KIP proposes to mitigate this risk by preferring replicas with
> > > more local data (lower localLogStartOffset) during leader election.
> > > Key changes include:
> > > 1) New config leader.election.prefer.early.local.log.start.offset to
> > > enable the feature
> > > 2) New config leader.election.local.log.start.offset.threshold to
> > > avoid leader churn from minor retention timing differences
> > > 3) Extending FetchRequest and AlterPartition to propagate
> > > localLogStartOffset from followers → leader → controller
> > >
> > > The full KIP is available here:
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1303%3A+Deprioritize+Tiered+Storage+Followers+In+Leader+Election
> > >
> > > Thanks,
> > > Tom
> > >
> > >
>

Re: [DISCUSS] KIP-1303: Deprioritize Tiered Storage Followers In Leader Election

Reply via email to