Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

Chia-Ping Tsai Tue, 21 Apr 2026 12:35:44 -0700

Hi Jun,

Thanks for the clarification. I think I misunderstood your previous point.
Let me summarize the scenarios to ensure we are fully aligned.


There are essentially three scenarios when a consumer needs to reset
offsets:

   1.

   Out-of-range (The group exists, but the offset has expired).
   2.

   Extended partition (The group exists, but encounters a newly added
   partition with no committed offset).
   3.

   No-offset (The group is completely new, or an existing group was deleted
   by the GC).

We all agree that the primary goal of this KIP is to catch up on all
records for scenario 2. There are no objections here.

Regarding the inconsistency you pointed out between 1) and 3) under the
current to_start_time design, I completely see your point. If users are not
fully aware that to_start_time is designed to read all records since the
creation of the group, they might get confused.

However, to me, this "inconsistency" is actually a matter of
predictability. Users who explicitly choose the to_start_time policy do so
precisely because they do not want to skip any records when encountering an
out-of-range scenario.

(I would prefer to set aside the topic of group GC for a moment. It is much
more important that we first focus our discussion on the "out-of-range"
scenario)

Best,

Chia-Ping

Jun Rao via dev <[email protected]> 於 2026年4月22日週三 上午1:13寫道：

> Hi, Chia-Ping,
>
> Hmm, is that true? With the earliest policy, we treat an out-of-range
> offset the same as no offset (because the group is deleted) and always set
> it to the earliest offset, right? With to_start_time, an out-of-range
> offset is treated differently from no offset.
>
> Thanks,
>
> Jun
>
> On Tue, Apr 21, 2026 at 12:54 AM Chia-Ping Tsai <[email protected]>
> wrote:
>
> > hi Jun
> >
> > Nice point. Group GC is definitely an issue for to_start_time, but it is
> > actually an issue for other policies as well.
> >
> > For example, a consumer using the earliest policy will suddenly read all
> > historical records from scratch if it sleeps for a long while and gets
> > GC'd; otherwise, it just resumes from previous offsets if the group still
> > exists. It is equally hard to explain to users: "Oh, your group was GC'd,
> > so your offset behavior changed."
> >
> > Therefore, it seems to me the right approach to fix this "inconsistency"
> > is to offer a group-level GC timeout in a future KIP, allowing users to
> > explicitly protect critical groups from GC. This saves not only
> > to_start_time, but all other reset policies too.
> >
> > Best,
> > Chia-Ping
> >
> > On 2026/04/20 20:19:47 Jun Rao via dev wrote:
> > > Hi, Jiunn-Yang and Chia-Ping,
> > >
> > > Thanks for the reply.
> > >
> > > The main concern I see with to_start_time is that its behavoir on how
> > much
> > > data to consume when the offset is out of range is not consistent and
> is
> > > hard to explain. If the group still exists, it will read from the
> > earliest
> > > offset. Otherwise, it will read from the latest.
> > >
> > > Jun
> > >
> > > On Mon, Apr 20, 2026 at 10:13 AM Chia-Ping Tsai <[email protected]>
> > wrote:
> > >
> > > > hi all,
> > > >
> > > > Just a note for a potential latest_v2:
> > > >
> > > > Since the purpose is to read all records from extended partitions, we
> > > > could leverage the group creation time to compare against the
> earliest
> > > > record of a partition when there is no committed offset. If the group
> > > > creation time is larger than the earliest record's timestamp, we
> > assume it
> > > > is not an extended partition. Otherwise, we treat it as an extended
> > > > partition.
> > > >
> > > > This approach allows us to catch all "possible" extended partitions,
> > which
> > > > includes both "true" extended partitions and old but truncated
> > partitions.
> > > > While there is a rare edge case where the cost is reprocessing some
> > records
> > > > we don't necessarily want, it is very easy to implement and
> guarantees
> > we
> > > > will never miss the actual extended partitions.
> > > >
> > > > Best,
> > > > Chia-Ping
> > > >
> > > > On 2026/04/20 13:33:31 黃竣陽 wrote:
> > > > > Hello all,
> > > > >
> > > > > I have added a new "Future Work: latest_strict Policy" section to
> the
> > > > KIP.
> > > > > The idea is a future policy that uses latest semantics by default
> but
> > > > falls
> > > > > back to the group creation timestamp specifically for newly added
> > > > partitions
> > > > > during partition expansion. This would reuse the group creation
> time
> > > > anchor
> > > > > introduced by this KIP, making it a natural extension with minimal
> > > > additional
> > > > > protocol changes.
> > > > >
> > > > > Best Regards,
> > > > > Jiunn-Yang
> > > > >
> > > > > > Chia-Ping Tsai <[email protected]> 於 2026年4月18日 下午4:09 寫道：
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > It is practically NP-hard to guess everyone's ideal use case
> right
> > now.
> > > > > > Also, I believe we all want to avoid falling back to the
> intricate
> > > > > > multi-policy approach proposed in KIP-842.
> > > > > >
> > > > > > I prefer to keep this KIP focused and discuss a "v2 latest"
> policy
> > in a
> > > > > > separate KIP. That future policy could build upon the
> to_start_time
> > > > anchor
> > > > > > to fix data loss specifically for extended partitions. We could
> > call it
> > > > > > something like latest_strict.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > >
> > > > > > 黃竣陽 <[email protected]> 於 2026年4月18日週六 下午3:24寫道：
> > > > > >
> > > > > >> Hello Jun,
> > > > > >>
> > > > > >> Thanks for the reply,
> > > > > >>
> > > > > >> When the offset goes out of range, the user faces two options:
> > > > > >>
> > > > > >> 1. Skip to the end (latest behavior) — risk losing data that was
> > > > produced
> > > > > >> during
> > > > > >> the group's lifetime but not yet consumed.
> > > > > >> 2. Seek back to the group creation time (to_start_time
> behavior) —
> > > > > >> potentially
> > > > > >> reprocess some data, but guarantee no data from the group's
> > lifetime
> > > > is
> > > > > >> silently lost.
> > > > > >>
> > > > > >> to_start_time chooses option 2 because its core promise is
> "never
> > > > silently
> > > > > >> lose data
> > > > > >> produced after the group started." If we fell back to latest on
> > > > > >> out-of-range, we would
> > > > > >> break this guarantee.
> > > > > >>
> > > > > >> I consider users who prefer option 1 can simply use
> > > > > >> auto.offset.reset=latest.
> > > > > >>
> > > > > >> Best Regards,
> > > > > >> Jiunn-Yang
> > > > > >>
> > > > > >>> Jun Rao via dev <[email protected]> 於 2026年4月18日 凌晨1:57 寫道：
> > > > > >>>
> > > > > >>> Hi, Jiunn-Yang and Chia-Ping,
> > > > > >>>
> > > > > >>> Thanks for the reply.
> > > > > >>>
> > > > > >>> "The core semantic of to_start_time is to read all records
> since
> > the
> > > > > >>> creation of the group."
> > > > > >>>
> > > > > >>> I am just questioning whether this actually covers a common use
> > > > case. If
> > > > > >>> the offset doesn't go out of range, the logic makes sense to
> me.
> > I'm
> > > > not
> > > > > >>> sure about the logic if the offset is out of range. If a user
> > > > chooses to
> > > > > >>> skip the historical data when starting the group, it seems the
> > user
> > > > > >> likely
> > > > > >>> wants to do the same if the offset is out of range.
> > > > > >>>
> > > > > >>> Jun
> > > > > >>>
> > > > > >>> On Fri, Apr 17, 2026 at 5:23 AM 黃竣陽 <[email protected]>
> wrote:
> > > > > >>>
> > > > > >>>> Hello Jun,
> > > > > >>>>
> > > > > >>>> Thank for the feedback,
> > > > > >>>>
> > > > > >>>> Adding to the points above:
> > > > > >>>>
> > > > > >>>> Regarding by_duration as an alternative to Scenario 1: beyond
> > clock
> > > > skew
> > > > > >>>> and retry issues, there is also a usability concern.
> by_duration
> > > > > >> requires
> > > > > >>>> users
> > > > > >>>> to reason about operational timing — "how long does partition
> > > > discovery
> > > > > >>>> take
> > > > > >>>> in my environment?”, and then translate that into a
> > configuration
> > > > value.
> > > > > >>>> to_start_time
> > > > > >>>> requires no such reasoning. It simply anchors to the group
> > creation
> > > > time
> > > > > >>>> recorded
> > > > > >>>> by the broker.
> > > > > >>>>
> > > > > >>>> Regarding Scenario 2: I'd also like to clarify that
> > to_start_time
> > > > does
> > > > > >> not
> > > > > >>>> branch between
> > > > > >>>> "use latest" and "use earliest." It applies the same
> > > > ListOffsetsRequest
> > > > > >>>> with the group creation
> > > > > >>>> timestamp in all cases. The difference in outcome:
> > > > > >>>> - skipping old data on first start
> > > > > >>>> - consuming surviving data after truncation
> > > > > >>>> is a natural consequence of what data exists in the partition
> at
> > > > that
> > > > > >>>> point, not a different policy
> > > > > >>>> being applied. The rule is always the same.
> > > > > >>>>
> > > > > >>>> Best Regards,
> > > > > >>>> Jiunn-Yang
> > > > > >>>>
> > > > > >>>>> Chia-Ping Tsai <[email protected]> 於 2026年4月17日 上午9:48 寫道：
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>> Jun Rao via dev <[email protected]> 於 2026年4月17日 凌晨4:57
> > 寫道：
> > > > > >>>>>>
> > > > > >>>>>> Also, a group is deleted after the consumer has been idle
> > longer
> > > > > >>>>>> than offsets.retention.minutes. What's the semantic of
> > > > to_start_time
> > > > > >> if
> > > > > >>>> the
> > > > > >>>>>> group creation time is unavailable?
> > > > > >>>>>
> > > > > >>>>> If the group is recreated, a new creation time will be
> > recorded.
> > > > Hence,
> > > > > >>>> it acts like a new group. Plus, it throws an exception
> directly
> > if
> > > > the
> > > > > >>>> group truly has no creation time.
> > > > > >>>>
> > > > > >>>>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

Reply via email to