Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

Jun Rao via dev Tue, 21 Apr 2026 10:13:46 -0700

Hi, Chia-Ping,

Hmm, is that true? With the earliest policy, we treat an out-of-range
offset the same as no offset (because the group is deleted) and always set
it to the earliest offset, right? With to_start_time, an out-of-range
offset is treated differently from no offset.


Thanks,

Jun

On Tue, Apr 21, 2026 at 12:54 AM Chia-Ping Tsai <[email protected]> wrote:

> hi Jun
>
> Nice point. Group GC is definitely an issue for to_start_time, but it is
> actually an issue for other policies as well.
>
> For example, a consumer using the earliest policy will suddenly read all
> historical records from scratch if it sleeps for a long while and gets
> GC'd; otherwise, it just resumes from previous offsets if the group still
> exists. It is equally hard to explain to users: "Oh, your group was GC'd,
> so your offset behavior changed."
>
> Therefore, it seems to me the right approach to fix this "inconsistency"
> is to offer a group-level GC timeout in a future KIP, allowing users to
> explicitly protect critical groups from GC. This saves not only
> to_start_time, but all other reset policies too.
>
> Best,
> Chia-Ping
>
> On 2026/04/20 20:19:47 Jun Rao via dev wrote:
> > Hi, Jiunn-Yang and Chia-Ping,
> >
> > Thanks for the reply.
> >
> > The main concern I see with to_start_time is that its behavoir on how
> much
> > data to consume when the offset is out of range is not consistent and is
> > hard to explain. If the group still exists, it will read from the
> earliest
> > offset. Otherwise, it will read from the latest.
> >
> > Jun
> >
> > On Mon, Apr 20, 2026 at 10:13 AM Chia-Ping Tsai <[email protected]>
> wrote:
> >
> > > hi all,
> > >
> > > Just a note for a potential latest_v2:
> > >
> > > Since the purpose is to read all records from extended partitions, we
> > > could leverage the group creation time to compare against the earliest
> > > record of a partition when there is no committed offset. If the group
> > > creation time is larger than the earliest record's timestamp, we
> assume it
> > > is not an extended partition. Otherwise, we treat it as an extended
> > > partition.
> > >
> > > This approach allows us to catch all "possible" extended partitions,
> which
> > > includes both "true" extended partitions and old but truncated
> partitions.
> > > While there is a rare edge case where the cost is reprocessing some
> records
> > > we don't necessarily want, it is very easy to implement and guarantees
> we
> > > will never miss the actual extended partitions.
> > >
> > > Best,
> > > Chia-Ping
> > >
> > > On 2026/04/20 13:33:31 黃竣陽 wrote:
> > > > Hello all,
> > > >
> > > > I have added a new "Future Work: latest_strict Policy" section to the
> > > KIP.
> > > > The idea is a future policy that uses latest semantics by default but
> > > falls
> > > > back to the group creation timestamp specifically for newly added
> > > partitions
> > > > during partition expansion. This would reuse the group creation time
> > > anchor
> > > > introduced by this KIP, making it a natural extension with minimal
> > > additional
> > > > protocol changes.
> > > >
> > > > Best Regards,
> > > > Jiunn-Yang
> > > >
> > > > > Chia-Ping Tsai <[email protected]> 於 2026年4月18日 下午4:09 寫道：
> > > > >
> > > > > Hi all,
> > > > >
> > > > > It is practically NP-hard to guess everyone's ideal use case right
> now.
> > > > > Also, I believe we all want to avoid falling back to the intricate
> > > > > multi-policy approach proposed in KIP-842.
> > > > >
> > > > > I prefer to keep this KIP focused and discuss a "v2 latest" policy
> in a
> > > > > separate KIP. That future policy could build upon the to_start_time
> > > anchor
> > > > > to fix data loss specifically for extended partitions. We could
> call it
> > > > > something like latest_strict.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > >
> > > > > 黃竣陽 <[email protected]> 於 2026年4月18日週六 下午3:24寫道：
> > > > >
> > > > >> Hello Jun,
> > > > >>
> > > > >> Thanks for the reply,
> > > > >>
> > > > >> When the offset goes out of range, the user faces two options:
> > > > >>
> > > > >> 1. Skip to the end (latest behavior) — risk losing data that was
> > > produced
> > > > >> during
> > > > >> the group's lifetime but not yet consumed.
> > > > >> 2. Seek back to the group creation time (to_start_time behavior) —
> > > > >> potentially
> > > > >> reprocess some data, but guarantee no data from the group's
> lifetime
> > > is
> > > > >> silently lost.
> > > > >>
> > > > >> to_start_time chooses option 2 because its core promise is "never
> > > silently
> > > > >> lose data
> > > > >> produced after the group started." If we fell back to latest on
> > > > >> out-of-range, we would
> > > > >> break this guarantee.
> > > > >>
> > > > >> I consider users who prefer option 1 can simply use
> > > > >> auto.offset.reset=latest.
> > > > >>
> > > > >> Best Regards,
> > > > >> Jiunn-Yang
> > > > >>
> > > > >>> Jun Rao via dev <[email protected]> 於 2026年4月18日 凌晨1:57 寫道：
> > > > >>>
> > > > >>> Hi, Jiunn-Yang and Chia-Ping,
> > > > >>>
> > > > >>> Thanks for the reply.
> > > > >>>
> > > > >>> "The core semantic of to_start_time is to read all records since
> the
> > > > >>> creation of the group."
> > > > >>>
> > > > >>> I am just questioning whether this actually covers a common use
> > > case. If
> > > > >>> the offset doesn't go out of range, the logic makes sense to me.
> I'm
> > > not
> > > > >>> sure about the logic if the offset is out of range. If a user
> > > chooses to
> > > > >>> skip the historical data when starting the group, it seems the
> user
> > > > >> likely
> > > > >>> wants to do the same if the offset is out of range.
> > > > >>>
> > > > >>> Jun
> > > > >>>
> > > > >>> On Fri, Apr 17, 2026 at 5:23 AM 黃竣陽 <[email protected]> wrote:
> > > > >>>
> > > > >>>> Hello Jun,
> > > > >>>>
> > > > >>>> Thank for the feedback,
> > > > >>>>
> > > > >>>> Adding to the points above:
> > > > >>>>
> > > > >>>> Regarding by_duration as an alternative to Scenario 1: beyond
> clock
> > > skew
> > > > >>>> and retry issues, there is also a usability concern. by_duration
> > > > >> requires
> > > > >>>> users
> > > > >>>> to reason about operational timing — "how long does partition
> > > discovery
> > > > >>>> take
> > > > >>>> in my environment?”, and then translate that into a
> configuration
> > > value.
> > > > >>>> to_start_time
> > > > >>>> requires no such reasoning. It simply anchors to the group
> creation
> > > time
> > > > >>>> recorded
> > > > >>>> by the broker.
> > > > >>>>
> > > > >>>> Regarding Scenario 2: I'd also like to clarify that
> to_start_time
> > > does
> > > > >> not
> > > > >>>> branch between
> > > > >>>> "use latest" and "use earliest." It applies the same
> > > ListOffsetsRequest
> > > > >>>> with the group creation
> > > > >>>> timestamp in all cases. The difference in outcome:
> > > > >>>> - skipping old data on first start
> > > > >>>> - consuming surviving data after truncation
> > > > >>>> is a natural consequence of what data exists in the partition at
> > > that
> > > > >>>> point, not a different policy
> > > > >>>> being applied. The rule is always the same.
> > > > >>>>
> > > > >>>> Best Regards,
> > > > >>>> Jiunn-Yang
> > > > >>>>
> > > > >>>>> Chia-Ping Tsai <[email protected]> 於 2026年4月17日 上午9:48 寫道：
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>> Jun Rao via dev <[email protected]> 於 2026年4月17日 凌晨4:57
> 寫道：
> > > > >>>>>>
> > > > >>>>>> Also, a group is deleted after the consumer has been idle
> longer
> > > > >>>>>> than offsets.retention.minutes. What's the semantic of
> > > to_start_time
> > > > >> if
> > > > >>>> the
> > > > >>>>>> group creation time is unavailable?
> > > > >>>>>
> > > > >>>>> If the group is recreated, a new creation time will be
> recorded.
> > > Hence,
> > > > >>>> it acts like a new group. Plus, it throws an exception directly
> if
> > > the
> > > > >>>> group truly has no creation time.
> > > > >>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

Reply via email to