Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

Jun Rao via dev Wed, 22 Apr 2026 10:34:27 -0700

Hi, Chia-Ping,

Thanks for the reply.


Let's try to understand from the user's perspective. When the user starts
the group for the first time, it faces a choice on whether to process the
backlog or not. When the offset is out-of-range, the user faces the same
choice regarding backlog processing. It seems that most users want to make
the same choice regarding backlog processing.

"Users who explicitly choose the to_start_time policy do so precisely
because they do not want to skip any records when encountering an
out-of-range scenario."
This argument is weak because that's how to_start_time is designed, but we
need to justify why it is a good choice in the first place.

Jun

On Tue, Apr 21, 2026 at 12:35 PM Chia-Ping Tsai <[email protected]> wrote:

> Hi Jun,
>
> Thanks for the clarification. I think I misunderstood your previous point.
> Let me summarize the scenarios to ensure we are fully aligned.
>
> There are essentially three scenarios when a consumer needs to reset
> offsets:
>
>    1.
>
>    Out-of-range (The group exists, but the offset has expired).
>    2.
>
>    Extended partition (The group exists, but encounters a newly added
>    partition with no committed offset).
>    3.
>
>    No-offset (The group is completely new, or an existing group was
>    deleted by the GC).
>
> We all agree that the primary goal of this KIP is to catch up on all
> records for scenario 2. There are no objections here.
>
> Regarding the inconsistency you pointed out between 1) and 3) under the
> current to_start_time design, I completely see your point. If users are
> not fully aware that to_start_time is designed to read all records since
> the creation of the group, they might get confused.
>
> However, to me, this "inconsistency" is actually a matter of
> predictability. Users who explicitly choose the to_start_time policy do
> so precisely because they do not want to skip any records when encountering
> an out-of-range scenario.
>
> (I would prefer to set aside the topic of group GC for a moment. It is
> much more important that we first focus our discussion on the
> "out-of-range" scenario)
>
> Best,
>
> Chia-Ping
>
> Jun Rao via dev <[email protected]> 於 2026年4月22日週三 上午1:13寫道：
>
>> Hi, Chia-Ping,
>>
>> Hmm, is that true? With the earliest policy, we treat an out-of-range
>> offset the same as no offset (because the group is deleted) and always set
>> it to the earliest offset, right? With to_start_time, an out-of-range
>> offset is treated differently from no offset.
>>
>> Thanks,
>>
>> Jun
>>
>> On Tue, Apr 21, 2026 at 12:54 AM Chia-Ping Tsai <[email protected]>
>> wrote:
>>
>> > hi Jun
>> >
>> > Nice point. Group GC is definitely an issue for to_start_time, but it is
>> > actually an issue for other policies as well.
>> >
>> > For example, a consumer using the earliest policy will suddenly read all
>> > historical records from scratch if it sleeps for a long while and gets
>> > GC'd; otherwise, it just resumes from previous offsets if the group
>> still
>> > exists. It is equally hard to explain to users: "Oh, your group was
>> GC'd,
>> > so your offset behavior changed."
>> >
>> > Therefore, it seems to me the right approach to fix this "inconsistency"
>> > is to offer a group-level GC timeout in a future KIP, allowing users to
>> > explicitly protect critical groups from GC. This saves not only
>> > to_start_time, but all other reset policies too.
>> >
>> > Best,
>> > Chia-Ping
>> >
>> > On 2026/04/20 20:19:47 Jun Rao via dev wrote:
>> > > Hi, Jiunn-Yang and Chia-Ping,
>> > >
>> > > Thanks for the reply.
>> > >
>> > > The main concern I see with to_start_time is that its behavoir on how
>> > much
>> > > data to consume when the offset is out of range is not consistent and
>> is
>> > > hard to explain. If the group still exists, it will read from the
>> > earliest
>> > > offset. Otherwise, it will read from the latest.
>> > >
>> > > Jun
>> > >
>> > > On Mon, Apr 20, 2026 at 10:13 AM Chia-Ping Tsai <[email protected]>
>> > wrote:
>> > >
>> > > > hi all,
>> > > >
>> > > > Just a note for a potential latest_v2:
>> > > >
>> > > > Since the purpose is to read all records from extended partitions,
>> we
>> > > > could leverage the group creation time to compare against the
>> earliest
>> > > > record of a partition when there is no committed offset. If the
>> group
>> > > > creation time is larger than the earliest record's timestamp, we
>> > assume it
>> > > > is not an extended partition. Otherwise, we treat it as an extended
>> > > > partition.
>> > > >
>> > > > This approach allows us to catch all "possible" extended partitions,
>> > which
>> > > > includes both "true" extended partitions and old but truncated
>> > partitions.
>> > > > While there is a rare edge case where the cost is reprocessing some
>> > records
>> > > > we don't necessarily want, it is very easy to implement and
>> guarantees
>> > we
>> > > > will never miss the actual extended partitions.
>> > > >
>> > > > Best,
>> > > > Chia-Ping
>> > > >
>> > > > On 2026/04/20 13:33:31 黃竣陽 wrote:
>> > > > > Hello all,
>> > > > >
>> > > > > I have added a new "Future Work: latest_strict Policy" section to
>> the
>> > > > KIP.
>> > > > > The idea is a future policy that uses latest semantics by default
>> but
>> > > > falls
>> > > > > back to the group creation timestamp specifically for newly added
>> > > > partitions
>> > > > > during partition expansion. This would reuse the group creation
>> time
>> > > > anchor
>> > > > > introduced by this KIP, making it a natural extension with minimal
>> > > > additional
>> > > > > protocol changes.
>> > > > >
>> > > > > Best Regards,
>> > > > > Jiunn-Yang
>> > > > >
>> > > > > > Chia-Ping Tsai <[email protected]> 於 2026年4月18日 下午4:09 寫道：
>> > > > > >
>> > > > > > Hi all,
>> > > > > >
>> > > > > > It is practically NP-hard to guess everyone's ideal use case
>> right
>> > now.
>> > > > > > Also, I believe we all want to avoid falling back to the
>> intricate
>> > > > > > multi-policy approach proposed in KIP-842.
>> > > > > >
>> > > > > > I prefer to keep this KIP focused and discuss a "v2 latest"
>> policy
>> > in a
>> > > > > > separate KIP. That future policy could build upon the
>> to_start_time
>> > > > anchor
>> > > > > > to fix data loss specifically for extended partitions. We could
>> > call it
>> > > > > > something like latest_strict.
>> > > > > >
>> > > > > > Thoughts?
>> > > > > >
>> > > > > >
>> > > > > > 黃竣陽 <[email protected]> 於 2026年4月18日週六 下午3:24寫道：
>> > > > > >
>> > > > > >> Hello Jun,
>> > > > > >>
>> > > > > >> Thanks for the reply,
>> > > > > >>
>> > > > > >> When the offset goes out of range, the user faces two options:
>> > > > > >>
>> > > > > >> 1. Skip to the end (latest behavior) — risk losing data that
>> was
>> > > > produced
>> > > > > >> during
>> > > > > >> the group's lifetime but not yet consumed.
>> > > > > >> 2. Seek back to the group creation time (to_start_time
>> behavior) —
>> > > > > >> potentially
>> > > > > >> reprocess some data, but guarantee no data from the group's
>> > lifetime
>> > > > is
>> > > > > >> silently lost.
>> > > > > >>
>> > > > > >> to_start_time chooses option 2 because its core promise is
>> "never
>> > > > silently
>> > > > > >> lose data
>> > > > > >> produced after the group started." If we fell back to latest on
>> > > > > >> out-of-range, we would
>> > > > > >> break this guarantee.
>> > > > > >>
>> > > > > >> I consider users who prefer option 1 can simply use
>> > > > > >> auto.offset.reset=latest.
>> > > > > >>
>> > > > > >> Best Regards,
>> > > > > >> Jiunn-Yang
>> > > > > >>
>> > > > > >>> Jun Rao via dev <[email protected]> 於 2026年4月18日 凌晨1:57
>> 寫道：
>> > > > > >>>
>> > > > > >>> Hi, Jiunn-Yang and Chia-Ping,
>> > > > > >>>
>> > > > > >>> Thanks for the reply.
>> > > > > >>>
>> > > > > >>> "The core semantic of to_start_time is to read all records
>> since
>> > the
>> > > > > >>> creation of the group."
>> > > > > >>>
>> > > > > >>> I am just questioning whether this actually covers a common
>> use
>> > > > case. If
>> > > > > >>> the offset doesn't go out of range, the logic makes sense to
>> me.
>> > I'm
>> > > > not
>> > > > > >>> sure about the logic if the offset is out of range. If a user
>> > > > chooses to
>> > > > > >>> skip the historical data when starting the group, it seems the
>> > user
>> > > > > >> likely
>> > > > > >>> wants to do the same if the offset is out of range.
>> > > > > >>>
>> > > > > >>> Jun
>> > > > > >>>
>> > > > > >>> On Fri, Apr 17, 2026 at 5:23 AM 黃竣陽 <[email protected]>
>> wrote:
>> > > > > >>>
>> > > > > >>>> Hello Jun,
>> > > > > >>>>
>> > > > > >>>> Thank for the feedback,
>> > > > > >>>>
>> > > > > >>>> Adding to the points above:
>> > > > > >>>>
>> > > > > >>>> Regarding by_duration as an alternative to Scenario 1: beyond
>> > clock
>> > > > skew
>> > > > > >>>> and retry issues, there is also a usability concern.
>> by_duration
>> > > > > >> requires
>> > > > > >>>> users
>> > > > > >>>> to reason about operational timing — "how long does partition
>> > > > discovery
>> > > > > >>>> take
>> > > > > >>>> in my environment?”, and then translate that into a
>> > configuration
>> > > > value.
>> > > > > >>>> to_start_time
>> > > > > >>>> requires no such reasoning. It simply anchors to the group
>> > creation
>> > > > time
>> > > > > >>>> recorded
>> > > > > >>>> by the broker.
>> > > > > >>>>
>> > > > > >>>> Regarding Scenario 2: I'd also like to clarify that
>> > to_start_time
>> > > > does
>> > > > > >> not
>> > > > > >>>> branch between
>> > > > > >>>> "use latest" and "use earliest." It applies the same
>> > > > ListOffsetsRequest
>> > > > > >>>> with the group creation
>> > > > > >>>> timestamp in all cases. The difference in outcome:
>> > > > > >>>> - skipping old data on first start
>> > > > > >>>> - consuming surviving data after truncation
>> > > > > >>>> is a natural consequence of what data exists in the
>> partition at
>> > > > that
>> > > > > >>>> point, not a different policy
>> > > > > >>>> being applied. The rule is always the same.
>> > > > > >>>>
>> > > > > >>>> Best Regards,
>> > > > > >>>> Jiunn-Yang
>> > > > > >>>>
>> > > > > >>>>> Chia-Ping Tsai <[email protected]> 於 2026年4月17日 上午9:48 寫道：
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>>> Jun Rao via dev <[email protected]> 於 2026年4月17日 凌晨4:57
>> > 寫道：
>> > > > > >>>>>>
>> > > > > >>>>>> Also, a group is deleted after the consumer has been idle
>> > longer
>> > > > > >>>>>> than offsets.retention.minutes. What's the semantic of
>> > > > to_start_time
>> > > > > >> if
>> > > > > >>>> the
>> > > > > >>>>>> group creation time is unavailable?
>> > > > > >>>>>
>> > > > > >>>>> If the group is recreated, a new creation time will be
>> > recorded.
>> > > > Hence,
>> > > > > >>>> it acts like a new group. Plus, it throws an exception
>> directly
>> > if
>> > > > the
>> > > > > >>>> group truly has no creation time.
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

Reply via email to