Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

Chia-Ping Tsai Mon, 20 Apr 2026 10:13:47 -0700

hi all,

Just a note for a potential latest_v2:


Since the purpose is to read all records from extended partitions, we could 
leverage the group creation time to compare against the earliest record of a 
partition when there is no committed offset. If the group creation time is 
larger than the earliest record's timestamp, we assume it is not an extended 
partition. Otherwise, we treat it as an extended partition.

This approach allows us to catch all "possible" extended partitions, which 
includes both "true" extended partitions and old but truncated partitions. 
While there is a rare edge case where the cost is reprocessing some records we 
don't necessarily want, it is very easy to implement and guarantees we will 
never miss the actual extended partitions.

Best,
Chia-Ping

On 2026/04/20 13:33:31 黃竣陽 wrote:
> Hello all,
> 
> I have added a new "Future Work: latest_strict Policy" section to the KIP. 
> The idea is a future policy that uses latest semantics by default but falls 
> back to the group creation timestamp specifically for newly added partitions 
> during partition expansion. This would reuse the group creation time anchor 
> introduced by this KIP, making it a natural extension with minimal additional 
> protocol changes.
> 
> Best Regards,
> Jiunn-Yang
> 
> > Chia-Ping Tsai <[email protected]> 於 2026年4月18日 下午4:09 寫道：
> > 
> > Hi all,
> > 
> > It is practically NP-hard to guess everyone's ideal use case right now.
> > Also, I believe we all want to avoid falling back to the intricate
> > multi-policy approach proposed in KIP-842.
> > 
> > I prefer to keep this KIP focused and discuss a "v2 latest" policy in a
> > separate KIP. That future policy could build upon the to_start_time anchor
> > to fix data loss specifically for extended partitions. We could call it
> > something like latest_strict.
> > 
> > Thoughts?
> > 
> > 
> > 黃竣陽 <[email protected]> 於 2026年4月18日週六 下午3:24寫道：
> > 
> >> Hello Jun,
> >> 
> >> Thanks for the reply,
> >> 
> >> When the offset goes out of range, the user faces two options:
> >> 
> >> 1. Skip to the end (latest behavior) — risk losing data that was produced
> >> during
> >> the group's lifetime but not yet consumed.
> >> 2. Seek back to the group creation time (to_start_time behavior) —
> >> potentially
> >> reprocess some data, but guarantee no data from the group's lifetime is
> >> silently lost.
> >> 
> >> to_start_time chooses option 2 because its core promise is "never silently
> >> lose data
> >> produced after the group started." If we fell back to latest on
> >> out-of-range, we would
> >> break this guarantee.
> >> 
> >> I consider users who prefer option 1 can simply use
> >> auto.offset.reset=latest.
> >> 
> >> Best Regards,
> >> Jiunn-Yang
> >> 
> >>> Jun Rao via dev <[email protected]> 於 2026年4月18日 凌晨1:57 寫道：
> >>> 
> >>> Hi, Jiunn-Yang and Chia-Ping,
> >>> 
> >>> Thanks for the reply.
> >>> 
> >>> "The core semantic of to_start_time is to read all records since the
> >>> creation of the group."
> >>> 
> >>> I am just questioning whether this actually covers a common use case. If
> >>> the offset doesn't go out of range, the logic makes sense to me. I'm not
> >>> sure about the logic if the offset is out of range. If a user chooses to
> >>> skip the historical data when starting the group, it seems the user
> >> likely
> >>> wants to do the same if the offset is out of range.
> >>> 
> >>> Jun
> >>> 
> >>> On Fri, Apr 17, 2026 at 5:23 AM 黃竣陽 <[email protected]> wrote:
> >>> 
> >>>> Hello Jun,
> >>>> 
> >>>> Thank for the feedback,
> >>>> 
> >>>> Adding to the points above:
> >>>> 
> >>>> Regarding by_duration as an alternative to Scenario 1: beyond clock skew
> >>>> and retry issues, there is also a usability concern. by_duration
> >> requires
> >>>> users
> >>>> to reason about operational timing — "how long does partition discovery
> >>>> take
> >>>> in my environment?”, and then translate that into a configuration value.
> >>>> to_start_time
> >>>> requires no such reasoning. It simply anchors to the group creation time
> >>>> recorded
> >>>> by the broker.
> >>>> 
> >>>> Regarding Scenario 2: I'd also like to clarify that to_start_time does
> >> not
> >>>> branch between
> >>>> "use latest" and "use earliest." It applies the same ListOffsetsRequest
> >>>> with the group creation
> >>>> timestamp in all cases. The difference in outcome:
> >>>> - skipping old data on first start
> >>>> - consuming surviving data after truncation
> >>>> is a natural consequence of what data exists in the partition at that
> >>>> point, not a different policy
> >>>> being applied. The rule is always the same.
> >>>> 
> >>>> Best Regards,
> >>>> Jiunn-Yang
> >>>> 
> >>>>> Chia-Ping Tsai <[email protected]> 於 2026年4月17日 上午9:48 寫道：
> >>>>> 
> >>>>> 
> >>>>>> Jun Rao via dev <[email protected]> 於 2026年4月17日 凌晨4:57 寫道：
> >>>>>> 
> >>>>>> Also, a group is deleted after the consumer has been idle longer
> >>>>>> than offsets.retention.minutes. What's the semantic of to_start_time
> >> if
> >>>> the
> >>>>>> group creation time is unavailable?
> >>>>> 
> >>>>> If the group is recreated, a new creation time will be recorded. Hence,
> >>>> it acts like a new group. Plus, it throws an exception directly if the
> >>>> group truly has no creation time.
> >>>> 
> >>>> 
> >> 
> >> 
> 
>

Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

Reply via email to