Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

黃竣陽 Thu, 23 Apr 2026 04:17:41 -0700

Hello all,

Thanks for the feedback. I'd like to advocate for keeping the original 
to_start_time semantics.


Earlier in this thread, both Jian and Ryan highlighted that branched logic 
is the main UX concern:

Jian: "If we can define one basic rule… it would make it easier for 
        everyone to stay on the same page."
Ryan: "The documentation might be difficult if it has to 
        list and explain all the cases."
Chia-Ping: "Having an opinionated config with branched logic makes it hard 
        to document and reason about."

to_start_time already follows this principle, it consistently issues a 
ListOffsets request 
anchored to the group creation timestamp. Differences in outcome are simply due 
to what 
data the broker retains, not different rules being applied. Changing 
out-of-range to latest 
would be the real inconsistency, since the policy would then branch based on 
the reset 
scenario.

Additionally, out-of-range and no-offset (group GC'd) are fundamentally 
different situations.
When the group exists, the creation timestamp is available and should be 
honored. When 
the group is GC'd, the metadata is gone, this is an orthogonal problem that 
affects all reset 
policies equally.

The strength of to_start_time is precisely its single, clean rule: "Always seek 
to the group’s 
creation time, and let ListOffsets resolve the rest."

Best Regards,
Jiunn-Yang

> Chia-Ping Tsai <[email protected]> 於 2026年4月23日 下午3:24 寫道：
> 
> Hi all,
> 
> BTW, regardless of where we land on the "out-of-range" debate, the underlying 
> infrastructure of persisting the "group creation time" is still highly 
> valuable and worth merging.
> 
> From my conversations with users, there are diverse needs: some love the 
> "better earliest" idea to safely skip massive historical backlogs, while 
> others only care about fixing the data loss in latest during partition 
> expansion.
> 
> Simply having the creation time persisted and exposed is already a massive 
> step forward, as it gives users a reliable, objective anchor to manually fix 
> the issue via a ConsumerRebalanceListener. However, much like the concept of 
> a DLQ (Dead Letter Queue) while users could implement it manually, providing 
> a built-in reset policy makes the developer experience significantly more 
> convenient, robust, and out-of-the-box.
> 
> I believe Ken might chime in later with a different perspective as well :)
> 
> Best,
> Chia-Ping
> 
> 
>> Chia-Ping Tsai <[email protected]> 於 2026年4月23日 凌晨3:59 寫道：
>> 
>> Hi Jun,
>> 
>> Thanks for the feedback. I agree that shifting this policy toward a "Smarter 
>> Latest" (rather than a better Earliest) is a more elegant path.
>> 
>> The refined behavior would be:
>> 
>> Out-of-range: Strictly follow latest semantics. This ensures a predictable 
>> "skip to end" behavior when users fall behind retention.
>> 
>> No-offset (Initial Start & Expansion): Leverage Group Creation Time for 
>> lookup.
>> 
>> • For new groups, this naturally results in latest behavior since creation 
>> time is "now".
>> 
>> • For existing groups discovering new partitions, this results in earliest 
>> behavior for those specific partitions.
>> 
>> Group GC: If a group is purged, it is treated as a brand-new group with a 
>> creation time of "now," consistently skipping to the end.
>> 
>> WDYT?
>> 
>> 
>>> Jun Rao via dev <[email protected]> 於 2026年4月23日 凌晨1:34 寫道：
>>> 
>>> Hi, Chia-Ping,
>>> 
>>> Thanks for the reply.
>>> 
>>> Let's try to understand from the user's perspective. When the user starts
>>> the group for the first time, it faces a choice on whether to process the
>>> backlog or not. When the offset is out-of-range, the user faces the same
>>> choice regarding backlog processing. It seems that most users want to make
>>> the same choice regarding backlog processing.
>>> 
>>> "Users who explicitly choose the to_start_time policy do so precisely
>>> because they do not want to skip any records when encountering an
>>> out-of-range scenario."
>>> This argument is weak because that's how to_start_time is designed, but we
>>> need to justify why it is a good choice in the first place.
>>> 
>>> Jun
>>> 
>>>>> On Tue, Apr 21, 2026 at 12:35 PM Chia-Ping Tsai <[email protected]> 
>>>>> wrote:
>>>> 
>>>> Hi Jun,
>>>> 
>>>> Thanks for the clarification. I think I misunderstood your previous point.
>>>> Let me summarize the scenarios to ensure we are fully aligned.
>>>> 
>>>> There are essentially three scenarios when a consumer needs to reset
>>>> offsets:
>>>> 
>>>> 1.
>>>> 
>>>> Out-of-range (The group exists, but the offset has expired).
>>>> 2.
>>>> 
>>>> Extended partition (The group exists, but encounters a newly added
>>>> partition with no committed offset).
>>>> 3.
>>>> 
>>>> No-offset (The group is completely new, or an existing group was
>>>> deleted by the GC).
>>>> 
>>>> We all agree that the primary goal of this KIP is to catch up on all
>>>> records for scenario 2. There are no objections here.
>>>> 
>>>> Regarding the inconsistency you pointed out between 1) and 3) under the
>>>> current to_start_time design, I completely see your point. If users are
>>>> not fully aware that to_start_time is designed to read all records since
>>>> the creation of the group, they might get confused.
>>>> 
>>>> However, to me, this "inconsistency" is actually a matter of
>>>> predictability. Users who explicitly choose the to_start_time policy do
>>>> so precisely because they do not want to skip any records when encountering
>>>> an out-of-range scenario.
>>>> 
>>>> (I would prefer to set aside the topic of group GC for a moment. It is
>>>> much more important that we first focus our discussion on the
>>>> "out-of-range" scenario)
>>>> 
>>>> Best,
>>>> 
>>>> Chia-Ping
>>>> 
>>>> Jun Rao via dev <[email protected]> 於 2026年4月22日週三 上午1:13寫道：
>>>> 
>>>>> Hi, Chia-Ping,
>>>>> 
>>>>> Hmm, is that true? With the earliest policy, we treat an out-of-range
>>>>> offset the same as no offset (because the group is deleted) and always set
>>>>> it to the earliest offset, right? With to_start_time, an out-of-range
>>>>> offset is treated differently from no offset.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Jun
>>>>> 
>>>>> On Tue, Apr 21, 2026 at 12:54 AM Chia-Ping Tsai <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> hi Jun
>>>>>> 
>>>>>> Nice point. Group GC is definitely an issue for to_start_time, but it is
>>>>>> actually an issue for other policies as well.
>>>>>> 
>>>>>> For example, a consumer using the earliest policy will suddenly read all
>>>>>> historical records from scratch if it sleeps for a long while and gets
>>>>>> GC'd; otherwise, it just resumes from previous offsets if the group
>>>>> still
>>>>>> exists. It is equally hard to explain to users: "Oh, your group was
>>>>> GC'd,
>>>>>> so your offset behavior changed."
>>>>>> 
>>>>>> Therefore, it seems to me the right approach to fix this "inconsistency"
>>>>>> is to offer a group-level GC timeout in a future KIP, allowing users to
>>>>>> explicitly protect critical groups from GC. This saves not only
>>>>>> to_start_time, but all other reset policies too.
>>>>>> 
>>>>>> Best,
>>>>>> Chia-Ping
>>>>>> 
>>>>>> On 2026/04/20 20:19:47 Jun Rao via dev wrote:
>>>>>>> Hi, Jiunn-Yang and Chia-Ping,
>>>>>>> 
>>>>>>> Thanks for the reply.
>>>>>>> 
>>>>>>> The main concern I see with to_start_time is that its behavoir on how
>>>>>> much
>>>>>>> data to consume when the offset is out of range is not consistent and
>>>>> is
>>>>>>> hard to explain. If the group still exists, it will read from the
>>>>>> earliest
>>>>>>> offset. Otherwise, it will read from the latest.
>>>>>>> 
>>>>>>> Jun
>>>>>>> 
>>>>>>> On Mon, Apr 20, 2026 at 10:13 AM Chia-Ping Tsai <[email protected]>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> hi all,
>>>>>>>> 
>>>>>>>> Just a note for a potential latest_v2:
>>>>>>>> 
>>>>>>>> Since the purpose is to read all records from extended partitions,
>>>>> we
>>>>>>>> could leverage the group creation time to compare against the
>>>>> earliest
>>>>>>>> record of a partition when there is no committed offset. If the
>>>>> group
>>>>>>>> creation time is larger than the earliest record's timestamp, we
>>>>>> assume it
>>>>>>>> is not an extended partition. Otherwise, we treat it as an extended
>>>>>>>> partition.
>>>>>>>> 
>>>>>>>> This approach allows us to catch all "possible" extended partitions,
>>>>>> which
>>>>>>>> includes both "true" extended partitions and old but truncated
>>>>>> partitions.
>>>>>>>> While there is a rare edge case where the cost is reprocessing some
>>>>>> records
>>>>>>>> we don't necessarily want, it is very easy to implement and
>>>>> guarantees
>>>>>> we
>>>>>>>> will never miss the actual extended partitions.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Chia-Ping
>>>>>>>> 
>>>>>>>> On 2026/04/20 13:33:31 黃竣陽 wrote:
>>>>>>>>> Hello all,
>>>>>>>>> 
>>>>>>>>> I have added a new "Future Work: latest_strict Policy" section to
>>>>> the
>>>>>>>> KIP.
>>>>>>>>> The idea is a future policy that uses latest semantics by default
>>>>> but
>>>>>>>> falls
>>>>>>>>> back to the group creation timestamp specifically for newly added
>>>>>>>> partitions
>>>>>>>>> during partition expansion. This would reuse the group creation
>>>>> time
>>>>>>>> anchor
>>>>>>>>> introduced by this KIP, making it a natural extension with minimal
>>>>>>>> additional
>>>>>>>>> protocol changes.
>>>>>>>>> 
>>>>>>>>> Best Regards,
>>>>>>>>> Jiunn-Yang
>>>>>>>>> 
>>>>>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年4月18日 下午4:09 寫道：
>>>>>>>>>> 
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> It is practically NP-hard to guess everyone's ideal use case
>>>>> right
>>>>>> now.
>>>>>>>>>> Also, I believe we all want to avoid falling back to the
>>>>> intricate
>>>>>>>>>> multi-policy approach proposed in KIP-842.
>>>>>>>>>> 
>>>>>>>>>> I prefer to keep this KIP focused and discuss a "v2 latest"
>>>>> policy
>>>>>> in a
>>>>>>>>>> separate KIP. That future policy could build upon the
>>>>> to_start_time
>>>>>>>> anchor
>>>>>>>>>> to fix data loss specifically for extended partitions. We could
>>>>>> call it
>>>>>>>>>> something like latest_strict.
>>>>>>>>>> 
>>>>>>>>>> Thoughts?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 黃竣陽 <[email protected]> 於 2026年4月18日週六 下午3:24寫道：
>>>>>>>>>> 
>>>>>>>>>>> Hello Jun,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for the reply,
>>>>>>>>>>> 
>>>>>>>>>>> When the offset goes out of range, the user faces two options:
>>>>>>>>>>> 
>>>>>>>>>>> 1. Skip to the end (latest behavior) — risk losing data that
>>>>> was
>>>>>>>> produced
>>>>>>>>>>> during
>>>>>>>>>>> the group's lifetime but not yet consumed.
>>>>>>>>>>> 2. Seek back to the group creation time (to_start_time
>>>>> behavior) —
>>>>>>>>>>> potentially
>>>>>>>>>>> reprocess some data, but guarantee no data from the group's
>>>>>> lifetime
>>>>>>>> is
>>>>>>>>>>> silently lost.
>>>>>>>>>>> 
>>>>>>>>>>> to_start_time chooses option 2 because its core promise is
>>>>> "never
>>>>>>>> silently
>>>>>>>>>>> lose data
>>>>>>>>>>> produced after the group started." If we fell back to latest on
>>>>>>>>>>> out-of-range, we would
>>>>>>>>>>> break this guarantee.
>>>>>>>>>>> 
>>>>>>>>>>> I consider users who prefer option 1 can simply use
>>>>>>>>>>> auto.offset.reset=latest.
>>>>>>>>>>> 
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Jiunn-Yang
>>>>>>>>>>> 
>>>>>>>>>>>> Jun Rao via dev <[email protected]> 於 2026年4月18日 凌晨1:57
>>>>> 寫道：
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi, Jiunn-Yang and Chia-Ping,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for the reply.
>>>>>>>>>>>> 
>>>>>>>>>>>> "The core semantic of to_start_time is to read all records
>>>>> since
>>>>>> the
>>>>>>>>>>>> creation of the group."
>>>>>>>>>>>> 
>>>>>>>>>>>> I am just questioning whether this actually covers a common
>>>>> use
>>>>>>>> case. If
>>>>>>>>>>>> the offset doesn't go out of range, the logic makes sense to
>>>>> me.
>>>>>> I'm
>>>>>>>> not
>>>>>>>>>>>> sure about the logic if the offset is out of range. If a user
>>>>>>>> chooses to
>>>>>>>>>>>> skip the historical data when starting the group, it seems the
>>>>>> user
>>>>>>>>>>> likely
>>>>>>>>>>>> wants to do the same if the offset is out of range.
>>>>>>>>>>>> 
>>>>>>>>>>>> Jun
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Apr 17, 2026 at 5:23 AM 黃竣陽 <[email protected]>
>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello Jun,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank for the feedback,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Adding to the points above:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regarding by_duration as an alternative to Scenario 1: beyond
>>>>>> clock
>>>>>>>> skew
>>>>>>>>>>>>> and retry issues, there is also a usability concern.
>>>>> by_duration
>>>>>>>>>>> requires
>>>>>>>>>>>>> users
>>>>>>>>>>>>> to reason about operational timing — "how long does partition
>>>>>>>> discovery
>>>>>>>>>>>>> take
>>>>>>>>>>>>> in my environment?”, and then translate that into a
>>>>>> configuration
>>>>>>>> value.
>>>>>>>>>>>>> to_start_time
>>>>>>>>>>>>> requires no such reasoning. It simply anchors to the group
>>>>>> creation
>>>>>>>> time
>>>>>>>>>>>>> recorded
>>>>>>>>>>>>> by the broker.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regarding Scenario 2: I'd also like to clarify that
>>>>>> to_start_time
>>>>>>>> does
>>>>>>>>>>> not
>>>>>>>>>>>>> branch between
>>>>>>>>>>>>> "use latest" and "use earliest." It applies the same
>>>>>>>> ListOffsetsRequest
>>>>>>>>>>>>> with the group creation
>>>>>>>>>>>>> timestamp in all cases. The difference in outcome:
>>>>>>>>>>>>> - skipping old data on first start
>>>>>>>>>>>>> - consuming surviving data after truncation
>>>>>>>>>>>>> is a natural consequence of what data exists in the
>>>>> partition at
>>>>>>>> that
>>>>>>>>>>>>> point, not a different policy
>>>>>>>>>>>>> being applied. The rule is always the same.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>> Jiunn-Yang
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年4月17日 上午9:48 寫道：
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Jun Rao via dev <[email protected]> 於 2026年4月17日 凌晨4:57
>>>>>> 寫道：
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Also, a group is deleted after the consumer has been idle
>>>>>> longer
>>>>>>>>>>>>>>> than offsets.retention.minutes. What's the semantic of
>>>>>>>> to_start_time
>>>>>>>>>>> if
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> group creation time is unavailable?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If the group is recreated, a new creation time will be
>>>>>> recorded.
>>>>>>>> Hence,
>>>>>>>>>>>>> it acts like a new group. Plus, it throws an exception
>>>>> directly
>>>>>> if
>>>>>>>> the
>>>>>>>>>>>>> group truly has no creation time.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>

Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

Reply via email to