Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

黃竣陽 Tue, 28 Apr 2026 03:54:27 -0700

Hello all, 

I've created a new KIP to introduce this config. Here is the discussion thread:
<https://lists.apache.org/thread/bp4zk31zr1sdxjsspg7b7bqddmm9t4gn>


Feedback and comments are very welcome!

Best Regards,
Jiunn-Yang

> Jun Rao via dev <[email protected]> 於 2026年4月28日 凌晨12:49 寫道：
> 
> Hi, Chia-Ping,
> 
> Thanks for the reply.
> 
> I agree that we should only add new options that cover a common use case.
> For auto.offset.reset.latest.max.age, it would be useful to compare it with
> by_duration.
> 
> Jun
> 
> On Sat, Apr 25, 2026 at 4:30 AM Chia-Ping Tsai <[email protected]> wrote:
> 
>> hi Jun
>> 
>> Honestly, we've seen a similar "case storm" in our local community
>> discussions. Some feel a new policy could revolutionize existing pipelines,
>> while others find it overly complicated to mentally juggle all these offset
>> edge cases.
>> 
>> I also realize that introducing a completely new policy just to overcome
>> the "data loss on partition expansion" issue might be a bit overkill for
>> now. We can always revisit a brand-new policy later.
>> 
>> For now, I'd like to pivot back to the original pain point: how to avoid
>> losing "hot" records from newly expanded partitions when using the latest
>> policy. The tricky part is that expanded partitions aren't always "hot" to
>> consumers. For instance, if a partition is expanded while the consumer is
>> offline for a long period, the user would likely prefer to skip to the end
>> upon resuming, as those records are no longer fresh.
>> 
>> Therefore, I'd like to propose a new consumer config:
>> auto.offset.reset.latest.max.age (Ryan's discussion inspires it). When a
>> consumer is using the latest policy, it can rely on this threshold to
>> determine its behavior on partitions without a committed offset. If the
>> partition's "age" is within this threshold (i.e., it's a recently expanded
>> partition), we fall back to earliest to catch the hot data. If it exceeds
>> the age, or if the age is unavailable (e.g., older broker versions), it
>> strictly adheres to latest.
>> 
>> This partition "age" could be returned via the consumer heartbeat. The age
>> would be calculated server-side by the coordinator: coordinator's current
>> time - partition creation time. This inherently means we would need to
>> modify the partition records to store the creation time, as well as update
>> the heartbeat RPC to pass this relative age.
>> 
>> We plan to draft a separate KIP for auto.offset.reset.latest.max.age and
>> start a new thread for it to keep things focused. We can leave this current
>> thread open for any broader discussions on completely new policies.
>> 
>> Any feedback on this new direction is highly welcome. Thanks everyone for
>> the incredible brainstorming session!
>> 
>> Best,
>> Chia-Ping
>> 
>> On 2026/04/23 20:44:41 Jun Rao via dev wrote:
>>> Hi, Chia-Ping,
>>> 
>>> Thanks for the reply.
>>> 
>>> "read all records produced since the group's birth."
>>> Let's consider this requirement a bit more. For CDC use cases, users
>> don't
>>> want to lose any data. The easiest option is to consume data with the
>>> earliest offset. Sometimes, there are good reasons to skip the backlog.
>> For
>>> example, the downstream system already obtains a database snapshot
>> through
>>> another channel. However, in this case, the user usually needs to set the
>>> initial offsets carefully to match the snapshot's timestamp and avoid
>> data
>>> loss. Starting from the group creation time doesn't seem to meet the
>>> business need in this case.
>>> 
>>> Jun
>>> 
>>> 
>>> On Thu, Apr 23, 2026 at 11:49 AM Chia-Ping Tsai <[email protected]>
>> wrote:
>>> 
>>>> hi Jun
>>>> 
>>>>> This seems to
>>>> fit the current auto.offset.reset framework more naturally.
>>>> 
>>>> Your point about the existing framework is well-taken, but it
>> highlights a
>>>> key distinction this KIP aims to address.
>>>> 
>>>> If a user simply wants a "Smarter Latest" (one that avoids data loss
>> from
>>>> extended partitions), they could indeed use by_duration=5mins as a
>>>> reasonable workaround.
>>>> 
>>>> However, there is currently no workaround for a policy that guarantees
>>>> "read all records produced since the group's birth." > This is a
>> critical
>>>> requirement for data pipelines like OLTP (MySQL/Postgres) -> Kafka ->
>> OLAP
>>>> (ClickHouse/Snowflake). These users often use latest initially to
>> avoid a
>>>> massive historical backlog, but they have a "Zero Data Loss"
>> requirement
>>>> once the pipeline is active.
>>>> 
>>>> When these users encounter an "out-of-range" error, they want to
>> consume
>>>> every surviving record in Kafka that belongs to their group's
>> lifetime. If
>>>> we force them to jump to the end, it means they have to manually
>> re-load
>>>> and backfill significantly more "lost records" from the source OLTP,
>> which
>>>> is a high-cost operational burden.
>>>> 
>>>> In short, the policy offered by this KIP is not just another option; it
>>>> provides a deterministic lifecycle anchor that cannot be emulated by
>> the
>>>> current policies.
>>>> 
>>>> Jun Rao via dev <[email protected]> 於 2026年4月24日週五 上午1:38寫道：
>>>> 
>>>>> Hi, Chia-Ping, Jiunn-Yang, and Jian,
>>>>> 
>>>>> Thanks for the reply. I appreciate your effort in trying to address a
>>>>> common issue.
>>>>> 
>>>>> To me, history and data are the same as the backlog. It's just that
>> the
>>>>> amount of backlog can vary. When the group is first created or when
>> the
>>>>> offset is out of range, the backlog is large. When a new partition is
>>>>> created and discovered by the consumer, the backlog is small (5
>> seconds of
>>>>> data for the new consumer, 5 minutes for the classic consumer). The
>>>>> question is how much backlog a user can tolerate. The to_start_time
>> option
>>>>> implicitly assumes that a user can tolerate 0 backlog in one case but
>> 5
>>>>> seconds or 5 minutes in another. This may or may not be what a user
>> wants,
>>>>> but at least it seems inconsistent. An alternative is to document all
>>>>> cases
>>>>> where a backlog can occur and let the user choose how much backlog
>> they
>>>>> can
>>>>> tolerate, configuring it with the existing by_during option. This
>> seems to
>>>>> fit the current auto.offset.reset framework more naturally.
>>>>> 
>>>>> Jun
>>>>> 
>>>>> 
>>>>> On Thu, Apr 23, 2026 at 6:23 AM jian fu <[email protected]> wrote:
>>>>> 
>>>>>> Hi All:
>>>>>> 
>>>>>> Since Jun Yang referenced my earlier discussion, I’d also like to
>> join
>>>>> in
>>>>>> and share some of my thoughts.
>>>>>> 
>>>>>> The key area of minor divergence is this case's handle:
>>>>>> " When the user starts the group for the first time, it faces a
>> choice
>>>>> on
>>>>>> whether to process the backlog or not. When the offset is
>> out-of-range,
>>>>> the
>>>>>> user faces the same
>>>>>> choice regarding backlog processing. "
>>>>>> 
>>>>>> so I think we have four options to handle two key choice:
>>>>>> 1 latest:  drop history + drop the data
>>>>>> 2 earliest:  not drop history + not drop the data
>>>>>> 3 the KIP propose mode:   drop history + not drop the data
>>>>>> 4 unreasonable mode:   not drop history + drop the data
>>>>>> 
>>>>>> I think the 3 is reasonable mode for user (not consider the naming
>> or
>>>>>> implement). Image one case in life. you may subscribe one magazine
>>>>> without
>>>>>> buy the older magazines. but you must don't to lost some magazine
>> after
>>>>>> subscribe due to you don't buy history.
>>>>>> 
>>>>>> Regards
>>>>>> Jian
>>>>>> 
>>>>>> 
>>>>>> 黃竣陽 <[email protected]> 于2026年4月23日周四 19:17写道：
>>>>>> 
>>>>>>> Hello all,
>>>>>>> 
>>>>>>> Thanks for the feedback. I'd like to advocate for keeping the
>> original
>>>>>>> to_start_time semantics.
>>>>>>> 
>>>>>>> Earlier in this thread, both Jian and Ryan highlighted that
>> branched
>>>>>> logic
>>>>>>> is the main UX concern:
>>>>>>> 
>>>>>>> Jian: "If we can define one basic rule… it would make it easier
>> for
>>>>>>>        everyone to stay on the same page."
>>>>>>> Ryan: "The documentation might be difficult if it has to
>>>>>>>        list and explain all the cases."
>>>>>>> Chia-Ping: "Having an opinionated config with branched logic
>> makes it
>>>>>> hard
>>>>>>>        to document and reason about."
>>>>>>> 
>>>>>>> to_start_time already follows this principle, it consistently
>> issues a
>>>>>>> ListOffsets request
>>>>>>> anchored to the group creation timestamp. Differences in outcome
>> are
>>>>>>> simply due to what
>>>>>>> data the broker retains, not different rules being applied.
>> Changing
>>>>>>> out-of-range to latest
>>>>>>> would be the real inconsistency, since the policy would then
>> branch
>>>>> based
>>>>>>> on the reset
>>>>>>> scenario.
>>>>>>> 
>>>>>>> Additionally, out-of-range and no-offset (group GC'd) are
>>>>> fundamentally
>>>>>>> different situations.
>>>>>>> When the group exists, the creation timestamp is available and
>> should
>>>>> be
>>>>>>> honored. When
>>>>>>> the group is GC'd, the metadata is gone, this is an orthogonal
>> problem
>>>>>>> that affects all reset
>>>>>>> policies equally.
>>>>>>> 
>>>>>>> The strength of to_start_time is precisely its single, clean rule:
>>>>>> "Always
>>>>>>> seek to the group’s
>>>>>>> creation time, and let ListOffsets resolve the rest."
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> Jiunn-Yang
>>>>>>> 
>>>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年4月23日 下午3:24 寫道：
>>>>>>>> 
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> BTW, regardless of where we land on the "out-of-range" debate,
>> the
>>>>>>> underlying infrastructure of persisting the "group creation time"
>> is
>>>>>> still
>>>>>>> highly valuable and worth merging.
>>>>>>>> 
>>>>>>>> From my conversations with users, there are diverse needs: some
>> love
>>>>>> the
>>>>>>> "better earliest" idea to safely skip massive historical backlogs,
>>>>> while
>>>>>>> others only care about fixing the data loss in latest during
>> partition
>>>>>>> expansion.
>>>>>>>> 
>>>>>>>> Simply having the creation time persisted and exposed is
>> already a
>>>>>>> massive step forward, as it gives users a reliable, objective
>> anchor
>>>>> to
>>>>>>> manually fix the issue via a ConsumerRebalanceListener. However,
>> much
>>>>>> like
>>>>>>> the concept of a DLQ (Dead Letter Queue) while users could
>> implement
>>>>> it
>>>>>>> manually, providing a built-in reset policy makes the developer
>>>>>> experience
>>>>>>> significantly more convenient, robust, and out-of-the-box.
>>>>>>>> 
>>>>>>>> I believe Ken might chime in later with a different perspective
>> as
>>>>> well
>>>>>>> :)
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Chia-Ping
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年4月23日 凌晨3:59 寫道：
>>>>>>>>> 
>>>>>>>>> Hi Jun,
>>>>>>>>> 
>>>>>>>>> Thanks for the feedback. I agree that shifting this policy
>> toward a
>>>>>>> "Smarter Latest" (rather than a better Earliest) is a more elegant
>>>>> path.
>>>>>>>>> 
>>>>>>>>> The refined behavior would be:
>>>>>>>>> 
>>>>>>>>> Out-of-range: Strictly follow latest semantics. This ensures a
>>>>>>> predictable "skip to end" behavior when users fall behind
>> retention.
>>>>>>>>> 
>>>>>>>>> No-offset (Initial Start & Expansion): Leverage Group Creation
>> Time
>>>>>> for
>>>>>>> lookup.
>>>>>>>>> 
>>>>>>>>> • For new groups, this naturally results in latest behavior
>> since
>>>>>>> creation time is "now".
>>>>>>>>> 
>>>>>>>>> • For existing groups discovering new partitions, this results
>> in
>>>>>>> earliest behavior for those specific partitions.
>>>>>>>>> 
>>>>>>>>> Group GC: If a group is purged, it is treated as a brand-new
>> group
>>>>>> with
>>>>>>> a creation time of "now," consistently skipping to the end.
>>>>>>>>> 
>>>>>>>>> WDYT?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Jun Rao via dev <[email protected]> 於 2026年4月23日 凌晨1:34
>> 寫道：
>>>>>>>>>> 
>>>>>>>>>> Hi, Chia-Ping,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the reply.
>>>>>>>>>> 
>>>>>>>>>> Let's try to understand from the user's perspective. When the
>> user
>>>>>>> starts
>>>>>>>>>> the group for the first time, it faces a choice on whether to
>>>>> process
>>>>>>> the
>>>>>>>>>> backlog or not. When the offset is out-of-range, the user
>> faces
>>>>> the
>>>>>>> same
>>>>>>>>>> choice regarding backlog processing. It seems that most users
>>>>> want to
>>>>>>> make
>>>>>>>>>> the same choice regarding backlog processing.
>>>>>>>>>> 
>>>>>>>>>> "Users who explicitly choose the to_start_time policy do so
>>>>> precisely
>>>>>>>>>> because they do not want to skip any records when
>> encountering an
>>>>>>>>>> out-of-range scenario."
>>>>>>>>>> This argument is weak because that's how to_start_time is
>>>>> designed,
>>>>>>> but we
>>>>>>>>>> need to justify why it is a good choice in the first place.
>>>>>>>>>> 
>>>>>>>>>> Jun
>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Apr 21, 2026 at 12:35 PM Chia-Ping Tsai <
>>>>>> [email protected]>
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Jun,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for the clarification. I think I misunderstood your
>>>>> previous
>>>>>>> point.
>>>>>>>>>>> Let me summarize the scenarios to ensure we are fully
>> aligned.
>>>>>>>>>>> 
>>>>>>>>>>> There are essentially three scenarios when a consumer needs
>> to
>>>>> reset
>>>>>>>>>>> offsets:
>>>>>>>>>>> 
>>>>>>>>>>> 1.
>>>>>>>>>>> 
>>>>>>>>>>> Out-of-range (The group exists, but the offset has expired).
>>>>>>>>>>> 2.
>>>>>>>>>>> 
>>>>>>>>>>> Extended partition (The group exists, but encounters a newly
>>>>> added
>>>>>>>>>>> partition with no committed offset).
>>>>>>>>>>> 3.
>>>>>>>>>>> 
>>>>>>>>>>> No-offset (The group is completely new, or an existing group
>> was
>>>>>>>>>>> deleted by the GC).
>>>>>>>>>>> 
>>>>>>>>>>> We all agree that the primary goal of this KIP is to catch
>> up on
>>>>> all
>>>>>>>>>>> records for scenario 2. There are no objections here.
>>>>>>>>>>> 
>>>>>>>>>>> Regarding the inconsistency you pointed out between 1) and 3)
>>>>> under
>>>>>>> the
>>>>>>>>>>> current to_start_time design, I completely see your point. If
>>>>> users
>>>>>>> are
>>>>>>>>>>> not fully aware that to_start_time is designed to read all
>>>>> records
>>>>>>> since
>>>>>>>>>>> the creation of the group, they might get confused.
>>>>>>>>>>> 
>>>>>>>>>>> However, to me, this "inconsistency" is actually a matter of
>>>>>>>>>>> predictability. Users who explicitly choose the to_start_time
>>>>> policy
>>>>>>> do
>>>>>>>>>>> so precisely because they do not want to skip any records
>> when
>>>>>>> encountering
>>>>>>>>>>> an out-of-range scenario.
>>>>>>>>>>> 
>>>>>>>>>>> (I would prefer to set aside the topic of group GC for a
>> moment.
>>>>> It
>>>>>> is
>>>>>>>>>>> much more important that we first focus our discussion on the
>>>>>>>>>>> "out-of-range" scenario)
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> 
>>>>>>>>>>> Chia-Ping
>>>>>>>>>>> 
>>>>>>>>>>> Jun Rao via dev <[email protected]> 於 2026年4月22日週三
>> 上午1:13寫道：
>>>>>>>>>>> 
>>>>>>>>>>>> Hi, Chia-Ping,
>>>>>>>>>>>> 
>>>>>>>>>>>> Hmm, is that true? With the earliest policy, we treat an
>>>>>> out-of-range
>>>>>>>>>>>> offset the same as no offset (because the group is deleted)
>> and
>>>>>>> always set
>>>>>>>>>>>> it to the earliest offset, right? With to_start_time, an
>>>>>> out-of-range
>>>>>>>>>>>> offset is treated differently from no offset.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> 
>>>>>>>>>>>> Jun
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Apr 21, 2026 at 12:54 AM Chia-Ping Tsai <
>>>>>> [email protected]
>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> hi Jun
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Nice point. Group GC is definitely an issue for
>> to_start_time,
>>>>> but
>>>>>>> it is
>>>>>>>>>>>>> actually an issue for other policies as well.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For example, a consumer using the earliest policy will
>> suddenly
>>>>>>> read all
>>>>>>>>>>>>> historical records from scratch if it sleeps for a long
>> while
>>>>> and
>>>>>>> gets
>>>>>>>>>>>>> GC'd; otherwise, it just resumes from previous offsets if
>> the
>>>>>> group
>>>>>>>>>>>> still
>>>>>>>>>>>>> exists. It is equally hard to explain to users: "Oh, your
>> group
>>>>>> was
>>>>>>>>>>>> GC'd,
>>>>>>>>>>>>> so your offset behavior changed."
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Therefore, it seems to me the right approach to fix this
>>>>>>> "inconsistency"
>>>>>>>>>>>>> is to offer a group-level GC timeout in a future KIP,
>> allowing
>>>>>>> users to
>>>>>>>>>>>>> explicitly protect critical groups from GC. This saves not
>> only
>>>>>>>>>>>>> to_start_time, but all other reset policies too.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Chia-Ping
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 2026/04/20 20:19:47 Jun Rao via dev wrote:
>>>>>>>>>>>>>> Hi, Jiunn-Yang and Chia-Ping,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for the reply.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The main concern I see with to_start_time is that its
>>>>> behavoir on
>>>>>>> how
>>>>>>>>>>>>> much
>>>>>>>>>>>>>> data to consume when the offset is out of range is not
>>>>> consistent
>>>>>>> and
>>>>>>>>>>>> is
>>>>>>>>>>>>>> hard to explain. If the group still exists, it will read
>> from
>>>>> the
>>>>>>>>>>>>> earliest
>>>>>>>>>>>>>> offset. Otherwise, it will read from the latest.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Jun
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Apr 20, 2026 at 10:13 AM Chia-Ping Tsai <
>>>>>>> [email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> hi all,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Just a note for a potential latest_v2:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Since the purpose is to read all records from extended
>>>>>> partitions,
>>>>>>>>>>>> we
>>>>>>>>>>>>>>> could leverage the group creation time to compare
>> against the
>>>>>>>>>>>> earliest
>>>>>>>>>>>>>>> record of a partition when there is no committed offset.
>> If
>>>>> the
>>>>>>>>>>>> group
>>>>>>>>>>>>>>> creation time is larger than the earliest record's
>>>>> timestamp, we
>>>>>>>>>>>>> assume it
>>>>>>>>>>>>>>> is not an extended partition. Otherwise, we treat it as
>> an
>>>>>>> extended
>>>>>>>>>>>>>>> partition.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This approach allows us to catch all "possible" extended
>>>>>>> partitions,
>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> includes both "true" extended partitions and old but
>>>>> truncated
>>>>>>>>>>>>> partitions.
>>>>>>>>>>>>>>> While there is a rare edge case where the cost is
>>>>> reprocessing
>>>>>>> some
>>>>>>>>>>>>> records
>>>>>>>>>>>>>>> we don't necessarily want, it is very easy to implement
>> and
>>>>>>>>>>>> guarantees
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>> will never miss the actual extended partitions.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Chia-Ping
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 2026/04/20 13:33:31 黃竣陽 wrote:
>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I have added a new "Future Work: latest_strict Policy"
>>>>> section
>>>>>> to
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> KIP.
>>>>>>>>>>>>>>>> The idea is a future policy that uses latest semantics
>> by
>>>>>> default
>>>>>>>>>>>> but
>>>>>>>>>>>>>>> falls
>>>>>>>>>>>>>>>> back to the group creation timestamp specifically for
>> newly
>>>>>> added
>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>> during partition expansion. This would reuse the group
>>>>> creation
>>>>>>>>>>>> time
>>>>>>>>>>>>>>> anchor
>>>>>>>>>>>>>>>> introduced by this KIP, making it a natural extension
>> with
>>>>>>> minimal
>>>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>> protocol changes.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>> Jiunn-Yang
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年4月18日
>> 下午4:09
>>>>> 寫道：
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> It is practically NP-hard to guess everyone's ideal use
>>>>> case
>>>>>>>>>>>> right
>>>>>>>>>>>>> now.
>>>>>>>>>>>>>>>>> Also, I believe we all want to avoid falling back to
>> the
>>>>>>>>>>>> intricate
>>>>>>>>>>>>>>>>> multi-policy approach proposed in KIP-842.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I prefer to keep this KIP focused and discuss a "v2
>> latest"
>>>>>>>>>>>> policy
>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>> separate KIP. That future policy could build upon the
>>>>>>>>>>>> to_start_time
>>>>>>>>>>>>>>> anchor
>>>>>>>>>>>>>>>>> to fix data loss specifically for extended partitions.
>> We
>>>>>> could
>>>>>>>>>>>>> call it
>>>>>>>>>>>>>>>>> something like latest_strict.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 黃竣陽 <[email protected]> 於 2026年4月18日週六 下午3:24寫道：
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hello Jun,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for the reply,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> When the offset goes out of range, the user faces two
>>>>>> options:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 1. Skip to the end (latest behavior) — risk losing
>> data
>>>>> that
>>>>>>>>>>>> was
>>>>>>>>>>>>>>> produced
>>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>>> the group's lifetime but not yet consumed.
>>>>>>>>>>>>>>>>>> 2. Seek back to the group creation time (to_start_time
>>>>>>>>>>>> behavior) —
>>>>>>>>>>>>>>>>>> potentially
>>>>>>>>>>>>>>>>>> reprocess some data, but guarantee no data from the
>>>>> group's
>>>>>>>>>>>>> lifetime
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> silently lost.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> to_start_time chooses option 2 because its core
>> promise is
>>>>>>>>>>>> "never
>>>>>>>>>>>>>>> silently
>>>>>>>>>>>>>>>>>> lose data
>>>>>>>>>>>>>>>>>> produced after the group started." If we fell back to
>>>>> latest
>>>>>> on
>>>>>>>>>>>>>>>>>> out-of-range, we would
>>>>>>>>>>>>>>>>>> break this guarantee.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I consider users who prefer option 1 can simply use
>>>>>>>>>>>>>>>>>> auto.offset.reset=latest.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>> Jiunn-Yang
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Jun Rao via dev <[email protected]> 於 2026年4月18日
>>>>> 凌晨1:57
>>>>>>>>>>>> 寫道：
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi, Jiunn-Yang and Chia-Ping,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for the reply.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> "The core semantic of to_start_time is to read all
>>>>> records
>>>>>>>>>>>> since
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> creation of the group."
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I am just questioning whether this actually covers a
>>>>> common
>>>>>>>>>>>> use
>>>>>>>>>>>>>>> case. If
>>>>>>>>>>>>>>>>>>> the offset doesn't go out of range, the logic makes
>>>>> sense to
>>>>>>>>>>>> me.
>>>>>>>>>>>>> I'm
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> sure about the logic if the offset is out of range.
>> If a
>>>>>> user
>>>>>>>>>>>>>>> chooses to
>>>>>>>>>>>>>>>>>>> skip the historical data when starting the group, it
>>>>> seems
>>>>>> the
>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>> likely
>>>>>>>>>>>>>>>>>>> wants to do the same if the offset is out of range.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Jun
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Fri, Apr 17, 2026 at 5:23 AM 黃竣陽 <
>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hello Jun,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thank for the feedback,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Adding to the points above:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Regarding by_duration as an alternative to Scenario
>> 1:
>>>>>> beyond
>>>>>>>>>>>>> clock
>>>>>>>>>>>>>>> skew
>>>>>>>>>>>>>>>>>>>> and retry issues, there is also a usability concern.
>>>>>>>>>>>> by_duration
>>>>>>>>>>>>>>>>>> requires
>>>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>> to reason about operational timing — "how long does
>>>>>> partition
>>>>>>>>>>>>>>> discovery
>>>>>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>>>>> in my environment?”, and then translate that into a
>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>> value.
>>>>>>>>>>>>>>>>>>>> to_start_time
>>>>>>>>>>>>>>>>>>>> requires no such reasoning. It simply anchors to the
>>>>> group
>>>>>>>>>>>>> creation
>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>>>> recorded
>>>>>>>>>>>>>>>>>>>> by the broker.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Regarding Scenario 2: I'd also like to clarify that
>>>>>>>>>>>>> to_start_time
>>>>>>>>>>>>>>> does
>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> branch between
>>>>>>>>>>>>>>>>>>>> "use latest" and "use earliest." It applies the same
>>>>>>>>>>>>>>> ListOffsetsRequest
>>>>>>>>>>>>>>>>>>>> with the group creation
>>>>>>>>>>>>>>>>>>>> timestamp in all cases. The difference in outcome:
>>>>>>>>>>>>>>>>>>>> - skipping old data on first start
>>>>>>>>>>>>>>>>>>>> - consuming surviving data after truncation
>>>>>>>>>>>>>>>>>>>> is a natural consequence of what data exists in the
>>>>>>>>>>>> partition at
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> point, not a different policy
>>>>>>>>>>>>>>>>>>>> being applied. The rule is always the same.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>>>> Jiunn-Yang
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年4月17日
>>>>> 上午9:48
>>>>>> 寫道：
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Jun Rao via dev <[email protected]> 於
>> 2026年4月17日
>>>>>> 凌晨4:57
>>>>>>>>>>>>> 寫道：
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Also, a group is deleted after the consumer has
>> been
>>>>> idle
>>>>>>>>>>>>> longer
>>>>>>>>>>>>>>>>>>>>>> than offsets.retention.minutes. What's the
>> semantic of
>>>>>>>>>>>>>>> to_start_time
>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> group creation time is unavailable?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> If the group is recreated, a new creation time
>> will be
>>>>>>>>>>>>> recorded.
>>>>>>>>>>>>>>> Hence,
>>>>>>>>>>>>>>>>>>>> it acts like a new group. Plus, it throws an
>> exception
>>>>>>>>>>>> directly
>>>>>>>>>>>>> if
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> group truly has no creation time.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

Reply via email to