Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

黃竣陽 Tue, 31 Mar 2026 04:40:47 -0700

Hello all,

I’d like to bump this thread. If there is no further feedback, I will call a 
vote.


Best Regards,
Jiunn-Yang

> 黃竣陽 <[email protected]> 於 2026年3月18日 晚上7:28 寫道：
> 
> Hello all, 
> 
> I have updated the KIP. Any feedback or suggestions are welcome. 
> 
> Best Regards, 
> Jiunn-Yang
> 
>> Chia-Ping Tsai <[email protected]> 於 2026年3月16日 晚上11:19 寫道：
>> 
>> Hi Jiunn-Yang,
>> 
>> Thanks for the updates. I really love the simplification! :)
>> 
>> It seems the KIP page is a bit out-of-date now. Would you mind updating the
>> KIP to match the above discussion?
>> 
>> For example, the KIP still uses two timestamps. Plus, we need to describe
>> the risk caused by server time skew that you mentioned.
>> 
>> Best, Chia-Ping
>> 
>> 黃竣陽 <[email protected]> 於 2026年3月16日週一 下午10:53寫道：
>> 
>>> Hello all,
>>> 
>>> Thanks for all feedbacks,
>>> 
>>> I believe we can simplify the approach further. Instead of comparing group
>>> creation
>>> time against partition creation time to decide between EARLIEST and
>>> LATEST, we
>>> can just always use the group creation timestamp to issue a
>>> ListOffsetsRequest for
>>> all partitions — whether there is no committed offset or the committed
>>> offset is out of
>>> range. If the partition was created after the group, the lookup will
>>> naturally resolve to
>>> the beginning of that partition. If the partition predates the group, the
>>> lookup will find the
>>> first message at or after the group's creation time. Even for the
>>> out-of-range case (e.g.,
>>> AdminClient.deleteRecords or log segment deletion), the same rule applies.
>>> 
>>> This approach is also more semantically accurate. The name by_start_time
>>> implies
>>> "start consuming from when my group started." Using the group creation
>>> timestamp
>>> as the sole reference point for all scenarios is exactly what this name
>>> promises —
>>> no hidden branching between EARLIEST and LATEST, no conditional
>>> comparisons.
>>> The behavior directly reflects the name.
>>> 
>>> This eliminates all conditional logic — the behavior becomes a single,
>>> predictable rule:
>>> always seek to the group's creation time. This simplification also aligns
>>> well with the basic
>>> rule proposed earlier in the discussion: the strategy only affects the
>>> initial startup decision,
>>> and after startup, the behavior remains consistent — we always try to
>>> consume as much
>>> data as possible regardless of the scenario.
>>> 
>>> Additionally, this reduces the scope of the protocol change, since we no
>>> longer need to pass
>>> partition creation timestamps in the heartbeat response. Only the group
>>> creation time is required.
>>> 
>>> There are two points still worth discussing: Server clock accuracy:
>>> 
>>> 1, The broker's clock may have some drift, which could affect the
>>> precision of the timestamp-based
>>> lookup. However, this is equally true for the previous approach that
>>> compared two server-side
>>> timestamps, so it does not introduce a new concern.
>>> 
>>> 2. User data timestamps: If producers set custom timestamps on their
>>> records that do not reflect real
>>> time, the ListOffsetsRequest may return unexpected results. However, this
>>> is fundamentally a data
>>> quality issue that users should be responsible for.
>>> 
>>> 
>>> Best Regards,
>>> Jiunn-Yang
>>> 
>>>> jian fu <[email protected]> 於 2026年3月16日 晚上8:10 寫道：
>>>> 
>>>> Hi all,
>>>> 
>>>> Thank you for introducing/discussion this KIP to address a long-standing
>>>> issue for users.
>>>> 
>>>> I think the current issue with the strategy is not only the data loss
>>>> during partition expansion, but also that it is difficult for users to
>>>> understand or remember in different scenarios.
>>>> 
>>>> If we can define one basic rule and get agreed in the KIP, it would make
>>> it
>>>> easier for everyone to stay on the same page.
>>>> 
>>>> The basic rule would be:
>>>> 
>>>> 1  The strategy only affects the initial startup of the consumer, to
>>> decide
>>>> whether historical data should be discarded.
>>>> 2  After startup, no matter which strategy is used, the behavior should
>>>> remain the same:
>>>> 
>>>> 2.1 we try not to lose any data. For example, in cases like partition
>>>> expansion,
>>>> 2.2 we should still attempt to consume all available data in future. If
>>>> some data loss is unavoidable, such as when the start offset changes
>>> (e.g.,
>>>> due to slow consumption or user-triggered data deletion), we should lose
>>>> the same amount of data regardless of the strategy. This would require a
>>>> small behavior change. Previously, when an out-of-range situation
>>> occurred
>>>> because of a start offset change, the latest strategy could lose more
>>> data
>>>> than earliest .After change, both strategies would behave the same: they
>>>> would reset to the start offset (earliest available offset), which will
>>>> also reduces the number of dropped messages.
>>>> 
>>>> I think this rule would significantly reduce the difficulty for users to
>>>> understand the behavior or confuse for some cases. Users would only need
>>> to
>>>> focus on the initial start of the consume group, and there would be no
>>>> differences in behavior in future scenarios.
>>>> 
>>>> WDTY? Thanks
>>>> 
>>>> Regards
>>>> 
>>>> Jian
>>>> 
>>>> Ryan Leslie (BLOOMBERG/ NEW YORK) <[email protected]> 于2026年3月16日周一
>>>> 08:39写道：
>>>> 
>>>>> Hey Chia,
>>>>> 
>>>>> Sorry, I see the KIP had some recent updates. I see the current idea is
>>> to
>>>>> store partition time and group start time. I think this is good. For
>>> the 2a
>>>>> case (stale offset) it mentions the user is explicitly reset to
>>> earliest if
>>>>> they use by_start_time. As you said, they could use by_duration instead
>>> to
>>>>> avoid a backlog but according to the KIP there is possibility of data
>>> loss
>>>>> during partition expansion. But I agree with you that it's fair to
>>> assume
>>>>> most using this feature are interested in data completeness, and users
>>> are
>>>>> still free to manually adjust offsets.
>>>>> 
>>>>> One other comment is the feature may be tricky for users to fully grasp
>>>>> since it combines multiple things. Most users won't really know what
>>> their
>>>>> exact group startup time is. If an offset was deleted, when consuming
>>>>> starts again it might end up being reset to earliest, or some other time
>>>>> that they are not too aware of. It's less deterministic than before.
>>> And if
>>>>> the offset is out of range, then instead they just always get earliest,
>>>>> independent of the group start time. And that behavior is a bit
>>> different
>>>>> than the other one. While this may be sound reasoning to kafka devs,
>>> having
>>>>> users understand what behavior to expect can be harder. The
>>> documentation
>>>>> might be difficult if it has to list out and explain all the cases as
>>> the
>>>>> KIP does.
>>>>> 
>>>>> The deleteRecords / stale offset use case is perhaps a separate one from
>>>>> partition expansion. We have sort of combined them into the one
>>> opinionated
>>>>> feature now.
>>>>> 
>>>>> Regardless, thank you guys once again for focusing on this long-lived
>>>>> issue in Kafka :)
>>>>> 
>>>>> From: [email protected] At: 03/15/26 18:30:45 UTC-4:00To:
>>>>> [email protected]
>>>>> Subject: Re: [DISCUSS] KIP-1282: Prevent data loss during partition
>>>>> expansion for dynamically added partitions
>>>>> 
>>>>> Hi Ryan,
>>>>> 
>>>>> It seems to me the by_duration policy is already perfectly suited to
>>> tackle
>>>>> the long downtime scenario (Case 2a).
>>>>> 
>>>>> If a user wants to avoid a massive backlog after being offline for a
>>> long
>>>>> time, they can configure something like by_duration=10m.
>>>>> 
>>>>> This allows them to skip the massive backlog of older records, while
>>> still
>>>>> capturing the most recent data (including anything in newly expanded
>>>>> partitions).
>>>>> 
>>>>> Since by_duration and latest already serve the users who want to skip
>>> data,
>>>>> shouldn't by_start_time remain strictly focused on users whose primary
>>> goal
>>>>> is data completeness (even if it means processing a backlog)?
>>>>> 
>>>>> Best,
>>>>> Chia-Ping
>>>>> 
>>>>> Ryan Leslie (BLOOMBERG/ NEW YORK) <[email protected]> 於
>>> 2026年3月16日週一
>>>>> 上午4:57寫道：
>>>>> 
>>>>>> Hey Chia,
>>>>>> 
>>>>>> That's an interesting point that there is another subcase, 2c, caused
>>> by
>>>>>> the user having deleted records. While the current behavior of 'latest'
>>>>>> would not be changed by adding something like 'auto.no.offset.reset'
>>>>>> mentioned earlier, the current suggestion of by_start_time would
>>> address
>>>>>> not only the case of newly added partitions, but this additional case
>>>>> when
>>>>>> records are deleted and the offset is invalid. It may be worth
>>> explicitly
>>>>>> highlighting this as another intended use case in the KIP.
>>>>>> 
>>>>>> Also, have you given anymore thought to the case where a consumer
>>> crashes
>>>>>> during the partition expansion and its timestamp is lost or incorrect
>>>>> when
>>>>>> it restarts? If we decided to do something where the timestamp is just
>>>>>> stored by the broker as the creation time of the consumer group, then
>>>>> after
>>>>>> a short while (a typical retention period) it may just degenerate into
>>>>>> 'earliest' anyway. In that case the behavior of 2a is also implicitly
>>>>>> changed to 'earliest' which again the user may not want since they
>>>>>> intentionally had the consumer down. This is why I feel it can get kind
>>>>> of
>>>>>> confusing to try to cover all reset cases in a single tunable,
>>>>>> auto.offset.reset. It's because there are several different cases.
>>>>>> 
>>>>>> From: [email protected] At: 03/15/26 10:44:06 UTC-4:00To:
>>>>>> [email protected]
>>>>>> Subject: Re: [DISCUSS] KIP-1282: Prevent data loss during partition
>>>>>> expansion for dynamically added partitions
>>>>>> 
>>>>>> Hi Ryan,
>>>>>> 
>>>>>> Thanks for the detailed breakdown! Categorizing the scenarios into 1a,
>>>>> 1b,
>>>>>> and
>>>>>> 2a is incredibly helpful. We are completely aligned that Case 1b (a
>>> newly
>>>>>> added
>>>>>> partition) is the true issue and the primary motivation for this KIP.
>>>>>> 
>>>>>> Regarding the "out of range" scenario (Case 2a), the situation you
>>>>>> described (a
>>>>>> consumer being offline for a long time) is very real. However, I'd like
>>>>> to
>>>>>> offer another common operational perspective:
>>>>>> 
>>>>>> What if a user explicitly uses AdminClient.deleteRecords to advance the
>>>>>> start
>>>>>> offset just to skip a batch of corrupted/poison-pill records?
>>>>>> 
>>>>>> In this scenario, the consumer hasn't been hanging or offline for long;
>>>>> it
>>>>>> simply hits an OffsetOutOfRange error. If we force the policy to jump
>>> to
>>>>>> latest, the consumer will unexpectedly skip all the remaining healthy
>>>>>> records
>>>>>> up to the High Watermark. This would cause massive, unintended data
>>> loss
>>>>>> for an
>>>>>> operation that merely meant to skip a few bad records.
>>>>>> 
>>>>>> Therefore, to make the semantic guarantee accurate and predictable: if
>>> a
>>>>>> user
>>>>>> explicitly opts into policy=by_start_time (meaning "I want every record
>>>>>> since I
>>>>>> started"), our contract should be to let them process as much surviving
>>>>>> data as
>>>>>> possible. Falling back to earliest during an out-of-range event
>>> fulfills
>>>>>> this
>>>>>> promise safely.
>>>>>> 
>>>>>> On 2026/03/13 23:07:04 "Ryan Leslie (BLOOMBERG/ NEW YORK)" wrote:
>>>>>>> Hi Chia,
>>>>>>> 
>>>>>>> I have a similar thought that knowing if the consumer group is new is
>>>>>> helpful. It may be possible to avoid having to know the age of the
>>>>>> partition.
>>>>>> Thinking out loud a bit here, but:
>>>>>>> 
>>>>>>> The two cases for auto.offset.reset are:
>>>>>>> 
>>>>>>> 1. no offset
>>>>>>> 2. offset is out of range
>>>>>>> 
>>>>>>> These have subcases such as:
>>>>>>> 
>>>>>>> 1. no offset
>>>>>>> a. group is new
>>>>>>> b. partition is new
>>>>>>> c. offset previously existed was explicitly deleted
>>>>>>> 
>>>>>>> 2. offset is out of range
>>>>>>> a. group has stopped consuming or had no consumers for a while,
>>>>> offset
>>>>>> is
>>>>>> stale
>>>>>>> b. invalid offset was explicitly committed
>>>>>>> 
>>>>>>> Users of auto.offset.reset LATEST are perhaps mostly interested in not
>>>>>> processing a backlog of messages when they first start up. But if there
>>>>> is
>>>>>> a
>>>>>> new partition they don't want to lose data. A typical user might want:
>>>>>>> 
>>>>>>> 1a - latest
>>>>>>> 
>>>>>>> 1b - earliest
>>>>>>> 
>>>>>>> 1c - Unclear, but this may be an advanced user. Since they may have
>>>>>> explicitly used AdminClient to remove the offset, they can explicitly
>>> use
>>>>>> it to
>>>>>> configure a new one
>>>>>>> 
>>>>>>> 2a - latest may be appropriate - the group has most likely already
>>> lost
>>>>>> messages
>>>>>>> 
>>>>>>> 2b - Similar to 2c, this may be an advanced user already in control
>>>>> over
>>>>>> offsets with AdminClient
>>>>>>> 
>>>>>>> If we worry most about cases 1a, 1b, and 2a, only 1b is the exception
>>>>>> (which
>>>>>> is why this KIP exists). What we really want to know is if partition is
>>>>>> new,
>>>>>> but perhaps as long as we ensure that 1a and 2a are more explicitly
>>>>>> configurable,
>>>>>>> then we may get support for 1b without a partition timestamp. For
>>>>>> example, if
>>>>>> the following new config existed:
>>>>>>> 
>>>>>>> Applies if the group already exists and no offset is committed for a
>>>>>> partition. Group existence is defined as the group already having some
>>>>>> offsets
>>>>>> committed for partitions. Otherwise, auto.offset.reset applies.
>>>>>>> Note: we could possibly make this not applicable to static partition
>>>>>> assignment cases
>>>>>>> 
>>>>>>> auto.no.offset.reset = earliest
>>>>>>> 
>>>>>>> Handles all other cases as it does today. auto.no.offset.resets take
>>>>>> precedence
>>>>>>> 
>>>>>>> auto.offset.reset = latest
>>>>>>> 
>>>>>>> The question is if the consumer can reliably determine if the group is
>>>>>> new
>>>>>> (has offsets or not) since there may be multiple consumers all
>>> committing
>>>>>> at
>>>>>> the same time around startup. Otherwise we would have to rely on the
>>> fact
>>>>>> that
>>>>>> the new consumer protocol centralizes things
>>>>>>> through the coordinator, but this may require more significant changes
>>>>>> as you
>>>>>> mentioned...
>>>>>>> 
>>>>>>> I agree with Jiunn that we could consider supporting the
>>>>>> auto.offset.reset
>>>>>> timestamp separately, but I'm still concerned about the case where a
>>>>>> consumer
>>>>>> crashes during the partition expansion and it's timestamp is lost or
>>>>>> incorrect
>>>>>> when it restarts. A feature with a false sense of security can be
>>>>>> problematic.
>>>>>>> 
>>>>>>> From: [email protected] At: 03/10/26 04:02:58 UTC-4:00To:
>>>>>> [email protected]
>>>>>>> Subject: Re: [DISCUSS] KIP-1282: Prevent data loss during partition
>>>>>> expansion
>>>>>> for dynamically added partitions
>>>>>>> 
>>>>>>> Hi Ryan,
>>>>>>> 
>>>>>>> Thanks for sharing your thoughts! We initially tried to leave the RPCs
>>>>> as
>>>>>>> they were, but you're absolutely right that the current KIP feels a
>>> bit
>>>>>>> like a kludge.
>>>>>>> 
>>>>>>> If we are open to touching the protocol (and can accept the potential
>>>>>> time
>>>>>>> skew between nodes), I have another idea: what if we add a creation
>>>>>>> timestamp to both the partition and the group?
>>>>>>> 
>>>>>>> This information could be returned to the consumer via the new
>>>>> Heartbeat
>>>>>>> RPC. The async consumer could then seamlessly leverage these
>>> timestamps
>>>>>> to
>>>>>>> make a deterministic decision between using "latest" or "earliest."
>>>>>>> 
>>>>>>> This approach would only work for the AsyncConsumer using subscribe()
>>>>>>> (since it relies on the group state), which actually serves as another
>>>>>>> great incentive for users to migrate to the new consumer!
>>>>>>> 
>>>>>>> Thoughts on this direction?
>>>>>>> 
>>>>>>> Best,
>>>>>>> 
>>>>>>> Chia-Ping
>>>>>>> 
>>>>>>> Ryan Leslie (BLOOMBERG/ NEW YORK) <[email protected]> 於
>>>>> 2026年3月10日週二
>>>>>>> 上午6:38寫道：
>>>>>>> 
>>>>>>>> Hi Chia and Jiunn,
>>>>>>>> 
>>>>>>>> Thanks for the response. I agree that the explicit timestamp gives
>>>>>> enough
>>>>>>>> flexibility for the user to avoid the issue I mentioned with the
>>>>>> implicit
>>>>>>>> timestamp at startup not matching the time the group instance
>>>>> started.
>>>>>>>> 
>>>>>>>> One potential downside is that the user may have to store this
>>>>>> timestamp
>>>>>>>> somewhere in between restarts. For the group instance id, that is not
>>>>>>>> always the case since sometimes it can be derived from the
>>>>> environment
>>>>>> such
>>>>>>>> as the hostname, or hardcoded in an environment variable where it
>>>>>> typically
>>>>>>>> doesn't need to be updated.
>>>>>>>> 
>>>>>>>> Also, since static instances may be long-lived, preserving just the
>>>>>>>> initial timestamp of the first instance might feel a bit awkward,
>>>>>> since you
>>>>>>>> may end up with static instances restarting and passing timestamps
>>>>> that
>>>>>>>> could be old like two months ago. The user could instead store
>>>>>> something
>>>>>>>> like the last time of restart (and subtract metadata max age from it
>>>>>> to be
>>>>>>>> safe), but it can be considered a burden and may fail if shutdown was
>>>>>> not
>>>>>>>> graceful, i.e. a crash.
>>>>>>>> 
>>>>>>>> I agree that this KIP provides a workable solution to avoid data loss
>>>>>>>> without protocol or broker changes, so I'm +1. But it does still
>>>>> feel a
>>>>>>>> little like a kludge since what the user really needs is an easy,
>>>>>> almost
>>>>>>>> implicit, way to not lose data when a recently added partition is
>>>>>>>> discovered, and currently there is no metadata for the creation time
>>>>>> of a
>>>>>>>> partition. The user may not want to even have the same policy applied
>>>>>> to
>>>>>>>> older partitions for which their offset was deleted.
>>>>>>>> 
>>>>>>>> Even for a consumer group not using static membership, suppose
>>>>>> partitions
>>>>>>>> are added by a producer and new messages are published. If at the
>>>>> same
>>>>>> time
>>>>>>>> there is consumer group, e.g. with 1 consumer only, and it has
>>>>> crashed,
>>>>>>>> when it comes back up it may lose messages unless it knows what
>>>>>> timestamp
>>>>>>>> to pass.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Ryan
>>>>>>>> 
>>>>>>>> From: [email protected] At: 03/07/26 02:46:28 UTC-5:00To:
>>>>>>>> [email protected]
>>>>>>>> Subject: Re: [DISCUSS] KIP-1282: Prevent data loss during partition
>>>>>>>> expansion for dynamically added partitions
>>>>>>>> 
>>>>>>>> Hello Sikka,
>>>>>>>> 
>>>>>>>>> If consumer restarts (app crash, bounce etc.) after dynamically
>>>>>> adding
>>>>>>>> partitions
>>>>>>>>> it would consume unread messages from last committed offset for
>>>>>> existing
>>>>>>>>> partitions but would still miss the messages from new partition.
>>>>>>>> 
>>>>>>>> For dynamic consumers, a restart inherently means leaving and
>>>>> rejoining
>>>>>>>> the
>>>>>>>> group
>>>>>>>> as a new member, so recalculating startupTimestamp = now() is
>>>>>> semantically
>>>>>>>> correct —
>>>>>>>> the consumer is genuinely starting fresh.
>>>>>>>> 
>>>>>>>> The gap you described only applies to static membership, where the
>>>>>>>> consumer can
>>>>>>>> restart
>>>>>>>> without triggering a rebalance, yet the local timestamp still gets
>>>>>> reset.
>>>>>>>> For
>>>>>>>> this scenario, as
>>>>>>>> Chia suggested, we could extend the configuration to accept an
>>>>> explicit
>>>>>>>> timestamp
>>>>>>>> This would allow users to pin a fixed reference point across
>>>>> restarts,
>>>>>>>> effectively closing the gap for
>>>>>>>> static membership. For dynamic consumers, the default by_start_time
>>>>>>>> without an
>>>>>>>> explicit timestamp
>>>>>>>> already provides the correct behavior and a significant improvement
>>>>>> over
>>>>>>>> latest, which would miss
>>>>>>>> data even without a restart.
>>>>>>>> 
>>>>>>>>> If the offset are deleted mention in mentioned in Scenario 2 (Log
>>>>>>>> truncation)
>>>>>>>>> how this solution would address that scenario ?
>>>>>>>> 
>>>>>>>> For the log truncation scenario, when segments are deleted and the
>>>>>>>> consumer's
>>>>>>>> committed
>>>>>>>> offset becomes out of range, auto.offset.reset is triggered. With
>>>>>> latest,
>>>>>>>> the
>>>>>>>> consumer simply jumps
>>>>>>>> to the end of the partition, skipping all remaining available data.
>>>>>> With
>>>>>>>> by_start_time, the consumer looks up
>>>>>>>> the position based on the startup timestamp rather than relying on
>>>>>>>> offsets.
>>>>>>>> Since the lookup is timestamp-based,
>>>>>>>> it is not affected by offset invalidation due to truncation. Any data
>>>>>> with
>>>>>>>> timestamps at or after the startup time
>>>>>>>> will still be found and consumed.
>>>>>>>> 
>>>>>>>>> Do we need to care about Clock Skew or SystemTime Issues on
>>>>> consumer
>>>>>>>> client
>>>>>>>> side.
>>>>>>>>> Should we use timestamp on the server/broker side ?
>>>>>>>> 
>>>>>>>> Clock skew is a fair concern, but using a server-side timestamp does
>>>>>> not
>>>>>>>> necessarily make things safer.
>>>>>>>> It would mean comparing the Group Coordinator's time against the
>>>>>> Partition
>>>>>>>> Leader's time, which are often
>>>>>>>> different nodes. Without strict clock synchronization across the
>>>>> Kafka
>>>>>>>> cluster,
>>>>>>>> this "happens-before" relationship
>>>>>>>> remains fundamentally unpredictable. On the other hand,
>>>>>> auto.offset.reset
>>>>>>>> is
>>>>>>>> strictly a client-level configuration —
>>>>>>>> consumers within the same group can intentionally use different
>>>>>> policies.
>>>>>>>> Tying
>>>>>>>> the timestamp to a global server-side
>>>>>>>> state would be a semantic mismatch. A local timestamp aligns much
>>>>>> better
>>>>>>>> with
>>>>>>>> the client-level nature of this config.
>>>>>>>> 
>>>>>>>>> Do you plan to have any metrics or observability when consumer
>>>>> resets
>>>>>>>> offset
>>>>>>>> by_start_time
>>>>>>>> 
>>>>>>>> That's a great suggestion. I plan to expose the startup timestamp
>>>>> used
>>>>>> by
>>>>>>>> by_start_time as a client-level metric,
>>>>>>>> so users can easily verify which reference point the consumer is
>>>>> using
>>>>>>>> during
>>>>>>>> debugging.
>>>>>>>> 
>>>>>>>> Best Regards,
>>>>>>>> Jiunn-Yang
>>>>>>>> 
>>>>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年3月6日 晚上10:44 寫道：
>>>>>>>>> 
>>>>>>>>> Hi Ryan,
>>>>>>>>> 
>>>>>>>>> That is a fantastic point. A static member restarting and
>>>>> capturing a
>>>>>>>> newer
>>>>>>>> local timestamp is definitely a critical edge case.
>>>>>>>>> 
>>>>>>>>> Since users already need to inject a unique group.instance.id into
>>>>>> the
>>>>>>>> configuration for static members, my idea is to allow the
>>>>>>>> auto.offset.reset
>>>>>>>> policy to carry a dedicated timestamp to explicitly "lock" the
>>>>> startup
>>>>>>>> time
>>>>>>>> (for example, using a format like
>>>>>> auto.offset.reset=startup:<timestamp>).
>>>>>>>>> 
>>>>>>>>> This means that if users want to leverage this new policy with
>>>>> static
>>>>>>>> membership, their deployment configuration would simply include these
>>>>>> two
>>>>>>>> specific injected values (the static ID and the locked timestamp).
>>>>>>>>> 
>>>>>>>>> This approach elegantly maintains the configuration semantics at
>>>>> the
>>>>>>>> member-level, and most importantly, it avoids any need to update the
>>>>>> RPC
>>>>>>>> protocol.
>>>>>>>>> 
>>>>>>>>> What do you think of this approach for the static membership
>>>>>> scenario?
>>>>>>>>> 
>>>>>>>>> On 2026/03/05 19:45:13 "Ryan Leslie (BLOOMBERG/ NEW YORK)" wrote:
>>>>>>>>>> Hey Jiunn,
>>>>>>>>>> 
>>>>>>>>>> Glad to see some progress around this issue.
>>>>>>>>>> 
>>>>>>>>>> I had a similar thought to David, that if the time is only known
>>>>>> client
>>>>>>>> side
>>>>>>>> there are still edge cases for data loss. One case is static
>>>>> membership
>>>>>>>> where
>>>>>>>> from the perspective of a client they are free to restart their
>>>>>> consumer
>>>>>>>> task
>>>>>>>> without actually having left or meaningfully affected the group.
>>>>>> However,
>>>>>>>> I
>>>>>>>> think with the proposed implementation the timestamp is still reset
>>>>>> here.
>>>>>>>> So if
>>>>>>>> the restart happens just after a partition is added and published to,
>>>>>> but
>>>>>>>> before the consumer metadata refreshed, the group still runs the risk
>>>>>> of
>>>>>>>> data
>>>>>>>> loss.
>>>>>>>>>> 
>>>>>>>>>> It could be argued that keeping the group 'stable' is a
>>>>> requirement
>>>>>> for
>>>>>>>> this
>>>>>>>> feature to work, but sometimes it's not possible to accomplish.
>>>>>>>>>> 
>>>>>>>>>> From: [email protected] At: 03/05/26 14:32:20 UTC-5:00To:
>>>>>>>> [email protected]
>>>>>>>>>> Subject: Re: [DISCUSS] KIP-1282: Prevent data loss during
>>>>> partition
>>>>>>>> expansion for dynamically added partitions
>>>>>>>>>> 
>>>>>>>>>> Hi Jiunn,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the KIP!
>>>>>>>>>> 
>>>>>>>>>> I was also considering this solution while we discussed in the
>>>>>> jira. It
>>>>>>>>>> seems to work in most of the cases but not in all. For instance,
>>>>>> let’s
>>>>>>>>>> imagine a partition created just before a new consumer joins or
>>>>>> rejoins
>>>>>>>> the
>>>>>>>>>> group and this consumer gets the new partition. In this case, the
>>>>>>>> consumer
>>>>>>>>>> will have a start time which is older than the partition creation
>>>>>> time.
>>>>>>>>>> This could also happen with the truncation case. It makes the
>>>>>> behavior
>>>>>>>> kind
>>>>>>>>>> of unpredictable again.
>>>>>>>>>> 
>>>>>>>>>> Instead of relying on a local timestamp, one idea would to rely
>>>>> on a
>>>>>>>>>> timestamp provided by the server. For instance, we could define
>>>>> the
>>>>>> time
>>>>>>>>>> since the group became non-empty. This would define the
>>>>> subscription
>>>>>>>> time
>>>>>>>>>> for the consumer group. The downside is that it only works if the
>>>>>>>> consumer
>>>>>>>>>> is part of a group.
>>>>>>>>>> 
>>>>>>>>>> In your missing semantic section, I don’t fully understand how the
>>>>>> 4th
>>>>>>>>>> point is improved by the KIP. It says start from earliest but with
>>>>>> the
>>>>>>>>>> change it would start from application start time. Could you
>>>>>> elaborate?
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> David
>>>>>>>>>> 
>>>>>>>>>> Le jeu. 5 mars 2026 à 12:47, 黃竣陽 <[email protected]> a écrit :
>>>>>>>>>> 
>>>>>>>>>>> Hello chia,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for your feedback, I have updated the KIP.
>>>>>>>>>>> 
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Jiunn-Yang
>>>>>>>>>>> 
>>>>>>>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年3月5日 晚上7:24 寫道：
>>>>>>>>>>>> 
>>>>>>>>>>>> hi Jiunn
>>>>>>>>>>>> 
>>>>>>>>>>>> chia_00: Would you mind mentioning KAFKA-19236 in the KIP? It
>>>>>> would be
>>>>>>>>>>> helpful to let readers know that "Dynamically at partition
>>>>>> discovery"
>>>>>>>> is
>>>>>>>>>>> being tracked as a separate issue
>>>>>>>>>>>> 
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Chia-Ping
>>>>>>>>>>>> 
>>>>>>>>>>>> On 2026/03/05 11:14:31 黃竣陽 wrote:
>>>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I would like to start a discussion on KIP-1282: Prevent data
>>>>> loss
>>>>>>>>>>> during partition expansion for dynamically added partitions
>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/x/mIY8G>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This proposal aims to introduce a new auto.offset.reset policy
>>>>>>>>>>> by_start_time, anchoring the
>>>>>>>>>>>>> offset reset to the consumer's startup timestamp rather than
>>>>>>>> partition
>>>>>>>>>>> discovery time, to prevent
>>>>>>>>>>>>> silent data loss during partition expansion.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Jiunn-Yang
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>

Re: [DISCUSS] KIP-1282: Prevent data loss during partition expansion for dynamically added partitions

Reply via email to