Hi, I have started a Vote thread for this KIP, considering all questions raised so far have been answered. I am happy to continue the discussion if needed, otherwise, this is a friendly reminder on the vote for this KIP.
Thanks, Lijun Tong Lijun Tong <[email protected]> 于2026年1月19日周一 17:59写道: > Hey Kamal, > > Thanks for raising these questions. Here are my responses to your > questions: > Q1 and Q2: > I think both questions boil down to how to release this new feature, both > questions are valid concerns. The solution I have in mind is this feature > is *gated by the metadata version*. The new tombstone semantics and the > additional fields (for example in RemoteLogSegmentUpdateRecord) are only > enabled once the cluster metadata version is upgraded to the version that > introduces this feature. As long as the cluster metadata version is not > bumped, the system will not produce tombstone records. Therefore, during > rolling upgrades (mixed 4.2/4.3 brokers), the feature remains effectively > disabled. Tombstones will only start being produced after the metadata > version is upgraded, at which point all brokers are already required to > support the new behavior. > > Since Kafka does not support metadata version downgrades at the moment, > once a metadata version that supports this feature is enabled, it cannot be > downgraded to a version that does not support it. I will add these details > to the KIP later. > Q3. This is an *editing mistake* in the KIP. Thanks for pointing it out — > the enum value has already been corrected in the latest revision to remove > the unused placeholder and keep the state values contiguous and consistent. > Q4. I don't foresee the quota mechanism will interfere with the state > transition in any way so far, let me know if any concern hits you. > > Thanks, > Lijun > > Kamal Chandraprakash <[email protected]> 于2026年1月18日周日 > 00:40写道: > >> Hi Lijun, >> >> Thanks for updating the KIP! >> >> The updated migration plan looks clean to me. Few questions: >> >> 1. The ConsumerTask in 4.2 Kafka build does not handle the tombstone >> records. Should the tombstone records be sent only when all the brokers >> are >> upgraded to 4.3 version? >> >> 2. Once all the brokers are upgraded and the __remote_log_metadata topic >> cleanup policy changed to compact. Then, downgrading the brokers is not >> allowed as the records without key will throw an error while producing the >> compacted topic. Shall we mention this in the compatibility section? >> >> 3. In the RemoteLogSegmentState Enum, why is the value 1 marked as unused? >> >> 4. Regarding the key (TopicIdPartition:EndOffset:BrokerLeaderEpoch), we >> may >> have to check for scenarios where there is segment lag due to remote log >> write quota. Will the state transition work correctly? Will come back to >> this later. >> >> Thanks, >> Kamal >> >> On Fri, Jan 16, 2026 at 4:50 AM jian fu <[email protected]> wrote: >> >> > Hi Lijun and Kamal >> > I also think we don't need to keep delJIanpolicy in final config,if >> so,we >> > should always keep remembering all of our topic retention time must less >> > than the value,right?It is one protect with risk involved. >> > Regards >> > JIan >> > >> > >> > >> > Lijun Tong <[email protected]>于2026年1月16日 周五06:45写道: >> > >> > > Hey Kamal, >> > > >> > > Some additional points about the Q4, >> > > >> > > > The user can decide when to change their internal topic cleanup >> policy >> > to >> > > > compact. If someone retains >> > > > the data in the remote storage for 3 months, then they can migrate >> to >> > the >> > > > compacted topic after 3 months >> > > > post rolling out this change. And, update their cleanup policy to >> > > [compact, >> > > > delete]. >> > > >> > > >> > > I don't think it's a good idea to keep delete in the final cleanup >> policy >> > > for the topic `__remote_log_metadata`, as this still requires the >> user to >> > > keep track of the max retention hours of topics that have remote >> storage >> > > enabled, and it's operational burden. It's also hard to reason about >> what >> > > will happen if the user configures the wrong retention.ms. I hope >> this >> > > makes sense. >> > > >> > > >> > > Thanks, >> > > Lijun Tong >> > > >> > > Lijun Tong <[email protected]> 于2026年1月15日周四 11:43写道: >> > > >> > > > Hey Kamal, >> > > > >> > > > Thanks for your reply! I am glad we are on the same page with making >> > the >> > > > __remote_log_metadata topic compacted optional for the user now, I >> will >> > > > update the KIP with this change. >> > > > >> > > > For the Q2: >> > > > With the key designed as >> TopicId:Partition:EndOffset:BrokerLeaderEpoch, >> > > > even the same broker retries the upload multiple times for the same >> log >> > > > segment, the latest retry attempt with the latest segment UUID will >> > > > overwrite the previous attempts' value since they share the same >> key, >> > so >> > > we >> > > > don't need to explicitly track the failed upload metadata, because >> it's >> > > > gone already by the later attempt. That's my understanding about the >> > > > RLMCopyTask, correct me if I am wrong. >> > > > >> > > > Thanks, >> > > > Lijun Tong >> > > > >> > > > Kamal Chandraprakash <[email protected]> 于2026年1月14日周三 >> > > > 21:18写道: >> > > > >> > > >> Hi Lijun, >> > > >> >> > > >> Thanks for the reply! >> > > >> >> > > >> Q1: Sounds good. Could you clarify it in the KIP that the same >> > > partitioner >> > > >> will be used? >> > > >> >> > > >> Q2: With TopicId:Partition:EndOffset:BrokerLeaderEpoch key, if the >> > same >> > > >> broker retries the upload due to intermittent >> > > >> issues in object storage (or) RLMM. Then, those failed upload >> metadata >> > > >> also >> > > >> need to be cleared. >> > > >> >> > > >> Q3: We may have to skip the null value records in the ConsumerTask. >> > > >> >> > > >> Q4a: The idea is to keep the cleanup policy as "delete" and also >> send >> > > the >> > > >> tombstone markers >> > > >> to the existing `__remote_log_metadata` topic. And, handle the >> > tombstone >> > > >> records in the ConsumerTask. >> > > >> >> > > >> The user can decide when to change their internal topic cleanup >> policy >> > > to >> > > >> compact. If someone retains >> > > >> the data in the remote storage for 3 months, then they can migrate >> to >> > > the >> > > >> compacted topic after 3 months >> > > >> post rolling out this change. And, update their cleanup policy to >> > > >> [compact, >> > > >> delete]. >> > > >> >> > > >> Thanks, >> > > >> Kamal >> > > >> >> > > >> On Thu, Jan 15, 2026 at 4:12 AM Lijun Tong < >> [email protected]> >> > > >> wrote: >> > > >> >> > > >> > Hey Jian, >> > > >> > >> > > >> > Thanks for your time to review this KIP. I appreciate that you >> > > propose a >> > > >> > simpler migration solution to onboard the new feature. >> > > >> > >> > > >> > There are 2 points that I think can be further refined on: >> > > >> > >> > > >> > 1). make the topic compacted optional, although the new feature >> will >> > > >> > continue to emit tombstone message for those expired log segments >> > even >> > > >> when >> > > >> > the topic is still on time-based retention mode, so once user >> > switched >> > > >> to >> > > >> > using the compacted topic, those expired messages can still be >> > deleted >> > > >> > despite the topic is not retention based anymore. >> > > >> > 2). we need to expose some flag to the user to indicate whether >> the >> > > >> topic >> > > >> > can be flipped to compacted by checking whether all the old >> format >> > > >> > keyed-less message has expired, and allow user to choose to flip >> to >> > > >> > compacted only when the flag is true. >> > > >> > >> > > >> > Thanks for sharing your idea. I will update the KIP later with >> this >> > > new >> > > >> > idea. >> > > >> > >> > > >> > Best, >> > > >> > Lijun Tong >> > > >> > >> > > >> > >> > > >> > jian fu <[email protected]> 于2026年1月12日周一 04:55写道: >> > > >> > >> > > >> > > Hi Lijun Tong: >> > > >> > > >> > > >> > > Thanks for your KIP which raise this critical issue. >> > > >> > > >> > > >> > > what about just keep one topic instead of involve another >> topic. >> > > >> > > for existed topic data's migration. maybe we can use this way >> to >> > > solve >> > > >> > the >> > > >> > > issue: >> > > >> > > (1) set the retention date > all of topic which enable remote >> > > >> storage's >> > > >> > > retention time >> > > >> > > (2) deploy new kafka version with feature: which send the >> message >> > > >> with >> > > >> > key >> > > >> > > (3) wait all the message expired and new message with key >> coming >> > to >> > > >> the >> > > >> > > topic >> > > >> > > (4) convert the topic to compact >> > > >> > > >> > > >> > > I don't test it. Just propose this solution according to code >> > review >> > > >> > > result. just for your reference. >> > > >> > > The steps maybe a little complex. but it can avoiding add new >> > topic. >> > > >> > > >> > > >> > > Regards >> > > >> > > Jian >> > > >> > > >> > > >> > > Lijun Tong <[email protected]> 于2026年1月8日周四 09:17写道: >> > > >> > > >> > > >> > > > Hey Kamal, >> > > >> > > > >> > > >> > > > >> > > >> > > > Thanks for your time for the review. >> > > >> > > > >> > > >> > > > >> > > >> > > > Here is my response to your questions: >> > > >> > > > >> > > >> > > > Q1 At this point, I don’t see a need to change >> > > >> > > > RemoteLogMetadataTopicPartitioner for this design. Nothing in >> > the >> > > >> > current >> > > >> > > > approach appears to require a partitioner change, but I’m >> open >> > to >> > > >> > > > revisiting if a concrete need arises. >> > > >> > > > >> > > >> > > > Q2 I have some reservations about using SegmentId:State as >> the >> > > key. >> > > >> A >> > > >> > > > practical challenge we see today is that the same logical >> > segment >> > > >> can >> > > >> > be >> > > >> > > > retried multiple times with different SegmentIds across >> brokers. >> > > If >> > > >> the >> > > >> > > key >> > > >> > > > is SegmentId-based, it becomes harder to discover and >> tombstone >> > > all >> > > >> > > related >> > > >> > > > attempts when the segment eventually expires. The >> > > >> > > > TopicId:Partition:EndOffset:BrokerLeaderEpoch key is >> > deterministic >> > > >> for >> > > >> > a >> > > >> > > > logical segment attempt and helps group retries by epoch, >> which >> > > >> > > simplifies >> > > >> > > > cleanup and reasoning about state. I’d love to understand the >> > > >> benefits >> > > >> > > > you’re seeing with SegmentId:State compared to the >> > > >> offset/epoch-based >> > > >> > key >> > > >> > > > so we can weigh the trade-offs. >> > > >> > > > >> > > >> > > > On partitioning: with this proposal, all states for a given >> user >> > > >> > > > topic-partition still map to the same metadata partition. >> That >> > > >> remains >> > > >> > > true >> > > >> > > > for the existing __remote_log_metadata (unchanged >> partitioner) >> > and >> > > >> for >> > > >> > > the >> > > >> > > > new __remote_log_metadata_compacted, preserving the >> properties >> > > >> > > > RemoteMetadataCache relies on. >> > > >> > > > >> > > >> > > > Q3 It should be fine for ConsumerTask to ignore tombstone >> > records >> > > >> (null >> > > >> > > > values) and no-op. >> > > >> > > > >> > > >> > > > Q4 Although TBRLMM is a sample RLMM implementation, it’s >> > currently >> > > >> the >> > > >> > > only >> > > >> > > > OSS option and is widely used. The new >> > > >> __remote_log_metadata_compacted >> > > >> > > > topic offers clear operational benefits in that context. We >> can >> > > also >> > > >> > > > provide a configuration to let users choose whether they >> want to >> > > >> keep >> > > >> > the >> > > >> > > > audit topic (__remote_log_metadata) in their cluster. >> > > >> > > > >> > > >> > > > Q4a Enabling compaction on __remote_log_metadata alone may >> not >> > > fully >> > > >> > > > address the unbounded growth, since we also need to emit >> > > tombstones >> > > >> for >> > > >> > > > expired keys to delete them. Deferring compaction and >> > tombstoning >> > > to >> > > >> > user >> > > >> > > > configuration could make the code flow complicated, also add >> > > >> > operational >> > > >> > > > complexity and make outcomes less predictable. The proposal >> aims >> > > to >> > > >> > > provide >> > > >> > > > a consistent experience by defining deterministic keys and >> > > emitting >> > > >> > > > tombstones as part of the broker’s responsibilities, while >> still >> > > >> > allowing >> > > >> > > > users to opt out of the audit topic if they prefer. But I am >> > open >> > > to >> > > >> > more >> > > >> > > > discussion if there is any concrete need I don't foresee. >> > > >> > > > >> > > >> > > > >> > > >> > > > Thanks, >> > > >> > > > >> > > >> > > > Lijun Tong >> > > >> > > > >> > > >> > > > Kamal Chandraprakash <[email protected]> >> > > 于2026年1月6日周二 >> > > >> > > > 01:01写道: >> > > >> > > > >> > > >> > > > > Hi Lijun, >> > > >> > > > > >> > > >> > > > > Thanks for the KIP! Went over the first pass. >> > > >> > > > > >> > > >> > > > > Few Questions: >> > > >> > > > > >> > > >> > > > > 1. Are we going to maintain the same >> > > >> > RemoteLogMetadataTopicPartitioner >> > > >> > > > > < >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/metadata/storage/RemoteLogMetadataTopicPartitioner.java >> > > >> > > > > > >> > > >> > > > > for both the topics? It is not clear in the KIP, could you >> > > clarify >> > > >> > it? >> > > >> > > > > 2. Can the key be changed to SegmentId:State instead of >> > > >> > > > > TopicId:Partition:EndOffset:BrokerLeaderEpoch if the same >> > > >> partitioner >> > > >> > > is >> > > >> > > > > used? It is good to maintain all the segment states for a >> > > >> > > > > user-topic-partition in the same metadata partition. >> > > >> > > > > 3. Should we have to handle the records with null value >> > > >> (tombstone) >> > > >> > in >> > > >> > > > the >> > > >> > > > > ConsumerTask >> > > >> > > > > < >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/metadata/storage/ConsumerTask.java?L166 >> > > >> > > > > > >> > > >> > > > > ? >> > > >> > > > > 4. TBRLMM >> > > >> > > > > < >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/metadata/storage/TopicBasedRemoteLogMetadataManager.java >> > > >> > > > > > >> > > >> > > > > is a sample plugin implementation of RLMM. Not sure whether >> > the >> > > >> > > community >> > > >> > > > > will agree to add one more internal topic for this plugin >> > impl. >> > > >> > > > > 4a. Can we modify the new messages to the >> > __remote_log_metadata >> > > >> topic >> > > >> > > to >> > > >> > > > > contain the key and leave it to the user to enable >> compaction >> > > for >> > > >> > this >> > > >> > > > > topic if they need? >> > > >> > > > > >> > > >> > > > > Thanks, >> > > >> > > > > Kamal >> > > >> > > > > >> > > >> > > > > On Tue, Jan 6, 2026 at 7:35 AM Lijun Tong < >> > > >> [email protected]> >> > > >> > > > wrote: >> > > >> > > > > >> > > >> > > > > > Hey Henry, >> > > >> > > > > > >> > > >> > > > > > Thank you for your time and response! I really like your >> > > >> KIP-1248 >> > > >> > > about >> > > >> > > > > > offloading the consumption of remote log away from the >> > broker, >> > > >> and >> > > >> > I >> > > >> > > > > think >> > > >> > > > > > with that change, the topic that enables the tiered >> storage >> > > can >> > > >> > also >> > > >> > > > have >> > > >> > > > > > longer retention configurations and would benefit from >> this >> > > KIP >> > > >> > too. >> > > >> > > > > > >> > > >> > > > > > Some suggestions: In your example scenarios, it would >> also >> > be >> > > >> good >> > > >> > to >> > > >> > > > add >> > > >> > > > > > > an example of remote log segment deletion triggered by >> > > >> retention >> > > >> > > > policy >> > > >> > > > > > > which will trigger generation of tombstone event into >> > > metadata >> > > >> > > topic >> > > >> > > > > and >> > > >> > > > > > > trigger log compaction/deletion 24 hour later, I think >> > this >> > > is >> > > >> > the >> > > >> > > > key >> > > >> > > > > > > event to cap the metadata topic size. >> > > >> > > > > > >> > > >> > > > > > >> > > >> > > > > > Regarding to this suggestion, I am not sure whether >> > Scenario 4 >> > > >> > > > > > < >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618613#KIP1266:BoundingTheNumberOfRemoteLogMetadataMessagesviaCompactedTopic-Scenario4:SegmentDeletion >> > > >> > > > > > > >> > > >> > > > > > has >> > > >> > > > > > covered it. I can add more rows in the Timeline Table >> like >> > > >> > T5+24hour >> > > >> > > to >> > > >> > > > > > indicate the messages are gone by then to explicitly show >> > that >> > > >> > > messages >> > > >> > > > > are >> > > >> > > > > > deleted, thus the number of messages are capped in the >> > topic. >> > > >> > > > > > >> > > >> > > > > > Regarding whether the topic __remote_log_metadata is >> still >> > > >> > > necessary, I >> > > >> > > > > am >> > > >> > > > > > inclined to continue to have this topic at least for >> > debugging >> > > >> > > purposes >> > > >> > > > > so >> > > >> > > > > > we can build confidence about the compacted topic >> change, we >> > > can >> > > >> > > > > > always choose to remove this topic in the future once we >> all >> > > >> agree >> > > >> > it >> > > >> > > > > > provides limited value for the users. >> > > >> > > > > > >> > > >> > > > > > Thanks, >> > > >> > > > > > Lijun Tong >> > > >> > > > > > >> > > >> > > > > > >> > > >> > > > > > Henry Haiying Cai via dev <[email protected]> >> > 于2026年1月5日周一 >> > > >> > > 16:19写道: >> > > >> > > > > > >> > > >> > > > > > > Lijun, >> > > >> > > > > > > >> > > >> > > > > > > Thanks for the proposal and I liked your idea of using >> a >> > > >> > compacted >> > > >> > > > > topic >> > > >> > > > > > > for tiered storage metadata topic. >> > > >> > > > > > > >> > > >> > > > > > > In our setup, we have set a shorter retention (3 days) >> for >> > > the >> > > >> > > tiered >> > > >> > > > > > > storage metadata topic to control the size growth. We >> can >> > > do >> > > >> > that >> > > >> > > > > since >> > > >> > > > > > we >> > > >> > > > > > > control all topic's retention policy in our clusters >> and >> > we >> > > >> set a >> > > >> > > > > uniform >> > > >> > > > > > > retention.policy for all our tiered storage topics. I >> can >> > > see >> > > >> > > other >> > > >> > > > > > > users/companies will not be able to enforce that >> retention >> > > >> policy >> > > >> > > to >> > > >> > > > > all >> > > >> > > > > > > tiered storage topics. >> > > >> > > > > > > >> > > >> > > > > > > Some suggestions: In your example scenarios, it would >> also >> > > be >> > > >> > good >> > > >> > > to >> > > >> > > > > add >> > > >> > > > > > > an example of remote log segment deletion triggered by >> > > >> retention >> > > >> > > > policy >> > > >> > > > > > > which will trigger generation of tombstone event into >> > > metadata >> > > >> > > topic >> > > >> > > > > and >> > > >> > > > > > > trigger log compaction/deletion 24 hour later, I think >> > this >> > > is >> > > >> > the >> > > >> > > > key >> > > >> > > > > > > event to cap the metadata topic size. >> > > >> > > > > > > >> > > >> > > > > > > For the original unbounded remote_log_metadata topic, >> I am >> > > not >> > > >> > sure >> > > >> > > > > > > whether we still need it or not. If it is left only >> for >> > > audit >> > > >> > > trail >> > > >> > > > > > > purpose, people can set up a data ingestion pipeline to >> > > ingest >> > > >> > the >> > > >> > > > > > content >> > > >> > > > > > > of metadata topic into a separate storage location. I >> > think >> > > >> we >> > > >> > can >> > > >> > > > > have >> > > >> > > > > > a >> > > >> > > > > > > flag to have only one metadata topic (the compacted >> > > version). >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > On Monday, January 5, 2026 at 01:22:42 PM PST, Lijun >> Tong >> > < >> > > >> > > > > > > [email protected]> wrote: >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > Hello Kafka Community, >> > > >> > > > > > > >> > > >> > > > > > > I would like to start a discussion on KIP-1266, which >> > > >> proposes to >> > > >> > > add >> > > >> > > > > > > another new compacted remote log metadata topic for the >> > > tiered >> > > >> > > > storage, >> > > >> > > > > > to >> > > >> > > > > > > limit the number of messages that need to be iterated >> to >> > > build >> > > >> > the >> > > >> > > > > remote >> > > >> > > > > > > metadata state. >> > > >> > > > > > > >> > > >> > > > > > > KIP link: KIP-1266 Bounding The Number Of >> > RemoteLogMetadata >> > > >> > > Messages >> > > >> > > > > via >> > > >> > > > > > > Compacted RemoteLogMetadata Topic >> > > >> > > > > > > < >> > > >> > > > > > > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1266%3A+Bounding+The+Number+Of+RemoteLogMetadata+Messages+via+Compacted+Topic >> > > >> > > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > Background: >> > > >> > > > > > > The current Tiered Storage implementation uses a >> > > >> > > > __remote_log_metadata >> > > >> > > > > > > topic with infinite retention and delete-based cleanup >> > > policy, >> > > >> > > > causing >> > > >> > > > > > > unbounded growth, slow broker bootstrap, no mechanism >> to >> > > >> clean up >> > > >> > > > > expired >> > > >> > > > > > > segment metadata, and inefficient re-reading from >> offset 0 >> > > >> during >> > > >> > > > > > > leadership changes. >> > > >> > > > > > > >> > > >> > > > > > > Proposal: >> > > >> > > > > > > A dual-topic approach that introduces a new >> > > >> > > > > > __remote_log_metadata_compacted >> > > >> > > > > > > topic using log compaction with deterministic >> offset-based >> > > >> keys, >> > > >> > > > while >> > > >> > > > > > > preserving the existing topic for audit history; this >> > allows >> > > >> > > brokers >> > > >> > > > to >> > > >> > > > > > > build their metadata cache exclusively from the >> compacted >> > > >> topic, >> > > >> > > > > enables >> > > >> > > > > > > cleanup of expired segment metadata through tombstones, >> > and >> > > >> > > includes >> > > >> > > > a >> > > >> > > > > > > migration strategy to populate the new topic during >> > > >> > > > upgrade—delivering >> > > >> > > > > > > bounded metadata growth and faster broker startup while >> > > >> > maintaining >> > > >> > > > > > > backward compatibility. >> > > >> > > > > > > >> > > >> > > > > > > More details are in the attached KIP link. >> > > >> > > > > > > Looking forward to your thoughts. >> > > >> > > > > > > >> > > >> > > > > > > Thank you for your time! >> > > >> > > > > > > >> > > >> > > > > > > Best, >> > > >> > > > > > > Lijun Tong >> > > >> > > > > > > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > > >> > > >> > >> >
