RLMM was not designed for aggressive copying of the latest data to tiered storage by having small segment rollouts.
+1 to Jun on leaving the existing RLMM for classic topics with tiered storage and having an efficient metadata management system required for diskless topics. On Tue, 12 May 2026 at 23:59, Jun Rao via dev <[email protected]> wrote: > > Hi, Victor, > > Thanks for the reply. > > JR1. (A) and (B) Yes, your summary matches my thinking. > (C) "Generally I think that (i) (ii) (iii) and (iv) may be addressed with > an aggressive tiered storage consolidation (the first approach)". > Hmm, I am confused by the above statement. By "the first approach", do you > mean aggressive tiering with faster segment rolling through the existing > RLMM? I don't think the existing RLMM is designed to solve these issues due > to inefficiencies in cost, metadata propagation and metadata storage as we > previously discussed. > > JR11. I was thinking we leave the existing RLMM as is and continue to use > it for classic topics. We design a new, more efficient metadata management > component independent of RLMM. This new component will be the only metadata > component that diskless topics depend on. > > Jun > > On Tue, May 12, 2026 at 8:43 AM Viktor Somogyi-Vass <[email protected]> > wrote: > > > Hi Jun, > > > > JR1 > > (1)-(2)-(3) I'd address these together and let me explain our current idea > > to solve the tiny object problem because I'm not sure if we're 100% talking > > about the same thing. I have two approaches in mind for TS consolidation > > ((A) and (B)) and I'm not sure if we're both assuming the same idea, so > > let's clarify this. > > > > (A) > > This is our current assumption. This uses local disks (create classic > > local logs with UnifiedLog) to consolidate logs into the classic log format > > and use RSM and RLMM to store them in tiered storage. This way we're not > > limited by the need to have short rollovers. Local logs become a form of > > staging environment to serve reads and accumulate records for tiered > > storage. This means that: > > (a) Once a message is consolidated into the classic log format, we can > > use it for serving lagging consumers. Diskless reads should really be used > > for the head of the log and after a few seconds logs should be consolidated. > > (b) The real cost is much closer to that 87.5% (and in fact my google > > sheet I shared also assumes this model) because we have more freedom in > > choosing the retention parameters of the classic log. > > (c) Metadata is smaller as we only need to keep diskless segments until > > the tiered offset surpasses the individual batches' offset. > > (d) RLMM metadata is also somewhat manageable due to the larger segment > > sizes but it's still possible to run into the metadata explosion problem. > > (e) It needs to rebuild this local log on reassignment to serve lagging > > consumers effectively, so reassignment is a bit more messy. > > (f) It's not optimal when partitions have a single replica: on failure we > > can only fall back to diskless mode until the partition is reassigned to a > > functioning broker. > > > > (B) > > Compared to the above there can be an alternative approach, which is to > > consolidate when diskless segments expire (after 15 minutes for instance). > > In that case your points seem to fit better as: > > (a) we can only use the classic, consolidated logs to serve lagging > > consumers after they have been tiered > > (b) to be more efficient with lagging consumers we have to stick to a > > short rollover > > (c) it's more costly due to the short rollovers > > (d) the RLMM bottleneck still exists due to the short rollovers > > (e) it's not given whether we use local disks for transforming logs as we > > can do it in memory too (which can be ineffective and more expensive) but > > perhaps a “chunked transfer encoding” that S3 supports or similar with > > other providers is a cost effective way. If we know the final size advance, > > we can upload data in chunks and still get billed for 1 put. > > (f) more efficient reassignment or failover is cleaner and faster as > > there isn't a need to rebuild local caches. > > > > (C) > > Apart from the first 2 approaches there is a 3rd, which is WAL merging. To > > understand your points, let me summarize that I could gather so far as > > reasons for WAL merging (and please correct me if I missed something): > > (i) protecting consumer lag: small WAL files create inefficient objects > > for lagging consumers, so larger objects should be more efficient > > (ii) avoiding the RLMM replay bottleneck: managing small segments with > > RLMM is very inefficient (100s of GB metadata) > > (iii) reducing batch metadata overhead: merging WAL files may reduce the > > metadata we need to store, but it depends on the merge algorithm and how we > > can compact batch data > > (iv) cost effectiveness: retrieving merged WAL files reduces the number > > of get requests to object storage > > (v) architectural redundancy with RLMM: ideally we wouldn't need 2 > > solutions to 2 somewhat similar problems (tiered storage and diskless) > > > > Generally I think that (i) (ii) (iii) and (iv) may be addressed with an > > aggressive tiered storage consolidation (the first approach), so the only > > remaining gap would be (v). I also agree that having 2 different solutions > > for metadata handling isn't ideal and perhaps there is a possibility of > > improvement here. It should be possible to redesign RLMM to be more similar > > to the diskless coordinator or design a common solution. > > > > JR11 > > "If we support merging in the diskless coordinator, I wonder how useful > > RLMM > > is. It seems simpler to manage all metadata from the object store in a > > single place." > > > > Could you please clarify this a little bit? Do you think that we should > > replace the RLMM with a solution that is more similar to the diskless > > coordinator or deprecate tiered storage altogether in favor of diskless? > > I'm not sure which option you're referring: > > (1) Unify tiered storage and diskless under a single storage layer (and > > possibly deprecate tiered storage in favor of diskless with merging WAL > > segments). > > (2) Create a smart coordinator instead of RLMM and possibly unify > > metadata coordination with diskless. > > (3) Keep tiered storage and diskless separate with their own solutions > > for metadata (probably not optimal). > > > > Thanks, > > Viktor > > > > On Fri, May 1, 2026 at 9:08 PM Jun Rao via dev <[email protected]> > > wrote: > > > >> Hi, Viktor and Greg, > >> > >> Thanks for the reply. > >> > >> JR1. > >> 1) Thanks for verifying the cost estimation. I noticed a bug in my earlier > >> calculation. I estimated the per broker network transfer rate at 2MB/sec. > >> It should be 4MB/sec. If I correct it, the estimated savings are similar > >> to > >> yours. > >> The cost for transferring 4MB through the network is 4 * 2 * 10^-5 = $8* > >> 10^-5 > >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The savings are > >> about 87.5%. > >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The savings are > >> 62.5%. > >> Savings are still significantly lower when using RLMM. > >> > >> "To me it seems like that Greg's previous suggestion for a 15 min rollover > >> may be a bit too much. With 1 hour we can achieve better cost saving and > >> less coordinate metadata being stored." > >> This solves the cost issue, but it has other implications (see point 2) > >> below). > >> > >> 2) "Yes, I think this is to be expected and a lot depends on the > >> implementation. Ideally segments or chunks should be cached to minimize > >> the > >> number of times segments pulled from remote storage." > >> In a classic topic, when a consumer lags, its requests are served either > >> from the local cache or from large objects in the object store. With the > >> current design in a diskless topic, lagging consumer requests might be > >> served from tiny 500-byte objects. This will significantly slow down the > >> consumer's catch-up, which is not expected user behavior. Ideally, we > >> don't > >> want those tiny objects to last more than a few minutes, let alone an > >> hour. > >> > >> 3) "I think if my calculations are correct (and we use a 60 minute > >> window), > >> then metadata generation should be slower, please see the google sheet I > >> linked above. I think given that traffic, the current topic based RLMM > >> should be able to handle it." > >> Why is a 60 minute window used? RLMM metadata needs to be retained for the > >> longest retention time among all topics. This means that the retention > >> window can be weeks instead of 1 hour. This means that RLMM might need to > >> replay over 100GB of data during reassignment, which is not what it is > >> designed for. > >> > >> JR10. "Your example of 100,000 1kb/s partitions is a borderline case, > >> where > >> there are some configurations which are not viable due to scale or cost, > >> and some that are. It would be up to the operator to tune their cluster, > >> by > >> changing diskless.segment.ms > >> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$> > >> < > >> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > >> >, > >> dividing up the cluster, or switching to a more scalable RLMM > >> implementation." > >> A broker with 4MB/sec produce throughput can probably be considered high > >> throughput. Even with 4K partitions per broker, we could still achieve an > >> 87.5% cost saving as listed above, if we do the right implementation. So, > >> ideally, it would be useful to support that as well. > >> > >> JR11. "We had a short conversation with Greg and we came to the conclusion > >> that because of the explosiveness of diskless metadata, it may be worth > >> revisiting the merging case as it can indeed buy us some more cost saving > >> for the added complexity. " > >> If we support merging in the diskless coordinator, I wonder how useful > >> RLMM > >> is. It seems simpler to manage all metadata from the object store in a > >> single place. > >> > >> Jun > >> > >> On Mon, Apr 27, 2026 at 4:17 PM Greg Harris <[email protected]> wrote: > >> > >> > Hi Jun, > >> > > >> > Thank you for scrutinizing the scalability of the current > >> > direct-to-tiered-storage strategy, and its metadata scalability. > >> > > >> > One of our implicit assumptions with this design was that users are able > >> > to choose between the Diskless and Classic mechanisms, and that any > >> > situations where the Diskless design was deficient, the Classic topics > >> > could continue to be used. > >> > This was originally applied to low-latency use-cases, but now also > >> applies > >> > to low-throughput use-cases too. When the throughput on a topic is low, > >> the > >> > benefit of using Diskless is also low, because it is proportional to the > >> > amount of data transferred, and it is more likely that the batch > >> overhead > >> > of the topics is significant. > >> > In other words, we've been treating cost-effective support for > >> arbitrarily > >> > low throughput topics as a non-goal. > >> > > >> > Your example of 100,000 1kb/s partitions is a borderline case, where > >> there > >> > are some configurations which are not viable due to scale or cost, and > >> some > >> > that are. It would be up to the operator to tune their cluster, by > >> changing > >> > diskless.segment.ms > >> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$> > >> > < > >> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > >> >, > >> > dividing up the cluster, or switching to a more scalable RLMM > >> > implementation. > >> > > >> > Do you think we should have cost-effective support for arbitrarily > >> > low-throughput partitions in Diskless? How much total demand is there in > >> > partitions where batches are >1kb but the partition throughput is > >> <1kb/s? > >> > > >> > Thanks, > >> > Greg > >> > > >> > On Fri, Apr 24, 2026 at 10:23 AM Viktor Somogyi-Vass <[email protected] > >> > > >> > wrote: > >> > > >> >> Hi Jun, > >> >> > >> >> Regarding JR1. > >> >> We had a short conversation with Greg and we came to the conclusion > >> that > >> >> because of the explosiveness of diskless metadata, it may be worth > >> >> revisiting the merging case as it can indeed buy us some more cost > >> saving > >> >> for the added complexity. Also, it would support smaller topics and we > >> >> could somewhat manage the tiered storage consolidation costs. I think > >> that > >> >> we would still need to consolidate WAL segments into tiered storage. > >> >> Reasons are: to limit WAL metadata, to be able to dynamically > >> >> enable/disable diskless and to be compatible with existing and future > >> TS > >> >> improvements. > >> >> I'll try to refresh KIP-1165 and build it into the calculator above (if > >> >> it's possible at all :) ) and come back to you. > >> >> Regardless, I just wanted to give a short update in the meantime, > >> looking > >> >> forward to your answer. > >> >> > >> >> Best, > >> >> Viktor > >> >> > >> >> On Fri, Apr 24, 2026 at 3:46 PM Viktor Somogyi-Vass < > >> >> [email protected]> > >> >> wrote: > >> >> > >> >> > Hi Jun, > >> >> > > >> >> > Thanks for the quick reply. > >> >> > > >> >> > JR1. > >> >> > 1) Thanks for putting the numbers together. While your calculation > >> >> > seems to be correct in the sense that 6 PUTs would worsen the cost > >> >> saving > >> >> > benefits, I think that in a byte for byte comparison there is a > >> bigger > >> >> > difference. The reason is that the 4 tiered storage puts transfer > >> much > >> >> more > >> >> > data compared to the small WAL segments, so in practice there should > >> be > >> >> > fewer TS puts. > >> >> > I made a google sheet calculator for this which I'd like to share > >> with > >> >> > you: > >> >> > > >> >> > >> https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906#gid=749470906 > >> <https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDHN-4uGY$> > >> >> < > >> https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wNjeT01kw$ > >> > > >> >> > Please copy the sheet to modify the values. > >> >> > About my findings: I was trying to create a similar cluster model > >> that > >> >> has > >> >> > been discussed here previously to see how cost varies over different > >> >> > segment rollovers.To me it seems like that Greg's previous suggestion > >> >> for a > >> >> > 15 min rollover may be a bit too much. With 1 hour we can achieve > >> better > >> >> > cost saving and less coordinate metadata being stored. I have also > >> >> tried to > >> >> > account for the producer batch metadata generated by diskless > >> partitions > >> >> > but to me it seems like a lower number than Greg's original numbers. > >> >> > > >> >> > 2) "Note that local storage could be lost on reassigned partitions. > >> In > >> >> > that case, lagging reads can only be served from the object store." > >> >> > Yes, I think this is to be expected and a lot depends on the > >> >> > implementation. Ideally segments or chunks should be cached to > >> minimize > >> >> the > >> >> > number of times segments pulled from remote storage. > >> >> > > >> >> > "The 2MB/sec I quoted is for a specific broker. Depending on the > >> broker > >> >> > instance type, a broker may only be able to handle low 10s of MB/sec > >> of > >> >> > data. So, 2MB/sec overhead is significant." > >> >> > Yes, I have indeed misunderstood, however I have updated my > >> calculator > >> >> > sheet with metadata calculation. Overall, the number of tiered > >> storage > >> >> > segments created seems to be much lower than in your calculations > >> given > >> >> the > >> >> > parameters of the cluster you specified earlier. Please take a look, > >> I'd > >> >> > like to really understand the thinking here because this is a crucial > >> >> point. > >> >> > > >> >> > 3) I think if my calculations are correct (and we use a 60 minute > >> >> window), > >> >> > then metadata generation should be slower, please see the google > >> sheet I > >> >> > linked above. I think given that traffic, the current topic based > >> RLMM > >> >> > should be able to handle it. > >> >> > In the case where we would need to make the RLMM capable of handling > >> a > >> >> > similar traffic as the diskless coordinator, then you're right, we > >> >> probably > >> >> > should consider how we can improve it. I think there are multiple > >> >> > possibilities as you mentioned, but ideally there should be a common > >> >> > implementation for metadata coordination that could handle these > >> cases. > >> >> > > >> >> > JR7. > >> >> > Yes, your expectation is totally reasonable, we should expect the get > >> >> and > >> >> > put operations to be strongly consistent for the read-after-write > >> >> > scenarios. And I think that since major cloud providers give strongly > >> >> > consistent object storages, it should be sufficient for a wide > >> >> user-group. > >> >> > So we could shrink the scope of the KIP a bit this way and avoid > >> adding > >> >> > complexity that is needed mostly on the margin. > >> >> > I can expect though that "list" can stay eventually consistent as the > >> >> KIP > >> >> > relies on it for only garbage collection where it is fine if a few > >> >> segments > >> >> > can be collected only in the next iteration. > >> >> > > >> >> > JR3. > >> >> > Since Greg hasn't replied yet, I'll try to catch up with him and > >> >> formulate > >> >> > an answer next week. > >> >> > > >> >> > Best, > >> >> > Viktor > >> >> > > >> >> > On Tue, Apr 21, 2026 at 8:16 PM Jun Rao via dev < > >> [email protected]> > >> >> > wrote: > >> >> > > >> >> >> Hi, Victor, > >> >> >> > >> >> >> Thanks for the reply. > >> >> >> > >> >> >> JR1. > >> >> >> 1) "So while it seems to be significant that we tripled the number > >> of > >> >> >> PUTs, cost-wise it doesn't seem to be significant." > >> >> >> Let's compare the savings achieved by replacing network replication > >> >> >> transfer with S3 puts in AWS. > >> >> >> network transfer cost: $0.02/GB = $2 * 10^-5/MB > >> >> >> S3 put cost: $0.005 per 1000 requests = $0.5 * 10^-5/request > >> >> >> > >> >> >> The KIP batches data up to 4MB. So, let's assume that we write 2MB > >> S3 > >> >> >> objects on average. > >> >> >> > >> >> >> The cost for transferring 2MB through the network is 2 * 2 * 10^-5 = > >> >> $4* > >> >> >> 10^-5 > >> >> >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The savings > >> >> are > >> >> >> about 75%. > >> >> >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The savings > >> >> are > >> >> >> 25%. As you can see, the savings are significantly lower. > >> >> >> > >> >> >> 2) "Therefore we could expect classic local segments to be present > >> >> which > >> >> >> could be used for catching up consumers." > >> >> >> Note that local storage could be lost on reassigned partitions. In > >> that > >> >> >> case, lagging reads can only be served from the object store. > >> >> >> > >> >> >> "Regarding the amount of metadata: 2MB/sec is well below the 2GB/s > >> >> >> throughput that Greg calculated previously, so I think it should be > >> >> >> manageable for a cluster with that amount of throughput," > >> >> >> It seems that you didn't make the correct comparison. 2GB/s that > >> Greg > >> >> >> mentioned is the throughput for the whole cluster. The 2MB/sec I > >> >> quoted is > >> >> >> for a specific broker. Depending on the broker instance type, a > >> broker > >> >> may > >> >> >> only be able to handle low 10s of MB/sec of data. So, 2MB/sec > >> overhead > >> >> is > >> >> >> significant. > >> >> >> > >> >> >> 3) "I'd separate it from the discussion of diskless core and > >> perhaps we > >> >> >> could address it in a separate KIP as it is mostly a redesign of the > >> >> >> RLMM." > >> >> >> Those problems don't exist in the existing usage of RLMM. They > >> manifest > >> >> >> because diskless tries to use RLMM in a way it wasn't designed for > >> >> (there > >> >> >> is at least a 20X increase in metadata). It would be useful to > >> consider > >> >> >> whether fixing those problems in RLMM or using a new approach is > >> >> >> better. For example, KIP-1164 already introduces a snapshotting > >> >> mechanism. > >> >> >> Adding another snapshotting mechanism to RLMM seems redundant. > >> >> >> > >> >> >> JR7. A typical object store supports 3 operations: puts, gets and > >> >> lists. > >> >> >> Which operations used by diskless can be eventually consistent? I'd > >> >> expect > >> >> >> that get should always see the result of the latest put. > >> >> >> > >> >> >> Jun > >> >> >> > >> >> >> On Mon, Apr 20, 2026 at 8:14 AM Viktor Somogyi-Vass < > >> [email protected] > >> >> > > >> >> >> wrote: > >> >> >> > >> >> >> > Hi Jun, > >> >> >> > > >> >> >> > I'd like to add my thoughts too until Greg has time to respond. > >> >> >> > > >> >> >> > JR1. I also think there are shortcomings in the current tiered > >> >> storage > >> >> >> > design, around the RLMM. > >> >> >> > 1) I think this is a correct observation, however if my > >> calculations > >> >> are > >> >> >> > correct, it actually comes down to a negligible amount of cost. > >> >> Taking > >> >> >> the > >> >> >> > AWS pricing sheet at > >> >> >> > > >> >> >> > >> >> > >> https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps > >> <https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDFpWs-Lg$> > >> >> < > >> https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMK8C32Iw$ > >> > > >> >> >> > it seems like the difference between 6 or 2 PUTs per second is > >> ~$52 > >> >> for > >> >> >> a > >> >> >> > month. The calculation follows > >> >> >> > as: 6*60*60*24*30*0.005/1000-2*60*60*24*30*0.005/1000=$51.84. So > >> >> while > >> >> >> it > >> >> >> > seems to be significant that we tripled the number of PUTs, > >> >> cost-wise it > >> >> >> > doesn't seem to be significant. > >> >> >> > 2) Reflecting to your original problem: the tiered storage > >> >> consolidation > >> >> >> > process should be continuously running and transforming WAL > >> segments > >> >> >> into > >> >> >> > classic logs. Therefore we could expect classic local segments to > >> be > >> >> >> > present which could be used for catching up consumers. So they > >> would > >> >> >> only > >> >> >> > switch to WAL reading when they're close to the end of the log. > >> Since > >> >> >> this > >> >> >> > offset space should be cached, the reads from there should be > >> fast. > >> >> >> > Regarding the amount of metadata: 2MB/sec is well below the 2GB/s > >> >> >> > throughput that Greg calculated previously, so I think it should > >> be > >> >> >> > manageable for a cluster with that amount of throughput, although > >> I > >> >> >> agree > >> >> >> > with your comment that the current topic based tiered metadata > >> >> manager > >> >> >> > isn't optimal and we could develop a better solution. > >> >> >> > 3) Tied to the previous point, I agree that your comments are > >> >> absolutely > >> >> >> > valid, however similarly to that, I'd separate it from the > >> >> discussion of > >> >> >> > diskless core and perhaps we could address it in a separate KIP as > >> >> it is > >> >> >> > mostly a redesign of the RLMM. > >> >> >> > > >> >> >> > JR2. Ack. We will raise a KIP in the near future. > >> >> >> > > >> >> >> > JR3. I'd leave answering this to Greg as I don't have too much > >> >> context > >> >> >> on > >> >> >> > this one. > >> >> >> > > >> >> >> > JR7. I think this could be similar to the tiered storage design, > >> so > >> >> any > >> >> >> > coordinator operation should be strongly consistent (since we're > >> >> using > >> >> >> > classic topics there). Therefore the WAL segment storage layer > >> could > >> >> be > >> >> >> > eventually consistent as we store its metadata in a strongly > >> >> consistent > >> >> >> > manner. I'm not sure though if this was the answer you're looking > >> >> for? > >> >> >> > > >> >> >> > Best, > >> >> >> > Viktor > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > On Thu, Mar 26, 2026 at 11:43 PM Jun Rao via dev < > >> >> [email protected]> > >> >> >> > wrote: > >> >> >> > > >> >> >> >> Hi, Greg, > >> >> >> >> > >> >> >> >> Thanks for the reply. > >> >> >> >> > >> >> >> >> JR1. Rolling log segments every 15 minutes addresses the 3 > >> concerns > >> >> I > >> >> >> >> listed, but it introduces some new issues because it doesn't > >> quite > >> >> fit > >> >> >> the > >> >> >> >> design of the current tiered storage. (a) The current tiered > >> storage > >> >> >> >> design > >> >> >> >> stores a single partition per object. If we roll a log segment > >> >> every 15 > >> >> >> >> minutes, with 4K partitions per broker, this means an additional > >> 4 > >> >> S3 > >> >> >> puts > >> >> >> >> per second. The diskless design aims for 2 S3 puts per second. > >> So, > >> >> this > >> >> >> >> triples the S3 put cost and reduces the savings benefits. (b) > >> With > >> >> Tier > >> >> >> >> storage, each broker essentially needs to read the tier metadata > >> >> from > >> >> >> all > >> >> >> >> tier metadata partitions if the number of user partitions exceeds > >> >> 50. > >> >> >> >> Assuming that we generate 100 bytes of tier metadata per > >> partition > >> >> >> every > >> >> >> >> 15 > >> >> >> >> minutes. Assuming that each broker has 4K partitions and a > >> cluster > >> >> of > >> >> >> 500 > >> >> >> >> brokers. Each broker needs to receive tier metadata at a rate of > >> >> 100 * > >> >> >> 4K > >> >> >> >> * > >> >> >> >> 500 / (15 * 60) = 200KB/Sec. For a broker hosting one of the 50 > >> tier > >> >> >> >> metadata topic partitions, it needs to send out metadata at 100 * > >> >> 4K * > >> >> >> 500 > >> >> >> >> / 50 * 500 / (15 * 60) = 2MB/Sec. This increases unnecessary > >> network > >> >> >> and > >> >> >> >> CPU overhead. (c) Tier storage doesn't support snapshots. A > >> >> restarted > >> >> >> >> broker needs to replay the tier metadata log from the beginning > >> to > >> >> >> build > >> >> >> >> the tier metadata state. Suppose that the tier metadata log is > >> kept > >> >> >> for 7 > >> >> >> >> days. The total amount of tier metadata that needs to be > >> replayed is > >> >> >> 200KB > >> >> >> >> * 7 * 24 * 3600 = 120GB. > >> >> >> >> Does the merging optimization you mentioned address those new > >> >> >> concerns? If > >> >> >> >> so, could you describe how it works? > >> >> >> >> > >> >> >> >> JR2. It's fine to cover the default partition assignment strategy > >> >> for > >> >> >> >> diskless topics in a separate KIP. However, since this is > >> essential > >> >> for > >> >> >> >> achieving the cost saving goal, we need a solution before > >> releasing > >> >> the > >> >> >> >> diskless KIP. > >> >> >> >> > >> >> >> >> JR3. Sounds good. Could you document how this work? > >> >> >> >> > >> >> >> >> JR7. Could you describe which parts of the operation can be > >> >> eventually > >> >> >> >> consistent? > >> >> >> >> > >> >> >> >> Jun > >> >> >> >> > >> >> >> >> On Thu, Mar 19, 2026 at 1:35 PM Greg Harris < > >> [email protected]> > >> >> >> wrote: > >> >> >> >> > >> >> >> >> > Hi Jun, > >> >> >> >> > > >> >> >> >> > Thanks for your comments! > >> >> >> >> > > >> >> >> >> > JR1: > >> >> >> >> > You are correct that the segment rolling configurations are > >> >> currently > >> >> >> >> > critical to balance the scalability of Diskless and Tiered > >> >> Storage, > >> >> >> as > >> >> >> >> > larger roll configurations benefit tiered storage, and smaller > >> >> roll > >> >> >> >> > configurations benefit Diskless. > >> >> >> >> > > >> >> >> >> > To address your points specifically: > >> >> >> >> > (1) A Diskless topic which is cost-competitive with an > >> equivalent > >> >> >> >> Classic > >> >> >> >> > topic will have a metadata size <1% of the data size. A cluster > >> >> >> storing > >> >> >> >> > 360GB of metadata will have >36TB of data under management and > >> a > >> >> >> >> retention > >> >> >> >> > of 5hr implies a throughput of >2GB/s. This will require > >> multiple > >> >> >> >> Diskless > >> >> >> >> > coordinators, which can share the load of storing the Diskless > >> >> >> metadata, > >> >> >> >> > and serving Diskless requests. > >> >> >> >> > (2) Catching up consumers are intended to be served from tiered > >> >> >> storage > >> >> >> >> > and local segment caches. Brokers which are building their > >> local > >> >> >> segment > >> >> >> >> > caches will have to read many files, but will amortize those > >> >> reads by > >> >> >> >> > receiving data for multiple partitions in a single read. > >> >> >> >> > (3) This is a fundamental downside of storing data from > >> multiple > >> >> >> topics > >> >> >> >> in > >> >> >> >> > a single object, similar to classic segments. We can implement > >> a > >> >> >> >> > configurable cluster-wide maximum roll time, which would set > >> the > >> >> >> slowest > >> >> >> >> > cadence at which Tiered Storage segments are rolled from > >> Diskless > >> >> >> >> segments. > >> >> >> >> > If an individual partition has more aggressive roll settings, > >> it > >> >> may > >> >> >> be > >> >> >> >> > rolled earlier. > >> >> >> >> > This configuration would permit the cluster operator to > >> >> approximately > >> >> >> >> > bound the number of diskless WAL segments, which bounds the > >> total > >> >> >> size > >> >> >> >> of > >> >> >> >> > the WAL segments, disk cache, diskless coordinator state, and > >> >> >> excessive > >> >> >> >> > retention window. For example, a diskless.segment.ms > >> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$> > >> >> < > >> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > >> > > >> >> of 15 minutes > >> >> >> >> would > >> >> >> >> > reduce the metadata storage to 18GB, WAL segments to 1.8TB, and > >> >> >> permit > >> >> >> >> > short-retention data to be physically deleted as soon as ~15 > >> >> minutes > >> >> >> >> after > >> >> >> >> > being produced. > >> >> >> >> > Of course, this will reduce the size of the tiered storage > >> >> segments > >> >> >> for > >> >> >> >> > topics that have low throughput, and where segment.ms > >> <https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$> > >> >> < > >> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$ > >> > > >> >> > > >> >> >> >> > diskless.segment.ms > >> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$> > >> >> < > >> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > >> >, > >> >> increasing overhead in the RLMM. We can perform > >> >> >> >> > merging/optimization of Tiered Storage segments to achieve the > >> >> >> per-topic > >> >> >> >> > segment.ms > >> <https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$> > >> >> < > >> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$ > >> > > >> >> . > >> >> >> >> > There were some reasons why we retracted the prior file-merging > >> >> >> >> approach, > >> >> >> >> > and why merging in tiered storage appears better: > >> >> >> >> > * Rewriting files requires mutability for existing data, which > >> >> adds > >> >> >> >> > complexity. Diskless batches or Remote Log Segments would need > >> to > >> >> be > >> >> >> >> made > >> >> >> >> > mutable, and the remote log will be made mutable in KIP-1272 > >> [1] > >> >> >> >> > * Because a WAL Segment can contain batches from multiple > >> Diskless > >> >> >> >> > Coordinators, multiple coordinators must also be involved in > >> the > >> >> >> merging > >> >> >> >> > step. The Tiered Storage design has exclusive ownership for > >> remote > >> >> >> log > >> >> >> >> > segments within the RLMM. > >> >> >> >> > * Diskless file merging competes for resources with > >> >> latency-sensitive > >> >> >> >> > producers and hot consumers. Tiered storage file merging > >> competes > >> >> for > >> >> >> >> > resources with lagging consumers, which are typically less > >> latency > >> >> >> >> > sensitive. > >> >> >> >> > * Implementing merging in Tiered Storage allows this > >> optimization > >> >> to > >> >> >> >> > benefit both classic topics and diskless topics, covering both > >> >> high > >> >> >> and > >> >> >> >> low > >> >> >> >> > throughput partitions. > >> >> >> >> > * Remote log segments may be optimized over much longer time > >> >> windows > >> >> >> >> > rather than performing optimization once in the first few > >> hours of > >> >> >> the > >> >> >> >> life > >> >> >> >> > of a WAL segment and then freezing the arrangement of the data > >> >> until > >> >> >> it > >> >> >> >> is > >> >> >> >> > deleted. > >> >> >> >> > * File merging will need to rely on heuristics, which should be > >> >> >> >> > configurable by the user. Multi-partition heuristics are more > >> >> >> >> complicated > >> >> >> >> > to describe and reason about than single-partition heuristics. > >> >> >> >> > What do you think of this alternative? > >> >> >> >> > > >> >> >> >> > JR2: > >> >> >> >> > Yes, the current default partition assignment strategy will > >> need > >> >> some > >> >> >> >> > improvement. This problem with Diskless WAL segments is > >> analogous > >> >> to > >> >> >> the > >> >> >> >> > Classic topics’ dense inter-broker connection graph. > >> >> >> >> > The natural solution to this seems to be some sort of cellular > >> >> >> design, > >> >> >> >> > where the replica placements tend to locate partitions in > >> similar > >> >> >> >> groups. > >> >> >> >> > Partitions in the same cell can generally share the same WAL > >> >> Segments > >> >> >> >> and > >> >> >> >> > the same Diskless Coordinator requests. This would also benefit > >> >> >> Classic > >> >> >> >> > topics, which would need fewer connections and fetch requests. > >> >> >> >> > Such a feature is out-of-scope of this KIP, and either we will > >> >> >> publish a > >> >> >> >> > follow-up KIP, or let operators and community tooling address > >> >> this. > >> >> >> >> > > >> >> >> >> > JR3: > >> >> >> >> > Yes we will replace the ISR/ELR election logic for diskless > >> >> topics, > >> >> >> as > >> >> >> >> > they no longer rely on replicas for data integrity. We will > >> fully > >> >> >> model > >> >> >> >> the > >> >> >> >> > state/lifecycle of the diskless replicas in KRaft, and choose > >> how > >> >> we > >> >> >> >> > display this to clients. > >> >> >> >> > For backwards compatibility, clients using older metadata > >> requests > >> >> >> >> should > >> >> >> >> > see diskless topics, but interpret them as classic topics. We > >> >> could > >> >> >> tell > >> >> >> >> > older clients that the leader is in the ISR, even if it just > >> >> started > >> >> >> >> > building its cache. > >> >> >> >> > For clients using the latest metadata, they should see the true > >> >> >> state of > >> >> >> >> > the diskless partition: which nodes can accept > >> >> >> produce/fetch/sharefetch > >> >> >> >> > requests, which ranges of offsets are cached on-broker, etc. > >> This > >> >> >> could > >> >> >> >> > also be used to break apart the “leader” field into more > >> granular > >> >> >> >> fields, > >> >> >> >> > now that leadership has changed meaning. > >> >> >> >> > > >> >> >> >> > JR4: > >> >> >> >> > Yes, we can replace the empty fetch requests to the leader > >> nodes > >> >> with > >> >> >> >> > cache hint fields in the requests to the Diskless Coordinator, > >> and > >> >> >> rely > >> >> >> >> on > >> >> >> >> > the coordinator to distribute cache hints to all replicas. This > >> >> >> should > >> >> >> >> be > >> >> >> >> > low-overhead, and eliminate the inter-broker communication for > >> >> >> brokers > >> >> >> >> > which only host Diskless topics. > >> >> >> >> > > >> >> >> >> > JR5.1: > >> >> >> >> > You are correct and this text was ambiguous, only specifying > >> that > >> >> the > >> >> >> >> > controller waits for the sync to be complete. This section is > >> now > >> >> >> >> updated > >> >> >> >> > to explicitly say that local segments are built from object > >> >> storage. > >> >> >> >> > > >> >> >> >> > JR5.2: > >> >> >> >> > Extending the JR2 discussion, reassignment of diskless topics > >> >> would > >> >> >> >> > generally happen within a cell, where the marginal cost of > >> >> reading an > >> >> >> >> > additional partition is very low. When cells are re-balanced > >> and a > >> >> >> >> > partition is migrated between cells, there is a brief time > >> (until > >> >> the > >> >> >> >> next > >> >> >> >> > Tiered Storage segment roll) when the marginal cost is doubled. > >> >> This > >> >> >> >> should > >> >> >> >> > be infrequent and well-amortized by other topics which aren’t > >> >> being > >> >> >> >> > re-balanced between cells. > >> >> >> >> > > >> >> >> >> > JR6.1: > >> >> >> >> > We plan to move data from Diskless to Tiered Storage. Once the > >> >> data > >> >> >> is > >> >> >> >> in > >> >> >> >> > Tiered Storage, it can be compacted using the functionality > >> >> >> described in > >> >> >> >> > KIP-1272 [1] > >> >> >> >> > > >> >> >> >> > JR6.2: > >> >> >> >> > We will add details for this soon. > >> >> >> >> > > >> >> >> >> > JR7: > >> >> >> >> > We specify the requirement of eventual consistency to allow > >> >> Diskless > >> >> >> >> > Topics to be used with other object storage implementations > >> which > >> >> >> aren’t > >> >> >> >> > the three major public clouds, such as self-managed software or > >> >> >> weaker > >> >> >> >> > consistency caches. > >> >> >> >> > > >> >> >> >> > Thanks, > >> >> >> >> > Greg > >> >> >> >> > > >> >> >> >> > [1] > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272%3A+Support+compacted+topic+in+tiered+storage > >> <https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbND2ONImL0$> > >> >> < > >> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMraeR_8A$ > >> > > >> >> >> >> > > >> >> >> >> > On Fri, Mar 6, 2026 at 4:14 PM Jun Rao via dev < > >> >> [email protected] > >> >> >> > > >> >> >> >> > wrote: > >> >> >> >> > > >> >> >> >> >> Hi, Ivan, > >> >> >> >> >> > >> >> >> >> >> Thanks for the KIP. A few comments below. > >> >> >> >> >> > >> >> >> >> >> JR1. I am concerned about the usage of the current tiered > >> >> storage to > >> >> >> >> >> control the number of small WAL files. Current tiered storage > >> >> only > >> >> >> >> tiers > >> >> >> >> >> the data when a segment rolls, which can take hours. This > >> causes > >> >> >> three > >> >> >> >> >> problems. (1) Much more metadata needs to be stored and > >> >> maintained, > >> >> >> >> which > >> >> >> >> >> increases the cost. Suppose that each segment rolls every 5 > >> >> hours, > >> >> >> each > >> >> >> >> >> partition generates 2 WAL files per second and each WAL file's > >> >> >> metadata > >> >> >> >> >> takes 100 bytes. Each partition will generate 5 * 3.6K * 2 * > >> 100 > >> >> = > >> >> >> >> 3.6MB > >> >> >> >> >> of > >> >> >> >> >> metadata. In a cluster with 100K partitions, this translates > >> to > >> >> >> 360GB > >> >> >> >> of > >> >> >> >> >> metadata stored on the diskless coordinators. (2) A > >> catching-up > >> >> >> >> consumer's > >> >> >> >> >> performance degrades since it's forced to read data from many > >> >> small > >> >> >> WAL > >> >> >> >> >> files. (3) The data in WAL files could be retained much longer > >> >> than > >> >> >> >> >> retention time. Since the small WAL files aren't completely > >> >> deleted > >> >> >> >> until > >> >> >> >> >> all partitions' data in it are obsolete, the deletion of the > >> WAL > >> >> >> files > >> >> >> >> >> could be delayed by hours or more. If the WAL file includes a > >> >> >> partition > >> >> >> >> >> with a low retention time, the retention contract could be > >> >> violated > >> >> >> >> >> significantly. The earlier design of the KIP included a > >> separate > >> >> >> object > >> >> >> >> >> merging process that combines small WAL files much more > >> >> aggressively > >> >> >> >> than > >> >> >> >> >> tiered storage, which seems to be a much better choice. > >> >> >> >> >> > >> >> >> >> >> JR2. I don't think the current default partition assignment > >> >> strategy > >> >> >> >> for > >> >> >> >> >> classic topics works for diskless topics. Current strategy > >> tries > >> >> to > >> >> >> >> spread > >> >> >> >> >> the replicas to as many brokers as possible. For example, if a > >> >> >> broker > >> >> >> >> has > >> >> >> >> >> 100 partitions, their replicas could be spread over 100 > >> brokers. > >> >> If > >> >> >> the > >> >> >> >> >> broker generates a WAL file with 100 partitions, this WAL file > >> >> will > >> >> >> be > >> >> >> >> >> read > >> >> >> >> >> 100 times, once by each broker. S3 read cost is 1/12 of the > >> cost > >> >> of > >> >> >> S3 > >> >> >> >> >> put. > >> >> >> >> >> This assignment strategy will increase the S3 cost by about > >> 8X, > >> >> >> which > >> >> >> >> is > >> >> >> >> >> prohibitive. We need to design a cost effective assignment > >> >> strategy > >> >> >> for > >> >> >> >> >> diskless topics. > >> >> >> >> >> > >> >> >> >> >> JR3. We need to think through the leade election logic with > >> >> diskless > >> >> >> >> >> topic. > >> >> >> >> >> The KIP tries to reuse the ISR logic for class topic, but it > >> >> doesn't > >> >> >> >> seem > >> >> >> >> >> very natural. > >> >> >> >> >> JR3.1 In classsic topic, the leader is always in ISR. In the > >> >> >> diskless > >> >> >> >> >> topic, the KIP says that a leader could be out of sync. > >> >> >> >> >> JR3.2 The existing leader election logic based on ISR/ELR > >> mainly > >> >> >> >> retries > >> >> >> >> >> to > >> >> >> >> >> preserve previously acknowledged data. With diskless topics, > >> >> since > >> >> >> the > >> >> >> >> >> object store provides durability, this logic seems no longer > >> >> needed. > >> >> >> >> The > >> >> >> >> >> existing min.isr and unclean leader election logic also don't > >> >> apply. > >> >> >> >> >> > >> >> >> >> >> JR4. "Despite that there is no inter-broker replication, > >> replicas > >> >> >> will > >> >> >> >> >> still issue FetchRequest to leaders. Leaders will respond with > >> >> empty > >> >> >> >> (no > >> >> >> >> >> records) FetchResponse." > >> >> >> >> >> This seems unnatural. Could we avoid issuing inter broker > >> fetch > >> >> >> >> requests > >> >> >> >> >> for diskless topics? > >> >> >> >> >> > >> >> >> >> >> JR5. "The replica reassignment will follow the same flow as in > >> >> >> classic > >> >> >> >> >> topic:". > >> >> >> >> >> JR5.1 Is this true? Since inter broker fetch response is alway > >> >> >> empty, > >> >> >> >> it > >> >> >> >> >> doesn't seem the current reassignment flow works for diskless > >> >> topic. > >> >> >> >> Also, > >> >> >> >> >> since the source of the data is object store, it seems more > >> >> natural > >> >> >> >> for a > >> >> >> >> >> replica to back fill the data from the object store, instead > >> of > >> >> >> other > >> >> >> >> >> replicas. This will also incur lower costs. > >> >> >> >> >> JR5.2 How do we prevent reassignment on diskless topics from > >> >> causing > >> >> >> >> the > >> >> >> >> >> same cost issue described in JR2? > >> >> >> >> >> > >> >> >> >> >> JR6." In other functional aspects, diskless topics are > >> >> >> >> indistinguishable > >> >> >> >> >> from classic topics. This includes durability guarantees, > >> >> ordering > >> >> >> >> >> guarantees, transactional and non-transactional producer API, > >> >> >> consumer > >> >> >> >> >> API, > >> >> >> >> >> consumer groups, share groups, data retention (deletion & > >> >> compact)," > >> >> >> >> >> JR6.1 Could you describe how compact diskless topics are > >> >> supported? > >> >> >> >> >> JR6.2 Neither this KIP nor KIP-1164 describes the > >> transactional > >> >> >> >> support in > >> >> >> >> >> detail. > >> >> >> >> >> > >> >> >> >> >> JR7. "Object Storage: A shared, durable, concurrent, and > >> >> eventually > >> >> >> >> >> consistent storage supporting arbitrary sized byte values and > >> a > >> >> >> minimal > >> >> >> >> >> set > >> >> >> >> >> of atomic operations: put, delete, list, and ranged get." > >> >> >> >> >> It seems that the object storage in all three major public > >> clouds > >> >> >> are > >> >> >> >> >> strongly consistent. > >> >> >> >> >> > >> >> >> >> >> Jun > >> >> >> >> >> > >> >> >> >> >> On Mon, Mar 2, 2026 at 5:43 AM Ivan Yurchenko <[email protected] > >> > > >> >> >> wrote: > >> >> >> >> >> > >> >> >> >> >> > Hi all, > >> >> >> >> >> > > >> >> >> >> >> > The parent KIP-1150 was voted for and accepted. Let's now > >> >> focus on > >> >> >> >> the > >> >> >> >> >> > technical details presented in this KIP-1163 and also in > >> >> KIP-1164: > >> >> >> >> >> Diskless > >> >> >> >> >> > Coordinator [1]. > >> >> >> >> >> > > >> >> >> >> >> > Best, > >> >> >> >> >> > Ivan > >> >> >> >> >> > > >> >> >> >> >> > [1] > >> >> >> >> >> > > >> >> >> >> >> > >> >> >> >> > >> >> >> > >> >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164%3A+Diskless+Coordinator > >> <https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDZKiPB2A$> > >> >> < > >> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPUG7nCtg$ > >> > > >> >> >> >> >> > > >> >> >> >> >> > On Wed, Apr 23, 2025, at 11:41, Ivan Yurchenko wrote: > >> >> >> >> >> > > Hi all! > >> >> >> >> >> > > > >> >> >> >> >> > > We want to start the discussion thread for KIP-1163: > >> Diskless > >> >> >> Core > >> >> >> >> >> [1], > >> >> >> >> >> > which is a sub-KIP for KIP-1150 [2]. > >> >> >> >> >> > > > >> >> >> >> >> > > Let's use the main KIP-1150 discuss thread [3] for > >> high-level > >> >> >> >> >> questions, > >> >> >> >> >> > motivation, and general direction of the feature and this > >> >> thread > >> >> >> for > >> >> >> >> >> > particular details of implementation. > >> >> >> >> >> > > > >> >> >> >> >> > > Best, > >> >> >> >> >> > > Ivan > >> >> >> >> >> > > > >> >> >> >> >> > > [1] > >> >> >> >> >> > > >> >> >> >> >> > >> >> >> >> > >> >> >> > >> >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163%3A+Diskless+Core > >> <https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDrNzi-QI$> > >> >> < > >> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMShS6OOA$ > >> > > >> >> >> >> >> > > [2] > >> >> >> >> >> > > >> >> >> >> >> > >> >> >> >> > >> >> >> > >> >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics > >> <https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDgFavpPM$> > >> >> < > >> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wP36tp67w$ > >> > > >> >> >> >> >> > > [3] > >> >> >> >> https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d > >> <https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbND75I4_MY$> > >> >> < > >> https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wN7nkkcTA$ > >> > > >> >> >> >> >> > > >> >> >> >> >> > >> >> >> >> > > >> >> >> >> > >> >> >> > > >> >> >> > >> >> > > >> >> > >> > > >> > >
