Hi Justine, Luke, and others, I believe a 3.8 version would make sense, and I would say KIP-853 should be part of it as well.
Best, On Wed, Dec 20, 2023 at 4:11 PM Justine Olshan <jols...@confluent.io.invalid> wrote: > Hey Luke, > > I think your point is valid. This is another good reason to have a 3.8 > release. > Would you say that implementing KIP-966 in 3.8 would be an acceptable way > to move forward? > > Thanks, > Justine > > > On Tue, Dec 19, 2023 at 4:35 AM Luke Chen <show...@gmail.com> wrote: > > > Hi Justine, > > > > Thanks for your reply. > > > > > I think that for folks that want to prioritize availability over > > durability, the aggressive recovery strategy from KIP-966 should be > > preferable to the old unclean leader election configuration. > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas#KIP966:EligibleLeaderReplicas-Uncleanrecovery > > > > Yes, I'm aware that we're going to implement the new way of leader > election > > in KIP-966. > > But obviously, KIP-966 is not included in v3.7.0. > > What I'm worried about is the users who prioritize availability over > > durability and enable the unclean leader election in ZK mode. > > Once they migrate to KRaft, there will be availability impact when > unclean > > leader election is needed. > > And like you said, they can run unclean leader election via CLI, but > again, > > the availability is already impacted, which might be unacceptable in some > > cases. > > > > IMO, we should prioritize this missing feature and include it in 3.x > > release. > > Including in 3.x release means users can migrate to KRaft in dual-write > > mode, and run it for a while to make sure everything works fine, before > > they decide to upgrade to 4.0. > > > > Does that make sense? > > > > Thanks. > > Luke > > > > On Tue, Dec 19, 2023 at 12:15 AM Justine Olshan > > <jols...@confluent.io.invalid> wrote: > > > > > Hey Luke -- > > > > > > There were some previous discussions on the mailing list about this but > > > looks like we didn't file the ticket > > > https://lists.apache.org/thread/sqsssos1d9whgmo92vdn81n9r5woy1wk > > > > > > When I asked some of the folks who worked on Kraft about this, they > > > communicated to me that it was intentional to make unclean leader > > election > > > a manual action. > > > > > > I think that for folks that want to prioritize availability over > > > durability, the aggressive recovery strategy from KIP-966 should be > > > preferable to the old unclean leader election configuration. > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas#KIP966:EligibleLeaderReplicas-Uncleanrecovery > > > > > > Let me know if we don't think this is sufficient. > > > > > > Justine > > > > > > On Mon, Dec 18, 2023 at 4:39 AM Luke Chen <show...@gmail.com> wrote: > > > > > > > Hi all, > > > > > > > > We found that currently (the latest trunk branch), the unclean leader > > > > election is not supported in KRaft mode. > > > > That is, when users enable `unclean.leader.election.enable` in KRaft > > > mode, > > > > the config won't take effect and just behave like > > > > `unclean.leader.election.enable` is disabled. > > > > KAFKA-12670 <https://issues.apache.org/jira/browse/KAFKA-12670> was > > > opened > > > > for this and is still not resolved. > > > > > > > > I think this is a regression issue in KRaft mode, and we should > > complete > > > > this missing feature in 3.x release, instead of adding it in 4.0. > > > > Does anyone know what's status for this issue? > > > > > > > > Thanks. > > > > Luke > > > > > > > > > > > > > > > > On Mon, Nov 27, 2023 at 4:38 PM Colin McCabe <cmcc...@apache.org> > > wrote: > > > > > > > > > On Fri, Nov 24, 2023, at 03:47, Anton Agestam wrote: > > > > > > In your last message you wrote: > > > > > > > > > > > > > But, on the KRaft side, I still maintain that nothing is > missing > > > > except > > > > > > > JBOD, which we already have a plan for. > > > > > > > > > > > > But earlier in this thread you mentioned an issue with "torn > > writes", > > > > > > possibly missing tests, as well as the fact that the recommended > > > method > > > > > of > > > > > > replacing controller nodes is undocumented. Would you mind > > clarifying > > > > > what > > > > > > your stance is on these three issues? Do you think that they are > > > > > important > > > > > > enablers of upgrade paths or not? > > > > > > > > > > Hi Anton, > > > > > > > > > > There shouldn't be anything blocking controller disk replacement > now. > > > > From > > > > > memory (not looking at the code now), we do log recovery on our > > single > > > > log > > > > > directory every time we start the controller, so it should handle > > > partial > > > > > records there. I do agree that a test would be good, and some > > > > > documentation. I'll probably take a look at that this week if I get > > > some > > > > > time. > > > > > > > > > > > > Well, the line was drawn in KIP-833. If we redraw it, what is > to > > > stop > > > > > us > > > > > > > from redrawing it again and again? > > > > > > > > > > > > I'm fairly new to the Kafka community so please forgive me if I'm > > > > missing > > > > > > things that have been said in earlier discussions, but reading up > > on > > > > that > > > > > > KIP I see it has language like "Note: this timeline is very rough > > and > > > > > > subject to change." in the section of versions, but it also says > > "As > > > > > > outlined above, we expect to close these gaps soon" with relation > > to > > > > the > > > > > > outstanding features. From my perspective this doesn't really > look > > > like > > > > > an > > > > > > agreement that dynamic quorum membership changes shall not be a > > > blocker > > > > > for > > > > > > 4.0. > > > > > > > > > > The timeline was rough because we wrote that in 2022, trying to > look > > > > > forward multiple releases. The gaps that were discussed have all > been > > > > > closed -- except for JBOD, which we are working on this quarter. > > > > > > > > > > The set of features needed for 4.0 is very clearly described in > > > KIP-833. > > > > > There's no uncertainty on that point. > > > > > > > > > > > > > > > > > To answer the specific question you pose here, "what is to stop > us > > > from > > > > > > redrawing it again and again?", wouldn't the suggestion of > parallel > > > > work > > > > > > lanes brought up by Josep address this concern? > > > > > > > > > > > > > > > > It's very important not to fragment the community by supporting > > > multiple > > > > > long-running branch lines. At the end of the day, once branch 3's > > time > > > > has > > > > > come, it needs to fade away, just like JDK 6 support or the old > Scala > > > > > producer. > > > > > > > > > > best, > > > > > Colin > > > > > > > > > > > > > > > > BR, > > > > > > Anton > > > > > > > > > > > > Den tors 23 nov. 2023 kl 05:48 skrev Colin McCabe < > > > cmcc...@apache.org > > > > >: > > > > > > > > > > > >> On Tue, Nov 21, 2023, at 19:30, Luke Chen wrote: > > > > > >> > Yes, KIP-853 and disk failure support are both very important > > > > missing > > > > > >> > features. For the disk failure support, I don't think this is > a > > > > > >> > "good-to-have-feature", it should be a "must-have" IMO. We > can't > > > > > announce > > > > > >> > the 4.0 release without a good solution for disk failure in > > KRaft. > > > > > >> > > > > > >> Hi Luke, > > > > > >> > > > > > >> Thanks for the reply. > > > > > >> > > > > > >> Controller disk failure support is not missing from KRaft. I > > > described > > > > > how > > > > > >> to handle controller disk failures earlier in this thread. > > > > > >> > > > > > >> I should note here that the broker in ZooKeeper mode also > requires > > > > > manual > > > > > >> handling of disk failures. Restarting a broker with the same ID, > > but > > > > an > > > > > >> empty disk, breaks the invariants of replication when in ZK > mode. > > > > > Consider: > > > > > >> > > > > > >> 1. Broker 1 goes down. A ZK state change notification for > /brokers > > > > fires > > > > > >> and goes on the controller queue. > > > > > >> > > > > > >> 2. Broker 1 comes back up with an empty disk. > > > > > >> > > > > > >> 3. The controller processes the zk state change notification for > > > > > /brokers. > > > > > >> Since broker 1 is up no action is taken. > > > > > >> > > > > > >> 4. Now broker 1 is in the ISR for any partitions it was > > previously, > > > > but > > > > > >> has no data. If it is or becomes leader for any partitions, > > > > irreversable > > > > > >> data loss will occur. > > > > > >> > > > > > >> This problem is more than theoretical. We at Confluent have > > observed > > > > it > > > > > in > > > > > >> production and put in place special workarounds for the ZK > > clusters > > > we > > > > > >> still have. > > > > > >> > > > > > >> KRaft has never had this problem because brokers are removed > from > > > ISRs > > > > > >> when a new incarnation of the broker registers. > > > > > >> > > > > > >> So perhaps ZK mode is not ready for production for Aiven? Since > > disk > > > > > >> failures do in fact require special handling there. (And/or > > bringing > > > > up > > > > > new > > > > > >> nodes with empty disks, which seems to be their main concern.) > > > > > >> > > > > > >> > > > > > > >> > It’s also worth thinking about how Apache Kafka users who > depend > > > on > > > > > JBOD > > > > > >> > might look at the risks of not having a 3.8 release. JBOD > > support > > > on > > > > > >> KRaft > > > > > >> > is planned to be added in 3.7, and is still in progress so > far. > > So > > > > > it’s > > > > > >> > hard to say it’s a blocker or not. But in practice, even if > the > > > > > feature > > > > > >> is > > > > > >> > made into 3.7 in time, a lot of new code for this feature is > > > > unlikely > > > > > to > > > > > >> be > > > > > >> > entirely bug free. We need to maintain the confidence of those > > > > users, > > > > > and > > > > > >> > forcing them to migrate through 3.7 where this new code is > > hardly > > > > > >> > battle-tested doesn’t appear to do that. > > > > > >> > > > > > > >> > > > > > >> As Ismael said, if there are JBOD bugs in 3.7, we will do > > follow-on > > > > > point > > > > > >> releases to address them. > > > > > >> > > > > > >> > Our goal for 4.0 should be that all the “main” features in > KRaft > > > are > > > > > in > > > > > >> > production ready state. To reach the goal, I think having one > > more > > > > > >> release > > > > > >> > makes sense. We can have different opinions about what the > “main > > > > > >> features” > > > > > >> > in KRaft are, but we should all agree, JBOD is one of them. > > > > > >> > > > > > >> The current plan is for JBOD to be production-ready in the 3.7 > > > branch. > > > > > >> > > > > > >> The other features of KRaft have been in production-ready state > > > since > > > > > the > > > > > >> 3.3 release. (Well, except for delegation tokens and SCRAM, > which > > > were > > > > > >> implemented in 3.5 and 3.6) > > > > > >> > > > > > >> > I totally agree with you. We can keep delaying the 4.0 release > > > > > forever. > > > > > >> I'd > > > > > >> > also like to draw a line to it. So, in my opinion, the 3.8 > > release > > > > is > > > > > the > > > > > >> > line. No 3.9, 3.10 releases after that. If this is the > decision, > > > > will > > > > > >> your > > > > > >> > concern about this infinite loop disappear? > > > > > >> > > > > > >> Well, the line was drawn in KIP-833. If we redraw it, what is to > > > stop > > > > us > > > > > >> from redrawing it again and again? > > > > > >> > > > > > >> > > > > > > >> > Final note: Speaking of the missing features, I can always > > > cooperate > > > > > with > > > > > >> > you and all other community contributors to make them happen, > > like > > > > we > > > > > >> have > > > > > >> > discussed earlier. Just let me know. > > > > > >> > > > > > > >> > > > > > >> Thanks, Luke. I appreciate the offer. > > > > > >> > > > > > >> But, on the KRaft side, I still maintain that nothing is missing > > > > except > > > > > >> JBOD, which we already have a plan for. > > > > > >> > > > > > >> best, > > > > > >> Colin > > > > > >> > > > > > >> > > > > > >> > Thank you. > > > > > >> > Luke > > > > > >> > > > > > > >> > On Wed, Nov 22, 2023 at 2:54 AM Colin McCabe < > > cmcc...@apache.org> > > > > > wrote: > > > > > >> > > > > > > >> >> On Tue, Nov 21, 2023, at 03:47, Josep Prat wrote: > > > > > >> >> > Hi Colin, > > > > > >> >> > > > > > > >> >> > I think it's great that Confluent runs KRaft clusters in > > > > > production, > > > > > >> >> > and it means that it is production ready for Confluent and > > it's > > > > > users. > > > > > >> >> > But luckily for Kafka, the community is bigger than this > > (self > > > > > managed > > > > > >> >> > in the cloud or in-prem, or customers of other SaaS > > companies). > > > > > >> >> > > > > > >> >> Hi Josep, > > > > > >> >> > > > > > >> >> Confluent is not the only company using or developing KRaft. > > Most > > > > of > > > > > the > > > > > >> >> big organizations developing Kafka are involved. I mentioned > > > > > Confluent's > > > > > >> >> deployments because I wanted to be clear that KRaft mode is > not > > > > > >> >> experimental or new. Talking about software in production is > a > > > good > > > > > way > > > > > >> to > > > > > >> >> clear up these misconceptions. > > > > > >> >> > > > > > >> >> Indeed, KRaft mode is many years old. It started around 2020, > > and > > > > > became > > > > > >> >> production-ready in AK 3.5 in 2022. ZK mode was deprecated in > > AK > > > > 3.5, > > > > > >> which > > > > > >> >> was released June 2023. If we release AK 4.0 around April (or > > > > maybe a > > > > > >> month > > > > > >> >> or two later) then that will be almost a full year between > > > > > deprecation > > > > > >> and > > > > > >> >> removal of ZK mode. We've talked about this a lot, in KIPs, > in > > > > Apache > > > > > >> blog > > > > > >> >> posts, at conferences, and so forth. > > > > > >> >> > > > > > >> >> > We've heard at least from 1 SaaS company, Aiven > (disclaimer, > > it > > > > is > > > > > my > > > > > >> >> > employer) where the current feature set makes it not > trivial > > to > > > > > >> >> > migrate. This same issue might happen not only at Aiven but > > > with > > > > > any > > > > > >> >> > user of Kafka who uses immutable infrastructure. > > > > > >> >> > > > > > >> >> Can you discuss why you feel it is "not trivial to migrate"? > > From > > > > the > > > > > >> >> discussion above, the main gap is that we should improve the > > > > > >> documentation > > > > > >> >> for handling failed disks. > > > > > >> >> > > > > > >> >> > Another case is for > > > > > >> >> > users that have hundreds (or more) of clusters and more > than > > > 100k > > > > > >> nodes > > > > > >> >> > experience node failures multiple times during a single > day. > > In > > > > > this > > > > > >> >> > situation, not having KIP 853 makes these power users > unable > > to > > > > > join > > > > > >> >> > the game as introducing a new error-prone manual (or > needed > > to > > > > > >> >> > automate) operation is usually a huge no-go. > > > > > >> >> > > > > > >> >> We have thousands of KRaft clusters in production and haven't > > > seen > > > > > these > > > > > >> >> problems, as I described above. > > > > > >> >> > > > > > >> >> best, > > > > > >> >> Colin > > > > > >> >> > > > > > >> >> > > > > > > >> >> > But I hear the concerns of delaying 4.0 for another 3 to 4 > > > > months. > > > > > >> >> > Would it help if we would aim at shortening the timeline > for > > > > 3.8.0 > > > > > and > > > > > >> >> > start with the 4.0.0 a bit earlier help? > > > > > >> >> > Maybe we could work on 3.8.0 almost in parallel with 4.0.0: > > > > > >> >> > - Start with 3.8.0 release process > > > > > >> >> > - After a small time (let's say a week) create the release > > > branch > > > > > >> >> > - Start with 4.0.0 release process as usual > > > > > >> >> > - Cherry pick KRaft related issues to 3.8.0 > > > > > >> >> > - Release 3.8.0 > > > > > >> >> > I suspect 4.0.0 will need a bit more time than usual to > > ensure > > > > the > > > > > >> code > > > > > >> >> > is cleaned up of deprecated classes and methods on top of > the > > > > usual > > > > > >> >> > work we have. For this reason I think there would be enough > > > time > > > > > >> >> > between releasing 3.8.0 and 4.0.0. > > > > > >> >> > > > > > > >> >> > What do you all think? > > > > > >> >> > > > > > > >> >> > Best, > > > > > >> >> > Josep Prat > > > > > >> >> > > > > > > >> >> > On 2023/11/20 20:03:18 Colin McCabe wrote: > > > > > >> >> >> Hi Josep, > > > > > >> >> >> > > > > > >> >> >> I think there is some confusion here. Quorum > reconfiguration > > > is > > > > > not > > > > > >> >> needed for KRaft to become production ready. Confluent runs > > > > > thousands of > > > > > >> >> KRaft clusters without quorum reconfiguration, and has for > > years. > > > > > While > > > > > >> >> dynamic quorum reconfiguration is a nice feature, it doesn't > > > block > > > > > >> >> anything: not migration, not deployment. As best as I > > understand > > > > it, > > > > > the > > > > > >> >> use-case Aiven has isn't even reconfiguration per se, just > > > wiping a > > > > > >> disk. > > > > > >> >> There are ways to handle this -- I discussed some earlier in > > the > > > > > >> thread. I > > > > > >> >> think it would be productive to continue that discussion -- > > > > > especially > > > > > >> the > > > > > >> >> part around documentation and testing of these cases. > > > > > >> >> >> > > > > > >> >> >> A lot of people have done a lot of work to get Kafka 4.0 > > > ready. > > > > I > > > > > >> would > > > > > >> >> not want to delay that because we want an additional feature. > > And > > > > we > > > > > >> will > > > > > >> >> always want additional features. So I am concerned we will > end > > up > > > > in > > > > > an > > > > > >> >> infinite loop of people asking for "just one more feature" > > before > > > > > they > > > > > >> >> migrate. > > > > > >> >> >> > > > > > >> >> >> best, > > > > > >> >> >> Colin > > > > > >> >> >> > > > > > >> >> >> > > > > > >> >> >> On Mon, Nov 20, 2023, at 04:15, Josep Prat wrote: > > > > > >> >> >> > Hi all, > > > > > >> >> >> > > > > > > >> >> >> > I wanted to share my opinion regarding this topic. I > know > > > some > > > > > >> >> >> > discussions happened some time ago (over a year) but I > > > believe > > > > > it's > > > > > >> >> >> > wise to reflect and re-evaluate if those decisions are > > still > > > > > valid. > > > > > >> >> >> > KRaft, as of Kafka 3.6.x and 3.7.x, has not yet feature > > > parity > > > > > with > > > > > >> >> >> > Zookeeper. By dropping Zookeeper altogether before > > achieving > > > > > such > > > > > >> >> >> > parity, we are opening the door to leaving a chunk of > > Apache > > > > > Kafka > > > > > >> >> >> > users without an easy way to upgrade to 4.0. > > > > > >> >> >> > In pro of making upgrades as smooth as possible, I > propose > > > to > > > > > have > > > > > >> a > > > > > >> >> >> > Kafka version where KIP-853 is merged and Zookeeper > still > > is > > > > > >> >> supported. > > > > > >> >> >> > This will enable community members who can't migrate yet > > to > > > > > KRaft > > > > > >> to > > > > > >> >> do > > > > > >> >> >> > so in a safe way (rolling back is something goes wrong). > > > > > >> >> Additionally, > > > > > >> >> >> > this will give us more confidence on having KRaft > > replacing > > > > > >> >> >> > successfully Zookeeper without any big problems by > > > discovering > > > > > and > > > > > >> >> >> > fixing bugs or by confirming that KRaft works as > expected. > > > > > >> >> >> > For this I strongly believe we should have a 3.8.x > version > > > > > before > > > > > >> >> 4.0.x. > > > > > >> >> >> > > > > > > >> >> >> > What do other think in this regard? > > > > > >> >> >> > > > > > > >> >> >> > Best, > > > > > >> >> >> > > > > > > >> >> >> > On 2023/11/14 20:47:10 Colin McCabe wrote: > > > > > >> >> >> >> On Tue, Nov 14, 2023, at 04:37, Anton Agestam wrote: > > > > > >> >> >> >> > Hi Colin, > > > > > >> >> >> >> > > > > > > >> >> >> >> > Thank you for your thoughtful and comprehensive > > response. > > > > > >> >> >> >> > > > > > > >> >> >> >> >> KIP-853 is not a blocker for either 3.7 or 4.0. We > > > > discussed > > > > > >> this > > > > > >> >> in > > > > > >> >> >> >> >> several KIPs that happened this year and last year. > > The > > > > most > > > > > >> >> notable was > > > > > >> >> >> >> >> probably KIP-866, which was approved in May 2022. > > > > > >> >> >> >> > > > > > > >> >> >> >> > I understand this is the case, I'm raising my concern > > > > > because I > > > > > >> was > > > > > >> >> >> >> > foreseeing some major pain points as a consequence of > > > this > > > > > >> >> decision. Just > > > > > >> >> >> >> > to make it clear though: I am not asking for anyone > to > > do > > > > > work > > > > > >> for > > > > > >> >> me, and > > > > > >> >> >> >> > I understand the limitations of resources available > to > > > > > implement > > > > > >> >> features. > > > > > >> >> >> >> > What I was asking is rather to consider the > > implications > > > of > > > > > >> >> _removing_ > > > > > >> >> >> >> > features before there exists a replacement for them. > > > > > >> >> >> >> > > > > > > >> >> >> >> > I understand that the timeframe for 3.7 isn't > feasible, > > > and > > > > > >> >> because of that > > > > > >> >> >> >> > I think what I was asking is rather: can we make sure > > > that > > > > > there > > > > > >> >> are more > > > > > >> >> >> >> > 3.x releases until controller quorum online resizing > is > > > > > >> >> implemented? > > > > > >> >> >> >> > > > > > > >> >> >> >> > From your response, I gather that your stance is that > > > it's > > > > > >> >> important to > > > > > >> >> >> >> > drop ZK support sooner rather than later and that the > > > > > necessary > > > > > >> >> pieces for > > > > > >> >> >> >> > doing so are already in place. > > > > > >> >> >> >> > > > > > >> >> >> >> Hi Anton, > > > > > >> >> >> >> > > > > > >> >> >> >> Yes. I'm basically just repeating what we agreed upon > in > > > 2022 > > > > > as > > > > > >> >> part of KIP-833. > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > > > >> >> >> >> > --- > > > > > >> >> >> >> > > > > > > >> >> >> >> > I want to make sure I've understood your suggested > > > sequence > > > > > for > > > > > >> >> controller > > > > > >> >> >> >> > node replacement. I hope the mentions of Kubernetes > are > > > > > rather > > > > > >> for > > > > > >> >> examples > > > > > >> >> >> >> > of how to carry things out, rather than saying "this > is > > > > only > > > > > >> >> supported on > > > > > >> >> >> >> > Kubernetes"? > > > > > >> >> >> >> > > > > > >> >> >> >> Apache Kafka is supported in lots of environments, > > > including > > > > > >> non-k8s > > > > > >> >> ones. I was just pointing out that using k8s means that you > > > control > > > > > your > > > > > >> >> own DNS resolution, which simplifies matters. If you don't > > > control > > > > > DNS > > > > > >> >> there are some extra steps for changing the quorum voters. > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > > > >> >> >> >> > Given we have three existing nodes as such: > > > > > >> >> >> >> > > > > > > >> >> >> >> > - a.local -> 192.168.0.100 > > > > > >> >> >> >> > - b.local -> 192.168.0.101 > > > > > >> >> >> >> > - c.local -> 192.168.0.102 > > > > > >> >> >> >> > > > > > > >> >> >> >> > As well as a candidate node 192.168.0.103 that we > want > > to > > > > > >> replace > > > > > >> >> for the > > > > > >> >> >> >> > role of c.local. > > > > > >> >> >> >> > > > > > > >> >> >> >> > 1. Shut down controller process on node .102 (to make > > > sure > > > > we > > > > > >> >> don't "go > > > > > >> >> >> >> > back in time"). > > > > > >> >> >> >> > 2. rsync state from leader to .103. > > > > > >> >> >> >> > 3. Start controller process on .103. > > > > > >> >> >> >> > 4. Point the c.local entry at .103. > > > > > >> >> >> >> > > > > > > >> >> >> >> > I have a few questions about this sequence: > > > > > >> >> >> >> > > > > > > >> >> >> >> > 1. Would this sequence be safe against leadership > > > changes? > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> If the leader changes, the new leader should have all > of > > > the > > > > > >> >> committed entries that the old leader had. > > > > > >> >> >> >> > > > > > >> >> >> >> > 2. Does it work > > > > > >> >> >> >> > > > > > >> >> >> >> Probably the biggest issue is dealing with "torn > writes" > > > that > > > > > >> happen > > > > > >> >> because you're copying the current log segment while it's > being > > > > > written > > > > > >> to. > > > > > >> >> The system should be robust against this. However, we don't > > > > > regularly do > > > > > >> >> this, so there hasn't been a lot of testing. > > > > > >> >> >> >> > > > > > >> >> >> >> I think Jose had a PR for improving the handling of > this > > > > which > > > > > we > > > > > >> >> might want to dig up. We'd want the system to auto-truncate > the > > > > > partial > > > > > >> >> record at the end of the log, if there is one. > > > > > >> >> >> >> > > > > > >> >> >> >> > 3. By "state", do we mean `metadata.log.dir`? > Something > > > > else? > > > > > >> >> >> >> > > > > > >> >> >> >> Yes, the state of the metadata.log.dir. Keep in mind > you > > > will > > > > > need > > > > > >> >> to change the node ID in meta.properties after copying, of > > > course. > > > > > >> >> >> >> > > > > > >> >> >> >> > 4. What are the effects on cluster availability? (I > > think > > > > > this > > > > > >> is > > > > > >> >> the same > > > > > >> >> >> >> > as asking what happens if a or b crashes during the > > > > process, > > > > > or > > > > > >> if > > > > > >> >> network > > > > > >> >> >> >> > partitions occur). > > > > > >> >> >> >> > > > > > >> >> >> >> Cluster metadata state tends to be pretty small. > > typically > > > a > > > > > >> hundred > > > > > >> >> megabytes or so. Therefore, I do not think it will take more > > > than a > > > > > >> second > > > > > >> >> or two to copy from one node to another. However, if you do > > > > > experience a > > > > > >> >> crash when one node out of three is down, then you will be > > > > > unavailable > > > > > >> >> until you can bring up a second node to regain a majority. > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > > > >> >> >> >> > --- > > > > > >> >> >> >> > > > > > > >> >> >> >> > If this is considered the official way of handling > > > > controller > > > > > >> node > > > > > >> >> >> >> > replacements, does it make sense to improve > > documentation > > > > in > > > > > >> this > > > > > >> >> area? Is > > > > > >> >> >> >> > there already a plan for this documentation layed out > > in > > > > some > > > > > >> >> KIPs? This is > > > > > >> >> >> >> > something I'd be happy to contribute to. > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> Yes, I think we should have official documentation > about > > > > this. > > > > > >> We'd > > > > > >> >> be happy to review anything in that area. > > > > > >> >> >> >> > > > > > >> >> >> >> >> To circle back to KIP-853, I think it stands a good > > > chance > > > > > of > > > > > >> >> making it > > > > > >> >> >> >> >> into AK 4.0. > > > > > >> >> >> >> > > > > > > >> >> >> >> > This sounds good, but the point I was making was if > we > > > > could > > > > > >> have > > > > > >> >> a release > > > > > >> >> >> >> > with both KRaft and ZK supporting this feature to > ease > > > the > > > > > >> >> migration out of > > > > > >> >> >> >> > ZK. > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> The problem is, supporting multiple controller > > > > implementations > > > > > is > > > > > >> a > > > > > >> >> huge burden. So we don't want to extend the 3.x release past > > the > > > > > point > > > > > >> >> that's needed to complete all the must-dos (SCRAM, delegation > > > > tokens, > > > > > >> JBOD) > > > > > >> >> >> >> > > > > > >> >> >> >> best, > > > > > >> >> >> >> Colin > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > > >> >> >> >> > BR, > > > > > >> >> >> >> > Anton > > > > > >> >> >> >> > > > > > > >> >> >> >> > Den tors 9 nov. 2023 kl 23:04 skrev Colin McCabe < > > > > > >> >> cmcc...@apache.org>: > > > > > >> >> >> >> > > > > > > >> >> >> >> >> Hi Anton, > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> It rarely makes sense to scale up and down the > number > > of > > > > > >> >> controller nodes > > > > > >> >> >> >> >> in the cluster. Only one controller node will be > > active > > > at > > > > > any > > > > > >> >> given time. > > > > > >> >> >> >> >> The main reason to use 5 nodes would be to be able > to > > > > > tolerate > > > > > >> 2 > > > > > >> >> failures > > > > > >> >> >> >> >> instead of 1. > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> At Confluent, we generally run KRaft with 3 > > controllers. > > > > We > > > > > >> have > > > > > >> >> not seen > > > > > >> >> >> >> >> problems with this setup, even with thousands of > > > clusters. > > > > > We > > > > > >> have > > > > > >> >> >> >> >> discussed using 5 node controller clusters on > certain > > > very > > > > > big > > > > > >> >> clusters, > > > > > >> >> >> >> >> but we haven't done that yet. This is all very > similar > > > to > > > > > ZK, > > > > > >> >> where most > > > > > >> >> >> >> >> deployments were 3 nodes as well. > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> KIP-853 is not a blocker for either 3.7 or 4.0. We > > > > discussed > > > > > >> this > > > > > >> >> in > > > > > >> >> >> >> >> several KIPs that happened this year and last year. > > The > > > > most > > > > > >> >> notable was > > > > > >> >> >> >> >> probably KIP-866, which was approved in May 2022. > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> Many users these days run in a Kubernetes > environment > > > > where > > > > > >> >> Kubernetes > > > > > >> >> >> >> >> actually controls the DNS. This makes changing the > set > > > of > > > > > >> voters > > > > > >> >> less > > > > > >> >> >> >> >> important than it was historically. > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> For example, in a world with static DNS, you might > > have > > > to > > > > > >> change > > > > > >> >> the > > > > > >> >> >> >> >> controller.quorum.voters setting from: > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> 100@a.local:9073,101@b.local:9073,102@c.local:9073 > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> to: > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> 100@a.local:9073,101@b.local:9073,102@d.local:9073 > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> In a world with k8s controlling the DNS, you simply > > > remap > > > > > >> c.local > > > > > >> >> to point > > > > > >> >> >> >> >> ot the IP address of your new pod for controller > 102, > > > and > > > > > >> you're > > > > > >> >> done. No > > > > > >> >> >> >> >> need to update controller.quorum.voters. > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> Another question is whether you re-create the pod > data > > > > from > > > > > >> >> scratch every > > > > > >> >> >> >> >> time you add a new node. If you store the controller > > > data > > > > > on an > > > > > >> >> EBS volume > > > > > >> >> >> >> >> (or cloud-specific equivalent), you really only have > > to > > > > > detach > > > > > >> it > > > > > >> >> from the > > > > > >> >> >> >> >> previous pod and re-attach it to the new pod. k8s > also > > > > > handles > > > > > >> >> this > > > > > >> >> >> >> >> automatically, of course. > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> If you want to reconstruct the full controller pod > > state > > > > > each > > > > > >> >> time you > > > > > >> >> >> >> >> create a new pod (for example, so that you can use > > only > > > > > >> instance > > > > > >> >> storage), > > > > > >> >> >> >> >> you should be able to rsync that state from the > > leader. > > > In > > > > > >> >> general, the > > > > > >> >> >> >> >> invariant that we want to maintain is that the state > > > > should > > > > > not > > > > > >> >> "go back in > > > > > >> >> >> >> >> time" -- if controller 102 promised to hold all log > > data > > > > up > > > > > to > > > > > >> >> offset X, it > > > > > >> >> >> >> >> should come back with committed data at at least > that > > > > > offset. > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> There are lots of new features we'd like to > implement > > > for > > > > > >> KRaft, > > > > > >> >> and Kafka > > > > > >> >> >> >> >> in general. If you have some you really would like > to > > > > see, I > > > > > >> >> think everyone > > > > > >> >> >> >> >> in the community would be happy to work with you. > The > > > flip > > > > > >> side, > > > > > >> >> of course, > > > > > >> >> >> >> >> is that since there are an unlimited number of > > features > > > we > > > > > >> could > > > > > >> >> do, we > > > > > >> >> >> >> >> can't really block the release for any one feature. > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> To circle back to KIP-853, I think it stands a good > > > chance > > > > > of > > > > > >> >> making it > > > > > >> >> >> >> >> into AK 4.0. Jose, Alyssa, and some other people > have > > > > > worked on > > > > > >> >> it. It > > > > > >> >> >> >> >> definitely won't make it into 3.7, since we have > only > > a > > > > few > > > > > >> weeks > > > > > >> >> left > > > > > >> >> >> >> >> before that release happens. > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> best, > > > > > >> >> >> >> >> Colin > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> > > > > > >> >> >> >> >> On Thu, Nov 9, 2023, at 00:20, Anton Agestam wrote: > > > > > >> >> >> >> >> > Hi Luke, > > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > We have been looking into what switching from ZK > to > > > > KRaft > > > > > >> will > > > > > >> >> mean for > > > > > >> >> >> >> >> > Aiven. > > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > We heavily depend on an “immutable infrastructure” > > > model > > > > > for > > > > > >> >> deployments. > > > > > >> >> >> >> >> > This means that, when we perform upgrades, we > > > introduce > > > > > new > > > > > >> >> nodes to our > > > > > >> >> >> >> >> > clusters, scale the cluster up to incorporate the > > new > > > > > nodes, > > > > > >> >> and then > > > > > >> >> >> >> >> phase > > > > > >> >> >> >> >> > the old ones out once all partitions are moved to > > the > > > > new > > > > > >> >> generation. > > > > > >> >> >> >> >> This > > > > > >> >> >> >> >> > allows us, and anyone else using a similar model, > to > > > do > > > > > >> >> upgrades as well > > > > > >> >> >> >> >> as > > > > > >> >> >> >> >> > cluster resizing with zero downtime. > > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > Reading up on KRaft and the ZK-to-KRaft migration > > > path, > > > > > this > > > > > >> is > > > > > >> >> somewhat > > > > > >> >> >> >> >> > worrying for us. It seems like, if KIP-853 is not > > > > included > > > > > >> >> prior to > > > > > >> >> >> >> >> > dropping support for ZK, we will essentially have > no > > > > > >> satisfying > > > > > >> >> upgrade > > > > > >> >> >> >> >> > path. Even if KIP-853 is included in 4.0, I’m > unsure > > > if > > > > > that > > > > > >> >> would allow > > > > > >> >> >> >> >> a > > > > > >> >> >> >> >> > migration path for us, since a new cluster > > generation > > > > > would > > > > > >> not > > > > > >> >> be able > > > > > >> >> >> >> >> to > > > > > >> >> >> >> >> > use ZK during the migration step. > > > > > >> >> >> >> >> > On the other hand, if KIP-853 was released in a > > > version > > > > > prior > > > > > >> >> to dropping > > > > > >> >> >> >> >> > ZK support, because it allows online resizing of > > KRaft > > > > > >> >> clusters, this > > > > > >> >> >> >> >> would > > > > > >> >> >> >> >> > allow us and others that use an immutable > > > infrastructure > > > > > >> >> deployment > > > > > >> >> >> >> >> model, > > > > > >> >> >> >> >> > to provide a zero downtime migration path. > > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > For that reason, we’d like to raise awareness > around > > > > this > > > > > >> issue > > > > > >> >> and > > > > > >> >> >> >> >> > encourage considering the implementation of > KIP-853 > > or > > > > > >> >> equivalent a > > > > > >> >> >> >> >> blocker > > > > > >> >> >> >> >> > not only for 4.0, but for the last version prior > to > > > 4.0. > > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > BR, > > > > > >> >> >> >> >> > Anton > > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > On 2023/10/11 12:17:23 Luke Chen wrote: > > > > > >> >> >> >> >> >> Hi all, > > > > > >> >> >> >> >> >> > > > > > >> >> >> >> >> >> While Kafka 3.6.0 is released, I’d like to start > > the > > > > > >> >> discussion for the > > > > > >> >> >> >> >> >> “road to Kafka 4.0”. Based on the plan in KIP-833 > > > > > >> >> >> >> >> >> < > > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > > > > > >> >> > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-Kafka3.7 > > > > > >> >> >> >> >> >>, > > > > > >> >> >> >> >> >> the next release 3.7 will be the final release > > before > > > > > moving > > > > > >> >> to Kafka > > > > > >> >> >> >> >> 4.0 > > > > > >> >> >> >> >> >> to remove the Zookeeper from Kafka. Before making > > > this > > > > > major > > > > > >> >> change, I'd > > > > > >> >> >> >> >> >> like to get consensus on the "must-have > > > features/fixes > > > > > for > > > > > >> >> Kafka 4.0", > > > > > >> >> >> >> >> to > > > > > >> >> >> >> >> >> avoid some users being surprised when upgrading > to > > > > Kafka > > > > > >> 4.0. > > > > > >> >> The intent > > > > > >> >> >> >> >> > is > > > > > >> >> >> >> >> >> to have a clear communication about what to > expect > > in > > > > the > > > > > >> >> following > > > > > >> >> >> >> >> > months. > > > > > >> >> >> >> >> >> In particular we should be signaling what > features > > > and > > > > > >> >> configurations > > > > > >> >> >> >> >> are > > > > > >> >> >> >> >> >> not supported, or at risk (if no one is able to > add > > > > > support > > > > > >> or > > > > > >> >> fix known > > > > > >> >> >> >> >> >> bugs). > > > > > >> >> >> >> >> >> > > > > > >> >> >> >> >> >> Here is the JIRA tickets list > > > > > >> >> >> >> >> >> < > > > > > >> >> > > > > > > > https://issues.apache.org/jira/issues/?jql=labels%20%3D%204.0-blocker> > > > > > >> >> >> >> >> I > > > > > >> >> >> >> >> >> labeled for "4.0-blocker". The criteria I labeled > > as > > > > > >> >> “4.0-blocker” are: > > > > > >> >> >> >> >> >> 1. The feature is supported in Zookeeper Mode, > but > > > not > > > > > >> >> supported in > > > > > >> >> >> >> >> KRaft > > > > > >> >> >> >> >> >> mode, yet (ex: KIP-858: JBOD in KRaft) > > > > > >> >> >> >> >> >> 2. Critical bugs in KRaft, (ex: KAFKA-15489 : > split > > > > > brain in > > > > > >> >> KRaft > > > > > >> >> >> >> >> >> controller quorum) > > > > > >> >> >> >> >> >> > > > > > >> >> >> >> >> >> If you disagree with my current list, welcome to > > have > > > > > >> >> discussion in the > > > > > >> >> >> >> >> >> specific JIRA ticket. Or, if you think there are > > some > > > > > >> tickets > > > > > >> >> I missed, > > > > > >> >> >> >> >> >> welcome to start a discussion in the JIRA ticket > > and > > > > > ping me > > > > > >> >> or other > > > > > >> >> >> >> >> >> people. After we get the consensus, we can > > > > label/unlabel > > > > > it > > > > > >> >> afterwards. > > > > > >> >> >> >> >> >> Again, the goal is to have an open communication > > with > > > > the > > > > > >> >> community > > > > > >> >> >> >> >> about > > > > > >> >> >> >> >> >> what will be coming in 4.0. > > > > > >> >> >> >> >> >> > > > > > >> >> >> >> >> >> Below is the high level category of the list > > content: > > > > > >> >> >> >> >> >> > > > > > >> >> >> >> >> >> 1. Recovery from disk failure > > > > > >> >> >> >> >> >> KIP-856 > > > > > >> >> >> >> >> >> < > > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > > > > > >> >> > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-856:+KRaft+Disk+Failure+Recovery > > > > > >> >> >> >> >> >>: > > > > > >> >> >> >> >> >> KRaft Disk Failure Recovery > > > > > >> >> >> >> >> >> > > > > > >> >> >> >> >> >> 2. Prevote to support controllers more than 3 > > > > > >> >> >> >> >> >> KIP-650 > > > > > >> >> >> >> >> >> < > > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > > > > > >> >> > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-650%3A+Enhance+Kafkaesque+Raft+semantics > > > > > >> >> >> >> >> >>: > > > > > >> >> >> >> >> >> Enhance Kafkaesque Raft semantics > > > > > >> >> >> >> >> >> > > > > > >> >> >> >> >> >> 3. JBOD support > > > > > >> >> >> >> >> >> KIP-858 > > > > > >> >> >> >> >> >> < > > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > > > > > >> >> > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft > > > > > >> >> >> >> >> >>: > > > > > >> >> >> >> >> >> Handle > > > > > >> >> >> >> >> >> JBOD broker disk failure in KRaft > > > > > >> >> >> >> >> >> > > > > > >> >> >> >> >> >> 4. Scale up/down Controllers > > > > > >> >> >> >> >> >> KIP-853 > > > > > >> >> >> >> >> >> < > > > > > >> >> >> >> >> > > > > > > >> >> >> >> >> > > > > > >> >> > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes > > > > > >> >> >> >> >> >>: > > > > > >> >> >> >> >> >> KRaft Controller Membership Changes > > > > > >> >> >> >> >> >> > > > > > >> >> >> >> >> >> 5. Modifying dynamic configurations on the KRaft > > > > > controller > > > > > >> >> >> >> >> >> > > > > > >> >> >> >> >> >> 6. Critical bugs in KRaft > > > > > >> >> >> >> >> >> > > > > > >> >> >> >> >> >> Does this make sense? > > > > > >> >> >> >> >> >> Any feedback is welcomed. > > > > > >> >> >> >> >> >> > > > > > >> >> >> >> >> >> Thank you. > > > > > >> >> >> >> >> >> Luke > > > > > >> >> >> >> >> >> > > > > > >> >> >> >> >> > > > > > >> >> >> >> > > > > > >> >> >> > > > > > >> >> > > > > > >> > > > > > > > > > > > > > > > -- [image: Aiven] <https://www.aiven.io> *Josep Prat* Open Source Engineering Director, *Aiven* josep.p...@aiven.io | +491715557497 aiven.io <https://www.aiven.io> | <https://www.facebook.com/aivencloud> <https://www.linkedin.com/company/aiven/> <https://twitter.com/aiven_io> *Aiven Deutschland GmbH* Alexanderufer 3-7, 10117 Berlin Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen Amtsgericht Charlottenburg, HRB 209739 B