I’ll join Dima with the thanks, Luke. This seems to be indeed a good way of enforcing safe restarts.
Thanks, Frank -- Francesco Burato | Software Development Engineer | Adobe | bur...@adobe.com<mailto:bur...@adobe.com> | c. +44 747 9029370<tel:+447479029370> From: Dima Brodsky <ddbrod...@gmail.com> Date: Monday, 22 April 2024 at 05:16 To: users@kafka.apache.org <users@kafka.apache.org> Subject: Re: Kraft controller readiness checks EXTERNAL: Use caution when clicking on links or opening attachments. Thanks Luke, this helps for our use case. It does not cover the buildout of a new cluster where there are no brokers, but that should be remedied by kip 919 which looks to be resolved in 3.7.0. ttyl Dima On Sun, Apr 21, 2024 at 9:06 PM Luke Chen <show...@gmail.com> wrote: > Hi Frank, > > About your question: > > Unless this is already available but not well publicised in the > documentation, ideally there should be protocol working on the controller > ports that answers to operational questions like “are metadata partitions > in sync?”, “has the current controller converged with other members of the > quorum?”. > > I'm sorry that KRaft controller is using raft protocol, so there is no such > "in-sync replica" definition like data replication protocol. What we did > for our check is described here > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F060-kafka-roller-kraft.md%23the-new-quorum-check&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178459069%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9ww%2F2%2B%2BnKB3NLfaRE99pzN%2FD9Q5cjZeTFvJmNXKeBwY%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md#the-new-quorum-check> > >. > In short, we use `controller.quorum.fetch.timeout.ms` and > `replicaLastCaughtUpTimestamp` to determine if it's safe to roll this > controller pod. > > Hope this helps. > > Thank you. > Luke > > > > > On Fri, Apr 19, 2024 at 5:06 PM Francesco Burato <bur...@adobe.com.invalid > > > wrote: > > > Hi Luke, > > > > Thanks for the answers. I understand what you are describing in terms of > > rationale for using just the availability of the controller port to > > determine the readiness of the controller, but that is not fully > satisfying > > under an operational perspective, at least based on the lack of > sufficient > > documentation on the matter. Based on my understanding of kraft, which I > > admit is not considerable, the controllers will host the cluster metadata > > partitions on disk and make them available for the brokers. So, > presumably, > > one of the purposes of the controllers is to ensure that the metadata > > partitions are properly replicated. Hence, what happens even in a non k8s > > environment all controllers go down? What sort of outage does the wider > > cluster experience in that circumstance? > > > > A complete outage on the controllers is of course an extreme scenario, > but > > a more likely one is that a disk of the controller goes offline and needs > > to be replaced. In this scenario, the controller will have to > re-construct > > from scratch the cluster metadata from the other controllers in the > quorum > > but it presumably cannot participate to the quorum until the metadata > > partitions are fully replicated. Based on this assumption, the mere > > availability of the controller port does not necessarily mean that I can > > safely shut down another controller because replication has not completed > > yet. > > > > As I mentioned earlier, I don’t know the details of kraft in sufficient > > details to evaluate if my assumptions are warranted, but the official > > documentation does not seem to go in much detail on how to safely > operate a > > cluster in kraft mode while it provides very good information on how to > > safely operate a ZK cluster by highlighting that the URP and leader > > elections must be kept under control during restarts. > > > > Unless this is already available but not well publicised in the > > documentation, ideally there should be protocol working on the controller > > ports that answers to operational questions like “are metadata partitions > > in sync?”, “has the current controller converged with other members of > the > > quorum?”. > > > > Goes without saying that if any of these topics are properly covered > > anywhere in the docs, more than happy to be RTFMed to the right place. > > > > As for the other points you raise: we have a very particular set-up for > > our kafka clusters that makes the circumstance you highlight not a > problem. > > In particular, our consumer and producers are all internal in a namespace > > and can connect to non-ready brokers. Given the URP script checks for the > > global URP state rather than just the URP state for the individual > broker, > > that means that as long as even one broker is marked as ready, that means > > the entire cluster is safe. With the ordered rotation imposed by > > statefulset parallel rolling restart, together with the URP readiness > check > > and the PDB, we are guaranteed not to cause any problem read or write > > errors. Rotations are rather long, but we don’t really care about speed. > > > > Thanks, > > > > Frank > > > > -- > > Francesco Burato | Software Development Engineer | Adobe | > > bur...@adobe.com<mailto:bur...@adobe.com> | c. +44 747 > > 9029370<tel:+447479029370> > > > > > > From: Luke Chen <show...@gmail.com> > > Date: Friday, 19 April 2024 at 05:21 > > To: users@kafka.apache.org <users@kafka.apache.org> > > Subject: Re: Kraft controller readiness checks > > EXTERNAL: Use caution when clicking on links or opening attachments. > > > > > > Hello Frank, > > > > That's a good question. > > I think we all know there is no "correct" answer for this question. But I > > can share with you what our team did for it. > > > > Readiness: controller is listening on the controller.listener.names > > > > The rationale behind it is: > > 1. The last step for the controller node startup is to wait until all the > > SocketServer ports to be open, and the Acceptors to be started, and the > > controller port is one of them. > > 2. This controller listener is used to talk to other controllers (voters) > > to form the raft quorum, so if it is not open and listening, the > controller > > is basically not working at all. > > 3. The controller listener is also used for brokers (observers) to get > the > > updated raft quorum info and fetch metadata. > > > > Compared with Zookeeper cluster, which is the KRaft quorum is trying to > > replace with, the liveness/readiness probe that recommended in Kubernetes > > tutorial > > < > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Ftutorials%2Fstateful-application%2Fzookeeper%2F%23testing-for-liveness&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178469310%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=RXHzVewIinIOYbgCc9h2Wf6M0jNTcD%2BjKh%2Fofw6l6KM%3D&reserved=0<https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/#testing-for-liveness> > > < > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Ftutorials%2Fstateful-application%2Fzookeeper%2F%23testing-for-liveness&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178474256%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=SQKDp8xJuvEUg%2FO3wQMSgL7J7On%2BFYL6g%2BW%2FENy%2FYco%3D&reserved=0<https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/#testing-for-liveness> > > >> > > is also doing "ruok" check for the pod. And the handler for this "ruok" > > command > > < > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fzookeeper%2Fblob%2Fd12aba599233b0fcba0b9b945ed3d2f45d4016f0%2Fzookeeper-server%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fzookeeper%2Fserver%2Fcommand%2FRuokCommand.java%23L32&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178478518%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=PkBfdiZBUECV7qbIrPU0MNJ%2BnBYxgmwck7%2FibCuYu9Q%3D&reserved=0<https://github.com/apache/zookeeper/blob/d12aba599233b0fcba0b9b945ed3d2f45d4016f0/zookeeper-server/src/main/java/org/apache/zookeeper/server/command/RuokCommand.java#L32> > > < > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fzookeeper%2Fblob%2Fd12aba599233b0fcba0b9b945ed3d2f45d4016f0%2Fzookeeper-server%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fzookeeper%2Fserver%2Fcommand%2FRuokCommand.java%23L32&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178482960%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=y%2BRpnKat1VUltcVVyJSpQ9APTJRW3lbY%2FqRzFpI%2ByM0%3D&reserved=0<https://github.com/apache/zookeeper/blob/d12aba599233b0fcba0b9b945ed3d2f45d4016f0/zookeeper-server/src/main/java/org/apache/zookeeper/server/command/RuokCommand.java#L32> > > >> > > in the Zookeeper server side, is returning "imok" directly, which means > > it's just doing connection check only. So we think this check makes > sense. > > > > Here's our design proposal > > < > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178487381%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=xL0SeF7Epo7oTaf0ZbRt4jZ46zbiDdL4TP%2FdeS6HlVg%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md> > > < > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178491426%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=ARpRbD%2B2cuYF1%2Bt5VhXQBJOjf%2B9xJ%2FQ%2FM8jPDk7MLfQ%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md> > > >> > > for the Liveness and Readiness probes in a KRaft Kafka cluster, FYI. > > But again, I still think there's no "correct" answer for it. If you have > > any better ideas, please let us know. > > > > However, I have some suggestions for your readiness probe for brokers: > > > > > our brokers are configured to use a script which marks the containers > as > > unready if under-replicated partitions exist. With this readiness check > and > > a pod disruption budget of the minimum in sync replica - 1 > > > > I understand it works well, but it has some drawbacks, and the biggest > > issue I can think of is: it's possible to cause unavailability in some > > partitions. > > For example: 3 brokers in the cluster: 0, 1, 2, and 10 topic partitions > are > > hosted in broker 0. > > a. Broker 0 is shutting down, all partitions in broker 0 are becoming > > follower. > > b. Broker 0 is starting up, all the followers are trying to catch up with > > the leader. > > c. 9 out of 10 partitions are caught up and joined ISR group. At this > > point, this pod is still unready because there's still 1 partition is > under > > replicated. > > d. Some of the partitions in broker 0 are becoming leader, for example, > > auto leader rebalance is triggered. > > e. For the leader partitions in broker 0 are now unavailable because the > > pod is not in ready state, it cannot serve incoming requests. > > > > In our team, we use the brokerState metric value = RUNNING state for > > readiness probe. In KRaft mode, the broker will enter RUNNING state after > > the broker has caught up with the controller for metadata, and start to > > serve requests from clients. We think that makes more senses. > > Again, for more details, you can check the design proposal > > < > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178495478%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=80Fgc6rKmZxcOCdW5%2FeiWz2pV80x%2F8KYtWPj3JZ1km8%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md> > > < > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178499536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=7CKf5Oas08r%2FraR%2Bz%2BS1umF8e2%2BfFsKJUBFw%2BcPfOgQ%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md> > > >> > > for the Liveness and Readiness probes in a KRaft Kafka cluster. > > > > Finally, I saw you didn't have operators for Kafka clusters. > > I don't know how you manage all these kafka clusters manually, but there > > must be some cumbersome operations, like rolling pods. > > Let's say now you want to roll the pods 1 by 1, which pod will you go > > first? > > And which pod goes last? > > Will you do any check before rolling? > > How much time does it take for each rolling? > > ... > > > > I'm just listing some of the problems they might have. So I would > recommend > > deploying an operator to help manage the kafka clusters. > > This is our design proposal > > < > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F060-kafka-roller-kraft.md&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178503580%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=0jI581U3xjomo5xV24dwIXf2uBoxmVjdpGnbJtA3bKI%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md> > > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F060-kafka-roller-kraft.md&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178507631%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9Pr%2FjWR5aNMZSH30oiEpu0hyYTuhvgFiNDEULuFpuB8%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md> > > >> > > for Kafka roller in operator for KRaft. FYI. > > > > And now, I'm totally biased, but Stirmzi > > < > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fstrimzi-kafka-operator&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178511625%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=NFZEwzThGISCpa1ldTcBH8%2F8WnBzFyK0%2FoMuFLoXxsU%3D&reserved=0<https://github.com/strimzi/strimzi-kafka-operator> > > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fstrimzi-kafka-operator&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178515555%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=70Zle9mFwce6zM2L2TdpkNOLrkRM%2BjjiBlfZ%2BlXa3DE%3D&reserved=0<https://github.com/strimzi/strimzi-kafka-operator>>> > > provides an fully > > open-source operator to manager kafka cluster on Kubernetes. > > Welcome to try it (hopefully it will help you manage kafka clusters), > join > > the community to ask questions, join discussions, or contribute to it. > > > > Thank you. > > Luke > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Apr 19, 2024 at 4:19 AM Francesco Burato > <bur...@adobe.com.invalid > > > > > wrote: > > > > > Hello, > > > > > > I have a question regarding the deployment of Kafka using Kraft > > > controllers in a Kubernetes environment. Our current Kafka cluster is > > > deployed on K8S clusters as statefulsets without operators and our > > brokers > > > are configured to use a script which marks the containers as unready if > > > under-replicated partitions exist. With this readiness check and a pod > > > disruption budget of the minimum in sync replica - 1, we are able to > > > perform rollout restarts of our brokers automatically without ever > > > producing consumers and producers errors. > > > > > > We have started the processes of transitioning to Kraft and based on > the > > > recommended deployment strategy we are going to define dedicated nodes > as > > > controllers instead of using combined servers. However, defining nodes > as > > > controller does not seem to allow to use the same strategy for > readiness > > > check as the kafka-topics.sh does not appear to be executable on > > controller > > > brokers. > > > > > > The question is: what is a reliable readiness check that can be used > for > > > Kraft controllers that ensures that rollout restart can be performed > > safely? > > > > > > Thanks, > > > > > > Frank > > > > > > -- > > > Francesco Burato | Software Development Engineer | Adobe | > > > bur...@adobe.com<mailto:bur...@adobe.com> > > > > > > > > > -- ddbrod...@gmail.com "The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find the most hard to pay." (Sir Antony Hoare, 1980)