I’ll join Dima with the thanks, Luke. This seems to be indeed a good way of 
enforcing safe restarts.

Thanks,

Frank

--
Francesco Burato | Software Development Engineer | Adobe | 
bur...@adobe.com<mailto:bur...@adobe.com>  | c. +44 747 
9029370<tel:+447479029370>


From: Dima Brodsky <ddbrod...@gmail.com>
Date: Monday, 22 April 2024 at 05:16
To: users@kafka.apache.org <users@kafka.apache.org>
Subject: Re: Kraft controller readiness checks
EXTERNAL: Use caution when clicking on links or opening attachments.


Thanks Luke, this helps for our use case.  It does not cover the buildout
of a new cluster where there are no brokers, but that should be remedied by
kip 919 which looks to be resolved in 3.7.0.

ttyl
Dima


On Sun, Apr 21, 2024 at 9:06 PM Luke Chen <show...@gmail.com> wrote:

> Hi Frank,
>
> About your question:
> > Unless this is already available but not well publicised in the
> documentation, ideally there should be protocol working on the controller
> ports that answers to operational questions like “are metadata partitions
> in sync?”, “has the current controller converged with other members of the
> quorum?”.
>
> I'm sorry that KRaft controller is using raft protocol, so there is no such
> "in-sync replica" definition like data replication protocol. What we did
> for our check is described here
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F060-kafka-roller-kraft.md%23the-new-quorum-check&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178459069%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9ww%2F2%2B%2BnKB3NLfaRE99pzN%2FD9Q5cjZeTFvJmNXKeBwY%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md#the-new-quorum-check>
> >.
> In short, we use `controller.quorum.fetch.timeout.ms` and
> `replicaLastCaughtUpTimestamp` to determine if it's safe to roll this
> controller pod.
>
> Hope this helps.
>
> Thank you.
> Luke
>
>
>
>
> On Fri, Apr 19, 2024 at 5:06 PM Francesco Burato <bur...@adobe.com.invalid
> >
> wrote:
>
> > Hi Luke,
> >
> > Thanks for the answers. I understand what you are describing in terms of
> > rationale for using just the availability of the controller port to
> > determine the readiness of the controller, but that is not fully
> satisfying
> > under an operational perspective, at least based on the lack of
> sufficient
> > documentation on the matter. Based on my understanding of kraft, which I
> > admit is not considerable, the controllers will host the cluster metadata
> > partitions on disk and make them available for the brokers. So,
> presumably,
> > one of the purposes of the controllers is to ensure that the metadata
> > partitions are properly replicated. Hence, what happens even in a non k8s
> > environment all controllers go down? What sort of outage does the wider
> > cluster experience in that circumstance?
> >
> > A complete outage on the controllers is of course an extreme scenario,
> but
> > a more likely one is that a disk of the controller goes offline and needs
> > to be replaced. In this scenario, the controller will have to
> re-construct
> > from scratch the cluster metadata from the other controllers in the
> quorum
> > but it presumably cannot participate to the quorum until the metadata
> > partitions are fully replicated. Based on this assumption, the mere
> > availability of the controller port does not necessarily mean that I can
> > safely shut down another controller because replication has not completed
> > yet.
> >
> > As I mentioned earlier, I don’t know the details of kraft in sufficient
> > details to evaluate if my assumptions are warranted, but the official
> > documentation does not seem to go in much detail on how to safely
> operate a
> > cluster in kraft mode while it provides very good information on how to
> > safely operate a ZK cluster by highlighting that the URP and leader
> > elections must be kept under control during restarts.
> >
> > Unless this is already available but not well publicised in the
> > documentation, ideally there should be protocol working on the controller
> > ports that answers to operational questions like “are metadata partitions
> > in sync?”, “has the current controller converged with other members of
> the
> > quorum?”.
> >
> > Goes without saying that if any of these topics are properly covered
> > anywhere in the docs, more than happy to be RTFMed to the right place.
> >
> > As for the other points you raise: we have a very particular set-up for
> > our kafka clusters that makes the circumstance you highlight not a
> problem.
> > In particular, our consumer and producers are all internal in a namespace
> > and can connect to non-ready brokers. Given the URP script checks for the
> > global URP state rather than just the URP state for the individual
> broker,
> > that means that as long as even one broker is marked as ready, that means
> > the entire cluster is safe. With the ordered rotation imposed by
> > statefulset parallel rolling restart, together with the URP readiness
> check
> > and the PDB, we are guaranteed not to cause any problem read or write
> > errors. Rotations are rather long, but we don’t really care about speed.
> >
> > Thanks,
> >
> > Frank
> >
> > --
> > Francesco Burato | Software Development Engineer | Adobe |
> > bur...@adobe.com<mailto:bur...@adobe.com>  | c. +44 747
> > 9029370<tel:+447479029370>
> >
> >
> > From: Luke Chen <show...@gmail.com>
> > Date: Friday, 19 April 2024 at 05:21
> > To: users@kafka.apache.org <users@kafka.apache.org>
> > Subject: Re: Kraft controller readiness checks
> > EXTERNAL: Use caution when clicking on links or opening attachments.
> >
> >
> > Hello Frank,
> >
> > That's a good question.
> > I think we all know there is no "correct" answer for this question. But I
> > can share with you what our team did for it.
> >
> > Readiness: controller is listening on the controller.listener.names
> >
> > The rationale behind it is:
> > 1. The last step for the controller node startup is to wait until all the
> > SocketServer ports to be open, and the Acceptors to be started, and the
> > controller port is one of them.
> > 2. This controller listener is used to talk to other controllers (voters)
> > to form the raft quorum, so if it is not open and listening, the
> controller
> > is basically not working at all.
> > 3. The controller listener is also used for brokers (observers) to get
> the
> > updated raft quorum info and fetch metadata.
> >
> > Compared with Zookeeper cluster, which is the KRaft quorum is trying to
> > replace with, the liveness/readiness probe that recommended in Kubernetes
> > tutorial
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Ftutorials%2Fstateful-application%2Fzookeeper%2F%23testing-for-liveness&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178469310%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=RXHzVewIinIOYbgCc9h2Wf6M0jNTcD%2BjKh%2Fofw6l6KM%3D&reserved=0<https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/#testing-for-liveness>
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Ftutorials%2Fstateful-application%2Fzookeeper%2F%23testing-for-liveness&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178474256%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=SQKDp8xJuvEUg%2FO3wQMSgL7J7On%2BFYL6g%2BW%2FENy%2FYco%3D&reserved=0<https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/#testing-for-liveness>
> > >>
> > is also doing "ruok" check for the pod. And the handler for this "ruok"
> > command
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fzookeeper%2Fblob%2Fd12aba599233b0fcba0b9b945ed3d2f45d4016f0%2Fzookeeper-server%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fzookeeper%2Fserver%2Fcommand%2FRuokCommand.java%23L32&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178478518%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=PkBfdiZBUECV7qbIrPU0MNJ%2BnBYxgmwck7%2FibCuYu9Q%3D&reserved=0<https://github.com/apache/zookeeper/blob/d12aba599233b0fcba0b9b945ed3d2f45d4016f0/zookeeper-server/src/main/java/org/apache/zookeeper/server/command/RuokCommand.java#L32>
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fzookeeper%2Fblob%2Fd12aba599233b0fcba0b9b945ed3d2f45d4016f0%2Fzookeeper-server%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fzookeeper%2Fserver%2Fcommand%2FRuokCommand.java%23L32&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178482960%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=y%2BRpnKat1VUltcVVyJSpQ9APTJRW3lbY%2FqRzFpI%2ByM0%3D&reserved=0<https://github.com/apache/zookeeper/blob/d12aba599233b0fcba0b9b945ed3d2f45d4016f0/zookeeper-server/src/main/java/org/apache/zookeeper/server/command/RuokCommand.java#L32>
> > >>
> > in the Zookeeper server side, is returning "imok" directly, which means
> > it's just doing connection check only. So we think this check makes
> sense.
> >
> > Here's our design proposal
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178487381%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=xL0SeF7Epo7oTaf0ZbRt4jZ46zbiDdL4TP%2FdeS6HlVg%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md>
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178491426%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=ARpRbD%2B2cuYF1%2Bt5VhXQBJOjf%2B9xJ%2FQ%2FM8jPDk7MLfQ%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md>
> > >>
> > for the Liveness and Readiness probes in a KRaft Kafka cluster, FYI.
> > But again, I still think there's no "correct" answer for it. If you have
> > any better ideas, please let us know.
> >
> > However, I have some suggestions for your readiness probe for brokers:
> >
> > > our brokers are configured to use a script which marks the containers
> as
> > unready if under-replicated partitions exist. With this readiness check
> and
> > a pod disruption budget of the minimum in sync replica - 1
> >
> > I understand it works well, but it has some drawbacks, and the biggest
> > issue I can think of is: it's possible to cause unavailability in some
> > partitions.
> > For example: 3 brokers in the cluster: 0, 1, 2, and 10 topic partitions
> are
> > hosted in broker 0.
> > a. Broker 0 is shutting down, all partitions in broker 0 are becoming
> > follower.
> > b. Broker 0 is starting up, all the followers are trying to catch up with
> > the leader.
> > c. 9 out of 10 partitions are caught up and joined ISR group. At this
> > point, this pod is still unready because there's still 1 partition is
> under
> > replicated.
> > d. Some of the partitions in broker 0 are becoming leader, for example,
> > auto leader rebalance is triggered.
> > e. For the leader partitions in broker 0 are now unavailable because the
> > pod is not in ready state, it cannot serve incoming requests.
> >
> > In our team, we use the brokerState metric value = RUNNING state for
> > readiness probe. In KRaft mode, the broker will enter RUNNING state after
> > the broker has caught up with the controller for metadata, and start to
> > serve requests from clients. We think that makes more senses.
> > Again, for more details, you can check the design proposal
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178495478%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=80Fgc6rKmZxcOCdW5%2FeiWz2pV80x%2F8KYtWPj3JZ1km8%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md>
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178499536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=7CKf5Oas08r%2FraR%2Bz%2BS1umF8e2%2BfFsKJUBFw%2BcPfOgQ%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md>
> > >>
> > for the Liveness and Readiness probes in a KRaft Kafka cluster.
> >
> > Finally, I saw you didn't have operators for Kafka clusters.
> > I don't know how you manage all these kafka clusters manually, but there
> > must be some cumbersome operations, like rolling pods.
> > Let's say now you want to roll the pods 1 by 1, which pod will you go
> > first?
> > And which pod goes last?
> > Will you do any check before rolling?
> > How much time does it take for each rolling?
> > ...
> >
> > I'm just listing some of the problems they might have. So I would
> recommend
> > deploying an operator to help manage the kafka clusters.
> > This is our design proposal
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F060-kafka-roller-kraft.md&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178503580%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=0jI581U3xjomo5xV24dwIXf2uBoxmVjdpGnbJtA3bKI%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md>
> > <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F060-kafka-roller-kraft.md&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178507631%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9Pr%2FjWR5aNMZSH30oiEpu0hyYTuhvgFiNDEULuFpuB8%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md>
> > >>
> > for Kafka roller in operator for KRaft. FYI.
> >
> > And now, I'm totally biased, but Stirmzi
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fstrimzi-kafka-operator&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178511625%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=NFZEwzThGISCpa1ldTcBH8%2F8WnBzFyK0%2FoMuFLoXxsU%3D&reserved=0<https://github.com/strimzi/strimzi-kafka-operator>
> > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fstrimzi-kafka-operator&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178515555%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=70Zle9mFwce6zM2L2TdpkNOLrkRM%2BjjiBlfZ%2BlXa3DE%3D&reserved=0<https://github.com/strimzi/strimzi-kafka-operator>>>
> >  provides an fully
> > open-source operator to manager kafka cluster on Kubernetes.
> > Welcome to try it (hopefully it will help you manage kafka clusters),
> join
> > the community to ask questions, join discussions, or contribute to it.
> >
> > Thank you.
> > Luke
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Apr 19, 2024 at 4:19 AM Francesco Burato
> <bur...@adobe.com.invalid
> > >
> > wrote:
> >
> > > Hello,
> > >
> > > I have a question regarding the deployment of Kafka using Kraft
> > > controllers in a Kubernetes environment. Our current Kafka cluster is
> > > deployed on K8S clusters as statefulsets without operators and our
> > brokers
> > > are configured to use a script which marks the containers as unready if
> > > under-replicated partitions exist. With this readiness check and a pod
> > > disruption budget of the minimum in sync replica - 1, we are able to
> > > perform rollout restarts of our brokers automatically without ever
> > > producing consumers and producers errors.
> > >
> > > We have started the processes of transitioning to Kraft and based on
> the
> > > recommended deployment strategy we are going to define dedicated nodes
> as
> > > controllers instead of using combined servers. However, defining nodes
> as
> > > controller does not seem to allow to use the same strategy for
> readiness
> > > check as the kafka-topics.sh does not appear to be executable on
> > controller
> > > brokers.
> > >
> > > The question is: what is a reliable readiness check that can be used
> for
> > > Kraft controllers that ensures that rollout restart can be performed
> > safely?
> > >
> > > Thanks,
> > >
> > > Frank
> > >
> > > --
> > > Francesco Burato | Software Development Engineer | Adobe |
> > > bur...@adobe.com<mailto:bur...@adobe.com>
> > >
> > >
> >
>


--
ddbrod...@gmail.com

"The price of reliability is the pursuit of the utmost simplicity.
It is a price which the very rich find the most hard to pay."
                                                                   (Sir
Antony Hoare, 1980)

Reply via email to