Re: Kraft controller readiness checks

Dima Brodsky Sun, 21 Apr 2024 21:16:40 -0700

Thanks Luke, this helps for our use case.  It does not cover the buildout
of a new cluster where there are no brokers, but that should be remedied by
kip 919 which looks to be resolved in 3.7.0.


ttyl
Dima


On Sun, Apr 21, 2024 at 9:06 PM Luke Chen <show...@gmail.com> wrote:

> Hi Frank,
>
> About your question:
> > Unless this is already available but not well publicised in the
> documentation, ideally there should be protocol working on the controller
> ports that answers to operational questions like “are metadata partitions
> in sync?”, “has the current controller converged with other members of the
> quorum?”.
>
> I'm sorry that KRaft controller is using raft protocol, so there is no such
> "in-sync replica" definition like data replication protocol. What we did
> for our check is described here
> <
> https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md#the-new-quorum-check
> >.
> In short, we use `controller.quorum.fetch.timeout.ms` and
> `replicaLastCaughtUpTimestamp` to determine if it's safe to roll this
> controller pod.
>
> Hope this helps.
>
> Thank you.
> Luke
>
>
>
>
> On Fri, Apr 19, 2024 at 5:06 PM Francesco Burato <bur...@adobe.com.invalid
> >
> wrote:
>
> > Hi Luke,
> >
> > Thanks for the answers. I understand what you are describing in terms of
> > rationale for using just the availability of the controller port to
> > determine the readiness of the controller, but that is not fully
> satisfying
> > under an operational perspective, at least based on the lack of
> sufficient
> > documentation on the matter. Based on my understanding of kraft, which I
> > admit is not considerable, the controllers will host the cluster metadata
> > partitions on disk and make them available for the brokers. So,
> presumably,
> > one of the purposes of the controllers is to ensure that the metadata
> > partitions are properly replicated. Hence, what happens even in a non k8s
> > environment all controllers go down? What sort of outage does the wider
> > cluster experience in that circumstance?
> >
> > A complete outage on the controllers is of course an extreme scenario,
> but
> > a more likely one is that a disk of the controller goes offline and needs
> > to be replaced. In this scenario, the controller will have to
> re-construct
> > from scratch the cluster metadata from the other controllers in the
> quorum
> > but it presumably cannot participate to the quorum until the metadata
> > partitions are fully replicated. Based on this assumption, the mere
> > availability of the controller port does not necessarily mean that I can
> > safely shut down another controller because replication has not completed
> > yet.
> >
> > As I mentioned earlier, I don’t know the details of kraft in sufficient
> > details to evaluate if my assumptions are warranted, but the official
> > documentation does not seem to go in much detail on how to safely
> operate a
> > cluster in kraft mode while it provides very good information on how to
> > safely operate a ZK cluster by highlighting that the URP and leader
> > elections must be kept under control during restarts.
> >
> > Unless this is already available but not well publicised in the
> > documentation, ideally there should be protocol working on the controller
> > ports that answers to operational questions like “are metadata partitions
> > in sync?”, “has the current controller converged with other members of
> the
> > quorum?”.
> >
> > Goes without saying that if any of these topics are properly covered
> > anywhere in the docs, more than happy to be RTFMed to the right place.
> >
> > As for the other points you raise: we have a very particular set-up for
> > our kafka clusters that makes the circumstance you highlight not a
> problem.
> > In particular, our consumer and producers are all internal in a namespace
> > and can connect to non-ready brokers. Given the URP script checks for the
> > global URP state rather than just the URP state for the individual
> broker,
> > that means that as long as even one broker is marked as ready, that means
> > the entire cluster is safe. With the ordered rotation imposed by
> > statefulset parallel rolling restart, together with the URP readiness
> check
> > and the PDB, we are guaranteed not to cause any problem read or write
> > errors. Rotations are rather long, but we don’t really care about speed.
> >
> > Thanks,
> >
> > Frank
> >
> > --
> > Francesco Burato | Software Development Engineer | Adobe |
> > bur...@adobe.com<mailto:bur...@adobe.com>  | c. +44 747
> > 9029370<tel:+447479029370>
> >
> >
> > From: Luke Chen <show...@gmail.com>
> > Date: Friday, 19 April 2024 at 05:21
> > To: users@kafka.apache.org <users@kafka.apache.org>
> > Subject: Re: Kraft controller readiness checks
> > EXTERNAL: Use caution when clicking on links or opening attachments.
> >
> >
> > Hello Frank,
> >
> > That's a good question.
> > I think we all know there is no "correct" answer for this question. But I
> > can share with you what our team did for it.
> >
> > Readiness: controller is listening on the controller.listener.names
> >
> > The rationale behind it is:
> > 1. The last step for the controller node startup is to wait until all the
> > SocketServer ports to be open, and the Acceptors to be started, and the
> > controller port is one of them.
> > 2. This controller listener is used to talk to other controllers (voters)
> > to form the raft quorum, so if it is not open and listening, the
> controller
> > is basically not working at all.
> > 3. The controller listener is also used for brokers (observers) to get
> the
> > updated raft quorum info and fetch metadata.
> >
> > Compared with Zookeeper cluster, which is the KRaft quorum is trying to
> > replace with, the liveness/readiness probe that recommended in Kubernetes
> > tutorial
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Ftutorials%2Fstateful-application%2Fzookeeper%2F%23testing-for-liveness&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901227302%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=ytYDx6DdtK1%2BzIT1HqQF%2BSV5FBZv%2Bb5963hJZBdqABU%3D&reserved=0
> > <
> >
> https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/#testing-for-liveness
> > >>
> > is also doing "ruok" check for the pod. And the handler for this "ruok"
> > command
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fzookeeper%2Fblob%2Fd12aba599233b0fcba0b9b945ed3d2f45d4016f0%2Fzookeeper-server%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fzookeeper%2Fserver%2Fcommand%2FRuokCommand.java%23L32&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901234955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=JReVGpMpndPAADyLd8hjASIKNipIlJoWRbRSUGtlAuQ%3D&reserved=0
> > <
> >
> https://github.com/apache/zookeeper/blob/d12aba599233b0fcba0b9b945ed3d2f45d4016f0/zookeeper-server/src/main/java/org/apache/zookeeper/server/command/RuokCommand.java#L32
> > >>
> > in the Zookeeper server side, is returning "imok" directly, which means
> > it's just doing connection check only. So we think this check makes
> sense.
> >
> > Here's our design proposal
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901240287%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=gvYKP%2Bpgw5uQksaODD0VIjwRK71n94jhtQ%2BKuu1K098%3D&reserved=0
> > <
> >
> https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md
> > >>
> > for the Liveness and Readiness probes in a KRaft Kafka cluster, FYI.
> > But again, I still think there's no "correct" answer for it. If you have
> > any better ideas, please let us know.
> >
> > However, I have some suggestions for your readiness probe for brokers:
> >
> > > our brokers are configured to use a script which marks the containers
> as
> > unready if under-replicated partitions exist. With this readiness check
> and
> > a pod disruption budget of the minimum in sync replica - 1
> >
> > I understand it works well, but it has some drawbacks, and the biggest
> > issue I can think of is: it's possible to cause unavailability in some
> > partitions.
> > For example: 3 brokers in the cluster: 0, 1, 2, and 10 topic partitions
> are
> > hosted in broker 0.
> > a. Broker 0 is shutting down, all partitions in broker 0 are becoming
> > follower.
> > b. Broker 0 is starting up, all the followers are trying to catch up with
> > the leader.
> > c. 9 out of 10 partitions are caught up and joined ISR group. At this
> > point, this pod is still unready because there's still 1 partition is
> under
> > replicated.
> > d. Some of the partitions in broker 0 are becoming leader, for example,
> > auto leader rebalance is triggered.
> > e. For the leader partitions in broker 0 are now unavailable because the
> > pod is not in ready state, it cannot serve incoming requests.
> >
> > In our team, we use the brokerState metric value = RUNNING state for
> > readiness probe. In KRaft mode, the broker will enter RUNNING state after
> > the broker has caught up with the controller for metadata, and start to
> > serve requests from clients. We think that makes more senses.
> > Again, for more details, you can check the design proposal
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901244309%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=jpPxenEUk5KCAmFtjPtDrxJOkkExdzTFBEV9CAP%2FDYk%3D&reserved=0
> > <
> >
> https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md
> > >>
> > for the Liveness and Readiness probes in a KRaft Kafka cluster.
> >
> > Finally, I saw you didn't have operators for Kafka clusters.
> > I don't know how you manage all these kafka clusters manually, but there
> > must be some cumbersome operations, like rolling pods.
> > Let's say now you want to roll the pods 1 by 1, which pod will you go
> > first?
> > And which pod goes last?
> > Will you do any check before rolling?
> > How much time does it take for each rolling?
> > ...
> >
> > I'm just listing some of the problems they might have. So I would
> recommend
> > deploying an operator to help manage the kafka clusters.
> > This is our design proposal
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F060-kafka-roller-kraft.md&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901248177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=N%2F0kC537%2FOlTDTLga193jSQebXsMMdyczu8KrX9%2FVsQ%3D&reserved=0
> > <
> https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md
> > >>
> > for Kafka roller in operator for KRaft. FYI.
> >
> > And now, I'm totally biased, but Stirmzi
> > <
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fstrimzi-kafka-operator&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901251906%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=25QOQ8hSqtTXdfoGg3lH6u0ZMILOYJW%2BBpkVoZNLTCQ%3D&reserved=0
> > <https://github.com/strimzi/strimzi-kafka-operator>> provides an fully
> > open-source operator to manager kafka cluster on Kubernetes.
> > Welcome to try it (hopefully it will help you manage kafka clusters),
> join
> > the community to ask questions, join discussions, or contribute to it.
> >
> > Thank you.
> > Luke
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Apr 19, 2024 at 4:19 AM Francesco Burato
> <bur...@adobe.com.invalid
> > >
> > wrote:
> >
> > > Hello,
> > >
> > > I have a question regarding the deployment of Kafka using Kraft
> > > controllers in a Kubernetes environment. Our current Kafka cluster is
> > > deployed on K8S clusters as statefulsets without operators and our
> > brokers
> > > are configured to use a script which marks the containers as unready if
> > > under-replicated partitions exist. With this readiness check and a pod
> > > disruption budget of the minimum in sync replica - 1, we are able to
> > > perform rollout restarts of our brokers automatically without ever
> > > producing consumers and producers errors.
> > >
> > > We have started the processes of transitioning to Kraft and based on
> the
> > > recommended deployment strategy we are going to define dedicated nodes
> as
> > > controllers instead of using combined servers. However, defining nodes
> as
> > > controller does not seem to allow to use the same strategy for
> readiness
> > > check as the kafka-topics.sh does not appear to be executable on
> > controller
> > > brokers.
> > >
> > > The question is: what is a reliable readiness check that can be used
> for
> > > Kraft controllers that ensures that rollout restart can be performed
> > safely?
> > >
> > > Thanks,
> > >
> > > Frank
> > >
> > > --
> > > Francesco Burato | Software Development Engineer | Adobe |
> > > bur...@adobe.com<mailto:bur...@adobe.com>
> > >
> > >
> >
>


-- 
ddbrod...@gmail.com

"The price of reliability is the pursuit of the utmost simplicity.
It is a price which the very rich find the most hard to pay."
                                                                   (Sir
Antony Hoare, 1980)

Re: Kraft controller readiness checks

Reply via email to