Re: Kraft controller readiness checks

Luke Chen Sun, 21 Apr 2024 21:06:06 -0700

Hi Frank,

About your question:
> Unless this is already available but not well publicised in the
documentation, ideally there should be protocol working on the controller
ports that answers to operational questions like “are metadata partitions
in sync?”, “has the current controller converged with other members of the
quorum?”.


I'm sorry that KRaft controller is using raft protocol, so there is no such
"in-sync replica" definition like data replication protocol. What we did
for our check is described here
<https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md#the-new-quorum-check>.
In short, we use `controller.quorum.fetch.timeout.ms` and
`replicaLastCaughtUpTimestamp` to determine if it's safe to roll this
controller pod.

Hope this helps.

Thank you.
Luke




On Fri, Apr 19, 2024 at 5:06 PM Francesco Burato <bur...@adobe.com.invalid>
wrote:

> Hi Luke,
>
> Thanks for the answers. I understand what you are describing in terms of
> rationale for using just the availability of the controller port to
> determine the readiness of the controller, but that is not fully satisfying
> under an operational perspective, at least based on the lack of sufficient
> documentation on the matter. Based on my understanding of kraft, which I
> admit is not considerable, the controllers will host the cluster metadata
> partitions on disk and make them available for the brokers. So, presumably,
> one of the purposes of the controllers is to ensure that the metadata
> partitions are properly replicated. Hence, what happens even in a non k8s
> environment all controllers go down? What sort of outage does the wider
> cluster experience in that circumstance?
>
> A complete outage on the controllers is of course an extreme scenario, but
> a more likely one is that a disk of the controller goes offline and needs
> to be replaced. In this scenario, the controller will have to re-construct
> from scratch the cluster metadata from the other controllers in the quorum
> but it presumably cannot participate to the quorum until the metadata
> partitions are fully replicated. Based on this assumption, the mere
> availability of the controller port does not necessarily mean that I can
> safely shut down another controller because replication has not completed
> yet.
>
> As I mentioned earlier, I don’t know the details of kraft in sufficient
> details to evaluate if my assumptions are warranted, but the official
> documentation does not seem to go in much detail on how to safely operate a
> cluster in kraft mode while it provides very good information on how to
> safely operate a ZK cluster by highlighting that the URP and leader
> elections must be kept under control during restarts.
>
> Unless this is already available but not well publicised in the
> documentation, ideally there should be protocol working on the controller
> ports that answers to operational questions like “are metadata partitions
> in sync?”, “has the current controller converged with other members of the
> quorum?”.
>
> Goes without saying that if any of these topics are properly covered
> anywhere in the docs, more than happy to be RTFMed to the right place.
>
> As for the other points you raise: we have a very particular set-up for
> our kafka clusters that makes the circumstance you highlight not a problem.
> In particular, our consumer and producers are all internal in a namespace
> and can connect to non-ready brokers. Given the URP script checks for the
> global URP state rather than just the URP state for the individual broker,
> that means that as long as even one broker is marked as ready, that means
> the entire cluster is safe. With the ordered rotation imposed by
> statefulset parallel rolling restart, together with the URP readiness check
> and the PDB, we are guaranteed not to cause any problem read or write
> errors. Rotations are rather long, but we don’t really care about speed.
>
> Thanks,
>
> Frank
>
> --
> Francesco Burato | Software Development Engineer | Adobe |
> bur...@adobe.com<mailto:bur...@adobe.com>  | c. +44 747
> 9029370<tel:+447479029370>
>
>
> From: Luke Chen <show...@gmail.com>
> Date: Friday, 19 April 2024 at 05:21
> To: users@kafka.apache.org <users@kafka.apache.org>
> Subject: Re: Kraft controller readiness checks
> EXTERNAL: Use caution when clicking on links or opening attachments.
>
>
> Hello Frank,
>
> That's a good question.
> I think we all know there is no "correct" answer for this question. But I
> can share with you what our team did for it.
>
> Readiness: controller is listening on the controller.listener.names
>
> The rationale behind it is:
> 1. The last step for the controller node startup is to wait until all the
> SocketServer ports to be open, and the Acceptors to be started, and the
> controller port is one of them.
> 2. This controller listener is used to talk to other controllers (voters)
> to form the raft quorum, so if it is not open and listening, the controller
> is basically not working at all.
> 3. The controller listener is also used for brokers (observers) to get the
> updated raft quorum info and fetch metadata.
>
> Compared with Zookeeper cluster, which is the KRaft quorum is trying to
> replace with, the liveness/readiness probe that recommended in Kubernetes
> tutorial
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Ftutorials%2Fstateful-application%2Fzookeeper%2F%23testing-for-liveness&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901227302%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=ytYDx6DdtK1%2BzIT1HqQF%2BSV5FBZv%2Bb5963hJZBdqABU%3D&reserved=0
> <
> https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/#testing-for-liveness
> >>
> is also doing "ruok" check for the pod. And the handler for this "ruok"
> command
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fzookeeper%2Fblob%2Fd12aba599233b0fcba0b9b945ed3d2f45d4016f0%2Fzookeeper-server%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fzookeeper%2Fserver%2Fcommand%2FRuokCommand.java%23L32&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901234955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=JReVGpMpndPAADyLd8hjASIKNipIlJoWRbRSUGtlAuQ%3D&reserved=0
> <
> https://github.com/apache/zookeeper/blob/d12aba599233b0fcba0b9b945ed3d2f45d4016f0/zookeeper-server/src/main/java/org/apache/zookeeper/server/command/RuokCommand.java#L32
> >>
> in the Zookeeper server side, is returning "imok" directly, which means
> it's just doing connection check only. So we think this check makes sense.
>
> Here's our design proposal
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901240287%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=gvYKP%2Bpgw5uQksaODD0VIjwRK71n94jhtQ%2BKuu1K098%3D&reserved=0
> <
> https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md
> >>
> for the Liveness and Readiness probes in a KRaft Kafka cluster, FYI.
> But again, I still think there's no "correct" answer for it. If you have
> any better ideas, please let us know.
>
> However, I have some suggestions for your readiness probe for brokers:
>
> > our brokers are configured to use a script which marks the containers as
> unready if under-replicated partitions exist. With this readiness check and
> a pod disruption budget of the minimum in sync replica - 1
>
> I understand it works well, but it has some drawbacks, and the biggest
> issue I can think of is: it's possible to cause unavailability in some
> partitions.
> For example: 3 brokers in the cluster: 0, 1, 2, and 10 topic partitions are
> hosted in broker 0.
> a. Broker 0 is shutting down, all partitions in broker 0 are becoming
> follower.
> b. Broker 0 is starting up, all the followers are trying to catch up with
> the leader.
> c. 9 out of 10 partitions are caught up and joined ISR group. At this
> point, this pod is still unready because there's still 1 partition is under
> replicated.
> d. Some of the partitions in broker 0 are becoming leader, for example,
> auto leader rebalance is triggered.
> e. For the leader partitions in broker 0 are now unavailable because the
> pod is not in ready state, it cannot serve incoming requests.
>
> In our team, we use the brokerState metric value = RUNNING state for
> readiness probe. In KRaft mode, the broker will enter RUNNING state after
> the broker has caught up with the controller for metadata, and start to
> serve requests from clients. We think that makes more senses.
> Again, for more details, you can check the design proposal
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901244309%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=jpPxenEUk5KCAmFtjPtDrxJOkkExdzTFBEV9CAP%2FDYk%3D&reserved=0
> <
> https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md
> >>
> for the Liveness and Readiness probes in a KRaft Kafka cluster.
>
> Finally, I saw you didn't have operators for Kafka clusters.
> I don't know how you manage all these kafka clusters manually, but there
> must be some cumbersome operations, like rolling pods.
> Let's say now you want to roll the pods 1 by 1, which pod will you go
> first?
> And which pod goes last?
> Will you do any check before rolling?
> How much time does it take for each rolling?
> ...
>
> I'm just listing some of the problems they might have. So I would recommend
> deploying an operator to help manage the kafka clusters.
> This is our design proposal
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F060-kafka-roller-kraft.md&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901248177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=N%2F0kC537%2FOlTDTLga193jSQebXsMMdyczu8KrX9%2FVsQ%3D&reserved=0
> <https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md
> >>
> for Kafka roller in operator for KRaft. FYI.
>
> And now, I'm totally biased, but Stirmzi
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fstrimzi-kafka-operator&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901251906%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=25QOQ8hSqtTXdfoGg3lH6u0ZMILOYJW%2BBpkVoZNLTCQ%3D&reserved=0
> <https://github.com/strimzi/strimzi-kafka-operator>> provides an fully
> open-source operator to manager kafka cluster on Kubernetes.
> Welcome to try it (hopefully it will help you manage kafka clusters), join
> the community to ask questions, join discussions, or contribute to it.
>
> Thank you.
> Luke
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Apr 19, 2024 at 4:19 AM Francesco Burato <bur...@adobe.com.invalid
> >
> wrote:
>
> > Hello,
> >
> > I have a question regarding the deployment of Kafka using Kraft
> > controllers in a Kubernetes environment. Our current Kafka cluster is
> > deployed on K8S clusters as statefulsets without operators and our
> brokers
> > are configured to use a script which marks the containers as unready if
> > under-replicated partitions exist. With this readiness check and a pod
> > disruption budget of the minimum in sync replica - 1, we are able to
> > perform rollout restarts of our brokers automatically without ever
> > producing consumers and producers errors.
> >
> > We have started the processes of transitioning to Kraft and based on the
> > recommended deployment strategy we are going to define dedicated nodes as
> > controllers instead of using combined servers. However, defining nodes as
> > controller does not seem to allow to use the same strategy for readiness
> > check as the kafka-topics.sh does not appear to be executable on
> controller
> > brokers.
> >
> > The question is: what is a reliable readiness check that can be used for
> > Kraft controllers that ensures that rollout restart can be performed
> safely?
> >
> > Thanks,
> >
> > Frank
> >
> > --
> > Francesco Burato | Software Development Engineer | Adobe |
> > bur...@adobe.com<mailto:bur...@adobe.com>
> >
> >
>

Re: Kraft controller readiness checks

Reply via email to