Hi Luke,

Thanks for the answers. I understand what you are describing in terms of 
rationale for using just the availability of the controller port to determine 
the readiness of the controller, but that is not fully satisfying under an 
operational perspective, at least based on the lack of sufficient documentation 
on the matter. Based on my understanding of kraft, which I admit is not 
considerable, the controllers will host the cluster metadata partitions on disk 
and make them available for the brokers. So, presumably, one of the purposes of 
the controllers is to ensure that the metadata partitions are properly 
replicated. Hence, what happens even in a non k8s environment all controllers 
go down? What sort of outage does the wider cluster experience in that 
circumstance?

A complete outage on the controllers is of course an extreme scenario, but a 
more likely one is that a disk of the controller goes offline and needs to be 
replaced. In this scenario, the controller will have to re-construct from 
scratch the cluster metadata from the other controllers in the quorum but it 
presumably cannot participate to the quorum until the metadata partitions are 
fully replicated. Based on this assumption, the mere availability of the 
controller port does not necessarily mean that I can safely shut down another 
controller because replication has not completed yet.

As I mentioned earlier, I don’t know the details of kraft in sufficient details 
to evaluate if my assumptions are warranted, but the official documentation 
does not seem to go in much detail on how to safely operate a cluster in kraft 
mode while it provides very good information on how to safely operate a ZK 
cluster by highlighting that the URP and leader elections must be kept under 
control during restarts.

Unless this is already available but not well publicised in the documentation, 
ideally there should be protocol working on the controller ports that answers 
to operational questions like “are metadata partitions in sync?”, “has the 
current controller converged with other members of the quorum?”.

Goes without saying that if any of these topics are properly covered anywhere 
in the docs, more than happy to be RTFMed to the right place.

As for the other points you raise: we have a very particular set-up for our 
kafka clusters that makes the circumstance you highlight not a problem. In 
particular, our consumer and producers are all internal in a namespace and can 
connect to non-ready brokers. Given the URP script checks for the global URP 
state rather than just the URP state for the individual broker, that means that 
as long as even one broker is marked as ready, that means the entire cluster is 
safe. With the ordered rotation imposed by statefulset parallel rolling 
restart, together with the URP readiness check and the PDB, we are guaranteed 
not to cause any problem read or write errors. Rotations are rather long, but 
we don’t really care about speed.

Thanks,

Frank

--
Francesco Burato | Software Development Engineer | Adobe | 
bur...@adobe.com<mailto:bur...@adobe.com>  | c. +44 747 
9029370<tel:+447479029370>


From: Luke Chen <show...@gmail.com>
Date: Friday, 19 April 2024 at 05:21
To: users@kafka.apache.org <users@kafka.apache.org>
Subject: Re: Kraft controller readiness checks
EXTERNAL: Use caution when clicking on links or opening attachments.


Hello Frank,

That's a good question.
I think we all know there is no "correct" answer for this question. But I
can share with you what our team did for it.

Readiness: controller is listening on the controller.listener.names

The rationale behind it is:
1. The last step for the controller node startup is to wait until all the
SocketServer ports to be open, and the Acceptors to be started, and the
controller port is one of them.
2. This controller listener is used to talk to other controllers (voters)
to form the raft quorum, so if it is not open and listening, the controller
is basically not working at all.
3. The controller listener is also used for brokers (observers) to get the
updated raft quorum info and fetch metadata.

Compared with Zookeeper cluster, which is the KRaft quorum is trying to
replace with, the liveness/readiness probe that recommended in Kubernetes
tutorial
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Ftutorials%2Fstateful-application%2Fzookeeper%2F%23testing-for-liveness&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901227302%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=ytYDx6DdtK1%2BzIT1HqQF%2BSV5FBZv%2Bb5963hJZBdqABU%3D&reserved=0<https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/#testing-for-liveness>>
is also doing "ruok" check for the pod. And the handler for this "ruok"
command
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fzookeeper%2Fblob%2Fd12aba599233b0fcba0b9b945ed3d2f45d4016f0%2Fzookeeper-server%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fzookeeper%2Fserver%2Fcommand%2FRuokCommand.java%23L32&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901234955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=JReVGpMpndPAADyLd8hjASIKNipIlJoWRbRSUGtlAuQ%3D&reserved=0<https://github.com/apache/zookeeper/blob/d12aba599233b0fcba0b9b945ed3d2f45d4016f0/zookeeper-server/src/main/java/org/apache/zookeeper/server/command/RuokCommand.java#L32>>
in the Zookeeper server side, is returning "imok" directly, which means
it's just doing connection check only. So we think this check makes sense.

Here's our design proposal
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901240287%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=gvYKP%2Bpgw5uQksaODD0VIjwRK71n94jhtQ%2BKuu1K098%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md>>
for the Liveness and Readiness probes in a KRaft Kafka cluster, FYI.
But again, I still think there's no "correct" answer for it. If you have
any better ideas, please let us know.

However, I have some suggestions for your readiness probe for brokers:

> our brokers are configured to use a script which marks the containers as
unready if under-replicated partitions exist. With this readiness check and
a pod disruption budget of the minimum in sync replica - 1

I understand it works well, but it has some drawbacks, and the biggest
issue I can think of is: it's possible to cause unavailability in some
partitions.
For example: 3 brokers in the cluster: 0, 1, 2, and 10 topic partitions are
hosted in broker 0.
a. Broker 0 is shutting down, all partitions in broker 0 are becoming
follower.
b. Broker 0 is starting up, all the followers are trying to catch up with
the leader.
c. 9 out of 10 partitions are caught up and joined ISR group. At this
point, this pod is still unready because there's still 1 partition is under
replicated.
d. Some of the partitions in broker 0 are becoming leader, for example,
auto leader rebalance is triggered.
e. For the leader partitions in broker 0 are now unavailable because the
pod is not in ready state, it cannot serve incoming requests.

In our team, we use the brokerState metric value = RUNNING state for
readiness probe. In KRaft mode, the broker will enter RUNNING state after
the broker has caught up with the controller for metadata, and start to
serve requests from clients. We think that makes more senses.
Again, for more details, you can check the design proposal
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F046-kraft-liveness-readiness.md&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901244309%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=jpPxenEUk5KCAmFtjPtDrxJOkkExdzTFBEV9CAP%2FDYk%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/046-kraft-liveness-readiness.md>>
for the Liveness and Readiness probes in a KRaft Kafka cluster.

Finally, I saw you didn't have operators for Kafka clusters.
I don't know how you manage all these kafka clusters manually, but there
must be some cumbersome operations, like rolling pods.
Let's say now you want to roll the pods 1 by 1, which pod will you go
first?
And which pod goes last?
Will you do any check before rolling?
How much time does it take for each rolling?
...

I'm just listing some of the problems they might have. So I would recommend
deploying an operator to help manage the kafka clusters.
This is our design proposal
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F060-kafka-roller-kraft.md&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901248177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=N%2F0kC537%2FOlTDTLga193jSQebXsMMdyczu8KrX9%2FVsQ%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md>>
for Kafka roller in operator for KRaft. FYI.

And now, I'm totally biased, but Stirmzi
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fstrimzi-kafka-operator&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901251906%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=25QOQ8hSqtTXdfoGg3lH6u0ZMILOYJW%2BBpkVoZNLTCQ%3D&reserved=0<https://github.com/strimzi/strimzi-kafka-operator>>
 provides an fully
open-source operator to manager kafka cluster on Kubernetes.
Welcome to try it (hopefully it will help you manage kafka clusters), join
the community to ask questions, join discussions, or contribute to it.

Thank you.
Luke













On Fri, Apr 19, 2024 at 4:19 AM Francesco Burato <bur...@adobe.com.invalid>
wrote:

> Hello,
>
> I have a question regarding the deployment of Kafka using Kraft
> controllers in a Kubernetes environment. Our current Kafka cluster is
> deployed on K8S clusters as statefulsets without operators and our brokers
> are configured to use a script which marks the containers as unready if
> under-replicated partitions exist. With this readiness check and a pod
> disruption budget of the minimum in sync replica - 1, we are able to
> perform rollout restarts of our brokers automatically without ever
> producing consumers and producers errors.
>
> We have started the processes of transitioning to Kraft and based on the
> recommended deployment strategy we are going to define dedicated nodes as
> controllers instead of using combined servers. However, defining nodes as
> controller does not seem to allow to use the same strategy for readiness
> check as the kafka-topics.sh does not appear to be executable on controller
> brokers.
>
> The question is: what is a reliable readiness check that can be used for
> Kraft controllers that ensures that rollout restart can be performed safely?
>
> Thanks,
>
> Frank
>
> --
> Francesco Burato | Software Development Engineer | Adobe |
> bur...@adobe.com<mailto:bur...@adobe.com>
>
>

Reply via email to