Hi David, Thank you for your response and your interest in this one. I also agree with the main counter-argument, having a single control-plane for all tenants would have a greater blast radius, I still think it would also be more cost effective. The idea is having a multi-region (3 AZ) controller quorum and then dual region (2 AZ) broker cluster and sharing this control plane with all other Kafka clusters. Actually this is not something new ¹, I have personally battle tested this setup and it works just as expected.
With both ZK and Kraft based deployments one can have different availability levels for metadata and data planes, although there is no chroot functionality for Kraft - chroot comes handy when the scale is getting larger and more Kafka clusters are being provisioned. - Within restricted network topologies (e.g. DMZ, GDPR or equivalent regulative requirement) separation of roles (metadata / data) helps complying with the regulations and chroot in this case enable us to re-use same metadata ensemble for new Kafka clusters - When data and metadata planes are separated it is easier to achieve 0 RTO and RPO if the metadata plane is distributed in three regions. Data plane does not have that requirement.² Cost effectiveness comes from reduced compute resources and eased management requirements. ¹ https://docs.confluent.io/platform/current/multi-dc-deployments/multi-region-architectures.html#stretched-cluster-2-5-data-center-cp-only ² https://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen#:~:text=we%20can%20tolerate%20N%2D1%20Kafka%20node%20failures%2C%20but%20only%20N/2%2D1%20Zookeeper%20failures. Kind regards, OSB On Mon, 1 Apr 2024 at 21:34, David Arthur <david.art...@confluent.io.invalid> wrote: > > Omer, > > Thanks for the email. This is an interesting thing to consider. > Conceptually, there is no reason why the controllers couldn't manage the > metadata for multiple brokers. The main counter-argument I can think of is > essentially the same as the motivation -- less isolation. With a shared > controller, one "noisy" broker cluster that put a lot of load on the > controller could affect metadata availability/latency for other broker > clusters. Related to this, having multiple broker clusters share one > controller cluster means a larger blast radius for controller failures. > > The "noisy neighbor" problem could be mitigated with a good implementation, > but the failure coupling cannot. > > In the containerized world, resources are abstracted away, so there is not > so much overhead to run a set of dedicated controller nodes. Even with > bare-metal hardware, controller processes can be run on the same nodes as > broker processes if needed. > > > The 2+1 data center example seems a bit tangential to me. > > > This way metadata and data would have different level of availability > and it enable enterprises to design a more cost effective solution by > separating metadata and data service layer > > Is the idea here to have a multi-region controller quorum and then single > region broker clusters? Could you achieve the same thing with one large > Kafka cluster spread across regions but with topics having assignments that > kept them region local? Is the "cost effectiveness" you're after just > inter-broker networking costs? > > Maybe you could expand on this scenario and help motivate it a bit more? > > -David