On Tue, 21 Mar 2023 11:47:23 +0100 Jérôme BECOT <jerome.be...@deveryware.com> wrote:
> Le 21/03/2023 à 11:00, Jehan-Guillaume de Rorthais a écrit : > > Hi, > > > > On Tue, 21 Mar 2023 09:33:04 +0100 > > Jérôme BECOT<jerome.be...@deveryware.com> wrote: > > > >> We have several clusters running for different zabbix components. Some > >> of these clusters consist of 2 zabbix proxies,where nodes run Mysql, > >> Zabbix-proxy server and a VIP, and a corosync-qdevice. > > I'm not sure to understand your topology. The corosync-device is not > > supposed to be on a cluster node. It is supposed to be on a remote node and > > provide some quorum features to one or more cluster without setting up the > > whole pacemaker/corosync stack. > I was not clear, the qdevice is deployed on a remote node, as intended. ok > >> The MySQL servers are always up to replicate, and are configured in > >> Master/Master (they both replicate from the other but only one is supposed > >> to be updated by the proxy running on the master node). > > Why do you bother with Master/Master when a simple (I suppose, I'm not a > > MySQL cluster guy) Primary-Secondary topology or even a shared storage > > would be enough and would keep your logic (writes on one node only) safe > > from incidents, failures, errors, etc? > > > > HA must be a simple as possible. Remove useless parts when you can. > A shared storage moves the complexity somewhere else. Yes, on storage/SAN side. > A classic Primary / secondary can be an option if PaceMaker manages to start > the client on the slave node, I suppose this can be done using a location constraint. > but it would become Master/Master during the split brain. No, and if you do have real split brain, then you might have something wrong in your setup. See below. > >> One cluster is prompt to frequent sync errors, with duplicate entries > >> errors in SQL. When I look at the logs, I can see "Mar 21 09:11:41 > >> zabbix-proxy-01 pacemaker-controld [948] (pcmk_cpg_membership) > >> info: Group crmd event 89: zabbix-proxy-02 (node 2 pid 967) left via > >> cluster exit", and within the next second, a rejoin. The same messages > >> are in the other node logs, suggesting a split brain, which should not > >> happen, because there is a quorum device. > > Would it be possible your SQL sync errors and the left/join issues are > > correlated and are both symptoms of another failure? Look at your log for > > some explanation about why the node decided to leave the cluster. > > My guess is that maybe a high latency in network cause the disjoin, > hence starting Zabbix-proxy on both nodes causes the replication error. > It is configured to use the vip which is up locally because there is a > split brain. If you have a split brain, that means your quorum setup is failing. No node could start/promote a resource without having the quorum. If a node is isolated from the cluster and quorum-device, it should stop its resources, not recover/promote them. If both nodes lost connection with each others, but are still connected to the quorum-device, the later should be able to grant the quorum on one side only. Lastly, quorum is a split brain protection when "things are going fine". Fencing is a split brain protection for all other situations. Fencing is hard and painful, but it saves from many split brain situation. > This is why I'm requesting guidance to check/monitor these nodes to find > out if it is temporary network latency that is causing the disjoin. A cluster is always very sensitive to network latency/failures. You need to build on stronger fondations. Regards, _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/