On 09/10/19 09:58 +0200, Kadlecsik József wrote: > The nodes in our cluster have got backend and frontend interfaces: the > former ones are for the storage and cluster (corosync) traffic and the > latter ones are for the public services of KVM guests only. > > One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 > stuck for 23s"), which resulted that the node could process traffic on the > backend interface but not on the fronted one. Thus the services became > unavailable but the cluster thought the node is all right and did not > stonith it. > > How could we protect the cluster against such failures? > > We could configure a second corosync ring, but that would be a redundancy > ring only. > > We could setup a second, independent corosync configuration for a second > pacemaker just with stonith agents. Is it enough to specify the cluster > name in the corosync config to pair pacemaker to corosync? What about the > pairing of pacemaker to this corosync instance, how can we tell pacemaker > to connect to this corosync instance?
Such pairing happens on the Unix socket system-wide singleton basis. IOW, two instances of the corosync on the same machine would apparently conflict -- only a single daemon can run at a time. > Which is the best way to solve the problem? Looks like heuristics of corosync-qdevice that would ping/attest your frontend interface could be a way to go. You'd need an additional host in your setup, though. -- Poki
pgpZKhjeAe4it.pgp
Description: PGP signature
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/