On Wed, 2019-10-09 at 09:58 +0200, Kadlecsik József wrote: > Hello, > > The nodes in our cluster have got backend and frontend interfaces: > the > former ones are for the storage and cluster (corosync) traffic and > the > latter ones are for the public services of KVM guests only. > > One of the nodes has got a failure ("watchdog: BUG: soft lockup - > CPU#7 > stuck for 23s"), which resulted that the node could process traffic > on the > backend interface but not on the fronted one. Thus the services > became > unavailable but the cluster thought the node is all right and did > not > stonith it. > > How could we protect the cluster against such failures?
See the ocf:heartbeat:ethmonitor agent (to monitor the interface itself) and/or the ocf:pacemaker:ping agent (to monitor reachability of some IP such as a gateway) > > We could configure a second corosync ring, but that would be a > redundancy > ring only. > > We could setup a second, independent corosync configuration for a > second > pacemaker just with stonith agents. Is it enough to specify the > cluster > name in the corosync config to pair pacemaker to corosync? What about > the > pairing of pacemaker to this corosync instance, how can we tell > pacemaker > to connect to this corosync instance? > > Which is the best way to solve the problem? > > Best regards, > Jozsef > -- > E-mail : kadlecsik.joz...@wigner.mta.hu > PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt > Address: Wigner Research Centre for Physics > H-1525 Budapest 114, POB. 49, Hungary > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/