Re: [ClusterLabs] Howto stonith in the case of any interface failure?
On Wed, 2019-10-09 at 20:10 +0200, Kadlecsik József wrote: > On Wed, 9 Oct 2019, Ken Gaillot wrote: > > > > One of the nodes has got a failure ("watchdog: BUG: soft lockup > > > - > > > CPU#7 stuck for 23s"), which resulted that the node could > > > process > > > traffic on the backend interface but not on the fronted one. Thus > > > the > > > services became unavailable but the cluster thought the node is > > > all > > > right and did not stonith it. > > > > > > How could we protect the cluster against such failures? > > > > See the ocf:heartbeat:ethmonitor agent (to monitor the interface > > itself) > > and/or the ocf:pacemaker:ping agent (to monitor reachability of > > some IP > > such as a gateway) > > This looks really promising, thank you! Does the cluster regard it as > a > failure when a ocf:heartbeat:ethmonitor agent clone on a node does > not > run? :-) If you configure it typically, so that it runs on all nodes, then a start failure on any node will be recorded in the cluster status. To get other resources to move off such a node, you would colocate them with the ethmonitor resource. > > Best regards, > Jozsef > -- > E-mail : kadlecsik.joz...@wigner.mta.hu > PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt > Address: Wigner Research Centre for Physics > H-1525 Budapest 114, POB. 49, Hungary > __ -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Howto stonith in the case of any interface failure?
On Wed, 9 Oct 2019, Digimer wrote: > > One of the nodes has got a failure ("watchdog: BUG: soft lockup - > > CPU#7 stuck for 23s"), which resulted that the node could process > > traffic on the backend interface but not on the fronted one. Thus the > > services became unavailable but the cluster thought the node is all > > right and did not stonith it. > > > > How could we protect the cluster against such failures? > > > We use mode=1 (active-passive) bonded network interfaces for each > network connection (we also have a back-end, front-end and a storage > network). Each bond has a link going to one switch and the other link to > a second switch. For fence devices, we use IPMI fencing connected via > switch 1 and PDU fencing as the backup method connected on switch 2. > > With this setup, no matter what might fail, one of the fence methods > will still be available. It's saved us in the field a few times now. A bonded interface helps, but I suspect that in this case it could not save the situation. It was not an interface failure but a strange kind of system lockup: some of the already running processes were fine (corosync), but for example sshd could not accept new connections from the direction of the seemingly fine backbone interface either. In the backend direction we have got bonded (LACP) interfaces - the frontend uses single interfaces only. Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Howto stonith in the case of any interface failure?
On Wed, Oct 9, 2019 at 10:59 AM Kadlecsik József wrote: > > Hello, > > The nodes in our cluster have got backend and frontend interfaces: the > former ones are for the storage and cluster (corosync) traffic and the > latter ones are for the public services of KVM guests only. > > One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 > stuck for 23s"), which resulted that the node could process traffic on the > backend interface but not on the fronted one. Thus the services became > unavailable but the cluster thought the node is all right and did not > stonith it. > > How could we protect the cluster against such failures? > > We could configure a second corosync ring, but that would be a redundancy > ring only. > > We could setup a second, independent corosync configuration for a second > pacemaker just with stonith agents. Is it enough to specify the cluster > name in the corosync config to pair pacemaker to corosync? What about the > pairing of pacemaker to this corosync instance, how can we tell pacemaker > to connect to this corosync instance? > > Which is the best way to solve the problem? > That really depends on what "node could process traffic" means. If it is just about basic IP connectivity, you can use ocf:pacemaker:ping resource to monitor network availability and move resource if current node is considered "unconnected". This is actually documented in Pacemaker Explained, 8.3.2. Moving Resources Due to Connectivity Changes. If "process traffic" means something else, you need custom agent that implements whatever checks are necessary to decide that node cannot process traffic anymore. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Howto stonith in the case of any interface failure?
On 2019-10-09 3:58 a.m., Kadlecsik József wrote: > Hello, > > The nodes in our cluster have got backend and frontend interfaces: the > former ones are for the storage and cluster (corosync) traffic and the > latter ones are for the public services of KVM guests only. > > One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 > stuck for 23s"), which resulted that the node could process traffic on the > backend interface but not on the fronted one. Thus the services became > unavailable but the cluster thought the node is all right and did not > stonith it. > > How could we protect the cluster against such failures? > > We could configure a second corosync ring, but that would be a redundancy > ring only. > > We could setup a second, independent corosync configuration for a second > pacemaker just with stonith agents. Is it enough to specify the cluster > name in the corosync config to pair pacemaker to corosync? What about the > pairing of pacemaker to this corosync instance, how can we tell pacemaker > to connect to this corosync instance? > > Which is the best way to solve the problem? > > Best regards, > Jozsef We use mode=1 (active-passive) bonded network interfaces for each network connection (we also have a back-end, front-end and a storage network). Each bond has a link going to one switch and the other link to a second switch. For fence devices, we use IPMI fencing connected via switch 1 and PDU fencing as the backup method connected on switch 2. With this setup, no matter what might fail, one of the fence methods will still be available. It's saved us in the field a few times now. -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Howto stonith in the case of any interface failure?
On Wed, 9 Oct 2019, Ken Gaillot wrote: > > One of the nodes has got a failure ("watchdog: BUG: soft lockup - > > CPU#7 stuck for 23s"), which resulted that the node could process > > traffic on the backend interface but not on the fronted one. Thus the > > services became unavailable but the cluster thought the node is all > > right and did not stonith it. > > > > How could we protect the cluster against such failures? > > See the ocf:heartbeat:ethmonitor agent (to monitor the interface itself) > and/or the ocf:pacemaker:ping agent (to monitor reachability of some IP > such as a gateway) This looks really promising, thank you! Does the cluster regard it as a failure when a ocf:heartbeat:ethmonitor agent clone on a node does not run? :-) Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Howto stonith in the case of any interface failure?
Hi, On Wed, 9 Oct 2019, Jan Pokorný wrote: > On 09/10/19 09:58 +0200, Kadlecsik József wrote: > > The nodes in our cluster have got backend and frontend interfaces: the > > former ones are for the storage and cluster (corosync) traffic and the > > latter ones are for the public services of KVM guests only. > > > > One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 > > stuck for 23s"), which resulted that the node could process traffic on the > > backend interface but not on the fronted one. Thus the services became > > unavailable but the cluster thought the node is all right and did not > > stonith it. > > > Which is the best way to solve the problem? > > Looks like heuristics of corosync-qdevice that would ping/attest your > frontend interface could be a way to go. You'd need an additional > host in your setup, though. As far as I see, corosync-qdevice can add/increase the votes for a node and cannot decrease it. I hope I'm wrong, I wouldn't mind adding an additional host :-) Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Howto stonith in the case of any interface failure?
On Wed, 2019-10-09 at 09:58 +0200, Kadlecsik József wrote: > Hello, > > The nodes in our cluster have got backend and frontend interfaces: > the > former ones are for the storage and cluster (corosync) traffic and > the > latter ones are for the public services of KVM guests only. > > One of the nodes has got a failure ("watchdog: BUG: soft lockup - > CPU#7 > stuck for 23s"), which resulted that the node could process traffic > on the > backend interface but not on the fronted one. Thus the services > became > unavailable but the cluster thought the node is all right and did > not > stonith it. > > How could we protect the cluster against such failures? See the ocf:heartbeat:ethmonitor agent (to monitor the interface itself) and/or the ocf:pacemaker:ping agent (to monitor reachability of some IP such as a gateway) > > We could configure a second corosync ring, but that would be a > redundancy > ring only. > > We could setup a second, independent corosync configuration for a > second > pacemaker just with stonith agents. Is it enough to specify the > cluster > name in the corosync config to pair pacemaker to corosync? What about > the > pairing of pacemaker to this corosync instance, how can we tell > pacemaker > to connect to this corosync instance? > > Which is the best way to solve the problem? > > Best regards, > Jozsef > -- > E-mail : kadlecsik.joz...@wigner.mta.hu > PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt > Address: Wigner Research Centre for Physics > H-1525 Budapest 114, POB. 49, Hungary > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Howto stonith in the case of any interface failure?
On 09/10/19 09:58 +0200, Kadlecsik József wrote: > The nodes in our cluster have got backend and frontend interfaces: the > former ones are for the storage and cluster (corosync) traffic and the > latter ones are for the public services of KVM guests only. > > One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 > stuck for 23s"), which resulted that the node could process traffic on the > backend interface but not on the fronted one. Thus the services became > unavailable but the cluster thought the node is all right and did not > stonith it. > > How could we protect the cluster against such failures? > > We could configure a second corosync ring, but that would be a redundancy > ring only. > > We could setup a second, independent corosync configuration for a second > pacemaker just with stonith agents. Is it enough to specify the cluster > name in the corosync config to pair pacemaker to corosync? What about the > pairing of pacemaker to this corosync instance, how can we tell pacemaker > to connect to this corosync instance? Such pairing happens on the Unix socket system-wide singleton basis. IOW, two instances of the corosync on the same machine would apparently conflict -- only a single daemon can run at a time. > Which is the best way to solve the problem? Looks like heuristics of corosync-qdevice that would ping/attest your frontend interface could be a way to go. You'd need an additional host in your setup, though. -- Poki pgpZKhjeAe4it.pgp Description: PGP signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/