Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-10 Thread Ken Gaillot
On Wed, 2019-10-09 at 20:10 +0200, Kadlecsik József wrote:
> On Wed, 9 Oct 2019, Ken Gaillot wrote:
> 
> > > One of the nodes has got a failure ("watchdog: BUG: soft lockup
> > > - 
> > > CPU#7 stuck for 23s"), which resulted that the node could
> > > process 
> > > traffic on the backend interface but not on the fronted one. Thus
> > > the 
> > > services became unavailable but the cluster thought the node is
> > > all 
> > > right and did not stonith it.
> > > 
> > > How could we protect the cluster against such failures?
> > 
> > See the ocf:heartbeat:ethmonitor agent (to monitor the interface
> > itself) 
> > and/or the ocf:pacemaker:ping agent (to monitor reachability of
> > some IP 
> > such as a gateway)
> 
> This looks really promising, thank you! Does the cluster regard it as
> a 
> failure when a ocf:heartbeat:ethmonitor agent clone on a node does
> not 
> run? :-)

If you configure it typically, so that it runs on all nodes, then a
start failure on any node will be recorded in the cluster status. To
get other resources to move off such a node, you would colocate them
with the ethmonitor resource.

> 
> Best regards,
> Jozsef
> --
> E-mail : kadlecsik.joz...@wigner.mta.hu
> PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
> Address: Wigner Research Centre for Physics
>  H-1525 Budapest 114, POB. 49, Hungary
> __
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-10 Thread Kadlecsik József
On Wed, 9 Oct 2019, Digimer wrote:

> > One of the nodes has got a failure ("watchdog: BUG: soft lockup - 
> > CPU#7 stuck for 23s"), which resulted that the node could process 
> > traffic on the backend interface but not on the fronted one. Thus the 
> > services became unavailable but the cluster thought the node is all 
> > right and did not stonith it.
> > 
> > How could we protect the cluster against such failures?
> > 
> We use mode=1 (active-passive) bonded network interfaces for each 
> network connection (we also have a back-end, front-end and a storage 
> network). Each bond has a link going to one switch and the other link to 
> a second switch. For fence devices, we use IPMI fencing connected via 
> switch 1 and PDU fencing as the backup method connected on switch 2.
> 
> With this setup, no matter what might fail, one of the fence methods
> will still be available. It's saved us in the field a few times now.

A bonded interface helps, but I suspect that in this case it could not 
save the situation. It was not an interface failure but a strange kind of 
system lockup: some of the already running processes were fine (corosync), 
but for example sshd could not accept new connections from the direction 
of the seemingly fine backbone interface either.

In the backend direction we have got bonded (LACP) interfaces - the 
frontend uses single interfaces only.

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Andrei Borzenkov
On Wed, Oct 9, 2019 at 10:59 AM Kadlecsik József
 wrote:
>
> Hello,
>
> The nodes in our cluster have got backend and frontend interfaces: the
> former ones are for the storage and cluster (corosync) traffic and the
> latter ones are for the public services of KVM guests only.
>
> One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7
> stuck for 23s"), which resulted that the node could process traffic on the
> backend interface but not on the fronted one. Thus the services became
> unavailable but the cluster thought the node is all right and did not
> stonith it.
>
> How could we protect the cluster against such failures?
>
> We could configure a second corosync ring, but that would be a redundancy
> ring only.
>
> We could setup a second, independent corosync configuration for a second
> pacemaker just with stonith agents. Is it enough to specify the cluster
> name in the corosync config to pair pacemaker to corosync? What about the
> pairing of pacemaker to this corosync instance, how can we tell pacemaker
> to connect to this corosync instance?
>
> Which is the best way to solve the problem?
>

That really depends on what "node could process traffic" means. If it
is just about basic IP connectivity, you can use ocf:pacemaker:ping
resource to monitor network availability and move resource if current
node is considered "unconnected". This is actually documented in
Pacemaker Explained, 8.3.2. Moving Resources Due to Connectivity
Changes.

If "process traffic" means something else, you need custom agent that
implements whatever checks are necessary to decide that node cannot
process traffic anymore.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Digimer
On 2019-10-09 3:58 a.m., Kadlecsik József wrote:
> Hello,
> 
> The nodes in our cluster have got backend and frontend interfaces: the 
> former ones are for the storage and cluster (corosync) traffic and the 
> latter ones are for the public services of KVM guests only.
> 
> One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 
> stuck for 23s"), which resulted that the node could process traffic on the 
> backend interface but not on the fronted one. Thus the services became 
> unavailable but the cluster thought the node is all right and did not 
> stonith it. 
> 
> How could we protect the cluster against such failures?
> 
> We could configure a second corosync ring, but that would be a redundancy 
> ring only.
> 
> We could setup a second, independent corosync configuration for a second 
> pacemaker just with stonith agents. Is it enough to specify the cluster 
> name in the corosync config to pair pacemaker to corosync? What about the 
> pairing of pacemaker to this corosync instance, how can we tell pacemaker 
> to connect to this corosync instance?
> 
> Which is the best way to solve the problem? 
> 
> Best regards,
> Jozsef

We use mode=1 (active-passive) bonded network interfaces for each
network connection (we also have a back-end, front-end and a storage
network). Each bond has a link going to one switch and the other link to
a second switch. For fence devices, we use IPMI fencing connected via
switch 1 and PDU fencing as the backup method connected on switch 2.

With this setup, no matter what might fail, one of the fence methods
will still be available. It's saved us in the field a few times now.

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Kadlecsik József
On Wed, 9 Oct 2019, Ken Gaillot wrote:

> > One of the nodes has got a failure ("watchdog: BUG: soft lockup - 
> > CPU#7 stuck for 23s"), which resulted that the node could process 
> > traffic on the backend interface but not on the fronted one. Thus the 
> > services became unavailable but the cluster thought the node is all 
> > right and did not stonith it.
> > 
> > How could we protect the cluster against such failures?
> 
> See the ocf:heartbeat:ethmonitor agent (to monitor the interface itself) 
> and/or the ocf:pacemaker:ping agent (to monitor reachability of some IP 
> such as a gateway)

This looks really promising, thank you! Does the cluster regard it as a 
failure when a ocf:heartbeat:ethmonitor agent clone on a node does not 
run? :-)

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Kadlecsik József
Hi,

On Wed, 9 Oct 2019, Jan Pokorný wrote:

> On 09/10/19 09:58 +0200, Kadlecsik József wrote:
> > The nodes in our cluster have got backend and frontend interfaces: the 
> > former ones are for the storage and cluster (corosync) traffic and the 
> > latter ones are for the public services of KVM guests only.
> > 
> > One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 
> > stuck for 23s"), which resulted that the node could process traffic on the 
> > backend interface but not on the fronted one. Thus the services became 
> > unavailable but the cluster thought the node is all right and did not 
> > stonith it. 
> 
> > Which is the best way to solve the problem? 
> 
> Looks like heuristics of corosync-qdevice that would ping/attest your
> frontend interface could be a way to go.  You'd need an additional
> host in your setup, though.

As far as I see, corosync-qdevice can add/increase the votes for a node 
and cannot decrease it. I hope I'm wrong, I wouldn't mind adding an 
additional host :-)

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Ken Gaillot
On Wed, 2019-10-09 at 09:58 +0200, Kadlecsik József wrote:
> Hello,
> 
> The nodes in our cluster have got backend and frontend interfaces:
> the 
> former ones are for the storage and cluster (corosync) traffic and
> the 
> latter ones are for the public services of KVM guests only.
> 
> One of the nodes has got a failure ("watchdog: BUG: soft lockup -
> CPU#7 
> stuck for 23s"), which resulted that the node could process traffic
> on the 
> backend interface but not on the fronted one. Thus the services
> became 
> unavailable but the cluster thought the node is all right and did
> not 
> stonith it. 
> 
> How could we protect the cluster against such failures?

See the ocf:heartbeat:ethmonitor agent (to monitor the interface
itself) and/or the ocf:pacemaker:ping agent (to monitor reachability of
some IP such as a gateway)

> 
> We could configure a second corosync ring, but that would be a
> redundancy 
> ring only.
> 
> We could setup a second, independent corosync configuration for a
> second 
> pacemaker just with stonith agents. Is it enough to specify the
> cluster 
> name in the corosync config to pair pacemaker to corosync? What about
> the 
> pairing of pacemaker to this corosync instance, how can we tell
> pacemaker 
> to connect to this corosync instance?
> 
> Which is the best way to solve the problem? 
> 
> Best regards,
> Jozsef
> --
> E-mail : kadlecsik.joz...@wigner.mta.hu
> PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
> Address: Wigner Research Centre for Physics
>  H-1525 Budapest 114, POB. 49, Hungary
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Jan Pokorný
On 09/10/19 09:58 +0200, Kadlecsik József wrote:
> The nodes in our cluster have got backend and frontend interfaces: the 
> former ones are for the storage and cluster (corosync) traffic and the 
> latter ones are for the public services of KVM guests only.
> 
> One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 
> stuck for 23s"), which resulted that the node could process traffic on the 
> backend interface but not on the fronted one. Thus the services became 
> unavailable but the cluster thought the node is all right and did not 
> stonith it. 
> 
> How could we protect the cluster against such failures?
> 
> We could configure a second corosync ring, but that would be a redundancy 
> ring only.
> 
> We could setup a second, independent corosync configuration for a second 
> pacemaker just with stonith agents. Is it enough to specify the cluster 
> name in the corosync config to pair pacemaker to corosync? What about the 
> pairing of pacemaker to this corosync instance, how can we tell pacemaker 
> to connect to this corosync instance?

Such pairing happens on the Unix socket system-wide singleton basis.
IOW, two instances of the corosync on the same machine would
apparently conflict -- only a single daemon can run at a time.

> Which is the best way to solve the problem? 

Looks like heuristics of corosync-qdevice that would ping/attest your
frontend interface could be a way to go.  You'd need an additional
host in your setup, though.

-- 
Poki


pgpZKhjeAe4it.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Kadlecsik József
Hello,

The nodes in our cluster have got backend and frontend interfaces: the 
former ones are for the storage and cluster (corosync) traffic and the 
latter ones are for the public services of KVM guests only.

One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 
stuck for 23s"), which resulted that the node could process traffic on the 
backend interface but not on the fronted one. Thus the services became 
unavailable but the cluster thought the node is all right and did not 
stonith it. 

How could we protect the cluster against such failures?

We could configure a second corosync ring, but that would be a redundancy 
ring only.

We could setup a second, independent corosync configuration for a second 
pacemaker just with stonith agents. Is it enough to specify the cluster 
name in the corosync config to pair pacemaker to corosync? What about the 
pairing of pacemaker to this corosync instance, how can we tell pacemaker 
to connect to this corosync instance?

Which is the best way to solve the problem? 

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/