[ClusterLabs] Problem with the cluster becoming mostly unresponsive

2021-05-14 Thread Digimer
Hi all, I'm run into an issue a couple of times now, and I'm not really sure what's causing it. I've got a RHEL 8 cluster that, after a while, will show one or more resources as 'FAILED'. When I try to do a cleanup, it marks the resources as stopped, despite them still running. After that, all a

Re: [ClusterLabs] Problem with the cluster becoming mostly unresponsive

2021-05-14 Thread kgaillot
On Fri, 2021-05-14 at 15:04 -0400, Digimer wrote: > Hi all, > > I'm run into an issue a couple of times now, and I'm not really > sure > what's causing it. I've got a RHEL 8 cluster that, after a while, > will > show one or more resources as 'FAILED'. When I try to do a cleanup, > it > marks the

Re: [ClusterLabs] Problem with the cluster becoming mostly unresponsive

2021-05-14 Thread Digimer
On 2021-05-14 6:06 p.m., kgail...@redhat.com wrote: > On Fri, 2021-05-14 at 15:04 -0400, Digimer wrote: >> Hi all, >> >> I'm run into an issue a couple of times now, and I'm not really >> sure >> what's causing it. I've got a RHEL 8 cluster that, after a while, >> will >> show one or more resourc

Re: [ClusterLabs] Problem with the cluster becoming mostly unresponsive

2021-05-15 Thread Strahil Nikolov
>So a monitor failure on the fence agent rendered the cluster effectively unresponsive? How would I normally recover from this? Actually it will ban the resource (stonith) from the node when it reaches the maximum fail count. When the stonith is banned from all nodes, no node will be able to use