Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread David Dolan
Hi Klaus,

With default quorum options I've performed the following on my 3 node
cluster

Bring down cluster services on one node - the running services migrate to
another node
Wait 3 minutes
Bring down cluster services on one of the two remaining nodes - the
surviving node in the cluster is then fenced

Instead of the surviving node being fenced, I hoped that the services would
migrate and run on that remaining node.

Just looking for confirmation that my understanding is ok and if I'm
missing something?

Thanks
David



On Thu, 31 Aug 2023 at 11:59, David Dolan  wrote:

> I just tried removing all the quorum options setting back to defaults so
> no last_man_standing or wait_for_all.
> I still see the same behaviour where the third node is fenced if I bring
> down services on two nodes.
> Thanks
> David
>
> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger  wrote:
>
>>
>>
>> On Thu, Aug 31, 2023 at 12:28 PM David Dolan 
>> wrote:
>>
>>>
>>>
>>> On Wed, 30 Aug 2023 at 17:35, David Dolan  wrote:
>>>


 > Hi All,
> >
> > I'm running Pacemaker on Centos7
> > Name: pcs
> > Version : 0.9.169
> > Release : 3.el7.centos.3
> > Architecture: x86_64
> >
> >
> Besides the pcs-version versions of the other cluster-stack-components
> could be interesting. (pacemaker, corosync)
>
  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
 fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
 corosynclib-2.4.5-7.el7_9.2.x86_64
 pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
 fence-agents-common-4.2.1-41.el7_9.6.x86_64
 corosync-2.4.5-7.el7_9.2.x86_64
 pacemaker-cli-1.1.23-1.el7_9.1.x86_64
 pacemaker-1.1.23-1.el7_9.1.x86_64
 pcs-0.9.169-3.el7.centos.3.x86_64
 pacemaker-libs-1.1.23-1.el7_9.1.x86_64

>
>
> > I'm performing some cluster failover tests in a 3 node cluster. We
> have 3
> > resources in the cluster.
> > I was trying to see if I could get it working if 2 nodes fail at
> different
> > times. I'd like the 3 resources to then run on one node.
> >
> > The quorum options I've configured are as follows
> > [root@node1 ~]# pcs quorum config
> > Options:
> >   auto_tie_breaker: 1
> >   last_man_standing: 1
> >   last_man_standing_window: 1
> >   wait_for_all: 1
> >
> >
> Not sure if the combination of auto_tie_breaker and last_man_standing
> makes
> sense.
> And as you have a cluster with an odd number of nodes auto_tie_breaker
> should be
> disabled anyway I guess.
>
 Ah ok I'll try removing auto_tie_breaker and leave last_man_standing

>
>
> > [root@node1 ~]# pcs quorum status
> > Quorum information
> > --
> > Date: Wed Aug 30 11:20:04 2023
> > Quorum provider:  corosync_votequorum
> > Nodes:3
> > Node ID:  1
> > Ring ID:  1/1538
> > Quorate:  Yes
> >
> > Votequorum information
> > --
> > Expected votes:   3
> > Highest expected: 3
> > Total votes:  3
> > Quorum:   2
> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
> >
> > Membership information
> > --
> > Nodeid  VotesQdevice Name
> >  1  1 NR node1 (local)
> >  2  1 NR node2
> >  3  1 NR node3
> >
> > If I stop the cluster services on node 2 and 3, the groups all
> failover to
> > node 1 since it is the node with the lowest ID
> > But if I stop them on node1 and node 2 or node1 and node3, the
> cluster
> > fails.
> >
> > I tried adding this line to corosync.conf and I could then bring
> down the
> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until
> last,
> > the cluster failed
> > auto_tie_breaker_node: 1  3
> >
> > This line had the same outcome as using 1 3
> > auto_tie_breaker_node: 1  2 3
> >
> >
> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but
> rather
> sounds dangerous if that configuration is possible at all.
>
> Maybe the misbehavior of last_man_standing is due to this (maybe not
> recognized) misconfiguration.
> Did you wait long enough between letting the 2 nodes fail?
>
 I've done it so many times so I believe so. But I'll try remove the
 auto_tie_breaker config, leaving the last_man_standing. I'll also make sure
 I leave a couple of minutes between bringing down the nodes and post back.

>>> Just confirming I removed the auto_tie_breaker config and tested. Quorum
>>> configuration is as follows:
>>>  Options:
>>>   last_man_standing: 1
>>>   last_man_standing_window: 1
>>>   wait_for_all: 1
>>>
>>> I waited 2-3 minutes between stopping cluster services o

Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Andrei Borzenkov
On Mon, Sep 4, 2023 at 1:45 PM David Dolan  wrote:
>
> Hi Klaus,
>
> With default quorum options I've performed the following on my 3 node cluster
>
> Bring down cluster services on one node - the running services migrate to 
> another node
> Wait 3 minutes
> Bring down cluster services on one of the two remaining nodes - the surviving 
> node in the cluster is then fenced
>

Is it fenced or is it reset? It is not the same.

The default for no-quorum-policy is "stop". So you either have
"no-quorum-policy" set to "suicide", or node is reset by something
outside of pacemaker. This "something" may initiate fencing too.

> Instead of the surviving node being fenced, I hoped that the services would 
> migrate and run on that remaining node.
>
> Just looking for confirmation that my understanding is ok and if I'm missing 
> something?
>
> Thanks
> David
>
>
>
> On Thu, 31 Aug 2023 at 11:59, David Dolan  wrote:
>>
>> I just tried removing all the quorum options setting back to defaults so no 
>> last_man_standing or wait_for_all.
>> I still see the same behaviour where the third node is fenced if I bring 
>> down services on two nodes.
>> Thanks
>> David
>>
>> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger  wrote:
>>>
>>>
>>>
>>> On Thu, Aug 31, 2023 at 12:28 PM David Dolan  wrote:



 On Wed, 30 Aug 2023 at 17:35, David Dolan  wrote:
>
>
>
>> > Hi All,
>> >
>> > I'm running Pacemaker on Centos7
>> > Name: pcs
>> > Version : 0.9.169
>> > Release : 3.el7.centos.3
>> > Architecture: x86_64
>> >
>> >
>> Besides the pcs-version versions of the other cluster-stack-components
>> could be interesting. (pacemaker, corosync)
>
>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
> corosynclib-2.4.5-7.el7_9.2.x86_64
> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
> fence-agents-common-4.2.1-41.el7_9.6.x86_64
> corosync-2.4.5-7.el7_9.2.x86_64
> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
> pacemaker-1.1.23-1.el7_9.1.x86_64
> pcs-0.9.169-3.el7.centos.3.x86_64
> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>>
>>
>>
>> > I'm performing some cluster failover tests in a 3 node cluster. We 
>> > have 3
>> > resources in the cluster.
>> > I was trying to see if I could get it working if 2 nodes fail at 
>> > different
>> > times. I'd like the 3 resources to then run on one node.
>> >
>> > The quorum options I've configured are as follows
>> > [root@node1 ~]# pcs quorum config
>> > Options:
>> >   auto_tie_breaker: 1
>> >   last_man_standing: 1
>> >   last_man_standing_window: 1
>> >   wait_for_all: 1
>> >
>> >
>> Not sure if the combination of auto_tie_breaker and last_man_standing 
>> makes
>> sense.
>> And as you have a cluster with an odd number of nodes auto_tie_breaker
>> should be
>> disabled anyway I guess.
>
> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>>
>>
>>
>> > [root@node1 ~]# pcs quorum status
>> > Quorum information
>> > --
>> > Date: Wed Aug 30 11:20:04 2023
>> > Quorum provider:  corosync_votequorum
>> > Nodes:3
>> > Node ID:  1
>> > Ring ID:  1/1538
>> > Quorate:  Yes
>> >
>> > Votequorum information
>> > --
>> > Expected votes:   3
>> > Highest expected: 3
>> > Total votes:  3
>> > Quorum:   2
>> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
>> >
>> > Membership information
>> > --
>> > Nodeid  VotesQdevice Name
>> >  1  1 NR node1 (local)
>> >  2  1 NR node2
>> >  3  1 NR node3
>> >
>> > If I stop the cluster services on node 2 and 3, the groups all 
>> > failover to
>> > node 1 since it is the node with the lowest ID
>> > But if I stop them on node1 and node 2 or node1 and node3, the cluster
>> > fails.
>> >
>> > I tried adding this line to corosync.conf and I could then bring down 
>> > the
>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until 
>> > last,
>> > the cluster failed
>> > auto_tie_breaker_node: 1  3
>> >
>> > This line had the same outcome as using 1 3
>> > auto_tie_breaker_node: 1  2 3
>> >
>> >
>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but 
>> rather
>> sounds dangerous if that configuration is possible at all.
>>
>> Maybe the misbehavior of last_man_standing is due to this (maybe not
>> recognized) misconfiguration.
>> Did you wait long enough between letting the 2 nodes fail?
>
> I've done it so many times so I 

Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Klaus Wenninger
On Mon, Sep 4, 2023 at 12:45 PM David Dolan  wrote:

> Hi Klaus,
>
> With default quorum options I've performed the following on my 3 node
> cluster
>
> Bring down cluster services on one node - the running services migrate to
> another node
> Wait 3 minutes
> Bring down cluster services on one of the two remaining nodes - the
> surviving node in the cluster is then fenced
>
> Instead of the surviving node being fenced, I hoped that the services
> would migrate and run on that remaining node.
>
> Just looking for confirmation that my understanding is ok and if I'm
> missing something?
>

As said I've never used it ...
Well when down to 2 nodes LMS per definition is getting into trouble as
after another
outage any of them is gonna be alone. In case of an ordered shutdown this
could
possibly be circumvented though. So I guess your fist attempt to enable
auto-tie-breaker
was the right idea. Like this you will have further service at least on one
of the nodes.
So I guess what you were seeing is the right - and unfortunately only
possible - behavior.
Where LMS shines is probably scenarios with substantially more nodes.

Klaus

>
> Thanks
> David
>
>
>
> On Thu, 31 Aug 2023 at 11:59, David Dolan  wrote:
>
>> I just tried removing all the quorum options setting back to defaults so
>> no last_man_standing or wait_for_all.
>> I still see the same behaviour where the third node is fenced if I bring
>> down services on two nodes.
>> Thanks
>> David
>>
>> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger 
>> wrote:
>>
>>>
>>>
>>> On Thu, Aug 31, 2023 at 12:28 PM David Dolan 
>>> wrote:
>>>


 On Wed, 30 Aug 2023 at 17:35, David Dolan 
 wrote:

>
>
> > Hi All,
>> >
>> > I'm running Pacemaker on Centos7
>> > Name: pcs
>> > Version : 0.9.169
>> > Release : 3.el7.centos.3
>> > Architecture: x86_64
>> >
>> >
>> Besides the pcs-version versions of the other cluster-stack-components
>> could be interesting. (pacemaker, corosync)
>>
>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
> corosynclib-2.4.5-7.el7_9.2.x86_64
> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
> fence-agents-common-4.2.1-41.el7_9.6.x86_64
> corosync-2.4.5-7.el7_9.2.x86_64
> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
> pacemaker-1.1.23-1.el7_9.1.x86_64
> pcs-0.9.169-3.el7.centos.3.x86_64
> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>
>>
>>
>> > I'm performing some cluster failover tests in a 3 node cluster. We
>> have 3
>> > resources in the cluster.
>> > I was trying to see if I could get it working if 2 nodes fail at
>> different
>> > times. I'd like the 3 resources to then run on one node.
>> >
>> > The quorum options I've configured are as follows
>> > [root@node1 ~]# pcs quorum config
>> > Options:
>> >   auto_tie_breaker: 1
>> >   last_man_standing: 1
>> >   last_man_standing_window: 1
>> >   wait_for_all: 1
>> >
>> >
>> Not sure if the combination of auto_tie_breaker and last_man_standing
>> makes
>> sense.
>> And as you have a cluster with an odd number of nodes auto_tie_breaker
>> should be
>> disabled anyway I guess.
>>
> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>
>>
>>
>> > [root@node1 ~]# pcs quorum status
>> > Quorum information
>> > --
>> > Date: Wed Aug 30 11:20:04 2023
>> > Quorum provider:  corosync_votequorum
>> > Nodes:3
>> > Node ID:  1
>> > Ring ID:  1/1538
>> > Quorate:  Yes
>> >
>> > Votequorum information
>> > --
>> > Expected votes:   3
>> > Highest expected: 3
>> > Total votes:  3
>> > Quorum:   2
>> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
>> >
>> > Membership information
>> > --
>> > Nodeid  VotesQdevice Name
>> >  1  1 NR node1 (local)
>> >  2  1 NR node2
>> >  3  1 NR node3
>> >
>> > If I stop the cluster services on node 2 and 3, the groups all
>> failover to
>> > node 1 since it is the node with the lowest ID
>> > But if I stop them on node1 and node 2 or node1 and node3, the
>> cluster
>> > fails.
>> >
>> > I tried adding this line to corosync.conf and I could then bring
>> down the
>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until
>> last,
>> > the cluster failed
>> > auto_tie_breaker_node: 1  3
>> >
>> > This line had the same outcome as using 1 3
>> > auto_tie_breaker_node: 1  2 3
>> >
>> >
>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but
>> rather
>>

Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Klaus Wenninger
On Mon, Sep 4, 2023 at 1:18 PM Klaus Wenninger  wrote:

>
>
> On Mon, Sep 4, 2023 at 12:45 PM David Dolan  wrote:
>
>> Hi Klaus,
>>
>> With default quorum options I've performed the following on my 3 node
>> cluster
>>
>> Bring down cluster services on one node - the running services migrate to
>> another node
>> Wait 3 minutes
>> Bring down cluster services on one of the two remaining nodes - the
>> surviving node in the cluster is then fenced
>>
>> Instead of the surviving node being fenced, I hoped that the services
>> would migrate and run on that remaining node.
>>
>> Just looking for confirmation that my understanding is ok and if I'm
>> missing something?
>>
>
> As said I've never used it ...
> Well when down to 2 nodes LMS per definition is getting into trouble as
> after another
> outage any of them is gonna be alone. In case of an ordered shutdown this
> could
> possibly be circumvented though. So I guess your fist attempt to enable
> auto-tie-breaker
> was the right idea. Like this you will have further service at least on
> one of the nodes.
> So I guess what you were seeing is the right - and unfortunately only
> possible - behavior.
> Where LMS shines is probably scenarios with substantially more nodes.
>

Or go for qdevice with LMS where I would expect it to be able to really go
down to
a single node left - any of the 2 last ones - as there is still qdevice.#
Sry for the confusion btw.

Klaus

>
> Klaus
>
>>
>> Thanks
>> David
>>
>>
>>
>> On Thu, 31 Aug 2023 at 11:59, David Dolan  wrote:
>>
>>> I just tried removing all the quorum options setting back to defaults so
>>> no last_man_standing or wait_for_all.
>>> I still see the same behaviour where the third node is fenced if I bring
>>> down services on two nodes.
>>> Thanks
>>> David
>>>
>>> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger 
>>> wrote:
>>>


 On Thu, Aug 31, 2023 at 12:28 PM David Dolan 
 wrote:

>
>
> On Wed, 30 Aug 2023 at 17:35, David Dolan 
> wrote:
>
>>
>>
>> > Hi All,
>>> >
>>> > I'm running Pacemaker on Centos7
>>> > Name: pcs
>>> > Version : 0.9.169
>>> > Release : 3.el7.centos.3
>>> > Architecture: x86_64
>>> >
>>> >
>>> Besides the pcs-version versions of the other
>>> cluster-stack-components
>>> could be interesting. (pacemaker, corosync)
>>>
>>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
>> corosynclib-2.4.5-7.el7_9.2.x86_64
>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
>> fence-agents-common-4.2.1-41.el7_9.6.x86_64
>> corosync-2.4.5-7.el7_9.2.x86_64
>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
>> pacemaker-1.1.23-1.el7_9.1.x86_64
>> pcs-0.9.169-3.el7.centos.3.x86_64
>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>>
>>>
>>>
>>> > I'm performing some cluster failover tests in a 3 node cluster. We
>>> have 3
>>> > resources in the cluster.
>>> > I was trying to see if I could get it working if 2 nodes fail at
>>> different
>>> > times. I'd like the 3 resources to then run on one node.
>>> >
>>> > The quorum options I've configured are as follows
>>> > [root@node1 ~]# pcs quorum config
>>> > Options:
>>> >   auto_tie_breaker: 1
>>> >   last_man_standing: 1
>>> >   last_man_standing_window: 1
>>> >   wait_for_all: 1
>>> >
>>> >
>>> Not sure if the combination of auto_tie_breaker and
>>> last_man_standing makes
>>> sense.
>>> And as you have a cluster with an odd number of nodes
>>> auto_tie_breaker
>>> should be
>>> disabled anyway I guess.
>>>
>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>>
>>>
>>>
>>> > [root@node1 ~]# pcs quorum status
>>> > Quorum information
>>> > --
>>> > Date: Wed Aug 30 11:20:04 2023
>>> > Quorum provider:  corosync_votequorum
>>> > Nodes:3
>>> > Node ID:  1
>>> > Ring ID:  1/1538
>>> > Quorate:  Yes
>>> >
>>> > Votequorum information
>>> > --
>>> > Expected votes:   3
>>> > Highest expected: 3
>>> > Total votes:  3
>>> > Quorum:   2
>>> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
>>> >
>>> > Membership information
>>> > --
>>> > Nodeid  VotesQdevice Name
>>> >  1  1 NR node1 (local)
>>> >  2  1 NR node2
>>> >  3  1 NR node3
>>> >
>>> > If I stop the cluster services on node 2 and 3, the groups all
>>> failover to
>>> > node 1 since it is the node with the lowest ID
>>> > But if I stop them on node1 and node 2 or node1 and node3, the
>>> cluster
>>> > fails.
>>> >
>>> > I tried add

Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Andrei Borzenkov
On Mon, Sep 4, 2023 at 2:25 PM Klaus Wenninger  wrote:
>
>
> Or go for qdevice with LMS where I would expect it to be able to really go 
> down to
> a single node left - any of the 2 last ones - as there is still qdevice.#
> Sry for the confusion btw.
>

According to documentation, "LMS is also incompatible with quorum
devices, if last_man_standing is specified in corosync.conf then the
quorum device will be disabled".
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Andrei Borzenkov
On Mon, Sep 4, 2023 at 2:18 PM Klaus Wenninger  wrote:
>
>
>
> On Mon, Sep 4, 2023 at 12:45 PM David Dolan  wrote:
>>
>> Hi Klaus,
>>
>> With default quorum options I've performed the following on my 3 node cluster
>>
>> Bring down cluster services on one node - the running services migrate to 
>> another node
>> Wait 3 minutes
>> Bring down cluster services on one of the two remaining nodes - the 
>> surviving node in the cluster is then fenced
>>
>> Instead of the surviving node being fenced, I hoped that the services would 
>> migrate and run on that remaining node.
>>
>> Just looking for confirmation that my understanding is ok and if I'm missing 
>> something?
>
>
> As said I've never used it ...
> Well when down to 2 nodes LMS per definition is getting into trouble as after 
> another
> outage any of them is gonna be alone. In case of an ordered shutdown this 
> could
> possibly be circumvented though. So I guess your fist attempt to enable 
> auto-tie-breaker
> was the right idea. Like this you will have further service at least on one 
> of the nodes.
> So I guess what you were seeing is the right - and unfortunately only 
> possible - behavior.

I still do not see where fencing comes from. Pacemaker requests
fencing of the missing nodes. It also may request self-fencing, but
not in the default settings. It is rather hard to tell what happens
without logs from the last remaining node.

That said, the default action is to stop all resources, so the end
result is not very different :)
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Klaus Wenninger
On Mon, Sep 4, 2023 at 1:44 PM Andrei Borzenkov  wrote:

> On Mon, Sep 4, 2023 at 2:25 PM Klaus Wenninger 
> wrote:
> >
> >
> > Or go for qdevice with LMS where I would expect it to be able to really
> go down to
> > a single node left - any of the 2 last ones - as there is still qdevice.#
> > Sry for the confusion btw.
> >
>
> According to documentation, "LMS is also incompatible with quorum
> devices, if last_man_standing is specified in corosync.conf then the
> quorum device will be disabled".
>

That is why I said qdevice with LMS - but it was probably not explicit
enough without telling that I meant the qdevice algorithm and not
the corosync flag.

Klaus

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Klaus Wenninger
On Mon, Sep 4, 2023 at 1:50 PM Andrei Borzenkov  wrote:

> On Mon, Sep 4, 2023 at 2:18 PM Klaus Wenninger 
> wrote:
> >
> >
> >
> > On Mon, Sep 4, 2023 at 12:45 PM David Dolan 
> wrote:
> >>
> >> Hi Klaus,
> >>
> >> With default quorum options I've performed the following on my 3 node
> cluster
> >>
> >> Bring down cluster services on one node - the running services migrate
> to another node
> >> Wait 3 minutes
> >> Bring down cluster services on one of the two remaining nodes - the
> surviving node in the cluster is then fenced
> >>
> >> Instead of the surviving node being fenced, I hoped that the services
> would migrate and run on that remaining node.
> >>
> >> Just looking for confirmation that my understanding is ok and if I'm
> missing something?
> >
> >
> > As said I've never used it ...
> > Well when down to 2 nodes LMS per definition is getting into trouble as
> after another
> > outage any of them is gonna be alone. In case of an ordered shutdown
> this could
> > possibly be circumvented though. So I guess your fist attempt to enable
> auto-tie-breaker
> > was the right idea. Like this you will have further service at least on
> one of the nodes.
> > So I guess what you were seeing is the right - and unfortunately only
> possible - behavior.
>
> I still do not see where fencing comes from. Pacemaker requests
> fencing of the missing nodes. It also may request self-fencing, but
> not in the default settings. It is rather hard to tell what happens
> without logs from the last remaining node.
>
> That said, the default action is to stop all resources, so the end
> result is not very different :)
>

But you are of course right. The expected behaviour would be that
the leftover node stops the resources.
But maybe we're missing something here. Hard to tell without
the exact configuration including fencing.
Again, as already said, I don't know anything about the LMS
implementation with corosync. In theory there were both arguments
to either suicide (but that would have to be done by pacemaker) or
to automatically switch to some 2-node-mode once the remaining
partition is reduced to just 2 followed by a fence-race (when done
without the precautions otherwise used for 2-node-clusters).
But I guess in this case it is none of those 2.

Klaus

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread David Dolan
Thanks Klaus\Andrei,

So if I understand correctly what I'm trying probably shouldn't work.
And I should attempt setting auto_tie_breaker in corosync and remove
last_man_standing.
Then, I should set up another server with qdevice and configure that using
the LMS algorithm.

Thanks
David

On Mon, 4 Sept 2023 at 13:32, Klaus Wenninger  wrote:

>
>
> On Mon, Sep 4, 2023 at 1:50 PM Andrei Borzenkov 
> wrote:
>
>> On Mon, Sep 4, 2023 at 2:18 PM Klaus Wenninger 
>> wrote:
>> >
>> >
>> >
>> > On Mon, Sep 4, 2023 at 12:45 PM David Dolan 
>> wrote:
>> >>
>> >> Hi Klaus,
>> >>
>> >> With default quorum options I've performed the following on my 3 node
>> cluster
>> >>
>> >> Bring down cluster services on one node - the running services migrate
>> to another node
>> >> Wait 3 minutes
>> >> Bring down cluster services on one of the two remaining nodes - the
>> surviving node in the cluster is then fenced
>> >>
>> >> Instead of the surviving node being fenced, I hoped that the services
>> would migrate and run on that remaining node.
>> >>
>> >> Just looking for confirmation that my understanding is ok and if I'm
>> missing something?
>> >
>> >
>> > As said I've never used it ...
>> > Well when down to 2 nodes LMS per definition is getting into trouble as
>> after another
>> > outage any of them is gonna be alone. In case of an ordered shutdown
>> this could
>> > possibly be circumvented though. So I guess your fist attempt to enable
>> auto-tie-breaker
>> > was the right idea. Like this you will have further service at least on
>> one of the nodes.
>> > So I guess what you were seeing is the right - and unfortunately only
>> possible - behavior.
>>
>> I still do not see where fencing comes from. Pacemaker requests
>> fencing of the missing nodes. It also may request self-fencing, but
>> not in the default settings. It is rather hard to tell what happens
>> without logs from the last remaining node.
>>
>> That said, the default action is to stop all resources, so the end
>> result is not very different :)
>>
>
> But you are of course right. The expected behaviour would be that
> the leftover node stops the resources.
> But maybe we're missing something here. Hard to tell without
> the exact configuration including fencing.
> Again, as already said, I don't know anything about the LMS
> implementation with corosync. In theory there were both arguments
> to either suicide (but that would have to be done by pacemaker) or
> to automatically switch to some 2-node-mode once the remaining
> partition is reduced to just 2 followed by a fence-race (when done
> without the precautions otherwise used for 2-node-clusters).
> But I guess in this case it is none of those 2.
>
> Klaus
>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Centreon HA Cluster - VIP issue

2023-09-04 Thread Jan Friesse

Hi,


On 02/09/2023 17:16, Adil Bouazzaoui wrote:

  Hello,

My name is Adil,i worked for Tman company, we are testing the Centreon HA
cluster to monitor our infrastructure for 13 companies, for now we are
using the 100 IT licence to test the platform, once everything is working
fine then we can purchase a licence suitable for our case.

We're stuck at *scenario 2*: setting up Centreon HA Cluster with Master &
Slave on a different datacenters.
For *scenario 1*: setting up the Cluster with Master & Slave and VIP
address on the same network (VLAN) it is working fine.

*Scenario 1: Cluster on Same network (same DC) ==> works fine*
Master in DC 1 VLAN 1: 172.30.15.10 /24
Slave in DC 1 VLAN 1: 172.30.15.20 /24
VIP in DC 1 VLAN 1: 172.30.15.30/24
Quorum in DC 1 LAN: 192.168.1.10/24
Poller in DC 1 LAN: 192.168.1.20/24

*Scenario 2: Cluster on different networks (2 separate DCs connected with
VPN) ==> still not working*


corosync on all nodes needs to have direct connection to any other node. 
VPN should work as long as routing is correctly configured. What exactly 
is "still not working"?



Master in DC 1 VLAN 1: 172.30.15.10 /24
Slave in DC 2 VLAN 2: 172.30.50.10 /24
VIP: example 102.84.30.XXX. We used a public static IP from our internet
service provider, we thought that using a IP from a site network won't
work, if the site goes down then the VIP won't be reachable!
Quorum: 192.168.1.10/24


No clue what you mean by Quorum, but placing it in DC1 doesn't feel right.


Poller: 192.168.1.20/24

Our *goal *is to have Master & Slave nodes on different sites, so when Site
A goes down, we keep monitoring with the slave.
The problem is that we don't know how to set up the VIP address? Nor what
kind of VIP address will work? or how can the VIP address work in this
scenario? or is there anything else that can replace the VIP address to
make things work.
Also, can we use a backup poller? so if the poller 1 on Site A goes down,
then the poller 2 on Site B can take the lead?

we looked everywhere (The watch, youtube, Reddit, Github...), and we still
couldn't get a workaround!

the guide we used to deploy the 2 Nodes Cluster:
https://docs.centreon.com/docs/installation/installation-of-centreon-ha/overview/

attached the 2 DCs architecture example.

We appreciate your support.
Thank you in advance.


Adil Bouazzaoui
IT Infrastructure Engineer
TMAN
adil.bouazza...@tmandis.ma
adilb...@gmail.com
+212 656 29 2020


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Andrei Borzenkov
On Mon, Sep 4, 2023 at 4:44 PM David Dolan  wrote:
>
> Thanks Klaus\Andrei,
>
> So if I understand correctly what I'm trying probably shouldn't work.

It is impossible to configure corosync (or any other cluster system
for that matter) to keep the *arbitrary* last node quorate. It is
possible to designate one node as "preferred" and to keep it quorate.
Returning to your example:

> I tried adding this line to corosync.conf and I could then bring down the 
> services on node 1 and 2 or node 2 and 3 but if I left node 2 until last, the 
> cluster failed
> auto_tie_breaker_node: 1  3
>

Correct. In your scenario the tie breaker is only relevant with two
nodes. When the first node is down, the remaining two nodes select the
tiebreaker. It can only be node 1 or 3.

> This line had the same outcome as using 1 3
> auto_tie_breaker_node: 1  2 3

If it really has the same outcome (i.e. cluster fails when node 2 is
left) it is a bug. This line makes nodes 1 or 2 a possible tiebreaker.
So the cluster must fail if node 3 is left, not node 2.

What most certainly *is* possible - no-quorum-policy=ignore + reliable
fencing. This worked just fine in two node clusters without two_node.
It does not make the last node quorate, but it allows pacemaker to
continue providing services on this node *and* taking over services
from other nodes if they were fenced successfully.

> And I should attempt setting auto_tie_breaker in corosync and remove 
> last_man_standing.
> Then, I should set up another server with qdevice and configure that using 
> the LMS algorithm.
>
> Thanks
> David
>
> On Mon, 4 Sept 2023 at 13:32, Klaus Wenninger  wrote:
>>
>>
>>
>> On Mon, Sep 4, 2023 at 1:50 PM Andrei Borzenkov  wrote:
>>>
>>> On Mon, Sep 4, 2023 at 2:18 PM Klaus Wenninger  wrote:
>>> >
>>> >
>>> >
>>> > On Mon, Sep 4, 2023 at 12:45 PM David Dolan  wrote:
>>> >>
>>> >> Hi Klaus,
>>> >>
>>> >> With default quorum options I've performed the following on my 3 node 
>>> >> cluster
>>> >>
>>> >> Bring down cluster services on one node - the running services migrate 
>>> >> to another node
>>> >> Wait 3 minutes
>>> >> Bring down cluster services on one of the two remaining nodes - the 
>>> >> surviving node in the cluster is then fenced
>>> >>
>>> >> Instead of the surviving node being fenced, I hoped that the services 
>>> >> would migrate and run on that remaining node.
>>> >>
>>> >> Just looking for confirmation that my understanding is ok and if I'm 
>>> >> missing something?
>>> >
>>> >
>>> > As said I've never used it ...
>>> > Well when down to 2 nodes LMS per definition is getting into trouble as 
>>> > after another
>>> > outage any of them is gonna be alone. In case of an ordered shutdown this 
>>> > could
>>> > possibly be circumvented though. So I guess your fist attempt to enable 
>>> > auto-tie-breaker
>>> > was the right idea. Like this you will have further service at least on 
>>> > one of the nodes.
>>> > So I guess what you were seeing is the right - and unfortunately only 
>>> > possible - behavior.
>>>
>>> I still do not see where fencing comes from. Pacemaker requests
>>> fencing of the missing nodes. It also may request self-fencing, but
>>> not in the default settings. It is rather hard to tell what happens
>>> without logs from the last remaining node.
>>>
>>> That said, the default action is to stop all resources, so the end
>>> result is not very different :)
>>
>>
>> But you are of course right. The expected behaviour would be that
>> the leftover node stops the resources.
>> But maybe we're missing something here. Hard to tell without
>> the exact configuration including fencing.
>> Again, as already said, I don't know anything about the LMS
>> implementation with corosync. In theory there were both arguments
>> to either suicide (but that would have to be done by pacemaker) or
>> to automatically switch to some 2-node-mode once the remaining
>> partition is reduced to just 2 followed by a fence-race (when done
>> without the precautions otherwise used for 2-node-clusters).
>> But I guess in this case it is none of those 2.
>>
>> Klaus
>>>
>>> ___
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Users Digest, Vol 104, Issue 5

2023-09-04 Thread Adil Bouazzaoui
Hi Jan,

to add more information, we deployed Centreon 2 Node HA Cluster (Master in
DC 1 & Slave in DC 2), quorum device which is responsible for split-brain
is on DC 1 too, and the poller which is responsible for monitoring is i DC
1 too. The problem is that a VIP address is required (attached to Master
node, in case of failover it will be moved to Slave) and we don't know what
VIP we should use? also we don't know what is the perfect setup for our
current scenario so if DC 1 goes down then the Slave on DC 2 will be the
Master, that's why we don't know where to place the Quorum device and the
poller?

i hope to get some ideas so we can setup this cluster correctly.
thanks in advance.

Adil Bouazzaoui
IT Infrastructure engineer
adil.bouazza...@tmandis.ma
adilb...@gmail.com

Le lun. 4 sept. 2023 à 15:24,  a écrit :

> Send Users mailing list submissions to
> users@clusterlabs.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@clusterlabs.org
>
> You can reach the person managing the list at
> users-ow...@clusterlabs.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Users digest..."
>
>
> Today's Topics:
>
>1. Re: issue during Pacemaker failover testing (Klaus Wenninger)
>2. Re: issue during Pacemaker failover testing (Klaus Wenninger)
>3. Re: issue during Pacemaker failover testing (David Dolan)
>4. Re: Centreon HA Cluster - VIP issue (Jan Friesse)
>
>
> --
>
> Message: 1
> Date: Mon, 4 Sep 2023 14:15:52 +0200
> From: Klaus Wenninger 
> To: Cluster Labs - All topics related to open-source clustering
> welcomed 
> Cc: David Dolan 
> Subject: Re: [ClusterLabs] issue during Pacemaker failover testing
> Message-ID:
>  wody...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On Mon, Sep 4, 2023 at 1:44?PM Andrei Borzenkov 
> wrote:
>
> > On Mon, Sep 4, 2023 at 2:25?PM Klaus Wenninger 
> > wrote:
> > >
> > >
> > > Or go for qdevice with LMS where I would expect it to be able to really
> > go down to
> > > a single node left - any of the 2 last ones - as there is still
> qdevice.#
> > > Sry for the confusion btw.
> > >
> >
> > According to documentation, "LMS is also incompatible with quorum
> > devices, if last_man_standing is specified in corosync.conf then the
> > quorum device will be disabled".
> >
>
> That is why I said qdevice with LMS - but it was probably not explicit
> enough without telling that I meant the qdevice algorithm and not
> the corosync flag.
>
> Klaus
>
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> https://lists.clusterlabs.org/pipermail/users/attachments/20230904/23e22260/attachment-0001.htm
> >
>
> --
>
> Message: 2
> Date: Mon, 4 Sep 2023 14:32:39 +0200
> From: Klaus Wenninger 
> To: Cluster Labs - All topics related to open-source clustering
> welcomed 
> Cc: David Dolan 
> Subject: Re: [ClusterLabs] issue during Pacemaker failover testing
> Message-ID:
> <
> calrdao0v8bxp4ajwcobkeae6pimvgg2xme6ia+ohxshesx9...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On Mon, Sep 4, 2023 at 1:50?PM Andrei Borzenkov 
> wrote:
>
> > On Mon, Sep 4, 2023 at 2:18?PM Klaus Wenninger 
> > wrote:
> > >
> > >
> > >
> > > On Mon, Sep 4, 2023 at 12:45?PM David Dolan 
> > wrote:
> > >>
> > >> Hi Klaus,
> > >>
> > >> With default quorum options I've performed the following on my 3 node
> > cluster
> > >>
> > >> Bring down cluster services on one node - the running services migrate
> > to another node
> > >> Wait 3 minutes
> > >> Bring down cluster services on one of the two remaining nodes - the
> > surviving node in the cluster is then fenced
> > >>
> > >> Instead of the surviving node being fenced, I hoped that the services
> > would migrate and run on that remaining node.
> > >>
> > >> Just looking for confirmation

Re: [ClusterLabs] Users Digest, Vol 104, Issue 5

2023-09-04 Thread Klaus Wenninger via Users
Down below you replied to 2 threads. I think the latter is the one you
intended to ... very confusing ...
Sry for adding more spam - was hesitant - but I think there is a chance it
removes some confusion ...

Klaus

On Mon, Sep 4, 2023 at 10:29 PM Adil Bouazzaoui  wrote:

> Hi Jan,
>
> to add more information, we deployed Centreon 2 Node HA Cluster (Master in
> DC 1 & Slave in DC 2), quorum device which is responsible for split-brain
> is on DC 1 too, and the poller which is responsible for monitoring is i DC
> 1 too. The problem is that a VIP address is required (attached to Master
> node, in case of failover it will be moved to Slave) and we don't know what
> VIP we should use? also we don't know what is the perfect setup for our
> current scenario so if DC 1 goes down then the Slave on DC 2 will be the
> Master, that's why we don't know where to place the Quorum device and the
> poller?
>
> i hope to get some ideas so we can setup this cluster correctly.
> thanks in advance.
>
> Adil Bouazzaoui
> IT Infrastructure engineer
> adil.bouazza...@tmandis.ma
> adilb...@gmail.com
>
> Le lun. 4 sept. 2023 à 15:24,  a écrit :
>
>> Send Users mailing list submissions to
>> users@clusterlabs.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> or, via email, send a message with subject or body 'help' to
>> users-requ...@clusterlabs.org
>>
>> You can reach the person managing the list at
>> users-ow...@clusterlabs.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Users digest..."
>>
>>
>> Today's Topics:
>>
>>1. Re: issue during Pacemaker failover testing (Klaus Wenninger)
>>2. Re: issue during Pacemaker failover testing (Klaus Wenninger)
>>3. Re: issue during Pacemaker failover testing (David Dolan)
>>4. Re: Centreon HA Cluster - VIP issue (Jan Friesse)
>>
>>
>> --
>>
>> Message: 1
>> Date: Mon, 4 Sep 2023 14:15:52 +0200
>> From: Klaus Wenninger 
>> To: Cluster Labs - All topics related to open-source clustering
>> welcomed 
>> Cc: David Dolan 
>> Subject: Re: [ClusterLabs] issue during Pacemaker failover testing
>> Message-ID:
>> > wody...@mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> On Mon, Sep 4, 2023 at 1:44?PM Andrei Borzenkov 
>> wrote:
>>
>> > On Mon, Sep 4, 2023 at 2:25?PM Klaus Wenninger 
>> > wrote:
>> > >
>> > >
>> > > Or go for qdevice with LMS where I would expect it to be able to
>> really
>> > go down to
>> > > a single node left - any of the 2 last ones - as there is still
>> qdevice.#
>> > > Sry for the confusion btw.
>> > >
>> >
>> > According to documentation, "LMS is also incompatible with quorum
>> > devices, if last_man_standing is specified in corosync.conf then the
>> > quorum device will be disabled".
>> >
>>
>> That is why I said qdevice with LMS - but it was probably not explicit
>> enough without telling that I meant the qdevice algorithm and not
>> the corosync flag.
>>
>> Klaus
>>
>> > ___
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users
>> >
>> > ClusterLabs home: https://www.clusterlabs.org/
>> >
>> -- next part --
>> An HTML attachment was scrubbed...
>> URL: <
>> https://lists.clusterlabs.org/pipermail/users/attachments/20230904/23e22260/attachment-0001.htm
>> >
>>
>> --
>>
>> Message: 2
>> Date: Mon, 4 Sep 2023 14:32:39 +0200
>> From: Klaus Wenninger 
>> To: Cluster Labs - All topics related to open-source clustering
>> welcomed 
>> Cc: David Dolan 
>> Subject: Re: [ClusterLabs] issue during Pacemaker failover testing
>> Message-ID:
>> <
>> calrdao0v8bxp4ajwcobkeae6pimvgg2xme6ia+ohxshesx9...@mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> On Mon, Sep 4, 2023 at 1:50?PM Andrei Borzenkov 
>> wrote:
>>
>> > On Mon, Sep 4, 2023 at 2:18?PM Klaus Wenninger 
>> > wrote:
>> > >
>> > >
>> > >
>> > > On Mon,