[ClusterLabs] Antw: [EXT] Re: Q: How to clean up a failed fencing operation?

2022-05-03 Thread Ulrich Windl
>>> Reid Wahl  schrieb am 03.05.2022 um 10:16 in Nachricht
:
> On Tue, May 3, 2022 at 12:36 AM Ulrich Windl
>  wrote:
>>
>> Hi!
>>
>> I'm familiar with cleaning up various failed resource actions via 
> "crm_resource ‑C ‑r resource_name ‑N node_name ‑n operation".
>> However I wonder wha tthe correct paraneters for a failed fencing operation

> (that lingers around) are.
> 
> stonith_admin ‑‑history '*' ‑‑cleanup

Ah, a completely different command! Interestingly this does not produce any
logs in syslog (no DC action).

Regards,
Ulrich


> 
>>
>> crm_mon found:
>> Failed Fencing Actions:
>>   * reboot of h18 failed: delegate=h16,
client=stonith_admin.controld.22336, 
> origin=h18, last‑failed='2022‑04‑27 02:22:52 +02:00' (a later attempt
succeeded)
>>
>> Regards,
>> Ulrich
>>
>>
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>
> 
> 
> ‑‑ 
> Regards,
> 
> Reid Wahl (He/Him), RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE ‑ Platform Support Delivery ‑ ClusterHA
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Help understanding recover of promotable resource after a "pcs cluster stop ‑‑all"

2022-05-03 Thread Andrei Borzenkov
On 03.05.2022 10:40, Ulrich Windl wrote:
> Hi!
> 
> I don't use DRBD, but I can imagine:
> If DRBD does asynchronous replication, it may make sense not to promote the
> slave as master after an interrupte dconnection (such when the master died) 
> (as
> this will cause some data loss).
> Probably it only wants to switch roles when both nodes are online to avoid
> that type of data loss (the master may have some newer data it wants to
> transfer first).
> 

Yes. See below.


 # sudo crm_mon ‑1A
 ...
 Node Attributes:
   * Node: server2:
 * master‑DRBDData : 1
>>>
>>> In the scenario you described, only server1 is up. If there is no
>>> master score for server1, it cannot be master. It's up the resource
>>> agent to set it. I'm not familiar enough with that agent to know why it
>>> might not.
>>>
>>
>> I can trivially reproduce it. When pacemaker with slave drbd instance is
>> stopped, DRBD disk state is set to "outdated". When it comes up, it will
>> not be selected for promoting. Setting master score does not work, it
>> just results in failure attempt to bring up the outdated replica. When
>> former master comes up, its disk state is "consistent" so it is selected
>> for promotion, becomes primary and synchronized with secondary.
>>
>> DRBD RA has an option to force outdated state on stop, but this option
>> is off by default as far as I can tell.
>>
>> This is probably something in DRBD configuration, but I am not familiar
>> with it on this deep level. Manually forcing primary on outdated replica
>> works and is reflected on pacemaker level (resource goes in promoted
> state).

Without any agent involved, doing "drbdadm down" on secondary instance
with active connection to primary marks it "outdated". Which is correct,
as from now on we do not know anything about state of primary. Doing
"drbdadm down" on a single replica without active connections leaves it
in "consistent" state.

When DRBD connection is active, booth replicas have "consistent" state
and when cluster nodes reboot from crash, anyone can assume master role.

I guess it is the same operational issue as with pacemaker itself - can
we shutdown both sides of DRBD leaving them in consistent state? But
even if we can, because pacemaker itself does not provide any means to
initiate such cluster-wide shutdown it would not help at all.

OTOH it is not really a big problem. Cluster reboot is manual action -
so administrator will need to manually activate remaining replica IF
ADMINISTRATOR IS SURE IT IS UP TO DATE. Rebooting individual nodes
sequentially should be OK.

>>



 Atenciosamente/Kind regards,
 Salatiel

 On Mon, May 2, 2022 at 12:26 PM Ken Gaillot 
 wrote:
> On Mon, 2022‑05‑02 at 09:58 ‑0300, Salatiel Filho wrote:
>> Hi, I am trying to understand the recovering process of a
>> promotable
>> resource after "pcs cluster stop ‑‑all" and shutdown of both
>> nodes.
>> I have a two nodes + qdevice quorum with a DRBD resource.
>>
>> This is a summary of the resources before my test. Everything is
>> working just fine and server2 is the master of DRBD.
>>
>>  * fence‑server1(stonith:fence_vmware_rest): Started
>> server2
>>  * fence‑server2(stonith:fence_vmware_rest): Started
>> server1
>>  * Clone Set: DRBDData‑clone [DRBDData] (promotable):
>>* Masters: [ server2 ]
>>* Slaves: [ server1 ]
>>  * Resource Group: nfs:
>>* drbd_fs(ocf::heartbeat:Filesystem): Started server2
>>
>>
>>
>> then I issue "pcs cluster stop ‑‑all". The cluster will be
>> stopped on
>> both nodes as expected.
>> Now I restart server1( previously the slave ) and poweroff
>> server2 (
>> previously the master ). When server1 restarts it will fence
>> server2
>> and I can see that server2 is starting on vcenter, but I just
>> pressed
>> any key on grub to make sure the server2 would not restart,
>> instead
>> it
>> would just be "paused" on grub screen.
>>
>> SSH'ing to server1 and running pcs status I get:
>>
>> Cluster name: cluster1
>> Cluster Summary:
>>   * Stack: corosync
>>   * Current DC: server1 (version 2.1.0‑8.el8‑7c3f660707) ‑
>> partition
>> with quorum
>>   * Last updated: Mon May  2 09:52:03 2022
>>   * Last change:  Mon May  2 09:39:22 2022 by root via cibadmin
>> on
>> server1
>>   * 2 nodes configured
>>   * 11 resource instances configured
>>
>> Node List:
>>   * Online: [ server1 ]
>>   * OFFLINE: [ server2 ]
>>
>> Full List of Resources:
>>   * fence‑server1(stonith:fence_vmware_rest): Stopped
>>   * fence‑server2(stonith:fence_vmware_rest): Started
>> server1
>>   * Clone Set: DRBDData‑clone [DRBDData] (promotable):
>> * Slaves: [ server1 ]
>> * Stopped: [ server2 ]

Re: [ClusterLabs] Q: How to clean up a failed fencing operation?

2022-05-03 Thread Reid Wahl
On Tue, May 3, 2022 at 12:36 AM Ulrich Windl
 wrote:
>
> Hi!
>
> I'm familiar with cleaning up various failed resource actions via 
> "crm_resource -C -r resource_name -N node_name -n operation".
> However I wonder wha tthe correct paraneters for a failed fencing operation 
> (that lingers around) are.

stonith_admin --history '*' --cleanup

>
> crm_mon found:
> Failed Fencing Actions:
>   * reboot of h18 failed: delegate=h16, client=stonith_admin.controld.22336, 
> origin=h18, last-failed='2022-04-27 02:22:52 +02:00' (a later attempt 
> succeeded)
>
> Regards,
> Ulrich
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl (He/Him), RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Help understanding recover of promotable resource after a "pcs cluster stop ‑‑all"

2022-05-03 Thread Ulrich Windl
Hi!

I don't use DRBD, but I can imagine:
If DRBD does asynchronous replication, it may make sense not to promote the
slave as master after an interrupte dconnection (such when the master died) (as
this will cause some data loss).
Probably it only wants to switch roles when both nodes are online to avoid
that type of data loss (the master may have some newer data it wants to
transfer first).

Regards,
Ulrich

>>> Andrei Borzenkov  schrieb am 03.05.2022 um 09:01 in
Nachricht <2b623fc1-1652-f332-e035-09dec045c...@gmail.com>:
> On 03.05.2022 00:25, Ken Gaillot wrote:
>> On Mon, 2022‑05‑02 at 13:11 ‑0300, Salatiel Filho wrote:
>>> Hi, Ken, here is the info you asked for.
>>>
>>>
>>> # pcs constraint
>>> Location Constraints:
>>>   Resource: fence‑server1
>>> Disabled on:
>>>   Node: server1 (score:‑INFINITY)
>>>   Resource: fence‑server2
>>> Disabled on:
>>>   Node: server2 (score:‑INFINITY)
>>> Ordering Constraints:
>>>   promote DRBDData‑clone then start nfs (kind:Mandatory)
>>> Colocation Constraints:
>>>   nfs with DRBDData‑clone (score:INFINITY) (rsc‑role:Started)
>>> (with‑rsc‑role:Master)
>>> Ticket Constraints:
>>>
>>> # sudo crm_mon ‑1A
>>> ...
>>> Node Attributes:
>>>   * Node: server2:
>>> * master‑DRBDData : 1
>> 
>> In the scenario you described, only server1 is up. If there is no
>> master score for server1, it cannot be master. It's up the resource
>> agent to set it. I'm not familiar enough with that agent to know why it
>> might not.
>> 
> 
> I can trivially reproduce it. When pacemaker with slave drbd instance is
> stopped, DRBD disk state is set to "outdated". When it comes up, it will
> not be selected for promoting. Setting master score does not work, it
> just results in failure attempt to bring up the outdated replica. When
> former master comes up, its disk state is "consistent" so it is selected
> for promotion, becomes primary and synchronized with secondary.
> 
> DRBD RA has an option to force outdated state on stop, but this option
> is off by default as far as I can tell.
> 
> This is probably something in DRBD configuration, but I am not familiar
> with it on this deep level. Manually forcing primary on outdated replica
> works and is reflected on pacemaker level (resource goes in promoted
state).
> 
>>>
>>>
>>>
>>> Atenciosamente/Kind regards,
>>> Salatiel
>>>
>>> On Mon, May 2, 2022 at 12:26 PM Ken Gaillot 
>>> wrote:
 On Mon, 2022‑05‑02 at 09:58 ‑0300, Salatiel Filho wrote:
> Hi, I am trying to understand the recovering process of a
> promotable
> resource after "pcs cluster stop ‑‑all" and shutdown of both
> nodes.
> I have a two nodes + qdevice quorum with a DRBD resource.
>
> This is a summary of the resources before my test. Everything is
> working just fine and server2 is the master of DRBD.
>
>  * fence‑server1(stonith:fence_vmware_rest): Started
> server2
>  * fence‑server2(stonith:fence_vmware_rest): Started
> server1
>  * Clone Set: DRBDData‑clone [DRBDData] (promotable):
>* Masters: [ server2 ]
>* Slaves: [ server1 ]
>  * Resource Group: nfs:
>* drbd_fs(ocf::heartbeat:Filesystem): Started server2
>
>
>
> then I issue "pcs cluster stop ‑‑all". The cluster will be
> stopped on
> both nodes as expected.
> Now I restart server1( previously the slave ) and poweroff
> server2 (
> previously the master ). When server1 restarts it will fence
> server2
> and I can see that server2 is starting on vcenter, but I just
> pressed
> any key on grub to make sure the server2 would not restart,
> instead
> it
> would just be "paused" on grub screen.
>
> SSH'ing to server1 and running pcs status I get:
>
> Cluster name: cluster1
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: server1 (version 2.1.0‑8.el8‑7c3f660707) ‑
> partition
> with quorum
>   * Last updated: Mon May  2 09:52:03 2022
>   * Last change:  Mon May  2 09:39:22 2022 by root via cibadmin
> on
> server1
>   * 2 nodes configured
>   * 11 resource instances configured
>
> Node List:
>   * Online: [ server1 ]
>   * OFFLINE: [ server2 ]
>
> Full List of Resources:
>   * fence‑server1(stonith:fence_vmware_rest): Stopped
>   * fence‑server2(stonith:fence_vmware_rest): Started
> server1
>   * Clone Set: DRBDData‑clone [DRBDData] (promotable):
> * Slaves: [ server1 ]
> * Stopped: [ server2 ]
>   * Resource Group: nfs:
> * drbd_fs(ocf::heartbeat:Filesystem): Stopped
>
>
> So I can see there is quorum, but the server1 is never promoted
> as
> DRBD master, so the remaining resources will be stopped until
> server2
> is back.
> 1) What do I need to do to force the promotion and recover
> without
> restarting server2?

[ClusterLabs] Q: How to clean up a failed fencing operation?

2022-05-03 Thread Ulrich Windl
Hi!

I'm familiar with cleaning up various failed resource actions via "crm_resource 
-C -r resource_name -N node_name -n operation".
However I wonder wha tthe correct paraneters for a failed fencing operation 
(that lingers around) are.

crm_mon found:
Failed Fencing Actions:
  * reboot of h18 failed: delegate=h16, client=stonith_admin.controld.22336, 
origin=h18, last-failed='2022-04-27 02:22:52 +02:00' (a later attempt succeeded)

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Help understanding recover of promotable resource after a "pcs cluster stop --all"

2022-05-03 Thread Andrei Borzenkov
On 03.05.2022 00:25, Ken Gaillot wrote:
> On Mon, 2022-05-02 at 13:11 -0300, Salatiel Filho wrote:
>> Hi, Ken, here is the info you asked for.
>>
>>
>> # pcs constraint
>> Location Constraints:
>>   Resource: fence-server1
>> Disabled on:
>>   Node: server1 (score:-INFINITY)
>>   Resource: fence-server2
>> Disabled on:
>>   Node: server2 (score:-INFINITY)
>> Ordering Constraints:
>>   promote DRBDData-clone then start nfs (kind:Mandatory)
>> Colocation Constraints:
>>   nfs with DRBDData-clone (score:INFINITY) (rsc-role:Started)
>> (with-rsc-role:Master)
>> Ticket Constraints:
>>
>> # sudo crm_mon -1A
>> ...
>> Node Attributes:
>>   * Node: server2:
>> * master-DRBDData : 1
> 
> In the scenario you described, only server1 is up. If there is no
> master score for server1, it cannot be master. It's up the resource
> agent to set it. I'm not familiar enough with that agent to know why it
> might not.
> 

I can trivially reproduce it. When pacemaker with slave drbd instance is
stopped, DRBD disk state is set to "outdated". When it comes up, it will
not be selected for promoting. Setting master score does not work, it
just results in failure attempt to bring up the outdated replica. When
former master comes up, its disk state is "consistent" so it is selected
for promotion, becomes primary and synchronized with secondary.

DRBD RA has an option to force outdated state on stop, but this option
is off by default as far as I can tell.

This is probably something in DRBD configuration, but I am not familiar
with it on this deep level. Manually forcing primary on outdated replica
works and is reflected on pacemaker level (resource goes in promoted state).

>>
>>
>>
>> Atenciosamente/Kind regards,
>> Salatiel
>>
>> On Mon, May 2, 2022 at 12:26 PM Ken Gaillot 
>> wrote:
>>> On Mon, 2022-05-02 at 09:58 -0300, Salatiel Filho wrote:
 Hi, I am trying to understand the recovering process of a
 promotable
 resource after "pcs cluster stop --all" and shutdown of both
 nodes.
 I have a two nodes + qdevice quorum with a DRBD resource.

 This is a summary of the resources before my test. Everything is
 working just fine and server2 is the master of DRBD.

  * fence-server1(stonith:fence_vmware_rest): Started
 server2
  * fence-server2(stonith:fence_vmware_rest): Started
 server1
  * Clone Set: DRBDData-clone [DRBDData] (promotable):
* Masters: [ server2 ]
* Slaves: [ server1 ]
  * Resource Group: nfs:
* drbd_fs(ocf::heartbeat:Filesystem): Started server2



 then I issue "pcs cluster stop --all". The cluster will be
 stopped on
 both nodes as expected.
 Now I restart server1( previously the slave ) and poweroff
 server2 (
 previously the master ). When server1 restarts it will fence
 server2
 and I can see that server2 is starting on vcenter, but I just
 pressed
 any key on grub to make sure the server2 would not restart,
 instead
 it
 would just be "paused" on grub screen.

 SSH'ing to server1 and running pcs status I get:

 Cluster name: cluster1
 Cluster Summary:
   * Stack: corosync
   * Current DC: server1 (version 2.1.0-8.el8-7c3f660707) -
 partition
 with quorum
   * Last updated: Mon May  2 09:52:03 2022
   * Last change:  Mon May  2 09:39:22 2022 by root via cibadmin
 on
 server1
   * 2 nodes configured
   * 11 resource instances configured

 Node List:
   * Online: [ server1 ]
   * OFFLINE: [ server2 ]

 Full List of Resources:
   * fence-server1(stonith:fence_vmware_rest): Stopped
   * fence-server2(stonith:fence_vmware_rest): Started
 server1
   * Clone Set: DRBDData-clone [DRBDData] (promotable):
 * Slaves: [ server1 ]
 * Stopped: [ server2 ]
   * Resource Group: nfs:
 * drbd_fs(ocf::heartbeat:Filesystem): Stopped


 So I can see there is quorum, but the server1 is never promoted
 as
 DRBD master, so the remaining resources will be stopped until
 server2
 is back.
 1) What do I need to do to force the promotion and recover
 without
 restarting server2?
 2) Why if instead of rebooting server1 and power off server2 I
 reboot
 server2 and poweroff server1 the cluster can recover by itself?


 Thanks!

>>>
>>> You shouldn't need to force promotion, that is the default behavior
>>> in
>>> that situation. There must be something else in the configuration
>>> that
>>> is preventing promotion.
>>>
>>> The DRBD resource agent should set a promotion score for the node.
>>> You
>>> can run "crm_mon -1A" to show all node attributes; there should be
>>> one
>>> like "master-DRBDData" for the active node.
>>>
>>> You can also show the constraints in the cluster to see if there is
>>> anything 

[ClusterLabs] Antw: [EXT] Re: Help understanding recover of promotable resource after a "pcs cluster stop ‑‑all"

2022-05-03 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 02.05.2022 um 23:25 in
Nachricht
<94927004ea4d4dca222ebc842f62711ff73b0a2a.ca...@redhat.com>:
> On Mon, 2022‑05‑02 at 13:11 ‑0300, Salatiel Filho wrote:
>> Hi, Ken, here is the info you asked for.
>> 
>> 
>> # pcs constraint
>> Location Constraints:
>>   Resource: fence‑server1
>> Disabled on:
>>   Node: server1 (score:‑INFINITY)
>>   Resource: fence‑server2
>> Disabled on:
>>   Node: server2 (score:‑INFINITY)
>> Ordering Constraints:
>>   promote DRBDData‑clone then start nfs (kind:Mandatory)
>> Colocation Constraints:
>>   nfs with DRBDData‑clone (score:INFINITY) (rsc‑role:Started)
>> (with‑rsc‑role:Master)
>> Ticket Constraints:
>> 
>> # sudo crm_mon ‑1A
>> ...
>> Node Attributes:
>>   * Node: server2:
>> * master‑DRBDData : 1
> 
> In the scenario you described, only server1 is up. If there is no
> master score for server1, it cannot be master. It's up the resource
> agent to set it. I'm not familiar enough with that agent to know why it
> might not.

Additional RA output (syslog) may be helpful as well.

> 
>> 
>> 
>> 
>> Atenciosamente/Kind regards,
>> Salatiel
>> 
>> On Mon, May 2, 2022 at 12:26 PM Ken Gaillot 
>> wrote:
>> > On Mon, 2022‑05‑02 at 09:58 ‑0300, Salatiel Filho wrote:
>> > > Hi, I am trying to understand the recovering process of a
>> > > promotable
>> > > resource after "pcs cluster stop ‑‑all" and shutdown of both
>> > > nodes.
>> > > I have a two nodes + qdevice quorum with a DRBD resource.
>> > > 
>> > > This is a summary of the resources before my test. Everything is
>> > > working just fine and server2 is the master of DRBD.
>> > > 
>> > >  * fence‑server1(stonith:fence_vmware_rest): Started
>> > > server2
>> > >  * fence‑server2(stonith:fence_vmware_rest): Started
>> > > server1
>> > >  * Clone Set: DRBDData‑clone [DRBDData] (promotable):
>> > >* Masters: [ server2 ]
>> > >* Slaves: [ server1 ]
>> > >  * Resource Group: nfs:
>> > >* drbd_fs(ocf::heartbeat:Filesystem): Started server2
>> > > 
>> > > 
>> > > 
>> > > then I issue "pcs cluster stop ‑‑all". The cluster will be
>> > > stopped on
>> > > both nodes as expected.
>> > > Now I restart server1( previously the slave ) and poweroff
>> > > server2 (
>> > > previously the master ). When server1 restarts it will fence
>> > > server2
>> > > and I can see that server2 is starting on vcenter, but I just
>> > > pressed
>> > > any key on grub to make sure the server2 would not restart,
>> > > instead
>> > > it
>> > > would just be "paused" on grub screen.
>> > > 
>> > > SSH'ing to server1 and running pcs status I get:
>> > > 
>> > > Cluster name: cluster1
>> > > Cluster Summary:
>> > >   * Stack: corosync
>> > >   * Current DC: server1 (version 2.1.0‑8.el8‑7c3f660707) ‑
>> > > partition
>> > > with quorum
>> > >   * Last updated: Mon May  2 09:52:03 2022
>> > >   * Last change:  Mon May  2 09:39:22 2022 by root via cibadmin
>> > > on
>> > > server1
>> > >   * 2 nodes configured
>> > >   * 11 resource instances configured
>> > > 
>> > > Node List:
>> > >   * Online: [ server1 ]
>> > >   * OFFLINE: [ server2 ]
>> > > 
>> > > Full List of Resources:
>> > >   * fence‑server1(stonith:fence_vmware_rest): Stopped
>> > >   * fence‑server2(stonith:fence_vmware_rest): Started
>> > > server1
>> > >   * Clone Set: DRBDData‑clone [DRBDData] (promotable):
>> > > * Slaves: [ server1 ]
>> > > * Stopped: [ server2 ]
>> > >   * Resource Group: nfs:
>> > > * drbd_fs(ocf::heartbeat:Filesystem): Stopped
>> > > 
>> > > 
>> > > So I can see there is quorum, but the server1 is never promoted
>> > > as
>> > > DRBD master, so the remaining resources will be stopped until
>> > > server2
>> > > is back.
>> > > 1) What do I need to do to force the promotion and recover
>> > > without
>> > > restarting server2?
>> > > 2) Why if instead of rebooting server1 and power off server2 I
>> > > reboot
>> > > server2 and poweroff server1 the cluster can recover by itself?
>> > > 
>> > > 
>> > > Thanks!
>> > > 
>> > 
>> > You shouldn't need to force promotion, that is the default behavior
>> > in
>> > that situation. There must be something else in the configuration
>> > that
>> > is preventing promotion.
>> > 
>> > The DRBD resource agent should set a promotion score for the node.
>> > You
>> > can run "crm_mon ‑1A" to show all node attributes; there should be
>> > one
>> > like "master‑DRBDData" for the active node.
>> > 
>> > You can also show the constraints in the cluster to see if there is
>> > anything relevant to the master role.
> 
> ‑‑ 
> Ken Gaillot 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/