Re: [ClusterLabs] How to unfence without reboot (fence_mpath)
Hello all, I think I found the problem. On the fenced node after a restart of the cluster stack , I saw the following: controld(dlm)[13025]: ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing I was so focused on the DC logs, so I missed it. I guess with HALVM , there will be no need to reboot - yet when dlm/clvmd were interrupted , the only path will be to reboot. Best Regards, Strahil Nikolov В понеделник, 17 февруари 2020 г., 15:36:39 ч. Гринуич+2, Ondrej написа: Hello Strahil, On 2/17/20 3:39 PM, Strahil Nikolov wrote: > Hello Ondrej, > > thanks for your reply. I really appreciate that. > > I have picked fence_multipath as I'm preparing for my EX436 and I can't know > what agent will be useful on the exam. > Also ,according to https://access.redhat.com/solutions/3201072 , there could > be a race condition with fence_scsi. I believe that exam is about testing knowledge in configuration and not testing knowledge in knowing which race condition bugs are present and how to handle them :) If you have access to learning materials for EX436 exam I would recommend trying those ones out - they have labs and comprehensive review exercises that are useful in preparation for exam. > So, I've checked the cluster when fencing and the node immediately goes > offline. > Last messages from pacemaker are: > > Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: Client > stonith_admin.controld.23888.b57ceee7 wants to fence (reboot) > 'node1.localdomain' with device '(any)' > Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: Requesting > peer fencing (reboot) of node1.localdomain > Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: FENCING can > fence (reboot) node1.localdomain (aka. '1'): static-list > Feb 17 08:21:58 node1.localdomain stonith-ng[23808]: notice: Operation > reboot of node1.localdomain by node2.localdomain for stonith_admin.controld.23888@node1.localdomain.ede38ffb: OK - This part looks OK - meaning the fencing looks like a success. > Feb 17 08:21:58 node1.localdomain crmd[23812]: crit: We were allegedly > just fenced by node2.localdomain for node1.localdomai - this is also normal as node just announces that it was fenced by other node > > > Which for me means - node1 just got fenced again. Actually fencing works ,as > I/O is immediately blocked and the reservation is removed. > > I've used https://access.redhat.com/solutions/2766611 to setup the > fence_mpath , but I could have messed up something. - note related to exam: you will not have Internet on exam, so I would expect that you would have to configure something that would not require access to this (and as Dan Swartzendruber pointed out in other email - we cannot* even see RH links without account) * you can get free developers account to read them, but ideally that should be not needed and is certainly inconvenient for wide public audience > > Cluster config is: > [root@node3 ~]# pcs config show > Cluster Name: HACLUSTER2 > Corosync Nodes: > node1.localdomain node2.localdomain node3.localdomain > Pacemaker Nodes: > node1.localdomain node2.localdomain node3.localdomain > > Resources: > Clone: dlm-clone > Meta Attrs: interleave=true ordered=true > Resource: dlm (class=ocf provider=pacemaker type=controld) > Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s) > start interval=0s timeout=90 (dlm-start-interval-0s) > stop interval=0s timeout=100 (dlm-stop-interval-0s) > Clone: clvmd-clone > Meta Attrs: interleave=true ordered=true > Resource: clvmd (class=ocf provider=heartbeat type=clvm) > Operations: monitor interval=30s on-fail=fence >(clvmd-monitor-interval-30s) > start interval=0s timeout=90s (clvmd-start-interval-0s) > stop interval=0s timeout=90s (clvmd-stop-interval-0s) > Clone: TESTGFS2-clone > Meta Attrs: interleave=true > Resource: TESTGFS2 (class=ocf provider=heartbeat type=Filesystem) > Attributes: device=/dev/TEST/gfs2 directory=/GFS2 fstype=gfs2 >options=noatime run_fsck=no > Operations: monitor interval=15s on-fail=fence OCF_CHECK_LEVEL=20 >(TESTGFS2-monitor-interval-15s) > notify interval=0s timeout=60s (TESTGFS2-notify-interval-0s) > start interval=0s timeout=60s (TESTGFS2-start-interval-0s) > stop interval=0s timeout=60s (TESTGFS2-stop-interval-0s) > > Stonith Devices: > Resource: FENCING (class=stonith type=fence_mpath) > Attributes: devices=/dev/mapper/36001405cb123d000 >pcmk_host_argument=key >pcmk_host_map=node1.localdomain:1;node2.localdomain:2;node3.localdomain:3 >pcmk_monitor_action=metadata pcmk_reboot_action=off > Meta Attrs: provides=unfencing > Operations: monitor interval=60s (FENCING-monitor-interval-60s) > Fencing Levels: > > Location Constraints: > Ordering Constraints: > start
Re: [ClusterLabs] How to unfence without reboot (fence_mpath)
On February 17, 2020 3:36:27 PM GMT+02:00, Ondrej wrote: >Hello Strahil, > >On 2/17/20 3:39 PM, Strahil Nikolov wrote: >> Hello Ondrej, >> >> thanks for your reply. I really appreciate that. >> >> I have picked fence_multipath as I'm preparing for my EX436 and I >can't know what agent will be useful on the exam. >> Also ,according to https://access.redhat.com/solutions/3201072 , >there could be a race condition with fence_scsi. > >I believe that exam is about testing knowledge in configuration and not > >testing knowledge in knowing which race condition bugs are present and >how to handle them :) >If you have access to learning materials for EX436 exam I would >recommend trying those ones out - they have labs and comprehensive >review exercises that are useful in preparation for exam. > >> So, I've checked the cluster when fencing and the node immediately >goes offline. >> Last messages from pacemaker are: >> >> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: Client >stonith_admin.controld.23888.b57ceee7 wants to fence (reboot) >'node1.localdomain' with device '(any)' >> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: >Requesting peer fencing (reboot) of node1.localdomain >> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: >FENCING can fence (reboot) node1.localdomain (aka. '1'): static-list >> Feb 17 08:21:58 node1.localdomain stonith-ng[23808]: notice: >Operation reboot of node1.localdomain by node2.localdomain for >stonith_admin.controld.23888@node1.localdomain.ede38ffb: OK >- This part looks OK - meaning the fencing looks like a success. >> Feb 17 08:21:58 node1.localdomain crmd[23812]: crit: We were >allegedly just fenced by node2.localdomain for node1.localdomai >- this is also normal as node just announces that it was fenced by >other >node > >> >> >> Which for me means - node1 just got fenced again. Actually fencing >works ,as I/O is immediately blocked and the reservation is removed. >> >> I've used https://access.redhat.com/solutions/2766611 to setup the >fence_mpath , but I could have messed up something. >- note related to exam: you will not have Internet on exam, so I would > >expect that you would have to configure something that would not >require >access to this (and as Dan Swartzendruber pointed out in other email - >we cannot* even see RH links without account) > >* you can get free developers account to read them, but ideally that >should be not needed and is certainly inconvenient for wide public >audience > >> >> Cluster config is: >> [root@node3 ~]# pcs config show >> Cluster Name: HACLUSTER2 >> Corosync Nodes: >> node1.localdomain node2.localdomain node3.localdomain >> Pacemaker Nodes: >> node1.localdomain node2.localdomain node3.localdomain >> >> Resources: >> Clone: dlm-clone >> Meta Attrs: interleave=true ordered=true >> Resource: dlm (class=ocf provider=pacemaker type=controld) >> Operations: monitor interval=30s on-fail=fence >(dlm-monitor-interval-30s) >> start interval=0s timeout=90 (dlm-start-interval-0s) >> stop interval=0s timeout=100 (dlm-stop-interval-0s) >> Clone: clvmd-clone >> Meta Attrs: interleave=true ordered=true >> Resource: clvmd (class=ocf provider=heartbeat type=clvm) >> Operations: monitor interval=30s on-fail=fence >(clvmd-monitor-interval-30s) >> start interval=0s timeout=90s >(clvmd-start-interval-0s) >> stop interval=0s timeout=90s (clvmd-stop-interval-0s) >> Clone: TESTGFS2-clone >> Meta Attrs: interleave=true >> Resource: TESTGFS2 (class=ocf provider=heartbeat type=Filesystem) >> Attributes: device=/dev/TEST/gfs2 directory=/GFS2 fstype=gfs2 >options=noatime run_fsck=no >> Operations: monitor interval=15s on-fail=fence OCF_CHECK_LEVEL=20 >(TESTGFS2-monitor-interval-15s) >> notify interval=0s timeout=60s >(TESTGFS2-notify-interval-0s) >> start interval=0s timeout=60s >(TESTGFS2-start-interval-0s) >> stop interval=0s timeout=60s >(TESTGFS2-stop-interval-0s) >> >> Stonith Devices: >> Resource: FENCING (class=stonith type=fence_mpath) >> Attributes: devices=/dev/mapper/36001405cb123d000 >pcmk_host_argument=key >pcmk_host_map=node1.localdomain:1;node2.localdomain:2;node3.localdomain:3 >pcmk_monitor_action=metadata pcmk_reboot_action=off >> Meta Attrs: provides=unfencing >> Operations: monitor interval=60s (FENCING-monitor-interval-60s) >> Fencing Levels: >> >> Location Constraints: >> Ordering Constraints: >> start dlm-clone then start clvmd-clone (kind:Mandatory) >(id:order-dlm-clone-clvmd-clone-mandatory) >> start clvmd-clone then start TESTGFS2-clone (kind:Mandatory) >(id:order-clvmd-clone-TESTGFS2-clone-mandatory) >> Colocation Constraints: >> clvmd-clone with dlm-clone (score:INFINITY) >(id:colocation-clvmd-clone-dlm-clone-INFINITY) >> TESTGFS2-clone with clvmd-clone (score:INFINITY) >(id:coloc
Re: [ClusterLabs] How to unfence without reboot (fence_mpath)
Hello Strahil, On 2/17/20 3:39 PM, Strahil Nikolov wrote: Hello Ondrej, thanks for your reply. I really appreciate that. I have picked fence_multipath as I'm preparing for my EX436 and I can't know what agent will be useful on the exam. Also ,according to https://access.redhat.com/solutions/3201072 , there could be a race condition with fence_scsi. I believe that exam is about testing knowledge in configuration and not testing knowledge in knowing which race condition bugs are present and how to handle them :) If you have access to learning materials for EX436 exam I would recommend trying those ones out - they have labs and comprehensive review exercises that are useful in preparation for exam. So, I've checked the cluster when fencing and the node immediately goes offline. Last messages from pacemaker are: Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: Client stonith_admin.controld.23888.b57ceee7 wants to fence (reboot) 'node1.localdomain' with device '(any)' Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: Requesting peer fencing (reboot) of node1.localdomain Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: FENCING can fence (reboot) node1.localdomain (aka. '1'): static-list Feb 17 08:21:58 node1.localdomain stonith-ng[23808]: notice: Operation reboot of node1.localdomain by node2.localdomain for stonith_admin.controld.23888@node1.localdomain.ede38ffb: OK - This part looks OK - meaning the fencing looks like a success. Feb 17 08:21:58 node1.localdomain crmd[23812]: crit: We were allegedly just fenced by node2.localdomain for node1.localdomai - this is also normal as node just announces that it was fenced by other node Which for me means - node1 just got fenced again. Actually fencing works ,as I/O is immediately blocked and the reservation is removed. I've used https://access.redhat.com/solutions/2766611 to setup the fence_mpath , but I could have messed up something. - note related to exam: you will not have Internet on exam, so I would expect that you would have to configure something that would not require access to this (and as Dan Swartzendruber pointed out in other email - we cannot* even see RH links without account) * you can get free developers account to read them, but ideally that should be not needed and is certainly inconvenient for wide public audience Cluster config is: [root@node3 ~]# pcs config show Cluster Name: HACLUSTER2 Corosync Nodes: node1.localdomain node2.localdomain node3.localdomain Pacemaker Nodes: node1.localdomain node2.localdomain node3.localdomain Resources: Clone: dlm-clone Meta Attrs: interleave=true ordered=true Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s) start interval=0s timeout=90 (dlm-start-interval-0s) stop interval=0s timeout=100 (dlm-stop-interval-0s) Clone: clvmd-clone Meta Attrs: interleave=true ordered=true Resource: clvmd (class=ocf provider=heartbeat type=clvm) Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s) start interval=0s timeout=90s (clvmd-start-interval-0s) stop interval=0s timeout=90s (clvmd-stop-interval-0s) Clone: TESTGFS2-clone Meta Attrs: interleave=true Resource: TESTGFS2 (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/TEST/gfs2 directory=/GFS2 fstype=gfs2 options=noatime run_fsck=no Operations: monitor interval=15s on-fail=fence OCF_CHECK_LEVEL=20 (TESTGFS2-monitor-interval-15s) notify interval=0s timeout=60s (TESTGFS2-notify-interval-0s) start interval=0s timeout=60s (TESTGFS2-start-interval-0s) stop interval=0s timeout=60s (TESTGFS2-stop-interval-0s) Stonith Devices: Resource: FENCING (class=stonith type=fence_mpath) Attributes: devices=/dev/mapper/36001405cb123d000 pcmk_host_argument=key pcmk_host_map=node1.localdomain:1;node2.localdomain:2;node3.localdomain:3 pcmk_monitor_action=metadata pcmk_reboot_action=off Meta Attrs: provides=unfencing Operations: monitor interval=60s (FENCING-monitor-interval-60s) Fencing Levels: Location Constraints: Ordering Constraints: start dlm-clone then start clvmd-clone (kind:Mandatory) (id:order-dlm-clone-clvmd-clone-mandatory) start clvmd-clone then start TESTGFS2-clone (kind:Mandatory) (id:order-clvmd-clone-TESTGFS2-clone-mandatory) Colocation Constraints: clvmd-clone with dlm-clone (score:INFINITY) (id:colocation-clvmd-clone-dlm-clone-INFINITY) TESTGFS2-clone with clvmd-clone (score:INFINITY) (id:colocation-TESTGFS2-clone-clvmd-clone-INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: No defaults set [root@node3 ~]# crm_mon -r1 Stack: corosync Current DC: node3.localdomain (version 1.1.20-5.el7_7.2-3c4c782f70) - partition with quorum Last up
Re: [ClusterLabs] How to unfence without reboot (fence_mpath)
Many people don't have red hat access, so linking those urls is not useful. On February 17, 2020, at 1:40 AM, Strahil Nikolov wrote: Hello Ondrej, thanks for your reply. I really appreciate that. I have picked fence_multipath as I'm preparing for my EX436 and I can't know what agent will be useful on the exam. Also ,according to https://access.redhat.com/solutions/3201072 , there could be a race condition with fence_scsi. So, I've checked the cluster when fencing and the node immediately goes offline. Last messages from pacemaker are: Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: Client stonith_admin.controld.23888.b57ceee7 wants to fence (reboot) 'node1.localdomain' with device '(any)' Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: Requesting peer fencing (reboot) of node1.localdomain Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: FENCING can fence (reboot) node1.localdomain (aka. '1'): static-list Feb 17 08:21:58 node1.localdomain stonith-ng[23808]: notice: Operation reboot of node1.localdomain by node2.localdomain for stonith_admin.controld.23888@node1.localdomain.ede38ffb: OK Feb 17 08:21:58 node1.localdomain crmd[23812]: crit: We were allegedly just fenced by node2.localdomain for node1.localdomai Which for me means - node1 just got fenced again. Actually fencing works ,as I/O is immediately blocked and the reservation is removed. I've used https://access.redhat.com/solutions/2766611 to setup the fence_mpath , but I could have messed up something. Cluster config is: [root@node3 ~]# pcs config show Cluster Name: HACLUSTER2 Corosync Nodes: node1.localdomain node2.localdomain node3.localdomain Pacemaker Nodes: node1.localdomain node2.localdomain node3.localdomain Resources: Clone: dlm-clone Meta Attrs: interleave=true ordered=true Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s) start interval=0s timeout=90 (dlm-start-interval-0s) stop interval=0s timeout=100 (dlm-stop-interval-0s) Clone: clvmd-clone Meta Attrs: interleave=true ordered=true Resource: clvmd (class=ocf provider=heartbeat type=clvm) Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s) start interval=0s timeout=90s (clvmd-start-interval-0s) stop interval=0s timeout=90s (clvmd-stop-interval-0s) Clone: TESTGFS2-clone Meta Attrs: interleave=true Resource: TESTGFS2 (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/TEST/gfs2 directory=/GFS2 fstype=gfs2 options=noatime run_fsck=no Operations: monitor interval=15s on-fail=fence OCF_CHECK_LEVEL=20 (TESTGFS2-monitor-interval-15s) notify interval=0s timeout=60s (TESTGFS2-notify-interval-0s) start interval=0s timeout=60s (TESTGFS2-start-interval-0s) stop interval=0s timeout=60s (TESTGFS2-stop-interval-0s) Stonith Devices: Resource: FENCING (class=stonith type=fence_mpath) Attributes: devices=/dev/mapper/36001405cb123d000 pcmk_host_argument=key pcmk_host_map=node1.localdomain:1;node2.localdomain:2;node3.localdomain:3 pcmk_monitor_action=metadata pcmk_reboot_action=off Meta Attrs: provides=unfencing Operations: monitor interval=60s (FENCING-monitor-interval-60s) Fencing Levels: Location Constraints: Ordering Constraints: start dlm-clone then start clvmd-clone (kind:Mandatory) (id:order-dlm-clone-clvmd-clone-mandatory) start clvmd-clone then start TESTGFS2-clone (kind:Mandatory) (id:order-clvmd-clone-TESTGFS2-clone-mandatory) Colocation Constraints: clvmd-clone with dlm-clone (score:INFINITY) (id:colocation-clvmd-clone-dlm-clone-INFINITY) TESTGFS2-clone with clvmd-clone (score:INFINITY) (id:colocation-TESTGFS2-clone-clvmd-clone-INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: No defaults set [root@node3 ~]# crm_mon -r1 Stack: corosync Current DC: node3.localdomain (version 1.1.20-5.el7_7.2-3c4c782f70) - partition with quorum Last updated: Mon Feb 17 08:39:30 2020 Last change: Sun Feb 16 18:44:06 2020 by root via cibadmin on node1.localdomain 3 nodes configured 10 resources configured Online: [ node2.localdomain node3.localdomain ] OFFLINE: [ node1.localdomain ] Full list of resources: FENCING (stonith:fence_mpath): Started node2.localdomain Clone Set: dlm-clone [dlm] Started: [ node2.localdomain node3.localdomain ] Stopped: [ node1.localdomain ] Clone Set: clvmd-clone [clvmd] Started: [ node2.localdomain node3.localdomain ] Stopped: [ node1.localdomain ] Clone Set: TESTGFS2-clone [TESTGFS2] Started: [ node2.localdomain node3.localdomain ] Stopped: [ node1.localdomain ] In the logs , I've noticed that the node is first unfenced and later it is fenced again... For the unfence , I believe "meta provides=unfencing" is 'guilty', yet I'm not sure about
Re: [ClusterLabs] How to unfence without reboot (fence_mpath)
Hello Ondrej, thanks for your reply. I really appreciate that. I have picked fence_multipath as I'm preparing for my EX436 and I can't know what agent will be useful on the exam. Also ,according to https://access.redhat.com/solutions/3201072 , there could be a race condition with fence_scsi. So, I've checked the cluster when fencing and the node immediately goes offline. Last messages from pacemaker are: Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: Client stonith_admin.controld.23888.b57ceee7 wants to fence (reboot) 'node1.localdomain' with device '(any)' Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: Requesting peer fencing (reboot) of node1.localdomain Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: FENCING can fence (reboot) node1.localdomain (aka. '1'): static-list Feb 17 08:21:58 node1.localdomain stonith-ng[23808]: notice: Operation reboot of node1.localdomain by node2.localdomain for stonith_admin.controld.23888@node1.localdomain.ede38ffb: OK Feb 17 08:21:58 node1.localdomain crmd[23812]: crit: We were allegedly just fenced by node2.localdomain for node1.localdomai Which for me means - node1 just got fenced again. Actually fencing works ,as I/O is immediately blocked and the reservation is removed. I've used https://access.redhat.com/solutions/2766611 to setup the fence_mpath , but I could have messed up something. Cluster config is: [root@node3 ~]# pcs config show Cluster Name: HACLUSTER2 Corosync Nodes: node1.localdomain node2.localdomain node3.localdomain Pacemaker Nodes: node1.localdomain node2.localdomain node3.localdomain Resources: Clone: dlm-clone Meta Attrs: interleave=true ordered=true Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s) start interval=0s timeout=90 (dlm-start-interval-0s) stop interval=0s timeout=100 (dlm-stop-interval-0s) Clone: clvmd-clone Meta Attrs: interleave=true ordered=true Resource: clvmd (class=ocf provider=heartbeat type=clvm) Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s) start interval=0s timeout=90s (clvmd-start-interval-0s) stop interval=0s timeout=90s (clvmd-stop-interval-0s) Clone: TESTGFS2-clone Meta Attrs: interleave=true Resource: TESTGFS2 (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/TEST/gfs2 directory=/GFS2 fstype=gfs2 options=noatime run_fsck=no Operations: monitor interval=15s on-fail=fence OCF_CHECK_LEVEL=20 (TESTGFS2-monitor-interval-15s) notify interval=0s timeout=60s (TESTGFS2-notify-interval-0s) start interval=0s timeout=60s (TESTGFS2-start-interval-0s) stop interval=0s timeout=60s (TESTGFS2-stop-interval-0s) Stonith Devices: Resource: FENCING (class=stonith type=fence_mpath) Attributes: devices=/dev/mapper/36001405cb123d000 pcmk_host_argument=key pcmk_host_map=node1.localdomain:1;node2.localdomain:2;node3.localdomain:3 pcmk_monitor_action=metadata pcmk_reboot_action=off Meta Attrs: provides=unfencing Operations: monitor interval=60s (FENCING-monitor-interval-60s) Fencing Levels: Location Constraints: Ordering Constraints: start dlm-clone then start clvmd-clone (kind:Mandatory) (id:order-dlm-clone-clvmd-clone-mandatory) start clvmd-clone then start TESTGFS2-clone (kind:Mandatory) (id:order-clvmd-clone-TESTGFS2-clone-mandatory) Colocation Constraints: clvmd-clone with dlm-clone (score:INFINITY) (id:colocation-clvmd-clone-dlm-clone-INFINITY) TESTGFS2-clone with clvmd-clone (score:INFINITY) (id:colocation-TESTGFS2-clone-clvmd-clone-INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: No defaults set [root@node3 ~]# crm_mon -r1 Stack: corosync Current DC: node3.localdomain (version 1.1.20-5.el7_7.2-3c4c782f70) - partition with quorum Last updated: Mon Feb 17 08:39:30 2020 Last change: Sun Feb 16 18:44:06 2020 by root via cibadmin on node1.localdomain 3 nodes configured 10 resources configured Online: [ node2.localdomain node3.localdomain ] OFFLINE: [ node1.localdomain ] Full list of resources: FENCING (stonith:fence_mpath): Started node2.localdomain Clone Set: dlm-clone [dlm] Started: [ node2.localdomain node3.localdomain ] Stopped: [ node1.localdomain ] Clone Set: clvmd-clone [clvmd] Started: [ node2.localdomain node3.localdomain ] Stopped: [ node1.localdomain ] Clone Set: TESTGFS2-clone [TESTGFS2] Started: [ node2.localdomain node3.localdomain ] Stopped: [ node1.localdomain ] In the logs , I've noticed that the node is first unfenced and later it is fenced again... For the unfence , I believe "meta provides=unfencing" is 'guilty', yet I'm not sure about the action from node2. So far I have used SCSI reservations only with ServiceGuard, while SBD on SUSE - and I was wondering if the setu
Re: [ClusterLabs] How to unfence without reboot (fence_mpath)
Hello Strahil, On 2/17/20 11:54 AM, Strahil Nikolov wrote: Hello Community, This is my first interaction with pacemaker and SCSI reservations and I was wondering how to unfence a node without rebooting it ? For first encounter with SCSI reservation I would recommend 'fence_scsi' over 'fence_mpath' for the reason that it is easier to configure :) If everything works correctly then simple restart of cluster on fenced node should be enough. Side NOTE: There was discussion previous year about change that introduced ability to choose what happens when node is fenced by storage-based fence agent (like fence_mpath/fence_scsi) that defaults as of now to 'shutdown the cluster'. In newer pacemaker versions is option that can change this to 'shutdown the cluster and panic the node making it to reboot'. I tried to stop & start the cluster stack - it just powers off itself. Adding the reservation before starting the cluster stack - same. It sounds like maybe after start the node was fenced again or at least the fencing was attempted. Are there any errors (/var/log/cluster/corosync.log or similar) in logs about fencing/stonith from around the time when the cluster is started again on node? Only a reboot works. What does the state of cluster looks like on living node when other nodes is fenced? I wonder if the fenced node is reported as Offline or UNCLEAN - you can use the 'crm_mon -1f' to get current cluster state on living node for this including the failures. Thank for answering my question. Best Regards, Strahil Nikolov -- Ondrej Famera ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/