On February 17, 2020 3:36:27 PM GMT+02:00, Ondrej <ondrej-clusterl...@famera.cz> wrote: >Hello Strahil, > >On 2/17/20 3:39 PM, Strahil Nikolov wrote: >> Hello Ondrej, >> >> thanks for your reply. I really appreciate that. >> >> I have picked fence_multipath as I'm preparing for my EX436 and I >can't know what agent will be useful on the exam. >> Also ,according to https://access.redhat.com/solutions/3201072 , >there could be a race condition with fence_scsi. > >I believe that exam is about testing knowledge in configuration and not > >testing knowledge in knowing which race condition bugs are present and >how to handle them :) >If you have access to learning materials for EX436 exam I would >recommend trying those ones out - they have labs and comprehensive >review exercises that are useful in preparation for exam. > >> So, I've checked the cluster when fencing and the node immediately >goes offline. >> Last messages from pacemaker are: >> <snip> >> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: Client >stonith_admin.controld.23888.b57ceee7 wants to fence (reboot) >'node1.localdomain' with device '(any)' >> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: >Requesting peer fencing (reboot) of node1.localdomain >> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: >FENCING can fence (reboot) node1.localdomain (aka. '1'): static-list >> Feb 17 08:21:58 node1.localdomain stonith-ng[23808]: notice: >Operation reboot of node1.localdomain by node2.localdomain for >stonith_admin.controld.23888@node1.localdomain.ede38ffb: OK >- This part looks OK - meaning the fencing looks like a success. >> Feb 17 08:21:58 node1.localdomain crmd[23812]: crit: We were >allegedly just fenced by node2.localdomain for node1.localdomai >- this is also normal as node just announces that it was fenced by >other >node > >> <snip> >> >> Which for me means - node1 just got fenced again. Actually fencing >works ,as I/O is immediately blocked and the reservation is removed. >> >> I've used https://access.redhat.com/solutions/2766611 to setup the >fence_mpath , but I could have messed up something. >- note related to exam: you will not have Internet on exam, so I would > >expect that you would have to configure something that would not >require >access to this (and as Dan Swartzendruber pointed out in other email - >we cannot* even see RH links without account) > >* you can get free developers account to read them, but ideally that >should be not needed and is certainly inconvenient for wide public >audience > >> >> Cluster config is: >> [root@node3 ~]# pcs config show >> Cluster Name: HACLUSTER2 >> Corosync Nodes: >> node1.localdomain node2.localdomain node3.localdomain >> Pacemaker Nodes: >> node1.localdomain node2.localdomain node3.localdomain >> >> Resources: >> Clone: dlm-clone >> Meta Attrs: interleave=true ordered=true >> Resource: dlm (class=ocf provider=pacemaker type=controld) >> Operations: monitor interval=30s on-fail=fence >(dlm-monitor-interval-30s) >> start interval=0s timeout=90 (dlm-start-interval-0s) >> stop interval=0s timeout=100 (dlm-stop-interval-0s) >> Clone: clvmd-clone >> Meta Attrs: interleave=true ordered=true >> Resource: clvmd (class=ocf provider=heartbeat type=clvm) >> Operations: monitor interval=30s on-fail=fence >(clvmd-monitor-interval-30s) >> start interval=0s timeout=90s >(clvmd-start-interval-0s) >> stop interval=0s timeout=90s (clvmd-stop-interval-0s) >> Clone: TESTGFS2-clone >> Meta Attrs: interleave=true >> Resource: TESTGFS2 (class=ocf provider=heartbeat type=Filesystem) >> Attributes: device=/dev/TEST/gfs2 directory=/GFS2 fstype=gfs2 >options=noatime run_fsck=no >> Operations: monitor interval=15s on-fail=fence OCF_CHECK_LEVEL=20 >(TESTGFS2-monitor-interval-15s) >> notify interval=0s timeout=60s >(TESTGFS2-notify-interval-0s) >> start interval=0s timeout=60s >(TESTGFS2-start-interval-0s) >> stop interval=0s timeout=60s >(TESTGFS2-stop-interval-0s) >> >> Stonith Devices: >> Resource: FENCING (class=stonith type=fence_mpath) >> Attributes: devices=/dev/mapper/36001405cb123d0000000000000000000 >pcmk_host_argument=key >pcmk_host_map=node1.localdomain:1;node2.localdomain:2;node3.localdomain:3 >pcmk_monitor_action=metadata pcmk_reboot_action=off >> Meta Attrs: provides=unfencing >> Operations: monitor interval=60s (FENCING-monitor-interval-60s) >> Fencing Levels: >> >> Location Constraints: >> Ordering Constraints: >> start dlm-clone then start clvmd-clone (kind:Mandatory) >(id:order-dlm-clone-clvmd-clone-mandatory) >> start clvmd-clone then start TESTGFS2-clone (kind:Mandatory) >(id:order-clvmd-clone-TESTGFS2-clone-mandatory) >> Colocation Constraints: >> clvmd-clone with dlm-clone (score:INFINITY) >(id:colocation-clvmd-clone-dlm-clone-INFINITY) >> TESTGFS2-clone with clvmd-clone (score:INFINITY) >(id:colocation-TESTGFS2-clone-clvmd-clone-INFINITY) >> Ticket Constraints: >> >> Alerts: >> No alerts defined >> >> Resources Defaults: >> No defaults set >> >> [root@node3 ~]# crm_mon -r1 >> Stack: corosync >> Current DC: node3.localdomain (version 1.1.20-5.el7_7.2-3c4c782f70) - >partition with quorum >> Last updated: Mon Feb 17 08:39:30 2020 >> Last change: Sun Feb 16 18:44:06 2020 by root via cibadmin on >node1.localdomain >> >> 3 nodes configured >> 10 resources configured >> >> Online: [ node2.localdomain node3.localdomain ] >> OFFLINE: [ node1.localdomain ] >> >> Full list of resources: >> >> FENCING (stonith:fence_mpath): Started node2.localdomain >> Clone Set: dlm-clone [dlm] >> Started: [ node2.localdomain node3.localdomain ] >> Stopped: [ node1.localdomain ] >> Clone Set: clvmd-clone [clvmd] >> Started: [ node2.localdomain node3.localdomain ] >> Stopped: [ node1.localdomain ] >> Clone Set: TESTGFS2-clone [TESTGFS2] >> Started: [ node2.localdomain node3.localdomain ] >> Stopped: [ node1.localdomain ] >> >> >> >> >> In the logs , I've noticed that the node is first unfenced and later >it is fenced again... For the unfence , I believe "meta >provides=unfencing" is 'guilty', yet I'm not sure about the action from >node2. > >'Unfecing' is exactly the expected behavior when provides=unfencing is >present (and it should be present with fence_scsi and fence_multipath). > >Here the important part is "first unfenced and later it is fenced >again". If everything is in normal state, then the node should not be >just fenced again. So it would make sense to me to investigate that >'fencing' after unfencing. I would expect that one of the nodes will >have a more verbose logs that would give idea why the fencing was >ordered. (my lucky guess would be failed 'monitor' operation on any of >the resources as all of them 'on-fail=fence', but this would need a >support from logs to be sure) >Also logs from fenced node can provide some information what happened >on >node - if that was the cause of fencing. > >> So far I have used SCSI reservations only with ServiceGuard, while >SBD on SUSE - and I was wondering if the setup is correctly done. >I don't see anything particularly bad looking from configuration point >of view. Best place to look for reason are now the logs from other >nodes >after 'unfencing' and before 'fencing again' > >> Storage in this test setup is a Highly Available iSCSI Cluster ontop >of DRBD /RHEL 7 again/, and it seems that SCSI Reservations Support is >OK. > From logs you have provided so far the reservations keys works as >fencing is happening and reports OK. > >> Best Regards, >> Strahil Nikolov > >Example of fencing because 'monitor' operation of resource 'testtest' >failed from logs: > >Feb 17 22:32:15 [1289] fastvm-centos-7-7-174 pengine: warning: >pe_fence_node: Cluster node fastvm-centos-7-7-175 will be fenced: > >testtest failed there >Feb 17 22:32:15 [1289] fastvm-centos-7-7-174 pengine: notice: >LogNodeActions: * Fence (reboot) fastvm-centos-7-7-175 'testtest >failed there' > >-- >Ondrej >_______________________________________________ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/
Hey Ondrej, Sadly the lab in the training is using a customized fencing mechanism that cannot be reproduced outside of RedHat training lab. As I don't know what will be the environment (Red Hat prevents any disclosure on that) , I have to pick a fencing mechanism that will work in any environment and 'fence_mpath' matches those criteria. Sadly, RedHat expects the engineer to be able to deal with bugs (RedHat CEO's intervirew from several years ago confirmed that), so if I know that fence_scsi can have issues - it is better to play safe and avoid it. I'm sorry for quoting the RedHat's Solutions . It mentions that each node should have a unique reservation_key ( in /etc/multipath.conf ) and the stonith agent is not defined with the mandatory 'key' as it is in the pcmk_host_maps . Best Regards, Strahil Nikolov _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/