Re: [ClusterLabs] cluster doesn't do HA as expected, pingd doesn't help
On 19.12.2023 21:42, Artem wrote: Andrei and Klaus thanks for prompt reply and clarification! As I understand, design and behavior of Pacemaker is tightly coupled with the stonith concept. But isn't it too rigid? If you insist on shooting yourself in the foot, pacemaker gives you the gun. It just does not load it by default and does not shoot itself. Seriously, this topic has been beaten to death. Just do some research. You can avoid fencing and rely on quorum in shared-nothing case. The prime example that I have seen is NetApp C-Mode ONTAP where the set of management processes go read-only preventing any modification when node(s) go(es) out of quorum. But as soon as you have shared resource, ignoring fencing will lead to data corruption sooner or later. Is there a way to leverage self-monitoring or pingd rules to trigger isolated node to umount its FS? Like vSphere High Availability host isolation response. Can resource-stickiness=off (auto-failback) decrease risk of corruption by unresponsive node coming back online? Is there a quorum feature not for cluster but for resource start/stop? Got lock - is welcome to mount, unable to refresh lease - force unmount. Can on-fail=ignore break manual failover logic (stopped will be considered as failed and thus ignored)? best regards, Artem On Tue, 19 Dec 2023 at 17:03, Klaus Wenninger wrote: On Tue, Dec 19, 2023 at 10:00 AM Andrei Borzenkov wrote: On Tue, Dec 19, 2023 at 10:41 AM Artem wrote: ... Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (update_resource_action_runnable)warning: OST4_stop_0 on lustre4 is unrunnable (node is offline) Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active)info: Start 20s-interval monitor for OST4 on lustre3 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4( lustre4 ) blocked This is the default for the failed stop operation. The only way pacemaker can resolve failure to stop a resource is to fence the node where this resource was active. If it is not possible (and IIRC you refuse to use stonith), pacemaker has no other choice as to block it. If you insist, you can of course sert on-fail=ignore, but this means unreachable node will continue to run resources. Whether it can lead to some corruption in your case I cannot guess. Don't know if I'm reading that correctly but I understand what you had written above that you try to trigger the failover by stopping the VM (lustre4) without ordered shutdown. With fencing disabled what we are seeing is exactly what we would expect: The state of the resource is unknown - pacemaker tries to stop it - doesn't work as the node is offline - no fencing configured - so everything it can do is wait till there is info if the resource is up or not. I guess the strange output below is because of fencing disabled - quite an unusual - also not recommended - configuration and so this might not have shown up too often in that way. Klaus Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0) That is a rather strange phrase. The resource is blocked because the pacemaker could not fence the node, not the other way round. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cluster doesn't do HA as expected, pingd doesn't help
What if node (especially vm) freezes for several minutes and then continues to write to a shared disk where other nodes already put their data? In my opinion, fencing, preferably two-level, is mandatory for lustre, trust me, I'd developed whole HA stack for both Exascaler and PangeaFS. We've seen so many points where data loss may occur... On December 19, 2023 19:42:56 Artem wrote: Andrei and Klaus thanks for prompt reply and clarification! As I understand, design and behavior of Pacemaker is tightly coupled with the stonith concept. But isn't it too rigid? Is there a way to leverage self-monitoring or pingd rules to trigger isolated node to umount its FS? Like vSphere High Availability host isolation response. Can resource-stickiness=off (auto-failback) decrease risk of corruption by unresponsive node coming back online? Is there a quorum feature not for cluster but for resource start/stop? Got lock - is welcome to mount, unable to refresh lease - force unmount. Can on-fail=ignore break manual failover logic (stopped will be considered as failed and thus ignored)? best regards, Artem On Tue, 19 Dec 2023 at 17:03, Klaus Wenninger wrote: On Tue, Dec 19, 2023 at 10:00 AM Andrei Borzenkov wrote: On Tue, Dec 19, 2023 at 10:41 AM Artem wrote: ... Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (update_resource_action_runnable)warning: OST4_stop_0 on lustre4 is unrunnable (node is offline) Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active)info: Start 20s-interval monitor for OST4 on lustre3 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4( lustre4 ) blocked This is the default for the failed stop operation. The only way pacemaker can resolve failure to stop a resource is to fence the node where this resource was active. If it is not possible (and IIRC you refuse to use stonith), pacemaker has no other choice as to block it. If you insist, you can of course sert on-fail=ignore, but this means unreachable node will continue to run resources. Whether it can lead to some corruption in your case I cannot guess. Don't know if I'm reading that correctly but I understand what you had written above that you try to trigger the failover by stopping the VM (lustre4) without ordered shutdown. With fencing disabled what we are seeing is exactly what we would expect: The state of the resource is unknown - pacemaker tries to stop it - doesn't work as the node is offline - no fencing configured - so everything it can do is wait till there is info if the resource is up or not. I guess the strange output below is because of fencing disabled - quite an unusual - also not recommended - configuration and so this might not have shown up too often in that way. Klaus Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0) That is a rather strange phrase. The resource is blocked because the pacemaker could not fence the node, not the other way round. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cluster doesn't do HA as expected, pingd doesn't help
Andrei and Klaus thanks for prompt reply and clarification! As I understand, design and behavior of Pacemaker is tightly coupled with the stonith concept. But isn't it too rigid? Is there a way to leverage self-monitoring or pingd rules to trigger isolated node to umount its FS? Like vSphere High Availability host isolation response. Can resource-stickiness=off (auto-failback) decrease risk of corruption by unresponsive node coming back online? Is there a quorum feature not for cluster but for resource start/stop? Got lock - is welcome to mount, unable to refresh lease - force unmount. Can on-fail=ignore break manual failover logic (stopped will be considered as failed and thus ignored)? best regards, Artem On Tue, 19 Dec 2023 at 17:03, Klaus Wenninger wrote: > > > On Tue, Dec 19, 2023 at 10:00 AM Andrei Borzenkov > wrote: > >> On Tue, Dec 19, 2023 at 10:41 AM Artem wrote: >> ... >> > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] >> (update_resource_action_runnable)warning: OST4_stop_0 on lustre4 is >> unrunnable (node is offline) >> > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] >> (recurring_op_for_active)info: Start 20s-interval monitor for OST4 on >> lustre3 >> > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] >> (log_list_item) notice: Actions: Stop OST4( lustre4 >> ) blocked >> >> This is the default for the failed stop operation. The only way >> pacemaker can resolve failure to stop a resource is to fence the node >> where this resource was active. If it is not possible (and IIRC you >> refuse to use stonith), pacemaker has no other choice as to block it. >> If you insist, you can of course sert on-fail=ignore, but this means >> unreachable node will continue to run resources. Whether it can lead >> to some corruption in your case I cannot guess. >> > > Don't know if I'm reading that correctly but I understand what you had > written > above that you try to trigger the failover by stopping the VM (lustre4) > without > ordered shutdown. > With fencing disabled what we are seeing is exactly what we would expect: > The state of the resource is unknown - pacemaker tries to stop it - > doesn't work > as the node is offline - no fencing configured - so everything it can do > is wait > till there is info if the resource is up or not. > I guess the strange output below is because of fencing disabled - quite an > unusual - also not recommended - configuration and so this might not have > shown up too often in that way. > > Klaus > >> >> > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] >> (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: >> blocked (OST4_stop_0) >> >> That is a rather strange phrase. The resource is blocked because the >> pacemaker could not fence the node, not the other way round. >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cluster doesn't do HA as expected, pingd doesn't help
On Tue, Dec 19, 2023 at 10:00 AM Andrei Borzenkov wrote: > On Tue, Dec 19, 2023 at 10:41 AM Artem wrote: > ... > > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] > (update_resource_action_runnable)warning: OST4_stop_0 on lustre4 is > unrunnable (node is offline) > > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] > (recurring_op_for_active)info: Start 20s-interval monitor for OST4 on > lustre3 > > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] > (log_list_item) notice: Actions: Stop OST4( lustre4 > ) blocked > > This is the default for the failed stop operation. The only way > pacemaker can resolve failure to stop a resource is to fence the node > where this resource was active. If it is not possible (and IIRC you > refuse to use stonith), pacemaker has no other choice as to block it. > If you insist, you can of course sert on-fail=ignore, but this means > unreachable node will continue to run resources. Whether it can lead > to some corruption in your case I cannot guess. > Don't know if I'm reading that correctly but I understand what you had written above that you try to trigger the failover by stopping the VM (lustre4) without ordered shutdown. With fencing disabled what we are seeing is exactly what we would expect: The state of the resource is unknown - pacemaker tries to stop it - doesn't work as the node is offline - no fencing configured - so everything it can do is wait till there is info if the resource is up or not. I guess the strange output below is because of fencing disabled - quite an unusual - also not recommended - configuration and so this might not have shown up too often in that way. Klaus > > > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] > (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: > blocked (OST4_stop_0) > > That is a rather strange phrase. The resource is blocked because the > pacemaker could not fence the node, not the other way round. > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cluster doesn't do HA as expected, pingd doesn't help
On Tue, Dec 19, 2023 at 10:41 AM Artem wrote: ... > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] > (update_resource_action_runnable)warning: OST4_stop_0 on lustre4 is > unrunnable (node is offline) > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] > (recurring_op_for_active)info: Start 20s-interval monitor for OST4 on > lustre3 > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] > (log_list_item) notice: Actions: Stop OST4( lustre4 ) > blocked This is the default for the failed stop operation. The only way pacemaker can resolve failure to stop a resource is to fence the node where this resource was active. If it is not possible (and IIRC you refuse to use stonith), pacemaker has no other choice as to block it. If you insist, you can of course sert on-fail=ignore, but this means unreachable node will continue to run resources. Whether it can lead to some corruption in your case I cannot guess. > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] > (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: > blocked (OST4_stop_0) That is a rather strange phrase. The resource is blocked because the pacemaker could not fence the node, not the other way round. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cluster doesn't do HA as expected, pingd doesn't help
Hi Ken, I rolled back settings to 100:100 scores without ping and did simulation again I checked pacemaker.log and the only meaningful entry is the following, still it doesn't make sense to me. Actions: Stop OST4(lustre4 ) blocked crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0) Entries in pacemaker.log of 1st cluster node (lustre-mgs): Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-based [3833057] (log_info) info: ++ /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='lustre4']: Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_cib_callback)info: CIB update 9 result for last-failure-OST4#monitor_2: OK | rc=0 Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_cib_callback)info: * last-failure-OST4#monitor_2[lustre4]=1702968493 Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-based [2457586] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_cib_callback)info: CIB update 10 result for fail-count-OST4#monitor_2: OK | rc=0 Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_cib_callback)info: * fail-count-OST4#monitor_2[lustre4]=1 again OST4 resource in not mentioned except for first seconds of failure, no logged attempts to restart it elsewhere last pacemaker.log from 3rd cluster node - same 1 min silence 09:48:13 - 09:49:14, but this time more entries regarding OST4 [root@lustre-mds2 ~]# grep OST4 /var/log/pacemaker/pacemaker.log Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-based [785103] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/lrm[@id='lustre4']/lrm_resources/lrm_resource[@id='OST4']: Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-based [785103] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-controld [785108] (abort_transition_graph) info: Transition 3 aborted by status-lustre4-fail-count-OST4.monitor_2 doing create fail-count-OST4#monitor_2=1: Transient attribute change | cib=0.467.213 source=abort_unless_down:297 path=/cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4'] complete=true Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__threshold_reached)info: OST4 can fail 99 more times on lustre4 before reaching migration threshold (100) Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource)info: Unassigning OST4 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (update_resource_action_runnable)warning: OST4_stop_0 on lustre4 is unrunnable (node is offline) Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active)info: Start 20s-interval monitor for OST4 on lustre3 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4( lustre4 ) blocked Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0) Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__threshold_reached)info: OST4 can fail 99 more times on lustre4 before reaching migration threshold (100) Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource)info: Unassigning OST4 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (update_resource_action_runnable)warning: OST4_stop_0 on lustre4 is unrunnable (node is offline) Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active)info: Start 20s-interval monitor for OST4 on lustre3 Dec 19 09:48:13
Re: [ClusterLabs] cluster doesn't do HA as expected, pingd doesn't help
On Mon, 2023-12-18 at 23:39 +0300, Artem wrote: > Hello experts. > > I previously played with a dummy resource and it worked as expected. > Now I'm switching to a Lustre OST resource and cannot make it. > Neither can I understand. > > > ### Initial setup: > # pcs resource defaults update resource-stickness=110 > # for i in {1..4}; do pcs cluster node add-remote lustre$i > reconnect_interval=60; done > # for i in {1..4}; do pcs constraint location lustre$i prefers > lustre-mgs lustre-mds1 lustre-mds2; done > # pcs resource create OST3 ocf:lustre:Lustre target=/dev/disk/by- > id/wwn-0x6000c291b7f7147f826bb95153e2eaca mountpoint=/lustre/oss3 > # pcs resource create OST4 ocf:lustre:Lustre target=/dev/disk/by- > id/wwn-0x6000c292c41eaae60bccdd3a752913b3 mountpoint=/lustre/oss4 > (I also tried ocf:heartbeat:Filesystem device=... directory=... > fstype=lustre force_unmount=safe --> same behavior) > > # pcs constraint location OST3 prefers lustre3=100 > # pcs constraint location OST3 prefers lustre4=100 > # pcs constraint location OST4 prefers lustre3=100 > # pcs constraint location OST4 prefers lustre4=100 > # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs > constraint location OST3 avoids $i; done > # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs > constraint location OST4 avoids $i; done > > ### Checking all is good > # crm_simulate --simulate --live-check --show-scores > pcmk__primitive_assign: OST4 allocation score on lustre3: 100 > pcmk__primitive_assign: OST4 allocation score on lustre4: 210 > # pcs status > * OST3(ocf::lustre:Lustre):Started lustre3 > * OST4(ocf::lustre:Lustre):Started lustre4 > > ### VM with lustre4 (OST4) is OFF > > # crm_simulate --simulate --live-check --show-scores > pcmk__primitive_assign: OST4 allocation score on lustre3: 100 > pcmk__primitive_assign: OST4 allocation score on lustre4: 100 > Start OST4( lustre3 ) > Resource action: OST4start on lustre3 > Resource action: OST4monitor=2 on lustre3 > # pcs status > * OST3(ocf::lustre:Lustre):Started lustre3 > * OST4(ocf::lustre:Lustre):Stopped > > 1) I see crm_simulate guesed that it has to restart failed OST4 on > lustre3. After making such decision I suspect it evaluates 100:100 > scores of both lustre3 and lustre4, but lustre3 is already running a > service. So it decides to run OST4 again on lustre4, which is failed. > Thus it cannot restart on surviving nodes. Right? No. I'd start with figuring out this case. There's no reason, given the configuration above, why OST4 would be stopped. In fact the simulation shows it should be started, so that suggests that maybe the actual start failed. Do the logs show any errors around this time? > 2) Ok, let's try not to give specific score - nothing changed, see > below: > ### did remove old constraints; clear all resources; cleanup all > resources; cluster stop; cluster start > > # pcs constraint location OST3 prefers lustre3 lustre4 > # pcs constraint location OST4 prefers lustre3 lustre4 > # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs > constraint location OST3 avoids $i; done > # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs > constraint location OST4 avoids $i; done > # crm_simulate --simulate --live-check --show-scores > pcmk__primitive_assign: OST4 allocation score on lustre3: INFINITY > pcmk__primitive_assign: OST4 allocation score on lustre4: INFINITY > # pcs status > * OST3(ocf::lustre:Lustre):Started lustre3 > * OST4(ocf::lustre:Lustre):Started lustre4 > > ### VM with lustre4 (OST4) is OFF > > # crm_simulate --simulate --live-check --show-scores > pcmk__primitive_assign: OST4 allocation score on lustre3: INFINITY > pcmk__primitive_assign: OST4 allocation score on lustre4: INFINITY > Start OST4( lustre3 ) > Resource action: OST4start on lustre3 > Resource action: OST4monitor=2 on lustre3 > # pcs status > * OST3(ocf::lustre:Lustre):Started lustre3 > * OST4(ocf::lustre:Lustre):Stopped > > 3) Ok lets try to set different scores with preference to nodes and > affect it with pingd: > ### did remove old constraints; clear all resources; cleanup all > resources; cluster stop; cluster start > > # pcs constraint location OST3 prefers lustre3=100 > # pcs constraint location OST3 prefers lustre4=90 > # pcs constraint location OST4 prefers lustre3=90 > # pcs constraint location OST4 prefers lustre4=100 > # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs > constraint location OST3 avoids $i; done > # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs > constraint location OST4 avoids $i; done > # pcs resource create ping ocf:pacemaker:ping dampen=5s > host_list=192.168.34.250 op monitor interval=3s timeout=7s meta > target-role="started" globally-unique="false" clone > # for i in lustre-mgs