Hi Ken, I rolled back settings to 100:100 scores without ping and did simulation again I checked pacemaker.log and the only meaningful entry is the following, still it doesn't make sense to me. Actions: Stop OST4 ( lustre4 ) blocked crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0)
Entries in pacemaker.log of 1st cluster node (lustre-mgs): Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-based [3833057] (log_info) info: ++ /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='lustre4']: <lrm_rsc_op id="lustre4_last_failure_0" operation_key="lustre4_monitor_5000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="48:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="2:1;48:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Remote executor did Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-based [3833057] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/lrm[@id='lustre4']/lrm_resources/lrm_resource[@id='OST4']: <lrm_rsc_op id="OST4_last_failure_0" operation_key="OST4_monitor_20000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="8:1;52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Action was pend Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-attrd [3833060] (update_attr_on_host) notice: Setting last-failure-lustre4#monitor_5000[lustre-mds1] in instance_attributes: (unset) -> 1702968493 | from lustre-mds2 with no write delay Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-attrd [3833060] (update_attr_on_host) notice: Setting fail-count-lustre4#monitor_5000[lustre-mds1] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-attrd [3833060] (update_attr_on_host) notice: Setting last-failure-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1702968493 | from lustre-mds2 with no write delay Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-attrd [3833060] (update_attr_on_host) notice: Setting fail-count-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-controld [3833062] (update_peer_state_iter) notice: Node lustre4 state is now lost | nodeid=0 previous=member source=handle_remote_state Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-attrd [3833060] (attrd_peer_remove) notice: Removing all lustre4 attributes for peer lustre-mds1 Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-controld [3833062] (peer_update_callback) info: Remote node lustre4 is now lost (was member) Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-attrd [3833060] (reap_crm_member) info: No peers with id=0 and/or uname=lustre4 to purge from the membership cache Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-based [3833057] (log_info) info: + /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='lustre4']/lrm_rsc_op[@id='lustre4_last_0']: @operation_key=lustre4_stop_0, @operation=stop, @transition-key=9:5:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390, @transition-magic=-1:193;9:5:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1702968493 Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-based [3833057] (cib_process_request) info: Completed cib_delete operation for section //node_state[@uname='lustre4']/transient_attributes: OK (rc=0, origin=lustre-mds1/crmd/352, version=0.467.217) Dec 19 09:48:13 lustre-mgs.ntslab.ru pacemaker-based [3833057] (log_info) info: + /cib/status/node_state[@id='lustre4']: @in_ccm=false, @crm-debug-origin=remote_node_down >>>> logs are silent for 1 MIN !!! And nothing regarding "OST4" later in logs !!! Dec 19 09:49:14 lustre-mgs.ntslab.ru pacemaker-based [3833057] (cib_process_request) info: Completed cib_delete operation for section /cib/status/node_state[@uname='lustre-mds1']/lrm/lrm_resources/lrm_resource[@id='lustre4']/lrm_rsc_op[@id='lustre4_last_failure_0']: OK (rc=0, origin=lustre-mds1/crmd/367, version=0.467.221) Dec 19 09:49:14 lustre-mgs.ntslab.ru pacemaker-based [3833057] (log_info) info: -- /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']/nvpair[@id='status-2-last-failure-lustre4.monitor_5000'] Dec 19 09:49:14 lustre-mgs.ntslab.ru pacemaker-based [3833057] (log_info) info: -- /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']/nvpair[@id='status-2-fail-count-lustre4.monitor_5000'] ... perhaps lustre4 RA was running on lustre-mds1 cluster node, so below its log (same silence for 1 min following the node failure) Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-controld [2457591] (monitor_timeout_cb) info: Timed out waiting for remote poke response from lustre4 Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-controld [2457591] (log_executor_event) error: Result of monitor operation for lustre4 on lustre-mds1: Timed Out after 10s (Remote executor did not respond) | graph action unconfirmed; call=8 key=lustre4_monitor_5000 Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-controld [2457591] (remote_lrm_op_callback) error: Lost connection to Pacemaker Remote node lustre4 Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-controld [2457591] (log_executor_event) error: Result of monitor operation for OST4 on lustre4: Internal communication failure (Action was pending when executor connection was dropped) | graph action confirmed; call=28122 key=OST4_monitor_20000 Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-based [2457586] (log_info) info: ++ /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='lustre4']: <lrm_rsc_op id="lustre4_last_failure_0" operation_key="lustre4_monitor_5000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="48:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="2:1;48:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Remote executor di Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-based [2457586] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/lrm[@id='lustre4']/lrm_resources/lrm_resource[@id='OST4']: <lrm_rsc_op id="OST4_last_failure_0" operation_key="OST4_monitor_20000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="8:1;52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Action was pen Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (update_attr_on_host) notice: Setting last-failure-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1702968493 | from lustre-mds2 with no write delay Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_write_attribute) info: Sent CIB request 9 with 1 change for last-failure-OST4#monitor_20000 (id n/a, set n/a) Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (update_attr_on_host) notice: Setting fail-count-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_write_attribute) info: Sent CIB request 10 with 1 change for fail-count-OST4#monitor_20000 (id n/a, set n/a) Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-based [2457586] (log_info) info: ++ <nvpair id="status-lustre4-last-failure-OST4.monitor_20000" name="last-failure-OST4#monitor_20000" value="1702968493"/> Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_cib_callback) info: CIB update 9 result for last-failure-OST4#monitor_20000: OK | rc=0 Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_cib_callback) info: * last-failure-OST4#monitor_20000[lustre4]=1702968493 Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-based [2457586] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-fail-count-OST4.monitor_20000" name="fail-count-OST4#monitor_20000" value="1"/> Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_cib_callback) info: CIB update 10 result for fail-count-OST4#monitor_20000: OK | rc=0 Dec 19 09:48:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_cib_callback) info: * fail-count-OST4#monitor_20000[lustre4]=1 again OST4 resource in not mentioned except for first seconds of failure, no logged attempts to restart it elsewhere last pacemaker.log from 3rd cluster node - same 1 min silence 09:48:13 - 09:49:14, but this time more entries regarding OST4 [root@lustre-mds2 ~]# grep OST4 /var/log/pacemaker/pacemaker.log Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-based [785103] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/lrm[@id='lustre4']/lrm_resources/lrm_resource[@id='OST4']: <lrm_rsc_op id="OST4_last_failure_0" operation_key="OST4_monitor_20000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="8:1;52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Action was pend Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-controld [785108] (abort_transition_graph) info: Transition 3 aborted by operation OST4_monitor_20000 'create' on lustre-mds1: Change in recurring result | magic=8:1;52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390 cib=0.467.208 source=process_graph_event:500 complete=true Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-controld [785108] (update_failcount) info: Updating failcount for OST4 on lustre4 after failed monitor: rc=1 (update=value++, time=1702968493) Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-attrd [785106] (handle_value_expansion) info: Expanded fail-count-OST4#monitor_20000=value++ to 1 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-controld [785108] (process_graph_event) notice: Transition 2 action 52 (OST4_monitor_20000 on lustre-mds1): expected 'ok' but got 'error' | target-rc=0 rc=1 call-id=28122 event='arrived after initial scheduling' Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-attrd [785106] (update_attr_on_host) notice: Setting last-failure-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1702968493 | from lustre-mds2 with no write delay Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-attrd [785106] (update_attr_on_host) notice: Setting fail-count-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-based [785103] (log_info) info: ++ <nvpair id="status-lustre4-last-failure-OST4.monitor_20000" name="last-failure-OST4#monitor_20000" value="1702968493"/> Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-based [785103] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-fail-count-OST4.monitor_20000" name="fail-count-OST4#monitor_20000" value="1"/> Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-controld [785108] (abort_transition_graph) info: Transition 3 aborted by status-lustre4-fail-count-OST4.monitor_20000 doing create fail-count-OST4#monitor_20000=1: Transient attribute change | cib=0.467.213 source=abort_unless_down:297 path=/cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4'] complete=true Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__threshold_reached) info: OST4 can fail 999999 more times on lustre4 before reaching migration threshold (1000000) Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline) Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0) Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__threshold_reached) info: OST4 can fail 999999 more times on lustre4 before reaching migration threshold (1000000) Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline) Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3 Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0) Dec 19 09:49:14 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 09:49:14 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 09:49:14 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Start OST4 ( lustre4 ) Dec 19 09:49:17 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 09:49:17 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 09:49:17 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 09:49:17 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 09:49:17 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 09:49:17 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 09:49:17 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 09:49:17 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 09:49:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 09:49:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 09:49:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 09:49:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 09:49:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 09:49:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 09:49:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 09:49:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 09:49:27 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 09:49:27 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 09:49:27 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 09:49:27 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 09:49:27 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 09:49:27 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 09:49:27 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline) Dec 19 09:49:27 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3 Dec 19 09:49:27 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked Dec 19 09:49:27 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0) Dec 19 09:50:18 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 09:50:18 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 09:50:18 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Start OST4 ( lustre4 ) Dec 19 09:50:21 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 09:50:21 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 09:50:21 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 09:50:21 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 09:50:21 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0 Do I need to set up migration-threshold=1? All the basic guides doesn't mention it. Why, if it might be critical for failover? OK, lets try this way: [root@lustre-mgs ~]# pcs resource update OST3 meta migration-threshold=1 failure-timeout=180s Warning: Agent 'ocf:lustre:Lustre' implements unsupported OCF version '1.0.1', supported versions are: '1.0', '1.1'; assumed version '1.0' [root@lustre-mgs ~]# pcs resource update OST4 meta migration-threshold=1 failure-timeout=180s Warning: Agent 'ocf:lustre:Lustre' implements unsupported OCF version '1.0.1', supported versions are: '1.0', '1.1'; assumed version '1.0' 10:26:01 VM lustre4 was OFF result is the same - OST4 didn't restart on lustre3 [root@lustre-mgs ~]# grep OST4 /var/log/pacemaker/pacemaker.log Dec 19 10:26:13 lustre-mgs.ntslab.ru pacemaker-based [3833057] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/lrm[@id='lustre4']/lrm_resources/lrm_resource[@id='OST4']: <lrm_rsc_op id="OST4_last_failure_0" operation_key="OST4_monitor_20000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="8:1;42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Action was pe Dec 19 10:26:13 lustre-mgs.ntslab.ru pacemaker-attrd [3833060] (update_attr_on_host) notice: Setting last-failure-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1702970773 | from lustre-mds2 with no write delay Dec 19 10:26:13 lustre-mgs.ntslab.ru pacemaker-attrd [3833060] (update_attr_on_host) notice: Setting fail-count-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay Dec 19 10:26:13 lustre-mgs.ntslab.ru pacemaker-based [3833057] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-last-failure-OST4.monitor_20000" name="last-failure-OST4#monitor_20000" value="1702970773"/> Dec 19 10:26:13 lustre-mgs.ntslab.ru pacemaker-based [3833057] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-fail-count-OST4.monitor_20000" name="fail-count-OST4#monitor_20000" value="1"/> [root@lustre-mds1 ~]# grep OST4 /var/log/pacemaker/pacemaker.log Dec 19 10:26:13 lustre-mds1.ntslab.ru pacemaker-controld [2457591] (log_executor_event) error: Result of monitor operation for OST4 on lustre4: Internal communication failure (Action was pending when executor connection was dropped) | graph action confirmed; call=74 key=OST4_monitor_20000 Dec 19 10:26:13 lustre-mds1.ntslab.ru pacemaker-based [2457586] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/lrm[@id='lustre4']/lrm_resources/lrm_resource[@id='OST4']: <lrm_rsc_op id="OST4_last_failure_0" operation_key="OST4_monitor_20000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="8:1;42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Action was p Dec 19 10:26:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (update_attr_on_host) notice: Setting last-failure-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1702970773 | from lustre-mds2 with no write delay Dec 19 10:26:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_write_attribute) info: Sent CIB request 80 with 1 change for last-failure-OST4#monitor_20000 (id n/a, set n/a) Dec 19 10:26:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (update_attr_on_host) notice: Setting fail-count-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay Dec 19 10:26:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_write_attribute) info: Sent CIB request 81 with 1 change for fail-count-OST4#monitor_20000 (id n/a, set n/a) Dec 19 10:26:13 lustre-mds1.ntslab.ru pacemaker-based [2457586] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-last-failure-OST4.monitor_20000" name="last-failure-OST4#monitor_20000" value="1702970773"/> Dec 19 10:26:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_cib_callback) info: CIB update 80 result for last-failure-OST4#monitor_20000: OK | rc=0 Dec 19 10:26:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_cib_callback) info: * last-failure-OST4#monitor_20000[lustre4]=1702970773 Dec 19 10:26:13 lustre-mds1.ntslab.ru pacemaker-based [2457586] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-fail-count-OST4.monitor_20000" name="fail-count-OST4#monitor_20000" value="1"/> Dec 19 10:26:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_cib_callback) info: CIB update 81 result for fail-count-OST4#monitor_20000: OK | rc=0 Dec 19 10:26:13 lustre-mds1.ntslab.ru pacemaker-attrd [2457589] (attrd_cib_callback) info: * fail-count-OST4#monitor_20000[lustre4]=1 [root@lustre-mds2 ~]# grep OST4 /var/log/pacemaker/pacemaker.log Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-based [785103] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/lrm[@id='lustre4']/lrm_resources/lrm_resource[@id='OST4']: <lrm_rsc_op id="OST4_last_failure_0" operation_key="OST4_monitor_20000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="8:1;42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Action was pe Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-controld [785108] (abort_transition_graph) info: Transition 54 aborted by operation OST4_monitor_20000 'create' on lustre-mds1: Change in recurring result | magic=8:1;42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390 cib=0.469.3 source=process_graph_event:500 complete=true Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-controld [785108] (update_failcount) info: Updating failcount for OST4 on lustre4 after failed monitor: rc=1 (update=value++, time=1702970773) Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-attrd [785106] (handle_value_expansion) info: Expanded fail-count-OST4#monitor_20000=value++ to 1 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-controld [785108] (process_graph_event) notice: Transition 48 action 42 (OST4_monitor_20000 on lustre-mds1): expected 'ok' but got 'error' | target-rc=0 rc=1 call-id=74 event='arrived after initial scheduling' Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-attrd [785106] (update_attr_on_host) notice: Setting last-failure-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1702970773 | from lustre-mds2 with no write delay Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-attrd [785106] (update_attr_on_host) notice: Setting fail-count-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-based [785103] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-last-failure-OST4.monitor_20000" name="last-failure-OST4#monitor_20000" value="1702970773"/> Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-based [785103] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-fail-count-OST4.monitor_20000" name="fail-count-OST4#monitor_20000" value="1"/> Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-controld [785108] (abort_transition_graph) info: Transition 54 aborted by status-lustre4-last-failure-OST4.monitor_20000 doing create last-failure-OST4#monitor_20000=1702970773: Transient attribute change | cib=0.469.8 source=abort_unless_down:297 path=/cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4'] complete=true Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-controld [785108] (abort_transition_graph) info: Transition 54 aborted by status-lustre4-fail-count-OST4.monitor_20000 doing create fail-count-OST4#monitor_20000=1: Transient attribute change | cib=0.469.9 source=abort_unless_down:297 path=/cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4'] complete=true Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__threshold_reached) warning: OST4 cannot run on lustre4 due to reaching migration threshold (clean up resource to allow again)| failures=1 migration-threshold=1 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline) Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0) Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__threshold_reached) warning: OST4 cannot run on lustre4 due to reaching migration threshold (clean up resource to allow again)| failures=1 migration-threshold=1 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline) Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3 Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked Dec 19 10:26:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0) Dec 19 10:27:14 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:27:14 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 10:27:14 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Start OST4 ( lustre4 ) Dec 19 10:27:18 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:27:18 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:27:18 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 10:27:18 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 10:27:18 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:27:18 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:27:18 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 10:27:18 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 10:27:21 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:27:21 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:27:21 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 10:27:21 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 10:27:21 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:27:21 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:27:21 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 10:27:21 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 10:27:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:27:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:27:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 10:27:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 10:27:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:27:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:27:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline) Dec 19 10:27:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3 Dec 19 10:27:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked Dec 19 10:27:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0) Dec 19 10:28:19 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:28:19 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 10:28:19 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Start OST4 ( lustre4 ) Dec 19 10:28:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:28:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:28:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 10:28:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 10:28:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:28:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:28:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 10:28:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 10:28:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:28:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 10:28:22 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Start OST4 ( lustre4 ) Dec 19 10:28:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:28:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:28:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 10:28:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 10:28:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:28:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:28:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline) Dec 19 10:28:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3 Dec 19 10:28:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked Dec 19 10:28:26 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0) Dec 19 10:28:27 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:28:27 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 10:28:27 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Start OST4 ( lustre4 ) Dec 19 10:28:30 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:28:30 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Dec 19 10:28:30 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4 Dec 19 10:28:30 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4) Dec 19 10:28:31 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0 Dec 19 10:28:31 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4 Sorry for so many log lines, but I don't understand what`s going on best regards, Artem On Tue, 19 Dec 2023 at 00:13, Ken Gaillot <kgail...@redhat.com> wrote: > On Mon, 2023-12-18 at 23:39 +0300, Artem wrote: > > Hello experts. > > > > I previously played with a dummy resource and it worked as expected. > > Now I'm switching to a Lustre OST resource and cannot make it. > > Neither can I understand. > > > > > > ### Initial setup: > > # pcs resource defaults update resource-stickness=110 > > # for i in {1..4}; do pcs cluster node add-remote lustre$i > > reconnect_interval=60; done > > # for i in {1..4}; do pcs constraint location lustre$i prefers > > lustre-mgs lustre-mds1 lustre-mds2; done > > # pcs resource create OST3 ocf:lustre:Lustre target=/dev/disk/by- > > id/wwn-0x6000c291b7f7147f826bb95153e2eaca mountpoint=/lustre/oss3 > > # pcs resource create OST4 ocf:lustre:Lustre target=/dev/disk/by- > > id/wwn-0x6000c292c41eaae60bccdd3a752913b3 mountpoint=/lustre/oss4 > > (I also tried ocf:heartbeat:Filesystem device=... directory=... > > fstype=lustre force_unmount=safe --> same behavior) > > > > # pcs constraint location OST3 prefers lustre3=100 > > # pcs constraint location OST3 prefers lustre4=100 > > # pcs constraint location OST4 prefers lustre3=100 > > # pcs constraint location OST4 prefers lustre4=100 > > # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs > > constraint location OST3 avoids $i; done > > # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs > > constraint location OST4 avoids $i; done > > > > ### Checking all is good > > # crm_simulate --simulate --live-check --show-scores > > pcmk__primitive_assign: OST4 allocation score on lustre3: 100 > > pcmk__primitive_assign: OST4 allocation score on lustre4: 210 > > # pcs status > > * OST3 (ocf::lustre:Lustre): Started lustre3 > > * OST4 (ocf::lustre:Lustre): Started lustre4 > > > > ### VM with lustre4 (OST4) is OFF > > > > # crm_simulate --simulate --live-check --show-scores > > pcmk__primitive_assign: OST4 allocation score on lustre3: 100 > > pcmk__primitive_assign: OST4 allocation score on lustre4: 100 > > Start OST4 ( lustre3 ) > > Resource action: OST4 start on lustre3 > > Resource action: OST4 monitor=20000 on lustre3 > > # pcs status > > * OST3 (ocf::lustre:Lustre): Started lustre3 > > * OST4 (ocf::lustre:Lustre): Stopped > > > > 1) I see crm_simulate guesed that it has to restart failed OST4 on > > lustre3. After making such decision I suspect it evaluates 100:100 > > scores of both lustre3 and lustre4, but lustre3 is already running a > > service. So it decides to run OST4 again on lustre4, which is failed. > > Thus it cannot restart on surviving nodes. Right? > > No. I'd start with figuring out this case. There's no reason, given the > configuration above, why OST4 would be stopped. In fact the simulation > shows it should be started, so that suggests that maybe the actual > start failed. > > Do the logs show any errors around this time? > > > 2) Ok, let's try not to give specific score - nothing changed, see > > below: > > ### did remove old constraints; clear all resources; cleanup all > > resources; cluster stop; cluster start > > > > # pcs constraint location OST3 prefers lustre3 lustre4 > > # pcs constraint location OST4 prefers lustre3 lustre4 > > # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs > > constraint location OST3 avoids $i; done > > # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs > > constraint location OST4 avoids $i; done > > # crm_simulate --simulate --live-check --show-scores > > pcmk__primitive_assign: OST4 allocation score on lustre3: INFINITY > > pcmk__primitive_assign: OST4 allocation score on lustre4: INFINITY > > # pcs status > > * OST3 (ocf::lustre:Lustre): Started lustre3 > > * OST4 (ocf::lustre:Lustre): Started lustre4 > > > > ### VM with lustre4 (OST4) is OFF > > > > # crm_simulate --simulate --live-check --show-scores > > pcmk__primitive_assign: OST4 allocation score on lustre3: INFINITY > > pcmk__primitive_assign: OST4 allocation score on lustre4: INFINITY > > Start OST4 ( lustre3 ) > > Resource action: OST4 start on lustre3 > > Resource action: OST4 monitor=20000 on lustre3 > > # pcs status > > * OST3 (ocf::lustre:Lustre): Started lustre3 > > * OST4 (ocf::lustre:Lustre): Stopped > > > > 3) Ok lets try to set different scores with preference to nodes and > > affect it with pingd: > > ### did remove old constraints; clear all resources; cleanup all > > resources; cluster stop; cluster start > > > > # pcs constraint location OST3 prefers lustre3=100 > > # pcs constraint location OST3 prefers lustre4=90 > > # pcs constraint location OST4 prefers lustre3=90 > > # pcs constraint location OST4 prefers lustre4=100 > > # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs > > constraint location OST3 avoids $i; done > > # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs > > constraint location OST4 avoids $i; done > > # pcs resource create ping ocf:pacemaker:ping dampen=5s > > host_list=192.168.34.250 op monitor interval=3s timeout=7s meta > > target-role="started" globally-unique="false" clone > > # for i in lustre-mgs lustre-mds{1..2} lustre{1..4}; do pcs > > constraint location ping-clone prefers $i; done > > # pcs constraint location OST3 rule score=0 pingd lt 1 or not_defined > > pingd > > # pcs constraint location OST4 rule score=0 pingd lt 1 or not_defined > > pingd > > # pcs constraint location OST3 rule score=125 defined pingd > > # pcs constraint location OST4 rule score=125 defined pingd > > > > ### same home base: > > # crm_simulate --simulate --live-check --show-scores > > pcmk__primitive_assign: OST4 allocation score on lustre3: 90 > > pcmk__primitive_assign: OST4 allocation score on lustre4: 210 > > # pcs status > > * OST3 (ocf::lustre:Lustre): Started lustre3 > > * OST4 (ocf::lustre:Lustre): Started lustre4 > > > > ### VM with lustre4 (OST4) is OFF. > > > > # crm_simulate --simulate --live-check --show-scores > > pcmk__primitive_assign: OST4 allocation score on lustre3: 90 > > pcmk__primitive_assign: OST4 allocation score on lustre4: 100 > > Start OST4 ( lustre3 ) > > Resource action: OST4 start on lustre3 > > Resource action: OST4 monitor=20000 on lustre3 > > # pcs status > > * OST3 (ocf::lustre:Lustre): Started lustre3 > > * OST4 (ocf::lustre:Lustre): Stopped > > > > Again lustre3 seems unable to overrule due to lower score and pingd > > DOESN'T help at all! > > > > > > 4) Can I make a reliable HA failover without pingd to keep things as > > simple as possible? > > 5) Pings might help to affect cluster decisions in case GW is lost, > > but its not working as all the guides say. Why? > > > > > > Thanks in advance, > > Artem > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > -- > Ken Gaillot <kgail...@redhat.com> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/