On Wed, 2023-11-29 at 12:56 +0300, Artem wrote: > Hello, > > I deployed a Lustre cluster with 3 nodes (metadata) as > pacemaker/corosync and 4 nodes as Remote Agents (for data). Initially > all went well, I've set up MGS and MDS resources, checked failover > and failback, remote agents were online. > > Then I tried to create a resource for OST on two nodes which are > remote agents. I also set location constraint preference for them, > collocation (OST1 and OST2 score=-50) and ordering constraint (MDS > then OST[12]). Then I read that colocation and ordering constraints > should not be used for RA. I deleted these constraints. At some stage > I used reconnect_interval=5s, but then found a bug report advising to > set it higher, so reverted to defaults. > > Only then I checked pcs status, and noticed then RA were Offline. > I tried to remove RA, add again, restart cluster, destroy it and > recreate, reboot nodes - nothing helped: at the very beginning of > cluster setup agents were persistently RemoteOFFLINE even before > creation of OST resource and locating it preferably on RA (lustre1 > and lustre2). I found nothing helpful in > /var/log/pacemaker/pacemaker.log. Please help me investigate and fix > it. > > > [root@lustre-mgs ~]# rpm -qa | grep -E "corosync|pacemaker|pcs" > pacemaker-cli-2.1.6-8.el8.x86_64 > pacemaker-schemas-2.1.6-8.el8.noarch > pcs-0.10.17-2.el8.x86_64 > pacemaker-libs-2.1.6-8.el8.x86_64 > corosync-3.1.7-1.el8.x86_64 > pacemaker-cluster-libs-2.1.6-8.el8.x86_64 > pacemaker-2.1.6-8.el8.x86_64 > corosynclib-3.1.7-1.el8.x86_64 > > [root@lustre-mgs ~]# ssh lustre1 "rpm -qa | grep resource-agents" > resource-agents-4.9.0-49.el8.x86_64 > > [root@lustre-mgs ~]# pcs status > Cluster name: cl-lustre > Cluster Summary: > * Stack: corosync (Pacemaker is running) > * Current DC: lustre-mds1 (version 2.1.6-8.el8-6fdc9deea29) - > partition with quorum > * Last updated: Wed Nov 29 12:40:37 2023 on lustre-mgs > * Last change: Wed Nov 29 12:11:21 2023 by root via cibadmin on > lustre-mgs > * 7 nodes configured > * 6 resource instances configured > Node List: > * Online: [ lustre-mds1 lustre-mds2 lustre-mgs ] > * RemoteOFFLINE: [ lustre1 lustre2 lustre3 lustre4 ] > Full List of Resources: > * lustre2 (ocf::pacemaker:remote): Stopped > * lustre3 (ocf::pacemaker:remote): Stopped > * lustre4 (ocf::pacemaker:remote): Stopped > * lustre1 (ocf::pacemaker:remote): Stopped > * MGT (ocf::heartbeat:Filesystem): Started lustre-mgs > * MDT00 (ocf::heartbeat:Filesystem): Started lustre-mds1 > Daemon Status: > corosync: active/disabled > pacemaker: active/enabled > pcsd: active/enabled > > [root@lustre-mgs ~]# pcs cluster verify --full > [root@lustre-mgs ~]# > > [root@lustre-mgs ~]# pcs constraint show --full > Warning: This command is deprecated and will be removed. Please use > 'pcs constraint config' instead. > Location Constraints: > Resource: MDT00 > Enabled on: > Node: lustre-mds1 (score:100) (id:location-MDT00-lustre-mds1- > 100) > Node: lustre-mds2 (score:100) (id:location-MDT00-lustre-mds2- > 100) > Resource: MGT > Enabled on: > Node: lustre-mgs (score:100) (id:location-MGT-lustre-mgs-100) > Node: lustre-mds2 (score:50) (id:location-MGT-lustre-mds2-50) > Ordering Constraints: > start MGT then start MDT00 (kind:Optional) (id:order-MGT-MDT00- > Optional) > Colocation Constraints: > Ticket Constraints: > > [root@lustre-mgs ~]# pcs resource show lustre1 > Warning: This command is deprecated and will be removed. Please use > 'pcs resource config' instead. > Resource: lustre1 (class=ocf provider=pacemaker type=remote) > Attributes: lustre1-instance_attributes > server=lustre1 > Operations: > migrate_from: lustre1-migrate_from-interval-0s > interval=0s > timeout=60s > migrate_to: lustre1-migrate_to-interval-0s > interval=0s > timeout=60s > monitor: lustre1-monitor-interval-60s > interval=60s > timeout=30s > reload: lustre1-reload-interval-0s > interval=0s > timeout=60s > reload-agent: lustre1-reload-agent-interval-0s > interval=0s > timeout=60s > start: lustre1-start-interval-0s > interval=0s > timeout=60s > stop: lustre1-stop-interval-0s > interval=0s > timeout=60s > > I also changed some properties: > pcs property set stonith-enabled=false > pcs property set symmetric-cluster=false
Hi, An asymmetric cluster requires that all resources be enabled on particular nodes with location constraints. Since you don't have any for your remote connections, they can't start anywhere. > pcs property set batch-limit=100 > pcs resource defaults update resource-stickness=1000 > pcs cluster config update > > [root@lustre-mgs ~]# ssh lustre1 "systemctl status pcsd pacemaker- > remote resource-agents-deps.target" > ● pcsd.service - PCS GUI and remote configuration interface > Loaded: loaded (/usr/lib/systemd/system/pcsd.service; enabled; > vendor preset: disabled) > Active: active (running) since Tue 2023-11-28 19:01:49 MSK; 17h > ago > Docs: man:pcsd(8) > man:pcs(8) > Main PID: 1752 (pcsd) > Tasks: 1 (limit: 408641) > Memory: 28.0M > CGroup: /system.slice/pcsd.service > └─1752 /usr/libexec/platform-python -Es /usr/sbin/pcsd > Nov 28 19:01:49 lustre1.ntslab.ru systemd[1]: Starting PCS GUI and > remote configuration interface... > Nov 28 19:01:49 lustre1.ntslab.ru systemd[1]: Started PCS GUI and > remote configuration interface. > > ● pacemaker_remote.service - Pacemaker Remote executor daemon > Loaded: loaded (/usr/lib/systemd/system/pacemaker_remote.service; > enabled; vendor preset: disabled) > Active: active (running) since Wed 2023-11-29 11:08:14 MSK; 1h > 37min ago > Docs: man:pacemaker-remoted > https://clusterlabs.org/pacemaker/doc/ > Main PID: 3040 (pacemaker-remot) > Tasks: 1 > Memory: 1.4M > CGroup: /system.slice/pacemaker_remote.service > └─3040 /usr/sbin/pacemaker-remoted > Nov 29 11:08:14 lustre1.ntslab.ru systemd[1]: Started Pacemaker > Remote executor daemon. > > ● resource-agents-deps.target - resource-agents dependencies > Loaded: loaded (/usr/lib/systemd/system/resource-agents- > deps.target; static; vendor preset: disabled) > Active: active since Tue 2023-11-28 19:01:47 MSK; 17h ago > > > attempt to readd: > [root@lustre-mgs ~]# date;pcs cluster node remove-remote lustre1 > Wed Nov 29 12:49:59 MSK 2023 > Requesting 'pacemaker_remote disable', 'pacemaker_remote stop' on > 'lustre1' > lustre1: successful run of 'pacemaker_remote disable' > lustre1: successful run of 'pacemaker_remote stop' > Requesting remove 'pacemaker authkey' from 'lustre1' > lustre1: successful removal of the file 'pacemaker authkey' > Deleting Resource - lustre1 > [root@lustre-mgs ~]# date;pcs cluster node add-remote lustre1 > Wed Nov 29 12:50:08 MSK 2023 > No addresses specified for host 'lustre1', using 'lustre1' > Sending 'pacemaker authkey' to 'lustre1' > lustre1: successful distribution of the file 'pacemaker authkey' > Requesting 'pacemaker_remote enable', 'pacemaker_remote start' on > 'lustre1' > lustre1: successful run of 'pacemaker_remote enable' > lustre1: successful run of 'pacemaker_remote start' > [root@lustre-mgs ~]# date; pcs status > Wed Nov 29 12:50:35 MSK 2023 > Cluster name: cl-lustre > Cluster Summary: > * Stack: corosync (Pacemaker is running) > * Current DC: lustre-mds1 (version 2.1.6-8.el8-6fdc9deea29) - > partition with quorum > * Last updated: Wed Nov 29 12:50:35 2023 on lustre-mgs > * Last change: Wed Nov 29 12:50:11 2023 by root via cibadmin on > lustre-mgs > * 7 nodes configured > * 6 resource instances configured > Node List: > * Online: [ lustre-mds1 lustre-mds2 lustre-mgs ] > * RemoteOFFLINE: [ lustre1 lustre2 lustre3 lustre4 ] > Full List of Resources: > * lustre2 (ocf::pacemaker:remote): Stopped > * lustre3 (ocf::pacemaker:remote): Stopped > * lustre4 (ocf::pacemaker:remote): Stopped > * MGT (ocf::heartbeat:Filesystem): Started lustre-mgs > * MDT00 (ocf::heartbeat:Filesystem): Started lustre-mds1 > * lustre1 (ocf::pacemaker:remote): Stopped > Daemon Status: > corosync: active/disabled > pacemaker: active/enabled > pcsd: active/enabled > > [root@lustre-mgs ~]# grep lustre1 /var/log/pacemaker/pacemaker.log > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (cib_process_request) info: Forwarding cib_delete operation for > section //primitive[@id='lustre1'] to all (origin=local/cibadmin/2) > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: -- > /cib/configuration/resources/primitive[@id='lustre1'] > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (cib_process_request) info: Completed cib_delete operation for > section //primitive[@id='lustre1']: OK (rc=0, origin=lustre- > mgs/cibadmin/2, version=0.25.0) > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-fenced [2482] > (stonith_device_remove) info: Device 'lustre1' not found (0 > active devices) > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: -- > /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resour > ce[@id='lustre1'] > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (cib_process_request) info: Completed cib_delete operation for > section //node_state[@uname='lustre- > mds1']/lrm/lrm_resources/lrm_resource[@id='lustre1']: OK (rc=0, > origin=lustre-mds1/crmd/157, version=0.25.0) > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: -- > /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resour > ce[@id='lustre1'] > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (cib_process_request) info: Completed cib_delete operation for > section //node_state[@uname='lustre- > mds1']/lrm/lrm_resources/lrm_resource[@id='lustre1']: OK (rc=0, > origin=lustre-mds1/crmd/158, version=0.25.1) > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (cib_process_request) info: Forwarding cib_delete operation for > section //node_state[@uname='lustre- > mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1'] to all > (origin=local/crmd/39) > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: -- > /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resour > ce[@id='lustre1'] > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (cib_process_request) info: Completed cib_delete operation for > section //node_state[@uname='lustre- > mds2']/lrm/lrm_resources/lrm_resource[@id='lustre1']: OK (rc=0, > origin=lustre-mds2/crmd/35, version=0.25.1) > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: -- > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour > ce[@id='lustre1'] > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (cib_process_request) info: Completed cib_delete operation for > section //node_state[@uname='lustre- > mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1']: OK (rc=0, > origin=lustre-mgs/crmd/39, version=0.25.1) > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-controld [2486] > (delete_resource) info: Removing resource lustre1 from executor > for tengine > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-controld [2486] > (controld_delete_resource_history) info: Clearing resource > history for lustre1 on lustre-mgs (via CIB call 40) | > xpath=//node_state[@uname='lustre- > mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1'] > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-controld [2486] > (notify_deleted) info: Notifying tengine on lustre-mds1 that > lustre1 was deleted > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: -- > /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resour > ce[@id='lustre1'] > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (cib_process_request) info: Completed cib_delete operation for > section //node_state[@uname='lustre- > mds2']/lrm/lrm_resources/lrm_resource[@id='lustre1']: OK (rc=0, > origin=lustre-mds2/crmd/36, version=0.25.2) > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (cib_process_request) info: Forwarding cib_delete operation for > section //node_state[@uname='lustre- > mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1'] to all > (origin=local/crmd/40) > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: -- > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour > ce[@id='lustre1'] > Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481] > (cib_process_request) info: Completed cib_delete operation for > section //node_state[@uname='lustre- > mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1']: OK (rc=0, > origin=lustre-mgs/crmd/40, version=0.25.3) > Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-controld [2486] > (reap_crm_member) info: No peers with id=0 and/or uname=lustre1 > to purge from the membership cache > Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-fenced [2482] > (reap_crm_member) info: No peers with id=0 and/or uname=lustre1 > to purge from the membership cache > Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-attrd [2484] > (attrd_client_peer_remove) info: Client e1142409-f793-4839-a938- > f512958a925e is requesting all values for lustre1 be removed > Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-attrd [2484] > (attrd_peer_remove) notice: Removing all lustre1 attributes for > peer lustre-mgs > Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-attrd [2484] > (reap_crm_member) info: No peers with id=0 and/or uname=lustre1 > to purge from the membership cache > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ /cib/configuration/resources: <primitive > class="ocf" id="lustre1" provider="pacemaker" type="remote"/> > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ > <instance_attributes id="lustre1-instance_attributes"> > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ <nvpair > id="lustre1-instance_attributes-server" name="server" > value="lustre1"/> > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ <op > id="lustre1-migrate_from-interval-0s" interval="0s" > name="migrate_from" timeout="60s"/> > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ <op > id="lustre1-migrate_to-interval-0s" interval="0s" name="migrate_to" > timeout="60s"/> > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ <op > id="lustre1-monitor-interval-60s" interval="60s" name="monitor" > timeout="30s"/> > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ <op > id="lustre1-reload-interval-0s" interval="0s" name="reload" > timeout="60s"/> > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ <op > id="lustre1-reload-agent-interval-0s" interval="0s" name="reload- > agent" timeout="60s"/> > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ <op > id="lustre1-start-interval-0s" interval="0s" name="start" > timeout="60s"/> > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ <op > id="lustre1-stop-interval-0s" interval="0s" name="stop" > timeout="60s"/> > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-execd [2483] > (process_lrmd_get_rsc_info) info: Agent information for 'lustre1' > not in cache > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-controld [2486] > (do_lrm_rsc_op) notice: Requesting local execution of probe > operation for lustre1 on lustre-mgs | transition_key=5:88:7:288b2e10- > 0bee-498d-b9eb-4bc5f0f8d5bf op_key=lustre1_monitor_0 > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-controld [2486] > (log_executor_event) notice: Result of probe operation for lustre1 > on lustre-mgs: not running (Remote connection inactive) | graph > action confirmed; call=7 key=lustre1_monitor_0 rc=7 > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ > /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources: > <lrm_resource id="lustre1" class="ocf" provider="pacemaker" > type="remote"/> > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ > <lrm_rsc_op id="lustre1_last_0" > operation_key="lustre1_monitor_0" operation="monitor" crm-debug- > origin="controld_update_resource_history" crm_feature_set="3.17.4" > transition-key="3:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf" > transition-magic="-1:193;3:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf" > exit-reason="" on_node="lustre-mds1" call-id="-1" rc-code="193" op-st > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: + > /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resou > rce[@id='lustre1']/lrm_rsc_op[@id='lustre1_last_0']: @transition- > magic=0:7;3:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf, @call-id=7, > @rc-code=7, @op-status=0 > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources: > <lrm_resource id="lustre1" class="ocf" provider="pacemaker" > type="remote"/> > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ > <lrm_rsc_op id="lustre1_last_0" > operation_key="lustre1_monitor_0" operation="monitor" crm-debug- > origin="controld_update_resource_history" crm_feature_set="3.17.4" > transition-key="5:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf" > transition-magic="-1:193;5:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf" > exit-reason="" on_node="lustre-mgs" call-id="-1" rc-code="193" op-sta > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ > /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources: > <lrm_resource id="lustre1" class="ocf" provider="pacemaker" > type="remote"/> > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: ++ > <lrm_rsc_op id="lustre1_last_0" > operation_key="lustre1_monitor_0" operation="monitor" crm-debug- > origin="controld_update_resource_history" crm_feature_set="3.17.4" > transition-key="4:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf" > transition-magic="-1:193;4:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf" > exit-reason="" on_node="lustre-mds2" call-id="-1" rc-code="193" op-st > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: + > /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resou > rce[@id='lustre1']/lrm_rsc_op[@id='lustre1_last_0']: @transition- > magic=0:7;4:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf, @call-id=7, > @rc-code=7, @op-status=0 > Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481] > (log_info) info: + > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resou > rce[@id='lustre1']/lrm_rsc_op[@id='lustre1_last_0']: @transition- > magic=0:7;5:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf, @call-id=7, > @rc-code=7, @op-status=0 > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/