Re: [ClusterLabs] RemoteOFFLINE status, permanently
Thank you very much Ken! I missed this step. Now I clearly see it in Morrone_LUG2017.pdf I added the constraint and RA became online. What bugs me is the following. I destroyed and recreated the cluster with the same settings on designated hosts and nothing worked - always RemoteOFFLINE. But when I repeated these steps for a fresh install of 3 VMs on my laptop it worked out of the box (RA was Online). On Mon, 4 Dec 2023 at 23:21, Ken Gaillot wrote: > Hi, > > An asymmetric cluster requires that all resources be enabled on > particular nodes with location constraints. Since you don't have any > for your remote connections, they can't start anywhere. > > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] RemoteOFFLINE status, permanently
On Wed, 2023-11-29 at 12:56 +0300, Artem wrote: > Hello, > > I deployed a Lustre cluster with 3 nodes (metadata) as > pacemaker/corosync and 4 nodes as Remote Agents (for data). Initially > all went well, I've set up MGS and MDS resources, checked failover > and failback, remote agents were online. > > Then I tried to create a resource for OST on two nodes which are > remote agents. I also set location constraint preference for them, > collocation (OST1 and OST2 score=-50) and ordering constraint (MDS > then OST[12]). Then I read that colocation and ordering constraints > should not be used for RA. I deleted these constraints. At some stage > I used reconnect_interval=5s, but then found a bug report advising to > set it higher, so reverted to defaults. > > Only then I checked pcs status, and noticed then RA were Offline. > I tried to remove RA, add again, restart cluster, destroy it and > recreate, reboot nodes - nothing helped: at the very beginning of > cluster setup agents were persistently RemoteOFFLINE even before > creation of OST resource and locating it preferably on RA (lustre1 > and lustre2). I found nothing helpful in > /var/log/pacemaker/pacemaker.log. Please help me investigate and fix > it. > > > [root@lustre-mgs ~]# rpm -qa | grep -E "corosync|pacemaker|pcs" > pacemaker-cli-2.1.6-8.el8.x86_64 > pacemaker-schemas-2.1.6-8.el8.noarch > pcs-0.10.17-2.el8.x86_64 > pacemaker-libs-2.1.6-8.el8.x86_64 > corosync-3.1.7-1.el8.x86_64 > pacemaker-cluster-libs-2.1.6-8.el8.x86_64 > pacemaker-2.1.6-8.el8.x86_64 > corosynclib-3.1.7-1.el8.x86_64 > > [root@lustre-mgs ~]# ssh lustre1 "rpm -qa | grep resource-agents" > resource-agents-4.9.0-49.el8.x86_64 > > [root@lustre-mgs ~]# pcs status > Cluster name: cl-lustre > Cluster Summary: > * Stack: corosync (Pacemaker is running) > * Current DC: lustre-mds1 (version 2.1.6-8.el8-6fdc9deea29) - > partition with quorum > * Last updated: Wed Nov 29 12:40:37 2023 on lustre-mgs > * Last change: Wed Nov 29 12:11:21 2023 by root via cibadmin on > lustre-mgs > * 7 nodes configured > * 6 resource instances configured > Node List: > * Online: [ lustre-mds1 lustre-mds2 lustre-mgs ] > * RemoteOFFLINE: [ lustre1 lustre2 lustre3 lustre4 ] > Full List of Resources: > * lustre2 (ocf::pacemaker:remote): Stopped > * lustre3 (ocf::pacemaker:remote): Stopped > * lustre4 (ocf::pacemaker:remote): Stopped > * lustre1 (ocf::pacemaker:remote): Stopped > * MGT (ocf::heartbeat:Filesystem): Started lustre-mgs > * MDT00 (ocf::heartbeat:Filesystem): Started lustre-mds1 > Daemon Status: > corosync: active/disabled > pacemaker: active/enabled > pcsd: active/enabled > > [root@lustre-mgs ~]# pcs cluster verify --full > [root@lustre-mgs ~]# > > [root@lustre-mgs ~]# pcs constraint show --full > Warning: This command is deprecated and will be removed. Please use > 'pcs constraint config' instead. > Location Constraints: > Resource: MDT00 > Enabled on: > Node: lustre-mds1 (score:100) (id:location-MDT00-lustre-mds1- > 100) > Node: lustre-mds2 (score:100) (id:location-MDT00-lustre-mds2- > 100) > Resource: MGT > Enabled on: > Node: lustre-mgs (score:100) (id:location-MGT-lustre-mgs-100) > Node: lustre-mds2 (score:50) (id:location-MGT-lustre-mds2-50) > Ordering Constraints: > start MGT then start MDT00 (kind:Optional) (id:order-MGT-MDT00- > Optional) > Colocation Constraints: > Ticket Constraints: > > [root@lustre-mgs ~]# pcs resource show lustre1 > Warning: This command is deprecated and will be removed. Please use > 'pcs resource config' instead. > Resource: lustre1 (class=ocf provider=pacemaker type=remote) > Attributes: lustre1-instance_attributes > server=lustre1 > Operations: > migrate_from: lustre1-migrate_from-interval-0s > interval=0s > timeout=60s > migrate_to: lustre1-migrate_to-interval-0s > interval=0s > timeout=60s > monitor: lustre1-monitor-interval-60s > interval=60s > timeout=30s > reload: lustre1-reload-interval-0s > interval=0s > timeout=60s > reload-agent: lustre1-reload-agent-interval-0s > interval=0s > timeout=60s > start: lustre1-start-interval-0s > interval=0s > timeout=60s > stop: lustre1-stop-interval-0s > interval=0s > timeout=60s > > I also changed some properties: > pcs property set stonith-enabled=false > pcs property set symmetric-cluster=false Hi, An asymmetric cluster requires that all resources be enabled on particular nodes with location constraints. Since you don't have any for your remote connections, they can't start anywhere. > pcs property set batch-limit=100 > pcs resource defaults update resource-stickness=1000 > pcs cluster config update > > [root@lustre-mgs ~]# ssh lustre1 "systemctl status pcsd pacemaker- > remote resource-agents-deps.target" > ● pcsd.service - PCS GUI and remote configuration int
[ClusterLabs] RemoteOFFLINE status, permanently
Hello, I deployed a Lustre cluster with 3 nodes (metadata) as pacemaker/corosync and 4 nodes as Remote Agents (for data). Initially all went well, I've set up MGS and MDS resources, checked failover and failback, remote agents were online. Then I tried to create a resource for OST on two nodes which are remote agents. I also set location constraint preference for them, collocation (OST1 and OST2 score=-50) and ordering constraint (MDS then OST[12]). Then I read that colocation and ordering constraints should not be used for RA. I deleted these constraints. At some stage I used reconnect_interval=5s, but then found a bug report advising to set it higher, so reverted to defaults. Only then I checked pcs status, and noticed then RA were Offline. I tried to remove RA, add again, restart cluster, destroy it and recreate, reboot nodes - nothing helped: at the very beginning of cluster setup agents were persistently RemoteOFFLINE even before creation of OST resource and locating it preferably on RA (lustre1 and lustre2). I found nothing helpful in /var/log/pacemaker/pacemaker.log. Please help me investigate and fix it. [root@lustre-mgs ~]# rpm -qa | grep -E "corosync|pacemaker|pcs" pacemaker-cli-2.1.6-8.el8.x86_64 pacemaker-schemas-2.1.6-8.el8.noarch pcs-0.10.17-2.el8.x86_64 pacemaker-libs-2.1.6-8.el8.x86_64 corosync-3.1.7-1.el8.x86_64 pacemaker-cluster-libs-2.1.6-8.el8.x86_64 pacemaker-2.1.6-8.el8.x86_64 corosynclib-3.1.7-1.el8.x86_64 [root@lustre-mgs ~]# ssh lustre1 "rpm -qa | grep resource-agents" resource-agents-4.9.0-49.el8.x86_64 [root@lustre-mgs ~]# pcs status Cluster name: cl-lustre Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: lustre-mds1 (version 2.1.6-8.el8-6fdc9deea29) - partition with quorum * Last updated: Wed Nov 29 12:40:37 2023 on lustre-mgs * Last change: Wed Nov 29 12:11:21 2023 by root via cibadmin on lustre-mgs * 7 nodes configured * 6 resource instances configured Node List: * Online: [ lustre-mds1 lustre-mds2 lustre-mgs ] * RemoteOFFLINE: [ lustre1 lustre2 lustre3 lustre4 ] Full List of Resources: * lustre2 (ocf::pacemaker:remote): Stopped * lustre3 (ocf::pacemaker:remote): Stopped * lustre4 (ocf::pacemaker:remote): Stopped * lustre1 (ocf::pacemaker:remote): Stopped * MGT (ocf::heartbeat:Filesystem): Started lustre-mgs * MDT00 (ocf::heartbeat:Filesystem): Started lustre-mds1 Daemon Status: corosync: active/disabled pacemaker: active/enabled pcsd: active/enabled [root@lustre-mgs ~]# pcs cluster verify --full [root@lustre-mgs ~]# [root@lustre-mgs ~]# pcs constraint show --full Warning: This command is deprecated and will be removed. Please use 'pcs constraint config' instead. Location Constraints: Resource: MDT00 Enabled on: Node: lustre-mds1 (score:100) (id:location-MDT00-lustre-mds1-100) Node: lustre-mds2 (score:100) (id:location-MDT00-lustre-mds2-100) Resource: MGT Enabled on: Node: lustre-mgs (score:100) (id:location-MGT-lustre-mgs-100) Node: lustre-mds2 (score:50) (id:location-MGT-lustre-mds2-50) Ordering Constraints: start MGT then start MDT00 (kind:Optional) (id:order-MGT-MDT00-Optional) Colocation Constraints: Ticket Constraints: [root@lustre-mgs ~]# pcs resource show lustre1 Warning: This command is deprecated and will be removed. Please use 'pcs resource config' instead. Resource: lustre1 (class=ocf provider=pacemaker type=remote) Attributes: lustre1-instance_attributes server=lustre1 Operations: migrate_from: lustre1-migrate_from-interval-0s interval=0s timeout=60s migrate_to: lustre1-migrate_to-interval-0s interval=0s timeout=60s monitor: lustre1-monitor-interval-60s interval=60s timeout=30s reload: lustre1-reload-interval-0s interval=0s timeout=60s reload-agent: lustre1-reload-agent-interval-0s interval=0s timeout=60s start: lustre1-start-interval-0s interval=0s timeout=60s stop: lustre1-stop-interval-0s interval=0s timeout=60s I also changed some properties: pcs property set stonith-enabled=false pcs property set symmetric-cluster=false pcs property set batch-limit=100 pcs resource defaults update resource-stickness=1000 pcs cluster config update [root@lustre-mgs ~]# ssh lustre1 "systemctl status pcsd pacemaker-remote resource-agents-deps.target" ● pcsd.service - PCS GUI and remote configuration interface Loaded: loaded (/usr/lib/systemd/system/pcsd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2023-11-28 19:01:49 MSK; 17h ago Docs: man:pcsd(8) man:pcs(8) Main PID: 1752 (pcsd) Tasks: 1 (limit: 408641) Memory: 28.0M CGroup: /system.slice/pcsd.service └─1752 /usr/libexec/platform-python -Es /usr/sbin/pcsd Nov 28 19:01:49 lustre1.ntslab.ru systemd[1]: Starting PCS GUI and remote configuration interface... Nov 28 19:01:4