Re: [ClusterLabs] RemoteOFFLINE status, permanently

2023-12-04 Thread Artem
Thank you very much Ken! I missed this step. Now I clearly see it in
Morrone_LUG2017.pdf
I added the constraint and RA became online.
What bugs me is the following. I destroyed and recreated the cluster with
the same settings on designated hosts and nothing worked - always
RemoteOFFLINE. But when I repeated these steps for a fresh install of 3 VMs
on my laptop it worked out of the box (RA was Online).


On Mon, 4 Dec 2023 at 23:21, Ken Gaillot  wrote:

> Hi,
>

> An asymmetric cluster requires that all resources be enabled on
> particular nodes with location constraints. Since you don't have any
> for your remote connections, they can't start anywhere.
>
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] RemoteOFFLINE status, permanently

2023-12-04 Thread Ken Gaillot
On Wed, 2023-11-29 at 12:56 +0300, Artem wrote:
> Hello,
> 
> I deployed a Lustre cluster with 3 nodes (metadata) as
> pacemaker/corosync and 4 nodes as Remote Agents (for data). Initially
> all went well, I've set up MGS and MDS resources, checked failover
> and failback, remote agents were online. 
> 
> Then I tried to create a resource for OST on two nodes which are
> remote agents. I also set location constraint preference for them,
> collocation (OST1 and OST2 score=-50) and ordering constraint (MDS
> then OST[12]). Then I read that colocation and ordering constraints
> should not be used for RA. I deleted these constraints. At some stage
> I used reconnect_interval=5s, but then found a bug report advising to
> set it higher, so reverted to defaults.
> 
> Only then I checked pcs status, and noticed then RA were Offline.
> I tried to remove RA, add again, restart cluster, destroy it and
> recreate, reboot nodes - nothing helped: at the very beginning of
> cluster setup agents were persistently RemoteOFFLINE even before
> creation of OST resource and locating it preferably on RA (lustre1
> and lustre2). I found nothing helpful in
> /var/log/pacemaker/pacemaker.log. Please help me investigate and fix
> it.
> 
> 
> [root@lustre-mgs ~]# rpm -qa | grep -E "corosync|pacemaker|pcs"
> pacemaker-cli-2.1.6-8.el8.x86_64
> pacemaker-schemas-2.1.6-8.el8.noarch
> pcs-0.10.17-2.el8.x86_64
> pacemaker-libs-2.1.6-8.el8.x86_64
> corosync-3.1.7-1.el8.x86_64
> pacemaker-cluster-libs-2.1.6-8.el8.x86_64
> pacemaker-2.1.6-8.el8.x86_64
> corosynclib-3.1.7-1.el8.x86_64
> 
> [root@lustre-mgs ~]# ssh lustre1 "rpm -qa | grep resource-agents"
> resource-agents-4.9.0-49.el8.x86_64
> 
> [root@lustre-mgs ~]# pcs status
> Cluster name: cl-lustre
> Cluster Summary:
>   * Stack: corosync (Pacemaker is running)
>   * Current DC: lustre-mds1 (version 2.1.6-8.el8-6fdc9deea29) -
> partition with quorum
>   * Last updated: Wed Nov 29 12:40:37 2023 on lustre-mgs
>   * Last change:  Wed Nov 29 12:11:21 2023 by root via cibadmin on
> lustre-mgs
>   * 7 nodes configured
>   * 6 resource instances configured
> Node List:
>   * Online: [ lustre-mds1 lustre-mds2 lustre-mgs ]
>   * RemoteOFFLINE: [ lustre1 lustre2 lustre3 lustre4 ]
> Full List of Resources:
>   * lustre2 (ocf::pacemaker:remote): Stopped
>   * lustre3 (ocf::pacemaker:remote): Stopped
>   * lustre4 (ocf::pacemaker:remote): Stopped
>   * lustre1 (ocf::pacemaker:remote): Stopped
>   * MGT (ocf::heartbeat:Filesystem): Started lustre-mgs
>   * MDT00   (ocf::heartbeat:Filesystem): Started lustre-mds1
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> 
> [root@lustre-mgs ~]# pcs cluster verify --full
> [root@lustre-mgs ~]# 
> 
> [root@lustre-mgs ~]# pcs constraint show --full
> Warning: This command is deprecated and will be removed. Please use
> 'pcs constraint config' instead.
> Location Constraints:
>   Resource: MDT00
> Enabled on:
>   Node: lustre-mds1 (score:100) (id:location-MDT00-lustre-mds1-
> 100)
>   Node: lustre-mds2 (score:100) (id:location-MDT00-lustre-mds2-
> 100)
>   Resource: MGT
> Enabled on:
>   Node: lustre-mgs (score:100) (id:location-MGT-lustre-mgs-100)
>   Node: lustre-mds2 (score:50) (id:location-MGT-lustre-mds2-50)
> Ordering Constraints:
>   start MGT then start MDT00 (kind:Optional) (id:order-MGT-MDT00-
> Optional)
> Colocation Constraints:
> Ticket Constraints:
> 
> [root@lustre-mgs ~]# pcs resource show lustre1
> Warning: This command is deprecated and will be removed. Please use
> 'pcs resource config' instead.
> Resource: lustre1 (class=ocf provider=pacemaker type=remote)
>   Attributes: lustre1-instance_attributes
> server=lustre1
>   Operations:
> migrate_from: lustre1-migrate_from-interval-0s
>   interval=0s
>   timeout=60s
> migrate_to: lustre1-migrate_to-interval-0s
>   interval=0s
>   timeout=60s
> monitor: lustre1-monitor-interval-60s
>   interval=60s
>   timeout=30s
> reload: lustre1-reload-interval-0s
>   interval=0s
>   timeout=60s
> reload-agent: lustre1-reload-agent-interval-0s
>   interval=0s
>   timeout=60s
> start: lustre1-start-interval-0s
>   interval=0s
>   timeout=60s
> stop: lustre1-stop-interval-0s
>   interval=0s
>   timeout=60s
> 
> I also changed some properties:
> pcs property set stonith-enabled=false
> pcs property set symmetric-cluster=false

Hi,

An asymmetric cluster requires that all resources be enabled on
particular nodes with location constraints. Since you don't have any
for your remote connections, they can't start anywhere.

> pcs property set batch-limit=100
> pcs resource defaults update resource-stickness=1000
> pcs cluster config update
> 
> [root@lustre-mgs ~]# ssh lustre1 "systemctl status pcsd pacemaker-
> remote resource-agents-deps.target"
> ● pcsd.service - PCS GUI and remote configuration