subject:"\"\\\\\\\[ClusterLabs\\\\\\\] RemoteOFFLINE status, permanently\""

Re: [ClusterLabs] RemoteOFFLINE status, permanently

2023-12-04 Thread Artem

Thank you very much Ken! I missed this step. Now I clearly see it in
Morrone_LUG2017.pdf
I added the constraint and RA became online.
What bugs me is the following. I destroyed and recreated the cluster with
the same settings on designated hosts and nothing worked - always
RemoteOFFLINE. But when I repeated these steps for a fresh install of 3 VMs
on my laptop it worked out of the box (RA was Online).

On Mon, 4 Dec 2023 at 23:21, Ken Gaillot  wrote:

> Hi,
>

> An asymmetric cluster requires that all resources be enabled on
> particular nodes with location constraints. Since you don't have any
> for your remote connections, they can't start anywhere.
>
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] RemoteOFFLINE status, permanently

2023-12-04 Thread Ken Gaillot

On Wed, 2023-11-29 at 12:56 +0300, Artem wrote:
> Hello,
> 
> I deployed a Lustre cluster with 3 nodes (metadata) as
> pacemaker/corosync and 4 nodes as Remote Agents (for data). Initially
> all went well, I've set up MGS and MDS resources, checked failover
> and failback, remote agents were online. 
> 
> Then I tried to create a resource for OST on two nodes which are
> remote agents. I also set location constraint preference for them,
> collocation (OST1 and OST2 score=-50) and ordering constraint (MDS
> then OST[12]). Then I read that colocation and ordering constraints
> should not be used for RA. I deleted these constraints. At some stage
> I used reconnect_interval=5s, but then found a bug report advising to
> set it higher, so reverted to defaults.
> 
> Only then I checked pcs status, and noticed then RA were Offline.
> I tried to remove RA, add again, restart cluster, destroy it and
> recreate, reboot nodes - nothing helped: at the very beginning of
> cluster setup agents were persistently RemoteOFFLINE even before
> creation of OST resource and locating it preferably on RA (lustre1
> and lustre2). I found nothing helpful in
> /var/log/pacemaker/pacemaker.log. Please help me investigate and fix
> it.
> 
> 
> [root@lustre-mgs ~]# rpm -qa | grep -E "corosync|pacemaker|pcs"
> pacemaker-cli-2.1.6-8.el8.x86_64
> pacemaker-schemas-2.1.6-8.el8.noarch
> pcs-0.10.17-2.el8.x86_64
> pacemaker-libs-2.1.6-8.el8.x86_64
> corosync-3.1.7-1.el8.x86_64
> pacemaker-cluster-libs-2.1.6-8.el8.x86_64
> pacemaker-2.1.6-8.el8.x86_64
> corosynclib-3.1.7-1.el8.x86_64
> 
> [root@lustre-mgs ~]# ssh lustre1 "rpm -qa | grep resource-agents"
> resource-agents-4.9.0-49.el8.x86_64
> 
> [root@lustre-mgs ~]# pcs status
> Cluster name: cl-lustre
> Cluster Summary:
>   * Stack: corosync (Pacemaker is running)
>   * Current DC: lustre-mds1 (version 2.1.6-8.el8-6fdc9deea29) -
> partition with quorum
>   * Last updated: Wed Nov 29 12:40:37 2023 on lustre-mgs
>   * Last change:  Wed Nov 29 12:11:21 2023 by root via cibadmin on
> lustre-mgs
>   * 7 nodes configured
>   * 6 resource instances configured
> Node List:
>   * Online: [ lustre-mds1 lustre-mds2 lustre-mgs ]
>   * RemoteOFFLINE: [ lustre1 lustre2 lustre3 lustre4 ]
> Full List of Resources:
>   * lustre2 (ocf::pacemaker:remote): Stopped
>   * lustre3 (ocf::pacemaker:remote): Stopped
>   * lustre4 (ocf::pacemaker:remote): Stopped
>   * lustre1 (ocf::pacemaker:remote): Stopped
>   * MGT (ocf::heartbeat:Filesystem): Started lustre-mgs
>   * MDT00   (ocf::heartbeat:Filesystem): Started lustre-mds1
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> 
> [root@lustre-mgs ~]# pcs cluster verify --full
> [root@lustre-mgs ~]# 
> 
> [root@lustre-mgs ~]# pcs constraint show --full
> Warning: This command is deprecated and will be removed. Please use
> 'pcs constraint config' instead.
> Location Constraints:
>   Resource: MDT00
> Enabled on:
>   Node: lustre-mds1 (score:100) (id:location-MDT00-lustre-mds1-
> 100)
>   Node: lustre-mds2 (score:100) (id:location-MDT00-lustre-mds2-
> 100)
>   Resource: MGT
> Enabled on:
>   Node: lustre-mgs (score:100) (id:location-MGT-lustre-mgs-100)
>   Node: lustre-mds2 (score:50) (id:location-MGT-lustre-mds2-50)
> Ordering Constraints:
>   start MGT then start MDT00 (kind:Optional) (id:order-MGT-MDT00-
> Optional)
> Colocation Constraints:
> Ticket Constraints:
> 
> [root@lustre-mgs ~]# pcs resource show lustre1
> Warning: This command is deprecated and will be removed. Please use
> 'pcs resource config' instead.
> Resource: lustre1 (class=ocf provider=pacemaker type=remote)
>   Attributes: lustre1-instance_attributes
> server=lustre1
>   Operations:
> migrate_from: lustre1-migrate_from-interval-0s
>   interval=0s
>   timeout=60s
> migrate_to: lustre1-migrate_to-interval-0s
>   interval=0s
>   timeout=60s
> monitor: lustre1-monitor-interval-60s
>   interval=60s
>   timeout=30s
> reload: lustre1-reload-interval-0s
>   interval=0s
>   timeout=60s
> reload-agent: lustre1-reload-agent-interval-0s
>   interval=0s
>   timeout=60s
> start: lustre1-start-interval-0s
>   interval=0s
>   timeout=60s
> stop: lustre1-stop-interval-0s
>   interval=0s
>   timeout=60s
> 
> I also changed some properties:
> pcs property set stonith-enabled=false
> pcs property set symmetric-cluster=false

Hi,

An asymmetric cluster requires that all resources be enabled on
particular nodes with location constraints. Since you don't have any
for your remote connections, they can't start anywhere.

> pcs property set batch-limit=100
> pcs resource defaults update resource-stickness=1000
> pcs cluster config update
> 
> [root@lustre-mgs ~]# ssh lustre1 "systemctl status pcsd pacemaker-
> remote resource-agents-deps.target"
> ● pcsd.service - PCS GUI and remote configuration int

[ClusterLabs] RemoteOFFLINE status, permanently

2023-11-29 Thread Artem

Hello,

I deployed a Lustre cluster with 3 nodes (metadata) as pacemaker/corosync
and 4 nodes as Remote Agents (for data). Initially all went well, I've set
up MGS and MDS resources, checked failover and failback, remote agents were
online.

Then I tried to create a resource for OST on two nodes which are remote
agents. I also set location constraint preference for them, collocation
(OST1 and OST2 score=-50) and ordering constraint (MDS then OST[12]). Then
I read that colocation and ordering constraints should not be used for RA.
I deleted these constraints. At some stage I used reconnect_interval=5s,
but then found a bug report advising to set it higher, so reverted to
defaults.

Only then I checked pcs status, and noticed then RA were Offline.
I tried to remove RA, add again, restart cluster, destroy it and recreate,
reboot nodes - nothing helped: at the very beginning of cluster setup
agents were persistently RemoteOFFLINE even before creation of OST resource
and locating it preferably on RA (lustre1 and lustre2). I found nothing
helpful in /var/log/pacemaker/pacemaker.log. Please help me investigate and
fix it.


[root@lustre-mgs ~]# rpm -qa | grep -E "corosync|pacemaker|pcs"
pacemaker-cli-2.1.6-8.el8.x86_64
pacemaker-schemas-2.1.6-8.el8.noarch
pcs-0.10.17-2.el8.x86_64
pacemaker-libs-2.1.6-8.el8.x86_64
corosync-3.1.7-1.el8.x86_64
pacemaker-cluster-libs-2.1.6-8.el8.x86_64
pacemaker-2.1.6-8.el8.x86_64
corosynclib-3.1.7-1.el8.x86_64

[root@lustre-mgs ~]# ssh lustre1 "rpm -qa | grep resource-agents"
resource-agents-4.9.0-49.el8.x86_64

[root@lustre-mgs ~]# pcs status
Cluster name: cl-lustre
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: lustre-mds1 (version 2.1.6-8.el8-6fdc9deea29) - partition
with quorum
  * Last updated: Wed Nov 29 12:40:37 2023 on lustre-mgs
  * Last change:  Wed Nov 29 12:11:21 2023 by root via cibadmin on
lustre-mgs
  * 7 nodes configured
  * 6 resource instances configured
Node List:
  * Online: [ lustre-mds1 lustre-mds2 lustre-mgs ]
  * RemoteOFFLINE: [ lustre1 lustre2 lustre3 lustre4 ]
Full List of Resources:
  * lustre2 (ocf::pacemaker:remote): Stopped
  * lustre3 (ocf::pacemaker:remote): Stopped
  * lustre4 (ocf::pacemaker:remote): Stopped
  * lustre1 (ocf::pacemaker:remote): Stopped
  * MGT (ocf::heartbeat:Filesystem): Started lustre-mgs
  * MDT00   (ocf::heartbeat:Filesystem): Started lustre-mds1
Daemon Status:
  corosync: active/disabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@lustre-mgs ~]# pcs cluster verify --full
[root@lustre-mgs ~]#

[root@lustre-mgs ~]# pcs constraint show --full
Warning: This command is deprecated and will be removed. Please use 'pcs
constraint config' instead.
Location Constraints:
  Resource: MDT00
Enabled on:
  Node: lustre-mds1 (score:100) (id:location-MDT00-lustre-mds1-100)
  Node: lustre-mds2 (score:100) (id:location-MDT00-lustre-mds2-100)
  Resource: MGT
Enabled on:
  Node: lustre-mgs (score:100) (id:location-MGT-lustre-mgs-100)
  Node: lustre-mds2 (score:50) (id:location-MGT-lustre-mds2-50)
Ordering Constraints:
  start MGT then start MDT00 (kind:Optional) (id:order-MGT-MDT00-Optional)
Colocation Constraints:
Ticket Constraints:

[root@lustre-mgs ~]# pcs resource show lustre1
Warning: This command is deprecated and will be removed. Please use 'pcs
resource config' instead.
Resource: lustre1 (class=ocf provider=pacemaker type=remote)
  Attributes: lustre1-instance_attributes
server=lustre1
  Operations:
migrate_from: lustre1-migrate_from-interval-0s
  interval=0s
  timeout=60s
migrate_to: lustre1-migrate_to-interval-0s
  interval=0s
  timeout=60s
monitor: lustre1-monitor-interval-60s
  interval=60s
  timeout=30s
reload: lustre1-reload-interval-0s
  interval=0s
  timeout=60s
reload-agent: lustre1-reload-agent-interval-0s
  interval=0s
  timeout=60s
start: lustre1-start-interval-0s
  interval=0s
  timeout=60s
stop: lustre1-stop-interval-0s
  interval=0s
  timeout=60s

I also changed some properties:
pcs property set stonith-enabled=false
pcs property set symmetric-cluster=false
pcs property set batch-limit=100
pcs resource defaults update resource-stickness=1000
pcs cluster config update

[root@lustre-mgs ~]# ssh lustre1 "systemctl status pcsd pacemaker-remote
resource-agents-deps.target"
● pcsd.service - PCS GUI and remote configuration interface
   Loaded: loaded (/usr/lib/systemd/system/pcsd.service; enabled; vendor
preset: disabled)
   Active: active (running) since Tue 2023-11-28 19:01:49 MSK; 17h ago
 Docs: man:pcsd(8)
   man:pcs(8)
 Main PID: 1752 (pcsd)
Tasks: 1 (limit: 408641)
   Memory: 28.0M
   CGroup: /system.slice/pcsd.service
   └─1752 /usr/libexec/platform-python -Es /usr/sbin/pcsd
Nov 28 19:01:49 lustre1.ntslab.ru systemd[1]: Starting PCS GUI and remote
configuration interface...
Nov 28 19:01:4

Re: [ClusterLabs] RemoteOFFLINE status, permanently

Re: [ClusterLabs] RemoteOFFLINE status, permanently

[ClusterLabs] RemoteOFFLINE status, permanently

3 matches

Site Navigation

Mail list logo

Footer information