[ClusterLabs] Help with tweaking an active/passive NFS cluster

2023-03-30 Thread Ronny Adsetts
Hi,

I wonder if someone more familiar with the workings of pacemaker/corosync would 
be able to assist in solving an issue.

I have a 3-node NFS cluster which exports several iSCSI LUNs. The LUNs are 
presented to the nodes via multipathd.

This all works fine except that I can't stop just one export. Sometimes I need 
to take a single filesystem offline for maintenance for example. Or if there's 
an issue and a filesystem goes offline and can't come back.

There's a trimmed down config below but essentially I want all the NFS exports 
on one node but I don't want any of the exports to block. So it's OK to stop 
(or fail) a single export.

My config has a group for each export and filesystem and another group for the 
NFS server and VIP. I then co-locate them together.

Cut-down config to limit the number of exports:

node 1: nfs-01
node 2: nfs-02
node 3: nfs-03
primitive NFSExportAdminHomes exportfs \
params clientspec="172.16.40.0/24" options="rw,async,no_root_squash" 
directory="/srv/adminhomes" fsid=dcfd1bbb-c026-4d6d-8541-7fc29d6fef1a \
op monitor timeout=20 interval=10 \
op_params interval=10
primitive NFSExportArchive exportfs \
params clientspec="172.16.40.0/24" options="rw,async,no_root_squash" 
directory="/srv/archive" fsid=3abb6e34-bff2-4896-b8ff-fc1123517359 \
op monitor timeout=20 interval=10 \
op_params interval=10 \
meta target-role=Started
primitive NFSExportDBBackups exportfs \
params clientspec="172.16.40.0/24" options="rw,async,no_root_squash" 
directory="/srv/dbbackups" fsid=df58b9c0-593b-45c0-9923-155b3d7d9483 \
op monitor timeout=20 interval=10 \
op_params interval=10
primitive NFSFSAdminHomes Filesystem \
params device="/dev/mapper/adminhomes-part1" 
directory="/srv/adminhomes" fstype=xfs \
op start interval=0 timeout=120 \
op monitor interval=60 timeout=60 \
op_params OCF_CHECK_LEVEL=20 \
op stop interval=0 timeout=240
primitive NFSFSArchive Filesystem \
params device="/dev/mapper/archive-part1" directory="/srv/archive" 
fstype=xfs \
op start interval=0 timeout=120 \
op monitor interval=60 timeout=60 \
op_params OCF_CHECK_LEVEL=20 \
op stop interval=0 timeout=240 \
meta target-role=Started
primitive NFSFSDBBackups Filesystem \
params device="/dev/mapper/dbbackups-part1" directory="/srv/dbbackups" 
fstype=xfs \
op start timeout=60 interval=0 \
op monitor interval=20 timeout=40 \
op stop timeout=60 interval=0 \
op_params OCF_CHECK_LEVEL=20
primitive NFSIP-01 IPaddr2 \
params ip=172.16.40.17 cidr_netmask=24 nic=ens14 \
op monitor interval=30s
group AdminHomes NFSFSAdminHomes NFSExportAdminHomes \
meta target-role=Started
group Archive NFSFSArchive NFSExportArchive \
meta target-role=Started
group DBBackups NFSFSDBBackups NFSExportDBBackups \
meta target-role=Started
group NFSServerIP NFSIP-01 NFSServer \
meta target-role=Started
colocation NFSMaster inf: NFSServerIP AdminHomes Archive DBBackups
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=2.0.1-9e909a5bdd \
cluster-infrastructure=corosync \
cluster-name=nfs-cluster \
stonith-enabled=false \
last-lrm-refresh=1675344768
rsc_defaults rsc-options: \
resource-stickiness=200


The problem is that if one export fails, none of the following exports will be 
attempted. Reading the docs, that's to be expected as each item in the 
colocation needs the preceding item to succeed.

I tried changing the colocation line like so to remove the dependency:

colocation NFSMaster inf: NFSServerIP ( AdminHomes Archive DBBackups )

but this gave me two problems:

1. Issuing a "resource stop DBBackups" took everything offline briefly

2. Issuing a "resource start DBBackups" brought it back on a different node to 
NFSServerIP 

I'm very obviously missing something here.

Could someone kindly point me in the right direction?

TIA.

Ronny

-- 
Ronny Adsetts
Technical Director
Amazing Internet Ltd, London
t: +44 20 8977 8943
w: www.amazinginternet.com

Registered office: 85 Waldegrave Park, Twickenham, TW1 4TJ
Registered in England. Company No. 4042957

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [ClusterLabs Developers] ERROR: LXC container name not set!

2023-03-30 Thread Reid Wahl
Moving to users list

On Thu, Mar 30, 2023 at 3:04 AM John via Developers
 wrote:
>
> Hey guys! I found some problem with /usr/lib/ocf/resource.d/heartbeat/lxc :
>
> OS: Debian 11 (And debian 10)
>
> Kernel: 5.10.0-15-amd64
>
> Env: resource-agents 1:4.7.0-1~bpo10+1, pacemaker 2.0.5-2, corosync 3.1.2-2, 
> lxc 1:4.0.6-2
>
> Just trying to add new resource
>
>
> lxc-start -n front-2.fr
> pcs resource create front-2.fr ocf:heartbeat:lxc 
> config=/mnt/cluster_volumes/lxc2/front-2.fr/config container=front-2.fr
>
>
> After ~5min want to remove it
>
>
> pcs resource remove front-2.fr --force
>
>
> got an error and cluster starts to migrate
>
>
> Mar 29 23:28:51 cse2.fr lxc(front-2.fr)[2103391]: ERROR: LXC container name 
> not set!
>
> as i can see in /usr/lib/ocf/resource.d/heartbeat/lxc the error spawns when 
> agent can't get OCF_RESKEY_container variable.
> This bug is only on clusters who work without reboot a long time. For example 
> after fencing i can add/remove lxc resources and everything will be fine for 
> a while.
>
> The question is: why? And how to debug it?
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/developers
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Totem decrypt with Wireshark

2023-03-30 Thread Justino, Fabiana
Hi,

I have corosync version 3.1.7-1, encrypted totem messages and would like to 
know how to decrypt them.
Tried to disable encryption with crypto_cipher set to No and crypto_hash set to 
No but it keeps encrypted.

Thank you in advance.
Fabiana.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Could not initialize corosync configuration API error 2

2023-03-30 Thread S Sathish S via Users
Hi Team,

we are unable to start corosync service which is already part of existing 
cluster same is running fine for longer time. Now we are seeing corosync
server unable to join "Could not initialize corosync configuration API error 
2". Please find the below logs.

[root@node1 ~]# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor 
preset: disabled)
   Active: failed (Result: exit-code) since Thu 2023-03-30 10:49:58 WAT; 7min 
ago
 Docs: man:corosync
   man:corosync.conf
   man:corosync_overview
  Process: 9922 ExecStop=/usr/share/corosync/corosync stop (code=exited, 
status=0/SUCCESS)
  Process: 9937 ExecStart=/usr/share/corosync/corosync start (code=exited, 
status=1/FAILURE)



Mar 30 10:48:57 node1 systemd[1]: Starting Corosync Cluster Engine...
Mar 30 10:49:58 node1 corosync[9937]: Starting Corosync Cluster Engine 
(corosync): [FAILED]
Mar 30 10:49:58 node1 systemd[1]: corosync.service: control process exited, 
code=exited status=1
Mar 30 10:49:58 node1 systemd[1]: Failed to start Corosync Cluster Engine.
Mar 30 10:49:58 node1 systemd[1]: Unit corosync.service entered failed state.
Mar 30 10:49:58 node1 systemd[1]: corosync.service failed.

Please find the corosync logs error:

Mar 30 10:49:52 [9947] node1 corosync debug   [MAIN  ] Denied connection, 
corosync is not ready
Mar 30 10:49:52 [9947] node1 corosync warning [QB] Denied connection, is 
not ready (9948-10497-23)
Mar 30 10:49:52 [9947] node1 corosync debug   [MAIN  ] 
cs_ipcs_connection_destroyed()
Mar 30 10:49:52 [9947] node1 corosync debug   [MAIN  ] Denied connection, 
corosync is not ready
Mar 30 10:49:57 [9947] node1 corosync debug   [MAIN  ] 
cs_ipcs_connection_destroyed()
Mar 30 10:49:58 [9947] node1 corosync notice  [MAIN  ] Node was shut down by a 
signal
Mar 30 10:49:58 [9947] node1 corosync notice  [SERV  ] Unloading all Corosync 
service engines.
Mar 30 10:49:58 [9947] node1 corosync info[QB] withdrawing server 
sockets
Mar 30 10:49:58 [9947] node1 corosync debug   [QB] qb_ipcs_unref() - 
destroying
Mar 30 10:49:58 [9947] node1 corosync notice  [SERV  ] Service engine unloaded: 
corosync vote quorum service v1.0
Mar 30 10:49:58 [9947] node1 corosync info[QB] withdrawing server 
sockets
Mar 30 10:49:58 [9947] node1 corosync debug   [QB] qb_ipcs_unref() - 
destroying
Mar 30 10:49:58 [9947] node1 corosync notice  [SERV  ] Service engine unloaded: 
corosync configuration map access
Mar 30 10:49:58 [9947] node1 corosync info[QB] withdrawing server 
sockets
Mar 30 10:49:58 [9947] node1 corosync debug   [QB] qb_ipcs_unref() - 
destroying
Mar 30 10:49:58 [9947] node1 corosync notice  [SERV  ] Service engine unloaded: 
corosync configuration service
Mar 30 10:49:58 [9947] node1 corosync info[QB] withdrawing server 
sockets
Mar 30 10:49:58 [9947] node1 corosync debug   [QB] qb_ipcs_unref() - 
destroying
Mar 30 10:49:58 [9947] node1 corosync notice  [SERV  ] Service engine unloaded: 
corosync cluster closed process group service v1.01
Mar 30 10:49:58 [9947] node1 corosync info[QB] withdrawing server 
sockets
Mar 30 10:49:58 [9947] node1 corosync debug   [QB] qb_ipcs_unref() - 
destroying
Mar 30 10:49:58 [9947] node1 corosync notice  [SERV  ] Service engine unloaded: 
corosync cluster quorum service v0.1
Mar 30 10:49:58 [9947] node1 corosync notice  [SERV  ] Service engine unloaded: 
corosync profile loading service
Mar 30 10:49:58 [9947] node1 corosync debug   [TOTEM ] sending join/leave 
message
Mar 30 10:49:58 [9947] node1 corosync notice  [MAIN  ] Corosync Cluster Engine 
exiting normally


While try manually start corosync service also getting below error.


[root@node1 ~]# bash -x /usr/share/corosync/corosync start
+ desc='Corosync Cluster Engine'
+ prog=corosync
+ PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/sbin
+ '[' -f /etc/sysconfig/corosync ']'
+ . /etc/sysconfig/corosync
++ COROSYNC_INIT_TIMEOUT=60
++ COROSYNC_OPTIONS=
+ case '/etc/sysconfig' in
+ '[' -f /etc/init.d/functions ']'
+ . /etc/init.d/functions
++ TEXTDOMAIN=initscripts
++ umask 022
++ PATH=/sbin:/usr/sbin:/bin:/usr/bin
++ export PATH
++ '[' 28864 -ne 1 -a -z '' ']'
++ '[' -d /run/systemd/system ']'
++ case "$0" in
++ '[' -z '' ']'
++ COLUMNS=80
++ '[' -z '' ']'
++ '[' -c /dev/stderr -a -r /dev/stderr ']'
+++ /sbin/consoletype
++ CONSOLETYPE=pty
++ '[' -z '' ']'
++ '[' -z '' ']'
++ '[' -f /etc/sysconfig/i18n -o -f /etc/locale.conf ']'
++ . /etc/profile.d/lang.sh
++ unset LANGSH_SOURCED
++ '[' -z '' ']'
++ '[' -f /etc/sysconfig/init ']'
++ . /etc/sysconfig/init
+++ BOOTUP=color
+++ RES_COL=60
+++ MOVE_TO_COL='echo -en \033[60G'
+++ SETCOLOR_SUCCESS='echo -en \033[0;32m'
+++ SETCOLOR_FAILURE='echo -en \033[0;31m'
+++ SETCOLOR_WARNING='echo -en \033[0;33m'
+++ SETCOLOR_NORMAL='echo -en \033[0;39m'
++ '[' pty = serial ']'
++ 
__sed_discard_ignored_files='/\(~\|\.bak\|\.orig\|\.rpmnew\|\.rpmorig\|\.