[ClusterLabs] Help with tweaking an active/passive NFS cluster
Hi, I wonder if someone more familiar with the workings of pacemaker/corosync would be able to assist in solving an issue. I have a 3-node NFS cluster which exports several iSCSI LUNs. The LUNs are presented to the nodes via multipathd. This all works fine except that I can't stop just one export. Sometimes I need to take a single filesystem offline for maintenance for example. Or if there's an issue and a filesystem goes offline and can't come back. There's a trimmed down config below but essentially I want all the NFS exports on one node but I don't want any of the exports to block. So it's OK to stop (or fail) a single export. My config has a group for each export and filesystem and another group for the NFS server and VIP. I then co-locate them together. Cut-down config to limit the number of exports: node 1: nfs-01 node 2: nfs-02 node 3: nfs-03 primitive NFSExportAdminHomes exportfs \ params clientspec="172.16.40.0/24" options="rw,async,no_root_squash" directory="/srv/adminhomes" fsid=dcfd1bbb-c026-4d6d-8541-7fc29d6fef1a \ op monitor timeout=20 interval=10 \ op_params interval=10 primitive NFSExportArchive exportfs \ params clientspec="172.16.40.0/24" options="rw,async,no_root_squash" directory="/srv/archive" fsid=3abb6e34-bff2-4896-b8ff-fc1123517359 \ op monitor timeout=20 interval=10 \ op_params interval=10 \ meta target-role=Started primitive NFSExportDBBackups exportfs \ params clientspec="172.16.40.0/24" options="rw,async,no_root_squash" directory="/srv/dbbackups" fsid=df58b9c0-593b-45c0-9923-155b3d7d9483 \ op monitor timeout=20 interval=10 \ op_params interval=10 primitive NFSFSAdminHomes Filesystem \ params device="/dev/mapper/adminhomes-part1" directory="/srv/adminhomes" fstype=xfs \ op start interval=0 timeout=120 \ op monitor interval=60 timeout=60 \ op_params OCF_CHECK_LEVEL=20 \ op stop interval=0 timeout=240 primitive NFSFSArchive Filesystem \ params device="/dev/mapper/archive-part1" directory="/srv/archive" fstype=xfs \ op start interval=0 timeout=120 \ op monitor interval=60 timeout=60 \ op_params OCF_CHECK_LEVEL=20 \ op stop interval=0 timeout=240 \ meta target-role=Started primitive NFSFSDBBackups Filesystem \ params device="/dev/mapper/dbbackups-part1" directory="/srv/dbbackups" fstype=xfs \ op start timeout=60 interval=0 \ op monitor interval=20 timeout=40 \ op stop timeout=60 interval=0 \ op_params OCF_CHECK_LEVEL=20 primitive NFSIP-01 IPaddr2 \ params ip=172.16.40.17 cidr_netmask=24 nic=ens14 \ op monitor interval=30s group AdminHomes NFSFSAdminHomes NFSExportAdminHomes \ meta target-role=Started group Archive NFSFSArchive NFSExportArchive \ meta target-role=Started group DBBackups NFSFSDBBackups NFSExportDBBackups \ meta target-role=Started group NFSServerIP NFSIP-01 NFSServer \ meta target-role=Started colocation NFSMaster inf: NFSServerIP AdminHomes Archive DBBackups property cib-bootstrap-options: \ have-watchdog=false \ dc-version=2.0.1-9e909a5bdd \ cluster-infrastructure=corosync \ cluster-name=nfs-cluster \ stonith-enabled=false \ last-lrm-refresh=1675344768 rsc_defaults rsc-options: \ resource-stickiness=200 The problem is that if one export fails, none of the following exports will be attempted. Reading the docs, that's to be expected as each item in the colocation needs the preceding item to succeed. I tried changing the colocation line like so to remove the dependency: colocation NFSMaster inf: NFSServerIP ( AdminHomes Archive DBBackups ) but this gave me two problems: 1. Issuing a "resource stop DBBackups" took everything offline briefly 2. Issuing a "resource start DBBackups" brought it back on a different node to NFSServerIP I'm very obviously missing something here. Could someone kindly point me in the right direction? TIA. Ronny -- Ronny Adsetts Technical Director Amazing Internet Ltd, London t: +44 20 8977 8943 w: www.amazinginternet.com Registered office: 85 Waldegrave Park, Twickenham, TW1 4TJ Registered in England. Company No. 4042957 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] [ClusterLabs Developers] ERROR: LXC container name not set!
Moving to users list On Thu, Mar 30, 2023 at 3:04 AM John via Developers wrote: > > Hey guys! I found some problem with /usr/lib/ocf/resource.d/heartbeat/lxc : > > OS: Debian 11 (And debian 10) > > Kernel: 5.10.0-15-amd64 > > Env: resource-agents 1:4.7.0-1~bpo10+1, pacemaker 2.0.5-2, corosync 3.1.2-2, > lxc 1:4.0.6-2 > > Just trying to add new resource > > > lxc-start -n front-2.fr > pcs resource create front-2.fr ocf:heartbeat:lxc > config=/mnt/cluster_volumes/lxc2/front-2.fr/config container=front-2.fr > > > After ~5min want to remove it > > > pcs resource remove front-2.fr --force > > > got an error and cluster starts to migrate > > > Mar 29 23:28:51 cse2.fr lxc(front-2.fr)[2103391]: ERROR: LXC container name > not set! > > as i can see in /usr/lib/ocf/resource.d/heartbeat/lxc the error spawns when > agent can't get OCF_RESKEY_container variable. > This bug is only on clusters who work without reboot a long time. For example > after fencing i can add/remove lxc resources and everything will be fine for > a while. > > The question is: why? And how to debug it? > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/developers > > ClusterLabs home: https://www.clusterlabs.org/ -- Regards, Reid Wahl (He/Him) Senior Software Engineer, Red Hat RHEL High Availability - Pacemaker ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Totem decrypt with Wireshark
Hi, I have corosync version 3.1.7-1, encrypted totem messages and would like to know how to decrypt them. Tried to disable encryption with crypto_cipher set to No and crypto_hash set to No but it keeps encrypted. Thank you in advance. Fabiana. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Could not initialize corosync configuration API error 2
Hi Team, we are unable to start corosync service which is already part of existing cluster same is running fine for longer time. Now we are seeing corosync server unable to join "Could not initialize corosync configuration API error 2". Please find the below logs. [root@node1 ~]# systemctl status corosync ● corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2023-03-30 10:49:58 WAT; 7min ago Docs: man:corosync man:corosync.conf man:corosync_overview Process: 9922 ExecStop=/usr/share/corosync/corosync stop (code=exited, status=0/SUCCESS) Process: 9937 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAILURE) Mar 30 10:48:57 node1 systemd[1]: Starting Corosync Cluster Engine... Mar 30 10:49:58 node1 corosync[9937]: Starting Corosync Cluster Engine (corosync): [FAILED] Mar 30 10:49:58 node1 systemd[1]: corosync.service: control process exited, code=exited status=1 Mar 30 10:49:58 node1 systemd[1]: Failed to start Corosync Cluster Engine. Mar 30 10:49:58 node1 systemd[1]: Unit corosync.service entered failed state. Mar 30 10:49:58 node1 systemd[1]: corosync.service failed. Please find the corosync logs error: Mar 30 10:49:52 [9947] node1 corosync debug [MAIN ] Denied connection, corosync is not ready Mar 30 10:49:52 [9947] node1 corosync warning [QB] Denied connection, is not ready (9948-10497-23) Mar 30 10:49:52 [9947] node1 corosync debug [MAIN ] cs_ipcs_connection_destroyed() Mar 30 10:49:52 [9947] node1 corosync debug [MAIN ] Denied connection, corosync is not ready Mar 30 10:49:57 [9947] node1 corosync debug [MAIN ] cs_ipcs_connection_destroyed() Mar 30 10:49:58 [9947] node1 corosync notice [MAIN ] Node was shut down by a signal Mar 30 10:49:58 [9947] node1 corosync notice [SERV ] Unloading all Corosync service engines. Mar 30 10:49:58 [9947] node1 corosync info[QB] withdrawing server sockets Mar 30 10:49:58 [9947] node1 corosync debug [QB] qb_ipcs_unref() - destroying Mar 30 10:49:58 [9947] node1 corosync notice [SERV ] Service engine unloaded: corosync vote quorum service v1.0 Mar 30 10:49:58 [9947] node1 corosync info[QB] withdrawing server sockets Mar 30 10:49:58 [9947] node1 corosync debug [QB] qb_ipcs_unref() - destroying Mar 30 10:49:58 [9947] node1 corosync notice [SERV ] Service engine unloaded: corosync configuration map access Mar 30 10:49:58 [9947] node1 corosync info[QB] withdrawing server sockets Mar 30 10:49:58 [9947] node1 corosync debug [QB] qb_ipcs_unref() - destroying Mar 30 10:49:58 [9947] node1 corosync notice [SERV ] Service engine unloaded: corosync configuration service Mar 30 10:49:58 [9947] node1 corosync info[QB] withdrawing server sockets Mar 30 10:49:58 [9947] node1 corosync debug [QB] qb_ipcs_unref() - destroying Mar 30 10:49:58 [9947] node1 corosync notice [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Mar 30 10:49:58 [9947] node1 corosync info[QB] withdrawing server sockets Mar 30 10:49:58 [9947] node1 corosync debug [QB] qb_ipcs_unref() - destroying Mar 30 10:49:58 [9947] node1 corosync notice [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Mar 30 10:49:58 [9947] node1 corosync notice [SERV ] Service engine unloaded: corosync profile loading service Mar 30 10:49:58 [9947] node1 corosync debug [TOTEM ] sending join/leave message Mar 30 10:49:58 [9947] node1 corosync notice [MAIN ] Corosync Cluster Engine exiting normally While try manually start corosync service also getting below error. [root@node1 ~]# bash -x /usr/share/corosync/corosync start + desc='Corosync Cluster Engine' + prog=corosync + PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/sbin + '[' -f /etc/sysconfig/corosync ']' + . /etc/sysconfig/corosync ++ COROSYNC_INIT_TIMEOUT=60 ++ COROSYNC_OPTIONS= + case '/etc/sysconfig' in + '[' -f /etc/init.d/functions ']' + . /etc/init.d/functions ++ TEXTDOMAIN=initscripts ++ umask 022 ++ PATH=/sbin:/usr/sbin:/bin:/usr/bin ++ export PATH ++ '[' 28864 -ne 1 -a -z '' ']' ++ '[' -d /run/systemd/system ']' ++ case "$0" in ++ '[' -z '' ']' ++ COLUMNS=80 ++ '[' -z '' ']' ++ '[' -c /dev/stderr -a -r /dev/stderr ']' +++ /sbin/consoletype ++ CONSOLETYPE=pty ++ '[' -z '' ']' ++ '[' -z '' ']' ++ '[' -f /etc/sysconfig/i18n -o -f /etc/locale.conf ']' ++ . /etc/profile.d/lang.sh ++ unset LANGSH_SOURCED ++ '[' -z '' ']' ++ '[' -f /etc/sysconfig/init ']' ++ . /etc/sysconfig/init +++ BOOTUP=color +++ RES_COL=60 +++ MOVE_TO_COL='echo -en \033[60G' +++ SETCOLOR_SUCCESS='echo -en \033[0;32m' +++ SETCOLOR_FAILURE='echo -en \033[0;31m' +++ SETCOLOR_WARNING='echo -en \033[0;33m' +++ SETCOLOR_NORMAL='echo -en \033[0;39m' ++ '[' pty = serial ']' ++ __sed_discard_ignored_files='/\(~\|\.bak\|\.orig\|\.rpmnew\|\.rpmorig\|\.