Your message dated Wed, 01 Feb 2023 08:13:14 +0100 with message-id <[email protected]> and subject line Re: Bug#970084: corosync: Corosync becomes unresponsive and disconnects from the rest of the cluster when primary link is lost has caused the Debian Bug report #970084, regarding corosync: Corosync becomes unresponsive and disconnects from the rest of the cluster when primary link is lost to be marked as done.
This means that you claim that the problem has been dealt with. If this is not the case it is now your responsibility to reopen the Bug report if necessary, and/or fix the problem forthwith. (NB: If you are a system administrator and have no idea what this message is talking about, this may indicate a serious mail system misconfiguration somewhere. Please contact [email protected] immediately.) -- 970084: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970084 Debian Bug Tracking System Contact [email protected] with problems
--- Begin Message ---Package: corosync Version: 3.0.1-2+deb10u1 Severity: important Dear Maintainer, * What led up to the situation? ** 2 Node Cluster Corosync 3.0.1 on Debian Buster. ** 2 Knet Links - ring0 on eth0 (front facing if) ring1 on eth1 (back-to-back link). ** Services running on cluster-node01. ** Cluster is running just fine, both nodes are online and see each other. ** crm_mon shows 2 online nodes and running resources without errors. * What exactly did you do (or not do) that was effective (or ineffective)? For failover testing we disconnected the eth0 interface on the active node (cluster-node01). * What was the outcome of this action? ** Situation on the active node (cluster-node01) Corosync on the node becomes unresponsive. It does not respond to commands like corosync-cfgtool and corosync-quorumtool. in crm_mon however the cluster status just looks fine. It claims both nodes are online and services are healthy. corosync logs however indicates that the cluster is disconnected. ####### corosync.log #### Sep 11 10:06:45 [1946] cluster-node01 corosync warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly. ######################### ** Situation on the passive node (cluster-node02) Corosync does respond to commands like corosync-cfgtool and shows that cluster-node01 is offline on all links. ######################### ####### corosync.log ####### Sep 11 10:06:09 [1941] cluster-node02 corosync info [KNET ] link: host: 1 link: 0 is down Sep 11 10:06:09 [1941] cluster-node02 corosync info [KNET ] host: host: 1 has 1 active links Sep 11 10:06:10 [1941] cluster-node02 corosync notice [TOTEM ] Token has not been received in 2250 ms Sep 11 10:06:11 [1941] cluster-node02 corosync notice [TOTEM ] A processor failed, forming new configuration. Sep 11 10:06:15 [1941] cluster-node02 corosync notice [TOTEM ] A new membership (2:16) was formed. Members left: 1 Sep 11 10:06:15 [1941] cluster-node02 corosync notice [TOTEM ] Failed to receive the leave message. failed: 1 Sep 11 10:06:15 [1941] cluster-node02 corosync warning [CPG ] downlist left_list: 1 received Sep 11 10:06:15 [1941] cluster-node02 corosync notice [QUORUM] Members[1]: 2 Sep 11 10:06:15 [1941] cluster-node02 corosync notice [MAIN ] Completed service synchronization, ready to provide service. Sep 11 10:06:16 [1941] cluster-node02 corosync info [KNET ] link: host: 1 link: 1 is down Sep 11 10:06:16 [1941] cluster-node02 corosync info [KNET ] host: host: 1 has 0 active links Sep 11 10:06:16 [1941] cluster-node02 corosync warning [KNET ] host: host: 1 has no active links ######################### ######################### ## corosync-cfgtool -s ## root@cluster-node02:~# corosync-cfgtool -s Printing link status. Local node ID 2 LINK ID 0 addr = ###.###.###.### status: node 0: link enabled:1 link connected:0 node 1: link enabled:1 link connected:1 LINK ID 1 addr = ###.###.###.### status: node 0: link enabled:1 link connected:1 node 1: link enabled:0 link connected:1 ######################### ######################### ##### crm_mon -rfA1 ####### root@cluster-node02:~# crm_mon -rfA1 Stack: corosync Current DC: cluster-node02 (version 2.0.1-9e909a5bdd) - partition with quorum Last updated: Fri Sep 11 10:45:53 2020 Last change: Fri Sep 11 10:42:26 2020 by root via cibadmin on cluster-node02 2 nodes configured 7 resources configured Online: [ cluster-node02 ] OFFLINE: [ cluster-node01 ] ######################### Pacemaker does therefore try to perform a failover. * What outcome did you expect instead? With our configuration the cluster should not take any action and both nodes should see each other on link1. * Tests with Corosync 3.0.3 from debian testing. We installed packages from debian testing and fulfilled dependencies from debian backports. ######################### apt install libnozzle1=1.16-2~bpo10+1 libknet1=1.16-2~bpo10+1 libnl-3-200 libnl-route-3-200 libknet-dev=1.16-2~bpo10+1 ./corosync_3.0.3-2_amd64.deb ./libcorosync-common4_3.0.3-2_amd64.deb ######################### The described problem does not occur with the 3.0.3 version from debian testing. -- System Information: Debian Release: 10.5 APT prefers stable APT policy: (550, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 4.19.0-10-amd64 (SMP w/2 CPU cores) Locale: LANG=en_US.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system) Versions of packages corosync depends on: ii adduser 3.118 ii init-system-helpers 1.56+nmu1 ii libc6 2.28-10 ii libcfg7 3.0.1-2+deb10u1 ii libcmap4 3.0.1-2+deb10u1 ii libcorosync-common4 3.0.1-2+deb10u1 ii libcpg4 3.0.1-2+deb10u1 ii libknet1 1.8-2 ii libqb0 1.0.5-1 ii libquorum5 3.0.1-2+deb10u1 ii libstatgrab10 0.91-1+b2 ii libsystemd0 241-7~deb10u4 ii libvotequorum8 3.0.1-2+deb10u1 ii lsb-base 10.2019051400 ii xsltproc 1.1.32-2.2~deb10u1 corosync recommends no packages. corosync suggests no packages. -- Configuration Files: /etc/corosync/corosync.conf changed: totem { version: 2 cluster_name: debian token: 3000 token_retransmits_before_loss_const: 10 crypto_model: nss crypto_cipher: aes256 crypto_hash: sha256 link_mode: active keyfile: /etc/corosync/authkey } nodelist { node { nodeid: 1 name: cluster-node01 ring0_addr: ###.###.###.142 ring1_addr: 192.168.14.1 } node { nodeid: 2 name: cluster-node02 ring0_addr: ###.###.###.143 ring1_addr: 192.168.14.2 } } logging { fileline: off to_stderr: no to_syslog: no to_logfile: yes logfile: /var/log/corosync/corosync.log debug: off logger_subsys { subsys: QUORUM debug: off } } quorum { provider: corosync_votequorum expected_votes: 2 two_node: 1 wait_for_all: 1 auto_tie_breaker: 0 } -- no debconf information
--- End Message ---
--- Begin Message ---This bug affects only oldstable (buster), where the suggested fix is to upgrade libknet1 from backports to 1.16-2~bpo10+1.
--- End Message ---

