Bug#970084: marked as done (corosync: Corosync becomes unresponsive and disconnects from the rest of the cluster when primary link is lost)

Debian Bug Tracking System Tue, 31 Jan 2023 23:33:16 -0800

Your message dated Wed, 01 Feb 2023 08:13:14 +0100
with message-id <[email protected]>
and subject line Re: Bug#970084: corosync: Corosync becomes unresponsive and 
disconnects from the rest of the cluster when primary link is lost
has caused the Debian Bug report #970084,
regarding corosync: Corosync becomes unresponsive and disconnects from the rest 
of the cluster when primary link is lost
to be marked as done.


This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)


-- 
970084: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970084
Debian Bug Tracking System
Contact [email protected] with problems

--- Begin Message ---

Package: corosync
Version: 3.0.1-2+deb10u1
Severity: important

Dear Maintainer,

* What led up to the situation?
** 2 Node Cluster Corosync 3.0.1 on Debian Buster.
** 2 Knet Links - ring0 on eth0 (front facing if) ring1 on eth1
(back-to-back link).
** Services running on cluster-node01.
** Cluster is running just fine, both nodes are online and see each other.
** crm_mon shows 2 online nodes and running resources without errors.

* What exactly did you do (or not do) that was effective (or ineffective)?
For failover testing we disconnected the eth0 interface on the active node
(cluster-node01).

* What was the outcome of this action?
** Situation on the active node (cluster-node01)
Corosync on the node becomes unresponsive. It does not respond to commands
like corosync-cfgtool and corosync-quorumtool.
in crm_mon however the cluster status just looks fine. It claims both nodes
are online and services are healthy.
corosync logs however indicates that the cluster is disconnected.
####### corosync.log ####
Sep 11 10:06:45 [1946] cluster-node01 corosync warning [MAIN  ] Totem is
unable to form a cluster because of an operating system or network fault
(reason: totem is continuously in gather state). The most common cause of
this message is that the local firewall is configured improperly.
#########################

** Situation on the passive node (cluster-node02)
Corosync does respond to commands like corosync-cfgtool and shows that
cluster-node01 is offline on all links.

#########################
####### corosync.log #######
Sep 11 10:06:09 [1941] cluster-node02 corosync info    [KNET  ] link: host:
1 link: 0 is down
Sep 11 10:06:09 [1941] cluster-node02 corosync info    [KNET  ] host: host:
1 has 1 active links
Sep 11 10:06:10 [1941] cluster-node02 corosync notice  [TOTEM ] Token has
not been received in 2250 ms
Sep 11 10:06:11 [1941] cluster-node02 corosync notice  [TOTEM ] A processor
failed, forming new configuration.
Sep 11 10:06:15 [1941] cluster-node02 corosync notice  [TOTEM ] A new
membership (2:16) was formed. Members left: 1
Sep 11 10:06:15 [1941] cluster-node02 corosync notice  [TOTEM ] Failed to
receive the leave message. failed: 1
Sep 11 10:06:15 [1941] cluster-node02 corosync warning [CPG   ] downlist
left_list: 1 received
Sep 11 10:06:15 [1941] cluster-node02 corosync notice  [QUORUM] Members[1]:
2
Sep 11 10:06:15 [1941] cluster-node02 corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Sep 11 10:06:16 [1941] cluster-node02 corosync info    [KNET  ] link: host:
1 link: 1 is down
Sep 11 10:06:16 [1941] cluster-node02 corosync info    [KNET  ] host: host:
1 has 0 active links
Sep 11 10:06:16 [1941] cluster-node02 corosync warning [KNET  ] host: host:
1 has no active links
#########################

#########################
## corosync-cfgtool -s ##
root@cluster-node02:~# corosync-cfgtool -s
Printing link status.
Local node ID 2
LINK ID 0
    addr    = ###.###.###.###
    status:
        node 0: link enabled:1  link connected:0
        node 1: link enabled:1  link connected:1
LINK ID 1
    addr    = ###.###.###.###
    status:
        node 0: link enabled:1  link connected:1
        node 1: link enabled:0  link connected:1
#########################


#########################
##### crm_mon -rfA1 #######
root@cluster-node02:~# crm_mon -rfA1
Stack: corosync
Current DC: cluster-node02 (version 2.0.1-9e909a5bdd) - partition with
quorum
Last updated: Fri Sep 11 10:45:53 2020
Last change: Fri Sep 11 10:42:26 2020 by root via cibadmin on cluster-node02

2 nodes configured
7 resources configured

Online: [ cluster-node02 ]
OFFLINE: [ cluster-node01 ]
#########################

Pacemaker does therefore try to perform a failover.

* What outcome did you expect instead?
With our configuration the cluster should not take any action and both
nodes should see each other on link1.

* Tests with Corosync 3.0.3 from debian testing.
We installed packages from debian testing and fulfilled dependencies from
debian backports.

#########################
apt install libnozzle1=1.16-2~bpo10+1 libknet1=1.16-2~bpo10+1 libnl-3-200
libnl-route-3-200 libknet-dev=1.16-2~bpo10+1 ./corosync_3.0.3-2_amd64.deb
./libcorosync-common4_3.0.3-2_amd64.deb
#########################
The described problem does not occur with the 3.0.3 version from debian
testing.


-- System Information:
Debian Release: 10.5
  APT prefers stable
  APT policy: (550, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 4.19.0-10-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8),
LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages corosync depends on:
ii  adduser              3.118
ii  init-system-helpers  1.56+nmu1
ii  libc6                2.28-10
ii  libcfg7              3.0.1-2+deb10u1
ii  libcmap4             3.0.1-2+deb10u1
ii  libcorosync-common4  3.0.1-2+deb10u1
ii  libcpg4              3.0.1-2+deb10u1
ii  libknet1             1.8-2
ii  libqb0               1.0.5-1
ii  libquorum5           3.0.1-2+deb10u1
ii  libstatgrab10        0.91-1+b2
ii  libsystemd0          241-7~deb10u4
ii  libvotequorum8       3.0.1-2+deb10u1
ii  lsb-base             10.2019051400
ii  xsltproc             1.1.32-2.2~deb10u1

corosync recommends no packages.

corosync suggests no packages.

-- Configuration Files:
/etc/corosync/corosync.conf changed:
totem {
    version: 2
    cluster_name: debian
    token: 3000
    token_retransmits_before_loss_const: 10
    crypto_model: nss
    crypto_cipher: aes256
    crypto_hash: sha256
    link_mode: active
    keyfile: /etc/corosync/authkey
}
nodelist {
    node {
        nodeid: 1
        name: cluster-node01
        ring0_addr: ###.###.###.142
        ring1_addr: 192.168.14.1
    }
    node {
        nodeid: 2
        name: cluster-node02
        ring0_addr: ###.###.###.143
        ring1_addr: 192.168.14.2
    }
}
logging {
    fileline: off
    to_stderr: no
    to_syslog: no
    to_logfile: yes
    logfile: /var/log/corosync/corosync.log
    debug: off
    logger_subsys {
        subsys: QUORUM
        debug: off
    }
}
quorum {
    provider: corosync_votequorum
    expected_votes: 2
    two_node: 1
    wait_for_all: 1
    auto_tie_breaker: 0
}


-- no debconf information

--- End Message ---

--- Begin Message ---

This bug affects only oldstable (buster), where the suggested fix is to
upgrade libknet1 from backports to 1.16-2~bpo10+1.

--- End Message ---

Bug#970084: marked as done (corosync: Corosync becomes unresponsive and disconnects from the rest of the cluster when primary link is lost)

Reply via email to