I have no knet experience, but the symptoms really sound odd. >>> "Leditzky, Fabian via Users" <users@clusterlabs.org> schrieb am 19.05.2022 um 10:16 in Nachricht <co1pr08mb6707cd476bba2d4fa52f891896...@co1pr08mb6707.namprd08.prod.outlook.com>
> Hello > > We have been dealing with our pacemaker/corosync clusters becoming unstable. > The OS is Debian 10 and we use Debian packages for pacemaker and corosync, > version 3.0.1‑5+deb10u1 and 3.0.1‑2+deb10u1 respectively. > We use knet over UDP transport. > > We run multiple 2‑node and 4‑8 node clusters, primarily managing VIP > resources. > The issue we experience presents itself as a spontaneous disagreement of > the status of cluster members. In two node clusters, each node spontaneously > sees the other node as offline, despite network connectivity being OK. > In larger clusters, the status can be inconsistent across the nodes. > E.g.: node1 sees 2,4 as offline, node 2 sees 1,4 as offline while node 3 and > 4 see every node as online. > The cluster becomes generally unresponsive to resource actions in this > state. > Thus far we have been unable to restore cluster health without restarting > corosync. > > We are running packet captures 24/7 on the clusters and have custom tooling > to detect lost UDP packets on knet ports. So far we could not see > significant > packet loss trigger an event, at most we have seen a single UDP packet > dropped > some seconds before the cluster fails. > > However, even if the root cause is indeed a flaky network, we do not > understand > why the cluster cannot recover on its own in any way. The issues definitely > persist > beyond the presence of any intermittent network problem. > > We were able to artificially break clusters by inducing packet loss with an > iptables rule. > Dropping packets on a single node of an 8‑node cluster can cause malfunctions > on > multiple other cluster nodes. The expected behavior would be detecting that > the > artificially broken node failed but keeping the rest of the cluster stable. > We were able to reproduce this also on Debian 11 with more recent > corosync/pacemaker > versions. > > Our configuration basic, we do not significantly deviate from the defaults. > > We will be very grateful for any insights into this problem. > > Thanks, > Fabian > > // corosync.conf > totem { > version: 2 > cluster_name: cluster01 > crypto_cipher: aes256 > crypto_hash: sha512 > transport: knet > } > logging { > fileline: off > to_stderr: no > to_logfile: no > to_syslog: yes > debug: off > timestamp: on > logger_subsys { > subsys: QUORUM > debug: off > } > } > quorum { > provider: corosync_votequorum > two_node: 1 > expected_votes: 2 > } > nodelist { > node { > name: node01 > nodeid: 01 > ring0_addr: 10.0.0.10 > } > node { > name: node02 > nodeid: 02 > ring0_addr: 10.0.0.11 > } > } > > // crm config show > node 1: node01 \ > attributes standby=off > node 2: node02 \ > attributes standby=off maintenance=off > primitive IP‑clusterC1 IPaddr2 \ > params ip=10.0.0.20 nic=eth0 cidr_netmask=24 \ > meta migration‑threshold=2 target‑role=Started is‑managed=true \ > op monitor interval=20 timeout=60 on‑fail=restart > primitive IP‑clusterC2 IPaddr2 \ > params ip=10.0.0.21 nic=eth0 cidr_netmask=24 \ > meta migration‑threshold=2 target‑role=Started is‑managed=true \ > op monitor interval=20 timeout=60 on‑fail=restart > location STICKY‑IP‑clusterC1 IP‑clusterC1 100: node01 > location STICKY‑IP‑clusterC2 IP‑clusterC2 100: node02 > property cib‑bootstrap‑options: \ > have‑watchdog=false \ > dc‑version=2.0.1‑9e909a5bdd \ > cluster‑infrastructure=corosync \ > cluster‑name=cluster01 \ > stonith‑enabled=no \ > no‑quorum‑policy=ignore \ > last‑lrm‑refresh=1632230917 > > > ________________________________ > [https://go.aciworldwide.com/rs/030‑ROK‑804/images/aci‑footer.jpg] > <http://www.aciworldwide.com> > This email message and any attachments may contain confidential, proprietary > or non‑public information. The information is intended solely for the > designated recipient(s). If an addressing or transmission error has > misdirected this email, please notify the sender immediately and destroy this > email. Any review, dissemination, use or reliance upon this information by > unintended recipients is prohibited. Any opinions expressed in this email are > those of the author personally. > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/