Re: [ClusterLabs] Corosync node gets unique Ring ID
Hi Christine, Thanks for your input! > It worries me that corosync-quorumtool behaves differently on some nodes > - some show names, some just IP addresses. That could be a cause of some > inconsistency. Haven't noticed this, we use static node names (not hostnames), so this even more strange > So what seems to be happening is that the cluster is being partitioned > somehow (I can't tell why, that's something you'll need to investigate) > and corosync isn't recovering very well from it. One of the things that > can make this happen is doing "ifdown" - which that old version of > coroysnc doesn't cope with very well. Even if that's not exactly what > you are doing (and I see no reason to beleive you are) I do wonder if > something similar is happening by other means - NetworkManager perhaps?) These are proxmox VMs on one physical host, no changes to filtering was made for sure and network issues are mainly excluded because of this. As for "ifdown" it is also not touched by us, but yes, NetworkManager is configured there. I have disabled reloading (since it is not necessary for current reconfiguration approach) will look after it. > Oh and also, try to upgrade to corosync 2.4.5 at least. I'm sure that > will help. We freezed updates at some point, because the new pacemaker release changed "stickiness=-1" logic. > resource-stickiness processing has been changed somehow in Pacemaker 1.1.16 > as resource-stickiness=-1 moves cloned resources from one node to another > constantly. > Fair to mention resource-stickiness=0 is not working as expected even for > previous versions, so -1 was used as a workaround. however at the moment we do not use cloned resources and will consider updating, thanks. On Wed, Jan 27, 2021 at 10:18 AM Christine Caulfield wrote: > > A few things really stand out from this report, I think the inconsistent > ring_id is just a symptom. > > It worries me that corosync-quorumtool behaves differently on some nodes > - some show names, some just IP addresses. That could be a cause of some > inconsistency. > > Also the messages > " > Jan 26 02:10:45 [13191] destination-standby corosync warning [MAIN ] > Totem is unable to form a cluster because of an operating system or > network fault. The most common cause of this message is that the local > firewall is configured improperly. > Jan 26 02:10:47 [13191] destination-standby corosync warning [MAIN ] > Totem is unable to form a cluster because of an operating system or > network fault. The most common cause of this message is that the local > firewall is configured improperly. > Jan 26 02:10:48 [13191] destination-standby corosync debug [TOTEM ] > The consensus timeout expired. > Jan 26 02:10:48 [13191] destination-standby corosync debug [TOTEM ] > entering GATHER state from 3(The consensus timeout expired.). > Jan 26 02:10:48 [13191] destination-standby corosync warning [MAIN ] > Totem is unable to form a cluster because of an operating system or > network fault. The most common cause of this message is that the local > firewall is configured improperly." > > Are a BAD sign. All this is contributing to the problems and also the > timeout on reload (which is reallly not a good thing). Those messages > are not caused by the reload, they are caused by some networking problems. > > So what seems to be happening is that the cluster is being partitioned > somehow (I can't tell why, that's something you'll need to investigate) > and corosync isn't recovering very well from it. One of the things that > can make this happen is doing "ifdown" - which that old version of > coroysnc doesn't cope with very well. Even if that's not exactly what > you are doing (and I see no reason to beleive you are) I do wonder if > something similar is happening by other means - NetworkManager perhaps?) > > So firstly, check the networking setup and be sure that all te nodes are > consistently configures and check that the network is not closing down > interfaces or ports at the time of the incident. > > Oh and also, try to upgrade to corosync 2.4.5 at least. I'm sure that > will help. > > Chrissie > > > > On 26/01/2021 02:45, Igor Tverdovskiy wrote: > > Hi All, > > > > > pacemakerd -$ > > Pacemaker 1.1.15-11.el7 > > > > > corosync -v > > Corosync Cluster Engine, version '2.4.0' > > > > > rpm -qi libqb > > Name: libqb > > Version : 1.0.1 > > > > Please assist. Recently faced a strange bug (I suppose), when one of the > > cluster nodes gets different from others "Ring ID" for example after > > corosync config reload , e.g.: > > > > > > *Affected node:* > > > > (target.standby)> sudo corosync-quorumtool > > Quorum information > > -- > > Date: Tue Jan 26 01:58:54 2021 > > Quorum provider: corosync_votequorum > > Nodes:5 > > Node ID: 5 > > Ring ID: *7/59268* <<< > > Quorate: Yes > > > > Votequorum information > > -- > > Expected votes: 5 > >
Re: [ClusterLabs] Corosync node gets unique Ring ID
A few things really stand out from this report, I think the inconsistent ring_id is just a symptom. It worries me that corosync-quorumtool behaves differently on some nodes - some show names, some just IP addresses. That could be a cause of some inconsistency. Also the messages " Jan 26 02:10:45 [13191] destination-standby corosync warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. Jan 26 02:10:47 [13191] destination-standby corosync warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. Jan 26 02:10:48 [13191] destination-standby corosync debug [TOTEM ] The consensus timeout expired. Jan 26 02:10:48 [13191] destination-standby corosync debug [TOTEM ] entering GATHER state from 3(The consensus timeout expired.). Jan 26 02:10:48 [13191] destination-standby corosync warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly." Are a BAD sign. All this is contributing to the problems and also the timeout on reload (which is reallly not a good thing). Those messages are not caused by the reload, they are caused by some networking problems. So what seems to be happening is that the cluster is being partitioned somehow (I can't tell why, that's something you'll need to investigate) and corosync isn't recovering very well from it. One of the things that can make this happen is doing "ifdown" - which that old version of coroysnc doesn't cope with very well. Even if that's not exactly what you are doing (and I see no reason to beleive you are) I do wonder if something similar is happening by other means - NetworkManager perhaps?) So firstly, check the networking setup and be sure that all te nodes are consistently configures and check that the network is not closing down interfaces or ports at the time of the incident. Oh and also, try to upgrade to corosync 2.4.5 at least. I'm sure that will help. Chrissie On 26/01/2021 02:45, Igor Tverdovskiy wrote: Hi All, > pacemakerd -$ Pacemaker 1.1.15-11.el7 > corosync -v Corosync Cluster Engine, version '2.4.0' > rpm -qi libqb Name : libqb Version : 1.0.1 Please assist. Recently faced a strange bug (I suppose), when one of the cluster nodes gets different from others "Ring ID" for example after corosync config reload , e.g.: *Affected node:* (target.standby)> sudo corosync-quorumtool Quorum information -- Date: Tue Jan 26 01:58:54 2021 Quorum provider: corosync_votequorum Nodes: 5 Node ID: 5 Ring ID: *7/59268* <<< Quorate: Yes Votequorum information -- Expected votes: 5 Highest expected: 5 Total votes: 5 Quorum: 3 Flags: Quorate Membership information -- Nodeid Votes Name 7 1 dispatching-sbc 8 1 dispatching-sbc-2-6 3 1 10.27.77.202 5 1 cassandra-3 (local) 6 1 10.27.77.205 *OK nodes:* > sudo corosync-quorumtool Quorum information -- Date: Tue Jan 26 01:59:13 2021 Quorum provider: corosync_votequorum Nodes: 4 Node ID: 8 Ring ID: *7/59300* <<< Quorate: Yes Votequorum information -- Expected votes: 5 Highest expected: 5 Total votes: 4 Quorum: 3 Flags: Quorate Membership information -- Nodeid Votes Name 7 1 10.27.77.106 8 1 10.27.77.107 (local) 3 1 10.27.77.202 6 1 10.27.77.205 Also strange is that *crm status shows only two of five nodes* on the affected node, but at the same time *"sudo crm_node -l" shows all 5 nodes as members*. (target.standby)> sudo crm_node -l 5 target.standby member 7 target.dsbc1 member 3 target.sip member 8 target.dsbc member 6 target.sec.sip member --- (target.standby)> sudo crm status Stack: corosync Current DC: target.sip (version 1.1.15-11.el7-e174ec8) - partition with quorum Last updated: Tue Jan 26 02:08:02 2021 Last change: Mon Jan 25 14:27:18 2021 by root via crm_node on target.sec.sip 2 nodes and 7 resources configured Online: [ target.sec.sip target.sip ] << Full list of resources: The issue here is that crm configure operations fail with timeout error: (target.standby)> sudo crm configure property maintenance-mode=true *Call cib_apply_diff failed (-62): Timer expired* ERROR: could not patch cib (rc=62) INFO: offending
Re: [ClusterLabs] Corosync node gets unique Ring ID
BTW checked log, corosync reload failed: -- Logs begin at Thu 2021-01-14 15:41:10 UTC, end at Tue 2021-01-26 01:42:48 UTC. -- Jan 22 14:33:09 destination-standby corosync[13180]: Starting Corosync Cluster Engine (corosync): [ OK ] Jan 22 14:33:09 destination-standby systemd[1]: Started Corosync Cluster Engine. Jan 26 01:41:18 destination-standby systemd[1]: Reloading Corosync Cluster Engine. Jan 26 01:42:48 destination-standby systemd[1]: corosync.service reload operation timed out. Stopping. Jan 26 01:42:48 destination-standby systemd[1]: Reload failed for Corosync Cluster Engine. However the issue started exactly after reload (see in previous email corosync.log) ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/