Re: [ClusterLabs] Corosync node gets unique Ring ID

2021-01-27 Thread Igor Tverdovskiy
Hi Christine,

Thanks for your input!

> It worries me that corosync-quorumtool behaves differently on some nodes
> - some show names, some just IP addresses. That could be a cause of some
> inconsistency.

Haven't noticed this, we use static node names (not hostnames), so
this even more strange

> So what seems to be happening is that the cluster is being partitioned
> somehow (I can't tell why, that's something you'll need to investigate)
> and corosync isn't recovering very well from it. One of the things that
> can make this happen is doing "ifdown" - which that old version of
> coroysnc doesn't cope with very well. Even if that's not exactly what
> you are doing (and I see no reason to beleive you are) I do wonder if
> something similar is happening by other means - NetworkManager perhaps?)

These are proxmox VMs on one physical host, no changes to filtering
was made for sure and network issues are mainly excluded because of
this.
As for "ifdown" it is also not touched by us, but yes, NetworkManager
is configured there.
I have disabled reloading (since it is not necessary for current
reconfiguration approach)
will look after it.

> Oh and also, try to upgrade to corosync 2.4.5 at least. I'm sure that
> will help.

We freezed updates at some point, because the new pacemaker release
changed "stickiness=-1" logic.
> resource-stickiness processing has been changed somehow in Pacemaker 1.1.16 
> as resource-stickiness=-1 moves cloned resources from one node to another 
> constantly.
> Fair to mention resource-stickiness=0 is not working as expected even for 
> previous versions, so -1 was used as a workaround.

however at the moment we do not use cloned resources and will consider
updating, thanks.


On Wed, Jan 27, 2021 at 10:18 AM Christine Caulfield
 wrote:
>
> A few things really stand out from this report, I think the inconsistent
> ring_id is just a symptom.
>
> It worries me that corosync-quorumtool behaves differently on some nodes
> - some show names, some just IP addresses. That could be a cause of some
> inconsistency.
>
> Also the messages
> "
> Jan 26 02:10:45 [13191] destination-standby corosync warning [MAIN  ]
> Totem is unable to form a cluster because of an operating system or
> network fault. The most common cause of this message is that the local
> firewall is configured improperly.
> Jan 26 02:10:47 [13191] destination-standby corosync warning [MAIN  ]
> Totem is unable to form a cluster because of an operating system or
> network fault. The most common cause of this message is that the local
> firewall is configured improperly.
> Jan 26 02:10:48 [13191] destination-standby corosync debug   [TOTEM ]
> The consensus timeout expired.
> Jan 26 02:10:48 [13191] destination-standby corosync debug   [TOTEM ]
> entering GATHER state from 3(The consensus timeout expired.).
> Jan 26 02:10:48 [13191] destination-standby corosync warning [MAIN  ]
> Totem is unable to form a cluster because of an operating system or
> network fault. The most common cause of this message is that the local
> firewall is configured improperly."
>
> Are a BAD sign. All this is contributing to the problems and also the
> timeout on reload (which is reallly not a good thing). Those messages
> are not caused by the reload, they are caused by some networking problems.
>
> So what seems to be happening is that the cluster is being partitioned
> somehow (I can't tell why, that's something you'll need to investigate)
> and corosync isn't recovering very well from it. One of the things that
> can make this happen is doing "ifdown" - which that old version of
> coroysnc doesn't cope with very well. Even if that's not exactly what
> you are doing (and I see no reason to beleive you are) I do wonder if
> something similar is happening by other means - NetworkManager perhaps?)
>
> So firstly, check the networking setup and be sure that all te nodes are
> consistently configures and check that the network is not closing down
> interfaces or ports at the time of the incident.
>
> Oh and also, try to upgrade to corosync 2.4.5 at least. I'm sure that
> will help.
>
> Chrissie
>
>
>
> On 26/01/2021 02:45, Igor Tverdovskiy wrote:
> > Hi All,
> >
> >  > pacemakerd -$
> > Pacemaker 1.1.15-11.el7
> >
> >  > corosync -v
> > Corosync Cluster Engine, version '2.4.0'
> >
> >  > rpm -qi libqb
> > Name: libqb
> > Version : 1.0.1
> >
> > Please assist. Recently faced a strange bug (I suppose), when one of the
> > cluster nodes gets different from others "Ring ID" for example after
> > corosync config reload , e.g.:
> >
> >
> > *Affected node:*
> > 
> > (target.standby)> sudo corosync-quorumtool
> > Quorum information
> > --
> > Date: Tue Jan 26 01:58:54 2021
> > Quorum provider:  corosync_votequorum
> > Nodes:5
> > Node ID:  5
> > Ring ID: *7/59268* <<<
> > Quorate:  Yes
> >
> > Votequorum information
> > --
> > Expected votes:   5
> > 

Re: [ClusterLabs] Corosync node gets unique Ring ID

2021-01-27 Thread Christine Caulfield
A few things really stand out from this report, I think the inconsistent 
ring_id is just a symptom.


It worries me that corosync-quorumtool behaves differently on some nodes 
- some show names, some just IP addresses. That could be a cause of some 
inconsistency.


Also the messages
"
Jan 26 02:10:45 [13191] destination-standby corosync warning [MAIN  ] 
Totem is unable to form a cluster because of an operating system or 
network fault. The most common cause of this message is that the local 
firewall is configured improperly.
Jan 26 02:10:47 [13191] destination-standby corosync warning [MAIN  ] 
Totem is unable to form a cluster because of an operating system or 
network fault. The most common cause of this message is that the local 
firewall is configured improperly.
Jan 26 02:10:48 [13191] destination-standby corosync debug   [TOTEM ] 
The consensus timeout expired.
Jan 26 02:10:48 [13191] destination-standby corosync debug   [TOTEM ] 
entering GATHER state from 3(The consensus timeout expired.).
Jan 26 02:10:48 [13191] destination-standby corosync warning [MAIN  ] 
Totem is unable to form a cluster because of an operating system or 
network fault. The most common cause of this message is that the local 
firewall is configured improperly."


Are a BAD sign. All this is contributing to the problems and also the 
timeout on reload (which is reallly not a good thing). Those messages 
are not caused by the reload, they are caused by some networking problems.


So what seems to be happening is that the cluster is being partitioned 
somehow (I can't tell why, that's something you'll need to investigate) 
and corosync isn't recovering very well from it. One of the things that 
can make this happen is doing "ifdown" - which that old version of 
coroysnc doesn't cope with very well. Even if that's not exactly what 
you are doing (and I see no reason to beleive you are) I do wonder if 
something similar is happening by other means - NetworkManager perhaps?)


So firstly, check the networking setup and be sure that all te nodes are 
consistently configures and check that the network is not closing down 
interfaces or ports at the time of the incident.


Oh and also, try to upgrade to corosync 2.4.5 at least. I'm sure that 
will help.


Chrissie



On 26/01/2021 02:45, Igor Tverdovskiy wrote:

Hi All,

 > pacemakerd -$
Pacemaker 1.1.15-11.el7

 > corosync -v
Corosync Cluster Engine, version '2.4.0'

 > rpm -qi libqb
Name        : libqb
Version     : 1.0.1

Please assist. Recently faced a strange bug (I suppose), when one of the 
cluster nodes gets different from others "Ring ID" for example after 
corosync config reload , e.g.:



*Affected node:*

(target.standby)> sudo corosync-quorumtool
Quorum information
--
Date:             Tue Jan 26 01:58:54 2021
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          5
Ring ID: *7/59268* <<<
Quorate:          Yes

Votequorum information
--
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3
Flags:            Quorate

Membership information
--
     Nodeid      Votes Name
          7          1 dispatching-sbc
          8          1 dispatching-sbc-2-6
          3          1 10.27.77.202
          5          1 cassandra-3 (local)
          6          1 10.27.77.205



*OK nodes:*
 > sudo corosync-quorumtool
Quorum information
--
Date:             Tue Jan 26 01:59:13 2021
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          8
Ring ID: *7/59300* <<<
Quorate:          Yes

Votequorum information
--
Expected votes:   5
Highest expected: 5
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
--
     Nodeid      Votes Name
          7          1 10.27.77.106
          8          1 10.27.77.107 (local)
          3          1 10.27.77.202
          6          1 10.27.77.205



Also strange is that *crm status shows only two of five nodes* on the 
affected node, but at the same time

*"sudo crm_node -l" shows all 5 nodes as members*.

(target.standby)> sudo crm_node -l
5 target.standby member
7 target.dsbc1 member
3 target.sip member
8 target.dsbc member
6 target.sec.sip member

---

(target.standby)> sudo crm status
Stack: corosync
Current DC: target.sip (version 1.1.15-11.el7-e174ec8) - partition with 
quorum
Last updated: Tue Jan 26 02:08:02 2021          Last change: Mon Jan 25 
14:27:18 2021 by root via crm_node on target.sec.sip


2 nodes and 7 resources configured

Online: [ target.sec.sip target.sip ] <<

Full list of resources:


The issue here is that crm configure operations fail with timeout error:

(target.standby)> sudo crm configure property maintenance-mode=true
*Call cib_apply_diff failed (-62): Timer expired*
ERROR: could not patch cib (rc=62)
INFO: offending 

Re: [ClusterLabs] Corosync node gets unique Ring ID

2021-01-25 Thread Igor Tverdovskiy
BTW checked log, corosync reload failed:
-- Logs begin at Thu 2021-01-14 15:41:10 UTC, end at Tue 2021-01-26
01:42:48 UTC. --
Jan 22 14:33:09 destination-standby corosync[13180]: Starting Corosync
Cluster Engine (corosync): [ OK ]
Jan 22 14:33:09 destination-standby systemd[1]: Started Corosync Cluster Engine.
Jan 26 01:41:18 destination-standby systemd[1]: Reloading Corosync
Cluster Engine.
Jan 26 01:42:48 destination-standby systemd[1]: corosync.service
reload operation timed out. Stopping.
Jan 26 01:42:48 destination-standby systemd[1]: Reload failed for
Corosync Cluster Engine.

However the issue started exactly after reload (see in previous email
corosync.log)
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/