A few things really stand out from this report, I think the inconsistent ring_id is just a symptom.

It worries me that corosync-quorumtool behaves differently on some nodes - some show names, some just IP addresses. That could be a cause of some inconsistency.

Also the messages
"
Jan 26 02:10:45 [13191] destination-standby corosync warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. Jan 26 02:10:47 [13191] destination-standby corosync warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. Jan 26 02:10:48 [13191] destination-standby corosync debug [TOTEM ] The consensus timeout expired. Jan 26 02:10:48 [13191] destination-standby corosync debug [TOTEM ] entering GATHER state from 3(The consensus timeout expired.). Jan 26 02:10:48 [13191] destination-standby corosync warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly."

Are a BAD sign. All this is contributing to the problems and also the timeout on reload (which is reallly not a good thing). Those messages are not caused by the reload, they are caused by some networking problems.

So what seems to be happening is that the cluster is being partitioned somehow (I can't tell why, that's something you'll need to investigate) and corosync isn't recovering very well from it. One of the things that can make this happen is doing "ifdown" - which that old version of coroysnc doesn't cope with very well. Even if that's not exactly what you are doing (and I see no reason to beleive you are) I do wonder if something similar is happening by other means - NetworkManager perhaps?)

So firstly, check the networking setup and be sure that all te nodes are consistently configures and check that the network is not closing down interfaces or ports at the time of the incident.

Oh and also, try to upgrade to corosync 2.4.5 at least. I'm sure that will help.

Chrissie



On 26/01/2021 02:45, Igor Tverdovskiy wrote:
Hi All,

 > pacemakerd -$
Pacemaker 1.1.15-11.el7

 > corosync -v
Corosync Cluster Engine, version '2.4.0'

 > rpm -qi libqb
Name        : libqb
Version     : 1.0.1

Please assist. Recently faced a strange bug (I suppose), when one of the cluster nodes gets different from others "Ring ID" for example after corosync config reload , e.g.:


*Affected node:*
============
(target.standby)> sudo corosync-quorumtool
Quorum information
------------------
Date:             Tue Jan 26 01:58:54 2021
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          5
Ring ID: *7/59268* <<<<<<<
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3
Flags:            Quorate

Membership information
----------------------
     Nodeid      Votes Name
          7          1 dispatching-sbc
          8          1 dispatching-sbc-2-6
          3          1 10.27.77.202
          5          1 cassandra-3 (local)
          6          1 10.27.77.205

============

*OK nodes:*
 > sudo corosync-quorumtool
Quorum information
------------------
Date:             Tue Jan 26 01:59:13 2021
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          8
Ring ID: *7/59300* <<<<<<<
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
     Nodeid      Votes Name
          7          1 10.27.77.106
          8          1 10.27.77.107 (local)
          3          1 10.27.77.202
          6          1 10.27.77.205
============


Also strange is that *crm status shows only two of five nodes* on the affected node, but at the same time
*"sudo crm_node -l" shows all 5 nodes as members*.
============
(target.standby)> sudo crm_node -l
5 target.standby member
7 target.dsbc1 member
3 target.sip member
8 target.dsbc member
6 target.sec.sip member

-------

(target.standby)> sudo crm status
Stack: corosync
Current DC: target.sip (version 1.1.15-11.el7-e174ec8) - partition with quorum Last updated: Tue Jan 26 02:08:02 2021          Last change: Mon Jan 25 14:27:18 2021 by root via crm_node on target.sec.sip

2 nodes and 7 resources configured

Online: [ target.sec.sip target.sip ] <<<<<<

Full list of resources:
============

The issue here is that crm configure operations fail with timeout error:
============
(target.standby)> sudo crm configure property maintenance-mode=true
*Call cib_apply_diff failed (-62): Timer expired*
ERROR: could not patch cib (rc=62)
INFO: offending xml diff: <diff format="2">
  <change operation="modify" path="/cib/configuration/crm_config/cluster_property_set[@id=&apos;cib-bootstrap-options&apos;]/nvpair[@id=&apos;cib-bootstrap-options-maintenance-mode&apos;]">
     <change-list>
       <change-attr name="value" operation="set" value="true"/>
     </change-list>
     <change-result>
      <nvpair name="maintenance-mode" value="true" id="cib-bootstrap-options-maintenance-mode"/>
     </change-result>
   </change>
</diff>
============


In the log there are errors that totem is unable to form a cluster:
============
(target.standby)

*First entry seems caused by node reloading (corosync-cfgtool -R):*
*
*Jan 26 01:40:35 [13190] destination-standby corosync notice  [CFG   ] Config reload requested by node 7 Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] removing dynamic member 10.27.77.106 for ring 0 Jan 26 01:40:35 [13190] destination-standby corosync notice  [TOTEM ] removing UDPU member {10.27.77.106} Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] Closing socket to: {10.27.77.106} Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] removing dynamic member 10.27.77.107 for ring 0 Jan 26 01:40:35 [13190] destination-standby corosync notice  [TOTEM ] removing UDPU member {10.27.77.107} Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] Closing socket to: {10.27.77.107} Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] removing dynamic member 10.27.77.204 for ring 0 Jan 26 01:40:35 [13190] destination-standby corosync notice  [TOTEM ] removing UDPU member {10.27.77.204} Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] Closing socket to: {10.27.77.204} Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] Configuration reloaded. Dumping actual totem config. Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] Token Timeout (5000 ms) retransmit timeout (1190 ms) Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] token hold (942 ms) retransmits before loss (4 retrans) Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] join (50 ms) send_join (0 ms) consensus (6000 ms) merge (200 ms) Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] downcheck (1000 ms) fail to recv const (2500 msgs) Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] seqno unchanged const (30 rotations) Maximum network MTU 1369 Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] window size per rotation (50 messages) maximum messages per rotation (17 messages) Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] missed count const (5 messages) Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] RRP token expired timeout (1190 ms) Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] RRP token problem counter (2000 ms) Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] RRP threshold (10 problem count) Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] RRP multicast threshold (100 problem count) Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] RRP automatic recovery check timeout (1000 ms) Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] RRP mode set to none. Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] heartbeat_failures_allowed (0) Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ] max_network_delay (50 ms) Jan 26 01:40:35 [13190] destination-standby corosync debug   [VOTEQ ] Reading configuration (runtime: 1) Jan 26 01:40:35 [13190] destination-standby corosync debug   [VOTEQ ] No nodelist defined or our node is not in the nodelist Jan 26 01:40:35 [13190] destination-standby corosync debug   [VOTEQ ] ev_tracking=0, ev_tracking_barrier = 0: expected_votes = 2 Jan 26 01:40:35 [13190] destination-standby corosync debug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] got nodeinfo message from cluster node 8 Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] nodeinfo message[8]: votes: 1, expected: 5 flags: 1 Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] total_votes=5, expected_votes=2 Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] Sending expected votes callback Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] node 3 state=1, votes=1, expected=5 Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] node 5 state=1, votes=1, expected=5 Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] node 6 state=1, votes=1, expected=5 Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] node 7 state=1, votes=1, expected=5 Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] node 8 state=1, votes=1, expected=5 Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] lowest node id: 3 us: 5 Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] highest node id: 8 us: 5 Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] got nodeinfo message from cluster node 8 Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Jan 26 01:40:38 [13191] destination-standby corosync error   [TOTEM ] FAILED TO RECEIVE Jan 26 01:40:38 [13191] destination-standby corosync debug   [TOTEM ] entering GATHER state from 6(failed to receive). Jan 26 01:40:44 [13191] destination-standby corosync debug   [TOTEM ] entering GATHER state from 0(consensus timeout). Jan 26 01:40:48 [13240] destination-standby       crmd:    debug: throttle_cib_load:    cib load: 0.000333 (1 ticks in 30s) Jan 26 01:40:48 [13240] destination-standby       crmd:    debug: throttle_load_avg:    Current load is 1.010000 (full: 1.01 0.52 0.45 2/471 10513) Jan 26 01:40:48 [13240] destination-standby       crmd:    debug: throttle_io_load:     Current IO load is 0.000000 Jan 26 01:40:55 [13191] destination-standby corosync debug   [TOTEM ] The consensus timeout expired. Jan 26 01:40:55 [13191] destination-standby corosync debug   [TOTEM ] entering GATHER state from 3(The consensus timeout expired.). Jan 26 01:41:06 [13191] destination-standby corosync debug   [TOTEM ] The consensus timeout expired. Jan 26 01:41:06 [13191] destination-standby corosync debug   [TOTEM ] entering GATHER state from 3(The consensus timeout expired.). Jan 26 01:41:17 [13191] destination-standby corosync debug   [TOTEM ] The consensus timeout expired. Jan 26 01:41:17 [13191] destination-standby corosync debug   [TOTEM ] entering GATHER state from 3(The consensus timeout expired.). Jan 26 01:41:18 [13191] destination-standby corosync debug   [QB    ] IPC credentials authenticated (13191-14633-16) Jan 26 01:41:18 [13191] destination-standby corosync debug   [QB    ] connecting to client [14633] Jan 26 01:41:18 [13191] destination-standby corosync debug   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168 Jan 26 01:41:18 [13191] destination-standby corosync debug   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168 Jan 26 01:41:18 [13191] destination-standby corosync debug   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168 Jan 26 01:41:18 [13191] destination-standby corosync debug   [MAIN  ] connection created Jan 26 01:41:18 [13240] destination-standby       crmd:    debug: throttle_cib_load:    cib load: 0.000000 (0 ticks in 30s) Jan 26 01:41:18 [13240] destination-standby       crmd:    debug: throttle_load_avg:    Current load is 1.360000 (full: 1.36 0.64 0.49 1/475 14633) Jan 26 01:41:18 [13240] destination-standby       crmd:    debug: throttle_io_load:     Current IO load is 0.000000 Jan 26 01:41:28 [13191] destination-standby corosync debug   [TOTEM ] The consensus timeout expired. Jan 26 01:41:28 [13191] destination-standby corosync debug   [TOTEM ] entering GATHER state from 3(The consensus timeout expired.). Jan 26 01:41:28 [13191] destination-standby corosync warning [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. Jan 26 01:41:30 [13191] destination-standby corosync warning [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.

Jan 26 02:10:42 [13191] destination-standby corosync warning [MAIN  ] *Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.* Jan 26 02:10:42 [13235] destination-standby        cib:    debug: crm_client_new:       Connecting 0x560a4bef1f60 for uid=0 gid=0 pid=6916 id=5a414534-bc62-4544-a5d2-6deb772a6b49 Jan 26 02:10:42 [13235] destination-standby        cib:    debug: handle_new_connection:        IPC credentials authenticated (13235-6916-13) Jan 26 02:10:42 [13235] destination-standby        cib:    debug: qb_ipcs_shm_connect:  connecting to client [6916] Jan 26 02:10:42 [13235] destination-standby        cib:    debug: qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096 Jan 26 02:10:42 [13235] destination-standby        cib:    debug: qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096 Jan 26 02:10:42 [13235] destination-standby        cib:    debug: qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096 Jan 26 02:10:42 [13235] destination-standby        cib:    debug: cib_acl_enabled:      CIB ACL is disabled Jan 26 02:10:42 [13235] destination-standby        cib:    debug: qb_ipcs_dispatch_connection_request:  HUP conn (13235-6916-13) Jan 26 02:10:42 [13235] destination-standby        cib:    debug: qb_ipcs_disconnect:   qb_ipcs_disconnect(13235-6916-13) state:2 Jan 26 02:10:42 [13235] destination-standby        cib:    debug: crm_client_destroy:   Destroying 0 events Jan 26 02:10:42 [13235] destination-standby        cib:    debug: qb_rb_close_helper:   Free'ing ringbuffer: /dev/shm/qb-cib_rw-response-13235-6916-13-header Jan 26 02:10:42 [13235] destination-standby        cib:    debug: qb_rb_close_helper:   Free'ing ringbuffer: /dev/shm/qb-cib_rw-event-13235-6916-13-header Jan 26 02:10:42 [13235] destination-standby        cib:    debug: qb_rb_close_helper:   Free'ing ringbuffer: /dev/shm/qb-cib_rw-request-13235-6916-13-header Jan 26 02:10:42 [13240] destination-standby       crmd:    debug: crm_client_new:       Connecting 0x55ce3bb0e6f0 for uid=0 gid=0 pid=6919 id=3a72b777-daa0-4b0e-acc2-fc58a07f31a6 Jan 26 02:10:42 [13240] destination-standby       crmd:    debug: handle_new_connection:        IPC credentials authenticated (13240-6919-13) Jan 26 02:10:42 [13240] destination-standby       crmd:    debug: qb_ipcs_shm_connect:  connecting to client [6919] Jan 26 02:10:42 [13240] destination-standby       crmd:    debug: qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792 Jan 26 02:10:42 [13240] destination-standby       crmd:    debug: qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792 Jan 26 02:10:42 [13240] destination-standby       crmd:    debug: qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792 Jan 26 02:10:44 [13191] destination-standby corosync warning [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_ipcs_dispatch_connection_request:  HUP conn (13240-6125-14) Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_ipcs_disconnect:   qb_ipcs_disconnect(13240-6125-14) state:2 Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: crm_client_destroy:   Destroying 0 events Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_rb_close_helper:   Free'ing ringbuffer: /dev/shm/qb-crmd-response-13240-6125-14-header Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_rb_close_helper:   Free'ing ringbuffer: /dev/shm/qb-crmd-event-13240-6125-14-header Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_rb_close_helper:   Free'ing ringbuffer: /dev/shm/qb-crmd-request-13240-6125-14-header Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: crm_client_new:       Connecting 0x55ce3bb22960 for uid=0 gid=0 pid=6928 id=039044f3-e674-4afa-9857-47459d1f0d0a Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: handle_new_connection:        IPC credentials authenticated (13240-6928-14) Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_ipcs_shm_connect:  connecting to client [6928] Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792 Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792 Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792 Jan 26 02:10:44 [13235] destination-standby        cib:    debug: crm_client_new:       Connecting 0x560a4bef1f60 for uid=0 gid=0 pid=6928 id=45726483-e7b8-4ed5-8388-c6e8578d3366 Jan 26 02:10:44 [13235] destination-standby        cib:    debug: handle_new_connection:        IPC credentials authenticated (13235-6928-13) Jan 26 02:10:44 [13235] destination-standby        cib:    debug: qb_ipcs_shm_connect:  connecting to client [6928] Jan 26 02:10:44 [13235] destination-standby        cib:    debug: qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096 Jan 26 02:10:44 [13235] destination-standby        cib:    debug: qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096 Jan 26 02:10:44 [13235] destination-standby        cib:    debug: qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096 Jan 26 02:10:44 [13235] destination-standby        cib:    debug: cib_acl_enabled:      CIB ACL is disabled Jan 26 02:10:44 [13235] destination-standby        cib:    debug: qb_ipcs_dispatch_connection_request:  HUP conn (13235-6928-13) Jan 26 02:10:44 [13235] destination-standby        cib:    debug: qb_ipcs_disconnect:   qb_ipcs_disconnect(13235-6928-13) state:2 Jan 26 02:10:44 [13235] destination-standby        cib:    debug: crm_client_destroy:   Destroying 0 events Jan 26 02:10:44 [13235] destination-standby        cib:    debug: qb_rb_close_helper:   Free'ing ringbuffer: /dev/shm/qb-cib_rw-response-13235-6928-13-header Jan 26 02:10:44 [13235] destination-standby        cib:    debug: qb_rb_close_helper:   Free'ing ringbuffer: /dev/shm/qb-cib_rw-event-13235-6928-13-header Jan 26 02:10:44 [13235] destination-standby        cib:    debug: qb_rb_close_helper:   Free'ing ringbuffer: /dev/shm/qb-cib_rw-request-13235-6928-13-header Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_ipcs_dispatch_connection_request:  HUP conn (13240-6928-14) Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_ipcs_disconnect:   qb_ipcs_disconnect(13240-6928-14) state:2 Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: crm_client_destroy:   Destroying 0 events Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_rb_close_helper:   Free'ing ringbuffer: /dev/shm/qb-crmd-response-13240-6928-14-header Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_rb_close_helper:   Free'ing ringbuffer: /dev/shm/qb-crmd-event-13240-6928-14-header Jan 26 02:10:44 [13240] destination-standby       crmd:    debug: qb_rb_close_helper:   Free'ing ringbuffer: /dev/shm/qb-crmd-request-13240-6928-14-header Jan 26 02:10:45 [13191] destination-standby corosync warning [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. Jan 26 02:10:47 [13191] destination-standby corosync warning [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. Jan 26 02:10:48 [13191] destination-standby corosync debug   [TOTEM ] The consensus timeout expired. Jan 26 02:10:48 [13191] destination-standby corosync debug   [TOTEM ] entering GATHER state from 3(The consensus timeout expired.). Jan 26 02:10:48 [13191] destination-standby corosync warning [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
============

(target.standby)> cat /etc/corosync/corosync.conf
totem {
         token: 5000
         version: 2
         secauth: off
         threads: 0
         interface {
                 ringnumber: 0
                 mcastport: 5405
                 bindnetaddr: 10.27.77.204
         }
         transport: udpu
}

nodelist {
         node {
             ring0_addr: 10.27.77.202
             name: target.sip
             nodeid: 3

         }
         node {
             ring0_addr: 10.27.77.205
             name: target.sec.sip
             nodeid: 6

         }
         node {
             ring0_addr: 10.27.77.106
             name: target.dsbc1
             nodeid: 7

         }
         node {
             ring0_addr: 10.27.77.107
             name: target.dsbc
             nodeid: 8

         }
         node {
             ring0_addr: 10.27.77.204
             name: target.standby
             nodeid: 5

         }
}

logging {
         fileline: off
         to_stderr: no
         to_logfile: yes
         to_syslog: no
        logfile: /var/10.27.77.204/log/cluster-suite/corosync.log <http://10.27.77.204/log/cluster-suite/corosync.log>
         syslog_facility: local0
         debug: on
         timestamp: on
         logger_subsys {
                 subsys: QUORUM
                 debug: on
         }
}

quorum {
         provider: corosync_votequorum
         expected_votes: 5
}

===================

(target.standby)> ping 10.27.77.106 (target.dsbc1)
PING 10.27.77.106 (10.27.77.106) 56(84) bytes of data.
64 bytes from 10.27.77.106 <http://10.27.77.106>: icmp_seq=1 ttl=64 time=0.177 ms 64 bytes from 10.27.77.106 <http://10.27.77.106>: icmp_seq=2 ttl=64 time=0.173 ms

===================

 > sudo nmap -sU -p 5405 10.27.77.106

Starting Nmap 6.40 ( http://nmap.org <http://nmap.org> ) at 2021-01-26 02:24 UTC
Nmap scan report for dispatching-sbc (10.27.77.106)
Host is up (0.00029s latency).
PORT     STATE         SERVICE
5405/udp open|filtered unknown

=========

My observation showed that it can be fixed by stopping all nodes and starting one by one.
Or restart of the affected node may help as well.
Nodes see each other when Ring ID is equal on all nodes.

Do you have any ideas?
Could you explain what is Ring ID in terms of corosync-quorum (rrp is not configured)?
Why does it differ?
What additional info may I provide if reproduced again?

It is reproduced sporadically, sometimes the issue is combined with 100% CPU usage by corosync also required to restart pacemaker/corosync on the affected node and sometimes all nodes. In this particular case CPU usage is normal.

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to