On a quick glance this message seems to indicate the problem: crmd[1214]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from ...
To me it looks as if more than one node had a comminication problem. How did you simulate node failure? >>> ???? ???????? <spl...@ya.ru> schrieb am 08.08.2019 um 17:41 in Nachricht <4bc322b7-fe8c-48ab-be20-94def65cf...@ya.ru>: > Hello all. > > I have a test bed with several virtual machines to test pacemaker. I > simulate random failure on one of the node. The cluster will be on several > data centres, so there is not stonith device, instead I use qnetd on the > third data centre and watchdog (softdog). And sometimes (not always) on > failure on one node, the second node also watchdoged due to lost of the > quorum. I increased quorum timeouts 2 times: > > for qnetd: COROSYNC_QNETD_OPTIONS="-S dpd_interval=20000 -d" > for pacemaker: pcs quorum device add sync_timeout=60000 timeout=20000 model > net host=‘witness' algorithm=ffsplit > > Also I set -I 60 option (net timeout) to sbd. > > But effect is still exists: > > Logs, after one of the nodes «tuchanka1a» was powered off in 17:13:53. > > From server 'witness' with qnetd: > > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug Client > ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) sent membership node > list. > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug msg seq > num = 7 > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug ring id = > (2.4c) > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug heuristics > = Undefined > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug node list: > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug node_id > = 2, data_center_id = 0, node_state = not set > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug ffsplit: > Membership for cluster krogan1 is not yet stable > Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug Algorithm > result vote is Wait for reply > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug Client > ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) sent quorum node > list. > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug msg seq > num = 8 > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug quorate = > 0 > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug node list: > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug node_id > = 1, data_center_id = 0, node_state = dead > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug node_id > = 2, data_center_id = 0, node_state = member > Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug Algorithm > result vote is No change > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 warning Client > ::ffff:192.168.89.11:47456 doesn't sent any message during 40000ms. > Disconnecting > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug Client > ::ffff:192.168.89.11:47456 (init_received 1, cluster krogan1, node_id 1) > disconnect > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit: > Membership for cluster krogan1 is now stable > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit: > Quorate partition selected > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug node list: > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug node_id > = 2, data_center_id = 0, node_state = not set > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit: No > client gets NACK > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug Sending vote > info to client ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug msg seq > num = 2 > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug vote = ACK > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug Client > ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) replied back to vote > info message > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug msg seq > num = 2 > Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit: All > ACK votes sent for cluster krogan1 > Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 warning Client > ::ffff:192.168.89.12:39144 doesn't sent any message during 40000ms. > Disconnecting > Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug Client > ::ffff:192.168.89.12:39144 (init_received 1, cluster krogan1, node_id 2) > disconnect > Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: > Membership for cluster krogan1 is now stable > Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: No > quorate partition was selected > Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: No > client gets NACK > Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: No > client gets ACK > > From other node ‘tuchanka1b’. I deleted all rows from sbd "info: > notify_parent: Notifying parent: healthy" and "info: notify_parent: Not > notifying parent: state transient (2)". > > Aug 8 17:13:54 tuchanka1b corosync[1185]: [TOTEM ] A processor failed, > forming new configuration. > Aug 8 17:13:55 tuchanka1b corosync[1185]: [TOTEM ] A new membership > (192.168.89.12:76) was formed. Members left: 1 > Aug 8 17:13:55 tuchanka1b corosync[1185]: [TOTEM ] Failed to receive the > leave message. failed: 1 > Aug 8 17:13:55 tuchanka1b corosync[1185]: [VOTEQ ] waiting for quorum > device Qdevice poll (but maximum for 60000 ms) > Aug 8 17:13:55 tuchanka1b stonith-ng[1210]: notice: Node tuchanka1a state > is now lost > Aug 8 17:13:55 tuchanka1b stonith-ng[1210]: notice: Purged 1 peer with > id=1 and/or uname=tuchanka1a from the membership cache > Aug 8 17:13:55 tuchanka1b crmd[1214]: notice: Our peer on the DC > (tuchanka1a) is dead > Aug 8 17:13:55 tuchanka1b crmd[1214]: notice: State transition S_NOT_DC -> > S_ELECTION > Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Node tuchanka1a state is > now lost > Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Removing all tuchanka1a > attributes for peer loss > Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Lost attribute writer > tuchanka1a > Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Purged 1 peer with id=1 > and/or uname=tuchanka1a from the membership cache > Aug 8 17:13:55 tuchanka1b cib[1209]: notice: Node tuchanka1a state is now > lost > Aug 8 17:13:55 tuchanka1b cib[1209]: notice: Purged 1 peer with id=1 > and/or uname=tuchanka1a from the membership cacheAug 8 17:14:25 tuchanka1b > crmd[1214]: notice: Deletion of > "//node_state[@uname='tuchanka1a']/transient_attributes": Timer expired > (rc=-62) > Aug 8 17:14:55 tuchanka1b corosync[1185]: [VOTEQ ] lost contact with quorum > device Qdevice > Aug 8 17:14:55 tuchanka1b corosync[1185]: [VOTEQ ] Waiting for all cluster > members. Current votes: 1 expected_votes: 3 > Aug 8 17:14:55 tuchanka1b corosync[1185]: [QUORUM] This node is within the > non-primary component and will NOT provide any services. > Aug 8 17:14:55 tuchanka1b corosync[1185]: [QUORUM] Members[1]: 2 > Aug 8 17:14:55 tuchanka1b corosync[1185]: [MAIN ] Completed service > synchronization, ready to provide service. > Aug 8 17:14:55 tuchanka1b pacemakerd[1208]: warning: Quorum lost > Aug 8 17:14:55 tuchanka1b pacemakerd[1208]: notice: Node tuchanka1a state > is now lost > Aug 8 17:14:55 tuchanka1b crmd[1214]: warning: Quorum lost > Aug 8 17:14:55 tuchanka1b crmd[1214]: notice: Node tuchanka1a state is now > lost > Aug 8 17:14:55 tuchanka1b sbd[1182]: cluster: info: notify_parent: > Notifying parent: healthy > Aug 8 17:14:55 tuchanka1b crmd[1214]: notice: State transition S_ELECTION > -> S_INTEGRATION > Aug 8 17:14:55 tuchanka1b crmd[1214]: warning: Input I_ELECTION_DC received > in state S_INTEGRATION from do_election_check > Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: Watchdog will be used via > SBD if fencing is required > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Fencing and resource > management disabled due to lack of quorum > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Cluster node tuchanka1a > is unclean: peer is no longer part of the cluster > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Node tuchanka1a is > unclean > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action > krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline) > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_stop_0 > on tuchanka1a is unrunnable (offline) > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action > krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline) > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_stop_0 > on tuchanka1a is unrunnable (offline) > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action > krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline) > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_stop_0 > on tuchanka1a is unrunnable (offline) > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action > krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline) > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_stop_0 > on tuchanka1a is unrunnable (offline) > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1IP_stop_0 > on tuchanka1a is unrunnable (offline) > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Node tuchanka1a is > unclean! > Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: Cannot fence unclean > nodes until quorum is attained (or no-quorum-policy is set to ignore) > Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1DB:0 > ( Slave tuchanka1b ) due to no quorum > Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1DB:1 > ( Master tuchanka1a ) due to node availability (blocked) > Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1IP > ( tuchanka1a ) due to no quorum (blocked) > Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1sIP > ( tuchanka1b ) due to no quorum > Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Calculated transition 0 > (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-4.bz2 > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating cancel operation > krogan1DB_monitor_15000 locally on tuchanka1b > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating cancel operation > krogan1DB_monitor_17000 locally on tuchanka1b > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating notify operation > krogan1DB_pre_notify_stop_0 locally on tuchanka1b > Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used > via SBD if fencing is required > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating stop operation > krogan1sIP_stop_0 locally on tuchanka1b > Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used > via SBD if fencing is required > Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used > via SBD if fencing is required > Aug 8 17:14:56 tuchanka1b IPaddr2(krogan1sIP)[5245]: INFO: IP status = ok, > IP_CIP= > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Result of stop operation for > krogan1sIP on tuchanka1b: 0 (ok) > Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used > via SBD if fencing is required > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Result of notify operation > for krogan1DB on tuchanka1b: 0 (ok) > Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating stop operation > krogan1DB_stop_0 locally on tuchanka1b > Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used > via SBD if fencing is required > Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: warning: cluster_status: > Fencing and resource management disabled due to lack of quorum > Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: warning: pe_fence_node: > Cluster node tuchanka1a is unclean: peer is no longer part of the cluster > Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: warning: > determine_online_status: Node tuchanka1a is unclean > Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: info: > set_servant_health: Quorum lost: Stop ALL resources > Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: info: notify_parent: > Not notifying parent: state transient (2) > Aug 8 17:14:57 tuchanka1b pgsqlms(krogan1DB)[5312]: INFO: Instance > "krogan1DB" stopped > Aug 8 17:14:57 tuchanka1b crmd[1214]: notice: Result of stop operation for > krogan1DB on tuchanka1b: 0 (ok) > Aug 8 17:14:57 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used > via SBD if fencing is required > Aug 8 17:14:57 tuchanka1b crmd[1214]: notice: Transition 0 (Complete=11, > Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-warn-4.bz2): Complete > Aug 8 17:14:57 tuchanka1b crmd[1214]: notice: State transition > S_TRANSITION_ENGINE -> S_IDLEAug 8 17:14:57 tuchanka1b sbd[1181]: pcmk: > info: notify_parent: Not notifying parent: state transient (2) > Aug 8 17:14:58 tuchanka1b ntpd[557]: Deleting interface #8 eth0, > 192.168.89.104#123, interface stats: received=0, sent=0, dropped=0, > active_time=260 secs > from now and later, repeatable Aug 8 17:14:58 tuchanka1b sbd[1181]: > pcmk: info: notify_parent: Not notifying parent: state transient (2) > from now and later, repeatable Aug 8 17:14:59 tuchanka1b sbd[1181]: > pcmk: warning: cluster_status: Fencing and resource management disabled due > to lack of quorum > from now and later, repeatable Aug 8 17:14:59 tuchanka1b sbd[1181]: > pcmk: warning: pe_fence_node: Cluster node tuchanka1a is unclean: peer is no > longer part of the cluster > from now and later, repeatable Aug 8 17:14:59 tuchanka1b sbd[1181]: > pcmk: warning: determine_online_status: Node tuchanka1a is unclean > Aug 8 17:15:00 tuchanka1b corosync[1185]: [VOTEQ ] Waiting for all cluster > members. Current votes: 2 expected_votes: 3 > Aug 8 17:15:56 tuchanka1b sbd[1180]: warning: inquisitor_child: Servant > pcmk is outdated (age: 61) > Aug 8 17:15:59 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: No > liveness for 4 s exceeds threshold of 3 s (healthy servants: 0) > Aug 8 17:15:59 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: No > liveness for 4 s exceeds threshold of 3 s (healthy servants: 0) > Aug 8 17:16:00 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: No > liveness for 5 s exceeds threshold of 3 s (healthy servants: 0) > Rebooted > > What is strange: > > 1. The node expect quorum in 60s, but it got quorum vote from qnetd on 5s > later. So in 17:14:55 the node lost quorum. > 2. In 17:15:00 it got vote and but not quorum. Is it due to "wait for all" > option? > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/