Олег Самойлов napsal(a):
Hello all.
I have a test bed with several virtual machines to test pacemaker. I simulate
random failure on one of the node. The cluster will be on several data centres,
so there is not stonith device, instead I use qnetd on the third data centre
and watchdog (softdog). And sometimes (not always) on failure on one node, the
second node also watchdoged due to lost of the quorum. I increased quorum
timeouts 2 times:
for qnetd: COROSYNC_QNETD_OPTIONS="-S dpd_interval=20000 -d"
Please do not set dpd_interval that high. dpd_interval on qnetd side is
not about how often is the ping is sent. Could you please retry your
test with dpd_interval=1000? I'm pretty sure it will work then.
Honza
for pacemaker: pcs quorum device add sync_timeout=60000 timeout=20000 model net
host=‘witness' algorithm=ffsplit
Also I set -I 60 option (net timeout) to sbd.
But effect is still exists:
Logs, after one of the nodes «tuchanka1a» was powered off in 17:13:53.
From server 'witness' with qnetd:
Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug Client
::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) sent membership node
list.
Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug msg seq num =
7
Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug ring id =
(2.4c)
Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug heuristics =
Undefined
Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug node list:
Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug node_id =
2, data_center_id = 0, node_state = not set
Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug ffsplit:
Membership for cluster krogan1 is not yet stable
Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug Algorithm
result vote is Wait for reply
Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug Client
::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) sent quorum node list.
Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug msg seq num =
8
Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug quorate = 0
Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug node list:
Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug node_id =
1, data_center_id = 0, node_state = dead
Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug node_id =
2, data_center_id = 0, node_state = member
Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug Algorithm
result vote is No change
Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 warning Client
::ffff:192.168.89.11:47456 doesn't sent any message during 40000ms.
Disconnecting
Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug Client
::ffff:192.168.89.11:47456 (init_received 1, cluster krogan1, node_id 1)
disconnect
Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit:
Membership for cluster krogan1 is now stable
Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit:
Quorate partition selected
Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug node list:
Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug node_id =
2, data_center_id = 0, node_state = not set
Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit: No
client gets NACK
Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug Sending vote
info to client ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2)
Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug msg seq num =
2
Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug vote = ACK
Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug Client
::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) replied back to vote
info message
Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug msg seq num =
2
Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit: All
ACK votes sent for cluster krogan1
Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 warning Client
::ffff:192.168.89.12:39144 doesn't sent any message during 40000ms.
Disconnecting
Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug Client
::ffff:192.168.89.12:39144 (init_received 1, cluster krogan1, node_id 2)
disconnect
Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit:
Membership for cluster krogan1 is now stable
Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: No
quorate partition was selected
Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: No
client gets NACK
Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: No
client gets ACK
From other node ‘tuchanka1b’. I deleted all rows from sbd "info: notify_parent: Notifying
parent: healthy" and "info: notify_parent: Not notifying parent: state transient
(2)".
Aug 8 17:13:54 tuchanka1b corosync[1185]: [TOTEM ] A processor failed, forming
new configuration.
Aug 8 17:13:55 tuchanka1b corosync[1185]: [TOTEM ] A new membership
(192.168.89.12:76) was formed. Members left: 1
Aug 8 17:13:55 tuchanka1b corosync[1185]: [TOTEM ] Failed to receive the leave
message. failed: 1
Aug 8 17:13:55 tuchanka1b corosync[1185]: [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 60000 ms)
Aug 8 17:13:55 tuchanka1b stonith-ng[1210]: notice: Node tuchanka1a state is
now lost
Aug 8 17:13:55 tuchanka1b stonith-ng[1210]: notice: Purged 1 peer with id=1
and/or uname=tuchanka1a from the membership cache
Aug 8 17:13:55 tuchanka1b crmd[1214]: notice: Our peer on the DC (tuchanka1a)
is dead
Aug 8 17:13:55 tuchanka1b crmd[1214]: notice: State transition S_NOT_DC ->
S_ELECTION
Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Node tuchanka1a state is now
lost
Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Removing all tuchanka1a
attributes for peer loss
Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Lost attribute writer
tuchanka1a
Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Purged 1 peer with id=1 and/or
uname=tuchanka1a from the membership cache
Aug 8 17:13:55 tuchanka1b cib[1209]: notice: Node tuchanka1a state is now lost
Aug 8 17:13:55 tuchanka1b cib[1209]: notice: Purged 1 peer with id=1 and/or
uname=tuchanka1a from the membership cacheAug 8 17:14:25 tuchanka1b crmd[1214]: notice:
Deletion of "//node_state[@uname='tuchanka1a']/transient_attributes": Timer
expired (rc=-62)
Aug 8 17:14:55 tuchanka1b corosync[1185]: [VOTEQ ] lost contact with quorum
device Qdevice
Aug 8 17:14:55 tuchanka1b corosync[1185]: [VOTEQ ] Waiting for all cluster
members. Current votes: 1 expected_votes: 3
Aug 8 17:14:55 tuchanka1b corosync[1185]: [QUORUM] This node is within the
non-primary component and will NOT provide any services.
Aug 8 17:14:55 tuchanka1b corosync[1185]: [QUORUM] Members[1]: 2
Aug 8 17:14:55 tuchanka1b corosync[1185]: [MAIN ] Completed service
synchronization, ready to provide service.
Aug 8 17:14:55 tuchanka1b pacemakerd[1208]: warning: Quorum lost
Aug 8 17:14:55 tuchanka1b pacemakerd[1208]: notice: Node tuchanka1a state is
now lost
Aug 8 17:14:55 tuchanka1b crmd[1214]: warning: Quorum lost
Aug 8 17:14:55 tuchanka1b crmd[1214]: notice: Node tuchanka1a state is now
lost
Aug 8 17:14:55 tuchanka1b sbd[1182]: cluster: info: notify_parent:
Notifying parent: healthy
Aug 8 17:14:55 tuchanka1b crmd[1214]: notice: State transition S_ELECTION ->
S_INTEGRATION
Aug 8 17:14:55 tuchanka1b crmd[1214]: warning: Input I_ELECTION_DC received in
state S_INTEGRATION from do_election_check
Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: Watchdog will be used via
SBD if fencing is required
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Fencing and resource
management disabled due to lack of quorum
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Cluster node tuchanka1a is
unclean: peer is no longer part of the cluster
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Node tuchanka1a is unclean
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_demote_0
on tuchanka1a is unrunnable (offline)
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_stop_0 on
tuchanka1a is unrunnable (offline)
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_demote_0
on tuchanka1a is unrunnable (offline)
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_stop_0 on
tuchanka1a is unrunnable (offline)
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_demote_0
on tuchanka1a is unrunnable (offline)
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_stop_0 on
tuchanka1a is unrunnable (offline)
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_demote_0
on tuchanka1a is unrunnable (offline)
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_stop_0 on
tuchanka1a is unrunnable (offline)
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1IP_stop_0 on
tuchanka1a is unrunnable (offline)
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Node tuchanka1a is unclean!
Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: Cannot fence unclean nodes
until quorum is attained (or no-quorum-policy is set to ignore)
Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1DB:0
( Slave tuchanka1b ) due to no quorum
Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1DB:1
( Master tuchanka1a ) due to node availability (blocked)
Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1IP
( tuchanka1a ) due to no quorum (blocked)
Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1sIP
( tuchanka1b ) due to no quorum
Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Calculated transition 0
(with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-4.bz2
Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating cancel operation
krogan1DB_monitor_15000 locally on tuchanka1b
Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating cancel operation
krogan1DB_monitor_17000 locally on tuchanka1b
Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating notify operation
krogan1DB_pre_notify_stop_0 locally on tuchanka1b
Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used via
SBD if fencing is required
Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating stop operation
krogan1sIP_stop_0 locally on tuchanka1b
Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used via
SBD if fencing is required
Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used via
SBD if fencing is required
Aug 8 17:14:56 tuchanka1b IPaddr2(krogan1sIP)[5245]: INFO: IP status = ok,
IP_CIP=
Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Result of stop operation for
krogan1sIP on tuchanka1b: 0 (ok)
Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used via
SBD if fencing is required
Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Result of notify operation for
krogan1DB on tuchanka1b: 0 (ok)
Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating stop operation
krogan1DB_stop_0 locally on tuchanka1b
Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used via
SBD if fencing is required
Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: warning: cluster_status:
Fencing and resource management disabled due to lack of quorum
Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: warning: pe_fence_node:
Cluster node tuchanka1a is unclean: peer is no longer part of the cluster
Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: warning:
determine_online_status: Node tuchanka1a is unclean
Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: info: set_servant_health:
Quorum lost: Stop ALL resources
Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: info: notify_parent: Not
notifying parent: state transient (2)
Aug 8 17:14:57 tuchanka1b pgsqlms(krogan1DB)[5312]: INFO: Instance "krogan1DB"
stopped
Aug 8 17:14:57 tuchanka1b crmd[1214]: notice: Result of stop operation for
krogan1DB on tuchanka1b: 0 (ok)
Aug 8 17:14:57 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used via
SBD if fencing is required
Aug 8 17:14:57 tuchanka1b crmd[1214]: notice: Transition 0 (Complete=11,
Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-warn-4.bz2): Complete
Aug 8 17:14:57 tuchanka1b crmd[1214]: notice: State transition
S_TRANSITION_ENGINE -> S_IDLEAug 8 17:14:57 tuchanka1b sbd[1181]: pcmk:
info: notify_parent: Not notifying parent: state transient (2)
Aug 8 17:14:58 tuchanka1b ntpd[557]: Deleting interface #8 eth0,
192.168.89.104#123, interface stats: received=0, sent=0, dropped=0,
active_time=260 secs
from now and later, repeatable Aug 8 17:14:58 tuchanka1b sbd[1181]: pcmk:
info: notify_parent: Not notifying parent: state transient (2)
from now and later, repeatable Aug 8 17:14:59 tuchanka1b sbd[1181]: pcmk:
warning: cluster_status: Fencing and resource management disabled due to lack
of quorum
from now and later, repeatable Aug 8 17:14:59 tuchanka1b sbd[1181]: pcmk:
warning: pe_fence_node: Cluster node tuchanka1a is unclean: peer is no longer
part of the cluster
from now and later, repeatable Aug 8 17:14:59 tuchanka1b sbd[1181]: pcmk:
warning: determine_online_status: Node tuchanka1a is unclean
Aug 8 17:15:00 tuchanka1b corosync[1185]: [VOTEQ ] Waiting for all cluster
members. Current votes: 2 expected_votes: 3
Aug 8 17:15:56 tuchanka1b sbd[1180]: warning: inquisitor_child: Servant pcmk
is outdated (age: 61)
Aug 8 17:15:59 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: No
liveness for 4 s exceeds threshold of 3 s (healthy servants: 0)
Aug 8 17:15:59 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: No
liveness for 4 s exceeds threshold of 3 s (healthy servants: 0)
Aug 8 17:16:00 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: No
liveness for 5 s exceeds threshold of 3 s (healthy servants: 0)
Rebooted
What is strange:
1. The node expect quorum in 60s, but it got quorum vote from qnetd on 5s
later. So in 17:14:55 the node lost quorum.
2. In 17:15:00 it got vote and but not quorum. Is it due to "wait for all"
option?
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/