On 27.02.2021 22:12, Andrei Borzenkov wrote: > On 27.02.2021 17:08, Eric Robinson wrote: >> >> I agree, one node is expected to go out of quorum. Still the question is, >> why didn't 001db01b take over the services? I just remembered that 001db01b >> has services running on it, and those services did not stop, so it seems >> that 001db01b did not lose quorum. So why didn't it take over the services >> that were running on 001db01a? > > That I cannot answer. I cannot reproduce it using similar configuration.
Hmm ... actually I can. Two nodes ha1 and ha2 + qdevice. I blocked all communication *from* ha1 (to be precise - all packets with ha1 source MAC are dropped). This happened around 10:43:45. Now look: ha1 immediately stops all services: Feb 28 10:43:44 ha1 corosync[3692]: [TOTEM ] A processor failed, forming new configuration. Feb 28 10:43:47 ha1 corosync[3692]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Feb 28 10:43:47 ha1 corosync[3692]: [TOTEM ] A new membership (192.168.1.1:2944) was formed. Members left: 2 Feb 28 10:43:47 ha1 corosync[3692]: [TOTEM ] Failed to receive the leave message. failed: 2 Feb 28 10:43:47 ha1 corosync[3692]: [CPG ] downlist left_list: 1 received Feb 28 10:43:47 ha1 pacemaker-attrd[3703]: notice: Node ha2 state is now lost Feb 28 10:43:47 ha1 pacemaker-attrd[3703]: notice: Removing all ha2 attributes for peer loss Feb 28 10:43:47 ha1 pacemaker-attrd[3703]: notice: Purged 1 peer with id=2 and/or uname=ha2 from the membership cache Feb 28 10:43:47 ha1 pacemaker-based[3700]: notice: Node ha2 state is now lost Feb 28 10:43:47 ha1 pacemaker-based[3700]: notice: Purged 1 peer with id=2 and/or uname=ha2 from the membership cache Feb 28 10:43:47 ha1 pacemaker-controld[3705]: warning: Stonith/shutdown of node ha2 was not expected Feb 28 10:43:47 ha1 pacemaker-controld[3705]: notice: State transition S_IDLE -> S_POLICY_ENGINE Feb 28 10:43:47 ha1 pacemaker-fenced[3701]: notice: Node ha2 state is now lost Feb 28 10:43:47 ha1 pacemaker-fenced[3701]: notice: Purged 1 peer with id=2 and/or uname=ha2 from the membership cache Feb 28 10:43:48 ha1 corosync[3692]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Feb 28 10:43:48 ha1 corosync[3692]: [TOTEM ] A new membership (192.168.1.1:2948) was formed. Members Feb 28 10:43:48 ha1 corosync[3692]: [CPG ] downlist left_list: 0 received Feb 28 10:43:50 ha1 corosync[3692]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Feb 28 10:43:50 ha1 corosync[3692]: [TOTEM ] A new membership (192.168.1.1:2952) was formed. Members Feb 28 10:43:50 ha1 corosync[3692]: [CPG ] downlist left_list: 0 received Feb 28 10:43:51 ha1 corosync[3692]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Feb 28 10:43:51 ha1 corosync[3692]: [TOTEM ] A new membership (192.168.1.1:2956) was formed. Members Feb 28 10:43:51 ha1 corosync[3692]: [CPG ] downlist left_list: 0 received Feb 28 10:43:56 ha1 corosync-qdevice[4522]: Server didn't send echo reply message on time Feb 28 10:43:56 ha1 corosync-qdevice[4522]: Feb 28 10:43:56 error Server didn't send echo reply message on time Feb 28 10:43:56 ha1 corosync[3692]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Feb 28 10:43:56 ha1 corosync[3692]: [QUORUM] Members[1]: 1 Feb 28 10:43:56 ha1 corosync[3692]: [MAIN ] Completed service synchronization, ready to provide service. Feb 28 10:43:56 ha1 pacemaker-controld[3705]: warning: Quorum lost Feb 28 10:43:56 ha1 pacemaker-controld[3705]: notice: Node ha2 state is now lost Feb 28 10:43:56 ha1 pacemaker-controld[3705]: warning: Stonith/shutdown of node ha2 was not expected Feb 28 10:43:56 ha1 pacemaker-controld[3705]: notice: Updating quorum status to false (call=274) Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: warning: Fencing and resource management disabled due to lack of quorum Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: notice: * Stop p_drbd0:0 ( Master ha1 ) due to no quorum Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: notice: * Stop p_drbd1:0 ( Slave ha1 ) due to no quorum Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: notice: * Stop p_fs_clust01 ( ha1 ) due to no quorum Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: notice: * Start p_fs_clust02 ( ha1 ) due to no quorum (blocked) Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: notice: * Stop p_mysql_001 ( ha1 ) due to no quorum Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: notice: * Start p_mysql_006 ( ha1 ) due to no quorum (blocked) ha2 *waits for 30 seconds* before doing anything: Feb 28 10:43:44 ha2 corosync[5389]: [TOTEM ] A processor failed, forming new configuration. Feb 28 10:43:45 ha2 corosync[5389]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Feb 28 10:43:45 ha2 corosync[5389]: [TOTEM ] A new membership (192.168.1.2:2936) was formed. Members left: 1 Feb 28 10:43:45 ha2 corosync[5389]: [TOTEM ] Failed to receive the leave message. failed: 1 Feb 28 10:43:45 ha2 corosync[5389]: [CPG ] downlist left_list: 1 received Feb 28 10:43:45 ha2 pacemaker-attrd[5660]: notice: Lost attribute writer ha1 Feb 28 10:43:45 ha2 pacemaker-attrd[5660]: notice: Node ha1 state is now lost Feb 28 10:43:45 ha2 pacemaker-attrd[5660]: notice: Removing all ha1 attributes for peer loss Feb 28 10:43:45 ha2 pacemaker-attrd[5660]: notice: Purged 1 peer with id=1 and/or uname=ha1 from the membership cache Feb 28 10:43:45 ha2 pacemaker-based[5657]: notice: Node ha1 state is now lost Feb 28 10:43:45 ha2 pacemaker-based[5657]: notice: Purged 1 peer with id=1 and/or uname=ha1 from the membership cache Feb 28 10:43:45 ha2 pacemaker-controld[5662]: notice: Our peer on the DC (ha1) is dead Feb 28 10:43:45 ha2 pacemaker-controld[5662]: notice: State transition S_NOT_DC -> S_ELECTION Feb 28 10:43:45 ha2 pacemaker-fenced[5658]: notice: Node ha1 state is now lost Feb 28 10:43:45 ha2 pacemaker-fenced[5658]: notice: Purged 1 peer with id=1 and/or uname=ha1 from the membership cache Feb 28 10:44:15 ha2 corosync[5389]: [VOTEQ ] lost contact with quorum device Qdevice Feb 28 10:44:15 ha2 corosync[5389]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Feb 28 10:44:15 ha2 corosync[5389]: [QUORUM] Members[1]: 2 Feb 28 10:44:15 ha2 corosync[5389]: [MAIN ] Completed service synchronization, ready to provide service. Now I recognize it and I believe we have seen variants of this already. Key is corosync[5389]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) ha1 lost connection to qnetd so it gives up all hope immediately. ha2 retains connection to qnetd so it waits for final decision before continuing. In your case apparently one node was completely disconnected for 15 seconds, then connectivity resumed. The second node was still waiting for qdevice/qnetd decision. So it appears to work as expected. Note that fencing would not have been initiated before timeout as well. Fencing /may/ have been initiated after nodes established connection again and saw that one resource failed to stop. This would automatically resolve your issue. I need to think how to reproduce stop failure. _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/