background:
we are upgrading a (very) old HA cluster running heartbeat DRBD and NFS, with no stonith, to a much more modern implementation. for the existing cluster, as well as the new one, the disk space requirements make running a full three-node cluster infeasible, so i am trying to configure a quorum-only node using corosync-qnetd.

the installation went fine, the nodes can communicate, etc, and the cluster seema to perform as desired when gracefully shutting down or restarting a node. but during my torture testing, simulating a node crash by stopping the network on one node leaves the remaining node in limbo for approximately 20 seconds before it and the quorum-only node decide that they are indeed quorate.

the problem:
the intended implementation involves DRBD, and its resource-level fencing freezes IO during the time that the remaining node is inquorate in order to avoid any possible data divergence/split-brain. this precaution is obviously desirable, and is the reason that i am trying to configure this cluster "properly".

my (admittedly naive) expectation is that the remaining node and the quorum-only node would continue ticking along as if nothing happened, and i am hoping that this delay is due to some misconfiguration/oversight/bone-headedness on my part.

so i am seeking understanding on the reason for this delay, and whether there is any (prudent) way to reduce it. of course, any other advice on the intended setup is welcome as well.

please let me know if you require any additional details.

TIA
Sherrard

details:
for testing purposes, i have stripped the cluster down to a single dummy resource

root@xen-nfs01:~# crm configure show
node 1: xen-nfs01
node 2: xen-nfs02
primitive dummy Dummy
property cib-bootstrap-options: \
        have-watchdog=false \
        stonith-enabled=false \
        dc-version=2.0.1-9e909a5bdd \
        cluster-infrastructure=corosync \
        cluster-name=xen-nfs01_xen-nfs02 \
        last-lrm-refresh=1571422320 \
        maintenance-mode=false


all nodes are running debian 10, with the following package versions

root@xen-nfs01:~# dpkg -l | grep -F -e corosync -e pacemaker
ii corosync 3.0.1-2 amd64 cluster engine daemon and utilities ii corosync-qdevice 3.0.0-4 amd64 cluster engine quorum device daemon ii crmsh 4.0.0~git20190108.3d56538-3 all CRM shell for the pacemaker cluster manager ii libcorosync-common4:amd64 3.0.1-2 amd64 cluster engine common library ii pacemaker 2.0.1-5 amd64 cluster resource manager ii pacemaker-cli-utils 2.0.1-5 amd64 cluster resource manager command line utilities ii pacemaker-common 2.0.1-5 all cluster resource manager common files ii pacemaker-resource-agents 2.0.1-5 all cluster resource manager general resource agents

root@xen-quorum:~# dpkg -l | grep -F -e corosync -e pacemaker
ii corosync-qnetd 3.0.0-4 amd64 cluster engine quorum device network daemon


/etc/corosync/corosync.conf:
totem {
  version: 2
  cluster_name: xen-nfs01_xen-nfs02
  crypto_cipher: aes256
  crypto_hash: sha512
}
logging {
  fileline: off
  to_stderr: yes
  to_logfile: yes
  logfile: /var/log/corosync/corosync.log
  to_syslog: yes
  debug: off
  logger_subsys {
    subsys: QUORUM
    debug: off
  }
}
nodelist {
  node {
    name: xen-nfs01
    nodeid: 1
    ring0_addr: 192.168.250.50
  }
  node {
    name: xen-nfs02
    nodeid: 2
    ring0_addr: 192.168.250.51
  }
}
quorum {
  provider: corosync_votequorum
  device {
    model: net
    votes: 1
    net {
      tls: on
      host: xen-quorum
      algorithm: ffsplit
    }
  }
}


logs:
Oct 24 02:33:48 xen-nfs01 corosync[9946]: [TOTEM ] Token has not been received in 750 ms Oct 24 02:33:48 xen-nfs01 corosync[9946]: [TOTEM ] A processor failed, forming new configuration. Oct 24 02:33:49 xen-nfs01 corosync[9946]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Oct 24 02:33:49 xen-nfs01 corosync[9946]: [TOTEM ] A new membership (1:1388) was formed. Members left: 2 Oct 24 02:33:49 xen-nfs01 corosync[9946]: [TOTEM ] Failed to receive the leave message. failed: 2 Oct 24 02:33:49 xen-nfs01 corosync[9946]: [CPG ] downlist left_list: 1 received Oct 24 02:33:49 xen-nfs01 pacemaker-attrd[10527]: notice: Node xen-nfs02 state is now lost Oct 24 02:33:49 xen-nfs01 pacemaker-attrd[10527]: notice: Removing all xen-nfs02 attributes for peer loss Oct 24 02:33:49 xen-nfs01 pacemaker-attrd[10527]: notice: Purged 1 peer with id=2 and/or uname=xen-nfs02 from the membership cache Oct 24 02:33:49 xen-nfs01 pacemaker-fenced[10525]: notice: Node xen-nfs02 state is now lost Oct 24 02:33:49 xen-nfs01 pacemaker-fenced[10525]: notice: Purged 1 peer with id=2 and/or uname=xen-nfs02 from the membership cache Oct 24 02:33:49 xen-nfs01 pacemaker-based[10524]: notice: Node xen-nfs02 state is now lost Oct 24 02:33:49 xen-nfs01 pacemaker-based[10524]: notice: Purged 1 peer with id=2 and/or uname=xen-nfs02 from the membership cache Oct 24 02:33:49 xen-nfs01 pacemaker-controld[10529]: warning: Stonith/shutdown of node xen-nfs02 was not expected Oct 24 02:33:49 xen-nfs01 pacemaker-controld[10529]: notice: State transition S_IDLE -> S_POLICY_ENGINE

Oct 24 02:34:12 xen-nfs01 corosync[9946]:   [QUORUM] Members[1]: 1
Oct 24 02:34:12 xen-nfs01 corosync[9946]: [MAIN ] Completed service synchronization, ready to provide service. Oct 24 02:34:12 xen-nfs01 pacemakerd[10522]: notice: Node xen-nfs02 state is now lost Oct 24 02:34:12 xen-nfs01 pacemaker-controld[10529]: notice: Node xen-nfs02 state is now lost Oct 24 02:34:12 xen-nfs01 pacemaker-controld[10529]: warning: Stonith/shutdown of node xen-nfs02 was not expected Oct 24 02:34:13 xen-nfs01 pacemaker-schedulerd[10528]: notice: Calculated transition 411, saving inputs in /var/lib/pacemaker/pengine/pe-input-1278.bz2 Oct 24 02:34:13 xen-nfs01 pacemaker-controld[10529]: notice: Transition 411 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1278.bz2): Complete Oct 24 02:34:13 xen-nfs01 pacemaker-controld[10529]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE



Oct 24 02:33:49 xen-quorum corosync-qnetd[3353]: Oct 24 02:33:49 debug Client ::ffff:192.168.250.50:38362 (cluster xen-nfs01_xen-nfs02, node_id 1) sent membership node list. Oct 24 02:33:49 xen-quorum corosync-qnetd[3353]: Oct 24 02:33:49 debug msg seq num = 61 Oct 24 02:33:49 xen-quorum corosync-qnetd[3353]: Oct 24 02:33:49 debug ring id = (1.56c) Oct 24 02:33:49 xen-quorum corosync-qnetd[3353]: Oct 24 02:33:49 debug heuristics = Undefined Oct 24 02:33:49 xen-quorum corosync-qnetd[3353]: Oct 24 02:33:49 debug node list: Oct 24 02:33:49 xen-quorum corosync-qnetd[3353]: Oct 24 02:33:49 debug node_id = 1, data_center_id = 0, node_state = not set Oct 24 02:33:49 xen-quorum corosync-qnetd[3353]: Oct 24 02:33:49 debug ffsplit: Membership for cluster xen-nfs01_xen-nfs02 is not yet stable Oct 24 02:33:49 xen-quorum corosync-qnetd[3353]: Oct 24 02:33:49 debug Algorithm result vote is Wait for reply

Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 warning Client ::ffff:192.168.250.51:36628 doesn't sent any message during 20000ms. Disconnecting Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug Client ::ffff:192.168.250.51:36628 (init_received 1, cluster xen-nfs01_xen-nfs02, node_id 2) disconnect Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug ffsplit: Membership for cluster xen-nfs01_xen-nfs02 is now stable Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug ffsplit: Quorate partition selected Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug node list: Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug node_id = 1, data_center_id = 0, node_state = not set Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug ffsplit: No client gets NACK Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug Sending vote info to client ::ffff:192.168.250.50:38362 (cluster xen-nfs01_xen-nfs02, node_id 1) Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug msg seq num = 30 Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug vote = ACK Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug Client ::ffff:192.168.250.50:38362 (cluster xen-nfs01_xen-nfs02, node_id 1) replied back to vote info message Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug msg seq num = 30 Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug ffsplit: All ACK votes sent for cluster xen-nfs01_xen-nfs02 Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug Client ::ffff:192.168.250.50:38362 (cluster xen-nfs01_xen-nfs02, node_id 1) sent quorum node list. Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug msg seq num = 62 Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug quorate = 1 Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug node list: Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug node_id = 2, data_center_id = 0, node_state = dead Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug node_id = 1, data_center_id = 0, node_state = member Oct 24 02:34:12 xen-quorum corosync-qnetd[3353]: Oct 24 02:34:12 debug Algorithm result vote is No change


root@xen-nfs02:~# date; service networking stop; journalctl -f
Thu Oct 24 02:33:47 UTC 2019
-- Logs begin at Mon 2019-10-14 21:10:14 UTC. --
Oct 24 02:33:21 xen-nfs02 pacemakerd[32123]:  notice: Quorum acquired
Oct 24 02:33:21 xen-nfs02 pacemakerd[32123]: notice: Node xen-nfs01 state is now member Oct 24 02:33:21 xen-nfs02 pacemaker-controld[32130]: notice: State transition S_IDLE -> S_INTEGRATION Oct 24 02:33:21 xen-nfs02 pacemaker-controld[32130]: warning: Another DC detected: xen-nfs01 (op=noop) Oct 24 02:33:21 xen-nfs02 pacemaker-controld[32130]: notice: State transition S_ELECTION -> S_RELEASE_DC Oct 24 02:33:21 xen-nfs02 pacemaker-attrd[32128]: notice: Detected another attribute writer (xen-nfs01), starting new election Oct 24 02:33:21 xen-nfs02 pacemaker-controld[32130]: notice: State transition S_PENDING -> S_NOT_DC
Oct 24 02:33:47 xen-nfs02 systemd[1]: Stopping Raise network interfaces...
Oct 24 02:33:47 xen-nfs02 systemd[1]: networking.service: Succeeded.
Oct 24 02:33:47 xen-nfs02 systemd[1]: Stopped Raise network interfaces.
Oct 24 02:33:48 xen-nfs02 corosync[30190]: [TOTEM ] Token has not been received in 750 ms Oct 24 02:33:48 xen-nfs02 corosync[30190]: [TOTEM ] A processor failed, forming new configuration. Oct 24 02:33:48 xen-nfs02 corosync[30190]: [KNET ] link: host: 1 link: 0 is down Oct 24 02:33:48 xen-nfs02 corosync[30190]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Oct 24 02:33:48 xen-nfs02 corosync[30190]: [KNET ] host: host: 1 has no active links Oct 24 02:33:49 xen-nfs02 corosync[30190]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Oct 24 02:33:49 xen-nfs02 corosync[30190]: [TOTEM ] A new membership (2:1388) was formed. Members left: 1 Oct 24 02:33:49 xen-nfs02 corosync[30190]: [TOTEM ] Failed to receive the leave message. failed: 1 Oct 24 02:33:49 xen-nfs02 corosync[30190]: [CPG ] downlist left_list: 1 received Oct 24 02:33:49 xen-nfs02 pacemaker-attrd[32128]: notice: Lost attribute writer xen-nfs01 Oct 24 02:33:49 xen-nfs02 pacemaker-attrd[32128]: notice: Node xen-nfs01 state is now lost Oct 24 02:33:49 xen-nfs02 pacemaker-attrd[32128]: notice: Removing all xen-nfs01 attributes for peer loss Oct 24 02:33:49 xen-nfs02 pacemaker-attrd[32128]: notice: Purged 1 peer with id=1 and/or uname=xen-nfs01 from the membership cache Oct 24 02:33:49 xen-nfs02 pacemaker-based[32125]: notice: Node xen-nfs01 state is now lost Oct 24 02:33:49 xen-nfs02 pacemaker-based[32125]: notice: Purged 1 peer with id=1 and/or uname=xen-nfs01 from the membership cache Oct 24 02:33:49 xen-nfs02 pacemaker-fenced[32126]: notice: Node xen-nfs01 state is now lost Oct 24 02:33:49 xen-nfs02 pacemaker-fenced[32126]: notice: Purged 1 peer with id=1 and/or uname=xen-nfs01 from the membership cache Oct 24 02:33:49 xen-nfs02 pacemaker-controld[32130]: notice: Our peer on the DC (xen-nfs01) is dead Oct 24 02:33:49 xen-nfs02 pacemaker-controld[32130]: notice: State transition S_NOT_DC -> S_ELECTION Oct 24 02:34:00 xen-nfs02 corosync-qdevice[31543]: Server didn't send echo reply message on time Oct 24 02:34:00 xen-nfs02 corosync[30190]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 24 02:34:00 xen-nfs02 corosync[30190]:   [QUORUM] Members[1]: 2
Oct 24 02:34:00 xen-nfs02 corosync[30190]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 24 02:34:00 xen-nfs02 pacemakerd[32123]:  warning: Quorum lost
Oct 24 02:34:00 xen-nfs02 pacemaker-controld[32130]:  warning: Quorum lost
Oct 24 02:34:00 xen-nfs02 pacemaker-controld[32130]: notice: Node xen-nfs01 state is now lost Oct 24 02:34:00 xen-nfs02 pacemakerd[32123]: notice: Node xen-nfs01 state is now lost Oct 24 02:34:01 xen-nfs02 pacemaker-attrd[32128]: notice: Recorded local node as attribute writer (was unset) Oct 24 02:34:01 xen-nfs02 pacemaker-controld[32130]: notice: State transition S_ELECTION -> S_INTEGRATION Oct 24 02:34:01 xen-nfs02 pacemaker-schedulerd[32129]: warning: Fencing and resource management disabled due to lack of quorum Oct 24 02:34:01 xen-nfs02 pacemaker-schedulerd[32129]: notice: * Start dummy ( xen-nfs02 ) due to no quorum (blocked) Oct 24 02:34:01 xen-nfs02 pacemaker-schedulerd[32129]: notice: Calculated transition 4, saving inputs in /var/lib/pacemaker/pengine/pe-input-471.bz2 Oct 24 02:34:01 xen-nfs02 pacemaker-controld[32130]: notice: Transition 4 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-471.bz2): Complete Oct 24 02:34:01 xen-nfs02 pacemaker-controld[32130]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Oct 24 02:34:04 xen-nfs02 corosync-qdevice[31543]: Can't connect to qnetd host (-5980): Network address is presently unreachable
Oct 24 02:34:12 xen-nfs02 corosync-qdevice[31543]: Connect timeout
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to