On Mon, 2024-01-29 at 14:35 -0800, Reid Wahl wrote: > > > On Monday, January 29, 2024, Ken Gaillot <kgail...@redhat.com> wrote: > > On Mon, 2024-01-29 at 18:05 +0000, Faaland, Olaf P. via Users > wrote: > >> Hi, > >> > >> I have configured clusters of node pairs, so each cluster has 2 > >> nodes. The cluster members are statically defined in > corosync.conf > >> before corosync or pacemaker is started, and quorum {two_node: 1} > is > >> set. > >> > >> When both nodes are powered off and I power them on, they do not > >> start pacemaker at exactly the same time. The time difference may > be > >> a few minutes depending on other factors outside the nodes. > >> > >> My goals are (I call the first node to start pacemaker "node1"): > >> 1) I want to control how long pacemaker on node1 waits before > fencing > >> node2 if node2 does not start pacemaker. > >> 2) If node1 is part-way through that waiting period, and node2 > starts > >> pacemaker so they detect each other, I would like them to proceed > >> immediately to probing resource state and starting resources which > >> are down, not wait until the end of that "grace period". > >> > >> It looks from the documentation like dc-deadtime is how #1 is > >> controlled, and #2 is expected normal behavior. However, I'm > seeing > >> fence actions before dc-deadtime has passed. > >> > >> Am I misunderstanding Pacemaker's expected behavior and/or how dc- > >> deadtime should be used? > > > > You have everything right. The problem is that you're starting with > an > > empty configuration every time, so the default dc-deadtime is being > > used for the first election (before you can set the desired value). > > Why would there be fence actions before dc-deadtime expires though?
There isn't -- after the (default) dc-deadtime pops, the node elects itself DC and runs the scheduler, which considers the other node unseen and in need of startup fencing. The dc-deadtime has been raised in the meantime, but that no longer matters. > > > > > I can't think of anything you can do to get around that, since the > > controller starts the timer as soon as it starts up. Would it be > > possible to bake an initial configuration into the PXE image? > > > > When the timer value changes, we could stop the existing timer and > > restart it. There's a risk that some external automation could make > > repeated changes to the timeout, thus never letting it expire, but > that > > seems preferable to your problem. I've created an issue for that: > > > > https://projects.clusterlabs.org/T764 > > > > BTW there's also election-timeout. I'm not sure offhand how that > > interacts; it might be necessary to raise that one as well. > > > >> > >> One possibly unusual aspect of this cluster is that these two > nodes > >> are stateless - they PXE boot from an image on another server - > and I > >> build the cluster configuration at boot time with a series of pcs > >> commands, because the nodes have no local storage for this > >> purpose. The commands are: > >> > >> ['pcs', 'cluster', 'start'] > >> ['pcs', 'property', 'set', 'stonith-action=off'] > >> ['pcs', 'property', 'set', 'cluster-recheck-interval=60'] > >> ['pcs', 'property', 'set', 'start-failure-is-fatal=false'] > >> ['pcs', 'property', 'set', 'dc-deadtime=300'] > >> ['pcs', 'stonith', 'create', 'fence_gopher11', 'fence_powerman', > >> 'ip=192.168.64.65', 'pcmk_host_check=static-list', > >> 'pcmk_host_list=gopher11,gopher12'] > >> ['pcs', 'stonith', 'create', 'fence_gopher12', 'fence_powerman', > >> 'ip=192.168.64.65', 'pcmk_host_check=static-list', > >> 'pcmk_host_list=gopher11,gopher12'] > >> ['pcs', 'resource', 'create', 'gopher11_zpool', 'ocf:llnl:zpool', > >> 'import_options="-f -N -d /dev/disk/by-vdev"', 'pool=gopher11', > 'op', > >> 'start', 'timeout=805'] > >> ... > >> ['pcs', 'property', 'set', 'no-quorum-policy=ignore'] > > > > BTW you don't need to change no-quorum-policy when you're using > > two_node with Corosync. > > > >> > >> I could, instead, generate a CIB so that when Pacemaker is > started, > >> it has a full config. Is that better? > >> > >> thanks, > >> Olaf > >> > >> === corosync.conf: > >> totem { > >> version: 2 > >> cluster_name: gopher11 > >> secauth: off > >> transport: udpu > >> } > >> nodelist { > >> node { > >> ring0_addr: gopher11 > >> name: gopher11 > >> nodeid: 1 > >> } > >> node { > >> ring0_addr: gopher12 > >> name: gopher12 > >> nodeid: 2 > >> } > >> } > >> quorum { > >> provider: corosync_votequorum > >> two_node: 1 > >> } > >> > >> === Log excerpt > >> > >> Here's an except from Pacemaker logs that reflect what I'm > >> seeing. These are from gopher12, the node that came up first. > The > >> other node, which is not yet up, is gopher11. > >> > >> Jan 25 17:55:38 gopher12 pacemakerd [116033] > >> (main) notice: Starting Pacemaker 2.1.7-1.t4 | build=2.1.7 > >> features:agent-manpages ascii-docs compat-2.0 corosync-ge-2 > default- > >> concurrent-fencing generated-manpages monotonic nagios ncurses > remote > >> systemd > >> Jan 25 17:55:39 gopher12 pacemaker-controld [116040] > >> (peer_update_callback) info: Cluster node gopher12 is now > member > >> (was in unknown state) > >> Jan 25 17:55:43 gopher12 pacemaker-based [116035] > >> (cib_perform_op) info: ++ > >> /cib/configuration/crm_config/cluster_property_set[@id='cib- > >> bootstrap-options']: <nvpair id="cib-bootstrap-options-dc- > deadtime" > >> name="dc-deadtime" value="300"/> > >> Jan 25 17:56:00 gopher12 pacemaker-controld [116040] > >> (crm_timer_popped) info: Election Trigger just popped | > >> input=I_DC_TIMEOUT time=300000ms > >> Jan 25 17:56:01 gopher12 pacemaker-based [116035] > >> (cib_perform_op) info: ++ > >> /cib/configuration/crm_config/cluster_property_set[@id='cib- > >> bootstrap-options']: <nvpair id="cib-bootstrap-options-no-quorum- > >> policy" name="no-quorum-policy" value="ignore"/> > >> Jan 25 17:56:01 gopher12 pacemaker-controld [116040] > >> (abort_transition_graph) info: Transition 0 aborted by cib- > >> bootstrap-options-no-quorum-policy doing create no-quorum- > >> policy=ignore: Configuration change | cib=0.26.0 > >> source=te_update_diff_v2:464 > >> path=/cib/configuration/crm_config/cluster_property_set[@id='cib- > >> bootstrap-options'] complete=true > >> Jan 25 17:56:01 gopher12 pacemaker-controld [116040] > >> (controld_execute_fence_action) notice: Requesting fencing (off) > >> targeting node gopher11 | action=11 timeout=60 > >> > >> > >> _______________________________________________ > >> Manage your subscription: > >> https://lists.clusterlabs.org/mailman/listinfo/users > >> > >> ClusterLabs home: https://www.clusterlabs.org/ > >> > > -- > > Ken Gaillot <kgail...@redhat.com> > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/