On Mon, 2024-01-29 at 22:48 +0000, Faaland, Olaf P. wrote: > Thank you, Ken. > > I changed my configuration management system to put an initial > cib.xml into /var/lib/pacemaker/cib/, which sets all the property > values I was setting via pcs commands, including dc-deadtime. I > removed those "pcs property set" commands from the ones that are run > at startup time. > > That worked in the sense that after Pacemaker start, the node waits > my newly specified dc-deadtime of 300s before giving up on the > partner node and fencing it, if the partner never appears as a > member. > > However, now it seems to wait that amount of time before it elects a > DC, even when quorum is acquired earlier. In my log snippet below, > with dc-deadtime 300s,
The dc-deadtime is not waiting for quorum, but for another DC to show up. If all nodes show up, it can proceed, but otherwise it has to wait. > > 14:14:24 Pacemaker starts on gopher12 > 14:17:04 quorum is acquired > 14:19:26 Election Trigger just popped (start time + dc-deadtime > seconds) > 14:19:26 gopher12 wins the election > > Is there other configuration that needs to be present in the cib at > startup time? > > thanks, > Olaf > > === log extract using new system of installing partial cib.xml before > startup > Jan 29 14:14:24 gopher12 pacemakerd [123690] > (main) notice: Starting Pacemaker 2.1.7-1.t4 | build=2.1.7 > features:agent-manpages ascii-docs compat-2.0 corosync-ge-2 default- > concurrent-fencing generated-manpages monotonic nagios ncurses remote > systemd > Jan 29 14:14:25 gopher12 pacemaker-attrd [123695] > (attrd_start_election_if_needed) info: Starting an election to > determine the writer > Jan 29 14:14:25 gopher12 pacemaker-attrd [123695] > (election_check) info: election-attrd won by local node > Jan 29 14:14:25 gopher12 pacemaker-controld [123697] > (peer_update_callback) info: Cluster node gopher12 is now member > (was in unknown state) > Jan 29 14:17:04 gopher12 pacemaker-controld [123697] > (quorum_notification_cb) notice: Quorum acquired | membership=54 > members=2 > Jan 29 14:19:26 gopher12 pacemaker-controld [123697] > (crm_timer_popped) info: Election Trigger just popped | > input=I_DC_TIMEOUT time=300000ms > Jan 29 14:19:26 gopher12 pacemaker-controld [123697] > (do_log) warning: Input I_DC_TIMEOUT received in state S_PENDING > from crm_timer_popped > Jan 29 14:19:26 gopher12 pacemaker-controld [123697] > (do_state_transition) info: State transition S_PENDING -> > S_ELECTION | input=I_DC_TIMEOUT cause=C_TIMER_POPPED > origin=crm_timer_popped > Jan 29 14:19:26 gopher12 pacemaker-controld [123697] > (election_check) info: election-DC won by local node > Jan 29 14:19:26 gopher12 pacemaker-controld [123697] (do_log) info: > Input I_ELECTION_DC received in state S_ELECTION from election_win_cb > Jan 29 14:19:26 gopher12 pacemaker-controld [123697] > (do_state_transition) notice: State transition S_ELECTION -> > S_INTEGRATION | input=I_ELECTION_DC cause=C_FSA_INTERNAL > origin=election_win_cb > Jan 29 14:19:26 gopher12 pacemaker-schedulerd[123696] > (recurring_op_for_active) info: Start 10s-interval monitor > for gopher11_zpool on gopher11 > Jan 29 14:19:26 gopher12 pacemaker-schedulerd[123696] > (recurring_op_for_active) info: Start 10s-interval monitor > for gopher12_zpool on gopher12 > > > === initial cib.xml contents > <cib crm_feature_set="3.19.0" validate-with="pacemaker-3.9" epoch="9" > num_updates="0" admin_epoch="0" cib-last-written="Mon Jan 29 11:07:06 > 2024" update-origin="gopher12" update-client="root" update- > user="root" have-quorum="0" dc-uuid="2"> > <configuration> > <crm_config> > <cluster_property_set id="cib-bootstrap-options"> > <nvpair id="cib-bootstrap-options-stonith-action" > name="stonith-action" value="off"/> > <nvpair id="cib-bootstrap-options-have-watchdog" name="have- > watchdog" value="false"/> > <nvpair id="cib-bootstrap-options-dc-version" name="dc- > version" value="2.1.7-1.t4-2.1.7"/> > <nvpair id="cib-bootstrap-options-cluster-infrastructure" > name="cluster-infrastructure" value="corosync"/> > <nvpair id="cib-bootstrap-options-cluster-name" > name="cluster-name" value="gopher11"/> > <nvpair id="cib-bootstrap-options-cluster-recheck-inte" > name="cluster-recheck-interval" value="60"/> > <nvpair id="cib-bootstrap-options-start-failure-is-fat" > name="start-failure-is-fatal" value="false"/> > <nvpair id="cib-bootstrap-options-dc-deadtime" name="dc- > deadtime" value="300"/> > </cluster_property_set> > </crm_config> > <nodes> > <node id="1" uname="gopher11"/> > <node id="2" uname="gopher12"/> > </nodes> > <resources/> > <constraints/> > </configuration> > </cib> > > ________________________________________ > From: Ken Gaillot <kgail...@redhat.com> > Sent: Monday, January 29, 2024 10:51 AM > To: Cluster Labs - All topics related to open-source clustering > welcomed > Cc: Faaland, Olaf P. > Subject: Re: [ClusterLabs] controlling cluster behavior on startup > > On Mon, 2024-01-29 at 18:05 +0000, Faaland, Olaf P. via Users wrote: > > Hi, > > > > I have configured clusters of node pairs, so each cluster has 2 > > nodes. The cluster members are statically defined in corosync.conf > > before corosync or pacemaker is started, and quorum {two_node: 1} > > is > > set. > > > > When both nodes are powered off and I power them on, they do not > > start pacemaker at exactly the same time. The time difference may > > be > > a few minutes depending on other factors outside the nodes. > > > > My goals are (I call the first node to start pacemaker "node1"): > > 1) I want to control how long pacemaker on node1 waits before > > fencing > > node2 if node2 does not start pacemaker. > > 2) If node1 is part-way through that waiting period, and node2 > > starts > > pacemaker so they detect each other, I would like them to proceed > > immediately to probing resource state and starting resources which > > are down, not wait until the end of that "grace period". > > > > It looks from the documentation like dc-deadtime is how #1 is > > controlled, and #2 is expected normal behavior. However, I'm > > seeing > > fence actions before dc-deadtime has passed. > > > > Am I misunderstanding Pacemaker's expected behavior and/or how dc- > > deadtime should be used? > > You have everything right. The problem is that you're starting with > an > empty configuration every time, so the default dc-deadtime is being > used for the first election (before you can set the desired value). > > I can't think of anything you can do to get around that, since the > controller starts the timer as soon as it starts up. Would it be > possible to bake an initial configuration into the PXE image? > > When the timer value changes, we could stop the existing timer and > restart it. There's a risk that some external automation could make > repeated changes to the timeout, thus never letting it expire, but > that > seems preferable to your problem. I've created an issue for that: > > > https://urldefense.us/v3/__https://projects.clusterlabs.org/T764__;!!G2kpM7uM-TzIFchu!0LU3msm_lT0kftiloTf7Qo4NM7JdSzgjqRk4ViRx8L8DbWSwdnp07tzNUVbSB7uaLL5DHsvPBb0d3U93x6U$ > > BTW there's also election-timeout. I'm not sure offhand how that > interacts; it might be necessary to raise that one as well. > > > One possibly unusual aspect of this cluster is that these two nodes > > are stateless - they PXE boot from an image on another server - and > > I > > build the cluster configuration at boot time with a series of pcs > > commands, because the nodes have no local storage for this > > purpose. The commands are: > > > > ['pcs', 'cluster', 'start'] > > ['pcs', 'property', 'set', 'stonith-action=off'] > > ['pcs', 'property', 'set', 'cluster-recheck-interval=60'] > > ['pcs', 'property', 'set', 'start-failure-is-fatal=false'] > > ['pcs', 'property', 'set', 'dc-deadtime=300'] > > ['pcs', 'stonith', 'create', 'fence_gopher11', 'fence_powerman', > > 'ip=192.168.64.65', 'pcmk_host_check=static-list', > > 'pcmk_host_list=gopher11,gopher12'] > > ['pcs', 'stonith', 'create', 'fence_gopher12', 'fence_powerman', > > 'ip=192.168.64.65', 'pcmk_host_check=static-list', > > 'pcmk_host_list=gopher11,gopher12'] > > ['pcs', 'resource', 'create', 'gopher11_zpool', 'ocf:llnl:zpool', > > 'import_options="-f -N -d /dev/disk/by-vdev"', 'pool=gopher11', > > 'op', > > 'start', 'timeout=805'] > > ... > > ['pcs', 'property', 'set', 'no-quorum-policy=ignore'] > > BTW you don't need to change no-quorum-policy when you're using > two_node with Corosync. > > > I could, instead, generate a CIB so that when Pacemaker is started, > > it has a full config. Is that better? > > > > thanks, > > Olaf > > > > === corosync.conf: > > totem { > > version: 2 > > cluster_name: gopher11 > > secauth: off > > transport: udpu > > } > > nodelist { > > node { > > ring0_addr: gopher11 > > name: gopher11 > > nodeid: 1 > > } > > node { > > ring0_addr: gopher12 > > name: gopher12 > > nodeid: 2 > > } > > } > > quorum { > > provider: corosync_votequorum > > two_node: 1 > > } > > > > === Log excerpt > > > > Here's an except from Pacemaker logs that reflect what I'm > > seeing. These are from gopher12, the node that came up first. The > > other node, which is not yet up, is gopher11. > > > > Jan 25 17:55:38 gopher12 pacemakerd [116033] > > (main) notice: Starting Pacemaker 2.1.7-1.t4 | build=2.1.7 > > features:agent-manpages ascii-docs compat-2.0 corosync-ge-2 > > default- > > concurrent-fencing generated-manpages monotonic nagios ncurses > > remote > > systemd > > Jan 25 17:55:39 gopher12 pacemaker-controld [116040] > > (peer_update_callback) info: Cluster node gopher12 is now member > > (was in unknown state) > > Jan 25 17:55:43 gopher12 pacemaker-based [116035] > > (cib_perform_op) info: ++ > > /cib/configuration/crm_config/cluster_property_set[@id='cib- > > bootstrap-options']: <nvpair id="cib-bootstrap-options-dc- > > deadtime" > > name="dc-deadtime" value="300"/> > > Jan 25 17:56:00 gopher12 pacemaker-controld [116040] > > (crm_timer_popped) info: Election Trigger just popped | > > input=I_DC_TIMEOUT time=300000ms > > Jan 25 17:56:01 gopher12 pacemaker-based [116035] > > (cib_perform_op) info: ++ > > /cib/configuration/crm_config/cluster_property_set[@id='cib- > > bootstrap-options']: <nvpair id="cib-bootstrap-options-no-quorum- > > policy" name="no-quorum-policy" value="ignore"/> > > Jan 25 17:56:01 gopher12 pacemaker-controld [116040] > > (abort_transition_graph) info: Transition 0 aborted by cib- > > bootstrap-options-no-quorum-policy doing create no-quorum- > > policy=ignore: Configuration change | cib=0.26.0 > > source=te_update_diff_v2:464 > > path=/cib/configuration/crm_config/cluster_property_set[@id='cib- > > bootstrap-options'] complete=true > > Jan 25 17:56:01 gopher12 pacemaker-controld [116040] > > (controld_execute_fence_action) notice: Requesting fencing (off) > > targeting node gopher11 | action=11 timeout=60 > > > > > > _______________________________________________ > > Manage your subscription: > > https://urldefense.us/v3/__https://lists.clusterlabs.org/mailman/listinfo/users__;!!G2kpM7uM-TzIFchu!0LU3msm_lT0kftiloTf7Qo4NM7JdSzgjqRk4ViRx8L8DbWSwdnp07tzNUVbSB7uaLL5DHsvPBb0dplCjBr8$ > > > > ClusterLabs home: > > https://urldefense.us/v3/__https://www.clusterlabs.org/__;!!G2kpM7uM-TzIFchu!0LU3msm_lT0kftiloTf7Qo4NM7JdSzgjqRk4ViRx8L8DbWSwdnp07tzNUVbSB7uaLL5DHsvPBb0dU0gWW04$ > > > -- > Ken Gaillot <kgail...@redhat.com> > -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/