This is the last update from my side and we could close the thread. We did the change to preserve sequential shutdown nodes and after ~30 cycles (each cycle is 3 HA groups with 4 nodes and 3 storages) we could say that the proposed workaround works as expected. I saw https://github.com/ClusterLabs/pacemaker/pull/3177 and https://github.com/ClusterLabs/pacemaker/pull/3178 , but I didn't check how it works. So we preserved the original version.
Thanks everybody, Arthur Novik > On Thu, 2023-08-03 at 12:37:18 -0500, Ken Gaillot wrote: > In the other case, the problem turned out to be a timing issue that can > occur when the DC and attribute writer are shutting down at the same > time. Since the problem in this case also occurred after shutting down > two nodes together, I'm thinking it's likely the same issue. > > A fix should be straightforward. A workaround in the meantime would be > to shut down nodes in sequence rather than in parallel, when shutting > down just some nodes. (Shutting down the entire cluster shouldn't be > subject to the race condition.) > > On Wed, 2023-08-02 at 16:53 -0500, Ken Gaillot wrote: > > Ha! I didn't realize crm_report saves blackboxes as text. Always > > something new to learn with Pacemaker :) > > > > As of 2.1.5, the controller now gets agent metadata asynchronously, > > which fixed bugs with synchronous calls blocking the controller. Once > > the metadata action returns, the original action that required the > > metadata is attempted. > > > > This led to the odd log messages. Normally, agent actions can't be > > attempted once the shutdown sequence begins. However, in this case, > > metadata actions were initiated before shutdown, but completed after > > shutdown began. The controller thus attempted the original actions > > after it had already disconnected from the executor, resulting in the > > odd logs. > > > > The fix for that is simple, but addresses only the logs, not the > > original problem that caused the controller to shut down. I'm still > > looking into that. > > > > I've since heard about a similar case, and I suspect in that case, it > > was related to having a node with an older version trying to join a > > cluster with a newer version. > > > > On Fri, 2023-07-28 at 15:21 +0300, Novik Arthur wrote: > > > 2023-07-21_pacemaker_debug.log.vm01.bz2 > > > 2023-07-21_pacemaker_debug.log.vm02.bz2 > > > 2023-07-21_pacemaker_debug.log.vm03.bz2 > > > 2023-07-21_pacemaker_debug.log.vm04.bz2 > > > blackbox_txt_vm04.tar.bz2 > > > On Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at > > > redhat.com > > > wrote: > > > > > > > Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker- > > > controld- > > > > 4257.1" (my version can't read it) will show trace logs that > > > > might > > > give > > > > a better idea of what exactly went wrong at this time (though > > > > these > > > > issues are side effects, not the cause). > > > > > > Blackboxes were attached to crm_report and they are in txt format. > > > Just in case adding them to this email. > > > > > > > FYI, it's not necessary to set cluster-recheck-interval as low as > > > > 1 > > > > minute. A long time ago that could be useful, but modern > > > > Pacemaker > > > > doesn't need it to calculate things such as failure expiration. I > > > > recommend leaving it at default, or at least raising it to 5 > > > minutes or > > > > so. > > > > > > That's good to know, since those rules came from pacemaker-1.x and > > > I'm an adept of the "don't touch if it works" rule > > > > > > > vm02, vm03, and vm04 all left the cluster at that time, leaving > > > only > > > > vm01. At this point, vm01 should have deleted the transient > > > attributes > > > > for all three nodes. Unfortunately, the logs for that would only > > > > be > > > in > > > > pacemaker.log, which crm_report appears not to have grabbed, so I > > > am > > > > not sure whether it tried. > > > Please find debug logs for "Jul 21" from DC (vm01) and crashed node > > > (vm04) in an attachment. > > > > Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at redhat.com > > > wrote: > > > > On Wed, 2023-07-26 at 13:29 -0700, Reid Wahl wrote: > > > > > On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur <freishutz at > > > gmail.com> > > > > > wrote: > > > > > > Hello Andrew, Ken and the entire community! > > > > > > > > > > > > I faced a problem and I would like to ask for help. > > > > > > > > > > > > Preamble: > > > > > > I have dual controller storage (C0, C1) with 2 VM per > > > controller > > > > > > (vm0[1,2] on C0, vm[3,4] on C1). > > > > > > I did online controller upgrade (update the firmware on > > > physical > > > > > > controller) and for that purpose we have a special procedure: > > > > > > > > > > > > Put all vms on the controller which will be updated into the > > > > > > standby mode (vm0[3,4] in logs). > > > > > > Once all resources are moved to spare controller VMs, turn on > > > > > > maintenance-mode (DC machine is vm01). > > > > > > Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM > > > > > > + > > > > > > HCA/HBA + BMC drivers will be updated). > > > > > > Reboot C1 > > > > > > Start vm0[3,4] > > > > > > On this step I hit the problem. > > > > > > Do the same steps for C0 (turn off maint, put nodes 3,4 to > > > online, > > > > > > put 1-2 to standby, maint and etc). > > > > > > > > > > > > Here is what I observed during step 5. > > > > > > Machine vm03 started without problems, but vm04 caught > > > > > > critical > > > > > > error and HA stack died. If manually start the pacemaker one > > > more > > > > > > time then it starts without problems and vm04 joins the > > > cluster. > > > > > > Some logs from vm04: > > > > > > > > > > > > Jul 21 04:05:39 vm04 corosync[3061]: [QUORUM] This node is > > > within > > > > > > the primary component and will provide service. > > > > > > Jul 21 04:05:39 vm04 corosync[3061]: [QUORUM] Members[4]: 1 > > > > > > 2 > > > 3 4 > > > > > > Jul 21 04:05:39 vm04 corosync[3061]: [MAIN ] Completed > > > service > > > > > > synchronization, ready to provide service. > > > > > > Jul 21 04:05:39 vm04 corosync[3061]: [KNET ] rx: host: 3 > > > link: 1 > > > > > > is up > > > > > > Jul 21 04:05:39 vm04 corosync[3061]: [KNET ] link: > > > > > > Resetting > > > MTU > > > > > > for link 1 because host 3 joined > > > > > > Jul 21 04:05:39 vm04 corosync[3061]: [KNET ] host: host: 3 > > > > > > (passive) best link: 0 (pri: 1) > > > > > > Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting > > > > > > ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600 > > > > > > Jul 21 04:05:40 vm04 corosync[3061]: [KNET ] pmtud: PMTUD > > > link > > > > > > change for host: 3 link: 1 from 453 to 65413 > > > > > > Jul 21 04:05:40 vm04 corosync[3061]: [KNET ] pmtud: Global > > > data > > > > > > MTU changed to: 1397 > > > > > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting > > > ping- > > > > > > lnet-o2ib-o2ib[vm02]: (unset) -> 4000 > > > > > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting > > > > > > ifspeed-lnet-o2ib-o2ib[vm01]: (unset) -> 600 > > > > > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting > > > ping- > > > > > > lnet-o2ib-o2ib[vm01]: (unset) -> 4000 > > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State > > > > > > transition S_NOT_DC -> S_STOPPING > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot > > > > > > execute monitor of sfa-home-vd: No executor connection > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: > > > > > > Cannot > > > > > > calculate digests for operation sfa-home-vd_monitor_0 because > > > we > > > > > > have no connection to executor for vm04 > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result > > > > > > of > > > > > > probe operation for sfa-home-vd on vm04: Error (No executor > > > > > > connection) > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot > > > > > > execute monitor of ifspeed-lnet-o2ib-o2ib: No executor > > > connection > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: > > > > > > Cannot > > > > > > calculate digests for operation ifspeed-lnet-o2ib- > > > o2ib_monitor_0 > > > > > > because we have no connection to executor for vm04 > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result > > > > > > of > > > > > > probe operation for ifspeed-lnet-o2ib-o2ib on vm04: Error (No > > > > > > executor connection) > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot > > > > > > execute monitor of ping-lnet-o2ib-o2ib: No executor > > > > > > connection > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: > > > > > > Cannot > > > > > > calculate digests for operation ping-lnet-o2ib-o2ib_monitor_0 > > > > > > because we have no connection to executor for vm04 > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result > > > > > > of > > > > > > probe operation for ping-lnet-o2ib-o2ib on vm04: Error (No > > > executor > > > > > > connection) > > > > > > Jul 21 04:05:49 vm04 pacemakerd[4127]: notice: pacemaker- > > > > > > controld[4257] is unresponsive to ipc after 1 tries > > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: warning: Shutting > > > cluster > > > > > > down because pacemaker-controld[4257] had fatal failure > > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down > > > > > > Pacemaker > > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping > > > pacemaker- > > > > > > schedulerd > > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping > > > pacemaker- > > > > > > attrd > > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping > > > pacemaker- > > > > > > execd > > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping > > > pacemaker- > > > > > > fenced > > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping > > > pacemaker- > > > > > > based > > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutdown > > > complete > > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down > > > and > > > > > > staying down after fatal error > > > > > > > > > > > > Jul 21 04:05:44 vm04 root[10111]: openibd: Set node_desc for > > > > > > mlx5_0: vm04 HCA-1 > > > > > > Jul 21 04:05:44 vm04 root[10113]: openibd: Set node_desc for > > > > > > mlx5_1: vm04 HCA-2 > > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: error: > > > > > > Shutting > > > > > > down controller after unexpected shutdown request from vm01 > > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: Problem > > > > > > detected > > > at > > > > > > handle_shutdown_ack:954 (controld_messages.c), please see > > > > > > /var/lib/pacemaker/blackbox/pacemaker-controld-4257.1 for > > > > > > additional details > > > > > > > > Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker- > > > controld- > > > > 4257.1" (my version can't read it) will show trace logs that > > > > might > > > give > > > > a better idea of what exactly went wrong at this time (though > > > > these > > > > issues are side effects, not the cause). > > > > > > > > FYI, it's not necessary to set cluster-recheck-interval as low as > > > > 1 > > > > minute. A long time ago that could be useful, but modern > > > > Pacemaker > > > > doesn't need it to calculate things such as failure expiration. I > > > > recommend leaving it at default, or at least raising it to 5 > > > minutes or > > > > so. > > > > > > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State > > > > > > transition S_NOT_DC -> S_STOPPING > > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: > > > > > > Disconnected from the executor > > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: > > > > > > Disconnected from Corosync > > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: > > > > > > Disconnected from the CIB manager > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: notice: > > > > > > Disconnected from the CIB manager > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: crit: GLib: > > > > > > g_hash_table_lookup: assertion 'hash_table != NULL' failed > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot > > > > > > execute monitor of sfa-home-vd: No executor connection > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: crit: GLib: > > > > > > g_hash_table_lookup: assertion 'hash_table != NULL' failed > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: > > > > > > Cannot > > > > > > calculate digests for operation sfa-home-vd_monitor_0 because > > > we > > > > > > have no connection to executor for vm04 > > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: > > > Resource > > > > > > update -107 failed: (rc=-107) Transport endpoint is not > > > connected > > > > > The controller disconnects from the executor and deletes the > > > executor > > > > > state table (lrm_state_table) in the middle of the shutdown > > > process: > > > https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L240 > > > > > These crit messages are happening when we try to look up > > > > > executor > > > > > state when lrm_state_table is NULL. That shouldn't happen. I > > > guess > > > > > the > > > > > lookups are happening while draining the mainloop: > > > > > > > > https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L286-L294 > > > > > > > > The blackbox should help confirm that. > > > > > > > > > > the log from DC vm01: > > > > > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: > > > Transition > > > > > > 16 aborted: Peer Halt > > > > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected > > > > > > another attribute writer (vm04), starting new election > > > > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting > > > #attrd- > > > > > > protocol[vm04]: (unset) -> 5 > > > > > > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: > > > Finalizing > > > > > > join-2 for 1 node (sync'ing CIB from vm02) > > > > > > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: > > > Requested > > > > > > CIB version <generation_tuple crm_feature_set="3.16.2" > > > validate- > > > > > > with="pacemaker-3.9" epoch="567" num_updates="111" > > > admin_epoch="0" > > > > > > cib-last-writt > > > > > > en="Fri Jul 21 03:48:43 2023" update-origin="vm01" update- > > > > > > client="cibadmin" update-user="root" have-quorum="0" dc- > > > uuid="1"/> > > > > > > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Recorded > > > local > > > > > > node as attribute writer (was unset) > > > > > > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Setting > > > > > > #feature-set[vm04]: (unset) -> 3.16.2 > > > > > > Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice: > > > Transition > > > > > > 16 aborted by deletion of lrm[@id='4']: Resource state > > > > > > removal > > > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: No > > > fencing > > > > > > will be done until there are resources to manage > > > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: * > > > > > > Shutdown vm04 > > > > > > > > This is where things start to go wrong, and it has nothing to do > > > with > > > > any of the messages here. It means that the shutdown node > > > > attribute > > > was > > > > not erased when vm04 shut down the last time before this. Going > > > back, > > > > we see when that happened: > > > > > > > > Jul 21 03:49:06 vm01 pacemaker-attrd[4017]: notice: Setting > > > shutdown[vm04]: (unset) -> 1689911346 > > > > vm02, vm03, and vm04 all left the cluster at that time, leaving > > > only > > > > vm01. At this point, vm01 should have deleted the transient > > > attributes > > > > for all three nodes. Unfortunately, the logs for that would only > > > > be > > > in > > > > pacemaker.log, which crm_report appears not to have grabbed, so I > > > am > > > > not sure whether it tried. > > > > > > > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: > > > Calculated > > > > > > transition 17, saving inputs in > > > > > > /var/lib/pacemaker/pengine/pe- > > > > > > input-940.bz2 > > > > > > > > What's interesting in this transition is that we schedule probes > > > > on > > > > vm04 even though we're shutting it down. That's a bug, and leads > > > > to > > > the > > > > "No executor connection" messages we see on vm04. I've added a > > > > task > > > to > > > > our project manager to take care of that. That's all a side > > > > effect > > > > though and not causing any real problems. > > > > > > > > > > As far as I understand, vm04 was killed by DC during the > > > election > > > > > > of a new attr writer? > > > > > > > > > > Not sure yet, maybe someone else recognizes this. > > > > > > > > > > I see the transition was aborted due to peer halt right after > > > node > > > > > vm04 joined. A new election started due to detection of node > > > > > vm04 > > > as > > > > > attribute writer. Node vm04's resource state was removed, which > > > is a > > > > > normal part of the join sequence; this caused another > > > > > transition > > > > > abort > > > > > message for the same transition number. > > > > > > > > > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Node > > > > > vm04 > > > > > state > > > > > is now member > > > > > ... > > > > > Jul 21 04:05:39 vm01 corosync[3134]: [KNET ] pmtud: Global > > > > > data > > > MTU > > > > > changed to: 1397 > > > > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: > > > > > Transition > > > 16 > > > > > aborted: Peer Halt > > > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected > > > another > > > > > attribute writer (vm04), starting new election > > > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting > > > > > #attrd-protocol[vm04]: (unset) -> 5 > > > > > ... > > > > > Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice: > > > > > Transition > > > 16 > > > > > aborted by deletion of lrm[@id='4']: Resource state removal > > > > > > > > > > Looking at pe-input-939 and pe-input-940, node vm04 was marked > > > > > as > > > > > shut down: > > > > > > > > > > Jul 21 04:05:38 vm01 pacemaker-schedulerd[4028]: notice: > > > Calculated > > > > > transition 16, saving inputs in > > > > > /var/lib/pacemaker/pengine/pe-input-939.bz2 > > > > > Jul 21 04:05:44 vm01 pacemaker-controld[4048]: notice: > > > > > Transition > > > 16 > > > > > (Complete=24, Pending=0, Fired=0, Skipped=34, Incomplete=34, > > > > > Source=/var/lib/pacemaker/pengine/pe-input-939.bz2): Stopped > > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: * > > > Shutdown > > > > > vm04 > > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: > > > Calculated > > > > > transition 17, saving inputs in > > > > > /var/lib/pacemaker/pengine/pe-input-940.bz2 > > > > > > > > > > 939: > > > > > <node_state id="4" uname="vm04" in_ccm="false" > > > > > crmd="offline" > > > > > crm-debug-origin="do_state_transition" join="down" > > > expected="down"> > > > > > <transient_attributes id="4"> > > > > > <instance_attributes id="status-4"> > > > > > <nvpair id="status-4-.feature-set" name="#feature- > > > > > set" > > > > > value="3.16.2"/> > > > > > <nvpair id="status-4-shutdown" name="shutdown" > > > > > value="1689911346"/> > > > > > </instance_attributes> > > > > > </transient_attributes> > > > > > > > > > > 940: > > > > > <node_state id="4" uname="vm04" in_ccm="true" crmd="online" > > > > > crm-debug-origin="do_state_transition" join="member" > > > > > expected="member"> > > > > > <transient_attributes id="4"> > > > > > <instance_attributes id="status-4"> > > > > > <nvpair id="status-4-.feature-set" name="#feature- > > > > > set" > > > > > value="3.16.2"/> > > > > > <nvpair id="status-4-shutdown" name="shutdown" > > > > > value="1689911346"/> > > > > > </instance_attributes> > > > > > </transient_attributes> > > > > > > > > > > I suppose that node vm04's state was not updated before the > > > > > transition > > > > > was aborted. So when the new transition (940) ran, the > > > > > scheduler > > > saw > > > > > that node vm04 is expected to be in shutdown state, and it > > > triggered > > > > > a > > > > > shutdown. > > > > > > > > > > This behavior might already be fixed upstream by the following > > > > > commit: > > > > > https://github.com/ClusterLabs/pacemaker/commit/5e3b3d14 > > > > > > > > > > That commit introduced a regression, however, and I'm working > > > > > on > > > > > fixing it. > > > > > > > > I suspect that's unrelated, because transient attributes are > > > cleared > > > > when a node leaves rather than when it joins. > > > > > > > > > > > > > > > The issue is reproducible from time to time and the version > > > > > > of > > > > > > pacemaker is " 2.1.5-8.1.el8_8-a3f44794f94" from Rocky linux > > > 8.8. > > > > > > I attached crm_report with blackbox. I have debug logs, but > > > they > > > > > > are pretty heavy (~40MB bzip --best). Please tell me if you > > > need > > > > > > them. > > > > > > > > > > > > Thanks, > > > > > > Arthur > > > > > > > > > > -- > > > > Ken Gaillot <kgaillot at redhat.com> > > > -- > > > Arthur Novik > > > _______________________________________________ > > > Manage your subscription: > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > -- > Ken Gaillot <kgaillot at redhat.com>
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/