On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur <freish...@gmail.com> wrote: > > Hello Andrew, Ken and the entire community! > > I faced a problem and I would like to ask for help. > > Preamble: > I have dual controller storage (C0, C1) with 2 VM per controller (vm0[1,2] on > C0, vm[3,4] on C1). > I did online controller upgrade (update the firmware on physical controller) > and for that purpose we have a special procedure: > > Put all vms on the controller which will be updated into the standby mode > (vm0[3,4] in logs). > Once all resources are moved to spare controller VMs, turn on > maintenance-mode (DC machine is vm01). > Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM + HCA/HBA + BMC > drivers will be updated). > Reboot C1 > Start vm0[3,4] > On this step I hit the problem. > Do the same steps for C0 (turn off maint, put nodes 3,4 to online, put 1-2 to > standby, maint and etc). > > Here is what I observed during step 5. > Machine vm03 started without problems, but vm04 caught critical error and HA > stack died. If manually start the pacemaker one more time then it starts > without problems and vm04 joins the cluster. > > Some logs from vm04: > > Jul 21 04:05:39 vm04 corosync[3061]: [QUORUM] This node is within the > primary component and will provide service. > Jul 21 04:05:39 vm04 corosync[3061]: [QUORUM] Members[4]: 1 2 3 4 > Jul 21 04:05:39 vm04 corosync[3061]: [MAIN ] Completed service > synchronization, ready to provide service. > Jul 21 04:05:39 vm04 corosync[3061]: [KNET ] rx: host: 3 link: 1 is up > Jul 21 04:05:39 vm04 corosync[3061]: [KNET ] link: Resetting MTU for link 1 > because host 3 joined > Jul 21 04:05:39 vm04 corosync[3061]: [KNET ] host: host: 3 (passive) best > link: 0 (pri: 1) > Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting > ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600 > Jul 21 04:05:40 vm04 corosync[3061]: [KNET ] pmtud: PMTUD link change for > host: 3 link: 1 from 453 to 65413 > Jul 21 04:05:40 vm04 corosync[3061]: [KNET ] pmtud: Global data MTU changed > to: 1397 > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting > ping-lnet-o2ib-o2ib[vm02]: (unset) -> 4000 > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting > ifspeed-lnet-o2ib-o2ib[vm01]: (unset) -> 600 > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting > ping-lnet-o2ib-o2ib[vm01]: (unset) -> 4000 > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State transition > S_NOT_DC -> S_STOPPING > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot execute monitor > of sfa-home-vd: No executor connection > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot calculate > digests for operation sfa-home-vd_monitor_0 because we have no connection to > executor for vm04 > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of probe > operation for sfa-home-vd on vm04: Error (No executor connection) > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot execute monitor > of ifspeed-lnet-o2ib-o2ib: No executor connection > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot calculate > digests for operation ifspeed-lnet-o2ib-o2ib_monitor_0 because we have no > connection to executor for vm04 > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of probe > operation for ifspeed-lnet-o2ib-o2ib on vm04: Error (No executor connection) > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot execute monitor > of ping-lnet-o2ib-o2ib: No executor connection > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot calculate > digests for operation ping-lnet-o2ib-o2ib_monitor_0 because we have no > connection to executor for vm04 > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of probe > operation for ping-lnet-o2ib-o2ib on vm04: Error (No executor connection) > Jul 21 04:05:49 vm04 pacemakerd[4127]: notice: pacemaker-controld[4257] is > unresponsive to ipc after 1 tries > Jul 21 04:05:52 vm04 pacemakerd[4127]: warning: Shutting cluster down because > pacemaker-controld[4257] had fatal failure > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down Pacemaker > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-schedulerd > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-attrd > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-execd > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-fenced > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-based > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutdown complete > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down and staying down > after fatal error > > Jul 21 04:05:44 vm04 root[10111]: openibd: Set node_desc for mlx5_0: vm04 > HCA-1 > Jul 21 04:05:44 vm04 root[10113]: openibd: Set node_desc for mlx5_1: vm04 > HCA-2 > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: error: Shutting down > controller after unexpected shutdown request from vm01 > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: Problem detected at > handle_shutdown_ack:954 (controld_messages.c), please see > /var/lib/pacemaker/blackbox/pacemaker-controld-4257.1 for additional details > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State transition > S_NOT_DC -> S_STOPPING > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: Disconnected from the > executor > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: Disconnected from > Corosync > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: Disconnected from the > CIB manager > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: notice: Disconnected from the > CIB manager > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: crit: GLib: > g_hash_table_lookup: assertion 'hash_table != NULL' failed > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot execute monitor > of sfa-home-vd: No executor connection > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: crit: GLib: > g_hash_table_lookup: assertion 'hash_table != NULL' failed > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot calculate > digests for operation sfa-home-vd_monitor_0 because we have no connection to > executor for vm04 > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Resource update -107 > failed: (rc=-107) Transport endpoint is not connected
The controller disconnects from the executor and deletes the executor state table (lrm_state_table) in the middle of the shutdown process: https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L240 These crit messages are happening when we try to look up executor state when lrm_state_table is NULL. That shouldn't happen. I guess the lookups are happening while draining the mainloop: https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L286-L294 > > the log from DC vm01: > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Transition 16 aborted: > Peer Halt > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected another > attribute writer (vm04), starting new election > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting > #attrd-protocol[vm04]: (unset) -> 5 > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: Finalizing join-2 for > 1 node (sync'ing CIB from vm02) > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: Requested CIB version > <generation_tuple crm_feature_set="3.16.2" validate-with="pacemaker-3.9" > epoch="567" num_updates="111" admin_epoch="0" cib-last-writt > en="Fri Jul 21 03:48:43 2023" update-origin="vm01" update-client="cibadmin" > update-user="root" have-quorum="0" dc-uuid="1"/> > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Recorded local node as > attribute writer (was unset) > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Setting > #feature-set[vm04]: (unset) -> 3.16.2 > Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice: Transition 16 aborted > by deletion of lrm[@id='4']: Resource state removal > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: No fencing will be > done until there are resources to manage > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: * Shutdown vm04 > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: Calculated > transition 17, saving inputs in /var/lib/pacemaker/pengine/pe-input-940.bz2 > > As far as I understand, vm04 was killed by DC during the election of a new > attr writer? Not sure yet, maybe someone else recognizes this. I see the transition was aborted due to peer halt right after node vm04 joined. A new election started due to detection of node vm04 as attribute writer. Node vm04's resource state was removed, which is a normal part of the join sequence; this caused another transition abort message for the same transition number. Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Node vm04 state is now member ... Jul 21 04:05:39 vm01 corosync[3134]: [KNET ] pmtud: Global data MTU changed to: 1397 Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Transition 16 aborted: Peer Halt Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected another attribute writer (vm04), starting new election Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting #attrd-protocol[vm04]: (unset) -> 5 ... Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice: Transition 16 aborted by deletion of lrm[@id='4']: Resource state removal Looking at pe-input-939 and pe-input-940, node vm04 was marked as shut down: Jul 21 04:05:38 vm01 pacemaker-schedulerd[4028]: notice: Calculated transition 16, saving inputs in /var/lib/pacemaker/pengine/pe-input-939.bz2 Jul 21 04:05:44 vm01 pacemaker-controld[4048]: notice: Transition 16 (Complete=24, Pending=0, Fired=0, Skipped=34, Incomplete=34, Source=/var/lib/pacemaker/pengine/pe-input-939.bz2): Stopped Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: * Shutdown vm04 Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: Calculated transition 17, saving inputs in /var/lib/pacemaker/pengine/pe-input-940.bz2 939: <node_state id="4" uname="vm04" in_ccm="false" crmd="offline" crm-debug-origin="do_state_transition" join="down" expected="down"> <transient_attributes id="4"> <instance_attributes id="status-4"> <nvpair id="status-4-.feature-set" name="#feature-set" value="3.16.2"/> <nvpair id="status-4-shutdown" name="shutdown" value="1689911346"/> </instance_attributes> </transient_attributes> 940: <node_state id="4" uname="vm04" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member"> <transient_attributes id="4"> <instance_attributes id="status-4"> <nvpair id="status-4-.feature-set" name="#feature-set" value="3.16.2"/> <nvpair id="status-4-shutdown" name="shutdown" value="1689911346"/> </instance_attributes> </transient_attributes> I suppose that node vm04's state was not updated before the transition was aborted. So when the new transition (940) ran, the scheduler saw that node vm04 is expected to be in shutdown state, and it triggered a shutdown. This behavior might already be fixed upstream by the following commit: https://github.com/ClusterLabs/pacemaker/commit/5e3b3d14 That commit introduced a regression, however, and I'm working on fixing it. > > The issue is reproducible from time to time and the version of pacemaker is " > 2.1.5-8.1.el8_8-a3f44794f94" from Rocky linux 8.8. > > I attached crm_report with blackbox. I have debug logs, but they are pretty > heavy (~40MB bzip --best). Please tell me if you need them. > > Thanks, > Arthur > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Regards, Reid Wahl (He/Him) Senior Software Engineer, Red Hat RHEL High Availability - Pacemaker _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/