2023-07-21_pacemaker_debug.log.vm01.bz2 <https://drive.google.com/file/d/1zPgIjKQw0_COTax1nHr4FJdDKHwW8qMN/view?usp=drive_web> 2023-07-21_pacemaker_debug.log.vm02.bz2 <https://drive.google.com/file/d/1wxEh4ZjmGwmy0ockclvYPSlMbsxj6v4k/view?usp=drive_web> 2023-07-21_pacemaker_debug.log.vm03.bz2 <https://drive.google.com/file/d/1YA_nsbuXA_0B2I1u0DbeI_Dd3llbftoP/view?usp=drive_web> 2023-07-21_pacemaker_debug.log.vm04.bz2 <https://drive.google.com/file/d/1FQkoWM20PNW7VZzoduEKYZqQmb9-Ij5w/view?usp=drive_web> blackbox_txt_vm04.tar.bz2 <https://drive.google.com/file/d/1fZXzg4RRSBBWIHqDq-Af4hNsWRxZngQ-/view?usp=drive_web>
On Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at redhat.com <users%40clusterlabs.org?Subject=Re:%20Re%3A%20%5BClusterLabs%5D%20Need%20a%20help%20with%20%22%28crm_glib_handler%29%20crit%3A%20GLib%3A%0A%20g_hash_table_lookup%3A%20assertion%20%27hash_table%20%21%3D%20NULL%27%20failed%22&In-Reply-To=%3C93011555dbaa91f51f9c660313807a16b6e2f676.camel%40redhat.com%3E> wrote: > Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-controld- > 4257.1" (my version can't read it) will show trace logs that might give > a better idea of what exactly went wrong at this time (though these > issues are side effects, not the cause). Blackboxes were attached to crm_report and they are in txt format. Just in case adding them to this email. > FYI, it's not necessary to set cluster-recheck-interval as low as 1 > minute. A long time ago that could be useful, but modern Pacemaker > doesn't need it to calculate things such as failure expiration. I > recommend leaving it at default, or at least raising it to 5 minutes or > so. That's good to know, since those rules came from pacemaker-1.x and I'm an adept of the "don't touch if it works" rule > vm02, vm03, and vm04 all left the cluster at that time, leaving only > vm01. At this point, vm01 should have deleted the transient attributes > for all three nodes. Unfortunately, the logs for that would only be in > pacemaker.log, which crm_report appears not to have grabbed, so I am > not sure whether it tried. Please find debug logs for "Jul 21" from DC (vm01) and crashed node (vm04) in an attachment. > Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at redhat.com > <users%40clusterlabs.org?Subject=Re:%20Re%3A%20%5BClusterLabs%5D%20Need%20a%20help%20with%20%22%28crm_glib_handler%29%20crit%3A%20GLib%3A%0A%20g_hash_table_lookup%3A%20assertion%20%27hash_table%20%21%3D%20NULL%27%20failed%22&In-Reply-To=%3C93011555dbaa91f51f9c660313807a16b6e2f676.camel%40redhat.com%3E> > wrote: > On Wed, 2023-07-26 at 13:29 -0700, Reid Wahl wrote: > > On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur <freishutz at gmail.com> > > wrote: > > > Hello Andrew, Ken and the entire community! > > > > > > I faced a problem and I would like to ask for help. > > > > > > Preamble: > > > I have dual controller storage (C0, C1) with 2 VM per controller > > > (vm0[1,2] on C0, vm[3,4] on C1). > > > I did online controller upgrade (update the firmware on physical > > > controller) and for that purpose we have a special procedure: > > > > > > Put all vms on the controller which will be updated into the > > > standby mode (vm0[3,4] in logs). > > > Once all resources are moved to spare controller VMs, turn on > > > maintenance-mode (DC machine is vm01). > > > Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM + > > > HCA/HBA + BMC drivers will be updated). > > > Reboot C1 > > > Start vm0[3,4] > > > On this step I hit the problem. > > > Do the same steps for C0 (turn off maint, put nodes 3,4 to online, > > > put 1-2 to standby, maint and etc). > > > > > > Here is what I observed during step 5. > > > Machine vm03 started without problems, but vm04 caught critical > > > error and HA stack died. If manually start the pacemaker one more > > > time then it starts without problems and vm04 joins the cluster. > > > > > > Some logs from vm04: > > > > > > Jul 21 04:05:39 vm04 corosync[3061]: [QUORUM] This node is within > > > the primary component and will provide service. > > > Jul 21 04:05:39 vm04 corosync[3061]: [QUORUM] Members[4]: 1 2 3 4 > > > Jul 21 04:05:39 vm04 corosync[3061]: [MAIN ] Completed service > > > synchronization, ready to provide service. > > > Jul 21 04:05:39 vm04 corosync[3061]: [KNET ] rx: host: 3 link: 1 > > > is up > > > Jul 21 04:05:39 vm04 corosync[3061]: [KNET ] link: Resetting MTU > > > for link 1 because host 3 joined > > > Jul 21 04:05:39 vm04 corosync[3061]: [KNET ] host: host: 3 > > > (passive) best link: 0 (pri: 1) > > > Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting > > > ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600 > > > Jul 21 04:05:40 vm04 corosync[3061]: [KNET ] pmtud: PMTUD link > > > change for host: 3 link: 1 from 453 to 65413 > > > Jul 21 04:05:40 vm04 corosync[3061]: [KNET ] pmtud: Global data > > > MTU changed to: 1397 > > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping- > > > lnet-o2ib-o2ib[vm02]: (unset) -> 4000 > > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting > > > ifspeed-lnet-o2ib-o2ib[vm01]: (unset) -> 600 > > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping- > > > lnet-o2ib-o2ib[vm01]: (unset) -> 4000 > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State > > > transition S_NOT_DC -> S_STOPPING > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot > > > execute monitor of sfa-home-vd: No executor connection > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot > > > calculate digests for operation sfa-home-vd_monitor_0 because we > > > have no connection to executor for vm04 > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of > > > probe operation for sfa-home-vd on vm04: Error (No executor > > > connection) > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot > > > execute monitor of ifspeed-lnet-o2ib-o2ib: No executor connection > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot > > > calculate digests for operation ifspeed-lnet-o2ib-o2ib_monitor_0 > > > because we have no connection to executor for vm04 > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of > > > probe operation for ifspeed-lnet-o2ib-o2ib on vm04: Error (No > > > executor connection) > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot > > > execute monitor of ping-lnet-o2ib-o2ib: No executor connection > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot > > > calculate digests for operation ping-lnet-o2ib-o2ib_monitor_0 > > > because we have no connection to executor for vm04 > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of > > > probe operation for ping-lnet-o2ib-o2ib on vm04: Error (No executor > > > connection) > > > Jul 21 04:05:49 vm04 pacemakerd[4127]: notice: pacemaker- > > > controld[4257] is unresponsive to ipc after 1 tries > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: warning: Shutting cluster > > > down because pacemaker-controld[4257] had fatal failure > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down > > > Pacemaker > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker- > > > schedulerd > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker- > > > attrd > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker- > > > execd > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker- > > > fenced > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker- > > > based > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutdown complete > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down and > > > staying down after fatal error > > > > > > Jul 21 04:05:44 vm04 root[10111]: openibd: Set node_desc for > > > mlx5_0: vm04 HCA-1 > > > Jul 21 04:05:44 vm04 root[10113]: openibd: Set node_desc for > > > mlx5_1: vm04 HCA-2 > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: error: Shutting > > > down controller after unexpected shutdown request from vm01 > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: Problem detected at > > > handle_shutdown_ack:954 (controld_messages.c), please see > > > /var/lib/pacemaker/blackbox/pacemaker-controld-4257.1 for > > > additional details > > Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-controld- > 4257.1" (my version can't read it) will show trace logs that might give > a better idea of what exactly went wrong at this time (though these > issues are side effects, not the cause). > > FYI, it's not necessary to set cluster-recheck-interval as low as 1 > minute. A long time ago that could be useful, but modern Pacemaker > doesn't need it to calculate things such as failure expiration. I > recommend leaving it at default, or at least raising it to 5 minutes or > so. > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State > > > transition S_NOT_DC -> S_STOPPING > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: > > > Disconnected from the executor > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: > > > Disconnected from Corosync > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: > > > Disconnected from the CIB manager > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: notice: > > > Disconnected from the CIB manager > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: crit: GLib: > > > g_hash_table_lookup: assertion 'hash_table != NULL' failed > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot > > > execute monitor of sfa-home-vd: No executor connection > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: crit: GLib: > > > g_hash_table_lookup: assertion 'hash_table != NULL' failed > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot > > > calculate digests for operation sfa-home-vd_monitor_0 because we > > > have no connection to executor for vm04 > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Resource > > > update -107 failed: (rc=-107) Transport endpoint is not connected > > > > The controller disconnects from the executor and deletes the executor > > state table (lrm_state_table) in the middle of the shutdown process: > > https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L240 > > > > These crit messages are happening when we try to look up executor > > state when lrm_state_table is NULL. That shouldn't happen. I guess > > the > > lookups are happening while draining the mainloop: > > https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L286-L294 > > > > The blackbox should help confirm that. > > > > the log from DC vm01: > > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Transition > > > 16 aborted: Peer Halt > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected > > > another attribute writer (vm04), starting new election > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting #attrd- > > > protocol[vm04]: (unset) -> 5 > > > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: Finalizing > > > join-2 for 1 node (sync'ing CIB from vm02) > > > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: Requested > > > CIB version <generation_tuple crm_feature_set="3.16.2" validate- > > > with="pacemaker-3.9" epoch="567" num_updates="111" admin_epoch="0" > > > cib-last-writt > > > en="Fri Jul 21 03:48:43 2023" update-origin="vm01" update- > > > client="cibadmin" update-user="root" have-quorum="0" dc-uuid="1"/> > > > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Recorded local > > > node as attribute writer (was unset) > > > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Setting > > > #feature-set[vm04]: (unset) -> 3.16.2 > > > Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice: Transition > > > 16 aborted by deletion of lrm[@id='4']: Resource state removal > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: No fencing > > > will be done until there are resources to manage > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: * > > > Shutdown vm04 > > This is where things start to go wrong, and it has nothing to do with > any of the messages here. It means that the shutdown node attribute was > not erased when vm04 shut down the last time before this. Going back, > we see when that happened: > > Jul 21 03:49:06 vm01 pacemaker-attrd[4017]: notice: Setting shutdown[vm04]: > (unset) -> 1689911346 > > vm02, vm03, and vm04 all left the cluster at that time, leaving only > vm01. At this point, vm01 should have deleted the transient attributes > for all three nodes. Unfortunately, the logs for that would only be in > pacemaker.log, which crm_report appears not to have grabbed, so I am > not sure whether it tried. > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: Calculated > > > transition 17, saving inputs in /var/lib/pacemaker/pengine/pe- > > > input-940.bz2 > > What's interesting in this transition is that we schedule probes on > vm04 even though we're shutting it down. That's a bug, and leads to the > "No executor connection" messages we see on vm04. I've added a task to > our project manager to take care of that. That's all a side effect > though and not causing any real problems. > > > > As far as I understand, vm04 was killed by DC during the election > > > of a new attr writer? > > > > Not sure yet, maybe someone else recognizes this. > > > > I see the transition was aborted due to peer halt right after node > > vm04 joined. A new election started due to detection of node vm04 as > > attribute writer. Node vm04's resource state was removed, which is a > > normal part of the join sequence; this caused another transition > > abort > > message for the same transition number. > > > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Node vm04 > > state > > is now member > > ... > > Jul 21 04:05:39 vm01 corosync[3134]: [KNET ] pmtud: Global data MTU > > changed to: 1397 > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Transition 16 > > aborted: Peer Halt > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected another > > attribute writer (vm04), starting new election > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting > > #attrd-protocol[vm04]: (unset) -> 5 > > ... > > Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice: Transition 16 > > aborted by deletion of lrm[@id='4']: Resource state removal > > > > Looking at pe-input-939 and pe-input-940, node vm04 was marked as > > shut down: > > > > Jul 21 04:05:38 vm01 pacemaker-schedulerd[4028]: notice: Calculated > > transition 16, saving inputs in > > /var/lib/pacemaker/pengine/pe-input-939.bz2 > > Jul 21 04:05:44 vm01 pacemaker-controld[4048]: notice: Transition 16 > > (Complete=24, Pending=0, Fired=0, Skipped=34, Incomplete=34, > > Source=/var/lib/pacemaker/pengine/pe-input-939.bz2): Stopped > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: * Shutdown > > vm04 > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: Calculated > > transition 17, saving inputs in > > /var/lib/pacemaker/pengine/pe-input-940.bz2 > > > > 939: > > <node_state id="4" uname="vm04" in_ccm="false" crmd="offline" > > crm-debug-origin="do_state_transition" join="down" expected="down"> > > <transient_attributes id="4"> > > <instance_attributes id="status-4"> > > <nvpair id="status-4-.feature-set" name="#feature-set" > > value="3.16.2"/> > > <nvpair id="status-4-shutdown" name="shutdown" > > value="1689911346"/> > > </instance_attributes> > > </transient_attributes> > > > > 940: > > <node_state id="4" uname="vm04" in_ccm="true" crmd="online" > > crm-debug-origin="do_state_transition" join="member" > > expected="member"> > > <transient_attributes id="4"> > > <instance_attributes id="status-4"> > > <nvpair id="status-4-.feature-set" name="#feature-set" > > value="3.16.2"/> > > <nvpair id="status-4-shutdown" name="shutdown" > > value="1689911346"/> > > </instance_attributes> > > </transient_attributes> > > > > I suppose that node vm04's state was not updated before the > > transition > > was aborted. So when the new transition (940) ran, the scheduler saw > > that node vm04 is expected to be in shutdown state, and it triggered > > a > > shutdown. > > > > This behavior might already be fixed upstream by the following > > commit: > > https://github.com/ClusterLabs/pacemaker/commit/5e3b3d14 > > > > That commit introduced a regression, however, and I'm working on > > fixing it. > > I suspect that's unrelated, because transient attributes are cleared > when a node leaves rather than when it joins. > > > > > > > > The issue is reproducible from time to time and the version of > > > pacemaker is " 2.1.5-8.1.el8_8-a3f44794f94" from Rocky linux 8.8. > > > > > > I attached crm_report with blackbox. I have debug logs, but they > > > are pretty heavy (~40MB bzip --best). Please tell me if you need > > > them. > > > > > > Thanks, > > > Arthur > > > > -- > Ken Gaillot <kgaillot at redhat.com> -- Arthur Novik
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/