Re: [ClusterLabs] Need a help with "(crm_glib_handler) crit: GLib: g_hash_table_lookup: assertion 'hash_table != NULL' failed"

2023-08-14 Thread Novik Arthur
This is the last update from my side and we could close the thread.

We did the change to preserve sequential shutdown nodes and after ~30
cycles (each cycle is 3 HA groups with 4 nodes and 3 storages) we could say
that the proposed workaround works as expected.
I saw https://github.com/ClusterLabs/pacemaker/pull/3177 and
https://github.com/ClusterLabs/pacemaker/pull/3178 , but I didn't check how
it works. So we preserved the original version.

Thanks everybody,
Arthur Novik

> On Thu, 2023-08-03 at 12:37:18 -0500, Ken Gaillot wrote:
> In the other case, the problem turned out to be a timing issue that can
> occur when the DC and attribute writer are shutting down at the same
> time. Since the problem in this case also occurred after shutting down
> two nodes together, I'm thinking it's likely the same issue.
>
> A fix should be straightforward. A workaround in the meantime would be
> to shut down nodes in sequence rather than in parallel, when shutting
> down just some nodes. (Shutting down the entire cluster shouldn't be
> subject to the race condition.)
>
> On Wed, 2023-08-02 at 16:53 -0500, Ken Gaillot wrote:
> > Ha! I didn't realize crm_report saves blackboxes as text. Always
> > something new to learn with Pacemaker :)
> >
> > As of 2.1.5, the controller now gets agent metadata asynchronously,
> > which fixed bugs with synchronous calls blocking the controller. Once
> > the metadata action returns, the original action that required the
> > metadata is attempted.
> >
> > This led to the odd log messages. Normally, agent actions can't be
> > attempted once the shutdown sequence begins. However, in this case,
> > metadata actions were initiated before shutdown, but completed after
> > shutdown began. The controller thus attempted the original actions
> > after it had already disconnected from the executor, resulting in the
> > odd logs.
> >
> > The fix for that is simple, but addresses only the logs, not the
> > original problem that caused the controller to shut down. I'm still
> > looking into that.
> >
> > I've since heard about a similar case, and I suspect in that case, it
> > was related to having a node with an older version trying to join a
> > cluster with a newer version.
> >
> > On Fri, 2023-07-28 at 15:21 +0300, Novik Arthur wrote:
> > >  2023-07-21_pacemaker_debug.log.vm01.bz2
> > >  2023-07-21_pacemaker_debug.log.vm02.bz2
> > >  2023-07-21_pacemaker_debug.log.vm03.bz2
> > >  2023-07-21_pacemaker_debug.log.vm04.bz2
> > >  blackbox_txt_vm04.tar.bz2
> > > On Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at
> > > redhat.com
> > > wrote:
> > >
> > > > Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-
> > > controld-
> > > > 4257.1" (my version can't read it) will show trace logs that
> > > > might
> > > give
> > > > a better idea of what exactly went wrong at this time (though
> > > > these
> > > > issues are side effects, not the cause).
> > >
> > > Blackboxes were attached to crm_report and they are in txt format.
> > > Just in case adding them to this email.
> > >
> > > > FYI, it's not necessary to set cluster-recheck-interval as low as
> > > > 1
> > > > minute. A long time ago that could be useful, but modern
> > > > Pacemaker
> > > > doesn't need it to calculate things such as failure expiration. I
> > > > recommend leaving it at default, or at least raising it to 5
> > > minutes or
> > > > so.
> > >
> > > That's good to know, since those rules came from pacemaker-1.x and
> > > I'm an adept of the "don't touch if it works" rule
> > >
> > > > vm02, vm03, and vm04 all left the cluster at that time, leaving
> > > only
> > > > vm01. At this point, vm01 should have deleted the transient
> > > attributes
> > > > for all three nodes. Unfortunately, the logs for that would only
> > > > be
> > > in
> > > > pacemaker.log, which crm_report appears not to have grabbed, so I
> > > am
> > > > not sure whether it tried.
> > > Please find debug logs for "Jul 21" from DC (vm01) and crashed node
> > > (vm04) in an attachment.
> > > > Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at redhat.com
> > > wrote:
> > > > On Wed, 2023-07-26 at 13:29 -0700, Reid Wahl wrote:
> > > > > On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur  > > gmail.com>
> > > > > wrote:
> > > > > > Hello Andrew, Ken and the entire community!
> > > > > >
> > > > > > I faced a problem and I would like to ask for help.
> > > > > >
> > > > > > Preamble:
> > > > > > I have dual controller storage (C0, C1) with 2 VM per
> > > controller
> > > > > > (vm0[1,2] on C0, vm[3,4] on C1).
> > > > > > I did online controller upgrade (update the firmware on
> > > physical
> > > > > > controller) and for that purpose we have a special procedure:
> > > > > >
> > > > > > Put all vms on the controller which will be updated into the
> > > > > > standby mode (vm0[3,4] in logs).
> > > > > > Once all resources are moved to spare controller VMs, turn on
> > > > > > maintenance-mode (DC machine is vm01).
> > > > > > Shutdown vm0[3,4] and perform 

Re: [ClusterLabs] Need a help with "(crm_glib_handler) crit: GLib: g_hash_table_lookup: assertion 'hash_table != NULL' failed"

2023-08-03 Thread Ken Gaillot
In the other case, the problem turned out to be a timing issue that can
occur when the DC and attribute writer are shutting down at the same
time. Since the problem in this case also occurred after shutting down
two nodes together, I'm thinking it's likely the same issue.

A fix should be straightforward. A workaround in the meantime would be
to shut down nodes in sequence rather than in parallel, when shutting
down just some nodes. (Shutting down the entire cluster shouldn't be
subject to the race condition.)

On Wed, 2023-08-02 at 16:53 -0500, Ken Gaillot wrote:
> Ha! I didn't realize crm_report saves blackboxes as text. Always
> something new to learn with Pacemaker :)
> 
> As of 2.1.5, the controller now gets agent metadata asynchronously,
> which fixed bugs with synchronous calls blocking the controller. Once
> the metadata action returns, the original action that required the
> metadata is attempted.
> 
> This led to the odd log messages. Normally, agent actions can't be
> attempted once the shutdown sequence begins. However, in this case,
> metadata actions were initiated before shutdown, but completed after
> shutdown began. The controller thus attempted the original actions
> after it had already disconnected from the executor, resulting in the
> odd logs.
> 
> The fix for that is simple, but addresses only the logs, not the
> original problem that caused the controller to shut down. I'm still
> looking into that.
> 
> I've since heard about a similar case, and I suspect in that case, it
> was related to having a node with an older version trying to join a
> cluster with a newer version.
> 
> On Fri, 2023-07-28 at 15:21 +0300, Novik Arthur wrote:
> >  2023-07-21_pacemaker_debug.log.vm01.bz2
> >  2023-07-21_pacemaker_debug.log.vm02.bz2
> >  2023-07-21_pacemaker_debug.log.vm03.bz2
> >  2023-07-21_pacemaker_debug.log.vm04.bz2
> >  blackbox_txt_vm04.tar.bz2
> > On Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at
> > redhat.com
> > wrote:
> > 
> > > Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-
> > controld-
> > > 4257.1" (my version can't read it) will show trace logs that
> > > might
> > give
> > > a better idea of what exactly went wrong at this time (though
> > > these
> > > issues are side effects, not the cause).
> > 
> > Blackboxes were attached to crm_report and they are in txt format.
> > Just in case adding them to this email.
> > 
> > > FYI, it's not necessary to set cluster-recheck-interval as low as
> > > 1
> > > minute. A long time ago that could be useful, but modern
> > > Pacemaker
> > > doesn't need it to calculate things such as failure expiration. I
> > > recommend leaving it at default, or at least raising it to 5
> > minutes or
> > > so.
> > 
> > That's good to know, since those rules came from pacemaker-1.x and
> > I'm an adept of the "don't touch if it works" rule
> > 
> > > vm02, vm03, and vm04 all left the cluster at that time, leaving
> > only
> > > vm01. At this point, vm01 should have deleted the transient
> > attributes
> > > for all three nodes. Unfortunately, the logs for that would only
> > > be
> > in
> > > pacemaker.log, which crm_report appears not to have grabbed, so I
> > am
> > > not sure whether it tried.
> > Please find debug logs for "Jul 21" from DC (vm01) and crashed node
> > (vm04) in an attachment.
> > > Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at redhat.com
> > wrote:
> > > On Wed, 2023-07-26 at 13:29 -0700, Reid Wahl wrote:
> > > > On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur  > gmail.com>
> > > > wrote:
> > > > > Hello Andrew, Ken and the entire community!
> > > > > 
> > > > > I faced a problem and I would like to ask for help.
> > > > > 
> > > > > Preamble:
> > > > > I have dual controller storage (C0, C1) with 2 VM per
> > controller
> > > > > (vm0[1,2] on C0, vm[3,4] on C1).
> > > > > I did online controller upgrade (update the firmware on
> > physical
> > > > > controller) and for that purpose we have a special procedure:
> > > > > 
> > > > > Put all vms on the controller which will be updated into the
> > > > > standby mode (vm0[3,4] in logs).
> > > > > Once all resources are moved to spare controller VMs, turn on
> > > > > maintenance-mode (DC machine is vm01).
> > > > > Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM
> > > > > +
> > > > > HCA/HBA + BMC drivers will be updated).
> > > > > Reboot C1
> > > > > Start vm0[3,4]
> > > > > On this step I hit the problem.
> > > > > Do the same steps for C0 (turn off maint, put nodes 3,4 to
> > online,
> > > > > put 1-2 to standby, maint and etc).
> > > > > 
> > > > > Here is what I observed during step 5.
> > > > > Machine vm03 started without problems, but vm04 caught
> > > > > critical
> > > > > error and HA stack died. If manually start the pacemaker one
> > more
> > > > > time then it starts without problems and vm04 joins the
> > cluster.
> > > > > Some logs from vm04:
> > > > > 
> > > > > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] This node is
> > 

Re: [ClusterLabs] Need a help with "(crm_glib_handler) crit: GLib: g_hash_table_lookup: assertion 'hash_table != NULL' failed"

2023-08-02 Thread Ken Gaillot
Ha! I didn't realize crm_report saves blackboxes as text. Always
something new to learn with Pacemaker :)

As of 2.1.5, the controller now gets agent metadata asynchronously,
which fixed bugs with synchronous calls blocking the controller. Once
the metadata action returns, the original action that required the
metadata is attempted.

This led to the odd log messages. Normally, agent actions can't be
attempted once the shutdown sequence begins. However, in this case,
metadata actions were initiated before shutdown, but completed after
shutdown began. The controller thus attempted the original actions
after it had already disconnected from the executor, resulting in the
odd logs.

The fix for that is simple, but addresses only the logs, not the
original problem that caused the controller to shut down. I'm still
looking into that.

I've since heard about a similar case, and I suspect in that case, it
was related to having a node with an older version trying to join a
cluster with a newer version.

On Fri, 2023-07-28 at 15:21 +0300, Novik Arthur wrote:
>  2023-07-21_pacemaker_debug.log.vm01.bz2
>  2023-07-21_pacemaker_debug.log.vm02.bz2
>  2023-07-21_pacemaker_debug.log.vm03.bz2
>  2023-07-21_pacemaker_debug.log.vm04.bz2
>  blackbox_txt_vm04.tar.bz2
> On Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at redhat.com
> wrote:
> 
> > Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-
> controld-
> > 4257.1" (my version can't read it) will show trace logs that might
> give
> > a better idea of what exactly went wrong at this time (though these
> > issues are side effects, not the cause).
> 
> Blackboxes were attached to crm_report and they are in txt format.
> Just in case adding them to this email.
> 
> > FYI, it's not necessary to set cluster-recheck-interval as low as 1
> > minute. A long time ago that could be useful, but modern Pacemaker
> > doesn't need it to calculate things such as failure expiration. I
> > recommend leaving it at default, or at least raising it to 5
> minutes or
> > so.
> 
> That's good to know, since those rules came from pacemaker-1.x and
> I'm an adept of the "don't touch if it works" rule
> 
> > vm02, vm03, and vm04 all left the cluster at that time, leaving
> only
> > vm01. At this point, vm01 should have deleted the transient
> attributes
> > for all three nodes. Unfortunately, the logs for that would only be
> in
> > pacemaker.log, which crm_report appears not to have grabbed, so I
> am
> > not sure whether it tried.
> Please find debug logs for "Jul 21" from DC (vm01) and crashed node
> (vm04) in an attachment.
> > Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at redhat.com
> wrote:
> > On Wed, 2023-07-26 at 13:29 -0700, Reid Wahl wrote:
> > > On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur  gmail.com>
> > > wrote:
> > > > Hello Andrew, Ken and the entire community!
> > > > 
> > > > I faced a problem and I would like to ask for help.
> > > > 
> > > > Preamble:
> > > > I have dual controller storage (C0, C1) with 2 VM per
> controller
> > > > (vm0[1,2] on C0, vm[3,4] on C1).
> > > > I did online controller upgrade (update the firmware on
> physical
> > > > controller) and for that purpose we have a special procedure:
> > > > 
> > > > Put all vms on the controller which will be updated into the
> > > > standby mode (vm0[3,4] in logs).
> > > > Once all resources are moved to spare controller VMs, turn on
> > > > maintenance-mode (DC machine is vm01).
> > > > Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM +
> > > > HCA/HBA + BMC drivers will be updated).
> > > > Reboot C1
> > > > Start vm0[3,4]
> > > > On this step I hit the problem.
> > > > Do the same steps for C0 (turn off maint, put nodes 3,4 to
> online,
> > > > put 1-2 to standby, maint and etc).
> > > > 
> > > > Here is what I observed during step 5.
> > > > Machine vm03 started without problems, but vm04 caught critical
> > > > error and HA stack died. If manually start the pacemaker one
> more
> > > > time then it starts without problems and vm04 joins the
> cluster.
> > > > 
> > > > Some logs from vm04:
> > > > 
> > > > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] This node is
> within
> > > > the primary component and will provide service.
> > > > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] Members[4]: 1 2
> 3 4
> > > > Jul 21 04:05:39 vm04 corosync[3061]:  [MAIN  ] Completed
> service
> > > > synchronization, ready to provide service.
> > > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] rx: host: 3
> link: 1
> > > > is up
> > > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] link: Resetting
> MTU
> > > > for link 1 because host 3 joined
> > > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] host: host: 3
> > > > (passive) best link: 0 (pri: 1)
> > > > Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting
> > > > ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600
> > > > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: PMTUD
> link
> > > > change for host: 3 link: 1 from 453 to 

Re: [ClusterLabs] Need a help with "(crm_glib_handler) crit: GLib: g_hash_table_lookup: assertion 'hash_table != NULL' failed"

2023-07-28 Thread Novik Arthur
 2023-07-21_pacemaker_debug.log.vm01.bz2

 2023-07-21_pacemaker_debug.log.vm02.bz2

 2023-07-21_pacemaker_debug.log.vm03.bz2

 2023-07-21_pacemaker_debug.log.vm04.bz2

 blackbox_txt_vm04.tar.bz2


On Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at redhat.com

wrote:
> Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-controld-
> 4257.1" (my version can't read it) will show trace logs that might give
> a better idea of what exactly went wrong at this time (though these
> issues are side effects, not the cause).

Blackboxes were attached to crm_report and they are in txt format.
Just in case adding them to this email.

> FYI, it's not necessary to set cluster-recheck-interval as low as 1
> minute. A long time ago that could be useful, but modern Pacemaker
> doesn't need it to calculate things such as failure expiration. I
> recommend leaving it at default, or at least raising it to 5 minutes or
> so.

That's good to know, since those rules came from pacemaker-1.x and I'm
an adept of the "don't touch if it works" rule

> vm02, vm03, and vm04 all left the cluster at that time, leaving only
> vm01. At this point, vm01 should have deleted the transient attributes
> for all three nodes. Unfortunately, the logs for that would only be in
> pacemaker.log, which crm_report appears not to have grabbed, so I am
> not sure whether it tried.

Please find debug logs for "Jul 21" from DC (vm01) and crashed node (vm04)
in an attachment.

> Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at redhat.com 
> 
>  wrote:
> On Wed, 2023-07-26 at 13:29 -0700, Reid Wahl wrote:
> > On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur 
> > wrote:
> > > Hello Andrew, Ken and the entire community!
> > >
> > > I faced a problem and I would like to ask for help.
> > >
> > > Preamble:
> > > I have dual controller storage (C0, C1) with 2 VM per controller
> > > (vm0[1,2] on C0, vm[3,4] on C1).
> > > I did online controller upgrade (update the firmware on physical
> > > controller) and for that purpose we have a special procedure:
> > >
> > > Put all vms on the controller which will be updated into the
> > > standby mode (vm0[3,4] in logs).
> > > Once all resources are moved to spare controller VMs, turn on
> > > maintenance-mode (DC machine is vm01).
> > > Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM +
> > > HCA/HBA + BMC drivers will be updated).
> > > Reboot C1
> > > Start vm0[3,4]
> > > On this step I hit the problem.
> > > Do the same steps for C0 (turn off maint, put nodes 3,4 to online,
> > > put 1-2 to standby, maint and etc).
> > >
> > > Here is what I observed during step 5.
> > > Machine vm03 started without problems, but vm04 caught critical
> > > error and HA stack died. If manually start the pacemaker one more
> > > time then it starts without problems and vm04 joins the cluster.
> > >
> > > Some logs from vm04:
> > >
> > > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] This node is within
> > > the primary component and will provide service.
> > > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] Members[4]: 1 2 3 4
> > > Jul 21 04:05:39 vm04 corosync[3061]:  [MAIN  ] Completed service
> > > synchronization, ready to provide service.
> > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] rx: host: 3 link: 1
> > > is up
> > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] link: Resetting MTU
> > > for link 1 because host 3 joined
> > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] host: host: 3
> > > (passive) best link: 0 (pri: 1)
> > > Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting
> > > ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600
> > > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: PMTUD link
> > > change for host: 3 link: 1 from 453 to 65413
> > > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: Global data
> > > MTU changed to: 1397
> > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping-
> > > lnet-o2ib-o2ib[vm02]: (unset) -> 4000
> > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting
> > > ifspeed-lnet-o2ib-o2ib[vm01]: (unset) -> 600
> > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping-
> > > lnet-o2ib-o2ib[vm01]: (unset) -> 4000
> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State
> > > transition S_NOT_DC -> S_STOPPING
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > > execute monitor of sfa-home-vd: No executor connection
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot
> > > calculate digests for operation sfa-home-vd_monitor_0 because we
> > > 

Re: [ClusterLabs] Need a help with "(crm_glib_handler) crit: GLib: g_hash_table_lookup: assertion 'hash_table != NULL' failed"

2023-07-27 Thread Ken Gaillot
On Wed, 2023-07-26 at 13:29 -0700, Reid Wahl wrote:
> On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur 
> wrote:
> > Hello Andrew, Ken and the entire community!
> > 
> > I faced a problem and I would like to ask for help.
> > 
> > Preamble:
> > I have dual controller storage (C0, C1) with 2 VM per controller
> > (vm0[1,2] on C0, vm[3,4] on C1).
> > I did online controller upgrade (update the firmware on physical
> > controller) and for that purpose we have a special procedure:
> > 
> > Put all vms on the controller which will be updated into the
> > standby mode (vm0[3,4] in logs).
> > Once all resources are moved to spare controller VMs, turn on
> > maintenance-mode (DC machine is vm01).
> > Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM +
> > HCA/HBA + BMC drivers will be updated).
> > Reboot C1
> > Start vm0[3,4]
> > On this step I hit the problem.
> > Do the same steps for C0 (turn off maint, put nodes 3,4 to online,
> > put 1-2 to standby, maint and etc).
> > 
> > Here is what I observed during step 5.
> > Machine vm03 started without problems, but vm04 caught critical
> > error and HA stack died. If manually start the pacemaker one more
> > time then it starts without problems and vm04 joins the cluster.
> > 
> > Some logs from vm04:
> > 
> > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] This node is within
> > the primary component and will provide service.
> > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] Members[4]: 1 2 3 4
> > Jul 21 04:05:39 vm04 corosync[3061]:  [MAIN  ] Completed service
> > synchronization, ready to provide service.
> > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] rx: host: 3 link: 1
> > is up
> > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] link: Resetting MTU
> > for link 1 because host 3 joined
> > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] host: host: 3
> > (passive) best link: 0 (pri: 1)
> > Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting
> > ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600
> > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: PMTUD link
> > change for host: 3 link: 1 from 453 to 65413
> > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: Global data
> > MTU changed to: 1397
> > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping-
> > lnet-o2ib-o2ib[vm02]: (unset) -> 4000
> > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting
> > ifspeed-lnet-o2ib-o2ib[vm01]: (unset) -> 600
> > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping-
> > lnet-o2ib-o2ib[vm01]: (unset) -> 4000
> > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State
> > transition S_NOT_DC -> S_STOPPING
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > execute monitor of sfa-home-vd: No executor connection
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot
> > calculate digests for operation sfa-home-vd_monitor_0 because we
> > have no connection to executor for vm04
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of
> > probe operation for sfa-home-vd on vm04: Error (No executor
> > connection)
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > execute monitor of ifspeed-lnet-o2ib-o2ib: No executor connection
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot
> > calculate digests for operation ifspeed-lnet-o2ib-o2ib_monitor_0
> > because we have no connection to executor for vm04
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of
> > probe operation for ifspeed-lnet-o2ib-o2ib on vm04: Error (No
> > executor connection)
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > execute monitor of ping-lnet-o2ib-o2ib: No executor connection
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot
> > calculate digests for operation ping-lnet-o2ib-o2ib_monitor_0
> > because we have no connection to executor for vm04
> > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of
> > probe operation for ping-lnet-o2ib-o2ib on vm04: Error (No executor
> > connection)
> > Jul 21 04:05:49 vm04 pacemakerd[4127]: notice: pacemaker-
> > controld[4257] is unresponsive to ipc after 1 tries
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: warning: Shutting cluster
> > down because pacemaker-controld[4257] had fatal failure
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down
> > Pacemaker
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > schedulerd
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > attrd
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > execd
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > fenced
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > based
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutdown complete
> > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down and
> > staying down after fatal error
> > 
> > Jul 21 04:05:44 vm04 

Re: [ClusterLabs] Need a help with "(crm_glib_handler) crit: GLib: g_hash_table_lookup: assertion 'hash_table != NULL' failed"

2023-07-26 Thread Reid Wahl
On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur  wrote:
>
> Hello Andrew, Ken and the entire community!
>
> I faced a problem and I would like to ask for help.
>
> Preamble:
> I have dual controller storage (C0, C1) with 2 VM per controller (vm0[1,2] on 
> C0, vm[3,4] on C1).
> I did online controller upgrade (update the firmware on physical controller) 
> and for that purpose we have a special procedure:
>
> Put all vms on the controller which will be updated into the standby mode 
> (vm0[3,4] in logs).
> Once all resources are moved to spare controller VMs, turn on 
> maintenance-mode (DC machine is vm01).
> Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM + HCA/HBA + BMC 
> drivers will be updated).
> Reboot C1
> Start vm0[3,4]
> On this step I hit the problem.
> Do the same steps for C0 (turn off maint, put nodes 3,4 to online, put 1-2 to 
> standby, maint and etc).
>
> Here is what I observed during step 5.
> Machine vm03 started without problems, but vm04 caught critical error and HA 
> stack died. If manually start the pacemaker one more time then it starts 
> without problems and vm04 joins the cluster.
>
> Some logs from vm04:
>
> Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] This node is within the 
> primary component and will provide service.
> Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] Members[4]: 1 2 3 4
> Jul 21 04:05:39 vm04 corosync[3061]:  [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] rx: host: 3 link: 1 is up
> Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] link: Resetting MTU for link 1 
> because host 3 joined
> Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] host: host: 3 (passive) best 
> link: 0 (pri: 1)
> Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting 
> ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600
> Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: PMTUD link change for 
> host: 3 link: 1 from 453 to 65413
> Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: Global data MTU changed 
> to: 1397
> Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting 
> ping-lnet-o2ib-o2ib[vm02]: (unset) -> 4000
> Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting 
> ifspeed-lnet-o2ib-o2ib[vm01]: (unset) -> 600
> Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting 
> ping-lnet-o2ib-o2ib[vm01]: (unset) -> 4000
> Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State transition 
> S_NOT_DC -> S_STOPPING
> Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot execute monitor 
> of sfa-home-vd: No executor connection
> Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot calculate 
> digests for operation sfa-home-vd_monitor_0 because we have no connection to 
> executor for vm04
> Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of probe 
> operation for sfa-home-vd on vm04: Error (No executor connection)
> Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot execute monitor 
> of ifspeed-lnet-o2ib-o2ib: No executor connection
> Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot calculate 
> digests for operation ifspeed-lnet-o2ib-o2ib_monitor_0 because we have no 
> connection to executor for vm04
> Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of probe 
> operation for ifspeed-lnet-o2ib-o2ib on vm04: Error (No executor connection)
> Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot execute monitor 
> of ping-lnet-o2ib-o2ib: No executor connection
> Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot calculate 
> digests for operation ping-lnet-o2ib-o2ib_monitor_0 because we have no 
> connection to executor for vm04
> Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of probe 
> operation for ping-lnet-o2ib-o2ib on vm04: Error (No executor connection)
> Jul 21 04:05:49 vm04 pacemakerd[4127]: notice: pacemaker-controld[4257] is 
> unresponsive to ipc after 1 tries
> Jul 21 04:05:52 vm04 pacemakerd[4127]: warning: Shutting cluster down because 
> pacemaker-controld[4257] had fatal failure
> Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down Pacemaker
> Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-schedulerd
> Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-attrd
> Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-execd
> Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-fenced
> Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-based
> Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutdown complete
> Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down and staying down 
> after fatal error
>
> Jul 21 04:05:44 vm04 root[10111]: openibd: Set node_desc for mlx5_0: vm04 
> HCA-1
> Jul 21 04:05:44 vm04 root[10113]: openibd: Set node_desc for mlx5_1: vm04 
> HCA-2
> Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  error: Shutting down 
> controller after unexpected