On Fri, Feb 25, 2022 at 3:22 AM Reid Wahl <nw...@redhat.com> wrote: > > On Thu, Feb 24, 2022 at 4:22 AM Ulrich Windl > <ulrich.wi...@rz.uni-regensburg.de> wrote: > > > > Hi! > > > > After reading about fence_kdump and fence_kdump_send I wonder: > > Does anybody use that in production? > > Quite a lot of people, in fact. > > > Having the networking and bonding in initrd does not sound like a good idea > > to me. > > Wouldn't it be easier to integrate that functionality into sbd? > > I mean: Let sbd wait for a "kdump-ed" message that initrd could send when > > kdump is complete. > > Basically that would be the same mechanism, but using storage instead of > > networking. > > > > If I get it right, the original fence_kdump would also introduce an extra > > fencing delay, and I wonder what happens with a hardware watchdog while a > > kdump is in progress... > > > > The background of all this is that our nodes kernel-panic, and support says > > the kdumps are all incomplete. > > The events are most likely: > > node1: panics (kdump) > > other_node: seens node1 had failed and fences it (via sbd). > > > > However sbd fencing wont work while kdump is executing (IMHO) > > > > So what happens most likely is that the watchdog terminates the kdump. > > In that case all the mess with fence_kdump won't help, right? > > You can configure extra_modules in your /etc/kdump.conf file to > include the watchdog module, and then restart kdump.service. For > example: > > # grep ^extra_modules /etc/kdump.conf > extra_modules i6300esb > > If you're not sure of the name of your watchdog module, wdctl can help > you find it. sbd needs to be stopped first, because it keeps the > watchdog device timer busy. > > # pcs cluster stop --all > # wdctl | grep Identity > Identity: i6300ESB timer [version 0] > # lsmod | grep -i i6300ESB > i6300esb 13566 0 > > > If you're also using fence_sbd (poison-pill fencing via block device), > then you should be able to protect yourself from that during a dump by > configuring fencing levels so that fence_kdump is level 1 and > fence_sbd is level 2.
RHKB, for anyone interested: - sbd watchdog timeout causes node to reboot during crash kernel execution (https://access.redhat.com/solutions/3552201) > > > > > > Regards, > > Ulrich > > > > > > > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > -- > Regards, > > Reid Wahl (He/Him), RHCA > Senior Software Maintenance Engineer, Red Hat > CEE - Platform Support Delivery - ClusterHA -- Regards, Reid Wahl (He/Him), RHCA Senior Software Maintenance Engineer, Red Hat CEE - Platform Support Delivery - ClusterHA _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/