On Mon, Feb 28, 2022 at 2:46 PM Klaus Wenninger <kwenn...@redhat.com> wrote:
> > > On Sat, Feb 26, 2022 at 7:14 AM Strahil Nikolov via Users < > users@clusterlabs.org> wrote: > >> I always used this one for triggering kdump when using sbd: >> https://www.suse.com/support/kb/doc/?id=000019873 >> >> On Fri, Feb 25, 2022 at 21:34, Reid Wahl >> <nw...@redhat.com> wrote: >> On Fri, Feb 25, 2022 at 3:47 AM Andrei Borzenkov <arvidj...@gmail.com> >> wrote: >> > >> > On Fri, Feb 25, 2022 at 2:23 PM Reid Wahl <nw...@redhat.com> wrote: >> > > >> > > On Fri, Feb 25, 2022 at 3:22 AM Reid Wahl <nw...@redhat.com> wrote: >> > > > >> > ... >> > > > > >> > > > > So what happens most likely is that the watchdog terminates the >> kdump. >> > > > > In that case all the mess with fence_kdump won't help, right? >> > > > >> > > > You can configure extra_modules in your /etc/kdump.conf file to >> > > > include the watchdog module, and then restart kdump.service. For >> > > > example: >> > > > >> > > > # grep ^extra_modules /etc/kdump.conf >> > > > extra_modules i6300esb >> > > > >> > > > If you're not sure of the name of your watchdog module, wdctl can >> help >> > > > you find it. sbd needs to be stopped first, because it keeps the >> > > > watchdog device timer busy. >> > > > >> > > > # pcs cluster stop --all >> > > > # wdctl | grep Identity >> > > > Identity: i6300ESB timer [version 0] >> > > > # lsmod | grep -i i6300ESB >> > > > i6300esb 13566 0 >> > > > >> > > > >> > > > If you're also using fence_sbd (poison-pill fencing via block >> device), >> > > > then you should be able to protect yourself from that during a dump >> by >> > > > configuring fencing levels so that fence_kdump is level 1 and >> > > > fence_sbd is level 2. >> > > >> > > RHKB, for anyone interested: >> > > - sbd watchdog timeout causes node to reboot during crash kernel >> > > execution (https://access.redhat.com/solutions/3552201) >> > >> > What is not clear from this KB (and quotes from it above) - what >> > instance updates watchdog? Quoting (emphasis mine) >> > >> > --><-- >> > With the module loaded, the timer *CAN* be updated so that it does not >> > expire and force a reboot in the middle of vmcore generation. >> > --><-- >> > >> > Sure it can, but what program exactly updates the watchdog during >> > kdump execution? I am pretty sure that sbd does not run at this point. >> >> That's a valid question. I found this approach to work back in 2018 >> after a fair amount of frustration, and didn't question it too deeply >> at the time. >> >> The answer seems to be that the kernel does it. >> - https://stackoverflow.com/a/2020717 >> - https://stackoverflow.com/a/42589110 >> >> I think in most cases nobody would be triggering the running watchdog > except maybe in case of the 2 drivers mentioned. > Behavior is that if there is no watchdog-timeout defined for the > crashdump-case > sbd will (at least try to) disable the watchdog. > If disabling isn't prohibited or not possible with a certain watchdog this > should > lead to the hardware-watchdog being really disabled without anything > needing > to trigger it anymore. > If crashdump-watchdog-timeout is configured to the same value as > watchdog-timeout engaged before sbd isn't gonna touch the watchdog > (closing the device without stopping). > That being said I'd suppose that the only somewhat production-safe > configuration should be setting both watchdog-timeouts to the same > value. > Unfortunately this setting isn't the default and thus contradicts the usual paradigm that defaults should be safe settings. Changing now - or even back when I fixed setting crashdump-timeout - would unfortunately break existing setups. So my suggestion is to stay with what we have and be aware of the non-safe-behavior. > I doubt that we can assume that all io from the host - that was initiated > prior to triggering the transition to crashdump-kernel - being stopped > immediately. All other nodes will assume that io will be stopped within > watchdog-timeout though. When we disable the watchdog we can't > be sure that subsequent transition to crashdump-kernel will even happen. > So leaving watchdog-timeout at the previous value seems to be > the only way to really assure that the node is being silenced by a > hardware-reset within the timeout assumed by the rest of the nodes. > In case the watchdog-driver has this running-detection - mentioned > in the links above - the safe way would probably be having the > module removed from crash-kernel. > > Klaus > >> >> > _______________________________________________ >> > Manage your subscription: >> > https://lists.clusterlabs.org/mailman/listinfo/users >> > >> > ClusterLabs home: https://www.clusterlabs.org/ >> > >> >> >> -- >> Regards, >> >> Reid Wahl (He/Him), RHCA >> Senior Software Maintenance Engineer, Red Hat >> CEE - Platform Support Delivery - ClusterHA >> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/