Re: [PATCH v2 08/13] tracing: Improve panic/die notifiers
On Tue, Aug 16, 2022 at 10:14:45AM -0400, Steven Rostedt wrote: > On Tue, 19 Jul 2022 16:53:21 -0300 > "Guilherme G. Piccoli" wrote: > > > Sorry for the late review, but this fell to the bottom of my queue :-/ > > > +/* > > + * The idea is to execute the following die/panic callback early, in order > > + * to avoid showing irrelevant information in the trace (like other panic > > + * notifier functions); we are the 2nd to run, after hung_task/rcu_stall > > + * warnings get disabled (to prevent potential log flooding). > > + */ > > +static int trace_die_panic_handler(struct notifier_block *self, > > + unsigned long ev, void *unused) > > +{ > > + if (!ftrace_dump_on_oops) > > + goto out; > > + > > + if (self == _die_notifier && ev != DIE_OOPS) > > + goto out; > > I really hate gotos that are not for clean ups. > > > + > > + ftrace_dump(ftrace_dump_on_oops); > > + > > +out: > > + return NOTIFY_DONE; > > +} > > + > > Just do: > > static int trace_die_panic_handler(struct notifier_block *self, > unsigned long ev, void *unused) > { > if (!ftrace_dump_on_oops) > return NOTIFY_DONE; > > /* The die notifier requires DIE_OOPS to trigger */ > if (self == _die_notifier && ev != DIE_OOPS) > return NOTIFY_DONE; > > ftrace_dump(ftrace_dump_on_oops); > > return NOTIFY_DONE; > } Or better yet: if (ftrace_dump_on_oops) { /* The die notifier requires DIE_OOPS to trigger */ if (self != _die_notifier || ev == DIE_OOPS) ftrace_dump(ftrace_dump_on_oops); } return NOTIFY_DONE; Alan Stern > Thanks, > > Other than that, Acked-by: Steven Rostedt (Google) > > -- Steve ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH V2] notifier/panic: Introduce panic_notifier_filter
On Thu, Jan 06, 2022 at 05:00:07PM -0300, Guilherme G. Piccoli wrote: > The kernel notifier infrastructure allows function callbacks to be > added in multiple lists, which are then called in the proper time, > like in a reboot or panic event. The panic_notifier_list specifically > contains the callbacks that are executed during a panic event. As any > other notifier list, the panic one has no filtering and all functions > previously registered are executed. > > The kdump infrastructure, on the other hand, enables users to set > a crash kernel that is kexec'ed in a panic event, and vmcore/logs > are collected in such crash kernel. When kdump is set, by default > the panic notifiers are ignored - the kexec jumps to the crash kernel > before the list is checked and callbacks executed. > > There are some cases though in which kdump users might want to > allow panic notifier callbacks to execute _before_ the kexec to > the crash kernel, for a variety of reasons - for example, users > may think kexec is very prone to fail and want to give a chance > to kmsg dumpers to run (and save logs using pstore), or maybe > some panic notifier is required to properly quiesce some hardware > that must be used to the crash kernel. For these cases, we have > the kernel parameter "crash_kexec_post_notifiers". > > But there's a problem: currently it's an "all-or-nothing" situation, > the kdump user choice is either to execute all panic notifiers or > none of them. Given that panic notifiers may increase the risk of a > kdump failure, this is a tough decision and may affect the debug of > hard to reproduce bugs, if for some reason the user choice is to > enable panic notifiers, but kdump then fails. > > So, this patch aims to ease this decision: we hereby introduce a filter > for the panic notifier list, in which users may select specifically > which callbacks they wish to run, allowing a safer kdump. The allowlist > should be provided using the parameter "panic_notifier_filter=a,b,..." > where a, b are valid callback names. Invalid symbols are discarded. > > Currently up to 16 symbols may be passed in this list, we consider > that this numbers allows enough flexibility (and no matter what > architecture is used, at most 30 panic callbacks are registered). > In an experiment using a qemu x86 virtual machine, by default only > six callbacks are registered in the panic notifier list. > Once a valid callback name is provided in the list, such function > is allowed to be registered/unregistered in the panic_notifier_list; > all other panic callbacks are ignored. Notice that this filter is > only for the panic notifiers and has no effect in the other notifiers. > > Signed-off-by: Guilherme G. Piccoli > diff --git a/kernel/notifier.c b/kernel/notifier.c > index b8251dc0bc0f..04cb9e956058 100644 > --- a/kernel/notifier.c > +++ b/kernel/notifier.c > @@ -140,10 +163,16 @@ int atomic_notifier_chain_register(struct > atomic_notifier_head *nh, > struct notifier_block *n) > { > unsigned long flags; > - int ret; > + int ret = 0; > > spin_lock_irqsave(>lock, flags); > + if (unlikely(panic_nf_count) && nh == _notifier_list) > + if (!is_panic_notifier_filtered(n)) > + goto panic_filtered_out; Forget the unlikely(); this is not a hot path. > + > ret = notifier_chain_register(>head, n); > + > +panic_filtered_out: > spin_unlock_irqrestore(>lock, flags); > return ret; > } It would be simpler to do: if (!(nh == _notifier_list && panic_nf_count > 0 && is_panic_notifier_filtered(n))) ret = notifier_chain_register(>head, n); If there were special-purpose functions just for registering and unregistering callbacks on the panic_notifier_list, the design would be cleaner (no need to modify core notifier code). But making that change would mean altering a lot of call sites. :-( > @@ -162,10 +194,16 @@ int atomic_notifier_chain_unregister(struct > atomic_notifier_head *nh, > struct notifier_block *n) > { > unsigned long flags; > - int ret; > + int ret = 0; > > spin_lock_irqsave(>lock, flags); > + if (unlikely(panic_nf_count) && nh == _notifier_list) > + if (!is_panic_notifier_filtered(n)) > + goto panic_filtered_out; > + > ret = notifier_chain_unregister(>head, n); > + > +panic_filtered_out: Same idea here. Alan Stern ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm 1/2] kexec jump -v12: kexec jump
On Fri, 11 Jul 2008, Eric W. Biederman wrote: Forcing the second set of requests to filter through an extra software layer is a clumsy way of accomplishing this. There ought to be a better approach. The point was something different. The reasons we can not store the state of the system with the hardware devices logically hot unplugged (and thus reuse all of the find device hotplug methods) is because things like the filesystem layer don't know how to cope with their block devices going away an coming back. This is not how the procedure works. During hibernation, block devices are not logically hot-unplugged. (If they were then they couldn't be used for writing the memory image.) Instead, they are quiesced or suspended and their input queues are plugged. That is the problem inserting an virtual software device in the middle can solve. If that works should there be a better way? Certainly but to prove it out starting with a block device wrapper is a trivial way to go. This sounds like a solution to a non-existent problem. Alan Stern ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm 1/2] kexec jump -v12: kexec jump
On Fri, 11 Jul 2008, Eric W. Biederman wrote: I just realized with a little care the block layer does have support for this, or something very close. You setup a software raid mirror with one disk device.The physical device can come in and out while the filesystems depend on the real device. Do you mean the filesystems depend on the logical RAID device? What's to prevent userspace from accessing the physical device directly? What this amounts to, in the end, is having a way to distinguish the set of I/O requests coming from the hibernation code (reading or writing the memory image) from the set of all other I/O requests. The driver or the block layer has to be set up to allow the first set through while blocking the second set. (And don't forget about the complications caused by error-recovery I/O during the hibernation activity!) Forcing the second set of requests to filter through an extra software layer is a clumsy way of accomplishing this. There ought to be a better approach. Alan Stern ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Wed, 14 May 2008, Eric W. Biederman wrote: My take on the situation is this. For proper handling we need driver device_detach and device_reattach methods. With the following semantics. The device_detach methods will disable DMA and place the hardware in a sane state from which the device driver can reclaim and reinitialize it, but the hardware will not be touched. device_reattach reattaches the driver to the hardware. How would these differ from the already-existing remove and probe methods? Alan Stern ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Sat, 22 Mar 2008, Rafael J. Wysocki wrote: However, as far as the ACPI NVS area is concerned, this is probably not necessary, because the spec wants us to restore the ACPI NVS before calling _WAK, which is just after the image kernel gets the control back. So, in theory, the ACPI NVS data could be stored in the image and restored by the image kernel from a location known to it (the procedure may be to copy the ACPI NVS data into a region of regular RAM before creating the image and copy them back into the ACPI NVS area in platform-leave(), for example), but I suspect that for this to work we'll have to switch ACPI off in the boot kernel, just prior to passing control back to the image kernel. That sounds by far the simplest solution. If the boot kernel can tell (by looking at some header field in the image or any other way) that the hibernation used S5 instead of S4, then it should just turn off ACPI before passing control to the image kernel. Then the image kernel can turn ACPI back on and all should be well. If you do this, does the NVS region still need to be preserved? Alan Stern ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Tue, 18 Mar 2008, Eric W. Biederman wrote: Alan Stern [EMAIL PROTECTED] writes: Could it be connected with the way the boot kernel hands control over to the image kernel? Presumably ACPI isn't prepared to deal with that sort of thing during a boot from S5. It would have to be fooled into thinking the two kernels were one and the same. It should be easy to test if it is a hand over problem, by turning off the laptop by placing it in S5 (shutdown -h now) and then booting same kernel again. ? Doesn't this happen every time Rafael turns the computer off and then turns it back on? Do you mean that Rafael should do an S5-type hibernate, but then reboot in such a way that the image isn't loaded and resumed? Alan Stern ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Wed, 12 Mar 2008, Huang, Ying wrote: I think kexec based hibernation is the only currently available possible method to write out image without freezer (after driver works are done). If other process is running, how to prevent them from writing to disk without freezing them in current implementation? This is a very good question. It's a matter of managing the block layer's request queues. Somehow the existing I/O requests must remain blocked while the requests needed for writing the image must be allowed to proceed. I don't know what would be needed to make this work, but it ought to be possible somehow... Alan Stern ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump
On Fri, 21 Sep 2007, Rafael J. Wysocki wrote: Well, the problem is that apparently some systems (eg. my HP nx6325) expect us to execute the _PTS ACPI global control method before creating the image _and_ to execute acpi_enter_sleep_state(ACPI_STATE_S4) in order to finally put the system into the sleep state. In particular, on nx6325, if we don't do that, then after the restore the status of the AC power will not be reported correctly (and if you replace the battery while in the sleep state, the battery status will not be updated correctly after the restore). Similar issues have been reported for other machines. Suppose that instead of using ACPI S4 state at all, you instead just power off. Yes, you'll lose wakeup event functionality, and flashy LEDs, but doesn't this take care of the problem? Nope. The firmware shouldn't see the hibernate as anything other than a shutdown and reboot. Actually, this assumption is apparently wrong. One gets the impression that the hibernation image includes a memory area used by the firmware. That could explain why devices need to be in a low-power state when the image is created -- so that when the image is restored, the firmware doesn't get confused about the device states. It would also explain why the firmware sees resume-from-power-off-hibernation as different from a regular reboot: because its data area gets overwritten as part of the resume. In reality it's probably more complicated than this, with weird interactions between the firmware and the various ACPI methods. Nevertheless, the main idea seems valid. Alan Stern ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec