Re: [PATCH v2 08/13] tracing: Improve panic/die notifiers

2022-08-16 Thread Alan Stern
On Tue, Aug 16, 2022 at 10:14:45AM -0400, Steven Rostedt wrote:
> On Tue, 19 Jul 2022 16:53:21 -0300
> "Guilherme G. Piccoli"  wrote:
> 
> 
> Sorry for the late review, but this fell to the bottom of my queue :-/
> 
> > +/*
> > + * The idea is to execute the following die/panic callback early, in order
> > + * to avoid showing irrelevant information in the trace (like other panic
> > + * notifier functions); we are the 2nd to run, after hung_task/rcu_stall
> > + * warnings get disabled (to prevent potential log flooding).
> > + */
> > +static int trace_die_panic_handler(struct notifier_block *self,
> > +   unsigned long ev, void *unused)
> > +{
> > +   if (!ftrace_dump_on_oops)
> > +   goto out;
> > +
> > +   if (self == _die_notifier && ev != DIE_OOPS)
> > +   goto out;
> 
> I really hate gotos that are not for clean ups.
> 
> > +
> > +   ftrace_dump(ftrace_dump_on_oops);
> > +
> > +out:
> > +   return NOTIFY_DONE;
> > +}
> > +
> 
> Just do:
> 
> static int trace_die_panic_handler(struct notifier_block *self,
>   unsigned long ev, void *unused)
> {
>   if (!ftrace_dump_on_oops)
>   return NOTIFY_DONE;
> 
>   /* The die notifier requires DIE_OOPS to trigger */
>   if (self == _die_notifier && ev != DIE_OOPS)
>   return NOTIFY_DONE;
> 
>   ftrace_dump(ftrace_dump_on_oops);
> 
>   return NOTIFY_DONE;
> }

Or better yet:

if (ftrace_dump_on_oops) {

    /* The die notifier requires DIE_OOPS to trigger */
if (self != _die_notifier || ev == DIE_OOPS)
ftrace_dump(ftrace_dump_on_oops);
}
return NOTIFY_DONE;

Alan Stern

> Thanks,
> 
> Other than that, Acked-by: Steven Rostedt (Google) 
> 
> -- Steve

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V2] notifier/panic: Introduce panic_notifier_filter

2022-01-06 Thread Alan Stern
On Thu, Jan 06, 2022 at 05:00:07PM -0300, Guilherme G. Piccoli wrote:
> The kernel notifier infrastructure allows function callbacks to be
> added in multiple lists, which are then called in the proper time,
> like in a reboot or panic event. The panic_notifier_list specifically
> contains the callbacks that are executed during a panic event. As any
> other notifier list, the panic one has no filtering and all functions
> previously registered are executed.
> 
> The kdump infrastructure, on the other hand, enables users to set
> a crash kernel that is kexec'ed in a panic event, and vmcore/logs
> are collected in such crash kernel. When kdump is set, by default
> the panic notifiers are ignored - the kexec jumps to the crash kernel
> before the list is checked and callbacks executed.
> 
> There are some cases though in which kdump users might want to
> allow panic notifier callbacks to execute _before_ the kexec to
> the crash kernel, for a variety of reasons - for example, users
> may think kexec is very prone to fail and want to give a chance
> to kmsg dumpers to run (and save logs using pstore), or maybe
> some panic notifier is required to properly quiesce some hardware
> that must be used to the crash kernel. For these cases, we have
> the kernel parameter "crash_kexec_post_notifiers".
> 
> But there's a problem: currently it's an "all-or-nothing" situation,
> the kdump user choice is either to execute all panic notifiers or
> none of them. Given that panic notifiers may increase the risk of a
> kdump failure, this is a tough decision and may affect the debug of
> hard to reproduce bugs, if for some reason the user choice is to
> enable panic notifiers, but kdump then fails.
> 
> So, this patch aims to ease this decision: we hereby introduce a filter
> for the panic notifier list, in which users may select specifically
> which callbacks they wish to run, allowing a safer kdump. The allowlist
> should be provided using the parameter "panic_notifier_filter=a,b,..."
> where a, b are valid callback names. Invalid symbols are discarded.
> 
> Currently up to 16 symbols may be passed in this list, we consider
> that this numbers allows enough flexibility (and no matter what
> architecture is used, at most 30 panic callbacks are registered).
> In an experiment using a qemu x86 virtual machine, by default only
> six callbacks are registered in the panic notifier list.
> Once a valid callback name is provided in the list, such function
> is allowed to be registered/unregistered in the panic_notifier_list;
> all other panic callbacks are ignored. Notice that this filter is
> only for the panic notifiers and has no effect in the other notifiers.
> 
> Signed-off-by: Guilherme G. Piccoli 

> diff --git a/kernel/notifier.c b/kernel/notifier.c
> index b8251dc0bc0f..04cb9e956058 100644
> --- a/kernel/notifier.c
> +++ b/kernel/notifier.c
> @@ -140,10 +163,16 @@ int atomic_notifier_chain_register(struct 
> atomic_notifier_head *nh,
>   struct notifier_block *n)
>  {
>   unsigned long flags;
> - int ret;
> + int ret = 0;
>  
>   spin_lock_irqsave(>lock, flags);
> + if (unlikely(panic_nf_count) && nh == _notifier_list)
> + if (!is_panic_notifier_filtered(n))
> + goto panic_filtered_out;

Forget the unlikely(); this is not a hot path.

> +
>   ret = notifier_chain_register(>head, n);
> +
> +panic_filtered_out:
>   spin_unlock_irqrestore(>lock, flags);
>   return ret;
>  }

It would be simpler to do:

if (!(nh == _notifier_list && panic_nf_count > 0 &&
is_panic_notifier_filtered(n)))
ret = notifier_chain_register(>head, n);

If there were special-purpose functions just for registering and 
unregistering callbacks on the panic_notifier_list, the design would be 
cleaner (no need to modify core notifier code).  But making that change 
would mean altering a lot of call sites.  :-(

> @@ -162,10 +194,16 @@ int atomic_notifier_chain_unregister(struct 
> atomic_notifier_head *nh,
>   struct notifier_block *n)
>  {
>   unsigned long flags;
> - int ret;
> + int ret = 0;
>  
>   spin_lock_irqsave(>lock, flags);
> + if (unlikely(panic_nf_count) && nh == _notifier_list)
> + if (!is_panic_notifier_filtered(n))
> + goto panic_filtered_out;
> +
>   ret = notifier_chain_unregister(>head, n);
> +
> +panic_filtered_out:

Same idea here.

Alan Stern

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [linux-pm] [PATCH -mm 1/2] kexec jump -v12: kexec jump

2008-07-12 Thread Alan Stern
On Fri, 11 Jul 2008, Eric W. Biederman wrote:

  Forcing the second set of requests to filter through an extra software 
  layer is a clumsy way of accomplishing this.  There ought to be a 
  better approach.
 
 The point was something different.  The reasons we can not store the
 state of the system with the hardware devices logically hot unplugged
 (and thus reuse all of the find device hotplug methods) is because
 things like the filesystem layer don't know how to cope with their
 block devices going away an coming back.

This is not how the procedure works.  During hibernation, block devices
are not logically hot-unplugged.  (If they were then they couldn't be
used for writing the memory image.)  Instead, they are quiesced or
suspended and their input queues are plugged.

 That is the problem inserting an virtual software device in the middle
 can solve.  If that works should there be a better way?  Certainly but
 to prove it out starting with a block device wrapper is a trivial way to
 go.

This sounds like a solution to a non-existent problem.

Alan Stern


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [linux-pm] [PATCH -mm 1/2] kexec jump -v12: kexec jump

2008-07-11 Thread Alan Stern
On Fri, 11 Jul 2008, Eric W. Biederman wrote:

 I just realized with a little care the block layer does have support for this,
 or something very close.
 
 You setup a software raid mirror with one disk device.The physical
 device can come in and out while the filesystems depend on the real device.

Do you mean the filesystems depend on the logical RAID device?  

What's to prevent userspace from accessing the physical device 
directly?

What this amounts to, in the end, is having a way to distinguish the
set of I/O requests coming from the hibernation code (reading or
writing the memory image) from the set of all other I/O requests.  The
driver or the block layer has to be set up to allow the first set
through while blocking the second set.  (And don't forget about the 
complications caused by error-recovery I/O during the hibernation 
activity!)

Forcing the second set of requests to filter through an extra software 
layer is a clumsy way of accomplishing this.  There ought to be a 
better approach.

Alan Stern


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [linux-pm] [PATCH -mm] kexec jump -v9

2008-05-15 Thread Alan Stern
On Wed, 14 May 2008, Eric W. Biederman wrote:

 My take on the situation is this.  For proper handling we
 need driver device_detach and device_reattach methods.
 
 With the following semantics.  The device_detach methods
 will disable DMA and place the hardware in a sane state
 from which the device driver can reclaim and reinitialize it,
 but the hardware will not be touched.
 
 device_reattach reattaches the driver to the hardware.

How would these differ from the already-existing remove and probe 
methods?

Alan Stern


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [linux-pm] [PATCH -mm] kexec jump -v9

2008-03-22 Thread Alan Stern
On Sat, 22 Mar 2008, Rafael J. Wysocki wrote:

 However, as far as the ACPI NVS area is concerned, this is probably not
 necessary, because the spec wants us to restore the ACPI NVS before calling
 _WAK, which is just after the image kernel gets the control back.  So, in
 theory, the ACPI NVS data could be stored in the image and restored by
 the image kernel from a location known to it (the procedure may be to copy
 the ACPI NVS data into a region of regular RAM before creating the image and
 copy them back into the ACPI NVS area in platform-leave(), for example), but
 I suspect that for this to work we'll have to switch ACPI off in the boot
 kernel, just prior to passing control back to the image kernel.

That sounds by far the simplest solution.  If the boot kernel can tell
(by looking at some header field in the image or any other way) that
the hibernation used S5 instead of S4, then it should just turn off 
ACPI before passing control to the image kernel.  Then the image kernel 
can turn ACPI back on and all should be well.  If you do this, does the 
NVS region still need to be preserved?

Alan Stern


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [linux-pm] [PATCH -mm] kexec jump -v9

2008-03-19 Thread Alan Stern
On Tue, 18 Mar 2008, Eric W. Biederman wrote:

 Alan Stern [EMAIL PROTECTED] writes:

  Could it be connected with the way the boot kernel hands control over
  to the image kernel?  Presumably ACPI isn't prepared to deal with that
  sort of thing during a boot from S5.  It would have to be fooled into
  thinking the two kernels were one and the same.
 
 It should be easy to test if it is a hand over problem, by turning off
 the laptop by placing it in S5 (shutdown -h now) and then booting same
 kernel again.

?  Doesn't this happen every time Rafael turns the computer off and 
then turns it back on?

Do you mean that Rafael should do an S5-type hibernate, but then reboot 
in such a way that the image isn't loaded and resumed?

Alan Stern


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [linux-pm] [PATCH -mm] kexec jump -v9

2008-03-12 Thread Alan Stern
On Wed, 12 Mar 2008, Huang, Ying wrote:

 I think kexec based hibernation is the only currently available
 possible method to write out image without freezer (after driver works
 are done). If other process is running, how to prevent them from writing
 to disk without freezing them in current implementation?

This is a very good question.

It's a matter of managing the block layer's request queues.  Somehow 
the existing I/O requests must remain blocked while the requests needed 
for writing the image must be allowed to proceed.

I don't know what would be needed to make this work, but it ought to be 
possible somehow...

Alan Stern


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [linux-pm] Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump

2007-09-22 Thread Alan Stern
On Fri, 21 Sep 2007, Rafael J. Wysocki wrote:

   Well, the problem is that apparently some systems (eg. my HP nx6325) 
   expect us
   to execute the _PTS ACPI global control method before creating the image 
   _and_
   to execute acpi_enter_sleep_state(ACPI_STATE_S4) in order to finally put 
   the
   system into the sleep state.  In particular, on nx6325, if we don't do 
   that,
   then after the restore the status of the AC power will not be reported
   correctly (and if you replace the battery while in the sleep state, the
   battery status will not be updated correctly after the restore).  Similar
   issues have been reported for other machines.
  
  Suppose that instead of using ACPI S4 state at all, you instead just
  power off.  Yes, you'll lose wakeup event functionality, and flashy
  LEDs, but doesn't this take care of the problem?
 
 Nope.
 
  The firmware shouldn't see the hibernate as anything other than a shutdown
  and reboot.
 
 Actually, this assumption is apparently wrong.

One gets the impression that the hibernation image includes a memory 
area used by the firmware.  That could explain why devices need to be 
in a low-power state when the image is created -- so that when the 
image is restored, the firmware doesn't get confused about the device 
states.

It would also explain why the firmware sees
resume-from-power-off-hibernation as different from a regular reboot:
because its data area gets overwritten as part of the resume.

In reality it's probably more complicated than this, with weird 
interactions between the firmware and the various ACPI methods.  
Nevertheless, the main idea seems valid.

Alan Stern


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec