Re: [PATCH] kexec: Add --lite option

2015-12-09 Thread Pratyush Anand
Hi James,

On 08/12/2015:04:00:17 PM, James Morse wrote:
> Hi Pratyush,
> 
> On 07/12/15 14:07, Pratyush Anand wrote:
> > On 07/12/2015:01:16:06 PM, James Morse wrote:
> >> I haven't benchmarked this, but:
> >>
> >> util_lib/sha256.c contains calls out to memcpy().
> >> In your case 1, this will use the glibc version. In case 2, it will use
> >> the version implemented in purgatory/string.c, which is a byte-by-byte 
> >> copy.
> >>
> > 
> > Yes, I agree that byte copy is too slow. But, memcpy() in sha256_update() 
> > will
> > copy only few bytes (I think max 126 bytes). Most of the data will be 
> > processed
> > using loop while( length >= 64 ){}, where we do not have any memcpy.So, I 
> > do not
> > think that this would be causing such a difference.
> 
> You're right, I benchmarked the two sha256.o files checksumming a 10MB
> buffer - one takes 0.6s, the other 1.7s, we can probably expect a couple
> of seconds to do this.
> 
> Is the sha256 really useful? Purgatory can't print out an error message,
> if it fails...

kdump needs the sha256 integrity checks. Please see previous reply from Dave,
Vivek and Eric in this thread.

Purgatory prints error message, in case sha256 fails. It prints expected and
calculated sha256 values. You must pass --port with proper value to see print
messages. Geoff has been able to see purgatory debug messages on foundation
model only with --port, however I need below patch and pass --port-lsr as well
in order to print all the characters properly.
https://github.com/pratyushanand/kexec-tools/commit/ab30f4015189b177dd2e78980f5b7e47c2d22fe4

So, on a system having pl011 base address 0xe101, I pass --port=0xe101
--port-lsr=0xe1010018,0x80 to the kexec command.

> 
> 
> > Could it be the case that I am not using perfect memory attributes while 
> > setting
> > up identity mapping and enabling D-cache. My implementation is here:
> > https://github.com/pratyushanand/kexec-tools/commit/8efdbc56b52f99a8a074edd0ddc519d7b68be82f
> 

First of all, thanks a lot for taking out your time to review it :-)

> I'm no expert, but that looks like you're setting it up as 'normal'

Me too not an expert.

Shouldn't it be normal type memory? I think, I will need to define only UART
area as device type (currently it is not defined, and so I am not able to use
print message while mmu is enabled).

> memory. You're missing some isb-s and tlbi-s: depending on how long the
> changes to system state take, you may be using old memory-attributes or
> page-tables. If you depend on a change to system state, (like turning
> the mmu on), you need explicit synchronisation, see section 12.3.2 of
> the 'Architecture Programmers Guide' (arm den0024a), and D7.1.2 of the
> ARM ARM.

I will go through these docs and kernel arch/arm64/kernel/head.S and will
rewrite the cache implementation.

> 
> I haven't managed to get your kexec-tools branch to work with v10 of

May be you can try master branch. It is almost same as that of Geoff's.
Additionally, it has support to "wait for transmit completion before next
character transmission".

> Geoff's series. It looks like you save registers to the stack, which
> give stale values once you turn the mmu off. You also do the opposite,
> saving registers with the mmu off, then cleaning cache lines over the
> top, corrupting the saved registers.
> 
> The page size of 64K is hard coded. Kexec-ing from a 4K kernel, to a 4K
> kernel will work, but only if the hardware also supports 64K, this will
> be surprising to debug.

OK, I will take care in the re-implementation.

Thanks

~Pratyush

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v12 05/16] arm64: Add back cpu_reset routines

2015-12-09 Thread Geoff Levand
Hi Will,

On Thu, 2015-12-03 at 09:32 +, Will Deacon wrote:
> On Wed, Dec 02, 2015 at 02:57:30PM -0800, Geoff Levand wrote:
> > On Mon, 2015-11-30 at 10:40 +, Marc Zyngier wrote:
> > 
> > > All that can be solved in C, and mostly at compile time. Using an
> > > assembler trampoline is complicating things for no good reason:
> > 
> > I added this into my kexec-v12.1 branch.  After some more testing I'll
> > post a v13 to the list.
> 
> You may well need some notrace annotations if you want to run C code
> here. Can you give it a spin with things like ftrace and kprobes?

I tested ftrace with and without the notrace annotation on
cpu_soft_restart() and could do kexec re-boots either way.

When cpu_soft_restart() is called we've only done setup_mm_for_reboot(),
so I would think C code should still be OK.

-Geoff


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[V6 PATCH 2/6] panic/x86: Allow CPUs to save registers even if they are looping in NMI context

2015-12-09 Thread Hidehiro Kawai
Currently, kdump_nmi_shootdown_cpus(), a subroutine of crash_kexec(),
sends NMI IPI to non-panic CPUs to stop them and save their register
information and doing some cleanups for crash dumping.  However, if
a non-panic CPU is infinitely looping in NMI context, we fail to
save its register information into the crash dump.

For example, this can happen when unknown NMIs are broadcast to all
CPUs as follows (before applying this patch series):

  CPU 0 CPU 1
  ===   ==
  receive an unknown NMI
  unknown_nmi_error()
panic() receive an unknown NMI
  spin_trylock(_lock) unknown_nmi_error()
  crash_kexec()   panic()
spin_trylock(_lock)
panic_smp_self_stop()
  infinite loop
kdump_nmi_shootdown_cpus()
  issue NMI IPI ---> blocked until IRET
  infinite loop...

  Here, since CPU 1 is in NMI context, additional NMI from CPU 0
  is blocked until CPU 1 executes IRET.  However, CPU 1 never
  executes IRET, so the NMI is not handled and the callback function
  to save registers is never called.

  Actually, this can happen on some servers which broadcast NMIs to
  all CPUs when the dump button is pushed.

To save registers in this case, we need to:

  a) Return from NMI handler instead of looping infinitely
  or
  b) Call the callback function directly from the infinite loop

Inherently, a) is risky because NMI is also used to prevent corrupted
data from being propagated to devices.  So, we chose b).

This patch does following things:

1. Move the timing of `infinite loop in NMI context' (actually
   done by panic_smp_self_stop()) outside of panic() to enable us to
   refer pt_regs.  Please note that panic_smp_self_stop() is still
   used for normal context
2. Call a callback of kdump_nmi_shootdown_cpus() directly to save
   registers and do some cleanups after setting waiting_for_crash_ipi
   which is used for counting down the number of CPUs which handled
   the callback

V6:
- Revise the commit description and many comments
- Move changes involved with nmi_reason_lock to a later patch because
  it turned out there is no problem at this point
- Rename crash_ipi_done to crash_ipi_issued to clarify its meaning

V5:
- Use WRITE_ONCE() when setting crash_ipi_done to 1 so that the
  compiler doesn't change the instruction order
- Support the case of b in the above description
- Add poll_crash_ipi_and_callback()

V4:
- Rewrite the patch description

V3:
- Newly introduced

Signed-off-by: Hidehiro Kawai 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Peter Zijlstra 
Cc: Eric Biederman 
Cc: Vivek Goyal 
Cc: Michal Hocko 
---
 arch/x86/kernel/nmi.c|6 +++---
 arch/x86/kernel/reboot.c |   20 
 include/linux/kernel.h   |   14 +++---
 kernel/panic.c   |   10 ++
 kernel/watchdog.c|2 +-
 5 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 5131714..5e00de7 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -231,7 +231,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
 #endif
 
if (panic_on_unrecovered_nmi)
-   nmi_panic("NMI: Not continuing");
+   nmi_panic(regs, "NMI: Not continuing");
 
pr_emerg("Dazed and confused, but trying to continue\n");
 
@@ -256,7 +256,7 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
show_regs(regs);
 
if (panic_on_io_nmi) {
-   nmi_panic("NMI IOCK error: Not continuing");
+   nmi_panic(regs, "NMI IOCK error: Not continuing");
 
/*
 * If we return from nmi_panic(), it means we have received
@@ -305,7 +305,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs 
*regs)
 
pr_emerg("Do you have a strange power saving mode enabled?\n");
if (unknown_nmi_panic || panic_on_unrecovered_nmi)
-   nmi_panic("NMI: Not continuing");
+   nmi_panic(regs, "NMI: Not continuing");
 
pr_emerg("Dazed and confused, but trying to continue\n");
 }
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 02693dd..1da1302 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -718,6 +718,7 @@ static int crashing_cpu;
 static nmi_shootdown_cb shootdown_callback;
 
 static atomic_t waiting_for_crash_ipi;
+static int crash_ipi_issued;
 
 static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
 {
@@ -780,6 +781,9 @@ void 

[V6 PATCH 5/6] x86/nmi: Fix to save registers for crash dump on external NMI broadcast

2015-12-09 Thread Hidehiro Kawai
Now, multiple CPUs can receive external NMI simultaneously by
specifying "apic_extnmi=all" as an boot option.  When we take a
crash dump by using external NMI with this option, we fail to save
register values into the crash dump.  This happens as follows:

  CPU 0  CPU 1
     =
  receive an external NMI
  default_do_nmi()   receive an external NMI
spin_lock(_reason_lock)  default_do_nmi()
io_check_error()   spin_lock(_reason_lock)
  panic()busy loop
  ...
kdump_nmi_shootdown_cpus()
  issue NMI IPI ---> blocked until IRET
 busy loop...

  Here, since CPU 1 is in NMI context, additional NMI from CPU 0
  is blocked until CPU 1 executes IRET.  However, CPU 1 never
  executes IRET, so the NMI is not handled and the callback function
  to save registers is never called.

To solve this issue, we check if the IPI for crash dumping was
issued while waiting for nmi_reason_lock to be released, and if so,
call its callback function directly.  If the IPI is not issued (e.g.
kdump is disabled), the actual behavior doesn't change.

V6:
- Separated from the former patch `panic/x86: Allow cpus to save
  registers even if they are looping in NMI context'
- Fix comments
- Remove unneeded UP version of poll_crash_ipi_and_calback
- Rename poll_crash_ipi_and_callback to run_crash_ipi_callback

Signed-off-by: Hidehiro Kawai 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
---
 arch/x86/include/asm/reboot.h |1 +
 arch/x86/kernel/nmi.c |   11 ++-
 arch/x86/kernel/reboot.c  |   18 +-
 3 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index a82c4f1..2cb1cc2 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -25,5 +25,6 @@ void __noreturn machine_real_restart(unsigned int type);
 
 typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
 void nmi_shootdown_cpus(nmi_shootdown_cb callback);
+void run_crash_ipi_callback(struct pt_regs *regs);
 
 #endif /* _ASM_X86_REBOOT_H */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 5e00de7..cbfa0b5 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -357,7 +358,15 @@ static void default_do_nmi(struct pt_regs *regs)
}
 
/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
-   raw_spin_lock(_reason_lock);
+   /*
+* Another CPU may be processing panic routines while holding
+* nmi_reason_lock.  Check if the CPU issued the IPI for crash
+* dumping, and if so, call its callback directly.  If there is
+* no CPU preparing crash dump, we simply loop here without doing
+* special things.
+*/
+   while (!raw_spin_trylock(_reason_lock))
+   run_crash_ipi_callback(regs);
reason = x86_platform.get_nmi_reason();
 
if (reason & NMI_REASON_MASK) {
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 1da1302..60a216b 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -793,17 +793,25 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
/* Leave the nmi callback set */
 }
 
+/*
+ * Wait for the crash dumping IPI to be issued, and then call its callback
+ * directly.  This function is used when we have already been in NMI handler.
+ */
+void run_crash_ipi_callback(struct pt_regs *regs)
+{
+   if (crash_ipi_issued)
+   crash_nmi_callback(0, regs); /* Don't return */
+}
+
 /* Override the weak function in kernel/panic.c */
 void nmi_panic_self_stop(struct pt_regs *regs)
 {
while (1) {
/*
-* Wait for the crash dumping IPI to be issued, and then
-* call its callback directly.
+* If there is no CPU preparing crash dump, we simply loop
+* here without doing special things.
 */
-   if (READ_ONCE(crash_ipi_issued))
-   crash_nmi_callback(0, regs); /* Don't return */
-
+   run_crash_ipi_callback(regs);
cpu_relax();
}
 }



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[V6 PATCH 4/6] x86/apic: Introduce apic_extnmi boot option

2015-12-09 Thread Hidehiro Kawai
This patch introduces new boot option, apic_extnmi:

 apic_extnmi={ bsp | all | none}

The default value is "bsp" and this is the current behavior; only
BSP receives external NMI.  "all" allows external NMIs to be
broadcast to all CPUs.  This would raise the success rate of panic
on NMI when BSP hangs up in NMI context or the external NMI is
swallowed by other NMI handlers on BSP.  If you specified "none",
any CPUs don't receive external NMIs.  This is useful for dump
capture kernel so that it wouldn't be shot down while saving a
crash dump.

V6:
- Fix comments
- Change to use strncmp to parse the value of "apic_extnmi="
- Make apic_set_extnmi() return -EINVAL if the given value is invalid

V5:
- Rename the option from "noextnmi" to "apic_extnmi"
- Add apic_extnmi=all feature
- Fix the wrong documentation about "noextnmi" (apic_extnmi=none)

Signed-off-by: Hidehiro Kawai 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Jonathan Corbet 
---
 Documentation/kernel-parameters.txt |9 +
 arch/x86/include/asm/apic.h |5 +
 arch/x86/kernel/apic/apic.c |   35 +--
 3 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 742f69d..74acea5 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -472,6 +472,15 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
Change the amount of debugging information output
when initialising the APIC and IO-APIC components.
 
+   apic_extnmi=[APIC,X86] External NMI delivery setting
+   Format: { bsp (default) | all | none }
+   bsp:  External NMI is delivered only to CPU 0
+   all:  External NMIs are broadcast to all CPUs as a
+ backup of CPU 0
+   none: External NMI is masked for all CPUs. This is
+ useful so that a dump capture kernel won't be
+ shot down by NMI
+
autoconf=   [IPV6]
See Documentation/networking/ipv6.txt.
 
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 7f62ad4..c80f6b6 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -23,6 +23,11 @@
 #define APIC_VERBOSE 1
 #define APIC_DEBUG   2
 
+/* Macros for apic_extnmi which controls external NMI masking */
+#define APIC_EXTNMI_BSP0 /* Default */
+#define APIC_EXTNMI_ALL1
+#define APIC_EXTNMI_NONE   2
+
 /*
  * Define the default level of output to be very little
  * This can be turned up by using apic=verbose for more
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 8d7df74..162d2bd 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -82,6 +82,12 @@ physid_mask_t phys_cpu_present_map;
 static unsigned int disabled_cpu_apicid __read_mostly = BAD_APICID;
 
 /*
+ * This variable controls which CPUs receive external NMIs.  By default,
+ * external NMIs are delivered only to the BSP.
+ */
+static int apic_extnmi = APIC_EXTNMI_BSP;
+
+/*
  * Map cpu index to physical APIC ID
  */
 DEFINE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid, BAD_APICID);
@@ -1161,6 +1167,8 @@ void __init init_bsp_APIC(void)
value = APIC_DM_NMI;
if (!lapic_is_integrated()) /* 82489DX */
value |= APIC_LVT_LEVEL_TRIGGER;
+   if (apic_extnmi == APIC_EXTNMI_NONE)
+   value |= APIC_LVT_MASKED;
apic_write(APIC_LVT1, value);
 }
 
@@ -1378,9 +1386,11 @@ void setup_local_APIC(void)
apic_write(APIC_LVT0, value);
 
/*
-* only the BP should see the LINT1 NMI signal, obviously.
+* Only the BP sees the LINT1 NMI signal by default.  This can be
+* modified by apic_extnmi= boot option.
 */
-   if (!cpu)
+   if ((!cpu && apic_extnmi != APIC_EXTNMI_NONE) ||
+   apic_extnmi == APIC_EXTNMI_ALL)
value = APIC_DM_NMI;
else
value = APIC_DM_NMI | APIC_LVT_MASKED;
@@ -2557,3 +2567,24 @@ static int __init apic_set_disabled_cpu_apicid(char *arg)
return 0;
 }
 early_param("disable_cpu_apicid", apic_set_disabled_cpu_apicid);
+
+static int __init apic_set_extnmi(char *arg)
+{
+   if (!arg)
+   return -EINVAL;
+
+   if (!strncmp("all", arg, 3))
+   apic_extnmi = APIC_EXTNMI_ALL;
+   else if (!strncmp("none", arg, 4))
+   apic_extnmi = APIC_EXTNMI_NONE;
+   else if (!strncmp("bsp", arg, 3))
+   apic_extnmi = APIC_EXTNMI_BSP;
+   else {
+   pr_warn("Unknown external NMI delivery mode `%s' ignored\n",
+ 

[V6 PATCH 3/6] kexec: Fix race between panic() and crash_kexec() called directly

2015-12-09 Thread Hidehiro Kawai
Currently, panic() and crash_kexec() can be called at the same time.
For example (x86 case):

CPU 0:
  oops_end()
crash_kexec()
  mutex_trylock() // acquired
nmi_shootdown_cpus() // stop other CPUs

CPU 1:
  panic()
crash_kexec()
  mutex_trylock() // failed to acquire
smp_send_stop() // stop other CPUs
infinite loop

If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
fails.

In another case:

CPU 0:
  oops_end()
crash_kexec()
  mutex_trylock() // acquired

io_check_error()
  panic()
crash_kexec()
  mutex_trylock() // failed to acquire
infinite loop

Clearly, this is an undesirable result.

To fix this problem, this patch changes crash_kexec() to exclude
others by using atomic_t panic_cpu.

V6:
- Use PANIC_CPU_INVALID instead of hard-coded -1
- Add comments about __crash_kexec()

V5:
- Add missing dummy __crash_kexec() for !CONFIG_KEXEC_CORE case
- Replace atomic_xchg() with atomic_set() in crash_kexec() because
  it is used as a release operation and there is no need of memory
  barrier effect.  This change also removes an unused value warning

V4:
- Use new __crash_kexec(), no exclusion check version of crash_kexec(),
  instead of checking if panic_cpu is the current CPU or not

V2:
- Use atomic_cmpxchg() instead of spin_trylock() on panic_lock
  to exclude concurrent accesses
- Don't introduce no-lock version of crash_kexec()

Signed-off-by: Hidehiro Kawai 
Cc: Eric Biederman 
Cc: Vivek Goyal 
Cc: Andrew Morton 
Cc: Michal Hocko 
---
 include/linux/kexec.h |2 ++
 kernel/kexec_core.c   |   30 +-
 kernel/panic.c|4 ++--
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d140b1e..7b68d27 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -237,6 +237,7 @@ extern int kexec_purgatory_get_set_symbol(struct kimage 
*image,
  unsigned int size, bool get_value);
 extern void *kexec_purgatory_get_symbol_addr(struct kimage *image,
 const char *name);
+extern void __crash_kexec(struct pt_regs *);
 extern void crash_kexec(struct pt_regs *);
 int kexec_should_crash(struct task_struct *);
 void crash_save_cpu(struct pt_regs *regs, int cpu);
@@ -332,6 +333,7 @@ int __weak arch_kexec_apply_relocations(const Elf_Ehdr 
*ehdr, Elf_Shdr *sechdrs,
 #else /* !CONFIG_KEXEC_CORE */
 struct pt_regs;
 struct task_struct;
+static inline void __crash_kexec(struct pt_regs *regs) { }
 static inline void crash_kexec(struct pt_regs *regs) { }
 static inline int kexec_should_crash(struct task_struct *p) { return 0; }
 #define kexec_in_progress false
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 11b64a6..c823f30 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -853,7 +853,12 @@ struct kimage *kexec_image;
 struct kimage *kexec_crash_image;
 int kexec_load_disabled;
 
-void crash_kexec(struct pt_regs *regs)
+/*
+ * No panic_cpu check version of crash_kexec().  This function is called
+ * only when panic_cpu holds the current CPU number; this is the only CPU
+ * which processes crash_kexec routines.
+ */
+void __crash_kexec(struct pt_regs *regs)
 {
/* Take the kexec_mutex here to prevent sys_kexec_load
 * running on one cpu from replacing the crash kernel
@@ -876,6 +881,29 @@ void crash_kexec(struct pt_regs *regs)
}
 }
 
+void crash_kexec(struct pt_regs *regs)
+{
+   int old_cpu, this_cpu;
+
+   /*
+* Only one CPU is allowed to execute the crash_kexec() code as with
+* panic().  Otherwise parallel calls of panic() and crash_kexec()
+* may stop each other.  To exclude them, we use panic_cpu here too.
+*/
+   this_cpu = raw_smp_processor_id();
+   old_cpu = atomic_cmpxchg(_cpu, PANIC_CPU_INVALID, this_cpu);
+   if (old_cpu == PANIC_CPU_INVALID) {
+   /* This is the 1st CPU which comes here, so go ahead. */
+   __crash_kexec(regs);
+
+   /*
+* Reset panic_cpu to allow another panic()/crash_kexec()
+* call.
+*/
+   atomic_set(_cpu, PANIC_CPU_INVALID);
+   }
+}
+
 size_t crash_get_memory_size(void)
 {
size_t size = 0;
diff --git a/kernel/panic.c b/kernel/panic.c
index 3d6c3f1..81a0a3d 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -138,7 +138,7 @@ void panic(const char *fmt, ...)
 * the "crash_kexec_post_notifiers" option to the kernel.
 */
if (!crash_kexec_post_notifiers)
-   crash_kexec(NULL);
+   __crash_kexec(NULL); /* bypass panic_cpu check */
 
/*
 * Note smp_send_stop is the usual smp shutdown function, which
@@ -163,7 +163,7 @@ 

[V6 PATCH 6/6] Documentation: Add documentation for kernel.panic_on_io_nmi sysctl

2015-12-09 Thread Hidehiro Kawai
kernel.panic_on_io_nmi sysctl was introduced by commit 5211a242d0cb
("x86: Add sysctl to allow panic on IOCK NMI error"), but its
documentation is missing. So, add it.

V6:
- Newly added

Signed-off-by: Hidehiro Kawai 
Cc: Jonathan Corbet 
Cc: Andrew Morton 
---
 Documentation/sysctl/kernel.txt |   15 +++
 1 file changed, 15 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index af70d15..235f804 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -551,6 +551,21 @@ the recommended setting is 60.
 
 ==
 
+panic_on_io_nmi:
+
+Controls the kernel's behavior when a CPU receives an NMI caused by
+an IO error.
+
+0: try to continue operation (default)
+
+1: panic immediately. The IO error triggered NMI indicates a serious
+   system condition, which could result in IO data corruption. Rather
+   than continuing, panicking might be a better choice. Some servers
+   issue this sort of NMI when the dump button is pushed, and you
+   can use this option to take a crash dump
+
+==
+
 panic_on_oops:
 
 Controls the kernel's behaviour when an oops or BUG is encountered.



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[V6 PATCH 0/6] Fix race issues among panic, NMI and crash_kexec

2015-12-09 Thread Hidehiro Kawai
When an HA clustering software or administrator detects unresponsiveness
of a host, they issue an NMI to the host to completely stop current
works and take a crash dump.  If the kernel has already panicked
or is capturing a crash dump at that time, further NMI can cause
a crash dump failure.

Also, crash_kexec() called from oops context and panic() can
cause race conditions.

To solve these issues, this patch set does following things:

- Don't call panic() on NMI if the kernel has already panicked
- Extend exclusion control currently done by panic_lock to crash_kexec
- Introduce "apic_extnmi=none" boot option which masks external NMI
  NMI at the boot time

Additionally, "apic_extnmi=all" is provieded.  This option unmasks
external NMI for all CPUs.  This would help cause kernel panic even if
CPU 0 can't handle an external NMI due to hang-up in NMI context
or being handled by other NMI handlers.

This patch set can be applied to current -tip tree.

V6:
- Update comments and patch descriptions all over the patch series
- Add documentation for kernel.panic_on_io_nmi sysctl (PATCH 6/6)
- Separate PATCH 5/6 from PATCH 2/6 because the portion is actually
  needed for "apic_extnmi=all" case introduced by PATCH 4/6
- ...and various fixes (please see the change logs in each patch
  description for details)

V5: https://lkml.org/lkml/2015/11/20/228
- Use WRITE_ONCE() for crash_ipi_done to keep the instruction order
  (PATCH 2/4)
- Address concurrent unknown/external NMI case, too (PATCH 2/4)
- Fix build errors (PATCH 3/4)
- Rename "noextnmi" boot option to "apic_extnmi" and expand its
  feature (PATCH 4/4)

V4: https://lkml.org/lkml/2015/9/25/193
- Improve comments and descriptions (PATCH 1/4 to 3/4)
- Use new __crash_kexec(), no exclusion check version of crash_kexec(),
  instead of checking if panic_cpu is the current cpu or not
  (PATCH 3/4)

V3: https://lkml.org/lkml/2015/8/6/39
- Introduce nmi_panic() macro to reduce code duplication
- In the case of panic on NMI, don't return from NMI handlers
  if another cpu already panicked

V2: https://lkml.org/lkml/2015/7/27/31
- Use atomic_cmpxchg() instead of current spin_trylock() to exclude
  concurrent accesses to panic() and crash_kexec()
- Don't introduce no-lock version of panic() and crash_kexec()

V1: https://lkml.org/lkml/2015/7/22/81

---

Hidehiro Kawai (6):
  panic/x86: Fix re-entrance problem due to panic on NMI
  panic/x86: Allow CPUs to save registers even if they are looping in NMI 
context
  kexec: Fix race between panic() and crash_kexec() called directly
  x86/apic: Introduce apic_extnmi boot option
  x86/nmi: Fix to save registers for crash dump on external NMI broadcast
  Documentation: Add documentation for kernel.panic_on_io_nmi sysctl


 Documentation/kernel-parameters.txt |9 +
 Documentation/sysctl/kernel.txt |   15 +++
 arch/x86/include/asm/apic.h |5 +
 arch/x86/include/asm/reboot.h   |1 +
 arch/x86/kernel/apic/apic.c |   35 +--
 arch/x86/kernel/nmi.c   |   27 ++-
 arch/x86/kernel/reboot.c|   28 
 include/linux/kernel.h  |   29 +
 include/linux/kexec.h   |2 ++
 kernel/kexec_core.c |   30 +-
 kernel/panic.c  |   29 -
 kernel/watchdog.c   |2 +-
 12 files changed, 198 insertions(+), 14 deletions(-)


-- 
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[V6 PATCH 1/6] panic/x86: Fix re-entrance problem due to panic on NMI

2015-12-09 Thread Hidehiro Kawai
If panic on NMI happens just after panic() on the same CPU, panic()
is recursively called.  As the result, it stalls after failing to
acquire panic_lock.

To avoid this problem, don't call panic() in NMI context if
we've already entered panic().

V6:
- Add a comment about panic_cpu
- Replace the magic number -1 for panic_cpu with a macro

V4:
- Improve comments in io_check_error() and panic()

V3:
- Introduce nmi_panic() macro to reduce code duplication
- In the case of panic on NMI, don't return from NMI handlers
  if another CPU already panicked

V2:
- Use atomic_cmpxchg() instead of current spin_trylock() to
  exclude concurrent accesses to the panic routines
- Don't introduce no-lock version of panic()

Signed-off-by: Hidehiro Kawai 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
---
 arch/x86/kernel/nmi.c  |   16 
 include/linux/kernel.h |   21 +
 kernel/panic.c |   15 ---
 kernel/watchdog.c  |2 +-
 4 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 697f90d..5131714 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -231,7 +231,7 @@ pci_serr_error(unsigned char reason, struct pt_regs *regs)
 #endif
 
if (panic_on_unrecovered_nmi)
-   panic("NMI: Not continuing");
+   nmi_panic("NMI: Not continuing");
 
pr_emerg("Dazed and confused, but trying to continue\n");
 
@@ -255,8 +255,16 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
 reason, smp_processor_id());
show_regs(regs);
 
-   if (panic_on_io_nmi)
-   panic("NMI IOCK error: Not continuing");
+   if (panic_on_io_nmi) {
+   nmi_panic("NMI IOCK error: Not continuing");
+
+   /*
+* If we return from nmi_panic(), it means we have received
+* NMI while processing panic().  So, simply return without
+* a delay and re-enabling NMI.
+*/
+   return;
+   }
 
/* Re-enable the IOCK line, wait for a few seconds */
reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
@@ -297,7 +305,7 @@ unknown_nmi_error(unsigned char reason, struct pt_regs 
*regs)
 
pr_emerg("Do you have a strange power saving mode enabled?\n");
if (unknown_nmi_panic || panic_on_unrecovered_nmi)
-   panic("NMI: Not continuing");
+   nmi_panic("NMI: Not continuing");
 
pr_emerg("Dazed and confused, but trying to continue\n");
 }
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 350dfb0..db66867 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -446,6 +446,27 @@ extern int sysctl_panic_on_stackoverflow;
 extern bool crash_kexec_post_notifiers;
 
 /*
+ * panic_cpu holds a panicking CPU number and is used for exclusive
+ * execution of panic and crash_kexec routines. If the value is
+ * PANIC_CPU_INVALID, it means that none of CPU has entered panic or
+ * crash_kexec.
+ */
+extern atomic_t panic_cpu;
+#define PANIC_CPU_INVALID  -1
+
+/*
+ * A variant of panic() called from NMI context.
+ * If we've already panicked on this CPU, return from here.
+ */
+#define nmi_panic(fmt, ...)\
+   do {\
+   int this_cpu = raw_smp_processor_id();  \
+   if (atomic_cmpxchg(_cpu, PANIC_CPU_INVALID, this_cpu) \
+   != this_cpu)\
+   panic(fmt, ##__VA_ARGS__);  \
+   } while (0)
+
+/*
  * Only to be used by arch init code. If the user over-wrote the default
  * CONFIG_PANIC_TIMEOUT, honor it.
  */
diff --git a/kernel/panic.c b/kernel/panic.c
index 4b150bc..3261e2d 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -61,6 +61,8 @@ void __weak panic_smp_self_stop(void)
cpu_relax();
 }
 
+atomic_t panic_cpu = ATOMIC_INIT(PANIC_CPU_INVALID);
+
 /**
  * panic - halt the system
  * @fmt: The text string to print
@@ -71,17 +73,17 @@ void __weak panic_smp_self_stop(void)
  */
 void panic(const char *fmt, ...)
 {
-   static DEFINE_SPINLOCK(panic_lock);
static char buf[1024];
va_list args;
long i, i_next = 0;
int state = 0;
+   int old_cpu, this_cpu;
 
/*
 * Disable local interrupts. This will prevent panic_smp_self_stop
 * from deadlocking the first cpu that invokes the panic, since
 * there is nothing to prevent an interrupt handler (that runs
-* after the panic_lock is acquired) from invoking panic again.
+ 

RE: Re: [V6 PATCH 5/6] x86/nmi: Fix to save registers for crash dump on external NMI broadcast

2015-12-09 Thread 河合英宏 / KAWAI,HIDEHIRO
Hi Steven,

> From: Steven Rostedt [mailto:rost...@goodmis.org]
> On Tue, Nov 24, 2015 at 11:48:53AM +0100, Borislav Petkov wrote:
> > > +  */
> > > + while (!raw_spin_trylock(_reason_lock))
> > > + poll_crash_ipi_and_callback(regs);
> >
> > Waaait a minute: so if we're getting NMIs broadcasted on every core but
> > we're *not* crash dumping, we will run into here too. This can't be
> > right. :-\
> 
> This only does something if crash_ipi_done is set, which means you are killing
> the box. But perhaps a comment that states that here would be useful, or maybe
> just put in the check here. There's no need to make it depend on SMP, as
> raw_spin_trylock() will turn to just ({1}) for UP, and that code wont even be
> hit.

It seems that poll_crash_ipi_and_callback (now renamed to 
run_crash_ipi_callback)
is referred for UP if CONFIG_DEBUG_SPINLOCK=y, and it causes a build error
as below.  run_crash_ipi_callback refers crash_ipi_issued and 
crash_nmi_callback,
which are defined only if CONFIG_SMP=y.  So we need to defined it separately
for SMP and UP.

I'll resend this patch later.

> Hi Hidehiro,
> 
> [auto build test ERROR on v4.4-rc4]
> [also build test ERROR on next-20151209]
> [cannot apply to tip/x86/core]
> 
> url:
> https://github.com/0day-ci/linux/commits/Hidehiro-Kawai/Fix-race-issues-among-panic-NMI-and-crash_kexec/20151210-095
> 254
> config: x86_64-randconfig-s4-12101030 (attached as .config)
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64
> 
> All errors (new ones prefixed by >>):
> 
>arch/x86/built-in.o: In function `do_nmi':
> >> (.text+0x7339): undefined reference to `run_crash_ipi_callback'
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[V6.1 PATCH 5/6] x86/nmi: Fix to save registers for crash dump on external NMI broadcast

2015-12-09 Thread Hidehiro Kawai
Now, multiple CPUs can receive external NMI simultaneously by
specifying "apic_extnmi=all" as an boot option.  When we take a
crash dump by using external NMI with this option, we fail to save
register values into the crash dump.  This happens as follows:

  CPU 0  CPU 1
     =
  receive an external NMI
  default_do_nmi()   receive an external NMI
spin_lock(_reason_lock)  default_do_nmi()
io_check_error()   spin_lock(_reason_lock)
  panic()busy loop
  ...
kdump_nmi_shootdown_cpus()
  issue NMI IPI ---> blocked until IRET
 busy loop...

  Here, since CPU 1 is in NMI context, additional NMI from CPU 0
  is blocked until CPU 1 executes IRET.  However, CPU 1 never
  executes IRET, so the NMI is not handled and the callback function
  to save registers is never called.

To solve this issue, we check if the IPI for crash dumping was
issued while waiting for nmi_reason_lock to be released, and if so,
call its callback function directly.  If the IPI is not issued (e.g.
kdump is disabled), the actual behavior doesn't change.

V6.1:
- Reintroduce the UP version of run_crash_ipi_callback to fix build
  error for CONFIG_SMP=n and CONFIG_DEBUG_SPINLOCK=y case

V6:
- Separated from the former patch `panic/x86: Allow cpus to save
  registers even if they are looping in NMI context'
- Fix comments
- Remove unneeded UP version of poll_crash_ipi_and_calback
- Rename poll_crash_ipi_and_callback to run_crash_ipi_callback

Signed-off-by: Hidehiro Kawai 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
---
 arch/x86/include/asm/reboot.h |1 +
 arch/x86/kernel/nmi.c |   11 ++-
 arch/x86/kernel/reboot.c  |   22 +-
 3 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index a82c4f1..2cb1cc2 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -25,5 +25,6 @@ void __noreturn machine_real_restart(unsigned int type);
 
 typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
 void nmi_shootdown_cpus(nmi_shootdown_cb callback);
+void run_crash_ipi_callback(struct pt_regs *regs);
 
 #endif /* _ASM_X86_REBOOT_H */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 5e00de7..cbfa0b5 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -357,7 +358,15 @@ static void default_do_nmi(struct pt_regs *regs)
}
 
/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
-   raw_spin_lock(_reason_lock);
+   /*
+* Another CPU may be processing panic routines while holding
+* nmi_reason_lock.  Check if the CPU issued the IPI for crash
+* dumping, and if so, call its callback directly.  If there is
+* no CPU preparing crash dump, we simply loop here without doing
+* special things.
+*/
+   while (!raw_spin_trylock(_reason_lock))
+   run_crash_ipi_callback(regs);
reason = x86_platform.get_nmi_reason();
 
if (reason & NMI_REASON_MASK) {
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 1da1302..8a184e3 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -793,17 +793,25 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
/* Leave the nmi callback set */
 }
 
+/*
+ * Wait for the crash dumping IPI to be issued, and then call its callback
+ * directly.  This function is used when we have already been in NMI handler.
+ */
+void run_crash_ipi_callback(struct pt_regs *regs)
+{
+   if (crash_ipi_issued)
+   crash_nmi_callback(0, regs); /* Don't return */
+}
+
 /* Override the weak function in kernel/panic.c */
 void nmi_panic_self_stop(struct pt_regs *regs)
 {
while (1) {
/*
-* Wait for the crash dumping IPI to be issued, and then
-* call its callback directly.
+* If there is no CPU preparing crash dump, we simply loop
+* here without doing special things.
 */
-   if (READ_ONCE(crash_ipi_issued))
-   crash_nmi_callback(0, regs); /* Don't return */
-
+   run_crash_ipi_callback(regs);
cpu_relax();
}
 }
@@ -813,4 +821,8 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
 {
/* No other CPUs to shoot down */
 }
+
+void run_crash_ipi_callback(struct pt_regs *regs)
+{
+}
 #endif