Re: [PATCH] x86: Support large number of memory ranges

2017-03-23 Thread Simon Horman
On Thu, Mar 23, 2017 at 01:27:08PM +, Ramsay, Frank wrote:
> 
> From: Xunlei Pang 
> Sent: Thursday, March 23, 2017 7:16:59 AM
> To: kexec@lists.infradead.org
> Cc: Dave Young; Xunlei Pang; Ramsay, Frank
> Subject: [PATCH] x86: Support large number of memory ranges
> 
> We got a problem on one SGI 64TB machine, the current kexec-tools
> failed to work due to the insufficient ranges(MAX_MEMORY_RANGES)
> allowed which is defined as 1024(less than the ranges on the machine).
> The kcore header is insufficient due to the same reason as well.
> 
> To solve this, this patch simply doubles "MAX_MEMORY_RANGES" and
> "KCORE_ELF_HEADERS_SIZE".
> 
> Cc: Frank Ramsay 
> Signed-off-by: Xunlei Pang 
> 
> acking I tested this
> Tested-by: Frank Ramsay 

Thanks, applied.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v33 00/14] add kdump support

2017-03-23 Thread David Woodhouse
On Fri, 2017-03-17 at 17:10 +, Marc Zyngier wrote:
> 
> > > > 
> > > > FWIW if I trigger a crash on CPU 1 my kdump (still 4.9.8+v32) doesn't 
> > > > work.
> > > > I end up booting the kdump kernel on CPU#1 and then it gets distinctly 
> > > > unhappy...
> > > > 
> > > > [0.00] Booting Linux on physical CPU 0x1
> > > > ...
> > > > [0.017125] Detected PIPT I-cache on CPU1
> > > > [0.017138] GICv3: CPU1: found redistributor 0 region 
> > > > 0:0xf028
> > > > [0.017147] CPU1: Booted secondary processor [411fd073]
> > > > [0.017339] Detected PIPT I-cache on CPU2
> > > > [0.017347] GICv3: CPU2: found redistributor 2 region 
> > > > 0:0xf02c
> > > > [0.017354] CPU2: Booted secondary processor [411fd073]
> > > > [0.017537] Detected PIPT I-cache on CPU3
> > > > [0.017545] GICv3: CPU3: found redistributor 3 region 
> > > > 0:0xf02e
> > > > [0.017551] CPU3: Booted secondary processor [411fd073]
> > > > [0.017576] Brought up 4 CPUs
> > > > [0.017587] SMP: Total of 4 processors activated.
> > > > ...
> > > > [   31.745809] INFO: rcu_sched detected stalls on CPUs/tasks:
> > > > [   31.751299]  1-...: (30 GPs behind) idle=c90/0/0 softirq=0/0 fqs=0 
> > > > [   31.757557]  2-...: (30 GPs behind) idle=608/0/0 softirq=0/0 fqs=0 
> > > > [   31.763814]  3-...: (30 GPs behind) idle=604/0/0 softirq=0/0 fqs=0 
> > > > [   31.770069]  (detected by 0, t=5252 jiffies, g=-270, c=-271, q=0)
> > > > [   31.776161] Task dump for CPU 1:
> > > > [   31.779381] swapper/1   R  running task0 0  1 
> > > > 0x0080
> > > > [   31.786446] Task dump for CPU 2:
> > > > [   31.789666] swapper/2   R  running task0 0  1 
> > > > 0x0080
> > > > [   31.796725] Task dump for CPU 3:
> > > > [   31.799945] swapper/3   R  running task0 0  1 
> > > > 0x0080
> > > > 
> > > > Is some of that platform-specific?
> > > That sounds like timer interrupts aren't being taken.
> > > 
> > > Given that the CPUs have come up, my suspicion would be that the GIC's
> > > been left in some odd state, that the kdump kernel hasn't managed to
> > > recover from.
> > > 
> > > Marc may have an idea.
> > I thought kdump was UP only? Anyway, this doesn't look too good.
> > 
> > It would be interesting to find out whether we're still taking
> > interrupts. Also, being able to reproduce this on mainline would be useful.
> > 
> > I wonder if we don't have a bug when booting on something other than
> > CPU#0, possibly on a GICv3 platform... I'll give it a go.
> Went ahead and tried a couple of kexecs with various CPUs disabled in
> order to force kexec not to boot on CPU#0, and the VM did boot just fine.
> 
> So I'd really appreciate a mainline reproducer.

I booted an up-to-date 4.11-rc2 kernel with the v33 patch set. I cannot
reproduce.

But then again, I can't reproduce it on 4.9 *either* any more. And that
is precisely the same kernel image I uploaded earlier. So it appears to
be sporadic, and just *happened* to hit me the first time I tried...
which is probably just as well or I'd never have tried that again :)

I'll keep trying.


smime.p7s
Description: S/MIME cryptographic signature
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v33 00/14] add kdump support

2017-03-23 Thread David Woodhouse
On Fri, 2017-03-17 at 15:33 +, Mark Rutland wrote:
> 
> We can certainly log a better message, e.g.
> 
> bool kdump = (image == kexec_crash_image);
> bool stuck_cpus = cpus_are_stuck_in_kernel() ||
>   num_online_cpus() > 1;
> 
> BUG_ON(stuck_cpus && !kdump);
> WARN(stuck_cpus, "Unable to offline CPUs, kdump will be 
> unreliable.\n");

No, in this case the CPUs *were* offlined correctly, or at least "as
designed", by smp_send_crash_stop(). And if that hadn't worked, as
verified by *its* synchronisation method based on the atomic_t
waiting_for_crash_ipi, then *it* would have complained for itself:

if (atomic_read(&waiting_for_crash_ipi) > 0)
pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
   cpumask_pr_args(cpu_online_mask));

It's just that smp_send_crash_stop() (or more specifically
ipi_cpu_crash_stop()) doesn't touch the online cpu mask. Unlike the
ARM32 equivalent function machien_crash_nonpanic_core(), which does.

It wasn't clear if that was *intentional*, to allow the original
contents of the online mask before the crash to be seen in the
resulting vmcore... or purely an accident. 

FWIW if I trigger a crash on CPU 1 my kdump (still 4.9.8+v32) doesn't work.
I end up booting the kdump kernel on CPU#1 and then it gets distinctly 
unhappy...

[0.00] Booting Linux on physical CPU 0x1
...
[0.017125] Detected PIPT I-cache on CPU1
[0.017138] GICv3: CPU1: found redistributor 0 region 0:0xf028
[0.017147] CPU1: Booted secondary processor [411fd073]
[0.017339] Detected PIPT I-cache on CPU2
[0.017347] GICv3: CPU2: found redistributor 2 region 0:0xf02c
[0.017354] CPU2: Booted secondary processor [411fd073]
[0.017537] Detected PIPT I-cache on CPU3
[0.017545] GICv3: CPU3: found redistributor 3 region 0:0xf02e
[0.017551] CPU3: Booted secondary processor [411fd073]
[0.017576] Brought up 4 CPUs
[0.017587] SMP: Total of 4 processors activated.
...
[   31.745809] INFO: rcu_sched detected stalls on CPUs/tasks:
[   31.751299]  1-...: (30 GPs behind) idle=c90/0/0 softirq=0/0 fqs=0 
[   31.757557]  2-...: (30 GPs behind) idle=608/0/0 softirq=0/0 fqs=0 
[   31.763814]  3-...: (30 GPs behind) idle=604/0/0 softirq=0/0 fqs=0 
[   31.770069]  (detected by 0, t=5252 jiffies, g=-270, c=-271, q=0)
[   31.776161] Task dump for CPU 1:
[   31.779381] swapper/1   R  running task0 0  1 0x0080
[   31.786446] Task dump for CPU 2:
[   31.789666] swapper/2   R  running task0 0  1 0x0080
[   31.796725] Task dump for CPU 3:
[   31.799945] swapper/3   R  running task0 0  1 0x0080

Is some of that platform-specific?

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 701c085..41d238e 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -129,7 +129,7 @@ static struct sysrq_key_op sysrq_unraw_op = {
 #define sysrq_unraw_op (*(struct sysrq_key_op *)NULL)
 #endif /* CONFIG_VT */
 
-static void sysrq_handle_crash(int key)
+static void do_sysrq_handle_crash(int key)
 {
    char *killer = NULL;
 
@@ -143,6 +143,12 @@ static void sysrq_handle_crash(int key)
    wmb();
    *killer = 1;
 }
+
+static void sysrq_handle_crash(int key)
+{
+   smp_call_on_cpu(1, (void *)do_sysrq_handle_crash, 0, 1);
+}
+
 static struct sysrq_key_op sysrq_crash_op = {
    .handler= sysrq_handle_crash,
    .help_msg   = "crash(c)",


smime.p7s
Description: S/MIME cryptographic signature
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-23 Thread Michael Holzheu
Am Thu, 23 Mar 2017 17:23:53 +0800
schrieb Xunlei Pang :

> On 03/23/2017 at 04:48 AM, Michael Holzheu wrote:
> > Am Wed, 22 Mar 2017 12:30:04 +0800
> > schrieb Dave Young :
> >
> >> On 03/21/17 at 10:18pm, Eric W. Biederman wrote:
> >>> Dave Young  writes:
> >>>
> > [snip]
> >
>  I think makedumpfile is using it, but I also vote to remove the
>  CRASHTIME. It is better not to do this while crashing and a makedumpfile
>  userspace patch is needed to drop the use of it.
> 
> > As we are looking at reliability concerns removing CRASHTIME should make
> > everything in vmcoreinfo a boot time constant.  Which should simplify
> > everything considerably.
>  It is a nice improvement..
> >>> We also need to take a close look at what s390 is doing with vmcoreinfo.
> >>> As apparently it is reading it in a different kind of crashdump process.
> >> Yes, need careful review from s390 and maybe ppc64 especially about
> >> patch 2/3, better to have comments from IBM about s390 dump tool and ppc
> >> fadump. Added more cc.
> > On s390 we have at least an issue with patch 1/3. For stand-alone dump
> > and also because we create the ELF header for kdump in the new
> > kernel we save the pointer to the vmcoreinfo note in the old kernel on a
> > defined memory address in our absolute zero lowcore.
> >
> > This is done in arch/s390/kernel/setup.c:
> >
> > static void __init setup_vmcoreinfo(void)
> > {
> > mem_assign_absolute(S390_lowcore.vmcore_info, 
> > paddr_vmcoreinfo_note());
> > }
> >
> > Since with patch 1/3 paddr_vmcoreinfo_note() returns NULL at this point in
> > time we have a problem here.
> >
> > To solve this - I think - we could move the initialization to
> > arch/s390/kernel/machine_kexec.c:
> >
> > void arch_crash_save_vmcoreinfo(void)
> > {
> > VMCOREINFO_SYMBOL(lowcore_ptr);
> > VMCOREINFO_SYMBOL(high_memory);
> > VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
> > mem_assign_absolute(S390_lowcore.vmcore_info, 
> > paddr_vmcoreinfo_note());
> > }
> >
> > Probably related to this is my observation that patch 3/3 leads to
> > an empty VMCOREINFO note for kdump on s390. The note is there ...
> >
> > # readelf -n /var/crash/127.0.0.1-2017-03-22-21:14:39/vmcore | grep VMCORE
> >   VMCOREINFO   0x068e   Unknown note type: (0x)
> >
> > But it contains only zeros.
> 
> Yes, this is a good catch, I will do more tests.

Hello Xunlei,

After spending some time on this, I now understood the problem:

In patch 3/3 you copy vmcoreinfo into the control page before
machine_kexec_prepare() is called. For s390 we give back all the
crashkernel memory to the hypervisor before the new crashkernel
is loaded:

/*
 * Give back memory to hypervisor before new kdump is loaded
 */
static int machine_kexec_prepare_kdump(void)
{
#ifdef CONFIG_CRASH_DUMP
if (MACHINE_IS_VM)
diag10_range(PFN_DOWN(crashk_res.start),
 PFN_DOWN(crashk_res.end - crashk_res.start + 1));
return 0;
#else
return -EINVAL;
#endif
}

So after machine_kexec_prepare_kdump() the contents of your control page
is gone and therefore the vmcorinfo ELF note contains only zeros.

If you call kimage_crash_copy_vmcoreinfo() after
machine_kexec_prepare_kdump() the problem should be solved for s390.

Regards
Michael


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4 1/5] crash: move crashkernel parsing and vmcore related code under CONFIG_CRASH_CORE

2017-03-23 Thread Hari Bathini

Hi Michael,

It's been a while since this patchset is Ack'ed.
Should this go through powerpc-tree or some other?

Thanks
Hari

On Thursday 05 January 2017 10:59 PM, Hari Bathini wrote:

Traditionally, kdump is used to save vmcore in case of a crash. Some
architectures like powerpc can save vmcore using architecture specific
support instead of kexec/kdump mechanism. Such architecture specific
support also needs to reserve memory, to be used by dump capture kernel.
crashkernel parameter can be a reused, for memory reservation, by such
architecture specific infrastructure.

But currently, code related to vmcoreinfo and parsing of crashkernel
parameter is built under CONFIG_KEXEC_CORE. This patch introduces
CONFIG_CRASH_CORE and moves the above mentioned code under this config,
allowing code reuse without dependency on CONFIG_KEXEC. There is no
functional change with this patch.

Signed-off-by: Hari Bathini 
---

Changes from v3:
* Renamed log_buf_kexec_setup()to log_buf_vmcoreinfo_setup() instead of
   log_buf_crash_setup().

Changes from v2:
* Used CONFIG_CRASH_CORE instead of CONFIG_KEXEC_CORE at
   appropriate places in printk and ksysfs.


  arch/Kconfig   |4
  include/linux/crash_core.h |   65 ++
  include/linux/kexec.h  |   57 --
  include/linux/printk.h |4
  kernel/Makefile|1
  kernel/crash_core.c|  445 
  kernel/kexec_core.c|  404 
  kernel/ksysfs.c|8 +
  kernel/printk/printk.c |6 -
  9 files changed, 531 insertions(+), 463 deletions(-)
  create mode 100644 include/linux/crash_core.h
  create mode 100644 kernel/crash_core.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 99839c2..82e6f99 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -2,7 +2,11 @@
  # General architecture dependent options
  #

+config CRASH_CORE
+   bool
+
  config KEXEC_CORE
+   select CRASH_CORE
bool

  config HAVE_IMA_KEXEC
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
new file mode 100644
index 000..18d0f94
--- /dev/null
+++ b/include/linux/crash_core.h
@@ -0,0 +1,65 @@
+#ifndef LINUX_CRASH_CORE_H
+#define LINUX_CRASH_CORE_H
+
+#include 
+#include 
+#include 
+
+#define CRASH_CORE_NOTE_NAME  "CORE"
+#define CRASH_CORE_NOTE_HEAD_BYTES ALIGN(sizeof(struct elf_note), 4)
+#define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(CRASH_CORE_NOTE_NAME), 4)
+#define CRASH_CORE_NOTE_DESC_BYTES ALIGN(sizeof(struct elf_prstatus), 4)
+
+#define CRASH_CORE_NOTE_BYTES ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  \
+CRASH_CORE_NOTE_NAME_BYTES +   \
+CRASH_CORE_NOTE_DESC_BYTES)
+
+#define VMCOREINFO_BYTES  (4096)
+#define VMCOREINFO_NOTE_NAME  "VMCOREINFO"
+#define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4)
+#define VMCOREINFO_NOTE_SIZE  ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  \
+VMCOREINFO_NOTE_NAME_BYTES +   \
+VMCOREINFO_BYTES)
+
+typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
+
+void crash_save_vmcoreinfo(void);
+void arch_crash_save_vmcoreinfo(void);
+__printf(1, 2)
+void vmcoreinfo_append_str(const char *fmt, ...);
+phys_addr_t paddr_vmcoreinfo_note(void);
+
+#define VMCOREINFO_OSRELEASE(value) \
+   vmcoreinfo_append_str("OSRELEASE=%s\n", value)
+#define VMCOREINFO_PAGESIZE(value) \
+   vmcoreinfo_append_str("PAGESIZE=%ld\n", value)
+#define VMCOREINFO_SYMBOL(name) \
+   vmcoreinfo_append_str("SYMBOL(%s)=%lx\n", #name, (unsigned long)&name)
+#define VMCOREINFO_SIZE(name) \
+   vmcoreinfo_append_str("SIZE(%s)=%lu\n", #name, \
+ (unsigned long)sizeof(name))
+#define VMCOREINFO_STRUCT_SIZE(name) \
+   vmcoreinfo_append_str("SIZE(%s)=%lu\n", #name, \
+ (unsigned long)sizeof(struct name))
+#define VMCOREINFO_OFFSET(name, field) \
+   vmcoreinfo_append_str("OFFSET(%s.%s)=%lu\n", #name, #field, \
+ (unsigned long)offsetof(struct name, field))
+#define VMCOREINFO_LENGTH(name, value) \
+   vmcoreinfo_append_str("LENGTH(%s)=%lu\n", #name, (unsigned long)value)
+#define VMCOREINFO_NUMBER(name) \
+   vmcoreinfo_append_str("NUMBER(%s)=%ld\n", #name, (long)name)
+#define VMCOREINFO_CONFIG(name) \
+   vmcoreinfo_append_str("CONFIG_%s=y\n", #name)
+
+extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
+extern size_t vmcoreinfo_size;
+extern size_t vmcoreinfo_max_size;
+
+int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
+   unsigned long long *crash_size, unsigned long long *crash_base);
+int parse_crashkernel_high(char *cmdline, unsigned long long system_ram,
+   unsigned long long *crash_size, unsigned long long *crash_base);
+int parse_crashkernel_low(char *cmdline, unsigned long long system

Re: [PATCH] x86: Support large number of memory ranges

2017-03-23 Thread Ramsay, Frank
acking I tested this
Tested-by: Frank Ramsay 


From: Xunlei Pang 
Sent: Thursday, March 23, 2017 7:16:59 AM
To: kexec@lists.infradead.org
Cc: Dave Young; Xunlei Pang; Ramsay, Frank
Subject: [PATCH] x86: Support large number of memory ranges

We got a problem on one SGI 64TB machine, the current kexec-tools
failed to work due to the insufficient ranges(MAX_MEMORY_RANGES)
allowed which is defined as 1024(less than the ranges on the machine).
The kcore header is insufficient due to the same reason as well.

To solve this, this patch simply doubles "MAX_MEMORY_RANGES" and
"KCORE_ELF_HEADERS_SIZE".

Cc: Frank Ramsay 
Signed-off-by: Xunlei Pang 
---
 kexec/arch/i386/kexec-x86.h | 2 +-
 kexec/crashdump.h   | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kexec/arch/i386/kexec-x86.h b/kexec/arch/i386/kexec-x86.h
index 33df352..51855f8 100644
--- a/kexec/arch/i386/kexec-x86.h
+++ b/kexec/arch/i386/kexec-x86.h
@@ -1,7 +1,7 @@
 #ifndef KEXEC_X86_H
 #define KEXEC_X86_H

-#define MAX_MEMORY_RANGES 1024
+#define MAX_MEMORY_RANGES 2048

 enum coretype {
CORE_TYPE_UNDEF = 0,
diff --git a/kexec/crashdump.h b/kexec/crashdump.h
index 86e1ef2..18bd691 100644
--- a/kexec/crashdump.h
+++ b/kexec/crashdump.h
@@ -7,8 +7,8 @@ extern int get_xen_vmcoreinfo(uint64_t *addr, uint64_t *len);

 /* Need to find a better way to determine per cpu notes section size. */
 #define MAX_NOTE_BYTES 1024
-/* Expecting ELF headers to fit in 32K. Increase it if you need more. */
-#define KCORE_ELF_HEADERS_SIZE  32768
+/* Expecting ELF headers to fit in 64K. Increase it if you need more. */
+#define KCORE_ELF_HEADERS_SIZE  65536
 /* The address of the ELF header is passed to the secondary kernel
  * using the kernel command line option memmap=nnn.
  * The smallest unit the kernel accepts is in kilobytes,
--
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v33 05/14] arm64: mm: allow for unmapping part of kernel mapping

2017-03-23 Thread AKASHI Takahiro
On Tue, Mar 21, 2017 at 10:35:53AM +, James Morse wrote:
> Hi Akashi,
> 
> On 15/03/17 09:59, AKASHI Takahiro wrote:
> > create_pgd_mapping() is enhanced here so that it will accept
> > PAGE_KERNEL_INVALID protection attribute and unmap a given range of memory.
> > 
> > The feature will be used in a later kdump patch to implement the protection
> > against possible corruption of crash dump kernel memory which is to be set
> > aside from ther other memory on primary kernel.
> 
> Nit: ther- > the

Fix it.

> > Note that, in this implementation, it assumes that all the range of memory
> > to be processed is mapped in page-level since the only current user is
> > kdump where page mappings are also required.
> 
> Using create_pgd_mapping() like this means the mappings will be updated via 
> the
> fixmap which is unnecessary as the page tables will be part of mapped memory. 
> In

This might be a reason that we would go for (un)map_kernel_range()
over create_pgd_mapping() (? not sure)

> the worst case this adds an extra tlbi for every 2MB of crash image when we 
> map
> or unmap it. I don't think this matters.
> 
> This code used to be __init and it is the only user of FIX_PTE, so there won't
> be any existing runtime users. The two arch_kexec_unprotect_crashkres() calls 
> in
> kexec are both protected by the kexec_mutex, and the call in hibernate happens
> after disable_nonboot_cpus(), so these callers can't race with each other.
> 
> This looks safe to me.
> 
> 
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index d28dbcf596b6..cb359a3927ef 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -128,7 +128,10 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long 
> > addr,
> > do {
> > pte_t old_pte = *pte;
> >  
> > -   set_pte(pte, pfn_pte(pfn, prot));
> > +   if (pgprot_val(prot))
> > +   set_pte(pte, pfn_pte(pfn, prot));
> > +   else
> > +   pte_clear(null, null, pte);
> 
> Lowercase NULLs? This relies on these values never being used... 
> __set_fixmap()
> in the same file passes &init_mm and the address, can we do the same to be
> consistent?

OK.

Thanks,
-Takahiro AKASHI

> 
> > pfn++;
> >  
> > /*
> 
> Reviewed-by: James Morse 
> 
> 
> Thanks,
> 
> James
> 
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v33 07/14] arm64: hibernate: preserve kdump image around hibernation

2017-03-23 Thread AKASHI Takahiro
On Tue, Mar 21, 2017 at 06:25:46PM +, James Morse wrote:
> Hi Akashi,
> 
> On 15/03/17 09:59, AKASHI Takahiro wrote:
> > Since arch_kexec_protect_crashkres() removes a mapping for crash dump
> > kernel image, the loaded data won't be preserved around hibernation.
> > 
> > In this patch, helper functions, kexec_prepare_suspend()/
> > kexec_post_resume(), are additionally called before/after hibernation so
> > that the relevant memory segments will be mapped again and preserved just
> > as the others are.
> > 
> > In addition, to minimize the size of hibernation image,
> > kexec_is_chraskres_nosave() is added to pfn_is_nosave() in order to
> 
> (crashkres)

Yes.

> > recoginize only the pages that hold loaded crash dump kernel image as
> 
> (recognize)

Ah, yes ...

> > saveable. Hibernation excludes any pages that are marked as Reserved and
> > yet "nosave."
> 
> 
> Neat! I didn't think this would be possible without hacking 
> kernel/power/snapshot.c.
> 
> I've given this a spin on Juno and Seattle, I even added debug_pagealloc, but
> that doesn't trick it because your kexec_prepare_suspend() puts the mapping 
> back.
> 
> Reviewed-by: James Morse 
> 
> 
> Thanks,
> 
> James
> 
> 
> Some nits about comments:
> 
> > diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
> > index 97a7384100f3..1e10fafa59bd 100644
> > --- a/arch/arm64/kernel/hibernate.c
> > +++ b/arch/arm64/kernel/hibernate.c
> > @@ -286,6 +288,9 @@ int swsusp_arch_suspend(void)
> > local_dbg_save(flags);
> >  
> > if (__cpu_suspend_enter(&state)) {
> > +   /* make the crash dump kernel image visible/saveable */
> > +   kexec_prepare_suspend();
> 
> Strictly this is kdump not kexec, but the comment makes that clear.

I hesitated to use "kdump_" here as there are no functions with
such a prefix (say, even arch_kexec_protect_crashkres()).
So probably "crash_" or "crash_kexec_" would sound much better.

> > +
> > sleep_cpu = smp_processor_id();
> > ret = swsusp_save();
> > } else {
> 
> > diff --git a/arch/arm64/kernel/machine_kexec.c 
> > b/arch/arm64/kernel/machine_kexec.c
> > index 02e4f929db3b..82f48db589cf 100644
> > --- a/arch/arm64/kernel/machine_kexec.c
> > +++ b/arch/arm64/kernel/machine_kexec.c
> 
> > @@ -220,7 +221,6 @@ void arch_kexec_protect_crashkres(void)
> > kexec_crash_image->segment[i].memsz,
> > PAGE_KERNEL_INVALID, true);
> >  
> > -
> 
> Stray whitespace change from a previous patch?

Oops, fix it.

> > flush_tlb_all();
> >  }
> >  
> > @@ -233,3 +233,74 @@ void arch_kexec_unprotect_crashkres(void)
> 
> > +/*
> > + * kexec_is_crashres_nosave
> > + *
> > + * Return true only if a page is part of reserved memory for crash dump 
> > kernel,
> > + * but does not hold any data of loaded kernel image.
> > + *
> > + * Note that all the pages in crash dump kernel memory have been initially
> > + * marked as Reserved in kexec_reserve_crashkres_pages().
> > + *
> > + * In hibernation, the pages which are Reserved and yet "nosave"
> > + * are excluded from the hibernation iamge. kexec_is_crashkres_nosave()
> > + * does this check for crash dump kernel and will reduce the total size
> > + * of hibernation image.
> > + */
> > +
> > +bool kexec_is_crashkres_nosave(unsigned long pfn)
> > +{
> > +   int i;
> > +   phys_addr_t addr;
> > +
> > +   /* in reserved memory? */
> 
> Comment in the wrong place?

This is after my deep thinking :) but

> 
> > +   if (!crashk_res.end)
> > +   return false;
> > +
> > +   addr = __pfn_to_phys(pfn);
> 
> (makes more sense here)

if you think so, I will follow you.

> 
> > +   if ((addr < crashk_res.start) || (crashk_res.end < addr))
> > +   return false;
> > +
> 
> > +   /* not part of loaded kernel image? */
> 
> Comment in the wrong place?
> 
> 
> > +   if (!kexec_crash_image)
> > +   return true;
> > +
> 
> (makes more sense here)

ditto

> 
> > +   for (i = 0; i < kexec_crash_image->nr_segments; i++)
> > +   if (addr >= kexec_crash_image->segment[i].mem &&
> > +   addr < (kexec_crash_image->segment[i].mem +
> > +   kexec_crash_image->segment[i].memsz))
> > +   return false;
> > +
> > +   return true;
> > +}
> > +
> 
> 
> Thanks,
> 
> James


Thank you for your review!
-Takahiro AKASHI

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] x86: Support large number of memory ranges

2017-03-23 Thread Xunlei Pang
We got a problem on one SGI 64TB machine, the current kexec-tools
failed to work due to the insufficient ranges(MAX_MEMORY_RANGES)
allowed which is defined as 1024(less than the ranges on the machine).
The kcore header is insufficient due to the same reason as well.

To solve this, this patch simply doubles "MAX_MEMORY_RANGES" and
"KCORE_ELF_HEADERS_SIZE".

Cc: Frank Ramsay 
Signed-off-by: Xunlei Pang 
---
 kexec/arch/i386/kexec-x86.h | 2 +-
 kexec/crashdump.h   | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kexec/arch/i386/kexec-x86.h b/kexec/arch/i386/kexec-x86.h
index 33df352..51855f8 100644
--- a/kexec/arch/i386/kexec-x86.h
+++ b/kexec/arch/i386/kexec-x86.h
@@ -1,7 +1,7 @@
 #ifndef KEXEC_X86_H
 #define KEXEC_X86_H
 
-#define MAX_MEMORY_RANGES 1024
+#define MAX_MEMORY_RANGES 2048
 
 enum coretype {
CORE_TYPE_UNDEF = 0,
diff --git a/kexec/crashdump.h b/kexec/crashdump.h
index 86e1ef2..18bd691 100644
--- a/kexec/crashdump.h
+++ b/kexec/crashdump.h
@@ -7,8 +7,8 @@ extern int get_xen_vmcoreinfo(uint64_t *addr, uint64_t *len);
 
 /* Need to find a better way to determine per cpu notes section size. */
 #define MAX_NOTE_BYTES 1024
-/* Expecting ELF headers to fit in 32K. Increase it if you need more. */
-#define KCORE_ELF_HEADERS_SIZE  32768
+/* Expecting ELF headers to fit in 64K. Increase it if you need more. */
+#define KCORE_ELF_HEADERS_SIZE  65536
 /* The address of the ELF header is passed to the secondary kernel
  * using the kernel command line option memmap=nnn.
  * The smallest unit the kernel accepts is in kilobytes,
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v33 05/14] arm64: mm: allow for unmapping part of kernel mapping

2017-03-23 Thread AKASHI Takahiro
Ard,

On Tue, Mar 21, 2017 at 11:16:34AM +, Ard Biesheuvel wrote:
> On 15 March 2017 at 09:59, AKASHI Takahiro  wrote:
> > create_pgd_mapping() is enhanced here so that it will accept
> > PAGE_KERNEL_INVALID protection attribute and unmap a given range of memory.
> >
> > The feature will be used in a later kdump patch to implement the protection
> > against possible corruption of crash dump kernel memory which is to be set
> > aside from ther other memory on primary kernel.
> >
> > Note that, in this implementation, it assumes that all the range of memory
> > to be processed is mapped in page-level since the only current user is
> > kdump where page mappings are also required.
> >
> > Signed-off-by: AKASHI Takahiro 
> 
> Couldn't we use unmap_kernel_range() for this?

I've almost forgotten this function, but my understanding is that
this function (and map counterpart) is mainly for "VM area" (< PAGE_OFFSET).
While it seems to (actually does) work instead of create_pgd_mapping() for now,
I'm not sure whether it is an expected usage (now and future).

So I think that it would be safe to keep using create_pgd_mapping().

Thanks,
-Takahiro AKASHI


> > ---
> >  arch/arm64/include/asm/pgtable-prot.h |  1 +
> >  arch/arm64/mm/mmu.c   | 17 +++--
> >  2 files changed, 12 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/pgtable-prot.h 
> > b/arch/arm64/include/asm/pgtable-prot.h
> > index 2142c7726e76..945d84cd5df7 100644
> > --- a/arch/arm64/include/asm/pgtable-prot.h
> > +++ b/arch/arm64/include/asm/pgtable-prot.h
> > @@ -54,6 +54,7 @@
> >  #define PAGE_KERNEL_ROX__pgprot(_PAGE_DEFAULT | PTE_UXN | 
> > PTE_DIRTY | PTE_RDONLY)
> >  #define PAGE_KERNEL_EXEC   __pgprot(_PAGE_DEFAULT | PTE_UXN | 
> > PTE_DIRTY | PTE_WRITE)
> >  #define PAGE_KERNEL_EXEC_CONT  __pgprot(_PAGE_DEFAULT | PTE_UXN | 
> > PTE_DIRTY | PTE_WRITE | PTE_CONT)
> > +#define PAGE_KERNEL_INVALID__pgprot(0)
> >
> >  #define PAGE_HYP   __pgprot(_PAGE_DEFAULT | PTE_HYP | 
> > PTE_HYP_XN)
> >  #define PAGE_HYP_EXEC  __pgprot(_PAGE_DEFAULT | PTE_HYP | 
> > PTE_RDONLY)
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index d28dbcf596b6..cb359a3927ef 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -128,7 +128,10 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long 
> > addr,
> > do {
> > pte_t old_pte = *pte;
> >
> > -   set_pte(pte, pfn_pte(pfn, prot));
> > +   if (pgprot_val(prot))
> > +   set_pte(pte, pfn_pte(pfn, prot));
> > +   else
> > +   pte_clear(null, null, pte);
> > pfn++;
> >
> > /*
> > @@ -309,12 +312,14 @@ static void __init create_mapping_noalloc(phys_addr_t 
> > phys, unsigned long virt,
> > __create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL, 
> > false);
> >  }
> >
> > -void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
> > -  unsigned long virt, phys_addr_t size,
> > -  pgprot_t prot, bool page_mappings_only)
> > +/*
> > + * Note that PAGE_KERNEL_INVALID should be used with page_mappings_only
> > + * true for now.
> > + */
> > +void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
> > +   unsigned long virt, phys_addr_t size,
> > +   pgprot_t prot, bool page_mappings_only)
> >  {
> > -   BUG_ON(mm == &init_mm);
> > -
> > __create_pgd_mapping(mm->pgd, phys, virt, size, prot,
> >  pgd_pgtable_alloc, page_mappings_only);
> >  }
> > --
> > 2.11.1
> >
> >
> > ___
> > linux-arm-kernel mailing list
> > linux-arm-ker...@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-23 Thread Xunlei Pang
On 03/23/2017 at 04:48 AM, Michael Holzheu wrote:
> Am Wed, 22 Mar 2017 12:30:04 +0800
> schrieb Dave Young :
>
>> On 03/21/17 at 10:18pm, Eric W. Biederman wrote:
>>> Dave Young  writes:
>>>
> [snip]
>
 I think makedumpfile is using it, but I also vote to remove the
 CRASHTIME. It is better not to do this while crashing and a makedumpfile
 userspace patch is needed to drop the use of it.

> As we are looking at reliability concerns removing CRASHTIME should make
> everything in vmcoreinfo a boot time constant.  Which should simplify
> everything considerably.
 It is a nice improvement..
>>> We also need to take a close look at what s390 is doing with vmcoreinfo.
>>> As apparently it is reading it in a different kind of crashdump process.
>> Yes, need careful review from s390 and maybe ppc64 especially about
>> patch 2/3, better to have comments from IBM about s390 dump tool and ppc
>> fadump. Added more cc.
> On s390 we have at least an issue with patch 1/3. For stand-alone dump
> and also because we create the ELF header for kdump in the new
> kernel we save the pointer to the vmcoreinfo note in the old kernel on a
> defined memory address in our absolute zero lowcore.
>
> This is done in arch/s390/kernel/setup.c:
>
> static void __init setup_vmcoreinfo(void)
> {
> mem_assign_absolute(S390_lowcore.vmcore_info, 
> paddr_vmcoreinfo_note());
> }
>
> Since with patch 1/3 paddr_vmcoreinfo_note() returns NULL at this point in
> time we have a problem here.
>
> To solve this - I think - we could move the initialization to
> arch/s390/kernel/machine_kexec.c:
>
> void arch_crash_save_vmcoreinfo(void)
> {
> VMCOREINFO_SYMBOL(lowcore_ptr);
> VMCOREINFO_SYMBOL(high_memory);
> VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
> mem_assign_absolute(S390_lowcore.vmcore_info, 
> paddr_vmcoreinfo_note());
> }
>
> Probably related to this is my observation that patch 3/3 leads to
> an empty VMCOREINFO note for kdump on s390. The note is there ...
>
> # readelf -n /var/crash/127.0.0.1-2017-03-22-21:14:39/vmcore | grep VMCORE
>   VMCOREINFO   0x068e   Unknown note type: (0x)
>
> But it contains only zeros.

Yes, this is a good catch, I will do more tests.

Thanks,
Xunlei

>
> Unfortunately I have not yet understood the reason for this.
>
> Michael
>
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v33 00/14] add kdump support

2017-03-23 Thread David Woodhouse
On Tue, 2017-03-21 at 16:34 +0900, AKASHI Takahiro wrote:
> Yes, it is intentional. I removed 'offline' code in my v14 (2016/3/4).
> As you assumed, I'd expect 'online' status of all CPUs to be kept
> unchanged in the core dump.

I wonder if it would be better to take a *copy* of it and put it back
after we're done taking the CPUs down? As things stand, we now have
*three* different methods of taking down all the CPUs... and *none* of
them allow a platform to override it with an NMI-based or STONITH-based 
method, which seems like something of an oversight.

> If you can agree, I would like to modify this disputed warning code to:
> 
> + BUG_ON(!in_kexec_crash && (stuck_cpus || (num_online_cpus() > 1)));
> + WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()),
> + "Some CPUs may be stale, kdump will be unreliable.\n");

That works; thanks.

FWIW I'm currently blaming my platform's firmware for my sporadic
crash-on-CPU#1 failures. If your testing includes crashes on non-boot
CPUs (perhaps using the sysrq hack I posted) and it reliably passes for
you, then let's ignore that for now.

smime.p7s
Description: S/MIME cryptographic signature
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec