[PATCH] kexec: add resriction on the kexec_load
From: zhong jiangI hit the following question when run trinity in my system. The kernel is 3.4 version. but the mainline have same question to be solved. The root cause is the segment size is too large, it can expand the most of the area or the whole memory, therefore, it may waste an amount of time to abtain a useable page. and other cases will block until the test case quit. at the some time, OOM will come up. ck time:20160628120131-243c5 rlock reason:SOFT-WATCHDOG detected! on cpu 5. CPU 5 Pid: 9485, comm: trinity-c5 RIP: 0010:[] [] next_zones_zonelist+0x3f/0x60 RSP: 0018:88088783bc38 EFLAGS: 0283 RAX: 8808bffd9b08 RBX: 88088783bbb8 RCX: 88088783bd30 RDX: 88088f15a248 RSI: 0002 RDI: RBP: 88088783bc38 R08: 8808bffd8d80 R09: 000412c4d000 R10: 000412c4e000 R11: R12: 0002 R13: R14: 8808bffd9b00 R15: FS: 7f91137ee700() GS:88089f2a() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0016161a CR3: 00088782 CR4: 000407e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process trinity-c5 (pid: 9485, threadinfo 88088783a000, task 88088f159980) Stack: 88088783bd88 81106eac 8808bffd8d80 8124c2be 0001 001e 8124c2be 0002 8124c2be Call Trace: [] __alloc_pages_nodemask+0x14c/0x8f0 [] ? trace_hardirqs_on_thunk+0x3a/0x3c [] ? trace_hardirqs_on_thunk+0x3a/0x3c [] ? trace_hardirqs_on_thunk+0x3a/0x3c [] ? trace_hardirqs_on_thunk+0x3a/0x3c [] ? trace_hardirqs_on_thunk+0x3a/0x3c [] alloc_pages_current+0xaf/0x120 [] kimage_alloc_pages+0x10/0x60 [] kimage_alloc_control_pages+0x5d/0x270 [] machine_kexec_prepare+0xe5/0x6c0 [] ? kimage_free_page_list+0x52/0x70 [] sys_kexec_load+0x141/0x600 [] ? vfs_write+0x100/0x180 [] system_call_fastpath+0x16/0x1b The patch just add condition on sanity_check_segment_list to restriction the segment size. Signed-off-by: zhong jiang --- arch/x86/include/asm/kexec.h | 1 + kernel/kexec_core.c | 12 2 files changed, 13 insertions(+) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index d2434c1..b31a723 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -67,6 +67,7 @@ struct kimage; /* Memory to backup during crash kdump */ #define KEXEC_BACKUP_SRC_START (0UL) #define KEXEC_BACKUP_SRC_END (640 * 1024UL) /* 640K */ +#define KEXEC_MAX_SEGMENT_SIZE (5 * 1024 * 1024UL) /* 5M */ /* * CPU does not save ss and sp on stack if execution is already diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 448127d..35c5159 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -209,6 +209,18 @@ int sanity_check_segment_list(struct kimage *image) return result; } + + /* Verity all segment size donnot exceed the specified size. +* if segment size from user space is too large, a large +* amount of time will be wasted when allocating page. so, +* softlockup may be come up. +*/ + for (i = 0; i< nr_segments; i++) { + if (image->segment[i].memsz > KEXEC_MAX_SEGMENT_SIZE) + return result; + } + + /* * Verify we have good destination addresses. Normally * the caller is responsible for making certain we don't -- 1.8.3.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC 3/3] kexec: extend kexec_file_load system call
On 07/18/16 at 11:07am, Mark Rutland wrote: > On Mon, Jul 18, 2016 at 10:30:24AM +0800, Dave Young wrote: > > On 07/15/16 at 02:19pm, Mark Rutland wrote: > > > On Fri, Jul 15, 2016 at 09:09:55AM -0400, Vivek Goyal wrote: > > > > On Tue, Jul 12, 2016 at 10:42:01AM +0900, AKASHI Takahiro wrote: > > > > > > > > [..] > > > > > -SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, > > > > > +SYSCALL_DEFINE6(kexec_file_load, int, kernel_fd, int, initrd_fd, > > > > > unsigned long, cmdline_len, const char __user *, > > > > > cmdline_ptr, > > > > > - unsigned long, flags) > > > > > + unsigned long, flags, const struct kexec_fdset __user > > > > > *, ufdset) > > > > > > > > Can one add more parameters to existing syscall. Can it break existing > > > > programs with new kernel? I was of the impression that one can't do > > > > that. > > > > But may be I am missing something. > > > > > > I think the idea was that we would only look at the new params if a new > > > flags was set, and otherwise it would behave as the old syscall. > > > > > > Regardless, I think it makes far more sense to add a kexec_file_load2 > > > syscall if we're going to modify the prototype at all. It's a rather > > > different proposition to the existing syscall, and needs to be treated > > > as such. > > > > I do not think it is worth to add another syscall for extra fds. > > We have open(2) as an example for different numbers of arguments > > already. > > Did we change the syscall interface for that? > > I was under the impression that there was always one underlying syscall, > and the C library did the right thing to pass the expected information > to the underlying syscall. I'm not sure kexec_load and kexec_file_load were included in glibc, we use syscall directly in kexec-tools. kexec_load man pages says there are no wrappers for both kexec_load and kexec_file_load in glibc. > > That's rather different to changing the underlying syscall. > > Regardless of how this is wrapped in userspace, I do not think modifying > the existing prototype is a good idea, and I think this kind of > extension needs to be a new syscall. Hmm, as I replied to Vivek, there is one case about the flags, previously the new flag will be regarded as invalid, but not we extend it it will be valid, this maybe the only potential bad case. Thanks Dave ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v22 2/8] arm64: limit memory regions based on DT property, usable-memory-range
Hi! (CC: Dennis Chen) On 12/07/16 06:05, AKASHI Takahiro wrote: > Crash dump kernel will be run with a limited range of memory as System > RAM. > > On arm64, we will use a device-tree property under /chosen, >linux,usable-memory-range = > in order for primary kernel either on uefi or non-uefi (device tree only) > system to hand over the information about usable memory region to crash > dump kernel. This property will supercede entries in uefi memory map table > and "memory" nodes in a device tree. > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 51b1302..d8b296f 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -300,10 +300,48 @@ static int __init early_mem(char *p) > } > early_param("mem", early_mem); > > +static int __init early_init_dt_scan_usablemem(unsigned long node, > + const char *uname, int depth, void *data) > +{ > + struct memblock_region *usablemem = (struct memblock_region *)data; > + const __be32 *reg; > + int len; > + > + usablemem->size = 0; > + > + if (depth != 1 || strcmp(uname, "chosen") != 0) > + return 0; > + > + reg = of_get_flat_dt_prop(node, "linux,usable-memory-range", ); > + if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells))) > + return 1; > + > + usablemem->base = dt_mem_next_cell(dt_root_addr_cells, ); > + usablemem->size = dt_mem_next_cell(dt_root_size_cells, ); > + > + return 1; > +} > + > +static void __init fdt_enforce_memory_region(void) > +{ > + struct memblock_region reg; > + > + of_scan_flat_dt(early_init_dt_scan_usablemem, ); > + > + if (reg.size) { > + memblock_remove(0, PAGE_ALIGN(reg.base)); > + memblock_remove(round_down(reg.base + reg.size, PAGE_SIZE), > + ULLONG_MAX); I think this is a new way to trip the problem Dennis Chen has been working on [0]. If I kdump with --reuse-cmdline on a kernel booted with 'acpi=on', I get the panic below [1]... It looks like Dennis's fix involves changes in mm/memblock.c, maybe they can be extended to support a range instead of just a limit? (It looks like x86 explicitly adds the acpi regions to the crash-kernels memory map in crash_setup_memmap_entries()). Is it possible for the kernel text to be outside this range? (a bug in kexec-tools, or another user of the DT property) If we haven't already failed in this case, it may be worth printing a warning, or refusing to restrict-memory/expose-vmcore. Thanks, James [0] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-July/443356.html [1] [0.00] efi: Getting EFI parameters from FDT: [0.00] efi: EFI v2.50 by ARM Juno EFI Nov 24 2015 12:36:35 [0.00] efi: ACPI=0xf95b ACPI 2.0=0xf95b0014 PROP=0xfe8db4d8 [0.00] Reserving 1KB of memory at 0x9f000 for elfcorehdr [0.00] cma: Reserved 16 MiB at 0x0009fec0 [0.00] ACPI: Early table checksum verification disabled [0.00] ACPI: RSDP 0xF95B0014 24 (v02 ARMLTD) [0.00] ACPI: XSDT 0xF95A00E8 4C (v01 ARMLTD ARM-JUNO 2014072 7 0113) [0.00] ACPI: FACP 0xF950 00010C (v05 ARMLTD ARM-JUNO 2014072 7 ARM 0099) [0.00] ACPI: DSDT 0xF94C 000396 (v01 ARMLTD ARM-JUNO 2014072 7 INTL 20150619) [0.00] ACPI: GTDT 0xF94F 60 (v02 ARMLTD ARM-JUNO 2014072 7 ARM 0099) [0.00] ACPI: APIC 0xF94E 000224 (v03 ARMLTD ARM-JUNO 2014072 7 ARM 0099) [0.00] ACPI: SSDT 0xF94D 0001E3 (v01 ARMLTD ARM-JUNO 2014072 7 INTL 20150619) [0.00] ACPI: MCFG 0xF94B 3C (v01 ARMLTD ARM-JUNO 2014072 7 ARM 0099) ... [0.737577] Serial: AMBA PL011 UART driver [0.786086] HugeTLB registered 2 MB page size, pre-allocated 0 pages [0.794203] ACPI: Added _OSI(Module Device) [0.798659] ACPI: Added _OSI(Processor Device) [0.803190] ACPI: Added _OSI(3.0 _SCP Extensions) [0.807973] ACPI: Added _OSI(Processor Aggregator Device) [0.813653] Unable to handle kernel paging request at virtual address 000 00804e027 [0.821704] pgd = 08cce000 [0.825155] [0804e027] *pgd=0009d003, *pud=0009c003, *pmd=0009b003, *pte=00e8f94c0707 [0.836319] Internal error: Oops: 9621 [#1] PREEMPT SMP [0.841972] Modules linked in: [0.845073] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G S 4.7.0-rc4 + #4569 [0.852927] Hardware name: ARM Juno development board (r1) (DT) [0.858936] task: 80003d898000 ti: 80003d894000 task.ti: 80003d89 4000 [0.866537] PC is at acpi_ns_lookup+0x23c/0x378 [0.871131] LR is at acpi_ds_load1_begin_op+0x88/0x260 [0.876340] pc : [] lr : [] pstate: 6 045 [0.883846] sp : 80003d8979b0 [0.887206] x29: 80003d8979b0 x28: [0.892596] x27: 001b x26:
Re: [RFC 0/3] extend kexec_file_load system call
On Mon, Jul 18, 2016 at 09:26:29AM -0400, Vivek Goyal wrote: > On Mon, Jul 18, 2016 at 10:46:04PM +1000, Balbir Singh wrote: > > On Wed, 2016-07-13 at 14:22 -0400, Vivek Goyal wrote: > > > On Wed, Jul 13, 2016 at 06:40:10PM +0100, Russell King - ARM Linux wrote: > > > > > > > > On Wed, Jul 13, 2016 at 09:03:38AM -0400, Vivek Goyal wrote: > > > > > > > > > > On Wed, Jul 13, 2016 at 09:26:39AM +0100, Russell King - ARM Linux > > > > > wrote: > > > > > > > > > > > > Indeed - maybe Eric knows better, but I can't see any situation > > > > > > where > > > > > > the dtb we load via kexec should ever affect "the bootloader", > > > > > > unless > > > > > > the "kernel" that's being loaded into kexec is "the bootloader". > > > > > > > > > > > > Now, going back to the more fundamental issue raised in my first > > > > > > reply, > > > > > > about the kernel command line. > > > > > > > > > > > > On x86, I can see that it _is_ possible for userspace to specify a > > > > > > command line, and the kernel loading the image provides the command > > > > > > line to the to-be-kexeced kernel with very little checking. So, if > > > > > > your kernel is signed, what stops the "insecure userspace" loading > > > > > > a signed kernel but giving it an insecure rootfs and/or console? > > > > > It is not kexec specific. I could do this for regular boot too, right? > > > > > > > > > > Command line options are not signed. I thought idea behind secureboot > > > > > was to execute only trusted code and command line options don't > > > > > enforce > > > > > you to execute unsigned code. > > > > > > > > > You can set module.sig_enforce=0 and open up the system a bit assuming > > that you can get a module to load with another attack > > IIUC, sig_enforce bool_enable_only so it can only be enabled. Default > value of it is 0 if CONFIG_MODULE_SIG_FORCE=n. > > IOW, if your kernel forced signature verification, you should not be > able to do sig_enforce=0. If you kernel did not have > CONFIG_MODULE_SIG_FORCE=y, then sig_enforce should be 0 by default anyway > and you are not making it worse using command line. [ CC Matthew Garrett ] I think on top of this there were patches by Matthew Garrett, which disallowed loading of unsigned modules if booted with secureboot on. I think those patches never made upstream though. Vivek > > > > > > > > So it sounds like different class of security problems which you are > > > > > referring to and not necessarily covered by secureboot or signed > > > > > kernel. > > > > Let me give you an example. > > > > > > > > You have a secure boot setup, where the firmware/ROM validates the boot > > > > loader. Good, the boot loader hasn't been tampered with. > > > > > > > > You interrupt the boot loader and are able to modify the command line > > > > for the booted kernel. > > > > > > > > The boot loader loads the kernel and verifies the kernel's signature. > > > > Good, the kernel hasn't been tampered with. The kernel starts running. > > > > > > > > You've plugged in a USB drive to the device, and specified a partition > > > > containing a root filesystem that you control to the kernel. The > > > > validated kernel finds the USB drive, and mounts it, and executes > > > > your own binaries on the USB drive. > > > You will require physical access to the machine to be able to > > > insert your usb drive. And IIRC, argument was that if attacker has > > > physical access to machine, all bets are off anyway. > > > > > > > You don't need physical access -- your machine controller BMC can > > do the magic for you. So its not always physical access, is it? > > Well, idea was that if you have physical access to machine, then all > bets are off. If BMC can do something which allows running unsigned > code at ring level 0, its a problem I think from secureboot model of > security. > > > > > > > > > > > > > > > You run a shell on the console. You now have control of the system, > > > > and can mount the real rootfs, inspect it, and work out what it does, > > > > etc. > > > > > > > > At this point, what use was all the validation that the secure boot > > > > has done? Absolutely useless. > > > > > > > > If you can change the command line arguments given to the kernel, you > > > > have no security, no matter how much you verify signatures. It's > > > > the illusion of security, nothing more, nothing less. > > > > > > > > I agree, if you can change command line arguments, all bets are of lesser > > value > > If changing command line allows execution of unsigned code at ring level > 0, then it is a problem. Otherwise we are talking of security issues which > are not covered by secureboot model. > > Vivek ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC 0/3] extend kexec_file_load system call
On Mon, Jul 18, 2016 at 10:46:04PM +1000, Balbir Singh wrote: > On Wed, 2016-07-13 at 14:22 -0400, Vivek Goyal wrote: > > On Wed, Jul 13, 2016 at 06:40:10PM +0100, Russell King - ARM Linux wrote: > > > > > > On Wed, Jul 13, 2016 at 09:03:38AM -0400, Vivek Goyal wrote: > > > > > > > > On Wed, Jul 13, 2016 at 09:26:39AM +0100, Russell King - ARM Linux > > > > wrote: > > > > > > > > > > Indeed - maybe Eric knows better, but I can't see any situation where > > > > > the dtb we load via kexec should ever affect "the bootloader", unless > > > > > the "kernel" that's being loaded into kexec is "the bootloader". > > > > > > > > > > Now, going back to the more fundamental issue raised in my first > > > > > reply, > > > > > about the kernel command line. > > > > > > > > > > On x86, I can see that it _is_ possible for userspace to specify a > > > > > command line, and the kernel loading the image provides the command > > > > > line to the to-be-kexeced kernel with very little checking. So, if > > > > > your kernel is signed, what stops the "insecure userspace" loading > > > > > a signed kernel but giving it an insecure rootfs and/or console? > > > > It is not kexec specific. I could do this for regular boot too, right? > > > > > > > > Command line options are not signed. I thought idea behind secureboot > > > > was to execute only trusted code and command line options don't enforce > > > > you to execute unsigned code. > > > > > > You can set module.sig_enforce=0 and open up the system a bit assuming > that you can get a module to load with another attack IIUC, sig_enforce bool_enable_only so it can only be enabled. Default value of it is 0 if CONFIG_MODULE_SIG_FORCE=n. IOW, if your kernel forced signature verification, you should not be able to do sig_enforce=0. If you kernel did not have CONFIG_MODULE_SIG_FORCE=y, then sig_enforce should be 0 by default anyway and you are not making it worse using command line. > > > > > So it sounds like different class of security problems which you are > > > > referring to and not necessarily covered by secureboot or signed > > > > kernel. > > > Let me give you an example. > > > > > > You have a secure boot setup, where the firmware/ROM validates the boot > > > loader. Good, the boot loader hasn't been tampered with. > > > > > > You interrupt the boot loader and are able to modify the command line > > > for the booted kernel. > > > > > > The boot loader loads the kernel and verifies the kernel's signature. > > > Good, the kernel hasn't been tampered with. The kernel starts running. > > > > > > You've plugged in a USB drive to the device, and specified a partition > > > containing a root filesystem that you control to the kernel. The > > > validated kernel finds the USB drive, and mounts it, and executes > > > your own binaries on the USB drive. > > You will require physical access to the machine to be able to > > insert your usb drive. And IIRC, argument was that if attacker has > > physical access to machine, all bets are off anyway. > > > > You don't need physical access -- your machine controller BMC can > do the magic for you. So its not always physical access, is it? Well, idea was that if you have physical access to machine, then all bets are off. If BMC can do something which allows running unsigned code at ring level 0, its a problem I think from secureboot model of security. > > > > > > > > > > You run a shell on the console. You now have control of the system, > > > and can mount the real rootfs, inspect it, and work out what it does, > > > etc. > > > > > > At this point, what use was all the validation that the secure boot > > > has done? Absolutely useless. > > > > > > If you can change the command line arguments given to the kernel, you > > > have no security, no matter how much you verify signatures. It's > > > the illusion of security, nothing more, nothing less. > > > > > I agree, if you can change command line arguments, all bets are of lesser > value If changing command line allows execution of unsigned code at ring level 0, then it is a problem. Otherwise we are talking of security issues which are not covered by secureboot model. Vivek ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC 0/3] extend kexec_file_load system call
On Wed, 2016-07-13 at 14:22 -0400, Vivek Goyal wrote: > On Wed, Jul 13, 2016 at 06:40:10PM +0100, Russell King - ARM Linux wrote: > > > > On Wed, Jul 13, 2016 at 09:03:38AM -0400, Vivek Goyal wrote: > > > > > > On Wed, Jul 13, 2016 at 09:26:39AM +0100, Russell King - ARM Linux wrote: > > > > > > > > Indeed - maybe Eric knows better, but I can't see any situation where > > > > the dtb we load via kexec should ever affect "the bootloader", unless > > > > the "kernel" that's being loaded into kexec is "the bootloader". > > > > > > > > Now, going back to the more fundamental issue raised in my first reply, > > > > about the kernel command line. > > > > > > > > On x86, I can see that it _is_ possible for userspace to specify a > > > > command line, and the kernel loading the image provides the command > > > > line to the to-be-kexeced kernel with very little checking. So, if > > > > your kernel is signed, what stops the "insecure userspace" loading > > > > a signed kernel but giving it an insecure rootfs and/or console? > > > It is not kexec specific. I could do this for regular boot too, right? > > > > > > Command line options are not signed. I thought idea behind secureboot > > > was to execute only trusted code and command line options don't enforce > > > you to execute unsigned code. > > > You can set module.sig_enforce=0 and open up the system a bit assuming that you can get a module to load with another attack > > > So it sounds like different class of security problems which you are > > > referring to and not necessarily covered by secureboot or signed > > > kernel. > > Let me give you an example. > > > > You have a secure boot setup, where the firmware/ROM validates the boot > > loader. Good, the boot loader hasn't been tampered with. > > > > You interrupt the boot loader and are able to modify the command line > > for the booted kernel. > > > > The boot loader loads the kernel and verifies the kernel's signature. > > Good, the kernel hasn't been tampered with. The kernel starts running. > > > > You've plugged in a USB drive to the device, and specified a partition > > containing a root filesystem that you control to the kernel. The > > validated kernel finds the USB drive, and mounts it, and executes > > your own binaries on the USB drive. > You will require physical access to the machine to be able to > insert your usb drive. And IIRC, argument was that if attacker has > physical access to machine, all bets are off anyway. > You don't need physical access -- your machine controller BMC can do the magic for you. So its not always physical access, is it? > > > > > > You run a shell on the console. You now have control of the system, > > and can mount the real rootfs, inspect it, and work out what it does, > > etc. > > > > At this point, what use was all the validation that the secure boot > > has done? Absolutely useless. > > > > If you can change the command line arguments given to the kernel, you > > have no security, no matter how much you verify signatures. It's > > the illusion of security, nothing more, nothing less. > > I agree, if you can change command line arguments, all bets are of lesser value Balbir Singh ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC 3/3] kexec: extend kexec_file_load system call
On Mon, Jul 18, 2016 at 10:30:24AM +0800, Dave Young wrote: > On 07/15/16 at 02:19pm, Mark Rutland wrote: > > On Fri, Jul 15, 2016 at 09:09:55AM -0400, Vivek Goyal wrote: > > > On Tue, Jul 12, 2016 at 10:42:01AM +0900, AKASHI Takahiro wrote: > > > > > > [..] > > > > -SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, > > > > +SYSCALL_DEFINE6(kexec_file_load, int, kernel_fd, int, initrd_fd, > > > > unsigned long, cmdline_len, const char __user *, > > > > cmdline_ptr, > > > > - unsigned long, flags) > > > > + unsigned long, flags, const struct kexec_fdset __user > > > > *, ufdset) > > > > > > Can one add more parameters to existing syscall. Can it break existing > > > programs with new kernel? I was of the impression that one can't do that. > > > But may be I am missing something. > > > > I think the idea was that we would only look at the new params if a new > > flags was set, and otherwise it would behave as the old syscall. > > > > Regardless, I think it makes far more sense to add a kexec_file_load2 > > syscall if we're going to modify the prototype at all. It's a rather > > different proposition to the existing syscall, and needs to be treated > > as such. > > I do not think it is worth to add another syscall for extra fds. > We have open(2) as an example for different numbers of arguments > already. Did we change the syscall interface for that? I was under the impression that there was always one underlying syscall, and the C library did the right thing to pass the expected information to the underlying syscall. That's rather different to changing the underlying syscall. Regardless of how this is wrapped in userspace, I do not think modifying the existing prototype is a good idea, and I think this kind of extension needs to be a new syscall. Thanks, Mark. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [V3 PATCH 1/2] x86/panic: Replace smp_send_stop() with kdump friendly version
Hi, On 07/15/16 at 11:50am, 河合英宏 / KAWAI,HIDEHIRO wrote: > Hi Dave, > > Thanks for your reply. > > > From: 'Dave Young' [mailto:dyo...@redhat.com] > > Sent: Wednesday, July 13, 2016 11:04 AM > > > > On 07/12/16 at 02:49am, 河合英宏 / KAWAI,HIDEHIRO wrote: > > > Hi Dave, > > > > > > Thanks for the comments. > > > > > > > From: Dave Young [mailto:dyo...@redhat.com] > > > > Sent: Monday, July 11, 2016 5:35 PM > > > > > > > > On 07/05/16 at 08:33pm, Hidehiro Kawai wrote: > > > > > This patch fixes one of the problems reported by Daniel Walker > > > > > (https://lkml.org/lkml/2015/6/24/44). > > > > > > > > > > If crash_kexec_post_notifiers boot option is specified, other CPUs > > > > > are stopped by smp_send_stop() instead of machine_crash_shutdown() > > > > > in crash_kexec() path. This behavior change leads two problems. > > > > > > > > > > Problem 1: > > > > > octeon_generic_shutdown() for MIPS OCTEON assumes that other CPUs > > > > > are still online and try to stop their watchdog timer. If > > > > > smp_send_stop() is called before octeon_generic_shutdown(), > > > > > stopping watchdog timer will fail because other CPUs have been > > > > > offlined by smp_send_stop(). > > > > > > > > > >panic() > > > > > if crash_kexec_post_notifiers == 1 > > > > >smp_send_stop() > > > > >atomic_notifier_call_chain() > > > > >kmsg_dump() > > > > > crash_kexec() > > > > >machine_crash_shutdown() > > > > > octeon_generic_shutdown() // shutdown watchdog for ONLINE > > > > > CPUs > > > > > > > > > > Problem 2: > > > > > Most of architectures stop other CPUs in machine_crash_shutdown() > > > > > path, and they also do something needed for kdump. For example, > > > > > they save registers, disable virtualization extensions, and so on. > > > > > However, if smp_send_stop() stops other CPUs before > > > > > machine_crash_shutdown(), we miss those operations. > > > > > > > > > > How do we fix these problems? In the first place, we should stop > > > > > other CPUs as soon as possible when panic() was called, otherwise > > > > > other CPUs may wipe out a clue to the cause of the failure. So, > > > > > we replace smp_send_stop() with more suitable one for kdump. > > > > > > > > We have been avoiding extra things in panic path, but unfortunately > > > > crash_kexec_post_notifiers were added. I tend to agree the best > > > > place for this stuff is in 2nd kernel or purgatory instead of in 1st > > > > kernel. > > > > > > Several months ago, I posted a patch set which writes regs to SEL, > > > generate an event to send SNMP message, and start/stop BMC's watchdog > > > timer in purgatory. This feature requires BMC with KCS (Keyboard > > > Controller Style) I/F, but the most of enterprise grade server would have > > > it. > > > (http://thread.gmane.org/gmane.linux.kernel.kexec/15382) > > > > > > Doing kmsg_dump things in purgatory wouldn't be suitable (should be > > > done in the 2nd kernel before enabling devices and IRQs?) > > > > In theory it is doable maybe do something like oldmem_kmsg_dump while > > /proc/vmcore initializing? > > Hmm, I checked the case of using ACPI ERST as a persistent > storage. The feature is initialized by device_initcall(): > > device_initcall > erst_init > pstore_register > > And vmcore is initialized by fs_initcall() which is called > before device_initcall(). We may be able to change the sequence, > but anyway, these are done in later phase of the kernel initialization. > So, it may get less reliable although it would be doable. Agreed, it is just an idea, it may need more experiments if you need. > > > > > As for this patch I'm not sure it is safe to replace the > > > > smp_send_stop with the kdump friendly function. I'm also not sure if > > > > the kdump friendly function is safe for kdump. Will glad to hear > > > > opinions from other arch experts. > > > > > > This stuff depends on architectures, so I speak only about > > > x86 (the logic doesn't change on other architectures at this time). > > > > > > kdump path with crash_kexec_post_notifiers disabled: > > > panic() > > >__crash_kexec() > > > crash_setup_regs() > > > crash_save_vmcoreinfo() > > > machine_crash_shutdown() > > >native_machine_crash_shutdown() > > > panic_smp_send_stop() /* mostly same as original > > > * kdump_nmi_shootdown_cpus() > > > */ > > > > > > kdump path with crash_kexec_post_notifiers enabled: > > > panic() > > >panic_smp_send_stop() > > >__crash_kexec() > > > crash_setup_regs() > > > crash_save_vmcoreinfo() > > > machine_crash_shutdown() > > >native_machine_crash_shutdown() > > > panic_smp_send_stop() // do nothing > > > > > > The difference is that stopping other CPUs before crash_setup_regs() > > > and crash_save_vmcoreinfo() or not. Since crash_setup_regs() and > > > crash_save_vmcoreinfo() just