[PATCH] kexec: add resriction on the kexec_load

2016-07-18 Thread zhongjiang
From: zhong jiang 

I hit the following question when run trinity in my system. The
kernel is 3.4 version. but the mainline have same question to be
solved. The root cause is the segment size is too large, it can
expand the most of the area or the whole memory, therefore, it
may waste an amount of time to abtain a useable page. and other
cases will block until the test case quit. at the some time,
OOM will come up.

ck time:20160628120131-243c5
rlock reason:SOFT-WATCHDOG detected! on cpu 5.
CPU 5 Pid: 9485, comm: trinity-c5
RIP: 0010:[]  [] 
next_zones_zonelist+0x3f/0x60
RSP: 0018:88088783bc38  EFLAGS: 0283
RAX: 8808bffd9b08 RBX: 88088783bbb8 RCX: 88088783bd30
RDX: 88088f15a248 RSI: 0002 RDI: 
RBP: 88088783bc38 R08: 8808bffd8d80 R09: 000412c4d000
R10: 000412c4e000 R11:  R12: 0002
R13:  R14: 8808bffd9b00 R15: 
FS:  7f91137ee700() GS:88089f2a() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0016161a CR3: 00088782 CR4: 000407e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process trinity-c5 (pid: 9485, threadinfo 88088783a000, task 
88088f159980)
Stack:
 88088783bd88 81106eac 8808bffd8d80 
  8124c2be 0001 001e
  8124c2be 0002 8124c2be
Call Trace:
 [] __alloc_pages_nodemask+0x14c/0x8f0
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] alloc_pages_current+0xaf/0x120
 [] kimage_alloc_pages+0x10/0x60
 [] kimage_alloc_control_pages+0x5d/0x270
 [] machine_kexec_prepare+0xe5/0x6c0
 [] ? kimage_free_page_list+0x52/0x70
 [] sys_kexec_load+0x141/0x600
 [] ? vfs_write+0x100/0x180
 [] system_call_fastpath+0x16/0x1b

The patch just add condition on sanity_check_segment_list to
restriction the segment size.

Signed-off-by: zhong jiang 
---
 arch/x86/include/asm/kexec.h |  1 +
 kernel/kexec_core.c  | 12 
 2 files changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index d2434c1..b31a723 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -67,6 +67,7 @@ struct kimage;
 /* Memory to backup during crash kdump */
 #define KEXEC_BACKUP_SRC_START (0UL)
 #define KEXEC_BACKUP_SRC_END   (640 * 1024UL)  /* 640K */
+#define KEXEC_MAX_SEGMENT_SIZE (5 * 1024 * 1024UL) /* 5M */
 
 /*
  * CPU does not save ss and sp on stack if execution is already
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 448127d..35c5159 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -209,6 +209,18 @@ int sanity_check_segment_list(struct kimage *image)
return result;
}
 
+
+   /* Verity all segment size donnot exceed the specified size.
+* if segment size from user space is too large,  a large 
+* amount of time will be wasted when allocating page. so,
+* softlockup may be come up.
+*/
+   for (i = 0; i< nr_segments; i++) {
+   if (image->segment[i].memsz > KEXEC_MAX_SEGMENT_SIZE)
+   return result;
+   }
+
+
/*
 * Verify we have good destination addresses.  Normally
 * the caller is responsible for making certain we don't
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC 3/3] kexec: extend kexec_file_load system call

2016-07-18 Thread Dave Young
On 07/18/16 at 11:07am, Mark Rutland wrote:
> On Mon, Jul 18, 2016 at 10:30:24AM +0800, Dave Young wrote:
> > On 07/15/16 at 02:19pm, Mark Rutland wrote:
> > > On Fri, Jul 15, 2016 at 09:09:55AM -0400, Vivek Goyal wrote:
> > > > On Tue, Jul 12, 2016 at 10:42:01AM +0900, AKASHI Takahiro wrote:
> > > > 
> > > > [..]
> > > > > -SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> > > > > +SYSCALL_DEFINE6(kexec_file_load, int, kernel_fd, int, initrd_fd,
> > > > >   unsigned long, cmdline_len, const char __user *, 
> > > > > cmdline_ptr,
> > > > > - unsigned long, flags)
> > > > > + unsigned long, flags, const struct kexec_fdset __user 
> > > > > *, ufdset)
> > > > 
> > > > Can one add more parameters to existing syscall. Can it break existing
> > > > programs with new kernel? I was of the impression that one can't do 
> > > > that.
> > > > But may be I am missing something.
> > > 
> > > I think the idea was that we would only look at the new params if a new
> > > flags was set, and otherwise it would behave as the old syscall.
> > > 
> > > Regardless, I think it makes far more sense to add a kexec_file_load2
> > > syscall if we're going to modify the prototype at all. It's a rather
> > > different proposition to the existing syscall, and needs to be treated
> > > as such.
> > 
> > I do not think it is worth to add another syscall for extra fds.
> > We have open(2) as an example for different numbers of arguments
> > already.
> 
> Did we change the syscall interface for that?
> 
> I was under the impression that there was always one underlying syscall,
> and the C library did the right thing to pass the expected information
> to the underlying syscall.

I'm not sure kexec_load and kexec_file_load were included in glibc, we use
syscall directly in kexec-tools.

kexec_load man pages says there are no wrappers for both kexec_load and
kexec_file_load in glibc.

> 
> That's rather different to changing the underlying syscall.
> 
> Regardless of how this is wrapped in userspace, I do not think modifying
> the existing prototype is a good idea, and I think this kind of
> extension needs to be a new syscall.

Hmm, as I replied to Vivek, there is one case about the flags, previously
the new flag will be regarded as invalid, but not we extend it it will be
valid, this maybe the only potential bad case.

Thanks
Dave

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v22 2/8] arm64: limit memory regions based on DT property, usable-memory-range

2016-07-18 Thread James Morse
Hi!

(CC: Dennis Chen)

On 12/07/16 06:05, AKASHI Takahiro wrote:
> Crash dump kernel will be run with a limited range of memory as System
> RAM.
> 
> On arm64, we will use a device-tree property under /chosen,
>linux,usable-memory-range = 
> in order for primary kernel either on uefi or non-uefi (device tree only)
> system to hand over the information about usable memory region to crash
> dump kernel. This property will supercede entries in uefi memory map table
> and "memory" nodes in a device tree.

> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 51b1302..d8b296f 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -300,10 +300,48 @@ static int __init early_mem(char *p)
>  }
>  early_param("mem", early_mem);
>  
> +static int __init early_init_dt_scan_usablemem(unsigned long node,
> + const char *uname, int depth, void *data)
> +{
> + struct memblock_region *usablemem = (struct memblock_region *)data;
> + const __be32 *reg;
> + int len;
> +
> + usablemem->size = 0;
> +
> + if (depth != 1 || strcmp(uname, "chosen") != 0)
> + return 0;
> +
> + reg = of_get_flat_dt_prop(node, "linux,usable-memory-range", );
> + if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
> + return 1;
> +
> + usablemem->base = dt_mem_next_cell(dt_root_addr_cells, );
> + usablemem->size = dt_mem_next_cell(dt_root_size_cells, );
> +
> + return 1;
> +}
> +
> +static void __init fdt_enforce_memory_region(void)
> +{
> + struct memblock_region reg;
> +
> + of_scan_flat_dt(early_init_dt_scan_usablemem, );
> +
> + if (reg.size) {
> + memblock_remove(0, PAGE_ALIGN(reg.base));
> + memblock_remove(round_down(reg.base + reg.size, PAGE_SIZE),
> + ULLONG_MAX);

I think this is a new way to trip the problem Dennis Chen has been working on
[0]. If I kdump with --reuse-cmdline on a kernel booted with 'acpi=on', I get
the panic below [1]...

It looks like Dennis's fix involves changes in mm/memblock.c, maybe they can be
extended to support a range instead of just a limit?

(It looks like x86 explicitly adds the acpi regions to the crash-kernels memory
map in crash_setup_memmap_entries()).



Is it possible for the kernel text to be outside this range? (a bug in
kexec-tools, or another user of the DT property) If we haven't already failed in
this case, it may be worth printing a warning, or refusing to
restrict-memory/expose-vmcore.



Thanks,

James


[0] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-July/443356.html
[1]
[0.00] efi: Getting EFI parameters from FDT:
[0.00] efi: EFI v2.50 by ARM Juno EFI Nov 24 2015 12:36:35
[0.00] efi:  ACPI=0xf95b  ACPI 2.0=0xf95b0014  PROP=0xfe8db4d8
[0.00] Reserving 1KB of memory at 0x9f000 for elfcorehdr
[0.00] cma: Reserved 16 MiB at 0x0009fec0
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0xF95B0014 24 (v02 ARMLTD)
[0.00] ACPI: XSDT 0xF95A00E8 4C (v01 ARMLTD ARM-JUNO 2014072
7  0113)
[0.00] ACPI: FACP 0xF950 00010C (v05 ARMLTD ARM-JUNO 2014072
7 ARM  0099)
[0.00] ACPI: DSDT 0xF94C 000396 (v01 ARMLTD ARM-JUNO 2014072
7 INTL 20150619)
[0.00] ACPI: GTDT 0xF94F 60 (v02 ARMLTD ARM-JUNO 2014072
7 ARM  0099)
[0.00] ACPI: APIC 0xF94E 000224 (v03 ARMLTD ARM-JUNO 2014072
7 ARM  0099)
[0.00] ACPI: SSDT 0xF94D 0001E3 (v01 ARMLTD ARM-JUNO 2014072
7 INTL 20150619)
[0.00] ACPI: MCFG 0xF94B 3C (v01 ARMLTD ARM-JUNO 2014072
7 ARM  0099)
...
[0.737577] Serial: AMBA PL011 UART driver
[0.786086] HugeTLB registered 2 MB page size, pre-allocated 0 pages
[0.794203] ACPI: Added _OSI(Module Device)
[0.798659] ACPI: Added _OSI(Processor Device)
[0.803190] ACPI: Added _OSI(3.0 _SCP Extensions)
[0.807973] ACPI: Added _OSI(Processor Aggregator Device)
[0.813653] Unable to handle kernel paging request at virtual address 000
00804e027
[0.821704] pgd = 08cce000
[0.825155] [0804e027] *pgd=0009d003, *pud=0009c003,
*pmd=0009b003, *pte=00e8f94c0707
[0.836319] Internal error: Oops: 9621 [#1] PREEMPT SMP
[0.841972] Modules linked in:
[0.845073] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G S  4.7.0-rc4
+ #4569
[0.852927] Hardware name: ARM Juno development board (r1) (DT)
[0.858936] task: 80003d898000 ti: 80003d894000 task.ti: 80003d89
4000
[0.866537] PC is at acpi_ns_lookup+0x23c/0x378
[0.871131] LR is at acpi_ds_load1_begin_op+0x88/0x260
[0.876340] pc : [] lr : [] pstate: 6
045
[0.883846] sp : 80003d8979b0
[0.887206] x29: 80003d8979b0 x28: 
[0.892596] x27: 001b x26: 

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-18 Thread Vivek Goyal
On Mon, Jul 18, 2016 at 09:26:29AM -0400, Vivek Goyal wrote:
> On Mon, Jul 18, 2016 at 10:46:04PM +1000, Balbir Singh wrote:
> > On Wed, 2016-07-13 at 14:22 -0400, Vivek Goyal wrote:
> > > On Wed, Jul 13, 2016 at 06:40:10PM +0100, Russell King - ARM Linux wrote:
> > > > 
> > > > On Wed, Jul 13, 2016 at 09:03:38AM -0400, Vivek Goyal wrote:
> > > > > 
> > > > > On Wed, Jul 13, 2016 at 09:26:39AM +0100, Russell King - ARM Linux 
> > > > > wrote:
> > > > > > 
> > > > > > Indeed - maybe Eric knows better, but I can't see any situation 
> > > > > > where
> > > > > > the dtb we load via kexec should ever affect "the bootloader", 
> > > > > > unless
> > > > > > the "kernel" that's being loaded into kexec is "the bootloader".
> > > > > > 
> > > > > > Now, going back to the more fundamental issue raised in my first 
> > > > > > reply,
> > > > > > about the kernel command line.
> > > > > > 
> > > > > > On x86, I can see that it _is_ possible for userspace to specify a
> > > > > > command line, and the kernel loading the image provides the command
> > > > > > line to the to-be-kexeced kernel with very little checking.  So, if
> > > > > > your kernel is signed, what stops the "insecure userspace" loading
> > > > > > a signed kernel but giving it an insecure rootfs and/or console?
> > > > > It is not kexec specific. I could do this for regular boot too, right?
> > > > > 
> > > > > Command line options are not signed. I thought idea behind secureboot
> > > > > was to execute only trusted code and command line options don't 
> > > > > enforce
> > > > > you to execute unsigned code.
> > > > > 
> > 
> > You can set module.sig_enforce=0 and open up the system a bit assuming
> > that you can get a module to load with another attack
> 
> IIUC, sig_enforce bool_enable_only so it can only be enabled. Default
> value of it is 0 if CONFIG_MODULE_SIG_FORCE=n.
> 
> IOW, if your kernel forced signature verification, you should not be
> able to do sig_enforce=0. If you kernel did not have
> CONFIG_MODULE_SIG_FORCE=y, then sig_enforce should be 0 by default anyway
> and you are not making it worse using command line.

[ CC Matthew Garrett ]

I think on top of this there were patches by Matthew Garrett, which
disallowed loading of unsigned modules if booted with secureboot on. I
think those patches never made upstream though.

Vivek

> 
> > 
> > > > > So it sounds like different class of security problems which you are
> > > > > referring to and not necessarily covered by secureboot or signed
> > > > > kernel.
> > > > Let me give you an example.
> > > > 
> > > > You have a secure boot setup, where the firmware/ROM validates the boot
> > > > loader.  Good, the boot loader hasn't been tampered with.
> > > > 
> > > > You interrupt the boot loader and are able to modify the command line
> > > > for the booted kernel.
> > > > 
> > > > The boot loader loads the kernel and verifies the kernel's signature.
> > > > Good, the kernel hasn't been tampered with.  The kernel starts running.
> > > > 
> > > > You've plugged in a USB drive to the device, and specified a partition
> > > > containing a root filesystem that you control to the kernel.  The
> > > > validated kernel finds the USB drive, and mounts it, and executes
> > > > your own binaries on the USB drive.
> > > You will require physical access to the machine to be able to
> > > insert your usb drive. And IIRC, argument was that if attacker has
> > > physical access to machine, all bets are off anyway.
> > >
> > 
> > You don't need physical access -- your machine controller BMC can
> > do the magic for you. So its not always physical access, is it?
> 
> Well, idea was that if you have physical access to machine, then all
> bets are off. If BMC can do something which allows running unsigned
> code at ring level 0, its a problem I think from secureboot model of
> security.
> 
> >  
> > > > 
> > > > 
> > > > You run a shell on the console.  You now have control of the system,
> > > > and can mount the real rootfs, inspect it, and work out what it does,
> > > > etc.
> > > > 
> > > > At this point, what use was all the validation that the secure boot
> > > > has done?  Absolutely useless.
> > > > 
> > > > If you can change the command line arguments given to the kernel, you
> > > > have no security, no matter how much you verify signatures.  It's
> > > > the illusion of security, nothing more, nothing less.
> > > > 
> > 
> > I agree, if you can change command line arguments, all bets are of lesser 
> > value
> 
> If changing command line allows execution of unsigned code at ring level
> 0, then it is a problem. Otherwise we are talking of security issues which
> are not covered by secureboot model.
> 
> Vivek

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC 0/3] extend kexec_file_load system call

2016-07-18 Thread Vivek Goyal
On Mon, Jul 18, 2016 at 10:46:04PM +1000, Balbir Singh wrote:
> On Wed, 2016-07-13 at 14:22 -0400, Vivek Goyal wrote:
> > On Wed, Jul 13, 2016 at 06:40:10PM +0100, Russell King - ARM Linux wrote:
> > > 
> > > On Wed, Jul 13, 2016 at 09:03:38AM -0400, Vivek Goyal wrote:
> > > > 
> > > > On Wed, Jul 13, 2016 at 09:26:39AM +0100, Russell King - ARM Linux 
> > > > wrote:
> > > > > 
> > > > > Indeed - maybe Eric knows better, but I can't see any situation where
> > > > > the dtb we load via kexec should ever affect "the bootloader", unless
> > > > > the "kernel" that's being loaded into kexec is "the bootloader".
> > > > > 
> > > > > Now, going back to the more fundamental issue raised in my first 
> > > > > reply,
> > > > > about the kernel command line.
> > > > > 
> > > > > On x86, I can see that it _is_ possible for userspace to specify a
> > > > > command line, and the kernel loading the image provides the command
> > > > > line to the to-be-kexeced kernel with very little checking.  So, if
> > > > > your kernel is signed, what stops the "insecure userspace" loading
> > > > > a signed kernel but giving it an insecure rootfs and/or console?
> > > > It is not kexec specific. I could do this for regular boot too, right?
> > > > 
> > > > Command line options are not signed. I thought idea behind secureboot
> > > > was to execute only trusted code and command line options don't enforce
> > > > you to execute unsigned code.
> > > > 
> 
> You can set module.sig_enforce=0 and open up the system a bit assuming
> that you can get a module to load with another attack

IIUC, sig_enforce bool_enable_only so it can only be enabled. Default
value of it is 0 if CONFIG_MODULE_SIG_FORCE=n.

IOW, if your kernel forced signature verification, you should not be
able to do sig_enforce=0. If you kernel did not have
CONFIG_MODULE_SIG_FORCE=y, then sig_enforce should be 0 by default anyway
and you are not making it worse using command line.

> 
> > > > So it sounds like different class of security problems which you are
> > > > referring to and not necessarily covered by secureboot or signed
> > > > kernel.
> > > Let me give you an example.
> > > 
> > > You have a secure boot setup, where the firmware/ROM validates the boot
> > > loader.  Good, the boot loader hasn't been tampered with.
> > > 
> > > You interrupt the boot loader and are able to modify the command line
> > > for the booted kernel.
> > > 
> > > The boot loader loads the kernel and verifies the kernel's signature.
> > > Good, the kernel hasn't been tampered with.  The kernel starts running.
> > > 
> > > You've plugged in a USB drive to the device, and specified a partition
> > > containing a root filesystem that you control to the kernel.  The
> > > validated kernel finds the USB drive, and mounts it, and executes
> > > your own binaries on the USB drive.
> > You will require physical access to the machine to be able to
> > insert your usb drive. And IIRC, argument was that if attacker has
> > physical access to machine, all bets are off anyway.
> >
> 
> You don't need physical access -- your machine controller BMC can
> do the magic for you. So its not always physical access, is it?

Well, idea was that if you have physical access to machine, then all
bets are off. If BMC can do something which allows running unsigned
code at ring level 0, its a problem I think from secureboot model of
security.

>  
> > > 
> > > 
> > > You run a shell on the console.  You now have control of the system,
> > > and can mount the real rootfs, inspect it, and work out what it does,
> > > etc.
> > > 
> > > At this point, what use was all the validation that the secure boot
> > > has done?  Absolutely useless.
> > > 
> > > If you can change the command line arguments given to the kernel, you
> > > have no security, no matter how much you verify signatures.  It's
> > > the illusion of security, nothing more, nothing less.
> > > 
> 
> I agree, if you can change command line arguments, all bets are of lesser 
> value

If changing command line allows execution of unsigned code at ring level
0, then it is a problem. Otherwise we are talking of security issues which
are not covered by secureboot model.

Vivek

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC 0/3] extend kexec_file_load system call

2016-07-18 Thread Balbir Singh
On Wed, 2016-07-13 at 14:22 -0400, Vivek Goyal wrote:
> On Wed, Jul 13, 2016 at 06:40:10PM +0100, Russell King - ARM Linux wrote:
> > 
> > On Wed, Jul 13, 2016 at 09:03:38AM -0400, Vivek Goyal wrote:
> > > 
> > > On Wed, Jul 13, 2016 at 09:26:39AM +0100, Russell King - ARM Linux wrote:
> > > > 
> > > > Indeed - maybe Eric knows better, but I can't see any situation where
> > > > the dtb we load via kexec should ever affect "the bootloader", unless
> > > > the "kernel" that's being loaded into kexec is "the bootloader".
> > > > 
> > > > Now, going back to the more fundamental issue raised in my first reply,
> > > > about the kernel command line.
> > > > 
> > > > On x86, I can see that it _is_ possible for userspace to specify a
> > > > command line, and the kernel loading the image provides the command
> > > > line to the to-be-kexeced kernel with very little checking.  So, if
> > > > your kernel is signed, what stops the "insecure userspace" loading
> > > > a signed kernel but giving it an insecure rootfs and/or console?
> > > It is not kexec specific. I could do this for regular boot too, right?
> > > 
> > > Command line options are not signed. I thought idea behind secureboot
> > > was to execute only trusted code and command line options don't enforce
> > > you to execute unsigned code.
> > > 

You can set module.sig_enforce=0 and open up the system a bit assuming
that you can get a module to load with another attack

> > > So it sounds like different class of security problems which you are
> > > referring to and not necessarily covered by secureboot or signed
> > > kernel.
> > Let me give you an example.
> > 
> > You have a secure boot setup, where the firmware/ROM validates the boot
> > loader.  Good, the boot loader hasn't been tampered with.
> > 
> > You interrupt the boot loader and are able to modify the command line
> > for the booted kernel.
> > 
> > The boot loader loads the kernel and verifies the kernel's signature.
> > Good, the kernel hasn't been tampered with.  The kernel starts running.
> > 
> > You've plugged in a USB drive to the device, and specified a partition
> > containing a root filesystem that you control to the kernel.  The
> > validated kernel finds the USB drive, and mounts it, and executes
> > your own binaries on the USB drive.
> You will require physical access to the machine to be able to
> insert your usb drive. And IIRC, argument was that if attacker has
> physical access to machine, all bets are off anyway.
>

You don't need physical access -- your machine controller BMC can
do the magic for you. So its not always physical access, is it?
 
> > 
> > 
> > You run a shell on the console.  You now have control of the system,
> > and can mount the real rootfs, inspect it, and work out what it does,
> > etc.
> > 
> > At this point, what use was all the validation that the secure boot
> > has done?  Absolutely useless.
> > 
> > If you can change the command line arguments given to the kernel, you
> > have no security, no matter how much you verify signatures.  It's
> > the illusion of security, nothing more, nothing less.
> > 

I agree, if you can change command line arguments, all bets are of lesser value

Balbir Singh

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC 3/3] kexec: extend kexec_file_load system call

2016-07-18 Thread Mark Rutland
On Mon, Jul 18, 2016 at 10:30:24AM +0800, Dave Young wrote:
> On 07/15/16 at 02:19pm, Mark Rutland wrote:
> > On Fri, Jul 15, 2016 at 09:09:55AM -0400, Vivek Goyal wrote:
> > > On Tue, Jul 12, 2016 at 10:42:01AM +0900, AKASHI Takahiro wrote:
> > > 
> > > [..]
> > > > -SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> > > > +SYSCALL_DEFINE6(kexec_file_load, int, kernel_fd, int, initrd_fd,
> > > > unsigned long, cmdline_len, const char __user *, 
> > > > cmdline_ptr,
> > > > -   unsigned long, flags)
> > > > +   unsigned long, flags, const struct kexec_fdset __user 
> > > > *, ufdset)
> > > 
> > > Can one add more parameters to existing syscall. Can it break existing
> > > programs with new kernel? I was of the impression that one can't do that.
> > > But may be I am missing something.
> > 
> > I think the idea was that we would only look at the new params if a new
> > flags was set, and otherwise it would behave as the old syscall.
> > 
> > Regardless, I think it makes far more sense to add a kexec_file_load2
> > syscall if we're going to modify the prototype at all. It's a rather
> > different proposition to the existing syscall, and needs to be treated
> > as such.
> 
> I do not think it is worth to add another syscall for extra fds.
> We have open(2) as an example for different numbers of arguments
> already.

Did we change the syscall interface for that?

I was under the impression that there was always one underlying syscall,
and the C library did the right thing to pass the expected information
to the underlying syscall.

That's rather different to changing the underlying syscall.

Regardless of how this is wrapped in userspace, I do not think modifying
the existing prototype is a good idea, and I think this kind of
extension needs to be a new syscall.

Thanks,
Mark.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [V3 PATCH 1/2] x86/panic: Replace smp_send_stop() with kdump friendly version

2016-07-18 Thread 'Dave Young'
Hi,

On 07/15/16 at 11:50am, 河合英宏 / KAWAI,HIDEHIRO wrote:
> Hi Dave,
> 
> Thanks for your reply.
> 
> > From: 'Dave Young' [mailto:dyo...@redhat.com]
> > Sent: Wednesday, July 13, 2016 11:04 AM
> > 
> > On 07/12/16 at 02:49am, 河合英宏 / KAWAI,HIDEHIRO wrote:
> > > Hi Dave,
> > >
> > > Thanks for the comments.
> > >
> > > > From: Dave Young [mailto:dyo...@redhat.com]
> > > > Sent: Monday, July 11, 2016 5:35 PM
> > > >
> > > > On 07/05/16 at 08:33pm, Hidehiro Kawai wrote:
> > > > > This patch fixes one of the problems reported by Daniel Walker
> > > > > (https://lkml.org/lkml/2015/6/24/44).
> > > > >
> > > > > If crash_kexec_post_notifiers boot option is specified, other CPUs
> > > > > are stopped by smp_send_stop() instead of machine_crash_shutdown()
> > > > > in crash_kexec() path.  This behavior change leads two problems.
> > > > >
> > > > >  Problem 1:
> > > > >  octeon_generic_shutdown() for MIPS OCTEON assumes that other CPUs
> > > > > are  still online and try to stop their watchdog timer.  If
> > > > >  smp_send_stop() is called before octeon_generic_shutdown(),
> > > > > stopping  watchdog timer will fail because other CPUs have been
> > > > > offlined by  smp_send_stop().
> > > > >
> > > > >panic()
> > > > >  if crash_kexec_post_notifiers == 1
> > > > >smp_send_stop()
> > > > >atomic_notifier_call_chain()
> > > > >kmsg_dump()
> > > > >  crash_kexec()
> > > > >machine_crash_shutdown()
> > > > >  octeon_generic_shutdown() // shutdown watchdog for ONLINE
> > > > > CPUs
> > > > >
> > > > >  Problem 2:
> > > > >  Most of architectures stop other CPUs in machine_crash_shutdown()
> > > > > path, and they also do something needed for kdump.  For example,
> > > > > they save registers, disable virtualization extensions, and so on.
> > > > >  However, if smp_send_stop() stops other CPUs before
> > > > > machine_crash_shutdown(), we miss those operations.
> > > > >
> > > > > How do we fix these problems?  In the first place, we should stop
> > > > > other CPUs as soon as possible when panic() was called, otherwise
> > > > > other CPUs may wipe out a clue to the cause of the failure.  So,
> > > > > we replace smp_send_stop() with more suitable one for kdump.
> > > >
> > > > We have been avoiding extra things in panic path, but unfortunately
> > > > crash_kexec_post_notifiers were added. I tend to agree the best
> > > > place for this stuff is in 2nd kernel or purgatory instead of in 1st 
> > > > kernel.
> > >
> > > Several months ago, I posted a patch set which writes regs to SEL,
> > > generate an event to send SNMP message, and start/stop BMC's watchdog
> > > timer in purgatory.  This feature requires BMC with KCS (Keyboard
> > > Controller Style) I/F, but the most of enterprise grade server would have 
> > > it.
> > > (http://thread.gmane.org/gmane.linux.kernel.kexec/15382)
> > >
> > > Doing kmsg_dump things in purgatory wouldn't be suitable (should be
> > > done in the 2nd kernel before enabling devices and IRQs?)
> > 
> > In theory it is doable maybe do something like oldmem_kmsg_dump while 
> > /proc/vmcore initializing?
> 
> Hmm, I checked the case of using ACPI ERST as a persistent
> storage. The feature is initialized by device_initcall():
> 
> device_initcall
>   erst_init
> pstore_register
> 
> And vmcore is initialized by fs_initcall() which is called
> before device_initcall().  We may be able to change the sequence,
> but anyway, these are done in later phase of the kernel initialization.
> So, it may get less reliable although it would be doable.

Agreed, it is just an idea, it may need more experiments if you need.

> 
> > > > As for this patch I'm not sure it is safe to replace the
> > > > smp_send_stop with the kdump friendly function. I'm also not sure if
> > > > the kdump friendly function is safe for kdump. Will glad to hear
> > > > opinions from other arch experts.
> > >
> > > This stuff depends on architectures, so I speak only about
> > > x86 (the logic doesn't change on other architectures at this time).
> > >
> > > kdump path with crash_kexec_post_notifiers disabled:
> > >  panic()
> > >__crash_kexec()
> > >  crash_setup_regs()
> > >  crash_save_vmcoreinfo()
> > >  machine_crash_shutdown()
> > >native_machine_crash_shutdown()
> > >  panic_smp_send_stop() /* mostly same as original
> > > * kdump_nmi_shootdown_cpus()
> > > */
> > >
> > > kdump path with crash_kexec_post_notifiers enabled:
> > >  panic()
> > >panic_smp_send_stop()
> > >__crash_kexec()
> > >  crash_setup_regs()
> > >  crash_save_vmcoreinfo()
> > >  machine_crash_shutdown()
> > >native_machine_crash_shutdown()
> > >  panic_smp_send_stop() // do nothing
> > >
> > > The difference is that stopping other CPUs before crash_setup_regs()
> > > and crash_save_vmcoreinfo() or not.  Since crash_setup_regs() and
> > > crash_save_vmcoreinfo() just