Re: [PATCH] kexec: add resriction on the kexec_load

2016-07-21 Thread zhong jiang
On 2016/7/21 16:10, Dave Young wrote:
> On 07/19/16 at 09:07pm, Eric W. Biederman wrote:
>> zhongjiang  writes:
>>
>>> From: zhong jiang 
>>>
>>> I hit the following question when run trinity in my system. The
>>> kernel is 3.4 version. but the mainline have same question to be
>>> solved. The root cause is the segment size is too large, it can
>>> expand the most of the area or the whole memory, therefore, it
>>> may waste an amount of time to abtain a useable page. and other
>>> cases will block until the test case quit. at the some time,
>>> OOM will come up.
>> 5MiB is way too small.  I have seen vmlinux images not to mention
>> ramdisks that get larger than that.  Depending on the system
>> 1GiB might not be an unreasonable ramdisk size.  AKA run an entire live
>> system out of a ramfs.  It works well if you have enough memory.
> There was a use case from Michael Holzheu about a 1.5G ramdisk, see below
> kexec-tools commit:
>
> commit 95741713e790fa6bde7780bbfb772ad88e81a744
> Author: Michael Holzheu 
> Date:   Fri Oct 30 16:02:04 2015 +0100
>
> kexec/s390x: use mmap instead of read for slurp_file()
> 
> The slurp_fd() function allocates memory and uses the read() system
> call.
> This results in double memory consumption for image and initrd:
> 
>  1) Memory allocated in user space by the kexec tool
>  2) Memory allocated in kernel by the kexec() system call
> 
> The following illustrates the use case that we have on s390x:
> 
>  1) Boot a 4 GB Linux system
>  2) Copy kernel and 1,5 GB ramdisk from external source into tmpfs
> (ram)
>  3) Use kexec to boot kernel with ramdisk
> 
>  Therefore for kexec runtime we need:
> 
>  1,5 GB (tmpfs) + 1,5 GB (kexec malloc) + 1,5 GB (kernel memory) =
> 4,5 GB
> 
> This patch introduces slurp_file_mmap() which for "normal" files
> uses
> mmap() instead of malloc()/read(). This reduces the runtime memory
> consumption of the kexec tool as follows:
> 
>  1,5 GB (tmpfs) + 1,5 GB (kernel memory) = 3 GB
> 
> Signed-off-by: Michael Holzheu 
> Reviewed-by: Dave Young 
> Signed-off-by: Simon Horman 
>
>> I think there is a practical limit at about 50% of memory (because we
>> need two copies in memory the source and the destination pages), but
>> anything else is pretty much reasonable and should have a fair chance of
>> working.
>>
>> A limit that reflected that reality above would be interesting.
>> Anything else will likely cause someone trouble in the futrue.
> Maybe one should test his ramdisk first to ensure it works first before
> really using it.
>
> Thanks
> Dave
>
> .
>
 Thank you reply.  I just test the syscall kexec_load, I don't really run kexec 
iamge to boot machine.
 Recently , I hit the question. I fix it by passing resonable parameters to 
kernel from user space.
 no functional change.   is right?  
 according to the W. Biederman advice, I agree so. 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2] kexec: add resriction on the kexec_load

2016-07-21 Thread zhongjiang
From: zhong jiang 

I hit the following question when run trinity in my system. The
kernel is 3.4 version. but the mainline have same question to be
solved. The root cause is the segment size is too large, it can
expand the most of the area or the whole memory, therefore, it
may waste an amount of time to abtain a useable page. and other
cases will block until the test case quit. at the some time,
OOM will come up.

Call Trace:
 [] __alloc_pages_nodemask+0x14c/0x8f0
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] alloc_pages_current+0xaf/0x120
 [] kimage_alloc_pages+0x10/0x60
 [] kimage_alloc_control_pages+0x5d/0x270
 [] machine_kexec_prepare+0xe5/0x6c0
 [] ? kimage_free_page_list+0x52/0x70
 [] sys_kexec_load+0x141/0x600
 [] ? vfs_write+0x100/0x180
 [] system_call_fastpath+0x16/0x1b

The patch just add condition on sanity_check_segment_list to
restriction the segment size.

Signed-off-by: zhong jiang 
---
 kernel/kexec_core.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 56b3ed0..1f58824 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -148,6 +148,7 @@ static struct page *kimage_alloc_page(struct kimage *image,
 int sanity_check_segment_list(struct kimage *image)
 {
int result, i;
+   unsigned long total_segments = 0;
unsigned long nr_segments = image->nr_segments;
 
/*
@@ -209,6 +210,21 @@ int sanity_check_segment_list(struct kimage *image)
return result;
}
 
+   /* Verity all segment size donnot exceed the specified size.
+* if segment size from user space is too large,  a large
+* amount of time will be wasted when allocating page. so,
+* softlockup may be come up.
+*/
+   for (i = 0; i < nr_segments; i++) {
+   if (image->segment[i].memsz > (totalram_pages / 2))
+   return result;
+
+   total_segments += image->segment[i].memsz;
+   }
+
+   if (total_segments > (totalram_pages / 2))
+   return result;
+
/*
 * Verify we have good destination addresses.  Normally
 * the caller is responsible for making certain we don't
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2] kexec: add resriction on the kexec_load

2016-07-21 Thread zhongjiang
From: zhong jiang 

I hit the following question when run trinity in my system. The
kernel is 3.4 version. but the mainline have same question to be
solved. The root cause is the segment size is too large, it can
expand the most of the area or the whole memory, therefore, it
may waste an amount of time to abtain a useable page. and other
cases will block until the test case quit. at the some time,
OOM will come up.

Call Trace:
 [] __alloc_pages_nodemask+0x14c/0x8f0
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] alloc_pages_current+0xaf/0x120
 [] kimage_alloc_pages+0x10/0x60
 [] kimage_alloc_control_pages+0x5d/0x270
 [] machine_kexec_prepare+0xe5/0x6c0
 [] ? kimage_free_page_list+0x52/0x70
 [] sys_kexec_load+0x141/0x600
 [] ? vfs_write+0x100/0x180
 [] system_call_fastpath+0x16/0x1b

The patch just add condition on sanity_check_segment_list to
restriction the segment size.

Signed-off-by: zhong jiang 
---
 kernel/kexec_core.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 56b3ed0..b8751c3 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -148,6 +148,7 @@ static struct page *kimage_alloc_page(struct kimage *image,
 int sanity_check_segment_list(struct kimage *image)
 {
int result, i;
+   unsigned long total_segments = 0;
unsigned long nr_segments = image->nr_segments;
 
/*
@@ -209,6 +210,21 @@ int sanity_check_segment_list(struct kimage *image)
return result;
}
 
+   /* Verity all segment size donnot exceed the specified size.
+* if segment size from user space is too large,  a large
+* amount of time will be wasted when allocating page. so,
+* softlockup may be come up.
+*/
+   for (i = 0; i < nr_segments; i++) {
+   if (image->segment[i].memsz > (totalram_pages / 2))
+   return result;
+
+   total += image->segment[i].memsz;
+   }
+
+   if (total > (totalram_pages / 2))
+   return result;
+
/*
 * Verify we have good destination addresses.  Normally
 * the caller is responsible for making certain we don't
-- 
1.8.3.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 3/4] arm64: Add arm64 kexec support

2016-07-21 Thread AKASHI Takahiro
On Fri, Jul 22, 2016 at 09:38:42AM +0530, Pratyush Anand wrote:
> On 21/07/2016:02:49:36 PM, Geoff Levand wrote:
> > On Thu, 2016-07-21 at 11:50 +0100, Robin Murphy wrote:
> > > The Exynos UART (drivers/tty/serial/samsung.c) is one which comes to
> > > mind as definitely existing, and on arm64 systems to boot. The TX
> > > register is at offset 0x20 there.
> > 
> > Here's what I came up with.
> > 
> > 
> > +   struct data {const char *name; int tx_offset;};
> > +   static const struct data ok_list[] = {
> > +   /*  {"armada-3700-uart", ?},*/
> > +   {"exynos4210-uart", 0x20},
> > +   /*  {"ls1021a-lpuart", ?},  */
> > +   /*  {"meson-uart", ?},  */
> > +   /*  {"mt6577-uart", ?}, */
> > +   {"ns16550", 0},
> > +   {"ns16550a", 0},
> > +   {"pl011", 0},
> > +   {NULL, 0}
> > +   };
> 
> sinc functionality is just to debug the scenario when something goes wrong in
> purgatory. IMHO, it should be disabled by default.

+1

-Takahiro AKASHI

> So, why not to keep it as
> simple as possible. Its a low level debugging mainly for developer, so user
> should know the absolute address. Therefore, I think no need to parse earlycon
> or earlyprintk from command line.  Whatever user passes in --port can be 
> treated
> as address of TX register. If TX offset is 0x20, then user can pass --port as
> base+0x20. Additionally, we can pass TX register width as well. So what about
> something like "--port=0x1c02,1" where 0x1c02 is TX register address 
> and
> 1 says about it's width in bytes.
> 
> ~Pratyush

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 3/4] arm64: Add arm64 kexec support

2016-07-21 Thread Pratyush Anand
On 21/07/2016:02:49:36 PM, Geoff Levand wrote:
> On Thu, 2016-07-21 at 11:50 +0100, Robin Murphy wrote:
> > The Exynos UART (drivers/tty/serial/samsung.c) is one which comes to
> > mind as definitely existing, and on arm64 systems to boot. The TX
> > register is at offset 0x20 there.
> 
> Here's what I came up with.
> 
> 
> + struct data {const char *name; int tx_offset;};
> + static const struct data ok_list[] = {
> + /*  {"armada-3700-uart", ?},*/
> + {"exynos4210-uart", 0x20},
> + /*  {"ls1021a-lpuart", ?},  */
> + /*  {"meson-uart", ?},  */
> + /*  {"mt6577-uart", ?}, */
> + {"ns16550", 0},
> + {"ns16550a", 0},
> + {"pl011", 0},
> + {NULL, 0}
> + };

sinc functionality is just to debug the scenario when something goes wrong in
purgatory. IMHO, it should be disabled by default. So, why not to keep it as
simple as possible. Its a low level debugging mainly for developer, so user
should know the absolute address. Therefore, I think no need to parse earlycon
or earlyprintk from command line.  Whatever user passes in --port can be treated
as address of TX register. If TX offset is 0x20, then user can pass --port as
base+0x20. Additionally, we can pass TX register width as well. So what about
something like "--port=0x1c02,1" where 0x1c02 is TX register address and
1 says about it's width in bytes.

~Pratyush

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC 0/3] extend kexec_file_load system call

2016-07-21 Thread Michael Ellerman
Thiago Jung Bauermann  writes:

> Am Freitag, 15 Juli 2016, 18:03:35 schrieb Thiago Jung Bauermann:
>> Am Freitag, 15 Juli 2016, 22:26:09 schrieb Arnd Bergmann:
>> > However, the powerpc specific RTAS runtime services provide a similar
>> > interface to the UEFI runtime support and allow to call into
>> > binary code from the kernel, which gets mapped from a physical
>> > address in the "linux,rtas-base" property in the rtas device node.
>> > 
>> > Modifying the /rtas node will definitely give you a backdoor into
>> > priviledged code, but modifying only /chosen should not let you get
>> > in through that specific method.
>> 
>> Except that arch/powerpc/kernel/rtas.c looks for any node in the tree
>> called "rtas", so it will try to use /chosen/rtas, or /chosen/foo/rtas.
>> 
>> We can forbid subnodes in /chosen in the dtb passed to kexec_file_load,
>> though that means userspace can't use the simple-framebuffer binding via
>> this mechanism.
>> 
>> We also have to blacklist the device_type and compatible properties in
>> /chosen to avoid the problem Mark mentioned.
>> 
>> Still doable, but not ideal. :-/
>
> So even if not ideal, the solution above is desirable for powerpc. We would 
> like to preserve the ability of allowing userspace to pass parameters to the 
> OS via the DTB, even if secure boot is enabled.
>
> I would like to turn the above into a proposal:
>
> Extend the syscall as shown in this RFC from Takahiro AKASHI, but instead of 
> accepting a complete DTB from userspace, the syscall accepts a DTB 
> containing only a /chosen node. If the DTB contains any other node, the 
> syscall fails with EINVAL. If the DTB contains any subnode in /chosen, or if 
> there's a compatible or device_type property in /chosen, the syscall fails 
> with EINVAL as well.
>
> The kernel can then add the properties in /chosen to the device tree that it 
> will pass to the next kernel.
>
> What do you think?

I think we will inevitably have someone who wants to pass something
other than a child of /chosen.

At that point we would be faced with adding yet another syscall, or at
best a new flag.

I think we'd be better allowing userspace to pass a DTB, and having an
explicit whitelist (in the kernel) of which nodes & properties are
allowed in that DTB.

For starters it would only contain /chosen/stdout-path (for example).
But we would be able to add new nodes & properties in future.

The downside is userspace would have no way of detecting the content of
the white list, other than trial and error. But in practice I'm not sure
that would be a big problem.

cheers

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC 0/3] extend kexec_file_load system call

2016-07-21 Thread Jeremy Kerr
Hi Thiago,

> So even if not ideal, the solution above is desirable for powerpc. We would 
> like to preserve the ability of allowing userspace to pass parameters to the 
> OS via the DTB, even if secure boot is enabled.
> 
> I would like to turn the above into a proposal:
> 
> Extend the syscall as shown in this RFC from Takahiro AKASHI, but instead of 
> accepting a complete DTB from userspace, the syscall accepts a DTB 
> containing only a /chosen node. If the DTB contains any other node, the 
> syscall fails with EINVAL. If the DTB contains any subnode in /chosen, or if 
> there's a compatible or device_type property in /chosen, the syscall fails 
> with EINVAL as well.

This works for me. We could even have it as just a DTB fragment that is
merged *at* the /chosen/ node of the kernel-device tree - so would not
contain a /chosen node itself, and it would be impossible to provide
nodes outside of /chosen. Either is fine.

Thanks!


Jeremy


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC 0/3] extend kexec_file_load system call

2016-07-21 Thread Thiago Jung Bauermann
Am Freitag, 15 Juli 2016, 18:03:35 schrieb Thiago Jung Bauermann:
> Am Freitag, 15 Juli 2016, 22:26:09 schrieb Arnd Bergmann:
> > However, the powerpc specific RTAS runtime services provide a similar
> > interface to the UEFI runtime support and allow to call into
> > binary code from the kernel, which gets mapped from a physical
> > address in the "linux,rtas-base" property in the rtas device node.
> > 
> > Modifying the /rtas node will definitely give you a backdoor into
> > priviledged code, but modifying only /chosen should not let you get
> > in through that specific method.
> 
> Except that arch/powerpc/kernel/rtas.c looks for any node in the tree
> called "rtas", so it will try to use /chosen/rtas, or /chosen/foo/rtas.
> 
> We can forbid subnodes in /chosen in the dtb passed to kexec_file_load,
> though that means userspace can't use the simple-framebuffer binding via
> this mechanism.
> 
> We also have to blacklist the device_type and compatible properties in
> /chosen to avoid the problem Mark mentioned.
> 
> Still doable, but not ideal. :-/

So even if not ideal, the solution above is desirable for powerpc. We would 
like to preserve the ability of allowing userspace to pass parameters to the 
OS via the DTB, even if secure boot is enabled.

I would like to turn the above into a proposal:

Extend the syscall as shown in this RFC from Takahiro AKASHI, but instead of 
accepting a complete DTB from userspace, the syscall accepts a DTB 
containing only a /chosen node. If the DTB contains any other node, the 
syscall fails with EINVAL. If the DTB contains any subnode in /chosen, or if 
there's a compatible or device_type property in /chosen, the syscall fails 
with EINVAL as well.

The kernel can then add the properties in /chosen to the device tree that it 
will pass to the next kernel.

What do you think?

-- 
[]'s
Thiago Jung Bauermann
IBM Linux Technology Center


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 3/4] arm64: Add arm64 kexec support

2016-07-21 Thread Geoff Levand
On Thu, 2016-07-21 at 11:50 +0100, Robin Murphy wrote:
> The Exynos UART (drivers/tty/serial/samsung.c) is one which comes to
> mind as definitely existing, and on arm64 systems to boot. The TX
> register is at offset 0x20 there.

Here's what I came up with.


+   struct data {const char *name; int tx_offset;};
+   static const struct data ok_list[] = {
+   /*  {"armada-3700-uart", ?},*/
+   {"exynos4210-uart", 0x20},
+   /*  {"ls1021a-lpuart", ?},  */
+   /*  {"meson-uart", ?},  */
+   /*  {"mt6577-uart", ?}, */
+   {"ns16550", 0},
+   {"ns16550a", 0},
+   {"pl011", 0},
+   {NULL, 0}
+   };

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: IO memory read from /proc/vmcore leads to hang.

2016-07-21 Thread Daniel Walker

On 07/21/2016 12:33 PM, Maxim Uvarov wrote:

2016-07-21 18:19 GMT+03:00 Daniel Walker :

  There appears to be no code which checks what is or is not System ram,
there is nothing that checks the device tree to see what is IO memory, and
nothing reads /proc/iomem .. So AFAIK nothing cares if it's IO memory, or
system ram, and there's no method to config things to skip any memory in the
system, except in makedumpfile which can skip symbols not IO memory.



Daniel,  unfortunately it's long time for me when I looked to powerpc
code. But I just
checked that here:

kexec-tools-2.0.6/kexec/arch/ppc64/kexec-ppc64.c

is probably what you need.


I have a powerpc32 .. In the powerpc64 file I only see something called 
"reserved-ranges" which may do what I want, however, that doesn't exist 
in the 32bit version.  It appers the reserved-ranges is used by OPAL 
firmware , which I don't have. There doesn't appear to be anything 
generic in ppc64 to exclude device IO memory.


Daniel

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: IO memory read from /proc/vmcore leads to hang.

2016-07-21 Thread Daniel Walker


 There appears to be no code which checks what is or is not System ram, 
there is nothing that checks the device tree to see what is IO memory, 
and nothing reads /proc/iomem .. So AFAIK nothing cares if it's IO 
memory, or system ram, and there's no method to config things to skip 
any memory in the system, except in makedumpfile which can skip symbols 
not IO memory.



On 07/21/2016 12:34 AM, Maxim Uvarov wrote:

Second kernel should already know that it's not system ram of the
first kernel and in that case makedumpfile will not dump that memory.
Simple way is to pass additional kernel argument to kexec is when you
load the kernel. If it works than you can think how it's better to
pass this parameter.  Variants might be request_resource() in first
kernel or add some logic to kexec tools.

Best regards,
Maxim.

2016-07-20 22:18 GMT+03:00 Daniel Walker :

Mahesh, I didn't get your email for some reason . I saw it in the Archives.

makedumpfile doesn't appear to have a way to drop free form memory areas. So
I need to drop 0080 to 00807fff , but I don't see a way to do that. Any
other suggestions on how to prevent this hang ?



On 07/11/2016 02:46 PM, Daniel Walker wrote:


Hi,

I found found that on my Powerpc machine there is some IO memory which
will cause the box to hang if I read it. It's a custom device that was added
to the board for a special purpose.

I was looking for a way to exclude this memory from the dump, and while
doing that I found that kexec makes a list of memory segments that go into
the core file. I was wondering why most of the kexec architecture don't
appear to exclude device memory like what's listed in /proc/iomem.

Is there a good reason why that's not the case?

Daniel



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec






___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 3/4] arm64: Add arm64 kexec support

2016-07-21 Thread Robin Murphy
On 21/07/16 11:31, Mark Rutland wrote:
[...]
 +
 +> >   > > if (*p == 0)
 +> >   > > > > return 0;
 +
 +> >   > > errno = 0;
 +
 +> >   > > v = strtoull(p, NULL, 0);
 +
 +> >   > > if (errno)
 +> >   > > > > return 0;
 +
 +> >   > > return v;
 +}
>>>
>>> It looks like the purgatory code expects angel SWI as the earlycon,
>>
>> Maybe you saw the debug_brk macro in entry.S?  I should remove
>> that and just loop.
> 
> Ah, sorry. For some reason I got that confused with the sink code. My
> bad.
> 
> Now I see that's assuming an 8-bit MMIO register.
> 
>>> whereas many other earlycons exist (with pl011 being extremely popular).
>>> Regardless, if we assume a particular UART type, we should explicitly
>>> verify that here. Otherwise the purgatory code will likely bring down
>>> the system, and it will be very painful to debug.
>>>
>>> Please explicitly check for the supported earlycon name.
>>
>> Purgatory just writes bytes to the address given.  Are there
>> UARTs that don't have TX as the first port?
> 
> I'm not sure, but it's certainly possible. The generic earlycon binding
> doesn't guarantee that the first address is a TX register. Even if they
> don't exist today, they could in a month's time, so I don't think we
> should assume anything.

The Exynos UART (drivers/tty/serial/samsung.c) is one which comes to
mind as definitely existing, and on arm64 systems to boot. The TX
register is at offset 0x20 there.

Robin.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC 3/3] kexec: extend kexec_file_load system call

2016-07-21 Thread Russell King - ARM Linux
On Wed, Jul 20, 2016 at 11:41:35AM +, David Laight wrote:
> From: Dave Young
> > I do not think it is worth to add another syscall for extra fds.
> > We have open(2) as an example for different numbers of arguments
> > already.
> 
> Probably works 'by luck' and no one has actually thought about why.
> That ioctl() works is (probably) even more lucky.
> 
> There are ABI that use different calling conventions for varags functions
> (eg always stack all the arguments). I guess linux doesn't run on any of them.
> 
> ioctl() is a particular problem because the 'arg' might be an integer or a 
> pointer.
> Fortunately all the 64bit ABI linux uses pass the arg parameter in a register
> (and don't use different registers for pointer and data arguments).
> 
> You could have two 'libc' functions that refer to the same system call entry.
> Certainly safer than a varargs function.

Don't forget that the syscall API is not a normal C function API - it's
special, because there's little point stacking arguments on the userspace
stack and then having the kernel function try and read them off the
kernelspace stack.

If an architecture does such a thing, then it needs special veneers to
handle that (reading off the userspace stack and placing them onto the
kernelspace stack, or the arch needs to define some other method of
handling the situation.)

So, really, the actual C APIs don't matter that much - what matters more
is the definition of a sane way to pass such arguments.  Given the
extensive historical nature of open() and ioctl(), it would be completely
silly not to create something which allows these calls to work.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec: add resriction on the kexec_load

2016-07-21 Thread Dave Young
On 07/19/16 at 09:07pm, Eric W. Biederman wrote:
> zhongjiang  writes:
> 
> > From: zhong jiang 
> >
> > I hit the following question when run trinity in my system. The
> > kernel is 3.4 version. but the mainline have same question to be
> > solved. The root cause is the segment size is too large, it can
> > expand the most of the area or the whole memory, therefore, it
> > may waste an amount of time to abtain a useable page. and other
> > cases will block until the test case quit. at the some time,
> > OOM will come up.
> 
> 5MiB is way too small.  I have seen vmlinux images not to mention
> ramdisks that get larger than that.  Depending on the system
> 1GiB might not be an unreasonable ramdisk size.  AKA run an entire live
> system out of a ramfs.  It works well if you have enough memory.

There was a use case from Michael Holzheu about a 1.5G ramdisk, see below
kexec-tools commit:

commit 95741713e790fa6bde7780bbfb772ad88e81a744
Author: Michael Holzheu 
Date:   Fri Oct 30 16:02:04 2015 +0100

kexec/s390x: use mmap instead of read for slurp_file()

The slurp_fd() function allocates memory and uses the read() system
call.
This results in double memory consumption for image and initrd:

 1) Memory allocated in user space by the kexec tool
 2) Memory allocated in kernel by the kexec() system call

The following illustrates the use case that we have on s390x:

 1) Boot a 4 GB Linux system
 2) Copy kernel and 1,5 GB ramdisk from external source into tmpfs
(ram)
 3) Use kexec to boot kernel with ramdisk

 Therefore for kexec runtime we need:

 1,5 GB (tmpfs) + 1,5 GB (kexec malloc) + 1,5 GB (kernel memory) =
4,5 GB

This patch introduces slurp_file_mmap() which for "normal" files
uses
mmap() instead of malloc()/read(). This reduces the runtime memory
consumption of the kexec tool as follows:

 1,5 GB (tmpfs) + 1,5 GB (kernel memory) = 3 GB

Signed-off-by: Michael Holzheu 
Reviewed-by: Dave Young 
Signed-off-by: Simon Horman 

> 
> I think there is a practical limit at about 50% of memory (because we
> need two copies in memory the source and the destination pages), but
> anything else is pretty much reasonable and should have a fair chance of
> working.
> 
> A limit that reflected that reality above would be interesting.
> Anything else will likely cause someone trouble in the futrue.

Maybe one should test his ramdisk first to ensure it works first before
really using it.

Thanks
Dave

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: IO memory read from /proc/vmcore leads to hang.

2016-07-21 Thread Maxim Uvarov
Second kernel should already know that it's not system ram of the
first kernel and in that case makedumpfile will not dump that memory.
Simple way is to pass additional kernel argument to kexec is when you
load the kernel. If it works than you can think how it's better to
pass this parameter.  Variants might be request_resource() in first
kernel or add some logic to kexec tools.

Best regards,
Maxim.

2016-07-20 22:18 GMT+03:00 Daniel Walker :
>
> Mahesh, I didn't get your email for some reason . I saw it in the Archives.
>
> makedumpfile doesn't appear to have a way to drop free form memory areas. So
> I need to drop 0080 to 00807fff , but I don't see a way to do that. Any
> other suggestions on how to prevent this hang ?
>
>
>
> On 07/11/2016 02:46 PM, Daniel Walker wrote:
>>
>>
>> Hi,
>>
>> I found found that on my Powerpc machine there is some IO memory which
>> will cause the box to hang if I read it. It's a custom device that was added
>> to the board for a special purpose.
>>
>> I was looking for a way to exclude this memory from the dump, and while
>> doing that I found that kexec makes a list of memory segments that go into
>> the core file. I was wondering why most of the kexec architecture don't
>> appear to exclude device memory like what's listed in /proc/iomem.
>>
>> Is there a good reason why that's not the case?
>>
>> Daniel
>
>
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec



-- 
Best regards,
Maxim Uvarov

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec