from:"Rik van Riel"

Re: [PATCH 1/4] pid: Replace pid bitmap implementation with IDR API

2017-09-25 Thread Rik van Riel

On Mon, 2017-09-25 at 08:56 -0400, Gargi Sharma wrote:
> This patch replaces the current bitmap implemetation for
> Process ID allocation. Functions that are no longer required,
> for example, free_pidmap(), alloc_pidmap(), etc. are removed.
> The rest of the functions are modified to use the IDR API.
> The change was made to make the PID allocation less complex by
> replacing custom code with calls to generic API.

Nice work, Gargi!

> Signed-off-by: Gargi Sharma <gs051...@gmail.com>

> --- a/arch/powerpc/platforms/cell/spufs/sched.c
> +++ b/arch/powerpc/platforms/cell/spufs/sched.c
> @@ -1093,7 +1093,7 @@ static int show_spu_loadavg(struct seq_file *s,
> void *private)
>   LOAD_INT(c), LOAD_FRAC(c),
>   count_active_contexts(),
>   atomic_read(_spu_contexts),
> - task_active_pid_ns(current)->last_pid);
> + task_active_pid_ns(current)->idr.idr_next-1);
>   return 0;
>  }

The repeated use of "idr.idr_next - 1" suggests that maybe this could
be hidden behind a new IDR API, but that could be something for a
follow-up patch.

There already is the idr_get_cursor function, but you would still
have to subtract 1 from the value returned by idr_get_cursor...

Other than that nitpick, this patch looks good to me.

Reviewed-by: Rik van Riel <r...@redhat.com>

kind regards,

Rik van Riel
-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part

Re: [PATCH 1/4] pid: Replace pid bitmap implementation with IDR API

2017-09-25 Thread Rik van Riel

On Mon, 2017-09-25 at 08:56 -0400, Gargi Sharma wrote:
> This patch replaces the current bitmap implemetation for
> Process ID allocation. Functions that are no longer required,
> for example, free_pidmap(), alloc_pidmap(), etc. are removed.
> The rest of the functions are modified to use the IDR API.
> The change was made to make the PID allocation less complex by
> replacing custom code with calls to generic API.

Nice work, Gargi!

> Signed-off-by: Gargi Sharma 

> --- a/arch/powerpc/platforms/cell/spufs/sched.c
> +++ b/arch/powerpc/platforms/cell/spufs/sched.c
> @@ -1093,7 +1093,7 @@ static int show_spu_loadavg(struct seq_file *s,
> void *private)
>   LOAD_INT(c), LOAD_FRAC(c),
>   count_active_contexts(),
>   atomic_read(_spu_contexts),
> - task_active_pid_ns(current)->last_pid);
> + task_active_pid_ns(current)->idr.idr_next-1);
>   return 0;
>  }

The repeated use of "idr.idr_next - 1" suggests that maybe this could
be hidden behind a new IDR API, but that could be something for a
follow-up patch.

There already is the idr_get_cursor function, but you would still
have to subtract 1 from the value returned by idr_get_cursor...

Other than that nitpick, this patch looks good to me.

Reviewed-by: Rik van Riel 

kind regards,

Rik van Riel
-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part

Re: [PATCH] mm: madvise: add description for MADV_WIPEONFORK and MADV_KEEPONFORK

2017-09-22 Thread Rik van Riel

On Sat, 2017-09-23 at 05:55 +0800, Yang Shi wrote:
> mm/madvise.c has the brief description about all MADV_ flags, added
> the
> description for the newly added MADV_WIPEONFORK and MADV_KEEPONFORK.
> 
> Although man page has the similar information, but it'd better to
> keep the
> consistency with other flags.
> 
> Signed-off-by: Yang Shi <yan...@alibaba-inc.com>
> 
Thank you for spotting that I missed a location!

Reviewed-by: Rik van Riel <r...@redhat.com>

Re: [PATCH] mm: madvise: add description for MADV_WIPEONFORK and MADV_KEEPONFORK

2017-09-22 Thread Rik van Riel

On Sat, 2017-09-23 at 05:55 +0800, Yang Shi wrote:
> mm/madvise.c has the brief description about all MADV_ flags, added
> the
> description for the newly added MADV_WIPEONFORK and MADV_KEEPONFORK.
> 
> Although man page has the similar information, but it'd better to
> keep the
> consistency with other flags.
> 
> Signed-off-by: Yang Shi 
> 
Thank you for spotting that I missed a location!

Reviewed-by: Rik van Riel

Re: [kernel-hardening] [PATCH v3 3/3] x86/fpu: reinitialize FPU registers if restoring FPU state fails

2017-09-21 Thread Rik van Riel

On Thu, 2017-09-21 at 11:52 -0700, Eric Biggers wrote:
> From: Eric Biggers <ebigg...@google.com>
> 
> Userspace can change the FPU state of a task using the ptrace() or
> rt_sigreturn() system calls.  Because reserved bits in the FPU state
> can
> cause the XRSTOR instruction to fail, the kernel has to carefully
> validate that no reserved bits or other invalid values are being set.
> 
> Unfortunately, there have been bugs in this validation code.  For
> example, we were not checking that the 'xcomp_bv' field in the
> xstate_header was 0.  As-is, such bugs are exploitable to read the
> FPU
> registers of other processes on the system.  To do so, an attacker
> can
> create a task, assign to it an invalid FPU state, then spin in a loop
> and monitor the values of the FPU registers.  Because the task's FPU
> registers are not being restored, sometimes the FPU registers will
> have
> the values from another process.
> 

Reviewed-by: Rik van Riel <r...@redhat.com>

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] [PATCH v3 3/3] x86/fpu: reinitialize FPU registers if restoring FPU state fails

2017-09-21 Thread Rik van Riel

On Thu, 2017-09-21 at 11:52 -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Userspace can change the FPU state of a task using the ptrace() or
> rt_sigreturn() system calls.  Because reserved bits in the FPU state
> can
> cause the XRSTOR instruction to fail, the kernel has to carefully
> validate that no reserved bits or other invalid values are being set.
> 
> Unfortunately, there have been bugs in this validation code.  For
> example, we were not checking that the 'xcomp_bv' field in the
> xstate_header was 0.  As-is, such bugs are exploitable to read the
> FPU
> registers of other processes on the system.  To do so, an attacker
> can
> create a task, assign to it an invalid FPU state, then spin in a loop
> and monitor the values of the FPU registers.  Because the task's FPU
> registers are not being restored, sometimes the FPU registers will
> have
> the values from another process.
> 

Reviewed-by: Rik van Riel 

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] [PATCH v3 2/3] x86/fpu: tighten validation of user-supplied xstate_header

2017-09-21 Thread Rik van Riel

On Thu, 2017-09-21 at 11:52 -0700, Eric Biggers wrote:
> From: Eric Biggers <ebigg...@google.com>
> 
> Move validation of user-supplied xstate_headers into a helper
> function
> and call it from both the ptrace and sigreturn syscall paths.  The
> new
> function also considers it to be an error if *any* reserved bits are
> set, whereas before we were just clearing most of them.
> 
> This should reduce the chance of bugs that fail to correctly validate
> user-supplied XSAVE areas.  It also will expose any broken userspace
> programs that set the other reserved bits; this is desirable because
> such programs will lose compatibility with future CPUs and kernels if
> those bits are ever used for anything.  (There shouldn't be any such
> programs, and in fact in the case where the compacted format is in
> use
> we were already validating xfeatures.  But you never know...)
> 
> Reviewed-by: Kees Cook <keesc...@chromium.org>
> Acked-by: Dave Hansen <dave.han...@linux.intel.com>
> Cc: Andy Lutomirski <l...@kernel.org>
> Cc: Dmitry Vyukov <dvyu...@google.com>
> Cc: Fenghua Yu <fenghua...@intel.com>
> Cc: Ingo Molnar <mi...@kernel.org>
> Cc: Kevin Hao <haoke...@gmail.com>
> Cc: Oleg Nesterov <o...@redhat.com>
> Cc: Wanpeng Li <wanpeng...@hotmail.com>
> Cc: Yu-cheng Yu <yu-cheng...@intel.com>
> Signed-off-by: Eric Biggers <ebigg...@google.com>
> 
Reviewed-by: Rik van Riel <r...@redhat.com>

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] [PATCH v3 2/3] x86/fpu: tighten validation of user-supplied xstate_header

2017-09-21 Thread Rik van Riel

On Thu, 2017-09-21 at 11:52 -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Move validation of user-supplied xstate_headers into a helper
> function
> and call it from both the ptrace and sigreturn syscall paths.  The
> new
> function also considers it to be an error if *any* reserved bits are
> set, whereas before we were just clearing most of them.
> 
> This should reduce the chance of bugs that fail to correctly validate
> user-supplied XSAVE areas.  It also will expose any broken userspace
> programs that set the other reserved bits; this is desirable because
> such programs will lose compatibility with future CPUs and kernels if
> those bits are ever used for anything.  (There shouldn't be any such
> programs, and in fact in the case where the compacted format is in
> use
> we were already validating xfeatures.  But you never know...)
> 
> Reviewed-by: Kees Cook 
> Acked-by: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Dmitry Vyukov 
> Cc: Fenghua Yu 
> Cc: Ingo Molnar 
> Cc: Kevin Hao 
> Cc: Oleg Nesterov 
> Cc: Wanpeng Li 
> Cc: Yu-cheng Yu 
> Signed-off-by: Eric Biggers 
> 
Reviewed-by: Rik van Riel 

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] [PATCH v3 1/3] x86/fpu: don't let userspace set bogus xcomp_bv

2017-09-21 Thread Rik van Riel

On Thu, 2017-09-21 at 11:52 -0700, Eric Biggers wrote:
> From: Eric Biggers <ebigg...@google.com>
> 
> Fix the bug by checking that the user-supplied value of xcomp_bv is 0
> in
> the uncompacted case, and returning an error otherwise.
> 
Reviewed-by: Rik van Riel <r...@redhat.com>

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] [PATCH v3 1/3] x86/fpu: don't let userspace set bogus xcomp_bv

2017-09-21 Thread Rik van Riel

On Thu, 2017-09-21 at 11:52 -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Fix the bug by checking that the user-supplied value of xcomp_bv is 0
> in
> the uncompacted case, and returning an error otherwise.
> 
Reviewed-by: Rik van Riel 

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [patch v2] madvise.2: Add MADV_WIPEONFORK documentation

2017-09-19 Thread Rik van Riel

On Tue, 2017-09-19 at 21:07 +0200, Michael Kerrisk (man-pages) wrote:

> Thanks. I applied this, and tweaked the madvise.2 text a little, to
> read as follows (please let me know if I messed anything up):
> 
>    MADV_WIPEONFORK (since Linux 4.14)
>   Present the child process with zero-filled
> memory  in  this
>   range  after  a fork(2).  This is useful in forking
> servers
>   in order to ensure that  sensitive  per-
> process  data  (for
>   example,  PRNG  seeds, cryptographic secrets, and so
> on) is
>   not handed to child processes.
> 
>   The MADV_WIPEONFORK operation can be applied
> only  to  pri‐
>   vate anonymous pages (see mmap(2)).

That looks great. Thank you, Michael!

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [patch v2] madvise.2: Add MADV_WIPEONFORK documentation

2017-09-19 Thread Rik van Riel

On Tue, 2017-09-19 at 21:07 +0200, Michael Kerrisk (man-pages) wrote:

> Thanks. I applied this, and tweaked the madvise.2 text a little, to
> read as follows (please let me know if I messed anything up):
> 
>    MADV_WIPEONFORK (since Linux 4.14)
>   Present the child process with zero-filled
> memory  in  this
>   range  after  a fork(2).  This is useful in forking
> servers
>   in order to ensure that  sensitive  per-
> process  data  (for
>   example,  PRNG  seeds, cryptographic secrets, and so
> on) is
>   not handed to child processes.
> 
>   The MADV_WIPEONFORK operation can be applied
> only  to  pri‐
>   vate anonymous pages (see mmap(2)).

That looks great. Thank you, Michael!

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH] mm: Fix typo in VM_MPX definition

2017-09-18 Thread Rik van Riel

On Mon, 2017-09-18 at 17:02 +0300, Kirill A. Shutemov wrote:
> There's typo in recent change of VM_MPX definition. We want it to be
> VM_HIGH_ARCH_4, not VM_HIGH_ARCH_BIT_4.

Ugh, indeed! Good catch!

> Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> Fixes: df3735c5b40f ("x86,mpx: make mpx depend on x86-64 to free up
> VMA flag")
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Dave Hansen <dave.han...@intel.com>

Reviewed-by: Rik van Riel <r...@redhat.com>

Re: [PATCH] mm: Fix typo in VM_MPX definition

2017-09-18 Thread Rik van Riel

On Mon, 2017-09-18 at 17:02 +0300, Kirill A. Shutemov wrote:
> There's typo in recent change of VM_MPX definition. We want it to be
> VM_HIGH_ARCH_4, not VM_HIGH_ARCH_BIT_4.

Ugh, indeed! Good catch!

> Signed-off-by: Kirill A. Shutemov 
> Fixes: df3735c5b40f ("x86,mpx: make mpx depend on x86-64 to free up
> VMA flag")
> Cc: Rik van Riel 
> Cc: Dave Hansen 

Reviewed-by: Rik van Riel

[patch v2] madvise.2: Add MADV_WIPEONFORK documentation

2017-09-14 Thread Rik van Riel

v2: implement the improvements suggested by Colm, and add
Colm's text to the fork.2 man page
(Colm, I have added a signed-off-by in your name - is that ok?)

Add MADV_WIPEONFORK and MADV_KEEPONFORK documentation to
madvise.2.  The new functionality was recently merged by
Linus, and should be in the 4.14 kernel.

While documenting what EINVAL means for MADV_WIPEONFORK,
I realized that MADV_FREE has the same thing going on,
so I documented EINVAL for both in the ERRORS section.

This patch documents the following kernel commit:

commit d2cd9ede6e193dd7d88b6d27399e96229a551b19
Author: Rik van Riel <r...@redhat.com>
Date:   Wed Sep 6 16:25:15 2017 -0700

mm,fork: introduce MADV_WIPEONFORK

Signed-off-by: Rik van Riel <r...@redhat.com>
Signed-off-by: Colm MacCárthaigh <c...@allcosts.net>

diff --git a/man2/fork.2 b/man2/fork.2
index b5af58ca08c0..b11e750e3876 100644
--- a/man2/fork.2
+++ b/man2/fork.2
@@ -140,6 +140,12 @@ Memory mappings that have been marked with the
 flag are not inherited across a
 .BR fork ().
 .IP *
+Memory in mappings that have been marked with the
+.BR madvise (2)
+.B MADV_WIPEONFORK
+flag is zeroed in the child after a
+.BR fork ().
+.IP *
 The termination signal of the child is always
 .B SIGCHLD
 (see
diff --git a/man2/madvise.2 b/man2/madvise.2
index dfb31b63dba3..bb0ac469c509 100644
--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -31,6 +31,9 @@
 .\" 2010-06-19, Andi Kleen, Add documentation of MADV_SOFT_OFFLINE.
 .\" 2011-09-18, Doug Goldstein <car...@cardoe.com>
 .\" Document MADV_HUGEPAGE and MADV_NOHUGEPAGE
+.\" 2017-09-14, Rik van Riel <r...@redhat.com>
+.\" Document MADV_WIPEONFORK and MADV_KEEPONFORK
+.\" commit d2cd9ede6e193dd7d88b6d27399e96229a551b19
 .\"
 .TH MADVISE 2 2017-07-13 "Linux" "Linux Programmer's Manual"
 .SH NAME
@@ -405,6 +408,22 @@ can be applied only to private anonymous pages (see
 .BR mmap (2)).
 On a swapless system, freeing pages in a given range happens instantly,
 regardless of memory pressure.
+.TP
+.BR MADV_WIPEONFORK " (since Linux 4.14)"
+Present the child process with zero-filled memory in this range after a
+.BR fork (2).
+This is useful for per-process data in forking servers that should be
+re-initialized in the child process after a fork, for example PRNG seeds,
+cryptographic secrets, etc.
+.IP
+The
+.B MADV_WIPEONFORK
+operation can only be applied to private anonymous pages (see
+.BR mmap (2)).
+.TP
+.BR MADV_KEEPONFORK " (since Linux 4.14)"
+Undo the effect of an earlier
+.BR MADV_WIPEONFORK .
 .SH RETURN VALUE
 On success,
 .BR madvise ()
@@ -457,6 +476,18 @@ or
 but the kernel was not configured with
 .BR CONFIG_KSM .
 .TP
+.B EINVAL
+.I advice
+is
+.BR MADV_FREE
+or
+.BR MADV_WIPEONFORK
+but the specified address range includes file, Huge TLB,
+.BR MAP_SHARED ,
+or
+.BR VM_PFNMAP
+ranges.
+.TP
 .B EIO
 (for
 .BR MADV_WILLNEED )

[patch v2] madvise.2: Add MADV_WIPEONFORK documentation

2017-09-14 Thread Rik van Riel

v2: implement the improvements suggested by Colm, and add
Colm's text to the fork.2 man page
(Colm, I have added a signed-off-by in your name - is that ok?)

Add MADV_WIPEONFORK and MADV_KEEPONFORK documentation to
madvise.2.  The new functionality was recently merged by
Linus, and should be in the 4.14 kernel.

While documenting what EINVAL means for MADV_WIPEONFORK,
I realized that MADV_FREE has the same thing going on,
so I documented EINVAL for both in the ERRORS section.

This patch documents the following kernel commit:

commit d2cd9ede6e193dd7d88b6d27399e96229a551b19
Author: Rik van Riel 
Date:   Wed Sep 6 16:25:15 2017 -0700

mm,fork: introduce MADV_WIPEONFORK

Signed-off-by: Rik van Riel 
Signed-off-by: Colm MacCárthaigh 

diff --git a/man2/fork.2 b/man2/fork.2
index b5af58ca08c0..b11e750e3876 100644
--- a/man2/fork.2
+++ b/man2/fork.2
@@ -140,6 +140,12 @@ Memory mappings that have been marked with the
 flag are not inherited across a
 .BR fork ().
 .IP *
+Memory in mappings that have been marked with the
+.BR madvise (2)
+.B MADV_WIPEONFORK
+flag is zeroed in the child after a
+.BR fork ().
+.IP *
 The termination signal of the child is always
 .B SIGCHLD
 (see
diff --git a/man2/madvise.2 b/man2/madvise.2
index dfb31b63dba3..bb0ac469c509 100644
--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -31,6 +31,9 @@
 .\" 2010-06-19, Andi Kleen, Add documentation of MADV_SOFT_OFFLINE.
 .\" 2011-09-18, Doug Goldstein 
 .\" Document MADV_HUGEPAGE and MADV_NOHUGEPAGE
+.\" 2017-09-14, Rik van Riel 
+.\" Document MADV_WIPEONFORK and MADV_KEEPONFORK
+.\" commit d2cd9ede6e193dd7d88b6d27399e96229a551b19
 .\"
 .TH MADVISE 2 2017-07-13 "Linux" "Linux Programmer's Manual"
 .SH NAME
@@ -405,6 +408,22 @@ can be applied only to private anonymous pages (see
 .BR mmap (2)).
 On a swapless system, freeing pages in a given range happens instantly,
 regardless of memory pressure.
+.TP
+.BR MADV_WIPEONFORK " (since Linux 4.14)"
+Present the child process with zero-filled memory in this range after a
+.BR fork (2).
+This is useful for per-process data in forking servers that should be
+re-initialized in the child process after a fork, for example PRNG seeds,
+cryptographic secrets, etc.
+.IP
+The
+.B MADV_WIPEONFORK
+operation can only be applied to private anonymous pages (see
+.BR mmap (2)).
+.TP
+.BR MADV_KEEPONFORK " (since Linux 4.14)"
+Undo the effect of an earlier
+.BR MADV_WIPEONFORK .
 .SH RETURN VALUE
 On success,
 .BR madvise ()
@@ -457,6 +476,18 @@ or
 but the kernel was not configured with
 .BR CONFIG_KSM .
 .TP
+.B EINVAL
+.I advice
+is
+.BR MADV_FREE
+or
+.BR MADV_WIPEONFORK
+but the specified address range includes file, Huge TLB,
+.BR MAP_SHARED ,
+or
+.BR VM_PFNMAP
+ranges.
+.TP
 .B EIO
 (for
 .BR MADV_WILLNEED )

[patch] madvise.2: Add MADV_WIPEONFORK documentation

2017-09-14 Thread Rik van Riel

Add MADV_WIPEONFORK and MADV_KEEPONFORK documentation to
madvise.2.  The new functionality was recently merged by
Linus, and should be in the 4.14 kernel.

While documenting what EINVAL means for MADV_WIPEONFORK,
I realized that MADV_FREE has the same thing going on,
so I documented EINVAL for both in the ERRORS section.

This patch documents the following kernel commit:

commit d2cd9ede6e193dd7d88b6d27399e96229a551b19
Author: Rik van Riel <r...@redhat.com>
Date:   Wed Sep 6 16:25:15 2017 -0700

mm,fork: introduce MADV_WIPEONFORK

Signed-off-by: Rik van Riel <r...@redhat.com>

index dfb31b63dba3..4f987ddfae79 100644
--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -31,6 +31,8 @@
 .\" 2010-06-19, Andi Kleen, Add documentation of MADV_SOFT_OFFLINE.
 .\" 2011-09-18, Doug Goldstein <car...@cardoe.com>
 .\" Document MADV_HUGEPAGE and MADV_NOHUGEPAGE
+.\" 2017-09-14, Rik van Riel <r...@redhat.com>
+.\" Document MADV_WIPEONFORK and MADV_KEEPONFORK
 .\"
 .TH MADVISE 2 2017-07-13 "Linux" "Linux Programmer's Manual"
 .SH NAME
@@ -405,6 +407,22 @@ can be applied only to private anonymous pages (see
 .BR mmap (2)).
 On a swapless system, freeing pages in a given range happens instantly,
 regardless of memory pressure.
+.TP
+.BR MADV_WIPEONFORK " (since Linux 4.14)"
+Present the child process with zero-filled memory in this range after a
+.BR fork (2).
+This is useful for per-process data in forking servers that should be
+re-initialized in the child process after a fork, for example PRNG seeds,
+cryptographic data, etc.
+.IP
+The
+.B MADV_WIPEONFORK
+operation can only be applied to private anonymous pages (see
+.BR mmap (2)).
+.TP
+.BR MADV_KEEPONFORK " (since Linux 4.14)"
+Undo the effect of an earlier
+.BR MADV_WIPEONFORK .
 .SH RETURN VALUE
 On success,
 .BR madvise ()
@@ -457,6 +475,18 @@ or
 but the kernel was not configured with
 .BR CONFIG_KSM .
 .TP
+.B EINVAL
+.I advice
+is
+.BR MADV_FREE
+or
+.BR MADV_WIPEONFORK
+but the specified address range includes file, Huge TLB,
+.BR MAP_SHARED ,
+or
+.BR VM_PFNMAP
+ranges.
+.TP
 .B EIO
 (for
 .BR MADV_WILLNEED )

[patch] madvise.2: Add MADV_WIPEONFORK documentation

2017-09-14 Thread Rik van Riel

Add MADV_WIPEONFORK and MADV_KEEPONFORK documentation to
madvise.2.  The new functionality was recently merged by
Linus, and should be in the 4.14 kernel.

While documenting what EINVAL means for MADV_WIPEONFORK,
I realized that MADV_FREE has the same thing going on,
so I documented EINVAL for both in the ERRORS section.

This patch documents the following kernel commit:

commit d2cd9ede6e193dd7d88b6d27399e96229a551b19
Author: Rik van Riel 
Date:   Wed Sep 6 16:25:15 2017 -0700

mm,fork: introduce MADV_WIPEONFORK

Signed-off-by: Rik van Riel 

index dfb31b63dba3..4f987ddfae79 100644
--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -31,6 +31,8 @@
 .\" 2010-06-19, Andi Kleen, Add documentation of MADV_SOFT_OFFLINE.
 .\" 2011-09-18, Doug Goldstein 
 .\" Document MADV_HUGEPAGE and MADV_NOHUGEPAGE
+.\" 2017-09-14, Rik van Riel 
+.\" Document MADV_WIPEONFORK and MADV_KEEPONFORK
 .\"
 .TH MADVISE 2 2017-07-13 "Linux" "Linux Programmer's Manual"
 .SH NAME
@@ -405,6 +407,22 @@ can be applied only to private anonymous pages (see
 .BR mmap (2)).
 On a swapless system, freeing pages in a given range happens instantly,
 regardless of memory pressure.
+.TP
+.BR MADV_WIPEONFORK " (since Linux 4.14)"
+Present the child process with zero-filled memory in this range after a
+.BR fork (2).
+This is useful for per-process data in forking servers that should be
+re-initialized in the child process after a fork, for example PRNG seeds,
+cryptographic data, etc.
+.IP
+The
+.B MADV_WIPEONFORK
+operation can only be applied to private anonymous pages (see
+.BR mmap (2)).
+.TP
+.BR MADV_KEEPONFORK " (since Linux 4.14)"
+Undo the effect of an earlier
+.BR MADV_WIPEONFORK .
 .SH RETURN VALUE
 On success,
 .BR madvise ()
@@ -457,6 +475,18 @@ or
 but the kernel was not configured with
 .BR CONFIG_KSM .
 .TP
+.B EINVAL
+.I advice
+is
+.BR MADV_FREE
+or
+.BR MADV_WIPEONFORK
+but the specified address range includes file, Huge TLB,
+.BR MAP_SHARED ,
+or
+.BR VM_PFNMAP
+ranges.
+.TP
 .B EIO
 (for
 .BR MADV_WILLNEED )

Re: [lkp-robot] [sched/fair] 6d46bd3d97: netperf.Throughput_tps -11.3% regression

2017-09-14 Thread Rik van Riel

On Sun, 2017-09-10 at 23:32 -0700, Joel Fernandes wrote:
> 
> To make the load check more meaningful, I am thinking if using
> wake_affine()'s balance check is a better thing to do than the
> 'nr_running < 2' check I used in this patch. Then again, since commit
> 3fed382b46baac ("sched/numa: Implement NUMA node level
> wake_affine()",
> wake_affine() doesn't do balance check for CPUs within a socket so
> probably bringing back something like the *old* wake_affine that
> checked load between different CPUs within a socket is needed to
> avoid
> a potentially disastrous sync decision? 

This is because regardless of whether or not we did
an affine wakeup, the code called select_idle_sibling
within that socket, anyway.

In other words, the behavior for within-socket
wakeups was not substantially different with or
without an affine wakeup.

All that changed is which CPU select_idle_sibling
starts searching at, and that only if the woken
task's previous CPU is not idle.

>  The commit I refer to was
> added with the reason that select_idle_sibling was selecting cores
> anywhere within a socket, but with my patch we're more specifically
> selecting the waker's CPU on passing the sync flag. Could you share
> your thoughts about this?

On systems with SMT, it may make more sense for
sync wakeups to look for idle threads of the same
core, than to have the woken task end up on the 
same thread, and wait for the current task to stop
running.

"Strong sync" wakeups like you propose would also
change the semantics of wake_wide() and potentially
other bits of code...

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [lkp-robot] [sched/fair] 6d46bd3d97: netperf.Throughput_tps -11.3% regression

2017-09-14 Thread Rik van Riel

On Sun, 2017-09-10 at 23:32 -0700, Joel Fernandes wrote:
> 
> To make the load check more meaningful, I am thinking if using
> wake_affine()'s balance check is a better thing to do than the
> 'nr_running < 2' check I used in this patch. Then again, since commit
> 3fed382b46baac ("sched/numa: Implement NUMA node level
> wake_affine()",
> wake_affine() doesn't do balance check for CPUs within a socket so
> probably bringing back something like the *old* wake_affine that
> checked load between different CPUs within a socket is needed to
> avoid
> a potentially disastrous sync decision? 

This is because regardless of whether or not we did
an affine wakeup, the code called select_idle_sibling
within that socket, anyway.

In other words, the behavior for within-socket
wakeups was not substantially different with or
without an affine wakeup.

All that changed is which CPU select_idle_sibling
starts searching at, and that only if the woken
task's previous CPU is not idle.

>  The commit I refer to was
> added with the reason that select_idle_sibling was selecting cores
> anywhere within a socket, but with my patch we're more specifically
> selecting the waker's CPU on passing the sync flag. Could you share
> your thoughts about this?

On systems with SMT, it may make more sense for
sync wakeups to look for idle threads of the same
core, than to have the woken task end up on the 
same thread, and wait for the current task to stop
running.

"Strong sync" wakeups like you propose would also
change the semantics of wake_wide() and potentially
other bits of code...

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

2017-09-11 Thread Rik van Riel

On Sun, 2017-09-10 at 18:46 -0700, Andy Lutomirski wrote:
> 
> No, nothing stops the problematic speculative load.  Here's the
> issue.
> One CPU removes a reference to a page table from a higher-level page
> table, flushes, and then frees the page table.  Then it re-allocates
> it and writes something unrelated there.  Another CPU that has CR3
> pointing to the page hierarchy in question could have a reference to
> the freed table in its paging structure cache.  Even if it's
> guaranteed to not try to access the addresses in question (because
> they're user addresses and the other CPU is in kernel mode, etc), but
> there is never a guarantee that the CPU doesn't randomly try to fill
> its TLB for the affected addresses.  This results in invalid PTEs in
> the TLB, possible accesses using bogus memory types, and maybe even
> reads from IO space.

Good point, I had forgotten all about memory accesses
that do not originate with software behavior.

Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

2017-09-11 Thread Rik van Riel

On Sun, 2017-09-10 at 18:46 -0700, Andy Lutomirski wrote:
> 
> No, nothing stops the problematic speculative load.  Here's the
> issue.
> One CPU removes a reference to a page table from a higher-level page
> table, flushes, and then frees the page table.  Then it re-allocates
> it and writes something unrelated there.  Another CPU that has CR3
> pointing to the page hierarchy in question could have a reference to
> the freed table in its paging structure cache.  Even if it's
> guaranteed to not try to access the addresses in question (because
> they're user addresses and the other CPU is in kernel mode, etc), but
> there is never a guarantee that the CPU doesn't randomly try to fill
> its TLB for the affected addresses.  This results in invalid PTEs in
> the TLB, possible accesses using bogus memory types, and maybe even
> reads from IO space.

Good point, I had forgotten all about memory accesses
that do not originate with software behavior.

Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

2017-09-10 Thread Rik van Riel

On Sat, 2017-09-09 at 12:28 -0700, Andy Lutomirski wrote:
> -
> I propose the following fix.  If PCID is on, then, in
> enter_lazy_tlb(), we switch to init_mm with the no-flush flag set.
> (And we give init_mm its own dedicated ASID to keep it simple and
> fast
> -- no need to use the LRU ASID mapping to assign one
> dynamically.)  We
> clear the bit in mm_cpumask.  That is, we more or less just skip the
> whole lazy TLB optimization and rely on PCID CPUs having reasonably
> fast CR3 writes.  No extra IPIs.

Avoiding the IPIs is probably what matters the most, especially
on systems with deep C states, and virtual machines where the
host may be running something else, causing the IPI service time
to go through the roof for idle VCPUs.

> Also, sorry Rik, this means your old increased laziness optimization
> is dead in the water.  It will have exactly the same speculative load
> problem.

Doesn't a memory barrier solve that speculative load
problem?

The memory barrier could be added only to the path
that potentially skips reloading the TLB, under the
assumption that a memory barrier is cheaper than a
TLB reload (even with ASID).

Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

2017-09-10 Thread Rik van Riel

On Sat, 2017-09-09 at 12:28 -0700, Andy Lutomirski wrote:
> -
> I propose the following fix.  If PCID is on, then, in
> enter_lazy_tlb(), we switch to init_mm with the no-flush flag set.
> (And we give init_mm its own dedicated ASID to keep it simple and
> fast
> -- no need to use the LRU ASID mapping to assign one
> dynamically.)  We
> clear the bit in mm_cpumask.  That is, we more or less just skip the
> whole lazy TLB optimization and rely on PCID CPUs having reasonably
> fast CR3 writes.  No extra IPIs.

Avoiding the IPIs is probably what matters the most, especially
on systems with deep C states, and virtual machines where the
host may be running something else, causing the IPI service time
to go through the roof for idle VCPUs.

> Also, sorry Rik, this means your old increased laziness optimization
> is dead in the water.  It will have exactly the same speculative load
> problem.

Doesn't a memory barrier solve that speculative load
problem?

The memory barrier could be added only to the path
that potentially skips reloading the TLB, under the
assumption that a memory barrier is cheaper than a
TLB reload (even with ASID).

Re: [RFC 1/2] proc: Return if nothing to unmount

2017-09-10 Thread Rik van Riel

On Sat, 2017-09-09 at 19:31 +0100, Al Viro wrote:
> On Sat, Sep 09, 2017 at 06:03:16PM +0530, Gargi Sharma wrote:
> > If a task exits before procfs is mounted, proc_flush_task_mnt will
> > be called with a NULL mnt parameter. In that case, not only is
> > there
> > nothing to unhash, but trying to do so will oops the kernel with a
> > null pointer dereference.
> 
> You are misreading that sucker.  It's about userland mounts, it's
> about
> the internal ones in pidns, for each pidns the process belongs to.
> 
> IOW, what you are adding is dead code.  The very first alloc_pid() in
> that pidns should've called pid_ns_prepare_proc(), which creates that
> vfsmount.

Looking at the code (now that I am home, and no longer
reading this email on my phone), I see the cause of this
problem.

A previous version of Gargi's code had RESERVED_PIDS as
the lower bound for idr_alloc_cyclic, even on the very
first PID allocation cycle in a PID namespace.

With the code changed to have 1 as the lower bound during
the first allocation cycle, pid_ns_prepare_proc() should
be called correctly, and things should work as expected.

Gargi, can you drop this patch 1/2, and make sure the code
still works fine?

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part

Re: [RFC 1/2] proc: Return if nothing to unmount

2017-09-10 Thread Rik van Riel

On Sat, 2017-09-09 at 19:31 +0100, Al Viro wrote:
> On Sat, Sep 09, 2017 at 06:03:16PM +0530, Gargi Sharma wrote:
> > If a task exits before procfs is mounted, proc_flush_task_mnt will
> > be called with a NULL mnt parameter. In that case, not only is
> > there
> > nothing to unhash, but trying to do so will oops the kernel with a
> > null pointer dereference.
> 
> You are misreading that sucker.  It's about userland mounts, it's
> about
> the internal ones in pidns, for each pidns the process belongs to.
> 
> IOW, what you are adding is dead code.  The very first alloc_pid() in
> that pidns should've called pid_ns_prepare_proc(), which creates that
> vfsmount.

Looking at the code (now that I am home, and no longer
reading this email on my phone), I see the cause of this
problem.

A previous version of Gargi's code had RESERVED_PIDS as
the lower bound for idr_alloc_cyclic, even on the very
first PID allocation cycle in a PID namespace.

With the code changed to have 1 as the lower bound during
the first allocation cycle, pid_ns_prepare_proc() should
be called correctly, and things should work as expected.

Gargi, can you drop this patch 1/2, and make sure the code
still works fine?

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part

Re: [RFC 1/2] proc: Return if nothing to unmount

2017-09-10 Thread Rik van Riel



On September 9, 2017 2:31:35 PM EDT, Al Viro  wrote:
>On Sat, Sep 09, 2017 at 06:03:16PM +0530, Gargi Sharma wrote:
>> If a task exits before procfs is mounted, proc_flush_task_mnt will
>> be called with a NULL mnt parameter. In that case, not only is there
>> nothing to unhash, but trying to do so will oops the kernel with a
>> null pointer dereference.
>
>You are misreading that sucker.  It's about userland mounts, it's about
>the internal ones in pidns, for each pidns the process belongs to.
>
>IOW, what you are adding is dead code.  The very first alloc_pid() in
>that pidns should've called pid_ns_prepare_proc(), which creates that
>vfsmount.

Huh, my bad. I wonder why Gargi's code ran into a null pointer dereference on a 
null mnt pointer, then...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [RFC 1/2] proc: Return if nothing to unmount

2017-09-10 Thread Rik van Riel



On September 9, 2017 2:31:35 PM EDT, Al Viro  wrote:
>On Sat, Sep 09, 2017 at 06:03:16PM +0530, Gargi Sharma wrote:
>> If a task exits before procfs is mounted, proc_flush_task_mnt will
>> be called with a NULL mnt parameter. In that case, not only is there
>> nothing to unhash, but trying to do so will oops the kernel with a
>> null pointer dereference.
>
>You are misreading that sucker.  It's about userland mounts, it's about
>the internal ones in pidns, for each pidns the process belongs to.
>
>IOW, what you are adding is dead code.  The very first alloc_pid() in
>that pidns should've called pid_ns_prepare_proc(), which creates that
>vfsmount.

Huh, my bad. I wonder why Gargi's code ran into a null pointer dereference on a 
null mnt pointer, then...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [RFC PATCH 01/12] housekeeping: Move housekeeping related code to its own file

2017-08-31 Thread Rik van Riel

On Wed, 2017-08-23 at 03:51 +0200, Frederic Weisbecker wrote:
> The housekeeping code is currently tied to the nohz code. As we are
> planning to make housekeeping independant from it, start with moving
> the relevant code to its own file.
> 
Why are nohz full and housekeeping being
decoupled from each other?

Won't people want to use them together?

What use case am I missing?

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [RFC PATCH 01/12] housekeeping: Move housekeeping related code to its own file

2017-08-31 Thread Rik van Riel

On Wed, 2017-08-23 at 03:51 +0200, Frederic Weisbecker wrote:
> The housekeeping code is currently tied to the nohz code. As we are
> planning to make housekeeping independant from it, start with moving
> the relevant code to its own file.
> 
Why are nohz full and housekeeping being
decoupled from each other?

Won't people want to use them together?

What use case am I missing?

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] [PATCH v2 24/30] fork: Define usercopy region in mm_struct slab caches

2017-08-30 Thread Rik van Riel

On Mon, 2017-08-28 at 14:35 -0700, Kees Cook wrote:
> From: David Windsor <d...@nullcore.net>
> 
> In support of usercopy hardening, this patch defines a region in the
> mm_struct slab caches in which userspace copy operations are allowed.
> Only the auxv field is copied to userspace.
> 
Acked-by: Rik van Riel <r...@redhat.com>

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] [PATCH v2 24/30] fork: Define usercopy region in mm_struct slab caches

2017-08-30 Thread Rik van Riel

On Mon, 2017-08-28 at 14:35 -0700, Kees Cook wrote:
> From: David Windsor 
> 
> In support of usercopy hardening, this patch defines a region in the
> mm_struct slab caches in which userspace copy operations are allowed.
> Only the auxv field is copied to userspace.
> 
Acked-by: Rik van Riel 

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] [PATCH v2 25/30] fork: Define usercopy region in thread_stack slab caches

2017-08-30 Thread Rik van Riel

On Mon, 2017-08-28 at 14:35 -0700, Kees Cook wrote:
> From: David Windsor <d...@nullcore.net>
> 
> In support of usercopy hardening, this patch defines a region in the
> thread_stack slab caches in which userspace copy operations are
> allowed.
> Since the entire thread_stack needs to be available to userspace, the
> entire slab contents are whitelisted. Note that the slab-based thread
> stack is only present on systems with THREAD_SIZE < PAGE_SIZE and
> !CONFIG_VMAP_STACK.
> 

Acked-by: Rik van Riel <r...@redhat.com>

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] [PATCH v2 25/30] fork: Define usercopy region in thread_stack slab caches

2017-08-30 Thread Rik van Riel

On Mon, 2017-08-28 at 14:35 -0700, Kees Cook wrote:
> From: David Windsor 
> 
> In support of usercopy hardening, this patch defines a region in the
> thread_stack slab caches in which userspace copy operations are
> allowed.
> Since the entire thread_stack needs to be available to userspace, the
> entire slab contents are whitelisted. Note that the slab-based thread
> stack is only present on systems with THREAD_SIZE < PAGE_SIZE and
> !CONFIG_VMAP_STACK.
> 

Acked-by: Rik van Riel 

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] [PATCH v2 27/30] x86: Implement thread_struct whitelist for hardened usercopy

2017-08-30 Thread Rik van Riel

On Mon, 2017-08-28 at 14:35 -0700, Kees Cook wrote:
> This whitelists the FPU register state portion of the thread_struct
> for
> copying to userspace, instead of the default entire struct.
> 
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: Ingo Molnar <mi...@redhat.com>
> Cc: "H. Peter Anvin" <h...@zytor.com>
> Cc: x...@kernel.org
> Cc: Borislav Petkov <b...@suse.de>
> Cc: Andy Lutomirski <l...@kernel.org>
> Cc: Mathias Krause <mini...@googlemail.com>
> Signed-off-by: Kees Cook <keesc...@chromium.org>
> ---
>  arch/x86/Kconfig | 1 +
>  arch/x86/include/asm/processor.h | 8 
>  2 files changed, 9 insertions(+)
> 
Acked-by: Rik van Riel <r...@redhat.com>

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] [PATCH v2 27/30] x86: Implement thread_struct whitelist for hardened usercopy

2017-08-30 Thread Rik van Riel

On Mon, 2017-08-28 at 14:35 -0700, Kees Cook wrote:
> This whitelists the FPU register state portion of the thread_struct
> for
> copying to userspace, instead of the default entire struct.
> 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Cc: Borislav Petkov 
> Cc: Andy Lutomirski 
> Cc: Mathias Krause 
> Signed-off-by: Kees Cook 
> ---
>  arch/x86/Kconfig | 1 +
>  arch/x86/include/asm/processor.h | 8 ++++++++
>  2 files changed, 9 insertions(+)
> 
Acked-by: Rik van Riel 

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] [PATCH v2 26/30] fork: Provide usercopy whitelisting for task_struct

2017-08-30 Thread Rik van Riel

On Mon, 2017-08-28 at 14:35 -0700, Kees Cook wrote:
> While the blocked and saved_sigmask fields of task_struct are copied
> to
> userspace (via sigmask_to_save() and setup_rt_frame()), it is always
> copied with a static length (i.e. sizeof(sigset_t)).
> 
> The only portion of task_struct that is potentially dynamically sized
> and
> may be copied to userspace is in the architecture-specific
> thread_struct
> at the end of task_struct.
> 
Acked-by: Rik van Riel <r...@redhat.com>

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] [PATCH v2 26/30] fork: Provide usercopy whitelisting for task_struct

2017-08-30 Thread Rik van Riel

On Mon, 2017-08-28 at 14:35 -0700, Kees Cook wrote:
> While the blocked and saved_sigmask fields of task_struct are copied
> to
> userspace (via sigmask_to_save() and setup_rt_frame()), it is always
> copied with a static length (i.e. sizeof(sigset_t)).
> 
> The only portion of task_struct that is potentially dynamically sized
> and
> may be copied to userspace is in the architecture-specific
> thread_struct
> at the end of task_struct.
> 
Acked-by: Rik van Riel 

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-18 Thread Rik van Riel

On Fri, 2017-08-18 at 11:15 -0700, Andrew Morton wrote:
> On Fri, 18 Aug 2017 12:28:29 -0400 Rik van Riel <r...@redhat.com>
> wrote:
> 
> > On Thu, 2017-08-17 at 15:50 -0700, Andrew Morton wrote:
> > > On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel <r...@redhat.com>
> > > wrote:
> > > 
> > > > > > --- a/mm/madvise.c
> > > > > > +++ b/mm/madvise.c
> > > > > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > > > > vm_area_struct *vma,
> > > > > > __  }
> > > > > > __  new_flags &= ~VM_DONTCOPY;
> > > > > > __  break;
> > > > > > +   case MADV_WIPEONFORK:
> > > > > > +   /* MADV_WIPEONFORK is only supported on
> > > > > > anonymous
> > > > > > memory. */
> > > > > > +   if (vma->vm_file || vma->vm_flags &
> > > > > > VM_SHARED)
> > > > > > {
> > > > > > +   error = -EINVAL;
> > > > > > +   goto out;
> > > > > > +   }
> > > > > > +   new_flags |= VM_WIPEONFORK;
> > > > > > +   break;
> > > > > > +   case MADV_KEEPONFORK:
> > > > > > +   new_flags &= ~VM_WIPEONFORK;
> > > > > > +   break;
> > > > > > __  case MADV_DONTDUMP:
> > > > > > __  new_flags |= VM_DONTDUMP;
> > > > > > __  break;
> > > > > 
> > > > > It seems odd to permit MADV_KEEPONFORK against other-than-
> > > > > anon
> > > > > vmas?
> > > > 
> > > > Given that the only way to set VM_WIPEONFORK is through
> > > > MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> > > > other-than-anon vma would be equivalent to a noop.
> > > > 
> > > > If new_flags == vma->vm_flags, madvise_behavior() will
> > > > immediately exit.
> > > 
> > > Yes, but calling MADV_WIPEONFORK against an other-than-anon vma
> > > is
> > > presumably a userspace bug.A bug which will probably result
> > > in
> > > userspace having WIPEONFORK memory which it didn't want.The
> > > kernel
> > > can trivially tell userspace that it has this bug so why not do
> > > so?
> > 
> > Uh, what?
> > 
> 
> Braino.  I meant MADV_KEEPONFORK.  Calling MADV_KEEPONFORK against an
> other-than-anon vma is a presumptive userspace bug and the kernel
> should report that.

All MADV_KEEPONFORK does is clear the flag set by
MADV_WIPEONFORK. Since there is no way to set the
WIPEONFORK flag on other-than-anon VMAs, that means
MADV_KEEPONFORK is always a noop for those VMAs.

You remind me that I should send in a man page
addition, though, when this code gets sent to
Linus.

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-18 Thread Rik van Riel

On Fri, 2017-08-18 at 11:15 -0700, Andrew Morton wrote:
> On Fri, 18 Aug 2017 12:28:29 -0400 Rik van Riel 
> wrote:
> 
> > On Thu, 2017-08-17 at 15:50 -0700, Andrew Morton wrote:
> > > On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel 
> > > wrote:
> > > 
> > > > > > --- a/mm/madvise.c
> > > > > > +++ b/mm/madvise.c
> > > > > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > > > > vm_area_struct *vma,
> > > > > > __  }
> > > > > > __  new_flags &= ~VM_DONTCOPY;
> > > > > > __  break;
> > > > > > +   case MADV_WIPEONFORK:
> > > > > > +   /* MADV_WIPEONFORK is only supported on
> > > > > > anonymous
> > > > > > memory. */
> > > > > > +   if (vma->vm_file || vma->vm_flags &
> > > > > > VM_SHARED)
> > > > > > {
> > > > > > +   error = -EINVAL;
> > > > > > +   goto out;
> > > > > > +   }
> > > > > > +   new_flags |= VM_WIPEONFORK;
> > > > > > +   break;
> > > > > > +   case MADV_KEEPONFORK:
> > > > > > +   new_flags &= ~VM_WIPEONFORK;
> > > > > > +   break;
> > > > > > __  case MADV_DONTDUMP:
> > > > > > __  new_flags |= VM_DONTDUMP;
> > > > > > __  break;
> > > > > 
> > > > > It seems odd to permit MADV_KEEPONFORK against other-than-
> > > > > anon
> > > > > vmas?
> > > > 
> > > > Given that the only way to set VM_WIPEONFORK is through
> > > > MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> > > > other-than-anon vma would be equivalent to a noop.
> > > > 
> > > > If new_flags == vma->vm_flags, madvise_behavior() will
> > > > immediately exit.
> > > 
> > > Yes, but calling MADV_WIPEONFORK against an other-than-anon vma
> > > is
> > > presumably a userspace bug.A bug which will probably result
> > > in
> > > userspace having WIPEONFORK memory which it didn't want.The
> > > kernel
> > > can trivially tell userspace that it has this bug so why not do
> > > so?
> > 
> > Uh, what?
> > 
> 
> Braino.  I meant MADV_KEEPONFORK.  Calling MADV_KEEPONFORK against an
> other-than-anon vma is a presumptive userspace bug and the kernel
> should report that.

All MADV_KEEPONFORK does is clear the flag set by
MADV_WIPEONFORK. Since there is no way to set the
WIPEONFORK flag on other-than-anon VMAs, that means
MADV_KEEPONFORK is always a noop for those VMAs.

You remind me that I should send in a man page
addition, though, when this code gets sent to
Linus.

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-18 Thread Rik van Riel

On Thu, 2017-08-17 at 15:50 -0700, Andrew Morton wrote:
> On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel <r...@redhat.com>
> wrote:
> 
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > > vm_area_struct *vma,
> > > > __  }
> > > > __  new_flags &= ~VM_DONTCOPY;
> > > > __  break;
> > > > +   case MADV_WIPEONFORK:
> > > > +   /* MADV_WIPEONFORK is only supported on
> > > > anonymous
> > > > memory. */
> > > > +   if (vma->vm_file || vma->vm_flags & VM_SHARED)
> > > > {
> > > > +   error = -EINVAL;
> > > > +   goto out;
> > > > +   }
> > > > +   new_flags |= VM_WIPEONFORK;
> > > > +   break;
> > > > +   case MADV_KEEPONFORK:
> > > > +   new_flags &= ~VM_WIPEONFORK;
> > > > +   break;
> > > > __  case MADV_DONTDUMP:
> > > > __  new_flags |= VM_DONTDUMP;
> > > > __  break;
> > > 
> > > It seems odd to permit MADV_KEEPONFORK against other-than-anon
> > > vmas?
> > 
> > Given that the only way to set VM_WIPEONFORK is through
> > MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> > other-than-anon vma would be equivalent to a noop.
> > 
> > If new_flags == vma->vm_flags, madvise_behavior() will
> > immediately exit.
> 
> Yes, but calling MADV_WIPEONFORK against an other-than-anon vma is
> presumably a userspace bug.  A bug which will probably result in
> userspace having WIPEONFORK memory which it didn't want.  The kernel
> can trivially tell userspace that it has this bug so why not do so?

Uh, what?

Calling MADV_WIPEONFORK on an other-than-anon vma results
in NOT getting VM_WIPEONFORK semantics on that VMA.

The code you are commenting on is the bit that CLEARS
the VM_WIPEONFORK code, not the bit where it gets set.

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-18 Thread Rik van Riel

On Thu, 2017-08-17 at 15:50 -0700, Andrew Morton wrote:
> On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel 
> wrote:
> 
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > > vm_area_struct *vma,
> > > > __  }
> > > > __  new_flags &= ~VM_DONTCOPY;
> > > > __  break;
> > > > +   case MADV_WIPEONFORK:
> > > > +   /* MADV_WIPEONFORK is only supported on
> > > > anonymous
> > > > memory. */
> > > > +   if (vma->vm_file || vma->vm_flags & VM_SHARED)
> > > > {
> > > > +   error = -EINVAL;
> > > > +   goto out;
> > > > +   }
> > > > +   new_flags |= VM_WIPEONFORK;
> > > > +   break;
> > > > +   case MADV_KEEPONFORK:
> > > > +   new_flags &= ~VM_WIPEONFORK;
> > > > +   break;
> > > > __  case MADV_DONTDUMP:
> > > > __  new_flags |= VM_DONTDUMP;
> > > > __  break;
> > > 
> > > It seems odd to permit MADV_KEEPONFORK against other-than-anon
> > > vmas?
> > 
> > Given that the only way to set VM_WIPEONFORK is through
> > MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> > other-than-anon vma would be equivalent to a noop.
> > 
> > If new_flags == vma->vm_flags, madvise_behavior() will
> > immediately exit.
> 
> Yes, but calling MADV_WIPEONFORK against an other-than-anon vma is
> presumably a userspace bug.  A bug which will probably result in
> userspace having WIPEONFORK memory which it didn't want.  The kernel
> can trivially tell userspace that it has this bug so why not do so?

Uh, what?

Calling MADV_WIPEONFORK on an other-than-anon vma results
in NOT getting VM_WIPEONFORK semantics on that VMA.

The code you are commenting on is the bit that CLEARS
the VM_WIPEONFORK code, not the bit where it gets set.

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-15 Thread Rik van Riel

On Tue, 2017-08-15 at 15:51 -0700, Andrew Morton wrote:
> On Fri, 11 Aug 2017 17:28:29 -0400 r...@redhat.com wrote:
> 
> > A further complication is the proliferation of clone flags,
> > programs bypassing glibc's functions to call clone directly,
> > and programs calling unshare, causing the glibc pthread_atfork
> > hook to not get called.
> > 
> > It would be better to have the kernel take care of this
> > automatically.
> 
> I'll add "The patch also adds MADV_KEEPONFORK, to undo the effects of
> a
> prior MADV_WIPEONFORK." here.
> 
> I guess it isn't worth mentioning that these things can cause VMA
> merges and splits. 

That's the same as every other Linux specific madvise operation.

> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > vm_area_struct *vma,
> >     }
> >     new_flags &= ~VM_DONTCOPY;
> >     break;
> > +   case MADV_WIPEONFORK:
> > +   /* MADV_WIPEONFORK is only supported on anonymous
> > memory. */
> > +   if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> > +   error = -EINVAL;
> > +   goto out;
> > +   }
> > +   new_flags |= VM_WIPEONFORK;
> > +   break;
> > +   case MADV_KEEPONFORK:
> > +   new_flags &= ~VM_WIPEONFORK;
> > +   break;
> >     case MADV_DONTDUMP:
> >     new_flags |= VM_DONTDUMP;
> >     break;
> 
> It seems odd to permit MADV_KEEPONFORK against other-than-anon vmas?

Given that the only way to set VM_WIPEONFORK is through
MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
other-than-anon vma would be equivalent to a noop.

If new_flags == vma->vm_flags, madvise_behavior() will
immediately exit.

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-15 Thread Rik van Riel

On Tue, 2017-08-15 at 15:51 -0700, Andrew Morton wrote:
> On Fri, 11 Aug 2017 17:28:29 -0400 r...@redhat.com wrote:
> 
> > A further complication is the proliferation of clone flags,
> > programs bypassing glibc's functions to call clone directly,
> > and programs calling unshare, causing the glibc pthread_atfork
> > hook to not get called.
> > 
> > It would be better to have the kernel take care of this
> > automatically.
> 
> I'll add "The patch also adds MADV_KEEPONFORK, to undo the effects of
> a
> prior MADV_WIPEONFORK." here.
> 
> I guess it isn't worth mentioning that these things can cause VMA
> merges and splits. 

That's the same as every other Linux specific madvise operation.

> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > vm_area_struct *vma,
> >     }
> >     new_flags &= ~VM_DONTCOPY;
> >     break;
> > +   case MADV_WIPEONFORK:
> > +   /* MADV_WIPEONFORK is only supported on anonymous
> > memory. */
> > +   if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> > +   error = -EINVAL;
> > +   goto out;
> > +   }
> > +   new_flags |= VM_WIPEONFORK;
> > +   break;
> > +   case MADV_KEEPONFORK:
> > +   new_flags &= ~VM_WIPEONFORK;
> > +   break;
> >     case MADV_DONTDUMP:
> >     new_flags |= VM_DONTDUMP;
> >     break;
> 
> It seems odd to permit MADV_KEEPONFORK against other-than-anon vmas?

Given that the only way to set VM_WIPEONFORK is through
MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
other-than-anon vma would be equivalent to a noop.

If new_flags == vma->vm_flags, madvise_behavior() will
immediately exit.

Re: [kernel-hardening] Re: x86: PIE support and option to extend KASLR randomization

2017-08-15 Thread Rik van Riel

On Tue, 2017-08-15 at 08:15 -0400, Jordan Glover wrote:
> Hello,
> I write to put different perspective into the topic. I'm glad that
> kernel developers care about performance optimizations and I see how
> 10% overhead can be a problem for some. On the other hand last ten
> years gave us 1000% faster hardware which trumps anything software
> can do. For many users like us performance isn't a problem, we have
> plenty of it and if we haven't we can buy it. It can be money
> problem, not software engineering problem.

CPUs stopped getting faster at a dramatic rate, though,
while network cards continue to get faster.

In some areas, the kernel is more squeezed for CPU cycles
today than it has ever been before.

Ingo is suggesting that the security enhancement be
implemented in a way that doesn't come with a 10% performance
gain. There are other ways to relocate a kernel than with PIE.

Re: [kernel-hardening] Re: x86: PIE support and option to extend KASLR randomization

2017-08-15 Thread Rik van Riel

On Tue, 2017-08-15 at 08:15 -0400, Jordan Glover wrote:
> Hello,
> I write to put different perspective into the topic. I'm glad that
> kernel developers care about performance optimizations and I see how
> 10% overhead can be a problem for some. On the other hand last ten
> years gave us 1000% faster hardware which trumps anything software
> can do. For many users like us performance isn't a problem, we have
> plenty of it and if we haven't we can buy it. It can be money
> problem, not software engineering problem.

CPUs stopped getting faster at a dramatic rate, though,
while network cards continue to get faster.

In some areas, the kernel is more squeezed for CPU cycles
today than it has ever been before.

Ingo is suggesting that the security enhancement be
implemented in a way that doesn't come with a 10% performance
gain. There are other ways to relocate a kernel than with PIE.

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Rik van Riel

On Fri, 2017-08-11 at 12:42 -0700, Linus Torvalds wrote:
> On Fri, Aug 11, 2017 at 12:19 PM,   wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0e517be91a89..f9b0ad7feb57 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct
> > *dst_mm, struct mm_struct *src_mm,
> > !vma->anon_vma)
> > return 0;
> > 
> > +   /*
> > +* With VM_WIPEONFORK, the child inherits the VMA from the
> > +* parent, but not its contents.
> > +*
> > +* A child accessing VM_WIPEONFORK memory will see all
> > zeroes;
> > +* a child accessing VM_DONTCOPY memory receives a
> > segfault.
> > +*/
> > +   if (vma->vm_flags & VM_WIPEONFORK)
> > +   return 0;
> > +
> 
> Is this right?
> 
> Yes, you don't do the page table copies. Fine. But you leave vma with
> the the anon_vma pointer - doesn't that mean that it's still
> connected
> to the original anonvma chain, and we might end up swapping something
> in?

Swapping something in would require there to be a swap entry in
the page table entries, which we are not copying, so this should
not be a correctness issue.

> And even if that ends up not being an issue, I'd expect that you'd
> want to break the anon_vma chain just to not make it grow
> unnecessarily.

This is a good point. I can send a v4 that skips the anon_vma_fork()
call if VM_WIPEONFORK, and calls anon_vma_prepare(), instead.

> So my gut feel is that doing this in "copy_page_range()" is wrong,
> and
> the logic should be moved up to dup_mmap(), where we can also
> short-circuit the anon_vma chain entirely.
> 
> No?

There is another test in copy_page_range already which ends up
skipping the page table copy when it should not be done.

If you want, I can move that test into a should_copy_page_range()
function, and call that from dup_mmap(), skipping the call to
copy_page_range() if should_copy_page_range() returns false.

Having only one of the two sets of tests in dup_mmap(), and
the other in copy_page_range() seems wrong.

Just let me know what you prefer, and I'll put that in v4.

> The madvice() interface looks fine to me.

That was the main reason for adding you to the thread :)

kind regards,

Rik

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Rik van Riel

On Fri, 2017-08-11 at 12:42 -0700, Linus Torvalds wrote:
> On Fri, Aug 11, 2017 at 12:19 PM,   wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0e517be91a89..f9b0ad7feb57 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct
> > *dst_mm, struct mm_struct *src_mm,
> > !vma->anon_vma)
> > return 0;
> > 
> > +   /*
> > +* With VM_WIPEONFORK, the child inherits the VMA from the
> > +* parent, but not its contents.
> > +*
> > +* A child accessing VM_WIPEONFORK memory will see all
> > zeroes;
> > +* a child accessing VM_DONTCOPY memory receives a
> > segfault.
> > +*/
> > +   if (vma->vm_flags & VM_WIPEONFORK)
> > +   return 0;
> > +
> 
> Is this right?
> 
> Yes, you don't do the page table copies. Fine. But you leave vma with
> the the anon_vma pointer - doesn't that mean that it's still
> connected
> to the original anonvma chain, and we might end up swapping something
> in?

Swapping something in would require there to be a swap entry in
the page table entries, which we are not copying, so this should
not be a correctness issue.

> And even if that ends up not being an issue, I'd expect that you'd
> want to break the anon_vma chain just to not make it grow
> unnecessarily.

This is a good point. I can send a v4 that skips the anon_vma_fork()
call if VM_WIPEONFORK, and calls anon_vma_prepare(), instead.

> So my gut feel is that doing this in "copy_page_range()" is wrong,
> and
> the logic should be moved up to dup_mmap(), where we can also
> short-circuit the anon_vma chain entirely.
> 
> No?

There is another test in copy_page_range already which ends up
skipping the page table copy when it should not be done.

If you want, I can move that test into a should_copy_page_range()
function, and call that from dup_mmap(), skipping the call to
copy_page_range() if should_copy_page_range() returns false.

Having only one of the two sets of tests in dup_mmap(), and
the other in copy_page_range() seems wrong.

Just let me know what you prefer, and I'll put that in v4.

> The madvice() interface looks fine to me.

That was the main reason for adding you to the thread :)

kind regards,

Rik

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Rik van Riel

On Fri, 2017-08-11 at 09:36 -0700, Mike Kravetz wrote:
> On 08/11/2017 08:23 AM, Rik van Riel wrote:
> > On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
> > > On Sun 06-08-17 10:04:25, Rik van Riel wrote:
> > > [...]
> > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > index 17921b0390b4..db1fb2802ecc 100644
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -659,6 +659,13 @@ static __latent_entropy int
> > > > dup_mmap(struct
> > > > mm_struct *mm,
> > > >     tmp->vm_flags &= ~(VM_LOCKED |
> > > > VM_LOCKONFAULT);
> > > >     tmp->vm_next = tmp->vm_prev = NULL;
> > > >     file = tmp->vm_file;
> > > > +
> > > > +   /* With VM_WIPEONFORK, the child gets an empty
> > > > VMA. */
> > > > +   if (tmp->vm_flags & VM_WIPEONFORK) {
> > > > +   tmp->vm_file = file = NULL;
> > > > +   tmp->vm_ops = NULL;
> > > > +   }
> > > 
> > > What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the
> > > around?
> > > At
> > > least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I
> > > miss
> > > where those flags are cleared?
> > 
> > Huh, good spotting.  That makes me wonder why the test case that
> > Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
> > and returned zero-filled memory when read by the child process.
> 
> Well, I think I still got a BUG with a MAP_SHARED|MAP_ANONYMOUS vma
> on
> your v2 patch.  Did not really want to start a discussion on the
> implementation until the issue of exactly what VM_WIPEONFORK was
> supposed
> to do was settled.

It worked here, but now I don't understand why :)

> > 
> > OK, I'll do a minimal implementation for now, which will return
> > -EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
> > and/or an mmapped file.
> > 
> > It will work the way it is supposed to with anonymous MAP_PRIVATE
> > memory, which is likely the only memory it will be used on, anyway.
> > 
> 
> Seems reasonable.
> 
> You should also add VM_HUGETLB to those returning -EINVAL.  IIRC, a
> VM_HUGETLB vma even without VM_SHARED expects vm_file != NULL.

In other words (flags & MAP_SHARED || vma->vm_file) would catch
hugetlbfs, too?

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Rik van Riel

On Fri, 2017-08-11 at 09:36 -0700, Mike Kravetz wrote:
> On 08/11/2017 08:23 AM, Rik van Riel wrote:
> > On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
> > > On Sun 06-08-17 10:04:25, Rik van Riel wrote:
> > > [...]
> > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > index 17921b0390b4..db1fb2802ecc 100644
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -659,6 +659,13 @@ static __latent_entropy int
> > > > dup_mmap(struct
> > > > mm_struct *mm,
> > > >     tmp->vm_flags &= ~(VM_LOCKED |
> > > > VM_LOCKONFAULT);
> > > >     tmp->vm_next = tmp->vm_prev = NULL;
> > > >     file = tmp->vm_file;
> > > > +
> > > > +   /* With VM_WIPEONFORK, the child gets an empty
> > > > VMA. */
> > > > +   if (tmp->vm_flags & VM_WIPEONFORK) {
> > > > +   tmp->vm_file = file = NULL;
> > > > +   tmp->vm_ops = NULL;
> > > > +   }
> > > 
> > > What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the
> > > around?
> > > At
> > > least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I
> > > miss
> > > where those flags are cleared?
> > 
> > Huh, good spotting.  That makes me wonder why the test case that
> > Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
> > and returned zero-filled memory when read by the child process.
> 
> Well, I think I still got a BUG with a MAP_SHARED|MAP_ANONYMOUS vma
> on
> your v2 patch.  Did not really want to start a discussion on the
> implementation until the issue of exactly what VM_WIPEONFORK was
> supposed
> to do was settled.

It worked here, but now I don't understand why :)

> > 
> > OK, I'll do a minimal implementation for now, which will return
> > -EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
> > and/or an mmapped file.
> > 
> > It will work the way it is supposed to with anonymous MAP_PRIVATE
> > memory, which is likely the only memory it will be used on, anyway.
> > 
> 
> Seems reasonable.
> 
> You should also add VM_HUGETLB to those returning -EINVAL.  IIRC, a
> VM_HUGETLB vma even without VM_SHARED expects vm_file != NULL.

In other words (flags & MAP_SHARED || vma->vm_file) would catch
hugetlbfs, too?

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Rik van Riel

On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
> On Sun 06-08-17 10:04:25, Rik van Riel wrote:
> [...]
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 17921b0390b4..db1fb2802ecc 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -659,6 +659,13 @@ static __latent_entropy int dup_mmap(struct
> > mm_struct *mm,
> >     tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
> >     tmp->vm_next = tmp->vm_prev = NULL;
> >     file = tmp->vm_file;
> > +
> > +   /* With VM_WIPEONFORK, the child gets an empty
> > VMA. */
> > +   if (tmp->vm_flags & VM_WIPEONFORK) {
> > +   tmp->vm_file = file = NULL;
> > +   tmp->vm_ops = NULL;
> > +   }
> 
> What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the around?
> At
> least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I miss
> where those flags are cleared?

Huh, good spotting.  That makes me wonder why the test case that
Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
and returned zero-filled memory when read by the child process.

OK, I'll do a minimal implementation for now, which will return
-EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
and/or an mmapped file.

It will work the way it is supposed to with anonymous MAP_PRIVATE
memory, which is likely the only memory it will be used on, anyway.

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Rik van Riel

On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
> On Sun 06-08-17 10:04:25, Rik van Riel wrote:
> [...]
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 17921b0390b4..db1fb2802ecc 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -659,6 +659,13 @@ static __latent_entropy int dup_mmap(struct
> > mm_struct *mm,
> >     tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
> >     tmp->vm_next = tmp->vm_prev = NULL;
> >     file = tmp->vm_file;
> > +
> > +   /* With VM_WIPEONFORK, the child gets an empty
> > VMA. */
> > +   if (tmp->vm_flags & VM_WIPEONFORK) {
> > +   tmp->vm_file = file = NULL;
> > +   tmp->vm_ops = NULL;
> > +   }
> 
> What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the around?
> At
> least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I miss
> where those flags are cleared?

Huh, good spotting.  That makes me wonder why the test case that
Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
and returned zero-filled memory when read by the child process.

OK, I'll do a minimal implementation for now, which will return
-EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
and/or an mmapped file.

It will work the way it is supposed to with anonymous MAP_PRIVATE
memory, which is likely the only memory it will be used on, anyway.

[tip:sched/core] sched/numa: Slow down scan rate if shared faults dominate

2017-08-10 Thread tip-bot for Rik van Riel

Commit-ID:  37ec97deb3a8c68a7adfab61beb261ffeab19d09
Gitweb: http://git.kernel.org/tip/37ec97deb3a8c68a7adfab61beb261ffeab19d09
Author: Rik van Riel <r...@redhat.com>
AuthorDate: Mon, 31 Jul 2017 15:28:46 -0400
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Thu, 10 Aug 2017 12:18:16 +0200

sched/numa: Slow down scan rate if shared faults dominate

The comment above update_task_scan_period() says the scan period should
be increased (scanning slows down) if the majority of memory accesses
are on the local node, or if the majority of the page accesses are
shared with other tasks.

However, with the current code, all a high ratio of shared accesses
does is slow down the rate at which scanning is made faster.

This patch changes things so either lots of shared accesses or
lots of local accesses will slow down scanning, and numa scanning
is sped up only when there are lots of private faults on remote
memory pages.

Signed-off-by: Rik van Riel <r...@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Acked-by: Mel Gorman <mgor...@suse.de>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: jhla...@redhat.com
Cc: lvena...@redhat.com
Link: http://lkml.kernel.org/r/20170731192847.23050-2-r...@redhat.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/sched/fair.c | 39 +--
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ef5b66b..cb6b7c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1892,7 +1892,7 @@ static void update_task_scan_period(struct task_struct *p,
unsigned long shared, unsigned long private)
 {
unsigned int period_slot;
-   int ratio;
+   int lr_ratio, ps_ratio;
int diff;
 
unsigned long remote = p->numa_faults_locality[0];
@@ -1922,25 +1922,36 @@ static void update_task_scan_period(struct task_struct 
*p,
 *   >= NUMA_PERIOD_THRESHOLD scan period increases (scan slower)
 */
period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
-   ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
-   if (ratio >= NUMA_PERIOD_THRESHOLD) {
-   int slot = ratio - NUMA_PERIOD_THRESHOLD;
+   lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
+   ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared);
+
+   if (ps_ratio >= NUMA_PERIOD_THRESHOLD) {
+   /*
+* Most memory accesses are local. There is no need to
+* do fast NUMA scanning, since memory is already local.
+*/
+   int slot = ps_ratio - NUMA_PERIOD_THRESHOLD;
+   if (!slot)
+   slot = 1;
+   diff = slot * period_slot;
+   } else if (lr_ratio >= NUMA_PERIOD_THRESHOLD) {
+   /*
+* Most memory accesses are shared with other tasks.
+* There is no point in continuing fast NUMA scanning,
+* since other tasks may just move the memory elsewhere.
+*/
+   int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;
if (!slot)
slot = 1;
diff = slot * period_slot;
} else {
-   diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
-
/*
-* Scale scan rate increases based on sharing. There is an
-* inverse relationship between the degree of sharing and
-* the adjustment made to the scanning period. Broadly
-* speaking the intent is that there is little point
-* scanning faster if shared accesses dominate as it may
-* simply bounce migrations uselessly
+* Private memory faults exceed (SLOTS-THRESHOLD)/SLOTS,
+* yet they are not on the local NUMA node. Speed up
+* NUMA scanning to get the memory moved over.
 */
-   ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + 
shared + 1));
-   diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
+   int ratio = max(lr_ratio, ps_ratio);
+   diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
}
 
p->numa_scan_period = clamp(p->numa_scan_period + diff,

[tip:sched/core] sched/numa: Slow down scan rate if shared faults dominate

2017-08-10 Thread tip-bot for Rik van Riel

Commit-ID:  37ec97deb3a8c68a7adfab61beb261ffeab19d09
Gitweb: http://git.kernel.org/tip/37ec97deb3a8c68a7adfab61beb261ffeab19d09
Author: Rik van Riel 
AuthorDate: Mon, 31 Jul 2017 15:28:46 -0400
Committer:  Ingo Molnar 
CommitDate: Thu, 10 Aug 2017 12:18:16 +0200

sched/numa: Slow down scan rate if shared faults dominate

The comment above update_task_scan_period() says the scan period should
be increased (scanning slows down) if the majority of memory accesses
are on the local node, or if the majority of the page accesses are
shared with other tasks.

However, with the current code, all a high ratio of shared accesses
does is slow down the rate at which scanning is made faster.

This patch changes things so either lots of shared accesses or
lots of local accesses will slow down scanning, and numa scanning
is sped up only when there are lots of private faults on remote
memory pages.

Signed-off-by: Rik van Riel 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Mel Gorman 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: jhla...@redhat.com
Cc: lvena...@redhat.com
Link: http://lkml.kernel.org/r/20170731192847.23050-2-r...@redhat.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 39 +--
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ef5b66b..cb6b7c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1892,7 +1892,7 @@ static void update_task_scan_period(struct task_struct *p,
unsigned long shared, unsigned long private)
 {
unsigned int period_slot;
-   int ratio;
+   int lr_ratio, ps_ratio;
int diff;
 
unsigned long remote = p->numa_faults_locality[0];
@@ -1922,25 +1922,36 @@ static void update_task_scan_period(struct task_struct 
*p,
 *   >= NUMA_PERIOD_THRESHOLD scan period increases (scan slower)
 */
period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
-   ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
-   if (ratio >= NUMA_PERIOD_THRESHOLD) {
-   int slot = ratio - NUMA_PERIOD_THRESHOLD;
+   lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
+   ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared);
+
+   if (ps_ratio >= NUMA_PERIOD_THRESHOLD) {
+   /*
+* Most memory accesses are local. There is no need to
+* do fast NUMA scanning, since memory is already local.
+*/
+   int slot = ps_ratio - NUMA_PERIOD_THRESHOLD;
+   if (!slot)
+   slot = 1;
+   diff = slot * period_slot;
+   } else if (lr_ratio >= NUMA_PERIOD_THRESHOLD) {
+   /*
+* Most memory accesses are shared with other tasks.
+* There is no point in continuing fast NUMA scanning,
+* since other tasks may just move the memory elsewhere.
+*/
+   int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;
if (!slot)
slot = 1;
diff = slot * period_slot;
} else {
-   diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
-
/*
-* Scale scan rate increases based on sharing. There is an
-* inverse relationship between the degree of sharing and
-* the adjustment made to the scanning period. Broadly
-* speaking the intent is that there is little point
-* scanning faster if shared accesses dominate as it may
-* simply bounce migrations uselessly
+* Private memory faults exceed (SLOTS-THRESHOLD)/SLOTS,
+* yet they are not on the local NUMA node. Speed up
+* NUMA scanning to get the memory moved over.
 */
-   ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + 
shared + 1));
-   diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
+   int ratio = max(lr_ratio, ps_ratio);
+   diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
}
 
p->numa_scan_period = clamp(p->numa_scan_period + diff,

[tip:sched/core] sched/numa: Scale scan period with tasks in group and shared/private

2017-08-10 Thread tip-bot for Rik van Riel

Commit-ID:  b5dd77c8bdada7b6262d0cba02a6ed525bf4e6e1
Gitweb: http://git.kernel.org/tip/b5dd77c8bdada7b6262d0cba02a6ed525bf4e6e1
Author: Rik van Riel <r...@redhat.com>
AuthorDate: Mon, 31 Jul 2017 15:28:47 -0400
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Thu, 10 Aug 2017 12:18:16 +0200

sched/numa: Scale scan period with tasks in group and shared/private

Running 80 tasks in the same group, or as threads of the same process,
results in the memory getting scanned 80x as fast as it would be if a
single task was using the memory.

This really hurts some workloads.

Scale the scan period by the number of tasks in the numa group, and
the shared / private ratio, so the average rate at which memory in
the group is scanned corresponds roughly to the rate at which a single
task would scan its memory.

Signed-off-by: Rik van Riel <r...@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Acked-by: Mel Gorman <mgor...@suse.de>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: jhla...@redhat.com
Cc: lvena...@redhat.com
Link: http://lkml.kernel.org/r/20170731192847.23050-3-r...@redhat.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/sched/fair.c | 111 
 1 file changed, 86 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cb6b7c8..a7f1c3b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1071,6 +1071,29 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+struct numa_group {
+   atomic_t refcount;
+
+   spinlock_t lock; /* nr_tasks, tasks */
+   int nr_tasks;
+   pid_t gid;
+   int active_nodes;
+
+   struct rcu_head rcu;
+   unsigned long total_faults;
+   unsigned long max_faults_cpu;
+   /*
+* Faults_cpu is used to decide whether memory should move
+* towards the CPU. As a consequence, these stats are weighted
+* more by CPU use than by memory faults.
+*/
+   unsigned long *faults_cpu;
+   unsigned long faults[0];
+};
+
+static inline unsigned long group_faults_priv(struct numa_group *ng);
+static inline unsigned long group_faults_shared(struct numa_group *ng);
+
 static unsigned int task_nr_scan_windows(struct task_struct *p)
 {
unsigned long rss = 0;
@@ -1107,13 +1130,47 @@ static unsigned int task_scan_min(struct task_struct *p)
return max_t(unsigned int, floor, scan);
 }
 
+static unsigned int task_scan_start(struct task_struct *p)
+{
+   unsigned long smin = task_scan_min(p);
+   unsigned long period = smin;
+
+   /* Scale the maximum scan period with the amount of shared memory. */
+   if (p->numa_group) {
+   struct numa_group *ng = p->numa_group;
+   unsigned long shared = group_faults_shared(ng);
+   unsigned long private = group_faults_priv(ng);
+
+   period *= atomic_read(>refcount);
+   period *= shared + 1;
+   period /= private + shared + 1;
+   }
+
+   return max(smin, period);
+}
+
 static unsigned int task_scan_max(struct task_struct *p)
 {
-   unsigned int smin = task_scan_min(p);
-   unsigned int smax;
+   unsigned long smin = task_scan_min(p);
+   unsigned long smax;
 
/* Watch for min being lower than max due to floor calculations */
smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+
+   /* Scale the maximum scan period with the amount of shared memory. */
+   if (p->numa_group) {
+   struct numa_group *ng = p->numa_group;
+   unsigned long shared = group_faults_shared(ng);
+   unsigned long private = group_faults_priv(ng);
+   unsigned long period = smax;
+
+   period *= atomic_read(>refcount);
+   period *= shared + 1;
+   period /= private + shared + 1;
+
+   smax = max(smax, period);
+   }
+
return max(smin, smax);
 }
 
@@ -1129,26 +1186,6 @@ static void account_numa_dequeue(struct rq *rq, struct 
task_struct *p)
rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p));
 }
 
-struct numa_group {
-   atomic_t refcount;
-
-   spinlock_t lock; /* nr_tasks, tasks */
-   int nr_tasks;
-   pid_t gid;
-   int active_nodes;
-
-   struct rcu_head rcu;
-   unsigned long total_faults;
-   unsigned long max_faults_cpu;
-   /*
-* Faults_cpu is used to decide whether memory should move
-* towards the CPU. As a consequence, these stats are weighted
-* more by CPU use than by memory faults.
-*/
-   unsigned

[tip:sched/core] sched/numa: Scale scan period with tasks in group and shared/private

2017-08-10 Thread tip-bot for Rik van Riel

Commit-ID:  b5dd77c8bdada7b6262d0cba02a6ed525bf4e6e1
Gitweb: http://git.kernel.org/tip/b5dd77c8bdada7b6262d0cba02a6ed525bf4e6e1
Author: Rik van Riel 
AuthorDate: Mon, 31 Jul 2017 15:28:47 -0400
Committer:  Ingo Molnar 
CommitDate: Thu, 10 Aug 2017 12:18:16 +0200

sched/numa: Scale scan period with tasks in group and shared/private

Running 80 tasks in the same group, or as threads of the same process,
results in the memory getting scanned 80x as fast as it would be if a
single task was using the memory.

This really hurts some workloads.

Scale the scan period by the number of tasks in the numa group, and
the shared / private ratio, so the average rate at which memory in
the group is scanned corresponds roughly to the rate at which a single
task would scan its memory.

Signed-off-by: Rik van Riel 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Mel Gorman 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: jhla...@redhat.com
Cc: lvena...@redhat.com
Link: http://lkml.kernel.org/r/20170731192847.23050-3-r...@redhat.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 111 
 1 file changed, 86 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cb6b7c8..a7f1c3b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1071,6 +1071,29 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+struct numa_group {
+   atomic_t refcount;
+
+   spinlock_t lock; /* nr_tasks, tasks */
+   int nr_tasks;
+   pid_t gid;
+   int active_nodes;
+
+   struct rcu_head rcu;
+   unsigned long total_faults;
+   unsigned long max_faults_cpu;
+   /*
+* Faults_cpu is used to decide whether memory should move
+* towards the CPU. As a consequence, these stats are weighted
+* more by CPU use than by memory faults.
+*/
+   unsigned long *faults_cpu;
+   unsigned long faults[0];
+};
+
+static inline unsigned long group_faults_priv(struct numa_group *ng);
+static inline unsigned long group_faults_shared(struct numa_group *ng);
+
 static unsigned int task_nr_scan_windows(struct task_struct *p)
 {
unsigned long rss = 0;
@@ -1107,13 +1130,47 @@ static unsigned int task_scan_min(struct task_struct *p)
return max_t(unsigned int, floor, scan);
 }
 
+static unsigned int task_scan_start(struct task_struct *p)
+{
+   unsigned long smin = task_scan_min(p);
+   unsigned long period = smin;
+
+   /* Scale the maximum scan period with the amount of shared memory. */
+   if (p->numa_group) {
+   struct numa_group *ng = p->numa_group;
+   unsigned long shared = group_faults_shared(ng);
+   unsigned long private = group_faults_priv(ng);
+
+   period *= atomic_read(>refcount);
+   period *= shared + 1;
+   period /= private + shared + 1;
+   }
+
+   return max(smin, period);
+}
+
 static unsigned int task_scan_max(struct task_struct *p)
 {
-   unsigned int smin = task_scan_min(p);
-   unsigned int smax;
+   unsigned long smin = task_scan_min(p);
+   unsigned long smax;
 
/* Watch for min being lower than max due to floor calculations */
smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+
+   /* Scale the maximum scan period with the amount of shared memory. */
+   if (p->numa_group) {
+   struct numa_group *ng = p->numa_group;
+   unsigned long shared = group_faults_shared(ng);
+   unsigned long private = group_faults_priv(ng);
+   unsigned long period = smax;
+
+   period *= atomic_read(>refcount);
+   period *= shared + 1;
+   period /= private + shared + 1;
+
+   smax = max(smax, period);
+   }
+
return max(smin, smax);
 }
 
@@ -1129,26 +1186,6 @@ static void account_numa_dequeue(struct rq *rq, struct 
task_struct *p)
rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p));
 }
 
-struct numa_group {
-   atomic_t refcount;
-
-   spinlock_t lock; /* nr_tasks, tasks */
-   int nr_tasks;
-   pid_t gid;
-   int active_nodes;
-
-   struct rcu_head rcu;
-   unsigned long total_faults;
-   unsigned long max_faults_cpu;
-   /*
-* Faults_cpu is used to decide whether memory should move
-* towards the CPU. As a consequence, these stats are weighted
-* more by CPU use than by memory faults.
-*/
-   unsigned long *faults_cpu;
-   unsigned long faults[0];
-};
-
 /* Shared or private faults. */
 #define NR_NUMA_HINT_FAULT_TYPES 2
 
@@ -1198,6 +1235,30 @@ static inline unsigned long group_faults_cpu(struct 
numa_group *group, int ni

Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK

2017-08-09 Thread Rik van Riel

On Wed, 2017-08-09 at 12:59 +0300, Kirill A. Shutemov wrote:
> On Mon, Aug 07, 2017 at 10:59:51AM -0400, Rik van Riel wrote:
> > On Mon, 2017-08-07 at 15:46 +0200, Michal Hocko wrote:
> > > On Mon 07-08-17 15:22:57, Michal Hocko wrote:
> > > > This is an user visible API so make sure you CC linux-api
> > > > (added)
> > > > 
> > > > On Sun 06-08-17 10:04:23, Rik van Riel wrote:
> > > > > 
> > > > > A further complication is the proliferation of clone flags,
> > > > > programs bypassing glibc's functions to call clone directly,
> > > > > and programs calling unshare, causing the glibc
> > > > > pthread_atfork
> > > > > hook to not get called.
> > > > > 
> > > > > It would be better to have the kernel take care of this
> > > > > automatically.
> > > > > 
> > > > > This is similar to the OpenBSD minherit syscall with
> > > > > MAP_INHERIT_ZERO:
> > > > > 
> > > > > https://man.openbsd.org/minherit.2
> > > 
> > > I would argue that a MAP_$FOO flag would be more appropriate. Or
> > > do
> > > you
> > > see any cases where such a special mapping would need to change
> > > the
> > > semantic and inherit the content over the fork again?
> > > 
> > > I do not like the madvise because it is an advise and as such it
> > > can
> > > be
> > > ignored/not implemented and that shouldn't have any correctness
> > > effects
> > > on the child process.
> > 
> > Too late for that. VM_DONTFORK is already implemented
> > through MADV_DONTFORK & MADV_DOFORK, in a way that is
> > very similar to the MADV_WIPEONFORK from these patches.
> 
> It's not obvious to me what would break if kernel would ignore
> MADV_DONTFORK or MADV_DONTDUMP.
> 
You might end up with multiple processes having a device open
which can only handle one process at a time.

Another thing that could go wrong is that if overcommit_memory=2,
a very large process with MADV_DONTFORK on a large memory area
suddenly fails to fork (due to there not being enough available
memory), and is unable to start a helper process.

Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK

2017-08-09 Thread Rik van Riel

On Wed, 2017-08-09 at 12:59 +0300, Kirill A. Shutemov wrote:
> On Mon, Aug 07, 2017 at 10:59:51AM -0400, Rik van Riel wrote:
> > On Mon, 2017-08-07 at 15:46 +0200, Michal Hocko wrote:
> > > On Mon 07-08-17 15:22:57, Michal Hocko wrote:
> > > > This is an user visible API so make sure you CC linux-api
> > > > (added)
> > > > 
> > > > On Sun 06-08-17 10:04:23, Rik van Riel wrote:
> > > > > 
> > > > > A further complication is the proliferation of clone flags,
> > > > > programs bypassing glibc's functions to call clone directly,
> > > > > and programs calling unshare, causing the glibc
> > > > > pthread_atfork
> > > > > hook to not get called.
> > > > > 
> > > > > It would be better to have the kernel take care of this
> > > > > automatically.
> > > > > 
> > > > > This is similar to the OpenBSD minherit syscall with
> > > > > MAP_INHERIT_ZERO:
> > > > > 
> > > > > https://man.openbsd.org/minherit.2
> > > 
> > > I would argue that a MAP_$FOO flag would be more appropriate. Or
> > > do
> > > you
> > > see any cases where such a special mapping would need to change
> > > the
> > > semantic and inherit the content over the fork again?
> > > 
> > > I do not like the madvise because it is an advise and as such it
> > > can
> > > be
> > > ignored/not implemented and that shouldn't have any correctness
> > > effects
> > > on the child process.
> > 
> > Too late for that. VM_DONTFORK is already implemented
> > through MADV_DONTFORK & MADV_DOFORK, in a way that is
> > very similar to the MADV_WIPEONFORK from these patches.
> 
> It's not obvious to me what would break if kernel would ignore
> MADV_DONTFORK or MADV_DONTDUMP.
> 
You might end up with multiple processes having a device open
which can only handle one process at a time.

Another thing that could go wrong is that if overcommit_memory=2,
a very large process with MADV_DONTFORK on a large memory area
suddenly fails to fork (due to there not being enough available
memory), and is unable to start a helper process.

Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK

2017-08-08 Thread Rik van Riel

On Tue, 2017-08-08 at 09:52 -0700, Matthew Wilcox wrote:
> On Tue, Aug 08, 2017 at 11:46:08AM -0400, Rik van Riel wrote:
> > On Tue, 2017-08-08 at 08:19 -0700, Mike Kravetz wrote:
> > > If the use case is fairly specific, then perhaps it makes sense
> > > to
> > > make MADV_WIPEONFORK not applicable (EINVAL) for mappings where
> > > the
> > > result is 'questionable'.
> > 
> > That would be a question for Florian and Colm.
> > 
> > If they are OK with MADV_WIPEONFORK only working on
> > anonymous VMAs (no file mapping), that certainly could
> > be implemented.
> > 
> > On the other hand, I am not sure that introducing cases
> > where MADV_WIPEONFORK does not implement wipe-on-fork
> > semantics would reduce user confusion...
> 
> It'll simply do exactly what it does today, so it won't introduce any
> new fallback code.

Sure, but actually implementing MADV_WIPEONFORK in a
way that turns file mapped VMAs into zero page backed
anonymous VMAs after fork takes no more code than
implementing it in a way that refuses to work on VMAs
that have a file backing.

There is no complexity argument for or against either
approach.

The big question is, what is the best for users?

Should we return -EINVAL when MADV_WIPEONFORK is called
on a VMA that has a file backing, and only succeed on
anonymous VMAs?

Or, should we simply turn every memory range that has
MADV_WIPEONFORK done to it into an anonymous VMA in the
child process?

Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK

2017-08-08 Thread Rik van Riel

On Tue, 2017-08-08 at 09:52 -0700, Matthew Wilcox wrote:
> On Tue, Aug 08, 2017 at 11:46:08AM -0400, Rik van Riel wrote:
> > On Tue, 2017-08-08 at 08:19 -0700, Mike Kravetz wrote:
> > > If the use case is fairly specific, then perhaps it makes sense
> > > to
> > > make MADV_WIPEONFORK not applicable (EINVAL) for mappings where
> > > the
> > > result is 'questionable'.
> > 
> > That would be a question for Florian and Colm.
> > 
> > If they are OK with MADV_WIPEONFORK only working on
> > anonymous VMAs (no file mapping), that certainly could
> > be implemented.
> > 
> > On the other hand, I am not sure that introducing cases
> > where MADV_WIPEONFORK does not implement wipe-on-fork
> > semantics would reduce user confusion...
> 
> It'll simply do exactly what it does today, so it won't introduce any
> new fallback code.

Sure, but actually implementing MADV_WIPEONFORK in a
way that turns file mapped VMAs into zero page backed
anonymous VMAs after fork takes no more code than
implementing it in a way that refuses to work on VMAs
that have a file backing.

There is no complexity argument for or against either
approach.

The big question is, what is the best for users?

Should we return -EINVAL when MADV_WIPEONFORK is called
on a VMA that has a file backing, and only succeed on
anonymous VMAs?

Or, should we simply turn every memory range that has
MADV_WIPEONFORK done to it into an anonymous VMA in the
child process?

Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK

2017-08-08 Thread Rik van Riel

On Tue, 2017-08-08 at 08:19 -0700, Mike Kravetz wrote:

> The other question I was trying to bring up is "What does
> MADV_WIPEONFORK
> mean for various types of mappings?"  For example, if we allow
> MADV_WIPEONFORK on a file backed mapping what does that mapping look
> like in the child after fork?  Does it have any connection at all to
> the
> file?  Or, do we drop all references to the file and essentially
> transform
> it to a private (or shared?) anonymous mapping after fork.  What
> about
> System V shared memory?  What about hugetlb?

My current patch turns any file-backed VMA into an empty
anonymous VMA if MADV_WIPEONFORK was used on that VMA.

> If the use case is fairly specific, then perhaps it makes sense to
> make MADV_WIPEONFORK not applicable (EINVAL) for mappings where the
> result is 'questionable'.

That would be a question for Florian and Colm.

If they are OK with MADV_WIPEONFORK only working on
anonymous VMAs (no file mapping), that certainly could
be implemented.

On the other hand, I am not sure that introducing cases
where MADV_WIPEONFORK does not implement wipe-on-fork
semantics would reduce user confusion...

Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK

2017-08-08 Thread Rik van Riel

On Tue, 2017-08-08 at 08:19 -0700, Mike Kravetz wrote:

> The other question I was trying to bring up is "What does
> MADV_WIPEONFORK
> mean for various types of mappings?"  For example, if we allow
> MADV_WIPEONFORK on a file backed mapping what does that mapping look
> like in the child after fork?  Does it have any connection at all to
> the
> file?  Or, do we drop all references to the file and essentially
> transform
> it to a private (or shared?) anonymous mapping after fork.  What
> about
> System V shared memory?  What about hugetlb?

My current patch turns any file-backed VMA into an empty
anonymous VMA if MADV_WIPEONFORK was used on that VMA.

> If the use case is fairly specific, then perhaps it makes sense to
> make MADV_WIPEONFORK not applicable (EINVAL) for mappings where the
> result is 'questionable'.

That would be a question for Florian and Colm.

If they are OK with MADV_WIPEONFORK only working on
anonymous VMAs (no file mapping), that certainly could
be implemented.

On the other hand, I am not sure that introducing cases
where MADV_WIPEONFORK does not implement wipe-on-fork
semantics would reduce user confusion...

Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK

2017-08-08 Thread Rik van Riel

On Tue, 2017-08-08 at 11:58 +0200, Florian Weimer wrote:
> On 08/07/2017 08:23 PM, Mike Kravetz wrote:
> > If my thoughts above are correct, what about returning EINVAL if
> > one
> > attempts to set MADV_DONTFORK on mappings set up for sharing?
> 
> That's my preference as well.  If there is a use case for shared or
> non-anonymous mappings, then we can implement MADV_DONTFORK with the
> semantics for this use case.  If we pick some arbitrary semantics
> now,
> without any use case, we might end up with something that's not
> actually
> useful.

MADV_DONTFORK is existing semantics, and it is enforced
on shared, non-anonymous mappings. It is frequently used
for things like device mappings, which should not be
inherited by a child process, because the device can only
be used by one process at a time.

When someone requests MADV_DONTFORK on a shared VMA, they
will get it. The later madvise request overrides the mmap
flags that were used earlier.

The question is, should MADV_WIPEONFORK (introduced by
this series) have not just different semantics, but also
totally different behavior from MADV_DONTFORK?

Does the principle of least surprise dictate that the
last request determines the policy on an area, or should
later requests not be able to override policy that was
set at mmap time?

Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK

2017-08-08 Thread Rik van Riel

On Tue, 2017-08-08 at 11:58 +0200, Florian Weimer wrote:
> On 08/07/2017 08:23 PM, Mike Kravetz wrote:
> > If my thoughts above are correct, what about returning EINVAL if
> > one
> > attempts to set MADV_DONTFORK on mappings set up for sharing?
> 
> That's my preference as well.  If there is a use case for shared or
> non-anonymous mappings, then we can implement MADV_DONTFORK with the
> semantics for this use case.  If we pick some arbitrary semantics
> now,
> without any use case, we might end up with something that's not
> actually
> useful.

MADV_DONTFORK is existing semantics, and it is enforced
on shared, non-anonymous mappings. It is frequently used
for things like device mappings, which should not be
inherited by a child process, because the device can only
be used by one process at a time.

When someone requests MADV_DONTFORK on a shared VMA, they
will get it. The later madvise request overrides the mmap
flags that were used earlier.

The question is, should MADV_WIPEONFORK (introduced by
this series) have not just different semantics, but also
totally different behavior from MADV_DONTFORK?

Does the principle of least surprise dictate that the
last request determines the policy on an area, or should
later requests not be able to override policy that was
set at mmap time?

Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK

2017-08-07 Thread Rik van Riel

On Mon, 2017-08-07 at 15:46 +0200, Michal Hocko wrote:
> On Mon 07-08-17 15:22:57, Michal Hocko wrote:
> > This is an user visible API so make sure you CC linux-api (added)
> > 
> > On Sun 06-08-17 10:04:23, Rik van Riel wrote:
> > > 
> > > A further complication is the proliferation of clone flags,
> > > programs bypassing glibc's functions to call clone directly,
> > > and programs calling unshare, causing the glibc pthread_atfork
> > > hook to not get called.
> > > 
> > > It would be better to have the kernel take care of this
> > > automatically.
> > > 
> > > This is similar to the OpenBSD minherit syscall with
> > > MAP_INHERIT_ZERO:
> > > 
> > > https://man.openbsd.org/minherit.2
> 
> I would argue that a MAP_$FOO flag would be more appropriate. Or do
> you
> see any cases where such a special mapping would need to change the
> semantic and inherit the content over the fork again?
> 
> I do not like the madvise because it is an advise and as such it can
> be
> ignored/not implemented and that shouldn't have any correctness
> effects
> on the child process.

Too late for that. VM_DONTFORK is already implemented
through MADV_DONTFORK & MADV_DOFORK, in a way that is
very similar to the MADV_WIPEONFORK from these patches.

I wonder if that was done because MAP_* flags are a
bitmap, with a very limited number of values as a result,
while MADV_* constants have an essentially unlimited
numerical namespace available.

Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK

2017-08-07 Thread Rik van Riel

On Mon, 2017-08-07 at 15:46 +0200, Michal Hocko wrote:
> On Mon 07-08-17 15:22:57, Michal Hocko wrote:
> > This is an user visible API so make sure you CC linux-api (added)
> > 
> > On Sun 06-08-17 10:04:23, Rik van Riel wrote:
> > > 
> > > A further complication is the proliferation of clone flags,
> > > programs bypassing glibc's functions to call clone directly,
> > > and programs calling unshare, causing the glibc pthread_atfork
> > > hook to not get called.
> > > 
> > > It would be better to have the kernel take care of this
> > > automatically.
> > > 
> > > This is similar to the OpenBSD minherit syscall with
> > > MAP_INHERIT_ZERO:
> > > 
> > > https://man.openbsd.org/minherit.2
> 
> I would argue that a MAP_$FOO flag would be more appropriate. Or do
> you
> see any cases where such a special mapping would need to change the
> semantic and inherit the content over the fork again?
> 
> I do not like the madvise because it is an advise and as such it can
> be
> ignored/not implemented and that shouldn't have any correctness
> effects
> on the child process.

Too late for that. VM_DONTFORK is already implemented
through MADV_DONTFORK & MADV_DOFORK, in a way that is
very similar to the MADV_WIPEONFORK from these patches.

I wonder if that was done because MAP_* flags are a
bitmap, with a very limited number of values as a result,
while MADV_* constants have an essentially unlimited
numerical namespace available.

Re: [PATCH 0/2] mm,fork,security: introduce MADV_WIPEONFORK

2017-08-05 Thread Rik van Riel

On Sat, 2017-08-05 at 02:44 +0300, Kirill A. Shutemov wrote:
> On Fri, Aug 04, 2017 at 03:07:28PM -0400, r...@redhat.com wrote:
> > [resend because half the recipients got dropped due to IPv6
> > firewall issues]
> > 
> > Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> > empty in the child process after fork. This differs from
> > MADV_DONTFORK
> > in one important way.
> > 
> > If a child process accesses memory that was MADV_WIPEONFORK, it
> > will get zeroes. The address ranges are still valid, they are just
> > empty.
> 
> I feel like we are repeating mistake we made with MADV_DONTNEED.
> 
> MADV_WIPEONFORK would require a specific action from kernel, ignoring
> the /advise/ would likely lead to application misbehaviour.
> 
> Is it something we really want to see from madvise()?

We already have various mandatory madvise behaviors in Linux,
including MADV_REMOVE, MADV_DONTFORK, and MADV_DONTDUMP.

Re: [PATCH 0/2] mm,fork,security: introduce MADV_WIPEONFORK

2017-08-05 Thread Rik van Riel

On Sat, 2017-08-05 at 02:44 +0300, Kirill A. Shutemov wrote:
> On Fri, Aug 04, 2017 at 03:07:28PM -0400, r...@redhat.com wrote:
> > [resend because half the recipients got dropped due to IPv6
> > firewall issues]
> > 
> > Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> > empty in the child process after fork. This differs from
> > MADV_DONTFORK
> > in one important way.
> > 
> > If a child process accesses memory that was MADV_WIPEONFORK, it
> > will get zeroes. The address ranges are still valid, they are just
> > empty.
> 
> I feel like we are repeating mistake we made with MADV_DONTNEED.
> 
> MADV_WIPEONFORK would require a specific action from kernel, ignoring
> the /advise/ would likely lead to application misbehaviour.
> 
> Is it something we really want to see from madvise()?

We already have various mandatory madvise behaviors in Linux,
including MADV_REMOVE, MADV_DONTFORK, and MADV_DONTDUMP.

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-05 Thread Rik van Riel

On Fri, 2017-08-04 at 16:09 -0700, Mike Kravetz wrote:
> On 08/04/2017 12:07 PM, r...@redhat.com wrote:
> > From: Rik van Riel <r...@redhat.com>
> > 
> > Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> > empty in the child process after fork. This differs from
> > MADV_DONTFORK
> > in one important way.
> > 
> > If a child process accesses memory that was MADV_WIPEONFORK, it
> > will get zeroes. The address ranges are still valid, they are just
> > empty.
> > 
> This didn't seem 'quite right' to me for shared mappings and/or file
> backed mappings.  I wasn't exactly sure what it 'should' do in such
> cases.  So, I tried it with a mapping created as follows:
> 
> addr = mmap(ADDR, page_size,
> PROT_READ | PROT_WRITE,
> MAP_ANONYMOUS|MAP_SHARED, -1, 0);

Your test program is pretty much the same I used, except I
used MAP_PRIVATE instead of MAP_SHARED.

Let me see how the code paths differ for both cases...


> When setting MADV_WIPEONFORK on the vma/mapping, I got the following
> at task exit time:
> 
> [  694.558290] [ cut here ]
> [  694.558978] kernel BUG at mm/filemap.c:212!
> [  694.559476] invalid opcode:  [#1] SMP
> [  694.560023] Modules linked in: ip6t_REJECT nf_reject_ipv6
> ip6t_rpfilter xt_conntrack ebtable_broute bridge stp llc ebtable_nat
> ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
> ip6table_raw ip6table_mangle ip6table_security iptable_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
> iptable_raw iptable_mangle 9p iptable_security ebtable_filter
> ebtables ip6table_filter ip6_tables snd_hda_codec_generic
> snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq ppdev
> snd_seq_device joydev crct10dif_pclmul crc32_pclmul crc32c_intel
> snd_pcm ghash_clmulni_intel 9pnet_virtio virtio_balloon snd_timer
> 9pnet parport_pc snd parport i2c_piix4 soundcore nfsd auth_rpcgss
> nfs_acl lockd grace sunrpc virtio_net virtio_blk virtio_console
> 8139too qxl drm_kms_helper ttm drm serio_raw 8139cp
> [  694.571554]  mii virtio_pci ata_generic virtio_ring virtio
> pata_acpi
> [  694.572608] CPU: 3 PID: 1200 Comm: test_wipe2 Not tainted 4.13.0-
> rc3+ #8
> [  694.573778] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.9.1-1.fc24 04/01/2014
> [  694.574917] task: 880137178040 task.stack: c900019d4000
> [  694.575650] RIP: 0010:__delete_from_page_cache+0x344/0x410
> [  694.576409] RSP: 0018:c900019d7a88 EFLAGS: 00010082
> [  694.577238] RAX: 0021 RBX: ea00047d0e00 RCX:
> 0006
> [  694.578537] RDX:  RSI: 0096 RDI:
> 88023fd0db90
> [  694.579774] RBP: c900019d7ad8 R08: 000882b6 R09:
> 028a
> [  694.580754] R10: c900019d7da8 R11: 8211184d R12:
> ea00047d0e00
> [  694.582040] R13:  R14: 0202 R15:
> 8801384439e8
> [  694.583236] FS:  () GS:88023fd0()
> knlGS:
> [  694.584607] CS:  0010 DS:  ES:  CR0: 80050033
> [  694.585409] CR2: 7ff77a8da618 CR3: 01e09000 CR4:
> 001406e0
> [  694.586547] Call Trace:
> [  694.586996]  delete_from_page_cache+0x54/0x110
> [  694.587481]  truncate_inode_page+0xab/0x120
> [  694.588110]  shmem_undo_range+0x498/0xa50
> [  694.588813]  ? save_stack_trace+0x1b/0x20
> [  694.589529]  ? set_track+0x70/0x140
> [  694.590150]  ? init_object+0x69/0xa0
> [  694.590722]  ? __inode_wait_for_writeback+0x73/0xe0
> [  694.591525]  shmem_truncate_range+0x16/0x40
> [  694.592268]  shmem_evict_inode+0xb1/0x190
> [  694.592735]  evict+0xbb/0x1c0
> [  694.593147]  iput+0x1c0/0x210
> [  694.593497]  dentry_unlink_inode+0xb4/0x150
> [  694.593982]  __dentry_kill+0xc1/0x150
> [  694.594400]  dput+0x1c8/0x1e0
> [  694.594745]  __fput+0x172/0x1e0
> [  694.595103]  fput+0xe/0x10
> [  694.595463]  task_work_run+0x80/0xa0
> [  694.595886]  do_exit+0x2d6/0xb50
> [  694.596323]  ? __do_page_fault+0x288/0x4a0
> [  694.596818]  do_group_exit+0x47/0xb0
> [  694.597249]  SyS_exit_group+0x14/0x20
> [  694.597682]  entry_SYSCALL_64_fastpath+0x1a/0xa5
> [  694.598198] RIP: 0033:0x7ff77a5e78c8
> [  694.598612] RSP: 002b:7ffc5aece318 EFLAGS: 0246 ORIG_RAX:
> 00e7
> [  694.599804] RAX: ffda RBX:  RCX:
> 7ff77a5e78c8
> [  694.600609] RDX:  RSI: 003c RDI:
> 
> [  694.601424] RBP: 7ff77a8da618 R08: 00e7 R09:
> ff98
> [  694.602224] R10: 0003 R11: 0246 R12:
> 0001
&g

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-05 Thread Rik van Riel

On Fri, 2017-08-04 at 16:09 -0700, Mike Kravetz wrote:
> On 08/04/2017 12:07 PM, r...@redhat.com wrote:
> > From: Rik van Riel 
> > 
> > Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> > empty in the child process after fork. This differs from
> > MADV_DONTFORK
> > in one important way.
> > 
> > If a child process accesses memory that was MADV_WIPEONFORK, it
> > will get zeroes. The address ranges are still valid, they are just
> > empty.
> > 
> This didn't seem 'quite right' to me for shared mappings and/or file
> backed mappings.  I wasn't exactly sure what it 'should' do in such
> cases.  So, I tried it with a mapping created as follows:
> 
> addr = mmap(ADDR, page_size,
> PROT_READ | PROT_WRITE,
> MAP_ANONYMOUS|MAP_SHARED, -1, 0);

Your test program is pretty much the same I used, except I
used MAP_PRIVATE instead of MAP_SHARED.

Let me see how the code paths differ for both cases...


> When setting MADV_WIPEONFORK on the vma/mapping, I got the following
> at task exit time:
> 
> [  694.558290] [ cut here ]
> [  694.558978] kernel BUG at mm/filemap.c:212!
> [  694.559476] invalid opcode:  [#1] SMP
> [  694.560023] Modules linked in: ip6t_REJECT nf_reject_ipv6
> ip6t_rpfilter xt_conntrack ebtable_broute bridge stp llc ebtable_nat
> ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
> ip6table_raw ip6table_mangle ip6table_security iptable_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
> iptable_raw iptable_mangle 9p iptable_security ebtable_filter
> ebtables ip6table_filter ip6_tables snd_hda_codec_generic
> snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq ppdev
> snd_seq_device joydev crct10dif_pclmul crc32_pclmul crc32c_intel
> snd_pcm ghash_clmulni_intel 9pnet_virtio virtio_balloon snd_timer
> 9pnet parport_pc snd parport i2c_piix4 soundcore nfsd auth_rpcgss
> nfs_acl lockd grace sunrpc virtio_net virtio_blk virtio_console
> 8139too qxl drm_kms_helper ttm drm serio_raw 8139cp
> [  694.571554]  mii virtio_pci ata_generic virtio_ring virtio
> pata_acpi
> [  694.572608] CPU: 3 PID: 1200 Comm: test_wipe2 Not tainted 4.13.0-
> rc3+ #8
> [  694.573778] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.9.1-1.fc24 04/01/2014
> [  694.574917] task: 880137178040 task.stack: c900019d4000
> [  694.575650] RIP: 0010:__delete_from_page_cache+0x344/0x410
> [  694.576409] RSP: 0018:c900019d7a88 EFLAGS: 00010082
> [  694.577238] RAX: 0021 RBX: ea00047d0e00 RCX:
> 0006
> [  694.578537] RDX:  RSI: 0096 RDI:
> 88023fd0db90
> [  694.579774] RBP: c900019d7ad8 R08: 000882b6 R09:
> 028a
> [  694.580754] R10: c900019d7da8 R11: 8211184d R12:
> ea00047d0e00
> [  694.582040] R13:  R14: 0202 R15:
> 8801384439e8
> [  694.583236] FS:  () GS:88023fd0()
> knlGS:
> [  694.584607] CS:  0010 DS:  ES:  CR0: 80050033
> [  694.585409] CR2: 7ff77a8da618 CR3: 01e09000 CR4:
> 001406e0
> [  694.586547] Call Trace:
> [  694.586996]  delete_from_page_cache+0x54/0x110
> [  694.587481]  truncate_inode_page+0xab/0x120
> [  694.588110]  shmem_undo_range+0x498/0xa50
> [  694.588813]  ? save_stack_trace+0x1b/0x20
> [  694.589529]  ? set_track+0x70/0x140
> [  694.590150]  ? init_object+0x69/0xa0
> [  694.590722]  ? __inode_wait_for_writeback+0x73/0xe0
> [  694.591525]  shmem_truncate_range+0x16/0x40
> [  694.592268]  shmem_evict_inode+0xb1/0x190
> [  694.592735]  evict+0xbb/0x1c0
> [  694.593147]  iput+0x1c0/0x210
> [  694.593497]  dentry_unlink_inode+0xb4/0x150
> [  694.593982]  __dentry_kill+0xc1/0x150
> [  694.594400]  dput+0x1c8/0x1e0
> [  694.594745]  __fput+0x172/0x1e0
> [  694.595103]  fput+0xe/0x10
> [  694.595463]  task_work_run+0x80/0xa0
> [  694.595886]  do_exit+0x2d6/0xb50
> [  694.596323]  ? __do_page_fault+0x288/0x4a0
> [  694.596818]  do_group_exit+0x47/0xb0
> [  694.597249]  SyS_exit_group+0x14/0x20
> [  694.597682]  entry_SYSCALL_64_fastpath+0x1a/0xa5
> [  694.598198] RIP: 0033:0x7ff77a5e78c8
> [  694.598612] RSP: 002b:7ffc5aece318 EFLAGS: 0246 ORIG_RAX:
> 00e7
> [  694.599804] RAX: ffda RBX:  RCX:
> 7ff77a5e78c8
> [  694.600609] RDX:  RSI: 003c RDI:
> 
> [  694.601424] RBP: 7ff77a8da618 R08: 00e7 R09:
> ff98
> [  694.602224] R10: 0003 R11: 0246 R12:
> 0001
> [

Re: [PATCH -mm -v3 1/6] mm, swap: Add swap cache statistics sysfs interface

2017-07-25 Thread Rik van Riel

On Tue, 2017-07-25 at 09:51 +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> The swap cache stats could be gotten only via sysrq, which isn't
> convenient in some situation.  So the sysfs interface of swap cache
> stats is added for that.  The added sysfs directories/files are as
> follow,
> 
> /sys/kernel/mm/swap
> /sys/kernel/mm/swap/cache_find_total
> /sys/kernel/mm/swap/cache_find_success
> /sys/kernel/mm/swap/cache_add
> /sys/kernel/mm/swap/cache_del
> /sys/kernel/mm/swap/cache_pages
> 
What is the advantage of this vs new fields in
/proc/vmstat, which is where most of the VM
statistics seem to live?

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH -mm -v3 1/6] mm, swap: Add swap cache statistics sysfs interface

2017-07-25 Thread Rik van Riel

On Tue, 2017-07-25 at 09:51 +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> The swap cache stats could be gotten only via sysrq, which isn't
> convenient in some situation.  So the sysfs interface of swap cache
> stats is added for that.  The added sysfs directories/files are as
> follow,
> 
> /sys/kernel/mm/swap
> /sys/kernel/mm/swap/cache_find_total
> /sys/kernel/mm/swap/cache_find_success
> /sys/kernel/mm/swap/cache_add
> /sys/kernel/mm/swap/cache_del
> /sys/kernel/mm/swap/cache_pages
> 
What is the advantage of this vs new fields in
/proc/vmstat, which is where most of the VM
statistics seem to live?

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH -mm -v3 02/12] mm, THP, swap: Support to reclaim swap space for THP swapped out

2017-07-25 Thread Rik van Riel

On Mon, 2017-07-24 at 13:18 +0800, Huang, Ying wrote:
> From: Huang Ying <ying.hu...@intel.com>
> 
> The normal swap slot reclaiming can be done when the swap count
> reaches SWAP_HAS_CACHE.  But for the swap slot which is backing a
> THP,
> all swap slots backing one THP must be reclaimed together, because
> the
> swap slot may be used again when the THP is swapped out again later.
> So the swap slots backing one THP can be reclaimed together when the
> swap count for all swap slots for the THP reached SWAP_HAS_CACHE.  In
> the patch, the functions to check whether the swap count for all swap
> slots backing one THP reached SWAP_HAS_CACHE are implemented and used
> when checking whether a swap slot can be reclaimed.
> 
> To make it easier to determine whether a swap slot is backing a THP,
> a
> new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
> cluster which is backing a THP (Transparent Huge Page).  Because THP
> swap in as a whole isn't supported now.  After deleting the THP from
> the swap cache (for example, swapping out finished), the
> CLUSTER_FLAG_HUGE flag will be cleared.  So that, the normal pages
> inside THP can be swapped in individually.
> 
> Signed-off-by: "Huang, Ying" <ying.hu...@intel.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Minchan Kim <minc...@kernel.org>
> Cc: Hugh Dickins <hu...@google.com>
> Cc: Shaohua Li <s...@kernel.org>
> Cc: Rik van Riel <r...@redhat.com>
> 
Acked-by: Rik van Riel <r...@redhat.com>

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH -mm -v3 02/12] mm, THP, swap: Support to reclaim swap space for THP swapped out

2017-07-25 Thread Rik van Riel

On Mon, 2017-07-24 at 13:18 +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> The normal swap slot reclaiming can be done when the swap count
> reaches SWAP_HAS_CACHE.  But for the swap slot which is backing a
> THP,
> all swap slots backing one THP must be reclaimed together, because
> the
> swap slot may be used again when the THP is swapped out again later.
> So the swap slots backing one THP can be reclaimed together when the
> swap count for all swap slots for the THP reached SWAP_HAS_CACHE.  In
> the patch, the functions to check whether the swap count for all swap
> slots backing one THP reached SWAP_HAS_CACHE are implemented and used
> when checking whether a swap slot can be reclaimed.
> 
> To make it easier to determine whether a swap slot is backing a THP,
> a
> new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
> cluster which is backing a THP (Transparent Huge Page).  Because THP
> swap in as a whole isn't supported now.  After deleting the THP from
> the swap cache (for example, swapping out finished), the
> CLUSTER_FLAG_HUGE flag will be cleared.  So that, the normal pages
> inside THP can be swapped in individually.
> 
> Signed-off-by: "Huang, Ying" 
> Cc: Johannes Weiner 
> Cc: Minchan Kim 
> Cc: Hugh Dickins 
> Cc: Shaohua Li 
> Cc: Rik van Riel 
> 
Acked-by: Rik van Riel 

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH -mm -v3 01/12] mm, THP, swap: Support to clear swap cache flag for THP swapped out

2017-07-25 Thread Rik van Riel

On Mon, 2017-07-24 at 13:18 +0800, Huang, Ying wrote:
> From: Huang Ying <ying.hu...@intel.com>
> 
> Previously, swapcache_free_cluster() is used only in the error path
> of
> shrink_page_list() to free the swap cluster just allocated if the
> THP (Transparent Huge Page) is failed to be split.  In this patch, it
> is enhanced to clear the swap cache flag (SWAP_HAS_CACHE) for the
> swap
> cluster that holds the contents of THP swapped out.
> 
> This will be used in delaying splitting THP after swapping out
> support.  Because there is no THP swapping in as a whole support yet,
> after clearing the swap cache flag, the swap cluster backing the THP
> swapped out will be split.  So that the swap slots in the swap
> cluster
> can be swapped in as normal pages later.
> 
> Signed-off-by: "Huang, Ying" <ying.hu...@intel.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Minchan Kim <minc...@kernel.org>
> Cc: Hugh Dickins <hu...@google.com>
> Cc: Shaohua Li <s...@kernel.org>
> Cc: Rik van Riel <r...@redhat.com>
> 

Acked-by: Rik van Riel <r...@redhat.com>

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH -mm -v3 01/12] mm, THP, swap: Support to clear swap cache flag for THP swapped out

2017-07-25 Thread Rik van Riel

On Mon, 2017-07-24 at 13:18 +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> Previously, swapcache_free_cluster() is used only in the error path
> of
> shrink_page_list() to free the swap cluster just allocated if the
> THP (Transparent Huge Page) is failed to be split.  In this patch, it
> is enhanced to clear the swap cache flag (SWAP_HAS_CACHE) for the
> swap
> cluster that holds the contents of THP swapped out.
> 
> This will be used in delaying splitting THP after swapping out
> support.  Because there is no THP swapping in as a whole support yet,
> after clearing the swap cache flag, the swap cluster backing the THP
> swapped out will be split.  So that the swap slots in the swap
> cluster
> can be swapped in as normal pages later.
> 
> Signed-off-by: "Huang, Ying" 
> Cc: Johannes Weiner 
> Cc: Minchan Kim 
> Cc: Hugh Dickins 
> Cc: Shaohua Li 
> Cc: Rik van Riel 
> 

Acked-by: Rik van Riel 

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [RFC][PATCH] exec: Use init rlimits for setuid exec

2017-07-10 Thread Rik van Riel

On Mon, 2017-07-10 at 20:16 +0200, Michal Hocko wrote:

> OK, I misread the code. 32b applications on 64b systems do top down
> by
> default and only if they override this by ADDR_COMPAT_LAYOUT
> personality. For some reason I thought that 32b userspace goes a
> different path and makes sure that they are always doing bottom up.
> 
> Anyway even if somebody really needs to grow stack really large we
> have
> the personality to give them the legacy layout.

I think what will happen when rlimit_stack is RLIMIT_INFINITY
is that mmap_base will end up placing mm->mmap_base at 512MB
(task_size / 6 * 5 below the top of address space) for 32 bit
kernels, and we eventually fall back to a bottom-up search
if the space below mmap_base is exhausted (if it ever is).

Re: [RFC][PATCH] exec: Use init rlimits for setuid exec

2017-07-10 Thread Rik van Riel

On Mon, 2017-07-10 at 20:16 +0200, Michal Hocko wrote:

> OK, I misread the code. 32b applications on 64b systems do top down
> by
> default and only if they override this by ADDR_COMPAT_LAYOUT
> personality. For some reason I thought that 32b userspace goes a
> different path and makes sure that they are always doing bottom up.
> 
> Anyway even if somebody really needs to grow stack really large we
> have
> the personality to give them the legacy layout.

I think what will happen when rlimit_stack is RLIMIT_INFINITY
is that mmap_base will end up placing mm->mmap_base at 512MB
(task_size / 6 * 5 below the top of address space) for 32 bit
kernels, and we eventually fall back to a bottom-up search
if the space below mmap_base is exhausted (if it ever is).

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-10 Thread Rik van Riel

On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:

> Johannes and Rik had some concerns that this could lead to premature
> OOM kills. I agree with them that we need a better throttling
> mechanism. Until now we didn't give the issue described above a high
> priority because it usually required a really insane workload to
> trigger. But it seems that the issue can be reproduced also without
> having an insane number of competing threads [3].

My worries stand, but lets fix the real observed bug, and not worry
too much about the theoretical bug for now.

Acked-by: Rik van Riel <r...@redhat.com>

Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

2017-07-10 Thread Rik van Riel

On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:

> Johannes and Rik had some concerns that this could lead to premature
> OOM kills. I agree with them that we need a better throttling
> mechanism. Until now we didn't give the issue described above a high
> priority because it usually required a really insane workload to
> trigger. But it seems that the issue can be reproduced also without
> having an insane number of competing threads [3].

My worries stand, but lets fix the real observed bug, and not worry
too much about the theoretical bug for now.

Acked-by: Rik van Riel

Re: [PATCH v3] mm: Add SLUB free list pointer obfuscation

2017-07-06 Thread Rik van Riel

On Thu, 2017-07-06 at 10:55 -0500, Christoph Lameter wrote:
> On Thu, 6 Jul 2017, Kees Cook wrote:
> 
> > On Thu, Jul 6, 2017 at 6:43 AM, Christoph Lameter 
> > wrote:
> > > On Wed, 5 Jul 2017, Kees Cook wrote:
> > > 
> > > > @@ -3536,6 +3565,9 @@ static int kmem_cache_open(struct
> > > > kmem_cache *s, unsigned long flags)
> > > >  {
> > > >   s->flags = kmem_cache_flags(s->size, flags, s->name, s-
> > > > >ctor);
> > > >   s->reserved = 0;
> > > > +#ifdef CONFIG_SLAB_FREELIST_HARDENED
> > > > + s->random = get_random_long();
> > > > +#endif
> > > > 
> > > >   if (need_reserve_slab_rcu && (s->flags &
> > > > SLAB_TYPESAFE_BY_RCU))
> > > >   s->reserved = sizeof(struct rcu_head);
> > > > 
> > > 
> > > So if an attacker knows the internal structure of data then he
> > > can simply
> > > dereference page->kmem_cache->random to decode the freepointer.
> > 
> > That requires a series of arbitrary reads. This is protecting
> > against
> > attacks that use an adjacent slab object write overflow to write
> > the
> > freelist pointer. This internal structure is very reliable, and has
> > been the basis of freelist attacks against the kernel for a decade.
> 
> These reads are not arbitrary. You can usually calculate the page
> struct
> address easily from the address and then do a couple of loads to get
> there.
> 
> Ok so you get rid of the old attacks because we did not have that
> hardening in effect when they designed their approaches?

The hardening protects against situations where
people do not have arbitrary code execution and
memory read access in the kernel, with the goal
of protecting people from acquiring those abilities.

> > It is a probabilistic defense, but then so is the stack protector.
> > This is a similar defense; while not perfect it makes the class of
> > attack much more difficult to mount.
> 
> Na I am not convinced of the "much more difficult". Maybe they will
> just
> have to upgrade their approaches to fetch the proper values to
> decode.

Easier said than done. Most of the time there is an
unpatched vulnerability outstanding, there is only
one known issue, before the kernel is updated by the
user, to a version that does not have that issue.

Bypassing kernel hardening typically requires the
use of multiple vulnerabilities, and the absence of
roadblocks (like hardening) that make a type of
vulnerability exploitable.

Between usercopy hardening, and these slub freelist
canaries (which is what they effectively are), several
classes of exploits are no longer usable.

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH v3] mm: Add SLUB free list pointer obfuscation

2017-07-06 Thread Rik van Riel

On Thu, 2017-07-06 at 10:55 -0500, Christoph Lameter wrote:
> On Thu, 6 Jul 2017, Kees Cook wrote:
> 
> > On Thu, Jul 6, 2017 at 6:43 AM, Christoph Lameter 
> > wrote:
> > > On Wed, 5 Jul 2017, Kees Cook wrote:
> > > 
> > > > @@ -3536,6 +3565,9 @@ static int kmem_cache_open(struct
> > > > kmem_cache *s, unsigned long flags)
> > > >  {
> > > >   s->flags = kmem_cache_flags(s->size, flags, s->name, s-
> > > > >ctor);
> > > >   s->reserved = 0;
> > > > +#ifdef CONFIG_SLAB_FREELIST_HARDENED
> > > > + s->random = get_random_long();
> > > > +#endif
> > > > 
> > > >   if (need_reserve_slab_rcu && (s->flags &
> > > > SLAB_TYPESAFE_BY_RCU))
> > > >   s->reserved = sizeof(struct rcu_head);
> > > > 
> > > 
> > > So if an attacker knows the internal structure of data then he
> > > can simply
> > > dereference page->kmem_cache->random to decode the freepointer.
> > 
> > That requires a series of arbitrary reads. This is protecting
> > against
> > attacks that use an adjacent slab object write overflow to write
> > the
> > freelist pointer. This internal structure is very reliable, and has
> > been the basis of freelist attacks against the kernel for a decade.
> 
> These reads are not arbitrary. You can usually calculate the page
> struct
> address easily from the address and then do a couple of loads to get
> there.
> 
> Ok so you get rid of the old attacks because we did not have that
> hardening in effect when they designed their approaches?

The hardening protects against situations where
people do not have arbitrary code execution and
memory read access in the kernel, with the goal
of protecting people from acquiring those abilities.

> > It is a probabilistic defense, but then so is the stack protector.
> > This is a similar defense; while not perfect it makes the class of
> > attack much more difficult to mount.
> 
> Na I am not convinced of the "much more difficult". Maybe they will
> just
> have to upgrade their approaches to fetch the proper values to
> decode.

Easier said than done. Most of the time there is an
unpatched vulnerability outstanding, there is only
one known issue, before the kernel is updated by the
user, to a version that does not have that issue.

Bypassing kernel hardening typically requires the
use of multiple vulnerabilities, and the absence of
roadblocks (like hardening) that make a type of
vulnerability exploitable.

Between usercopy hardening, and these slub freelist
canaries (which is what they effectively are), several
classes of exploits are no longer usable.

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [tip:sched/core] sched/cputime: Refactor the cputime_adjust() code

2017-06-30 Thread Rik van Riel

On Fri, 2017-06-30 at 06:10 -0700, tip-bot for Gustavo A. R. Silva
wrote:

> +++ b/kernel/sched/cputime.c
> @@ -615,19 +615,13 @@ static void cputime_adjust(struct task_cputime
> *curr,
>    * userspace. Once a task gets some ticks, the monotonicy
> code at
>    * 'update' will ensure things converge to the observed
> ratio.
>    */
> - if (stime == 0) {
> - utime = rtime;
> - goto update;
> + if (stime != 0) {
> + if (utime == 0)
> + stime = rtime;
> + else
> + stime = scale_stime(stime, rtime, stime +
> utime);
>   }
>  
> - if (utime == 0) {
> - stime = rtime;
> - goto update;
> - }
> -
> - stime = scale_stime(stime, rtime, stime + utime);
> -
> -update:

Wait, what?

This get rid of the utime = rtime assignment, when
stime == 0.  That could be a correctness issue.

>   /*
>    * Make sure stime doesn't go backwards; this preserves
> monotonicity
>    * for utime because rtime is monotonic.

Re: [tip:sched/core] sched/cputime: Refactor the cputime_adjust() code

2017-06-30 Thread Rik van Riel

On Fri, 2017-06-30 at 06:10 -0700, tip-bot for Gustavo A. R. Silva
wrote:

> +++ b/kernel/sched/cputime.c
> @@ -615,19 +615,13 @@ static void cputime_adjust(struct task_cputime
> *curr,
>    * userspace. Once a task gets some ticks, the monotonicy
> code at
>    * 'update' will ensure things converge to the observed
> ratio.
>    */
> - if (stime == 0) {
> - utime = rtime;
> - goto update;
> + if (stime != 0) {
> + if (utime == 0)
> + stime = rtime;
> + else
> + stime = scale_stime(stime, rtime, stime +
> utime);
>   }
>  
> - if (utime == 0) {
> - stime = rtime;
> - goto update;
> - }
> -
> - stime = scale_stime(stime, rtime, stime + utime);
> -
> -update:

Wait, what?

This get rid of the utime = rtime assignment, when
stime == 0.  That could be a correctness issue.

>   /*
>    * Make sure stime doesn't go backwards; this preserves
> monotonicity
>    * for utime because rtime is monotonic.

Re: [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource

2017-06-29 Thread Rik van Riel

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> From: Wanpeng Li <kernel...@gmail.com>
> 
> Currently the cputime source used by vtime is jiffies. When we cross
> a context boundary and jiffies have changed since the last snapshot,
> the
> pending cputime is accounted to the switching out context.
> 
> This system works ok if the ticks are not aligned across CPUs. If
> they
> instead are aligned (ie: all fire at the same time) and the CPUs run
> in
> userspace, the jiffies change is only observed on tick exit and
> therefore
> the user cputime is accounted as system cputime. This is because the
> CPU that maintains timekeeping fires its tick at the same time as the
> others. It updates jiffies in the middle of the tick and the other
> CPUs
> see that update on IRQ exit:
> 
> CPU 0 (timekeeper)  CPU 1
> ---  -
>   jiffies = N
> ...  run in userspace for a jiffy
> tick entry   tick entry (sees jiffies = N)
> set jiffies = N + 1
> tick exittick exit (sees jiffies = N + 1)
> account 1 jiffy as
> stime
> 
> Fix this with using a nanosec clock source instead of jiffies. The
> cputime is then accumulated and flushed everytime the pending delta
> reaches a jiffy in order to mitigate the accounting overhead.

Glad to hear this could be done without dramatically
increasing the accounting overhead!

> [fweisbec: changelog, rebase on struct vtime, field renames, add
> delta
> on cputime readers, keep idle vtime as-is (low overhead accounting),
> harmonize clock sources]
> 
> Reported-by: Luiz Capitulino <lcapitul...@redhat.com>
> Suggested-by: Thomas Gleixner <t...@linutronix.de>
> Not-Yet-Signed-off-by: Wanpeng Li <kernel...@gmail.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: Wanpeng Li <kernel...@gmail.com>
> Cc: Ingo Molnar <mi...@kernel.org>
> Cc: Luiz Capitulino <lcapitul...@redhat.com>
> Signed-off-by: Frederic Weisbecker <fweis...@gmail.com>

Acked-by: Rik van Riel <r...@redhat.com>

Re: [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource

2017-06-29 Thread Rik van Riel

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> From: Wanpeng Li 
> 
> Currently the cputime source used by vtime is jiffies. When we cross
> a context boundary and jiffies have changed since the last snapshot,
> the
> pending cputime is accounted to the switching out context.
> 
> This system works ok if the ticks are not aligned across CPUs. If
> they
> instead are aligned (ie: all fire at the same time) and the CPUs run
> in
> userspace, the jiffies change is only observed on tick exit and
> therefore
> the user cputime is accounted as system cputime. This is because the
> CPU that maintains timekeeping fires its tick at the same time as the
> others. It updates jiffies in the middle of the tick and the other
> CPUs
> see that update on IRQ exit:
> 
> CPU 0 (timekeeper)  CPU 1
> ---  -
>   jiffies = N
> ...  run in userspace for a jiffy
> tick entry   tick entry (sees jiffies = N)
> set jiffies = N + 1
> tick exittick exit (sees jiffies = N + 1)
> account 1 jiffy as
> stime
> 
> Fix this with using a nanosec clock source instead of jiffies. The
> cputime is then accumulated and flushed everytime the pending delta
> reaches a jiffy in order to mitigate the accounting overhead.

Glad to hear this could be done without dramatically
increasing the accounting overhead!

> [fweisbec: changelog, rebase on struct vtime, field renames, add
> delta
> on cputime readers, keep idle vtime as-is (low overhead accounting),
> harmonize clock sources]
> 
> Reported-by: Luiz Capitulino 
> Suggested-by: Thomas Gleixner 
> Not-Yet-Signed-off-by: Wanpeng Li 
> Cc: Rik van Riel 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Wanpeng Li 
> Cc: Ingo Molnar 
> Cc: Luiz Capitulino 
> Signed-off-by: Frederic Weisbecker 

Acked-by: Rik van Riel

Re: [PATCH 4/5] sched: Move vtime task fields to their own struct

2017-06-29 Thread Rik van Riel

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> We are about to add vtime accumulation fields to the task struct.
> Let's
> avoid more bloatification and gather vtime informations to their own
> struct.
> 
> Cc: Wanpeng Li <kernel...@gmail.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: Ingo Molnar <mi...@kernel.org>
> Cc: Luiz Capitulino <lcapitul...@redhat.com>
> Signed-off-by: Frederic Weisbecker <fweis...@gmail.com>

Acked-by: Rik van Riel <r...@redhat.com>

Re: [PATCH 4/5] sched: Move vtime task fields to their own struct

2017-06-29 Thread Rik van Riel

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> We are about to add vtime accumulation fields to the task struct.
> Let's
> avoid more bloatification and gather vtime informations to their own
> struct.
> 
> Cc: Wanpeng Li 
> Cc: Rik van Riel 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Luiz Capitulino 
> Signed-off-by: Frederic Weisbecker 

Acked-by: Rik van Riel

Re: [PATCH 3/5] sched: Rename vtime fields

2017-06-29 Thread Rik van Riel

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> The current "snapshot" based naming on vtime fields suggests we
> record
> some past event but that's a low level picture of their actual
> purpose
> which comes out blurry. The real point of these fields is to run a
> basic
> state machine that tracks down cputime entry while switching between
> contexts.
> 
> So lets reflect that with more meaningful names.
> 
> Cc: Wanpeng Li <kernel...@gmail.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: Ingo Molnar <mi...@kernel.org>
> Cc: Luiz Capitulino <lcapitul...@redhat.com>
> Signed-off-by: Frederic Weisbecker <fweis...@gmail.com>

Acked-by: Rik van Riel <r...@redhat.com>

Re: [PATCH 3/5] sched: Rename vtime fields

2017-06-29 Thread Rik van Riel

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> The current "snapshot" based naming on vtime fields suggests we
> record
> some past event but that's a low level picture of their actual
> purpose
> which comes out blurry. The real point of these fields is to run a
> basic
> state machine that tracks down cputime entry while switching between
> contexts.
> 
> So lets reflect that with more meaningful names.
> 
> Cc: Wanpeng Li 
> Cc: Rik van Riel 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Luiz Capitulino 
> Signed-off-by: Frederic Weisbecker 

Acked-by: Rik van Riel

Re: [PATCH 2/5] sched: Always set vtime_snap_whence after accounting vtime

2017-06-29 Thread Rik van Riel

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> Even though it doesn't have functional consequences, setting
> the task's new context state after we actually accounted the pending
> vtime from the old context state makes more sense from a review
> perspective.
> 
> vtime_user_exit() is the only function that doesn't follow that rule
> and that can bug the reviewer for a little while until he realizes
> there
> is no reason for this special case.
> 
> Cc: Wanpeng Li <kernel...@gmail.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: Ingo Molnar <mi...@kernel.org>
> Cc: Luiz Capitulino <lcapitul...@redhat.com>
> Signed-off-by: Frederic Weisbecker <fweis...@gmail.com>

Acked-by: Rik van Riel <r...@redhat.com>

Re: [PATCH 2/5] sched: Always set vtime_snap_whence after accounting vtime

2017-06-29 Thread Rik van Riel

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> Even though it doesn't have functional consequences, setting
> the task's new context state after we actually accounted the pending
> vtime from the old context state makes more sense from a review
> perspective.
> 
> vtime_user_exit() is the only function that doesn't follow that rule
> and that can bug the reviewer for a little while until he realizes
> there
> is no reason for this special case.
> 
> Cc: Wanpeng Li 
> Cc: Rik van Riel 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Luiz Capitulino 
> Signed-off-by: Frederic Weisbecker 

Acked-by: Rik van Riel

Re: [PATCH 1/5] vtime: Remove vtime_account_user()

2017-06-29 Thread Rik van Riel

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> It's an unnecessary function between vtime_user_exit() and
> account_user_time().
> 
> Cc: Wanpeng Li <kernel...@gmail.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: Ingo Molnar <mi...@kernel.org>
> Cc: Luiz Capitulino <lcapitul...@redhat.com>
> Signed-off-by: Frederic Weisbecker <fweis...@gmail.com>

Acked-by: Rik van Riel <r...@redhat.com>

Re: [PATCH 1/5] vtime: Remove vtime_account_user()

2017-06-29 Thread Rik van Riel

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> It's an unnecessary function between vtime_user_exit() and
> account_user_time().
> 
> Cc: Wanpeng Li 
> Cc: Rik van Riel 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Luiz Capitulino 
> Signed-off-by: Frederic Weisbecker 

Acked-by: Rik van Riel

Re: [PATCH v2] mm: Add SLUB free list pointer obfuscation

2017-06-29 Thread Rik van Riel

On Thu, 2017-06-29 at 10:47 -0700, Kees Cook wrote:
> On Thu, Jun 29, 2017 at 10:05 AM, Christoph Lameter 
> wrote:
> > On Sun, 25 Jun 2017, Kees Cook wrote:
> > 
> > > The difference gets lost in the noise, but if the above is
> > > sensible,
> > > it's 0.07% slower. ;)
> > 
> > Hmmm... These differences add up. Also in a repetative benchmark
> > like that
> > you do not see the impact that the additional cacheline use in the
> > cpu
> > cache has on larger workloads. Those may be pushed over the edge of
> > l1 or
> > l2 capacity at some point which then causes drastic regressions.
> 
> Even if that is true, it may be worth it to some people to have the
> protection. Given that is significantly hampers a large class of heap
> overflow attacks[1], I think it's an important change to have. I'm
> not
> suggesting this be on by default, it's cleanly behind
> CONFIG-controlled macros, and is very limited in scope. If you can
> Ack
> it we can let system builders decide if they want to risk a possible
> performance hit. I'm pretty sure most distros would like to have this
> protection.

I could certainly see it being useful for all kinds of portable
and network-connected systems where security is simply much
more important than performance.

Re: [PATCH v2] mm: Add SLUB free list pointer obfuscation

2017-06-29 Thread Rik van Riel

On Thu, 2017-06-29 at 10:47 -0700, Kees Cook wrote:
> On Thu, Jun 29, 2017 at 10:05 AM, Christoph Lameter 
> wrote:
> > On Sun, 25 Jun 2017, Kees Cook wrote:
> > 
> > > The difference gets lost in the noise, but if the above is
> > > sensible,
> > > it's 0.07% slower. ;)
> > 
> > Hmmm... These differences add up. Also in a repetative benchmark
> > like that
> > you do not see the impact that the additional cacheline use in the
> > cpu
> > cache has on larger workloads. Those may be pushed over the edge of
> > l1 or
> > l2 capacity at some point which then causes drastic regressions.
> 
> Even if that is true, it may be worth it to some people to have the
> protection. Given that is significantly hampers a large class of heap
> overflow attacks[1], I think it's an important change to have. I'm
> not
> suggesting this be on by default, it's cleanly behind
> CONFIG-controlled macros, and is very limited in scope. If you can
> Ack
> it we can let system builders decide if they want to risk a possible
> performance hit. I'm pretty sure most distros would like to have this
> protection.

I could certainly see it being useful for all kinds of portable
and network-connected systems where security is simply much
more important than performance.

Re: [PATCH] sched/numa: fix build without CONFIG_NUMA_BALANCING

2017-06-28 Thread Rik van Riel

On Wed, 2017-06-28 at 22:05 +0200, Arnd Bergmann wrote:
> I ran into a build error:
> 
> kernel/sched/fair.c:2655:44: error: 'struct sched_domain' declared
> inside parameter list will not be visible outside of this definition
> or declaration [-Werror]
> 
> Adding a forward declaration for the struct name is sufficient
> to avoid it.
> 
> Fixes: 3fed382b46ba ("sched/numa: Implement NUMA node level
> wake_affine()")
> Signed-off-by: Arnd Bergmann 

Peter pointed out another problem with that patch.

I am running a few tests now on a kernel with that
other bug fixed, which should also magically get
rid of this compiler warning.

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH] sched/numa: fix build without CONFIG_NUMA_BALANCING

2017-06-28 Thread Rik van Riel

On Wed, 2017-06-28 at 22:05 +0200, Arnd Bergmann wrote:
> I ran into a build error:
> 
> kernel/sched/fair.c:2655:44: error: 'struct sched_domain' declared
> inside parameter list will not be visible outside of this definition
> or declaration [-Werror]
> 
> Adding a forward declaration for the struct name is sufficient
> to avoid it.
> 
> Fixes: 3fed382b46ba ("sched/numa: Implement NUMA node level
> wake_affine()")
> Signed-off-by: Arnd Bergmann 

Peter pointed out another problem with that patch.

I am running a few tests now on a kernel with that
other bug fixed, which should also magically get
rid of this compiler warning.

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH 4/4] sched,fair: remove effective_load

2017-06-27 Thread Rik van Riel

On Mon, 2017-06-26 at 18:12 +0200, Peter Zijlstra wrote:

> Also note that your use of task_h_load() in the new numa thing
> suffers
> from exactly the problem effective_load() is trying to solve.

Thinking about this some more, I suspect that
using effective_load() in combination with
select_idle_sibling(), which will often place
the task on a different CPU than the one specified,
may not lead to entirely useful results...

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

Re: [PATCH 4/4] sched,fair: remove effective_load

2017-06-27 Thread Rik van Riel

On Mon, 2017-06-26 at 18:12 +0200, Peter Zijlstra wrote:

> Also note that your use of task_h_load() in the new numa thing
> suffers
> from exactly the problem effective_load() is trying to solve.

Thinking about this some more, I suspect that
using effective_load() in combination with
select_idle_sibling(), which will often place
the task on a different CPU than the one specified,
may not lead to entirely useful results...

-- 
All rights reversed

signature.asc
Description: This is a digitally signed message part

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 7046 matches

Mail list logo