[PATCH v2] prctl: PR_SET_MM - unify copying of user's auvx

2021-03-23 Thread Cyrill Gorcunov
prctl(PR_SET_MM, PR_SET_MM_AUXV | PR_SET_MM_MAP, ...) copies user
provided auxiliary vector to kernel and saves it to mm::saved_auxv,
this involves same code in to places. Lets move it into one helper
instead.

When we copy data from user space we make sure that the vector ends
up with AT_NULL key/value pair as specification requires. And here
is a bit vague moment if task is running in compat mode: instead of
one last value we zeroing two entries at the end. This is done for
code simplicity (if arch supports compat mode then the initial vector
size must be big enough to store values needed for the native mode
as well, that's why we define the vector as an array of longs. In
particular when Elf executable is loaded the vector is considered
as pairs of elf_addr_t elements, which is 4 byte per each on 32
bit environment and 8 byte per each in 64 bit kernel).

Same time lets drop useless task_lock()/task_unlock() calls from
PR_SET_MM_AUXV. It doesn't protect anything here and seems to be
sneaked in accidentally (Oleg pointed me this moment).

Reported-by: Alexey Dobriyan 
Reported-by: Oleg Nesterov 
CC: Andrey Vagin 
CC: Dmitry Safonov <0x7f454...@gmail.com>
CC: Andrew Morton 
Signed-off-by: Cyrill Gorcunov 
---
 kernel/sys.c |   70 ---
 1 file changed, 38 insertions(+), 32 deletions(-)

Index: linux-tip.git/kernel/sys.c
===
--- linux-tip.git.orig/kernel/sys.c
+++ linux-tip.git/kernel/sys.c
@@ -1961,6 +1961,30 @@ out:
return error;
 }
 
+static int copy_auxv_from_user(unsigned long *auxv, size_t auxv_size,
+  const void __user *addr, size_t len)
+{
+   BUG_ON(auxv_size != sizeof(current->mm->saved_auxv));
+
+   if (!addr || len > auxv_size)
+   return -EINVAL;
+
+   memset(auxv, 0, auxv_size);
+   if (len && copy_from_user(auxv, addr, len))
+   return -EFAULT;
+
+   /*
+* Specification requires the vector to be
+* ended up with AT_NULL entry so user space
+* will notice where to stop enumerating.
+*/
+   if (len == auxv_size) {
+   auxv[AT_VECTOR_SIZE - 2] = AT_NULL;
+   auxv[AT_VECTOR_SIZE - 1] = AT_NULL;
+   }
+   return 0;
+}
+
 #ifdef CONFIG_CHECKPOINT_RESTORE
 static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long 
data_size)
 {
@@ -1987,22 +2011,12 @@ static int prctl_set_mm_map(int opt, con
return error;
 
if (prctl_map.auxv_size) {
-   /*
-* Someone is trying to cheat the auxv vector.
-*/
-   if (!prctl_map.auxv ||
-   prctl_map.auxv_size > sizeof(mm->saved_auxv))
-   return -EINVAL;
-
-   memset(user_auxv, 0, sizeof(user_auxv));
-   if (copy_from_user(user_auxv,
-  (const void __user *)prctl_map.auxv,
-  prctl_map.auxv_size))
-   return -EFAULT;
-
-   /* Last entry must be AT_NULL as specification requires */
-   user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL;
-   user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL;
+   int error = copy_auxv_from_user(user_auxv,
+   sizeof(user_auxv),
+   prctl_map.auxv,
+   prctl_map.auxv_size);
+   if (error)
+   return error;
}
 
if (prctl_map.exe_fd != (u32)-1) {
@@ -2079,25 +2093,17 @@ static int prctl_set_auxv(struct mm_stru
 * up to the caller to provide sane values here, otherwise userspace
 * tools which use this vector might be unhappy.
 */
-   unsigned long user_auxv[AT_VECTOR_SIZE] = {};
-
-   if (len > sizeof(user_auxv))
-   return -EINVAL;
-
-   if (copy_from_user(user_auxv, (const void __user *)addr, len))
-   return -EFAULT;
-
-   /* Make sure the last entry is always AT_NULL */
-   user_auxv[AT_VECTOR_SIZE - 2] = 0;
-   user_auxv[AT_VECTOR_SIZE - 1] = 0;
+   unsigned long user_auxv[AT_VECTOR_SIZE];
+   int error;
 
BUILD_BUG_ON(sizeof(user_auxv) != sizeof(mm->saved_auxv));
 
-   task_lock(current);
-   memcpy(mm->saved_auxv, user_auxv, len);
-   task_unlock(current);
-
-   return 0;
+   error = copy_auxv_from_user(user_auxv, sizeof(user_auxv),
+   (const void __user *)addr,
+   len);
+   if (!error)
+   memcpy(mm->saved_auxv, user_auxv, len);
+   return error;
 }
 
 static int prctl_set_mm(int opt, unsigned long addr,


Re: [PATCH] prctl: fix overwrite last but one entry in auxv vector

2021-03-22 Thread Cyrill Gorcunov
On Sun, Mar 21, 2021 at 11:36:42PM +0300, Cyrill Gorcunov wrote:
> Alexey reported that current PR_SET_MM_AUXV (and PR_SET_MM_MAP) overwrite
> too many entries on non 64bit kernels. This is because auxv is defined
> as an array of longs and in result access to AT_VECTOR_SIZE - 2 entry
> is not a type of auxv entry but rather an entry before the last one.

Drop this patch, please. I'll make a new version.


[PATCH] prctl: fix overwrite last but one entry in auxv vector

2021-03-21 Thread Cyrill Gorcunov
Alexey reported that current PR_SET_MM_AUXV (and PR_SET_MM_MAP) overwrite
too many entries on non 64bit kernels. This is because auxv is defined
as an array of longs and in result access to AT_VECTOR_SIZE - 2 entry
is not a type of auxv entry but rather an entry before the last one.

Since it is a common code for all architectures lets use __BITS_PER_LONG
definition to determinate each type/value pair in auxv_t is fitting
into `long` or not.

Note that on compat mode (ie Elf32 running in 64bit compiled kernel)
the preallocated vector size will be big enough to carry all entries
and zapping two entries at the end of the vector won't cause problems.

Same time lets drop useless task_lock()/task_unlock() calls from
PR_SET_MM_AUXV. It doesn't protect anything here and seems to be
sneaked in accidentally (Oleg pointed me to this moment).

Reported-by: Alexey Dobriyan 
CC: Oleg Nesterov 
CC: Andrey Vagin 
CC: Dmitry Safonov <0x7f454...@gmail.com>
CC: Andrew Morton 
Signed-off-by: Cyrill Gorcunov 
---
Take a look please, once time permit. The issue on its own
should not be that critical but better to fix it. I tested
it manually via trivial test but I think it is not enough.
Need to implement some selftesting as well. Also obviously
I ran test on x86 only.

 kernel/sys.c |   73 +--
 1 file changed, 41 insertions(+), 32 deletions(-)

Index: linux-tip.git/kernel/sys.c
===
--- linux-tip.git.orig/kernel/sys.c
+++ linux-tip.git/kernel/sys.c
@@ -1961,6 +1961,33 @@ out:
return error;
 }
 
+static int copy_auxv_from_user(unsigned long *auxv, size_t auxv_size,
+  const void __user *addr, size_t len)
+{
+   BUG_ON(auxv_size != sizeof(current->mm->saved_auxv));
+
+   if (!addr || len > auxv_size)
+   return -EINVAL;
+
+   memset(auxv, 0, auxv_size);
+   if (len && copy_from_user(auxv, addr, len))
+   return -EFAULT;
+
+   /*
+* Specification requires the vector to be
+* ended up with AT_NULL entry thus userspace
+* will notice where to stop enumerating. Thus
+* if someone is passing a screwed data make sure
+* at least it has the end of vector sign.
+*/
+   if (len == auxv_size) {
+   if (__BITS_PER_LONG == 64)
+   auxv[AT_VECTOR_SIZE - 2] = AT_NULL;
+   auxv[AT_VECTOR_SIZE - 1] = AT_NULL;
+   }
+   return 0;
+}
+
 #ifdef CONFIG_CHECKPOINT_RESTORE
 static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long 
data_size)
 {
@@ -1987,22 +2014,12 @@ static int prctl_set_mm_map(int opt, con
return error;
 
if (prctl_map.auxv_size) {
-   /*
-* Someone is trying to cheat the auxv vector.
-*/
-   if (!prctl_map.auxv ||
-   prctl_map.auxv_size > sizeof(mm->saved_auxv))
-   return -EINVAL;
-
-   memset(user_auxv, 0, sizeof(user_auxv));
-   if (copy_from_user(user_auxv,
-  (const void __user *)prctl_map.auxv,
-  prctl_map.auxv_size))
-   return -EFAULT;
-
-   /* Last entry must be AT_NULL as specification requires */
-   user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL;
-   user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL;
+   int error = copy_auxv_from_user(user_auxv,
+   sizeof(user_auxv),
+   prctl_map.auxv,
+   prctl_map.auxv_size);
+   if (error)
+   return error;
}
 
if (prctl_map.exe_fd != (u32)-1) {
@@ -2079,25 +2096,17 @@ static int prctl_set_auxv(struct mm_stru
 * up to the caller to provide sane values here, otherwise userspace
 * tools which use this vector might be unhappy.
 */
-   unsigned long user_auxv[AT_VECTOR_SIZE] = {};
-
-   if (len > sizeof(user_auxv))
-   return -EINVAL;
-
-   if (copy_from_user(user_auxv, (const void __user *)addr, len))
-   return -EFAULT;
-
-   /* Make sure the last entry is always AT_NULL */
-   user_auxv[AT_VECTOR_SIZE - 2] = 0;
-   user_auxv[AT_VECTOR_SIZE - 1] = 0;
+   unsigned long user_auxv[AT_VECTOR_SIZE];
+   int error;
 
BUILD_BUG_ON(sizeof(user_auxv) != sizeof(mm->saved_auxv));
 
-   task_lock(current);
-   memcpy(mm->saved_auxv, user_auxv, len);
-   task_unlock(current);
-
-   return 0;
+   error = copy_auxv_from_user(user_auxv, sizeof(user_auxv),
+   (const void __user *)addr,
+   len);
+   if (!error)
+  

Re: auxv stuff (Re: [PATCH] prctl: fix PR_SET_MM_AUXV kernel stack leak)

2021-03-16 Thread Cyrill Gorcunov
On Tue, Mar 16, 2021 at 09:50:35PM +0300, Alexey Dobriyan wrote:
> > > 
> > > There is another (non-security) one. Compat 32-bit process will report
> > > 2 longs too many:
> > 
> > Good catch! Alexey, should I address it? Or you have fixed it already?
> 
> I didn't and I don't know how frankly.
> Something I've noticed during more important auxv rewrite.

OK. I will then. Thanks!


Re: [PATCH] prctl: fix PR_SET_MM_AUXV kernel stack leak

2021-03-15 Thread Cyrill Gorcunov
On Mon, Mar 15, 2021 at 02:19:12PM +0100, Oleg Nesterov wrote:
> > >
> > > And why task_lock(current) ? What does it try to protect?
> >
> > As far as I remember this was related to reading from procfs
> > at time the patch was written for first time. Looks like this
> > not relevant anymore and could be dropped.
> 
> I think this was never relevant, ->alloc_lock is per-thread, not per mm.

Then we can safely drop it. I'll take one more look once time permit
and prepare a patch.


Re: [PATCH] prctl: fix PR_SET_MM_AUXV kernel stack leak

2021-03-15 Thread Cyrill Gorcunov
On Mon, Mar 15, 2021 at 01:08:03PM +0100, Oleg Nesterov wrote:
> On 03/14, Alexey Dobriyan wrote:
> >
> > prctl(PR_SET_MM, PR_SET_MM_AUXV, addr, 1);
> >
> > will copy 1 byte from userspace to (quite big) on-stack array
> > and then stash everything to mm->saved_auxv.
> 
> I too don't understand, memcpy(mm->saved_auxv, user_auxv, len) will
> copy 1 byte...

Indeed. I overlooked that we pass @len when copying. I should
not reply at night :(

> 
> And why task_lock(current) ? What does it try to protect?

As far as I remember this was related to reading from procfs
at time the patch was written for first time. Looks like this
not relevant anymore and could be dropped.


Re: auxv stuff (Re: [PATCH] prctl: fix PR_SET_MM_AUXV kernel stack leak)

2021-03-15 Thread Cyrill Gorcunov
On Mon, Mar 15, 2021 at 09:00:00AM +0300, Alexey Dobriyan wrote:
> On Sun, Mar 14, 2021 at 02:40:05PM -0700, Linus Torvalds wrote:
> > [mm->saved_auxv]
> > 
> > That's a separate issue, and I can't find it in myself to care (and
> > nobody has ever complained), but I thought I'd mention it.
> 
> There is another (non-security) one. Compat 32-bit process will report
> 2 longs too many:

Good catch! Alexey, should I address it? Or you have fixed it already?


Re: [PATCH] prctl: fix PR_SET_MM_AUXV kernel stack leak

2021-03-14 Thread Cyrill Gorcunov
On Sun, Mar 14, 2021 at 02:40:05PM -0700, Linus Torvalds wrote:
> Applied directly, since I'm just about to tag rc3 and was just looking
> that there were no last-minute pull requests.
> 
> Andrew, no need to pick it up into your queue.
> 
> Side note: I think we should return -EINVAL more aggressively: right
> now we fill up potentially all of user_auxv[] and return success, but
> we will have always cleared that last auxv pointer pair.
> 
> So we actually return "success" even when the user supplies us with
> more data than we then really accept.

Yes, this is somehow weird and probably we should start complaining
if last two elements in the user array is not AT_NULL but I fear
this might break backward compatibility? Dunno if someone relies
on kernel to setup last two entries unconditionally.

> 
> IOW, tightening that up might be worth it (maybe actually check that
> they are valid user pointers at the same time).
> 
> That's a separate issue, and I can't find it in myself to care (and
> nobody has ever complained), but I thought I'd mention it.


Re: [PATCH] prctl: fix PR_SET_MM_AUXV kernel stack leak

2021-03-14 Thread Cyrill Gorcunov
On Sun, Mar 14, 2021 at 11:51:14PM +0300, Alexey Dobriyan wrote:
>   prctl(PR_SET_MM, PR_SET_MM_AUXV, addr, 1);
> 
> will copy 1 byte from userspace to (quite big) on-stack array
> and then stash everything to mm->saved_auxv.
> AT_NULL terminator will be inserted at the very end.
> 
> /proc/*/auxv handler will find that AT_NULL terminator
> and copy original stack contents to userspace.
> 
> This devious scheme requires CAP_SYS_RESOURCE.
> 
> Signed-off-by: Alexey Dobriyan 
> ---

Thanks for catching up, Alexey!


Re: [PATCH] fcntl: make F_GETOWN(EX) return 0 on dead owner task

2021-02-07 Thread Cyrill Gorcunov
On Wed, Feb 03, 2021 at 03:41:56PM +0300, Pavel Tikhomirov wrote:
> Currently there is no way to differentiate the file with alive owner
> from the file with dead owner but pid of the owner reused. That's why
> CRIU can't actually know if it needs to restore file owner or not,
> because if it restores owner but actual owner was dead, this can
> introduce unexpected signals to the "false"-owner (which reused the
> pid).
> 
> Let's change the api, so that F_GETOWN(EX) returns 0 in case actual
> owner is dead already.
> 
> Cc: Jeff Layton 
> Cc: "J. Bruce Fields" 
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Cyrill Gorcunov 
> Cc: Andrei Vagin 
> Signed-off-by: Pavel Tikhomirov 

I can't imagine a scenario where we could break some backward
compatibility with this change, so

Reviewed-by: Cyrill Gorcunov 


Re: [PATCH v3] kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE

2021-02-06 Thread Cyrill Gorcunov
On Fri, Feb 05, 2021 at 10:00:12PM +, Chris Wilson wrote:
> Userspace has discovered the functionality offered by SYS_kcmp and has
> started to depend upon it. In particular, Mesa uses SYS_kcmp for
> os_same_file_description() in order to identify when two fd (e.g. device
> or dmabuf) point to the same struct file. Since they depend on it for
> core functionality, lift SYS_kcmp out of the non-default
> CONFIG_CHECKPOINT_RESTORE into the selectable syscall category.
> 
...
Reviewed-by: Cyrill Gorcunov 


Re: [PATCH v19 12/25] mm: Introduce VM_SHSTK for shadow stack memory

2021-02-04 Thread Cyrill Gorcunov
On Wed, Feb 03, 2021 at 02:55:34PM -0800, Yu-cheng Yu wrote:
>  
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 602e3a52884d..59623dcd92bb 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -661,6 +661,9 @@ static void show_smap_vma_flags(struct seq_file *m, 
> struct vm_area_struct *vma)
>   [ilog2(VM_PKEY_BIT4)]   = "",
>  #endif
>  #endif /* CONFIG_ARCH_HAS_PKEYS */
> +#ifdef CONFIG_X86_CET
> + [ilog2(VM_SHSTK)]   = "ss",
> +#endif
>   };

IIRC we've these abbreviations explained in documentaion
(proc.rst file). Could you please update it once time
permit? I think it can be done on top of the series.


Re: [PATCH] fcntl: make F_GETOWN(EX) return 0 on dead owner task

2021-02-03 Thread Cyrill Gorcunov
On Thu, Feb 04, 2021 at 12:35:42AM +0300, Pavel Tikhomirov wrote:
> 
> AFAICS if pid is held only by 1) fowner refcount and by 2) single process
> (without threads, group and session for simplicity), on process exit we go
> through:
> 
> do_exit
>   exit_notify
> release_task
>   __exit_signal
> __unhash_process
>   detach_pid
> __change_pid
>   free_pid
> idr_remove
> 
> So pid is removed from idr, and after that alloc_pid can reuse pid numbers
> even if old pid structure is still alive and is still held by fowner.
...
> Hope this answers your question, Thanks!

Yeah, indeed, thanks! So the change is sane still I'm
a bit worried about backward compatibility, gimme some
time I'll try to refresh my memory first, in a couple
of days or weekend (though here are a number of experienced
developers CC'ed maybe they reply even faster).


Re: [PATCH] fcntl: make F_GETOWN(EX) return 0 on dead owner task

2021-02-03 Thread Cyrill Gorcunov
On Wed, Feb 03, 2021 at 03:41:56PM +0300, Pavel Tikhomirov wrote:
> Currently there is no way to differentiate the file with alive owner
> from the file with dead owner but pid of the owner reused. That's why
> CRIU can't actually know if it needs to restore file owner or not,
> because if it restores owner but actual owner was dead, this can
> introduce unexpected signals to the "false"-owner (which reused the
> pid).

Hi! Thanks for the patch. You know I manage to forget the fowner internals.
Could you please enlighten me -- when owner is set with some pid we do

f_setown_ex
  __f_setown
f_modown
  filp->f_owner.pid = get_pid(pid);

Thus pid get refcount incremented. Then the owner exits but refcounter
should be still up and running and pid should not be reused, no? Or
I miss something obvious?

The patch itself looks ok on a first glance.


[PATCH] prctl: allow to setup brk for et_dyn executables

2021-01-21 Thread Cyrill Gorcunov
Keno Fischer reported that when a binray loaded via
ld-linux-x the prctl(PR_SET_MM_MAP) doesn't allow to
setup brk value because it lays before mm:end_data.

For example a test program shows

 | # ~/t
 |
 | start_code  401000
 | end_code401a15
 | start_stack 7ffce4577dd0
 | start_data  403e10
 | end_data40408c
 | start_brk   b5b000
 | sbrk(0) b5b000

and when executed via ld-linux

 | # /lib64/ld-linux-x86-64.so.2 ~/t
 |
 | start_code  7fc25b0a4000
 | end_code7fc25b0c4524
 | start_stack 7fffcc6b2400
 | start_data  7fc25b0ce4c0
 | end_data7fc25b0cff98
 | start_brk   5710c000
 | sbrk(0) 5710c000

This of course prevent criu from restoring such programs.
Looking into how kernel operates with brk/start_brk inside
brk() syscall I don't see any problem if we allow to setup
brk/start_brk without checking for end_data. Even if someone
pass some weird address here on a purpose then the worst
possible result will be an unexpected unmapping of existing
vma (own vma, since prctl works with the callers memory) but
test for RLIMIT_DATA is still valid and a user won't be able
to gain more memory in case of expanding VMAs via new values
shipped with prctl call.

Reported-by: Keno Fischer 
Signed-off-by: Cyrill Gorcunov 
CC: Andrew Morton 
CC: Dmitry Safonov <0x7f454...@gmail.com>
CC: Andrey Vagin 
CC: Kirill Tkhai 
CC: Eric W. Biederman 
---
Guys, take a look please once time permit. Hopefully I didn't
miss something 'cause made this patch via code reading only.

Andrey, do we still have a criu container which tests new kernels,
right? Would be great to run criu tests with this patch applied
to make sure everything is intact.

 kernel/sys.c |7 ---
 1 file changed, 7 deletions(-)

Index: linux-tip.git/kernel/sys.c
===
--- linux-tip.git.orig/kernel/sys.c
+++ linux-tip.git/kernel/sys.c
@@ -1943,13 +1943,6 @@ static int validate_prctl_map_addr(struc
error = -EINVAL;
 
/*
-* @brk should be after @end_data in traditional maps.
-*/
-   if (prctl_map->start_brk <= prctl_map->end_data ||
-   prctl_map->brk <= prctl_map->end_data)
-   goto out;
-
-   /*
 * Neither we should allow to override limits if they set.
 */
if (check_data_rlimit(rlimit(RLIMIT_DATA), prctl_map->brk,


Re: brk checks in PR_SET_MM code

2020-12-17 Thread Cyrill Gorcunov
On Thu, Dec 17, 2020 at 10:42:02AM +0300, Cyrill Gorcunov wrote:
> On Wed, Dec 16, 2020 at 08:29:30PM -0500, Keno Fischer wrote:
> > Hi all,
> > 
> > The code in prctl(PR_SET_MM, ...) performs a number of sanity checks,
> > among them
> > 
> > ```
> > /*
> >  * @brk should be after @end_data in traditional maps.
> >  */
> > if (prctl_map->start_brk <= prctl_map->end_data ||
> > prctl_map->brk <= prctl_map->end_data)
> > goto out;
> > ```
> > 
> 
> Thanks for pointing, Keno! I don't remember the details right now,
> gimme some time and once I refresh my memory I'll reply with
> details.

Indeed, when loaded via ld directly we've got a different layout:

# /lib64/ld-linux-x86-64.so.2 ~/t

start_code  7fc25b0a4000
end_code7fc25b0c4524
start_stack 7fffcc6b2400
start_data  7fc25b0ce4c0
end_data7fc25b0cff98
start_brk   5710c000
sbrk(0) 5710c000

Note though that as far as I understand the layout is provided by
ld loader. I contrast the regular load

# ~/t

start_code  401000
end_code401a15
start_stack 7ffce4577dd0
start_data  403e10
end_data40408c
start_brk   b5b000
sbrk(0) b5b000

I fear we've not been using ld's loaded programs in c/r procedure much
that's why it has not been noted earlier. Need to think how to fix it.
Using the whole memory map for verification procedure is a correct way
thus the commit you mention is doing exactly what it should but we need
to figure out how to deal with fdpic loaded files... I'll back once I
figure it out (hopefully more-less soon). Thanks a huge for report!


Re: brk checks in PR_SET_MM code

2020-12-16 Thread Cyrill Gorcunov
On Wed, Dec 16, 2020 at 08:29:30PM -0500, Keno Fischer wrote:
> Hi all,
> 
> The code in prctl(PR_SET_MM, ...) performs a number of sanity checks,
> among them
> 
> ```
> /*
>  * @brk should be after @end_data in traditional maps.
>  */
> if (prctl_map->start_brk <= prctl_map->end_data ||
> prctl_map->brk <= prctl_map->end_data)
> goto out;
> ```
> 

Thanks for pointing, Keno! I don't remember the details right now,
gimme some time and once I refresh my memory I'll reply with
details.


Re: [PATCH v2 13/24] kcmp: In get_file_raw_ptr use task_lookup_fd_rcu

2020-11-23 Thread Cyrill Gorcunov
On Fri, Nov 20, 2020 at 05:14:30PM -0600, Eric W. Biederman wrote:
> Modify get_file_raw_ptr to use task_lookup_fd_rcu.  The helper
> task_lookup_fd_rcu does the work of taking the task lock and verifying
> that task->files != NULL and then calls files_lookup_fd_rcu.  So let
> use the helper to make a simpler implementation of get_file_raw_ptr.
> 
> Signed-off-by: "Eric W. Biederman" 
Acked-by: Cyrill Gorcunov 

Since I wrote this kcmp code in first place. Thanks Eric!


Re: [PATCH v2 11/24] file: Implement task_lookup_fd_rcu

2020-11-22 Thread Cyrill Gorcunov
On Sun, Nov 22, 2020 at 07:00:20AM -0600, Eric W. Biederman wrote:
> Cyrill Gorcunov  writes:
...
> That is present in files_lookup_fd_rcu, so this code should
> be good from the warning side.

Indeed, thanks!


Re: [PATCH v2 11/24] file: Implement task_lookup_fd_rcu

2020-11-21 Thread Cyrill Gorcunov
On Fri, Nov 20, 2020 at 05:14:28PM -0600, Eric W. Biederman wrote:
> 
> diff --git a/fs/file.c b/fs/file.c
> index 5861c4f89419..6448523ca29e 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -865,6 +865,21 @@ struct file *fget_task(struct task_struct *task, 
> unsigned int fd)
>   return file;
>  }
>  
> +struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd)
> +{
> + /* Must be called with rcu_read_lock held */

Eric, maybe worth to have something like

RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
 "suspicious task_lookup_fd_rcu() usage");

here?


Re: Inconsistent capability requirements for prctl_set_mm_exe_file()

2020-10-27 Thread Cyrill Gorcunov
On Tue, Oct 27, 2020 at 08:22:11PM +0300, Kirill Tkhai wrote:
> 1)Before my commit there also were different checks
> 
> !capable(CAP_SYS_RESOURCE))
> and
>   uid_eq(cred->uid, make_kuid(ns, 0)) && gid_eq(cred->gid, make_kgid(ns, 
> 0))
> 
> so it is not the initial reason. The commit even decreased the checks 
> difference
> and it made both the checks are about capability().
> 
> 2)As I understand new PR_SET_MM_MAP interface differs in the way, that it 
> allows to batch
> a setup of prctl_mm_map parameters. So, instead of 14 prlctl calls with 
> different arguments:
> PR_SET_MM_START_CODE, PR_SET_MM_END_CODE, PR_SET_MM_START_DATA, .., 
> PR_SET_MM_ENV_END,
> we set then all at once and the performance is better.
> 
> The second advantage is that the new interface is more comfortable in case of 
> we set all
> of 14 parameters. Old interface requires special order of calls: sometimes 
> you have to
> set PR_SET_MM_START_CODE first and then PR_SET_MM_END_CODE second, some times 
> it is vice
> versa. Otherwise __prctl_check_order() in validate_prctl_map() will fail.

Since I've been the person who introduced the former PR_SET_MM_X interface I 
already
explained that the PR_SET_MM_X is simply shitty and better be vanished and 
forgotten.
Which we simply can't do :/ In turn PR_SET_MM_MAP is the only right way to 
verify
all fields in a one pass not only because of speed but otherwise the validation
of parameters is not even associative and in result (as you mentioned) _order_ 
of
calls does matter for start_code/end_code.

> 3)For me it looks like any combinations of parameters acceptable to be set by 
> both interfaces
> are the same (in case of we don't see on permissions checks). In case of we 
> can assign a set of
> parameters {A} using old interface, we can assign it from new interface and 
> vice versa.
> Isn't this so?! If so, we should use the same permissions check.

Yup, if only I'm not missing something obvious we could drop cap(sys-resource) 
here
because internally after this check we jump into traditional verification 
procedure
where the map is filled from runtime data.

I'm quite sceptic that LSM hook Jann mentioned gonna be better. It might be an
addition but not the replacement. Moreover if we add it here someone from CRIU
camp have to verify that it won't break current CRIU tests.


Re: Inconsistent capability requirements for prctl_set_mm_exe_file()

2020-10-27 Thread Cyrill Gorcunov
On Tue, Oct 27, 2020 at 01:11:40PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Nicolas, Cyrill, and others,
> 
> @Nicolas, your commit ebd6de6812387a changed the capability 
> requirements for the prctl_set_mm_exe_file() operation from
> 
> ns_capable(CAP_SYS_ADMIN)
> 
> to
> 
> ns_capable(CAP_SYS_ADMIN) || ns_capable(CAP_CHECKPOINT_RESTORE).
> 
> That's fine I guess, but while looking at that change, I found
> an anomaly.
> 
> The same prctl_set_mm_exe_file() functionality is also available
> via the prctl() PR_SET_MM_EXE_FILE operation, which was added
> by Cyrill's commit b32dfe377102ce668. However, there the 
> prctl_set_mm_exe_file() operation is guarded by a check
> 
> capable(CAP_SYS_RESOURCE).
> 
> There are two things I note:
> 
> * The capability requirements are different in the two cases.
> * In one case the checks are with ns_capable(), while in the 
>   other case the check is with capable().
> 
> In both cases, the inconsistencies predate Nicolas's patch,
> and appear to have been introduced in Kirill Tkhai's commit
> 4d28df6152aa3ff.
> 
> I'm not sure what is right, but those inconsistencies seem
> seem odd, and presumably unintended. Similarly, I'm not
> sure what fix, if any, should be applied. However, I thought
> it worth mentioning these details, since the situation is odd
> and surprising.

Hi Michael! This is more likely due to historical reasons:
the initial version of prctl(PR_SET_MM, ...) been operating
with individual fields and this was very unsafe. Because of
this we left it under CAP_SYS_RESOURCE (because you must have
enough rights to change such deep fields). Later we switched
to PR_SET_MM_MAP which is a safe version and allows to modify
memory map as a "whole" so we can do a precise check. And this
allowed us to relax requirements.

As to me the old PR_SET_MM should be deprecated and finally
removed from the kernel, but since it is a part of API we
can't do such thing easily.

Same time current PR_SET_MM internally is rather an alias
for PR_SET_MM_MAP because we create a temporary map and
pass it to the verification procedure so it looks like
we can relax requirements here to match the PR_SET_MM_MAP
call. But need to think maybe I miss something obvious here.


Re: [PATCH v14 03/26] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states

2020-10-12 Thread Cyrill Gorcunov
On Mon, Oct 12, 2020 at 08:38:27AM -0700, Yu-cheng Yu wrote:
...
>  /*
>   * x86-64 Task Priority Register, CR8
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 038e19c0019e..705fd9b94e31 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -38,6 +38,9 @@ static const char *xfeature_names[] =
>   "Processor Trace (unused)"  ,
>   "Protection Keys User registers",
>   "unknown xstate feature",
> + "Control-flow User registers"   ,
> + "Control-flow Kernel registers" ,
> + "unknown xstate feature",
>  };
>  
>  static short xsave_cpuid_features[] __initdata = {
> @@ -51,6 +54,9 @@ static short xsave_cpuid_features[] __initdata = {
>   X86_FEATURE_AVX512F,
>   X86_FEATURE_INTEL_PT,
>   X86_FEATURE_PKU,
> + -1,/* Unused */
> + X86_FEATURE_SHSTK, /* XFEATURE_CET_USER */
> + X86_FEATURE_SHSTK, /* XFEATURE_CET_KERNEL */
>  };

Why do you need "-1" here in the array? The only 1:1 mapping is between
the names itselves and values, not indices of arrays so i don't understand
why we need this unused value. Sorry if it is a dumb questions and
been discussed already.


Re: [PATCH v14 1/7] x86/cet/ibt: Add Kconfig option for user-mode Indirect Branch Tracking

2020-10-12 Thread Cyrill Gorcunov
On Mon, Oct 12, 2020 at 08:45:24AM -0700, Yu-cheng Yu wrote:
...
> +   the application support it.  When this feature is enabled,
> +   legacy non-IBT applications continue to work, but without
> +   IBT protection.
> +   Support for this feature is only known to be present on
> +   processors released in 2020 or later.  CET features are also
> +   known to increase kernel text size by 3.7 KB.

It seems the last sentence is redundant - new features always bloat
the kernel code and precise size may differ depending on compiler
and options. Surely this can be patched on top.


Re: [PATCH 09/17] file: Implement fnext_task

2020-08-20 Thread Cyrill Gorcunov
On Mon, Aug 17, 2020 at 05:04:17PM -0500, Eric W. Biederman wrote:
> As a companion to fget_task and fcheck_task implement fnext_task that
> will return the struct file for the first file descriptor show number
> is equal or greater than the fd argument value, or NULL if there is
> no such struct file.
> 
> This allows file descriptors of foreign processes to be iterated through
> safely, without needed to increment the count on files_struct.
> 
> Signed-off-by: "Eric W. Biederman" 
> ---
>  fs/file.c   | 21 +
>  include/linux/fdtable.h |  1 +
>  2 files changed, 22 insertions(+)
> 
> diff --git a/fs/file.c b/fs/file.c
> index 8d4b385055e9..88f9f78869f8 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -876,6 +876,27 @@ struct file *fcheck_task(struct task_struct *task, 
> unsigned int fd)
>   return file;
>  }
>  
> +struct file *fnext_task(struct task_struct *task, unsigned int *ret_fd)
> +{
> + /* Must be called with rcu_read_lock held */
> + struct files_struct *files;
> + unsigned int fd = *ret_fd;
> + struct file *file = NULL;
> +
> + task_lock(task);
> + files = task->files;
> + if (files) {
> + for (; fd < files_fdtable(files)->max_fds; fd++) {
> + file = fcheck_files(files, fd);
> + if (file)
> + break;
> + }
> + }
> + task_unlock(task);
> + *ret_fd = fd;
> + return file;
> +}

Eric, if only I'm not missing something obvious you could
escape @fd/@ret_fd operations in case if task->files = NULL,
iow

if (files) {
unsigned int fd = *ret_fd;
for (; fd < files_fdtable(files)->max_fds; fd++) {
file = fcheck_files(files, fd);
if (file)
break;
}
*ret_fd = fd;
}


Re: [PATCH 04/17] kcmp: In kcmp_epoll_target use fget_task

2020-08-20 Thread Cyrill Gorcunov
On Mon, Aug 17, 2020 at 05:04:12PM -0500, Eric W. Biederman wrote:
> Use the helper fget_task and simplify the code.
> 
> As well as simplifying the code this removes one unnecessary increment of
> struct files_struct.  This unnecessary increment of files_struct.count can
> result in exec unnecessarily unsharing files_struct and breaking posix
> locks, and it can result in fget_light having to fallback to fget reducing
> performance.
> 
> Suggested-by: Oleg Nesterov 
> Signed-off-by: "Eric W. Biederman" 
Reviewed-by: Cyrill Gorcunov 

I really like this simplification!


Re: [PATCH] x86/fpu/xstate: Fix an xstate size check warning

2020-07-20 Thread Cyrill Gorcunov
On Mon, Jul 20, 2020 at 06:50:51AM -0700, kan.li...@linux.intel.com wrote:
...
>  static unsigned int __init get_xsave_size(void)
>  {
>   unsigned int eax, ebx, ecx, edx;
> @@ -710,7 +741,7 @@ static int __init init_xstate_size(void)
>   xsave_size = get_xsave_size();
>  
>   if (boot_cpu_has(X86_FEATURE_XSAVES))
> - possible_xstate_size = get_xsaves_size();
> + possible_xstate_size = get_xsaves_size_no_dynamic();
>   else
>   possible_xstate_size = xsave_size;

Hi! Maybe we could enhance get_xsaves_size instead ? The get_xsaves_size is
static and __init function (thus not a hot path) used once as far as I see.
Say

static unsigned int __init get_xsaves_size(void)
{
u64 mask = xfeatures_mask_dynamic();
unsigned int eax, ebx, ecx, edx;

/*
 * In case if dynamic features are present make
 * sure they are not accounted in the result since
 * the buffer should be allocated separately from
 * task->fpu.
 */
if (mask)
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor());

/*
 * - CPUID function 0DH, sub-function 1:
 *EBX enumerates the size (in bytes) required by
 *the XSAVES instruction for an XSAVE area
 *containing all the state components
 *corresponding to bits currently set in
 *XCR0 | IA32_XSS.
 */
cpuid_count(XSTATE_CPUID, 1, , , , );

if (mask)
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() | mask);

return ebx;
}

but if you expect more use of get_xsaves_size_no_dynamic() and
get_xsaves_size() in future then sure, we need a separate function.

The benefit from such extension is that when you read get_xsaves_size
you'll notice the dependency on dynamic features immediaely.

Though I'm fine with current patch as well, up to you. Thanks for the patch!

Reviewed-by: Cyrill Gorcunov 


Re: [PATCH v5 4/6] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE

2020-07-15 Thread Cyrill Gorcunov
On Wed, Jul 15, 2020 at 04:49:52PM +0200, Adrian Reber wrote:
> Opening files in /proc/pid/map_files when the current user is
> CAP_CHECKPOINT_RESTORE capable in the root namespace is useful for
> checkpointing and restoring to recover files that are unreachable via
> the file system such as deleted files, or memfd files.
> 
> Signed-off-by: Adrian Reber 
> Signed-off-by: Nicolas Viennot 

I still have a plan to make this code been usable without
capabilities requirements but due to lack of spare time
for deep investigation this won't happen anytime soon.
Thus the patch looks OK to me, fwiw

Reviewed-by: Cyrill Gorcunov 


Re: [PATCH] kcmp: add separate Kconfig symbol for kcmp syscall

2020-07-10 Thread Cyrill Gorcunov
On Fri, Jul 10, 2020 at 11:05:11AM +0200, Rasmus Villemoes wrote:
> >> I deliberately drop the ifdef in the eventpoll.h header rather than
> >> replace with KCMP_SYSCALL; it's harmless to declare a function that
> >> isn't defined anywhere.
> > 
> > Could you please point why setting #fidef KCMP_SYSCALL in eventpoll.h
> > is not suitable?
> 
> It's just from a general "avoid ifdef clutter if possible" POV. The
> conditional declaration of the function doesn't really serve any
> purpose.

OK, thanks for explanation.


Re: [PATCH] kcmp: add separate Kconfig symbol for kcmp syscall

2020-07-10 Thread Cyrill Gorcunov
On Fri, Jul 10, 2020 at 09:56:31AM +0200, Rasmus Villemoes wrote:
> The ability to check open file descriptions for equality (without
> resorting to unreliable fstat() and fcntl(F_GETFL) comparisons) can be
> useful outside of the checkpoint/restore use case - for example,
> systemd uses kcmp() to deduplicate the per-service file descriptor
> store.
> 
> Make it possible to have the kcmp() syscall without the full
> CONFIG_CHECKPOINT_RESTORE.
> 
> Signed-off-by: Rasmus Villemoes 
> ---
> I deliberately drop the ifdef in the eventpoll.h header rather than
> replace with KCMP_SYSCALL; it's harmless to declare a function that
> isn't defined anywhere.

Could you please point why setting #fidef KCMP_SYSCALL in eventpoll.h
is not suitable? Still the overall idea is fine for me, thanks!

Reviewed-by: Cyrill Gorcunov 


Re: [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe

2020-07-07 Thread Cyrill Gorcunov
On Tue, Jul 07, 2020 at 05:45:04PM +0200, Christian Brauner wrote:
...
> 
> Ok, so the original patch proposal was presented in [4] in 2014. The
> final version of that patch added the PR_SET_MM_MAP we know today. The
> initial version presented in [4] did not require _any_ privilege.
> 

True. I still think that relyng on /proc//exe being immutable (or
guarded by caps) in a sake of security is a bit misleading, this link
only a hint without any guarantees of what code is being executed once
we pass cs:rip to userspace right after exec is completed. Nowadays I rather
think we might need to call audit_log() here or something similar to point
that exe link is changed (by criu or someone else) and simply notify
node's administrator, that's all. But as you pointed tomoyo may be
affected if we simply drops all caps from here. Thus I agree that
the new cap won't make situation worse.

Still I'm not in touch with kernel code for a couple of years already
and might be missing something obvious here.


Re: [PATCH 21/21] perf/x86/intel/lbr: Support XSAVES for arch LBR read

2020-06-22 Thread Cyrill Gorcunov
On Mon, Jun 22, 2020 at 03:11:07PM -0400, Liang, Kan wrote:
> > 
> > The goto and "return" statement before the "rollback" label
> > looks pretty ugly. I'm sorry I didn't follow the series
> > in details so if you plan to add more handlers at "rollback"
> > then sure.
> > 
> 
> There were several handlers when I first implemented the function, but they
> are removed now. I don't think I will add more handlers in the next version.
> I will remove the "rollback" label.
> 
> Thanks for pointing it out.

This could be done on top of the series of course, no need to resend
for this sole change I think.


Re: [PATCH 21/21] perf/x86/intel/lbr: Support XSAVES for arch LBR read

2020-06-22 Thread Cyrill Gorcunov
On Fri, Jun 19, 2020 at 07:04:09AM -0700, kan.li...@linux.intel.com wrote:
...
> +static void intel_pmu_arch_lbr_read_xsave(struct cpu_hw_events *cpuc)
> +{
> + struct x86_perf_task_context_arch_lbr_xsave *xsave = cpuc->lbr_xsave;
> + struct arch_lbr_entry *lbr;
> + int i;
> +
> + if (!xsave)
> + goto rollback;

Why not make it simplier?

if (!xsave) {
intel_pmu_arch_lbr_read(cpuc);
return;
}

The goto and "return" statement before the "rollback" label
looks pretty ugly. I'm sorry I didn't follow the series
in details so if you plan to add more handlers at "rollback"
then sure.


Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-06-09 Thread Cyrill Gorcunov
On Tue, Jun 09, 2020 at 08:09:49PM +, Nicolas Viennot wrote:
> >>  proc_map_files_get_link(struct dentry *dentry,
> >>struct inode *inode,
> >>struct delayed_call *done)
> >>  {
> >> -  if (!capable(CAP_SYS_ADMIN))
> >> +  if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
> >>return ERR_PTR(-EPERM);
> 
> > First of all -- sorry for late reply. You know, looking into this code more 
> > I think
> this CAP_SYS_ADMIN is simply wrong: for example I can't even fetch links for 
> /proc/self/map_files.
> Still /proc/$pid/maps (which as well points to the files opened) test for 
> ptrace-read permission.
> I think we need ptrace-may-attach test here instead of these capabilities (if 
> I can attach to
> a process I can read any data needed, including the content of the mapped 
> files, if only
> I'm not missing something obvious).
> 

Nikolas, could you please split the text lines next time, I've had to add 
newlines into reply manually :)

> Currently /proc/pid/map_files/* have exactly the same permission checks as 
> /proc/pid/fd/*, with the exception
> of the extra CAP_SYS_ADMIN check. The check originated from the following 
> discussions where 3 security issues are discussed:
> http://lkml.iu.edu/hypermail/linux/kernel/1505.2/02524.html
> http://lkml.iu.edu/hypermail/linux/kernel/1505.2/04030.html
> 
> From what I understand, the extra CAP_SYS_ADMIN comes from the following 
> issues:
> 1. Being able to open dma-buf / kdbus region (referred in the referenced 
> email as problem #1).
> I don't fully understand what the dangers are, but perhaps we could do 
> CAP_SYS_ADMIN check
> only for such dangerous files, as opposed to all files.

As far as I remember we only need to read the content of mmap'ed files and if 
I've ptrace-attach
permission we aready can inject own code into a process and read anything we 
wish. That said we probably
should fixup this interface like -- test for open mode and if it is read only 
then ptrace-attach
should be enough, if it is write mode -- then we require being node's admin 
instead of just adding
a new capability here. And thanks a huge for mail reference, I'll take a look 
once time permit.


Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-06-09 Thread Cyrill Gorcunov
On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
> This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> checkpoint/restore for non-root users.
> 
> Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> asked numerous times if it is possible to checkpoint/restore a process as
> non-root. The answer usually was: 'almost'.
> 
> The main blocker to restore a process as non-root was to control the PID of 
> the
> restored process. This feature available via the clone3 system call, or via
> /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
...
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d86c0afc8a85..ce02f3a4b2d7 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2189,16 +2189,16 @@ struct map_files_info {
>  };
>  
>  /*
> - * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how 
> the
> - * symlinks may be used to bypass permissions on ancestor directories in the
> - * path to the file in question.
> + * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, 
> due
> + * to concerns about how the symlinks may be used to bypass permissions on
> + * ancestor directories in the path to the file in question.
>   */
>  static const char *
>  proc_map_files_get_link(struct dentry *dentry,
>   struct inode *inode,
>   struct delayed_call *done)
>  {
> - if (!capable(CAP_SYS_ADMIN))
> + if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
>   return ERR_PTR(-EPERM);

First of all -- sorry for late reply. You know, looking into this code more
I think this CAP_SYS_ADMIN is simply wrong: for example I can't even fetch
links for /proc/self/map_files. Still /proc/$pid/maps (which as well points
to the files opened) test for ptrace-read permission. I think we need
ptrace-may-attach test here instead of these capabilities (if I can attach
to a process I can read any data needed, including the content of the
mapped files, if only I'm not missing something obvious).


Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-06-03 Thread Cyrill Gorcunov
On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
...
>  
>  /*
> - * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how 
> the
> - * symlinks may be used to bypass permissions on ancestor directories in the
> - * path to the file in question.
> + * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, 
> due
> + * to concerns about how the symlinks may be used to bypass permissions on
> + * ancestor directories in the path to the file in question.
>   */
>  static const char *
>  proc_map_files_get_link(struct dentry *dentry,
>   struct inode *inode,
>   struct delayed_call *done)
>  {
> - if (!capable(CAP_SYS_ADMIN))
> + if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
>   return ERR_PTR(-EPERM);

You know, I'm still not sure if we need this capable() check at all since
we have proc_fd_access_allowed() called but anyway can we please make this
if() condition more explicit

if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
return ERR_PTR(-EPERM);

though I won't insist. And I'll reread the series a bit later once I've
some spare time to.


Re: [PATCH] x86/dumpstack/64: Don't evaluate exception stacks before setup

2019-10-23 Thread Cyrill Gorcunov
On Wed, Oct 23, 2019 at 11:31:40AM -0700, Matthew Wilcox wrote:
> On Wed, Oct 23, 2019 at 08:05:49PM +0200, Thomas Gleixner wrote:
> > Prevent this by checking the validity of the cea_exception_stack base
> > address and bailing out if it is zero.
> 
> Could also initialise cea_exception_stack to -1?  That would lead to it
> being caught by ...
> 
> > end = begin + sizeof(struct cea_exception_stacks);
> > /* Bail if @stack is outside the exception stack area. */
> > if (stk < begin || stk >= end)
> 
> this existing check.

As to me this would be a hack and fragile :/ In turn the current explicit
test Thomas made is a way more readable.


Re: [BUG -tip] kmemleak and stacktrace cause page faul

2019-10-23 Thread Cyrill Gorcunov
On Wed, Oct 23, 2019 at 03:47:57PM +0200, Thomas Gleixner wrote:
> 
> So the fix is trivial.

Works like a charm.

Tested-by: Cyrill Gorcunov 


Re: [BUG -tip] kmemleak and stacktrace cause page faul

2019-10-23 Thread Cyrill Gorcunov
On Wed, Oct 23, 2019 at 03:47:57PM +0200, Thomas Gleixner wrote:
> 
> And you are right this happens because cea_exception_stacks is not yet
> initialized which makes begin = 0 and therefore point into nirvana.
> 
> So the fix is trivial.

Great! Thanks a lot for sych detailed analysis! I'll test the patch.


Re: [BUG -tip] kmemleak and stacktrace cause page faul

2019-10-23 Thread Cyrill Gorcunov
On Wed, Oct 23, 2019 at 03:38:40PM +0200, Thomas Gleixner wrote:
> > > 
> > > [0,2,4,5,6,8,10,12] are guard pages so 0 is not that crappy at all
> > 
> > Wait, Thomas, I might be wrong, but per-cpu is initialized to the pointer,
> > the memory for this estack_pages has not yet been allocated, no?
> 
> static const
> struct estack_pages estack_pages[CEA_ESTACK_PAGES] cacheline_aligned = {
> EPAGERANGE(DF),
>   EPAGERANGE(NMI),
>   EPAGERANGE(DB1),
>   EPAGERANGE(DB),
> EPAGERANGE(MCE),
> };
> 
> It's statically allocated. So it's available from the very beginning.

Indeed, thanks! I happened to overlooked this moment.
...
> And as I explained to you properly decoded the values _ARE_ correct and
> make sense.

Yes, just posted the diff itself to be sure. Thanks a huge for
explanation, Thomas!


Re: [BUG -tip] kmemleak and stacktrace cause page faul

2019-10-23 Thread Cyrill Gorcunov
On Wed, Oct 23, 2019 at 03:21:05PM +0200, Thomas Gleixner wrote:
> On Tue, 22 Oct 2019, Cyrill Gorcunov wrote:
> > On Tue, Oct 22, 2019 at 05:23:25PM +0300, Cyrill Gorcunov wrote:
> > > 
> > > I presume the kmemleak tries to save stack trace too early when 
> > > estack_pages are not
> > > yet filled.
> > 
> > Indeed, at this stage of boot the percpu_setup_exception_stacks has not 
> > been called
> > yet and estack_pages full of crap
> > 
> > [0.157502] stk 0x1008 k 1 begin 0x0 end 0xd000 estack_pages 
> > 0x82014880 ep 0x82014888
> > [0.159395] estack_pages[0] = 0x0
> > [0.160046] estack_pages[1] = 0x510001000
> > [0.160881] estack_pages[2] = 0x0
> > [0.161530] estack_pages[3] = 0x610003000
> > [0.162343] estack_pages[4] = 0x0
> > [0.162962] estack_pages[5] = 0x0
> > [0.163523] estack_pages[6] = 0x0
> > [0.164065] estack_pages[7] = 0x810007000
> > [0.164978] estack_pages[8] = 0x0
> > [0.165624] estack_pages[9] = 0x910009000
> > [0.166448] estack_pages[10] = 0x0
> > [0.167064] estack_pages[11] = 0xa1000b000
> > [0.168055] estack_pages[12] = 0x0
> 
> Errm. estack_pages is statically initialized and it's an array of:.
> 
> struct estack_pages {
> u32 offs;
> u16 size;
> u16 type;
> };
> 
> [0,2,4,5,6,8,10,12] are guard pages so 0 is not that crappy at all

Wait, Thomas, I might be wrong, but per-cpu is initialized to the pointer,
the memory for this estack_pages has not yet been allocated, no?

> The rest looks completely valid if you actually decode it proper.

The diff I made to fetch the values are

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 753b8cfe8b8a..bf0d755b6079 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -101,8 +101,18 @@ static bool in_exception_stack(unsigned long *stack, 
struct stack_info *info)
 
/* Calc page offset from start of exception stacks */
k = (stk - begin) >> PAGE_SHIFT;
+
/* Lookup the page descriptor */
ep = _pages[k];
+
+   printk("stk 0x%lx k %u begin 0x%lx end 0x%lx estack_pages 0x%lx ep 
0x%lx\n",
+  stk, k, begin, end, (long)(void *)_pages[0], (long)(void 
*)ep);
+
+   for (k = 0; k < CEA_ESTACK_PAGES; k++) {
+   long v = *(long *)(void *)_pages[k];
+   printk("estack_pages[%d] = 0x%lx\n", k, v);
+   }
+
/* Guard page? */
if (!ep->size)
return false;


> 
> e.g. 0x51000 1000
> 
>  bit  0-31: 1000  Offset 0x1000: 1 Page
>  bit 32-47: 1000  Size 0x1000:   1 Page
>  bit 48-63: 5 Type 5: STACK_TYPE_EXCEPTION + ESTACK_DF
> 
> So, no. This is NOT the problem.

I drop the left of your reply. True, I agreed with anything you said.
You know I didn't manage to dive more into this problem yesterday
but if time permits I'll continue today. It is easily triggering
under kvm (the kernel I'm building is almost without modules so
I simply upload bzImage into the guest). FWIW, the config I'm
using is https://gist.github.com/cyrillos/7cd5d2510a99af8ea872f07ac6f9095b


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-23 Thread Cyrill Gorcunov
On Tue, Oct 22, 2019 at 09:11:04PM -0700, Andy Lutomirski wrote:
> Trying again.  It looks like I used the wrong address for Pavel.

Thanks for CC Andy! I must confess I didn't dive into userfaultfd engine
personally but let me CC more people involved from criu side. (overquoting
left untouched for their sake).

> 
> On Sat, Oct 12, 2019 at 6:14 PM Andy Lutomirski  wrote:
> >
> > [adding more people because this is going to be an ABI break, sigh]
> >
> > On Sat, Oct 12, 2019 at 5:52 PM Daniel Colascione  wrote:
> > >
> > > On Sat, Oct 12, 2019 at 4:10 PM Andy Lutomirski  wrote:
> > > >
> > > > On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione  
> > > > wrote:
> > > > >
> > > > > The new secure flag makes userfaultfd use a new "secure" anonymous
> > > > > file object instead of the default one, letting security modules
> > > > > supervise userfaultfd use.
> > > > >
> > > > > Requiring that users pass a new flag lets us avoid changing the
> > > > > semantics for existing callers.
> > > >
> > > > Is there any good reason not to make this be the default?
> > > >
> > > >
> > > > The only downside I can see is that it would increase the memory usage
> > > > of userfaultfd(), but that doesn't seem like such a big deal.  A
> > > > lighter-weight alternative would be to have a single inode shared by
> > > > all userfaultfd instances, which would require a somewhat different
> > > > internal anon_inode API.
> > >
> > > I'd also prefer to just make SELinux use mandatory, but there's a
> > > nasty interaction with UFFD_EVENT_FORK. Adding a new UFFD_SECURE mode
> > > which blocks UFFD_EVENT_FORK sidesteps this problem. Maybe you know a
> > > better way to deal with it.
> >
> > ...
> >
> > > But maybe we can go further: let's separate authentication and
> > > authorization, as we do in other LSM hooks. Let's split my
> > > inode_init_security_anon into two hooks, inode_init_security_anon and
> > > inode_create_anon. We'd define the former to just initialize the file
> > > object's security information --- in the SELinux case, figuring out
> > > its class and SID --- and define the latter to answer the yes/no
> > > question of whether a particular anonymous inode creation should be
> > > allowed. Normally, anon_inode_getfile2() would just call both hooks.
> > > We'd add another anon_inode_getfd flag, ANON_INODE_SKIP_AUTHORIZATION
> > > or something, that would tell anon_inode_getfile2() to skip calling
> > > the authorization hook, effectively making the creation always
> > > succeed. We can then make the UFFD code pass
> > > ANON_INODE_SKIP_AUTHORIZATION when it's creating a file object in the
> > > fork child while creating UFFD_EVENT_FORK messages.
> >
> > That sounds like an improvement.  Or maybe just teach SELinux that
> > this particular fd creation is actually making an anon_inode that is a
> > child of an existing anon inode and that the context should be copied
> > or whatever SELinux wants to do.  Like this, maybe:
> >
> > static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
> >   struct userfaultfd_ctx *new,
> >   struct uffd_msg *msg)
> > {
> > int fd;
> >
> > Change this:
> >
> > fd = anon_inode_getfd("[userfaultfd]", _fops, new,
> >   O_RDWR | (new->flags & 
> > UFFD_SHARED_FCNTL_FLAGS));
> >
> > to something like:
> >
> >   fd = anon_inode_make_child_fd(..., ctx->inode, ...);
> >
> > where ctx->inode is the one context's inode.
> >
> > *** HOWEVER *** !!!
> >
> > Now that you've pointed this mechanism out, it is utterly and
> > completely broken and should be removed from the kernel outright or at
> > least severely restricted.  A .read implementation MUST NOT ACT ON THE
> > CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
> > as stdin to a setuid program.
> >
> > So I think the right solution might be to attempt to *remove*
> > UFFD_EVENT_FORK.  Maybe the solution is to say that, unless the
> > creator of a userfaultfd() has global CAP_SYS_ADMIN, then it cannot
> > use UFFD_FEATURE_EVENT_FORK) and print a warning (once) when
> > UFFD_FEATURE_EVENT_FORK is allowed.  And, after some suitable
> > deprecation period, just remove it.  If it's genuinely useful, it
> > needs an entirely new API based on ioctl() or a syscall.  Or even
> > recvmsg() :)
> >
> > And UFFD_SECURE should just become automatic, since you don't have a
> > problem any more. :-p
> >
> > --Andy
> 

Cyrill


Re: [BUG -tip] kmemleak and stacktrace cause page faul

2019-10-22 Thread Cyrill Gorcunov
On Tue, Oct 22, 2019 at 05:23:25PM +0300, Cyrill Gorcunov wrote:
> 
> I presume the kmemleak tries to save stack trace too early when estack_pages 
> are not
> yet filled.

Indeed, at this stage of boot the percpu_setup_exception_stacks has not been 
called
yet and estack_pages full of crap

[0.157502] stk 0x1008 k 1 begin 0x0 end 0xd000 estack_pages 
0x82014880 ep 0x82014888
[0.159395] estack_pages[0] = 0x0
[0.160046] estack_pages[1] = 0x510001000
[0.160881] estack_pages[2] = 0x0
[0.161530] estack_pages[3] = 0x610003000
[0.162343] estack_pages[4] = 0x0
[0.162962] estack_pages[5] = 0x0
[0.163523] estack_pages[6] = 0x0
[0.164065] estack_pages[7] = 0x810007000
[0.164978] estack_pages[8] = 0x0
[0.165624] estack_pages[9] = 0x910009000
[0.166448] estack_pages[10] = 0x0
[0.167064] estack_pages[11] = 0xa1000b000
[0.168055] estack_pages[12] = 0x0
[0.168891] BUG: unable to handle page fault for address: 1ff0



Re: [BUG -tip] kmemleak and stacktrace cause page faul

2019-10-22 Thread Cyrill Gorcunov
On Sat, Oct 19, 2019 at 02:44:21PM +0300, Cyrill Gorcunov wrote:
...
> 
> I nailed it down to the following kmemleak code
> 
> create_object
>   ...
>   object->trace_len = __save_stack_trace(object->trace);
> 
> if I drop this line out it boots fine. Just wanted to share the observation,
> probably it is known issue already.
> 
> Sidenote: The last -tip kernel which I've been working with is dated Sep 18
> so the changes which cause the problem should be introduced last month.

I've just tried to boot with fresh -tip

commit c2e50c28eeb90c0f3309d43c2ce0bd292a6751b3 (HEAD -> master, origin/master, 
origin/HEAD)
Merge: aec1ea9d4f48 27a0a90d6301
Author: Ingo Molnar 
Date:   Tue Oct 22 01:16:59 2019 +0200

Merge branch 'perf/core'

Conflicts:
tools/perf/check-headers.sh

but got the same issue. So I tried to go deeper, and here is a result: we're 
failing
in arch/x86/kernel/dumpstack_64.c:in_exception_stack routine, precisely at line

ep = _pages[k];
/* Guard page? */
if (!ep->size)
return false;

so I added a logline here

[0.082275] stk 0x1010 k 1 begin 0x0 end 0xd000 estack_pages 
0x82014880 ep 0x82014888
[0.084951] BUG: unable to handle page fault for address: 1ff0
[0.086724] #PF: supervisor read access in kernel mode
[0.088506] #PF: error_code(0x) - not-present page
[0.090265] PGD 0 P4D 0 
[0.090846] Oops:  [#2] PREEMPT SMP PTI
[0.091734] CPU: 0 PID: 0 Comm: swapper Not tainted 
5.4.0-rc4-00258-gc2e50c28eeb9-dirty #114
[0.093514] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[0.096993] RIP: 0010:get_stack_info+0xdc/0x173
[0.097994] Code: 84 48 01 82 66 85 c0 74 27 42 8b 14 f5 80 48 01 82 49 01 
d5 42 0f b7 14 f5 86 48 01 82 4c 01 e8 4c 89 6b 08 89 13 48 89 43 10 <48> 8b 40 
f0 eb 2b 65 48 8b 05 16 f4 f9 7e 48 8d 90 00 c0 ff ff 48

I presume the kmemleak tries to save stack trace too early when estack_pages 
are not
yet filled.



[BUG -tip] kmemleak and stacktrace cause page faul

2019-10-19 Thread Cyrill Gorcunov
Hi! I'm not sure if I've CC'ed proper persons, so please sorry if I did.
Anyway, today's -tip (07b4dbf1d830) refused to boot

[0.024793] No NUMA configuration found
[0.025406] Faking a node at [mem 0x-0x7ffdefff]
[0.026462] NODE_DATA(0) allocated [mem 0x7ffdb000-0x7ffdefff]
[0.027246] BUG: unable to handle page fault for address: 1ff0
[0.028160] #PF: supervisor read access in kernel mode
[0.028992] #PF: error_code(0x) - not-present page
[0.029820] PGD 0 P4D 0 
[0.030226] Oops:  [#1] PREEMPT SMP PTI
[0.031069] CPU: 0 PID: 0 Comm: swapper Not tainted 
5.4.0-rc3-00258-g07b4dbf1d830 #93
[0.032317] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[0.034163] RIP: 0010:get_stack_info+0xb3/0x148
[0.034903] Code: 04 d5 84 48 01 82 66 85 c0 74 25 8b 0c d5 80 48 01 82 0f 
b7 14 d5 86 48 01 82 48 01 f1 89 13 48 01 c8 48 89 4b 08 48 89 43 10 <48> 8b 40 
f0 eb 2b 65 48 8b 05 1f f4 f9 7e 48 8d 90 00 c0 ff ff 48
[0.037579] RSP: :82603be0 EFLAGS: 00010006

I nailed it down to the following kmemleak code

create_object
  ...
  object->trace_len = __save_stack_trace(object->trace);

if I drop this line out it boots fine. Just wanted to share the observation,
probably it is known issue already.

Sidenote: The last -tip kernel which I've been working with is dated Sep 18
so the changes which cause the problem should be introduced last month.


Re: [PATCH] mm, memcg: assign shrinker_map before kvfree

2019-09-20 Thread Cyrill Gorcunov
On Fri, Sep 20, 2019 at 05:11:00PM +0300, Kirill Tkhai wrote:
> 
> The current scheme is following. We allocate shrinker_map in css_online,
> while normal freeing happens in css_free. The NULLifying of pointer is needed
> in case of "abnormal freeing", when memcg_free_shrinker_maps() is called
> from memcg_alloc_shrinker_maps(). The NULLifying guarantees, we won't free
> pn->shrinker_map twice.
> 
> There are no races or problems with kvfree() and rcu_assign_pointer() order,
> because of nobody can reference shrinker_map before memcg is online.
> 
> In case of this rcu_assign_pointer() confuses, we may just remove is from
> the function, and call it only on css_free. Something like the below:

Kirill, I know that there is no problem now (as I pointed in changelog),
simply a regular pattern of free after assign is being reversed, which
made me nervious. Anyway dropping assigns doesn't help much from my pov
so lets leave it as is. The good point is that we've this conversation
and if someone get a bit confused in future the google will reveal this
text. Which is enough I think.


Re: [PATCH] mm, memcg: assign shrinker_map before kvfree

2019-09-20 Thread Cyrill Gorcunov
On Fri, Sep 20, 2019 at 04:21:14PM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 20, 2019 at 03:29:07PM +0300, Cyrill Gorcunov wrote:
> > Currently there is a small gap between fetching pointer, calling
> > kvfree and assign its value to nil. In current callgraph it is
> > not a problem (since memcg_free_shrinker_maps is running from
> > memcg_alloc_shrinker_maps and mem_cgroup_css_free only) still
> > this looks suspicious and we can easily eliminate the gap at all.
> 
> With this logic it will still look suspicious since you don't wait
> a grace period before freeing the map.

Probably, but as far as I see we're using mutex here to order
requests. I'm not sure, maybe ktkhai@ made the code to use free
before the assign intentionally? As I said there is no bug it
the code right now just forced me to stop and reread it several
times due to this gap. If you look into other code places where
we use similar technique we always assign before free.


[PATCH] mm, memcg: assign shrinker_map before kvfree

2019-09-20 Thread Cyrill Gorcunov
Currently there is a small gap between fetching pointer, calling
kvfree and assign its value to nil. In current callgraph it is
not a problem (since memcg_free_shrinker_maps is running from
memcg_alloc_shrinker_maps and mem_cgroup_css_free only) still
this looks suspicious and we can easily eliminate the gap at all.

Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Kirill Tkhai 
Signed-off-by: Cyrill Gorcunov 
---
 mm/memcontrol.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-tip.git/mm/memcontrol.c
===
--- linux-tip.git.orig/mm/memcontrol.c
+++ linux-tip.git/mm/memcontrol.c
@@ -364,9 +364,9 @@ static void memcg_free_shrinker_maps(str
for_each_node(nid) {
pn = mem_cgroup_nodeinfo(memcg, nid);
map = rcu_dereference_protected(pn->shrinker_map, true);
+   rcu_assign_pointer(pn->shrinker_map, NULL);
if (map)
kvfree(map);
-   rcu_assign_pointer(pn->shrinker_map, NULL);
}
 }
 


Re: [PATCH 4/9] select: Micro-optimise __estimate_accuracy()

2019-09-19 Thread Cyrill Gorcunov
On Mon, Sep 09, 2019 at 11:23:35AM +0100, Dmitry Safonov wrote:
> Shift on s64 is faster than division, use it instead.
> 
> As the result of the patch there is a hardly user-visible effect:
> poll(), select(), etc syscalls will be a bit more precise on ~2.3%
> than before because 1000 != 1024 :)
> 
> Signed-off-by: Dmitry Safonov 
> ---
>  fs/select.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/select.c b/fs/select.c
> index 12cdefd3be2d..2477c202631e 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -51,15 +51,14 @@
>  
>  static long __estimate_accuracy(ktime_t slack)
>  {
> - int divfactor = 1000;
> -
>   if (slack < 0)
>   return 0;

Btw, don't you better use <= here?


[PATCH -tip] x86/boot: Reserve enough space for any cpu level string

2019-09-19 Thread Cyrill Gorcunov
Currently when we're validating cpu on boot stage we assume that cpu level
won't ever be greater than two digits. In particular the check_cpu() helper
ensures to be so. Still the cpu_name() helper is a way too far from
check_cpu() routine internals so I think better to be on a safe side
and prevent any possible overrun errors when printing cpu name.
For this sake just increase the buffer up to be able to keep any level.

CC: Ingo Molnar 
CC: Thomas Gleixner 
Signed-off-by: Cyrill Gorcunov 
---
 arch/x86/boot/cpu.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-tip.git/arch/x86/boot/cpu.c
===
--- linux-tip.git.orig/arch/x86/boot/cpu.c
+++ linux-tip.git/arch/x86/boot/cpu.c
@@ -20,7 +20,7 @@
 
 static char *cpu_name(int level)
 {
-   static char buf[6];
+   static char buf[16];
 
if (level == 64) {
return "x86-64";


Re: [PATCH 4/9] select: Micro-optimise __estimate_accuracy()

2019-09-09 Thread Cyrill Gorcunov
On Mon, Sep 09, 2019 at 12:50:27PM +0100, Dmitry Safonov wrote:
> Hi Cyrill,
> 
> On Mon, 9 Sep 2019 at 12:18, Cyrill Gorcunov  wrote:
> > Compiler precompute constants so it doesn't do division here.
> > But I didn't read the series yet so I might be missing
> > something obvious.
> 
> Heh, like a division is in ktime_divns()?

Ah, you meant the ktime_divns you've dropped out. I thought
you were talking about the constant value we've had here before
your patch. Seems I didn't got the changelog right. Anyway
need to take more precise look on the series. Hopefully soon.


Re: [PATCH 4/9] select: Micro-optimise __estimate_accuracy()

2019-09-09 Thread Cyrill Gorcunov
On Mon, Sep 09, 2019 at 11:23:35AM +0100, Dmitry Safonov wrote:
> Shift on s64 is faster than division, use it instead.
> 
> As the result of the patch there is a hardly user-visible effect:
> poll(), select(), etc syscalls will be a bit more precise on ~2.3%
> than before because 1000 != 1024 :)
> 
> Signed-off-by: Dmitry Safonov 

> ---
>  fs/select.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/select.c b/fs/select.c
> index 12cdefd3be2d..2477c202631e 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -51,15 +51,14 @@
>  
>  static long __estimate_accuracy(ktime_t slack)
>  {
> - int divfactor = 1000;
> -
>   if (slack < 0)
>   return 0;
>  
> - if (task_nice(current) > 0)
> - divfactor = divfactor / 5;
> + /* A bit more precise than 0.1% */
> + slack = slack >> 10;
>  
> - slack = ktime_divns(slack, divfactor);
> + if (task_nice(current) > 0)
> + slack = slack * 5;
>  
>   if (slack > MAX_SLACK)
>   return MAX_SLACK;

Compiler precompute constants so it doesn't do division here.
But I didn't read the series yet so I might be missing
something obvious.


Re: [PATCH v2] sys_prctl(): remove unsigned comparision with less than zero

2019-07-24 Thread Cyrill Gorcunov
On Wed, Jul 24, 2019 at 10:11:48AM +0800, Yang Xu wrote:
> Currently, when calling prctl(PR_SET_TIMERSLACK, arg2), arg2 is an
> unsigned long value, arg2 will never < 0. Negative judgment is
> meaningless, so remove it.
> 
> Fixes: 6976675d9404 ("hrtimer: create a "timer_slack" field in the task 
> struct")
> Signed-off-by: Yang Xu 
> Cc: Cyrill Gorcunov 
Acked-by: Cyrill Gorcunov 


Re: [PATCH] sys_prctl(): simplify arg2 judgment when calling PR_SET_TIMERSLACK

2019-07-23 Thread Cyrill Gorcunov
On Tue, Jul 23, 2019 at 04:11:09PM +0800, Yang Xu wrote:
> > 2) according to man page passing negative value should be acceptable,
> > though it never worked as expected. I've been grepping "git log"
> > for this file and the former API is coming from
> > 
> > commit 6976675d94042fbd446231d1bd8b7de71a980ada
> > Author: Arjan van de Ven
> > Date:   Mon Sep 1 15:52:40 2008 -0700
> > 
> >  hrtimer: create a "timer_slack" field in the task struct
> > 
> > which is 11 years old by now. Nobody complained so far even when man
> > page is saying pretty obviously
> > 
> > PR_SET_TIMERSLACK (since Linux 2.6.28)
> >Each thread has two associated timer slack values:  a  
> > "default"
> >value, and a "current" value.  This operation sets the 
> > "current"
> >timer slack value for the calling  thread.   If  the  
> > nanosecond
> >value  supplied in arg2 is greater than zero, then the 
> > "current"
> >value is set to this value.  If arg2 is less than  or  equal 
> >  to
> >zero,  the  "current"  timer  slack  is  reset  to  the 
> > thread's
> >"default" timer slack value.
> > 
> > So i think to match the man page (and assuming that accepting negative value
> > has been supposed) we should rather do
> > 
> > if ((long)arg2<  0)
> Looks correct. But if we set a ULONG_MAX(PR_GET_TIMERSLACK also limits 
> ULONG_MAX)
> value(about 4s) on 32bit machine, this code will think this value is a 
> negative value and use default value.
> 
> I guess man page was written as "less than or equal to zero" because of this 
> confusing code(arg2<=0, but arg2
> is an unsinged long value).
> I think we can change this man page and also add bounds value description.

OK, seems reasonable. I think we should use comparision with zero
and simply update a man page.


Re: [PATCH] sys_prctl(): simplify arg2 judgment when calling PR_SET_TIMERSLACK

2019-07-23 Thread Cyrill Gorcunov
On Tue, Jul 23, 2019 at 11:30:53AM +0800, Yang Xu wrote:
> arg2 will never < 0, for its type is 'unsigned long'. So negative
> judgment is meaningless.
> 
> Signed-off-by: Yang Xu 
> ---
>  kernel/sys.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 2969304c29fe..399457d26bef 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2372,11 +2372,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> arg2, unsigned long, arg3,
>   error = current->timer_slack_ns;
>   break;
>   case PR_SET_TIMERSLACK:
> - if (arg2 <= 0)
> + if (arg2)
> + current->timer_slack_ns = arg2;
> + else
>   current->timer_slack_ns =
>   current->default_timer_slack_ns;
> - else
> - current->timer_slack_ns = arg2;
>   break;
>   case PR_MCE_KILL:
>   if (arg4 | arg5)

>From a glance it looks correct to me, but then...

1) you might simply compare with zero, iow if (arg2 == 0)
   instead of changing 7 lines
2) according to man page passing negative value should be acceptable,
   though it never worked as expected. I've been grepping "git log"
   for this file and the former API is coming from

commit 6976675d94042fbd446231d1bd8b7de71a980ada
Author: Arjan van de Ven 
Date:   Mon Sep 1 15:52:40 2008 -0700

hrtimer: create a "timer_slack" field in the task struct

which is 11 years old by now. Nobody complained so far even when man
page is saying pretty obviously

   PR_SET_TIMERSLACK (since Linux 2.6.28)
  Each thread has two associated timer slack values:  a  "default"
  value, and a "current" value.  This operation sets the "current"
  timer slack value for the calling  thread.   If  the  nanosecond
  value  supplied in arg2 is greater than zero, then the "current"
  value is set to this value.  If arg2 is less than  or  equal  to
  zero,  the  "current"  timer  slack  is  reset  to  the thread's
  "default" timer slack value.

So i think to match the man page (and assuming that accepting negative value
has been supposed) we should rather do

if ((long)arg2 < 0)

Thoughts?


Re: [PATCH v2 5/6] proc: use down_read_killable mmap_sem for /proc/pid/map_files

2019-06-13 Thread Cyrill Gorcunov
On Wed, Jun 12, 2019 at 04:14:28PM -0700, Andrei Vagin wrote:
> On Sun, Jun 09, 2019 at 01:09:00PM +0300, Konstantin Khlebnikov wrote:
> > Do not stuck forever if something wrong.
> > Killable lock allows to cleanup stuck tasks and simplifies investigation.
> 
> This patch breaks the CRIU project, because stat() returns EINTR instead
> of ENOENT:
> 
> [root@fc24 criu]# stat /proc/self/map_files/0-0
> stat: cannot stat '/proc/self/map_files/0-0': Interrupted system call
> 
> Here is one inline comment with the fix for this issue.
> 
> > 
> > It seems ->d_revalidate() could return any error (except ECHILD) to
> > abort validation and pass error as result of lookup sequence.
> > 
> > Signed-off-by: Konstantin Khlebnikov 
> > Reviewed-by: Roman Gushchin 
> > Reviewed-by: Cyrill Gorcunov 
> > Reviewed-by: Kirill Tkhai 
> 
> It was nice to see all four of you in one place :).

Holymoly ;) And we all managed to miss this error code.
Thanks, Andrew!


Re: [RFC PATCH] binfmt_elf: Protect mm_struct access with mmap_sem

2019-06-12 Thread Cyrill Gorcunov
On Wed, Jun 12, 2019 at 10:51:59AM -0700, Matthew Wilcox wrote:
> On Wed, Jun 12, 2019 at 07:29:15PM +0200, Michal Koutný wrote:
> > On Wed, Jun 12, 2019 at 10:00:34AM -0700, Matthew Wilcox 
> >  wrote:
> > > On Wed, Jun 12, 2019 at 04:28:11PM +0200, Michal Koutný wrote:
> > > > -   /* N.B. passed_fileno might not be initialized? */
> > > > +
> > > 
> > > Why did you delete this comment?
> > The variable got removed in
> > d20894a23708 ("Remove a.out interpreter support in ELF loader")
> > so it is not relevant anymore.
> 
> Better put that in the changelog for v2 then.  or even make it a
> separate patch.

Just updated changelog should be fine, I guess. A separate commit
just to remove an obsolete comment is too much.


Re: [RFC PATCH] binfmt_elf: Protect mm_struct access with mmap_sem

2019-06-12 Thread Cyrill Gorcunov
On Wed, Jun 12, 2019 at 04:28:11PM +0200, Michal Koutný wrote:
> find_extend_vma assumes the caller holds mmap_sem as a reader (explained
> in expand_downwards()). The path when we are extending the stack VMA to
> accomodate argv[] pointers happens without the lock.
> 
> I was not able to cause an mm_struct corruption but
> BUG_ON(!rwsem_is_locked(>mmap_sem)) in find_extend_vma could be
> triggered as
> 
> #  xargs: echo: terminated by signal 11
> 
> (bigfile needs to have more than RLIMIT_STACK / sizeof(char *) rows)
> 
> Other accesses to mm_struct in exec path are protected by mmap_sem, so
> conservatively, protect also this one. Besides that, explain why we omit
> mm_struct.arg_lock in the exec(2) path.
> 
> Cc: Cyrill Gorcunov 
> Signed-off-by: Michal Koutný 
> ---
> 
> When I was attempting to reduce usage of mmap_sem I came across this
> unprotected access and increased number of its holders :-/
> 
> I'm not sure whether there is a real concurrent writer at this early
> stages (I considered khugepaged especially as setup_arg_pages invokes
> khugepaged_enter_vma_merge but we're lucky because khugepaged skips it
> because of VM_STACK_INCOMPLETE_SETUP).
> 
> A nicer approach would perhaps be to do all this exec setup when the
> mm_struct is still not exposed via current->mm (and hence no need to
> synchronize via mmap_sem). But I didn't look enough into binfmt specific
> whether it is even doable and worth it.
> 
> So I'm sending this for a discussion.
> 
>  fs/binfmt_elf.c | 10 +-
>  fs/exec.c   |  3 ++-
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 8264b468f283..48e169760a9c 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -299,7 +299,11 @@ create_elf_tables(struct linux_binprm *bprm, struct 
> elfhdr *exec,
>* Grow the stack manually; some architectures have a limit on how
>* far ahead a user-space access may be in order to grow the stack.
>*/
> + if (down_read_killable(>mm->mmap_sem))
> + return -EINTR;
>   vma = find_extend_vma(current->mm, bprm->p);
> + up_read(>mm->mmap_sem);
> +

Good catch, Michal! Actually the loader code is heavy on its own so
I think having readlock taken here should not cause any perf problems
but worth having for consistency.

Reviewed-by: Cyrill Gorcunov 


Re: [PATCH] kernel/sys.c: fix possible spectre-v1 in do_prlimit()

2019-05-30 Thread Cyrill Gorcunov
On Thu, May 30, 2019 at 01:45:16PM +0800, Dianzhang Chen wrote:
>
> Though syscall `getrlimit` , it seems not works after 
> check_prlimit_permission.
> 
> And the speculation windows are large, as said[1]:
> >> Can the speculation proceed past the task_lock()?  Or is the policy to
> >> ignore such happy happenstances even if they are available?
> >
> > Locks are not in the way of speculation. Speculation has almost no limits
> > except serializing instructions. At least they respect the magic AND
> > limitation in array_index_nospec().
> 
> [1] https://do-db2.lkml.org/lkml/2018/5/15/1056

Please, stop top-posting, it trashes conversation context. You miss the point:
before bpu hits misprediction we've a number of branches, second in case of
misprediction the kernel's stack value is cached, so I'm not convinced at all
that teaching bpu and read the cache is easy here (or possible at all). Thus,
the final solution is up to maintainers. Another reason why I complaining about
the patch -- it is not the patch body, as I said I'm fine with it, but the patch
title: it implies the fix should go in stable kernels, that's what I dont agree
with. But again, I'm not a maintainer and might be wrong.

> 
> On Wed, May 29, 2019 at 8:18 PM Cyrill Gorcunov  wrote:
> >
> > On Wed, May 29, 2019 at 10:39:52AM +0800, Dianzhang Chen wrote:
> > > Hi,
> > >
> > > Although when detect it is misprediction and drop the execution, but
> > > it can not drop all the effects of speculative execution, like the
> > > cache state. During the speculative execution, the:
> > >
> > >
> > > rlim = tsk->signal->rlim + resource;// use resource as index
> > >
> > > ...
> > >
> > > *old_rlim = *rlim;
> > >
> > >
> > > may read some secret data into cache.
> > >
> > > and then the attacker can use side-channel attack to find out what the
> > > secret data is.
> >
> > This code works after check_prlimit_permission call, which means you already
> > should have a permission granted. And you implies that misprediction gonna
> > be that deep which involves a number of 
> > calls/read/writes/jumps/locks-rb-wb-flushes
> > and a bunch or other instructions, moreover all conditions are 
> > "mispredicted".
> > This is very bold and actually unproved claim!
> >
> > Note that I pointed the patch is fine in cleanup context but seriously I
> > don't see how this all can be exploitable in sense of spectre.
> 

Cyrill


Re: [PATCH] kernel/sys.c: fix possible spectre-v1 in do_prlimit()

2019-05-29 Thread Cyrill Gorcunov
On Wed, May 29, 2019 at 10:39:52AM +0800, Dianzhang Chen wrote:
> Hi,
> 
> Although when detect it is misprediction and drop the execution, but
> it can not drop all the effects of speculative execution, like the
> cache state. During the speculative execution, the:
> 
> 
> rlim = tsk->signal->rlim + resource;// use resource as index
> 
> ...
> 
> *old_rlim = *rlim;
> 
> 
> may read some secret data into cache.
> 
> and then the attacker can use side-channel attack to find out what the
> secret data is.

This code works after check_prlimit_permission call, which means you already
should have a permission granted. And you implies that misprediction gonna
be that deep which involves a number of 
calls/read/writes/jumps/locks-rb-wb-flushes
and a bunch or other instructions, moreover all conditions are "mispredicted".
This is very bold and actually unproved claim!

Note that I pointed the patch is fine in cleanup context but seriously I
don't see how this all can be exploitable in sense of spectre.


Re: [PATCH] kernel/sys.c: fix possible spectre-v1 in do_prlimit()

2019-05-28 Thread Cyrill Gorcunov
On Tue, May 28, 2019 at 10:37:10AM +0800, Dianzhang Chen wrote:
> Hi,
> Because when i reply your email,i always get 'Message rejected' from
> gmail(get this rejection from all the recipients). I still don't know
> how to deal with it, so i reply your email here:

Hi! This is weird. Next time simply reply to LKML (I CC'ed it back).

> Because of speculative execution, the attacker can bypass the bound
> check `if (resource >= RLIM_NLIMITS)`.

And then misprediction get detected and execution is dropped. So I
still don't see a problem here, since we don't leak info even in
such case.

That said I don't mind for this patch but rather in a sake of
code clarity, not because of spectre issue since it has
nothing to do here.

> as for array_index_nospec(index, size), it will clamp the index within
> the range of [0, size), and attacker can't exploit speculative
> execution to make the index out of range [0, size).
> 
> 
> For more detail, please check the link below:
> 
> https://github.com/torvalds/linux/commit/f3804203306e098dae9ca51540fcd5eb700d7f40


Re: [PATCH] kernel/sys.c: fix possible spectre-v1 in do_prlimit()

2019-05-27 Thread Cyrill Gorcunov
On Mon, May 27, 2019 at 03:23:08PM +0800, Dianzhang Chen wrote:
> The `resource` in do_prlimit() is controlled by userspace via syscall: 
> setrlimit(defined in kernel/sys.c), hence leading to a potential exploitation 
> of the Spectre variant 1 vulnerability.
> The relevant code in do_prlimit() is as below:
> 
> if (resource >= RLIM_NLIMITS)
> return -EINVAL;
> ...
> rlim = tsk->signal->rlim + resource;// use resource as index
> ...
> *old_rlim = *rlim;
> 
> Fix this by sanitizing resource before using it to index tsk->signal->rlim.
> 
> Signed-off-by: Dianzhang Chen 
> ---
>  kernel/sys.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sys.c b/kernel/sys.c
> index bdbfe8d..7eba1ca 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1532,6 +1532,8 @@ int do_prlimit(struct task_struct *tsk, unsigned int 
> resource,
>  
>   if (resource >= RLIM_NLIMITS)
>   return -EINVAL;
> +
> + resource = array_index_nospec(resource, RLIM_NLIMITS);
>   if (new_rlim) {
>   if (new_rlim->rlim_cur > new_rlim->rlim_max)
>   return -EINVAL;

Could you please explain in details how array_index_nospec is different
from resource >= RLIM_NLIMITS? Since I don't get how it is related to
spectre issue.


Re: [PATCH 1/5] proc: use down_read_killable for /proc/pid/maps

2019-05-15 Thread Cyrill Gorcunov
On Wed, May 15, 2019 at 11:41:12AM +0300, Konstantin Khlebnikov wrote:
> Do not stuck forever if something wrong.
> This function also used for /proc/pid/smaps.
> 
> Signed-off-by: Konstantin Khlebnikov 

All patches in series look ok to me (actually I thought if there
is a scenario where might_sleep may trigger a warning, and didn't
find one, so should be safe).

Reviewed-by: Cyrill Gorcunov 


Re: [PATCH v2 1/2] kernel/sys: add PR_GET_TASK_SIZE option to prctl(2)

2019-05-03 Thread Cyrill Gorcunov
On Thu, May 02, 2019 at 05:46:08PM -0400, Joel Savitz wrote:
> > Won't be possible to use put_user here? Something like
> >
> > static int prctl_get_tasksize(unsigned long __user *uaddr)
> > {
> > return put_user(TASK_SIZE, uaddr) ? -EFAULT : 0;
> > }
> 
> What would be the benefit of using put_user() over copy_to_user() in
> this context?

It is a common pattern to use put_user with native types, where
copy_to_user more biased for composed types transfer.


Re: [PATCH v2 0/2] sys/prctl: expose TASK_SIZE value to userspace

2019-05-02 Thread Cyrill Gorcunov
On Thu, May 02, 2019 at 05:01:38PM -0400, Waiman Long wrote:
> 
> What did you change in v2 versus v1?

Seems unsigned long long has been changed to unsigned long.


Re: [PATCH v2 1/2] kernel/sys: add PR_GET_TASK_SIZE option to prctl(2)

2019-05-02 Thread Cyrill Gorcunov
On Thu, May 02, 2019 at 04:52:21PM -0400, Joel Savitz wrote:
>  
> +static int prctl_get_tasksize(void __user * uaddr)
> +{
> + unsigned long task_size = TASK_SIZE;
> + return copy_to_user(uaddr, _size, sizeof(unsigned long))
> + ? -EFAULT : 0;
> +}

Won't be possible to use put_user here? Something like

static int prctl_get_tasksize(unsigned long __user *uaddr)
{
return put_user(TASK_SIZE, uaddr) ? -EFAULT : 0;
}


Re: [PATCH v3 1/2] prctl_set_mm: Refactor checks from validate_prctl_map

2019-05-02 Thread Cyrill Gorcunov
On Thu, May 02, 2019 at 02:52:02PM +0200, Michal Koutný wrote:
> Despite comment of validate_prctl_map claims there are no capability
> checks, it is not completely true since commit 4d28df6152aa ("prctl:
> Allow local CAP_SYS_ADMIN changing exe_file"). Extract the check out of
> the function and make the function perform purely arithmetic checks.
> 
> This patch should not change any behavior, it is mere refactoring for
> following patch.
> 
> v1, v2: ---
> v3: Remove unused mm variable from validate_prctl_map_addr
> 
> CC: Kirill Tkhai 
> CC: Cyrill Gorcunov 
> Signed-off-by: Michal Koutný 
> Reviewed-by: Kirill Tkhai 
Reviewed-by: Cyrill Gorcunov 


Re: [PATCH v3 2/2] prctl_set_mm: downgrade mmap_sem to read lock

2019-05-02 Thread Cyrill Gorcunov
On Thu, May 02, 2019 at 02:52:03PM +0200, Michal Koutný wrote:
> The commit a3b609ef9f8b ("proc read mm's {arg,env}_{start,end} with mmap
> semaphore taken.") added synchronization of reading argument/environment
> boundaries under mmap_sem. Later commit 88aa7cc688d4 ("mm: introduce
> arg_lock to protect arg_start|end and env_start|end in mm_struct")
> avoided the coarse use of mmap_sem in similar situations. But there
> still remained two places that (mis)use mmap_sem.
> 
> get_cmdline should also use arg_lock instead of mmap_sem when it reads the
> boundaries.
> 
> The second place that should use arg_lock is in prctl_set_mm. By
> protecting the boundaries fields with the arg_lock, we can downgrade
> mmap_sem to reader lock (analogous to what we already do in
> prctl_set_mm_map).
> 
> v2: call find_vma without arg_lock held
> v3: squashed get_cmdline arg_lock patch
> 
> Fixes: 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and 
> env_start|end in mm_struct")
> Cc: Yang Shi 
> Cc: Mateusz Guzik 
> CC: Cyrill Gorcunov 
> Co-developed-by: Laurent Dufour 
> Signed-off-by: Laurent Dufour 
> Signed-off-by: Michal Koutný 
Reviewed-by: Cyrill Gorcunov 


Re: [PATCH 1/3] mm: get_cmdline use arg_lock instead of mmap_sem

2019-04-30 Thread Cyrill Gorcunov
On Tue, Apr 30, 2019 at 12:56:10PM +0200, Michal Koutný wrote:
> On Tue, Apr 30, 2019 at 01:45:17PM +0300, Cyrill Gorcunov 
>  wrote:
> > It setups these parameters unconditionally. I need to revisit
> > this moment. Technically (if only I'm not missing something
> > obvious) we might have a race here with prctl setting up new
> > params, but this should be harmless since most of them (except
> > stack setup) are purely informative data.
>
> FTR, when I reviewed that usage, I noticed it was missing the
> synchronization. My understanding was that the mm_struct isn't yet
> shared at this moment. I can see some of the operations take place after
> flush_old_exec (where current->mm = mm_struct), so potentially it is
> shared since then. OTOH, I guess there aren't concurrent parties that
> could access the field at this stage of exec.

Just revisited this code -- we're either executing prctl, either execve.
Since both operates with current task we're safe.


Re: [PATCH 1/3] mm: get_cmdline use arg_lock instead of mmap_sem

2019-04-30 Thread Cyrill Gorcunov
On Tue, Apr 30, 2019 at 12:53:51PM +0300, Kirill Tkhai wrote:
> > 
> > Well, strictly speaking we probably should but you know setup of
> > the @arg_start by kernel's elf loader doesn't cause any side
> > effects as far as I can tell (its been working this lockless
> > way for years, mmap_sem is taken later in the loader code).
> > Though for consistency sake we probably should set it up
> > under the spinlock.
> 
> Ok, so elf loader doesn't change these parameters.
> Thanks for the explanation.

It setups these parameters unconditionally. I need to revisit
this moment. Technically (if only I'm not missing something
obvious) we might have a race here with prctl setting up new
params, but this should be harmless since most of them (except
stack setup) are purely informative data.


Re: [PATCH 1/3] mm: get_cmdline use arg_lock instead of mmap_sem

2019-04-30 Thread Cyrill Gorcunov
On Tue, Apr 30, 2019 at 12:09:57PM +0300, Kirill Tkhai wrote:
> 
> This looks OK for me.
> 
> But speaking about existing code it's a secret for me, why we ignore arg_lock
> in binfmt code, e.g. in load_elf_binary().

Well, strictly speaking we probably should but you know setup of
the @arg_start by kernel's elf loader doesn't cause any side
effects as far as I can tell (its been working this lockless
way for years, mmap_sem is taken later in the loader code).
Though for consistency sake we probably should set it up
under the spinlock.


Re: [PATCH 3/3] prctl_set_mm: downgrade mmap_sem to read lock

2019-04-30 Thread Cyrill Gorcunov
On Tue, Apr 30, 2019 at 11:55:45AM +0300, Kirill Tkhai wrote:
> > -   up_write(>mmap_sem);
> > +   spin_unlock(>arg_lock);
> > +   up_read(>mmap_sem);
> > return error;
> 
> Hm, shouldn't spin_lock()/spin_unlock() pair go as a fixup to existing code
> in a separate patch? 
> 
> Without them, the existing code has a problem at least in get_mm_cmdline().

Seems reasonable to merge it into patch 1.


Re: [PATCHv3 04/27] timens: Introduce CLOCK_BOOTTIME offset

2019-04-25 Thread Cyrill Gorcunov
On Thu, Apr 25, 2019 at 05:13:53PM +0100, Dmitry Safonov wrote:
> 
> diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
> index 8f75d34cf34a..5f0da6858b10 100644
> --- a/include/linux/time_namespace.h
> +++ b/include/linux/time_namespace.h
> @@ -48,6 +48,14 @@ static inline void timens_add_monotonic(struct timespec64 
> *ts)
>  *ts = timespec64_add(*ts, ns_offsets->monotonic_time_offset);
>  }
>  
> +static inline void timens_add_boottime(struct timespec64 *ts)
> +{
> +struct timens_offsets *ns_offsets = 
> current->nsproxy->time_ns->offsets;
> +
> +if (ns_offsets)
> +*ts = timespec64_add(*ts, 
> ns_offsets->monotonic_boottime_offset);
> +}
>
...

>  struct timens_offsets {
>   struct timespec64  monotonic_time_offset;
> + struct timespec64  monotonic_boottime_offset;
>  };

Guys, is not the _offset postfix here redundant? timens_offsets
already has it and ns_offsets->monotonic_boottime_offset looks
too much.

static always_inline current_time_ns_offsets(void)
{
return current->nsproxy->time_ns->offsets;
}

static inline void timens_add_boottime(struct timespec64 *ts)
{
struct timens_offsets *tns_off = current_time_ns_offsets();

if (tns_off)
*ts = timespec64_add(*ts, tns_off->monotonic_boottime);
}

Hmm? Up to you though.


Re: [PATCH] prctl_set_mm: downgrade mmap_sem to read lock

2019-04-18 Thread Cyrill Gorcunov
On Thu, Apr 18, 2019 at 03:50:39PM +0200, Michal Koutný wrote:
> I learnt, it's, alas, too late to drop the non PRCTL_SET_MM_MAP calls
> [1], so at least downgrade the write acquisition of mmap_sem as in the
> patch below (that should be stacked on the previous one or squashed).
> 
> Cyrill, you mentioned lock changes in [1] but the link seems empty. Is
> it supposed to be [2]? That could be an alternative to this patch after
> some refreshments and clarifications.
> 
> 
> [1] https://lore.kernel.org/lkml/20190417165632.gc3...@uranus.lan/
> [2] https://lore.kernel.org/lkml/20180507075606.870903...@gmail.com/
> 
> 
> 
> Since commit 88aa7cc688d4 ("mm: introduce arg_lock to protect
> arg_start|end and env_start|end in mm_struct") we use arg_lock for
> boundaries modifications. Synchronize prctl_set_mm with this lock and
> keep mmap_sem for reading only (analogous to what we already do in
> prctl_set_mm_map).
> 
> Also, save few cycles by looking up VMA only after performing basic
> arguments validation.
> 
> Signed-off-by: Michal Koutný 
Reviewed-by: Cyrill Gorcunov 

As Laurent mentioned we might move vma lookup before the spinlock,
but this might be done on top of the series.


Re: [PATCH] prctl_set_mm: downgrade mmap_sem to read lock

2019-04-18 Thread Cyrill Gorcunov
On Thu, Apr 18, 2019 at 03:50:39PM +0200, Michal Koutný wrote:
> I learnt, it's, alas, too late to drop the non PRCTL_SET_MM_MAP calls
> [1], so at least downgrade the write acquisition of mmap_sem as in the
> patch below (that should be stacked on the previous one or squashed).
> 
> Cyrill, you mentioned lock changes in [1] but the link seems empty. Is
> it supposed to be [2]? That could be an alternative to this patch after
> some refreshments and clarifications.

Yes, seems so. From a glance the patch shold be ok. Michal will review
it more carefully today. Thanks!


Re: [PATCH] prctl: Don't compile some of prctl functions when CRUI

2019-04-17 Thread Cyrill Gorcunov
On Wed, Apr 17, 2019 at 04:44:54PM +0200, Michal Koutný wrote:
> On Wed, Apr 17, 2019 at 03:38:41PM +0300, Cyrill Gorcunov 
>  wrote:
> > I've a bit vague memory what we've ended up with, but iirc there was
> > a problem with brk() syscall or similar. Then I think we left everything
> > as is.
> Was this related to the removal of non PR_SET_MM_MAP operations too?
> Do you have any pointers to the topic?

Here some details: suprisingly there are already users of PR_SET_MM_x
operands, so we can't deprecate it (https://lkml.org/lkml/2018/5/6/127).
And then there was an attempt to change locks 
https://lore.kernel.org/patchwork/patch/
but since we're modifying mm-> entries we should gurd them against
simultaneous sys_brk call. Or I misunderstood you and you meant somthing
different?


Re: [PATCH] prctl: Don't compile some of prctl functions when CRUI

2019-04-17 Thread Cyrill Gorcunov
On Wed, Apr 17, 2019 at 04:44:54PM +0200, Michal Koutný wrote:
> On Wed, Apr 17, 2019 at 03:38:41PM +0300, Cyrill Gorcunov 
>  wrote:
> > I've a bit vague memory what we've ended up with, but iirc there was
> > a problem with brk() syscall or similar. Then I think we left everything
> > as is.
>
> Was this related to the removal of non PR_SET_MM_MAP operations too?
> Do you have any pointers to the topic?

Gimme some time, will reply later.


Re: [PATCH] prctl: Don't compile some of prctl functions when CRUI

2019-04-17 Thread Cyrill Gorcunov
On Wed, Apr 17, 2019 at 02:23:50PM +0200, Michal Koutný wrote:
> Hi.
> 
> I see this discussion somewhat faded away since the previous year.
> 
> There was rework [1] that reduced (ab)use of mmap_sem in prctl
> functions.
> 
> Actually, there still remains the down_write() in prctl_set_mm.
> I considered at least replacing it with the mm_struct.arg_lock +
> down_read() but then I learnt about this thread intending to remove that
> part completely. I wouldn't oppose if CRIU is the sole (aware) user.
> 
> Ad the bot build issue, I could build the kernel both with
> CONFIG_CHECKPOINT_RESTORE and without CONFIG_CHECKPOINT_RESTORE just
> fine after applying the two proposed patches.
> 
> What is the current state? Perhaps, this change should be CCed to
> linux-...@vger.kernel.org(?).
> 
> [1] 
> https://lore.kernel.org/lkml/1523730291-109696-1-git-send-email-yang@linux.alibaba.com/T/

Hi! I've a bit vague memory what we've ended up with, but iirc there was
a problem with brk() syscall or similar. Then I think we left everything
as is. I think there is no much activity in this prctl area now as far
as i know (replying to what is current state).


Re: perf: perf_fuzzer crashes on Pentium 4 systems

2019-04-09 Thread Cyrill Gorcunov
On Tue, Apr 09, 2019 at 12:38:25PM -0400, Vince Weaver wrote:
> 
> It still crashes at the same place with this patch and my reproducible 
> test case.

Thank for testing, Vince! At least now we know it is unrelated to
alias events. Will ping you once I get more ideas.


[PATCH -next] prctl: Fix false positive in validate_prctl_map

2019-04-08 Thread Cyrill Gorcunov
While validating new map we require the @start_data to be strictly less
than @end_data, which is fine for regular applications (this is why this
nit didn't trigger for that long). These members are set from executable
loaders such as elf halders, still it is pretty valid to have a loadable
data section with zero size in file, in such case the start_data is equal
to end_data once kernel loader finishes.

In result when we'are trying to restore such program the procedure fails
and kernel returns -EINVAL. From the image dump of a program:

 | "mm_start_code": "0x40",
 | "mm_end_code": "0x8f5fb4",
 | "mm_start_data": "0xf1bfb0",
 | "mm_end_data": "0xf1bfb0",

Thus we need to change validate_prctl_map from strictly less to less or
equal operator use.

Fixes: f606b77f1a9e362451aca8f81d8f36a3a112139e
Signed-off-by: Cyrill Gorcunov 
CC: Andrey Vagin 
CC: Dmitry Safonov <0x7f454...@gmail.com>
CC: Pavel Emelyanov 
CC: Andrew Morton 
---
I don't consider this issue as a critical one, since it triggered first
time for in more than 4 years period (and report came from a proprietary
program).

 kernel/sys.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-next.git/kernel/sys.c
===
--- linux-next.git.orig/kernel/sys.c
+++ linux-next.git/kernel/sys.c
@@ -1924,7 +1924,7 @@ static int validate_prctl_map(struct prc
((unsigned long)prctl_map->__m1 __op\
 (unsigned long)prctl_map->__m2) ? 0 : -EINVAL
error  = __prctl_check_order(start_code, <, end_code);
-   error |= __prctl_check_order(start_data, <, end_data);
+   error |= __prctl_check_order(start_data,<=, end_data);
error |= __prctl_check_order(start_brk, <=, brk);
error |= __prctl_check_order(arg_start, <=, arg_end);
error |= __prctl_check_order(env_start, <=, env_end);


Re: perf: perf_fuzzer crashes on Pentium 4 systems

2019-04-07 Thread Cyrill Gorcunov
On Thu, Apr 04, 2019 at 11:20:38PM +0300, Cyrill Gorcunov wrote:
> > the machine still crashes after this, but not right away.
> 
> yes, exactly, if look into disasm code we will see that 0x158
> offset points to hwc from event. Vince, gimme some time, probably
> the weekend so I would dive into the perf code more deeply and
> will try to make some debugging patch for more precise tracking
> of events. The kernel you're running is the latest -tip?

Vince, could you please disable alias events and see if it change
anything, once you have time? Note once we've aliases disabled the
counter for cpu cycles get used for NMI watchdog so PERF_COUNT_HW_CPU_CYCLES
won't be available in "perf" tool itself, but I guess perf_fuzzer uses
direct kernel syscall.
---
 arch/x86/events/intel/p4.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-tip.git/arch/x86/events/intel/p4.c
===
--- linux-tip.git.orig/arch/x86/events/intel/p4.c
+++ linux-tip.git/arch/x86/events/intel/p4.c
@@ -622,6 +622,8 @@ static u64 p4_get_alias_event(u64 config
u64 config_match;
int i;
 
+   return 0;
+
/*
 * Only event with special mark is allowed,
 * we're to be sure it didn't come as malformed


Re: perf: perf_fuzzer crashes on Pentium 4 systems

2019-04-04 Thread Cyrill Gorcunov
On Thu, Apr 04, 2019 at 03:01:14PM -0400, Vince Weaver wrote:
> 
> I do have a lot of this automated already from tracking down past bugs, 
> but it turns out that most of the fuzzer-found bugs aren't deterministic 
> so it doesn't always work.
> 
> For example this bug, while I can easily repeat it, doesn't happen at 
> the same time each time.  I suspect something corrupts things, but the
> crash doesn't trigger until a context switch happens.

I fear so, I've readin code around to figure out where it might came
from but without much luck yet.

> For what it's worth I've put code in p4_pmu_enable_all() to see what's 
> going on when the NULL dereference happens, and sure enough the printk is 
> triggered where I'd expect.
> 
> [  138.132889] VMW: p4_pmu_enable_all: idx 4 is NULL
...
> 
> the machine still crashes after this, but not right away.

yes, exactly, if look into disasm code we will see that 0x158
offset points to hwc from event. Vince, gimme some time, probably
the weekend so I would dive into the perf code more deeply and
will try to make some debugging patch for more precise tracking
of events. The kernel you're running is the latest -tip?


Re: perf: perf_fuzzer crashes on Pentium 4 systems

2019-04-04 Thread Cyrill Gorcunov
On Thu, Apr 04, 2019 at 12:37:18PM -0400, Vince Weaver wrote:
> On Thu, 4 Apr 2019, Cyrill Gorcunov wrote:
> 
> > On Thu, Apr 04, 2019 at 09:25:47AM -0400, Vince Weaver wrote:
> > > 
> > > It looks like there are at least two bugs here, one that's a full 
> > > hardlockup with nothing on serial console.  The other is the NULL 
> > > dereference.
> 
> OK, it turns out the hard-lock and the null pointer dereference might be 
> the same, I have a random seed for the fuzzer from a hard-lock crash that 
> reproduces and it generated the null pointer crash.  (This is with your 
> patch applied).

I see. My patch simply eliminates wrong event for unimplemented general
events, but it definitely won't help with nil deref, so it mostly to
eliminate some potential side effects.

> I can try to see if I can bisect down to a specific event sequence that 
> triggers this, but that can be tricky sometimes if things lock up so fast 
> that the event log doesn't get written out before the crash.

Oh, Vince, I suspect such kind of bisection might consume a lot of your
time :( Maybe we could update perf fuzzer so that it would send events
to some net-storage first then write them to the counters, iow to automatize
this all stuff somehow?


Re: perf: perf_fuzzer crashes on Pentium 4 systems

2019-04-04 Thread Cyrill Gorcunov
On Thu, Apr 04, 2019 at 09:25:47AM -0400, Vince Weaver wrote:
> 
> It looks like there are at least two bugs here, one that's a full 
> hardlockup with nothing on serial console.  The other is the NULL 
> dereference.
> 

Seems so. I've spent plenty of time yesterday trying to figure out how
we even reach the case when event = nil but without much luck.

> Just ran with your patch applied and it hit the hard lockup case.
> 
> I'll have to see if things are reproducible and I can try to see if I can 
> get a reproducible value for what even caused the issue.  perf_fuzzer has 
> some infrastructure for determining that but it's hit or miss if you can 
> get anything useful from it.

At least the sequence of events migh give us some ideas, maybe not indeed.
Thanks a huge, Vince!

> I'll keep running things, but I'm a bit busy at work here the next few 
> days so there might be some delay in the results.

Sure, take your time. I think this problem with p4 is not urgent. I'll ping
you if I get some more ideas.


Re: perf: perf_fuzzer crashes on Pentium 4 systems

2019-04-03 Thread Cyrill Gorcunov
On Wed, Apr 03, 2019 at 10:19:44PM +0300, Cyrill Gorcunov wrote:
> 
> You know, seems I got what happened -- p4_general_events do
> not cover all general events, they stop at PERF_COUNT_HW_BUS_CYCLES,
> while more 3 general event left. This is 'cause I've not been following
> pmu evolution in code. I will try to cover this events hopefully more
> less soon and send you a patch to test (if you don't mind).

Still this should not cause nil deref, continue investigating. Vince
could oyu please apply the patch below, I doubt if it help with nil
issue but worth having anyway
---
From: Cyrill Gorcunov 
Subject: [PATCH] perf/x86/intel/p4: Limit p4_general_events down to real ones

p4_general_events are allocated up to PERF_COUNT_HW_MAX while this constant
is bigger than the number of general events we do support by now. Thus the
all other entries are equal to zero and maps to P4_EVENT_TC_DELIVER_MODE
which is wrong of course. Instead drop off PERF_COUNT_HW_MAX constant
from declaration, we use ARRAY_SIZE for max_events.

Signed-off-by: Cyrill Gorcunov 
---
 arch/x86/events/intel/p4.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-tip.git/arch/x86/events/intel/p4.c
===
--- linux-tip.git.orig/arch/x86/events/intel/p4.c
+++ linux-tip.git/arch/x86/events/intel/p4.c
@@ -648,7 +648,7 @@ static u64 p4_get_alias_event(u64 config
return config_match | (config & P4_CONFIG_EVENT_ALIAS_IMMUTABLE_BITS);
 }
 
-static u64 p4_general_events[PERF_COUNT_HW_MAX] = {
+static u64 p4_general_events[] = {
   /* non-halted CPU clocks */
   [PERF_COUNT_HW_CPU_CYCLES] =
p4_config_pack_escr(P4_ESCR_EVENT(P4_EVENT_GLOBAL_POWER_EVENTS) 
|


Re: perf: perf_fuzzer crashes on Pentium 4 systems

2019-04-03 Thread Cyrill Gorcunov
On Wed, Apr 03, 2019 at 10:59:32AM -0400, Vince Weaver wrote:
> 
> so moving this to its own thread.
> 
> There was a two-part question asked.
>   1. Can the perf-fuzzer crash a Pentium 4 system
>   2. Does anyone care anymore?
> 
> The answer to #1 turns out to be "yes"
> I'm not sure about #2 (but it's telling my p4 test system hadn't been 
> turned on in over 3 years).
> 
> In any case the perf_fuzzer can crash my p4 system within an hour or so.  
> The debugging from this isn't great, I forget what the preferred debug 
> things to enable in the kernel hacking menu are.
> 
> Here is one crash that just happened:
> 
> The instruction at RIP is unhelpfully
>   ./arch/x86/include/asm/processor.h:400
> which is
>   DECLARE_PER_CPU_FIRST(union irq_stack_union, irq_stack_union) __visible;
> 
> Though looking at the assembly it looks like
>   p4_pmu_enable_event() is called with NULL as the paramater.
> 

You know, seems I got what happened -- p4_general_events do
not cover all general events, they stop at PERF_COUNT_HW_BUS_CYCLES,
while more 3 general event left. This is 'cause I've not been following
pmu evolution in code. I will try to cover this events hopefully more
less soon and send you a patch to test (if you don't mind).


Re: perf: perf_fuzzer crashes on Pentium 4 systems

2019-04-03 Thread Cyrill Gorcunov
On Wed, Apr 03, 2019 at 10:59:32AM -0400, Vince Weaver wrote:
> 
> so moving this to its own thread.
> 
> There was a two-part question asked.
>   1. Can the perf-fuzzer crash a Pentium 4 system
>   2. Does anyone care anymore?
> 
> The answer to #1 turns out to be "yes"
> I'm not sure about #2 (but it's telling my p4 test system hadn't been 
> turned on in over 3 years).
> 
> In any case the perf_fuzzer can crash my p4 system within an hour or so.  
> The debugging from this isn't great, I forget what the preferred debug 
> things to enable in the kernel hacking menu are.
> 
> Here is one crash that just happened:
> 
> The instruction at RIP is unhelpfully
>   ./arch/x86/include/asm/processor.h:400
> which is
>   DECLARE_PER_CPU_FIRST(union irq_stack_union, irq_stack_union) __visible;
> 
> Though looking at the assembly it looks like
>   p4_pmu_enable_event() is called with NULL as the paramater.
>

Interesting! I'll look more carefully at evening. As far as I remember we
rely on active_mask bit set completely, not sure how it could happen that
we get nil here. Thanks for pointing!


Re: [RFC PATCH v3 0/3] x86/perf/amd: AMD PMC counters and NMI latency

2019-04-03 Thread Cyrill Gorcunov
On Wed, Apr 03, 2019 at 05:27:54PM +0300, Cyrill Gorcunov wrote:
> On Wed, Apr 03, 2019 at 10:15:33AM -0400, Vince Weaver wrote:
> > > Mind to point me where json events should lay, I could try to convert
> > > names.
> > 
> > I was mostly joking about that.  But the event lists are in the kernel 
> > tree in
> > tools/perf/pmu-events/arch/x86/
> >
> > I don't think anything older than Nehalem is included.
> 
> I see ;)
> 
> > After letting the fuzzer run a bit longer I did manage to get it 
> > to hard-lock with no messages in the log, though eventually while I was 
> > fiddling with alt-sysrq over serial port did get this to trigger.
> > Though if I'm going to start reporting p4 crashes I should start a 
> > separate thread.
> 
> Oh. Will try to take a look on, thanks a lot!

Btw, Vince, is there a chance to fetch last seuqence the
fuzzer inserted into events input?


Re: [RFC PATCH v3 0/3] x86/perf/amd: AMD PMC counters and NMI latency

2019-04-03 Thread Cyrill Gorcunov
On Wed, Apr 03, 2019 at 10:15:33AM -0400, Vince Weaver wrote:
> > Mind to point me where json events should lay, I could try to convert
> > names.
> 
> I was mostly joking about that.  But the event lists are in the kernel 
> tree in
>   tools/perf/pmu-events/arch/x86/
>
> I don't think anything older than Nehalem is included.

I see ;)

> After letting the fuzzer run a bit longer I did manage to get it 
> to hard-lock with no messages in the log, though eventually while I was 
> fiddling with alt-sysrq over serial port did get this to trigger.
> Though if I'm going to start reporting p4 crashes I should start a 
> separate thread.

Oh. Will try to take a look on, thanks a lot!


Re: [RFC PATCH v3 0/3] x86/perf/amd: AMD PMC counters and NMI latency

2019-04-02 Thread Cyrill Gorcunov
On Tue, Apr 02, 2019 at 05:13:15PM -0400, Vince Weaver wrote:
> 
> > You know, running fuzzer on p4 might worth in anycase. As to potential
> > problems to fix -- i could try find some time slot for, still quite
> > limited too 'cause of many other duties :(
> 
> 
> Well I fired up the Pentium 4
>   /dev/sda1 has gone 1457 days without being checked

Heh :)

> and eventually got it up to date (running "git pull" in a 5-year old Linux 
> tree plus running "apt-get dist-upgrade" at the same time was maybe a 
> mistake on a system with 1GB of RAM).
> 
> Anyway I have it fuzzing current git and surprisingly while it's hit a few 
> WARNINGs and some NMI dazed+confused messages it hasn't actually crashed 
> yet.  Not sure if I want to let it fuzz overnight if I'm not here though.

Wow, sounds impressive! Ping me if you find something.

> 
> Shame on Intel though for not providing perf JSON files for the 
> Pentium 4 event names.

Mind to point me where json events should lay, I could try to convert
names.


Re: [RFC PATCH v3 0/3] x86/perf/amd: AMD PMC counters and NMI latency

2019-04-02 Thread Cyrill Gorcunov
On Tue, Apr 02, 2019 at 10:53:38AM -0400, Vince Weaver wrote:
> On Tue, 2 Apr 2019, Cyrill Gorcunov wrote:
> > On Tue, Apr 02, 2019 at 03:03:02PM +0200, Peter Zijlstra wrote:
> > > I have vague memories of the P4 thing crashing with Vince's perf_fuzzer,
> > > but maybe I'm wrong.
> > 
> > No, you're correct. p4 was crashing many times before we manage to make
> > it more-less stable. The main problem though that to find working p4 box
> > is really a problem.
> 
> I do have some a functioning p4 system I can test on.
> I can easily run the fuzzer and report crashes, but I only have limited 
> time/skills to actually fix the problems it turns up.

You know, running fuzzer on p4 might worth in anycase. As to potential
problems to fix -- i could try find some time slot for, still quite
limited too 'cause of many other duties :(

> One nice thing is that as of Linux 5.0 *finally* the fuzzer can run 
> indefinitely on most modern Intel chips without crashing (still triggers a 
> few warnings).  So finally we have the ability to tell when a new crash is 
> a regression and potentially can bisect it.  Although obviously this 
> doesn't necessarily apply to the p4.
> 
> I do think the number of people trying to run perf on a p4 is probably 
> pretty small these days.

Quite agree. Moreover current p4 perf code doesn't cover all potential
functionality (for example cascaded counters) and nobody ever complained
about it, I think exactly because not that many p4 boxes left and putting
efforts into this perf module development doesn't look like a good investment,
better to stick with more modern cpus and deprecate p4 with small steps.


Re: [RFC PATCH v3 0/3] x86/perf/amd: AMD PMC counters and NMI latency

2019-04-02 Thread Cyrill Gorcunov
On Tue, Apr 02, 2019 at 03:03:02PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 01, 2019 at 09:46:33PM +, Lendacky, Thomas wrote:
> > This patch series addresses issues with increased NMI latency in newer
> > AMD processors that can result in unknown NMI messages when PMC counters
> > are active.
> > 
> > The following fixes are included in this series:
> > 
> > - Resolve a race condition when disabling an overflowed PMC counter,
> >   specifically when updating the PMC counter with a new value.
> > - Resolve handling of active PMC counter overflows in the perf NMI
> >   handler and when to report that the NMI is not related to a PMC.
> > - Remove earlier workaround for spurious NMIs by re-ordering the
> >   PMC stop sequence to disable the PMC first and then remove the PMC
> >   bit from the active_mask bitmap. As part of disabling the PMC, the
> >   code will wait for an overflow to be reset.
> > 
> > The last patch re-works the order of when the PMC is removed from the
> > active_mask. There was a comment from a long time ago about having
> > to clear the bit in active_mask before disabling the counter because
> > the perf NMI handler could re-enable the PMC again. Looking at the
> > handler today, I don't see that as possible, hence the reordering. The
> > question will be whether the Intel PMC support will now have issues.
> > There is still support for using x86_pmu_handle_irq() in the Intel
> > core.c file. Did Intel have any issues with spurious NMIs in the past?
> > Peter Z, any thoughts on this?
> 
> I can't remember :/ I suppose we'll see if anything pops up after these
> here patches. At least then we get a chance to properly document things.
> 
> > Also, I couldn't completely get rid of the "running" bit because it
> > is used by arch/x86/events/intel/p4.c. An old commit comment that
> > seems to indicate the p4 code suffered the spurious interrupts:
> > 03e22198d237 ("perf, x86: Handle in flight NMIs on P4 platform").
> > So maybe that partially answers my previous question...
> 
> Yeah, the P4 code is magic, and I don't have any such machines left, nor
> do I think does Cyrill who wrote much of that.

It was so long ago :) What I remember from the head is some of the counters
were borken on hardware level so that I had to use only one counter instead
of two present in the system. And there were spurious NMIs too. I think
we can move this "running" bit to per-cpu base declared inside p4 code
only, so get rid of it from cpu_hw_events?

> I have vague memories of the P4 thing crashing with Vince's perf_fuzzer,
> but maybe I'm wrong.

No, you're correct. p4 was crashing many times before we manage to make
it more-less stable. The main problem though that to find working p4 box
is really a problem.

> Ideally we'd find a willing victim to maintain that thing, or possibly
> just delete it, dunno if anybody still cares.

As to me, I would rather mark this p4pmu code as deprecated, until there
is *real* need for its support.

> 
> Anyway, I like these patches, but I cannot apply since you send them
> base64 encoded and my script chokes on that.


Re: [PATCH 05/32] timerfd/timens: Take into account ns clock offsets

2019-02-06 Thread Cyrill Gorcunov
On Wed, Feb 06, 2019 at 11:52:03AM +0300, Cyrill Gorcunov wrote:
...
> >  
> > -   if ((flags & ~TFD_SETTIME_FLAGS) ||
> > -!itimerspec64_valid(new))
> > -   return -EINVAL;
> 
> Please don't defer this early test of a @flags value. Otherwise
> if @flags is invalid you continue fget/put/clock-to-host even
> if result will be dropped out then.

Just to clarify -- this could be done on top of the series to
not resend the whole bunch.


Re: [PATCH 05/32] timerfd/timens: Take into account ns clock offsets

2019-02-06 Thread Cyrill Gorcunov
On Wed, Feb 06, 2019 at 12:10:39AM +, Dmitry Safonov wrote:
> From: Andrei Vagin 
> 
> Make timerfd respect timens offsets.
> Provide two helpers timens_clock_to_host() timens_clock_from_host() that
> are useful to wire up timens to different kernel subsystems.
> Following patches will use timens_clock_from_host(), added here for
> completeness.
> 
> Signed-off-by: Andrei Vagin 
> Co-developed-by: Dmitry Safonov 
> Signed-off-by: Dmitry Safonov 
> ---
>  fs/timerfd.c | 16 +++-
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/timerfd.c b/fs/timerfd.c
> index 803ca070d42e..c7ae1e371912 100644
> --- a/fs/timerfd.c
> +++ b/fs/timerfd.c
> @@ -26,6 +26,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct timerfd_ctx {
>   union {
> @@ -433,22 +434,27 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, 
> flags)
>  }
>  
>  static int do_timerfd_settime(int ufd, int flags, 
> - const struct itimerspec64 *new,
> + struct itimerspec64 *new,
>   struct itimerspec64 *old)
>  {
>   struct fd f;
>   struct timerfd_ctx *ctx;
>   int ret;
>  
> - if ((flags & ~TFD_SETTIME_FLAGS) ||
> -  !itimerspec64_valid(new))
> - return -EINVAL;

Please don't defer this early test of a @flags value. Otherwise
if @flags is invalid you continue fget/put/clock-to-host even
if result will be dropped out then.


Re: [PATCH next] fs/splice: iter_to_pipe -- Use PIPE_DEF_BUFFERS instead of hardcoded number

2019-01-30 Thread Cyrill Gorcunov
On Wed, Jan 30, 2019 at 06:48:51PM +0300, Cyrill Gorcunov wrote:
> On Wed, Jan 30, 2019 at 03:42:45PM +, Al Viro wrote:
> > On Wed, Jan 30, 2019 at 06:39:11PM +0300, Cyrill Gorcunov wrote:
> > > And use ARRAY_SIZE for easier code modification if we ever need in future.
> > 
> > Umm...  Why PIPE_DEF_BUFFERS, though?  Because it's a constant from more or
> > less the same area that happens to be equal to the value we use here?
> 
> Yes. Since plain 16 value completely unclear where it comes from. I looked
> into your commit which made this change and seems previously we've had this
> constant related (79fddc4efd5d4de5cf210fe5ecf4d2734140849a). Am I wrong?

Thinking more I'm more convinced that i'm wrong. Also it is related to
this default size, the code can pretty well use (alsmost) any other
constant here. Al, drop it please.


Re: [PATCH next] fs/splice: iter_to_pipe -- Use PIPE_DEF_BUFFERS instead of hardcoded number

2019-01-30 Thread Cyrill Gorcunov
On Wed, Jan 30, 2019 at 03:42:45PM +, Al Viro wrote:
> On Wed, Jan 30, 2019 at 06:39:11PM +0300, Cyrill Gorcunov wrote:
> > And use ARRAY_SIZE for easier code modification if we ever need in future.
> 
> Umm...  Why PIPE_DEF_BUFFERS, though?  Because it's a constant from more or
> less the same area that happens to be equal to the value we use here?

Yes. Since plain 16 value completely unclear where it comes from. I looked
into your commit which made this change and seems previously we've had this
constant related (79fddc4efd5d4de5cf210fe5ecf4d2734140849a). Am I wrong?


[PATCH next] fs/splice: iter_to_pipe -- Use PIPE_DEF_BUFFERS instead of hardcoded number

2019-01-30 Thread Cyrill Gorcunov
And use ARRAY_SIZE for easier code modification if we ever need in future.

CC: Al Viro 
Signed-off-by: Cyrill Gorcunov 
---
 fs/splice.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-next.git/fs/splice.c
===
--- linux-next.git.orig/fs/splice.c
+++ linux-next.git/fs/splice.c
@@ -1203,12 +1203,12 @@ static int iter_to_pipe(struct iov_iter
bool failed = false;
 
while (iov_iter_count(from) && !failed) {
-   struct page *pages[16];
+   struct page *pages[PIPE_DEF_BUFFERS];
ssize_t copied;
size_t start;
int n;
 
-   copied = iov_iter_get_pages(from, pages, ~0UL, 16, );
+   copied = iov_iter_get_pages(from, pages, ~0UL, 
ARRAY_SIZE(pages), );
if (copied <= 0) {
ret = copied;
break;


[PATCH v2 -next] rcu: rcu_qs -- Use raise_softirq_irqoff to not save irqs twice

2019-01-24 Thread Cyrill Gorcunov
The rcu_qs is disabling IRQs by self so no need to do the same in raise_softirq
but instead we can save some cycles using raise_softirq_irqoff directly.

CC: Paul E. McKenney 
Signed-off-by: Cyrill Gorcunov 
---
The prev patch body has been screwed, sorry.

 kernel/rcu/tiny.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-next.git/kernel/rcu/tiny.c
===
--- linux-next.git.orig/kernel/rcu/tiny.c
+++ linux-next.git/kernel/rcu/tiny.c
@@ -65,7 +65,7 @@ void rcu_qs(void)
local_irq_save(flags);
if (rcu_ctrlblk.donetail != rcu_ctrlblk.curtail) {
rcu_ctrlblk.donetail = rcu_ctrlblk.curtail;
-   raise_softirq(RCU_SOFTIRQ);
+   raise_softirq_irqoff(RCU_SOFTIRQ);
}
local_irq_restore(flags);
 }


[PATCH -next] rcu: rcu_qs -- Use raise_softirq_irqoff to not save irqs twice

2019-01-24 Thread Cyrill Gorcunov
The rcu_qs is disabling IRQs by self so no need to do the same in raise_softirq,
instead we can save some cycles using raise_softirq_irqoff directly.

CC: Paul E. McKenney 
Signed-off-by: Cyrill Gorcunov 
---
 kernel/rcu/tiny.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-next.git/kernel/rcu/tiny.c
===
--- linux-next.git.orig/kernel/rcu/tiny.c
+++ linux-next.git/kernel/rcu/tiny.c
@@ -65,7 +65,7 @@ void rcu_qs(void)
local_irq_save(flags);
if (rcu_ctrlblk.donetail != rcu_ctrlblk.curtail) {
rcu_ctrlblk.donetail = rcu_ctrlblk.curtail;
-   raise_softirq(RCU_SOFTIRQ);
+   raise_softirq_irqoff(RCU_SOFTIRQ);
}
local_irq_restore(flags);
 }
Subject: [PATCH]


Re: [PATCH]: xarray: Fix potential out of bounds access

2019-01-14 Thread Cyrill Gorcunov
On Mon, Jan 14, 2019 at 11:01:20AM -0800, Matthew Wilcox wrote:
> On Mon, Jan 14, 2019 at 09:47:41PM +0300, Cyrill Gorcunov wrote:
> > Since the mark is used as an array index we should use
> > preincrement to not access the XA_MARK_MAX index.
> 
> But XA_MARK_MAX is inclusive:
> 
> include/linux/xarray.h:#define XA_MARK_MAX  XA_MARK_2

Indeed, I misread the variable name.

> so we actually want to access XA_MARK_MAX.  Now, we don't have a test
> in the test-suite that fails as a result of your patch, so that needs to get
> fixed.  How about this:

Looks great. Thank you!


  1   2   3   4   5   6   7   8   9   10   >