Re: [PATCH] fs: Replace kmap{,_atomic}() with kmap_local_page()

2022-06-30 Thread Eric W. Biederman
"Fabio M. De Francesco"  writes:

> The use of kmap() and kmap_atomic() are being deprecated in favor of
> kmap_local_page().
>
> With kmap_local_page(), the mappings are per thread, CPU local and not
> globally visible. Furthermore, the mappings can be acquired from any
> context (including interrupts).
>
> Therefore, use kmap_local_page() in exec.c because these mappings are per
> thread, CPU local, and not globally visible.
>
> Tested with xfstests on a QEMU + KVM 32-bits VM booting a kernel with
> HIGHMEM64GB enabled.

Can someone please refresh my memory on what is going on.

I remember there were limitations that kmap_atomic had that are hard to
meet so something I think it was kmap_local was invented and created
to be the kmap_atomic replacement.

What are the requirements on kmap_local?  In copy_strings
kmap is called in contexts that can sleep in page faults so any
nearly any requirement except a thread local use is invalidated.

As you have described kmap_local above it does not sound like kmap_local
is safe in this context, but that could just be a problem in description
that my poor memory does is not recalling the necessary details to
correct.

Eric

> Suggested-by: Ira Weiny 
> Signed-off-by: Fabio M. De Francesco 
> ---
>  fs/exec.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 0989fb8472a1..4a2129c0d422 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -583,11 +583,11 @@ static int copy_strings(int argc, struct user_arg_ptr 
> argv,
>  
>   if (kmapped_page) {
>   flush_dcache_page(kmapped_page);
> - kunmap(kmapped_page);
> + kunmap_local(kaddr);
>   put_arg_page(kmapped_page);
>   }
>   kmapped_page = page;
> - kaddr = kmap(kmapped_page);
> + kaddr = kmap_local_page(kmapped_page);
>   kpos = pos & PAGE_MASK;
>   flush_arg_page(bprm, kpos, kmapped_page);
>   }
> @@ -601,7 +601,7 @@ static int copy_strings(int argc, struct user_arg_ptr 
> argv,
>  out:
>   if (kmapped_page) {
>   flush_dcache_page(kmapped_page);
> - kunmap(kmapped_page);
> + kunmap_local(kaddr);
>   put_arg_page(kmapped_page);
>   }
>   return ret;
> @@ -883,11 +883,11 @@ int transfer_args_to_stack(struct linux_binprm *bprm,
>  
>   for (index = MAX_ARG_PAGES - 1; index >= stop; index--) {
>   unsigned int offset = index == stop ? bprm->p & ~PAGE_MASK : 0;
> - char *src = kmap(bprm->page[index]) + offset;
> + char *src = kmap_local_page(bprm->page[index]) + offset;
>   sp -= PAGE_SIZE - offset;
>   if (copy_to_user((void *) sp, src, PAGE_SIZE - offset) != 0)
>   ret = -EFAULT;
> - kunmap(bprm->page[index]);
> + kunmap_local(src);
>   if (ret)
>   goto out;
>   }
> @@ -1680,13 +1680,13 @@ int remove_arg_zero(struct linux_binprm *bprm)
>   ret = -EFAULT;
>   goto out;
>   }
> - kaddr = kmap_atomic(page);
> + kaddr = kmap_local_page(page);
>  
>   for (; offset < PAGE_SIZE && kaddr[offset];
>   offset++, bprm->p++)
>   ;
>  
> - kunmap_atomic(kaddr);
> + kunmap_local(kaddr);
>   put_arg_page(page);
>   } while (offset == PAGE_SIZE);



Re: [PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3.2)

2021-04-18 Thread Eric W. Biederman


Guiseppe can you take a look at this?

This is a second attempt at tightening up the semantics of writing to
file capabilities from a user namespace.

The first attempt was reverted with 3b0c2d3eaa83 ("Revert 95ebabde382c
("capabilities: Don't allow writing ambiguous v3 file capabilities")"),
which corrected the issue reported in:
https://github.com/containers/buildah/issues/3071

There is a report the podman testsuite passes.  While different this
looks in many ways much more strict than the code that was reverted.  So
while I can imagine this change doesn't cause problems as is, I will be
surprised.

Eric



"Serge E. Hallyn"  writes:

> A process running as uid 0 but without cap_setfcap currently can simply
> unshare a new user namespace with uid 0 mapped to 0.  While this task
> will not have new capabilities against the parent namespace, there is
> a loophole due to the way namespaced file capabilities work.  File
> capabilities valid in userns 1 are distinguised from file capabilities
> valid in userns 2 by the kuid which underlies uid 0.  Therefore
> the restricted root process can unshare a new self-mapping namespace,
> add a namespaced file capability onto a file, then use that file
> capability in the parent namespace.
>
> To prevent that, do not allow mapping uid 0 if the process which
> opened the uid_map file does not have CAP_SETFCAP, which is the capability
> for setting file capabilities.
>
> A further wrinkle:  a task can unshare its user namespace, then
> open its uid_map file itself, and map (only) its own uid.  In this
> case we do not have the credential from before unshare,  which was
> potentially more restricted.  So, when creating a user namespace, we
> record whether the creator had CAP_SETFCAP.  Then we can use that
> during map_write().
>
> With this patch:
>
> 1. unprivileged user can still unshare -Ur
>
> ubuntu@caps:~$ unshare -Ur
> root@caps:~# logout
>
> 2. root user can still unshare -Ur
>
> ubuntu@caps:~$ sudo bash
> root@caps:/home/ubuntu# unshare -Ur
> root@caps:/home/ubuntu# logout
>
> 3. root user without CAP_SETFCAP cannot unshare -Ur:
>
> root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> unable to set CAP_SETFCAP effective capability: Operation not permitted
> root@caps:/home/ubuntu# unshare -Ur
> unshare: write failed /proc/self/uid_map: Operation not permitted
>
> Signed-off-by: Serge Hallyn 
>
> Changelog:
>* fix logic in the case of writing to another task's uid_map
>* rename 'ns' to 'map_ns', and make a file_ns local variable
>* use /* comments */
>* update the CAP_SETFCAP comment in capability.h
>* rename parent_unpriv to parent_can_setfcap (and reverse the
>  logic)
>* remove printks
>* clarify (i hope) the code comments
>* update capability.h comment
>* renamed parent_can_setfcap to parent_could_setfcap
>* made the check its own disallowed_0_mapping() fn
>* moved the check into new_idmap_permitted
>* rename disallowed_0_mapping to verify_root_mapping
>* change verify_root_mapping to Christian's suggested flow
> ---
>  include/linux/user_namespace.h  |  3 ++
>  include/uapi/linux/capability.h |  3 +-
>  kernel/user_namespace.c | 66 +++--
>  3 files changed, 68 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 64cf8ebdc4ec..f6c5f784be5a 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -63,6 +63,9 @@ struct user_namespace {
>   kgid_t  group;
>   struct ns_commonns;
>   unsigned long   flags;
> + /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
> +  * in its effective capability set at the child ns creation time. */
> + boolparent_could_setfcap;
>  
>  #ifdef CONFIG_KEYS
>   /* List of joinable keyrings in this namespace.  Modification access of
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index c6ca33034147..2ddb4226cd23 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -335,7 +335,8 @@ struct vfs_ns_cap_data {
>  
>  #define CAP_AUDIT_CONTROL30
>  
> -/* Set or remove capabilities on files */
> +/* Set or remove capabilities on files.
> +   Map uid=0 into a child user namespace. */
>  
>  #define CAP_SETFCAP   31
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index af612945a4d0..2ead291177b0 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -106,6 +106,7 @@ int create_user_ns(struct cred *new)
>   if (!ns)
>   goto fail_dec;
>  
> + ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP);
>   ret = ns_alloc_inum(>ns);
>   if (ret)
>   goto fail_free;
> @@ -841,6 +842,61 @@ static int 

Re: 08ed4efad6: stress-ng.sigsegv.ops_per_sec -41.9% regression

2021-04-08 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Thu, Apr 8, 2021 at 1:32 AM kernel test robot  
> wrote:
>>
>> FYI, we noticed a -41.9% regression of stress-ng.sigsegv.ops_per_sec due to 
>> commit
>> 08ed4efad684 ("[PATCH v10 6/9] Reimplement RLIMIT_SIGPENDING on top of 
>> ucounts")
>
> Ouch.

We were cautiously optimistic when no test problems showed up from
the last posting that there was nothing to look at here.

Unfortunately it looks like the bots just missed the last posting. 

So it seems we are finally pretty much at correct code in need
of performance tuning.

> I *think* this test may be testing "send so many signals that it
> triggers the signal queue overflow case".
>
> And I *think* that the performance degradation may be due to lots of
> unnecessary allocations, because ity looks like that commit changes
> __sigqueue_alloc() to do
>
> struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);
>
> *before* checking the signal limit, and then if the signal limit was
> exceeded, it will just be free'd instead.
>
> The old code would check the signal count against RLIMIT_SIGPENDING
> *first*, and if there were m ore pending signals then it wouldn't do
> anything at all (including not incrementing that expensive atomic
> count).

This is an interesting test in a lot of ways as it is testing the
synchronous signal delivery path caused by an exception.  The test
is either executing *ptr = 0 (where ptr points to a read-only page)
or it executes an x86 instruction that is excessively long.

I have found the code but I haven't figured out how it is being
called yet.  The core loop is just:
for(;;) {
sigaction(SIGSEGV, , NULL);
sigaction(SIGILL, , NULL);
sigaction(SIGBUS, , NULL);

ret = sigsetjmp(jmp_env, 1);
if (done())
break;
if (ret) {
/* verify signal */
} else {
*ptr = 0;
}
}

Code like that fundamentally can not be multi-threaded.  So the only way
the sigpending limit is being hit is if there are more processes running
that code simultaneously than the size of the limit.

Further it looks like stress-ng pushes RLIMIT_SIGPENDING as high as it
will go before the test starts.


> Also, the old code was very careful to only do the "get_user()" for
> the *first* signal it added to the queue, and do the "put_user()" for
> when removing the last signal. Exactly because those atomics are very
> expensive.
>
> The new code just does a lot of these atomics unconditionally.

Yes. That seems a likely culprit.

> I dunno. The profile data in there is a bit hard to read, but there's
> a lot more cachee misses, and a *lot* of node crossers:
>
>>5961544  +190.4%   17314361perf-stat.i.cache-misses
>>   22107466  +119.2%   48457656perf-stat.i.cache-references
>> 163292 ą  3%   +4582.0%7645410perf-stat.i.node-load-misses
>> 227388 ą  2%   +3708.8%8660824perf-stat.i.node-loads
>
> and (probably as a result) average instruction costs have gone up enormously:
>
>>   3.47   +66.8%   5.79perf-stat.overall.cpi
>>  22849   -65.6%   7866
>> perf-stat.overall.cycles-between-cache-misses
>
> and it does seem to be at least partly about "put_ucounts()":
>
>>   0.00+4.54.46
>> perf-profile.calltrace.cycles-pp.put_ucounts.__sigqueue_free.get_signal.arch_do_signal_or_restart.exit_to_user_mode_prepare
>
> and a lot of "get_ucounts()".
>
> But it may also be that the new "get sigpending" is just *so* much
> more expensive than it used to be.

That too is possible.

That node-load-misses number does look like something is bouncing back
and forth between the nodes a lot more.  So I suspect stress-ng is
running multiple copies of the sigsegv test in different processes at
once.



That really suggests cache line ping pong from get_ucounts and
incrementing sigpending.

It surprises me that obtaining the cache lines exclusively is
the dominant cost on this code path but obtaining two cache lines
exclusively instead of one cache cache line exclusively is consistent
with a causing the exception delivery to take nearly twice as long.

For the optimization we only care about the leaf count so with a little
care we can restore the optimization.  So that is probably the thing
to do here.  The fewer changes to worry about the less likely to find
surprises.



That said for this specific case there is a lot of potential room for
improvement.  As this is a per thread signal the code update sigpending
in commit_cred and never worry about needing to pin the struct
user_struct or struct ucounts.  As this is a synchronous signal we could
skip the sigpending increment, skip the signal queue entirely, and
deliver the signal to user-space immediately.  The removal of all cache
ping pongs might 

Re: [PATCH v9 4/8] Reimplement RLIMIT_NPROC on top of ucounts

2021-04-07 Thread Eric W. Biederman
Alexey Gladkov  writes:

> On Mon, Apr 05, 2021 at 11:56:35AM -0500, Eric W. Biederman wrote:
>>
>> Also when setting ns->ucount_max[] in create_user_ns because one value
>> is signed and the other is unsigned.  Care should be taken so that
>> rlimit_infinity is translated into the largest positive value the
>> type can hold.
>
> You mean like that ?
>
> ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC) <= LONG_MAX ?
>   rlimit(RLIMIT_NPROC) : LONG_MAX;
> ns->ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = rlimit(RLIMIT_MSGQUEUE) <= LONG_MAX ?
>   rlimit(RLIMIT_MSGQUEUE) : LONG_MAX;
> ns->ucount_max[UCOUNT_RLIMIT_SIGPENDING] = rlimit(RLIMIT_SIGPENDING) <= 
> LONG_MAX ?
>   rlimit(RLIMIT_SIGPENDING) : LONG_MAX;
> ns->ucount_max[UCOUNT_RLIMIT_MEMLOCK] = rlimit(RLIMIT_MEMLOCK) <= LONG_MAX ?
>   rlimit(RLIMIT_MEMLOCK) : LONG_MAX;

Yes.

I only got as far as:
if (rlimit(RLIMI_NNN) == RLIM_INFINITY) {
ns->ucount_max[UCOUNT_LIMIT_NNN] = LONG_MAX;
} else {
ns->ucount_max[UCOUNT_LMIT_NNN] = rlmit(RLIMIT_NNN);
}
But forcing everything about LONG_MAX to LONG_MAX actually looks better
in practice.  Especially as that is effectively RLIMIT_INFINITY anyway.

Eric


Re: [PATCH v9 0/8] Count rlimits in each user namespace

2021-04-05 Thread Eric W. Biederman
Alexey Gladkov  writes:

> Preface
> ---
> These patches are for binding the rlimit counters to a user in user namespace.
> This patch set can be applied on top of:
>
> git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.12-rc4
>
> Problem
> ---
> The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
> implementation places the counters in user_struct [1]. These limits are global
> between processes and persists for the lifetime of the process, even if
> processes are in different user namespaces.
>
> To illustrate the impact of rlimits, let's say there is a program that does 
> not
> fork. Some service-A wants to run this program as user X in multiple 
> containers.
> Since the program never fork the service wants to set RLIMIT_NPROC=1.
>
> service-A
>  \- program (uid=1000, container1, rlimit_nproc=1)
>  \- program (uid=1000, container2, rlimit_nproc=1)
>
> The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
> service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
> since user X already has one running process.
>
> The problem is not that the limit from container1 affects container2. The
> problem is that limit is verified against the global counter that reflects
> the number of processes in all containers.
>
> This problem can be worked around by using different users for each container
> but in this case we face a different problem of uid mapping when transferring
> files from one container to another.
>
> Eric W. Biederman mentioned this issue [2][3].
>
> Introduced changes
> --
> To address the problem, we bind rlimit counters to user namespace. Each 
> counter
> reflects the number of processes in a given uid in a given user namespace. The
> result is a tree of rlimit counters with the biggest value at the root (aka
> init_user_ns). The limit is considered exceeded if it's exceeded up in the 
> tree.
>
> [1]: https://lore.kernel.org/containers/87imd2incs@x220.int.ebiederm.org/
> [2]: 
> https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
> [3]: 
> https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html
>
> Changelog
> -
> v9:
> * Used a negative value to check that the ucounts->count is close to overflow.
> * Rebased onto v5.12-rc4.
>
> v8:
> * Used atomic_t for ucounts reference counting. Also added counter overflow
>   check (thanks to Linus Torvalds for the idea).
> * Fixed other issues found by lkp-tests project in the patch that Reimplements
>   RLIMIT_MEMLOCK on top of ucounts.
>
> v7:
> * Fixed issues found by lkp-tests project in the patch that Reimplements
>   RLIMIT_MEMLOCK on top of ucounts.
>
> v6:
> * Fixed issues found by lkp-tests project.
> * Rebased onto v5.11.
>
> v5:
> * Split the first commit into two commits: change ucounts.count type to 
> atomic_long_t
>   and add ucounts to cred. These commits were merged by mistake during the 
> rebase.
> * The __get_ucounts() renamed to alloc_ucounts().
> * The cred.ucounts update has been moved from commit_creds() as it did not 
> allow
>   to handle errors.
> * Added error handling of set_cred_ucounts().
>
> v4:
> * Reverted the type change of ucounts.count to refcount_t.
> * Fixed typo in the kernel/cred.c
>
> v3:
> * Added get_ucounts() function to increase the reference count. The existing
>   get_counts() function renamed to __get_ucounts().
> * The type of ucounts.count changed from atomic_t to refcount_t.
> * Dropped 'const' from set_cred_ucounts() arguments.
> * Fixed a bug with freeing the cred structure after calling 
> cred_alloc_blank().
> * Commit messages have been updated.
> * Added selftest.
>
> v2:
> * RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to 
> ucounts.
> * Added ucounts for pair uid and user namespace into cred.
> * Added the ability to increase ucount by more than 1.
>
> v1:
> * After discussion with Eric W. Biederman, I increased the size of ucounts to
>   atomic_long_t.
> * Added ucount_max to avoid the fork bomb.
>
> --
>
> Alexey Gladkov (8):
>   Increase size of ucounts to atomic_long_t
>   Add a reference to ucounts for each cred
>   Use atomic_t for ucounts reference counting
>   Reimplement RLIMIT_NPROC on top of ucounts
>   Reimplement RLIMIT_MSGQUEUE on top of ucounts
>   Reimplement RLIMIT_SIGPENDING on top of ucounts
>   Reimplement RLIMIT_MEMLOCK on top of ucounts
>   kselftests: Add test to check for rlimit changes in different user 
> namespaces
>

Overall this looks good, and it is a very good sign that the automatic
testing bots have not found anything.  I found a few little nits.
But thing are looking very good.

Eric



Re: [PATCH v9 3/8] Use atomic_t for ucounts reference counting

2021-04-05 Thread Eric W. Biederman
Alexey Gladkov  writes:

> The current implementation of the ucounts reference counter requires the
> use of spin_lock. We're going to use get_ucounts() in more performance
> critical areas like a handling of RLIMIT_SIGPENDING.
>
> Now we need to use spin_lock only if we want to change the hashtable.
>
> v9:
> * Use a negative value to check that the ucounts->count is close to
>   overflow.


Overall this looks good, one small issue below.

Eric

> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 50cc1dfb7d28..7bac19bb3f1e 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -11,7 +11,7 @@
>  struct ucounts init_ucounts = {
>   .ns= _user_ns,
>   .uid   = GLOBAL_ROOT_UID,
> - .count = 1,
> + .count = ATOMIC_INIT(1),
>  };
>  
>  #define UCOUNTS_HASHTABLE_BITS 10
> @@ -139,6 +139,15 @@ static void hlist_add_ucounts(struct ucounts *ucounts)
>   spin_unlock_irq(_lock);
>  }
>  
> +struct ucounts *get_ucounts(struct ucounts *ucounts)
> +{
> + if (ucounts && atomic_add_negative(1, >count)) {
> + atomic_dec(>count);
^^^

To handle the pathological case of all of the other uses calling
put_ucounts after the value goes negative, the above should
be put_ucounts intead of atomic_dec.


> + ucounts = NULL;
> + }
> + return ucounts;
> +}
> +
>  struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
>  {
>   struct hlist_head *hashent = ucounts_hashentry(ns, uid);
> @@ -155,7 +164,7 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, 
> kuid_t uid)
>  
>   new->ns = ns;
>   new->uid = uid;
> - new->count = 0;
> + atomic_set(>count, 1);
>  
>   spin_lock_irq(_lock);
>   ucounts = find_ucounts(ns, uid, hashent);
> @@ -163,33 +172,12 @@ struct ucounts *alloc_ucounts(struct user_namespace 
> *ns, kuid_t uid)
>   kfree(new);
>   } else {
>   hlist_add_head(>node, hashent);
> - ucounts = new;
> + spin_unlock_irq(_lock);
> + return new;
>   }
>   }
> - if (ucounts->count == INT_MAX)
> - ucounts = NULL;
> - else
> - ucounts->count += 1;
>   spin_unlock_irq(_lock);
> - return ucounts;
> -}
> -
> -struct ucounts *get_ucounts(struct ucounts *ucounts)
> -{
> - unsigned long flags;
> -
> - if (!ucounts)
> - return NULL;
> -
> - spin_lock_irqsave(_lock, flags);
> - if (ucounts->count == INT_MAX) {
> - WARN_ONCE(1, "ucounts: counter has reached its maximum value");
> - ucounts = NULL;
> - } else {
> - ucounts->count += 1;
> - }
> - spin_unlock_irqrestore(_lock, flags);
> -
> + ucounts = get_ucounts(ucounts);
>   return ucounts;
>  }
>  
> @@ -197,15 +185,12 @@ void put_ucounts(struct ucounts *ucounts)
>  {
>   unsigned long flags;
>  
> - spin_lock_irqsave(_lock, flags);
> - ucounts->count -= 1;
> - if (!ucounts->count)
> + if (atomic_dec_and_test(>count)) {
> + spin_lock_irqsave(_lock, flags);
>   hlist_del_init(>node);
> - else
> - ucounts = NULL;
> - spin_unlock_irqrestore(_lock, flags);
> -
> - kfree(ucounts);
> + spin_unlock_irqrestore(_lock, flags);
> + kfree(ucounts);
> + }
>  }
>  
>  static inline bool atomic_long_inc_below(atomic_long_t *v, int u)


Re: [PATCH v9 6/8] Reimplement RLIMIT_SIGPENDING on top of ucounts

2021-04-05 Thread Eric W. Biederman


A small bug below.

Eric


> diff --git a/kernel/signal.c b/kernel/signal.c
> index f2a1b898da29..1b537d9de447 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -413,49 +413,44 @@ void task_join_group_stop(struct task_struct *task)
>  static struct sigqueue *
>  __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int 
> override_rlimit)
>  {
> - struct sigqueue *q = NULL;
> - struct user_struct *user;
> - int sigpending;
> + struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);
>  
> - /*
> -  * Protect access to @t credentials. This can go away when all
> -  * callers hold rcu read lock.
> -  *
> -  * NOTE! A pending signal will hold on to the user refcount,
> -  * and we get/put the refcount only when the sigpending count
> -  * changes from/to zero.
> -  */
> - rcu_read_lock();
> - user = __task_cred(t)->user;
> - sigpending = atomic_inc_return(>sigpending);
> - if (sigpending == 1)
> - get_uid(user);
> - rcu_read_unlock();
> + if (likely(q != NULL)) {
> + bool overlimit;
>  
> - if (override_rlimit || likely(sigpending <= task_rlimit(t, 
> RLIMIT_SIGPENDING))) {
> - q = kmem_cache_alloc(sigqueue_cachep, flags);
> - } else {
> - print_dropped_signal(sig);
> - }
> -
> - if (unlikely(q == NULL)) {
> - if (atomic_dec_and_test(>sigpending))
> - free_uid(user);
> - } else {
>   INIT_LIST_HEAD(>list);
>   q->flags = 0;
> - q->user = user;
> +
> + /*
> +  * Protect access to @t credentials. This can go away when all
> +  * callers hold rcu read lock.
> +  */
> + rcu_read_lock();
> + q->ucounts = get_ucounts(task_ucounts(t));
> + if (q->ucounts) {
> + overlimit = inc_rlimit_ucounts_and_test(q->ucounts, 
> UCOUNT_RLIMIT_SIGPENDING,
> + 1, task_rlimit(t, RLIMIT_SIGPENDING));
> +
> + if (override_rlimit || likely(!overlimit)) {
> + rcu_read_unlock();
> + return q;
> + }
> + }
> + rcu_read_unlock();

I believe you need to call __sigqueue_free here.

>   }
>  
> - return q;
> + print_dropped_signal(sig);
> + return NULL;
>  }
>  
>  static void __sigqueue_free(struct sigqueue *q)
>  {
>   if (q->flags & SIGQUEUE_PREALLOC)
>   return;
> - if (atomic_dec_and_test(>user->sigpending))
> - free_uid(q->user);
> + if (q->ucounts) {
> + dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> + put_ucounts(q->ucounts);
> + }
>   kmem_cache_free(sigqueue_cachep, q);
>  }
>  

Eric


Re: [PATCH v9 4/8] Reimplement RLIMIT_NPROC on top of ucounts

2021-04-05 Thread Eric W. Biederman
Alexey Gladkov  writes:

> The rlimit counter is tied to uid in the user_namespace. This allows
> rlimit values to be specified in userns even if they are already
> globally exceeded by the user. However, the value of the previous
> user_namespaces cannot be exceeded.
>
> To illustrate the impact of rlimits, let's say there is a program that
> does not fork. Some service-A wants to run this program as user X in
> multiple containers. Since the program never fork the service wants to
> set RLIMIT_NPROC=1.
>
> service-A
>  \- program (uid=1000, container1, rlimit_nproc=1)
>  \- program (uid=1000, container2, rlimit_nproc=1)
>
> The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
> When the service-A tries to run a program with RLIMIT_NPROC=1 in
> container2 it fails since user X already has one running process.
>
> We cannot use existing inc_ucounts / dec_ucounts because they do not
> allow us to exceed the maximum for the counter. Some rlimits can be
> overlimited by root or if the user has the appropriate capability.

In general this looks good.

My comments are going to be on the infrastructure this patch
introduces rather than the change to RLIMIT_NPROC.  Perhaps
these would fare better as separate patches.

To preserve the existing semantics of incrementing and decrementing
rlimits in places separate from where the limits are checked you
correctly introduce inc_rlimit_ucounts and dec_rlimit_ucounts.

This separation means that for a short period of time the values
in the counts can overflow.  Which is something that the current
ucount helpers don't allow.  Taking advantage of the entire
negative range for the counts, just like we take advantage
of the entire negative range in get_ucounts seems a reasonable thing
to do.

On 32bit using "atomic_long_t" instead of "unsigned long" to hold the
counts has the downside that RLIMIT_MSGQUEUE and RLIMIT_MEMLOCK are
limited to 2GiB instead of 4GiB.  I don't think anyone cares but
it should be mentioned in case someone does.

The functions inc_rlimit_ucounts and inc_rlimit_ucounts_test need
to return overflow in the unlikely event the count values becomes
negative.

Also when setting ns->ucount_max[] in create_user_ns because one value
is signed and the other is unsigned.  Care should be taken so that
rlimit_infinity is translated into the largest positive value the
type can hold.

Eric

> Signed-off-by: Alexey Gladkov 
> ---
>  fs/exec.c  |  2 +-
>  include/linux/cred.h   |  2 ++
>  include/linux/sched/user.h |  1 -
>  include/linux/user_namespace.h | 13 
>  kernel/cred.c  | 10 +++---
>  kernel/exit.c  |  2 +-
>  kernel/fork.c  |  9 ++---
>  kernel/sys.c   |  2 +-
>  kernel/ucount.c| 61 ++
>  kernel/user.c  |  1 -
>  kernel/user_namespace.c|  3 +-
>  11 files changed, 91 insertions(+), 15 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index d7c4187ca023..f2bcdbeb3afb 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1878,7 +1878,7 @@ static int do_execveat_common(int fd, struct filename 
> *filename,
>* whether NPROC limit is still exceeded.
>*/
>   if ((current->flags & PF_NPROC_EXCEEDED) &&
> - atomic_read(_user()->processes) > rlimit(RLIMIT_NPROC)) {
> + is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, 
> rlimit(RLIMIT_NPROC))) {
>   retval = -EAGAIN;
>   goto out_ret;
>   }
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index 66436e655032..5ca1e8a1d035 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -372,6 +372,7 @@ static inline void put_cred(const struct cred *_cred)
>  
>  #define task_uid(task)   (task_cred_xxx((task), uid))
>  #define task_euid(task)  (task_cred_xxx((task), euid))
> +#define task_ucounts(task)   (task_cred_xxx((task), ucounts))
>  
>  #define current_cred_xxx(xxx)\
>  ({   \
> @@ -388,6 +389,7 @@ static inline void put_cred(const struct cred *_cred)
>  #define current_fsgid()  (current_cred_xxx(fsgid))
>  #define current_cap()(current_cred_xxx(cap_effective))
>  #define current_user()   (current_cred_xxx(user))
> +#define current_ucounts()(current_cred_xxx(ucounts))
>  
>  extern struct user_namespace init_user_ns;
>  #ifdef CONFIG_USER_NS
> diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
> index a8ec3b6093fc..d33d867ad6c1 100644
> --- a/include/linux/sched/user.h
> +++ b/include/linux/sched/user.h
> @@ -12,7 +12,6 @@
>   */
>  struct user_struct {
>   refcount_t __count; /* reference count */
> - atomic_t processes; /* How many processes does this user have? */
>   atomic_t sigpending;/* How many pending signals does this user 
> have? */
>  #ifdef CONFIG_FANOTIFY

Re: [PATCH] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

2021-04-01 Thread Eric W. Biederman
Kees Cook  writes:

> On Wed, Mar 31, 2021 at 11:36:28PM -0500, Eric W. Biederman wrote:
>> Josh Hunt  writes:
>> 
>> > Currently only root can write files under /proc/pressure. Relax this to
>> > allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
>> > able to write to these files.
>> 
>> The test for CAP_SYS_RESOURCE really needs to be in open rather
>> than in write.
>> 
>> Otherwise a suid root executable could have stdout redirected
>> into these files.
>
> Right. Or check against f_cred. (See uses of kallsyms_show_value())
> https://www.kernel.org/doc/html/latest/security/credentials.html#open-file-credentials

We really want to limit checking against f_cred to those cases where we
break userspace by checking in open.  AKA the cases where we made the
mistake of putting the permission check in the wrong place and now can't
fix it.

Since this change is change the permissions that open uses already I
don't see any reason we can't perform a proper check in open.

Eric


Re: [PATCH] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

2021-03-31 Thread Eric W. Biederman
Josh Hunt  writes:

> Currently only root can write files under /proc/pressure. Relax this to
> allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
> able to write to these files.

The test for CAP_SYS_RESOURCE really needs to be in open rather
than in write.

Otherwise a suid root executable could have stdout redirected
into these files.

Eric


> Signed-off-by: Josh Hunt 
> ---
>  kernel/sched/psi.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index b1b00e9bd7ed..98ff7baf1ba8 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -1270,6 +1270,9 @@ static ssize_t psi_write(struct file *file, const char 
> __user *user_buf,
>   if (!nbytes)
>   return -EINVAL;
>  
> + if (!capable(CAP_SYS_RESOURCE))
> + return -EPERM;
> +
>   buf_size = min(nbytes, sizeof(buf));
>   if (copy_from_user(buf, user_buf, buf_size))
>   return -EFAULT;
> @@ -1353,9 +1356,9 @@ static int __init psi_proc_init(void)
>  {
>   if (psi_enable) {
>   proc_mkdir("pressure", NULL);
> - proc_create("pressure/io", 0, NULL, _io_proc_ops);
> - proc_create("pressure/memory", 0, NULL, _memory_proc_ops);
> - proc_create("pressure/cpu", 0, NULL, _cpu_proc_ops);
> + proc_create("pressure/io", 0666, NULL, _io_proc_ops);
> + proc_create("pressure/memory", 0666, NULL, 
> _memory_proc_ops);
> + proc_create("pressure/cpu", 0666, NULL, _cpu_proc_ops);
>   }
>   return 0;
>  }


Re: [PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-27 Thread Eric W. Biederman
Jens Axboe  writes:

> On 3/26/21 4:38 PM, Jens Axboe wrote:
>> OK good point, and follows the same logic even if it won't make a
>> difference in my case. I'll make the change.
>
> Made the suggested edits and ran the quick tests and the KILL/STOP
> testing, and no ill effects observed. Kicked off the longer runs now.
>
> Not a huge amount of changes from the posted series, but please peruse
> here if you want to double check:
>
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-5.12
>
> And diff against v2 posted is below. Thanks!

That looks good.  Thanks.

Acked-by: "Eric W. Biederman" 

>
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index 3e2f059a1737..7434eb40ca8c 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -505,10 +505,9 @@ static int io_wqe_worker(void *data)
>   if (signal_pending(current)) {
>   struct ksignal ksig;
>  
> - if (fatal_signal_pending(current))
> - break;
> - if (get_signal())
> + if (!get_signal())
>   continue;
> + break;
>   }
>   if (ret)
>   continue;
> @@ -722,10 +721,9 @@ static int io_wq_manager(void *data)
>   if (signal_pending(current)) {
>   struct ksignal ksig;
>  
> - if (fatal_signal_pending(current))
> - set_bit(IO_WQ_BIT_EXIT, >state);
> - else if (get_signal())
> + if (!get_signal())
>   continue;
> + set_bit(IO_WQ_BIT_EXIT, >state);
>   }
>   } while (!test_bit(IO_WQ_BIT_EXIT, >state));
>  
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 66ae46874d85..880abd8b6d31 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -6746,10 +6746,9 @@ static int io_sq_thread(void *data)
>   if (signal_pending(current)) {
>   struct ksignal ksig;
>  
> - if (fatal_signal_pending(current))
> - break;
> - if (get_signal())
> + if (!get_signal())
>   continue;
> + break;
>   }
>   sqt_spin = false;
>   cap_entries = !list_is_singular(>ctx_list);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 5b75fbe3d2d6..f2718350bf4b 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2752,15 +2752,6 @@ bool get_signal(struct ksignal *ksig)
>*/
>   current->flags |= PF_SIGNALED;
>  
> - /*
> -  * PF_IO_WORKER threads will catch and exit on fatal signals
> -  * themselves. They have cleanup that must be performed, so
> -  * we cannot call do_exit() on their behalf. coredumps also
> -  * do not apply to them.
> -  */
> - if (current->flags & PF_IO_WORKER)
> - return false;
> -
>   if (sig_kernel_coredump(signr)) {
>   if (print_fatal_signals)
>   print_fatal_signal(ksig->info.si_signo);
> @@ -2776,6 +2767,14 @@ bool get_signal(struct ksignal *ksig)
>   do_coredump(>info);
>   }
>  
> + /*
> +  * PF_IO_WORKER threads will catch and exit on fatal signals
> +  * themselves. They have cleanup that must be performed, so
> +  * we cannot call do_exit() on their behalf.
> +  */
> + if (current->flags & PF_IO_WORKER)
> + goto out;
> +
>   /*
>* Death signals, no core dump.
>*/
> @@ -2783,7 +2782,7 @@ bool get_signal(struct ksignal *ksig)
>   /* NOTREACHED */
>   }
>   spin_unlock_irq(>siglock);
> -
> +out:
>   ksig->sig = signr;
>  
>   if (!(ksig->ka.sa.sa_flags & SA_EXPOSE_TAGBITS))


Re: [PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-26 Thread Eric W. Biederman
Jens Axboe  writes:

> On 3/26/21 4:23 PM, Eric W. Biederman wrote:
>> Jens Axboe  writes:
>> 
>>> On 3/26/21 2:29 PM, Eric W. Biederman wrote:
>>>> Jens Axboe  writes:
>>>>
>>>>> We go through various hoops to disallow signals for the IO threads, but
>>>>> there's really no reason why we cannot just allow them. The IO threads
>>>>> never return to userspace like a normal thread, and hence don't go through
>>>>> normal signal processing. Instead, just check for a pending signal as part
>>>>> of the work loop, and call get_signal() to handle it for us if anything
>>>>> is pending.
>>>>>
>>>>> With that, we can support receiving signals, including special ones like
>>>>> SIGSTOP.
>>>>>
>>>>> Signed-off-by: Jens Axboe 
>>>>> ---
>>>>>  fs/io-wq.c| 24 +---
>>>>>  fs/io_uring.c | 12 
>>>>>  2 files changed, 25 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/fs/io-wq.c b/fs/io-wq.c
>>>>> index b7c1fa932cb3..3e2f059a1737 100644
>>>>> --- a/fs/io-wq.c
>>>>> +++ b/fs/io-wq.c
>>>>> @@ -16,7 +16,6 @@
>>>>>  #include 
>>>>>  #include 
>>>>>  #include 
>>>>> -#include 
>>>>>  
>>>>>  #include "../kernel/sched/sched.h"
>>>>>  #include "io-wq.h"
>>>>> @@ -503,10 +502,16 @@ static int io_wqe_worker(void *data)
>>>>>   if (io_flush_signals())
>>>>>   continue;
>>>>>   ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
>>>>> - if (try_to_freeze() || ret)
>>>>> + if (signal_pending(current)) {
>>>>> + struct ksignal ksig;
>>>>> +
>>>>> + if (fatal_signal_pending(current))
>>>>> + break;
>>>>> + if (get_signal())
>>>>> + continue;
>>>> ^^
>>>>
>>>> That is wrong.  You are promising to deliver a signal to signal
>>>> handler and them simply discarding it.  Perhaps:
>>>>
>>>>if (!get_signal())
>>>>continue;
>>>>WARN_ON(!sig_kernel_stop(ksig->sig));
>>>> break;
>>>
>>> Thanks, updated.
>> 
>> Gah.  Kill the WARN_ON.
>> 
>> I was thinking "WARN_ON(!sig_kernel_fatal(ksig->sig));"
>> The function sig_kernel_fatal does not exist.
>> 
>> Fatal is the state that is left when a signal is neither
>> ignored nor a stop signal, and does not have a handler.
>> 
>> The rest of the logic still works.
>
> I've just come to the same conclusion myself after testing it.
> Of the 3 cases, most of them can do the continue, but doesn't
> really matter with the way the loop is structured. Anyway, looks
> like this now:

This idiom in the code:
> + if (signal_pending(current)) {
> + struct ksignal ksig;
> +
> + if (fatal_signal_pending(current))
> + break;
> + if (!get_signal())
> + continue;
>  }

Needs to be:
> + if (signal_pending(current)) {
> + struct ksignal ksig;
> +
> + if (!get_signal())
> + continue;
> + break;
>  }

Because any signal returned from get_signal is fatal in this case.
It might make sense to "WARN_ON(ksig->ka.sa.sa_handler != SIG_DFL)".
As the io workers don't handle that case.

It won't happen because you have everything blocked.

The extra fatal_signal_pending(current) logic is just confusing in this
case.

Eric


Re: [PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-26 Thread Eric W. Biederman
Jens Axboe  writes:

> On 3/26/21 2:29 PM, Eric W. Biederman wrote:
>> Jens Axboe  writes:
>> 
>>> We go through various hoops to disallow signals for the IO threads, but
>>> there's really no reason why we cannot just allow them. The IO threads
>>> never return to userspace like a normal thread, and hence don't go through
>>> normal signal processing. Instead, just check for a pending signal as part
>>> of the work loop, and call get_signal() to handle it for us if anything
>>> is pending.
>>>
>>> With that, we can support receiving signals, including special ones like
>>> SIGSTOP.
>>>
>>> Signed-off-by: Jens Axboe 
>>> ---
>>>  fs/io-wq.c| 24 +---
>>>  fs/io_uring.c | 12 
>>>  2 files changed, 25 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/fs/io-wq.c b/fs/io-wq.c
>>> index b7c1fa932cb3..3e2f059a1737 100644
>>> --- a/fs/io-wq.c
>>> +++ b/fs/io-wq.c
>>> @@ -16,7 +16,6 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> -#include 
>>>  
>>>  #include "../kernel/sched/sched.h"
>>>  #include "io-wq.h"
>>> @@ -503,10 +502,16 @@ static int io_wqe_worker(void *data)
>>> if (io_flush_signals())
>>> continue;
>>> ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
>>> -   if (try_to_freeze() || ret)
>>> +   if (signal_pending(current)) {
>>> +   struct ksignal ksig;
>>> +
>>> +   if (fatal_signal_pending(current))
>>> +   break;
>>> +   if (get_signal())
>>> +   continue;
>> ^^
>> 
>> That is wrong.  You are promising to deliver a signal to signal
>> handler and them simply discarding it.  Perhaps:
>> 
>>  if (!get_signal())
>>  continue;
>>  WARN_ON(!sig_kernel_stop(ksig->sig));
>> break;
>
> Thanks, updated.

Gah.  Kill the WARN_ON.

I was thinking "WARN_ON(!sig_kernel_fatal(ksig->sig));"
The function sig_kernel_fatal does not exist.

Fatal is the state that is left when a signal is neither
ignored nor a stop signal, and does not have a handler.

The rest of the logic still works.

Eric



Re: [PATCH 3/4] exec: simplify the compat syscall handling

2021-03-26 Thread Eric W. Biederman
Christoph Hellwig  writes:


> diff --git a/fs/exec.c b/fs/exec.c
> index 06e07278b456fa..b34c1eb9e7ad8e 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -391,47 +391,34 @@ static int bprm_mm_init(struct linux_binprm *bprm)
>   return err;
>  }
>  
> -struct user_arg_ptr {
> -#ifdef CONFIG_COMPAT
> - bool is_compat;
> -#endif
> - union {
> - const char __user *const __user *native;
> -#ifdef CONFIG_COMPAT
> - const compat_uptr_t __user *compat;
> -#endif
> - } ptr;
> -};
> -
> -static const char __user *get_user_arg_ptr(struct user_arg_ptr argv, int nr)
> +static const char __user *
> +get_user_arg_ptr(const char __user *const __user *argv, int nr)
>  {
> - const char __user *native;
> -
> -#ifdef CONFIG_COMPAT
> - if (unlikely(argv.is_compat)) {
> + if (in_compat_syscall()) {
> + const compat_uptr_t __user *compat_argv =
> + compat_ptr((unsigned long)argv);

Ouch!  Passing a pointer around as the wrong type through the kernel!

Perhaps we should reduce everything to do_execveat and
do_execveat_compat.  Then there would be no need for anything
to do anything odd with the pointer types.

I think the big change would be to factor out a copy_string out
of copy_strings, that performs all of the work once we know the proper
pointer value.

Casting pointers from one type to another scares me as one mistake means
we are doing something wrong and probably exploitable.


Eric




>   compat_uptr_t compat;
>  
> - if (get_user(compat, argv.ptr.compat + nr))
> + if (get_user(compat, compat_argv + nr))
>   return ERR_PTR(-EFAULT);
> -
>   return compat_ptr(compat);
> - }
> -#endif
> -
> - if (get_user(native, argv.ptr.native + nr))
> - return ERR_PTR(-EFAULT);
> + } else {
> + const char __user *native;
>  
> - return native;
> + if (get_user(native, argv + nr))
> + return ERR_PTR(-EFAULT);
> + return native;
> + }
>  }
>  


Re: [PATCH 3/7] kernel: stop masking signals in create_io_thread()

2021-03-26 Thread Eric W. Biederman
Jens Axboe  writes:

> This is racy - move the blocking into when the task is created and
> we're marking it as PF_IO_WORKER anyway. The IO threads are now
> prepared to handle signals like SIGSTOP as well, so clear that from
> the mask to allow proper stopping of IO threads.

Acked-by: "Eric W. Biederman" 

>
> Reported-by: Oleg Nesterov 
> Signed-off-by: Jens Axboe 
> ---
>  kernel/fork.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d3171e8e88e5..ddaa15227071 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1940,8 +1940,14 @@ static __latent_entropy struct task_struct 
> *copy_process(
>   p = dup_task_struct(current, node);
>   if (!p)
>   goto fork_out;
> - if (args->io_thread)
> + if (args->io_thread) {
> + /*
> +  * Mark us an IO worker, and block any signal that isn't
> +  * fatal or STOP
> +  */
>   p->flags |= PF_IO_WORKER;
> + siginitsetinv(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> + }
>  
>   /*
>* This _must_ happen before we call free_task(), i.e. before we jump
> @@ -2430,14 +2436,8 @@ struct task_struct *create_io_thread(int (*fn)(void 
> *), void *arg, int node)
>   .stack_size = (unsigned long)arg,
>   .io_thread  = 1,
>   };
> - struct task_struct *tsk;
>  
> - tsk = copy_process(NULL, 0, node, );
> - if (!IS_ERR(tsk)) {
> - sigfillset(>blocked);
> - sigdelsetmask(>blocked, sigmask(SIGKILL));
> - }
> - return tsk;
> + return copy_process(NULL, 0, node, );
>  }
>  
>  /*


Re: [PATCH 1/7] kernel: don't call do_exit() for PF_IO_WORKER threads

2021-03-26 Thread Eric W. Biederman
Jens Axboe  writes:

> Right now we're never calling get_signal() from PF_IO_WORKER threads, but
> in preparation for doing so, don't handle a fatal signal for them. The
> workers have state they need to cleanup when exiting, and they don't do
> coredumps, so just return instead of performing either a dump or calling
> do_exit() on their behalf. The threads themselves will detect a fatal
> signal and do proper shutdown.
>
> Signed-off-by: Jens Axboe 
> ---
>  kernel/signal.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index f2a1b898da29..e3e1b8fbfe8a 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2756,6 +2756,15 @@ bool get_signal(struct ksignal *ksig)
>*/
>   current->flags |= PF_SIGNALED;
>  
> + /*
> +  * PF_IO_WORKER threads will catch and exit on fatal signals
> +  * themselves. They have cleanup that must be performed, so
> +  * we cannot call do_exit() on their behalf. coredumps also
> +  * do not apply to them.
> +  */
> + if (current->flags & PF_IO_WORKER)
> + return false;
> +

Returning false when get_signal needs the caller to handle a signal
adds a very weird and awkward special case to how get_signal returns
arguments.

Instead you should simply break and let get_signal return SIGKILL like
any other signal that has a handler that the caller of get_signal needs
to handle.

Something like:
> + /*
> +  * PF_IO_WORKER have cleanup that must be performed,
> +  * before calling do_exit().
> +  */
> + if (current->flags & PF_IO_WORKER)
> + break;


As do_coredump does not call do_exit there is no reason to skip calling into
the coredump handling either.   And allowing it will remove yet another
special case from the io worker code.

>   if (sig_kernel_coredump(signr)) {
>   if (print_fatal_signals)
>   print_fatal_signal(ksig->info.si_signo);

Eric


Re: [PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-26 Thread Eric W. Biederman
Jens Axboe  writes:

> We go through various hoops to disallow signals for the IO threads, but
> there's really no reason why we cannot just allow them. The IO threads
> never return to userspace like a normal thread, and hence don't go through
> normal signal processing. Instead, just check for a pending signal as part
> of the work loop, and call get_signal() to handle it for us if anything
> is pending.
>
> With that, we can support receiving signals, including special ones like
> SIGSTOP.
>
> Signed-off-by: Jens Axboe 
> ---
>  fs/io-wq.c| 24 +---
>  fs/io_uring.c | 12 
>  2 files changed, 25 insertions(+), 11 deletions(-)
>
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index b7c1fa932cb3..3e2f059a1737 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -16,7 +16,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  
>  #include "../kernel/sched/sched.h"
>  #include "io-wq.h"
> @@ -503,10 +502,16 @@ static int io_wqe_worker(void *data)
>   if (io_flush_signals())
>   continue;
>   ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
> - if (try_to_freeze() || ret)
> + if (signal_pending(current)) {
> + struct ksignal ksig;
> +
> + if (fatal_signal_pending(current))
> + break;
> + if (get_signal())
> + continue;
^^

That is wrong.  You are promising to deliver a signal to signal
handler and them simply discarding it.  Perhaps:

if (!get_signal())
continue;
WARN_ON(!sig_kernel_stop(ksig->sig));
break;


Eric


Re: [PATCH 0/2] Don't show PF_IO_WORKER in /proc//task/

2021-03-25 Thread Eric W. Biederman
Oleg Nesterov  writes:

> On 03/25, Linus Torvalds wrote:
>>
>> The whole "signals are very special for IO threads" thing has caused
>> so many problems, that maybe the solution is simply to _not_ make them
>> special?
>
> Or may be IO threads should not abuse CLONE_THREAD?
>
> Why does create_io_thread() abuse CLONE_THREAD ?
>
> One reason (I think) is that this implies SIGKILL when the process 
> exits/execs,
> anything else?

A lot.

The io workers perform work on behave of the ordinary userspace threads.
Some of that work is opening files.  For things like rlimits to work
properly you need to share the signal_struct.  But odds are if you find
anything in signal_struct (not counting signals) there will be an
io_uring code path that can exercise it as io_uring can traverse the
filesystem, open files and read/write files.  So io_uring can exercise
all of proc.

Using create_io_thread with CLONE_THREAD is the least problematic way
(including all of the signal and ptrace problems we are looking at right
now) to implement the io worker threads.

They _really_ are threads of the process that just never execute any
code in userspace.

Eric



Re: [PATCH 0/2] Don't show PF_IO_WORKER in /proc//task/

2021-03-25 Thread Eric W. Biederman
Oleg Nesterov  writes:

> On 03/25, Eric W. Biederman wrote:
>>
>> So looking quickly the flip side of the coin is gdb (and other
>> debuggers) needs a way to know these threads are special, so it can know
>> not to attach.
>
> may be,
>
>> I suspect getting -EPERM (or possibly a different error code) when
>> attempting attach is the right was to know that a thread is not
>> available to be debugged.
>
> may be.
>
> But I don't think we can blame gdb. The kernel changed the rules, and this
> broke gdb. IOW, I don't agree this is gdb bug.

My point would be it is not strictly a regression either.  It is gdb not
handling new functionality.

If we can be backwards compatible and make ptrace_attach work that is
preferable.  If we can't saying the handful of ptrace using applications
need an upgrade to support processes that use io_uring may be
acceptable.

I don't see any easy to implement path that is guaranteed to work.

Eric





Re: [PATCH 0/2] Don't show PF_IO_WORKER in /proc//task/

2021-03-25 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Thu, Mar 25, 2021 at 12:42 PM Linus Torvalds
>  wrote:
>>
>> On Thu, Mar 25, 2021 at 12:38 PM Linus Torvalds
>>  wrote:
>> >
>> > I don't know what the gdb logic is, but maybe there's some other
>> > option that makes gdb not react to them?
>>
>> .. maybe we could have a different name for them under the task/
>> subdirectory, for example (not  just the pid)? Although that probably
>> messes up 'ps' too..
>
> Actually, maybe the right model is to simply make all the io threads
> take signals, and get rid of all the special cases.
>
> Sure, the signals will never be delivered to user space, but if we
>
>  - just made the thread loop do "get_signal()" when there are pending signals
>
>  - allowed ptrace_attach on them
>
> they'd look pretty much like regular threads that just never do the
> user-space part of signal handling.
>
> The whole "signals are very special for IO threads" thing has caused
> so many problems, that maybe the solution is simply to _not_ make them
> special?

The special case in check_kill_permission is certainly unnecessary.
Having the signal blocked is enough to prevent signal_pending() from
being true. 


The most straight forward thing I can see is to allow ptrace_attach and
to modify ptrace_check_attach to always return -ESRCH for io workers
unless ignore_state is set causing none of the other ptrace operations
to work.

That is what a long running in-kernel thread would do today so
user-space aka gdb may actually cope with it.


We might be able to support if io workers start supporting SIGSTOP but I
am not at all certain.

Eric



Re: [PATCH 0/2] Don't show PF_IO_WORKER in /proc//task/

2021-03-25 Thread Eric W. Biederman
Jens Axboe  writes:

> On 3/25/21 1:42 PM, Linus Torvalds wrote:
>> On Thu, Mar 25, 2021 at 12:38 PM Linus Torvalds
>>  wrote:
>>>
>>> I don't know what the gdb logic is, but maybe there's some other
>>> option that makes gdb not react to them?
>> 
>> .. maybe we could have a different name for them under the task/
>> subdirectory, for example (not  just the pid)? Although that probably
>> messes up 'ps' too..
>
> Heh, I can try, but my guess is that it would mess up _something_, if
> not ps/top.

Hmm.

So looking quickly the flip side of the coin is gdb (and other
debuggers) needs a way to know these threads are special, so it can know
not to attach.

I suspect getting -EPERM (or possibly a different error code) when
attempting attach is the right was to know that a thread is not
available to be debugged.

Eric



Re: [PATCH 0/2] Don't show PF_IO_WORKER in /proc//task/

2021-03-25 Thread Eric W. Biederman
Jens Axboe  writes:

> Hi,
>
> Stefan reports that attaching to a task with io_uring will leave gdb
> very confused and just repeatedly attempting to attach to the IO threads,
> even though it receives an -EPERM every time. This patchset proposes to
> skip PF_IO_WORKER threads as same_thread_group(), except for accounting
> purposes which we still desire.
>
> We also skip listing the IO threads in /proc//task/ so that gdb
> doesn't think it should stop and attach to them. This makes us consistent
> with earlier kernels, where these async threads were not related to the
> ring owning task, and hence gdb (and others) ignored them anyway.
>
> Seems to me that this is the right approach, but open to comments on if
> others agree with this. Oleg, I did see your messages as well on SIGSTOP,
> and as was discussed with Eric as well, this is something we most
> certainly can revisit. I do think that the visibility of these threads
> is a separate issue. Even with SIGSTOP implemented (which I did try as
> well), we're never going to allow ptrace attach and hence gdb would still
> be broken. Hence I'd rather treat them as separate issues to attack.

A quick skim shows that these threads are not showing up anywhere in
proc which appears to be a problem, as it hides them from top.

Sysadmins need the ability to dig into a system and find out where all
their cpu usage or io's have gone when there is a problem.  I general I
think this argues that these threads should show up as threads of the
process so I am not even certain this is the right fix to deal with gdb.

Eric


Re: [PATCH AUTOSEL 5.11 43/44] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Eric W. Biederman
Stefan Metzmacher  writes:

> Am 25.03.21 um 12:24 schrieb Sasha Levin:
>> From: "Eric W. Biederman" 
>> 
>> [ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]
>> 
>> Just like we don't allow normal signals to IO threads, don't deliver a
>> STOP to a task that has PF_IO_WORKER set. The IO threads don't take
>> signals in general, and have no means of flushing out a stop either.
>> 
>> Longer term, we may want to look into allowing stop of these threads,
>> as it relates to eg process freezing. For now, this prevents a spin
>> issue if a SIGSTOP is delivered to the parent task.
>> 
>> Reported-by: Stefan Metzmacher 
>> Signed-off-by: Jens Axboe 
>> Signed-off-by: "Eric W. Biederman" 
>> Signed-off-by: Sasha Levin 
>> ---
>>  kernel/signal.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 55526b941011..00a3840f6037 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct *task, 
>> unsigned long mask)
>>  JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
>>  BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
>>  
>> -if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
>> +if (unlikely(fatal_signal_pending(task) ||
>> + (task->flags & (PF_EXITING | PF_IO_WORKER
>>  return false;
>>  
>>  if (mask & JOBCTL_STOP_SIGMASK)
>> 
>
> Again, why is this proposed for 5.11 and 5.10 already?

Has the bit about the io worker kthreads been backported?
If so this isn't horrible.  If not this is nonsense.

Eric




Re: [PATCHSET 0/2] PF_IO_WORKER signal tweaks

2021-03-21 Thread Eric W. Biederman
Jens Axboe  writes:

> On 3/20/21 4:08 PM, Eric W. Biederman wrote:
>> 
>> Added criu because I just realized that io_uring (which can open files
>> from an io worker thread) looks to require some special handling for
>> stopping and freezing processes.  If not in the SIGSTOP case in the
>> related cgroup freezer case.
>> 
>> Linus Torvalds  writes:
>> 
>>> On Sat, Mar 20, 2021 at 10:51 AM Linus Torvalds
>>>  wrote:
>>>>
>>>> Alternatively, make it not use
>>>> CLONE_SIGHAND|CLONE_THREAD at all, but that would make it
>>>> unnecessarily allocate its own signal state, so that's "cleaner" but
>>>> not great either.
>>>
>>> Thinking some more about that, it would be problematic for things like
>>> the resource counters too. They'd be much better shared.
>>>
>>> Not adding it to the thread list etc might be clever, but feels a bit too 
>>> scary.
>>>
>>> So on the whole I think Jens' minor patches to just not have IO helper
>>> threads accept signals are probably the right thing to do.
>> 
>> The way I see it we have two options:
>> 
>> 1) Don't ask PF_IO_WORKERs to stop do_signal_stop and in
>>task_join_group_stop.
>> 
>>The easiest comprehensive implementation looks like just
>>updating task_set_jobctl_pending to treat PF_IO_WORKER
>>as it treats PF_EXITING.
>> 
>> 2) Have the main loop of the kernel thread test for JOBCTL_STOP_PENDING
>>and call into do_signal_stop.
>> 
>> It is a wee bit trickier to modify the io_workers to stop, but it does
>> not look prohibitively difficult.
>> 
>> All of the work performed by the io worker is work scheduled via
>> io_uring by the process being stopped.
>> 
>> - Is the amount of work performed by the io worker thread sufficiently
>>   negligible that we don't care?
>> 
>> - Or is the amount of work performed by the io worker so great that it
>>   becomes a way for an errant process to escape SIGSTOP?
>> 
>> As the code is all intermingled with the cgroup_freezer.  I am also
>> wondering creating checkpoints needs additional stopping guarantees.
>
> The work done is the same a syscall, basically. So it could be long
> running and essentially not doing anything (eg read from a socket, no
> data is there), or it's pretty short lived (eg read from a file, just
> waiting on DMA).
>
> This is outside of my domain of expertise, which is exactly why I added
> you and Linus to make some calls on what the best approach here would
> be. My two patches obviously go route #1 in terms of STOP. And fwiw,
> I tested this:
>
>> To solve the issue that SIGSTOP is simply broken right now I am totally
>> fine with something like:
>> 
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index ba4d1ef39a9e..cb9acdfb32fa 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct *task, 
>> unsigned long mask)
>>  JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
>>  BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
>>  
>> -if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
>> +if (unlikely(fatal_signal_pending(task) ||
>> + (task->flags & (PF_EXITING | PF_IO_WORKER
>>  return false;
>>  
>>  if (mask & JOBCTL_STOP_SIGMASK)
>
> and can confirm it works fine for me with 2/2 reverted and this applied
> instead.
>
>> Which just keeps from creating unstoppable processes today.  I am just
>> not convinced that is what we want as a long term solution.
>
> How about we go with either my 2/2 or yours above to at least ensure we
> don't leave workers looping as schedule() is a nop with sigpending? If
> there's a longer timeline concern that "evading" SIGSTOP is a concern, I
> have absolutely no qualms with making the IO threads participate. But
> since it seems conceptually simple but with potentially lurking minor
> issues, probably not the ideal approach for right now.


Here is the signoff for mine.

Signed-off-by: "Eric W. Biederman" 

Yours misses the joining of group stop during fork.  So we better use
mine.

As far as I can see that fixes the outstanding bugs.

Jens can you make a proper patch out of it and send it to Linus for
-rc4?  I unfortunately have other commitments and this is all I can do
for today.

Eric


Re: [PATCH 1/2] signal: don't allow sending any signals to PF_IO_WORKER threads

2021-03-21 Thread Eric W. Biederman
Jens Axboe  writes:

> On 3/20/21 3:38 PM, Eric W. Biederman wrote:
>> Linus Torvalds  writes:
>> 
>>> On Sat, Mar 20, 2021 at 9:19 AM Eric W. Biederman  
>>> wrote:
>>>>
>>>> The creds should be reasonably in-sync with the rest of the threads.
>>>
>>> It's not about credentials (despite the -EPERM).
>>>
>>> It's about the fact that kernel threads cannot handle signals, and
>>> then get caught in endless loops of "if (sigpending()) return
>>> -EAGAIN".
>>>
>>> For a normal user thread, that "return -EAGAIN" (or whatever) will end
>>> up returning an error to user space - and before it does that, it will
>>> go through the "oh, returning to user space, so handle signal" path.
>>> Which will clear sigpending etc.
>>>
>>> A thread that never returns to user space fundamentally cannot handle
>>> this. The sigpending() stays on forever, the signal never gets
>>> handled, the thread can't do anything.
>>>
>>> So delivering a signal to a kernel thread fundamentally cannot work
>>> (although we do have some threads that explicitly see "oh, if I was
>>> killed, I will exit" - think things like in-kernel nfsd etc).
>> 
>> I agree that getting a kernel thread to receive a signal is quite
>> tricky.  But that is not what the patch affects.
>> 
>> The patch covers the case when instead of specifying the pid of the
>> process to kill(2) someone specifies the tid of a thread.  Which implies
>> that type is PIDTYPE_TGID, and in turn the signal is being placed on the
>> t->signal->shared_pending queue.  Not the thread specific t->pending
>> queue.
>> 
>> So my question is since the signal is delivered to the process as a
>> whole why do we care if someone specifies the tid of a kernel thread,
>> rather than the tid of a userspace thread?
>
> Right, that's what this first patch does, and in all honesty, it's not
> required like the 2/2 patch is. I do think it makes it more consistent,
> though - the threads don't take signals, period. Allowing delivery from
> eg kill(2) and then pass it to the owning task of the io_uring is
> somewhat counterintuitive, and differs from earlier kernels where there
> was no relationsship between that owning task and the async worker
> thread.
>
> That's why I think the patch DOES make sense. These threads may share a
> personality with the owning task, but I don't think we should be able to
> manipulate them from userspace at all. That includes SIGSTOP, of course,
> but also regular signals.
>
> Hence I do think we should do something like this.

I agree about signals.  Especially because being able to use kill(2)
with the tid of thread is a linuxism and a backwards compatibility thing
from before we had CLONE_THREAD.

I think for kill(2) we should just return -ESRCH.

Thank you for providing the reasoning that is what I really saw missing
in the patches.  The why.  And software is difficult to maintain without
the why.





Eric





Re: [PATCHSET 0/2] PF_IO_WORKER signal tweaks

2021-03-20 Thread Eric W. Biederman


Added criu because I just realized that io_uring (which can open files
from an io worker thread) looks to require some special handling for
stopping and freezing processes.  If not in the SIGSTOP case in the
related cgroup freezer case.

Linus Torvalds  writes:

> On Sat, Mar 20, 2021 at 10:51 AM Linus Torvalds
>  wrote:
>>
>> Alternatively, make it not use
>> CLONE_SIGHAND|CLONE_THREAD at all, but that would make it
>> unnecessarily allocate its own signal state, so that's "cleaner" but
>> not great either.
>
> Thinking some more about that, it would be problematic for things like
> the resource counters too. They'd be much better shared.
>
> Not adding it to the thread list etc might be clever, but feels a bit too 
> scary.
>
> So on the whole I think Jens' minor patches to just not have IO helper
> threads accept signals are probably the right thing to do.

The way I see it we have two options:

1) Don't ask PF_IO_WORKERs to stop do_signal_stop and in
   task_join_group_stop.

   The easiest comprehensive implementation looks like just
   updating task_set_jobctl_pending to treat PF_IO_WORKER
   as it treats PF_EXITING.

2) Have the main loop of the kernel thread test for JOBCTL_STOP_PENDING
   and call into do_signal_stop.

It is a wee bit trickier to modify the io_workers to stop, but it does
not look prohibitively difficult.

All of the work performed by the io worker is work scheduled via
io_uring by the process being stopped.

- Is the amount of work performed by the io worker thread sufficiently
  negligible that we don't care?

- Or is the amount of work performed by the io worker so great that it
  becomes a way for an errant process to escape SIGSTOP?

As the code is all intermingled with the cgroup_freezer.  I am also
wondering creating checkpoints needs additional stopping guarantees.


To solve the issue that SIGSTOP is simply broken right now I am totally
fine with something like:

diff --git a/kernel/signal.c b/kernel/signal.c
index ba4d1ef39a9e..cb9acdfb32fa 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct *task, 
unsigned long mask)
JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
 
-   if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
+   if (unlikely(fatal_signal_pending(task) ||
+(task->flags & (PF_EXITING | PF_IO_WORKER
return false;
 
if (mask & JOBCTL_STOP_SIGMASK)



Which just keeps from creating unstoppable processes today.  I am just
not convinced that is what we want as a long term solution.

Eric


Re: [PATCH 1/2] signal: don't allow sending any signals to PF_IO_WORKER threads

2021-03-20 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Sat, Mar 20, 2021 at 9:19 AM Eric W. Biederman  
> wrote:
>>
>> The creds should be reasonably in-sync with the rest of the threads.
>
> It's not about credentials (despite the -EPERM).
>
> It's about the fact that kernel threads cannot handle signals, and
> then get caught in endless loops of "if (sigpending()) return
> -EAGAIN".
>
> For a normal user thread, that "return -EAGAIN" (or whatever) will end
> up returning an error to user space - and before it does that, it will
> go through the "oh, returning to user space, so handle signal" path.
> Which will clear sigpending etc.
>
> A thread that never returns to user space fundamentally cannot handle
> this. The sigpending() stays on forever, the signal never gets
> handled, the thread can't do anything.
>
> So delivering a signal to a kernel thread fundamentally cannot work
> (although we do have some threads that explicitly see "oh, if I was
> killed, I will exit" - think things like in-kernel nfsd etc).

I agree that getting a kernel thread to receive a signal is quite
tricky.  But that is not what the patch affects.

The patch covers the case when instead of specifying the pid of the
process to kill(2) someone specifies the tid of a thread.  Which implies
that type is PIDTYPE_TGID, and in turn the signal is being placed on the
t->signal->shared_pending queue.  Not the thread specific t->pending
queue.

So my question is since the signal is delivered to the process as a
whole why do we care if someone specifies the tid of a kernel thread,
rather than the tid of a userspace thread?

Eric


Re: [PATCHSET 0/2] PF_IO_WORKER signal tweaks

2021-03-20 Thread Eric W. Biederman
Jens Axboe  writes:

> Hi,
>
> Been trying to ensure that we do the right thing wrt signals and
> PF_IO_WORKER threads, and I think there are two cases we need to handle
> explicitly:
>
> 1) Just don't allow signals to them in general. We do mask everything
>as blocked, outside of SIGKILL, so things like wants_signal() will
>never return true for them. But it's still possible to send them a
>signal via (ultimately) group_send_sig_info(). This will then deliver
>the signal to the original io_uring owning task, and that seems a bit
>unexpected. So just don't allow them in general.
>
> 2) STOP is done a bit differently, and we should not allow that either.
>
> Outside of that, I've been looking at same_thread_group(). This will
> currently return true for an io_uring task and it's IO workers, since
> they do share ->signal. From looking at the kernel users of this, that
> actually seems OK for the cases I checked. One is accounting related,
> which we obviously want, and others are related to permissions between
> tasks. FWIW, I ran with the below and didn't observe any ill effects,
> but I'd like someone to actually think about and verify that PF_IO_WORKER
> same_thread_group() usage is sane.

They are helper threads running in kernel space.  Allowing the ordinary
threads not to block.  Why would same_thread_group be a problem?

I don't hate this set of patches.   But I also don't see much
explanation why the changes are the right thing semantically.

That makes me uneasy.  Because especially the SIGSTOP changes feels like
it is the wrong thing semantically.  The group_send_sig_info change
simply feels like it is unnecessary.

Like we are maybe playing whak-a-mole with symptoms rather than make
certain these are the right fixes for the symptoms.

Eric

> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 3f6a0fcaa10c..a580bc0f8aa3 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -667,10 +667,17 @@ static inline bool thread_group_leader(struct 
> task_struct *p)
>   return p->exit_signal >= 0;
>  }
>  
> +static inline
> +bool same_thread_group_account(struct task_struct *p1, struct task_struct 
> *p2)
> +{
> + return p1->signal == p2->signal
> +}
> +
>  static inline
>  bool same_thread_group(struct task_struct *p1, struct task_struct *p2)
>  {
> - return p1->signal == p2->signal;
> + return same_thread_group_account(p1, p2) &&
> + !((p1->flags | p2->flags) & PF_IO_WORKER);
>  }
>  
>  static inline struct task_struct *next_thread(const struct task_struct *p)
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 5f611658eeab..625110cacc2a 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -307,7 +307,7 @@ void thread_group_cputime(struct task_struct *tsk, struct 
> task_cputime *times)
>* those pending times and rely only on values updated on tick or
>* other scheduler action.
>*/
> - if (same_thread_group(current, tsk))
> + if (same_thread_group_account(current, tsk))
>   (void) task_sched_runtime(current);
>  
>   rcu_read_lock();


Re: [PATCH 2/2] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-20 Thread Eric W. Biederman
Jens Axboe  writes:

> Just like we don't allow normal signals to IO threads, don't deliver a
> STOP to a task that has PF_IO_WORKER set. The IO threads don't take
> signals in general, and have no means of flushing out a stop either.

At first glance this seems safe.  This is before we count all of the
threads, and it needs to be a non io_uring thread.

Further we can change this decision later if necessary, as this is not
really exposed to userspace.

But please tell me why is it not the right thing to have the io_uring
helper threads stop?  Why is it ok for that process to go on consuming
cpu resources in a non-stoppable way.

Eric


> Reported-by: Stefan Metzmacher 
> Signed-off-by: Jens Axboe 
> ---
>  kernel/signal.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 730ecd3d6faf..b113bf647fb4 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2349,6 +2349,10 @@ static bool do_signal_stop(int signr)
>  
>   t = current;
>   while_each_thread(current, t) {
> + /* don't STOP PF_IO_WORKER threads */
> + if (t->flags & PF_IO_WORKER)
> + continue;
> +
>   /*
>* Setting state to TASK_STOPPED for a group
>* stop is always done with the siglock held,


Re: [PATCH 1/2] signal: don't allow sending any signals to PF_IO_WORKER threads

2021-03-20 Thread Eric W. Biederman
Jens Axboe  writes:

> They don't take signals individually, and even if they share signals with
> the parent task, don't allow them to be delivered through the worker
> thread.

This is silly I know, but why do we care?

The creds should be reasonably in-sync with the rest of the threads.

There are other threads that will receive the signal, especially when
you worry about group_send_sig_info.  Which signal sending code paths
are actually a problem.

> Reported-by: Stefan Metzmacher 
> Signed-off-by: Jens Axboe 
> ---
>  kernel/signal.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index ba4d1ef39a9e..730ecd3d6faf 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -833,6 +833,9 @@ static int check_kill_permission(int sig, struct 
> kernel_siginfo *info,
>  
>   if (!valid_signal(sig))
>   return -EINVAL;
> + /* PF_IO_WORKER threads don't take any signals */
> + if (t->flags & PF_IO_WORKER)
> + return -EPERM;
>  
>   if (!si_fromuser(info))
>   return 0;


Re: [PATCH V3] exit: trigger panic when global init has exited

2021-03-18 Thread Eric W. Biederman
Oleg Nesterov  writes:

> On 03/18, qianli zhao wrote:
>>
>> Hi,Oleg
>>
>> Thank you for your reply.
>>
>> >> When init sub-threads running on different CPUs exit at the same time,
>> >> zap_pid_ns_processe()->BUG() may be happened.
>>
>> > and why do you think your patch can't prevent this?
>>
>> > Sorry, I must have missed something. But it seems to me that you are trying
>> > to fix the wrong problem. Yes, zap_pid_ns_processes() must not be called in
>> > the root namespace, and this has nothing to do with CONFIG_PID_NS.
>>
>> Yes, i try to fix this exception by test SIGNAL_GROUP_EXIT and call
>> panic before setting PF_EXITING to prevent zap_pid_ns_processes()
>> being called when init do_exit().
>
> Ah, I didn't notice your patch does atomic_dec_and_test(signal->live)
> before exit_signals() which sets PF_EXITING. Thanks for correcting me.
>
> So yes, I was wrong, your patch can prevent this. Although I'd like to
> recheck if every do-something-if-group-dead action is correct in the
> case we have a non-PF_EXITING thread...
>
> But then I don't understand the SIGNAL_GROUP_EXIT check added by your
> patch. Do we really need it if we want to avoid zap_pid_ns_processes()
> when the global init exits?
>
>> In addition, the patch also protects the init process state to
>> successfully get usable init coredump.
>
> Could you spell please?
>
> Does this connect to SIGNAL_GROUP_EXIT check? Do you mean that you want
> to panic earlier, before other init's sub-threads exit?

That is my understanding.

As I understand it this patch has two purposes:
1. Avoid the BUG_ON in zap_pid_ns_processes when !CONFIG_PID_NS
2. panic as early as possible so exiting threads don't removing
   interesting debugging state.


It is a bit tricky to tell if the movement of the decrement of
signal->live is safe.  That affects current_is_single threaded
which is used by unshare, setns of the time namespace, and setting
the selinux part of creds.

The usage in kernel/cgroup/cgroup.c:css_task_iter_advance seems safe.
Hmm, Maybe not.  Today cgroup_thread_change_begin is held around
setting PF_EXITING before signal->live is decremented.  So there seem to
be some subtle cgroup dependencies.

The usages of group_dead in do_exit seem safe, as except for the new
one everything is the same.

We could definitely take advantage of knowing group_dead in exit_signals
to simplify it's optimization to not rerouting signals to living
threads.


I think if we are going to move the decrement of signal->live that
should be it's own patch and be accompanied with a good description of
why it is safe instead of having the decrement of signal->live be there
as a side effect of another change.

Eric


Re: [GIT PULL] userns regression fix for v5.12-rc3

2021-03-13 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Fri, Mar 12, 2021 at 1:34 PM Eric W. Biederman  
> wrote:
>>
>> Please pull the for-v5.12-rc3 branch from the git tree.
>>
>> Removing the ambiguity broke userspace so please revert the change:
>> It turns out that there are in fact userspace implementations that
>> care and this recent change caused a regression.
>>
>> https://github.com/containers/buildah/issues/3071
>
> Pulled.
>
> But please cc the kernel mailing list (or something like that) for
> these things, so that you get the pr-tracker-bot reply automatically.
>
> Perhaps you don't care about the pr-tracker-bot that much, but even
> aside from that, I just like the pulls to be cc'd on the lists, if
> only as "documentation".
>
> (Plus since I've gotten used to seeing the pr-tracker-bot
> confirmations, I then get antsy and wondering if something is wrong
> with the bot when it doesn't respond fully when I push things out,
> which is how I noticed this one).

Sorry about that.  I meant to CC lkml but my fingers forgot to include
it this time.

I just knew I had to get this out so we could start unbreaking users.

Eric


Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-12 Thread Eric W. Biederman
Jim Newsome  writes:

> On 3/12/21 14:29, Eric W. Biederman wrote:
>> When I looked at this a second time it became apparent that using
>> pid_task twice should actually be faster as it removes a dependent load
>> caused by thread_group_leader, and replaces it by accessing two adjacent
>> pointers in the same cache line.
>> 
>> I know the algorithmic improvement is the main advantage, but removing
>> 60ns or so for a dependent load can't hurt.
>> 
>> Plus I think using the two pid types really makes it clear that one
>> is always a process and the other is always potentially a thread.
>> 
>> /*
>>  * Optimization for waiting on PIDTYPE_PID. No need to iterate through child
>>  * and tracee lists to find the target task.
>>  */
>> static int do_wait_pid(struct wait_opts *wo)
>> {
>>  bool ptrace;
>>  struct task_struct *target;
>>  int retval;
>> 
>>  ptrace = false;
>>  target = pid_task(wo->wo_pid, PIDTYPE_TGID);
>>  if (target && is_effectively_child(wo, ptrace, target)) {
>>  retval = wait_consider_task(wo, ptrace, target);
>>  if (retval)
>>  return retval;
>>  }
>> 
>>  ptrace = true;
>>  target = pid_task(wo->wo_pid, PIDTYPE_PID);
>>  if (target && target->ptrace &&
>> is_effectively_child(wo, ptrace, target)) {
>>  retval = wait_consider_task(wo, ptrace, target);
>>  if (retval)
>>  return retval;
>>  }
>> 
>>  return 0;
>> }
>
> I'm fine with either way.
>
> Part of what made my earlier version with the double-lookup a bit
> awkward was only doing the second lookup if the first lookup failed. I'm
> happy to take your word though that making the second lookup conditional
> is unnecessary or even detrimental :).

Oh absolutely.  The two lookups are independent.

> It did cross my mind that it
> might not be a very consistent branch for a branch-predictor, but I also
> figured pid_task's synchronization might outweigh that.

pid_task has a lot of verbiage but it is only reading a pointer,
verifying the pointer is not NULL and calling container_of on the result
of the pointer read.

Eric



Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-12 Thread Eric W. Biederman
Jim Newsome  writes:

> do_wait is an internal function used to implement waitpid, waitid,
> wait4, etc. To handle the general case, it does an O(n) linear scan of
> the thread group's children and tracees.
>
> This patch adds a special-case when waiting on a pid to skip these scans
> and instead do an O(1) lookup. This improves performance when waiting on
> a pid from a thread group with many children and/or tracees.

I am going to kibitz just a little bit more.

When I looked at this a second time it became apparent that using
pid_task twice should actually be faster as it removes a dependent load
caused by thread_group_leader, and replaces it by accessing two adjacent
pointers in the same cache line.

I know the algorithmic improvement is the main advantage, but removing
60ns or so for a dependent load can't hurt.

Plus I think using the two pid types really makes it clear that one
is always a process and the other is always potentially a thread.

/*
 * Optimization for waiting on PIDTYPE_PID. No need to iterate through child
 * and tracee lists to find the target task.
 */
static int do_wait_pid(struct wait_opts *wo)
{
bool ptrace;
struct task_struct *target;
int retval;

ptrace = false;
target = pid_task(wo->wo_pid, PIDTYPE_TGID);
if (target && is_effectively_child(wo, ptrace, target)) {
retval = wait_consider_task(wo, ptrace, target);
if (retval)
return retval;
}

ptrace = true;
target = pid_task(wo->wo_pid, PIDTYPE_PID);
if (target && target->ptrace &&
is_effectively_child(wo, ptrace, target)) {
retval = wait_consider_task(wo, ptrace, target);
if (retval)
return retval;
}

return 0;
}

Since the probably needs to be respun to include the improved
description can we look at my micro performance improvement?

Eric


Re: [PATCH V2] exit: trigger panic when global init has exited

2021-03-12 Thread Eric W. Biederman
Qianli Zhao  writes:

> From: Qianli Zhao 
>
> When init sub-threads running on different CPUs exit at the same time,
> zap_pid_ns_processe()->BUG() may be happened.
> And every thread status is abnormal after exit(PF_EXITING set,task->mm=NULL 
> etc),
> which makes it difficult to parse coredump from fulldump normally.
> In order to fix the above problem, when any one init has been set to 
> SIGNAL_GROUP_EXIT,
> trigger panic immediately, and prevent other init threads from continuing to 
> exit.
>
> [   24.705376] Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x7f00
> [   24.705382] CPU: 4 PID: 552 Comm: init Tainted: G S O
> 4.14.180-perf-g4483caa8ae80-dirty #1
> [   24.705390] kernel BUG at include/linux/pid_namespace.h:98!
>
> PID: 552   CPU: 4   COMMAND: "init"
> PID: 1 CPU: 7   COMMAND: "init"
> core4   core7
> ... sys_exit_group()
> do_group_exit()
>- sig->flags = SIGNAL_GROUP_EXIT
>- zap_other_threads()
> do_exit() //PF_EXITING is set
> ret_to_user()
> do_notify_resume()
> get_signal()
> - signal_group_exit
> - goto fatal;
> do_group_exit()
> do_exit() //PF_EXITING is set
> - panic("Attempted to kill init! exitcode=0x%08x\n")
> exit_notify()
> find_alive_thread() //no alive sub-threads
> zap_pid_ns_processes()//CONFIG_PID_NS is not 
> set
> BUG()
>
> Signed-off-by: Qianli Zhao 

The changelog is much better thank you.

As Oleg pointer out we need to do something like the code below.

diff --git a/kernel/exit.c b/kernel/exit.c
index 04029e35e69a..bc676c06ef9a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -785,15 +785,16 @@ void __noreturn do_exit(long code)
sync_mm_rss(tsk->mm);
acct_update_integrals(tsk);
group_dead = atomic_dec_and_test(>signal->live);
+   /*
+* If the global init has exited, panic immediately to get a
+* useable coredump.
+*/
+   if (unlikely(is_global_init(tsk) &&
+(group_dead || (tsk->signal->flags & SIGNAL_GROUP_EXIT 
{
+   panic("Attempted to kill init! exitcode=0x%08x\n",
+ tsk->signal->group_exit_code ?: (int)code);
+   }
if (group_dead) {
-   /*
-* If the last thread of global init has exited, panic
-* immediately to get a useable coredump.
-*/
-   if (unlikely(is_global_init(tsk)))
-   panic("Attempted to kill init! exitcode=0x%08x\n",
-   tsk->signal->group_exit_code ?: (int)code);
-
 #ifdef CONFIG_POSIX_TIMERS
hrtimer_cancel(>signal->real_timer);
exit_itimers(tsk->signal);

There is still a race that could lead to the BUG in zap_pid_ns_processes.
We still have a case where the last two threads of a process call
pthread_exit (aka do_exit not do_group_exit in the kernel).

Thread AThread B
do_exit()   do_exit()

 exit_signals()
   tsk->flags |= PF_EXITING;
 group_dead = false;
exit_signals()
  tsk->flags |= PF_EXITING;
 exit_notify()
  forget_original_parent
find_child_reaper
  reaper = find_alive_thread()
  zap_pid_ns_processes()
 BUG()
group_dead = true;
if (is_global_init())
panic("Attemted to kill init");

As we are guaranteed to see the panic with my change above I suggest
we augment it by simply removing the BUG in zap_pid_ns_processes.

Or maybe not if there is a better way to write the panic code.  I don't
think having pid namespaces compiled out is a particularly common case.
So whatever we can do to keep the code correct and reduce testing.

Eric


Re: [patch V2 0/3] signals: Allow caching one sigqueue object per task

2021-03-11 Thread Eric W. Biederman
Thomas Gleixner  writes:

> This is a follow up to the initial submission which can be found here:
>
>   https://lore.kernel.org/r/20210303142025.wbbt2nnr6dtgw...@linutronix.de
>
> Signal sending requires a kmem cache allocation at the sender side and the
> receiver hands it back to the kmem cache when consuming the signal.
>
> This works pretty well even for realtime workloads except for the case when
> the kmem cache allocation has to go into the slow path which is rare but
> happens.
>
> Preempt-RT carries a patch which allows caching of one sigqueue object per
> task. The object is not preallocated. It's cached when the task receives a
> signal. The cache is freed when the task exits.

I am probably skimming fast and missed your explanation but is there
a reason the caching is per task (aka thread) and not per signal_struct
(aka process)?

My sense is most signal delivery is per process.  Are realtime workloads
that extensively use pthread_sigqueue?  The ordinary sigqueue interface
only allows targeting a process.

Mostly I am just trying to get a sense of the workloads that are
improved by this.

Eric


Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-11 Thread Eric W. Biederman
Oleg Nesterov  writes:

> On 03/10, Eric W. Biederman wrote:
>>
>> Jim Newsome  writes:
>>
>> > +static int do_wait_pid(struct wait_opts *wo)
>> > +{
>> > +  struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
>>  ^
>> This is subtle change in behavior.
>>
>> Today on the task->children list we only place thread group leaders.
>
> Aaah, yes, thanks Eric!
>
>> So the code either needs a thread_group_leader filter on target before
>> the ptrace=0 case or we need to use "pid_task(wo->wo_pid, PIDTYPE_TGID)"
>> and "pid_task(wo->wo_pid, PIDTYPE_PID)" for the "ptrace=1" case.
>
> Agreed,
>
>> I would like to make thread_group_leaders go away
>
> Hmm, why?

Mostly because we have class of very nasty bugs to fix because code
thinks one thread is special.

There has been and I think still is code that mishandles zombie thread
group leaders.

Particularly nasty are zombie thread group leaders after userspace has
called setresuid in a way that changes signal permissions.

Eric


Re: [PATCH] signal: Allow RT tasks to cache one sigqueue struct

2021-03-11 Thread Eric W. Biederman
Thomas Gleixner  writes:

> On Wed, Mar 10 2021 at 15:57, Eric W. Biederman wrote:
>> Thomas Gleixner  writes:
>>> IMO, not bothering with an extra counter and rlimit plus the required
>>> atomic operations is just fine and having this for all tasks
>>> unconditionally looks like a clear win.
>>>
>>> I'll post an updated version of this soonish.
>>
>> That looks like a good analysis.
>>
>> I see that there is a sigqueue_cachep.  As I recall there are per cpu
>> caches and all kinds of other good stuff when using kmem_cache_alloc.
>>
>> Are those goodies falling down?
>>
>> I am just a little unclear on why a slab allocation is sufficiently
>> problematic that we want to avoid it.
>
> In the normal case it's not problematic at all. i.e. when the per cpu
> cache can directly fullfil the allocation in the fast path. Once that
> fails you're off into latency land...
>
> For the usual setup probably not an issue at all, but for real time
> processing it matters.
>
> Vs. the dedicated kmem cache for sigqueue. That's a red herring. By
> default kmem caches are shared/merged as I learned today and if you want
> dedicated ones you need to boot with 'slab_nomerge' on the command line.
>
> So without that option (which is of course not backwards compatible
> because the original behaviour was the other way around) your signal
> kmem cache might end up in a shared/merged kmem cache. Just do:
>
>   cat /proc/slabinfo | grep sig
>
> and the default will find:
>
> signal_cache6440   6440   1152   288 : tunables000 : 
> slabdata230230  0
> sighand_cache   3952   4035   2112   158 : tunables000 : 
> slabdata269269  0
>
> But of course there is no way to figure out where your cache actually
> landed and then with with 'slab_nomerge' you'll get:
>
> sigqueue3264   3264 80   511 : tunables000 : 
> slabdata 64 64  0
> signal_cache6440   6440   1152   288 : tunables000 : 
> slabdata230230  0
> sighand_cache   3952   4035   2112   158 : tunables000 : 
> slabdata269269  0
>
> Don't worry about the 'active objects' field. That's just bonkers
> because SLUB has no proper accounting for active objects. That number is
> useless ...
>
> Not even CONFIG_SLUB_STATS=y will give you anything useful. I had to
> hack my own statistics into the signal code to gather these numbers
> !$@**!^?#!
>
> But why I'm not surprised? This stuff is optimized for High Frequency
> Trading which is useless by definition. Oh well...
>
> Rant aside, there is no massive benefit of doing that caching in
> general, but there is not much of a downside either and for particular
> use cases it's useful even outside of PREEMPT_RT.
>
> IMO, having it there unconditionally is better than yet another special
> cased hackery.

Sounds reasonable, and thank you for actually looking into it.  I think
a comment saying this gives a strong guarantee that as long as userspace
plays by the rules (aka max one outstanding signal per process)
userspace gets a low latency guarantee.

Eric


Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-11 Thread Eric W. Biederman
Jim Newsome  writes:

> On 3/10/21 16:40, Eric W. Biederman wrote:
>>> +// Optimization for waiting on PIDTYPE_PID. No need to iterate
> through child
>>> +// and tracee lists to find the target task.
>>
>> Minor nit:  C++ style comments look very out of place in this file
>> which uses old school C /* */ comment delimiters for
>> all of it's block comments.
>
> Will do
>
>>> +static int do_wait_pid(struct wait_opts *wo)
>>> +{
>>> +   struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
>>  ^
>> This is subtle change in behavior.
>> 
>> Today on the task->children list we only place thread group leaders.
>
> Shouldn't we allow waiting on clone children if __WALL or __WCLONE is set?
>
> This is already checked later in `eligible_child`, called from
> `wait_consider_task`, so I *think* the current form should already do
> the right thing. Now I'm confused though how the general path (through
> `do_wait_thread`) works if clone children aren't on the task->children
> list...?
>
> (In any case it seems this will need another version with at least an
> explanatory comment here)

What I am worried about are not clone children.  AKA ordinary children
that have a different exit signal but CLONE_THREAD children that are
never put on the children list so are naturally excluded from today's
do_wait (except in the case of ptrace). These are also known as threads.

Maybe I am missing it but I don't see anything in wait_consider_task
or in the way that you are calling it that would exclude CLONE_THREAD
children for the non-ptrace case.

Eric


Re: [PATCH v3] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-10 Thread Eric W. Biederman
Jim Newsome  writes:

> do_wait is an internal function used to implement waitpid, waitid,
> wait4, etc. To handle the general case, it does an O(n) linear scan of
> the thread group's children and tracees.
>
> This patch adds a special-case when waiting on a pid to skip these scans
> and instead do an O(1) lookup. This improves performance when waiting on
> a pid from a thread group with many children and/or tracees.
>
> Signed-off-by: James Newsome 
> ---
>  kernel/exit.c | 53 +--
>  1 file changed, 43 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 04029e35e69a..c2438d4ba262 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -1439,9 +1439,34 @@ void __wake_up_parent(struct task_struct *p, struct 
> task_struct *parent)
>  TASK_INTERRUPTIBLE, p);
>  }
>  
> +// Optimization for waiting on PIDTYPE_PID. No need to iterate through child
> +// and tracee lists to find the target task.

Minor nit:  C++ style comments look very out of place in this file
which uses old school C /* */ comment delimiters for
all of it's block comments.
  
> +static int do_wait_pid(struct wait_opts *wo)
> +{
> + struct task_struct *target = pid_task(wo->wo_pid, PIDTYPE_PID);
 ^
This is subtle change in behavior.

Today on the task->children list we only place thread group leaders.

Which means that your do_wait_pid wait for thread of someone else's
process and that is a change in behavior.

So the code either needs a thread_group_leader filter on target before
the ptrace=0 case or we need to use "pid_task(wo->wo_pid, PIDTYPE_TGID)"
and "pid_task(wo->wo_pid, PIDTYPE_PID)" for the "ptrace=1" case.

I would like to make thread_group_leaders go away so I would favor two
pid_task calls.  But either will work right now.

Eric

 
> + int retval;
> +
> + if (!target)
> + return 0;
> + if (current == target->real_parent ||
> + (!(wo->wo_flags & __WNOTHREAD) &&
> +  same_thread_group(current, target->real_parent))) {
> + retval = wait_consider_task(wo, /* ptrace= */ 0, target);
> + if (retval)
> + return retval;
> + }
> + if (target->ptrace && (current == target->parent ||
> +(!(wo->wo_flags & __WNOTHREAD) &&
> + same_thread_group(current, target->parent {
> + retval = wait_consider_task(wo, /* ptrace= */ 1, target);
> + if (retval)
> + return retval;
> + }
> + return 0;
> +}
> +
>  static long do_wait(struct wait_opts *wo)
>  {
> - struct task_struct *tsk;
>   int retval;
>  
>   trace_sched_process_wait(wo->wo_pid);
> @@ -1463,19 +1488,27 @@ static long do_wait(struct wait_opts *wo)
>  
>   set_current_state(TASK_INTERRUPTIBLE);
>   read_lock(_lock);
> - tsk = current;
> - do {
> - retval = do_wait_thread(wo, tsk);
> - if (retval)
> - goto end;
>  
> - retval = ptrace_do_wait(wo, tsk);
> + if (wo->wo_type == PIDTYPE_PID) {
> + retval = do_wait_pid(wo);
>   if (retval)
>   goto end;
> + } else {
> + struct task_struct *tsk = current;
>  
> - if (wo->wo_flags & __WNOTHREAD)
> - break;
> - } while_each_thread(current, tsk);
> + do {
> + retval = do_wait_thread(wo, tsk);
> + if (retval)
> + goto end;
> +
> + retval = ptrace_do_wait(wo, tsk);
> + if (retval)
> + goto end;
> +
> + if (wo->wo_flags & __WNOTHREAD)
> + break;
> + } while_each_thread(current, tsk);
> + }
>   read_unlock(_lock);
>  
>  notask:


Re: [PATCH] exit: trigger panic when init process is set to SIGNAL_GROUP_EXIT

2021-03-10 Thread Eric W. Biederman
Oleg Nesterov  writes:

> On 03/10, Eric W. Biederman wrote:
>>
>>  /* If global init has exited,
>>  * panic immediately to get a useable coredump.
>>  */
>>  if (unlikely(is_global_init(tsk) &&
>>  (thread_group_empty(tsk) ||
>> (tsk->signal->flags & SIGNAL_GROUP_EXIT {
>>  panic("Attempted to kill init!  exitcode=0x%08x\n",
>>  tsk->signal->group_exit_code ?: (int)code);
>>  }
>>
>> The thread_group_empty test is needed to handle single threaded
>> inits.
>
> But we can't rely on thread_group_empty(). Just suppose that the main
> thread exit first, then the 2nd (last) thread exits too.

It took me a minute.  I think you are pointing out that there is a case
where we do not set SIGNAL_GROUP_EXIT and that init actually exits.

The case where all of the threads do pthread_exit() aka do_exit().

I think that implies that to have a comprehensive test would
need to do:

group_dead = atomic_dec_and_test(>signal->live);

/* If global init has exited,
 * panic immediately to get a useable coredump.
*/
if (unlikely(is_global_init(tsk) &&
(group_dead || thread_group_empty(tsk) ||
 (tsk->signal->flags & SIGNAL_GROUP_EXIT {
panic("Attempted to kill init!  exitcode=0x%08x\n",
tsk->signal->group_exit_code ?: (int)code);
}

Leaving the test where it is.  Yes.  I think that should work.


Eric



Re: [PATCH] signal: Allow RT tasks to cache one sigqueue struct

2021-03-10 Thread Eric W. Biederman
Thomas Gleixner  writes:

> On Thu, Mar 04 2021 at 21:58, Thomas Gleixner wrote:
>> On Thu, Mar 04 2021 at 13:04, Eric W. Biederman wrote:
>>> Thomas Gleixner  writes:
>>>>
>>>> We could of course do the caching unconditionally for all tasks.
>>>
>>> Is there any advantage to only doing this for realtime tasks?
>>
>> It was mostly to avoid tons of cached entries hanging around all over
>> the place. So I limited it to the case which the RT users deeply cared
>> about. Also related to the accounting question below.
>>
>>> If not it probably makes sense to do the caching for all tasks.
>>>
>>> I am wondering if we want to count the cached sigqueue structure to the
>>> users rt signal rlimit?
>>
>> That makes some sense, but that's a user visible change as a single
>> signal will up the count for a tasks lifetime while today it is removed
>> from accounting again once the signal is delivered. So that needs some
>> thought.
>
> Thought more about it. To make this accounting useful we'd need:
>
>   - a seperate user::sigqueue_cached counter
>   - a seperate RLIMIT_SIGQUEUE_CACHED
>
> Then you need to think about the defaults. Any signal heavy application
> will want this enabled and obviously automagically.
>
> Also there is an argument not to have this due to possible pointless
> memory consumption.
>
> But what are we talking about? 80 bytes worth of memory per task in the
> worst case. Which is compared to the rest of a task's memory consumption
> just noise.
>
> Looking at some statistics from a devel system there are less than 10
> items cached when the machine is fully idle after boot. During a kernel
> compile the cache utilization goes up to ~150 at max (make -j128 and 64
> CPUs). What's interesting is the allocation statistics after boot and
> full kernel compile:
>
>   from slab:23996
>   from task cache:52223
>
> A typical pattern there is:
>
> -58490   [010] d..2  7765.664198: __sigqueue_alloc: 58488 from slab 
> 8881132df460 10
> -58488   [002] d..1  7765.664294: __sigqueue_free.part.35: cache 
> 8881132df460 10
> -58488   [002] d..2  7765.665146: __sigqueue_alloc: 1149 from cache 
> 8881103dc550 10
>  bash-1149   [000] d..2  7765.665220: exit_task_sighand: free 
> 8881132df460 8 9
>  bash-1149   [000] d..1  7765.665662: __sigqueue_free.part.35: cache 
> 8881103dc550 9
>
> 58488 grabs the sigqueue from bash's task cache and bash sticks it back
> in. Lather, rinse and repeat. 
>
> IMO, not bothering with an extra counter and rlimit plus the required
> atomic operations is just fine and having this for all tasks
> unconditionally looks like a clear win.
>
> I'll post an updated version of this soonish.

That looks like a good analysis.

I see that there is a sigqueue_cachep.  As I recall there are per cpu
caches and all kinds of other good stuff when using kmem_cache_alloc.

Are those goodies falling down?

I am just a little unclear on why a slab allocation is sufficiently
problematic that we want to avoid it.

Eric


Re: [PATCH v2 1/1] fs: Allow no_new_privs tasks to call chroot(2)

2021-03-10 Thread Eric W. Biederman
Mickaël Salaün  writes:

> From: Mickaël Salaün 
>
> Being able to easily change root directories enable to ease some
> development workflow and can be used as a tool to strengthen
> unprivileged security sandboxes.  chroot(2) is not an access-control
> mechanism per se, but it can be used to limit the absolute view of the
> filesystem, and then limit ways to access data and kernel interfaces
> (e.g. /proc, /sys, /dev, etc.).
>
> Users may not wish to expose namespace complexity to potentially
> malicious processes, or limit their use because of limited resources.
> The chroot feature is much more simple (and limited) than the mount
> namespace, but can still be useful.  As for containers, users of
> chroot(2) should take care of file descriptors or data accessible by
> other means (e.g. current working directory, leaked FDs, passed FDs,
> devices, mount points, etc.).  There is a lot of literature that discuss
> the limitations of chroot, and users of this feature should be aware of
> the multiple ways to bypass it.  Using chroot(2) for security purposes
> can make sense if it is combined with other features (e.g. dedicated
> user, seccomp, LSM access-controls, etc.).
>
> One could argue that chroot(2) is useless without a properly populated
> root hierarchy (i.e. without /dev and /proc).  However, there are
> multiple use cases that don't require the chrooting process to create
> file hierarchies with special files nor mount points, e.g.:
> * A process sandboxing itself, once all its libraries are loaded, may
>   not need files other than regular files, or even no file at all.
> * Some pre-populated root hierarchies could be used to chroot into,
>   provided for instance by development environments or tailored
>   distributions.
> * Processes executed in a chroot may not require access to these special
>   files (e.g. with minimal runtimes, or by emulating some special files
>   with a LD_PRELOADed library or seccomp).
>
> Allowing a task to change its own root directory is not a threat to the
> system if we can prevent confused deputy attacks, which could be
> performed through execution of SUID-like binaries.  This can be
> prevented if the calling task sets PR_SET_NO_NEW_PRIVS on itself with
> prctl(2).  To only affect this task, its filesystem information must not
> be shared with other tasks, which can be achieved by not passing
> CLONE_FS to clone(2).  A similar no_new_privs check is already used by
> seccomp to avoid the same kind of security issues.  Furthermore, because
> of its security use and to avoid giving a new way for attackers to get
> out of a chroot (e.g. using /proc//root), an unprivileged chroot is
> only allowed if the new root directory is the same or beneath the
> current one.  This still allows a process to use a subset of its
> legitimate filesystem to chroot into and then further reduce its view of
> the filesystem.
>
> This change may not impact systems relying on other permission models
> than POSIX capabilities (e.g. Tomoyo).  Being able to use chroot(2) on
> such systems may require to update their security policies.
>
> Only the chroot system call is relaxed with this no_new_privs check; the
> init_chroot() helper doesn't require such change.
>
> Allowing unprivileged users to use chroot(2) is one of the initial
> objectives of no_new_privs:
> https://www.kernel.org/doc/html/latest/userspace-api/no_new_privs.html
> This patch is a follow-up of a previous one sent by Andy Lutomirski, but
> with less limitations:
> https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.l...@amacapital.net/
>
> Cc: Al Viro 
> Cc: Andy Lutomirski 
> Cc: Christian Brauner 
> Cc: Christoph Hellwig 
> Cc: David Howells 
> Cc: Dominik Brodowski 
> Cc: Eric W. Biederman 
> Cc: James Morris 
> Cc: John Johansen 
> Cc: Kees Cook 
> Cc: Kentaro Takeda 
> Cc: Serge Hallyn 
> Cc: Tetsuo Handa 
> Signed-off-by: Mickaël Salaün 
> Link: https://lore.kernel.org/r/20210310181857.401675-2-...@digikod.net
> ---
>
> Changes since v1:
> * Replace custom is_path_beneath() with existing path_is_under().

Neither is_path_beneath nor path_is_under really help prevent escapes,
as except for open files and files accessible from proc chroot already
disallows going up.  The reason is the path is resolved with the current
root before switching to it.

My brain was fuzzy.  I had the classic escape scenario confused.
It isn't chroot("../../..");

The actual classic chroot escape is.
chdir("/");
chroot("/somedir");
chdir("../../../..");

Your change would disallow changing the root directory, but it doesn't
much help as all files in the mount namespace are visible anyway.

Furthermore changing chdir to

Re: [PATCH] exit: trigger panic when init process is set to SIGNAL_GROUP_EXIT

2021-03-10 Thread Eric W. Biederman
Oleg Nesterov  writes:

> On 03/10, Eric W. Biederman wrote:
>>
>>  /* If global init has exited,
>>  * panic immediately to get a useable coredump.
>>  */
>>  if (unlikely(is_global_init(tsk) &&
>>  (thread_group_empty(tsk) ||
>> (tsk->signal->flags & SIGNAL_GROUP_EXIT {
>>  panic("Attempted to kill init!  exitcode=0x%08x\n",
>>  tsk->signal->group_exit_code ?: (int)code);
>>  }
>>
>> The thread_group_empty test is needed to handle single threaded
>> inits.
>
> But we can't rely on thread_group_empty(). Just suppose that the main
> thread exit first, then the 2nd (last) thread exits too.

My code above is designed so that every thread calls panic.
Only the first thread into panic actually writes the panic (That is in
panic itself).

By testing for thread_group_empty() || SIGNAL_GROUP_EXIT
I am just trying to allow threads of init to exit.

Maybe thread_group_empty isn't the exact test we need to allow those.


Eric


Re: [RFC PATCH] mm: fork: Prevent a NULL deref by getting mm only if the refcount isn't 0

2021-03-10 Thread Eric W. Biederman
Filippo Sironi  writes:

> We've seen a number of crashes with the following signature:
>
> BUG: kernel NULL pointer dereference, address: 
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x) - not-present page
> ...
> Oops:  [#1] SMP PTI
> ...
> RIP: 0010:__rb_erase_color+0xc2/0x260
> ...
> Call Trace:
>  unlink_file_vma+0x36/0x50
>  free_pgtables+0x62/0x110
>  exit_mmap+0xd5/0x160
>  ? put_dec+0x3a/0x90
>  ? num_to_str+0xa8/0xc0
>  mmput+0x11/0xb0
>  do_task_stat+0x940/0xc80
>  proc_single_show+0x49/0x80
>  ? __check_object_size+0xcc/0x1a0
>  seq_read+0xd3/0x400
>  vfs_read+0x72/0xb0
>  ksys_read+0x9c/0xd0
>  do_syscall_64+0x69/0x400
>  ? schedule+0x2a/0x90
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> ...
>
> This happens when a process goes through the tasks stats in procfs while
> another is exiting.  This looks like a race where the process that's
> exiting drops the last reference on the mm (with mmput) while the other
> increases it (with mmget).  By only increasing when the reference isn't
> 0 to begin with, we prevent this from happening.

For this to be a race with exit this would require racing with exit_mm
where current->mm is cleared.

Looking at exit_mm() the code does:

struct mm_struct *mm = current->mm;

mmap_read_lock(mm);
mmgrab(mm);
task_lock(current);
local_irq_disable();
current->mm = NULL;
local_irq_enable();
task_unlock(current);
mmap_read_unlock(mm);

mmput(mm);

Which seems to guarantee "mm_users > 0" if "task->mm != NULL" under
tasklist_lock.

So I suggest you instrument your failing kernels and find what is
improperly decrementing mm_users.

Eric


Re: [PATCH v1 1/1] fs: Allow no_new_privs tasks to call chroot(2)

2021-03-10 Thread Eric W. Biederman
Mickaël Salaün  writes:

> From: Mickaël Salaün 
>
> Being able to easily change root directories enable to ease some
> development workflow and can be used as a tool to strengthen
> unprivileged security sandboxes.  chroot(2) is not an access-control
> mechanism per se, but it can be used to limit the absolute view of the
> filesystem, and then limit ways to access data and kernel interfaces
> (e.g. /proc, /sys, /dev, etc.).

Actually chroot does not so limit the view of things.  It only limits
the default view.

A process that is chrooted can always escape by something like
chroot("../../../../../../../../..").

So I don't see the point of allowing chroot once you are in your locked
down sandbox.

> Users may not wish to expose namespace complexity to potentially
> malicious processes, or limit their use because of limited resources.
> The chroot feature is much more simple (and limited) than the mount
> namespace, but can still be useful.  As for containers, users of
> chroot(2) should take care of file descriptors or data accessible by
> other means (e.g. current working directory, leaked FDs, passed FDs,
> devices, mount points, etc.).  There is a lot of literature that discuss
> the limitations of chroot, and users of this feature should be aware of
> the multiple ways to bypass it.  Using chroot(2) for security purposes
> can make sense if it is combined with other features (e.g. dedicated
> user, seccomp, LSM access-controls, etc.).
>
> One could argue that chroot(2) is useless without a properly populated
> root hierarchy (i.e. without /dev and /proc).  However, there are
> multiple use cases that don't require the chrooting process to create
> file hierarchies with special files nor mount points, e.g.:
> * A process sandboxing itself, once all its libraries are loaded, may
>   not need files other than regular files, or even no file at all.
> * Some pre-populated root hierarchies could be used to chroot into,
>   provided for instance by development environments or tailored
>   distributions.
> * Processes executed in a chroot may not require access to these special
>   files (e.g. with minimal runtimes, or by emulating some special files
>   with a LD_PRELOADed library or seccomp).
>
> Allowing a task to change its own root directory is not a threat to the
> system if we can prevent confused deputy attacks, which could be
> performed through execution of SUID-like binaries.  This can be
> prevented if the calling task sets PR_SET_NO_NEW_PRIVS on itself with
> prctl(2).  To only affect this task, its filesystem information must not
> be shared with other tasks, which can be achieved by not passing
> CLONE_FS to clone(2).  A similar no_new_privs check is already used by
> seccomp to avoid the same kind of security issues.  Furthermore, because
> of its security use and to avoid giving a new way for attackers to get
> out of a chroot (e.g. using /proc//root), an unprivileged chroot is
> only allowed if the new root directory is the same or beneath the
> current one.  This still allows a process to use a subset of its
> legitimate filesystem to chroot into and then further reduce its view of
> the filesystem.
>
> This change may not impact systems relying on other permission models
> than POSIX capabilities (e.g. Tomoyo).  Being able to use chroot(2) on
> such systems may require to update their security policies.
>
> Only the chroot system call is relaxed with this no_new_privs check; the
> init_chroot() helper doesn't require such change.
>
> Allowing unprivileged users to use chroot(2) is one of the initial
> objectives of no_new_privs:
> https://www.kernel.org/doc/html/latest/userspace-api/no_new_privs.html
> This patch is a follow-up of a previous one sent by Andy Lutomirski, but
> with less limitations:
> https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.l...@amacapital.net/

Last time I remember talking architecture we agreed that user namespaces
would be used for enabling features and that no_new_privs would just be
used to lock-down userspace.  That way no_new_privs could be kept simple
and trivial to audit and understand.

You can build your sandbox and use chroot if you use a user namespace at
the start.  A mount namespace would also help lock things down.  Still
allowing chroot after the sanbox has been built, a seccomp filter has
been installed and no_new_privs has been enabled seems like it is asking
for trouble and may weaken existing sandboxes.

So I think we need a pretty compelling use case to consider allowing
chroot(2).  You haven't even mentioned what your usecase is at this
point so I don't know why we would tackle that complexity.

Eric



Re: [PATCH] exit: trigger panic when init process is set to SIGNAL_GROUP_EXIT

2021-03-10 Thread Eric W. Biederman
qianli zhao  writes:

> Hi,Oleg
>
> Thanks for your replay.
>
>> To be honest, I don't understand the changelog. It seems that you want
>> to uglify the kernel to simplify the debugging of buggy init? Or what?
>
> My patch is for the following purpose:
> 1. I hope to fix the occurrence of unexpected panic
> - [   24.705390] kernel BUG at include/linux/pid_namespace.h:98!
> This problem occurs when both two init threads enter the do_exit,
> One of the init thread is syscall sys_exit_group,and set SIGNAL_GROUP_EXIT
> The other init thread perform ret_to_user()->get_signal() and found
> SIGNAL_GROUP_EXIT is set,then do_group_exit()->do_exit()
> Since there are no alive init threads it finally goes to
> zap_pid_ns_processes() and BUG().
> Timing details are in the changelog.

At the start of your changelog and your patch subject you describe what
you are doing but not why.  For the next revision of the patch please
lead with the why it makes what you are trying to do much easier to
understand.

> 2. I hope to fix the problem that coredump cannot be parsed after init
> crash Before my patch,ever init sub-thread will finish do_exit(),the last
> thread will trigger panic().
> Except for the thread that triggered the panic,the state(such as
> PF_EXITING etc) of the other threads is already exiting,so we can't
> parse coredump from fulldump
> In fact, when any one init has been set to SIGNAL_GROUP_EXIT, then we
> can trigger panic,and prevent other init threads from continuing to
> exit
>
>> Nor can I understand the patch. I fail to understand the games with
>> SIGNAL_UNKILLABLE and ->siglock.
>
> When first init thread panic,i don't want other init threads to still
> exit,this will destroy the state of other init threads.
> so i use SIGNAL_UNKILLABLE to mark this stat,prevent other init
> threads from continuing to exit
> In addition i use siglock to protect tsk->signal->flags.

It does not work to use SIGNAL_UNKILLABLE for this.  Normally init
has SIGNAL_UNKILLABLE set.  The only case that clears SIGNAL_UNKILLABLE
is force_sig_info_to_task.  If the init process exits with exit(2)
SIGNAL_UNKILLABLE will already be set.  Which means testing
SIGNAL_UNKILLABLE as your patch does will prevent the panic.

Further simply calling panic is sufficient to guarantee that the other
threads don't exit, and that whichever thread calls panic first
will be the reporting thread.  The rest of the threads will be caught
in panic_smp_self_stop(), if they happen to be running on other cpus.

So I would make the whole thing just be:

/* If global init has exited,
 * panic immediately to get a useable coredump.
 */
if (unlikely(is_global_init(tsk) &&
(thread_group_empty(tsk) ||
(tsk->signal->flags & SIGNAL_GROUP_EXIT {
panic("Attempted to kill init!  exitcode=0x%08x\n",
tsk->signal->group_exit_code ?: (int)code);
}

The thread_group_empty test is needed to handle single threaded
inits.

Do you think you can respin your patch as something like that?

Eric

>
>> And iiuc with this patch the kernel will crash if init's sub-thread execs,
>> signal_group_exit() returns T in this case.
>
> Oleg Nesterov  于2021年3月10日周三 上午2:27写道:
>>
>> On 03/09, Qianli Zhao wrote:
>> >
>> > From: Qianli Zhao 
>> >
>> > Once any init thread finds SIGNAL_GROUP_EXIT, trigger panic immediately
>> > instead of last thread of global init has exited, and do not allow other
>> > init threads to exit, protect task/memory state of all sub-threads for
>> > get reliable init coredump
>>
>> To be honest, I don't understand the changelog. It seems that you want
>> to uglify the kernel to simplify the debugging of buggy init? Or what?
>>
>> Nor can I understand the patch. I fail to understand the games with
>> SIGNAL_UNKILLABLE and ->siglock.
>>
>> And iiuc with this patch the kernel will crash if init's sub-thread execs,
>> signal_group_exit() returns T in this case.
>>
>> Oleg.
>>
>> > [   24.705376] Kernel panic - not syncing: Attempted to kill init! 
>> > exitcode=0x7f00
>> > [   24.705382] CPU: 4 PID: 552 Comm: init Tainted: G S O
>> > 4.14.180-perf-g4483caa8ae80-dirty #1
>> > [   24.705390] kernel BUG at include/linux/pid_namespace.h:98!
>> >
>> > PID: 552   CPU: 4   COMMAND: "init"
>> > PID: 1 CPU: 7   COMMAND: "init"
>> > core4 core7
>> > ...   sys_exit_group()
>> >   do_group_exit()
>> >   - sig->flags = SIGNAL_GROUP_EXIT
>> >   - zap_other_threads()
>> >   do_exit()
>> >   - PF_EXITING is set
>> > ret_to_user()
>> > do_notify_resume()
>> > get_signal()
>> > - signal_group_exit
>> > - goto fatal;
>> > do_group_exit()
>> > do_exit()
>> > - PF_EXITING is set
>> > - panic("Attempted to kill init! exitcode=0x%08x\n")
>> >

Re: kernel panic: Attempted to kill init!

2021-03-09 Thread Eric W. Biederman
Al Viro  writes:

> On Tue, Mar 09, 2021 at 11:29:14AM +0530, Palash Oswal wrote:
>
>> I observe the following result(notice the segfault in systemd):
>> root@sandbox:~# ./repro
>> [9.457767] got to 221
>> [9.457791] got to 183
>> [9.459144] got to 201
>> [9.459471] got to 208
>> [9.459773] got to 210
>> [9.462602] got to 270
>> [9.488551] systemd[1]: segfault at 7ffe59fd7fb8 ip
>> 55be8f20b466 sp 7ffe59fd7fc0 error 6 in
>> systemd[55be8f15f000+ed000]
>> [9.490723] Code: 00 00 00 00 41 57 41 56 41 55 41 54 55 53 89 fd
>> 48 81 ec 48 01 00 00 64 48 8b 04 25 28 00 00 00 48 89 84 24 38 01 00
>> 00 31 c0  f5 bf f7 ff 83 f8 01 0f 84 b7 00 00 00 48 8d 9c 240
>> [9.492637] Kernel panic - not syncing: Attempted to kill init!
>> exitcode=0x000b
>
> Lovely.  So something in that sequence of syscalls manages to trigger
> segfault in unrelated process.  What happens if you put it to sleep
> right after open_by_handle_at() (e.g. by read(2) from fd 0, etc.)?

There is the creation of at least one file.  I wonder if inotify or
another notification mechanism is being triggered in systemd, and
systemd handling the notification badly and falling over.

Eric



Re: d28296d248: stress-ng.sigsegv.ops_per_sec -82.7% regression

2021-03-05 Thread Eric W. Biederman
Alexey Gladkov  writes:

> On Wed, Feb 24, 2021 at 12:50:21PM -0600, Eric W. Biederman wrote:
>> Alexey Gladkov  writes:
>> 
>> > On Wed, Feb 24, 2021 at 10:54:17AM -0600, Eric W. Biederman wrote:
>> >> kernel test robot  writes:
>> >> 
>> >> > Greeting,
>> >> >
>> >> > FYI, we noticed a -82.7% regression of stress-ng.sigsegv.ops_per_sec 
>> >> > due to commit:
>> >> >
>> >> >
>> >> > commit: d28296d2484fa11e94dff65e93eb25802a443d47 ("[PATCH v7 5/7] 
>> >> > Reimplement RLIMIT_SIGPENDING on top of ucounts")
>> >> > url: 
>> >> > https://github.com/0day-ci/linux/commits/Alexey-Gladkov/Count-rlimits-in-each-user-namespace/20210222-175836
>> >> > base: 
>> >> > https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git 
>> >> > next
>> >> >
>> >> > in testcase: stress-ng
>> >> > on test machine: 48 threads Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz 
>> >> > with 112G memory
>> >> > with following parameters:
>> >> >
>> >> > nr_threads: 100%
>> >> > disk: 1HDD
>> >> > testtime: 60s
>> >> > class: interrupt
>> >> > test: sigsegv
>> >> > cpufreq_governor: performance
>> >> > ucode: 0x42e
>> >> >
>> >> >
>> >> > In addition to that, the commit also has significant impact on the
>> >> > following tests:
>> >> 
>> >> Thank you.  Now we have a sense of where we need to test the performance
>> >> of these changes carefully.
>> >
>> > One of the reasons for this is that I rolled back the patch that changed
>> > the ucounts.count type to atomic_t. Now get_ucounts() is forced to use a
>> > spin_lock to increase the reference count.
>> 
>> Which given the hickups with getting a working version seems justified.
>> 
>> Now we can add incremental patches on top to improve the performance.
>
> I'm not sure that get_ucounts() should be used in __sigqueue_alloc() [1].
> I tried removing it and running KASAN tests that were failing before. So
> far, I have not found any problems.
>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/legion/linux.git/tree/kernel/signal.c?h=patchset/per-userspace-rlimit/v7.1=2d4a2e2be7db42c95acb98abfc2a9b370ddd0604#n428

Hmm.  The code you posted still seems to include the get_ucounts.

I like the idea of not needing to increment and decrement the ucount
reference count every time a signal is sent, unfortunately there is a
problem.  The way we have implemented setresuid allows different threads
in a threaded application to have different cred->user values.

That is actually an extension of what posix supports and pthreads will
keep the creds of a process in sync.  Still I recall looking into this a
few years ago and there were a few applications that take advantage of
the linux behavior.

In principle I think it is possible to hold a ucount reference in
somewhere such as task->signal.  In practice there are enough
complicating factors I don't immediately see how to implement that.

If the creds were stored in signal_struct instead of in task_struct
we could simply move the sigpending counts in set_user, when the uid
of a process changed.

With the current state I don't know how to pick which is the real user.

Eric


Re: [PATCH] signal: Allow RT tasks to cache one sigqueue struct

2021-03-04 Thread Eric W. Biederman
Thomas Gleixner  writes:

> On Thu, Mar 04 2021 at 09:11, Sebastian Andrzej Siewior wrote:
>> On 2021-03-03 16:09:05 [-0600], Eric W. Biederman wrote:
>>> Sebastian Andrzej Siewior  writes:
>>> 
>>> > From: Thomas Gleixner 
>>> >
>>> > Allow realtime tasks to cache one sigqueue in task struct. This avoids an
>>> > allocation which can increase the latency or fail.
>>> > Ideally the sigqueue is cached after first successful delivery and will be
>>> > available for next signal delivery. This works under the assumption that 
>>> > the RT
>>> > task has never an unprocessed signal while a one is about to be queued.
>>> >
>>> > The caching is not used for SIGQUEUE_PREALLOC because this kind of 
>>> > sigqueue is
>>> > handled differently (and not used for regular signal delivery).
>>> 
>>> What part of this is about real time tasks?  This allows any task
>>> to cache a sigqueue entry.
>>
>> It is limited to realtime tasks (SCHED_FIFO/RR/DL):
>>
>> +static void __sigqueue_cache_or_free(struct sigqueue *q)
>> +{
>> …
>> +if (!task_is_realtime(current) || !sigqueue_add_cache(current, q))
>> +kmem_cache_free(sigqueue_cachep, q);
>> +}
>
> We could of course do the caching unconditionally for all tasks.

Is there any advantage to only doing this for realtime tasks?

If not it probably makes sense to do the caching for all tasks.

I am wondering if we want to count the cached sigqueue structure to the
users rt signal rlimit?

Eric


Re: [PATCH] signal: Allow RT tasks to cache one sigqueue struct

2021-03-04 Thread Eric W. Biederman
Sebastian Andrzej Siewior  writes:

> On 2021-03-03 16:09:05 [-0600], Eric W. Biederman wrote:
>> Sebastian Andrzej Siewior  writes:
>> 
>> > From: Thomas Gleixner 
>> >
>> > Allow realtime tasks to cache one sigqueue in task struct. This avoids an
>> > allocation which can increase the latency or fail.
>> > Ideally the sigqueue is cached after first successful delivery and will be
>> > available for next signal delivery. This works under the assumption that 
>> > the RT
>> > task has never an unprocessed signal while a one is about to be queued.
>> >
>> > The caching is not used for SIGQUEUE_PREALLOC because this kind of 
>> > sigqueue is
>> > handled differently (and not used for regular signal delivery).
>> 
>> What part of this is about real time tasks?  This allows any task
>> to cache a sigqueue entry.
>
> It is limited to realtime tasks (SCHED_FIFO/RR/DL):
>
> +static void __sigqueue_cache_or_free(struct sigqueue *q)
> +{
> …
> + if (!task_is_realtime(current) || !sigqueue_add_cache(current, q))
> + kmem_cache_free(sigqueue_cachep, q);
> +}

I see now.  I was looking for it somewhere in the allocation side.
Oleg's suggestion of simply adding a few additional lines to
__sigqueue_free would have made this stand out more.

A __sigqueue_free that takes the relevant task_struct instead of always
assuming current would be nice here.


>> Either the patch is buggy or the description is.  Overall caching one
>> sigqueue entry doesn't look insane. But it would help to have a clear
>> description of what is going on.
>
> Does this clear things up or is my logic somehow broken here?

No I just missed the task_is_realtime limitation.

Eric



Re: [GIT PULL] idmapped mounts for v5.12

2021-03-03 Thread Eric W. Biederman
Christian Brauner  writes:

> Hi Linus,

> This series comes with an extensive xfstests suite covering both ext4 and xfs
> https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
> It covers truncation, creation, opening, xattrs, vfscaps, setid execution,
> setgid inheritance and more both with idmapped and non-idmapped mounts.
> It already helped to discover an unrelated xfs setgid inheritance bug which 
> has
> since been fixed in mainline. It will be sent for inclusion with the xfstests
> project should you decide to merge this.

And yet chown is broken (details below), and in a very predictable way.

This is not considering that the entire concept is giving people a
loaded footgun, that is very difficult to use safely.


When the user namespace was implemented the two kinds of uids were very
carefully separated from each other by type, so it would be take
deliberate action to mix them.  These changes introduces a third type
of uid and does not use the type system to keep them separate.  In just
a little bit of looking since I realized this problem I have found two
bugs in chown where the wrong values are compared.

We now have the following types of uids and gids:
- The userspace values.
- The kernel values that are used for comparisons.
  (The old fashioned kuid_t and kgid_t)
- The values used for interfacing with the filesystems
  underneath a mount.
  (The beneath mount kuid_t and kgid_t)
- The values stored in the filesystem.

The third type is new, and the code mixes old fashioned kuid_t and
kgid_t with the below mount kuid_t and kgid_t.

Starting with chown_common the code does:

int chown_common(const struct path *path, uid_t user, gid_t group)
{
...
uid = make_kuid(current_user_ns(), user);
gid = make_kgid(current_user_ns(), group);

mnt_userns = mnt_user_ns(path->mnt);
uid = kuid_from_mnt(mnt_userns, uid);
gid = kgid_from_mnt(mnt_userns, gid);

retry_deleg:
newattrs.ia_valid =  ATTR_CTIME;
if (user != (uid_t) -1) {
if (!uid_valid(uid))
return -EINVAL;
newattrs.ia_valid |= ATTR_UID;
newattrs.ia_uid = uid;
}
if (group != (gid_t) -1) {
if (!gid_valid(gid))
return -EINVAL;
newattrs.ia_valid |= ATTR_GID;
newattrs.ia_gid = gid;
}
if (!S_ISDIR(inode->i_mode))
newattrs.ia_valid |=
ATTR_KILL_SUID | ATTR_KILL_SGID | ATTR_KILL_PRIV;
inode_lock(inode);
error = security_path_chown(path, uid, gid);
if (!error)
error = notify_change(mnt_userns, path->dentry, ,
  _inode);
inode_unlock(inode);
...
}

Here security_path_chown is expecting the old fashioned kuid_t and
kgid_t but looking at the top of the function we can see that
security_path_chown is getting the kuid_t and kgid_t from below the
mount.

The Tomoyo lsm cares.


Notice that ia_uid and ia_gid in struct newattrs are below mount values.


Now looking in notify_change:

int notify_change(struct user_namespace *mnt_userns, struct dentry *dentry,
  struct iattr *attr, struct inode **delegated_inode)
{
...
if (inode->i_op->setattr)
error = inode->i_op->setattr(mnt_userns, dentry, attr);
else
error = simple_setattr(mnt_userns, dentry, attr);
...
}


int simple_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
   struct iattr *iattr)
{
...
error = setattr_prepare(mnt_userns, dentry, iattr);
if (error)
return error;
...
}


int setattr_prepare(struct user_namespace *mnt_userns, struct dentry *dentry,
struct iattr *attr)
{
...
/* Make sure a caller can chown. */
if ((ia_valid & ATTR_UID) && !chown_ok(mnt_userns, inode, attr->ia_uid))
return -EPERM;

/* Make sure caller can chgrp. */
if ((ia_valid & ATTR_GID) && !chgrp_ok(mnt_userns, inode, attr->ia_gid))
return -EPERM;
...
}

static bool chown_ok(struct user_namespace *mnt_userns,
 const struct inode *inode,
 kuid_t uid)
{
kuid_t kuid = i_uid_into_mnt(mnt_userns, inode);
if (uid_eq(current_fsuid(), kuid) && uid_eq(uid, kuid))
 ^^
return true;
   
}

The comparison of uid and kuid in chown_ok is nonsense.  As the kuid is
the old fashioned kuid.  While the uid is attr->ia_uid is the below
mount value.

I found these both within just a couple of minutes by creating
a type vfsuid_t and vfsgid_t and using it for the values stored in
struct inode and struct iattr.

There are probably more cases of inappropriate mixing.  I stopped as I

Re: [PATCH] signal: Allow RT tasks to cache one sigqueue struct

2021-03-03 Thread Eric W. Biederman
Sebastian Andrzej Siewior  writes:

> From: Thomas Gleixner 
>
> Allow realtime tasks to cache one sigqueue in task struct. This avoids an
> allocation which can increase the latency or fail.
> Ideally the sigqueue is cached after first successful delivery and will be
> available for next signal delivery. This works under the assumption that the 
> RT
> task has never an unprocessed signal while a one is about to be queued.
>
> The caching is not used for SIGQUEUE_PREALLOC because this kind of sigqueue is
> handled differently (and not used for regular signal delivery).

What part of this is about real time tasks?  This allows any task
to cache a sigqueue entry.

Either the patch is buggy or the description is.  Overall caching one
sigqueue entry doesn't look insane. But it would help to have a clear
description of what is going on.

Eric


> [bigeasy: With a fix from Matt Fleming ]
> Signed-off-by: Thomas Gleixner 
> Signed-off-by: Sebastian Andrzej Siewior 
> ---
>  include/linux/sched.h  |  1 +
>  include/linux/signal.h |  1 +
>  kernel/exit.c  |  2 +-
>  kernel/fork.c  |  1 +
>  kernel/signal.c| 65 +++---
>  5 files changed, 65 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ef00bb22164cd..7009b25f48160 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -985,6 +985,7 @@ struct task_struct {
>   /* Signal handlers: */
>   struct signal_struct*signal;
>   struct sighand_struct __rcu *sighand;
> + struct sigqueue *sigqueue_cache;
>   sigset_tblocked;
>   sigset_treal_blocked;
>   /* Restored if set_restore_sigmask() was used: */
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index 205526c4003aa..d47a86790edc8 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -265,6 +265,7 @@ static inline void init_sigpending(struct sigpending *sig)
>  }
>  
>  extern void flush_sigqueue(struct sigpending *queue);
> +extern void flush_task_sigqueue(struct task_struct *tsk);
>  
>  /* Test if 'sig' is valid signal. Use this instead of testing _NSIG directly 
> */
>  static inline int valid_signal(unsigned long sig)
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 04029e35e69af..346f7b76cecaa 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -152,7 +152,7 @@ static void __exit_signal(struct task_struct *tsk)
>* Do this under ->siglock, we can race with another thread
>* doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals.
>*/
> - flush_sigqueue(>pending);
> + flush_task_sigqueue(tsk);
>   tsk->sighand = NULL;
>   spin_unlock(>siglock);
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d66cd1014211b..a767e4e49a692 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1993,6 +1993,7 @@ static __latent_entropy struct task_struct 
> *copy_process(
>   spin_lock_init(>alloc_lock);
>  
>   init_sigpending(>pending);
> + p->sigqueue_cache = NULL;
>  
>   p->utime = p->stime = p->gtime = 0;
>  #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
> diff --git a/kernel/signal.c b/kernel/signal.c
> index ba4d1ef39a9ea..d99273b798085 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -404,13 +405,30 @@ void task_join_group_stop(struct task_struct *task)
>   task_set_jobctl_pending(task, mask | JOBCTL_STOP_PENDING);
>  }
>  
> +static struct sigqueue *sigqueue_from_cache(struct task_struct *t)
> +{
> + struct sigqueue *q = t->sigqueue_cache;
> +
> + if (q && cmpxchg(>sigqueue_cache, q, NULL) == q)
> + return q;
> + return NULL;
> +}
> +
> +static bool sigqueue_add_cache(struct task_struct *t, struct sigqueue *q)
> +{
> + if (!t->sigqueue_cache && cmpxchg(>sigqueue_cache, NULL, q) == NULL)
> + return true;
> + return false;
> +}
> +
>  /*
>   * allocate a new signal queue record
>   * - this may be called without locks if and only if t == current, otherwise 
> an
>   *   appropriate lock must be held to stop the target task from exiting
>   */
>  static struct sigqueue *
> -__sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int 
> override_rlimit)
> +__sigqueue_do_alloc(int sig, struct task_struct *t, gfp_t flags,
> + int override_rlimit, bool fromslab)
>  {
>   struct sigqueue *q = NULL;
>   struct user_struct *user;
> @@ -432,7 +450,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t 
> flags, int override_rlimi
>   rcu_read_unlock();
>  
>   if (override_rlimit || likely(sigpending <= task_rlimit(t, 
> RLIMIT_SIGPENDING))) {
> - q = kmem_cache_alloc(sigqueue_cachep, flags);
> + if (!fromslab)
> + q = 

Re: exec error: BUG: Bad rss-counter

2021-03-03 Thread Eric W. Biederman
Ilya Lipnitskiy  writes:

> On Wed, Mar 3, 2021 at 7:50 AM Eric W. Biederman  
> wrote:
>>
>> Ilya Lipnitskiy  writes:
>>
>> > On Tue, Mar 2, 2021 at 11:37 AM Eric W. Biederman  
>> > wrote:
>> >>
>> >> Ilya Lipnitskiy  writes:
>> >>
>> >> > On Mon, Mar 1, 2021 at 12:43 PM Eric W. Biederman 
>> >> >  wrote:
>> >> >>
>> >> >> Ilya Lipnitskiy  writes:
>> >> >>
>> >> >> > Eric, All,
>> >> >> >
>> >> >> > The following error appears when running Linux 5.10.18 on an embedded
>> >> >> > MIPS mt7621 target:
>> >> >> > [0.301219] BUG: Bad rss-counter state mm:(ptrval) 
>> >> >> > type:MM_ANONPAGES val:1
>> >> >> >
>> >> >> > Being a very generic error, I started digging and added a stack dump
>> >> >> > before the BUG:
>> >> >> > Call Trace:
>> >> >> > [<80008094>] show_stack+0x30/0x100
>> >> >> > [<8033b238>] dump_stack+0xac/0xe8
>> >> >> > [<800285e8>] __mmdrop+0x98/0x1d0
>> >> >> > [<801a6de8>] free_bprm+0x44/0x118
>> >> >> > [<801a86a8>] kernel_execve+0x160/0x1d8
>> >> >> > [<800420f4>] call_usermodehelper_exec_async+0x114/0x194
>> >> >> > [<80003198>] ret_from_kernel_thread+0x14/0x1c
>> >> >> >
>> >> >> > So that's how I got to looking at fs/exec.c and noticed quite a few
>> >> >> > changes last year. Turns out this message only occurs once very early
>> >> >> > at boot during the very first call to kernel_execve. current->mm is
>> >> >> > NULL at this stage, so acct_arg_size() is effectively a no-op.
>> >> >>
>> >> >> If you believe this is a new error you could bisect the kernel
>> >> >> to see which change introduced the behavior you are seeing.
>> >> >>
>> >> >> > More digging, and I traced the RSS counter increment to:
>> >> >> > [<8015adb4>] add_mm_counter_fast+0xb4/0xc0
>> >> >> > [<80160d58>] handle_mm_fault+0x6e4/0xea0
>> >> >> > [<80158aa4>] __get_user_pages.part.78+0x190/0x37c
>> >> >> > [<8015992c>] __get_user_pages_remote+0x128/0x360
>> >> >> > [<801a6d9c>] get_arg_page+0x34/0xa0
>> >> >> > [<801a7394>] copy_string_kernel+0x194/0x2a4
>> >> >> > [<801a880c>] kernel_execve+0x11c/0x298
>> >> >> > [<800420f4>] call_usermodehelper_exec_async+0x114/0x194
>> >> >> > [<80003198>] ret_from_kernel_thread+0x14/0x1c
>> >> >> >
>> >> >> > In fact, I also checked vma_pages(bprm->vma) and lo and behold it is 
>> >> >> > set to 1.
>> >> >> >
>> >> >> > How is fs/exec.c supposed to handle implied RSS increments that 
>> >> >> > happen
>> >> >> > due to page faults when discarding the bprm structure? In this case,
>> >> >> > the bug-generating kernel_execve call never succeeded, it returned 
>> >> >> > -2,
>> >> >> > but I didn't trace exactly what failed.
>> >> >>
>> >> >> Unless I am mistaken any left over pages should be purged by exit_mmap
>> >> >> which is called by mmput before mmput calls mmdrop.
>> >> > Good to know. Some more digging and I can say that we hit this error
>> >> > when trying to unmap PFN 0 (is_zero_pfn(pfn) returns TRUE,
>> >> > vm_normal_page returns NULL, zap_pte_range does not decrement
>> >> > MM_ANONPAGES RSS counter). Is my understanding correct that PFN 0 is
>> >> > usable, but special? Or am I totally off the mark here?
>> >>
>> >> It would be good to know if that is the page that get_user_pages_remote
>> >> returned to copy_string_kernel.  The zero page that is always zero,
>> >> should never be returned when a writable mapping is desired.
>> >
>> > Indeed, pfn 0 is returned from get_arg_page: (page is 0x809cf000,
>> > page_to_pfn(page) is 0) and it is the same page that is being freed and not
>> > refcounted in mmput/zap_pte_range. Confirmed with good old printk.

Re: exec error: BUG: Bad rss-counter

2021-03-03 Thread Eric W. Biederman
Ilya Lipnitskiy  writes:

> On Tue, Mar 2, 2021 at 11:37 AM Eric W. Biederman  
> wrote:
>>
>> Ilya Lipnitskiy  writes:
>>
>> > On Mon, Mar 1, 2021 at 12:43 PM Eric W. Biederman  
>> > wrote:
>> >>
>> >> Ilya Lipnitskiy  writes:
>> >>
>> >> > Eric, All,
>> >> >
>> >> > The following error appears when running Linux 5.10.18 on an embedded
>> >> > MIPS mt7621 target:
>> >> > [0.301219] BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES 
>> >> > val:1
>> >> >
>> >> > Being a very generic error, I started digging and added a stack dump
>> >> > before the BUG:
>> >> > Call Trace:
>> >> > [<80008094>] show_stack+0x30/0x100
>> >> > [<8033b238>] dump_stack+0xac/0xe8
>> >> > [<800285e8>] __mmdrop+0x98/0x1d0
>> >> > [<801a6de8>] free_bprm+0x44/0x118
>> >> > [<801a86a8>] kernel_execve+0x160/0x1d8
>> >> > [<800420f4>] call_usermodehelper_exec_async+0x114/0x194
>> >> > [<80003198>] ret_from_kernel_thread+0x14/0x1c
>> >> >
>> >> > So that's how I got to looking at fs/exec.c and noticed quite a few
>> >> > changes last year. Turns out this message only occurs once very early
>> >> > at boot during the very first call to kernel_execve. current->mm is
>> >> > NULL at this stage, so acct_arg_size() is effectively a no-op.
>> >>
>> >> If you believe this is a new error you could bisect the kernel
>> >> to see which change introduced the behavior you are seeing.
>> >>
>> >> > More digging, and I traced the RSS counter increment to:
>> >> > [<8015adb4>] add_mm_counter_fast+0xb4/0xc0
>> >> > [<80160d58>] handle_mm_fault+0x6e4/0xea0
>> >> > [<80158aa4>] __get_user_pages.part.78+0x190/0x37c
>> >> > [<8015992c>] __get_user_pages_remote+0x128/0x360
>> >> > [<801a6d9c>] get_arg_page+0x34/0xa0
>> >> > [<801a7394>] copy_string_kernel+0x194/0x2a4
>> >> > [<801a880c>] kernel_execve+0x11c/0x298
>> >> > [<800420f4>] call_usermodehelper_exec_async+0x114/0x194
>> >> > [<80003198>] ret_from_kernel_thread+0x14/0x1c
>> >> >
>> >> > In fact, I also checked vma_pages(bprm->vma) and lo and behold it is 
>> >> > set to 1.
>> >> >
>> >> > How is fs/exec.c supposed to handle implied RSS increments that happen
>> >> > due to page faults when discarding the bprm structure? In this case,
>> >> > the bug-generating kernel_execve call never succeeded, it returned -2,
>> >> > but I didn't trace exactly what failed.
>> >>
>> >> Unless I am mistaken any left over pages should be purged by exit_mmap
>> >> which is called by mmput before mmput calls mmdrop.
>> > Good to know. Some more digging and I can say that we hit this error
>> > when trying to unmap PFN 0 (is_zero_pfn(pfn) returns TRUE,
>> > vm_normal_page returns NULL, zap_pte_range does not decrement
>> > MM_ANONPAGES RSS counter). Is my understanding correct that PFN 0 is
>> > usable, but special? Or am I totally off the mark here?
>>
>> It would be good to know if that is the page that get_user_pages_remote
>> returned to copy_string_kernel.  The zero page that is always zero,
>> should never be returned when a writable mapping is desired.
>
> Indeed, pfn 0 is returned from get_arg_page: (page is 0x809cf000,
> page_to_pfn(page) is 0) and it is the same page that is being freed and not
> refcounted in mmput/zap_pte_range. Confirmed with good old printk. Also,
> ZERO_PAGE(0)==0x809fc000 -> PFN 5120.
>
> I think I have found the problem though, after much digging and thanks to all
> the information provided. init_zero_pfn() gets called too late (after
> the call to
> is_zero_pfn(0) from mmput returns true), until then zero_pfn == 0, and after,
> zero_pfn == 5120. Boom.
>
> So PFN 0 is special, but only for a little bit, enough for something
> on my system
> to call kernel_execve :)
>
> Question: is my system not supposed to be calling kernel_execve this
> early or does
> init_zero_pfn() need to happen earlier? init_zero_pfn is currently a
> core_initcall.

Looking quickly it seems that init_zero_pfn() is in mm/memory.c and is
common for both mips and x86.  Further it appears init_zero_pfn() has
been that was since 2009 a13ea5b75964 ("mm: reinstate ZERO_PAGE").

Given the testing that x86 gets and that nothing like this has been
reported it looks like whatever driver is triggering the kernel_execve
is doing something wrong. 

Because honestly.  If the zero page isn't working there is not a chance
that anything in userspace is working so it is clearly much too early.

I suspect there is some driver that is initialized very early that is
doing something that looks innocuous (like triggering a hotplug event)
and that happens to cause a call_usermode_helper which then calls
kernel_execve.

Eric


Re: exec error: BUG: Bad rss-counter

2021-03-02 Thread Eric W. Biederman
Ilya Lipnitskiy  writes:

> On Mon, Mar 1, 2021 at 12:43 PM Eric W. Biederman  
> wrote:
>>
>> Ilya Lipnitskiy  writes:
>>
>> > Eric, All,
>> >
>> > The following error appears when running Linux 5.10.18 on an embedded
>> > MIPS mt7621 target:
>> > [0.301219] BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES 
>> > val:1
>> >
>> > Being a very generic error, I started digging and added a stack dump
>> > before the BUG:
>> > Call Trace:
>> > [<80008094>] show_stack+0x30/0x100
>> > [<8033b238>] dump_stack+0xac/0xe8
>> > [<800285e8>] __mmdrop+0x98/0x1d0
>> > [<801a6de8>] free_bprm+0x44/0x118
>> > [<801a86a8>] kernel_execve+0x160/0x1d8
>> > [<800420f4>] call_usermodehelper_exec_async+0x114/0x194
>> > [<80003198>] ret_from_kernel_thread+0x14/0x1c
>> >
>> > So that's how I got to looking at fs/exec.c and noticed quite a few
>> > changes last year. Turns out this message only occurs once very early
>> > at boot during the very first call to kernel_execve. current->mm is
>> > NULL at this stage, so acct_arg_size() is effectively a no-op.
>>
>> If you believe this is a new error you could bisect the kernel
>> to see which change introduced the behavior you are seeing.
>>
>> > More digging, and I traced the RSS counter increment to:
>> > [<8015adb4>] add_mm_counter_fast+0xb4/0xc0
>> > [<80160d58>] handle_mm_fault+0x6e4/0xea0
>> > [<80158aa4>] __get_user_pages.part.78+0x190/0x37c
>> > [<8015992c>] __get_user_pages_remote+0x128/0x360
>> > [<801a6d9c>] get_arg_page+0x34/0xa0
>> > [<801a7394>] copy_string_kernel+0x194/0x2a4
>> > [<801a880c>] kernel_execve+0x11c/0x298
>> > [<800420f4>] call_usermodehelper_exec_async+0x114/0x194
>> > [<80003198>] ret_from_kernel_thread+0x14/0x1c
>> >
>> > In fact, I also checked vma_pages(bprm->vma) and lo and behold it is set 
>> > to 1.
>> >
>> > How is fs/exec.c supposed to handle implied RSS increments that happen
>> > due to page faults when discarding the bprm structure? In this case,
>> > the bug-generating kernel_execve call never succeeded, it returned -2,
>> > but I didn't trace exactly what failed.
>>
>> Unless I am mistaken any left over pages should be purged by exit_mmap
>> which is called by mmput before mmput calls mmdrop.
> Good to know. Some more digging and I can say that we hit this error
> when trying to unmap PFN 0 (is_zero_pfn(pfn) returns TRUE,
> vm_normal_page returns NULL, zap_pte_range does not decrement
> MM_ANONPAGES RSS counter). Is my understanding correct that PFN 0 is
> usable, but special? Or am I totally off the mark here?

It would be good to know if that is the page that get_user_pages_remote
returned to copy_string_kernel.  The zero page that is always zero,
should never be returned when a writable mapping is desired.

> Here is the (optimized) stack trace when the counter does not get decremented:
> [<8015b078>] vm_normal_page+0x114/0x1a8
> [<8015dc98>] unmap_page_range+0x388/0xacc
> [<8015e5a0>] unmap_vmas+0x6c/0x98
> [<80166194>] exit_mmap+0xd8/0x1ac
> [<800290c0>] mmput+0x58/0xf8
> [<801a6f8c>] free_bprm+0x2c/0xc4
> [<801a8890>] kernel_execve+0x160/0x1d8
> [<800420e0>] call_usermodehelper_exec_async+0x114/0x194
> [<80003198>] ret_from_kernel_thread+0x14/0x1c
>
>>
>> AKA it looks very very fishy this happens and this does not look like
>> an execve error.
> I think you are right, I'm probably wrong to bother you. However,
> since the thread is already started, let me add linux-mm here :)

It happens during exec.  I don't mind looking and pointing you a useful
direction.

>>
>> On the other hand it would be good to know why kernel_execve is failing.
>> Then the error handling paths could be scrutinized, and we can check to
>> see if everything that should happen on an error path does.
> I can check on this, but likely it's the init system not doing things
> quite in the right order on my platform, or something similar. The
> error is ENOENT from do_open_execat().

That does narrow things down considerably.
After the error all we do is:
Clear in_execve and fs->in_exec.
Return from bprm_execve
Call free_bprm
Which does:
if (bprm->mm) {
acct_arg_size(bprm, 0);
mmput(bprm->mm);
}

So it really needs to be the mmput that cleans things up.\

I would really verify the correspondence between what get_arg_page
returns and what gets freed in mmput if it is not too difficult.
I think it should just be a page or two.

Eric


Re: exec error: BUG: Bad rss-counter

2021-03-01 Thread Eric W. Biederman
Ilya Lipnitskiy  writes:

> Eric, All,
>
> The following error appears when running Linux 5.10.18 on an embedded
> MIPS mt7621 target:
> [0.301219] BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1
>
> Being a very generic error, I started digging and added a stack dump
> before the BUG:
> Call Trace:
> [<80008094>] show_stack+0x30/0x100
> [<8033b238>] dump_stack+0xac/0xe8
> [<800285e8>] __mmdrop+0x98/0x1d0
> [<801a6de8>] free_bprm+0x44/0x118
> [<801a86a8>] kernel_execve+0x160/0x1d8
> [<800420f4>] call_usermodehelper_exec_async+0x114/0x194
> [<80003198>] ret_from_kernel_thread+0x14/0x1c
>
> So that's how I got to looking at fs/exec.c and noticed quite a few
> changes last year. Turns out this message only occurs once very early
> at boot during the very first call to kernel_execve. current->mm is
> NULL at this stage, so acct_arg_size() is effectively a no-op.

If you believe this is a new error you could bisect the kernel
to see which change introduced the behavior you are seeing.

> More digging, and I traced the RSS counter increment to:
> [<8015adb4>] add_mm_counter_fast+0xb4/0xc0
> [<80160d58>] handle_mm_fault+0x6e4/0xea0
> [<80158aa4>] __get_user_pages.part.78+0x190/0x37c
> [<8015992c>] __get_user_pages_remote+0x128/0x360
> [<801a6d9c>] get_arg_page+0x34/0xa0
> [<801a7394>] copy_string_kernel+0x194/0x2a4
> [<801a880c>] kernel_execve+0x11c/0x298
> [<800420f4>] call_usermodehelper_exec_async+0x114/0x194
> [<80003198>] ret_from_kernel_thread+0x14/0x1c
>
> In fact, I also checked vma_pages(bprm->vma) and lo and behold it is set to 1.
>
> How is fs/exec.c supposed to handle implied RSS increments that happen
> due to page faults when discarding the bprm structure? In this case,
> the bug-generating kernel_execve call never succeeded, it returned -2,
> but I didn't trace exactly what failed.

Unless I am mistaken any left over pages should be purged by exit_mmap
which is called by mmput before mmput calls mmdrop.

AKA it looks very very fishy this happens and this does not look like
an execve error.

On the other hand it would be good to know why kernel_execve is failing.
Then the error handling paths could be scrutinized, and we can check to
see if everything that should happen on an error path does.

> Interestingly, this "BUG:" message is timing-dependent. If I wait a
> bit before calling free_bprm after bprm_execve the message seems to go
> away (there are 3 other cores running and calling into kernel_execve
> at the same time, so there is that). The error also only ever happens
> once (probably because no more page faults happen?).
>
> I don't know enough to propose a proper fix here. Is it decrementing
> the bprm->mm RSS counter to account for that page fault? Or is
> current->mm being NULL a bigger problem?

This is call_usermode_helper calls kernel_execve from a kernel thread
forked by kthreadd.  Which means current->mm == NULL is expected, and
current->active_mm == _mm.

Similarly I bprm->mm having an incremented RSS counter appears correct.

The question is why doesn't that count get consistently cleaned up.

> Apologies in advance, but I have looked hard and do not see a clear
> resolution for this even in the latest kernel code.

I may be blind but I see two possibilities.

1) There is a memory stomp that happens early on and bad timing causes
   the memory stomp to result in an elevated rss count.

2) There is a buggy error handling path, and whatever failure you are
running into that early in boot walks through that buggy failure
path.

I don't think this is a widespread issue or yours would not be the first
report like this I have seen.

The two productive paths I can see for tracing down your problem are:
1) git bisect (assuming you have a known good version)
2) Figuring out what exec failed.

I really think exec_mmap should have cleaned up anything in the mm.  So
the fact that it doesn't worries me.

Eric


Re: [PATCH v14 01/11] x86: kdump: replace the hard-coded alignment with macro CRASH_ALIGN

2021-02-26 Thread Eric W. Biederman
chenzhou  writes:

> On 2021/2/25 15:25, Baoquan He wrote:
>> On 02/24/21 at 02:19pm, Catalin Marinas wrote:
>>> On Sat, Jan 30, 2021 at 03:10:15PM +0800, Chen Zhou wrote:
 Move CRASH_ALIGN to header asm/kexec.h for later use. Besides, the
 alignment of crash kernel regions in x86 is 16M(CRASH_ALIGN), but
 function reserve_crashkernel() also used 1M alignment. So just
 replace hard-coded alignment 1M with macro CRASH_ALIGN.
>>> [...]
 @@ -510,7 +507,7 @@ static void __init reserve_crashkernel(void)
} else {
unsigned long long start;
  
 -  start = memblock_phys_alloc_range(crash_size, SZ_1M, crash_base,
 +  start = memblock_phys_alloc_range(crash_size, CRASH_ALIGN, 
 crash_base,
  crash_base + crash_size);
if (start != crash_base) {
pr_info("crashkernel reservation failed - memory is in 
 use.\n");
>>> There is a small functional change here for x86. Prior to this patch,
>>> crash_base passed by the user on the command line is allowed to be 1MB
>>> aligned. With this patch, such reservation will fail.
>>>
>>> Is the current behaviour a bug in the current x86 code or it does allow
>>> 1MB-aligned reservations?
>> Hmm, you are right. Here we should keep 1MB alignment as is because
>> users specify the address and size, their intention should be respected.
>> The 1MB alignment for fixed memory region reservation was introduced in
>> below commit, but it doesn't tell what is Eric's request at that time, I
>> guess it meant respecting users' specifying.


> I think we could make the alignment unified. Why is the alignment system 
> reserved and
> user specified different? Besides, there is no document about the 1MB 
> alignment.
> How about adding the alignment size(16MB) in doc  if user specified
> start address as arm64 does.

Looking at what the code is doing.  Attempting to reserve a crash region
at the location the user specified.  Adding unnecessary alignment
constraints is totally broken. 

I am not even certain enforcing a 1MB alignment makes sense.  I suspect
it was added so that we don't accidentally reserve low memory on x86.
Frankly I am not even certain that makes sense.

Now in practice there might be an argument for 2MB alignment that goes
with huge page sizes on x86.  But until someone finds that there are
actual problems with 1MB alignment I would not touch it.

The proper response to something that isn't documented and confusing is
not to arbitrarily change it and risk breaking users.  Especially in
this case where it is clear that adding additional alignment is total
nonsense.  The proper response to something that isn't clear and
documented is to dig in and document it, or to leave it alone and let it
be the next persons problem.

In this case there is no reason for changing this bit of code.
All CRASH_ALIGN is about is a default alignment when none is specified.
It is not a functional requirement but just something so that things
come out nicely.


Eric


Re: d28296d248: stress-ng.sigsegv.ops_per_sec -82.7% regression

2021-02-24 Thread Eric W. Biederman
Alexey Gladkov  writes:

> On Wed, Feb 24, 2021 at 10:54:17AM -0600, Eric W. Biederman wrote:
>> kernel test robot  writes:
>> 
>> > Greeting,
>> >
>> > FYI, we noticed a -82.7% regression of stress-ng.sigsegv.ops_per_sec due 
>> > to commit:
>> >
>> >
>> > commit: d28296d2484fa11e94dff65e93eb25802a443d47 ("[PATCH v7 5/7] 
>> > Reimplement RLIMIT_SIGPENDING on top of ucounts")
>> > url: 
>> > https://github.com/0day-ci/linux/commits/Alexey-Gladkov/Count-rlimits-in-each-user-namespace/20210222-175836
>> > base: 
>> > https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git next
>> >
>> > in testcase: stress-ng
>> > on test machine: 48 threads Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz with 
>> > 112G memory
>> > with following parameters:
>> >
>> >nr_threads: 100%
>> >disk: 1HDD
>> >testtime: 60s
>> >class: interrupt
>> >test: sigsegv
>> >cpufreq_governor: performance
>> >ucode: 0x42e
>> >
>> >
>> > In addition to that, the commit also has significant impact on the
>> > following tests:
>> 
>> Thank you.  Now we have a sense of where we need to test the performance
>> of these changes carefully.
>
> One of the reasons for this is that I rolled back the patch that changed
> the ucounts.count type to atomic_t. Now get_ucounts() is forced to use a
> spin_lock to increase the reference count.

Which given the hickups with getting a working version seems justified.

Now we can add incremental patches on top to improve the performance.


Eric



Re: d28296d248: stress-ng.sigsegv.ops_per_sec -82.7% regression

2021-02-24 Thread Eric W. Biederman
kernel test robot  writes:

> Greeting,
>
> FYI, we noticed a -82.7% regression of stress-ng.sigsegv.ops_per_sec due to 
> commit:
>
>
> commit: d28296d2484fa11e94dff65e93eb25802a443d47 ("[PATCH v7 5/7] Reimplement 
> RLIMIT_SIGPENDING on top of ucounts")
> url: 
> https://github.com/0day-ci/linux/commits/Alexey-Gladkov/Count-rlimits-in-each-user-namespace/20210222-175836
> base: https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git 
> next
>
> in testcase: stress-ng
> on test machine: 48 threads Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz with 
> 112G memory
> with following parameters:
>
>   nr_threads: 100%
>   disk: 1HDD
>   testtime: 60s
>   class: interrupt
>   test: sigsegv
>   cpufreq_governor: performance
>   ucode: 0x42e
>
>
> In addition to that, the commit also has significant impact on the
> following tests:

Thank you.  Now we have a sense of where we need to test the performance
of these changes carefully.

Eric


> +--+---+
> | testcase: change | stress-ng: stress-ng.sigq.ops_per_sec -56.1% regression  
>  |
> | test machine | 48 threads Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz 
> with 112G memory |
> | test parameters  | class=interrupt  
>  |
> |  | cpufreq_governor=performance 
>  |
> |  | disk=1HDD
>  |
> |  | nr_threads=100%  
>  |
> |  | test=sigq
>  |
> |  | testtime=60s 
>  |
> |  | ucode=0x42e  
>  |
> +--+---+
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot 
>
>
> Details are as below:
> -->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp installjob.yaml  # job file is attached in 
> this email
> bin/lkp split-job --compatible job.yaml
> bin/lkp runcompatible-job.yaml
>
> =
> class/compiler/cpufreq_governor/disk/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime/ucode:
>   
> interrupt/gcc-9/performance/1HDD/x86_64-rhel-8.3/100%/debian-10.4-x86_64-20200603.cgz/lkp-ivb-2ep1/sigsegv/stress-ng/60s/0x42e
>
> commit: 
>   4660d663b4 ("Reimplement RLIMIT_MSGQUEUE on top of ucounts")
>   d28296d248 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
>
> 4660d663b4207ce6 d28296d2484fa11e94dff65e93e 
>  --- 
>fail:runs  %reproductionfail:runs
>| | |
>  14:6 -217%   1:6 
> perf-profile.children.cycles-pp.error_entry
>  12:6 -179%   1:6 
> perf-profile.self.cycles-pp.error_entry
>  %stddev %change %stddev
>  \  |\  
>  4.766e+08   -82.7%   82358308stress-ng.sigsegv.ops
>7942807   -82.7%1372639stress-ng.sigsegv.ops_per_sec
>  29408   -77.5%   6621 ± 61%  
> stress-ng.time.file_system_inputs
>   1566   +69.4%   2653stress-ng.time.system_time
>   1274   -84.8% 193.22 ±  8%  stress-ng.time.user_time
>  12458 ±  5% +37.2%  17097 ±  5%  numa-meminfo.node1.Active(anon)
>  51.41   +66.5%  85.59iostat.cpu.system
>  41.17   -84.2%   6.50 ±  7%  iostat.cpu.user
>   3040 ±  4% +37.9%   4193 ±  4%  numa-vmstat.node1.nr_active_anon
>   3040 ±  4% +37.9%   4193 ±  4%  
> numa-vmstat.node1.nr_zone_active_anon
>  50.83   +67.2%  85.00vmstat.cpu.sy
>  40.50   -85.6%   5.83 ± 11%  vmstat.cpu.us
> 225.33   -77.7%  50.33 ± 62%  vmstat.io.bi
>   7.00  -100.0%   0.00vmstat.memory.buff
>  20735 ±  2% -14.1%  17812 ±  5%  meminfo.Active
>  13506 ±  3% +31.9%  17812 ±  5%  meminfo.Active(anon)
>   7228  -100.0%   0.00meminfo.Active(file)
>  29308   +18.4%  34687 ±  2%  meminfo.Shmem
> 202067-9.5% 182899meminfo.VmallocUsed
>   0.01 ± 17%  -0.00.00 ± 10%  mpstat.cpu.all.iowait%
>   1.04-0.10.92

Re: [PATCH v6 0/7] Count rlimits in each user namespace

2021-02-22 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Mon, Feb 15, 2021 at 4:42 AM Alexey Gladkov  
> wrote:
>>
>> These patches are for binding the rlimit counters to a user in user 
>> namespace.
>
> So this is now version 6, but I think the kernel test robot keeps
> complaining about them causing KASAN issues.
>
> The complaints seem to change, so I'm hoping they get fixed, but it
> does seem like every version there's a new one. Hmm?

I have been keeping an eye on this as well, and yes the issues are
getting fixed.

My current plan is to aim at getting v7 rebased onto -rc1 into a branch.
Review the changes very closely.  Get some performance testing and some
other testing against it.  Then to get this code into linux-next.

If everything goes smoothly I will send you a pull request next merge
window.  I have no intention of shipping this (or sending you a pull
request) before it is ready.

Eric


Re: [PATCH v2] kexec: move machine_kexec_post_load() to public interface

2021-02-22 Thread Eric W. Biederman
Will Deacon  writes:

> On Fri, 19 Feb 2021 14:51:42 -0500, Pavel Tatashin wrote:
>> machine_kexec_post_load() is called after kexec load is finished. It must
>> declared in public header not in kexec_internal.h
>> 
>> Fixes the following compiler warning:
>> 
>> arch/arm64/kernel/machine_kexec.c:62:5: warning: no previous prototype for
>> function 'machine_kexec_post_load' [-Wmissing-prototypes]
>>int machine_kexec_post_load(struct kimage *kimage)
>
> Applied to arm64 (for-next/fixes), thanks!
>
> [1/1] kexec: move machine_kexec_post_load() to public interface
>   https://git.kernel.org/arm64/c/2596b6ae412b

As you have already applied this it seems a bit late,
but I will mention that change could legitimately have the following
tag.

Fixes: de68e4daea90 ("kexec: add machine_kexec_post_load()")

So far arm64 is the only implementation of that hook.

Eric


Re: [PATCH v2] sparc: make copy_thread honor pid namespaces

2021-02-22 Thread Eric W. Biederman
"Dmitry V. Levin"  writes:

> On sparc, fork and clone syscalls have an unusual semantics of
> returning the pid of the parent process to the child process.
>
> Apparently, the implementation did not honor pid namespaces at all,
> so the child used to get the pid of its parent in the init namespace.
>
> Fortunately, most users of these syscalls are not affected by this bug
> because they use another register to distinguish the parent process
> from its child, and the pid of the parent process is often discarded.
>
> Reproducer:
>
>   $ gcc -Wall -O2 -xc - <<'EOF'
>   #define _GNU_SOURCE
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   static void test_fork(void)
>   {
>   int pid = syscall(__NR_fork);
>   if (pid < 0)
>   err(1, "fork");
>   fprintf(stderr, "current: %d, parent: %d, fork returned: %d\n",
>   getpid(), getppid(), pid);
>   int status;
>   if (wait() < 0) {
>   if (errno == ECHILD)
>   _exit(0);
>   err(1, "wait");
>   }
>   if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
>   errx(1, "wait: %#x", status);
>   }
>   int main(void)
>   {
>   test_fork();
>   if (unshare(CLONE_NEWPID | CLONE_NEWUSER) < 0)
>   err(1, "unshare");
>   test_fork();
>   return 0;
>   }
>   EOF
>   $ sh -c ./a.out
>   current: 10001, parent: 1, fork returned: 10002
>   current: 10002, parent: 10001, fork returned: 10001
>   current: 10001, parent: 1, fork returned: 10003
>   current: 1, parent: 0, fork returned: 10001
>
> This bug was found by strace test suite.
>
> Cc: Eric W. Biederman 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Dmitry V. Levin 
> ---
>
> v2: Replaced task_active_pid_ns(p) with current->nsproxy->pid_ns_for_children
> as suggested by Eric.

I can't test this either.  But from just reading through the code earlier.
Acked-by: "Eric W. Biederman" 

>  arch/sparc/kernel/process_32.c | 3 ++-
>  arch/sparc/kernel/process_64.c | 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
> index a02363735915..3be653e40204 100644
> --- a/arch/sparc/kernel/process_32.c
> +++ b/arch/sparc/kernel/process_32.c
> @@ -368,7 +368,8 @@ int copy_thread(unsigned long clone_flags, unsigned long 
> sp, unsigned long arg,
>  #endif
>  
>   /* Set the return value for the child. */
> - childregs->u_regs[UREG_I0] = current->pid;
> + childregs->u_regs[UREG_I0] =
> + task_pid_nr_ns(current, current->nsproxy->pid_ns_for_children);
>   childregs->u_regs[UREG_I1] = 1;
>  
>   /* Set the return value for the parent. */
> diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
> index 6f8c7822fc06..f53ef5cff46a 100644
> --- a/arch/sparc/kernel/process_64.c
> +++ b/arch/sparc/kernel/process_64.c
> @@ -629,7 +629,8 @@ int copy_thread(unsigned long clone_flags, unsigned long 
> sp, unsigned long arg,
>   t->utraps[0]++;
>  
>   /* Set the return value for the child. */
> - t->kregs->u_regs[UREG_I0] = current->pid;
> + t->kregs->u_regs[UREG_I0] =
> + task_pid_nr_ns(current, current->nsproxy->pid_ns_for_children);
>   t->kregs->u_regs[UREG_I1] = 1;
>  
>   /* Set the second return value for the parent. */


Re: [RESEND PATCH v4 0/3] proc: Relax check of mount visibility

2021-02-22 Thread Eric W. Biederman
Alexey Gladkov  writes:

> If only the dynamic part of procfs is mounted (subset=pid), then there is no
> need to check if procfs is fully visible to the user in the new user
> namespace.


A couple of things.

1) Allowing the mount should come in the last patch.  So we don't have a
bisect hazard.

2) We should document that we still require a mount of proc to match on
atime and readonly mount attributes.

3) If we can find a way to safely not require a previous mount of proc
this will be much more valuable.

Eric



Re: [PATCH] sparc: make copy_thread honor pid namespaces

2021-02-18 Thread Eric W. Biederman
"Dmitry V. Levin"  writes:

> On sparc, fork and clone syscalls have an unusual semantics of
> returning the pid of the parent process to the child process.
>
> Apparently, the implementation did not honor pid namespaces at all,
> so the child used to get the pid of its parent in the init namespace.
>
> This bug was found by strace test suite.
>
> Reproducer:
>
>   $ gcc -Wall -O2 -xc - <<'EOF'
>   #define _GNU_SOURCE
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   int main(void)
>   {
> if (unshare(CLONE_NEWPID | CLONE_NEWUSER) < 0)
>   err(1, "unshare");
> int pid = syscall(__NR_fork);
> if (pid < 0)
>   err(1, "fork");
> fprintf(stderr, "current: %d, parent: %d, fork returned: %d\n",
> getpid(), getppid(), pid);
> int status;
> if (wait() < 0) {
>   if (errno == ECHILD)
> _exit(0);
>   err(1, "wait");
> }
> return !WIFEXITED(status) || WEXITSTATUS(status) != 0;
>   }
>   EOF
>   $ sh -c ./a.out
>   current: 10001, parent: 1, fork returned: 10002
>   current: 1, parent: 0, fork returned: 10001

>From glibc's sysdeps/unix/sparc/fork.S:
> SYSCALL__ (fork, 0)
>   /* %o1 is now 0 for the parent and 1 for the child.  Decrement it to
>  make it -1 (all bits set) for the parent, and 0 (no bits set)
>  for the child.  Then AND it with %o0, so the parent gets
>  %o0&-1==pid, and the child gets %o0&0==0.  */
>   sub %o1, 1, %o1
>   retl
>   and %o0, %o1, %o0
> libc_hidden_def (__fork)
> 
> weak_alias (__fork, fork)

This needs to be shared to make the test case make sense.

The change below looks reasonable.  Unfortunately because copy_thread
comes before init_task_pid task_active_pid_ns(p) will not work
correctly.

The code below needs to use current->nsproxy->pid_ns_for_children
instead of task_active_pid_ns(p).

For sparc people.  Do we know of anyone who actually uses the parent pid
returned from fork to the child process?  If not the code can simply
return 0 and we don't have to do this.

Eric

> Cc: Eric W. Biederman 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Dmitry V. Levin 
> ---
>
> Although the fix seems to be obvious, I have no means to test it myself,
> so any help with the testing is much appreciated.
>
>  arch/sparc/kernel/process_32.c | 2 +-
>  arch/sparc/kernel/process_64.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
> index a02363735915..7a89969befa8 100644
> --- a/arch/sparc/kernel/process_32.c
> +++ b/arch/sparc/kernel/process_32.c
> @@ -368,7 +368,7 @@ int copy_thread(unsigned long clone_flags, unsigned long 
> sp, unsigned long arg,
>  #endif
>  
>   /* Set the return value for the child. */
> - childregs->u_regs[UREG_I0] = current->pid;
> + childregs->u_regs[UREG_I0] = task_pid_nr_ns(current, 
> task_active_pid_ns(p));
>   childregs->u_regs[UREG_I1] = 1;
>  
>   /* Set the return value for the parent. */
> diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
> index 6f8c7822fc06..ec97217ab970 100644
> --- a/arch/sparc/kernel/process_64.c
> +++ b/arch/sparc/kernel/process_64.c
> @@ -629,7 +629,7 @@ int copy_thread(unsigned long clone_flags, unsigned long 
> sp, unsigned long arg,
>   t->utraps[0]++;
>  
>   /* Set the return value for the child. */
> - t->kregs->u_regs[UREG_I0] = current->pid;
> + t->kregs->u_regs[UREG_I0] = task_pid_nr_ns(current, 
> task_active_pid_ns(p));
>   t->kregs->u_regs[UREG_I1] = 1;
>  
>   /* Set the second return value for the parent. */


Re: [PATCH] proc: Convert S_ permission uses to octal

2021-02-12 Thread Eric W. Biederman
Matthew Wilcox  writes:

> On Fri, Feb 12, 2021 at 04:01:48PM -0600, Eric W. Biederman wrote:
>> Joe Perches  writes:
>> 
>> > Convert S_ permissions to the more readable octal.
>> >
>> > Done using:
>> > $ ./scripts/checkpatch.pl -f --fix-inplace --types=SYMBOLIC_PERMS 
>> > fs/proc/*.[ch]
>> >
>> > No difference in generated .o files allyesconfig x86-64
>> >
>> > Link:
>> > https://lore.kernel.org/lkml/ca+55afw5v23t-zvdzp-mmd_eyxf8wbafwwb59934fv7g21u...@mail.gmail.com/
>> 
>> 
>> I will be frank.  I don't know what 0644 means.  I can never remember
>> which bit is read, write or execute.  So I like symbolic constants.
>
> Heh, I'm the other way, I can't remember what S_IRUGO means.
>
> but I think there's another way which improves the information
> density:
>
> #define DIR_RO_ALL(NAME, iops, fops)  DIR(NAME, 0555, iops, fops)
> ...
> (or S_IRUGO or whatever expands to 0555)
>
> There's really only a few combinations --
>   root read-only,
>   everybody read-only
>   root-write, others-read
>   everybody-write
>
> and execute is only used by proc for directories, not files, so I think
> there's only 8 combinations we'd need (and everybody-write is almost
> unused ...)

I guess it depends on which part of proc.  For fs/proc/base.c and it's
per process relatives something like that seems reasonable.

I don't know about fs/proc/generic.c where everyone from all over the
kernel registers new proc entries.

>> Perhaps we can do something like:
>> 
>> #define S_IRWX 7
>> #define S_IRW_ 6
>> #define S_IR_X 5
>> #define S_IR__ 4
>> #define S_I_WX 3
>> #define S_I_W_ 2
>> #define S_I__X 1
>> #define S_I___ 0
>> 
>> #define MODE(TYPE, USER, GROUP, OTHER) \
>>  (((S_IF##TYPE) << 9) | \
>>  ((S_I##USER)  << 6) | \
>>  ((S_I##GROUP) << 3) | \
>>  (S_I##OTHER))
>> 
>> Which would be used something like:
>> MODE(DIR, RWX, R_X, R_X)
>> MODE(REG, RWX, R__, R__)
>> 
>> Something like that should be able to address the readability while
>> still using symbolic constants.
>
> I think that's been proposed before.

I don't think it has ever been shot down.  Just no one care enough to
implement it.

Come to think of it, that has the nice property that if we cared we
could make it type safe as well.  Something we can't do with the octal
for obvious reasons.

Eric



Re: [PATCH] proc: Convert S_ permission uses to octal

2021-02-12 Thread Eric W. Biederman
Joe Perches  writes:

> On Fri, 2021-02-12 at 16:01 -0600, Eric W. Biederman wrote:
>> Joe Perches  writes:
>> 
>> > Convert S_ permissions to the more readable octal.
>> > 
>> > Done using:
>> > $ ./scripts/checkpatch.pl -f --fix-inplace --types=SYMBOLIC_PERMS 
>> > fs/proc/*.[ch]
>> > 
>> > No difference in generated .o files allyesconfig x86-64
>> > 
>> > Link:
>> > https://lore.kernel.org/lkml/ca+55afw5v23t-zvdzp-mmd_eyxf8wbafwwb59934fv7g21u...@mail.gmail.com/
>> 
>> 
>> I will be frank.  I don't know what 0644 means.  I can never remember
>> which bit is read, write or execute.  So I like symbolic constants.
>> 
>> I don't see a compelling reason to change the existing code.
>
> Did you read Linus' message in the Link: above?
>
> It was a reply to what Ingo Molnar suggested here:
>
> https://lore.kernel.org/lkml/20160803081140.ga7...@gmail.com/

Only if you read in reverse chronological order.

Ingo's message was in reply to Linus and it received somewhat favorable
replies and was not shot down.

I certainly do not see sufficient consensus to go around changing code
other people maintain.

My suggest has the nice property that it handles all 512 different
combinations.  I think that was the only real downside of Ingo's
suggestion.  There are just too many different combinations to define
a set of macros to cover all of the cases.

Eric


Re: [PATCH] proc: Convert S_ permission uses to octal

2021-02-12 Thread Eric W. Biederman
Joe Perches  writes:

> Convert S_ permissions to the more readable octal.
>
> Done using:
> $ ./scripts/checkpatch.pl -f --fix-inplace --types=SYMBOLIC_PERMS 
> fs/proc/*.[ch]
>
> No difference in generated .o files allyesconfig x86-64
>
> Link:
> https://lore.kernel.org/lkml/ca+55afw5v23t-zvdzp-mmd_eyxf8wbafwwb59934fv7g21u...@mail.gmail.com/


I will be frank.  I don't know what 0644 means.  I can never remember
which bit is read, write or execute.  So I like symbolic constants.

I don't see a compelling reason to change the existing code.

There are definitely information density and clarity issues with
the symbolic constants.  So I can understand a desire to do something
better.

If we could somehow get something closer to ls output where it
lists things as rwxr-xr-x  I think that would be much better.

Perhaps we can do something like:

#define S_IRWX 7
#define S_IRW_ 6
#define S_IR_X 5
#define S_IR__ 4
#define S_I_WX 3
#define S_I_W_ 2
#define S_I__X 1
#define S_I___ 0

#define MODE(TYPE, USER, GROUP, OTHER) \
(((S_IF##TYPE) << 9) | \
 ((S_I##USER)  << 6) | \
 ((S_I##GROUP) << 3) | \
 (S_I##OTHER))

Which would be used something like:
MODE(DIR, RWX, R_X, R_X)
MODE(REG, RWX, R__, R__)

Something like that should be able to address the readability while
still using symbolic constants.

Eric



> Signed-off-by: Joe Perches 
> ---
>  fs/proc/base.c| 216 
> +-
>  fs/proc/fd.c  |   6 +-
>  fs/proc/generic.c |   8 +-
>  fs/proc/kcore.c   |   2 +-
>  fs/proc/kmsg.c|   2 +-
>  fs/proc/namespaces.c  |   2 +-
>  fs/proc/nommu.c   |   2 +-
>  fs/proc/page.c|   6 +-
>  fs/proc/proc_sysctl.c |   8 +-
>  fs/proc/proc_tty.c|   2 +-
>  fs/proc/root.c|   2 +-
>  fs/proc/self.c|   2 +-
>  fs/proc/thread_self.c |   2 +-
>  fs/proc/vmcore.c  |   2 +-
>  14 files changed, 131 insertions(+), 131 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 3851bfcdba56..49cf949fb295 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -135,7 +135,7 @@ struct pid_entry {
>  #define DIR(NAME, MODE, iops, fops)  \
>   NOD(NAME, (S_IFDIR|(MODE)), , , {} )
>  #define LNK(NAME, get_link)  \
> - NOD(NAME, (S_IFLNK|S_IRWXUGO),  \
> + NOD(NAME, (S_IFLNK | 0777), \
>   _pid_link_inode_operations, NULL,  \
>   { .proc_get_link = get_link } )
>  #define REG(NAME, MODE, fops)\
> @@ -1840,7 +1840,7 @@ void task_dump_owner(struct task_struct *task, umode_t 
> mode,
>* made this apply to all per process world readable and executable
>* directories.
>*/
> - if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) {
> + if (mode != (S_IFDIR | 0555)) {
>   struct mm_struct *mm;
>   task_lock(task);
>   mm = task->mm;
> @@ -2235,8 +2235,8 @@ proc_map_files_instantiate(struct dentry *dentry,
>   struct inode *inode;
>  
>   inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK |
> - ((mode & FMODE_READ ) ? S_IRUSR : 0) |
> - ((mode & FMODE_WRITE) ? S_IWUSR : 0));
> + ((mode & FMODE_READ) ? 0400 : 0) |
> + ((mode & FMODE_WRITE) ? 0200 : 0));
>   if (!inode)
>   return ERR_PTR(-ENOENT);
>  
> @@ -3156,114 +3156,114 @@ static const struct file_operations 
> proc_task_operations;
>  static const struct inode_operations proc_task_inode_operations;
>  
>  static const struct pid_entry tgid_base_stuff[] = {
> - DIR("task",   S_IRUGO|S_IXUGO, proc_task_inode_operations, 
> proc_task_operations),
> - DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, 
> proc_fd_operations),
> - DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, 
> proc_map_files_operations),
> - DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, 
> proc_fdinfo_operations),
> - DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, 
> proc_ns_dir_operations),
> + DIR("task",   0555, proc_task_inode_operations, 
> proc_task_operations),
> + DIR("fd", 0500, proc_fd_inode_operations, proc_fd_operations),
> + DIR("map_files",  0500, proc_map_files_inode_operations, 
> proc_map_files_operations),
> + DIR("fdinfo", 0500, proc_fdinfo_inode_operations, 
> proc_fdinfo_operations),
> + DIR("ns", 0511, proc_ns_dir_inode_operations, 
> proc_ns_dir_operations),
>  #ifdef CONFIG_NET
> - DIR("net",S_IRUGO|S_IXUGO, proc_net_inode_operations, 
> proc_net_operations),
> + DIR("net",0555, proc_net_inode_operations, proc_net_operations),
>  #endif
> - REG("environ",S_IRUSR, proc_environ_operations),
> - REG("auxv",   S_IRUSR, 

Re: [PATCH v11 0/6] arm64: MMU enabled kexec relocation

2021-02-04 Thread Eric W. Biederman
Pavel Tatashin  writes:

>> > I understand that having an extra set of page tables could potentially
>> > waste memory, especially if VAs are sparse, but in this case we use
>> > page tables exclusively for contiguous VA space (copy [src, src +
>> > size]). Therefore, the extra memory usage is tiny. The ratio for
>> > kernels with  4K page_size is (size of relocated memory) / 512.  A
>> > normal initrd + kernel is usually under 64M, an extra space which
>> > means ~128K for the page table. Even with a huge relocation, where
>> > initrd is ~512M the extra memory usage in the worst case is just ~1M.
>> > I really doubt we will have any problem from users because of such
>> > small overhead in comparison to the total kexec-load size.
>
> Hi Eric,
>
>>
>> Foolish question.
>
> Thank you for your e-mail, you gave some interesting insights.
>
>>
>> Does arm64 have something like 2M pages that it can use for the
>> linear map?
>
> Yes, with 4K pages arm64 as well has 2M pages, but arm64 also has a
> choice of 16K and 64K and second level pages are bigger there.

>> On x86_64 we always generate page tables, because they are necessary to
>> be in 64bit mode.  As I recall on x86_64 we always use 2M pages which
>> means for each 4K of page tables we map 1GiB of memory.   Which is very
>> tiny.
>>
>> If you do as well as x86_64 for arm64 I suspect that will be good enough
>> for people to not claim regression.
>>
>> Would a variation on the x86_64 implementation that allocates page
>> tables work for arm64?
> ...
>>
>> As long as the page table provided is a linear mapping of physical
>> memory (aka it looks like paging is disabled).  The the code that
>> relocates memory should be pretty much the same.
>>
>> My experience with other architectures suggests only a couple of
>> instructions need to be different to deal with a MMU being enabled.
>
> I think what you are proposing is similar to what James proposed. Yes,
> for a linear map relocation should be pretty much the same as we do
> relocation as with MMU disabled.
>
> Linear map still uses memory, because page tables must be outside of
> destination addresses of segments of the next kernel. Therefore, we
> must allocate a page table for the linear map. It might be a little
> smaller, but in reality the difference is small with 4K pages, and
> insignificant with 64K pages. The benefit of my approach is that the
> assembly copy loop is simpler, and allows hardware prefetching to
> work.
>
> The regular relocation loop works like this:
>
> for (entry = head; !(entry & IND_DONE); entry = *ptr++) {
> addr = __va(entry & PAGE_MASK);
>
> switch (entry & IND_FLAGS) {
> case IND_DESTINATION:
> dest = addr;
> break;
> case IND_INDIRECTION:
> ptr = addr;
> break;
> case IND_SOURCE:
> copy_page(dest, addr);
> dest += PAGE_SIZE;
> }
> }
>
> The entry for the next relocation page has to be always fetched, and
> therefore prefetching cannot help with the actual loop.

True.

In the common case the loop looks like:
> for (entry = head; !(entry & IND_DONE); entry = *ptr++) {
> addr = __va(entry & PAGE_MASK);
>
> switch (entry & IND_FLAGS) {
> case IND_SOURCE:
> copy_page(dest, addr);
> dest += PAGE_SIZE;
> }
> }

Which is a read of the source address followed by the copy_page.
I suspect the overhead of that loop is small enough that it swamped by
the cost of the copy_page.

If not and a better data structure can be proposed we can look at that.

> In comparison, the loop that I am proposing is like this:
>
> for (addr = head; addr < end; addr += PAGE_SIZE, dst += PAGE_SIZE)
> copy_page(dest, addr);
>
> Here is assembly code for my loop:
>
> 1: copy_page x1, x2, x3, x4, x5, x6, x7, x8, x9, x10
> sub x11, x11, #PAGE_SIZE
> cbnz x11, 1b

I think you may be hiding the cost of that loop in the page table
fetches themselves.

It is possible though unlikely that a page table with huge pages
(and thus smaller page fault costs) and the original loop is actually
cheaper.

> That said, if James and you agree that linear map is the way to go
> forward, I am OK with that as well, as it is still much better than
> having no caching at all.

The big advantage of a linear map is that the kexec'd code can continue
to use it until it sets up it's own page tables.

I probably did not document it well enough but a linear map then
equivalent of not having virtual addresses at all was always my
intention for the hand-off state of kexec between kernels.

So please try the linear map.  If it is noticably slower than your
optimized page table give numbers and we can see if there is a way to
improve the generic kexec data structures.

Eric


Re: [PATCH v11 0/6] arm64: MMU enabled kexec relocation

2021-02-03 Thread Eric W. Biederman
Pavel Tatashin  writes:

> Hi James,
>
>> The problem I see with this is rewriting the relocation code. It needs to 
>> work whether the
>> machine has enough memory to enable the MMU during kexec, or not.
>>
>> In off-list mail to Pavel I proposed an alternative implementation here:
>> https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec+mmu/v0
>>
>> By using a copy of the linear map, and passing the phys_to_virt offset into
>> arm64_relocate_new_kernel() its possible to use the same code when we fail 
>> to allocate the
>> page tables, and run with the MMU off as it does today.
>> I'm convinced someone will crawl out of the woodwork screaming 'regression' 
>> if we
>> substantially increase the amount of memory needed to kexec at all.
>>
>> From that discussion: this didn't meet Pavel's timing needs.
>> If you depend on having all the src/dst pages lined up in a single line, it 
>> sounds like
>> you've over-tuned this to depend on the CPU's streaming mode. What causes 
>> the CPU to
>> start/stop that stuff is very implementation specific (and firmware 
>> configurable).
>> I don't think we should let this rule out systems that can kexec today, but 
>> don't have
>> enough extra memory for the page tables.
>> Having two copies of the relocation code is obviously a bad idea.
>
> I understand that having an extra set of page tables could potentially
> waste memory, especially if VAs are sparse, but in this case we use
> page tables exclusively for contiguous VA space (copy [src, src +
> size]). Therefore, the extra memory usage is tiny. The ratio for
> kernels with  4K page_size is (size of relocated memory) / 512.  A
> normal initrd + kernel is usually under 64M, an extra space which
> means ~128K for the page table. Even with a huge relocation, where
> initrd is ~512M the extra memory usage in the worst case is just ~1M.
> I really doubt we will have any problem from users because of such
> small overhead in comparison to the total kexec-load size.

Foolish question.

Does arm64 have something like 2M pages that it can use for the
linear map?

On x86_64 we always generate page tables, because they are necessary to
be in 64bit mode.  As I recall on x86_64 we always use 2M pages which
means for each 4K of page tables we map 1GiB of memory.   Which is very
tiny.

If you do as well as x86_64 for arm64 I suspect that will be good enough
for people to not claim regression.

Would a variation on the x86_64 implementation that allocates page
tables work for arm64?

>> (as before: ) Instead of trying to make the relocations run quickly, can we 
>> reduce them?
>> This would benefit other architectures too.
>
> This was exactly my first approach [1] where I tried to pre-reserve
> memory similar to how it is done for a crash kernel, but I was asked
> to go away [2] as this is an ARM64 specific problem, where current
> relocation performance is prohibitively slow. I have tested on x86,
> and it does not suffer from this problem, relocation performance is
> just as fast as with MMU enabled ARM64.
>
>>
>> Can the kexec core code allocate higher order pages, instead of doing 
>> everything page at
>> at time?
>
> Yes, however, failures during kexec-load due to failure to coalesce
> huge pages can add extra hassle to users, and therefore this should be
> only an optimization with fallback to base pages.
>
>>
>> If you have a crash kernel reservation, can we use that to eliminate the 
>> relocations
>> completely?
>> (I think this suggestion has been lost in translation each time I make it.
>> I mean like this:
>> https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec/kexec_in_crashk/v0
>> Runes to test it:
>> | sudo ./kexec -p -u
>> | sudo cat /proc/iomem | grep Crash
>> |  b020-f01f : Crash kernel
>> | sudo ./kexec --mem-min=0xb020 --mem-max=0xf01ff -l ~/Image 
>> --reuse-cmdline
>>
>> I bet its even faster!)
>
> There is a problem with this approach. While, with kexec_load() call
> it is possible to specify physical destinations for each segment, with
> kexec_file_load() it is not possible. The secure systems that do IMA
> checks during kexec load require kexec_file_load(), and we cannot
> ahead of time specify destinations for these segments (at least
> without substantially changing common kexec code which is not going to
> happen as this arm64 specific problem).


>> I think 'as fast as possible' and 'memory constrained' are mutually exclusive
>> requirements. We need to make the page tables optional with a single 
>> implementation.

In my experience the slowdown with disabling a cpus cache (which
apparently happens on arm64 when the MMU is disabled) is freakishly
huge.

Enabling the cache shouldn't be 'as fast as possible' but simply
disengaging the parking brake.

> In my opinion having two different types of relocations will only add
> extra corner cases, confusion about different performance, and bugs.
> It is better to have two types: 1. crash kernel type without
> relocation, 2. fast 

Re: [PATCH 2/2] security.capability: fix conversions on getxattr

2021-01-31 Thread Eric W. Biederman
"Serge E. Hallyn"  writes:

> On Fri, Jan 29, 2021 at 04:55:29PM -0600, Eric W. Biederman wrote:
>> "Serge E. Hallyn"  writes:
>> 
>> > On Thu, Jan 28, 2021 at 02:19:13PM -0600, Eric W. Biederman wrote:
>> >> "Serge E. Hallyn"  writes:
>> >> 
>> >> > On Tue, Jan 19, 2021 at 07:34:49PM -0600, Eric W. Biederman wrote:
>> >> >> Miklos Szeredi  writes:
>> >> >> 
>> >> >> > If a capability is stored on disk in v2 format 
>> >> >> > cap_inode_getsecurity() will
>> >> >> > currently return in v2 format unconditionally.
>> >> >> >
>> >> >> > This is wrong: v2 cap should be equivalent to a v3 cap with zero 
>> >> >> > rootid,
>> >> >> > and so the same conversions performed on it.
>> >> >> >
>> >> >> > If the rootid cannot be mapped v3 is returned unconverted.  Fix this 
>> >> >> > so
>> >> >> > that both v2 and v3 return -EOVERFLOW if the rootid (or the owner of 
>> >> >> > the fs
>> >> >> > user namespace in case of v2) cannot be mapped in the current user
>> >> >> > namespace.
>> >> >> 
>> >> >> This looks like a good cleanup.
>> >> >
>> >> > Sorry, I'm not following.  Why is this a good cleanup?  Why should
>> >> > the xattr be shown as faked v3 in this case?
>> >> 
>> >> If the reader is in _user_ns.  If the filesystem was mounted in a
>> >> user namespace.   Then the reader looses the information that the
>> >
>> > Can you be more precise about "filesystem was mounted in a user namespace"?
>> > Is this a FUSE thing, the fs is marked as being mounted in a non-init 
>> > userns?
>> > If that's a possible case, then yes that must be represented as v3.  Using
>> > is_v2header() may be the simpler way to check for that, but the more 
>> > accurate
>> > check would be "is it v2 header and mounted by init_user_ns".
>> 
>> I think the filesystems current relevant are fuse,overlayfs,ramfs,tmpfs.
>> 
>> > Basically yes, in as many cases as possible we want to just give a v2
>> > cap because more userspace knows what to do with that, but a 
>> > non-init-userns
>> > mounted fs which provides a v2 fscap should have it represented as v3 cap
>> > with rootid being the kuid that owns the userns.
>> 
>> That is the case we that is being fixed in the patch.
>> 
>> > Or am I still thinking wrongly?  Wouldn't be entirely surprised :)
>> 
>> No you got it.
>
> So then can we make faking a v3 gated on whether
> sb->s_user_ns != _user_ns ?

Sort of.

What Miklos's patch implements is always treating a v2 cap xattr on disk
as v3 internally.

>   if (is_v2header((size_t) ret, cap)) {
>   root = 0;
>   } else if (is_v3header((size_t) ret, cap)) {
>   nscap = (struct vfs_ns_cap_data *) tmpbuf;
>   root = le32_to_cpu(nscap->rootid);
>   } else {
>   size = -EINVAL;
>   goto out_free;
>   }

Then v3 is returned if:
>   /* If the root kuid maps to a valid uid in current ns, then return
>* this as a nscap. */
>   mappedroot = from_kuid(current_user_ns(), kroot);
>   if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {

After that we verify that the fs capability can be seen by the caller
as a v2 cap xattr with:

> > if (!rootid_owns_currentns(kroot)) {
> > size = -EOVERFLOW;
> > goto out_free;

Anything that passes that test and does not encounter a memory
allocation error is returned as a v2.

...

Which in practice does mean that if sb->s_user_ns != _user_ns, 
then mappedroot != 0, and is returned as a v3.

The rest of the logic takes care of all of the other crazy silly
combinations.  Like a user namespace that identity maps uid 0,
and then mounts a filesystem.

Eric





Re: [PATCH 2/2] security.capability: fix conversions on getxattr

2021-01-29 Thread Eric W. Biederman
"Serge E. Hallyn"  writes:

> On Thu, Jan 28, 2021 at 08:44:26PM +0100, Miklos Szeredi wrote:
>> On Thu, Jan 28, 2021 at 6:09 PM Serge E. Hallyn  wrote:
>> >
>> > On Tue, Jan 19, 2021 at 07:34:49PM -0600, Eric W. Biederman wrote:
>> > > Miklos Szeredi  writes:
>> > >
>> > > > if (!rootid_owns_currentns(kroot)) {
>> > > > -   kfree(tmpbuf);
>> > > > -   return -EOPNOTSUPP;
>> > > > +   size = -EOVERFLOW;
>> >
>> > Why this change?  Christian (cc:d) noticed that this is a user visible 
>> > change.
>> > Without this change, if you are in a userns which has different rootid, the
>> > EOVERFLOW tells vfs_getxattr to vall back to __vfs_getxattr() and so you 
>> > can
>> > see the v3 capability with its rootid.
>> >
>> > With this change, you instead just get EOVERFLOW.
>> 
>> Why would the user want to see nonsense (in its own userns) rootid and
>> what would it do with it?
>
> They would know that the data is there.

But an error of -EOVERFLOW still indicates data is there.
You just don't get the data because it can not be represented.

>> Please give an example where an untranslatable rootid would make any
>> sense at all to the user.
>
> I may have accidentally, from init_user_ns, as uid 1000, set an
> fscap with rootid 11 instead of 10, and wonder why the
> cap is not working in the container where 10 is root.

Getting -EOVERFLOW when attempting to read the cap from inside
the user namespace will immediately tell you what is wrong. The rootid
does not map.

That is how all the non-mapping situations are handled.  Either
-EOVERFLOW or returning INVALID_UID/the unmapped user id aka nobody.

The existing code is wrong because it returns a completely untranslated
uid, which is completely non-sense.

An argument could be made for returning a rootid of 0x aka
INVALID_UID in a v3 cap xattr when the rootid can not be mapped.  I
think that is what we do with posix_acls that contain ids that don't
map.  My sense is returning -EOVERFLOW inside the container and
returning the v3 cap xattr outside the container will most quickly get
the problem diagnosed, and will be the most likely to not cause
problems.

If there is a good case for returning a v3 cap with rootid of 0x
instead of -EOVERFLOW we can do that.  Right now I don't see anything
that would be compelling in either direction.

Eric






Re: [PATCH 2/2] security.capability: fix conversions on getxattr

2021-01-29 Thread Eric W. Biederman
"Serge E. Hallyn"  writes:

> On Thu, Jan 28, 2021 at 02:19:13PM -0600, Eric W. Biederman wrote:
>> "Serge E. Hallyn"  writes:
>> 
>> > On Tue, Jan 19, 2021 at 07:34:49PM -0600, Eric W. Biederman wrote:
>> >> Miklos Szeredi  writes:
>> >> 
>> >> > If a capability is stored on disk in v2 format cap_inode_getsecurity() 
>> >> > will
>> >> > currently return in v2 format unconditionally.
>> >> >
>> >> > This is wrong: v2 cap should be equivalent to a v3 cap with zero rootid,
>> >> > and so the same conversions performed on it.
>> >> >
>> >> > If the rootid cannot be mapped v3 is returned unconverted.  Fix this so
>> >> > that both v2 and v3 return -EOVERFLOW if the rootid (or the owner of 
>> >> > the fs
>> >> > user namespace in case of v2) cannot be mapped in the current user
>> >> > namespace.
>> >> 
>> >> This looks like a good cleanup.
>> >
>> > Sorry, I'm not following.  Why is this a good cleanup?  Why should
>> > the xattr be shown as faked v3 in this case?
>> 
>> If the reader is in _user_ns.  If the filesystem was mounted in a
>> user namespace.   Then the reader looses the information that the
>
> Can you be more precise about "filesystem was mounted in a user namespace"?
> Is this a FUSE thing, the fs is marked as being mounted in a non-init userns?
> If that's a possible case, then yes that must be represented as v3.  Using
> is_v2header() may be the simpler way to check for that, but the more accurate
> check would be "is it v2 header and mounted by init_user_ns".

I think the filesystems current relevant are fuse,overlayfs,ramfs,tmpfs.

> Basically yes, in as many cases as possible we want to just give a v2
> cap because more userspace knows what to do with that, but a non-init-userns
> mounted fs which provides a v2 fscap should have it represented as v3 cap
> with rootid being the kuid that owns the userns.

That is the case we that is being fixed in the patch.

> Or am I still thinking wrongly?  Wouldn't be entirely surprised :)

No you got it.

Eric


Re: [PATCH 2/2] security.capability: fix conversions on getxattr

2021-01-28 Thread Eric W. Biederman
Miklos Szeredi  writes:

> On Thu, Jan 28, 2021 at 9:24 PM Eric W. Biederman  
> wrote:
>
>> 
>> From our previous discussions I would also argue it would be good
>> if there was a bypass that skipped all conversions if the reader
>> and the filesystem are in the same user namespace.
>> 
>
> That's however just an optimization (AFAICS) that only makes sense if
> it helps a read world workload.   I'm not convinced that that's the
> case.

It is definitely a different issue.

>From previous conversations with Serge, there is a concern with a
sysadmin wanting to see what is actually on disk.  In case there are
bugs that care about the different layout.  Just passing everything
through when no translation is necessary will allow that kind of
diagnosis.

As your patch demonstrates we already have had bugs in this area
so being able to get at the raw data may help people if they get into a
situation where bugs matter.

Eric


Re: [PATCH 2/2] security.capability: fix conversions on getxattr

2021-01-28 Thread Eric W. Biederman
"Serge E. Hallyn"  writes:

> On Tue, Jan 19, 2021 at 07:34:49PM -0600, Eric W. Biederman wrote:
>> Miklos Szeredi  writes:
>> 
>> > If a capability is stored on disk in v2 format cap_inode_getsecurity() will
>> > currently return in v2 format unconditionally.
>> >
>> > This is wrong: v2 cap should be equivalent to a v3 cap with zero rootid,
>> > and so the same conversions performed on it.
>> >
>> > If the rootid cannot be mapped v3 is returned unconverted.  Fix this so
>> > that both v2 and v3 return -EOVERFLOW if the rootid (or the owner of the fs
>> > user namespace in case of v2) cannot be mapped in the current user
>> > namespace.
>> 
>> This looks like a good cleanup.
>
> Sorry, I'm not following.  Why is this a good cleanup?  Why should
> the xattr be shown as faked v3 in this case?

If the reader is in _user_ns.  If the filesystem was mounted in a
user namespace.   Then the reader looses the information that the
capability xattr only applies to a subset of user namespaces.

A trivial place where this would be important is if userspace was to
copy the file and the associated  capability xattr to another
filesystem, that is mounted differently.



>From our previous discussions I would also argue it would be good
if there was a bypass that skipped all conversions if the reader
and the filesystem are in the same user namespace.



> A separate question below.
>
>> I do wonder how well this works with stacking.  In particular
>> ovl_xattr_set appears to call vfs_getxattr without overriding the creds.
>> What the purpose of that is I haven't quite figured out.  It looks like
>> it is just a probe to see if an xattr is present so maybe it is ok.
>> 
>> Acked-by: "Eric W. Biederman" 
>> 
>> >
>> > Signed-off-by: Miklos Szeredi 
>> > ---
>> >  security/commoncap.c | 67 
>> >  1 file changed, 43 insertions(+), 24 deletions(-)
>> >
>> > diff --git a/security/commoncap.c b/security/commoncap.c
>> > index baccd871..c9d99f8f4c82 100644
>> > --- a/security/commoncap.c
>> > +++ b/security/commoncap.c
>> > @@ -371,10 +371,11 @@ int cap_inode_getsecurity(struct inode *inode, const 
>> > char *name, void **buffer,
>> >  {
>> >int size, ret;
>> >kuid_t kroot;
>> > +  __le32 nsmagic, magic;
>> >uid_t root, mappedroot;
>> >char *tmpbuf = NULL;
>> >struct vfs_cap_data *cap;
>> > -  struct vfs_ns_cap_data *nscap;
>> > +  struct vfs_ns_cap_data *nscap = NULL;
>> >struct dentry *dentry;
>> >struct user_namespace *fs_ns;
>> >  
>> > @@ -396,46 +397,61 @@ int cap_inode_getsecurity(struct inode *inode, const 
>> > char *name, void **buffer,
>> >fs_ns = inode->i_sb->s_user_ns;
>> >cap = (struct vfs_cap_data *) tmpbuf;
>> >if (is_v2header((size_t) ret, cap)) {
>> > -  /* If this is sizeof(vfs_cap_data) then we're ok with the
>> > -   * on-disk value, so return that.  */
>> > -  if (alloc)
>> > -  *buffer = tmpbuf;
>> > -  else
>> > -  kfree(tmpbuf);
>> > -  return ret;
>> > -  } else if (!is_v3header((size_t) ret, cap)) {
>> > -  kfree(tmpbuf);
>> > -  return -EINVAL;
>> > +  root = 0;
>> > +  } else if (is_v3header((size_t) ret, cap)) {
>> > +  nscap = (struct vfs_ns_cap_data *) tmpbuf;
>> > +  root = le32_to_cpu(nscap->rootid);
>> > +  } else {
>> > +  size = -EINVAL;
>> > +  goto out_free;
>> >}
>> >  
>> > -  nscap = (struct vfs_ns_cap_data *) tmpbuf;
>> > -  root = le32_to_cpu(nscap->rootid);
>> >kroot = make_kuid(fs_ns, root);
>> >  
>> >/* If the root kuid maps to a valid uid in current ns, then return
>> > * this as a nscap. */
>> >mappedroot = from_kuid(current_user_ns(), kroot);
>> >if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {
>> > +  size = sizeof(struct vfs_ns_cap_data);
>> >if (alloc) {
>> > -  *buffer = tmpbuf;
>> > +  if (!nscap) {
>> > +  /* v2 -> v3 conversion */
>> > +  nscap = kzalloc(size, GFP_ATOMIC);
>> > +  if (!nscap) {
>> > +

Re: [PATCH v2 1/1] kexec: dump kmessage before machine_kexec

2021-01-28 Thread Eric W. Biederman
Pavel Tatashin  writes:

> kmsg_dump(KMSG_DUMP_SHUTDOWN) is called before
> machine_restart(), machine_halt(), machine_power_off(), the only one that
> is missing is  machine_kexec().
>
> The dmesg output that it contains can be used to study the shutdown
> performance of both kernel and systemd during kexec reboot.
>
> Here is example of dmesg data collected after kexec:

As long was we keep kmsg_dump out of the crash_kexec path where
it completely breaks kexec on panic this seems a reasonable thing to do.
On the ordinary kernel_kexec path everything is expected to be working.

Is kmsg_dump expected to work after all of the device drivers
are shut down?  Otherwise this placement of kmsg_dump is too late.

Eric

> root@dplat-cp22:~# cat /sys/fs/pstore/dmesg-ramoops-0 | tail
> ...
> <6>[   70.914592] psci: CPU3 killed (polled 0 ms)
> <5>[   70.915705] CPU4: shutdown
> <6>[   70.916643] psci: CPU4 killed (polled 4 ms)
> <5>[   70.917715] CPU5: shutdown
> <6>[   70.918725] psci: CPU5 killed (polled 0 ms)
> <5>[   70.919704] CPU6: shutdown
> <6>[   70.920726] psci: CPU6 killed (polled 4 ms)
> <5>[   70.921642] CPU7: shutdown
> <6>[   70.922650] psci: CPU7 killed (polled 0 ms)
>
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Kees Cook 
> Reviewed-by: Petr Mladek 
> Reviewed-by: Bhupesh Sharma 
> ---
>  kernel/kexec_core.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 4f8efc278aa7..e253c8b59145 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -37,6 +37,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -1180,6 +1181,7 @@ int kernel_kexec(void)
>   machine_shutdown();
>   }
>  
> + kmsg_dump(KMSG_DUMP_SHUTDOWN);
>   machine_kexec(kexec_image);
>  
>  #ifdef CONFIG_KEXEC_JUMP


Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

2021-01-21 Thread Eric W. Biederman
Alexey Gladkov  writes:

> On Tue, Jan 19, 2021 at 07:57:36PM -0600, Eric W. Biederman wrote:
>> Alexey Gladkov  writes:
>> 
>> > On Mon, Jan 18, 2021 at 12:34:29PM -0800, Linus Torvalds wrote:
>> >> On Mon, Jan 18, 2021 at 11:46 AM Alexey Gladkov
>> >>  wrote:
>> >> >
>> >> > Sorry about that. I thought that this code is not needed when switching
>> >> > from int to refcount_t. I was wrong.
>> >> 
>> >> Well, you _may_ be right. I personally didn't check how the return
>> >> value is used.
>> >> 
>> >> I only reacted to "it certainly _may_ be used, and there is absolutely
>> >> no comment anywhere about why it wouldn't matter".
>> >
>> > I have not found examples where checked the overflow after calling
>> > refcount_inc/refcount_add.
>> >
>> > For example in kernel/fork.c:2298 :
>> >
>> >current->signal->nr_threads++;   
>> >atomic_inc(>signal->live);  
>> >refcount_inc(>signal->sigcnt);  
>> >
>> > $ semind search signal_struct.sigcnt
>> > def include/linux/sched/signal.h:83refcount_t  
>> > sigcnt;
>> > m-- kernel/fork.c:723 put_signal_structif 
>> > (refcount_dec_and_test(>sigcnt))
>> > m-- kernel/fork.c:1571 copy_signal 
>> > refcount_set(>sigcnt, 1);
>> > m-- kernel/fork.c:2298 copy_process
>> > refcount_inc(>signal->sigcnt);
>> >
>> > It seems to me that the only way is to use __refcount_inc and then compare
>> > the old value with REFCOUNT_MAX
>> >
>> > Since I have not seen examples of such checks, I thought that this is
>> > acceptable. Sorry once again. I have not tried to hide these changes.
>> 
>> The current ucount code does check for overflow and fails the increment
>> in every case.
>> 
>> So arguably it will be a regression and inferior error handling behavior
>> if the code switches to the ``better'' refcount_t data structure.
>> 
>> I originally didn't use refcount_t because silently saturating and not
>> bothering to handle the error makes me uncomfortable.
>> 
>> Not having to acquire the ucounts_lock every time seems nice.  Perhaps
>> the path forward would be to start with stupid/correct code that always
>> takes the ucounts_lock for every increment of ucounts->count, that is
>> later replaced with something more optimal.
>> 
>> Not impacting performance in the non-namespace cases and having good
>> performance in the other cases is a fundamental requirement of merging
>> code like this.
>
> Did I understand your suggestion correctly that you suggest to use
> spin_lock for atomic_read and atomic_inc ?
>
> If so, then we are already incrementing the counter under ucounts_lock.
>
>   ...
>   if (atomic_read(>count) == INT_MAX)
>   ucounts = NULL;
>   else
>   atomic_inc(>count);
>   spin_unlock_irq(_lock);
>   return ucounts;
>
> something like this ?

Yes.  But without atomics.  Something a bit more like:
>   ...
>   if (ucounts->count == INT_MAX)
>   ucounts = NULL;
>   else
>   ucounts->count++;
>   spin_unlock_irq(_lock);
>   return ucounts;

I do believe at some point we will want to say using the spin_lock for
ucounts->count is cumbersome, and suboptimal and we want to change it to
get a better performing implementation.

Just for getting the semantics correct we should be able to use just
ucounts_lock for locking.  Then when everything is working we can
profile and optimize the code.

I just don't want figuring out what is needed to get hung up over little
details that we can change later.

Eric



[RFC][PATCH] apparmor: Enforce progressively tighter permissions for no_new_privs

2021-01-20 Thread Eric W. Biederman


The current understanding of apparmor with respect to no_new_privs is at
odds with how no_new_privs is implemented and understood by the rest of
the kernel.

The documentation of no_new_privs states:
> With ``no_new_privs`` set, ``execve()`` promises not to grant the
> privilege to do anything that could not have been done without the
> execve call.

And reading through the kernel except for apparmor that description
matches what is implemented.

There are two major divergences of apparmor from this definition:
- proc_setattr enforces limitations when no_new_privs are set.
- the limitation is enforced from the apparent time when no_new_privs is
  set instead of guaranteeing that each execve has progressively more
  narrow permissions.

The code in apparmor that attempts to discover the apparmor label at the
point where no_new_privs is set is not robust.  The capture happens a
long time after no_new_privs is set.

Capturing the label at the point where no_new_privs is set is
practically impossible to implement robustly.  Today the rule is struct
cred can only be changed by it's current task.  Today
SECCOMP_FILTER_FLAG_TSYNC sets no_new_privs from another thread.  A
robust implementation would require changing something fundamental in
how creds are managed for SECCOMP_FILTER_FLAG_TSYNC to be able to
capture the cred at the point it is set.

Futhermore given the consistent documentation and how everything else
implements no_new_privs, not having the permissions get progressively
tighter is a footgun aimed at userspace.  I fully expect it to break any
security sensitive software that uses no_new_privs and was not
deliberately designed and tested against apparmor.

Avoid the questionable and hard to fix implementation and the
potential to confuse userspace by having no_new_privs enforce
progressinvely tighter permissions.

Fixes: 9fcf78cca198 ("apparmor: update domain transitions that are subsets of 
confinement at nnp")
Signed-off-by: Eric W. Biederman 
---

I came accross this while examining the places cred_guard_mutex is
used and trying to find a way to make those code paths less insane.

If it would be more pallatable I would not mind removing the
task_no_new_privs test entirely from aa_change_hat and aa_change_profile
as those are not part of exec, so arguably no_new_privs should not care
about them at all.

Can we please get rid of the huge semantic wart and pain in the implementation?

 security/apparmor/domain.c   | 39 
 security/apparmor/include/task.h |  4 
 security/apparmor/task.c |  7 --
 3 files changed, 4 insertions(+), 46 deletions(-)

diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c
index f919ebd042fd..8f77059bf890 100644
--- a/security/apparmor/domain.c
+++ b/security/apparmor/domain.c
@@ -869,17 +869,6 @@ int apparmor_bprm_creds_for_exec(struct linux_binprm *bprm)
 
label = aa_get_newest_label(cred_label(bprm->cred));
 
-   /*
-* Detect no new privs being set, and store the label it
-* occurred under. Ideally this would happen when nnp
-* is set but there isn't a good way to do that yet.
-*
-* Testing for unconfined must be done before the subset test
-*/
-   if ((bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS) && !unconfined(label) &&
-   !ctx->nnp)
-   ctx->nnp = aa_get_label(label);
-
/* buffer freed below, name is pointer into buffer */
buffer = aa_get_buffer(false);
if (!buffer) {
@@ -915,7 +904,7 @@ int apparmor_bprm_creds_for_exec(struct linux_binprm *bprm)
 */
if ((bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS) &&
!unconfined(label) &&
-   !aa_label_is_unconfined_subset(new, ctx->nnp)) {
+   !aa_label_is_unconfined_subset(new, label)) {
error = -EPERM;
info = "no new privs";
goto audit;
@@ -1158,16 +1147,6 @@ int aa_change_hat(const char *hats[], int count, u64 
token, int flags)
label = aa_get_newest_cred_label(cred);
previous = aa_get_newest_label(ctx->previous);
 
-   /*
-* Detect no new privs being set, and store the label it
-* occurred under. Ideally this would happen when nnp
-* is set but there isn't a good way to do that yet.
-*
-* Testing for unconfined must be done before the subset test
-*/
-   if (task_no_new_privs(current) && !unconfined(label) && !ctx->nnp)
-   ctx->nnp = aa_get_label(label);
-
if (unconfined(label)) {
info = "unconfined can not change_hat";
error = -EPERM;
@@ -1193,7 +1172,7 @@ int aa_change_hat(const char *hats[], int count, u64 
token, int flags)
 * reduce restrictions.
 */
  

Re: [RFC][PATCH] apparmor: Enforce progressively tighter permissions for no_new_privs

2021-01-20 Thread Eric W. Biederman


TL;DR selinux and apparmor ignore no_new_privs

What?


John Johansen  writes:

> On 1/20/21 1:26 PM, Eric W. Biederman wrote:
>> 
>> The current understanding of apparmor with respect to no_new_privs is at
>> odds with how no_new_privs is implemented and understood by the rest of
>> the kernel.
>> 
>> The documentation of no_new_privs states:
>>> With ``no_new_privs`` set, ``execve()`` promises not to grant the
>>> privilege to do anything that could not have been done without the
>>> execve call.
>> 
>> And reading through the kernel except for apparmor that description
>> matches what is implemented.
>> 
>
> That is not correct.
>
> commit 7b0d0b40cd78 ("selinux: Permit bounded transitions under
> NO_NEW_PRIVS or NOSUID.")
>
> Allows for bound transitions under selinux
> and

As I understand a bound transition it is a transition to a state with
a set of permissions that are a subset of what was previously held.
Which is consistent with the mandate of no_new_privs.

> commit af63f4193f9f selinux: Generalize support for NNP/nosuid SELinux
> domain transitions
>
> goes further and "Decouple NNP/nosuid from SELinux transitions".

Yes.  Looking at that commit I do see that selinux appears to be
deliberately ignoring no_new_privs in specific cases.

WTF.

>> There are two major divergences of apparmor from this definition:
>> - proc_setattr enforces limitations when no_new_privs are set.
>> - the limitation is enforced from the apparent time when no_new_privs is
>>   set instead of guaranteeing that each execve has progressively more
>>   narrow permissions.
>> 
>> The code in apparmor that attempts to discover the apparmor label at the
>> point where no_new_privs is set is not robust.  The capture happens a
>> long time after no_new_privs is set.
>> 
>
> yes, but that shouldn't matter. As apparmor has not changed its label
> at any point between when no_new_privs was set and when the check is
> done. AppArmor is attempting to change it label, and if it finds NNP
> has been set we capture what the confinement was.
>
>> Capturing the label at the point where no_new_privs is set is
>> practically impossible to implement robustly.  Today the rule is struct
>> cred can only be changed by it's current task.  Today
>
> right, and apparmor only ever has the task update its own label.
>
>> SECCOMP_FILTER_FLAG_TSYNC sets no_new_privs from another thread.  A
>> robust implementation would require changing something fundamental in
>> how creds are managed for SECCOMP_FILTER_FLAG_TSYNC to be able to
>> capture the cred at the point it is set.
>> 
> I am open to supporting something like that.

I can't see how it would be possible to be robust without completely
changing the locking.  Locking that right now in a simpler model we have
not figured out how to make obviously correct.

>> Futhermore given the consistent documentation and how everything else
>> implements no_new_privs, not having the permissions get progressively
>
> Again see above

Except where selinux deliberately ignores no_new_privs this is
consitent.

>> tighter is a footgun aimed at userspace.  I fully expect it to break any
>
> tighter is somewhat relative, nor is it only progressively tighter it
> is bounded against the snapshot of the label that was on the task.

Which is the BUG I am reporting.  It should be progressingly tighter.

>> security sensitive software that uses no_new_privs and was not
>> deliberately designed and tested against apparmor.
>> 
>
> Currently the situation has become either an either or choice between
> the LSM and NNP. We are trying to walk a balance. Ideally apparmor
> would like to do something similar to selinux and decouple the label
> transition from NNP and nosuid via an internal capability, but we
> have not gone there yet.

Why do you need to escape no_new_privs.  Why does anyone need to escape
no_new_privs?

>> Avoid the questionable and hard to fix implementation and the
>> potential to confuse userspace by having no_new_privs enforce
>> progressinvely tighter permissions.
>> 
>
> This would completely break several use cases.

Enforcing no_new_privs as documented would break userspace?

Isn't the opposite true that you are breaking people by not enforcing
it?

>> Fixes: 9fcf78cca198 ("apparmor: update domain transitions that are subsets 
>> of confinement at nnp")
>> Signed-off-by: Eric W. Biederman 
>> ---
>> 
>> I came accross this while examining the places cred_guard_mutex is
>> used and trying to find a way to make those code paths less insane.
>> 
&g

Re: [RFC][PATCH] apparmor: Enforce progressively tighter permissions for no_new_privs

2021-01-20 Thread Eric W. Biederman


This should now Cc the correct email address for James Morris.

ebied...@xmission.com (Eric W. Biederman) writes:

> The current understanding of apparmor with respect to no_new_privs is at
> odds with how no_new_privs is implemented and understood by the rest of
> the kernel.
>
> The documentation of no_new_privs states:
>> With ``no_new_privs`` set, ``execve()`` promises not to grant the
>> privilege to do anything that could not have been done without the
>> execve call.
>
> And reading through the kernel except for apparmor that description
> matches what is implemented.
>
> There are two major divergences of apparmor from this definition:
> - proc_setattr enforces limitations when no_new_privs are set.
> - the limitation is enforced from the apparent time when no_new_privs is
>   set instead of guaranteeing that each execve has progressively more
>   narrow permissions.
>
> The code in apparmor that attempts to discover the apparmor label at the
> point where no_new_privs is set is not robust.  The capture happens a
> long time after no_new_privs is set.
>
> Capturing the label at the point where no_new_privs is set is
> practically impossible to implement robustly.  Today the rule is struct
> cred can only be changed by it's current task.  Today
> SECCOMP_FILTER_FLAG_TSYNC sets no_new_privs from another thread.  A
> robust implementation would require changing something fundamental in
> how creds are managed for SECCOMP_FILTER_FLAG_TSYNC to be able to
> capture the cred at the point it is set.
>
> Futhermore given the consistent documentation and how everything else
> implements no_new_privs, not having the permissions get progressively
> tighter is a footgun aimed at userspace.  I fully expect it to break any
> security sensitive software that uses no_new_privs and was not
> deliberately designed and tested against apparmor.
>
> Avoid the questionable and hard to fix implementation and the
> potential to confuse userspace by having no_new_privs enforce
> progressinvely tighter permissions.
>
> Fixes: 9fcf78cca198 ("apparmor: update domain transitions that are subsets of 
> confinement at nnp")
> Signed-off-by: Eric W. Biederman 
> ---
>
> I came accross this while examining the places cred_guard_mutex is
> used and trying to find a way to make those code paths less insane.
>
> If it would be more pallatable I would not mind removing the
> task_no_new_privs test entirely from aa_change_hat and aa_change_profile
> as those are not part of exec, so arguably no_new_privs should not care
> about them at all.
>
> Can we please get rid of the huge semantic wart and pain in the 
> implementation?
>
>  security/apparmor/domain.c   | 39 
>  security/apparmor/include/task.h |  4 
>  security/apparmor/task.c |  7 --
>  3 files changed, 4 insertions(+), 46 deletions(-)
>
> diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c
> index f919ebd042fd..8f77059bf890 100644
> --- a/security/apparmor/domain.c
> +++ b/security/apparmor/domain.c
> @@ -869,17 +869,6 @@ int apparmor_bprm_creds_for_exec(struct linux_binprm 
> *bprm)
>  
>   label = aa_get_newest_label(cred_label(bprm->cred));
>  
> - /*
> -  * Detect no new privs being set, and store the label it
> -  * occurred under. Ideally this would happen when nnp
> -  * is set but there isn't a good way to do that yet.
> -  *
> -  * Testing for unconfined must be done before the subset test
> -  */
> - if ((bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS) && !unconfined(label) &&
> - !ctx->nnp)
> - ctx->nnp = aa_get_label(label);
> -
>   /* buffer freed below, name is pointer into buffer */
>   buffer = aa_get_buffer(false);
>   if (!buffer) {
> @@ -915,7 +904,7 @@ int apparmor_bprm_creds_for_exec(struct linux_binprm 
> *bprm)
>*/
>   if ((bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS) &&
>   !unconfined(label) &&
> - !aa_label_is_unconfined_subset(new, ctx->nnp)) {
> + !aa_label_is_unconfined_subset(new, label)) {
>   error = -EPERM;
>   info = "no new privs";
>   goto audit;
> @@ -1158,16 +1147,6 @@ int aa_change_hat(const char *hats[], int count, u64 
> token, int flags)
>   label = aa_get_newest_cred_label(cred);
>   previous = aa_get_newest_label(ctx->previous);
>  
> - /*
> -  * Detect no new privs being set, and store the label it
> -  * occurred under. Ideally this would happen when nnp
> -  * is set but there isn't a good way to do that yet.
> - 

Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

2021-01-19 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Alexey Gladkov  writes:
>
>> On Mon, Jan 18, 2021 at 12:34:29PM -0800, Linus Torvalds wrote:
>>> On Mon, Jan 18, 2021 at 11:46 AM Alexey Gladkov
>>>  wrote:
>>> >
>>> > Sorry about that. I thought that this code is not needed when switching
>>> > from int to refcount_t. I was wrong.
>>> 
>>> Well, you _may_ be right. I personally didn't check how the return
>>> value is used.
>>> 
>>> I only reacted to "it certainly _may_ be used, and there is absolutely
>>> no comment anywhere about why it wouldn't matter".
>>
>> I have not found examples where checked the overflow after calling
>> refcount_inc/refcount_add.
>>
>> For example in kernel/fork.c:2298 :
>>
>>current->signal->nr_threads++;   
>>atomic_inc(>signal->live);  
>>refcount_inc(>signal->sigcnt);  
>>
>> $ semind search signal_struct.sigcnt
>> def include/linux/sched/signal.h:83  refcount_t  
>> sigcnt;
>> m-- kernel/fork.c:723 put_signal_struct  if 
>> (refcount_dec_and_test(>sigcnt))
>> m-- kernel/fork.c:1571 copy_signal   refcount_set(>sigcnt, 1);
>> m-- kernel/fork.c:2298 copy_process  
>> refcount_inc(>signal->sigcnt);
>>
>> It seems to me that the only way is to use __refcount_inc and then compare
>> the old value with REFCOUNT_MAX
>>
>> Since I have not seen examples of such checks, I thought that this is
>> acceptable. Sorry once again. I have not tried to hide these changes.
>
> The current ucount code does check for overflow and fails the increment
> in every case.
>
> So arguably it will be a regression and inferior error handling behavior
> if the code switches to the ``better'' refcount_t data structure.
>
> I originally didn't use refcount_t because silently saturating and not
> bothering to handle the error makes me uncomfortable.
>
> Not having to acquire the ucounts_lock every time seems nice.  Perhaps
> the path forward would be to start with stupid/correct code that always
> takes the ucounts_lock for every increment of ucounts->count, that is
> later replaced with something more optimal.
>
> Not impacting performance in the non-namespace cases and having good
> performance in the other cases is a fundamental requirement of merging
> code like this.

So starting with something easy to comprehend and simple, may make it
easier to figure out how to optimize the code.

Eric



Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

2021-01-19 Thread Eric W. Biederman
Alexey Gladkov  writes:

> On Mon, Jan 18, 2021 at 12:34:29PM -0800, Linus Torvalds wrote:
>> On Mon, Jan 18, 2021 at 11:46 AM Alexey Gladkov
>>  wrote:
>> >
>> > Sorry about that. I thought that this code is not needed when switching
>> > from int to refcount_t. I was wrong.
>> 
>> Well, you _may_ be right. I personally didn't check how the return
>> value is used.
>> 
>> I only reacted to "it certainly _may_ be used, and there is absolutely
>> no comment anywhere about why it wouldn't matter".
>
> I have not found examples where checked the overflow after calling
> refcount_inc/refcount_add.
>
> For example in kernel/fork.c:2298 :
>
>current->signal->nr_threads++;   
>atomic_inc(>signal->live);  
>refcount_inc(>signal->sigcnt);  
>
> $ semind search signal_struct.sigcnt
> def include/linux/sched/signal.h:83   refcount_t  sigcnt;
> m-- kernel/fork.c:723 put_signal_struct   if 
> (refcount_dec_and_test(>sigcnt))
> m-- kernel/fork.c:1571 copy_signalrefcount_set(>sigcnt, 1);
> m-- kernel/fork.c:2298 copy_process   
> refcount_inc(>signal->sigcnt);
>
> It seems to me that the only way is to use __refcount_inc and then compare
> the old value with REFCOUNT_MAX
>
> Since I have not seen examples of such checks, I thought that this is
> acceptable. Sorry once again. I have not tried to hide these changes.

The current ucount code does check for overflow and fails the increment
in every case.

So arguably it will be a regression and inferior error handling behavior
if the code switches to the ``better'' refcount_t data structure.

I originally didn't use refcount_t because silently saturating and not
bothering to handle the error makes me uncomfortable.

Not having to acquire the ucounts_lock every time seems nice.  Perhaps
the path forward would be to start with stupid/correct code that always
takes the ucounts_lock for every increment of ucounts->count, that is
later replaced with something more optimal.

Not impacting performance in the non-namespace cases and having good
performance in the other cases is a fundamental requirement of merging
code like this.

Eric



Re: [PATCH 2/2] security.capability: fix conversions on getxattr

2021-01-19 Thread Eric W. Biederman
Miklos Szeredi  writes:

> If a capability is stored on disk in v2 format cap_inode_getsecurity() will
> currently return in v2 format unconditionally.
>
> This is wrong: v2 cap should be equivalent to a v3 cap with zero rootid,
> and so the same conversions performed on it.
>
> If the rootid cannot be mapped v3 is returned unconverted.  Fix this so
> that both v2 and v3 return -EOVERFLOW if the rootid (or the owner of the fs
> user namespace in case of v2) cannot be mapped in the current user
> namespace.

This looks like a good cleanup.

I do wonder how well this works with stacking.  In particular
ovl_xattr_set appears to call vfs_getxattr without overriding the creds.
What the purpose of that is I haven't quite figured out.  It looks like
it is just a probe to see if an xattr is present so maybe it is ok.

Acked-by: "Eric W. Biederman" 

>
> Signed-off-by: Miklos Szeredi 
> ---
>  security/commoncap.c | 67 
>  1 file changed, 43 insertions(+), 24 deletions(-)
>
> diff --git a/security/commoncap.c b/security/commoncap.c
> index baccd871..c9d99f8f4c82 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -371,10 +371,11 @@ int cap_inode_getsecurity(struct inode *inode, const 
> char *name, void **buffer,
>  {
>   int size, ret;
>   kuid_t kroot;
> + __le32 nsmagic, magic;
>   uid_t root, mappedroot;
>   char *tmpbuf = NULL;
>   struct vfs_cap_data *cap;
> - struct vfs_ns_cap_data *nscap;
> + struct vfs_ns_cap_data *nscap = NULL;
>   struct dentry *dentry;
>   struct user_namespace *fs_ns;
>  
> @@ -396,46 +397,61 @@ int cap_inode_getsecurity(struct inode *inode, const 
> char *name, void **buffer,
>   fs_ns = inode->i_sb->s_user_ns;
>   cap = (struct vfs_cap_data *) tmpbuf;
>   if (is_v2header((size_t) ret, cap)) {
> - /* If this is sizeof(vfs_cap_data) then we're ok with the
> -  * on-disk value, so return that.  */
> - if (alloc)
> - *buffer = tmpbuf;
> - else
> - kfree(tmpbuf);
> - return ret;
> - } else if (!is_v3header((size_t) ret, cap)) {
> - kfree(tmpbuf);
> - return -EINVAL;
> + root = 0;
> + } else if (is_v3header((size_t) ret, cap)) {
> + nscap = (struct vfs_ns_cap_data *) tmpbuf;
> + root = le32_to_cpu(nscap->rootid);
> + } else {
> + size = -EINVAL;
> + goto out_free;
>   }
>  
> - nscap = (struct vfs_ns_cap_data *) tmpbuf;
> - root = le32_to_cpu(nscap->rootid);
>   kroot = make_kuid(fs_ns, root);
>  
>   /* If the root kuid maps to a valid uid in current ns, then return
>* this as a nscap. */
>   mappedroot = from_kuid(current_user_ns(), kroot);
>   if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {
> + size = sizeof(struct vfs_ns_cap_data);
>   if (alloc) {
> - *buffer = tmpbuf;
> + if (!nscap) {
> + /* v2 -> v3 conversion */
> + nscap = kzalloc(size, GFP_ATOMIC);
> + if (!nscap) {
> + size = -ENOMEM;
> + goto out_free;
> + }
> + nsmagic = VFS_CAP_REVISION_3;
> + magic = le32_to_cpu(cap->magic_etc);
> + if (magic & VFS_CAP_FLAGS_EFFECTIVE)
> + nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
> + memcpy(>data, >data, sizeof(__le32) 
> * 2 * VFS_CAP_U32);
> + nscap->magic_etc = cpu_to_le32(nsmagic);
> + } else {
> + /* use allocated v3 buffer */
> + tmpbuf = NULL;
> + }
>   nscap->rootid = cpu_to_le32(mappedroot);
> - } else
> - kfree(tmpbuf);
> - return size;
> + *buffer = nscap;
> + }
> + goto out_free;
>   }
>  
>   if (!rootid_owns_currentns(kroot)) {
> - kfree(tmpbuf);
> - return -EOPNOTSUPP;
> + size = -EOVERFLOW;
> + goto out_free;
>   }
>  
>   /* This comes from a parent namespace.  Return as a v2 capability */
>   size = sizeof(struct vfs_cap_data);
>   if (alloc) {
> - *buffer = 

Re: [PATCH 0/2] capability conversion fixes

2021-01-19 Thread Eric W. Biederman
Miklos Szeredi  writes:

> It turns out overlayfs is actually okay wrt. mutliple conversions, because
> it uses the right context for lower operations.  I.e. before calling
> vfs_{set,get}xattr() on underlying fs, it overrides creds with that of the
> mounter, so the current user ns will now match that of
> overlay_sb->s_user_ns, meaning that the caps will be converted to just the
> right format for the next layer
>
> OTOH ecryptfs, which is the only other one affected by commit 7c03e2cda4a5
> ("vfs: move cap_convert_nscap() call into vfs_setxattr()") needs to be
> fixed up, since it doesn't do the cap override thing that overlayfs does.
>
> I don't have an ecryptfs setup, so untested, but it's a fairly trivial
> change.
>
> My other observation was that cap_inode_getsecurity() messes up conversion
> of caps in more than one case.  This is independent of the overlayfs user
> ns enablement but affects it as well.
>
> Maybe we can revisit the infrastructure improvements we discussed, but I
> think these fixes are more appropriate for the current cycle.

I mostly agree.  Fixing the bugs in a back-portable way is important.

However we need to sort out the infrastructure, and implementation.

As far as I can tell it is only the fact that overlayfs does not support
the new mount api aka fs_context that allows this fix to work and be
correct.

I believe the new mount api would allow specifying a different userns
thatn curent_user_ns for the overlay filesystem and that would break
this.

So while I agree with the making a minimal fix for now.  We need a good
fix because this code is much too subtle, and it can break very easily
with no one noticing.

Eric





> Thanks,
> Miklos
>
> Miklos Szeredi (2):
>   ecryptfs: fix uid translation for setxattr on security.capability
>   security.capability: fix conversions on getxattr
>
>  fs/ecryptfs/inode.c  | 10 +--
>  security/commoncap.c | 67 
>  2 files changed, 50 insertions(+), 27 deletions(-)


Re: [PATCH 1/2] ecryptfs: fix uid translation for setxattr on security.capability

2021-01-19 Thread Eric W. Biederman
Miklos Szeredi  writes:

> Prior to commit 7c03e2cda4a5 ("vfs: move cap_convert_nscap() call into
> vfs_setxattr()") the translation of nscap->rootid did not take stacked
> filesystems (overlayfs and ecryptfs) into account.
>
> That patch fixed the overlay case, but made the ecryptfs case worse.
>
> Restore old the behavior for ecryptfs that existed before the overlayfs
> fix.  This does not fix ecryptfs's handling of complex user namespace
> setups, but it does make sure existing setups don't regress.

Today vfs_setxattr handles handles a delegated_inode and breaking
leases.  Code that is enabled with CONFIG_FILE_LOCKING.  So unless
I am missing something this introduces a different regression into
ecryptfs.

>
> Reported-by: Eric W. Biederman 
> Cc: Tyler Hicks 
> Fixes: 7c03e2cda4a5 ("vfs: move cap_convert_nscap() call into vfs_setxattr()")
> Signed-off-by: Miklos Szeredi 
> ---
>  fs/ecryptfs/inode.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
> index e23752d9a79f..58d0f7187997 100644
> --- a/fs/ecryptfs/inode.c
> +++ b/fs/ecryptfs/inode.c
> @@ -1016,15 +1016,19 @@ ecryptfs_setxattr(struct dentry *dentry, struct inode 
> *inode,
>  {
>   int rc;
>   struct dentry *lower_dentry;
> + struct inode *lower_inode;
>  
>   lower_dentry = ecryptfs_dentry_to_lower(dentry);
> - if (!(d_inode(lower_dentry)->i_opflags & IOP_XATTR)) {
> + lower_inode = d_inode(lower_dentry);
> + if (!(lower_inode->i_opflags & IOP_XATTR)) {
>   rc = -EOPNOTSUPP;
>   goto out;
>   }
> - rc = vfs_setxattr(lower_dentry, name, value, size, flags);
> + inode_lock(lower_inode);
> + rc = __vfs_setxattr_locked(lower_dentry, name, value, size, flags, 
> NULL);
> + inode_unlock(lower_inode);
>   if (!rc && inode)
> - fsstack_copy_attr_all(inode, d_inode(lower_dentry));
> + fsstack_copy_attr_all(inode, lower_inode);
>  out:
>   return rc;
>  }

Eric


Re: [RFC PATCH v2 1/8] Use atomic type for ucounts reference counting

2021-01-13 Thread Eric W. Biederman
Alexey Gladkov  writes:

We might want to use refcount_t instead of atomic_t.  Not a big deal
either way.

> Signed-off-by: Alexey Gladkov 
> ---
>  include/linux/user_namespace.h |  2 +-
>  kernel/ucount.c| 10 +-
>  2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 64cf8ebdc4ec..84fefa9247c4 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -92,7 +92,7 @@ struct ucounts {
>   struct hlist_node node;
>   struct user_namespace *ns;
>   kuid_t uid;
> - int count;
> + atomic_t count;
>   atomic_t ucount[UCOUNT_COUNTS];
>  };
>  
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 11b1596e2542..0f2c7c11df19 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -141,7 +141,8 @@ static struct ucounts *get_ucounts(struct user_namespace 
> *ns, kuid_t uid)
>  
>   new->ns = ns;
>   new->uid = uid;
> - new->count = 0;
> +
> + atomic_set(>count, 0);
>  
>   spin_lock_irq(_lock);
>   ucounts = find_ucounts(ns, uid, hashent);
> @@ -152,10 +153,10 @@ static struct ucounts *get_ucounts(struct 
> user_namespace *ns, kuid_t uid)
>   ucounts = new;
>   }
>   }
> - if (ucounts->count == INT_MAX)
> + if (atomic_read(>count) == INT_MAX)
>   ucounts = NULL;
>   else
> - ucounts->count += 1;
> + atomic_inc(>count);
>   spin_unlock_irq(_lock);
>   return ucounts;
>  }
> @@ -165,8 +166,7 @@ static void put_ucounts(struct ucounts *ucounts)
>   unsigned long flags;
>  
>   spin_lock_irqsave(_lock, flags);
> - ucounts->count -= 1;
> - if (!ucounts->count)
> + if (atomic_dec_and_test(>count))
>   hlist_del_init(>node);
>   else
>   ucounts = NULL;


This can become:
static void put_ucounts(struct ucounts *ucounts)
{
unsigned long flags;

if (atomic_dec_and_lock_irqsave(>count, _lock, flags)) 
{
hlist_del_init(>node);
spin_unlock_irqrestore(_lock);
kfree(ucounts);
}
}



Re: [RFC PATCH v2 2/8] Add a reference to ucounts for each user

2021-01-13 Thread Eric W. Biederman


The subject is wrong.  This should be:
[RFC PATCH v2 2/8] Add a reference to ucounts for each cred.

Further the explanation could use a little work.  Something along the
lines of:

For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred.  Add a ucounts reference to struct cred, so that
RLIMIT_NPROC can switch from using a per user limit to using a per user
per user namespace limit.

Nits about the code below.

Alexey Gladkov  writes:

> Before this, only the owner of the user namespace had an entry in ucounts.
> This entry addressed the user in the given user namespace.
>
> Now we create such an entry in ucounts for all users in the user namespace.
> Each user has only one entry for each user namespace.
>
> This commit is in preparation for migrating rlimits to ucounts.
>
> Signed-off-by: Alexey Gladkov 
> ---
>  include/linux/cred.h   |  1 +
>  include/linux/user_namespace.h |  2 ++
>  kernel/cred.c  | 17 +++--
>  kernel/ucount.c| 12 +++-
>  kernel/user_namespace.c|  1 +
>  5 files changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index 18639c069263..307744fcc387 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -144,6 +144,7 @@ struct cred {
>  #endif
>   struct user_struct *user;   /* real user ID subscription */
>   struct user_namespace *user_ns; /* user_ns the caps and keyrings are 
> relative to. */
> + struct ucounts *ucounts;
>   struct group_info *group_info;  /* supplementary groups for euid/fsgid 
> */
>   /* RCU deletion */
>   union {
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 84fefa9247c4..483568a56f7f 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -102,6 +102,8 @@ bool setup_userns_sysctls(struct user_namespace *ns);
>  void retire_userns_sysctls(struct user_namespace *ns);
>  struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum 
> ucount_type type);
>  void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
> +void put_ucounts(struct ucounts *ucounts);
> +void set_cred_ucounts(const struct cred *cred, struct user_namespace *ns, 
> kuid_t uid);
>  
>  #ifdef CONFIG_USER_NS
>  
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 421b1149c651..d19e2e97092c 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -119,6 +119,7 @@ static void put_cred_rcu(struct rcu_head *rcu)
>   if (cred->group_info)
>   put_group_info(cred->group_info);
>   free_uid(cred->user);
> + put_ucounts(cred->ucounts);
>   put_user_ns(cred->user_ns);
>   kmem_cache_free(cred_jar, cred);
>  }
> @@ -144,6 +145,9 @@ void __put_cred(struct cred *cred)
>   BUG_ON(cred == current->cred);
>   BUG_ON(cred == current->real_cred);
>  
> + BUG_ON(cred->ucounts == NULL);
> + BUG_ON(cred->ucounts->ns != cred->user_ns);
> +
>   if (cred->non_rcu)
>   put_cred_rcu(>rcu);
>   else
> @@ -271,6 +275,9 @@ struct cred *prepare_creds(void)
>   get_uid(new->user);
>   get_user_ns(new->user_ns);
>  
> + new->ucounts = NULL;
> + set_cred_ucounts(new, new->user_ns, new->euid);
> +
This hunk should be:
atomic_inc(>count);

That means you get to skip the lookup by uid and user_ns which while it
should be cheap is completely unnecessary in this case.

>  #ifdef CONFIG_KEYS
>   key_get(new->session_keyring);
>   key_get(new->process_keyring);
> @@ -363,6 +370,7 @@ int copy_creds(struct task_struct *p, unsigned long 
> clone_flags)
>   ret = create_user_ns(new);
>   if (ret < 0)
>   goto error_put;
> + set_cred_ucounts(new, new->user_ns, new->euid);
>   }
>  
>  #ifdef CONFIG_KEYS
> @@ -485,8 +493,11 @@ int commit_creds(struct cred *new)
>* in set_user().
>*/
>   alter_cred_subscribers(new, 2);
> - if (new->user != old->user)
> - atomic_inc(>user->processes);
> + if (new->user != old->user || new->user_ns != old->user_ns) {
> + if (new->user != old->user)
> + atomic_inc(>user->processes);
> + set_cred_ucounts(new, new->user_ns, new->euid);
> + }
>   rcu_assign_pointer(task->real_cred, new);
>   rcu_assign_pointer(task->cred, new);
>   if (new->user != old->user)
> @@ -661,6 +672,7 @@ void __init cred_init(void)
>   /* allocate a slab in which we can store credentials */
>   cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred), 0,
>   SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
> + set_cred_ucounts(_cred, _user_ns, GLOBAL_ROOT_UID);
Unfortuantely this is needed here because this is the first cred
and there is no ucount reference to copy.
>  }
>  
> 

Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()

2021-01-12 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> So there is the basic question do we want to read the raw bytes on disk
> or do we want to return something meaningful to the reader.  As the
> existing tools use the xattr interface to set/clear fscaps returning
> data to user space rather than raw bytes seems the perfered interface.
>
> My ideal semantics would be:
>
> - If current_user_ns() == sb->s_user_ns return the raw data.
>
>   I don't know how to implement this first scenario while permitting
>   stacked filesystems.

After a little more thought I do.

In getxattr if the get_cap method is not implemented by the
filesystem if current_user_ns() == sb->s_user_ns simply treat it as
an ordinary xattr read/write.

Otherwise call vfs_get_cap and translate the result as described
below.

The key point of this is it allows for seeing what is actually on
disk (when it is not confusing).

> - Calculate the cpu_vfs_cap_data as get_vfs_caps_from_disk does.
>   That gives the meaning of the xattr.
>
> - If "from_kuid(current_userns(), krootid) == 0" return a v2 cap.
>
> - If "rootid_owns_currentns()" return a v2 cap.
>
> - Else return an error.  Probably a permission error.
>
>   The fscap simply can not make sense to the user if the rootid does not
>   map.  Return a v2 cap would imply that the caps are present on the
>   executable (in the current context) which they are not.


Eric


Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()

2021-01-12 Thread Eric W. Biederman
Miklos Szeredi  writes:

> On Tue, Jan 12, 2021 at 1:15 AM Eric W. Biederman  
> wrote:
>>
>> Miklos Szeredi  writes:
>>
>> > On Fri, Jan 01, 2021 at 11:35:16AM -0600, Eric W. Biederman wrote:
>
>> > For one: a v2 fscap is supposed to be equivalent to a v3 fscap with a 
>> > rootid of
>> > zero, right?
>>
>> Yes.  This assumes that everything is translated into the uids of the
>> target filesystem.
>>
>> > If so, why does cap_inode_getsecurity() treat them differently (v2 fscap
>> > succeeding unconditionally while v3 one being either converted to v2, 
>> > rejected
>> > or left as v3 depending on current_user_ns())?
>>
>> As I understand it v2 fscaps have always succeeded unconditionally.  The
>> only case I can see for a v2 fscap might not succeed when read is if the
>> filesystem is outside of the initial user namespace.
>
> Looking again, it's rather confusing.  cap_inode_getsecurity()
> currently handles the following cases:
>
> v1: -> fails with -EINVAL
>
> v2: -> returns unconverted xattr
>
> v3:
>  a) rootid is mapped in the current namespace to non-zero:
>  -> convert rootid
>
>  b) rootid owns the current or ancerstor namespace:
>  -> convert to v2
>
>  c) rootid is not mapped and is not owner:
>  -> return -EOPNOTSUPP -> falls back to unconverted v3
>
> So lets take the example, where a tmpfs is created in a private user
> namespace and one file has a v2 cap and the other an equivalent v3 cap
> with a zero rootid.  This is the result when looking at it from
>
> 1) the namespace of the fs:
> ---
> t = cap_dac_override+eip
> tt = cap_dac_override+eip
>
> 2) the initial namespace:
> ---
> t = cap_dac_override+eip
> tt = cap_dac_override+eip [rootid=1000]
>
> 3) an unrelated namespace:
> ---
> t = cap_dac_override+eip
> tt = cap_dac_override+eip
>
> Note: in this last case getxattr will actually return a v3 cap with
> zero rootid for "tt" which getcap does not display due to being zero.
> I could do a setup with a nested namespaces that better demonstrate
> the confusing nature of this, but I think this also proves the point.

Yes.  There is real confusion on the reading case when the namespaces
do not match.

> At this point userspace simply cannot determine whether the returned
> cap is in any way valid or not.
>
> The following semantics would make a ton more sense, since getting a
> v2 would indicate that rootid is unknown:

> - if cap is v2 convert to v3 with zero rootid
> - after this, check if rootid needs to be translated, if not return v3
> - if yes, try to translate to current ns, if succeeds return translated v3
> - if not mappable, return v2
>
> Hmm?

So there is the basic question do we want to read the raw bytes on disk
or do we want to return something meaningful to the reader.  As the
existing tools use the xattr interface to set/clear fscaps returning
data to user space rather than raw bytes seems the perfered interface.

My ideal semantics would be:

- If current_user_ns() == sb->s_user_ns return the raw data.

  I don't know how to implement this first scenario while permitting
  stacked filesystems.
  
- Calculate the cpu_vfs_cap_data as get_vfs_caps_from_disk does.
  That gives the meaning of the xattr.

- If "from_kuid(current_userns(), krootid) == 0" return a v2 cap.

- If "rootid_owns_currentns()" return a v2 cap.

- Else return an error.  Probably a permission error.

  The fscap simply can not make sense to the user if the rootid does not
  map.  Return a v2 cap would imply that the caps are present on the
  executable (in the current context) which they are not.


>> > Anyway, here's a patch that I think fixes getxattr() layering for
>> > security.capability.  Does basically what you suggested.  Slight change of
>> > semantics vs. v1 caps, not sure if that is still needed, 
>> > getxattr()/setxattr()
>> > hasn't worked for these since the introduction of v3 in 4.14.
>> > Untested.
>>
>> Taking a look.  The goal of change how these operate is to make it so
>> that layered filesystems can just pass through the data if they don't
>> want to change anything (even with the user namespaces of the
>> filesystems in question are different).
>>
>> Feedback on the code below:
>> - cap_get should be in inode_operations like get_acl and set_acl.
>
> So it's not clear to me why xattr ops are per-sb and acl ops are per-inode.

I don't know why either.  What I do see is everything except
inode->i_sb->

Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()

2021-01-11 Thread Eric W. Biederman
Miklos Szeredi  writes:

> On Fri, Jan 01, 2021 at 11:35:16AM -0600, Eric W. Biederman wrote:
>> Miklos Szeredi  writes:
>> 
>> > cap_convert_nscap() does permission checking as well as conversion of the
>> > xattr value conditionally based on fs's user-ns.
>> >
>> > This is needed by overlayfs and probably other layered fs (ecryptfs) and is
>> > what vfs_foo() is supposed to do anyway.
>> 
>> Well crap.
>> 
>> I just noticed this and it turns out this change is wrong.
>> 
>> The problem is that it reads the rootid from the v3 fscap, using
>> current_user_ns() and then writes it using the sb->s_user_ns.
>> 
>> So any time the stacked filesystems sb->s_user_ns do not match or
>> current_user_ns does not match sb->s_user_ns this could be a problem.
>> 
>> In a stacked filesystem a second pass through vfs_setxattr will result
>> in the rootid being translated a second time (with potentially the wrong
>> namespaces).  I think because of the security checks a we won't write
>> something we shouldn't be able to write to the filesystem.  Still we
>> will be writing the wrong v3 fscap which can go quite badly.
>> 
>> This doesn't look terribly difficult to fix.
>> 
>> Probably convert this into a fs independent form using uids in
>> init_user_ns at input and have cap_convert_nscap convert the v3 fscap
>> into the filesystem dependent form.  With some way for stackable
>> filesystems to just skip converting it from the filesystem independent
>> format.
>> 
>> Uids in xattrs that are expected to go directly to disk, but aren't
>> always suitable for going directly to disk are tricky.
>
> I've been looking at this for a couple of days and can't say I clearly
> understand everything yet.
>
> For one: a v2 fscap is supposed to be equivalent to a v3 fscap with a rootid 
> of
> zero, right?

Yes.  This assumes that everything is translated into the uids of the
target filesystem.

> If so, why does cap_inode_getsecurity() treat them differently (v2 fscap
> succeeding unconditionally while v3 one being either converted to v2, rejected
> or left as v3 depending on current_user_ns())?

As I understand it v2 fscaps have always succeeded unconditionally.  The
only case I can see for a v2 fscap might not succeed when read is if the
filesystem is outside of the initial user namespace.


> Anyway, here's a patch that I think fixes getxattr() layering for
> security.capability.  Does basically what you suggested.  Slight change of
> semantics vs. v1 caps, not sure if that is still needed, getxattr()/setxattr()
> hasn't worked for these since the introduction of v3 in 4.14.
> Untested.

Taking a look.  The goal of change how these operate is to make it so
that layered filesystems can just pass through the data if they don't
want to change anything (even with the user namespaces of the
filesystems in question are different).

Feedback on the code below:
- cap_get should be in inode_operations like get_acl and set_acl.

- cap_get should return a cpu_vfs_cap_data.

  Which means that only make_kuid is needed when reading the cap from
  disk.

  Which means that except for the rootid_owns_currentns check (which
  needs to happen elsewhere) default_cap_get should be today's
  get_vfs_cap_from_disk.

- With the introduction of cap_get I believe commoncap should stop
  implementing the security_inode_getsecurity hook, and rather have
  getxattr observe is the file capability xatter and call the new
  vfs_cap_get then translate to a v2 or v3 cap as appropriate when
  returning the cap to userspace.

I think that would put the code on a solid comprehensible foundation.

Eric


> I still need to wrap my head around the permission requirements for the
> setxattr() case...
>
> Thanks,
> Miklos
>
> ---
>  fs/overlayfs/super.c   |   15 +++
>  include/linux/capability.h |2 
>  include/linux/fs.h |1 
>  security/commoncap.c   |  210 
> -
>  4 files changed, 132 insertions(+), 96 deletions(-)
>
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -395,6 +395,20 @@ static int ovl_remount(struct super_bloc
>   return ret;
>  }
>  
> +static int ovl_cap_get(struct dentry *dentry,
> +struct vfs_ns_cap_data *nscap)
> +{
> + int res;
> + const struct cred *old_cred;
> + struct dentry *realdentry = ovl_dentry_real(dentry);
> +
> + old_cred = ovl_override_creds(dentry->d_sb);
> + res = vfs_cap_get(realdentry, nscap);
> + revert_creds(old_cred);
> +
> + return res;
> +}
> +
>  static const struct super_operations ovl_super_operations = {
>   .a

Re: [RFC PATCH v2 0/8] Count rlimits in each user namespace

2021-01-11 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Sun, Jan 10, 2021 at 9:34 AM Alexey Gladkov  
> wrote:
>>
>> To address the problem, we bind rlimit counters to each user namespace. The
>> result is a tree of rlimit counters with the biggest value at the root (aka
>> init_user_ns). The rlimit counter increment/decrement occurs in the current 
>> and
>> all parent user namespaces.
>
> I'm not seeing why this is necessary.
>
> Maybe it's the right approach, but none of the patches (or this cover
> letter email) really explain it to me.
>
> I understand why you might want the _limits_ themselves would form a
> tree like this - with the "master limit" limiting the limits in the
> user namespaces under it.
>
> But I don't understand why the _counts_ should do that. The 'struct
> user_struct' should be shared across even user namespaces for the same
> user.
>
> IOW, the very example of the problem you quote seems to argue against this:
>
>> For example, there are two containers (A and B) created by one user. The
>> container A sets RLIMIT_NPROC=1 and starts one process. Everything is fine, 
>> but
>> when container B tries to do the same it will fail because the number of
>> processes is counted globally for each user and user has one process already.
>
> Note how the problem was _not_ that the _count_ was global. That part
> was fine and all good.

The problem is fundamentally that the per process RLIMIT_NPROC was
compared against the user_struct->processes.

I have only heard the problem described but I believe it is either the
RLIMIT_NPROC test in fork or at the beginning of do_execveat_common that
is failing.

>From fs/exec.c line 1866:
>   /*
>* We move the actual failure in case of RLIMIT_NPROC excess from
>* set*uid() to execve() because too many poorly written programs
>* don't check setuid() return code.  Here we additionally recheck
>* whether NPROC limit is still exceeded.
>*/
>   if ((current->flags & PF_NPROC_EXCEEDED) &&
>   atomic_read(_user()->processes) > rlimit(RLIMIT_NPROC)) {
>   retval = -EAGAIN;
>   goto out_ret;
>   }

>From fs/fork.c line 1966:
>   retval = -EAGAIN;
>   if (atomic_read(>real_cred->user->processes) >=
>   task_rlimit(p, RLIMIT_NPROC)) {
>   if (p->real_cred->user != INIT_USER &&
>   !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
>   goto bad_fork_free;
>   }
>   current->flags &= ~PF_NPROC_EXCEEDED;

In both the cases the RLIMIT_NPROC value comes from
task->signal->rlim[RLIMIT_NPROC] and the count of processes
comes from task->cred->user->processes.

> No, the problem was that the _limit_ in container A also ended up
> affecting container B.

The description I have is that both containers run the same service
that set it's RLIMIT_NPROC to 1 in both containers.

> So to me, that says that it would make sense to continue to use the
> resource counts in 'struct user_struct' (because if user A has a hard
> limit of X, then creating a new namespace shouldn't expand that
> limit), but then have the ability to make per-container changes to the
> resource limits (as long as they are within the bounds of the parent
> user namespace resource limit).

I agree that needs to work as well.

> Maybe there is some reason for this ucounts approach, but if so, I
> feel it was not explained at all.

Let me see if I can starte the example a litle more clearly.

Suppose there is a service never_fork that sets RLIMIT_NPROC runs as
never_fork_user and sets RLIMIT_NPROC to 1 in it's systemd service file.

Further suppose there is a user bob who has two containers he wants to
run: container1 and container2.  Both containers start the never_fork
service.

Bob first starts container1 and inside it the never_fork service starts.
Bob starts container2 and the never_fork service fails to start.

Does that make it clear that it is the count of the processes that would
exceed 1 if both instances of the never_fork service starts that would
be the problem?

Eric


Re: [PATCH] x86/vm86/32: Remove VM86_SCREEN_BITMAP support

2021-01-09 Thread Eric W. Biederman
Andy Lutomirski  writes:

> The implementation was rather buggy.  It unconditionally marked PTEs
> read-only, even for VM_SHARED mappings.  I'm not sure whether this is
> actually a problem, but it certainly seems unwise.  More importantly, it
> released the mmap lock before flushing the TLB, which could allow a racing
> CoW operation to falsely believe that the underlying memory was not
> writable.
>
> I can't find any users at all of this mechanism, so just remove it.

In another age this was used by dosemu.  Have you looked at dosemu to
see if it still uses this support (on 32bit where dosemu can use vm86)?

It may still be a valid removal target I just wanted to point out what
the original user was.

Eric

> Cc: Andrea Arcangeli 
> Cc: Linux-MM 
> Cc: Jason Gunthorpe 
> Cc: x...@kernel.org
> Cc: Linus Torvalds 
> Cc: Matthew Wilcox 
> Cc: Jann Horn 
> Cc: Jan Kara 
> Cc: Yu Zhao 
> Cc: Peter Xu 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/include/uapi/asm/vm86.h |  2 +-
>  arch/x86/kernel/vm86_32.c| 55 ++--
>  2 files changed, 10 insertions(+), 47 deletions(-)
>
> diff --git a/arch/x86/include/uapi/asm/vm86.h 
> b/arch/x86/include/uapi/asm/vm86.h
> index d2ee4e307ef8..50004fb4590d 100644
> --- a/arch/x86/include/uapi/asm/vm86.h
> +++ b/arch/x86/include/uapi/asm/vm86.h
> @@ -106,7 +106,7 @@ struct vm86_struct {
>  /*
>   * flags masks
>   */
> -#define VM86_SCREEN_BITMAP   0x0001
> +#define VM86_SCREEN_BITMAP   0x0001/* no longer supported */
>  
>  struct vm86plus_info_struct {
>   unsigned long force_return_for_pic:1;
> diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
> index 764573de3996..28b9e8d511e1 100644
> --- a/arch/x86/kernel/vm86_32.c
> +++ b/arch/x86/kernel/vm86_32.c
> @@ -160,49 +160,6 @@ void save_v86_state(struct kernel_vm86_regs *regs, int 
> retval)
>   do_exit(SIGSEGV);
>  }
>  
> -static void mark_screen_rdonly(struct mm_struct *mm)
> -{
> - struct vm_area_struct *vma;
> - spinlock_t *ptl;
> - pgd_t *pgd;
> - p4d_t *p4d;
> - pud_t *pud;
> - pmd_t *pmd;
> - pte_t *pte;
> - int i;
> -
> - mmap_write_lock(mm);
> - pgd = pgd_offset(mm, 0xA);
> - if (pgd_none_or_clear_bad(pgd))
> - goto out;
> - p4d = p4d_offset(pgd, 0xA);
> - if (p4d_none_or_clear_bad(p4d))
> - goto out;
> - pud = pud_offset(p4d, 0xA);
> - if (pud_none_or_clear_bad(pud))
> - goto out;
> - pmd = pmd_offset(pud, 0xA);
> -
> - if (pmd_trans_huge(*pmd)) {
> - vma = find_vma(mm, 0xA);
> - split_huge_pmd(vma, pmd, 0xA);
> - }
> - if (pmd_none_or_clear_bad(pmd))
> - goto out;
> - pte = pte_offset_map_lock(mm, pmd, 0xA, );
> - for (i = 0; i < 32; i++) {
> - if (pte_present(*pte))
> - set_pte(pte, pte_wrprotect(*pte));
> - pte++;
> - }
> - pte_unmap_unlock(pte, ptl);
> -out:
> - mmap_write_unlock(mm);
> - flush_tlb_mm_range(mm, 0xA, 0xA + 32*PAGE_SIZE, PAGE_SHIFT, 
> false);
> -}
> -
> -
> -
>  static int do_vm86_irq_handling(int subfunction, int irqnumber);
>  static long do_sys_vm86(struct vm86plus_struct __user *user_vm86, bool plus);
>  
> @@ -282,6 +239,15 @@ static long do_sys_vm86(struct vm86plus_struct __user 
> *user_vm86, bool plus)
>   offsetof(struct vm86_struct, int_revectored)))
>   return -EFAULT;
>  
> +
> + /* VM86_SCREEN_BITMAP had numerous bugs and appears to have no users. */
> + if (v.flags & VM86_SCREEN_BITMAP) {
> + char comm[TASK_COMM_LEN];
> +
> + pr_info_once("vm86: '%s' uses VM86_SCREEN_BITMAP, which is no 
> longer supported\n", get_task_comm(comm, current);
> + return -EINVAL;
> + }
> +
>   memset(, 0, sizeof(vm86regs));
>  
>   vm86regs.pt.bx = v.regs.ebx;
> @@ -370,9 +336,6 @@ static long do_sys_vm86(struct vm86plus_struct __user 
> *user_vm86, bool plus)
>   update_task_stack(tsk);
>   preempt_enable();
>  
> - if (vm86->flags & VM86_SCREEN_BITMAP)
> - mark_screen_rdonly(tsk->mm);
> -
>   memcpy((struct kernel_vm86_regs *)regs, , sizeof(vm86regs));
>   return regs->ax;
>  }


Re: in_compat_syscall() on x86

2021-01-05 Thread Eric W. Biederman
Al Viro  writes:

> On Mon, Jan 04, 2021 at 06:47:38PM -0600, Eric W. Biederman wrote:
>> >> It is defined in the Ubuntu kernel configs I've got lurking:
>> >> Both 3.8.0-19_generic (Ubuntu 13.04) and 5.4.0-56_generic (probably 
>> >> 20.04).
>> >> Which is probably why it is in my test builds (I've just cut out
>> >> a lot of modules).
>> 
>> Interesting.  That sounds like something a gentle prod to the Ubuntu
>> kernel team might get them to disable.  Especially if there are not any
>> x32 binaries in sight.
>
> What for?

Any code that no one uses is better off disabled or deleted.

There are maintenance and support costs to such code as they cause extra
work when maintaining the kernel, and because the code is practically
never tested inevitably bugs some of which turn into security issues
slip through.

Maybe I am wrong and there are interesting users of x32.  All I remember
is that last time this was discussed someone found a distro that
actually shipped an x32 build to users.  Which was just enough users to
keep x32 alive.  Given that distros are in the process of dropping 32bit
support I suspect that distro may be going if it is not already gone.

There are a lot of weird x32 special cases that it would be nice to get
rid of if no one uses x32, and could certainly be made less of an issue
if distro's that don't actually care about x32 simply stopped compiling
it in.

>> The core dump code is currently tied to what binary you exec.
>> The code in exec sets mm->binfmt, and the coredump code uses mm->binfmt
>> to pick the coredump handler.
>> 
>> An x32 binary will make all kinds of 64bit calls where it doesn't need
>> the compat handling.  And of course x32 binaries run in 64bit mode with
>> 32bit pointers so looking at the current execution mode doesn't help.
>> 
>> Further fun compat_binfmt_elf is shared between x32 and ia32, because
>> except for a few stray places they do exactly the same thing.
>
> FWIW, there's a series cleaning that crap up nicely; as a side benefit,
> it converts both compats on mips (o32 and n32) to regular compat_binfmt_elf.c
> Yes, the current mainline is bloody awful in that area (PRSTATUS_SIZE and
> SET_PR_FPVALID are not for weak stomach), but that's really not hard to
> get into sane shape - -next had that done in last cycle and I'm currently
> testing (well, building the test kernel) of port of that to 5.11-rc1.

That does sound interesting.  Anytime we can clean up arch specific
weirdness so that it simply becomes generic weirdness and it can be
tested and maintained by more people is always nice.

> I really don't see the point of getting rid of x32 - mips n32 is *not*
> going away, and that's an exact parallel.

>From what I maintain x32 and n32 are *not* exact parallels.  The change
in size of the timestamps of SIGCHLD is not present on any other
architecture.

The truth is someone years ago royallying messed up struct siginfo and
took a structure that should have been the same on 32bit and 64bit and
would up with a horrible monstrosity of unions.

> PS: if anything, I wonder if we would better off with binfmt_elf{64,32}.o,
> built from fs/binfmt_elf.c; it's not that hard to do.  With arseloads
> of weirdness going away if we do that...

It sounds like we are already on our way to having that.  I suspect you
would need an binfmt_elf_ops to handle the architecture specific bits,
but it should not be too bad.

Eric


Re: in_compat_syscall() on x86

2021-01-04 Thread Eric W. Biederman
Andy Lutomirski  writes:

>> On Jan 4, 2021, at 2:36 PM, David Laight  wrote:
>> 
>> From: Eric W. Biederman
>>> Sent: 04 January 2021 20:41
>>> 
>>> Al Viro  writes:
>>> 
>>>> On Mon, Jan 04, 2021 at 12:16:56PM +, David Laight wrote:
>>>>> On x86 in_compat_syscall() is defined as:
>>>>>in_ia32_syscall() || in_x32_syscall()
>>>>> 
>>>>> Now in_ia32_syscall() is a simple check of the TS_COMPAT flag.
>>>>> However in_x32_syscall() is a horrid beast that has to indirect
>>>>> through to the original %eax value (ie the syscall number) and
>>>>> check for a bit there.
>>>>> 
>>>>> So on a kernel with x32 support (probably most distro kernels)
>>>>> the in_compat_syscall() check is rather more expensive than
>>>>> one might expect.
>>> 
>>> I suggest you check the distro kernels.  I suspect they don't compile in
>>> support for x32.  As far as I can tell x32 is an undead beast of a
>>> subarchitecture that just enough people use that it can't be removed,
>>> but few enough people use it likely has a few lurking scary bugs.
>> 
>> It is defined in the Ubuntu kernel configs I've got lurking:
>> Both 3.8.0-19_generic (Ubuntu 13.04) and 5.4.0-56_generic (probably 20.04).
>> Which is probably why it is in my test builds (I've just cut out
>> a lot of modules).

Interesting.  That sounds like something a gentle prod to the Ubuntu
kernel team might get them to disable.  Especially if there are not any
x32 binaries in sight.

Maybe Ubuntu has a reason or maybe someone just enabled the option
because it was there and they could.

>>>>> It would be muck better if both checks could be done together.
>>>>> I think this would require the syscall entry code to set a
>>>>> value in both the 64bit and x32 entry paths.
>>>>> (Can a process make both 64bit and x32 system calls?)
>>>> 
>>>> Yes, it bloody well can.
>>>> 
>>>> And I see no benefit in pushing that logics into syscall entry,
>>>> since anything that calls in_compat_syscall() more than once
>>>> per syscall execution is doing the wrong thing.  Moreover,
>>>> in quite a few cases we don't call the sucker at all, and for
>>>> all of those pushing that crap into syscall entry logics is
>>>> pure loss.
>>> 
>>> The x32 system calls have their own system call table and it would be
>>> trivial to set a flag like TS_COMPAT when looking up a system call from
>>> that table.  I expect such a change would be purely in the noise.
>> 
>> Certainly a write of 0/1/2 into a dirtied cache line of 'current'
>> could easily cost absolutely nothing.
>> Especially if current has already been read.
>> 
>> I also wondered about resetting it to zero when an x32 system call
>> exits (rather than entry to a 64bit one).
>> 
>> For ia32 the flag is set (with |=) on every syscall entry.
>> Even though I'm pretty sure it can only change during exec.
>
> It can change for every syscall. I have tests that do this.
>
>>>> What's the point, really?
>>> 
>>> Before we came up with the current games with __copy_siginfo_to_user
>>> and x32_copy_siginfo_to_user I was wondering if we should make such
>>> a change.  The delivery of compat signal frames and core dumps which
>>> do not go through the system call entry path could almost benefit from
>>> a flag that could be set/tested when on those paths.
>> 
>> For signal delivery it should (probably) depend on the system call
>> that setup the signal handler.
>
> I think it has worked this way for some time now.

It always has, but there is code that called as part of signal delivery
that needs to know if it is ia32 or x32 code (namely
copy_siginfo_to_user32).  The code paths are short enough we don't
strictly need the runtime test on that path and we have been able to
remove it, but it is an example of the kind of path that is not a
syscall entry where it would be nice to set the flag.

>> Although I'm sure I remember one kernel where some of it was done
>> in libc (with a single entrypoint for all hadlers).
>> 
>>> The fact that only SIGCHLD (which can not trigger a coredump) is
>>> different saves the coredump code from needing such a test.
>>> 
>>> The fact that the signal frame code is simple enough it can directly
>>> call x32_copy_siginfo_to_user or __copy_siginfo_to_user saves us there.
>>> 
>>> So 

Re: in_compat_syscall() on x86

2021-01-04 Thread Eric W. Biederman
Al Viro  writes:

> On Mon, Jan 04, 2021 at 12:16:56PM +, David Laight wrote:
>> On x86 in_compat_syscall() is defined as:
>> in_ia32_syscall() || in_x32_syscall()
>> 
>> Now in_ia32_syscall() is a simple check of the TS_COMPAT flag.
>> However in_x32_syscall() is a horrid beast that has to indirect
>> through to the original %eax value (ie the syscall number) and
>> check for a bit there.
>> 
>> So on a kernel with x32 support (probably most distro kernels)
>> the in_compat_syscall() check is rather more expensive than
>> one might expect.

I suggest you check the distro kernels.  I suspect they don't compile in
support for x32.  As far as I can tell x32 is an undead beast of a
subarchitecture that just enough people use that it can't be removed,
but few enough people use it likely has a few lurking scary bugs.

>> It would be muck better if both checks could be done together.
>> I think this would require the syscall entry code to set a
>> value in both the 64bit and x32 entry paths.
>> (Can a process make both 64bit and x32 system calls?)
>
> Yes, it bloody well can.
>
> And I see no benefit in pushing that logics into syscall entry,
> since anything that calls in_compat_syscall() more than once
> per syscall execution is doing the wrong thing.  Moreover,
> in quite a few cases we don't call the sucker at all, and for
> all of those pushing that crap into syscall entry logics is
> pure loss.

The x32 system calls have their own system call table and it would be
trivial to set a flag like TS_COMPAT when looking up a system call from
that table.  I expect such a change would be purely in the noise.

> What's the point, really?

Before we came up with the current games with __copy_siginfo_to_user
and x32_copy_siginfo_to_user I was wondering if we should make such
a change.  The delivery of compat signal frames and core dumps which
do not go through the system call entry path could almost benefit from
a flag that could be set/tested when on those paths.

The fact that only SIGCHLD (which can not trigger a coredump) is
different saves the coredump code from needing such a test.

The fact that the signal frame code is simple enough it can directly
call x32_copy_siginfo_to_user or __copy_siginfo_to_user saves us there.

So I don't think we have any cases where we actually need a flag that
is independent of the system call but we have come very close.


For people who want to optimize I suggest tracking down the handful of
users of x32 and see if x32 can be made to just go away.

Eric



Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()

2021-01-01 Thread Eric W. Biederman
Miklos Szeredi  writes:

> cap_convert_nscap() does permission checking as well as conversion of the
> xattr value conditionally based on fs's user-ns.
>
> This is needed by overlayfs and probably other layered fs (ecryptfs) and is
> what vfs_foo() is supposed to do anyway.

Well crap.

I just noticed this and it turns out this change is wrong.

The problem is that it reads the rootid from the v3 fscap, using
current_user_ns() and then writes it using the sb->s_user_ns.

So any time the stacked filesystems sb->s_user_ns do not match or
current_user_ns does not match sb->s_user_ns this could be a problem.

In a stacked filesystem a second pass through vfs_setxattr will result
in the rootid being translated a second time (with potentially the wrong
namespaces).  I think because of the security checks a we won't write
something we shouldn't be able to write to the filesystem.  Still we
will be writing the wrong v3 fscap which can go quite badly.

This doesn't look terribly difficult to fix.

Probably convert this into a fs independent form using uids in
init_user_ns at input and have cap_convert_nscap convert the v3 fscap
into the filesystem dependent form.  With some way for stackable
filesystems to just skip converting it from the filesystem independent
format.

Uids in xattrs that are expected to go directly to disk, but aren't
always suitable for going directly to disk are tricky.

Eric

> Signed-off-by: Miklos Szeredi 
> ---
>  fs/xattr.c | 17 +++--
>  include/linux/capability.h |  2 +-
>  security/commoncap.c   |  3 +--
>  3 files changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/fs/xattr.c b/fs/xattr.c
> index cd7a563e8bcd..fd57153b1f61 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -276,8 +276,16 @@ vfs_setxattr(struct dentry *dentry, const char *name, 
> const void *value,
>  {
>   struct inode *inode = dentry->d_inode;
>   struct inode *delegated_inode = NULL;
> + const void  *orig_value = value;
>   int error;
>  
> + if (size && strcmp(name, XATTR_NAME_CAPS) == 0) {
> + error = cap_convert_nscap(dentry, , size);
> + if (error < 0)
> + return error;
> + size = error;
> + }
> +
>  retry_deleg:
>   inode_lock(inode);
>   error = __vfs_setxattr_locked(dentry, name, value, size, flags,
> @@ -289,6 +297,9 @@ vfs_setxattr(struct dentry *dentry, const char *name, 
> const void *value,
>   if (!error)
>   goto retry_deleg;
>   }
> + if (value != orig_value)
> + kfree(value);
> +
>   return error;
>  }
>  EXPORT_SYMBOL_GPL(vfs_setxattr);
> @@ -537,12 +548,6 @@ setxattr(struct dentry *d, const char __user *name, 
> const void __user *value,
>   if ((strcmp(kname, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
>   (strcmp(kname, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
>   posix_acl_fix_xattr_from_user(kvalue, size);
> - else if (strcmp(kname, XATTR_NAME_CAPS) == 0) {
> - error = cap_convert_nscap(d, , size);
> - if (error < 0)
> - goto out;
> - size = error;
> - }
>   }
>  
>   error = vfs_setxattr(d, kname, kvalue, size, flags);
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index 1e7fe311cabe..b2f698915c0f 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -270,6 +270,6 @@ static inline bool checkpoint_restore_ns_capable(struct 
> user_namespace *ns)
>  /* audit system wants to get cap info from files as well */
>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
> cpu_vfs_cap_data *cpu_caps);
>  
> -extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t 
> size);
> +extern int cap_convert_nscap(struct dentry *dentry, const void **ivalue, 
> size_t size);
>  
>  #endif /* !_LINUX_CAPABILITY_H */
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 59bf3c1674c8..baccd871 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -473,7 +473,7 @@ static bool validheader(size_t size, const struct 
> vfs_cap_data *cap)
>   *
>   * If all is ok, we return the new size, on error return < 0.
>   */
> -int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
> +int cap_convert_nscap(struct dentry *dentry, const void **ivalue, size_t 
> size)
>  {
>   struct vfs_ns_cap_data *nscap;
>   uid_t nsrootid;
> @@ -516,7 +516,6 @@ int cap_convert_nscap(struct dentry *dentry, void 
> **ivalue, size_t size)
>   nscap->magic_etc = cpu_to_le32(nsmagic);
>   memcpy(>data, >data, sizeof(__le32) * 2 * VFS_CAP_U32);
>  
> - kvfree(*ivalue);
>   *ivalue = nscap;
>   return newsize;
>  }


Re: Does uaccess_kernel() work for detecting kernel thread?

2020-12-22 Thread Eric W. Biederman
Tetsuo Handa  writes:

> Commit db68ce10c4f0a27c ("new helper: uaccess_kernel()") replaced 
> segment_eq(get_fs(), KERNEL_DS)
> with uaccess_kernel(). But uaccess_kernel() became an unconditional "false" 
> for some architectures
> due to commit 5e6e9852d6f76e01 ("uaccess: add infrastructure for kernel 
> builds with set_fs()") and
> follow up changes in Linux 5.10. As a result, I guess that uaccess_kernel() 
> can no longer be used
> as a condition for checking whether current thread is a kernel thread or not.
>
> For example, if uaccess_kernel() is "false" due to CONFIG_SET_FS=n,
> isn't sg_check_file_access() failing to detect kernel context?
>
> static int sg_check_file_access(struct file *filp, const char *caller)
> {
>   if (filp->f_cred != current_real_cred()) {
>   pr_err_once("%s: process %d (%s) changed security contexts 
> after opening file descriptor, this is not allowed.\n",
>   caller, task_tgid_vnr(current), current->comm);
>   return -EPERM;
>   }
>   if (uaccess_kernel()) {
>   pr_err_once("%s: process %d (%s) called from kernel context, 
> this is not allowed.\n",
>   caller, task_tgid_vnr(current), current->comm);
>   return -EACCES;
>   }
>   return 0;
> }
>
> For another example, if uaccess_kernel() is "false" due to CONFIG_SET_FS=n,
> isn't TOMOYO unexpectedly checking permissions for socket operations?
>
> static bool tomoyo_kernel_service(void)
> {
>   /* Nothing to do if I am a kernel service. */
>   return uaccess_kernel();
> }
>
> static u8 tomoyo_sock_family(struct sock *sk)
> {
>   u8 family;
>
>   if (tomoyo_kernel_service())
>   return 0;
>   family = sk->sk_family;
>   switch (family) {
>   case PF_INET:
>   case PF_INET6:
>   case PF_UNIX:
>   return family;
>   default:
>   return 0;
>   }
> }
>
> Don't we need to replace such usage with something like (current->flags & 
> PF_KTHREAD) ?
> I don't know about io_uring, but according to
> https://lkml.kernel.org/r/dacfb329-de66-d0cf-dcf9-f030ea137...@schaufler-ca.com
>  ,
> should (current->flags & (PF_KTHREAD | PF_IO_WORKER)) == PF_KTHREAD be used 
> instead?

I think you are reading the situation properly.

I skimmed the tomoyo code and it appears that you are excluding kernel
threads so as not to limit kernel threads such as nfsd.  For
PF_IO_WORKER kernel threads which are running code on behalf of a user
we want to perform the ordinary permission checks.  So you want
the idiom you pasted above.

I do wonder though if perhaps we should create a is_user_cred helper to
detect the difference between the creds of kernel threads and the thread
of ordinary userspace.   Which would handle io_uring that copy creds
around and check them at a later time more cleanly.

Eric




Re: [RFC PATCH] ptrace: make ptrace() fail if the tracee changed its pid unexpectedly

2020-12-21 Thread Eric W. Biederman
Oleg Nesterov  writes:

> On 12/17, Eric W. Biederman wrote:
>>
>> Oleg Nesterov  writes:
>>
>> > Suppose we have 2 threads, the group-leader L and a sub-theread T,
>> > both parked in ptrace_stop(). Debugger tries to resume both threads
>> > and does
>> >
>> >ptrace(PTRACE_CONT, T);
>> >ptrace(PTRACE_CONT, L);
>> >
>> > If the sub-thread T execs in between, the 2nd PTRACE_CONT doesn not
>> > resume the old leader L, it resumes the post-exec thread T which was
>> > actually now stopped in PTHREAD_EVENT_EXEC. In this case the
>> > PTHREAD_EVENT_EXEC event is lost, and the tracer can't know that the
>> > tracee changed its pid.
>>
>> The change seems sensible.  I don't expect this is common but it looks
>> painful to deal with if it happens.
>
> Yes, this is not a bug, but gdb can't handle this case without some help
> from the kernel.

>> I admit this a threaded PTRACE_EVENT_EXEC is the only event we are
>> likely to miss but still.
>
> Yes, this is the only event debugger can miss even if it uses wait()
> correctly.

I think that is my confusion with the patch.  The uniqueness of this
case is not described well.


Eric


Re: [PATCH] Smack: Handle io_uring kernel thread privileges.

2020-12-21 Thread Eric W. Biederman
Casey Schaufler  writes:

> Smack assumes that kernel threads are privileged for smackfs
> operations. This was necessary because the credential of the
> kernel thread was not related to a user operation. With io_uring
> the credential does reflect a user's rights and can be used.

Acked-by: "Eric W. Biederman" 

>
> Suggested-by: Jens Axboe 
> Signed-off-by: Casey Schaufler 
> ---
>  security/smack/smack_access.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/security/smack/smack_access.c b/security/smack/smack_access.c
> index efe2406a3960..7eabb448acab 100644
> --- a/security/smack/smack_access.c
> +++ b/security/smack/smack_access.c
> @@ -688,9 +688,10 @@ bool smack_privileged_cred(int cap, const struct cred 
> *cred)
>  bool smack_privileged(int cap)
>  {
>   /*
> -  * All kernel tasks are privileged
> +  * Kernel threads may not have credentials we can use.
> +  * The io_uring kernel threads do have reliable credentials.
>*/
> - if (unlikely(current->flags & PF_KTHREAD))
> + if ((current->flags & (PF_KTHREAD | PF_IO_WORKER)) == PF_KTHREAD)
>   return true;
>  
>   return smack_privileged_cred(cap, current_cred());


Re: [PATCH] signal: Don't init struct kernel_siginfo fields to zero again

2020-12-21 Thread Eric W. Biederman
Leesoo Ahn  writes:

> clear_siginfo() is responsible for clearing struct kernel_siginfo object.
> It's obvious that manually initializing those fields is needless as
> a commit[1] explains why the function introduced and its guarantee that
> all bits in the struct are cleared after it.

The fields that are explicitly set to zero are fields that must be zero.
Not fields whose value should default to zero.

Given how difficult it is to keep track of which fields are relevant
in struct siginfo.  I prefer the explicit approach that currently
exists.  Especially as the compiler appears smart enough to optimize
through this.

Mostly I am being touchy and conservative because getting all of these
fields set properly took me several kernel development cycles, and
before that there were bugs in setting those fields that persisted for
decades.

I think the current form allows someone to glance at the entry and see
that a field like si_errno is present and set to 0.  With you your
suggested change someone needs to walk through the definition of the
union and see which union member is being invoked to see which fields
are present, to see if si_errno even exists, before it is possible
to see that si_errno is 0.

So unless there is a reason more compelling than a few less lines of
code let's keep it this way for now.

Eric


> [1]: commit 8c5dbf2ae00b ("signal: Introduce clear_siginfo")
>
> Signed-off-by: Leesoo Ahn 
> ---
>  kernel/signal.c | 21 -
>  1 file changed, 21 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 5736c55aaa1a..8f49fa3ade33 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -603,10 +603,7 @@ static void collect_signal(int sig, struct sigpending 
> *list, kernel_siginfo_t *i
>*/
>   clear_siginfo(info);
>   info->si_signo = sig;
> - info->si_errno = 0;
>   info->si_code = SI_USER;
> - info->si_pid = 0;
> - info->si_uid = 0;
>   }
>  }
>  
> @@ -1120,7 +1117,6 @@ static int __send_signal(int sig, struct kernel_siginfo 
> *info, struct task_struc
>   case (unsigned long) SEND_SIG_NOINFO:
>   clear_siginfo(>info);
>   q->info.si_signo = sig;
> - q->info.si_errno = 0;
>   q->info.si_code = SI_USER;
>   q->info.si_pid = task_tgid_nr_ns(current,
>   task_active_pid_ns(t));
> @@ -1133,10 +1129,7 @@ static int __send_signal(int sig, struct 
> kernel_siginfo *info, struct task_struc
>   case (unsigned long) SEND_SIG_PRIV:
>   clear_siginfo(>info);
>   q->info.si_signo = sig;
> - q->info.si_errno = 0;
>   q->info.si_code = SI_KERNEL;
> - q->info.si_pid = 0;
> - q->info.si_uid = 0;
>   break;
>   default:
>   copy_siginfo(>info, info);
> @@ -1623,10 +1616,7 @@ void force_sig(int sig)
>  
>   clear_siginfo();
>   info.si_signo = sig;
> - info.si_errno = 0;
>   info.si_code = SI_KERNEL;
> - info.si_pid = 0;
> - info.si_uid = 0;
>   force_sig_info();
>  }
>  EXPORT_SYMBOL(force_sig);
> @@ -1659,7 +1649,6 @@ int force_sig_fault_to_task(int sig, int code, void 
> __user *addr
>  
>   clear_siginfo();
>   info.si_signo = sig;
> - info.si_errno = 0;
>   info.si_code  = code;
>   info.si_addr  = addr;
>  #ifdef __ARCH_SI_TRAPNO
> @@ -1691,7 +1680,6 @@ int send_sig_fault(int sig, int code, void __user *addr
>  
>   clear_siginfo();
>   info.si_signo = sig;
> - info.si_errno = 0;
>   info.si_code  = code;
>   info.si_addr  = addr;
>  #ifdef __ARCH_SI_TRAPNO
> @@ -1712,7 +1700,6 @@ int force_sig_mceerr(int code, void __user *addr, short 
> lsb)
>   WARN_ON((code != BUS_MCEERR_AO) && (code != BUS_MCEERR_AR));
>   clear_siginfo();
>   info.si_signo = SIGBUS;
> - info.si_errno = 0;
>   info.si_code = code;
>   info.si_addr = addr;
>   info.si_addr_lsb = lsb;
> @@ -1726,7 +1713,6 @@ int send_sig_mceerr(int code, void __user *addr, short 
> lsb, struct task_struct *
>   WARN_ON((code != BUS_MCEERR_AO) && (code != BUS_MCEERR_AR));
>   clear_siginfo();
>   info.si_signo = SIGBUS;
> - info.si_errno = 0;
>   info.si_code = code;
>   info.si_addr = addr;
>   info.si_addr_lsb = lsb;
> @@ -1740,7 +1726,6 @@ int force_sig_bnderr(void __user *addr, void __user 
> *lower, void __user *upper)
>  
>   clear_siginfo();
>   info.si_signo = SIGSEGV;
> - info.si_errno = 0;
>   info.si_code  = SEGV_BNDERR;
>   info.si_addr  = addr;
>   info.si_lower = lower;
> @@ -1755,7 +1740,6 @@ int force_sig_pkuerr(void __user *addr, u32 pkey)
>  
>   clear_siginfo();
>   info.si_signo = SIGSEGV;
> - info.si_errno 

  1   2   3   4   5   6   7   8   9   10   >