from:"Alexey Gladkov"

Re: 08ed4efad6: stress-ng.sigsegv.ops_per_sec -41.9% regression

2021-04-16 Thread Alexey Gladkov

On Thu, Apr 08, 2021 at 01:44:43PM -0500, Eric W. Biederman wrote:
> Linus Torvalds  writes:
> 
> > On Thu, Apr 8, 2021 at 1:32 AM kernel test robot  
> > wrote:
> >>
> >> FYI, we noticed a -41.9% regression of stress-ng.sigsegv.ops_per_sec due 
> >> to commit
> >> 08ed4efad684 ("[PATCH v10 6/9] Reimplement RLIMIT_SIGPENDING on top of 
> >> ucounts")
> >
> > Ouch.
> 
> We were cautiously optimistic when no test problems showed up from
> the last posting that there was nothing to look at here.
> 
> Unfortunately it looks like the bots just missed the last posting. 
> 
> So it seems we are finally pretty much at correct code in need
> of performance tuning.
> 
> > I *think* this test may be testing "send so many signals that it
> > triggers the signal queue overflow case".
> >
> > And I *think* that the performance degradation may be due to lots of
> > unnecessary allocations, because ity looks like that commit changes
> > __sigqueue_alloc() to do
> >
> > struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);
> >
> > *before* checking the signal limit, and then if the signal limit was
> > exceeded, it will just be free'd instead.
> >
> > The old code would check the signal count against RLIMIT_SIGPENDING
> > *first*, and if there were m ore pending signals then it wouldn't do
> > anything at all (including not incrementing that expensive atomic
> > count).
> 
> This is an interesting test in a lot of ways as it is testing the
> synchronous signal delivery path caused by an exception.  The test
> is either executing *ptr = 0 (where ptr points to a read-only page)
> or it executes an x86 instruction that is excessively long.
> 
> I have found the code but I haven't figured out how it is being
> called yet.  The core loop is just:
>   for(;;) {
>   sigaction(SIGSEGV, , NULL);
>   sigaction(SIGILL, , NULL);
>   sigaction(SIGBUS, , NULL);
> 
>   ret = sigsetjmp(jmp_env, 1);
>   if (done())
>   break;
>   if (ret) {
>   /* verify signal */
> } else {
>   *ptr = 0;
> }
>   }
> 
> Code like that fundamentally can not be multi-threaded.  So the only way
> the sigpending limit is being hit is if there are more processes running
> that code simultaneously than the size of the limit.
> 
> Further it looks like stress-ng pushes RLIMIT_SIGPENDING as high as it
> will go before the test starts.
> 
> 
> > Also, the old code was very careful to only do the "get_user()" for
> > the *first* signal it added to the queue, and do the "put_user()" for
> > when removing the last signal. Exactly because those atomics are very
> > expensive.
> >
> > The new code just does a lot of these atomics unconditionally.
> 
> Yes. That seems a likely culprit.
> 
> > I dunno. The profile data in there is a bit hard to read, but there's
> > a lot more cachee misses, and a *lot* of node crossers:
> >
> >>5961544  +190.4%   17314361perf-stat.i.cache-misses
> >>   22107466  +119.2%   48457656perf-stat.i.cache-references
> >> 163292 ą  3%   +4582.0%7645410perf-stat.i.node-load-misses
> >> 227388 ą  2%   +3708.8%8660824perf-stat.i.node-loads
> >
> > and (probably as a result) average instruction costs have gone up 
> > enormously:
> >
> >>   3.47   +66.8%   5.79perf-stat.overall.cpi
> >>  22849   -65.6%   7866
> >> perf-stat.overall.cycles-between-cache-misses
> >
> > and it does seem to be at least partly about "put_ucounts()":
> >
> >>   0.00+4.54.46
> >> perf-profile.calltrace.cycles-pp.put_ucounts.__sigqueue_free.get_signal.arch_do_signal_or_restart.exit_to_user_mode_prepare
> >
> > and a lot of "get_ucounts()".
> >
> > But it may also be that the new "get sigpending" is just *so* much
> > more expensive than it used to be.
> 
> That too is possible.
> 
> That node-load-misses number does look like something is bouncing back
> and forth between the nodes a lot more.  So I suspect stress-ng is
> running multiple copies of the sigsegv test in different processes at
> once.
> 
> 
> 
> That really suggests cache line ping pong from get_ucounts and
> incrementing sigpending.
> 
> It surprises me that obtaining the cache lines exclusively is
> the dominant cost on this code path but obtaining two cache lines
> exclusively instead of one cache cache line exclusively is consistent
> with a causing the exception delivery to take nearly twice as long.
> 
> For the optimization we only care about the leaf count so with a little
> care we can restore the optimization.  So that is probably the thing
> to do here.  The fewer changes to worry about the less likely to find
> surprises.
> 
> 
> 
> That said for this specific case there is a lot of potential room for
> improvement.  As this is a per thread signal the code update

Re: 08ed4efad6: stress-ng.sigsegv.ops_per_sec -41.9% regression

2021-04-08 Thread Alexey Gladkov

On Thu, Apr 08, 2021 at 09:22:40AM -0700, Linus Torvalds wrote:
> On Thu, Apr 8, 2021 at 1:32 AM kernel test robot  
> wrote:
> >
> > FYI, we noticed a -41.9% regression of stress-ng.sigsegv.ops_per_sec due to 
> > commit
> > 08ed4efad684 ("[PATCH v10 6/9] Reimplement RLIMIT_SIGPENDING on top of 
> > ucounts")
> 
> Ouch.
> 
> I *think* this test may be testing "send so many signals that it
> triggers the signal queue overflow case".
> 
> And I *think* that the performance degradation may be due to lots of
> unnecessary allocations, because ity looks like that commit changes
> __sigqueue_alloc() to do
> 
> struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);
> 
> *before* checking the signal limit, and then if the signal limit was
> exceeded, it will just be free'd instead.
> 
> The old code would check the signal count against RLIMIT_SIGPENDING
> *first*, and if there were m ore pending signals then it wouldn't do
> anything at all (including not incrementing that expensive atomic
> count).
> 
> Also, the old code was very careful to only do the "get_user()" for
> the *first* signal it added to the queue, and do the "put_user()" for
> when removing the last signal. Exactly because those atomics are very
> expensive.
> 
> The new code just does a lot of these atomics unconditionally.

Yes and right now I'm trying to rewrite this patch.

> I dunno. The profile data in there is a bit hard to read, but there's
> a lot more cachee misses, and a *lot* of node crossers:
> 
> >5961544  +190.4%   17314361perf-stat.i.cache-misses
> >   22107466  +119.2%   48457656perf-stat.i.cache-references
> > 163292 ą  3%   +4582.0%7645410perf-stat.i.node-load-misses
> > 227388 ą  2%   +3708.8%8660824perf-stat.i.node-loads
> 
> and (probably as a result) average instruction costs have gone up enormously:
> 
> >   3.47   +66.8%   5.79perf-stat.overall.cpi
> >  22849   -65.6%   7866
> > perf-stat.overall.cycles-between-cache-misses
> 
> and it does seem to be at least partly about "put_ucounts()":
> 
> >   0.00+4.54.46
> > perf-profile.calltrace.cycles-pp.put_ucounts.__sigqueue_free.get_signal.arch_do_signal_or_restart.exit_to_user_mode_prepare
> 
> and a lot of "get_ucounts()".
> 
> But it may also be that the new "get sigpending" is just *so* much
> more expensive than it used to be.

Thanks for decrypting this! I spent some time to understand this report
and still wasn't sure I understood it.

-- 
Rgrds, legion

[PATCH v10 9/9] ucounts: Set ucount_max to the largest positive value the type can hold

2021-04-07 Thread Alexey Gladkov

The ns->ucount_max[] is signed long which is less than the rlimit size.
We have to protect ucount_max[] from overflow and only use the largest
value that we can hold.

On 32bit using "long" instead of "unsigned long" to hold the counts has
the downside that RLIMIT_MSGQUEUE and RLIMIT_MEMLOCK are limited to 2GiB
instead of 4GiB. I don't think anyone cares but it should be mentioned
in case someone does.

The RLIMIT_NPROC and RLIMIT_SIGPENDING used atomic_t so their maximum
hasn't changed.

Signed-off-by: Alexey Gladkov 
---
 include/linux/user_namespace.h | 6 ++
 kernel/fork.c  | 8 
 kernel/user_namespace.c| 8 
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 82851fba7278..1c778182f5d5 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -123,6 +123,12 @@ bool inc_rlimit_ucounts_and_test(struct ucounts *ucounts, 
enum ucount_type type,
 void dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long 
v);
 bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, 
unsigned long max);
 
+static inline void set_rlimit_ucount_max(struct user_namespace *ns,
+   enum ucount_type type, unsigned long max)
+{
+   ns->ucount_max[type] = max <= LONG_MAX ? max : LONG_MAX;
+}
+
 #ifdef CONFIG_USER_NS
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
diff --git a/kernel/fork.c b/kernel/fork.c
index a3a5e317c3c0..2cd01c443196 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -822,10 +822,10 @@ void __init fork_init(void)
for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++)
init_user_ns.ucount_max[i] = max_threads/2;
 
-   init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(_task, 
RLIMIT_NPROC);
-   init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = 
task_rlimit(_task, RLIMIT_MSGQUEUE);
-   init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = 
task_rlimit(_task, RLIMIT_SIGPENDING);
-   init_user_ns.ucount_max[UCOUNT_RLIMIT_MEMLOCK] = 
task_rlimit(_task, RLIMIT_MEMLOCK);
+   set_rlimit_ucount_max(_user_ns, UCOUNT_RLIMIT_NPROC, 
task_rlimit(_task, RLIMIT_NPROC));
+   set_rlimit_ucount_max(_user_ns, UCOUNT_RLIMIT_MSGQUEUE, 
task_rlimit(_task, RLIMIT_MSGQUEUE));
+   set_rlimit_ucount_max(_user_ns, UCOUNT_RLIMIT_SIGPENDING, 
task_rlimit(_task, RLIMIT_SIGPENDING));
+   set_rlimit_ucount_max(_user_ns, UCOUNT_RLIMIT_MEMLOCK, 
task_rlimit(_task, RLIMIT_MEMLOCK));
 
 #ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 5ef0d4b182ba..df7651935fd5 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -121,10 +121,10 @@ int create_user_ns(struct cred *new)
for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
ns->ucount_max[i] = INT_MAX;
}
-   ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC);
-   ns->ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = rlimit(RLIMIT_MSGQUEUE);
-   ns->ucount_max[UCOUNT_RLIMIT_SIGPENDING] = rlimit(RLIMIT_SIGPENDING);
-   ns->ucount_max[UCOUNT_RLIMIT_MEMLOCK] = rlimit(RLIMIT_MEMLOCK);
+   set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
+   set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, 
rlimit(RLIMIT_MSGQUEUE));
+   set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, 
rlimit(RLIMIT_SIGPENDING));
+   set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, 
rlimit(RLIMIT_MEMLOCK));
ns->ucounts = ucounts;
 
/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
-- 
2.29.3

[PATCH v10 6/9] Reimplement RLIMIT_SIGPENDING on top of ucounts

2021-04-07 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

v10:
* Fix memory leak on get_ucounts failure.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/array.c|  2 +-
 include/linux/sched/user.h |  1 -
 include/linux/signal_types.h   |  4 ++-
 include/linux/user_namespace.h |  1 +
 kernel/fork.c  |  1 +
 kernel/signal.c| 58 --
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 9 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index bb87e4d89cd8..74b0ea4b7e38 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -284,7 +284,7 @@ static inline void task_sig(struct seq_file *m, struct 
task_struct *p)
collect_sigign_sigcatch(p, , );
num_threads = get_nr_threads(p);
rcu_read_lock();  /* FIXME: is this correct? */
-   qsize = atomic_read(&__task_cred(p)->user->sigpending);
+   qsize = get_ucounts_value(task_ucounts(p), 
UCOUNT_RLIMIT_SIGPENDING);
rcu_read_unlock();
qlim = task_rlimit(p, RLIMIT_SIGPENDING);
unlock_task_sighand(p, );
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8a34446681aa..8ba9cec4fb99 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
  */
 struct user_struct {
refcount_t __count; /* reference count */
-   atomic_t sigpending;/* How many pending signals does this user 
have? */
 #ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
 #endif
diff --git a/include/linux/signal_types.h b/include/linux/signal_types.h
index 68e06c75c5b2..34cb28b8f16c 100644
--- a/include/linux/signal_types.h
+++ b/include/linux/signal_types.h
@@ -13,6 +13,8 @@ typedef struct kernel_siginfo {
__SIGINFO;
 } kernel_siginfo_t;
 
+struct ucounts;
+
 /*
  * Real Time signals may be queued.
  */
@@ -21,7 +23,7 @@ struct sigqueue {
struct list_head list;
int flags;
kernel_siginfo_t info;
-   struct user_struct *user;
+   struct ucounts *ucounts;
 };
 
 /* flags values. */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index d0fea0306394..6e8736c7aa29 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -52,6 +52,7 @@ enum ucount_type {
 #endif
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
+   UCOUNT_RLIMIT_SIGPENDING,
UCOUNT_COUNTS,
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 85c6094f5a48..741f896c156e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -824,6 +824,7 @@ void __init fork_init(void)
 
init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(_task, 
RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = 
task_rlimit(_task, RLIMIT_MSGQUEUE);
+   init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = 
task_rlimit(_task, RLIMIT_SIGPENDING);
 
 #ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/signal.c b/kernel/signal.c
index f2a1b898da29..4e80386acec7 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -413,49 +413,45 @@ void task_join_group_stop(struct task_struct *task)
 static struct sigqueue *
 __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int 
override_rlimit)
 {
-   struct sigqueue *q = NULL;
-   struct user_struct *user;
-   int sigpending;
+   struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);
 
-   /*
-* Protect access to @t credentials. This can go away when all
-* callers hold rcu read lock.
-*
-* NOTE! A pending signal will hold on to the user refcount,
-* and we get/put the refcount only when the sigpending count
-* changes from/to zero.
-*/
-   rcu_read_lock();
-   user = __task_cred(t)->user;
-   sigpending = atomic_inc_return(>sigpending);
-   if (sigpending == 1)
-   get_uid(user);
-   rcu_read_unlock();
+   if (likely(q != NULL)) {
+   bool overlimit;
 
-   if (override_rlimit || likely(sigpending <= task_rlimit(t, 
RLIMIT_SIGPENDING))) {
-   q = kmem_cache_alloc(sigqueue_cachep, flags);
-   } else {
-   print_dropped_signal(sig);
-   }
-
-   if (unlikely(q == NULL)) {
-   if (atomic_dec_and_test(>sigpending))
-   free_uid(user);
-   } else {
INIT_LIST_HEAD(>list);
q->flags = 0;
-   q->user = user;
+
+   /*
+* Protect access to

[PATCH v10 7/9] Reimplement RLIMIT_MEMLOCK on top of ucounts

2021-04-07 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog
v8:
* Fix issues found by lkp-tests project.

v7:
* Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot 
Signed-off-by: Alexey Gladkov 
---
 fs/hugetlbfs/inode.c   | 16 
 include/linux/hugetlb.h|  4 ++--
 include/linux/mm.h |  4 ++--
 include/linux/sched/user.h |  1 -
 include/linux/shmem_fs.h   |  2 +-
 include/linux/user_namespace.h |  1 +
 ipc/shm.c  | 26 +-
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 mm/memfd.c |  4 ++--
 mm/mlock.c | 23 +++
 mm/mmap.c  |  4 ++--
 mm/shmem.c |  8 
 15 files changed, 53 insertions(+), 44 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 701c82c36138..be519fc9559a 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1443,7 +1443,7 @@ static int get_hstate_idx(int page_size_log)
  * otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
  */
 struct file *hugetlb_file_setup(const char *name, size_t size,
-   vm_flags_t acctflag, struct user_struct **user,
+   vm_flags_t acctflag, struct ucounts **ucounts,
int creat_flags, int page_size_log)
 {
struct inode *inode;
@@ -1455,20 +1455,20 @@ struct file *hugetlb_file_setup(const char *name, 
size_t size,
if (hstate_idx < 0)
return ERR_PTR(-ENODEV);
 
-   *user = NULL;
+   *ucounts = NULL;
mnt = hugetlbfs_vfsmount[hstate_idx];
if (!mnt)
return ERR_PTR(-ENOENT);
 
if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
-   *user = current_user();
-   if (user_shm_lock(size, *user)) {
+   *ucounts = current_ucounts();
+   if (user_shm_lock(size, *ucounts)) {
task_lock(current);
pr_warn_once("%s (%d): Using mlock ulimits for 
SHM_HUGETLB is deprecated\n",
current->comm, current->pid);
task_unlock(current);
} else {
-   *user = NULL;
+   *ucounts = NULL;
return ERR_PTR(-EPERM);
}
}
@@ -1495,9 +1495,9 @@ struct file *hugetlb_file_setup(const char *name, size_t 
size,
 
iput(inode);
 out:
-   if (*user) {
-   user_shm_unlock(size, *user);
-   *user = NULL;
+   if (*ucounts) {
+   user_shm_unlock(size, *ucounts);
+   *ucounts = NULL;
}
return file;
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index cccd1aab69dd..96d63dbdec65 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -434,7 +434,7 @@ static inline struct hugetlbfs_inode_info 
*HUGETLBFS_I(struct inode *inode)
 extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
-   struct user_struct **user, int creat_flags,
+   struct ucounts **ucounts, int creat_flags,
int page_size_log);
 
 static inline bool is_file_hugepages(struct file *file)
@@ -454,7 +454,7 @@ static inline struct hstate *hstate_inode(struct inode *i)
 #define is_file_hugepages(file)false
 static inline struct file *
 hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
-   struct user_struct **user, int creat_flags,
+   struct ucounts **ucounts, int creat_flags,
int page_size_log)
 {
return ERR_PTR(-ENOSYS);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 64a71bf20536..7466eab000d0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1658,8 +1658,8 @@ extern bool can_do_mlock(void);
 #else
 static inline bool can_do_mlock(void) { return false; }
 #endif
-extern int user_shm_lock(size_t, struct user_struct *);
-extern void user_shm_unlock(size_t, struct user_struct *);
+extern int user_shm_lock(size_t, struct ucounts *);
+extern void user_shm_unlock(size_t, struct ucounts *);
 
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/include/linux/sched/u

[PATCH v10 8/9] kselftests: Add test to check for rlimit changes in different user namespaces

2021-04-07 Thread Alexey Gladkov

The testcase runs few instances of the program with RLIMIT_NPROC=1 from
user uid=6, in different user namespaces.

Signed-off-by: Alexey Gladkov 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/rlimits/.gitignore|   2 +
 tools/testing/selftests/rlimits/Makefile  |   6 +
 tools/testing/selftests/rlimits/config|   1 +
 .../selftests/rlimits/rlimits-per-userns.c| 161 ++
 5 files changed, 171 insertions(+)
 create mode 100644 tools/testing/selftests/rlimits/.gitignore
 create mode 100644 tools/testing/selftests/rlimits/Makefile
 create mode 100644 tools/testing/selftests/rlimits/config
 create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6c575cf34a71..a4ea1481bd9a 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -48,6 +48,7 @@ TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
 TARGETS += openat2
+TARGETS += rlimits
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/rlimits/.gitignore 
b/tools/testing/selftests/rlimits/.gitignore
new file mode 100644
index ..091021f255b3
--- /dev/null
+++ b/tools/testing/selftests/rlimits/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+rlimits-per-userns
diff --git a/tools/testing/selftests/rlimits/Makefile 
b/tools/testing/selftests/rlimits/Makefile
new file mode 100644
index ..03aadb406212
--- /dev/null
+++ b/tools/testing/selftests/rlimits/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := rlimits-per-userns
+
+include ../lib.mk
diff --git a/tools/testing/selftests/rlimits/config 
b/tools/testing/selftests/rlimits/config
new file mode 100644
index ..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/rlimits/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c 
b/tools/testing/selftests/rlimits/rlimits-per-userns.c
new file mode 100644
index ..26dc949e93ea
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov 
+ */
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NR_CHILDS 2
+
+static char *service_prog;
+static uid_t user   = 6;
+static uid_t group  = 6;
+
+static void setrlimit_nproc(rlim_t n)
+{
+   pid_t pid = getpid();
+   struct rlimit limit = {
+   .rlim_cur = n,
+   .rlim_max = n
+   };
+
+   warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
+
+   if (setrlimit(RLIMIT_NPROC, ) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+}
+
+static pid_t fork_child(void)
+{
+   pid_t pid = fork();
+
+   if (pid < 0)
+   err(EXIT_FAILURE, "fork");
+
+   if (pid > 0)
+   return pid;
+
+   pid = getpid();
+
+   warnx("(pid=%d): New process starting ...", pid);
+
+   if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
+   err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid);
+
+   signal(SIGUSR1, SIG_DFL);
+
+   warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
+
+   if (setgid(group) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
+   if (setuid(user) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
+
+   warnx("(pid=%d): Service running ...", pid);
+
+   warnx("(pid=%d): Unshare user namespace", pid);
+   if (unshare(CLONE_NEWUSER) < 0)
+   err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+
+   char *const argv[] = { "service", NULL };
+   char *const envp[] = { "I_AM_SERVICE=1", NULL };
+
+   warnx("(pid=%d): Executing real service ...", pid);
+
+   execve(service_prog, argv, envp);
+   err(EXIT_FAILURE, "(pid=%d): execve", pid);
+}
+
+int main(int argc, char **argv)
+{
+   size_t i;
+   pid_t child[NR_CHILDS];
+   int wstatus[NR_CHILDS];
+   int childs = NR_CHILDS;
+   pid_t pid;
+
+   if (getenv("I_AM_SERVICE")) {
+   pause();
+   exit(EXIT_SUCCESS);
+   }
+
+   service_prog = argv[0];
+   pid = getpid();
+
+   warnx("(pid=%d) Starting testcase", pid);
+
+   /*
+* This rlimit is not a problem for root because it can be exceeded.
+*/
+   setrlimit_nproc(1);
+
+   for (i = 0; i < NR_CH

[PATCH v10 2/9] Add a reference to ucounts for each cred

2021-04-07 Thread Alexey Gladkov

For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.

Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().

Changelog

v6:
* Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
  error was caused by the fact that cred_alloc_blank() left the ucounts
  pointer empty.

Reported-by: kernel test robot 
Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  4 
 include/linux/cred.h   |  2 ++
 include/linux/user_namespace.h |  4 
 kernel/cred.c  | 40 ++
 kernel/fork.c  |  6 +
 kernel/sys.c   | 12 ++
 kernel/ucount.c| 40 +++---
 kernel/user_namespace.c|  3 +++
 8 files changed, 108 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..d7c4187ca023 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1360,6 +1360,10 @@ int begin_new_exec(struct linux_binprm * bprm)
WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
flush_signal_handlers(me, 0);
 
+   retval = set_cred_ucounts(bprm->cred);
+   if (retval < 0)
+   goto out_unlock;
+
/*
 * install the new credentials for this executable
 */
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 4c6350503697..66436e655032 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -144,6 +144,7 @@ struct cred {
 #endif
struct user_struct *user;   /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are 
relative to. */
+   struct ucounts *ucounts;
struct group_info *group_info;  /* supplementary groups for euid/fsgid 
*/
/* RCU deletion */
union {
@@ -170,6 +171,7 @@ extern int set_security_override_from_ctx(struct cred *, 
const char *);
 extern int set_create_files_as(struct cred *, struct inode *);
 extern int cred_fscmp(const struct cred *, const struct cred *);
 extern void __init cred_init(void);
+extern int set_cred_ucounts(struct cred *);
 
 /*
  * check for validity of credentials
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 0bb833fd41f4..f71b5a4a3e74 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -97,11 +97,15 @@ struct ucounts {
 };
 
 extern struct user_namespace init_user_ns;
+extern struct ucounts init_ucounts;
 
 bool setup_userns_sysctls(struct user_namespace *ns);
 void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum 
ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
+struct ucounts *get_ucounts(struct ucounts *ucounts);
+void put_ucounts(struct ucounts *ucounts);
 
 #ifdef CONFIG_USER_NS
 
diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..58a8a9e24347 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -60,6 +60,7 @@ struct cred init_cred = {
.user   = INIT_USER,
.user_ns= _user_ns,
.group_info = _groups,
+   .ucounts= _ucounts,
 };
 
 static inline void set_cred_subscribers(struct cred *cred, int n)
@@ -119,6 +120,8 @@ static void put_cred_rcu(struct rcu_head *rcu)
if (cred->group_info)
put_group_info(cred->group_info);
free_uid(cred->user);
+   if (cred->ucounts)
+   put_ucounts(cred->ucounts);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
 }
@@ -222,6 +225,7 @@ struct cred *cred_alloc_blank(void)
 #ifdef CONFIG_DEBUG_CREDENTIALS
new->magic = CRED_MAGIC;
 #endif
+   new->ucounts = get_ucounts(_ucounts);
 
if (security_cred_alloc_blank(new, GFP_KERNEL_ACCOUNT) < 0)
goto error;
@@ -284,6 +288,11 @@ struct cred *prepare_creds(void)
 
if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
goto error;
+
+   new->ucounts = get_ucounts(new->ucounts);
+   if (!new->ucounts)
+   goto error;
+
validate_creds(new);
return new;
 
@@ -363,6 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long 
clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
+   if (set_cred_uc

[PATCH v10 5/9] Reimplement RLIMIT_MSGQUEUE on top of ucounts

2021-04-07 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov 
---
 include/linux/sched/user.h |  4 
 include/linux/user_namespace.h |  1 +
 ipc/mqueue.c   | 41 ++
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user_namespace.c|  1 +
 6 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index d33d867ad6c1..8a34446681aa 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,10 +18,6 @@ struct user_struct {
 #endif
 #ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors 
currently watched */
-#endif
-#ifdef CONFIG_POSIX_MQUEUE
-   /* protected by mq_lock */
-   unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
 #endif
unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight;/* How many files in flight in unix 
sockets */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 9d1ca370c201..d0fea0306394 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -51,6 +51,7 @@ enum ucount_type {
UCOUNT_INOTIFY_WATCHES,
 #endif
UCOUNT_RLIMIT_NPROC,
+   UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_COUNTS,
 };
 
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 8031464ed4ae..f0f8f013dae2 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -144,7 +144,7 @@ struct mqueue_inode_info {
struct pid *notify_owner;
u32 notify_self_exec_id;
struct user_namespace *notify_user_ns;
-   struct user_struct *user;   /* user who created, for accounting */
+   struct ucounts *ucounts;/* user who created, for accounting */
struct sock *notify_sock;
struct sk_buff *notify_cookie;
 
@@ -292,7 +292,6 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
struct ipc_namespace *ipc_ns, umode_t mode,
struct mq_attr *attr)
 {
-   struct user_struct *u = current_user();
struct inode *inode;
int ret = -ENOMEM;
 
@@ -321,7 +320,7 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
info->notify_owner = NULL;
info->notify_user_ns = NULL;
info->qsize = 0;
-   info->user = NULL;  /* set when all is ok */
+   info->ucounts = NULL;   /* set when all is ok */
info->msg_tree = RB_ROOT;
info->msg_tree_rightmost = NULL;
info->node_cache = NULL;
@@ -371,19 +370,24 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
if (mq_bytes + mq_treesize < mq_bytes)
goto out_inode;
mq_bytes += mq_treesize;
-   spin_lock(_lock);
-   if (u->mq_bytes + mq_bytes < u->mq_bytes ||
-   u->mq_bytes + mq_bytes > rlimit(RLIMIT_MSGQUEUE)) {
+   info->ucounts = get_ucounts(current_ucounts());
+   if (info->ucounts) {
+   bool overlimit;
+
+   spin_lock(_lock);
+   overlimit = inc_rlimit_ucounts_and_test(info->ucounts, 
UCOUNT_RLIMIT_MSGQUEUE,
+   mq_bytes, rlimit(RLIMIT_MSGQUEUE));
+   if (overlimit) {
+   dec_rlimit_ucounts(info->ucounts, 
UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
+   spin_unlock(_lock);
+   put_ucounts(info->ucounts);
+   info->ucounts = NULL;
+   /* mqueue_evict_inode() releases info->messages 
*/
+   ret = -EMFILE;
+   goto out_inode;
+   }
spin_unlock(_lock);
-   /* mqueue_evict_inode() releases info->messages */
-   ret = -EMFILE;
-   goto out_inode;
}
-   u->mq_bytes += mq_bytes;
-   spin_unlock(_lock);
-
-   /* all is ok */
-   info->user = get_uid(u);
} else if (S_ISDIR(mode)) {
inc_nlink(inode);
/* Some things misbehave if size == 0 on a directory */
@@ -497,7 +501,6 @@ static void mqueue_free_inode(struct inode *inode)
 static void mqueue_evict_inode(struct inode *inode)
 {
struct mqueue_inode_info *info;
-   struct user_struct *user;
struct ipc_namespace *ipc_ns;
struct msg_msg

[PATCH v10 3/9] Use atomic_t for ucounts reference counting

2021-04-07 Thread Alexey Gladkov

The current implementation of the ucounts reference counter requires the
use of spin_lock. We're going to use get_ucounts() in more performance
critical areas like a handling of RLIMIT_SIGPENDING.

Now we need to use spin_lock only if we want to change the hashtable.

v10:
* Always try to put ucounts in case we cannot increase ucounts->count.
  This will allow to cover the case when all consumers will return
  ucounts at once.

v9:
* Use a negative value to check that the ucounts->count is close to
  overflow.

Signed-off-by: Alexey Gladkov 
---
 include/linux/user_namespace.h |  4 +--
 kernel/ucount.c| 53 --
 2 files changed, 21 insertions(+), 36 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index f71b5a4a3e74..d84cc2c0b443 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -92,7 +92,7 @@ struct ucounts {
struct hlist_node node;
struct user_namespace *ns;
kuid_t uid;
-   int count;
+   atomic_t count;
atomic_long_t ucount[UCOUNT_COUNTS];
 };
 
@@ -104,7 +104,7 @@ void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum 
ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
 struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
-struct ucounts *get_ucounts(struct ucounts *ucounts);
+struct ucounts * __must_check get_ucounts(struct ucounts *ucounts);
 void put_ucounts(struct ucounts *ucounts);
 
 #ifdef CONFIG_USER_NS
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 50cc1dfb7d28..365865f368ec 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -11,7 +11,7 @@
 struct ucounts init_ucounts = {
.ns= _user_ns,
.uid   = GLOBAL_ROOT_UID,
-   .count = 1,
+   .count = ATOMIC_INIT(1),
 };
 
 #define UCOUNTS_HASHTABLE_BITS 10
@@ -139,6 +139,15 @@ static void hlist_add_ucounts(struct ucounts *ucounts)
spin_unlock_irq(_lock);
 }
 
+struct ucounts *get_ucounts(struct ucounts *ucounts)
+{
+   if (ucounts && atomic_add_negative(1, >count)) {
+   put_ucounts(ucounts);
+   ucounts = NULL;
+   }
+   return ucounts;
+}
+
 struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
 {
struct hlist_head *hashent = ucounts_hashentry(ns, uid);
@@ -155,7 +164,7 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, 
kuid_t uid)
 
new->ns = ns;
new->uid = uid;
-   new->count = 0;
+   atomic_set(>count, 1);
 
spin_lock_irq(_lock);
ucounts = find_ucounts(ns, uid, hashent);
@@ -163,33 +172,12 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, 
kuid_t uid)
kfree(new);
} else {
hlist_add_head(>node, hashent);
-   ucounts = new;
+   spin_unlock_irq(_lock);
+   return new;
}
}
-   if (ucounts->count == INT_MAX)
-   ucounts = NULL;
-   else
-   ucounts->count += 1;
spin_unlock_irq(_lock);
-   return ucounts;
-}
-
-struct ucounts *get_ucounts(struct ucounts *ucounts)
-{
-   unsigned long flags;
-
-   if (!ucounts)
-   return NULL;
-
-   spin_lock_irqsave(_lock, flags);
-   if (ucounts->count == INT_MAX) {
-   WARN_ONCE(1, "ucounts: counter has reached its maximum value");
-   ucounts = NULL;
-   } else {
-   ucounts->count += 1;
-   }
-   spin_unlock_irqrestore(_lock, flags);
-
+   ucounts = get_ucounts(ucounts);
return ucounts;
 }
 
@@ -197,15 +185,12 @@ void put_ucounts(struct ucounts *ucounts)
 {
unsigned long flags;
 
-   spin_lock_irqsave(_lock, flags);
-   ucounts->count -= 1;
-   if (!ucounts->count)
+   if (atomic_dec_and_test(>count)) {
+   spin_lock_irqsave(_lock, flags);
hlist_del_init(>node);
-   else
-   ucounts = NULL;
-   spin_unlock_irqrestore(_lock, flags);
-
-   kfree(ucounts);
+   spin_unlock_irqrestore(_lock, flags);
+   kfree(ucounts);
+   }
 }
 
 static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
-- 
2.29.3

[PATCH v10 4/9] Reimplement RLIMIT_NPROC on top of ucounts

2021-04-07 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

To illustrate the impact of rlimits, let's say there is a program that
does not fork. Some service-A wants to run this program as user X in
multiple containers. Since the program never fork the service wants to
set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
When the service-A tries to run a program with RLIMIT_NPROC=1 in
container2 it fails since user X already has one running process.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited by root or if the user has the appropriate capability.

Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  2 +-
 include/linux/cred.h   |  2 ++
 include/linux/sched/user.h |  1 -
 include/linux/user_namespace.h | 13 +
 kernel/cred.c  | 10 +++
 kernel/exit.c  |  2 +-
 kernel/fork.c  |  9 +++---
 kernel/sys.c   |  2 +-
 kernel/ucount.c| 51 ++
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  3 +-
 11 files changed, 81 insertions(+), 15 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index d7c4187ca023..f2bcdbeb3afb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1878,7 +1878,7 @@ static int do_execveat_common(int fd, struct filename 
*filename,
 * whether NPROC limit is still exceeded.
 */
if ((current->flags & PF_NPROC_EXCEEDED) &&
-   atomic_read(_user()->processes) > rlimit(RLIMIT_NPROC)) {
+   is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, 
rlimit(RLIMIT_NPROC))) {
retval = -EAGAIN;
goto out_ret;
}
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 66436e655032..5ca1e8a1d035 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -372,6 +372,7 @@ static inline void put_cred(const struct cred *_cred)
 
 #define task_uid(task) (task_cred_xxx((task), uid))
 #define task_euid(task)(task_cred_xxx((task), euid))
+#define task_ucounts(task) (task_cred_xxx((task), ucounts))
 
 #define current_cred_xxx(xxx)  \
 ({ \
@@ -388,6 +389,7 @@ static inline void put_cred(const struct cred *_cred)
 #define current_fsgid()(current_cred_xxx(fsgid))
 #define current_cap()  (current_cred_xxx(cap_effective))
 #define current_user() (current_cred_xxx(user))
+#define current_ucounts()  (current_cred_xxx(ucounts))
 
 extern struct user_namespace init_user_ns;
 #ifdef CONFIG_USER_NS
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index a8ec3b6093fc..d33d867ad6c1 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
  */
 struct user_struct {
refcount_t __count; /* reference count */
-   atomic_t processes; /* How many processes does this user have? */
atomic_t sigpending;/* How many pending signals does this user 
have? */
 #ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index d84cc2c0b443..9d1ca370c201 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -50,9 +50,12 @@ enum ucount_type {
UCOUNT_INOTIFY_INSTANCES,
UCOUNT_INOTIFY_WATCHES,
 #endif
+   UCOUNT_RLIMIT_NPROC,
UCOUNT_COUNTS,
 };
 
+#define MAX_PER_NAMESPACE_UCOUNTS UCOUNT_RLIMIT_NPROC
+
 struct user_namespace {
struct uid_gid_map  uid_map;
struct uid_gid_map  gid_map;
@@ -107,6 +110,16 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, 
kuid_t uid);
 struct ucounts * __must_check get_ucounts(struct ucounts *ucounts);
 void put_ucounts(struct ucounts *ucounts);
 
+static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type 
type)
+{
+   return atomic_long_read(>ucount[type]);
+}
+
+bool inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long 
v);
+bool inc_rlimit_ucounts_and_test(struct ucounts *ucounts, enum ucount_type 
type, long v, long max);
+void dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long 
v);
+bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, 
unsigned long max);
+
 #ifdef CONFIG_USER_NS
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
diff --git a/kernel/cred.c b/kernel/cred.c
index 58a8a9e24347..dcfa30b337c5 100644
--- a/kernel/cred.c
+++ b

[PATCH v10 1/9] Increase size of ucounts to atomic_long_t

2021-04-07 Thread Alexey Gladkov

RLIMIT_MSGQUEUE and RLIMIT_MEMLOCK use unsigned long to store their
counters. As a preparation for moving rlimits based on ucounts, we need
to increase the size of the variable to long.

Signed-off-by: Alexey Gladkov 
---
 include/linux/user_namespace.h |  4 ++--
 kernel/ucount.c| 16 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..0bb833fd41f4 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -85,7 +85,7 @@ struct user_namespace {
struct ctl_table_header *sysctls;
 #endif
struct ucounts  *ucounts;
-   int ucount_max[UCOUNT_COUNTS];
+   long ucount_max[UCOUNT_COUNTS];
 } __randomize_layout;
 
 struct ucounts {
@@ -93,7 +93,7 @@ struct ucounts {
struct user_namespace *ns;
kuid_t uid;
int count;
-   atomic_t ucount[UCOUNT_COUNTS];
+   atomic_long_t ucount[UCOUNT_COUNTS];
 };
 
 extern struct user_namespace init_user_ns;
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 11b1596e2542..04c561751af1 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -175,14 +175,14 @@ static void put_ucounts(struct ucounts *ucounts)
kfree(ucounts);
 }
 
-static inline bool atomic_inc_below(atomic_t *v, int u)
+static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
 {
-   int c, old;
-   c = atomic_read(v);
+   long c, old;
+   c = atomic_long_read(v);
for (;;) {
if (unlikely(c >= u))
return false;
-   old = atomic_cmpxchg(v, c, c+1);
+   old = atomic_long_cmpxchg(v, c, c+1);
if (likely(old == c))
return true;
c = old;
@@ -196,17 +196,17 @@ struct ucounts *inc_ucount(struct user_namespace *ns, 
kuid_t uid,
struct user_namespace *tns;
ucounts = get_ucounts(ns, uid);
for (iter = ucounts; iter; iter = tns->ucounts) {
-   int max;
+   long max;
tns = iter->ns;
max = READ_ONCE(tns->ucount_max[type]);
-   if (!atomic_inc_below(>ucount[type], max))
+   if (!atomic_long_inc_below(>ucount[type], max))
goto fail;
}
return ucounts;
 fail:
bad = iter;
for (iter = ucounts; iter != bad; iter = iter->ns->ucounts)
-   atomic_dec(>ucount[type]);
+   atomic_long_dec(>ucount[type]);
 
put_ucounts(ucounts);
return NULL;
@@ -216,7 +216,7 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type 
type)
 {
struct ucounts *iter;
for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-   int dec = atomic_dec_if_positive(>ucount[type]);
+   long dec = atomic_long_dec_if_positive(>ucount[type]);
WARN_ON_ONCE(dec < 0);
}
put_ucounts(ucounts);
-- 
2.29.3

[PATCH v10 0/9] Count rlimits in each user namespace

2021-04-07 Thread Alexey Gladkov

Preface
---
These patches are for binding the rlimit counters to a user in user namespace.
This patch set can be applied on top of:

git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.12-rc4

Problem
---
The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
implementation places the counters in user_struct [1]. These limits are global
between processes and persists for the lifetime of the process, even if
processes are in different user namespaces.

To illustrate the impact of rlimits, let's say there is a program that does not
fork. Some service-A wants to run this program as user X in multiple containers.
Since the program never fork the service wants to set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
since user X already has one running process.

The problem is not that the limit from container1 affects container2. The
problem is that limit is verified against the global counter that reflects
the number of processes in all containers.

This problem can be worked around by using different users for each container
but in this case we face a different problem of uid mapping when transferring
files from one container to another.

Eric W. Biederman mentioned this issue [2][3].

Introduced changes
--
To address the problem, we bind rlimit counters to user namespace. Each counter
reflects the number of processes in a given uid in a given user namespace. The
result is a tree of rlimit counters with the biggest value at the root (aka
init_user_ns). The limit is considered exceeded if it's exceeded up in the tree.

[1]: https://lore.kernel.org/containers/87imd2incs@x220.int.ebiederm.org/
[2]: 
https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
[3]: 
https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html

Changelog
-
v10:
* Fixed memory leak in __sigqueue_alloc.
* Handled an unlikely situation when all consumers will return ucounts at once.
* Addressed other review comments from Eric W. Biederman.

v9:
* Used a negative value to check that the ucounts->count is close to overflow.
* Rebased onto v5.12-rc4.

v8:
* Used atomic_t for ucounts reference counting. Also added counter overflow
  check (thanks to Linus Torvalds for the idea).
* Fixed other issues found by lkp-tests project in the patch that Reimplements
  RLIMIT_MEMLOCK on top of ucounts.

v7:
* Fixed issues found by lkp-tests project in the patch that Reimplements
  RLIMIT_MEMLOCK on top of ucounts.

v6:
* Fixed issues found by lkp-tests project.
* Rebased onto v5.11.

v5:
* Split the first commit into two commits: change ucounts.count type to 
atomic_long_t
  and add ucounts to cred. These commits were merged by mistake during the 
rebase.
* The __get_ucounts() renamed to alloc_ucounts().
* The cred.ucounts update has been moved from commit_creds() as it did not allow
  to handle errors.
* Added error handling of set_cred_ucounts().

v4:
* Reverted the type change of ucounts.count to refcount_t.
* Fixed typo in the kernel/cred.c

v3:
* Added get_ucounts() function to increase the reference count. The existing
  get_counts() function renamed to __get_ucounts().
* The type of ucounts.count changed from atomic_t to refcount_t.
* Dropped 'const' from set_cred_ucounts() arguments.
* Fixed a bug with freeing the cred structure after calling cred_alloc_blank().
* Commit messages have been updated.
* Added selftest.

v2:
* RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts.
* Added ucounts for pair uid and user namespace into cred.
* Added the ability to increase ucount by more than 1.

v1:
* After discussion with Eric W. Biederman, I increased the size of ucounts to
  atomic_long_t.
* Added ucount_max to avoid the fork bomb.

--

Alexey Gladkov (9):
  Increase size of ucounts to atomic_long_t
  Add a reference to ucounts for each cred
  Use atomic_t for ucounts reference counting
  Reimplement RLIMIT_NPROC on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  kselftests: Add test to check for rlimit changes in different user
namespaces
  ucounts: Set ucount_max to the largest positive value the type can
hold

 fs/exec.c |   6 +-
 fs/hugetlbfs/inode.c  |  16 +-
 fs/proc/array.c   |   2 +-
 include/linux/cred.h  |   4 +
 include/linux/hugetlb.h   |   4 +-
 include/linux/mm.h|   4 +-
 include/linux/sched/user.h|   7 -
 include/linux/shmem_fs.h  |   2 +-
 include/li

Re: [PATCH v9 4/8] Reimplement RLIMIT_NPROC on top of ucounts

2021-04-06 Thread Alexey Gladkov

On Mon, Apr 05, 2021 at 11:56:35AM -0500, Eric W. Biederman wrote:
>
> Also when setting ns->ucount_max[] in create_user_ns because one value
> is signed and the other is unsigned.  Care should be taken so that
> rlimit_infinity is translated into the largest positive value the
> type can hold.

You mean like that ?

ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC) <= LONG_MAX ?
rlimit(RLIMIT_NPROC) : LONG_MAX;
ns->ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = rlimit(RLIMIT_MSGQUEUE) <= LONG_MAX ?
rlimit(RLIMIT_MSGQUEUE) : LONG_MAX;
ns->ucount_max[UCOUNT_RLIMIT_SIGPENDING] = rlimit(RLIMIT_SIGPENDING) <= 
LONG_MAX ?
rlimit(RLIMIT_SIGPENDING) : LONG_MAX;
ns->ucount_max[UCOUNT_RLIMIT_MEMLOCK] = rlimit(RLIMIT_MEMLOCK) <= LONG_MAX ?
rlimit(RLIMIT_MEMLOCK) : LONG_MAX;

-- 
Rgrds, legion

[PATCH v9 5/8] Reimplement RLIMIT_MSGQUEUE on top of ucounts

2021-03-23 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov 
---
 include/linux/sched/user.h |  4 
 include/linux/user_namespace.h |  1 +
 ipc/mqueue.c   | 41 ++
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user_namespace.c|  1 +
 6 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index d33d867ad6c1..8a34446681aa 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,10 +18,6 @@ struct user_struct {
 #endif
 #ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors 
currently watched */
-#endif
-#ifdef CONFIG_POSIX_MQUEUE
-   /* protected by mq_lock */
-   unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
 #endif
unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight;/* How many files in flight in unix 
sockets */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 9d1ca370c201..d0fea0306394 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -51,6 +51,7 @@ enum ucount_type {
UCOUNT_INOTIFY_WATCHES,
 #endif
UCOUNT_RLIMIT_NPROC,
+   UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_COUNTS,
 };
 
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 8031464ed4ae..f0f8f013dae2 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -144,7 +144,7 @@ struct mqueue_inode_info {
struct pid *notify_owner;
u32 notify_self_exec_id;
struct user_namespace *notify_user_ns;
-   struct user_struct *user;   /* user who created, for accounting */
+   struct ucounts *ucounts;/* user who created, for accounting */
struct sock *notify_sock;
struct sk_buff *notify_cookie;
 
@@ -292,7 +292,6 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
struct ipc_namespace *ipc_ns, umode_t mode,
struct mq_attr *attr)
 {
-   struct user_struct *u = current_user();
struct inode *inode;
int ret = -ENOMEM;
 
@@ -321,7 +320,7 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
info->notify_owner = NULL;
info->notify_user_ns = NULL;
info->qsize = 0;
-   info->user = NULL;  /* set when all is ok */
+   info->ucounts = NULL;   /* set when all is ok */
info->msg_tree = RB_ROOT;
info->msg_tree_rightmost = NULL;
info->node_cache = NULL;
@@ -371,19 +370,24 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
if (mq_bytes + mq_treesize < mq_bytes)
goto out_inode;
mq_bytes += mq_treesize;
-   spin_lock(_lock);
-   if (u->mq_bytes + mq_bytes < u->mq_bytes ||
-   u->mq_bytes + mq_bytes > rlimit(RLIMIT_MSGQUEUE)) {
+   info->ucounts = get_ucounts(current_ucounts());
+   if (info->ucounts) {
+   bool overlimit;
+
+   spin_lock(_lock);
+   overlimit = inc_rlimit_ucounts_and_test(info->ucounts, 
UCOUNT_RLIMIT_MSGQUEUE,
+   mq_bytes, rlimit(RLIMIT_MSGQUEUE));
+   if (overlimit) {
+   dec_rlimit_ucounts(info->ucounts, 
UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
+   spin_unlock(_lock);
+   put_ucounts(info->ucounts);
+   info->ucounts = NULL;
+   /* mqueue_evict_inode() releases info->messages 
*/
+   ret = -EMFILE;
+   goto out_inode;
+   }
spin_unlock(_lock);
-   /* mqueue_evict_inode() releases info->messages */
-   ret = -EMFILE;
-   goto out_inode;
}
-   u->mq_bytes += mq_bytes;
-   spin_unlock(_lock);
-
-   /* all is ok */
-   info->user = get_uid(u);
} else if (S_ISDIR(mode)) {
inc_nlink(inode);
/* Some things misbehave if size == 0 on a directory */
@@ -497,7 +501,6 @@ static void mqueue_free_inode(struct inode *inode)
 static void mqueue_evict_inode(struct inode *inode)
 {
struct mqueue_inode_info *info;
-   struct user_struct *user;
struct ipc_namespace *ipc_ns;
struct msg_msg

[PATCH v9 7/8] Reimplement RLIMIT_MEMLOCK on top of ucounts

2021-03-23 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog
v8:
* Fix issues found by lkp-tests project.

v7:
* Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot 
Signed-off-by: Alexey Gladkov 
---
 fs/hugetlbfs/inode.c   | 16 
 include/linux/hugetlb.h|  4 ++--
 include/linux/mm.h |  4 ++--
 include/linux/sched/user.h |  1 -
 include/linux/shmem_fs.h   |  2 +-
 include/linux/user_namespace.h |  1 +
 ipc/shm.c  | 26 +-
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 mm/memfd.c |  4 ++--
 mm/mlock.c | 23 +++
 mm/mmap.c  |  4 ++--
 mm/shmem.c |  8 
 15 files changed, 53 insertions(+), 44 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 701c82c36138..be519fc9559a 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1443,7 +1443,7 @@ static int get_hstate_idx(int page_size_log)
  * otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
  */
 struct file *hugetlb_file_setup(const char *name, size_t size,
-   vm_flags_t acctflag, struct user_struct **user,
+   vm_flags_t acctflag, struct ucounts **ucounts,
int creat_flags, int page_size_log)
 {
struct inode *inode;
@@ -1455,20 +1455,20 @@ struct file *hugetlb_file_setup(const char *name, 
size_t size,
if (hstate_idx < 0)
return ERR_PTR(-ENODEV);
 
-   *user = NULL;
+   *ucounts = NULL;
mnt = hugetlbfs_vfsmount[hstate_idx];
if (!mnt)
return ERR_PTR(-ENOENT);
 
if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
-   *user = current_user();
-   if (user_shm_lock(size, *user)) {
+   *ucounts = current_ucounts();
+   if (user_shm_lock(size, *ucounts)) {
task_lock(current);
pr_warn_once("%s (%d): Using mlock ulimits for 
SHM_HUGETLB is deprecated\n",
current->comm, current->pid);
task_unlock(current);
} else {
-   *user = NULL;
+   *ucounts = NULL;
return ERR_PTR(-EPERM);
}
}
@@ -1495,9 +1495,9 @@ struct file *hugetlb_file_setup(const char *name, size_t 
size,
 
iput(inode);
 out:
-   if (*user) {
-   user_shm_unlock(size, *user);
-   *user = NULL;
+   if (*ucounts) {
+   user_shm_unlock(size, *ucounts);
+   *ucounts = NULL;
}
return file;
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index cccd1aab69dd..96d63dbdec65 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -434,7 +434,7 @@ static inline struct hugetlbfs_inode_info 
*HUGETLBFS_I(struct inode *inode)
 extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
-   struct user_struct **user, int creat_flags,
+   struct ucounts **ucounts, int creat_flags,
int page_size_log);
 
 static inline bool is_file_hugepages(struct file *file)
@@ -454,7 +454,7 @@ static inline struct hstate *hstate_inode(struct inode *i)
 #define is_file_hugepages(file)false
 static inline struct file *
 hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
-   struct user_struct **user, int creat_flags,
+   struct ucounts **ucounts, int creat_flags,
int page_size_log)
 {
return ERR_PTR(-ENOSYS);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 64a71bf20536..7466eab000d0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1658,8 +1658,8 @@ extern bool can_do_mlock(void);
 #else
 static inline bool can_do_mlock(void) { return false; }
 #endif
-extern int user_shm_lock(size_t, struct user_struct *);
-extern void user_shm_unlock(size_t, struct user_struct *);
+extern int user_shm_lock(size_t, struct ucounts *);
+extern void user_shm_unlock(size_t, struct ucounts *);
 
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/include/linux/sched/u

[PATCH v9 8/8] kselftests: Add test to check for rlimit changes in different user namespaces

2021-03-23 Thread Alexey Gladkov

The testcase runs few instances of the program with RLIMIT_NPROC=1 from
user uid=6, in different user namespaces.

Signed-off-by: Alexey Gladkov 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/rlimits/.gitignore|   2 +
 tools/testing/selftests/rlimits/Makefile  |   6 +
 tools/testing/selftests/rlimits/config|   1 +
 .../selftests/rlimits/rlimits-per-userns.c| 161 ++
 5 files changed, 171 insertions(+)
 create mode 100644 tools/testing/selftests/rlimits/.gitignore
 create mode 100644 tools/testing/selftests/rlimits/Makefile
 create mode 100644 tools/testing/selftests/rlimits/config
 create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6c575cf34a71..a4ea1481bd9a 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -48,6 +48,7 @@ TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
 TARGETS += openat2
+TARGETS += rlimits
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/rlimits/.gitignore 
b/tools/testing/selftests/rlimits/.gitignore
new file mode 100644
index ..091021f255b3
--- /dev/null
+++ b/tools/testing/selftests/rlimits/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+rlimits-per-userns
diff --git a/tools/testing/selftests/rlimits/Makefile 
b/tools/testing/selftests/rlimits/Makefile
new file mode 100644
index ..03aadb406212
--- /dev/null
+++ b/tools/testing/selftests/rlimits/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := rlimits-per-userns
+
+include ../lib.mk
diff --git a/tools/testing/selftests/rlimits/config 
b/tools/testing/selftests/rlimits/config
new file mode 100644
index ..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/rlimits/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c 
b/tools/testing/selftests/rlimits/rlimits-per-userns.c
new file mode 100644
index ..26dc949e93ea
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov 
+ */
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NR_CHILDS 2
+
+static char *service_prog;
+static uid_t user   = 6;
+static uid_t group  = 6;
+
+static void setrlimit_nproc(rlim_t n)
+{
+   pid_t pid = getpid();
+   struct rlimit limit = {
+   .rlim_cur = n,
+   .rlim_max = n
+   };
+
+   warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
+
+   if (setrlimit(RLIMIT_NPROC, ) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+}
+
+static pid_t fork_child(void)
+{
+   pid_t pid = fork();
+
+   if (pid < 0)
+   err(EXIT_FAILURE, "fork");
+
+   if (pid > 0)
+   return pid;
+
+   pid = getpid();
+
+   warnx("(pid=%d): New process starting ...", pid);
+
+   if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
+   err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid);
+
+   signal(SIGUSR1, SIG_DFL);
+
+   warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
+
+   if (setgid(group) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
+   if (setuid(user) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
+
+   warnx("(pid=%d): Service running ...", pid);
+
+   warnx("(pid=%d): Unshare user namespace", pid);
+   if (unshare(CLONE_NEWUSER) < 0)
+   err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+
+   char *const argv[] = { "service", NULL };
+   char *const envp[] = { "I_AM_SERVICE=1", NULL };
+
+   warnx("(pid=%d): Executing real service ...", pid);
+
+   execve(service_prog, argv, envp);
+   err(EXIT_FAILURE, "(pid=%d): execve", pid);
+}
+
+int main(int argc, char **argv)
+{
+   size_t i;
+   pid_t child[NR_CHILDS];
+   int wstatus[NR_CHILDS];
+   int childs = NR_CHILDS;
+   pid_t pid;
+
+   if (getenv("I_AM_SERVICE")) {
+   pause();
+   exit(EXIT_SUCCESS);
+   }
+
+   service_prog = argv[0];
+   pid = getpid();
+
+   warnx("(pid=%d) Starting testcase", pid);
+
+   /*
+* This rlimit is not a problem for root because it can be exceeded.
+*/
+   setrlimit_nproc(1);
+
+   for (i = 0; i < NR_CH

[PATCH v9 0/8] Count rlimits in each user namespace

2021-03-23 Thread Alexey Gladkov

Preface
---
These patches are for binding the rlimit counters to a user in user namespace.
This patch set can be applied on top of:

git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.12-rc4

Problem
---
The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
implementation places the counters in user_struct [1]. These limits are global
between processes and persists for the lifetime of the process, even if
processes are in different user namespaces.

To illustrate the impact of rlimits, let's say there is a program that does not
fork. Some service-A wants to run this program as user X in multiple containers.
Since the program never fork the service wants to set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
since user X already has one running process.

The problem is not that the limit from container1 affects container2. The
problem is that limit is verified against the global counter that reflects
the number of processes in all containers.

This problem can be worked around by using different users for each container
but in this case we face a different problem of uid mapping when transferring
files from one container to another.

Eric W. Biederman mentioned this issue [2][3].

Introduced changes
--
To address the problem, we bind rlimit counters to user namespace. Each counter
reflects the number of processes in a given uid in a given user namespace. The
result is a tree of rlimit counters with the biggest value at the root (aka
init_user_ns). The limit is considered exceeded if it's exceeded up in the tree.

[1]: https://lore.kernel.org/containers/87imd2incs@x220.int.ebiederm.org/
[2]: 
https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
[3]: 
https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html

Changelog
-
v9:
* Used a negative value to check that the ucounts->count is close to overflow.
* Rebased onto v5.12-rc4.

v8:
* Used atomic_t for ucounts reference counting. Also added counter overflow
  check (thanks to Linus Torvalds for the idea).
* Fixed other issues found by lkp-tests project in the patch that Reimplements
  RLIMIT_MEMLOCK on top of ucounts.

v7:
* Fixed issues found by lkp-tests project in the patch that Reimplements
  RLIMIT_MEMLOCK on top of ucounts.

v6:
* Fixed issues found by lkp-tests project.
* Rebased onto v5.11.

v5:
* Split the first commit into two commits: change ucounts.count type to 
atomic_long_t
  and add ucounts to cred. These commits were merged by mistake during the 
rebase.
* The __get_ucounts() renamed to alloc_ucounts().
* The cred.ucounts update has been moved from commit_creds() as it did not allow
  to handle errors.
* Added error handling of set_cred_ucounts().

v4:
* Reverted the type change of ucounts.count to refcount_t.
* Fixed typo in the kernel/cred.c

v3:
* Added get_ucounts() function to increase the reference count. The existing
  get_counts() function renamed to __get_ucounts().
* The type of ucounts.count changed from atomic_t to refcount_t.
* Dropped 'const' from set_cred_ucounts() arguments.
* Fixed a bug with freeing the cred structure after calling cred_alloc_blank().
* Commit messages have been updated.
* Added selftest.

v2:
* RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts.
* Added ucounts for pair uid and user namespace into cred.
* Added the ability to increase ucount by more than 1.

v1:
* After discussion with Eric W. Biederman, I increased the size of ucounts to
  atomic_long_t.
* Added ucount_max to avoid the fork bomb.

--

Alexey Gladkov (8):
  Increase size of ucounts to atomic_long_t
  Add a reference to ucounts for each cred
  Use atomic_t for ucounts reference counting
  Reimplement RLIMIT_NPROC on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  kselftests: Add test to check for rlimit changes in different user
namespaces

 fs/exec.c |   6 +-
 fs/hugetlbfs/inode.c  |  16 +-
 fs/proc/array.c   |   2 +-
 include/linux/cred.h  |   4 +
 include/linux/hugetlb.h   |   4 +-
 include/linux/mm.h|   4 +-
 include/linux/sched/user.h|   7 -
 include/linux/shmem_fs.h  |   2 +-
 include/linux/signal_types.h  |   4 +-
 include/linux/user_namespace.h|  26 ++-
 ipc/mqueue.c  |  41 ++---
 ipc/shm.c |  26 +--
 kernel/cre

[PATCH v9 4/8] Reimplement RLIMIT_NPROC on top of ucounts

2021-03-23 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

To illustrate the impact of rlimits, let's say there is a program that
does not fork. Some service-A wants to run this program as user X in
multiple containers. Since the program never fork the service wants to
set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
When the service-A tries to run a program with RLIMIT_NPROC=1 in
container2 it fails since user X already has one running process.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited by root or if the user has the appropriate capability.

Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  2 +-
 include/linux/cred.h   |  2 ++
 include/linux/sched/user.h |  1 -
 include/linux/user_namespace.h | 13 
 kernel/cred.c  | 10 +++---
 kernel/exit.c  |  2 +-
 kernel/fork.c  |  9 ++---
 kernel/sys.c   |  2 +-
 kernel/ucount.c| 61 ++
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  3 +-
 11 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index d7c4187ca023..f2bcdbeb3afb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1878,7 +1878,7 @@ static int do_execveat_common(int fd, struct filename 
*filename,
 * whether NPROC limit is still exceeded.
 */
if ((current->flags & PF_NPROC_EXCEEDED) &&
-   atomic_read(_user()->processes) > rlimit(RLIMIT_NPROC)) {
+   is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, 
rlimit(RLIMIT_NPROC))) {
retval = -EAGAIN;
goto out_ret;
}
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 66436e655032..5ca1e8a1d035 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -372,6 +372,7 @@ static inline void put_cred(const struct cred *_cred)
 
 #define task_uid(task) (task_cred_xxx((task), uid))
 #define task_euid(task)(task_cred_xxx((task), euid))
+#define task_ucounts(task) (task_cred_xxx((task), ucounts))
 
 #define current_cred_xxx(xxx)  \
 ({ \
@@ -388,6 +389,7 @@ static inline void put_cred(const struct cred *_cred)
 #define current_fsgid()(current_cred_xxx(fsgid))
 #define current_cap()  (current_cred_xxx(cap_effective))
 #define current_user() (current_cred_xxx(user))
+#define current_ucounts()  (current_cred_xxx(ucounts))
 
 extern struct user_namespace init_user_ns;
 #ifdef CONFIG_USER_NS
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index a8ec3b6093fc..d33d867ad6c1 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
  */
 struct user_struct {
refcount_t __count; /* reference count */
-   atomic_t processes; /* How many processes does this user have? */
atomic_t sigpending;/* How many pending signals does this user 
have? */
 #ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index d84cc2c0b443..9d1ca370c201 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -50,9 +50,12 @@ enum ucount_type {
UCOUNT_INOTIFY_INSTANCES,
UCOUNT_INOTIFY_WATCHES,
 #endif
+   UCOUNT_RLIMIT_NPROC,
UCOUNT_COUNTS,
 };
 
+#define MAX_PER_NAMESPACE_UCOUNTS UCOUNT_RLIMIT_NPROC
+
 struct user_namespace {
struct uid_gid_map  uid_map;
struct uid_gid_map  gid_map;
@@ -107,6 +110,16 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, 
kuid_t uid);
 struct ucounts * __must_check get_ucounts(struct ucounts *ucounts);
 void put_ucounts(struct ucounts *ucounts);
 
+static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type 
type)
+{
+   return atomic_long_read(>ucount[type]);
+}
+
+bool inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long 
v);
+bool inc_rlimit_ucounts_and_test(struct ucounts *ucounts, enum ucount_type 
type, long v, long max);
+void dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long 
v);
+bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, 
unsigned long max);
+
 #ifdef CONFIG_USER_NS
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
diff --git a/kernel/cred.c b/kernel/cred.c
index 58a8a9e24347..dcfa30b337c5 100644
--- a/kernel/cred.c
+++ b

[PATCH v9 3/8] Use atomic_t for ucounts reference counting

2021-03-23 Thread Alexey Gladkov

The current implementation of the ucounts reference counter requires the
use of spin_lock. We're going to use get_ucounts() in more performance
critical areas like a handling of RLIMIT_SIGPENDING.

Now we need to use spin_lock only if we want to change the hashtable.

v9:
* Use a negative value to check that the ucounts->count is close to
  overflow.

Signed-off-by: Alexey Gladkov 
---
 include/linux/user_namespace.h |  4 +--
 kernel/ucount.c| 53 --
 2 files changed, 21 insertions(+), 36 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index f71b5a4a3e74..d84cc2c0b443 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -92,7 +92,7 @@ struct ucounts {
struct hlist_node node;
struct user_namespace *ns;
kuid_t uid;
-   int count;
+   atomic_t count;
atomic_long_t ucount[UCOUNT_COUNTS];
 };
 
@@ -104,7 +104,7 @@ void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum 
ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
 struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
-struct ucounts *get_ucounts(struct ucounts *ucounts);
+struct ucounts * __must_check get_ucounts(struct ucounts *ucounts);
 void put_ucounts(struct ucounts *ucounts);
 
 #ifdef CONFIG_USER_NS
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 50cc1dfb7d28..7bac19bb3f1e 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -11,7 +11,7 @@
 struct ucounts init_ucounts = {
.ns= _user_ns,
.uid   = GLOBAL_ROOT_UID,
-   .count = 1,
+   .count = ATOMIC_INIT(1),
 };
 
 #define UCOUNTS_HASHTABLE_BITS 10
@@ -139,6 +139,15 @@ static void hlist_add_ucounts(struct ucounts *ucounts)
spin_unlock_irq(_lock);
 }
 
+struct ucounts *get_ucounts(struct ucounts *ucounts)
+{
+   if (ucounts && atomic_add_negative(1, >count)) {
+   atomic_dec(>count);
+   ucounts = NULL;
+   }
+   return ucounts;
+}
+
 struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
 {
struct hlist_head *hashent = ucounts_hashentry(ns, uid);
@@ -155,7 +164,7 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, 
kuid_t uid)
 
new->ns = ns;
new->uid = uid;
-   new->count = 0;
+   atomic_set(>count, 1);
 
spin_lock_irq(_lock);
ucounts = find_ucounts(ns, uid, hashent);
@@ -163,33 +172,12 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, 
kuid_t uid)
kfree(new);
} else {
hlist_add_head(>node, hashent);
-   ucounts = new;
+   spin_unlock_irq(_lock);
+   return new;
}
}
-   if (ucounts->count == INT_MAX)
-   ucounts = NULL;
-   else
-   ucounts->count += 1;
spin_unlock_irq(_lock);
-   return ucounts;
-}
-
-struct ucounts *get_ucounts(struct ucounts *ucounts)
-{
-   unsigned long flags;
-
-   if (!ucounts)
-   return NULL;
-
-   spin_lock_irqsave(_lock, flags);
-   if (ucounts->count == INT_MAX) {
-   WARN_ONCE(1, "ucounts: counter has reached its maximum value");
-   ucounts = NULL;
-   } else {
-   ucounts->count += 1;
-   }
-   spin_unlock_irqrestore(_lock, flags);
-
+   ucounts = get_ucounts(ucounts);
return ucounts;
 }
 
@@ -197,15 +185,12 @@ void put_ucounts(struct ucounts *ucounts)
 {
unsigned long flags;
 
-   spin_lock_irqsave(_lock, flags);
-   ucounts->count -= 1;
-   if (!ucounts->count)
+   if (atomic_dec_and_test(>count)) {
+   spin_lock_irqsave(_lock, flags);
hlist_del_init(>node);
-   else
-   ucounts = NULL;
-   spin_unlock_irqrestore(_lock, flags);
-
-   kfree(ucounts);
+   spin_unlock_irqrestore(_lock, flags);
+   kfree(ucounts);
+   }
 }
 
 static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
-- 
2.29.3

[PATCH v9 2/8] Add a reference to ucounts for each cred

2021-03-23 Thread Alexey Gladkov

For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.

Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().

Changelog

v6:
* Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
  error was caused by the fact that cred_alloc_blank() left the ucounts
  pointer empty.

Reported-by: kernel test robot 
Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  4 
 include/linux/cred.h   |  2 ++
 include/linux/user_namespace.h |  4 
 kernel/cred.c  | 40 ++
 kernel/fork.c  |  6 +
 kernel/sys.c   | 12 ++
 kernel/ucount.c| 40 +++---
 kernel/user_namespace.c|  3 +++
 8 files changed, 108 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..d7c4187ca023 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1360,6 +1360,10 @@ int begin_new_exec(struct linux_binprm * bprm)
WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
flush_signal_handlers(me, 0);
 
+   retval = set_cred_ucounts(bprm->cred);
+   if (retval < 0)
+   goto out_unlock;
+
/*
 * install the new credentials for this executable
 */
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 4c6350503697..66436e655032 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -144,6 +144,7 @@ struct cred {
 #endif
struct user_struct *user;   /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are 
relative to. */
+   struct ucounts *ucounts;
struct group_info *group_info;  /* supplementary groups for euid/fsgid 
*/
/* RCU deletion */
union {
@@ -170,6 +171,7 @@ extern int set_security_override_from_ctx(struct cred *, 
const char *);
 extern int set_create_files_as(struct cred *, struct inode *);
 extern int cred_fscmp(const struct cred *, const struct cred *);
 extern void __init cred_init(void);
+extern int set_cred_ucounts(struct cred *);
 
 /*
  * check for validity of credentials
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 0bb833fd41f4..f71b5a4a3e74 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -97,11 +97,15 @@ struct ucounts {
 };
 
 extern struct user_namespace init_user_ns;
+extern struct ucounts init_ucounts;
 
 bool setup_userns_sysctls(struct user_namespace *ns);
 void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum 
ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
+struct ucounts *get_ucounts(struct ucounts *ucounts);
+void put_ucounts(struct ucounts *ucounts);
 
 #ifdef CONFIG_USER_NS
 
diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..58a8a9e24347 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -60,6 +60,7 @@ struct cred init_cred = {
.user   = INIT_USER,
.user_ns= _user_ns,
.group_info = _groups,
+   .ucounts= _ucounts,
 };
 
 static inline void set_cred_subscribers(struct cred *cred, int n)
@@ -119,6 +120,8 @@ static void put_cred_rcu(struct rcu_head *rcu)
if (cred->group_info)
put_group_info(cred->group_info);
free_uid(cred->user);
+   if (cred->ucounts)
+   put_ucounts(cred->ucounts);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
 }
@@ -222,6 +225,7 @@ struct cred *cred_alloc_blank(void)
 #ifdef CONFIG_DEBUG_CREDENTIALS
new->magic = CRED_MAGIC;
 #endif
+   new->ucounts = get_ucounts(_ucounts);
 
if (security_cred_alloc_blank(new, GFP_KERNEL_ACCOUNT) < 0)
goto error;
@@ -284,6 +288,11 @@ struct cred *prepare_creds(void)
 
if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
goto error;
+
+   new->ucounts = get_ucounts(new->ucounts);
+   if (!new->ucounts)
+   goto error;
+
validate_creds(new);
return new;
 
@@ -363,6 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long 
clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
+   if (set_cred_uc

[PATCH v9 6/8] Reimplement RLIMIT_SIGPENDING on top of ucounts

2021-03-23 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/array.c|  2 +-
 include/linux/sched/user.h |  1 -
 include/linux/signal_types.h   |  4 ++-
 include/linux/user_namespace.h |  1 +
 kernel/fork.c  |  1 +
 kernel/signal.c| 57 --
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 9 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index bb87e4d89cd8..74b0ea4b7e38 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -284,7 +284,7 @@ static inline void task_sig(struct seq_file *m, struct 
task_struct *p)
collect_sigign_sigcatch(p, , );
num_threads = get_nr_threads(p);
rcu_read_lock();  /* FIXME: is this correct? */
-   qsize = atomic_read(&__task_cred(p)->user->sigpending);
+   qsize = get_ucounts_value(task_ucounts(p), 
UCOUNT_RLIMIT_SIGPENDING);
rcu_read_unlock();
qlim = task_rlimit(p, RLIMIT_SIGPENDING);
unlock_task_sighand(p, );
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8a34446681aa..8ba9cec4fb99 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
  */
 struct user_struct {
refcount_t __count; /* reference count */
-   atomic_t sigpending;/* How many pending signals does this user 
have? */
 #ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
 #endif
diff --git a/include/linux/signal_types.h b/include/linux/signal_types.h
index 68e06c75c5b2..34cb28b8f16c 100644
--- a/include/linux/signal_types.h
+++ b/include/linux/signal_types.h
@@ -13,6 +13,8 @@ typedef struct kernel_siginfo {
__SIGINFO;
 } kernel_siginfo_t;
 
+struct ucounts;
+
 /*
  * Real Time signals may be queued.
  */
@@ -21,7 +23,7 @@ struct sigqueue {
struct list_head list;
int flags;
kernel_siginfo_t info;
-   struct user_struct *user;
+   struct ucounts *ucounts;
 };
 
 /* flags values. */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index d0fea0306394..6e8736c7aa29 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -52,6 +52,7 @@ enum ucount_type {
 #endif
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
+   UCOUNT_RLIMIT_SIGPENDING,
UCOUNT_COUNTS,
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 85c6094f5a48..741f896c156e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -824,6 +824,7 @@ void __init fork_init(void)
 
init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(_task, 
RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = 
task_rlimit(_task, RLIMIT_MSGQUEUE);
+   init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = 
task_rlimit(_task, RLIMIT_SIGPENDING);
 
 #ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/signal.c b/kernel/signal.c
index f2a1b898da29..1b537d9de447 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -413,49 +413,44 @@ void task_join_group_stop(struct task_struct *task)
 static struct sigqueue *
 __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int 
override_rlimit)
 {
-   struct sigqueue *q = NULL;
-   struct user_struct *user;
-   int sigpending;
+   struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);
 
-   /*
-* Protect access to @t credentials. This can go away when all
-* callers hold rcu read lock.
-*
-* NOTE! A pending signal will hold on to the user refcount,
-* and we get/put the refcount only when the sigpending count
-* changes from/to zero.
-*/
-   rcu_read_lock();
-   user = __task_cred(t)->user;
-   sigpending = atomic_inc_return(>sigpending);
-   if (sigpending == 1)
-   get_uid(user);
-   rcu_read_unlock();
+   if (likely(q != NULL)) {
+   bool overlimit;
 
-   if (override_rlimit || likely(sigpending <= task_rlimit(t, 
RLIMIT_SIGPENDING))) {
-   q = kmem_cache_alloc(sigqueue_cachep, flags);
-   } else {
-   print_dropped_signal(sig);
-   }
-
-   if (unlikely(q == NULL)) {
-   if (atomic_dec_and_test(>sigpending))
-   free_uid(user);
-   } else {
INIT_LIST_HEAD(>list);
q->flags = 0;
-   q->user = user;
+
+   /*
+* Protect access to @t credentials. This can go away when all
+

[PATCH v9 1/8] Increase size of ucounts to atomic_long_t

2021-03-23 Thread Alexey Gladkov

RLIMIT_MSGQUEUE and RLIMIT_MEMLOCK use unsigned long to store their
counters. As a preparation for moving rlimits based on ucounts, we need
to increase the size of the variable to long.

Signed-off-by: Alexey Gladkov 
---
 include/linux/user_namespace.h |  4 ++--
 kernel/ucount.c| 16 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..0bb833fd41f4 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -85,7 +85,7 @@ struct user_namespace {
struct ctl_table_header *sysctls;
 #endif
struct ucounts  *ucounts;
-   int ucount_max[UCOUNT_COUNTS];
+   long ucount_max[UCOUNT_COUNTS];
 } __randomize_layout;
 
 struct ucounts {
@@ -93,7 +93,7 @@ struct ucounts {
struct user_namespace *ns;
kuid_t uid;
int count;
-   atomic_t ucount[UCOUNT_COUNTS];
+   atomic_long_t ucount[UCOUNT_COUNTS];
 };
 
 extern struct user_namespace init_user_ns;
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 11b1596e2542..04c561751af1 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -175,14 +175,14 @@ static void put_ucounts(struct ucounts *ucounts)
kfree(ucounts);
 }
 
-static inline bool atomic_inc_below(atomic_t *v, int u)
+static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
 {
-   int c, old;
-   c = atomic_read(v);
+   long c, old;
+   c = atomic_long_read(v);
for (;;) {
if (unlikely(c >= u))
return false;
-   old = atomic_cmpxchg(v, c, c+1);
+   old = atomic_long_cmpxchg(v, c, c+1);
if (likely(old == c))
return true;
c = old;
@@ -196,17 +196,17 @@ struct ucounts *inc_ucount(struct user_namespace *ns, 
kuid_t uid,
struct user_namespace *tns;
ucounts = get_ucounts(ns, uid);
for (iter = ucounts; iter; iter = tns->ucounts) {
-   int max;
+   long max;
tns = iter->ns;
max = READ_ONCE(tns->ucount_max[type]);
-   if (!atomic_inc_below(>ucount[type], max))
+   if (!atomic_long_inc_below(>ucount[type], max))
goto fail;
}
return ucounts;
 fail:
bad = iter;
for (iter = ucounts; iter != bad; iter = iter->ns->ucounts)
-   atomic_dec(>ucount[type]);
+   atomic_long_dec(>ucount[type]);
 
put_ucounts(ucounts);
return NULL;
@@ -216,7 +216,7 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type 
type)
 {
struct ucounts *iter;
for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-   int dec = atomic_dec_if_positive(>ucount[type]);
+   long dec = atomic_long_dec_if_positive(>ucount[type]);
WARN_ON_ONCE(dec < 0);
}
put_ucounts(ucounts);
-- 
2.29.3

Re: [PATCH] proc: delete redundant subset=pid check

2021-03-20 Thread Alexey Gladkov

On Sat, Mar 20, 2021 at 06:46:08PM +0300, Alexey Dobriyan wrote:
> Two checks in lookup and readdir code should be enough to not have
> third check in open code.
> 
> Can't open what can't be looked up?

As far as I remember, I first added pidonly processing here and later then
I disabled lookup. Now this is unnecessary.

Acked-by: Alexey Gladkov 

> Signed-off-by: Alexey Dobriyan 
> ---
> 
>  fs/proc/inode.c |4 
>  1 file changed, 4 deletions(-)
> 
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -483,7 +483,6 @@ proc_reg_get_unmapped_area(struct file *file, unsigned 
> long orig_addr,
>  
>  static int proc_reg_open(struct inode *inode, struct file *file)
>  {
> - struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
>   struct proc_dir_entry *pde = PDE(inode);
>   int rv = 0;
>   typeof_member(struct proc_ops, proc_open) open;
> @@ -497,9 +496,6 @@ static int proc_reg_open(struct inode *inode, struct file 
> *file)
>   return rv;
>   }
>  
> - if (fs_info->pidonly == PROC_PIDONLY_ON)
> - return -ENOENT;
> -
>   /*
>* Ensure that
>* 1) PDE's ->release hook will be called no matter what
> 

-- 
Rgrds, legion

Re: [PATCH] proc: test subset=pid

2021-03-20 Thread Alexey Gladkov

On Sat, Mar 20, 2021 at 06:48:55PM +0300, Alexey Dobriyan wrote:
> Test that /proc instance mounted with
> 
>   mount -t proc -o subset=pid
> 
> contains only ".", "..", "self", "thread-self" and pid directories.
> 
> Note:
> Currently "subset=pid" doesn't return "." and ".." via readdir.
> This must be a bug.

Ops. Good catch! Looks good to me.

Acked-by: Alexey Gladkov 

> Signed-off-by: Alexey Dobriyan 
> ---
> 
>  tools/testing/selftests/proc/Makefile  |1 
>  tools/testing/selftests/proc/proc-subset-pid.c |  121 
> +
>  2 files changed, 122 insertions(+)
> 
> --- a/tools/testing/selftests/proc/Makefile
> +++ b/tools/testing/selftests/proc/Makefile
> @@ -12,6 +12,7 @@ TEST_GEN_PROGS += proc-self-map-files-001
>  TEST_GEN_PROGS += proc-self-map-files-002
>  TEST_GEN_PROGS += proc-self-syscall
>  TEST_GEN_PROGS += proc-self-wchan
> +TEST_GEN_PROGS += proc-subset-pid
>  TEST_GEN_PROGS += proc-uptime-001
>  TEST_GEN_PROGS += proc-uptime-002
>  TEST_GEN_PROGS += read
> new file mode 100644
> --- /dev/null
> +++ b/tools/testing/selftests/proc/proc-subset-pid.c
> @@ -0,0 +1,121 @@
> +/*
> + * Copyright (c) 2021 Alexey Dobriyan 
> + *
> + * Permission to use, copy, modify, and distribute this software for any
> + * purpose with or without fee is hereby granted, provided that the above
> + * copyright notice and this permission notice appear in all copies.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> + */
> +/*
> + * Test that "mount -t proc -o subset=pid" hides everything but pids,
> + * /proc/self and /proc/thread-self.
> + */
> +#undef NDEBUG
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static inline bool streq(const char *a, const char *b)
> +{
> + return strcmp(a, b) == 0;
> +}
> +
> +static void make_private_proc(void)
> +{
> + if (unshare(CLONE_NEWNS) == -1) {
> + if (errno == ENOSYS || errno == EPERM) {
> + exit(4);
> + }
> + exit(1);
> + }
> + if (mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL) == -1) {
> + exit(1);
> + }
> + if (mount(NULL, "/proc", "proc", 0, "subset=pid") == -1) {
> + exit(1);
> + }
> +}
> +
> +static bool string_is_pid(const char *s)
> +{
> + while (1) {
> + switch (*s++) {
> + case '0':case '1':case '2':case '3':case '4':
> + case '5':case '6':case '7':case '8':case '9':
> + continue;
> +
> + case '\0':
> + return true;
> +
> + default:
> + return false;
> + }
> + }
> +}
> +
> +int main(void)
> +{
> + make_private_proc();
> +
> + DIR *d = opendir("/proc");
> + assert(d);
> +
> + struct dirent *de;
> +
> + bool dot = false;
> + bool dot_dot = false;
> + bool self = false;
> + bool thread_self = false;
> +
> + while ((de = readdir(d))) {
> + if (streq(de->d_name, ".")) {
> + assert(!dot);
> + dot = true;
> + assert(de->d_type == DT_DIR);
> + } else if (streq(de->d_name, "..")) {
> + assert(!dot_dot);
> + dot_dot = true;
> + assert(de->d_type == DT_DIR);
> + } else if (streq(de->d_name, "self")) {
> + assert(!self);
> + self = true;
> + assert(de->d_type == DT_LNK);
> + } else if (streq(de->d_name, "thread-self")) {
> + assert(!thread_self);
> + thread_self = true;
> + assert(de->d_type == DT_LNK);
> + } else {
> + if (!string_is_pid(de->d_name)) {
> + fprintf(stderr, "d_name '%s'\n", de->d_name);
> + assert(0);
> + }
> + assert(de->d_type == DT_DIR);
> + }
> + }
> +
> + char c;
> + int rv = readlink("/proc/cpuinfo", , 1);
> + assert(rv == -1 && errno == ENOENT);
> +
> + int fd = open("/proc/cpuinfo", O_RDONLY);
> + assert(fd == -1 && errno == ENOENT);
> +
> + return 0;
> +}
> 

-- 
Rgrds, legion

[PATCH v6 1/5] docs: proc: add documentation about mount restrictions

2021-03-12 Thread Alexey Gladkov

Signed-off-by: Alexey Gladkov 
---
 Documentation/filesystems/proc.rst | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/filesystems/proc.rst 
b/Documentation/filesystems/proc.rst
index 2fa69f710e2a..5a1bb0e081fd 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -50,6 +50,7 @@ fixes/update part 1.1  Stefani Seibold   
  June 9 2009
 
   4Configuring procfs
   4.1  Mount options
+  4.2  Mount restrictions
 
   5Filesystem behavior
 
@@ -2175,6 +2176,19 @@ information about processes information, just add identd 
to this group.
 subset=pid hides all top level files and directories in the procfs that
 are not related to tasks.
 
+4.2Mount restrictions
+--
+
+If user namespaces are in use, the kernel additionally checks the instances of
+procfs available to the mounter and will not allow procfs to be mounted if:
+
+  1. This mount is not fully visible.
+
+ a. It's root directory is not the root directory of the filesystem.
+ b. If any file or non-empty procfs directory is hidden by another mount.
+
+  2. A new mount overrides the readonly option or any option from atime 
familty.
+
 Chapter 5: Filesystem behavior
 ==
 
-- 
2.29.3

[PATCH v6 3/5] proc: Disable cancellation of subset=pid option

2021-03-12 Thread Alexey Gladkov

When procfs is mounted with subset=pid option, where is no way to
remount it with this option removed. This is done in order not to make
visible what ever was hidden since some checks occur during mount.

This patch makes the limitation explicit and prints an error message.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/root.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index 6a75ac717455..0d20bb67e79a 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -145,7 +145,7 @@ static int proc_parse_param(struct fs_context *fc, struct 
fs_parameter *param)
return 0;
 }
 
-static void proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct proc_fs_info *fs_info,
   struct fs_context *fc,
   struct user_namespace *user_ns)
 {
@@ -155,8 +155,12 @@ static void proc_apply_options(struct proc_fs_info 
*fs_info,
fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
if (ctx->mask & (1 << Opt_hidepid))
fs_info->hide_pid = ctx->hidepid;
-   if (ctx->mask & (1 << Opt_subset))
+   if (ctx->mask & (1 << Opt_subset)) {
+   if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == 
PROC_PIDONLY_ON)
+   return invalf(fc, "proc: subset=pid cannot be unset\n");
fs_info->pidonly = ctx->pidonly;
+   }
+   return 0;
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -172,7 +176,9 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
 
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
fs_info->mounter_cred = get_cred(fc->cred);
-   proc_apply_options(fs_info, fc, current_user_ns());
+   ret = proc_apply_options(fs_info, fc, current_user_ns());
+   if (ret)
+   return ret;
 
/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -224,8 +230,7 @@ static int proc_reconfigure(struct fs_context *fc)
put_cred(fs_info->mounter_cred);
fs_info->mounter_cred = get_cred(fc->cred);
 
-   proc_apply_options(fs_info, fc, current_user_ns());
-   return 0;
+   return proc_apply_options(fs_info, fc, current_user_ns());
 }
 
 static int proc_get_tree(struct fs_context *fc)
-- 
2.29.3

[PATCH v6 5/5] docs: proc: add documentation about relaxing visibility restrictions

2021-03-12 Thread Alexey Gladkov

Signed-off-by: Alexey Gladkov 
---
 Documentation/filesystems/proc.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.rst 
b/Documentation/filesystems/proc.rst
index 5a1bb0e081fd..9d993aef7f1c 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2182,7 +2182,8 @@ are not related to tasks.
 If user namespaces are in use, the kernel additionally checks the instances of
 procfs available to the mounter and will not allow procfs to be mounted if:
 
-  1. This mount is not fully visible.
+  1. This mount is not fully visible unless the new procfs is going to be
+ mounted with subset=pid option.
 
  a. It's root directory is not the root directory of the filesystem.
  b. If any file or non-empty procfs directory is hidden by another mount.
-- 
2.29.3

[PATCH v6 4/5] proc: Relax check of mount visibility

2021-03-12 Thread Alexey Gladkov

Allow to mount procfs with subset=pid option even if the entire procfs
is not fully accessible to the user.

Signed-off-by: Alexey Gladkov 
---
 fs/namespace.c | 30 ++
 fs/proc/root.c | 16 ++--
 include/linux/fs.h |  1 +
 3 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9d33909d0f9e..f38570fdfc3f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3951,7 +3951,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
/* This mount is not fully visible if it's root directory
 * is not the root directory of the filesystem.
 */
-   if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
+   if (!(sb->s_iflags & SB_I_DYNAMIC) &&
+   mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
continue;
 
/* A local view of the mount flags */
@@ -3971,18 +3972,23 @@ static bool mnt_already_visible(struct mnt_namespace 
*ns,
((mnt_flags & MNT_ATIME_MASK) != (new_flags & 
MNT_ATIME_MASK)))
continue;
 
-   /* This mount is not fully visible if there are any
-* locked child mounts that cover anything except for
-* empty directories.
+   /* If this filesystem is completely dynamic, then it
+* makes no sense to check for any child mounts.
 */
-   list_for_each_entry(child, >mnt_mounts, mnt_child) {
-   struct inode *inode = child->mnt_mountpoint->d_inode;
-   /* Only worry about locked mounts */
-   if (!(child->mnt.mnt_flags & MNT_LOCKED))
-   continue;
-   /* Is the directory permanetly empty? */
-   if (!is_empty_dir_inode(inode))
-   goto next;
+   if (!(sb->s_iflags & SB_I_DYNAMIC)) {
+   /* This mount is not fully visible if there are any
+* locked child mounts that cover anything except for
+* empty directories.
+*/
+   list_for_each_entry(child, >mnt_mounts, mnt_child) 
{
+   struct inode *inode = 
child->mnt_mountpoint->d_inode;
+   /* Only worry about locked mounts */
+   if (!(child->mnt.mnt_flags & MNT_LOCKED))
+   continue;
+   /* Is the directory permanetly empty? */
+   if (!is_empty_dir_inode(inode))
+   goto next;
+   }
}
/* Preserve the locked attributes */
*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 0d20bb67e79a..c739ed94246c 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -145,18 +145,21 @@ static int proc_parse_param(struct fs_context *fc, struct 
fs_parameter *param)
return 0;
 }
 
-static int proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct super_block *s,
   struct fs_context *fc,
   struct user_namespace *user_ns)
 {
struct proc_fs_context *ctx = fc->fs_private;
+   struct proc_fs_info *fs_info = proc_sb_info(s);
 
if (ctx->mask & (1 << Opt_gid))
fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
if (ctx->mask & (1 << Opt_hidepid))
fs_info->hide_pid = ctx->hidepid;
if (ctx->mask & (1 << Opt_subset)) {
-   if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == 
PROC_PIDONLY_ON)
+   if (ctx->pidonly == PROC_PIDONLY_ON)
+   s->s_iflags |= SB_I_DYNAMIC;
+   else if (fs_info->pidonly == PROC_PIDONLY_ON)
return invalf(fc, "proc: subset=pid cannot be unset\n");
fs_info->pidonly = ctx->pidonly;
}
@@ -176,9 +179,6 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
 
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
fs_info->mounter_cred = get_cred(fc->cred);
-   ret = proc_apply_options(fs_info, fc, current_user_ns());
-   if (ret)
-   return ret;
 
/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -190,6 +190,10 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
s->s_time_gran = 1;

[PATCH v6 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN

2021-03-12 Thread Alexey Gladkov

Cache the mounters credentials and allow access to the net directories
contingent of the permissions of the mounter of proc.

Do not show /proc/self/net when proc is mounted with subset=pid option
and the mounter does not have CAP_NET_ADMIN.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/proc_net.c  | 8 
 fs/proc/root.c  | 5 +
 include/linux/proc_fs.h | 1 +
 3 files changed, 14 insertions(+)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 18601042af99..a198f74cdb3b 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -259,6 +260,7 @@ static struct net *get_proc_task_net(struct inode *dir)
struct task_struct *task;
struct nsproxy *ns;
struct net *net = NULL;
+   struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
 
rcu_read_lock();
task = pid_task(proc_pid(dir), PIDTYPE_PID);
@@ -271,6 +273,12 @@ static struct net *get_proc_task_net(struct inode *dir)
}
rcu_read_unlock();
 
+   if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
+   security_capable(fs_info->mounter_cred, net->user_ns, 
CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
+   put_net(net);
+   net = NULL;
+   }
+
return net;
 }
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5e444d4f9717..6a75ac717455 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -171,6 +171,7 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
return -ENOMEM;
 
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+   fs_info->mounter_cred = get_cred(fc->cred);
proc_apply_options(fs_info, fc, current_user_ns());
 
/* User space would break if executables or devices appear on proc */
@@ -220,6 +221,9 @@ static int proc_reconfigure(struct fs_context *fc)
 
sync_filesystem(sb);
 
+   put_cred(fs_info->mounter_cred);
+   fs_info->mounter_cred = get_cred(fc->cred);
+
proc_apply_options(fs_info, fc, current_user_ns());
return 0;
 }
@@ -274,6 +278,7 @@ static void proc_kill_sb(struct super_block *sb)
 
kill_anon_super(sb);
put_pid_ns(fs_info->pid_ns);
+   put_cred(fs_info->mounter_cred);
kfree(fs_info);
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 000cc0533c33..ffa871941bd0 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -64,6 +64,7 @@ struct proc_fs_info {
kgid_t pid_gid;
enum proc_hidepid hide_pid;
enum proc_pidonly pidonly;
+   const struct cred *mounter_cred;
 };
 
 static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
-- 
2.29.3

[PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility

2021-03-12 Thread Alexey Gladkov

Allow to mount procfs with subset=pid option even if the entire procfs
is not fully accessible to the mounter.

Changelog
-
v6:
* Add documentation about procfs mount restrictions.
* Reorder commits for better review.

v4:
* Set SB_I_DYNAMIC only if pidonly is set.
* Add an error message if subset=pid is canceled during remount.

v3:
* Add 'const' to struct cred *mounter_cred (fix kernel test robot warning).

v2:
* cache the mounters credentials and make access to the net directories
  contingent of the permissions of the mounter of procfs.

--

Alexey Gladkov (5):
  docs: proc: add documentation about mount restrictions
  proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  proc: Disable cancellation of subset=pid option
  proc: Relax check of mount visibility
  docs: proc: add documentation about relaxing visibility restrictions

 Documentation/filesystems/proc.rst | 15 +++
 fs/namespace.c | 30 ++
 fs/proc/proc_net.c |  8 
 fs/proc/root.c | 24 +++-
 include/linux/fs.h |  1 +
 include/linux/proc_fs.h|  1 +
 6 files changed, 62 insertions(+), 17 deletions(-)

-- 
2.29.3

Re: [PATCH v5 0/5] proc: Relax check of mount visibility

2021-03-10 Thread Alexey Gladkov

On Wed, Mar 10, 2021 at 07:19:55PM +0100, Alexey Gladkov wrote:
> If only the dynamic part of procfs is mounted (subset=pid), then there is no
> need to check if procfs is fully visible to the user in the new user 
> namespace.

I'm sorry about that unfinished patch set. Please ignore it.

> Changelog
> -
> v4:
> * Set SB_I_DYNAMIC only if pidonly is set.
> * Add an error message if subset=pid is canceled during remount.
> 
> v3:
> * Add 'const' to struct cred *mounter_cred (fix kernel test robot warning).
> 
> v2:
> * cache the mounters credentials and make access to the net directories
>   contingent of the permissions of the mounter of procfs.
> 
> --
> 
> Alexey Gladkov (5):
>   docs: proc: add documentation about mount restrictions
>   proc: Show /proc/self/net only for CAP_NET_ADMIN
>   proc: Disable cancellation of subset=pid option
>   proc: Relax check of mount visibility
>   docs: proc: add documentation about relaxing visibility restrictions
> 
>  Documentation/filesystems/proc.rst | 18 ++
>  fs/namespace.c | 27 ---
>  fs/proc/proc_net.c |  8 
>  fs/proc/root.c | 25 +++--
>  include/linux/fs.h |  1 +
>  include/linux/proc_fs.h|  1 +
>  6 files changed, 63 insertions(+), 17 deletions(-)
> 
> -- 
> 2.29.2
> 

-- 
Rgrds, legion

Re: [RESEND PATCH v4 0/3] proc: Relax check of mount visibility

2021-03-10 Thread Alexey Gladkov

On Mon, Feb 22, 2021 at 09:44:40AM -0600, Eric W. Biederman wrote:
> Alexey Gladkov  writes:
> 
> > If only the dynamic part of procfs is mounted (subset=pid), then there is no
> > need to check if procfs is fully visible to the user in the new user
> > namespace.
> 
> 
> A couple of things.
> 
> 1) Allowing the mount should come in the last patch.  So we don't have a
> bisect hazard.
> 
> 2) We should document that we still require a mount of proc to match on
> atime and readonly mount attributes.

Ok. I will try to do it in v5.

> 3) If we can find a way to safely not require a previous mount of proc
> this will be much more valuable.

True, but for now I have no idea how to do it. I would prefer to move in
small steps.

-- 
Rgrds, legion

[PATCH v5 5/5] docs: proc: add documentation about relaxing visibility restrictions

2021-03-10 Thread Alexey Gladkov

Signed-off-by: Alexey Gladkov 
---
 Documentation/filesystems/proc.rst | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/filesystems/proc.rst 
b/Documentation/filesystems/proc.rst
index 3daf0e7d1071..9d2985a7aad6 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2190,6 +2190,8 @@ available to you and will not allow procfs to be mounted 
if:
   2. Mount is prohibited if a new mount overrides the readonly option or family
  of atime options.
   3. If any file or non-empty procfs directory is hidden by another filesystem.
+ You can still mount procfs even with overlapped directories if the
+ subset=pid option is used.
 
 Chapter 5: Filesystem behavior
 ==
-- 
2.29.2

[PATCH v5 4/5] proc: Relax check of mount visibility

2021-03-10 Thread Alexey Gladkov

Allow to mount of procfs with subset=pid option even if the entire
procfs is not fully accessible to the user.

Signed-off-by: Alexey Gladkov 
---
 fs/namespace.c | 27 ---
 fs/proc/root.c | 17 ++---
 include/linux/fs.h |  1 +
 3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9d33909d0f9e..f9a38584f865 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3971,18 +3971,23 @@ static bool mnt_already_visible(struct mnt_namespace 
*ns,
((mnt_flags & MNT_ATIME_MASK) != (new_flags & 
MNT_ATIME_MASK)))
continue;
 
-   /* This mount is not fully visible if there are any
-* locked child mounts that cover anything except for
-* empty directories.
+   /* If this filesystem is completely dynamic, then it
+* makes no sense to check for any child mounts.
 */
-   list_for_each_entry(child, >mnt_mounts, mnt_child) {
-   struct inode *inode = child->mnt_mountpoint->d_inode;
-   /* Only worry about locked mounts */
-   if (!(child->mnt.mnt_flags & MNT_LOCKED))
-   continue;
-   /* Is the directory permanetly empty? */
-   if (!is_empty_dir_inode(inode))
-   goto next;
+   if (!(sb->s_iflags & SB_I_DYNAMIC)) {
+   /* This mount is not fully visible if there are any
+* locked child mounts that cover anything except for
+* empty directories.
+*/
+   list_for_each_entry(child, >mnt_mounts, mnt_child) 
{
+   struct inode *inode = 
child->mnt_mountpoint->d_inode;
+   /* Only worry about locked mounts */
+   if (!(child->mnt.mnt_flags & MNT_LOCKED))
+   continue;
+   /* Is the directory permanetly empty? */
+   if (!is_empty_dir_inode(inode))
+   goto next;
+   }
}
/* Preserve the locked attributes */
*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 0d20bb67e79a..049d5c125f8f 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -145,18 +145,21 @@ static int proc_parse_param(struct fs_context *fc, struct 
fs_parameter *param)
return 0;
 }
 
-static int proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct super_block *s,
   struct fs_context *fc,
   struct user_namespace *user_ns)
 {
struct proc_fs_context *ctx = fc->fs_private;
+   struct proc_fs_info *fs_info = proc_sb_info(s);
 
if (ctx->mask & (1 << Opt_gid))
fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
if (ctx->mask & (1 << Opt_hidepid))
fs_info->hide_pid = ctx->hidepid;
if (ctx->mask & (1 << Opt_subset)) {
-   if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == 
PROC_PIDONLY_ON)
+   if (ctx->pidonly == PROC_PIDONLY_ON)
+   s->s_iflags |= SB_I_DYNAMIC;
+   else if (fs_info->pidonly == PROC_PIDONLY_ON)
return invalf(fc, "proc: subset=pid cannot be unset\n");
fs_info->pidonly = ctx->pidonly;
}
@@ -176,9 +179,6 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
 
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
fs_info->mounter_cred = get_cred(fc->cred);
-   ret = proc_apply_options(fs_info, fc, current_user_ns());
-   if (ret)
-   return ret;
 
/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -190,6 +190,10 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
s->s_time_gran = 1;
s->s_fs_info = fs_info;
 
+   ret = proc_apply_options(s, fc, current_user_ns());
+   if (ret)
+   return ret;
+
/*
 * procfs isn't actually a stacking filesystem; however, there is
 * too much magic going on inside it to permit stacking things on
@@ -223,14 +227,13 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
 static int proc_reconfigure(struct fs_context *fc)
 {
struct super_block *sb = fc->root->d_sb;
-   struct proc_fs_info

[PATCH v5 3/5] proc: Disable cancellation of subset=pid option

2021-03-10 Thread Alexey Gladkov

There is no way to remount procfs mountpoint with subset=pid option
without it. This is done in order not to make visible what was hidden
since some checks occur during mount.

This patch makes this limitation explicit and demonstrates the error.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/root.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index 6a75ac717455..0d20bb67e79a 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -145,7 +145,7 @@ static int proc_parse_param(struct fs_context *fc, struct 
fs_parameter *param)
return 0;
 }
 
-static void proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct proc_fs_info *fs_info,
   struct fs_context *fc,
   struct user_namespace *user_ns)
 {
@@ -155,8 +155,12 @@ static void proc_apply_options(struct proc_fs_info 
*fs_info,
fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
if (ctx->mask & (1 << Opt_hidepid))
fs_info->hide_pid = ctx->hidepid;
-   if (ctx->mask & (1 << Opt_subset))
+   if (ctx->mask & (1 << Opt_subset)) {
+   if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == 
PROC_PIDONLY_ON)
+   return invalf(fc, "proc: subset=pid cannot be unset\n");
fs_info->pidonly = ctx->pidonly;
+   }
+   return 0;
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -172,7 +176,9 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
 
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
fs_info->mounter_cred = get_cred(fc->cred);
-   proc_apply_options(fs_info, fc, current_user_ns());
+   ret = proc_apply_options(fs_info, fc, current_user_ns());
+   if (ret)
+   return ret;
 
/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -224,8 +230,7 @@ static int proc_reconfigure(struct fs_context *fc)
put_cred(fs_info->mounter_cred);
fs_info->mounter_cred = get_cred(fc->cred);
 
-   proc_apply_options(fs_info, fc, current_user_ns());
-   return 0;
+   return proc_apply_options(fs_info, fc, current_user_ns());
 }
 
 static int proc_get_tree(struct fs_context *fc)
-- 
2.29.2

[PATCH v5 1/5] docs: proc: add documentation about mount restrictions

2021-03-10 Thread Alexey Gladkov

Signed-off-by: Alexey Gladkov 
---
 Documentation/filesystems/proc.rst | 16 
 1 file changed, 16 insertions(+)

diff --git a/Documentation/filesystems/proc.rst 
b/Documentation/filesystems/proc.rst
index 2fa69f710e2a..3daf0e7d1071 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -50,6 +50,7 @@ fixes/update part 1.1  Stefani Seibold   
  June 9 2009
 
   4Configuring procfs
   4.1  Mount options
+  4.2  Mount restrictions
 
   5Filesystem behavior
 
@@ -2175,6 +2176,21 @@ information about processes information, just add identd 
to this group.
 subset=pid hides all top level files and directories in the procfs that
 are not related to tasks.
 
+4.2Mount restrictions
+--
+
+The procfs can be mounted without any special restrictions if user namespace is
+not used. You only need to have permission to mount (CAP_SYS_ADMIN).
+
+If you are inside the user namespace, the kernel checks the instances of procfs
+available to you and will not allow procfs to be mounted if:
+
+  1. There is a bind mount of part of procfs visible. Whoever mounts should be
+ able to see the entire filesystem.
+  2. Mount is prohibited if a new mount overrides the readonly option or family
+ of atime options.
+  3. If any file or non-empty procfs directory is hidden by another filesystem.
+
 Chapter 5: Filesystem behavior
 ==
 
-- 
2.29.2

[PATCH v5 2/5] proc: Show /proc/self/net only for CAP_NET_ADMIN

2021-03-10 Thread Alexey Gladkov

Cache the mounters credentials and make access to the net directories
contingent of the permissions of the mounter of proc.

Show /proc/self/net only if mounter has CAP_NET_ADMIN and if proc is
mounted with subset=pid option.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/proc_net.c  | 8 
 fs/proc/root.c  | 5 +
 include/linux/proc_fs.h | 1 +
 3 files changed, 14 insertions(+)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 18601042af99..a198f74cdb3b 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -259,6 +260,7 @@ static struct net *get_proc_task_net(struct inode *dir)
struct task_struct *task;
struct nsproxy *ns;
struct net *net = NULL;
+   struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
 
rcu_read_lock();
task = pid_task(proc_pid(dir), PIDTYPE_PID);
@@ -271,6 +273,12 @@ static struct net *get_proc_task_net(struct inode *dir)
}
rcu_read_unlock();
 
+   if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
+   security_capable(fs_info->mounter_cred, net->user_ns, 
CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
+   put_net(net);
+   net = NULL;
+   }
+
return net;
 }
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5e444d4f9717..6a75ac717455 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -171,6 +171,7 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
return -ENOMEM;
 
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+   fs_info->mounter_cred = get_cred(fc->cred);
proc_apply_options(fs_info, fc, current_user_ns());
 
/* User space would break if executables or devices appear on proc */
@@ -220,6 +221,9 @@ static int proc_reconfigure(struct fs_context *fc)
 
sync_filesystem(sb);
 
+   put_cred(fs_info->mounter_cred);
+   fs_info->mounter_cred = get_cred(fc->cred);
+
proc_apply_options(fs_info, fc, current_user_ns());
return 0;
 }
@@ -274,6 +278,7 @@ static void proc_kill_sb(struct super_block *sb)
 
kill_anon_super(sb);
put_pid_ns(fs_info->pid_ns);
+   put_cred(fs_info->mounter_cred);
kfree(fs_info);
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 000cc0533c33..ffa871941bd0 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -64,6 +64,7 @@ struct proc_fs_info {
kgid_t pid_gid;
enum proc_hidepid hide_pid;
enum proc_pidonly pidonly;
+   const struct cred *mounter_cred;
 };
 
 static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
-- 
2.29.2

[PATCH v5 0/5] proc: Relax check of mount visibility

2021-03-10 Thread Alexey Gladkov

If only the dynamic part of procfs is mounted (subset=pid), then there is no
need to check if procfs is fully visible to the user in the new user namespace.

Changelog
-
v4:
* Set SB_I_DYNAMIC only if pidonly is set.
* Add an error message if subset=pid is canceled during remount.

v3:
* Add 'const' to struct cred *mounter_cred (fix kernel test robot warning).

v2:
* cache the mounters credentials and make access to the net directories
  contingent of the permissions of the mounter of procfs.

--

Alexey Gladkov (5):
  docs: proc: add documentation about mount restrictions
  proc: Show /proc/self/net only for CAP_NET_ADMIN
  proc: Disable cancellation of subset=pid option
  proc: Relax check of mount visibility
  docs: proc: add documentation about relaxing visibility restrictions

 Documentation/filesystems/proc.rst | 18 ++
 fs/namespace.c | 27 ---
 fs/proc/proc_net.c |  8 
 fs/proc/root.c | 25 +++--
 include/linux/fs.h |  1 +
 include/linux/proc_fs.h|  1 +
 6 files changed, 63 insertions(+), 17 deletions(-)

-- 
2.29.2

[PATCH v8 7/8] Reimplement RLIMIT_MEMLOCK on top of ucounts

2021-03-10 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog
v8:
* Fix issues found by lkp-tests project.

v7:
* Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot 
Signed-off-by: Alexey Gladkov 
---
 fs/hugetlbfs/inode.c   | 16 
 include/linux/hugetlb.h|  4 ++--
 include/linux/mm.h |  4 ++--
 include/linux/sched/user.h |  1 -
 include/linux/shmem_fs.h   |  2 +-
 include/linux/user_namespace.h |  1 +
 ipc/shm.c  | 26 +-
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 mm/memfd.c |  4 ++--
 mm/mlock.c | 23 +++
 mm/mmap.c  |  4 ++--
 mm/shmem.c |  8 
 15 files changed, 53 insertions(+), 44 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 21c20fd5f9ee..cea98b68f271 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1452,7 +1452,7 @@ static int get_hstate_idx(int page_size_log)
  * otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
  */
 struct file *hugetlb_file_setup(const char *name, size_t size,
-   vm_flags_t acctflag, struct user_struct **user,
+   vm_flags_t acctflag, struct ucounts **ucounts,
int creat_flags, int page_size_log)
 {
struct inode *inode;
@@ -1464,20 +1464,20 @@ struct file *hugetlb_file_setup(const char *name, 
size_t size,
if (hstate_idx < 0)
return ERR_PTR(-ENODEV);
 
-   *user = NULL;
+   *ucounts = NULL;
mnt = hugetlbfs_vfsmount[hstate_idx];
if (!mnt)
return ERR_PTR(-ENOENT);
 
if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
-   *user = current_user();
-   if (user_shm_lock(size, *user)) {
+   *ucounts = current_ucounts();
+   if (user_shm_lock(size, *ucounts)) {
task_lock(current);
pr_warn_once("%s (%d): Using mlock ulimits for 
SHM_HUGETLB is deprecated\n",
current->comm, current->pid);
task_unlock(current);
} else {
-   *user = NULL;
+   *ucounts = NULL;
return ERR_PTR(-EPERM);
}
}
@@ -1504,9 +1504,9 @@ struct file *hugetlb_file_setup(const char *name, size_t 
size,
 
iput(inode);
 out:
-   if (*user) {
-   user_shm_unlock(size, *user);
-   *user = NULL;
+   if (*ucounts) {
+   user_shm_unlock(size, *ucounts);
+   *ucounts = NULL;
}
return file;
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..12b78ae587a2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -434,7 +434,7 @@ static inline struct hugetlbfs_inode_info 
*HUGETLBFS_I(struct inode *inode)
 extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
-   struct user_struct **user, int creat_flags,
+   struct ucounts **ucounts, int creat_flags,
int page_size_log);
 
 static inline bool is_file_hugepages(struct file *file)
@@ -454,7 +454,7 @@ static inline struct hstate *hstate_inode(struct inode *i)
 #define is_file_hugepages(file)false
 static inline struct file *
 hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
-   struct user_struct **user, int creat_flags,
+   struct ucounts **ucounts, int creat_flags,
int page_size_log)
 {
return ERR_PTR(-ENOSYS);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..64927c5492f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1628,8 +1628,8 @@ extern bool can_do_mlock(void);
 #else
 static inline bool can_do_mlock(void) { return false; }
 #endif
-extern int user_shm_lock(size_t, struct user_struct *);
-extern void user_shm_unlock(size_t, struct user_struct *);
+extern int user_shm_lock(size_t, struct ucounts *);
+extern void user_shm_unlock(size_t, struct ucounts *);
 
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/include/linux/sched/u

[PATCH v8 8/8] kselftests: Add test to check for rlimit changes in different user namespaces

2021-03-10 Thread Alexey Gladkov

The testcase runs few instances of the program with RLIMIT_NPROC=1 from
user uid=6, in different user namespaces.

Signed-off-by: Alexey Gladkov 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/rlimits/.gitignore|   2 +
 tools/testing/selftests/rlimits/Makefile  |   6 +
 tools/testing/selftests/rlimits/config|   1 +
 .../selftests/rlimits/rlimits-per-userns.c| 161 ++
 5 files changed, 171 insertions(+)
 create mode 100644 tools/testing/selftests/rlimits/.gitignore
 create mode 100644 tools/testing/selftests/rlimits/Makefile
 create mode 100644 tools/testing/selftests/rlimits/config
 create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 8a917cb4426a..a6d3fde4a617 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -46,6 +46,7 @@ TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
 TARGETS += openat2
+TARGETS += rlimits
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/rlimits/.gitignore 
b/tools/testing/selftests/rlimits/.gitignore
new file mode 100644
index ..091021f255b3
--- /dev/null
+++ b/tools/testing/selftests/rlimits/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+rlimits-per-userns
diff --git a/tools/testing/selftests/rlimits/Makefile 
b/tools/testing/selftests/rlimits/Makefile
new file mode 100644
index ..03aadb406212
--- /dev/null
+++ b/tools/testing/selftests/rlimits/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := rlimits-per-userns
+
+include ../lib.mk
diff --git a/tools/testing/selftests/rlimits/config 
b/tools/testing/selftests/rlimits/config
new file mode 100644
index ..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/rlimits/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c 
b/tools/testing/selftests/rlimits/rlimits-per-userns.c
new file mode 100644
index ..26dc949e93ea
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov 
+ */
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NR_CHILDS 2
+
+static char *service_prog;
+static uid_t user   = 6;
+static uid_t group  = 6;
+
+static void setrlimit_nproc(rlim_t n)
+{
+   pid_t pid = getpid();
+   struct rlimit limit = {
+   .rlim_cur = n,
+   .rlim_max = n
+   };
+
+   warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
+
+   if (setrlimit(RLIMIT_NPROC, ) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+}
+
+static pid_t fork_child(void)
+{
+   pid_t pid = fork();
+
+   if (pid < 0)
+   err(EXIT_FAILURE, "fork");
+
+   if (pid > 0)
+   return pid;
+
+   pid = getpid();
+
+   warnx("(pid=%d): New process starting ...", pid);
+
+   if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
+   err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid);
+
+   signal(SIGUSR1, SIG_DFL);
+
+   warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
+
+   if (setgid(group) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
+   if (setuid(user) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
+
+   warnx("(pid=%d): Service running ...", pid);
+
+   warnx("(pid=%d): Unshare user namespace", pid);
+   if (unshare(CLONE_NEWUSER) < 0)
+   err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+
+   char *const argv[] = { "service", NULL };
+   char *const envp[] = { "I_AM_SERVICE=1", NULL };
+
+   warnx("(pid=%d): Executing real service ...", pid);
+
+   execve(service_prog, argv, envp);
+   err(EXIT_FAILURE, "(pid=%d): execve", pid);
+}
+
+int main(int argc, char **argv)
+{
+   size_t i;
+   pid_t child[NR_CHILDS];
+   int wstatus[NR_CHILDS];
+   int childs = NR_CHILDS;
+   pid_t pid;
+
+   if (getenv("I_AM_SERVICE")) {
+   pause();
+   exit(EXIT_SUCCESS);
+   }
+
+   service_prog = argv[0];
+   pid = getpid();
+
+   warnx("(pid=%d) Starting testcase", pid);
+
+   /*
+* This rlimit is not a problem for root because it can be exceeded.
+*/
+   setrlimit_nproc(1);
+
+   for (i = 0; i < NR_CH

[PATCH v8 4/8] Reimplement RLIMIT_NPROC on top of ucounts

2021-03-10 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

To illustrate the impact of rlimits, let's say there is a program that
does not fork. Some service-A wants to run this program as user X in
multiple containers. Since the program never fork the service wants to
set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
When the service-A tries to run a program with RLIMIT_NPROC=1 in
container2 it fails since user X already has one running process.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited by root or if the user has the appropriate capability.

Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  2 +-
 fs/io-wq.c | 22 ++--
 fs/io-wq.h |  2 +-
 fs/io_uring.c  |  2 +-
 include/linux/cred.h   |  2 ++
 include/linux/sched/user.h |  1 -
 include/linux/user_namespace.h | 13 
 kernel/cred.c  | 10 +++---
 kernel/exit.c  |  2 +-
 kernel/fork.c  |  9 ++---
 kernel/sys.c   |  2 +-
 kernel/ucount.c| 61 ++
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  3 +-
 14 files changed, 103 insertions(+), 29 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 0371a3400be5..e6d7f186f33c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1874,7 +1874,7 @@ static int do_execveat_common(int fd, struct filename 
*filename,
 * whether NPROC limit is still exceeded.
 */
if ((current->flags & PF_NPROC_EXCEEDED) &&
-   atomic_read(_user()->processes) > rlimit(RLIMIT_NPROC)) {
+   is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, 
rlimit(RLIMIT_NPROC))) {
retval = -EAGAIN;
goto out_ret;
}
diff --git a/fs/io-wq.c b/fs/io-wq.c
index a564f36e260c..5b6940c90c61 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../kernel/sched/sched.h"
 #include "io-wq.h"
@@ -120,7 +121,7 @@ struct io_wq {
io_wq_work_fn *do_work;
 
struct task_struct *manager;
-   struct user_struct *user;
+   const struct cred *cred;
refcount_t refs;
struct completion done;
 
@@ -234,7 +235,7 @@ static void io_worker_exit(struct io_worker *worker)
if (worker->flags & IO_WORKER_F_RUNNING)
atomic_dec(>nr_running);
if (!(worker->flags & IO_WORKER_F_BOUND))
-   atomic_dec(>wq->user->processes);
+   dec_rlimit_ucounts(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 
1);
worker->flags = 0;
preempt_enable();
 
@@ -364,15 +365,15 @@ static void __io_worker_busy(struct io_wqe *wqe, struct 
io_worker *worker,
worker->flags |= IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers--;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers++;
-   atomic_dec(>wq->user->processes);
+   dec_rlimit_ucounts(wqe->wq->cred->ucounts, 
UCOUNT_RLIMIT_NPROC, 1);
} else {
worker->flags &= ~IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers++;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers--;
-   atomic_inc(>wq->user->processes);
+   inc_rlimit_ucounts(wqe->wq->cred->ucounts, 
UCOUNT_RLIMIT_NPROC, 1);
}
io_wqe_inc_running(wqe, worker);
-}
+   }
 }
 
 /*
@@ -707,7 +708,7 @@ static bool create_io_worker(struct io_wq *wq, struct 
io_wqe *wqe, int index)
raw_spin_unlock_irq(>lock);
 
if (index == IO_WQ_ACCT_UNBOUND)
-   atomic_inc(>user->processes);
+   inc_rlimit_ucounts(wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);
 
refcount_inc(>refs);
wake_up_process(worker->task);
@@ -838,7 +839,7 @@ static bool io_wq_can_queue(struct io_wqe *wqe, struct 
io_wqe_acct *acct,
if (free_worker)
return true;
 
-   if (atomic_read(>wq->user->processes) >= acct->max_workers &&
+   if (is_ucounts_overlimit(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 
acct->max_workers) &&
!(capable(CAP_SYS_RESOURCE) || capable(CAP_SYS_ADMIN)))

[PATCH v8 6/8] Reimplement RLIMIT_SIGPENDING on top of ucounts

2021-03-10 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/array.c|  2 +-
 include/linux/sched/user.h |  1 -
 include/linux/signal_types.h   |  4 ++-
 include/linux/user_namespace.h |  1 +
 kernel/fork.c  |  1 +
 kernel/signal.c| 57 --
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 9 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index bb87e4d89cd8..74b0ea4b7e38 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -284,7 +284,7 @@ static inline void task_sig(struct seq_file *m, struct 
task_struct *p)
collect_sigign_sigcatch(p, , );
num_threads = get_nr_threads(p);
rcu_read_lock();  /* FIXME: is this correct? */
-   qsize = atomic_read(&__task_cred(p)->user->sigpending);
+   qsize = get_ucounts_value(task_ucounts(p), 
UCOUNT_RLIMIT_SIGPENDING);
rcu_read_unlock();
qlim = task_rlimit(p, RLIMIT_SIGPENDING);
unlock_task_sighand(p, );
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8a34446681aa..8ba9cec4fb99 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
  */
 struct user_struct {
refcount_t __count; /* reference count */
-   atomic_t sigpending;/* How many pending signals does this user 
have? */
 #ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
 #endif
diff --git a/include/linux/signal_types.h b/include/linux/signal_types.h
index 68e06c75c5b2..34cb28b8f16c 100644
--- a/include/linux/signal_types.h
+++ b/include/linux/signal_types.h
@@ -13,6 +13,8 @@ typedef struct kernel_siginfo {
__SIGINFO;
 } kernel_siginfo_t;
 
+struct ucounts;
+
 /*
  * Real Time signals may be queued.
  */
@@ -21,7 +23,7 @@ struct sigqueue {
struct list_head list;
int flags;
kernel_siginfo_t info;
-   struct user_struct *user;
+   struct ucounts *ucounts;
 };
 
 /* flags values. */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index d0fea0306394..6e8736c7aa29 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -52,6 +52,7 @@ enum ucount_type {
 #endif
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
+   UCOUNT_RLIMIT_SIGPENDING,
UCOUNT_COUNTS,
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 0a939332efcc..99b10b9fe4b6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -824,6 +824,7 @@ void __init fork_init(void)
 
init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(_task, 
RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = 
task_rlimit(_task, RLIMIT_MSGQUEUE);
+   init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = 
task_rlimit(_task, RLIMIT_SIGPENDING);
 
 #ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/signal.c b/kernel/signal.c
index 5ad8566534e7..a515e36a8a11 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -412,49 +412,44 @@ void task_join_group_stop(struct task_struct *task)
 static struct sigqueue *
 __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int 
override_rlimit)
 {
-   struct sigqueue *q = NULL;
-   struct user_struct *user;
-   int sigpending;
+   struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);
 
-   /*
-* Protect access to @t credentials. This can go away when all
-* callers hold rcu read lock.
-*
-* NOTE! A pending signal will hold on to the user refcount,
-* and we get/put the refcount only when the sigpending count
-* changes from/to zero.
-*/
-   rcu_read_lock();
-   user = __task_cred(t)->user;
-   sigpending = atomic_inc_return(>sigpending);
-   if (sigpending == 1)
-   get_uid(user);
-   rcu_read_unlock();
+   if (likely(q != NULL)) {
+   bool overlimit;
 
-   if (override_rlimit || likely(sigpending <= task_rlimit(t, 
RLIMIT_SIGPENDING))) {
-   q = kmem_cache_alloc(sigqueue_cachep, flags);
-   } else {
-   print_dropped_signal(sig);
-   }
-
-   if (unlikely(q == NULL)) {
-   if (atomic_dec_and_test(>sigpending))
-   free_uid(user);
-   } else {
INIT_LIST_HEAD(>list);
q->flags = 0;
-   q->user = user;
+
+   /*
+* Protect access to @t credentials. This can go away when all
+

[PATCH v8 3/8] Use atomic_t for ucounts reference counting

2021-03-10 Thread Alexey Gladkov

The current implementation of the ucounts reference counter requires the
use of spin_lock. We're going to use get_ucounts() in more performance
critical areas like a handling of RLIMIT_SIGPENDING.

Now we need to use spin_lock only if we want to change the hashtable.

Signed-off-by: Alexey Gladkov 
---
 include/linux/user_namespace.h |  4 +--
 kernel/ucount.c| 60 +++---
 2 files changed, 28 insertions(+), 36 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index f71b5a4a3e74..d84cc2c0b443 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -92,7 +92,7 @@ struct ucounts {
struct hlist_node node;
struct user_namespace *ns;
kuid_t uid;
-   int count;
+   atomic_t count;
atomic_long_t ucount[UCOUNT_COUNTS];
 };
 
@@ -104,7 +104,7 @@ void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum 
ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
 struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
-struct ucounts *get_ucounts(struct ucounts *ucounts);
+struct ucounts * __must_check get_ucounts(struct ucounts *ucounts);
 void put_ucounts(struct ucounts *ucounts);
 
 #ifdef CONFIG_USER_NS
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 50cc1dfb7d28..bb3203039b5e 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -11,7 +11,7 @@
 struct ucounts init_ucounts = {
.ns= _user_ns,
.uid   = GLOBAL_ROOT_UID,
-   .count = 1,
+   .count = ATOMIC_INIT(1),
 };
 
 #define UCOUNTS_HASHTABLE_BITS 10
@@ -139,6 +139,22 @@ static void hlist_add_ucounts(struct ucounts *ucounts)
spin_unlock_irq(_lock);
 }
 
+/* 127: arbitrary random number, small enough to assemble well */
+#define refcount_zero_or_close_to_overflow(ucounts) \
+   ((unsigned int) atomic_read(>count) + 127u <= 127u)
+
+struct ucounts *get_ucounts(struct ucounts *ucounts)
+{
+   if (ucounts) {
+   if (refcount_zero_or_close_to_overflow(ucounts)) {
+   WARN_ONCE(1, "ucounts: counter has reached its maximum 
value");
+   return NULL;
+   }
+   atomic_inc(>count);
+   }
+   return ucounts;
+}
+
 struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
 {
struct hlist_head *hashent = ucounts_hashentry(ns, uid);
@@ -155,7 +171,7 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, 
kuid_t uid)
 
new->ns = ns;
new->uid = uid;
-   new->count = 0;
+   atomic_set(>count, 1);
 
spin_lock_irq(_lock);
ucounts = find_ucounts(ns, uid, hashent);
@@ -163,33 +179,12 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, 
kuid_t uid)
kfree(new);
} else {
hlist_add_head(>node, hashent);
-   ucounts = new;
+   spin_unlock_irq(_lock);
+   return new;
}
}
-   if (ucounts->count == INT_MAX)
-   ucounts = NULL;
-   else
-   ucounts->count += 1;
spin_unlock_irq(_lock);
-   return ucounts;
-}
-
-struct ucounts *get_ucounts(struct ucounts *ucounts)
-{
-   unsigned long flags;
-
-   if (!ucounts)
-   return NULL;
-
-   spin_lock_irqsave(_lock, flags);
-   if (ucounts->count == INT_MAX) {
-   WARN_ONCE(1, "ucounts: counter has reached its maximum value");
-   ucounts = NULL;
-   } else {
-   ucounts->count += 1;
-   }
-   spin_unlock_irqrestore(_lock, flags);
-
+   ucounts = get_ucounts(ucounts);
return ucounts;
 }
 
@@ -197,15 +192,12 @@ void put_ucounts(struct ucounts *ucounts)
 {
unsigned long flags;
 
-   spin_lock_irqsave(_lock, flags);
-   ucounts->count -= 1;
-   if (!ucounts->count)
+   if (atomic_dec_and_test(>count)) {
+   spin_lock_irqsave(_lock, flags);
hlist_del_init(>node);
-   else
-   ucounts = NULL;
-   spin_unlock_irqrestore(_lock, flags);
-
-   kfree(ucounts);
+   spin_unlock_irqrestore(_lock, flags);
+   kfree(ucounts);
+   }
 }
 
 static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
-- 
2.29.2

[PATCH v8 1/8] Increase size of ucounts to atomic_long_t

2021-03-10 Thread Alexey Gladkov

RLIMIT_MSGQUEUE and RLIMIT_MEMLOCK use unsigned long to store their
counters. As a preparation for moving rlimits based on ucounts, we need
to increase the size of the variable to long.

Signed-off-by: Alexey Gladkov 
---
 include/linux/user_namespace.h |  4 ++--
 kernel/ucount.c| 16 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..0bb833fd41f4 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -85,7 +85,7 @@ struct user_namespace {
struct ctl_table_header *sysctls;
 #endif
struct ucounts  *ucounts;
-   int ucount_max[UCOUNT_COUNTS];
+   long ucount_max[UCOUNT_COUNTS];
 } __randomize_layout;
 
 struct ucounts {
@@ -93,7 +93,7 @@ struct ucounts {
struct user_namespace *ns;
kuid_t uid;
int count;
-   atomic_t ucount[UCOUNT_COUNTS];
+   atomic_long_t ucount[UCOUNT_COUNTS];
 };
 
 extern struct user_namespace init_user_ns;
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 11b1596e2542..04c561751af1 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -175,14 +175,14 @@ static void put_ucounts(struct ucounts *ucounts)
kfree(ucounts);
 }
 
-static inline bool atomic_inc_below(atomic_t *v, int u)
+static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
 {
-   int c, old;
-   c = atomic_read(v);
+   long c, old;
+   c = atomic_long_read(v);
for (;;) {
if (unlikely(c >= u))
return false;
-   old = atomic_cmpxchg(v, c, c+1);
+   old = atomic_long_cmpxchg(v, c, c+1);
if (likely(old == c))
return true;
c = old;
@@ -196,17 +196,17 @@ struct ucounts *inc_ucount(struct user_namespace *ns, 
kuid_t uid,
struct user_namespace *tns;
ucounts = get_ucounts(ns, uid);
for (iter = ucounts; iter; iter = tns->ucounts) {
-   int max;
+   long max;
tns = iter->ns;
max = READ_ONCE(tns->ucount_max[type]);
-   if (!atomic_inc_below(>ucount[type], max))
+   if (!atomic_long_inc_below(>ucount[type], max))
goto fail;
}
return ucounts;
 fail:
bad = iter;
for (iter = ucounts; iter != bad; iter = iter->ns->ucounts)
-   atomic_dec(>ucount[type]);
+   atomic_long_dec(>ucount[type]);
 
put_ucounts(ucounts);
return NULL;
@@ -216,7 +216,7 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type 
type)
 {
struct ucounts *iter;
for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-   int dec = atomic_dec_if_positive(>ucount[type]);
+   long dec = atomic_long_dec_if_positive(>ucount[type]);
WARN_ON_ONCE(dec < 0);
}
put_ucounts(ucounts);
-- 
2.29.2

[PATCH v8 5/8] Reimplement RLIMIT_MSGQUEUE on top of ucounts

2021-03-10 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov 
---
 include/linux/sched/user.h |  4 
 include/linux/user_namespace.h |  1 +
 ipc/mqueue.c   | 41 ++
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user_namespace.c|  1 +
 6 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index d33d867ad6c1..8a34446681aa 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,10 +18,6 @@ struct user_struct {
 #endif
 #ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors 
currently watched */
-#endif
-#ifdef CONFIG_POSIX_MQUEUE
-   /* protected by mq_lock */
-   unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
 #endif
unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight;/* How many files in flight in unix 
sockets */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 9d1ca370c201..d0fea0306394 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -51,6 +51,7 @@ enum ucount_type {
UCOUNT_INOTIFY_WATCHES,
 #endif
UCOUNT_RLIMIT_NPROC,
+   UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_COUNTS,
 };
 
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index beff0cfcd1e8..75dba8780c80 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -144,7 +144,7 @@ struct mqueue_inode_info {
struct pid *notify_owner;
u32 notify_self_exec_id;
struct user_namespace *notify_user_ns;
-   struct user_struct *user;   /* user who created, for accounting */
+   struct ucounts *ucounts;/* user who created, for accounting */
struct sock *notify_sock;
struct sk_buff *notify_cookie;
 
@@ -292,7 +292,6 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
struct ipc_namespace *ipc_ns, umode_t mode,
struct mq_attr *attr)
 {
-   struct user_struct *u = current_user();
struct inode *inode;
int ret = -ENOMEM;
 
@@ -321,7 +320,7 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
info->notify_owner = NULL;
info->notify_user_ns = NULL;
info->qsize = 0;
-   info->user = NULL;  /* set when all is ok */
+   info->ucounts = NULL;   /* set when all is ok */
info->msg_tree = RB_ROOT;
info->msg_tree_rightmost = NULL;
info->node_cache = NULL;
@@ -371,19 +370,24 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
if (mq_bytes + mq_treesize < mq_bytes)
goto out_inode;
mq_bytes += mq_treesize;
-   spin_lock(_lock);
-   if (u->mq_bytes + mq_bytes < u->mq_bytes ||
-   u->mq_bytes + mq_bytes > rlimit(RLIMIT_MSGQUEUE)) {
+   info->ucounts = get_ucounts(current_ucounts());
+   if (info->ucounts) {
+   bool overlimit;
+
+   spin_lock(_lock);
+   overlimit = inc_rlimit_ucounts_and_test(info->ucounts, 
UCOUNT_RLIMIT_MSGQUEUE,
+   mq_bytes, rlimit(RLIMIT_MSGQUEUE));
+   if (overlimit) {
+   dec_rlimit_ucounts(info->ucounts, 
UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
+   spin_unlock(_lock);
+   put_ucounts(info->ucounts);
+   info->ucounts = NULL;
+   /* mqueue_evict_inode() releases info->messages 
*/
+   ret = -EMFILE;
+   goto out_inode;
+   }
spin_unlock(_lock);
-   /* mqueue_evict_inode() releases info->messages */
-   ret = -EMFILE;
-   goto out_inode;
}
-   u->mq_bytes += mq_bytes;
-   spin_unlock(_lock);
-
-   /* all is ok */
-   info->user = get_uid(u);
} else if (S_ISDIR(mode)) {
inc_nlink(inode);
/* Some things misbehave if size == 0 on a directory */
@@ -497,7 +501,6 @@ static void mqueue_free_inode(struct inode *inode)
 static void mqueue_evict_inode(struct inode *inode)
 {
struct mqueue_inode_info *info;
-   struct user_struct *user;
struct ipc_namespace *ipc_ns;
struct msg_msg

[PATCH v8 2/8] Add a reference to ucounts for each cred

2021-03-10 Thread Alexey Gladkov

For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.

Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().

Changelog

v6:
* Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
  error was caused by the fact that cred_alloc_blank() left the ucounts
  pointer empty.

Reported-by: kernel test robot 
Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  4 
 include/linux/cred.h   |  2 ++
 include/linux/user_namespace.h |  4 
 kernel/cred.c  | 40 ++
 kernel/fork.c  |  6 +
 kernel/sys.c   | 12 ++
 kernel/ucount.c| 40 +++---
 kernel/user_namespace.c|  3 +++
 8 files changed, 108 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 5d4d52039105..0371a3400be5 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1360,6 +1360,10 @@ int begin_new_exec(struct linux_binprm * bprm)
WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
flush_signal_handlers(me, 0);
 
+   retval = set_cred_ucounts(bprm->cred);
+   if (retval < 0)
+   goto out_unlock;
+
/*
 * install the new credentials for this executable
 */
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..ad160e5fe5c6 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -144,6 +144,7 @@ struct cred {
 #endif
struct user_struct *user;   /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are 
relative to. */
+   struct ucounts *ucounts;
struct group_info *group_info;  /* supplementary groups for euid/fsgid 
*/
/* RCU deletion */
union {
@@ -170,6 +171,7 @@ extern int set_security_override_from_ctx(struct cred *, 
const char *);
 extern int set_create_files_as(struct cred *, struct inode *);
 extern int cred_fscmp(const struct cred *, const struct cred *);
 extern void __init cred_init(void);
+extern int set_cred_ucounts(struct cred *);
 
 /*
  * check for validity of credentials
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 0bb833fd41f4..f71b5a4a3e74 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -97,11 +97,15 @@ struct ucounts {
 };
 
 extern struct user_namespace init_user_ns;
+extern struct ucounts init_ucounts;
 
 bool setup_userns_sysctls(struct user_namespace *ns);
 void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum 
ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
+struct ucounts *get_ucounts(struct ucounts *ucounts);
+void put_ucounts(struct ucounts *ucounts);
 
 #ifdef CONFIG_USER_NS
 
diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..58a8a9e24347 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -60,6 +60,7 @@ struct cred init_cred = {
.user   = INIT_USER,
.user_ns= _user_ns,
.group_info = _groups,
+   .ucounts= _ucounts,
 };
 
 static inline void set_cred_subscribers(struct cred *cred, int n)
@@ -119,6 +120,8 @@ static void put_cred_rcu(struct rcu_head *rcu)
if (cred->group_info)
put_group_info(cred->group_info);
free_uid(cred->user);
+   if (cred->ucounts)
+   put_ucounts(cred->ucounts);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
 }
@@ -222,6 +225,7 @@ struct cred *cred_alloc_blank(void)
 #ifdef CONFIG_DEBUG_CREDENTIALS
new->magic = CRED_MAGIC;
 #endif
+   new->ucounts = get_ucounts(_ucounts);
 
if (security_cred_alloc_blank(new, GFP_KERNEL_ACCOUNT) < 0)
goto error;
@@ -284,6 +288,11 @@ struct cred *prepare_creds(void)
 
if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
goto error;
+
+   new->ucounts = get_ucounts(new->ucounts);
+   if (!new->ucounts)
+   goto error;
+
validate_creds(new);
return new;
 
@@ -363,6 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long 
clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
+   if (set_cred_uc

[PATCH v8 0/8] Count rlimits in each user namespace

2021-03-10 Thread Alexey Gladkov

Preface
---
These patches are for binding the rlimit counters to a user in user namespace.
This patch set can be applied on top of:

git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.11

Problem
---
The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
implementation places the counters in user_struct [1]. These limits are global
between processes and persists for the lifetime of the process, even if
processes are in different user namespaces.

To illustrate the impact of rlimits, let's say there is a program that does not
fork. Some service-A wants to run this program as user X in multiple containers.
Since the program never fork the service wants to set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
since user X already has one running process.

The problem is not that the limit from container1 affects container2. The
problem is that limit is verified against the global counter that reflects
the number of processes in all containers.

This problem can be worked around by using different users for each container
but in this case we face a different problem of uid mapping when transferring
files from one container to another.

Eric W. Biederman mentioned this issue [2][3].

Introduced changes
--
To address the problem, we bind rlimit counters to user namespace. Each counter
reflects the number of processes in a given uid in a given user namespace. The
result is a tree of rlimit counters with the biggest value at the root (aka
init_user_ns). The limit is considered exceeded if it's exceeded up in the tree.

[1]: https://lore.kernel.org/containers/87imd2incs@x220.int.ebiederm.org/
[2]: 
https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
[3]: 
https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html

Changelog
-
v8:
* Used atomic_t for ucounts reference counting. Also added counter overflow
  check (thanks to Linus Torvalds for the idea).
* Fixed other issues found by lkp-tests project in the patch that Reimplements
  RLIMIT_MEMLOCK on top of ucounts.

v7:
* Fixed issues found by lkp-tests project in the patch that Reimplements
  RLIMIT_MEMLOCK on top of ucounts.

v6:
* Fixed issues found by lkp-tests project.
* Rebased onto v5.11.

v5:
* Split the first commit into two commits: change ucounts.count type to 
atomic_long_t
  and add ucounts to cred. These commits were merged by mistake during the 
rebase.
* The __get_ucounts() renamed to alloc_ucounts().
* The cred.ucounts update has been moved from commit_creds() as it did not allow
  to handle errors.
* Added error handling of set_cred_ucounts().

v4:
* Reverted the type change of ucounts.count to refcount_t.
* Fixed typo in the kernel/cred.c

v3:
* Added get_ucounts() function to increase the reference count. The existing
  get_counts() function renamed to __get_ucounts().
* The type of ucounts.count changed from atomic_t to refcount_t.
* Dropped 'const' from set_cred_ucounts() arguments.
* Fixed a bug with freeing the cred structure after calling cred_alloc_blank().
* Commit messages have been updated.
* Added selftest.

v2:
* RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts.
* Added ucounts for pair uid and user namespace into cred.
* Added the ability to increase ucount by more than 1.

v1:
* After discussion with Eric W. Biederman, I increased the size of ucounts to
  atomic_long_t.
* Added ucount_max to avoid the fork bomb.

--

Alexey Gladkov (8):
  Increase size of ucounts to atomic_long_t
  Add a reference to ucounts for each cred
  Use atomic_t for ucounts reference counting
  Reimplement RLIMIT_NPROC on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  kselftests: Add test to check for rlimit changes in different user
namespaces

 fs/exec.c |   6 +-
 fs/hugetlbfs/inode.c  |  16 +-
 fs/io-wq.c|  22 ++-
 fs/io-wq.h|   2 +-
 fs/io_uring.c |   2 +-
 fs/proc/array.c   |   2 +-
 include/linux/cred.h  |   4 +
 include/linux/hugetlb.h   |   4 +-
 include/linux/mm.h|   4 +-
 include/linux/sched/user.h|   7 -
 include/linux/shmem_fs.h  |   2 +-
 include/linux/signal_types.h  |   4 +-
 include/linux/user_namespace.h|  26 ++-
 ipc/mqueue.c  |  41 ++---
 ipc/shm.c

Re: d28296d248: stress-ng.sigsegv.ops_per_sec -82.7% regression

2021-02-25 Thread Alexey Gladkov

On Wed, Feb 24, 2021 at 12:50:21PM -0600, Eric W. Biederman wrote:
> Alexey Gladkov  writes:
> 
> > On Wed, Feb 24, 2021 at 10:54:17AM -0600, Eric W. Biederman wrote:
> >> kernel test robot  writes:
> >> 
> >> > Greeting,
> >> >
> >> > FYI, we noticed a -82.7% regression of stress-ng.sigsegv.ops_per_sec due 
> >> > to commit:
> >> >
> >> >
> >> > commit: d28296d2484fa11e94dff65e93eb25802a443d47 ("[PATCH v7 5/7] 
> >> > Reimplement RLIMIT_SIGPENDING on top of ucounts")
> >> > url: 
> >> > https://github.com/0day-ci/linux/commits/Alexey-Gladkov/Count-rlimits-in-each-user-namespace/20210222-175836
> >> > base: 
> >> > https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git 
> >> > next
> >> >
> >> > in testcase: stress-ng
> >> > on test machine: 48 threads Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz 
> >> > with 112G memory
> >> > with following parameters:
> >> >
> >> >  nr_threads: 100%
> >> >  disk: 1HDD
> >> >  testtime: 60s
> >> >  class: interrupt
> >> >  test: sigsegv
> >> >  cpufreq_governor: performance
> >> >  ucode: 0x42e
> >> >
> >> >
> >> > In addition to that, the commit also has significant impact on the
> >> > following tests:
> >> 
> >> Thank you.  Now we have a sense of where we need to test the performance
> >> of these changes carefully.
> >
> > One of the reasons for this is that I rolled back the patch that changed
> > the ucounts.count type to atomic_t. Now get_ucounts() is forced to use a
> > spin_lock to increase the reference count.
> 
> Which given the hickups with getting a working version seems justified.
> 
> Now we can add incremental patches on top to improve the performance.

I'm not sure that get_ucounts() should be used in __sigqueue_alloc() [1].
I tried removing it and running KASAN tests that were failing before. So
far, I have not found any problems.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/legion/linux.git/tree/kernel/signal.c?h=patchset/per-userspace-rlimit/v7.1=2d4a2e2be7db42c95acb98abfc2a9b370ddd0604#n428

-- 
Rgrds, legion

Re: d28296d248: stress-ng.sigsegv.ops_per_sec -82.7% regression

2021-02-24 Thread Alexey Gladkov

On Wed, Feb 24, 2021 at 10:54:17AM -0600, Eric W. Biederman wrote:
> kernel test robot  writes:
> 
> > Greeting,
> >
> > FYI, we noticed a -82.7% regression of stress-ng.sigsegv.ops_per_sec due to 
> > commit:
> >
> >
> > commit: d28296d2484fa11e94dff65e93eb25802a443d47 ("[PATCH v7 5/7] 
> > Reimplement RLIMIT_SIGPENDING on top of ucounts")
> > url: 
> > https://github.com/0day-ci/linux/commits/Alexey-Gladkov/Count-rlimits-in-each-user-namespace/20210222-175836
> > base: 
> > https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git next
> >
> > in testcase: stress-ng
> > on test machine: 48 threads Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz with 
> > 112G memory
> > with following parameters:
> >
> > nr_threads: 100%
> > disk: 1HDD
> > testtime: 60s
> > class: interrupt
> > test: sigsegv
> > cpufreq_governor: performance
> > ucode: 0x42e
> >
> >
> > In addition to that, the commit also has significant impact on the
> > following tests:
> 
> Thank you.  Now we have a sense of where we need to test the performance
> of these changes carefully.

One of the reasons for this is that I rolled back the patch that changed
the ucounts.count type to atomic_t. Now get_ucounts() is forced to use a
spin_lock to increase the reference count.

-- 
Rgrds, legion

Re: [PATCH v6 0/7] Count rlimits in each user namespace

2021-02-22 Thread Alexey Gladkov

On Sun, Feb 21, 2021 at 02:20:00PM -0800, Linus Torvalds wrote:
> On Mon, Feb 15, 2021 at 4:42 AM Alexey Gladkov  
> wrote:
> >
> > These patches are for binding the rlimit counters to a user in user 
> > namespace.
> 
> So this is now version 6, but I think the kernel test robot keeps
> complaining about them causing KASAN issues.
> 
> The complaints seem to change, so I'm hoping they get fixed, but it
> does seem like every version there's a new one. Hmm?

First, KASAN found an unexpected bug in the second patch (Add a reference
to ucounts for each cred). Because I missed that creed_alloc_blank() is
used wider than I found.

Now KASAN has found problems in the RLIMIT_MEMLOCK which I believe I fixed
in v7.

-- 
Rgrds, legion

Re: [PATCH v6 3/7] Reimplement RLIMIT_NPROC on top of ucounts

2021-02-22 Thread Alexey Gladkov

On Sun, Feb 21, 2021 at 04:38:10PM -0700, Jens Axboe wrote:
> On 2/15/21 5:41 AM, Alexey Gladkov wrote:
> > diff --git a/fs/io-wq.c b/fs/io-wq.c
> > index a564f36e260c..5b6940c90c61 100644
> > --- a/fs/io-wq.c
> > +++ b/fs/io-wq.c
> > @@ -1090,10 +1091,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct 
> > io_wq_data *data)
> > wqe->node = alloc_node;
> > wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded;
> > atomic_set(>acct[IO_WQ_ACCT_BOUND].nr_running, 0);
> > -   if (wq->user) {
> > -   wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
> > -   task_rlimit(current, RLIMIT_NPROC);
> > -   }
> > +   wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers = 
> > task_rlimit(current, RLIMIT_NPROC);
> 
> This doesn't look like an equivalent transformation. But that may be
> moot if we merge the io_uring-worker.v3 series, as then you would not
> have to touch io-wq at all.

In the current code the wq->user is always set to current_user():

io_uring_create [1]
`- io_sq_offload_create
   `- io_init_wq_offload [2]
  `-io_wq_create [3]

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/io_uring.c#n9752
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/io_uring.c#n8107
[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/io-wq.c#n1070

So, specifying max_workers always happens.

-- 
Rgrds, legion

[PATCH v7 7/7] kselftests: Add test to check for rlimit changes in different user namespaces

2021-02-22 Thread Alexey Gladkov

The testcase runs few instances of the program with RLIMIT_NPROC=1 from
user uid=6, in different user namespaces.

Signed-off-by: Alexey Gladkov 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/rlimits/.gitignore|   2 +
 tools/testing/selftests/rlimits/Makefile  |   6 +
 tools/testing/selftests/rlimits/config|   1 +
 .../selftests/rlimits/rlimits-per-userns.c| 161 ++
 5 files changed, 171 insertions(+)
 create mode 100644 tools/testing/selftests/rlimits/.gitignore
 create mode 100644 tools/testing/selftests/rlimits/Makefile
 create mode 100644 tools/testing/selftests/rlimits/config
 create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 8a917cb4426a..a6d3fde4a617 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -46,6 +46,7 @@ TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
 TARGETS += openat2
+TARGETS += rlimits
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/rlimits/.gitignore 
b/tools/testing/selftests/rlimits/.gitignore
new file mode 100644
index ..091021f255b3
--- /dev/null
+++ b/tools/testing/selftests/rlimits/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+rlimits-per-userns
diff --git a/tools/testing/selftests/rlimits/Makefile 
b/tools/testing/selftests/rlimits/Makefile
new file mode 100644
index ..03aadb406212
--- /dev/null
+++ b/tools/testing/selftests/rlimits/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := rlimits-per-userns
+
+include ../lib.mk
diff --git a/tools/testing/selftests/rlimits/config 
b/tools/testing/selftests/rlimits/config
new file mode 100644
index ..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/rlimits/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c 
b/tools/testing/selftests/rlimits/rlimits-per-userns.c
new file mode 100644
index ..26dc949e93ea
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov 
+ */
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NR_CHILDS 2
+
+static char *service_prog;
+static uid_t user   = 6;
+static uid_t group  = 6;
+
+static void setrlimit_nproc(rlim_t n)
+{
+   pid_t pid = getpid();
+   struct rlimit limit = {
+   .rlim_cur = n,
+   .rlim_max = n
+   };
+
+   warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
+
+   if (setrlimit(RLIMIT_NPROC, ) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+}
+
+static pid_t fork_child(void)
+{
+   pid_t pid = fork();
+
+   if (pid < 0)
+   err(EXIT_FAILURE, "fork");
+
+   if (pid > 0)
+   return pid;
+
+   pid = getpid();
+
+   warnx("(pid=%d): New process starting ...", pid);
+
+   if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
+   err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid);
+
+   signal(SIGUSR1, SIG_DFL);
+
+   warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
+
+   if (setgid(group) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
+   if (setuid(user) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
+
+   warnx("(pid=%d): Service running ...", pid);
+
+   warnx("(pid=%d): Unshare user namespace", pid);
+   if (unshare(CLONE_NEWUSER) < 0)
+   err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+
+   char *const argv[] = { "service", NULL };
+   char *const envp[] = { "I_AM_SERVICE=1", NULL };
+
+   warnx("(pid=%d): Executing real service ...", pid);
+
+   execve(service_prog, argv, envp);
+   err(EXIT_FAILURE, "(pid=%d): execve", pid);
+}
+
+int main(int argc, char **argv)
+{
+   size_t i;
+   pid_t child[NR_CHILDS];
+   int wstatus[NR_CHILDS];
+   int childs = NR_CHILDS;
+   pid_t pid;
+
+   if (getenv("I_AM_SERVICE")) {
+   pause();
+   exit(EXIT_SUCCESS);
+   }
+
+   service_prog = argv[0];
+   pid = getpid();
+
+   warnx("(pid=%d) Starting testcase", pid);
+
+   /*
+* This rlimit is not a problem for root because it can be exceeded.
+*/
+   setrlimit_nproc(1);
+
+   for (i = 0; i < NR_CH

[PATCH v7 5/7] Reimplement RLIMIT_SIGPENDING on top of ucounts

2021-02-22 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/array.c|  2 +-
 include/linux/sched/user.h |  1 -
 include/linux/signal_types.h   |  4 ++-
 include/linux/user_namespace.h |  1 +
 kernel/fork.c  |  1 +
 kernel/signal.c| 57 --
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 9 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index bb87e4d89cd8..74b0ea4b7e38 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -284,7 +284,7 @@ static inline void task_sig(struct seq_file *m, struct 
task_struct *p)
collect_sigign_sigcatch(p, , );
num_threads = get_nr_threads(p);
rcu_read_lock();  /* FIXME: is this correct? */
-   qsize = atomic_read(&__task_cred(p)->user->sigpending);
+   qsize = get_ucounts_value(task_ucounts(p), 
UCOUNT_RLIMIT_SIGPENDING);
rcu_read_unlock();
qlim = task_rlimit(p, RLIMIT_SIGPENDING);
unlock_task_sighand(p, );
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8a34446681aa..8ba9cec4fb99 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
  */
 struct user_struct {
refcount_t __count; /* reference count */
-   atomic_t sigpending;/* How many pending signals does this user 
have? */
 #ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
 #endif
diff --git a/include/linux/signal_types.h b/include/linux/signal_types.h
index 68e06c75c5b2..34cb28b8f16c 100644
--- a/include/linux/signal_types.h
+++ b/include/linux/signal_types.h
@@ -13,6 +13,8 @@ typedef struct kernel_siginfo {
__SIGINFO;
 } kernel_siginfo_t;
 
+struct ucounts;
+
 /*
  * Real Time signals may be queued.
  */
@@ -21,7 +23,7 @@ struct sigqueue {
struct list_head list;
int flags;
kernel_siginfo_t info;
-   struct user_struct *user;
+   struct ucounts *ucounts;
 };
 
 /* flags values. */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 52453143fe23..f84b68832c56 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -52,6 +52,7 @@ enum ucount_type {
 #endif
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
+   UCOUNT_RLIMIT_SIGPENDING,
UCOUNT_COUNTS,
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 0a939332efcc..99b10b9fe4b6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -824,6 +824,7 @@ void __init fork_init(void)
 
init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(_task, 
RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = 
task_rlimit(_task, RLIMIT_MSGQUEUE);
+   init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = 
task_rlimit(_task, RLIMIT_SIGPENDING);
 
 #ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/signal.c b/kernel/signal.c
index 5ad8566534e7..a515e36a8a11 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -412,49 +412,44 @@ void task_join_group_stop(struct task_struct *task)
 static struct sigqueue *
 __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int 
override_rlimit)
 {
-   struct sigqueue *q = NULL;
-   struct user_struct *user;
-   int sigpending;
+   struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);
 
-   /*
-* Protect access to @t credentials. This can go away when all
-* callers hold rcu read lock.
-*
-* NOTE! A pending signal will hold on to the user refcount,
-* and we get/put the refcount only when the sigpending count
-* changes from/to zero.
-*/
-   rcu_read_lock();
-   user = __task_cred(t)->user;
-   sigpending = atomic_inc_return(>sigpending);
-   if (sigpending == 1)
-   get_uid(user);
-   rcu_read_unlock();
+   if (likely(q != NULL)) {
+   bool overlimit;
 
-   if (override_rlimit || likely(sigpending <= task_rlimit(t, 
RLIMIT_SIGPENDING))) {
-   q = kmem_cache_alloc(sigqueue_cachep, flags);
-   } else {
-   print_dropped_signal(sig);
-   }
-
-   if (unlikely(q == NULL)) {
-   if (atomic_dec_and_test(>sigpending))
-   free_uid(user);
-   } else {
INIT_LIST_HEAD(>list);
q->flags = 0;
-   q->user = user;
+
+   /*
+* Protect access to @t credentials. This can go away when all
+

[PATCH v7 6/7] Reimplement RLIMIT_MEMLOCK on top of ucounts

2021-02-22 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v7:
* Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot 
Signed-off-by: Alexey Gladkov 
---
 fs/hugetlbfs/inode.c   | 16 
 include/linux/hugetlb.h|  4 ++--
 include/linux/mm.h |  4 ++--
 include/linux/sched/user.h |  1 -
 include/linux/shmem_fs.h   |  2 +-
 include/linux/user_namespace.h |  1 +
 ipc/shm.c  | 26 +-
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 mm/memfd.c |  4 ++--
 mm/mlock.c | 20 
 mm/mmap.c  |  4 ++--
 mm/shmem.c |  8 
 15 files changed, 50 insertions(+), 44 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 21c20fd5f9ee..cea98b68f271 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1452,7 +1452,7 @@ static int get_hstate_idx(int page_size_log)
  * otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
  */
 struct file *hugetlb_file_setup(const char *name, size_t size,
-   vm_flags_t acctflag, struct user_struct **user,
+   vm_flags_t acctflag, struct ucounts **ucounts,
int creat_flags, int page_size_log)
 {
struct inode *inode;
@@ -1464,20 +1464,20 @@ struct file *hugetlb_file_setup(const char *name, 
size_t size,
if (hstate_idx < 0)
return ERR_PTR(-ENODEV);
 
-   *user = NULL;
+   *ucounts = NULL;
mnt = hugetlbfs_vfsmount[hstate_idx];
if (!mnt)
return ERR_PTR(-ENOENT);
 
if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
-   *user = current_user();
-   if (user_shm_lock(size, *user)) {
+   *ucounts = current_ucounts();
+   if (user_shm_lock(size, *ucounts)) {
task_lock(current);
pr_warn_once("%s (%d): Using mlock ulimits for 
SHM_HUGETLB is deprecated\n",
current->comm, current->pid);
task_unlock(current);
} else {
-   *user = NULL;
+   *ucounts = NULL;
return ERR_PTR(-EPERM);
}
}
@@ -1504,9 +1504,9 @@ struct file *hugetlb_file_setup(const char *name, size_t 
size,
 
iput(inode);
 out:
-   if (*user) {
-   user_shm_unlock(size, *user);
-   *user = NULL;
+   if (*ucounts) {
+   user_shm_unlock(size, *ucounts);
+   *ucounts = NULL;
}
return file;
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..12b78ae587a2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -434,7 +434,7 @@ static inline struct hugetlbfs_inode_info 
*HUGETLBFS_I(struct inode *inode)
 extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
-   struct user_struct **user, int creat_flags,
+   struct ucounts **ucounts, int creat_flags,
int page_size_log);
 
 static inline bool is_file_hugepages(struct file *file)
@@ -454,7 +454,7 @@ static inline struct hstate *hstate_inode(struct inode *i)
 #define is_file_hugepages(file)false
 static inline struct file *
 hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
-   struct user_struct **user, int creat_flags,
+   struct ucounts **ucounts, int creat_flags,
int page_size_log)
 {
return ERR_PTR(-ENOSYS);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..64927c5492f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1628,8 +1628,8 @@ extern bool can_do_mlock(void);
 #else
 static inline bool can_do_mlock(void) { return false; }
 #endif
-extern int user_shm_lock(size_t, struct user_struct *);
-extern void user_shm_unlock(size_t, struct user_struct *);
+extern int user_shm_lock(size_t, struct ucounts *);
+extern void user_shm_unlock(size_t, struct ucounts *);
 
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8ba9ce

[PATCH v7 2/7] Add a reference to ucounts for each cred

2021-02-22 Thread Alexey Gladkov

For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.

Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().

Changelog

v6:
* Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
  error was caused by the fact that cred_alloc_blank() left the ucounts
  pointer empty.

Reported-by: kernel test robot 
Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  4 
 include/linux/cred.h   |  2 ++
 include/linux/user_namespace.h |  4 
 kernel/cred.c  | 40 ++
 kernel/fork.c  |  6 +
 kernel/sys.c   | 12 ++
 kernel/ucount.c| 40 +++---
 kernel/user_namespace.c|  3 +++
 8 files changed, 108 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 5d4d52039105..0371a3400be5 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1360,6 +1360,10 @@ int begin_new_exec(struct linux_binprm * bprm)
WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
flush_signal_handlers(me, 0);
 
+   retval = set_cred_ucounts(bprm->cred);
+   if (retval < 0)
+   goto out_unlock;
+
/*
 * install the new credentials for this executable
 */
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..ad160e5fe5c6 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -144,6 +144,7 @@ struct cred {
 #endif
struct user_struct *user;   /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are 
relative to. */
+   struct ucounts *ucounts;
struct group_info *group_info;  /* supplementary groups for euid/fsgid 
*/
/* RCU deletion */
union {
@@ -170,6 +171,7 @@ extern int set_security_override_from_ctx(struct cred *, 
const char *);
 extern int set_create_files_as(struct cred *, struct inode *);
 extern int cred_fscmp(const struct cred *, const struct cred *);
 extern void __init cred_init(void);
+extern int set_cred_ucounts(struct cred *);
 
 /*
  * check for validity of credentials
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 0bb833fd41f4..f71b5a4a3e74 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -97,11 +97,15 @@ struct ucounts {
 };
 
 extern struct user_namespace init_user_ns;
+extern struct ucounts init_ucounts;
 
 bool setup_userns_sysctls(struct user_namespace *ns);
 void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum 
ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
+struct ucounts *get_ucounts(struct ucounts *ucounts);
+void put_ucounts(struct ucounts *ucounts);
 
 #ifdef CONFIG_USER_NS
 
diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..58a8a9e24347 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -60,6 +60,7 @@ struct cred init_cred = {
.user   = INIT_USER,
.user_ns= _user_ns,
.group_info = _groups,
+   .ucounts= _ucounts,
 };
 
 static inline void set_cred_subscribers(struct cred *cred, int n)
@@ -119,6 +120,8 @@ static void put_cred_rcu(struct rcu_head *rcu)
if (cred->group_info)
put_group_info(cred->group_info);
free_uid(cred->user);
+   if (cred->ucounts)
+   put_ucounts(cred->ucounts);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
 }
@@ -222,6 +225,7 @@ struct cred *cred_alloc_blank(void)
 #ifdef CONFIG_DEBUG_CREDENTIALS
new->magic = CRED_MAGIC;
 #endif
+   new->ucounts = get_ucounts(_ucounts);
 
if (security_cred_alloc_blank(new, GFP_KERNEL_ACCOUNT) < 0)
goto error;
@@ -284,6 +288,11 @@ struct cred *prepare_creds(void)
 
if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
goto error;
+
+   new->ucounts = get_ucounts(new->ucounts);
+   if (!new->ucounts)
+   goto error;
+
validate_creds(new);
return new;
 
@@ -363,6 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long 
clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
+   if (set_cred_uc

[PATCH v7 3/7] Reimplement RLIMIT_NPROC on top of ucounts

2021-02-22 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

To illustrate the impact of rlimits, let's say there is a program that
does not fork. Some service-A wants to run this program as user X in
multiple containers. Since the program never fork the service wants to
set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
When the service-A tries to run a program with RLIMIT_NPROC=1 in
container2 it fails since user X already has one running process.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited by root or if the user has the appropriate capability.

Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  2 +-
 fs/io-wq.c | 22 ++--
 fs/io-wq.h |  2 +-
 fs/io_uring.c  |  2 +-
 include/linux/cred.h   |  2 ++
 include/linux/sched/user.h |  1 -
 include/linux/user_namespace.h | 13 
 kernel/cred.c  | 10 +++---
 kernel/exit.c  |  2 +-
 kernel/fork.c  |  9 ++---
 kernel/sys.c   |  2 +-
 kernel/ucount.c| 61 ++
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  3 +-
 14 files changed, 103 insertions(+), 29 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 0371a3400be5..e6d7f186f33c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1874,7 +1874,7 @@ static int do_execveat_common(int fd, struct filename 
*filename,
 * whether NPROC limit is still exceeded.
 */
if ((current->flags & PF_NPROC_EXCEEDED) &&
-   atomic_read(_user()->processes) > rlimit(RLIMIT_NPROC)) {
+   is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, 
rlimit(RLIMIT_NPROC))) {
retval = -EAGAIN;
goto out_ret;
}
diff --git a/fs/io-wq.c b/fs/io-wq.c
index a564f36e260c..5b6940c90c61 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../kernel/sched/sched.h"
 #include "io-wq.h"
@@ -120,7 +121,7 @@ struct io_wq {
io_wq_work_fn *do_work;
 
struct task_struct *manager;
-   struct user_struct *user;
+   const struct cred *cred;
refcount_t refs;
struct completion done;
 
@@ -234,7 +235,7 @@ static void io_worker_exit(struct io_worker *worker)
if (worker->flags & IO_WORKER_F_RUNNING)
atomic_dec(>nr_running);
if (!(worker->flags & IO_WORKER_F_BOUND))
-   atomic_dec(>wq->user->processes);
+   dec_rlimit_ucounts(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 
1);
worker->flags = 0;
preempt_enable();
 
@@ -364,15 +365,15 @@ static void __io_worker_busy(struct io_wqe *wqe, struct 
io_worker *worker,
worker->flags |= IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers--;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers++;
-   atomic_dec(>wq->user->processes);
+   dec_rlimit_ucounts(wqe->wq->cred->ucounts, 
UCOUNT_RLIMIT_NPROC, 1);
} else {
worker->flags &= ~IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers++;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers--;
-   atomic_inc(>wq->user->processes);
+   inc_rlimit_ucounts(wqe->wq->cred->ucounts, 
UCOUNT_RLIMIT_NPROC, 1);
}
io_wqe_inc_running(wqe, worker);
-}
+   }
 }
 
 /*
@@ -707,7 +708,7 @@ static bool create_io_worker(struct io_wq *wq, struct 
io_wqe *wqe, int index)
raw_spin_unlock_irq(>lock);
 
if (index == IO_WQ_ACCT_UNBOUND)
-   atomic_inc(>user->processes);
+   inc_rlimit_ucounts(wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);
 
refcount_inc(>refs);
wake_up_process(worker->task);
@@ -838,7 +839,7 @@ static bool io_wq_can_queue(struct io_wqe *wqe, struct 
io_wqe_acct *acct,
if (free_worker)
return true;
 
-   if (atomic_read(>wq->user->processes) >= acct->max_workers &&
+   if (is_ucounts_overlimit(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 
acct->max_workers) &&
!(capable(CAP_SYS_RESOURCE) || capable(CAP_SYS_ADMIN)))

[PATCH v7 4/7] Reimplement RLIMIT_MSGQUEUE on top of ucounts

2021-02-22 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov 
---
 include/linux/sched/user.h |  4 
 include/linux/user_namespace.h |  1 +
 ipc/mqueue.c   | 41 ++
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user_namespace.c|  1 +
 6 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index d33d867ad6c1..8a34446681aa 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,10 +18,6 @@ struct user_struct {
 #endif
 #ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors 
currently watched */
-#endif
-#ifdef CONFIG_POSIX_MQUEUE
-   /* protected by mq_lock */
-   unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
 #endif
unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight;/* How many files in flight in unix 
sockets */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 0a27cd049404..52453143fe23 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -51,6 +51,7 @@ enum ucount_type {
UCOUNT_INOTIFY_WATCHES,
 #endif
UCOUNT_RLIMIT_NPROC,
+   UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_COUNTS,
 };
 
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index beff0cfcd1e8..75dba8780c80 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -144,7 +144,7 @@ struct mqueue_inode_info {
struct pid *notify_owner;
u32 notify_self_exec_id;
struct user_namespace *notify_user_ns;
-   struct user_struct *user;   /* user who created, for accounting */
+   struct ucounts *ucounts;/* user who created, for accounting */
struct sock *notify_sock;
struct sk_buff *notify_cookie;
 
@@ -292,7 +292,6 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
struct ipc_namespace *ipc_ns, umode_t mode,
struct mq_attr *attr)
 {
-   struct user_struct *u = current_user();
struct inode *inode;
int ret = -ENOMEM;
 
@@ -321,7 +320,7 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
info->notify_owner = NULL;
info->notify_user_ns = NULL;
info->qsize = 0;
-   info->user = NULL;  /* set when all is ok */
+   info->ucounts = NULL;   /* set when all is ok */
info->msg_tree = RB_ROOT;
info->msg_tree_rightmost = NULL;
info->node_cache = NULL;
@@ -371,19 +370,24 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
if (mq_bytes + mq_treesize < mq_bytes)
goto out_inode;
mq_bytes += mq_treesize;
-   spin_lock(_lock);
-   if (u->mq_bytes + mq_bytes < u->mq_bytes ||
-   u->mq_bytes + mq_bytes > rlimit(RLIMIT_MSGQUEUE)) {
+   info->ucounts = get_ucounts(current_ucounts());
+   if (info->ucounts) {
+   bool overlimit;
+
+   spin_lock(_lock);
+   overlimit = inc_rlimit_ucounts_and_test(info->ucounts, 
UCOUNT_RLIMIT_MSGQUEUE,
+   mq_bytes, rlimit(RLIMIT_MSGQUEUE));
+   if (overlimit) {
+   dec_rlimit_ucounts(info->ucounts, 
UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
+   spin_unlock(_lock);
+   put_ucounts(info->ucounts);
+   info->ucounts = NULL;
+   /* mqueue_evict_inode() releases info->messages 
*/
+   ret = -EMFILE;
+   goto out_inode;
+   }
spin_unlock(_lock);
-   /* mqueue_evict_inode() releases info->messages */
-   ret = -EMFILE;
-   goto out_inode;
}
-   u->mq_bytes += mq_bytes;
-   spin_unlock(_lock);
-
-   /* all is ok */
-   info->user = get_uid(u);
} else if (S_ISDIR(mode)) {
inc_nlink(inode);
/* Some things misbehave if size == 0 on a directory */
@@ -497,7 +501,6 @@ static void mqueue_free_inode(struct inode *inode)
 static void mqueue_evict_inode(struct inode *inode)
 {
struct mqueue_inode_info *info;
-   struct user_struct *user;
struct ipc_namespace *ipc_ns;
struct msg_msg

[PATCH v7 0/7] Count rlimits in each user namespace

2021-02-22 Thread Alexey Gladkov

Preface
---
These patches are for binding the rlimit counters to a user in user namespace.
This patch set can be applied on top of:

git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.11

Problem
---
The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
implementation places the counters in user_struct [1]. These limits are global
between processes and persists for the lifetime of the process, even if
processes are in different user namespaces.

To illustrate the impact of rlimits, let's say there is a program that does not
fork. Some service-A wants to run this program as user X in multiple containers.
Since the program never fork the service wants to set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
since user X already has one running process.

The problem is not that the limit from container1 affects container2. The
problem is that limit is verified against the global counter that reflects
the number of processes in all containers.

This problem can be worked around by using different users for each container
but in this case we face a different problem of uid mapping when transferring
files from one container to another.

Eric W. Biederman mentioned this issue [2][3].

Introduced changes
--
To address the problem, we bind rlimit counters to user namespace. Each counter
reflects the number of processes in a given uid in a given user namespace. The
result is a tree of rlimit counters with the biggest value at the root (aka
init_user_ns). The limit is considered exceeded if it's exceeded up in the tree.

[1] https://lore.kernel.org/containers/87imd2incs@x220.int.ebiederm.org/
[2] 
https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
[3] 
https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html

Changelog
-
v7:
* Fixed issues found by lkp-tests project in the patch that Reimplements
  RLIMIT_MEMLOCK on top of ucounts.

v6:
* Fixed issues found by lkp-tests project.
* Rebased onto v5.11.

v5:
* Split the first commit into two commits: change ucounts.count type to 
atomic_long_t
  and add ucounts to cred. These commits were merged by mistake during the 
rebase.
* The __get_ucounts() renamed to alloc_ucounts().
* The cred.ucounts update has been moved from commit_creds() as it did not allow
  to handle errors.
* Added error handling of set_cred_ucounts().

v4:
* Reverted the type change of ucounts.count to refcount_t.
* Fixed typo in the kernel/cred.c

v3:
* Added get_ucounts() function to increase the reference count. The existing
  get_counts() function renamed to __get_ucounts().
* The type of ucounts.count changed from atomic_t to refcount_t.
* Dropped 'const' from set_cred_ucounts() arguments.
* Fixed a bug with freeing the cred structure after calling cred_alloc_blank().
* Commit messages have been updated.
* Added selftest.

v2:
* RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts.
* Added ucounts for pair uid and user namespace into cred.
* Added the ability to increase ucount by more than 1.

v1:
* After discussion with Eric W. Biederman, I increased the size of ucounts to
  atomic_long_t.
* Added ucount_max to avoid the fork bomb.

--

Alexey Gladkov (7):
  Increase size of ucounts to atomic_long_t
  Add a reference to ucounts for each cred
  Reimplement RLIMIT_NPROC on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  kselftests: Add test to check for rlimit changes in different user
namespaces

 fs/exec.c |   6 +-
 fs/hugetlbfs/inode.c  |  16 +-
 fs/io-wq.c|  22 ++-
 fs/io-wq.h|   2 +-
 fs/io_uring.c |   2 +-
 fs/proc/array.c   |   2 +-
 include/linux/cred.h  |   4 +
 include/linux/hugetlb.h   |   4 +-
 include/linux/mm.h|   4 +-
 include/linux/sched/user.h|   7 -
 include/linux/shmem_fs.h  |   2 +-
 include/linux/signal_types.h  |   4 +-
 include/linux/user_namespace.h|  24 ++-
 ipc/mqueue.c  |  41 ++---
 ipc/shm.c |  26 +--
 kernel/cred.c |  50 +-
 kernel/exit.c |   2 +-
 kernel/fork.c |  18 +-
 kernel/signal.c   |  57 +++
 kernel/sys.c  |  14

[PATCH v7 1/7] Increase size of ucounts to atomic_long_t

2021-02-22 Thread Alexey Gladkov

RLIMIT_MSGQUEUE and RLIMIT_MEMLOCK use unsigned long to store their
counters. As a preparation for moving rlimits based on ucounts, we need
to increase the size of the variable to long.

Signed-off-by: Alexey Gladkov 
---
 include/linux/user_namespace.h |  4 ++--
 kernel/ucount.c| 16 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..0bb833fd41f4 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -85,7 +85,7 @@ struct user_namespace {
struct ctl_table_header *sysctls;
 #endif
struct ucounts  *ucounts;
-   int ucount_max[UCOUNT_COUNTS];
+   long ucount_max[UCOUNT_COUNTS];
 } __randomize_layout;
 
 struct ucounts {
@@ -93,7 +93,7 @@ struct ucounts {
struct user_namespace *ns;
kuid_t uid;
int count;
-   atomic_t ucount[UCOUNT_COUNTS];
+   atomic_long_t ucount[UCOUNT_COUNTS];
 };
 
 extern struct user_namespace init_user_ns;
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 11b1596e2542..04c561751af1 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -175,14 +175,14 @@ static void put_ucounts(struct ucounts *ucounts)
kfree(ucounts);
 }
 
-static inline bool atomic_inc_below(atomic_t *v, int u)
+static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
 {
-   int c, old;
-   c = atomic_read(v);
+   long c, old;
+   c = atomic_long_read(v);
for (;;) {
if (unlikely(c >= u))
return false;
-   old = atomic_cmpxchg(v, c, c+1);
+   old = atomic_long_cmpxchg(v, c, c+1);
if (likely(old == c))
return true;
c = old;
@@ -196,17 +196,17 @@ struct ucounts *inc_ucount(struct user_namespace *ns, 
kuid_t uid,
struct user_namespace *tns;
ucounts = get_ucounts(ns, uid);
for (iter = ucounts; iter; iter = tns->ucounts) {
-   int max;
+   long max;
tns = iter->ns;
max = READ_ONCE(tns->ucount_max[type]);
-   if (!atomic_inc_below(>ucount[type], max))
+   if (!atomic_long_inc_below(>ucount[type], max))
goto fail;
}
return ucounts;
 fail:
bad = iter;
for (iter = ucounts; iter != bad; iter = iter->ns->ucounts)
-   atomic_dec(>ucount[type]);
+   atomic_long_dec(>ucount[type]);
 
put_ucounts(ucounts);
return NULL;
@@ -216,7 +216,7 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type 
type)
 {
struct ucounts *iter;
for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-   int dec = atomic_dec_if_positive(>ucount[type]);
+   long dec = atomic_long_dec_if_positive(>ucount[type]);
WARN_ON_ONCE(dec < 0);
}
put_ucounts(ucounts);
-- 
2.29.2

[RESEND PATCH v4 1/3] proc: Relax check of mount visibility

2021-02-17 Thread Alexey Gladkov

Allow to mount of procfs with subset=pid option even if the entire
procfs is not fully accessible to the user.

Signed-off-by: Alexey Gladkov 
---
 fs/namespace.c | 27 ---
 fs/proc/root.c | 17 ++---
 include/linux/fs.h |  1 +
 3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9d33909d0f9e..f9a38584f865 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3971,18 +3971,23 @@ static bool mnt_already_visible(struct mnt_namespace 
*ns,
((mnt_flags & MNT_ATIME_MASK) != (new_flags & 
MNT_ATIME_MASK)))
continue;
 
-   /* This mount is not fully visible if there are any
-* locked child mounts that cover anything except for
-* empty directories.
+   /* If this filesystem is completely dynamic, then it
+* makes no sense to check for any child mounts.
 */
-   list_for_each_entry(child, >mnt_mounts, mnt_child) {
-   struct inode *inode = child->mnt_mountpoint->d_inode;
-   /* Only worry about locked mounts */
-   if (!(child->mnt.mnt_flags & MNT_LOCKED))
-   continue;
-   /* Is the directory permanetly empty? */
-   if (!is_empty_dir_inode(inode))
-   goto next;
+   if (!(sb->s_iflags & SB_I_DYNAMIC)) {
+   /* This mount is not fully visible if there are any
+* locked child mounts that cover anything except for
+* empty directories.
+*/
+   list_for_each_entry(child, >mnt_mounts, mnt_child) 
{
+   struct inode *inode = 
child->mnt_mountpoint->d_inode;
+   /* Only worry about locked mounts */
+   if (!(child->mnt.mnt_flags & MNT_LOCKED))
+   continue;
+   /* Is the directory permanetly empty? */
+   if (!is_empty_dir_inode(inode))
+   goto next;
+   }
}
/* Preserve the locked attributes */
*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5e444d4f9717..051ffe5e67ce 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -145,18 +145,22 @@ static int proc_parse_param(struct fs_context *fc, struct 
fs_parameter *param)
return 0;
 }
 
-static void proc_apply_options(struct proc_fs_info *fs_info,
+static void proc_apply_options(struct super_block *s,
   struct fs_context *fc,
   struct user_namespace *user_ns)
 {
struct proc_fs_context *ctx = fc->fs_private;
+   struct proc_fs_info *fs_info = proc_sb_info(s);
 
if (ctx->mask & (1 << Opt_gid))
fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
if (ctx->mask & (1 << Opt_hidepid))
fs_info->hide_pid = ctx->hidepid;
-   if (ctx->mask & (1 << Opt_subset))
+   if (ctx->mask & (1 << Opt_subset)) {
+   if (ctx->pidonly == PROC_PIDONLY_ON)
+   s->s_iflags |= SB_I_DYNAMIC;
fs_info->pidonly = ctx->pidonly;
+   }
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -170,9 +174,6 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
if (!fs_info)
return -ENOMEM;
 
-   fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
-   proc_apply_options(fs_info, fc, current_user_ns());
-
/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
s->s_flags |= SB_NODIRATIME | SB_NOSUID | SB_NOEXEC;
@@ -183,6 +184,9 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
s->s_time_gran = 1;
s->s_fs_info = fs_info;
 
+   fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+   proc_apply_options(s, fc, current_user_ns());
+
/*
 * procfs isn't actually a stacking filesystem; however, there is
 * too much magic going on inside it to permit stacking things on
@@ -216,11 +220,10 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
 static int proc_reconfigure(struct fs_context *fc)
 {
struct super_block *sb = fc->root->d_sb;
-   struct proc_fs_info *fs_info = proc_sb_info(sb);
 
sync_filesystem(sb);
 
-   proc_apply_options(fs_info, fc, current_user_ns());
+

[RESEND PATCH v4 3/3] proc: Disable cancellation of subset=pid option

2021-02-17 Thread Alexey Gladkov

There is no way to remount procfs mountpoint with subset=pid option
without it. This is done in order not to make visible what was hidden
since some checks occur during mount.

This patch makes this limitation explicit and demonstrates the error.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/root.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index 0ab90e24d9ae..d4a91f48c430 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -145,7 +145,7 @@ static int proc_parse_param(struct fs_context *fc, struct 
fs_parameter *param)
return 0;
 }
 
-static void proc_apply_options(struct super_block *s,
+static int proc_apply_options(struct super_block *s,
   struct fs_context *fc,
   struct user_namespace *user_ns)
 {
@@ -159,8 +159,11 @@ static void proc_apply_options(struct super_block *s,
if (ctx->mask & (1 << Opt_subset)) {
if (ctx->pidonly == PROC_PIDONLY_ON)
s->s_iflags |= SB_I_DYNAMIC;
+   else if (fs_info->pidonly == PROC_PIDONLY_ON)
+   return invalf(fc, "proc: subset=pid cannot be unset\n");
fs_info->pidonly = ctx->pidonly;
}
+   return 0;
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -187,7 +190,10 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
fs_info->mounter_cred = get_cred(fc->cred);
 
-   proc_apply_options(s, fc, current_user_ns());
+   ret = proc_apply_options(s, fc, current_user_ns());
+   if (ret) {
+   return ret;
+   }
 
/*
 * procfs isn't actually a stacking filesystem; however, there is
@@ -229,8 +235,7 @@ static int proc_reconfigure(struct fs_context *fc)
put_cred(fs_info->mounter_cred);
fs_info->mounter_cred = get_cred(fc->cred);
 
-   proc_apply_options(sb, fc, current_user_ns());
-   return 0;
+   return proc_apply_options(sb, fc, current_user_ns());
 }
 
 static int proc_get_tree(struct fs_context *fc)
-- 
2.29.2

[RESEND PATCH v4 0/3] proc: Relax check of mount visibility

2021-02-17 Thread Alexey Gladkov

If only the dynamic part of procfs is mounted (subset=pid), then there is no
need to check if procfs is fully visible to the user in the new user namespace.

Changelog
-
v4:
* Set SB_I_DYNAMIC only if pidonly is set.
* Add an error message if subset=pid is canceled during remount.

v3:
* Add 'const' to struct cred *mounter_cred (fix kernel test robot warning).

v2:
* cache the mounters credentials and make access to the net directories
  contingent of the permissions of the mounter of procfs.

--

Alexey Gladkov (3):
  proc: Relax check of mount visibility
  proc: Show /proc/self/net only for CAP_NET_ADMIN
  proc: Disable cancellation of subset=pid option

 fs/namespace.c  | 27 ---
 fs/proc/proc_net.c  |  8 
 fs/proc/root.c  | 29 ++---
 include/linux/fs.h  |  1 +
 include/linux/proc_fs.h |  1 +
 5 files changed, 48 insertions(+), 18 deletions(-)

-- 
2.29.2

[RESEND PATCH v4 2/3] proc: Show /proc/self/net only for CAP_NET_ADMIN

2021-02-17 Thread Alexey Gladkov

Cache the mounters credentials and make access to the net directories
contingent of the permissions of the mounter of proc.

Show /proc/self/net only if mounter has CAP_NET_ADMIN and if proc is
mounted with subset=pid option.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/proc_net.c  | 8 
 fs/proc/root.c  | 7 +++
 include/linux/proc_fs.h | 1 +
 3 files changed, 16 insertions(+)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 18601042af99..a198f74cdb3b 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -259,6 +260,7 @@ static struct net *get_proc_task_net(struct inode *dir)
struct task_struct *task;
struct nsproxy *ns;
struct net *net = NULL;
+   struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
 
rcu_read_lock();
task = pid_task(proc_pid(dir), PIDTYPE_PID);
@@ -271,6 +273,12 @@ static struct net *get_proc_task_net(struct inode *dir)
}
rcu_read_unlock();
 
+   if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
+   security_capable(fs_info->mounter_cred, net->user_ns, 
CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
+   put_net(net);
+   net = NULL;
+   }
+
return net;
 }
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 051ffe5e67ce..0ab90e24d9ae 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -185,6 +185,8 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
s->s_fs_info = fs_info;
 
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+   fs_info->mounter_cred = get_cred(fc->cred);
+
proc_apply_options(s, fc, current_user_ns());
 
/*
@@ -220,9 +222,13 @@ static int proc_fill_super(struct super_block *s, struct 
fs_context *fc)
 static int proc_reconfigure(struct fs_context *fc)
 {
struct super_block *sb = fc->root->d_sb;
+   struct proc_fs_info *fs_info = proc_sb_info(sb);
 
sync_filesystem(sb);
 
+   put_cred(fs_info->mounter_cred);
+   fs_info->mounter_cred = get_cred(fc->cred);
+
proc_apply_options(sb, fc, current_user_ns());
return 0;
 }
@@ -277,6 +283,7 @@ static void proc_kill_sb(struct super_block *sb)
 
kill_anon_super(sb);
put_pid_ns(fs_info->pid_ns);
+   put_cred(fs_info->mounter_cred);
kfree(fs_info);
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 000cc0533c33..ffa871941bd0 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -64,6 +64,7 @@ struct proc_fs_info {
kgid_t pid_gid;
enum proc_hidepid hide_pid;
enum proc_pidonly pidonly;
+   const struct cred *mounter_cred;
 };
 
 static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
-- 
2.29.2

[PATCH v7 6/7] Reimplement RLIMIT_MEMLOCK on top of ucounts

2021-02-16 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v7:
* Fix hugetlb_file_setup() declaration if CONFIG_HUGETLBFS=n. A `const'
  was missing from one of the arguments.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot 
Signed-off-by: Alexey Gladkov 
---
 fs/hugetlbfs/inode.c   | 16 
 include/linux/hugetlb.h|  4 ++--
 include/linux/mm.h |  4 ++--
 include/linux/sched/user.h |  1 -
 include/linux/shmem_fs.h   |  2 +-
 include/linux/user_namespace.h |  1 +
 ipc/shm.c  | 30 +++--
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 mm/memfd.c |  5 ++---
 mm/mlock.c | 35 +-
 mm/mmap.c  |  4 ++--
 mm/shmem.c |  8 
 15 files changed, 54 insertions(+), 60 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 21c20fd5f9ee..a8757e39cefa 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1452,7 +1452,7 @@ static int get_hstate_idx(int page_size_log)
  * otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
  */
 struct file *hugetlb_file_setup(const char *name, size_t size,
-   vm_flags_t acctflag, struct user_struct **user,
+   vm_flags_t acctflag, const struct cred **cred,
int creat_flags, int page_size_log)
 {
struct inode *inode;
@@ -1464,20 +1464,20 @@ struct file *hugetlb_file_setup(const char *name, 
size_t size,
if (hstate_idx < 0)
return ERR_PTR(-ENODEV);
 
-   *user = NULL;
+   *cred = NULL;
mnt = hugetlbfs_vfsmount[hstate_idx];
if (!mnt)
return ERR_PTR(-ENOENT);
 
if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
-   *user = current_user();
-   if (user_shm_lock(size, *user)) {
+   *cred = current_cred();
+   if (user_shm_lock(size, *cred)) {
task_lock(current);
pr_warn_once("%s (%d): Using mlock ulimits for 
SHM_HUGETLB is deprecated\n",
current->comm, current->pid);
task_unlock(current);
} else {
-   *user = NULL;
+   *cred = NULL;
return ERR_PTR(-EPERM);
}
}
@@ -1504,9 +1504,9 @@ struct file *hugetlb_file_setup(const char *name, size_t 
size,
 
iput(inode);
 out:
-   if (*user) {
-   user_shm_unlock(size, *user);
-   *user = NULL;
+   if (*cred) {
+   user_shm_unlock(size, *cred);
+   *cred = NULL;
}
return file;
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..0a19897b773b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -434,7 +434,7 @@ static inline struct hugetlbfs_inode_info 
*HUGETLBFS_I(struct inode *inode)
 extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
-   struct user_struct **user, int creat_flags,
+   const struct cred **cred, int creat_flags,
int page_size_log);
 
 static inline bool is_file_hugepages(struct file *file)
@@ -454,7 +454,7 @@ static inline struct hstate *hstate_inode(struct inode *i)
 #define is_file_hugepages(file)false
 static inline struct file *
 hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
-   struct user_struct **user, int creat_flags,
+   const struct cred **cred, int creat_flags,
int page_size_log)
 {
return ERR_PTR(-ENOSYS);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..30a37aef1ab9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1628,8 +1628,8 @@ extern bool can_do_mlock(void);
 #else
 static inline bool can_do_mlock(void) { return false; }
 #endif
-extern int user_shm_lock(size_t, struct user_struct *);
-extern void user_shm_unlock(size_t, struct user_struct *);
+extern int user_shm_lock(size_t, const struct cred *);
+extern void user_shm_unlock(size_t, const struct cred *);
 
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/include/linux/sched/user.h b

[PATCH v6 6/7] Reimplement RLIMIT_MEMLOCK on top of ucounts

2021-02-15 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot 
Signed-off-by: Alexey Gladkov 
---
 fs/hugetlbfs/inode.c   | 16 
 include/linux/hugetlb.h|  4 ++--
 include/linux/mm.h |  4 ++--
 include/linux/sched/user.h |  1 -
 include/linux/shmem_fs.h   |  2 +-
 include/linux/user_namespace.h |  1 +
 ipc/shm.c  | 30 +++--
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 mm/memfd.c |  5 ++---
 mm/mlock.c | 35 +-
 mm/mmap.c  |  4 ++--
 mm/shmem.c |  8 
 15 files changed, 54 insertions(+), 60 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 21c20fd5f9ee..a8757e39cefa 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1452,7 +1452,7 @@ static int get_hstate_idx(int page_size_log)
  * otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
  */
 struct file *hugetlb_file_setup(const char *name, size_t size,
-   vm_flags_t acctflag, struct user_struct **user,
+   vm_flags_t acctflag, const struct cred **cred,
int creat_flags, int page_size_log)
 {
struct inode *inode;
@@ -1464,20 +1464,20 @@ struct file *hugetlb_file_setup(const char *name, 
size_t size,
if (hstate_idx < 0)
return ERR_PTR(-ENODEV);
 
-   *user = NULL;
+   *cred = NULL;
mnt = hugetlbfs_vfsmount[hstate_idx];
if (!mnt)
return ERR_PTR(-ENOENT);
 
if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
-   *user = current_user();
-   if (user_shm_lock(size, *user)) {
+   *cred = current_cred();
+   if (user_shm_lock(size, *cred)) {
task_lock(current);
pr_warn_once("%s (%d): Using mlock ulimits for 
SHM_HUGETLB is deprecated\n",
current->comm, current->pid);
task_unlock(current);
} else {
-   *user = NULL;
+   *cred = NULL;
return ERR_PTR(-EPERM);
}
}
@@ -1504,9 +1504,9 @@ struct file *hugetlb_file_setup(const char *name, size_t 
size,
 
iput(inode);
 out:
-   if (*user) {
-   user_shm_unlock(size, *user);
-   *user = NULL;
+   if (*cred) {
+   user_shm_unlock(size, *cred);
+   *cred = NULL;
}
return file;
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..de5ce8a11b5e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -434,7 +434,7 @@ static inline struct hugetlbfs_inode_info 
*HUGETLBFS_I(struct inode *inode)
 extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
-   struct user_struct **user, int creat_flags,
+   const struct cred **cred, int creat_flags,
int page_size_log);
 
 static inline bool is_file_hugepages(struct file *file)
@@ -454,7 +454,7 @@ static inline struct hstate *hstate_inode(struct inode *i)
 #define is_file_hugepages(file)false
 static inline struct file *
 hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
-   struct user_struct **user, int creat_flags,
+   struct cred **cred, int creat_flags,
int page_size_log)
 {
return ERR_PTR(-ENOSYS);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..30a37aef1ab9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1628,8 +1628,8 @@ extern bool can_do_mlock(void);
 #else
 static inline bool can_do_mlock(void) { return false; }
 #endif
-extern int user_shm_lock(size_t, struct user_struct *);
-extern void user_shm_unlock(size_t, struct user_struct *);
+extern int user_shm_lock(size_t, const struct cred *);
+extern void user_shm_unlock(size_t, const struct cred *);
 
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8ba9cec4fb99..82bd2532da6b 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/

[PATCH v6 0/7] Count rlimits in each user namespace

2021-02-15 Thread Alexey Gladkov

Preface
---
These patches are for binding the rlimit counters to a user in user namespace.
This patch set can be applied on top of:

git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.11

Problem
---
The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
implementation places the counters in user_struct [1]. These limits are global
between processes and persists for the lifetime of the process, even if
processes are in different user namespaces.

To illustrate the impact of rlimits, let's say there is a program that does not
fork. Some service-A wants to run this program as user X in multiple containers.
Since the program never fork the service wants to set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
since user X already has one running process.

The problem is not that the limit from container1 affects container2. The
problem is that limit is verified against the global counter that reflects
the number of processes in all containers.

This problem can be worked around by using different users for each container
but in this case we face a different problem of uid mapping when transferring
files from one container to another.

Eric W. Biederman mentioned this issue [2][3].

Introduced changes
--
To address the problem, we bind rlimit counters to user namespace. Each counter
reflects the number of processes in a given uid in a given user namespace. The
result is a tree of rlimit counters with the biggest value at the root (aka
init_user_ns). The limit is considered exceeded if it's exceeded up in the tree.

[1] https://lore.kernel.org/containers/87imd2incs@x220.int.ebiederm.org/
[2] 
https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
[3] 
https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html

Changelog
-
v6:
* Fixed issues found by lkp-tests project.
* Rebased onto v5.11.

v5:
* Split the first commit into two commits: change ucounts.count type to 
atomic_long_t
  and add ucounts to cred. These commits were merged by mistake during the 
rebase.
* The __get_ucounts() renamed to alloc_ucounts().
* The cred.ucounts update has been moved from commit_creds() as it did not allow
  to handle errors.
* Added error handling of set_cred_ucounts().

v4:
* Reverted the type change of ucounts.count to refcount_t.
* Fixed typo in the kernel/cred.c

v3:
* Added get_ucounts() function to increase the reference count. The existing
  get_counts() function renamed to __get_ucounts().
* The type of ucounts.count changed from atomic_t to refcount_t.
* Dropped 'const' from set_cred_ucounts() arguments.
* Fixed a bug with freeing the cred structure after calling cred_alloc_blank().
* Commit messages have been updated.
* Added selftest.

v2:
* RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts.
* Added ucounts for pair uid and user namespace into cred.
* Added the ability to increase ucount by more than 1.

v1:
* After discussion with Eric W. Biederman, I increased the size of ucounts to
  atomic_long_t.
* Added ucount_max to avoid the fork bomb.

--

Alexey Gladkov (7):
  Increase size of ucounts to atomic_long_t
  Add a reference to ucounts for each cred
  Reimplement RLIMIT_NPROC on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  kselftests: Add test to check for rlimit changes in different user
namespaces

 fs/exec.c |   6 +-
 fs/hugetlbfs/inode.c  |  16 +-
 fs/io-wq.c|  22 ++-
 fs/io-wq.h|   2 +-
 fs/io_uring.c |   2 +-
 fs/proc/array.c   |   2 +-
 include/linux/cred.h  |   4 +
 include/linux/hugetlb.h   |   4 +-
 include/linux/mm.h|   4 +-
 include/linux/sched/user.h|   7 -
 include/linux/shmem_fs.h  |   2 +-
 include/linux/signal_types.h  |   4 +-
 include/linux/user_namespace.h|  24 ++-
 ipc/mqueue.c  |  29 ++--
 ipc/shm.c |  30 ++--
 kernel/cred.c |  50 +-
 kernel/exit.c |   2 +-
 kernel/fork.c |  18 +-
 kernel/signal.c   |  53 +++---
 kernel/sys.c  |  14 +-
 kernel/ucount.c   | 120 +++--
 kernel/user.c

[PATCH v6 7/7] kselftests: Add test to check for rlimit changes in different user namespaces

2021-02-15 Thread Alexey Gladkov

The testcase runs few instances of the program with RLIMIT_NPROC=1 from
user uid=6, in different user namespaces.

Signed-off-by: Alexey Gladkov 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/rlimits/.gitignore|   2 +
 tools/testing/selftests/rlimits/Makefile  |   6 +
 tools/testing/selftests/rlimits/config|   1 +
 .../selftests/rlimits/rlimits-per-userns.c| 161 ++
 5 files changed, 171 insertions(+)
 create mode 100644 tools/testing/selftests/rlimits/.gitignore
 create mode 100644 tools/testing/selftests/rlimits/Makefile
 create mode 100644 tools/testing/selftests/rlimits/config
 create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 8a917cb4426a..a6d3fde4a617 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -46,6 +46,7 @@ TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
 TARGETS += openat2
+TARGETS += rlimits
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/rlimits/.gitignore 
b/tools/testing/selftests/rlimits/.gitignore
new file mode 100644
index ..091021f255b3
--- /dev/null
+++ b/tools/testing/selftests/rlimits/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+rlimits-per-userns
diff --git a/tools/testing/selftests/rlimits/Makefile 
b/tools/testing/selftests/rlimits/Makefile
new file mode 100644
index ..03aadb406212
--- /dev/null
+++ b/tools/testing/selftests/rlimits/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := rlimits-per-userns
+
+include ../lib.mk
diff --git a/tools/testing/selftests/rlimits/config 
b/tools/testing/selftests/rlimits/config
new file mode 100644
index ..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/rlimits/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c 
b/tools/testing/selftests/rlimits/rlimits-per-userns.c
new file mode 100644
index ..26dc949e93ea
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov 
+ */
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NR_CHILDS 2
+
+static char *service_prog;
+static uid_t user   = 6;
+static uid_t group  = 6;
+
+static void setrlimit_nproc(rlim_t n)
+{
+   pid_t pid = getpid();
+   struct rlimit limit = {
+   .rlim_cur = n,
+   .rlim_max = n
+   };
+
+   warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
+
+   if (setrlimit(RLIMIT_NPROC, ) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+}
+
+static pid_t fork_child(void)
+{
+   pid_t pid = fork();
+
+   if (pid < 0)
+   err(EXIT_FAILURE, "fork");
+
+   if (pid > 0)
+   return pid;
+
+   pid = getpid();
+
+   warnx("(pid=%d): New process starting ...", pid);
+
+   if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
+   err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid);
+
+   signal(SIGUSR1, SIG_DFL);
+
+   warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
+
+   if (setgid(group) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
+   if (setuid(user) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
+
+   warnx("(pid=%d): Service running ...", pid);
+
+   warnx("(pid=%d): Unshare user namespace", pid);
+   if (unshare(CLONE_NEWUSER) < 0)
+   err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+
+   char *const argv[] = { "service", NULL };
+   char *const envp[] = { "I_AM_SERVICE=1", NULL };
+
+   warnx("(pid=%d): Executing real service ...", pid);
+
+   execve(service_prog, argv, envp);
+   err(EXIT_FAILURE, "(pid=%d): execve", pid);
+}
+
+int main(int argc, char **argv)
+{
+   size_t i;
+   pid_t child[NR_CHILDS];
+   int wstatus[NR_CHILDS];
+   int childs = NR_CHILDS;
+   pid_t pid;
+
+   if (getenv("I_AM_SERVICE")) {
+   pause();
+   exit(EXIT_SUCCESS);
+   }
+
+   service_prog = argv[0];
+   pid = getpid();
+
+   warnx("(pid=%d) Starting testcase", pid);
+
+   /*
+* This rlimit is not a problem for root because it can be exceeded.
+*/
+   setrlimit_nproc(1);
+
+   for (i = 0; i < NR_CH

[PATCH v6 5/7] Reimplement RLIMIT_SIGPENDING on top of ucounts

2021-02-15 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/array.c|  2 +-
 include/linux/sched/user.h |  1 -
 include/linux/signal_types.h   |  4 ++-
 include/linux/user_namespace.h |  1 +
 kernel/fork.c  |  1 +
 kernel/signal.c| 53 ++
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 9 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index bb87e4d89cd8..74b0ea4b7e38 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -284,7 +284,7 @@ static inline void task_sig(struct seq_file *m, struct 
task_struct *p)
collect_sigign_sigcatch(p, , );
num_threads = get_nr_threads(p);
rcu_read_lock();  /* FIXME: is this correct? */
-   qsize = atomic_read(&__task_cred(p)->user->sigpending);
+   qsize = get_ucounts_value(task_ucounts(p), 
UCOUNT_RLIMIT_SIGPENDING);
rcu_read_unlock();
qlim = task_rlimit(p, RLIMIT_SIGPENDING);
unlock_task_sighand(p, );
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8a34446681aa..8ba9cec4fb99 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
  */
 struct user_struct {
refcount_t __count; /* reference count */
-   atomic_t sigpending;/* How many pending signals does this user 
have? */
 #ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
 #endif
diff --git a/include/linux/signal_types.h b/include/linux/signal_types.h
index 68e06c75c5b2..34cb28b8f16c 100644
--- a/include/linux/signal_types.h
+++ b/include/linux/signal_types.h
@@ -13,6 +13,8 @@ typedef struct kernel_siginfo {
__SIGINFO;
 } kernel_siginfo_t;
 
+struct ucounts;
+
 /*
  * Real Time signals may be queued.
  */
@@ -21,7 +23,7 @@ struct sigqueue {
struct list_head list;
int flags;
kernel_siginfo_t info;
-   struct user_struct *user;
+   struct ucounts *ucounts;
 };
 
 /* flags values. */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 52453143fe23..f84b68832c56 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -52,6 +52,7 @@ enum ucount_type {
 #endif
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
+   UCOUNT_RLIMIT_SIGPENDING,
UCOUNT_COUNTS,
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 0a939332efcc..99b10b9fe4b6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -824,6 +824,7 @@ void __init fork_init(void)
 
init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(_task, 
RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = 
task_rlimit(_task, RLIMIT_MSGQUEUE);
+   init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = 
task_rlimit(_task, RLIMIT_SIGPENDING);
 
 #ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/signal.c b/kernel/signal.c
index 5ad8566534e7..a37dd66f9358 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -412,49 +412,40 @@ void task_join_group_stop(struct task_struct *task)
 static struct sigqueue *
 __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int 
override_rlimit)
 {
-   struct sigqueue *q = NULL;
-   struct user_struct *user;
-   int sigpending;
+   struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);
 
-   /*
-* Protect access to @t credentials. This can go away when all
-* callers hold rcu read lock.
-*
-* NOTE! A pending signal will hold on to the user refcount,
-* and we get/put the refcount only when the sigpending count
-* changes from/to zero.
-*/
-   rcu_read_lock();
-   user = __task_cred(t)->user;
-   sigpending = atomic_inc_return(>sigpending);
-   if (sigpending == 1)
-   get_uid(user);
-   rcu_read_unlock();
+   if (likely(q != NULL)) {
+   bool overlimit;
 
-   if (override_rlimit || likely(sigpending <= task_rlimit(t, 
RLIMIT_SIGPENDING))) {
-   q = kmem_cache_alloc(sigqueue_cachep, flags);
-   } else {
-   print_dropped_signal(sig);
-   }
-
-   if (unlikely(q == NULL)) {
-   if (atomic_dec_and_test(>sigpending))
-   free_uid(user);
-   } else {
INIT_LIST_HEAD(>list);
q->flags = 0;
-   q->user = user;
+
+   /*
+* Protect access to @t credentials. This can go away when all
+

[PATCH v6 3/7] Reimplement RLIMIT_NPROC on top of ucounts

2021-02-15 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

To illustrate the impact of rlimits, let's say there is a program that
does not fork. Some service-A wants to run this program as user X in
multiple containers. Since the program never fork the service wants to
set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
When the service-A tries to run a program with RLIMIT_NPROC=1 in
container2 it fails since user X already has one running process.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited by root or if the user has the appropriate capability.

Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  2 +-
 fs/io-wq.c | 22 ++--
 fs/io-wq.h |  2 +-
 fs/io_uring.c  |  2 +-
 include/linux/cred.h   |  2 ++
 include/linux/sched/user.h |  1 -
 include/linux/user_namespace.h | 13 
 kernel/cred.c  | 10 +++---
 kernel/exit.c  |  2 +-
 kernel/fork.c  |  9 ++---
 kernel/sys.c   |  2 +-
 kernel/ucount.c| 61 ++
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  3 +-
 14 files changed, 103 insertions(+), 29 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 0371a3400be5..e6d7f186f33c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1874,7 +1874,7 @@ static int do_execveat_common(int fd, struct filename 
*filename,
 * whether NPROC limit is still exceeded.
 */
if ((current->flags & PF_NPROC_EXCEEDED) &&
-   atomic_read(_user()->processes) > rlimit(RLIMIT_NPROC)) {
+   is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, 
rlimit(RLIMIT_NPROC))) {
retval = -EAGAIN;
goto out_ret;
}
diff --git a/fs/io-wq.c b/fs/io-wq.c
index a564f36e260c..5b6940c90c61 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../kernel/sched/sched.h"
 #include "io-wq.h"
@@ -120,7 +121,7 @@ struct io_wq {
io_wq_work_fn *do_work;
 
struct task_struct *manager;
-   struct user_struct *user;
+   const struct cred *cred;
refcount_t refs;
struct completion done;
 
@@ -234,7 +235,7 @@ static void io_worker_exit(struct io_worker *worker)
if (worker->flags & IO_WORKER_F_RUNNING)
atomic_dec(>nr_running);
if (!(worker->flags & IO_WORKER_F_BOUND))
-   atomic_dec(>wq->user->processes);
+   dec_rlimit_ucounts(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 
1);
worker->flags = 0;
preempt_enable();
 
@@ -364,15 +365,15 @@ static void __io_worker_busy(struct io_wqe *wqe, struct 
io_worker *worker,
worker->flags |= IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers--;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers++;
-   atomic_dec(>wq->user->processes);
+   dec_rlimit_ucounts(wqe->wq->cred->ucounts, 
UCOUNT_RLIMIT_NPROC, 1);
} else {
worker->flags &= ~IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers++;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers--;
-   atomic_inc(>wq->user->processes);
+   inc_rlimit_ucounts(wqe->wq->cred->ucounts, 
UCOUNT_RLIMIT_NPROC, 1);
}
io_wqe_inc_running(wqe, worker);
-}
+   }
 }
 
 /*
@@ -707,7 +708,7 @@ static bool create_io_worker(struct io_wq *wq, struct 
io_wqe *wqe, int index)
raw_spin_unlock_irq(>lock);
 
if (index == IO_WQ_ACCT_UNBOUND)
-   atomic_inc(>user->processes);
+   inc_rlimit_ucounts(wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);
 
refcount_inc(>refs);
wake_up_process(worker->task);
@@ -838,7 +839,7 @@ static bool io_wq_can_queue(struct io_wqe *wqe, struct 
io_wqe_acct *acct,
if (free_worker)
return true;
 
-   if (atomic_read(>wq->user->processes) >= acct->max_workers &&
+   if (is_ucounts_overlimit(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 
acct->max_workers) &&
!(capable(CAP_SYS_RESOURCE) || capable(CAP_SYS_ADMIN)))

[PATCH v6 4/7] Reimplement RLIMIT_MSGQUEUE on top of ucounts

2021-02-15 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov 
---
 include/linux/sched/user.h |  4 
 include/linux/user_namespace.h |  1 +
 ipc/mqueue.c   | 29 +++--
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user_namespace.c|  1 +
 6 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index d33d867ad6c1..8a34446681aa 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,10 +18,6 @@ struct user_struct {
 #endif
 #ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors 
currently watched */
-#endif
-#ifdef CONFIG_POSIX_MQUEUE
-   /* protected by mq_lock */
-   unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
 #endif
unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight;/* How many files in flight in unix 
sockets */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 0a27cd049404..52453143fe23 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -51,6 +51,7 @@ enum ucount_type {
UCOUNT_INOTIFY_WATCHES,
 #endif
UCOUNT_RLIMIT_NPROC,
+   UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_COUNTS,
 };
 
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index beff0cfcd1e8..05fcf067131f 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -144,7 +144,7 @@ struct mqueue_inode_info {
struct pid *notify_owner;
u32 notify_self_exec_id;
struct user_namespace *notify_user_ns;
-   struct user_struct *user;   /* user who created, for accounting */
+   struct ucounts *ucounts;/* user who created, for accounting */
struct sock *notify_sock;
struct sk_buff *notify_cookie;
 
@@ -292,7 +292,6 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
struct ipc_namespace *ipc_ns, umode_t mode,
struct mq_attr *attr)
 {
-   struct user_struct *u = current_user();
struct inode *inode;
int ret = -ENOMEM;
 
@@ -309,6 +308,8 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
if (S_ISREG(mode)) {
struct mqueue_inode_info *info;
unsigned long mq_bytes, mq_treesize;
+   struct ucounts *ucounts;
+   bool overlimit;
 
inode->i_fop = _file_operations;
inode->i_size = FILENT_SIZE;
@@ -321,7 +322,7 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
info->notify_owner = NULL;
info->notify_user_ns = NULL;
info->qsize = 0;
-   info->user = NULL;  /* set when all is ok */
+   info->ucounts = NULL;   /* set when all is ok */
info->msg_tree = RB_ROOT;
info->msg_tree_rightmost = NULL;
info->node_cache = NULL;
@@ -371,19 +372,19 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
if (mq_bytes + mq_treesize < mq_bytes)
goto out_inode;
mq_bytes += mq_treesize;
+   ucounts = current_ucounts();
spin_lock(_lock);
-   if (u->mq_bytes + mq_bytes < u->mq_bytes ||
-   u->mq_bytes + mq_bytes > rlimit(RLIMIT_MSGQUEUE)) {
+   overlimit = inc_rlimit_ucounts_and_test(ucounts, 
UCOUNT_RLIMIT_MSGQUEUE,
+   mq_bytes, rlimit(RLIMIT_MSGQUEUE));
+   if (overlimit) {
+   dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MSGQUEUE, 
mq_bytes);
spin_unlock(_lock);
/* mqueue_evict_inode() releases info->messages */
ret = -EMFILE;
goto out_inode;
}
-   u->mq_bytes += mq_bytes;
spin_unlock(_lock);
-
-   /* all is ok */
-   info->user = get_uid(u);
+   info->ucounts = get_ucounts(ucounts);
} else if (S_ISDIR(mode)) {
inc_nlink(inode);
/* Some things misbehave if size == 0 on a directory */
@@ -497,7 +498,7 @@ static void mqueue_free_inode(struct inode *inode)
 static void mqueue_evict_inode(struct inode *inode)
 {
struct mqueue_inode_info *info;
-   struct user_struct *user;
+   struct ucounts *ucounts;
struct ipc_namespace *ipc_ns;
struct msg_msg *msg, *nmsg;
LIST_HEAD(tmp_msg);
@@ -520,8 +521,8 @@ static void mqueue_evict_inode(struct inode *in

[PATCH v6 1/7] Increase size of ucounts to atomic_long_t

2021-02-15 Thread Alexey Gladkov

RLIMIT_MSGQUEUE and RLIMIT_MEMLOCK use unsigned long to store their
counters. As a preparation for moving rlimits based on ucounts, we need
to increase the size of the variable to long.

Signed-off-by: Alexey Gladkov 
---
 include/linux/user_namespace.h |  4 ++--
 kernel/ucount.c| 16 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..0bb833fd41f4 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -85,7 +85,7 @@ struct user_namespace {
struct ctl_table_header *sysctls;
 #endif
struct ucounts  *ucounts;
-   int ucount_max[UCOUNT_COUNTS];
+   long ucount_max[UCOUNT_COUNTS];
 } __randomize_layout;
 
 struct ucounts {
@@ -93,7 +93,7 @@ struct ucounts {
struct user_namespace *ns;
kuid_t uid;
int count;
-   atomic_t ucount[UCOUNT_COUNTS];
+   atomic_long_t ucount[UCOUNT_COUNTS];
 };
 
 extern struct user_namespace init_user_ns;
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 11b1596e2542..04c561751af1 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -175,14 +175,14 @@ static void put_ucounts(struct ucounts *ucounts)
kfree(ucounts);
 }
 
-static inline bool atomic_inc_below(atomic_t *v, int u)
+static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
 {
-   int c, old;
-   c = atomic_read(v);
+   long c, old;
+   c = atomic_long_read(v);
for (;;) {
if (unlikely(c >= u))
return false;
-   old = atomic_cmpxchg(v, c, c+1);
+   old = atomic_long_cmpxchg(v, c, c+1);
if (likely(old == c))
return true;
c = old;
@@ -196,17 +196,17 @@ struct ucounts *inc_ucount(struct user_namespace *ns, 
kuid_t uid,
struct user_namespace *tns;
ucounts = get_ucounts(ns, uid);
for (iter = ucounts; iter; iter = tns->ucounts) {
-   int max;
+   long max;
tns = iter->ns;
max = READ_ONCE(tns->ucount_max[type]);
-   if (!atomic_inc_below(>ucount[type], max))
+   if (!atomic_long_inc_below(>ucount[type], max))
goto fail;
}
return ucounts;
 fail:
bad = iter;
for (iter = ucounts; iter != bad; iter = iter->ns->ucounts)
-   atomic_dec(>ucount[type]);
+   atomic_long_dec(>ucount[type]);
 
put_ucounts(ucounts);
return NULL;
@@ -216,7 +216,7 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type 
type)
 {
struct ucounts *iter;
for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-   int dec = atomic_dec_if_positive(>ucount[type]);
+   long dec = atomic_long_dec_if_positive(>ucount[type]);
WARN_ON_ONCE(dec < 0);
}
put_ucounts(ucounts);
-- 
2.29.2

[PATCH v6 2/7] Add a reference to ucounts for each cred

2021-02-15 Thread Alexey Gladkov

For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.

Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().

Changelog

v6:
* Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
  error was caused by the fact that cred_alloc_blank() left the ucounts
  pointer empty.

Reported-by: kernel test robot 
Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  4 
 include/linux/cred.h   |  2 ++
 include/linux/user_namespace.h |  4 
 kernel/cred.c  | 40 ++
 kernel/fork.c  |  6 +
 kernel/sys.c   | 12 ++
 kernel/ucount.c| 40 +++---
 kernel/user_namespace.c|  3 +++
 8 files changed, 108 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 5d4d52039105..0371a3400be5 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1360,6 +1360,10 @@ int begin_new_exec(struct linux_binprm * bprm)
WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
flush_signal_handlers(me, 0);
 
+   retval = set_cred_ucounts(bprm->cred);
+   if (retval < 0)
+   goto out_unlock;
+
/*
 * install the new credentials for this executable
 */
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..ad160e5fe5c6 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -144,6 +144,7 @@ struct cred {
 #endif
struct user_struct *user;   /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are 
relative to. */
+   struct ucounts *ucounts;
struct group_info *group_info;  /* supplementary groups for euid/fsgid 
*/
/* RCU deletion */
union {
@@ -170,6 +171,7 @@ extern int set_security_override_from_ctx(struct cred *, 
const char *);
 extern int set_create_files_as(struct cred *, struct inode *);
 extern int cred_fscmp(const struct cred *, const struct cred *);
 extern void __init cred_init(void);
+extern int set_cred_ucounts(struct cred *);
 
 /*
  * check for validity of credentials
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 0bb833fd41f4..f71b5a4a3e74 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -97,11 +97,15 @@ struct ucounts {
 };
 
 extern struct user_namespace init_user_ns;
+extern struct ucounts init_ucounts;
 
 bool setup_userns_sysctls(struct user_namespace *ns);
 void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum 
ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
+struct ucounts *get_ucounts(struct ucounts *ucounts);
+void put_ucounts(struct ucounts *ucounts);
 
 #ifdef CONFIG_USER_NS
 
diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..58a8a9e24347 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -60,6 +60,7 @@ struct cred init_cred = {
.user   = INIT_USER,
.user_ns= _user_ns,
.group_info = _groups,
+   .ucounts= _ucounts,
 };
 
 static inline void set_cred_subscribers(struct cred *cred, int n)
@@ -119,6 +120,8 @@ static void put_cred_rcu(struct rcu_head *rcu)
if (cred->group_info)
put_group_info(cred->group_info);
free_uid(cred->user);
+   if (cred->ucounts)
+   put_ucounts(cred->ucounts);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
 }
@@ -222,6 +225,7 @@ struct cred *cred_alloc_blank(void)
 #ifdef CONFIG_DEBUG_CREDENTIALS
new->magic = CRED_MAGIC;
 #endif
+   new->ucounts = get_ucounts(_ucounts);
 
if (security_cred_alloc_blank(new, GFP_KERNEL_ACCOUNT) < 0)
goto error;
@@ -284,6 +288,11 @@ struct cred *prepare_creds(void)
 
if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
goto error;
+
+   new->ucounts = get_ucounts(new->ucounts);
+   if (!new->ucounts)
+   goto error;
+
validate_creds(new);
return new;
 
@@ -363,6 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long 
clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
+   if (set_cred_uc

[PATCH v5 6/7] Reimplement RLIMIT_MEMLOCK on top of ucounts

2021-02-01 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov 
---
 fs/hugetlbfs/inode.c   | 17 -
 include/linux/hugetlb.h|  3 +--
 include/linux/mm.h |  4 ++--
 include/linux/sched/user.h |  1 -
 include/linux/shmem_fs.h   |  2 +-
 include/linux/user_namespace.h |  1 +
 ipc/shm.c  | 31 --
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 mm/memfd.c |  4 +---
 mm/mlock.c | 35 +-
 mm/mmap.c  |  3 +--
 mm/shmem.c |  8 
 15 files changed, 52 insertions(+), 61 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b5c109703daa..82298412f020 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1451,34 +1451,35 @@ static int get_hstate_idx(int page_size_log)
  * otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
  */
 struct file *hugetlb_file_setup(const char *name, size_t size,
-   vm_flags_t acctflag, struct user_struct **user,
+   vm_flags_t acctflag,
int creat_flags, int page_size_log)
 {
struct inode *inode;
struct vfsmount *mnt;
int hstate_idx;
struct file *file;
+   const struct cred *cred;
 
hstate_idx = get_hstate_idx(page_size_log);
if (hstate_idx < 0)
return ERR_PTR(-ENODEV);
 
-   *user = NULL;
mnt = hugetlbfs_vfsmount[hstate_idx];
if (!mnt)
return ERR_PTR(-ENOENT);
 
if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
-   *user = current_user();
-   if (user_shm_lock(size, *user)) {
+   cred = current_cred();
+   if (user_shm_lock(size, cred)) {
task_lock(current);
pr_warn_once("%s (%d): Using mlock ulimits for 
SHM_HUGETLB is deprecated\n",
current->comm, current->pid);
task_unlock(current);
} else {
-   *user = NULL;
return ERR_PTR(-EPERM);
}
+   } else {
+   cred = NULL;
}
 
file = ERR_PTR(-ENOSPC);
@@ -1503,10 +1504,8 @@ struct file *hugetlb_file_setup(const char *name, size_t 
size,
 
iput(inode);
 out:
-   if (*user) {
-   user_shm_unlock(size, *user);
-   *user = NULL;
-   }
+   if (cred)
+   user_shm_unlock(size, cred);
return file;
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ebca2ef02212..fbd36c452648 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -434,8 +434,7 @@ static inline struct hugetlbfs_inode_info 
*HUGETLBFS_I(struct inode *inode)
 extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
-   struct user_struct **user, int creat_flags,
-   int page_size_log);
+   int creat_flags, int page_size_log);
 
 static inline bool is_file_hugepages(struct file *file)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..30a37aef1ab9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1628,8 +1628,8 @@ extern bool can_do_mlock(void);
 #else
 static inline bool can_do_mlock(void) { return false; }
 #endif
-extern int user_shm_lock(size_t, struct user_struct *);
-extern void user_shm_unlock(size_t, struct user_struct *);
+extern int user_shm_lock(size_t, const struct cred *);
+extern void user_shm_unlock(size_t, const struct cred *);
 
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8ba9cec4fb99..82bd2532da6b 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,7 +18,6 @@ struct user_struct {
 #ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors 
currently watched */
 #endif
-   unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight;/* How many files in flight in unix 
sockets */
atomic_long_t pipe_bufs;  /* how many pages are allocated in pipe 
buffers */
 
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index

[PATCH v5 0/7] Count rlimits in each user namespace

2021-02-01 Thread Alexey Gladkov

Preface
---
These patches are for binding the rlimit counters to a user in user namespace.
This patch set can be applied on top of:

git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.11-rc2

Problem
---
The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
implementation places the counters in user_struct [1]. These limits are global
between processes and persists for the lifetime of the process, even if
processes are in different user namespaces.

To illustrate the impact of rlimits, let's say there is a program that does not
fork. Some service-A wants to run this program as user X in multiple containers.
Since the program never fork the service wants to set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
since user X already has one running process.

The problem is not that the limit from container1 affects container2. The
problem is that limit is verified against the global counter that reflects
the number of processes in all containers.

This problem can be worked around by using different users for each container
but in this case we face a different problem of uid mapping when transferring
files from one container to another.

Eric W. Biederman mentioned this issue [2][3].

Introduced changes
--
To address the problem, we bind rlimit counters to user namespace. Each counter
reflects the number of processes in a given uid in a given user namespace. The
result is a tree of rlimit counters with the biggest value at the root (aka
init_user_ns). The limit is considered exceeded if it's exceeded up in the tree.

[1] https://lore.kernel.org/containers/87imd2incs@x220.int.ebiederm.org/
[2] 
https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
[3] 
https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html

Changelog
-
v5:
* Split the first commit into two commits: change ucounts.count type to 
atomic_long_t
  and add ucounts to cred. These commits were merged by mistake during the 
rebase.
* The __get_ucounts() renamed to alloc_ucounts().
* The cred.ucounts update has been moved from commit_creds() as it did not allow
  to handle errors.
* Added error handling of set_cred_ucounts().

v4:
* Reverted the type change of ucounts.count to refcount_t.
* Fixed typo in the kernel/cred.c

v3:
* Added get_ucounts() function to increase the reference count. The existing
  get_counts() function renamed to __get_ucounts().
* The type of ucounts.count changed from atomic_t to refcount_t.
* Dropped 'const' from set_cred_ucounts() arguments.
* Fixed a bug with freeing the cred structure after calling cred_alloc_blank().
* Commit messages have been updated.
* Added selftest.

v2:
* RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts.
* Added ucounts for pair uid and user namespace into cred.
* Added the ability to increase ucount by more than 1.

v1:
* After discussion with Eric W. Biederman, I increased the size of ucounts to
  atomic_long_t.
* Added ucount_max to avoid the fork bomb.

--

Alexey Gladkov (7):
  Increase size of ucounts to atomic_long_t
  Add a reference to ucounts for each cred
  Reimplement RLIMIT_NPROC on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  kselftests: Add test to check for rlimit changes in different user
namespaces

 fs/exec.c |   6 +-
 fs/hugetlbfs/inode.c  |  17 +-
 fs/io-wq.c|  22 ++-
 fs/io-wq.h|   2 +-
 fs/io_uring.c |   2 +-
 fs/proc/array.c   |   2 +-
 include/linux/cred.h  |   4 +
 include/linux/hugetlb.h   |   3 +-
 include/linux/mm.h|   4 +-
 include/linux/sched/user.h|   7 -
 include/linux/shmem_fs.h  |   2 +-
 include/linux/signal_types.h  |   4 +-
 include/linux/user_namespace.h|  23 ++-
 ipc/mqueue.c  |  29 ++--
 ipc/shm.c |  31 ++--
 kernel/cred.c |  56 +-
 kernel/exit.c |   2 +-
 kernel/fork.c |  18 +-
 kernel/signal.c   |  53 +++---
 kernel/sys.c  |  14 +-
 kernel/ucount.c   | 105 ++--
 kernel/user.c |   3 -
 kernel/user_namespace.c   |   9

[PATCH v5 7/7] kselftests: Add test to check for rlimit changes in different user namespaces

2021-02-01 Thread Alexey Gladkov

The testcase runs few instances of the program with RLIMIT_NPROC=1 from
user uid=6, in different user namespaces.

Signed-off-by: Alexey Gladkov 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/rlimits/.gitignore|   2 +
 tools/testing/selftests/rlimits/Makefile  |   6 +
 tools/testing/selftests/rlimits/config|   1 +
 .../selftests/rlimits/rlimits-per-userns.c| 161 ++
 5 files changed, 171 insertions(+)
 create mode 100644 tools/testing/selftests/rlimits/.gitignore
 create mode 100644 tools/testing/selftests/rlimits/Makefile
 create mode 100644 tools/testing/selftests/rlimits/config
 create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index afbab4aeef3c..4dbeb5686f7b 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -46,6 +46,7 @@ TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
 TARGETS += openat2
+TARGETS += rlimits
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/rlimits/.gitignore 
b/tools/testing/selftests/rlimits/.gitignore
new file mode 100644
index ..091021f255b3
--- /dev/null
+++ b/tools/testing/selftests/rlimits/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+rlimits-per-userns
diff --git a/tools/testing/selftests/rlimits/Makefile 
b/tools/testing/selftests/rlimits/Makefile
new file mode 100644
index ..03aadb406212
--- /dev/null
+++ b/tools/testing/selftests/rlimits/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := rlimits-per-userns
+
+include ../lib.mk
diff --git a/tools/testing/selftests/rlimits/config 
b/tools/testing/selftests/rlimits/config
new file mode 100644
index ..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/rlimits/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c 
b/tools/testing/selftests/rlimits/rlimits-per-userns.c
new file mode 100644
index ..26dc949e93ea
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov 
+ */
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NR_CHILDS 2
+
+static char *service_prog;
+static uid_t user   = 6;
+static uid_t group  = 6;
+
+static void setrlimit_nproc(rlim_t n)
+{
+   pid_t pid = getpid();
+   struct rlimit limit = {
+   .rlim_cur = n,
+   .rlim_max = n
+   };
+
+   warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
+
+   if (setrlimit(RLIMIT_NPROC, ) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+}
+
+static pid_t fork_child(void)
+{
+   pid_t pid = fork();
+
+   if (pid < 0)
+   err(EXIT_FAILURE, "fork");
+
+   if (pid > 0)
+   return pid;
+
+   pid = getpid();
+
+   warnx("(pid=%d): New process starting ...", pid);
+
+   if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
+   err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid);
+
+   signal(SIGUSR1, SIG_DFL);
+
+   warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
+
+   if (setgid(group) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
+   if (setuid(user) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
+
+   warnx("(pid=%d): Service running ...", pid);
+
+   warnx("(pid=%d): Unshare user namespace", pid);
+   if (unshare(CLONE_NEWUSER) < 0)
+   err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+
+   char *const argv[] = { "service", NULL };
+   char *const envp[] = { "I_AM_SERVICE=1", NULL };
+
+   warnx("(pid=%d): Executing real service ...", pid);
+
+   execve(service_prog, argv, envp);
+   err(EXIT_FAILURE, "(pid=%d): execve", pid);
+}
+
+int main(int argc, char **argv)
+{
+   size_t i;
+   pid_t child[NR_CHILDS];
+   int wstatus[NR_CHILDS];
+   int childs = NR_CHILDS;
+   pid_t pid;
+
+   if (getenv("I_AM_SERVICE")) {
+   pause();
+   exit(EXIT_SUCCESS);
+   }
+
+   service_prog = argv[0];
+   pid = getpid();
+
+   warnx("(pid=%d) Starting testcase", pid);
+
+   /*
+* This rlimit is not a problem for root because it can be exceeded.
+*/
+   setrlimit_nproc(1);
+
+   for (i = 0; i < NR_CH

[PATCH v5 5/7] Reimplement RLIMIT_SIGPENDING on top of ucounts

2021-02-01 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/array.c|  2 +-
 include/linux/sched/user.h |  1 -
 include/linux/signal_types.h   |  4 ++-
 include/linux/user_namespace.h |  1 +
 kernel/fork.c  |  1 +
 kernel/signal.c| 53 ++
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 9 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index bb87e4d89cd8..74b0ea4b7e38 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -284,7 +284,7 @@ static inline void task_sig(struct seq_file *m, struct 
task_struct *p)
collect_sigign_sigcatch(p, , );
num_threads = get_nr_threads(p);
rcu_read_lock();  /* FIXME: is this correct? */
-   qsize = atomic_read(&__task_cred(p)->user->sigpending);
+   qsize = get_ucounts_value(task_ucounts(p), 
UCOUNT_RLIMIT_SIGPENDING);
rcu_read_unlock();
qlim = task_rlimit(p, RLIMIT_SIGPENDING);
unlock_task_sighand(p, );
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8a34446681aa..8ba9cec4fb99 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
  */
 struct user_struct {
refcount_t __count; /* reference count */
-   atomic_t sigpending;/* How many pending signals does this user 
have? */
 #ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
 #endif
diff --git a/include/linux/signal_types.h b/include/linux/signal_types.h
index 68e06c75c5b2..34cb28b8f16c 100644
--- a/include/linux/signal_types.h
+++ b/include/linux/signal_types.h
@@ -13,6 +13,8 @@ typedef struct kernel_siginfo {
__SIGINFO;
 } kernel_siginfo_t;
 
+struct ucounts;
+
 /*
  * Real Time signals may be queued.
  */
@@ -21,7 +23,7 @@ struct sigqueue {
struct list_head list;
int flags;
kernel_siginfo_t info;
-   struct user_struct *user;
+   struct ucounts *ucounts;
 };
 
 /* flags values. */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 66d471753bed..66aebabc6c7f 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -52,6 +52,7 @@ enum ucount_type {
 #endif
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
+   UCOUNT_RLIMIT_SIGPENDING,
UCOUNT_COUNTS,
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
index bdd5be6062ab..4c91709af704 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -825,6 +825,7 @@ void __init fork_init(void)
 
init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(_task, 
RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = 
task_rlimit(_task, RLIMIT_MSGQUEUE);
+   init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = 
task_rlimit(_task, RLIMIT_SIGPENDING);
 
 #ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/signal.c b/kernel/signal.c
index 5736c55aaa1a..b01c2007a282 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -412,49 +412,40 @@ void task_join_group_stop(struct task_struct *task)
 static struct sigqueue *
 __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int 
override_rlimit)
 {
-   struct sigqueue *q = NULL;
-   struct user_struct *user;
-   int sigpending;
+   struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);
 
-   /*
-* Protect access to @t credentials. This can go away when all
-* callers hold rcu read lock.
-*
-* NOTE! A pending signal will hold on to the user refcount,
-* and we get/put the refcount only when the sigpending count
-* changes from/to zero.
-*/
-   rcu_read_lock();
-   user = __task_cred(t)->user;
-   sigpending = atomic_inc_return(>sigpending);
-   if (sigpending == 1)
-   get_uid(user);
-   rcu_read_unlock();
+   if (likely(q != NULL)) {
+   bool overlimit;
 
-   if (override_rlimit || likely(sigpending <= task_rlimit(t, 
RLIMIT_SIGPENDING))) {
-   q = kmem_cache_alloc(sigqueue_cachep, flags);
-   } else {
-   print_dropped_signal(sig);
-   }
-
-   if (unlikely(q == NULL)) {
-   if (atomic_dec_and_test(>sigpending))
-   free_uid(user);
-   } else {
INIT_LIST_HEAD(>list);
q->flags = 0;
-   q->user = user;
+
+   /*
+* Protect access to @t credentials. This can go away when all
+

[PATCH v5 3/7] Reimplement RLIMIT_NPROC on top of ucounts

2021-02-01 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

To illustrate the impact of rlimits, let's say there is a program that
does not fork. Some service-A wants to run this program as user X in
multiple containers. Since the program never fork the service wants to
set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
When the service-A tries to run a program with RLIMIT_NPROC=1 in
container2 it fails since user X already has one running process.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited by root or if the user has the appropriate capability.

Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  2 +-
 fs/io-wq.c | 22 ++---
 fs/io-wq.h |  2 +-
 fs/io_uring.c  |  2 +-
 include/linux/cred.h   |  2 ++
 include/linux/sched/user.h |  1 -
 include/linux/user_namespace.h | 13 
 kernel/cred.c  | 11 ---
 kernel/exit.c  |  2 +-
 kernel/fork.c  |  9 ++---
 kernel/sys.c   |  2 +-
 kernel/ucount.c| 60 ++
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  3 +-
 14 files changed, 103 insertions(+), 29 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 0371a3400be5..e6d7f186f33c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1874,7 +1874,7 @@ static int do_execveat_common(int fd, struct filename 
*filename,
 * whether NPROC limit is still exceeded.
 */
if ((current->flags & PF_NPROC_EXCEEDED) &&
-   atomic_read(_user()->processes) > rlimit(RLIMIT_NPROC)) {
+   is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, 
rlimit(RLIMIT_NPROC))) {
retval = -EAGAIN;
goto out_ret;
}
diff --git a/fs/io-wq.c b/fs/io-wq.c
index a564f36e260c..5b6940c90c61 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../kernel/sched/sched.h"
 #include "io-wq.h"
@@ -120,7 +121,7 @@ struct io_wq {
io_wq_work_fn *do_work;
 
struct task_struct *manager;
-   struct user_struct *user;
+   const struct cred *cred;
refcount_t refs;
struct completion done;
 
@@ -234,7 +235,7 @@ static void io_worker_exit(struct io_worker *worker)
if (worker->flags & IO_WORKER_F_RUNNING)
atomic_dec(>nr_running);
if (!(worker->flags & IO_WORKER_F_BOUND))
-   atomic_dec(>wq->user->processes);
+   dec_rlimit_ucounts(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 
1);
worker->flags = 0;
preempt_enable();
 
@@ -364,15 +365,15 @@ static void __io_worker_busy(struct io_wqe *wqe, struct 
io_worker *worker,
worker->flags |= IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers--;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers++;
-   atomic_dec(>wq->user->processes);
+   dec_rlimit_ucounts(wqe->wq->cred->ucounts, 
UCOUNT_RLIMIT_NPROC, 1);
} else {
worker->flags &= ~IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers++;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers--;
-   atomic_inc(>wq->user->processes);
+   inc_rlimit_ucounts(wqe->wq->cred->ucounts, 
UCOUNT_RLIMIT_NPROC, 1);
}
io_wqe_inc_running(wqe, worker);
-}
+   }
 }
 
 /*
@@ -707,7 +708,7 @@ static bool create_io_worker(struct io_wq *wq, struct 
io_wqe *wqe, int index)
raw_spin_unlock_irq(>lock);
 
if (index == IO_WQ_ACCT_UNBOUND)
-   atomic_inc(>user->processes);
+   inc_rlimit_ucounts(wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);
 
refcount_inc(>refs);
wake_up_process(worker->task);
@@ -838,7 +839,7 @@ static bool io_wq_can_queue(struct io_wqe *wqe, struct 
io_wqe_acct *acct,
if (free_worker)
return true;
 
-   if (atomic_read(>wq->user->processes) >= acct->max_workers &&
+   if (is_ucounts_overlimit(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 
acct->max_workers) &&
!(capable(CAP_SYS_RESOURCE) || capable(CAP_SYS_ADMIN)))

[PATCH v5 1/7] Increase size of ucounts to atomic_long_t

2021-02-01 Thread Alexey Gladkov

RLIMIT_MSGQUEUE and RLIMIT_MEMLOCK use unsigned long to store their
counters. As a preparation for moving rlimits based on ucounts, we need
to increase the size of the variable to long.

Signed-off-by: Alexey Gladkov 
---
 include/linux/user_namespace.h |  4 ++--
 kernel/ucount.c| 16 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..0bb833fd41f4 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -85,7 +85,7 @@ struct user_namespace {
struct ctl_table_header *sysctls;
 #endif
struct ucounts  *ucounts;
-   int ucount_max[UCOUNT_COUNTS];
+   long ucount_max[UCOUNT_COUNTS];
 } __randomize_layout;
 
 struct ucounts {
@@ -93,7 +93,7 @@ struct ucounts {
struct user_namespace *ns;
kuid_t uid;
int count;
-   atomic_t ucount[UCOUNT_COUNTS];
+   atomic_long_t ucount[UCOUNT_COUNTS];
 };
 
 extern struct user_namespace init_user_ns;
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 11b1596e2542..04c561751af1 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -175,14 +175,14 @@ static void put_ucounts(struct ucounts *ucounts)
kfree(ucounts);
 }
 
-static inline bool atomic_inc_below(atomic_t *v, int u)
+static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
 {
-   int c, old;
-   c = atomic_read(v);
+   long c, old;
+   c = atomic_long_read(v);
for (;;) {
if (unlikely(c >= u))
return false;
-   old = atomic_cmpxchg(v, c, c+1);
+   old = atomic_long_cmpxchg(v, c, c+1);
if (likely(old == c))
return true;
c = old;
@@ -196,17 +196,17 @@ struct ucounts *inc_ucount(struct user_namespace *ns, 
kuid_t uid,
struct user_namespace *tns;
ucounts = get_ucounts(ns, uid);
for (iter = ucounts; iter; iter = tns->ucounts) {
-   int max;
+   long max;
tns = iter->ns;
max = READ_ONCE(tns->ucount_max[type]);
-   if (!atomic_inc_below(>ucount[type], max))
+   if (!atomic_long_inc_below(>ucount[type], max))
goto fail;
}
return ucounts;
 fail:
bad = iter;
for (iter = ucounts; iter != bad; iter = iter->ns->ucounts)
-   atomic_dec(>ucount[type]);
+   atomic_long_dec(>ucount[type]);
 
put_ucounts(ucounts);
return NULL;
@@ -216,7 +216,7 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type 
type)
 {
struct ucounts *iter;
for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-   int dec = atomic_dec_if_positive(>ucount[type]);
+   long dec = atomic_long_dec_if_positive(>ucount[type]);
WARN_ON_ONCE(dec < 0);
}
put_ucounts(ucounts);
-- 
2.29.2

[PATCH v5 2/7] Add a reference to ucounts for each cred

2021-02-01 Thread Alexey Gladkov

For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.

Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().

Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  4 +++
 include/linux/cred.h   |  2 ++
 include/linux/user_namespace.h |  3 +++
 kernel/cred.c  | 45 ++
 kernel/fork.c  |  6 +
 kernel/sys.c   | 12 +
 kernel/ucount.c| 26 +---
 kernel/user_namespace.c|  3 +++
 8 files changed, 98 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 5d4d52039105..0371a3400be5 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1360,6 +1360,10 @@ int begin_new_exec(struct linux_binprm * bprm)
WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
flush_signal_handlers(me, 0);
 
+   retval = set_cred_ucounts(bprm->cred);
+   if (retval < 0)
+   goto out_unlock;
+
/*
 * install the new credentials for this executable
 */
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..ad160e5fe5c6 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -144,6 +144,7 @@ struct cred {
 #endif
struct user_struct *user;   /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are 
relative to. */
+   struct ucounts *ucounts;
struct group_info *group_info;  /* supplementary groups for euid/fsgid 
*/
/* RCU deletion */
union {
@@ -170,6 +171,7 @@ extern int set_security_override_from_ctx(struct cred *, 
const char *);
 extern int set_create_files_as(struct cred *, struct inode *);
 extern int cred_fscmp(const struct cred *, const struct cred *);
 extern void __init cred_init(void);
+extern int set_cred_ucounts(struct cred *);
 
 /*
  * check for validity of credentials
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 0bb833fd41f4..43f5847cc897 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -102,6 +102,9 @@ bool setup_userns_sysctls(struct user_namespace *ns);
 void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum 
ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
+struct ucounts *get_ucounts(struct ucounts *ucounts);
+void put_ucounts(struct ucounts *ucounts);
 
 #ifdef CONFIG_USER_NS
 
diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..8194a59e283f 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -119,6 +119,8 @@ static void put_cred_rcu(struct rcu_head *rcu)
if (cred->group_info)
put_group_info(cred->group_info);
free_uid(cred->user);
+   if (cred->ucounts)
+   put_ucounts(cred->ucounts);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
 }
@@ -284,6 +286,11 @@ struct cred *prepare_creds(void)
 
if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
goto error;
+
+   new->ucounts = get_ucounts(new->ucounts);
+   if (!new->ucounts)
+   goto error;
+
validate_creds(new);
return new;
 
@@ -363,6 +370,8 @@ int copy_creds(struct task_struct *p, unsigned long 
clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
+   if (set_cred_ucounts(new) < 0)
+   goto error_put;
}
 
 #ifdef CONFIG_KEYS
@@ -653,6 +662,33 @@ int cred_fscmp(const struct cred *a, const struct cred *b)
 }
 EXPORT_SYMBOL(cred_fscmp);
 
+int set_cred_ucounts(struct cred *new)
+{
+   struct task_struct *task = current;
+   const struct cred *old = task->real_cred;
+   struct ucounts *old_ucounts = new->ucounts;
+
+   BUG_ON(task->cred != old);
+
+   if (new->user == old->user && new->user_ns == old->user_ns)
+   return 0;
+
+   /*
+* This optimization is needed because alloc_ucounts() uses locks
+* for table lookups.
+*/
+   if (old_ucounts && old_ucounts->ns == new->user_ns && 
uid_eq(old_ucounts->uid, new->euid))
+   return 0;
+
+   if (!(new->ucounts = alloc_ucounts(new->user_ns,

[PATCH v5 4/7] Reimplement RLIMIT_MSGQUEUE on top of ucounts

2021-02-01 Thread Alexey Gladkov

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov 
---
 include/linux/sched/user.h |  4 
 include/linux/user_namespace.h |  1 +
 ipc/mqueue.c   | 29 +++--
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user_namespace.c|  1 +
 6 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index d33d867ad6c1..8a34446681aa 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,10 +18,6 @@ struct user_struct {
 #endif
 #ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors 
currently watched */
-#endif
-#ifdef CONFIG_POSIX_MQUEUE
-   /* protected by mq_lock */
-   unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
 #endif
unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight;/* How many files in flight in unix 
sockets */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index bca7ba20f552..66d471753bed 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -51,6 +51,7 @@ enum ucount_type {
UCOUNT_INOTIFY_WATCHES,
 #endif
UCOUNT_RLIMIT_NPROC,
+   UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_COUNTS,
 };
 
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index beff0cfcd1e8..05fcf067131f 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -144,7 +144,7 @@ struct mqueue_inode_info {
struct pid *notify_owner;
u32 notify_self_exec_id;
struct user_namespace *notify_user_ns;
-   struct user_struct *user;   /* user who created, for accounting */
+   struct ucounts *ucounts;/* user who created, for accounting */
struct sock *notify_sock;
struct sk_buff *notify_cookie;
 
@@ -292,7 +292,6 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
struct ipc_namespace *ipc_ns, umode_t mode,
struct mq_attr *attr)
 {
-   struct user_struct *u = current_user();
struct inode *inode;
int ret = -ENOMEM;
 
@@ -309,6 +308,8 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
if (S_ISREG(mode)) {
struct mqueue_inode_info *info;
unsigned long mq_bytes, mq_treesize;
+   struct ucounts *ucounts;
+   bool overlimit;
 
inode->i_fop = _file_operations;
inode->i_size = FILENT_SIZE;
@@ -321,7 +322,7 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
info->notify_owner = NULL;
info->notify_user_ns = NULL;
info->qsize = 0;
-   info->user = NULL;  /* set when all is ok */
+   info->ucounts = NULL;   /* set when all is ok */
info->msg_tree = RB_ROOT;
info->msg_tree_rightmost = NULL;
info->node_cache = NULL;
@@ -371,19 +372,19 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
if (mq_bytes + mq_treesize < mq_bytes)
goto out_inode;
mq_bytes += mq_treesize;
+   ucounts = current_ucounts();
spin_lock(_lock);
-   if (u->mq_bytes + mq_bytes < u->mq_bytes ||
-   u->mq_bytes + mq_bytes > rlimit(RLIMIT_MSGQUEUE)) {
+   overlimit = inc_rlimit_ucounts_and_test(ucounts, 
UCOUNT_RLIMIT_MSGQUEUE,
+   mq_bytes, rlimit(RLIMIT_MSGQUEUE));
+   if (overlimit) {
+   dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MSGQUEUE, 
mq_bytes);
spin_unlock(_lock);
/* mqueue_evict_inode() releases info->messages */
ret = -EMFILE;
goto out_inode;
}
-   u->mq_bytes += mq_bytes;
spin_unlock(_lock);
-
-   /* all is ok */
-   info->user = get_uid(u);
+   info->ucounts = get_ucounts(ucounts);
} else if (S_ISDIR(mode)) {
inc_nlink(inode);
/* Some things misbehave if size == 0 on a directory */
@@ -497,7 +498,7 @@ static void mqueue_free_inode(struct inode *inode)
 static void mqueue_evict_inode(struct inode *inode)
 {
struct mqueue_inode_info *info;
-   struct user_struct *user;
+   struct ucounts *ucounts;
struct ipc_namespace *ipc_ns;
struct msg_msg *msg, *nmsg;
LIST_HEAD(tmp_msg);
@@ -520,8 +521,8 @@ static void mqueue_evict_inode(struct inode *in

[PATCH v4 5/7] Move RLIMIT_MEMLOCK counter to ucounts

2021-01-22 Thread Alexey Gladkov

Signed-off-by: Alexey Gladkov 
---
 fs/hugetlbfs/inode.c   | 17 -
 include/linux/hugetlb.h|  3 +--
 include/linux/mm.h |  4 ++--
 include/linux/shmem_fs.h   |  2 +-
 include/linux/user_namespace.h |  1 +
 ipc/shm.c  | 31 --
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user_namespace.c|  1 +
 mm/memfd.c |  4 +---
 mm/mlock.c | 35 +-
 mm/mmap.c  |  3 +--
 mm/shmem.c |  8 
 13 files changed, 52 insertions(+), 59 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b5c109703daa..82298412f020 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1451,34 +1451,35 @@ static int get_hstate_idx(int page_size_log)
  * otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
  */
 struct file *hugetlb_file_setup(const char *name, size_t size,
-   vm_flags_t acctflag, struct user_struct **user,
+   vm_flags_t acctflag,
int creat_flags, int page_size_log)
 {
struct inode *inode;
struct vfsmount *mnt;
int hstate_idx;
struct file *file;
+   const struct cred *cred;
 
hstate_idx = get_hstate_idx(page_size_log);
if (hstate_idx < 0)
return ERR_PTR(-ENODEV);
 
-   *user = NULL;
mnt = hugetlbfs_vfsmount[hstate_idx];
if (!mnt)
return ERR_PTR(-ENOENT);
 
if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
-   *user = current_user();
-   if (user_shm_lock(size, *user)) {
+   cred = current_cred();
+   if (user_shm_lock(size, cred)) {
task_lock(current);
pr_warn_once("%s (%d): Using mlock ulimits for 
SHM_HUGETLB is deprecated\n",
current->comm, current->pid);
task_unlock(current);
} else {
-   *user = NULL;
return ERR_PTR(-EPERM);
}
+   } else {
+   cred = NULL;
}
 
file = ERR_PTR(-ENOSPC);
@@ -1503,10 +1504,8 @@ struct file *hugetlb_file_setup(const char *name, size_t 
size,
 
iput(inode);
 out:
-   if (*user) {
-   user_shm_unlock(size, *user);
-   *user = NULL;
-   }
+   if (cred)
+   user_shm_unlock(size, cred);
return file;
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ebca2ef02212..fbd36c452648 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -434,8 +434,7 @@ static inline struct hugetlbfs_inode_info 
*HUGETLBFS_I(struct inode *inode)
 extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
-   struct user_struct **user, int creat_flags,
-   int page_size_log);
+   int creat_flags, int page_size_log);
 
 static inline bool is_file_hugepages(struct file *file)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..30a37aef1ab9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1628,8 +1628,8 @@ extern bool can_do_mlock(void);
 #else
 static inline bool can_do_mlock(void) { return false; }
 #endif
-extern int user_shm_lock(size_t, struct user_struct *);
-extern void user_shm_unlock(size_t, struct user_struct *);
+extern int user_shm_lock(size_t, const struct cred *);
+extern void user_shm_unlock(size_t, const struct cred *);
 
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index d82b6f396588..10f50b1c4e0e 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -65,7 +65,7 @@ extern struct file *shmem_file_setup_with_mnt(struct vfsmount 
*mnt,
 extern int shmem_zero_setup(struct vm_area_struct *);
 extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags);
-extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern int shmem_lock(struct file *file, int lock, const struct cred *cred);
 #ifdef CONFIG_SHMEM
 extern const struct address_space_operations shmem_aops;
 static inline bool shmem_mapping(struct address_space *mapping)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 26000aea53c4..9f8ecf3063dd 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -53,

[PATCH v4 3/7] Move RLIMIT_MSGQUEUE counter to ucounts

2021-01-22 Thread Alexey Gladkov

Signed-off-by: Alexey Gladkov 
---
 include/linux/sched/user.h |  4 
 include/linux/user_namespace.h |  1 +
 ipc/mqueue.c   | 29 +++--
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user_namespace.c|  1 +
 6 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index d33d867ad6c1..8a34446681aa 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,10 +18,6 @@ struct user_struct {
 #endif
 #ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors 
currently watched */
-#endif
-#ifdef CONFIG_POSIX_MQUEUE
-   /* protected by mq_lock */
-   unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
 #endif
unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight;/* How many files in flight in unix 
sockets */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 68a87d05d8d5..1766cf503d1b 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -51,6 +51,7 @@ enum ucount_type {
UCOUNT_INOTIFY_WATCHES,
 #endif
UCOUNT_RLIMIT_NPROC,
+   UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_COUNTS,
 };
 
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index beff0cfcd1e8..05fcf067131f 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -144,7 +144,7 @@ struct mqueue_inode_info {
struct pid *notify_owner;
u32 notify_self_exec_id;
struct user_namespace *notify_user_ns;
-   struct user_struct *user;   /* user who created, for accounting */
+   struct ucounts *ucounts;/* user who created, for accounting */
struct sock *notify_sock;
struct sk_buff *notify_cookie;
 
@@ -292,7 +292,6 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
struct ipc_namespace *ipc_ns, umode_t mode,
struct mq_attr *attr)
 {
-   struct user_struct *u = current_user();
struct inode *inode;
int ret = -ENOMEM;
 
@@ -309,6 +308,8 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
if (S_ISREG(mode)) {
struct mqueue_inode_info *info;
unsigned long mq_bytes, mq_treesize;
+   struct ucounts *ucounts;
+   bool overlimit;
 
inode->i_fop = _file_operations;
inode->i_size = FILENT_SIZE;
@@ -321,7 +322,7 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
info->notify_owner = NULL;
info->notify_user_ns = NULL;
info->qsize = 0;
-   info->user = NULL;  /* set when all is ok */
+   info->ucounts = NULL;   /* set when all is ok */
info->msg_tree = RB_ROOT;
info->msg_tree_rightmost = NULL;
info->node_cache = NULL;
@@ -371,19 +372,19 @@ static struct inode *mqueue_get_inode(struct super_block 
*sb,
if (mq_bytes + mq_treesize < mq_bytes)
goto out_inode;
mq_bytes += mq_treesize;
+   ucounts = current_ucounts();
spin_lock(_lock);
-   if (u->mq_bytes + mq_bytes < u->mq_bytes ||
-   u->mq_bytes + mq_bytes > rlimit(RLIMIT_MSGQUEUE)) {
+   overlimit = inc_rlimit_ucounts_and_test(ucounts, 
UCOUNT_RLIMIT_MSGQUEUE,
+   mq_bytes, rlimit(RLIMIT_MSGQUEUE));
+   if (overlimit) {
+   dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MSGQUEUE, 
mq_bytes);
spin_unlock(_lock);
/* mqueue_evict_inode() releases info->messages */
ret = -EMFILE;
goto out_inode;
}
-   u->mq_bytes += mq_bytes;
spin_unlock(_lock);
-
-   /* all is ok */
-   info->user = get_uid(u);
+   info->ucounts = get_ucounts(ucounts);
} else if (S_ISDIR(mode)) {
inc_nlink(inode);
/* Some things misbehave if size == 0 on a directory */
@@ -497,7 +498,7 @@ static void mqueue_free_inode(struct inode *inode)
 static void mqueue_evict_inode(struct inode *inode)
 {
struct mqueue_inode_info *info;
-   struct user_struct *user;
+   struct ucounts *ucounts;
struct ipc_namespace *ipc_ns;
struct msg_msg *msg, *nmsg;
LIST_HEAD(tmp_msg);
@@ -520,8 +521,8 @@ static void mqueue_evict_inode(struct inode *inode)
free_msg(msg);
}
 
-   user = info->user;
-   if (user) {
+   ucounts = info->ucounts;
+   if (ucounts) {
unsigned long mq_bytes, mq_treesize;
 
/* Total amou

[PATCH v4 2/7] Move RLIMIT_NPROC counter to ucounts

2021-01-22 Thread Alexey Gladkov

RLIMIT_NPROC is implemented on top of ucounts. The process counter is
tied to the user in the user namespace. Therefore, there is no longer
one single counter for the user. Instead, there is now one counter for
each user namespace. Thus, getting the RLIMIT_NPROC counter value to
check the rlimit becomes meaningless.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited if the user has the appropriate capability.

Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  2 +-
 fs/io-wq.c | 22 ++---
 fs/io-wq.h |  2 +-
 fs/io_uring.c  |  2 +-
 include/linux/cred.h   |  2 ++
 include/linux/sched/user.h |  1 -
 include/linux/user_namespace.h | 13 
 kernel/cred.c  | 10 +++---
 kernel/exit.c  |  2 +-
 kernel/fork.c  |  9 ++---
 kernel/sys.c   |  2 +-
 kernel/ucount.c| 60 ++
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  3 +-
 14 files changed, 102 insertions(+), 29 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 5d4d52039105..f62fd2632104 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1870,7 +1870,7 @@ static int do_execveat_common(int fd, struct filename 
*filename,
 * whether NPROC limit is still exceeded.
 */
if ((current->flags & PF_NPROC_EXCEEDED) &&
-   atomic_read(_user()->processes) > rlimit(RLIMIT_NPROC)) {
+   is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, 
rlimit(RLIMIT_NPROC))) {
retval = -EAGAIN;
goto out_ret;
}
diff --git a/fs/io-wq.c b/fs/io-wq.c
index a564f36e260c..5b6940c90c61 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../kernel/sched/sched.h"
 #include "io-wq.h"
@@ -120,7 +121,7 @@ struct io_wq {
io_wq_work_fn *do_work;
 
struct task_struct *manager;
-   struct user_struct *user;
+   const struct cred *cred;
refcount_t refs;
struct completion done;
 
@@ -234,7 +235,7 @@ static void io_worker_exit(struct io_worker *worker)
if (worker->flags & IO_WORKER_F_RUNNING)
atomic_dec(>nr_running);
if (!(worker->flags & IO_WORKER_F_BOUND))
-   atomic_dec(>wq->user->processes);
+   dec_rlimit_ucounts(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 
1);
worker->flags = 0;
preempt_enable();
 
@@ -364,15 +365,15 @@ static void __io_worker_busy(struct io_wqe *wqe, struct 
io_worker *worker,
worker->flags |= IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers--;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers++;
-   atomic_dec(>wq->user->processes);
+   dec_rlimit_ucounts(wqe->wq->cred->ucounts, 
UCOUNT_RLIMIT_NPROC, 1);
} else {
worker->flags &= ~IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers++;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers--;
-   atomic_inc(>wq->user->processes);
+   inc_rlimit_ucounts(wqe->wq->cred->ucounts, 
UCOUNT_RLIMIT_NPROC, 1);
}
io_wqe_inc_running(wqe, worker);
-}
+   }
 }
 
 /*
@@ -707,7 +708,7 @@ static bool create_io_worker(struct io_wq *wq, struct 
io_wqe *wqe, int index)
raw_spin_unlock_irq(>lock);
 
if (index == IO_WQ_ACCT_UNBOUND)
-   atomic_inc(>user->processes);
+   inc_rlimit_ucounts(wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);
 
refcount_inc(>refs);
wake_up_process(worker->task);
@@ -838,7 +839,7 @@ static bool io_wq_can_queue(struct io_wqe *wqe, struct 
io_wqe_acct *acct,
if (free_worker)
return true;
 
-   if (atomic_read(>wq->user->processes) >= acct->max_workers &&
+   if (is_ucounts_overlimit(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 
acct->max_workers) &&
!(capable(CAP_SYS_RESOURCE) || capable(CAP_SYS_ADMIN)))
return false;
 
@@ -1074,7 +1075,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct 
io_wq_data *data)
wq->do_work = data->do_work;
 
/* caller must already hold a reference to this */
-   wq->user = data->user;
+   wq->cred = data->cred;
 
ret = -ENOMEM;
for_each_node(node) {
@@ -1090,10 +1091,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct 
io_wq_data *data)
wqe-

[PATCH v4 4/7] Move RLIMIT_SIGPENDING counter to ucounts

2021-01-22 Thread Alexey Gladkov

Signed-off-by: Alexey Gladkov 
---
 fs/proc/array.c|  2 +-
 include/linux/sched/user.h |  1 -
 include/linux/signal_types.h   |  4 ++-
 include/linux/user_namespace.h |  1 +
 kernel/fork.c  |  1 +
 kernel/signal.c| 53 ++
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 9 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index bb87e4d89cd8..74b0ea4b7e38 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -284,7 +284,7 @@ static inline void task_sig(struct seq_file *m, struct 
task_struct *p)
collect_sigign_sigcatch(p, , );
num_threads = get_nr_threads(p);
rcu_read_lock();  /* FIXME: is this correct? */
-   qsize = atomic_read(&__task_cred(p)->user->sigpending);
+   qsize = get_ucounts_value(task_ucounts(p), 
UCOUNT_RLIMIT_SIGPENDING);
rcu_read_unlock();
qlim = task_rlimit(p, RLIMIT_SIGPENDING);
unlock_task_sighand(p, );
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8a34446681aa..8ba9cec4fb99 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
  */
 struct user_struct {
refcount_t __count; /* reference count */
-   atomic_t sigpending;/* How many pending signals does this user 
have? */
 #ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
 #endif
diff --git a/include/linux/signal_types.h b/include/linux/signal_types.h
index 68e06c75c5b2..34cb28b8f16c 100644
--- a/include/linux/signal_types.h
+++ b/include/linux/signal_types.h
@@ -13,6 +13,8 @@ typedef struct kernel_siginfo {
__SIGINFO;
 } kernel_siginfo_t;
 
+struct ucounts;
+
 /*
  * Real Time signals may be queued.
  */
@@ -21,7 +23,7 @@ struct sigqueue {
struct list_head list;
int flags;
kernel_siginfo_t info;
-   struct user_struct *user;
+   struct ucounts *ucounts;
 };
 
 /* flags values. */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 1766cf503d1b..26000aea53c4 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -52,6 +52,7 @@ enum ucount_type {
 #endif
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
+   UCOUNT_RLIMIT_SIGPENDING,
UCOUNT_COUNTS,
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
index f61a5a3dc02f..a7be5790392e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -825,6 +825,7 @@ void __init fork_init(void)
 
init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(_task, 
RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = 
task_rlimit(_task, RLIMIT_MSGQUEUE);
+   init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = 
task_rlimit(_task, RLIMIT_SIGPENDING);
 
 #ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/signal.c b/kernel/signal.c
index 5736c55aaa1a..b01c2007a282 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -412,49 +412,40 @@ void task_join_group_stop(struct task_struct *task)
 static struct sigqueue *
 __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int 
override_rlimit)
 {
-   struct sigqueue *q = NULL;
-   struct user_struct *user;
-   int sigpending;
+   struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);
 
-   /*
-* Protect access to @t credentials. This can go away when all
-* callers hold rcu read lock.
-*
-* NOTE! A pending signal will hold on to the user refcount,
-* and we get/put the refcount only when the sigpending count
-* changes from/to zero.
-*/
-   rcu_read_lock();
-   user = __task_cred(t)->user;
-   sigpending = atomic_inc_return(>sigpending);
-   if (sigpending == 1)
-   get_uid(user);
-   rcu_read_unlock();
+   if (likely(q != NULL)) {
+   bool overlimit;
 
-   if (override_rlimit || likely(sigpending <= task_rlimit(t, 
RLIMIT_SIGPENDING))) {
-   q = kmem_cache_alloc(sigqueue_cachep, flags);
-   } else {
-   print_dropped_signal(sig);
-   }
-
-   if (unlikely(q == NULL)) {
-   if (atomic_dec_and_test(>sigpending))
-   free_uid(user);
-   } else {
INIT_LIST_HEAD(>list);
q->flags = 0;
-   q->user = user;
+
+   /*
+* Protect access to @t credentials. This can go away when all
+* callers hold rcu read lock.
+*/
+   rcu_read_lock();
+   q->ucounts = get_ucounts(task_ucounts(t));
+   overlimit = inc_rlimit_ucounts_and_test(q->ucounts, 
UCOUNT_RLIMIT_SIGPENDING,
+

[PATCH v4 6/7] Move RLIMIT_NPROC check to the place where we increment the counter

2021-01-22 Thread Alexey Gladkov

After calling set_user(), we always have to call commit_creds() to apply
new credentials upon the current task. There is no need to separate
limit check and counter incrementing.

Signed-off-by: Alexey Gladkov 
---
 kernel/cred.c | 22 +-
 kernel/sys.c  | 13 -
 2 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/kernel/cred.c b/kernel/cred.c
index fdb40adc2ebd..334d2c9ae519 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -487,14 +487,26 @@ int commit_creds(struct cred *new)
if (!gid_eq(new->fsgid, old->fsgid))
key_fsgid_changed(new);
 
-   /* do it
-* RLIMIT_NPROC limits on user->processes have already been checked
-* in set_user().
-*/
alter_cred_subscribers(new, 2);
if (new->user != old->user || new->user_ns != old->user_ns) {
+   bool overlimit;
+
set_cred_ucounts(new, new->user_ns, new->euid);
-   inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
+
+   overlimit = inc_rlimit_ucounts_and_test(new->ucounts, 
UCOUNT_RLIMIT_NPROC,
+   1, rlimit(RLIMIT_NPROC));
+
+   /*
+* We don't fail in case of NPROC limit excess here because too 
many
+* poorly written programs don't check set*uid() return code, 
assuming
+* it never fails if called by root.  We may still enforce 
NPROC limit
+* for programs doing set*uid()+execve() by harmlessly 
deferring the
+* failure to the execve() stage.
+*/
+   if (overlimit && new->user != INIT_USER)
+   current->flags |= PF_NPROC_EXCEEDED;
+   else
+   current->flags &= ~PF_NPROC_EXCEEDED;
}
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
diff --git a/kernel/sys.c b/kernel/sys.c
index c2734ab9474e..180c4e06064f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -467,19 +467,6 @@ static int set_user(struct cred *new)
if (!new_user)
return -EAGAIN;
 
-   /*
-* We don't fail in case of NPROC limit excess here because too many
-* poorly written programs don't check set*uid() return code, assuming
-* it never fails if called by root.  We may still enforce NPROC limit
-* for programs doing set*uid()+execve() by harmlessly deferring the
-* failure to the execve() stage.
-*/
-   if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, 
rlimit(RLIMIT_NPROC)) &&
-   new_user != INIT_USER)
-   current->flags |= PF_NPROC_EXCEEDED;
-   else
-   current->flags &= ~PF_NPROC_EXCEEDED;
-
free_uid(new->user);
new->user = new_user;
return 0;
-- 
2.29.2

[PATCH v4 7/7] kselftests: Add test to check for rlimit changes in different user namespaces

2021-01-22 Thread Alexey Gladkov

The testcase runs few instances of the program with RLIMIT_NPROC=1 from
user uid=6, in different user namespaces.

Signed-off-by: Alexey Gladkov 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/rlimits/.gitignore|   2 +
 tools/testing/selftests/rlimits/Makefile  |   6 +
 tools/testing/selftests/rlimits/config|   1 +
 .../selftests/rlimits/rlimits-per-userns.c| 161 ++
 5 files changed, 171 insertions(+)
 create mode 100644 tools/testing/selftests/rlimits/.gitignore
 create mode 100644 tools/testing/selftests/rlimits/Makefile
 create mode 100644 tools/testing/selftests/rlimits/config
 create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index afbab4aeef3c..4dbeb5686f7b 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -46,6 +46,7 @@ TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
 TARGETS += openat2
+TARGETS += rlimits
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/rlimits/.gitignore 
b/tools/testing/selftests/rlimits/.gitignore
new file mode 100644
index ..091021f255b3
--- /dev/null
+++ b/tools/testing/selftests/rlimits/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+rlimits-per-userns
diff --git a/tools/testing/selftests/rlimits/Makefile 
b/tools/testing/selftests/rlimits/Makefile
new file mode 100644
index ..03aadb406212
--- /dev/null
+++ b/tools/testing/selftests/rlimits/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := rlimits-per-userns
+
+include ../lib.mk
diff --git a/tools/testing/selftests/rlimits/config 
b/tools/testing/selftests/rlimits/config
new file mode 100644
index ..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/rlimits/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c 
b/tools/testing/selftests/rlimits/rlimits-per-userns.c
new file mode 100644
index ..26dc949e93ea
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov 
+ */
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NR_CHILDS 2
+
+static char *service_prog;
+static uid_t user   = 6;
+static uid_t group  = 6;
+
+static void setrlimit_nproc(rlim_t n)
+{
+   pid_t pid = getpid();
+   struct rlimit limit = {
+   .rlim_cur = n,
+   .rlim_max = n
+   };
+
+   warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
+
+   if (setrlimit(RLIMIT_NPROC, ) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+}
+
+static pid_t fork_child(void)
+{
+   pid_t pid = fork();
+
+   if (pid < 0)
+   err(EXIT_FAILURE, "fork");
+
+   if (pid > 0)
+   return pid;
+
+   pid = getpid();
+
+   warnx("(pid=%d): New process starting ...", pid);
+
+   if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
+   err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid);
+
+   signal(SIGUSR1, SIG_DFL);
+
+   warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
+
+   if (setgid(group) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
+   if (setuid(user) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
+
+   warnx("(pid=%d): Service running ...", pid);
+
+   warnx("(pid=%d): Unshare user namespace", pid);
+   if (unshare(CLONE_NEWUSER) < 0)
+   err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+
+   char *const argv[] = { "service", NULL };
+   char *const envp[] = { "I_AM_SERVICE=1", NULL };
+
+   warnx("(pid=%d): Executing real service ...", pid);
+
+   execve(service_prog, argv, envp);
+   err(EXIT_FAILURE, "(pid=%d): execve", pid);
+}
+
+int main(int argc, char **argv)
+{
+   size_t i;
+   pid_t child[NR_CHILDS];
+   int wstatus[NR_CHILDS];
+   int childs = NR_CHILDS;
+   pid_t pid;
+
+   if (getenv("I_AM_SERVICE")) {
+   pause();
+   exit(EXIT_SUCCESS);
+   }
+
+   service_prog = argv[0];
+   pid = getpid();
+
+   warnx("(pid=%d) Starting testcase", pid);
+
+   /*
+* This rlimit is not a problem for root because it can be exceeded.
+*/
+   setrlimit_nproc(1);
+
+   for (i = 0; i < NR_CH

[PATCH v4 1/7] Add a reference to ucounts for each cred

2021-01-22 Thread Alexey Gladkov

For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred.  Add a ucounts reference to struct cred, so that
RLIMIT_NPROC can switch from using a per user limit to using a per user
per user namespace limit.

Signed-off-by: Alexey Gladkov 
---
 include/linux/cred.h   |  1 +
 include/linux/user_namespace.h |  7 --
 kernel/cred.c  | 20 +--
 kernel/ucount.c| 46 ++
 kernel/user_namespace.c|  1 +
 5 files changed, 61 insertions(+), 14 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..307744fcc387 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -144,6 +144,7 @@ struct cred {
 #endif
struct user_struct *user;   /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are 
relative to. */
+   struct ucounts *ucounts;
struct group_info *group_info;  /* supplementary groups for euid/fsgid 
*/
/* RCU deletion */
union {
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..4cf93f9f93a6 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -85,7 +85,7 @@ struct user_namespace {
struct ctl_table_header *sysctls;
 #endif
struct ucounts  *ucounts;
-   int ucount_max[UCOUNT_COUNTS];
+   long ucount_max[UCOUNT_COUNTS];
 } __randomize_layout;
 
 struct ucounts {
@@ -93,7 +93,7 @@ struct ucounts {
struct user_namespace *ns;
kuid_t uid;
int count;
-   atomic_t ucount[UCOUNT_COUNTS];
+   atomic_long_t ucount[UCOUNT_COUNTS];
 };
 
 extern struct user_namespace init_user_ns;
@@ -102,6 +102,9 @@ bool setup_userns_sysctls(struct user_namespace *ns);
 void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum 
ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+struct ucounts *get_ucounts(struct ucounts *ucounts);
+void put_ucounts(struct ucounts *ucounts);
+void set_cred_ucounts(struct cred *cred, struct user_namespace *ns, kuid_t 
uid);
 
 #ifdef CONFIG_USER_NS
 
diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..9473e71e784c 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -119,6 +119,8 @@ static void put_cred_rcu(struct rcu_head *rcu)
if (cred->group_info)
put_group_info(cred->group_info);
free_uid(cred->user);
+   if (cred->ucounts)
+   put_ucounts(cred->ucounts);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
 }
@@ -144,6 +146,9 @@ void __put_cred(struct cred *cred)
BUG_ON(cred == current->cred);
BUG_ON(cred == current->real_cred);
 
+   if (cred->ucounts)
+   BUG_ON(cred->ucounts->ns != cred->user_ns);
+
if (cred->non_rcu)
put_cred_rcu(>rcu);
else
@@ -270,6 +275,7 @@ struct cred *prepare_creds(void)
get_group_info(new->group_info);
get_uid(new->user);
get_user_ns(new->user_ns);
+   get_ucounts(new->ucounts);
 
 #ifdef CONFIG_KEYS
key_get(new->session_keyring);
@@ -363,6 +369,7 @@ int copy_creds(struct task_struct *p, unsigned long 
clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
+   set_cred_ucounts(new, new->user_ns, new->euid);
}
 
 #ifdef CONFIG_KEYS
@@ -485,8 +492,11 @@ int commit_creds(struct cred *new)
 * in set_user().
 */
alter_cred_subscribers(new, 2);
-   if (new->user != old->user)
-   atomic_inc(>user->processes);
+   if (new->user != old->user || new->user_ns != old->user_ns) {
+   if (new->user != old->user)
+   atomic_inc(>user->processes);
+   set_cred_ucounts(new, new->user_ns, new->euid);
+   }
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
if (new->user != old->user)
@@ -661,6 +671,11 @@ void __init cred_init(void)
/* allocate a slab in which we can store credentials */
cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
+   /*
+* This is needed here because this is the first cred and there is no
+* ucount reference to copy.
+*/
+   set_cred_ucounts(_cred, _user_ns, GLOBAL_ROOT_UID);
 }
 
 /**
@@ -704,6 +719,7 @@ struct cred *prepare_kernel_cred(struct task_struct *daemon)
get_uid(new->user);
get_user_ns(new->user_ns);
g

[PATCH v4 0/7] Count rlimits in each user namespace

2021-01-22 Thread Alexey Gladkov

Preface
---
These patches are for binding the rlimit counters to a user in user namespace.
This patch set can be applied on top of:

git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.11-rc2

Problem
---
The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
implementation places the counters in user_struct [1]. These limits are global
between processes and persists for the lifetime of the process, even if
processes are in different user namespaces.

To illustrate the impact of rlimits, let's say there is a program that does not
fork. Some service-A wants to run this program as user X in multiple containers.
Since the program never fork the service wants to set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
since user X already has one running process.

The problem is not that the limit from container1 affects container2. The
problem is that limit is verified against the global counter that reflects
the number of processes in all containers.

This problem can be worked around by using different users for each container
but in this case we face a different problem of uid mapping when transferring
files from one container to another.

Eric W. Biederman mentioned this issue [2][3].

Introduced changes
--
To address the problem, we bind rlimit counters to user namespace. Each counter
reflects the number of processes in a given uid in a given user namespace. The
result is a tree of rlimit counters with the biggest value at the root (aka
init_user_ns). The limit is considered exceeded if it's exceeded up in the tree.

[1] https://lore.kernel.org/containers/87imd2incs@x220.int.ebiederm.org/
[2] 
https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
[3] 
https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html

Changelog
-
v4:
* Reverted the type change of ucounts.count to refcount_t.
* Fixed typo in the kernel/cred.c

v3:
* Added get_ucounts() function to increase the reference count. The existing
  get_counts() function renamed to __get_ucounts().
* The type of ucounts.count changed from atomic_t to refcount_t.
* Dropped 'const' from set_cred_ucounts() arguments.
* Fixed a bug with freeing the cred structure after calling cred_alloc_blank().
* Commit messages have been updated.
* Added selftest.

v2:
* RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts.
* Added ucounts for pair uid and user namespace into cred.
* Added the ability to increase ucount by more than 1.

v1:
* After discussion with Eric W. Biederman, I increased the size of ucounts to
  atomic_long_t.
* Added ucount_max to avoid the fork bomb.

--

Alexey Gladkov (7):
  Add a reference to ucounts for each cred
  Move RLIMIT_NPROC counter to ucounts
  Move RLIMIT_MSGQUEUE counter to ucounts
  Move RLIMIT_SIGPENDING counter to ucounts
  Move RLIMIT_MEMLOCK counter to ucounts
  Move RLIMIT_NPROC check to the place where we increment the counter
  kselftests: Add test to check for rlimit changes in different user
namespaces

 fs/exec.c |   2 +-
 fs/hugetlbfs/inode.c  |  17 +-
 fs/io-wq.c|  22 ++-
 fs/io-wq.h|   2 +-
 fs/io_uring.c |   2 +-
 fs/proc/array.c   |   2 +-
 include/linux/cred.h  |   3 +
 include/linux/hugetlb.h   |   3 +-
 include/linux/mm.h|   4 +-
 include/linux/sched/user.h|   6 -
 include/linux/shmem_fs.h  |   2 +-
 include/linux/signal_types.h  |   4 +-
 include/linux/user_namespace.h|  23 ++-
 ipc/mqueue.c  |  29 ++--
 ipc/shm.c |  31 ++--
 kernel/cred.c |  46 -
 kernel/exit.c |   2 +-
 kernel/fork.c |  12 +-
 kernel/signal.c   |  53 +++---
 kernel/sys.c  |  13 --
 kernel/ucount.c   | 109 ++--
 kernel/user.c |   2 -
 kernel/user_namespace.c   |   7 +-
 mm/memfd.c|   4 +-
 mm/mlock.c|  35 ++--
 mm/mmap.c |   3 +-
 mm/shmem.c|   8 +-
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/rlimits/.gitignore|   2 +
 tools/testing/selftests/rlimits

Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

2021-01-21 Thread Alexey Gladkov

On Thu, Jan 21, 2021 at 09:50:34AM -0600, Eric W. Biederman wrote:
> >> The current ucount code does check for overflow and fails the increment
> >> in every case.
> >> 
> >> So arguably it will be a regression and inferior error handling behavior
> >> if the code switches to the ``better'' refcount_t data structure.
> >> 
> >> I originally didn't use refcount_t because silently saturating and not
> >> bothering to handle the error makes me uncomfortable.
> >> 
> >> Not having to acquire the ucounts_lock every time seems nice.  Perhaps
> >> the path forward would be to start with stupid/correct code that always
> >> takes the ucounts_lock for every increment of ucounts->count, that is
> >> later replaced with something more optimal.
> >> 
> >> Not impacting performance in the non-namespace cases and having good
> >> performance in the other cases is a fundamental requirement of merging
> >> code like this.
> >
> > Did I understand your suggestion correctly that you suggest to use
> > spin_lock for atomic_read and atomic_inc ?
> >
> > If so, then we are already incrementing the counter under ucounts_lock.
> >
> > ...
> > if (atomic_read(>count) == INT_MAX)
> > ucounts = NULL;
> > else
> > atomic_inc(>count);
> > spin_unlock_irq(_lock);
> > return ucounts;
> >
> > something like this ?
> 
> Yes.  But without atomics.  Something a bit more like:
> > ...
> > if (ucounts->count == INT_MAX)
> > ucounts = NULL;
> > else
> > ucounts->count++;
> > spin_unlock_irq(_lock);
> > return ucounts;

This is the original code.

> I do believe at some point we will want to say using the spin_lock for
> ucounts->count is cumbersome, and suboptimal and we want to change it to
> get a better performing implementation.
> 
> Just for getting the semantics correct we should be able to use just
> ucounts_lock for locking.  Then when everything is working we can
> profile and optimize the code.
> 
> I just don't want figuring out what is needed to get hung up over little
> details that we can change later.

OK. So I will drop this my change for now.

-- 
Rgrds, legion

Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

2021-01-21 Thread Alexey Gladkov

On Tue, Jan 19, 2021 at 07:57:36PM -0600, Eric W. Biederman wrote:
> Alexey Gladkov  writes:
> 
> > On Mon, Jan 18, 2021 at 12:34:29PM -0800, Linus Torvalds wrote:
> >> On Mon, Jan 18, 2021 at 11:46 AM Alexey Gladkov
> >>  wrote:
> >> >
> >> > Sorry about that. I thought that this code is not needed when switching
> >> > from int to refcount_t. I was wrong.
> >> 
> >> Well, you _may_ be right. I personally didn't check how the return
> >> value is used.
> >> 
> >> I only reacted to "it certainly _may_ be used, and there is absolutely
> >> no comment anywhere about why it wouldn't matter".
> >
> > I have not found examples where checked the overflow after calling
> > refcount_inc/refcount_add.
> >
> > For example in kernel/fork.c:2298 :
> >
> >current->signal->nr_threads++;   
> >atomic_inc(>signal->live);  
> >refcount_inc(>signal->sigcnt);  
> >
> > $ semind search signal_struct.sigcnt
> > def include/linux/sched/signal.h:83 refcount_t  
> > sigcnt;
> > m-- kernel/fork.c:723 put_signal_struct if 
> > (refcount_dec_and_test(>sigcnt))
> > m-- kernel/fork.c:1571 copy_signal  refcount_set(>sigcnt, 1);
> > m-- kernel/fork.c:2298 copy_process 
> > refcount_inc(>signal->sigcnt);
> >
> > It seems to me that the only way is to use __refcount_inc and then compare
> > the old value with REFCOUNT_MAX
> >
> > Since I have not seen examples of such checks, I thought that this is
> > acceptable. Sorry once again. I have not tried to hide these changes.
> 
> The current ucount code does check for overflow and fails the increment
> in every case.
> 
> So arguably it will be a regression and inferior error handling behavior
> if the code switches to the ``better'' refcount_t data structure.
> 
> I originally didn't use refcount_t because silently saturating and not
> bothering to handle the error makes me uncomfortable.
> 
> Not having to acquire the ucounts_lock every time seems nice.  Perhaps
> the path forward would be to start with stupid/correct code that always
> takes the ucounts_lock for every increment of ucounts->count, that is
> later replaced with something more optimal.
> 
> Not impacting performance in the non-namespace cases and having good
> performance in the other cases is a fundamental requirement of merging
> code like this.

Did I understand your suggestion correctly that you suggest to use
spin_lock for atomic_read and atomic_inc ?

If so, then we are already incrementing the counter under ucounts_lock.

...
if (atomic_read(>count) == INT_MAX)
ucounts = NULL;
else
atomic_inc(>count);
spin_unlock_irq(_lock);
return ucounts;

something like this ?

-- 
Rgrds, legion

Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

2021-01-18 Thread Alexey Gladkov

On Mon, Jan 18, 2021 at 12:34:29PM -0800, Linus Torvalds wrote:
> On Mon, Jan 18, 2021 at 11:46 AM Alexey Gladkov
>  wrote:
> >
> > Sorry about that. I thought that this code is not needed when switching
> > from int to refcount_t. I was wrong.
> 
> Well, you _may_ be right. I personally didn't check how the return
> value is used.
> 
> I only reacted to "it certainly _may_ be used, and there is absolutely
> no comment anywhere about why it wouldn't matter".

I have not found examples where checked the overflow after calling
refcount_inc/refcount_add.

For example in kernel/fork.c:2298 :

   current->signal->nr_threads++;   
   atomic_inc(>signal->live);  
   refcount_inc(>signal->sigcnt);  

$ semind search signal_struct.sigcnt
def include/linux/sched/signal.h:83 refcount_t  sigcnt;
m-- kernel/fork.c:723 put_signal_struct if 
(refcount_dec_and_test(>sigcnt))
m-- kernel/fork.c:1571 copy_signal  refcount_set(>sigcnt, 1);
m-- kernel/fork.c:2298 copy_process 
refcount_inc(>signal->sigcnt);

It seems to me that the only way is to use __refcount_inc and then compare
the old value with REFCOUNT_MAX

Since I have not seen examples of such checks, I thought that this is
acceptable. Sorry once again. I have not tried to hide these changes.

-- 
Rgrds, legion

Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

2021-01-18 Thread Alexey Gladkov

On Mon, Jan 18, 2021 at 11:14:48AM -0800, Linus Torvalds wrote:
> On Fri, Jan 15, 2021 at 6:59 AM Alexey Gladkov  
> wrote:
> >
> > @@ -152,10 +153,7 @@ static struct ucounts *get_ucounts(struct 
> > user_namespace *ns, kuid_t uid)
> > ucounts = new;
> > }
> > }
> > -   if (ucounts->count == INT_MAX)
> > -   ucounts = NULL;
> > -   else
> > -   ucounts->count += 1;
> > +   refcount_inc(>count);
> > spin_unlock_irq(_lock);
> > return ucounts;
> >  }
> 
> This is wrong.
> 
> It used to return NULL when the count saturated.
> 
> Now it just silently saturates.
> 
> I'm not sure how many people care, but that NULL return ends up being
> returned quite widely (through "inc_uncount()" and friends).
> 
> The fact that this has no commit message at all to explain what it is
> doing and why is also a grounds for just NAK.

Sorry about that. I thought that this code is not needed when switching
from int to refcount_t. I was wrong. I'll think about how best to check
it.

-- 
Rgrds, legion

[PATCH v4 2/8] Add a reference to ucounts for each cred

2021-01-18 Thread Alexey Gladkov

For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred.  Add a ucounts reference to struct cred, so that
RLIMIT_NPROC can switch from using a per user limit to using a per user
per user namespace limit.

Changelog
-
v4:
* Fixed typo in the kernel/cred.c

Signed-off-by: Alexey Gladkov 
---
 include/linux/cred.h   |  1 +
 include/linux/user_namespace.h | 13 +++--
 kernel/cred.c  | 20 ++--
 kernel/ucount.c| 30 --
 kernel/user_namespace.c|  1 +
 5 files changed, 51 insertions(+), 14 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..307744fcc387 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -144,6 +144,7 @@ struct cred {
 #endif
struct user_struct *user;   /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are 
relative to. */
+   struct ucounts *ucounts;
struct group_info *group_info;  /* supplementary groups for euid/fsgid 
*/
/* RCU deletion */
union {
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index f84fc2d9ce20..9a3ba69e9223 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -85,7 +85,7 @@ struct user_namespace {
struct ctl_table_header *sysctls;
 #endif
struct ucounts  *ucounts;
-   int ucount_max[UCOUNT_COUNTS];
+   long ucount_max[UCOUNT_COUNTS];
 } __randomize_layout;
 
 struct ucounts {
@@ -93,7 +93,7 @@ struct ucounts {
struct user_namespace *ns;
kuid_t uid;
refcount_t count;
-   atomic_t ucount[UCOUNT_COUNTS];
+   atomic_long_t ucount[UCOUNT_COUNTS];
 };
 
 extern struct user_namespace init_user_ns;
@@ -102,6 +102,15 @@ bool setup_userns_sysctls(struct user_namespace *ns);
 void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum 
ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+void put_ucounts(struct ucounts *ucounts);
+void set_cred_ucounts(struct cred *cred, struct user_namespace *ns, kuid_t 
uid);
+
+static inline struct ucounts *get_ucounts(struct ucounts *ucounts)
+{
+   if (ucounts)
+   refcount_inc(>count);
+   return ucounts;
+}
 
 #ifdef CONFIG_USER_NS
 
diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..9473e71e784c 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -119,6 +119,8 @@ static void put_cred_rcu(struct rcu_head *rcu)
if (cred->group_info)
put_group_info(cred->group_info);
free_uid(cred->user);
+   if (cred->ucounts)
+   put_ucounts(cred->ucounts);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
 }
@@ -144,6 +146,9 @@ void __put_cred(struct cred *cred)
BUG_ON(cred == current->cred);
BUG_ON(cred == current->real_cred);
 
+   if (cred->ucounts)
+   BUG_ON(cred->ucounts->ns != cred->user_ns);
+
if (cred->non_rcu)
put_cred_rcu(>rcu);
else
@@ -270,6 +275,7 @@ struct cred *prepare_creds(void)
get_group_info(new->group_info);
get_uid(new->user);
get_user_ns(new->user_ns);
+   get_ucounts(new->ucounts);
 
 #ifdef CONFIG_KEYS
key_get(new->session_keyring);
@@ -363,6 +369,7 @@ int copy_creds(struct task_struct *p, unsigned long 
clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
+   set_cred_ucounts(new, new->user_ns, new->euid);
}
 
 #ifdef CONFIG_KEYS
@@ -485,8 +492,11 @@ int commit_creds(struct cred *new)
 * in set_user().
 */
alter_cred_subscribers(new, 2);
-   if (new->user != old->user)
-   atomic_inc(>user->processes);
+   if (new->user != old->user || new->user_ns != old->user_ns) {
+   if (new->user != old->user)
+   atomic_inc(>user->processes);
+   set_cred_ucounts(new, new->user_ns, new->euid);
+   }
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
if (new->user != old->user)
@@ -661,6 +671,11 @@ void __init cred_init(void)
/* allocate a slab in which we can store credentials */
cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
+   /*
+* This is needed here because this is the first cred and there is no
+* ucount reference to copy.
+*/
+   set_cred_ucounts(_cred, _user_ns, G

[RFC PATCH v3 8/8] kselftests: Add test to check for rlimit changes in different user namespaces

2021-01-15 Thread Alexey Gladkov

The testcase runs few instances of the program with RLIMIT_NPROC=1 from
user uid=6, in different user namespaces.

Signed-off-by: Alexey Gladkov 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/rlimits/.gitignore|   2 +
 tools/testing/selftests/rlimits/Makefile  |   6 +
 tools/testing/selftests/rlimits/config|   1 +
 .../selftests/rlimits/rlimits-per-userns.c| 161 ++
 5 files changed, 171 insertions(+)
 create mode 100644 tools/testing/selftests/rlimits/.gitignore
 create mode 100644 tools/testing/selftests/rlimits/Makefile
 create mode 100644 tools/testing/selftests/rlimits/config
 create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index afbab4aeef3c..4dbeb5686f7b 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -46,6 +46,7 @@ TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
 TARGETS += openat2
+TARGETS += rlimits
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/rlimits/.gitignore 
b/tools/testing/selftests/rlimits/.gitignore
new file mode 100644
index ..091021f255b3
--- /dev/null
+++ b/tools/testing/selftests/rlimits/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+rlimits-per-userns
diff --git a/tools/testing/selftests/rlimits/Makefile 
b/tools/testing/selftests/rlimits/Makefile
new file mode 100644
index ..03aadb406212
--- /dev/null
+++ b/tools/testing/selftests/rlimits/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := rlimits-per-userns
+
+include ../lib.mk
diff --git a/tools/testing/selftests/rlimits/config 
b/tools/testing/selftests/rlimits/config
new file mode 100644
index ..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/rlimits/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c 
b/tools/testing/selftests/rlimits/rlimits-per-userns.c
new file mode 100644
index ..26dc949e93ea
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov 
+ */
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NR_CHILDS 2
+
+static char *service_prog;
+static uid_t user   = 6;
+static uid_t group  = 6;
+
+static void setrlimit_nproc(rlim_t n)
+{
+   pid_t pid = getpid();
+   struct rlimit limit = {
+   .rlim_cur = n,
+   .rlim_max = n
+   };
+
+   warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
+
+   if (setrlimit(RLIMIT_NPROC, ) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+}
+
+static pid_t fork_child(void)
+{
+   pid_t pid = fork();
+
+   if (pid < 0)
+   err(EXIT_FAILURE, "fork");
+
+   if (pid > 0)
+   return pid;
+
+   pid = getpid();
+
+   warnx("(pid=%d): New process starting ...", pid);
+
+   if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
+   err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid);
+
+   signal(SIGUSR1, SIG_DFL);
+
+   warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
+
+   if (setgid(group) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
+   if (setuid(user) < 0)
+   err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
+
+   warnx("(pid=%d): Service running ...", pid);
+
+   warnx("(pid=%d): Unshare user namespace", pid);
+   if (unshare(CLONE_NEWUSER) < 0)
+   err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+
+   char *const argv[] = { "service", NULL };
+   char *const envp[] = { "I_AM_SERVICE=1", NULL };
+
+   warnx("(pid=%d): Executing real service ...", pid);
+
+   execve(service_prog, argv, envp);
+   err(EXIT_FAILURE, "(pid=%d): execve", pid);
+}
+
+int main(int argc, char **argv)
+{
+   size_t i;
+   pid_t child[NR_CHILDS];
+   int wstatus[NR_CHILDS];
+   int childs = NR_CHILDS;
+   pid_t pid;
+
+   if (getenv("I_AM_SERVICE")) {
+   pause();
+   exit(EXIT_SUCCESS);
+   }
+
+   service_prog = argv[0];
+   pid = getpid();
+
+   warnx("(pid=%d) Starting testcase", pid);
+
+   /*
+* This rlimit is not a problem for root because it can be exceeded.
+*/
+   setrlimit_nproc(1);
+
+   for (i = 0; i < NR_CH

[RFC PATCH v3 7/8] Move RLIMIT_NPROC check to the place where we increment the counter

2021-01-15 Thread Alexey Gladkov

After calling set_user(), we always have to call commit_creds() to apply
new credentials upon the current task. There is no need to separate
limit check and counter incrementing.

Signed-off-by: Alexey Gladkov 
---
 kernel/cred.c | 22 +-
 kernel/sys.c  | 13 -
 2 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/kernel/cred.c b/kernel/cred.c
index c43e30407d22..991c43559ee8 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -487,14 +487,26 @@ int commit_creds(struct cred *new)
if (!gid_eq(new->fsgid, old->fsgid))
key_fsgid_changed(new);
 
-   /* do it
-* RLIMIT_NPROC limits on user->processes have already been checked
-* in set_user().
-*/
alter_cred_subscribers(new, 2);
if (new->user != old->user || new->user_ns != old->user_ns) {
+   bool overlimit;
+
set_cred_ucounts(new, new->user_ns, new->euid);
-   inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
+
+   overlimit = inc_rlimit_ucounts_and_test(new->ucounts, 
UCOUNT_RLIMIT_NPROC,
+   1, rlimit(RLIMIT_NPROC));
+
+   /*
+* We don't fail in case of NPROC limit excess here because too 
many
+* poorly written programs don't check set*uid() return code, 
assuming
+* it never fails if called by root.  We may still enforce 
NPROC limit
+* for programs doing set*uid()+execve() by harmlessly 
deferring the
+* failure to the execve() stage.
+*/
+   if (overlimit && new->user != INIT_USER)
+   current->flags |= PF_NPROC_EXCEEDED;
+   else
+   current->flags &= ~PF_NPROC_EXCEEDED;
}
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
diff --git a/kernel/sys.c b/kernel/sys.c
index c2734ab9474e..180c4e06064f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -467,19 +467,6 @@ static int set_user(struct cred *new)
if (!new_user)
return -EAGAIN;
 
-   /*
-* We don't fail in case of NPROC limit excess here because too many
-* poorly written programs don't check set*uid() return code, assuming
-* it never fails if called by root.  We may still enforce NPROC limit
-* for programs doing set*uid()+execve() by harmlessly deferring the
-* failure to the execve() stage.
-*/
-   if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, 
rlimit(RLIMIT_NPROC)) &&
-   new_user != INIT_USER)
-   current->flags |= PF_NPROC_EXCEEDED;
-   else
-   current->flags &= ~PF_NPROC_EXCEEDED;
-
free_uid(new->user);
new->user = new_user;
return 0;
-- 
2.29.2

[RFC PATCH v3 5/8] Move RLIMIT_SIGPENDING counter to ucounts

2021-01-15 Thread Alexey Gladkov

Signed-off-by: Alexey Gladkov 
---
 fs/proc/array.c|  2 +-
 include/linux/sched/user.h |  1 -
 include/linux/signal_types.h   |  4 ++-
 include/linux/user_namespace.h |  1 +
 kernel/fork.c  |  1 +
 kernel/signal.c| 53 ++
 kernel/ucount.c|  1 +
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  1 +
 9 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index bb87e4d89cd8..74b0ea4b7e38 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -284,7 +284,7 @@ static inline void task_sig(struct seq_file *m, struct 
task_struct *p)
collect_sigign_sigcatch(p, , );
num_threads = get_nr_threads(p);
rcu_read_lock();  /* FIXME: is this correct? */
-   qsize = atomic_read(&__task_cred(p)->user->sigpending);
+   qsize = get_ucounts_value(task_ucounts(p), 
UCOUNT_RLIMIT_SIGPENDING);
rcu_read_unlock();
qlim = task_rlimit(p, RLIMIT_SIGPENDING);
unlock_task_sighand(p, );
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8a34446681aa..8ba9cec4fb99 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
  */
 struct user_struct {
refcount_t __count; /* reference count */
-   atomic_t sigpending;/* How many pending signals does this user 
have? */
 #ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
 #endif
diff --git a/include/linux/signal_types.h b/include/linux/signal_types.h
index 68e06c75c5b2..34cb28b8f16c 100644
--- a/include/linux/signal_types.h
+++ b/include/linux/signal_types.h
@@ -13,6 +13,8 @@ typedef struct kernel_siginfo {
__SIGINFO;
 } kernel_siginfo_t;
 
+struct ucounts;
+
 /*
  * Real Time signals may be queued.
  */
@@ -21,7 +23,7 @@ struct sigqueue {
struct list_head list;
int flags;
kernel_siginfo_t info;
-   struct user_struct *user;
+   struct ucounts *ucounts;
 };
 
 /* flags values. */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index ff96a906d7da..852b7bc40318 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -52,6 +52,7 @@ enum ucount_type {
 #endif
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
+   UCOUNT_RLIMIT_SIGPENDING,
UCOUNT_COUNTS,
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
index f61a5a3dc02f..a7be5790392e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -825,6 +825,7 @@ void __init fork_init(void)
 
init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(_task, 
RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = 
task_rlimit(_task, RLIMIT_MSGQUEUE);
+   init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = 
task_rlimit(_task, RLIMIT_SIGPENDING);
 
 #ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/signal.c b/kernel/signal.c
index 5736c55aaa1a..b01c2007a282 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -412,49 +412,40 @@ void task_join_group_stop(struct task_struct *task)
 static struct sigqueue *
 __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int 
override_rlimit)
 {
-   struct sigqueue *q = NULL;
-   struct user_struct *user;
-   int sigpending;
+   struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);
 
-   /*
-* Protect access to @t credentials. This can go away when all
-* callers hold rcu read lock.
-*
-* NOTE! A pending signal will hold on to the user refcount,
-* and we get/put the refcount only when the sigpending count
-* changes from/to zero.
-*/
-   rcu_read_lock();
-   user = __task_cred(t)->user;
-   sigpending = atomic_inc_return(>sigpending);
-   if (sigpending == 1)
-   get_uid(user);
-   rcu_read_unlock();
+   if (likely(q != NULL)) {
+   bool overlimit;
 
-   if (override_rlimit || likely(sigpending <= task_rlimit(t, 
RLIMIT_SIGPENDING))) {
-   q = kmem_cache_alloc(sigqueue_cachep, flags);
-   } else {
-   print_dropped_signal(sig);
-   }
-
-   if (unlikely(q == NULL)) {
-   if (atomic_dec_and_test(>sigpending))
-   free_uid(user);
-   } else {
INIT_LIST_HEAD(>list);
q->flags = 0;
-   q->user = user;
+
+   /*
+* Protect access to @t credentials. This can go away when all
+* callers hold rcu read lock.
+*/
+   rcu_read_lock();
+   q->ucounts = get_ucounts(task_ucounts(t));
+   overlimit = inc_rlimit_ucounts_and_test(q->ucounts, 
UCOUNT_RLIMIT_SIGPENDING,
+

[RFC PATCH v3 6/8] Move RLIMIT_MEMLOCK counter to ucounts

2021-01-15 Thread Alexey Gladkov

Signed-off-by: Alexey Gladkov 
---
 fs/hugetlbfs/inode.c   | 17 -
 include/linux/hugetlb.h|  3 +--
 include/linux/mm.h |  4 ++--
 include/linux/shmem_fs.h   |  2 +-
 include/linux/user_namespace.h |  1 +
 ipc/shm.c  | 31 --
 kernel/fork.c  |  1 +
 kernel/ucount.c|  1 +
 kernel/user_namespace.c|  1 +
 mm/memfd.c |  4 +---
 mm/mlock.c | 35 +-
 mm/mmap.c  |  3 +--
 mm/shmem.c |  8 
 13 files changed, 52 insertions(+), 59 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b5c109703daa..82298412f020 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1451,34 +1451,35 @@ static int get_hstate_idx(int page_size_log)
  * otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
  */
 struct file *hugetlb_file_setup(const char *name, size_t size,
-   vm_flags_t acctflag, struct user_struct **user,
+   vm_flags_t acctflag,
int creat_flags, int page_size_log)
 {
struct inode *inode;
struct vfsmount *mnt;
int hstate_idx;
struct file *file;
+   const struct cred *cred;
 
hstate_idx = get_hstate_idx(page_size_log);
if (hstate_idx < 0)
return ERR_PTR(-ENODEV);
 
-   *user = NULL;
mnt = hugetlbfs_vfsmount[hstate_idx];
if (!mnt)
return ERR_PTR(-ENOENT);
 
if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
-   *user = current_user();
-   if (user_shm_lock(size, *user)) {
+   cred = current_cred();
+   if (user_shm_lock(size, cred)) {
task_lock(current);
pr_warn_once("%s (%d): Using mlock ulimits for 
SHM_HUGETLB is deprecated\n",
current->comm, current->pid);
task_unlock(current);
} else {
-   *user = NULL;
return ERR_PTR(-EPERM);
}
+   } else {
+   cred = NULL;
}
 
file = ERR_PTR(-ENOSPC);
@@ -1503,10 +1504,8 @@ struct file *hugetlb_file_setup(const char *name, size_t 
size,
 
iput(inode);
 out:
-   if (*user) {
-   user_shm_unlock(size, *user);
-   *user = NULL;
-   }
+   if (cred)
+   user_shm_unlock(size, cred);
return file;
 }
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ebca2ef02212..fbd36c452648 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -434,8 +434,7 @@ static inline struct hugetlbfs_inode_info 
*HUGETLBFS_I(struct inode *inode)
 extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
-   struct user_struct **user, int creat_flags,
-   int page_size_log);
+   int creat_flags, int page_size_log);
 
 static inline bool is_file_hugepages(struct file *file)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..30a37aef1ab9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1628,8 +1628,8 @@ extern bool can_do_mlock(void);
 #else
 static inline bool can_do_mlock(void) { return false; }
 #endif
-extern int user_shm_lock(size_t, struct user_struct *);
-extern void user_shm_unlock(size_t, struct user_struct *);
+extern int user_shm_lock(size_t, const struct cred *);
+extern void user_shm_unlock(size_t, const struct cred *);
 
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index d82b6f396588..10f50b1c4e0e 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -65,7 +65,7 @@ extern struct file *shmem_file_setup_with_mnt(struct vfsmount 
*mnt,
 extern int shmem_zero_setup(struct vm_area_struct *);
 extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags);
-extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern int shmem_lock(struct file *file, int lock, const struct cred *cred);
 #ifdef CONFIG_SHMEM
 extern const struct address_space_operations shmem_aops;
 static inline bool shmem_mapping(struct address_space *mapping)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 852b7bc40318..701903a8beeb 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -53,

[RFC PATCH v3 3/8] Move RLIMIT_NPROC counter to ucounts

2021-01-15 Thread Alexey Gladkov

RLIMIT_NPROC is implemented on top of ucounts. The process counter is
tied to the user in the user namespace. Therefore, there is no longer
one single counter for the user. Instead, there is now one counter for
each user namespace. Thus, getting the RLIMIT_NPROC counter value to
check the rlimit becomes meaningless.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited if the user has the appropriate capability.

Signed-off-by: Alexey Gladkov 
---
 fs/exec.c  |  2 +-
 fs/io-wq.c | 22 ++---
 fs/io-wq.h |  2 +-
 fs/io_uring.c  |  2 +-
 include/linux/cred.h   |  2 ++
 include/linux/sched/user.h |  1 -
 include/linux/user_namespace.h | 13 
 kernel/cred.c  | 10 +++---
 kernel/exit.c  |  2 +-
 kernel/fork.c  |  9 ++---
 kernel/sys.c   |  2 +-
 kernel/ucount.c| 60 ++
 kernel/user.c  |  1 -
 kernel/user_namespace.c|  3 +-
 14 files changed, 102 insertions(+), 29 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 5d4d52039105..f62fd2632104 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1870,7 +1870,7 @@ static int do_execveat_common(int fd, struct filename 
*filename,
 * whether NPROC limit is still exceeded.
 */
if ((current->flags & PF_NPROC_EXCEEDED) &&
-   atomic_read(_user()->processes) > rlimit(RLIMIT_NPROC)) {
+   is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, 
rlimit(RLIMIT_NPROC))) {
retval = -EAGAIN;
goto out_ret;
}
diff --git a/fs/io-wq.c b/fs/io-wq.c
index a564f36e260c..5b6940c90c61 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../kernel/sched/sched.h"
 #include "io-wq.h"
@@ -120,7 +121,7 @@ struct io_wq {
io_wq_work_fn *do_work;
 
struct task_struct *manager;
-   struct user_struct *user;
+   const struct cred *cred;
refcount_t refs;
struct completion done;
 
@@ -234,7 +235,7 @@ static void io_worker_exit(struct io_worker *worker)
if (worker->flags & IO_WORKER_F_RUNNING)
atomic_dec(>nr_running);
if (!(worker->flags & IO_WORKER_F_BOUND))
-   atomic_dec(>wq->user->processes);
+   dec_rlimit_ucounts(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 
1);
worker->flags = 0;
preempt_enable();
 
@@ -364,15 +365,15 @@ static void __io_worker_busy(struct io_wqe *wqe, struct 
io_worker *worker,
worker->flags |= IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers--;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers++;
-   atomic_dec(>wq->user->processes);
+   dec_rlimit_ucounts(wqe->wq->cred->ucounts, 
UCOUNT_RLIMIT_NPROC, 1);
} else {
worker->flags &= ~IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers++;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers--;
-   atomic_inc(>wq->user->processes);
+   inc_rlimit_ucounts(wqe->wq->cred->ucounts, 
UCOUNT_RLIMIT_NPROC, 1);
}
io_wqe_inc_running(wqe, worker);
-}
+   }
 }
 
 /*
@@ -707,7 +708,7 @@ static bool create_io_worker(struct io_wq *wq, struct 
io_wqe *wqe, int index)
raw_spin_unlock_irq(>lock);
 
if (index == IO_WQ_ACCT_UNBOUND)
-   atomic_inc(>user->processes);
+   inc_rlimit_ucounts(wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);
 
refcount_inc(>refs);
wake_up_process(worker->task);
@@ -838,7 +839,7 @@ static bool io_wq_can_queue(struct io_wqe *wqe, struct 
io_wqe_acct *acct,
if (free_worker)
return true;
 
-   if (atomic_read(>wq->user->processes) >= acct->max_workers &&
+   if (is_ucounts_overlimit(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 
acct->max_workers) &&
!(capable(CAP_SYS_RESOURCE) || capable(CAP_SYS_ADMIN)))
return false;
 
@@ -1074,7 +1075,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct 
io_wq_data *data)
wq->do_work = data->do_work;
 
/* caller must already hold a reference to this */
-   wq->user = data->user;
+   wq->cred = data->cred;
 
ret = -ENOMEM;
for_each_node(node) {
@@ -1090,10 +1091,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct 
io_wq_data *data)
wqe-

[RFC PATCH v3 2/8] Add a reference to ucounts for each cred

2021-01-15 Thread Alexey Gladkov

For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred.  Add a ucounts reference to struct cred, so that
RLIMIT_NPROC can switch from using a per user limit to using a per user
per user namespace limit.

Signed-off-by: Alexey Gladkov 
---
 include/linux/cred.h   |  1 +
 include/linux/user_namespace.h | 13 +++--
 kernel/cred.c  | 20 ++--
 kernel/ucount.c| 30 --
 kernel/user_namespace.c|  1 +
 5 files changed, 51 insertions(+), 14 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..307744fcc387 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -144,6 +144,7 @@ struct cred {
 #endif
struct user_struct *user;   /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are 
relative to. */
+   struct ucounts *ucounts;
struct group_info *group_info;  /* supplementary groups for euid/fsgid 
*/
/* RCU deletion */
union {
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index f84fc2d9ce20..9a3ba69e9223 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -85,7 +85,7 @@ struct user_namespace {
struct ctl_table_header *sysctls;
 #endif
struct ucounts  *ucounts;
-   int ucount_max[UCOUNT_COUNTS];
+   long ucount_max[UCOUNT_COUNTS];
 } __randomize_layout;
 
 struct ucounts {
@@ -93,7 +93,7 @@ struct ucounts {
struct user_namespace *ns;
kuid_t uid;
refcount_t count;
-   atomic_t ucount[UCOUNT_COUNTS];
+   atomic_long_t ucount[UCOUNT_COUNTS];
 };
 
 extern struct user_namespace init_user_ns;
@@ -102,6 +102,15 @@ bool setup_userns_sysctls(struct user_namespace *ns);
 void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum 
ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+void put_ucounts(struct ucounts *ucounts);
+void set_cred_ucounts(struct cred *cred, struct user_namespace *ns, kuid_t 
uid);
+
+static inline struct ucounts *get_ucounts(struct ucounts *ucounts)
+{
+   if (ucounts)
+   refcount_inc(>count);
+   return ucounts;
+}
 
 #ifdef CONFIG_USER_NS
 
diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..a27d725c7c79 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -119,6 +119,8 @@ static void put_cred_rcu(struct rcu_head *rcu)
if (cred->group_info)
put_group_info(cred->group_info);
free_uid(cred->user);
+   if (cred->ucounts)
+   put_ucounts(cred->ucounts);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
 }
@@ -144,6 +146,9 @@ void __put_cred(struct cred *cred)
BUG_ON(cred == current->cred);
BUG_ON(cred == current->real_cred);
 
+   if (cred->ucounts);
+   BUG_ON(cred->ucounts->ns != cred->user_ns);
+
if (cred->non_rcu)
put_cred_rcu(>rcu);
else
@@ -270,6 +275,7 @@ struct cred *prepare_creds(void)
get_group_info(new->group_info);
get_uid(new->user);
get_user_ns(new->user_ns);
+   get_ucounts(new->ucounts);
 
 #ifdef CONFIG_KEYS
key_get(new->session_keyring);
@@ -363,6 +369,7 @@ int copy_creds(struct task_struct *p, unsigned long 
clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
+   set_cred_ucounts(new, new->user_ns, new->euid);
}
 
 #ifdef CONFIG_KEYS
@@ -485,8 +492,11 @@ int commit_creds(struct cred *new)
 * in set_user().
 */
alter_cred_subscribers(new, 2);
-   if (new->user != old->user)
-   atomic_inc(>user->processes);
+   if (new->user != old->user || new->user_ns != old->user_ns) {
+   if (new->user != old->user)
+   atomic_inc(>user->processes);
+   set_cred_ucounts(new, new->user_ns, new->euid);
+   }
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
if (new->user != old->user)
@@ -661,6 +671,11 @@ void __init cred_init(void)
/* allocate a slab in which we can store credentials */
cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
+   /*
+* This is needed here because this is the first cred and there is no
+* ucount reference to copy.
+*/
+   set_cred_ucounts(_cred, _user_ns, GLOBAL_ROOT_UID);
 }
 
 /**
@@ -704,6 +719,7 @@ struct cred *

[RFC PATCH v3 0/8] Count rlimits in each user namespace

2021-01-15 Thread Alexey Gladkov

Preface
---
These patches are for binding the rlimit counters to a user in user namespace.
This patch set can be applied on top of:

git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.11-rc2

Problem
---
The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
implementation places the counters in user_struct [1]. These limits are global
between processes and persists for the lifetime of the process, even if
processes are in different user namespaces.

To illustrate the impact of rlimits, let's say there is a program that does not
fork. Some service-A wants to run this program as user X in multiple containers.
Since the program never fork the service wants to set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
since user X already has one running process.

The problem is not that the limit from container1 affects container2. The
problem is that limit is verified against the global counter that reflects
the number of processes in all containers.

This problem can be worked around by using different users for each container
but in this case we face a different problem of uid mapping when transferring
files from one container to another.

Eric W. Biederman mentioned this issue [2][3].

Introduced changes
--
To address the problem, we bind rlimit counters to user namespace. Each counter
reflects the number of processes in a given uid in a given user namespace. The
result is a tree of rlimit counters with the biggest value at the root (aka
init_user_ns). The limit is considered exceeded if it's exceeded up in the tree.

[1] https://lore.kernel.org/containers/87imd2incs@x220.int.ebiederm.org/
[2] 
https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
[3] 
https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html

Changelog
-
v3:
* Added get_ucounts() function to increase the reference count. The existing
  get_counts() function renamed to __get_ucounts().
* The type of ucounts.count changed from atomic_t to refcount_t.
* Dropped 'const' from set_cred_ucounts() arguments.
* Fixed a bug with freeing the cred structure after calling cred_alloc_blank().
* Commit messages have been updated.
* Added selftest.

v2:
* RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts.
* Added ucounts for pair uid and user namespace into cred.
* Added the ability to increase ucount by more than 1.

v1:
* After discussion with Eric W. Biederman, I increased the size of ucounts to
  atomic_long_t.
* Added ucount_max to avoid the fork bomb.

--

Alexey Gladkov (8):
  Use refcount_t for ucounts reference counting
  Add a reference to ucounts for each cred
  Move RLIMIT_NPROC counter to ucounts
  Move RLIMIT_MSGQUEUE counter to ucounts
  Move RLIMIT_SIGPENDING counter to ucounts
  Move RLIMIT_MEMLOCK counter to ucounts
  Move RLIMIT_NPROC check to the place where we increment the counter
  kselftests: Add test to check for rlimit changes in different user
namespaces

 fs/exec.c |   2 +-
 fs/hugetlbfs/inode.c  |  17 +-
 fs/io-wq.c|  22 ++-
 fs/io-wq.h|   2 +-
 fs/io_uring.c |   2 +-
 fs/proc/array.c   |   2 +-
 include/linux/cred.h  |   3 +
 include/linux/hugetlb.h   |   3 +-
 include/linux/mm.h|   4 +-
 include/linux/sched/user.h|   6 -
 include/linux/shmem_fs.h  |   2 +-
 include/linux/signal_types.h  |   4 +-
 include/linux/user_namespace.h|  31 +++-
 ipc/mqueue.c  |  29 ++--
 ipc/shm.c |  31 ++--
 kernel/cred.c |  46 -
 kernel/exit.c |   2 +-
 kernel/fork.c |  12 +-
 kernel/signal.c   |  53 +++---
 kernel/sys.c  |  13 --
 kernel/ucount.c   | 111 +---
 kernel/user.c |   2 -
 kernel/user_namespace.c   |   7 +-
 mm/memfd.c|   4 +-
 mm/mlock.c|  35 ++--
 mm/mmap.c |   3 +-
 mm/shmem.c|   8 +-
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/rlimits/.gitignore|   2 +
 tools/testing/selftests/rlimits/Makefile  |   6 +
 tools/testing/selftests/rlimits

1 2 3 >

1 - 100 of 217 matches

Mail list logo