Re: [PATCH v14 4/4] cgroup: implement the PIDs subsystem

2015-07-14 Thread Tejun Heo
On Wed, Jun 10, 2015 at 01:53:04PM +0900, Tejun Heo wrote:
> On Tue, Jun 09, 2015 at 09:32:10PM +1000, Aleksa Sarai wrote:
> > Adds a new single-purpose PIDs subsystem to limit the number of
> > tasks that can be forked inside a cgroup. Essentially this is an
> > implementation of RLIMIT_NPROC that applies to a cgroup rather than a
> > process tree.
> > 
> > However, it should be noted that organisational operations (adding and
> > removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
> > the number of tasks in the hierarchy cannot exceed the limit through
> > forking. This is due to the fact that, in the unified hierarchy, attach
> > cannot fail (and it is not possible for a task to overcome its PIDs
> > cgroup policy limit by attaching to a child cgroup -- even if migrating
> > mid-fork it must be able to fork in the parent first).
> > 
> > PIDs are fundamentally a global resource, and it is possible to reach
> > PID exhaustion inside a cgroup without hitting any reasonable kmemcg
> > policy. Once you've hit PID exhaustion, you're only in a marginally
> > better state than OOM. This subsystem allows PID exhaustion inside a
> > cgroup to be prevented.
> 
> Patches 3-4 look good to me.  Will apply once v4.3 dev window opens.

Applied 3-4 to cgroup/for-4.3.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v14 4/4] cgroup: implement the PIDs subsystem

2015-06-10 Thread Aleksa Sarai
Hi Tejun,

> Patches 3-4 look good to me.  Will apply once v4.3 dev window opens.

Do you want me to update Documentation/cgroups/ to include a
description of this?

--
Aleksa Sarai (cyphar)
www.cyphar.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v14 4/4] cgroup: implement the PIDs subsystem

2015-06-09 Thread Tejun Heo
On Tue, Jun 09, 2015 at 09:32:10PM +1000, Aleksa Sarai wrote:
> Adds a new single-purpose PIDs subsystem to limit the number of
> tasks that can be forked inside a cgroup. Essentially this is an
> implementation of RLIMIT_NPROC that applies to a cgroup rather than a
> process tree.
> 
> However, it should be noted that organisational operations (adding and
> removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
> the number of tasks in the hierarchy cannot exceed the limit through
> forking. This is due to the fact that, in the unified hierarchy, attach
> cannot fail (and it is not possible for a task to overcome its PIDs
> cgroup policy limit by attaching to a child cgroup -- even if migrating
> mid-fork it must be able to fork in the parent first).
> 
> PIDs are fundamentally a global resource, and it is possible to reach
> PID exhaustion inside a cgroup without hitting any reasonable kmemcg
> policy. Once you've hit PID exhaustion, you're only in a marginally
> better state than OOM. This subsystem allows PID exhaustion inside a
> cgroup to be prevented.

Patches 3-4 look good to me.  Will apply once v4.3 dev window opens.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v14 4/4] cgroup: implement the PIDs subsystem

2015-06-09 Thread Aleksa Sarai
Adds a new single-purpose PIDs subsystem to limit the number of
tasks that can be forked inside a cgroup. Essentially this is an
implementation of RLIMIT_NPROC that applies to a cgroup rather than a
process tree.

However, it should be noted that organisational operations (adding and
removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
the number of tasks in the hierarchy cannot exceed the limit through
forking. This is due to the fact that, in the unified hierarchy, attach
cannot fail (and it is not possible for a task to overcome its PIDs
cgroup policy limit by attaching to a child cgroup -- even if migrating
mid-fork it must be able to fork in the parent first).

PIDs are fundamentally a global resource, and it is possible to reach
PID exhaustion inside a cgroup without hitting any reasonable kmemcg
policy. Once you've hit PID exhaustion, you're only in a marginally
better state than OOM. This subsystem allows PID exhaustion inside a
cgroup to be prevented.

Signed-off-by: Aleksa Sarai 
---
 CREDITS   |   5 +
 include/linux/cgroup_subsys.h |   5 +
 init/Kconfig  |  16 ++
 kernel/Makefile   |   1 +
 kernel/cgroup_pids.c  | 366 ++
 5 files changed, 393 insertions(+)
 create mode 100644 kernel/cgroup_pids.c

diff --git a/CREDITS b/CREDITS
index 40cc4bf..0727426 100644
--- a/CREDITS
+++ b/CREDITS
@@ -3215,6 +3215,11 @@ S: 69 rue Dunois
 S: 75013 Paris
 S: France
 
+N: Aleksa Sarai
+E: cyp...@cyphar.com
+W: https://www.cyphar.com/
+D: `pids` cgroup subsystem
+
 N: Dipankar Sarma
 E: dipan...@in.ibm.com
 D: RCU
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ec43bce..1f36945 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -62,6 +62,11 @@ SUBSYS(hugetlb)
  * Subsystems that implement the can_fork() family of callbacks.
  */
 SUBSYS_TAG(CANFORK_START)
+
+#if IS_ENABLED(CONFIG_CGROUP_PIDS)
+SUBSYS(pids)
+#endif
+
 SUBSYS_TAG(CANFORK_END)
 
 /*
diff --git a/init/Kconfig b/init/Kconfig
index b9b824b..f4e4918 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -968,6 +968,22 @@ config CGROUP_FREEZER
  Provides a way to freeze and unfreeze all tasks in a
  cgroup.
 
+config CGROUP_PIDS
+   bool "PIDs cgroup subsystem"
+   help
+ Provides enforcement of process number limits in the scope of a
+ cgroup. Any attempt to fork more processes than is allowed in the
+ cgroup will fail. PIDs are fundamentally a global resource because it
+ is fairly trivial to reach PID exhaustion before you reach even a
+ conservative kmemcg limit. As a result, it is possible to grind a
+ system to halt without being limited by other cgroup policies. The
+ PIDs cgroup subsystem is designed to stop this from happening.
+
+ It should be noted that organisational operations (such as attaching
+ to a cgroup hierarchy will *not* be blocked by the PIDs subsystem),
+ since the PIDs limit only affects a process's ability to fork, not to
+ attach to a cgroup.
+
 config CGROUP_DEVICE
bool "Device controller for cgroups"
help
diff --git a/kernel/Makefile b/kernel/Makefile
index 0f8f8b0..df5406c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -55,6 +55,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
new file mode 100644
index 000..d754888
--- /dev/null
+++ b/kernel/cgroup_pids.c
@@ -0,0 +1,366 @@
+/*
+ * Process number limiting controller for cgroups.
+ *
+ * Used to allow a cgroup hierarchy to stop any new processes from fork()ing
+ * after a certain limit is reached.
+ *
+ * Since it is trivial to hit the task limit without hitting any kmemcg limits
+ * in place, PIDs are a fundamental resource. As such, PID exhaustion must be
+ * preventable in the scope of a cgroup hierarchy by allowing resource limiting
+ * of the number of tasks in a cgroup.
+ *
+ * In order to use the `pids` controller, set the maximum number of tasks in
+ * pids.max (this is not available in the root cgroup for obvious reasons). The
+ * number of processes currently in the cgroup is given by pids.current.
+ * Organisational operations are not blocked by cgroup policies, so it is
+ * possible to have pids.current > pids.max. However, it is not possible to
+ * violate a cgroup policy through fork(). fork() will return -EAGAIN if 
forking
+ * would cause a cgroup policy to be violated.
+ *
+ * To set a cgroup to have no limit, set pids.max to "max". This is the default
+ * for all new cgroups (N.B. that PID limits are hierarchical, so