Because ->pre_destroy() could fail and can't be called under
cgroup_mutex, cgroup destruction did something very ugly.

  1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.

  2. Release cgroup_mutex and call ->pre_destroy().

  3. Re-grab cgroup_mutex and verify it can still be destroyed; fail
     otherwise.

  4. Continue destroying.

In addition to being ugly, it has been always broken in various ways.
For example, memcg ->pre_destroy() expects the cgroup to be inactive
after it's done but tasks can be attached and detached between #2 and
#3 and the conditions that memcg verified in ->pre_destroy() might no
longer hold by the time control reaches #3.

Now that ->pre_destroy() is no longer allowed to fail.  We can switch
to the following.

  1. Grab cgroup_mutex and fail if it can't be destroyed; fail
     otherwise.

  2. Deactivate CSS's and mark the cgroup removed thus preventing any
     further operations which can invalidate the verification from #1.

  3. Release cgroup_mutex and call ->pre_destroy().

  4. Re-grab cgroup_mutex and continue destroying.

After this change, controllers can safely assume that ->pre_destroy()
will only be called only once for a given cgroup and, once
->pre_destroy() is called, the cgroup will stay dormant till it's
destroyed.

Signed-off-by: Tejun Heo <t...@kernel.org>
---
 kernel/cgroup.c | 41 +++++++++++++++++++----------------------
 1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b3010ae..66204a6 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4058,18 +4058,6 @@ static int cgroup_rmdir(struct inode *unused_dir, struct 
dentry *dentry)
        struct cgroup_event *event, *tmp;
        struct cgroup_subsys *ss;
 
-       /* the vfs holds both inode->i_mutex already */
-       mutex_lock(&cgroup_mutex);
-       if (atomic_read(&cgrp->count) != 0) {
-               mutex_unlock(&cgroup_mutex);
-               return -EBUSY;
-       }
-       if (!list_empty(&cgrp->children)) {
-               mutex_unlock(&cgroup_mutex);
-               return -EBUSY;
-       }
-       mutex_unlock(&cgroup_mutex);
-
        /*
         * In general, subsystem has no css->refcnt after pre_destroy(). But
         * in racy cases, subsystem may have to get css->refcnt after
@@ -4081,14 +4069,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct 
dentry *dentry)
         */
        set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 
-       /*
-        * Call pre_destroy handlers of subsys. Notify subsystems
-        * that rmdir() request comes.
-        */
-       for_each_subsys(cgrp->root, ss)
-               if (ss->pre_destroy)
-                       WARN_ON_ONCE(ss->pre_destroy(cgrp));
-
+       /* the vfs holds both inode->i_mutex already */
        mutex_lock(&cgroup_mutex);
        parent = cgrp->parent;
        if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
@@ -4098,13 +4079,30 @@ static int cgroup_rmdir(struct inode *unused_dir, 
struct dentry *dentry)
        }
        prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
 
-       /* block new css_tryget() by deactivating refcnt */
+       /*
+        * Block new css_tryget() by deactivating refcnt and mark @cgrp
+        * removed.  This makes future css_tryget() and child creation
+        * attempts fail thus maintaining the removal conditions verified
+        * above.
+        */
        for_each_subsys(cgrp->root, ss) {
                struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
 
                WARN_ON(atomic_read(&css->refcnt) < 0);
                atomic_add(CSS_DEACT_BIAS, &css->refcnt);
        }
+       set_bit(CGRP_REMOVED, &cgrp->flags);
+
+       /*
+        * Tell subsystems to initate destruction.  pre_destroy() should be
+        * called with cgroup_mutex unlocked.  See 3fa59dfbc3 ("cgroup: fix
+        * potential deadlock in pre_destroy") for details.
+        */
+       mutex_unlock(&cgroup_mutex);
+       for_each_subsys(cgrp->root, ss)
+               if (ss->pre_destroy)
+                       WARN_ON_ONCE(ss->pre_destroy(cgrp));
+       mutex_lock(&cgroup_mutex);
 
        /*
         * Put all the base refs.  Each css holds an extra reference to the
@@ -4120,7 +4118,6 @@ static int cgroup_rmdir(struct inode *unused_dir, struct 
dentry *dentry)
        clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 
        raw_spin_lock(&release_list_lock);
-       set_bit(CGRP_REMOVED, &cgrp->flags);
        if (!list_empty(&cgrp->release_list))
                list_del_init(&cgrp->release_list);
        raw_spin_unlock(&release_list_lock);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to