On Fri, 22 Nov 2013, Tejun Heo wrote:

> Hello, Hugh.
> 
> I applied the following patch to cgroup/for-3.13-fixes.

Looks good, thanks a lot.

> For longer
> term, I think it'd be better to pull workqueue init before cgroup one
> but this one should be easier to backport for now.

Yes, that's the right direction, but this the right fix for now.

Hugh

> 
> Thanks!
> 
> ----- >8 -----
> From e5fca243abae1445afbfceebda5f08462ef869d3 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <t...@kernel.org>
> Date: Fri, 22 Nov 2013 17:14:39 -0500
> 
> Since be44562613851 ("cgroup: remove synchronize_rcu() from
> cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
> freeing is performed from a work item from that point on and a later
> commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
> steps"), moves css offlining to workqueue too.
> 
> As cgroup destruction isn't depended upon for memory reclaim, the
> destruction work items were put on the system_wq; unfortunately, some
> controller may block in the destruction path for considerable duration
> while holding cgroup_mutex.  As large part of destruction path is
> synchronized through cgroup_mutex, when combined with high rate of
> cgroup removals, this has potential to fill up system_wq's max_active
> of 256.
> 
> Also, it turns out that memcg's css destruction path ends up queueing
> and waiting for work items on system_wq through work_on_cpu().  If
> such operation happens while system_wq is fully occupied by cgroup
> destruction work items, work_on_cpu() can't make forward progress
> because system_wq is full and other destruction work items on
> system_wq can't make forward progress because the work item waiting
> for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
> 
> This can be fixed by queueing destruction work items on a separate
> workqueue.  This patch creates a dedicated workqueue -
> cgroup_destroy_wq - for this purpose.  As these work items shouldn't
> have inter-dependencies and mostly serialized by cgroup_mutex anyway,
> giving high concurrency level doesn't buy anything and the workqueue's
> @max_active is set to 1 so that destruction work items are executed
> one by one on each CPU.
> 
> Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
> cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
> separate core_initcall().  In the future, we probably want to reorder
> so that workqueue init happens before cgroup_init().
> 
> Signed-off-by: Tejun Heo <t...@kernel.org>
> Reported-by: Hugh Dickins <hu...@google.com>
> Reported-by: Shawn Bohrer <shawn.boh...@gmail.com>
> Link: 
> http://lkml.kernel.org/r/20131111220626.ga7...@sbohrermbp13-local.rgmadvisors.com
> Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
> Cc: sta...@vger.kernel.org # v3.9+
> ---
>  kernel/cgroup.c | 30 +++++++++++++++++++++++++++---
>  1 file changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 4c62513..a7b98ee 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -90,6 +90,14 @@ static DEFINE_MUTEX(cgroup_mutex);
>  static DEFINE_MUTEX(cgroup_root_mutex);
>  
>  /*
> + * cgroup destruction makes heavy use of work items and there can be a lot
> + * of concurrent destructions.  Use a separate workqueue so that cgroup
> + * destruction work items don't end up filling up max_active of system_wq
> + * which may lead to deadlock.
> + */
> +static struct workqueue_struct *cgroup_destroy_wq;
> +
> +/*
>   * Generate an array of cgroup subsystem pointers. At boot time, this is
>   * populated with the built in subsystems, and modular subsystems are
>   * registered after that. The mutable section of this array is protected by
> @@ -871,7 +879,7 @@ static void cgroup_free_rcu(struct rcu_head *head)
>       struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);
>  
>       INIT_WORK(&cgrp->destroy_work, cgroup_free_fn);
> -     schedule_work(&cgrp->destroy_work);
> +     queue_work(cgroup_destroy_wq, &cgrp->destroy_work);
>  }
>  
>  static void cgroup_diput(struct dentry *dentry, struct inode *inode)
> @@ -4249,7 +4257,7 @@ static void css_free_rcu_fn(struct rcu_head *rcu_head)
>        * css_put().  dput() requires process context which we don't have.
>        */
>       INIT_WORK(&css->destroy_work, css_free_work_fn);
> -     schedule_work(&css->destroy_work);
> +     queue_work(cgroup_destroy_wq, &css->destroy_work);
>  }
>  
>  static void css_release(struct percpu_ref *ref)
> @@ -4539,7 +4547,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
>               container_of(ref, struct cgroup_subsys_state, refcnt);
>  
>       INIT_WORK(&css->destroy_work, css_killed_work_fn);
> -     schedule_work(&css->destroy_work);
> +     queue_work(cgroup_destroy_wq, &css->destroy_work);
>  }
>  
>  /**
> @@ -5063,6 +5071,22 @@ out:
>       return err;
>  }
>  
> +static int __init cgroup_wq_init(void)
> +{
> +     /*
> +      * There isn't much point in executing destruction path in
> +      * parallel.  Good chunk is serialized with cgroup_mutex anyway.
> +      * Use 1 for @max_active.
> +      *
> +      * We would prefer to do this in cgroup_init() above, but that
> +      * is called before init_workqueues(): so leave this until after.
> +      */
> +     cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
> +     BUG_ON(!cgroup_destroy_wq);
> +     return 0;
> +}
> +core_initcall(cgroup_wq_init);
> +
>  /*
>   * proc_cgroup_show()
>   *  - Print task's cgroup paths into seq_file, one line for each hierarchy
> -- 
> 1.8.4.2
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to