Re: 3.10.16 cgroup_mutex deadlock

2013-12-02 Thread Li Zefan
On 2013/12/2 18:31, William Dauchy wrote:
> Hi Li,
> 
> On Mon, Nov 25, 2013 at 2:20 AM, Li Zefan  wrote:
>> I'll do this after the patch hits mainline, if Tejun doesn't plan to.
> 
> Do you have some news about it?
> 

Tejun has already done the backport. :)

It has been included in 3.10.22, which will be released in a couple of days.

http://article.gmane.org/gmane.linux.kernel.stable/71292

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-12-02 Thread William Dauchy
Hi Li,

On Mon, Nov 25, 2013 at 2:20 AM, Li Zefan  wrote:
> I'll do this after the patch hits mainline, if Tejun doesn't plan to.

Do you have some news about it?
-- 
William
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-12-02 Thread William Dauchy
Hi Li,

On Mon, Nov 25, 2013 at 2:20 AM, Li Zefan lize...@huawei.com wrote:
 I'll do this after the patch hits mainline, if Tejun doesn't plan to.

Do you have some news about it?
-- 
William
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-12-02 Thread Li Zefan
On 2013/12/2 18:31, William Dauchy wrote:
 Hi Li,
 
 On Mon, Nov 25, 2013 at 2:20 AM, Li Zefan lize...@huawei.com wrote:
 I'll do this after the patch hits mainline, if Tejun doesn't plan to.
 
 Do you have some news about it?
 

Tejun has already done the backport. :)

It has been included in 3.10.22, which will be released in a couple of days.

http://article.gmane.org/gmane.linux.kernel.stable/71292

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-24 Thread Li Zefan
On 2013/11/23 6:54, William Dauchy wrote:
> Hi Tejun,
> 
> On Fri, Nov 22, 2013 at 11:18 PM, Tejun Heo  wrote:
>> Just applied to cgroup/for-3.13-fixes w/ stable cc'd.  Will push to
>> Linus next week.
> 
> Thank your for your quick reply. Do you also have a backport for
> v3.10.x already available?
> 

I'll do this after the patch hits mainline, if Tejun doesn't plan to.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-24 Thread Li Zefan
On 2013/11/23 6:54, William Dauchy wrote:
 Hi Tejun,
 
 On Fri, Nov 22, 2013 at 11:18 PM, Tejun Heo t...@kernel.org wrote:
 Just applied to cgroup/for-3.13-fixes w/ stable cc'd.  Will push to
 Linus next week.
 
 Thank your for your quick reply. Do you also have a backport for
 v3.10.x already available?
 

I'll do this after the patch hits mainline, if Tejun doesn't plan to.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-22 Thread William Dauchy
Hi Tejun,

On Fri, Nov 22, 2013 at 11:18 PM, Tejun Heo  wrote:
> Just applied to cgroup/for-3.13-fixes w/ stable cc'd.  Will push to
> Linus next week.

Thank your for your quick reply. Do you also have a backport for
v3.10.x already available?

Best regards,
-- 
William
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-22 Thread Tejun Heo
On Fri, Nov 22, 2013 at 09:59:37PM +0100, William Dauchy wrote:
> Hugh, Tejun,
> 
> Do we have some news about this patch? I'm also hitting this bug on a 3.10.x

Just applied to cgroup/for-3.13-fixes w/ stable cc'd.  Will push to
Linus next week.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-22 Thread William Dauchy
On Mon, Nov 18, 2013 at 3:17 AM, Hugh Dickins  wrote:
> Sorry for the delay: I was on the point of reporting success last
> night, when I tried a debug kernel: and that didn't work so well
> (got spinlock bad magic report in pwd_adjust_max_active(), and
> tests wouldn't run at all).
>
> Even the non-early cgroup_init() is called well before the
> early_initcall init_workqueues(): though only the debug (lockdep
> and spinlock debug) kernel appeared to have a problem with that.
>
> Here's the patch I ended up with successfully on a 3.11.7-based
> kernel (though below I've rediffed it against 3.11.8): the
> schedule_work->queue_work hunks are slightly different on 3.11
> than in your patch against current, and I did alloc_workqueue()
> from a separate core_initcall.
>
> The interval between cgroup_init and that is a bit of a worry;
> but we don't seem to have suffered from the interval between
> cgroup_init and init_workqueues before (when system_wq is NULL)
> - though you may have more courage than I to reorder them!
>
> Initially I backed out my system_highpri_wq workaround, and
> verified that it was still easy to reproduce the problem with
> one of our cgroup stresstests.  Yes it was, then your modified
> patch below convincingly fixed it.
>
> I ran with Johannes's patch adding extra mem_cgroup_reparent_charges:
> as I'd expected, that didn't solve this issue (though it's worth
> our keeping it in to rule out another source of problems).  And I
> checked back on dumps of failures: they indeed show the tell-tale
> 256 kworkers doing cgroup_offline_fn, just as you predicted.

Hugh, Tejun,

Do we have some news about this patch? I'm also hitting this bug on a 3.10.x

Thanks,
-- 
William
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-22 Thread William Dauchy
On Mon, Nov 18, 2013 at 3:17 AM, Hugh Dickins hu...@google.com wrote:
 Sorry for the delay: I was on the point of reporting success last
 night, when I tried a debug kernel: and that didn't work so well
 (got spinlock bad magic report in pwd_adjust_max_active(), and
 tests wouldn't run at all).

 Even the non-early cgroup_init() is called well before the
 early_initcall init_workqueues(): though only the debug (lockdep
 and spinlock debug) kernel appeared to have a problem with that.

 Here's the patch I ended up with successfully on a 3.11.7-based
 kernel (though below I've rediffed it against 3.11.8): the
 schedule_work-queue_work hunks are slightly different on 3.11
 than in your patch against current, and I did alloc_workqueue()
 from a separate core_initcall.

 The interval between cgroup_init and that is a bit of a worry;
 but we don't seem to have suffered from the interval between
 cgroup_init and init_workqueues before (when system_wq is NULL)
 - though you may have more courage than I to reorder them!

 Initially I backed out my system_highpri_wq workaround, and
 verified that it was still easy to reproduce the problem with
 one of our cgroup stresstests.  Yes it was, then your modified
 patch below convincingly fixed it.

 I ran with Johannes's patch adding extra mem_cgroup_reparent_charges:
 as I'd expected, that didn't solve this issue (though it's worth
 our keeping it in to rule out another source of problems).  And I
 checked back on dumps of failures: they indeed show the tell-tale
 256 kworkers doing cgroup_offline_fn, just as you predicted.

Hugh, Tejun,

Do we have some news about this patch? I'm also hitting this bug on a 3.10.x

Thanks,
-- 
William
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-22 Thread Tejun Heo
On Fri, Nov 22, 2013 at 09:59:37PM +0100, William Dauchy wrote:
 Hugh, Tejun,
 
 Do we have some news about this patch? I'm also hitting this bug on a 3.10.x

Just applied to cgroup/for-3.13-fixes w/ stable cc'd.  Will push to
Linus next week.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-22 Thread William Dauchy
Hi Tejun,

On Fri, Nov 22, 2013 at 11:18 PM, Tejun Heo t...@kernel.org wrote:
 Just applied to cgroup/for-3.13-fixes w/ stable cc'd.  Will push to
 Linus next week.

Thank your for your quick reply. Do you also have a backport for
v3.10.x already available?

Best regards,
-- 
William
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-20 Thread Shawn Bohrer
On Tue, Nov 19, 2013 at 10:55:18AM +0800, Li Zefan wrote:
> > Thanks Tejun and Hugh.  Sorry for my late entry in getting around to
> > testing this fix. On the surface it sounds correct however I'd like to
> > test this on top of 3.10.* since that is what we'll likely be running.
> > I've tried to apply Hugh's patch above on top of 3.10.19 but it
> > appears there are a number of conflicts.  Looking over the changes and
> > my understanding of the problem I believe on 3.10 only the
> > cgroup_free_fn needs to be run in a separate workqueue.  Below is the
> > patch I've applied on top of 3.10.19, which I'm about to start
> > testing.  If it looks like I botched the backport in any way please
> > let me know so I can test a propper fix on top of 3.10.19.
> > 
> 
> You didn't move css free_work to the dedicate wq as Tejun's patch does.
> css free_work won't acquire cgroup_mutex, but when destroying a lot of
> cgroups, we can have a lot of css free_work in the workqueue, so I'd
> suggest you also use cgroup_destroy_wq for it.

Well, I didn't move the css free_work, but I did test the patch I
posted on top of 3.10.19 and I am unable to reproduce the lockup so it
appears my patch was sufficient for 3.10.*.  Hopefully we can get this
fix applied and backported into stable.

Thanks,
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-20 Thread Shawn Bohrer
On Tue, Nov 19, 2013 at 10:55:18AM +0800, Li Zefan wrote:
  Thanks Tejun and Hugh.  Sorry for my late entry in getting around to
  testing this fix. On the surface it sounds correct however I'd like to
  test this on top of 3.10.* since that is what we'll likely be running.
  I've tried to apply Hugh's patch above on top of 3.10.19 but it
  appears there are a number of conflicts.  Looking over the changes and
  my understanding of the problem I believe on 3.10 only the
  cgroup_free_fn needs to be run in a separate workqueue.  Below is the
  patch I've applied on top of 3.10.19, which I'm about to start
  testing.  If it looks like I botched the backport in any way please
  let me know so I can test a propper fix on top of 3.10.19.
  
 
 You didn't move css free_work to the dedicate wq as Tejun's patch does.
 css free_work won't acquire cgroup_mutex, but when destroying a lot of
 cgroups, we can have a lot of css free_work in the workqueue, so I'd
 suggest you also use cgroup_destroy_wq for it.

Well, I didn't move the css free_work, but I did test the patch I
posted on top of 3.10.19 and I am unable to reproduce the lockup so it
appears my patch was sufficient for 3.10.*.  Hopefully we can get this
fix applied and backported into stable.

Thanks,
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-18 Thread Li Zefan
> Thanks Tejun and Hugh.  Sorry for my late entry in getting around to
> testing this fix. On the surface it sounds correct however I'd like to
> test this on top of 3.10.* since that is what we'll likely be running.
> I've tried to apply Hugh's patch above on top of 3.10.19 but it
> appears there are a number of conflicts.  Looking over the changes and
> my understanding of the problem I believe on 3.10 only the
> cgroup_free_fn needs to be run in a separate workqueue.  Below is the
> patch I've applied on top of 3.10.19, which I'm about to start
> testing.  If it looks like I botched the backport in any way please
> let me know so I can test a propper fix on top of 3.10.19.
> 

You didn't move css free_work to the dedicate wq as Tejun's patch does.
css free_work won't acquire cgroup_mutex, but when destroying a lot of
cgroups, we can have a lot of css free_work in the workqueue, so I'd
suggest you also use cgroup_destroy_wq for it.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-18 Thread Shawn Bohrer
On Sun, Nov 17, 2013 at 06:17:17PM -0800, Hugh Dickins wrote:
> Sorry for the delay: I was on the point of reporting success last
> night, when I tried a debug kernel: and that didn't work so well
> (got spinlock bad magic report in pwd_adjust_max_active(), and
> tests wouldn't run at all).
> 
> Even the non-early cgroup_init() is called well before the
> early_initcall init_workqueues(): though only the debug (lockdep
> and spinlock debug) kernel appeared to have a problem with that.
> 
> Here's the patch I ended up with successfully on a 3.11.7-based
> kernel (though below I've rediffed it against 3.11.8): the
> schedule_work->queue_work hunks are slightly different on 3.11
> than in your patch against current, and I did alloc_workqueue()
> from a separate core_initcall.
> 
> The interval between cgroup_init and that is a bit of a worry;
> but we don't seem to have suffered from the interval between
> cgroup_init and init_workqueues before (when system_wq is NULL)
> - though you may have more courage than I to reorder them!
> 
> Initially I backed out my system_highpri_wq workaround, and
> verified that it was still easy to reproduce the problem with
> one of our cgroup stresstests.  Yes it was, then your modified
> patch below convincingly fixed it.
> 
> I ran with Johannes's patch adding extra mem_cgroup_reparent_charges:
> as I'd expected, that didn't solve this issue (though it's worth
> our keeping it in to rule out another source of problems).  And I
> checked back on dumps of failures: they indeed show the tell-tale
> 256 kworkers doing cgroup_offline_fn, just as you predicted.
> 
> Thanks!
> Hugh
> 
> ---
>  kernel/cgroup.c | 30 +++---
>  1 file changed, 27 insertions(+), 3 deletions(-)
> 
> --- 3.11.8/kernel/cgroup.c2013-11-17 17:40:54.200640692 -0800
> +++ linux/kernel/cgroup.c 2013-11-17 17:43:10.876643941 -0800
> @@ -89,6 +89,14 @@ static DEFINE_MUTEX(cgroup_mutex);
>  static DEFINE_MUTEX(cgroup_root_mutex);
>  
>  /*
> + * cgroup destruction makes heavy use of work items and there can be a lot
> + * of concurrent destructions.  Use a separate workqueue so that cgroup
> + * destruction work items don't end up filling up max_active of system_wq
> + * which may lead to deadlock.
> + */
> +static struct workqueue_struct *cgroup_destroy_wq;
> +
> +/*
>   * Generate an array of cgroup subsystem pointers. At boot time, this is
>   * populated with the built in subsystems, and modular subsystems are
>   * registered after that. The mutable section of this array is protected by
> @@ -890,7 +898,7 @@ static void cgroup_free_rcu(struct rcu_h
>   struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);
>  
>   INIT_WORK(>destroy_work, cgroup_free_fn);
> - schedule_work(>destroy_work);
> + queue_work(cgroup_destroy_wq, >destroy_work);
>  }
>  
>  static void cgroup_diput(struct dentry *dentry, struct inode *inode)
> @@ -4205,7 +4213,7 @@ static void css_release(struct percpu_re
>   struct cgroup_subsys_state *css =
>   container_of(ref, struct cgroup_subsys_state, refcnt);
>  
> - schedule_work(>dput_work);
> + queue_work(cgroup_destroy_wq, >dput_work);
>  }
>  
>  static void init_cgroup_css(struct cgroup_subsys_state *css,
> @@ -4439,7 +4447,7 @@ static void cgroup_css_killed(struct cgr
>  
>   /* percpu ref's of all css's are killed, kick off the next step */
>   INIT_WORK(>destroy_work, cgroup_offline_fn);
> - schedule_work(>destroy_work);
> + queue_work(cgroup_destroy_wq, >destroy_work);
>  }
>  
>  static void css_ref_killed_fn(struct percpu_ref *ref)
> @@ -4967,6 +4975,22 @@ out:
>   return err;
>  }
>  
> +static int __init cgroup_destroy_wq_init(void)
> +{
> + /*
> +  * There isn't much point in executing destruction path in
> +  * parallel.  Good chunk is serialized with cgroup_mutex anyway.
> +  * Use 1 for @max_active.
> +  *
> +  * We would prefer to do this in cgroup_init() above, but that
> +  * is called before init_workqueues(): so leave this until after.
> +  */
> + cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
> + BUG_ON(!cgroup_destroy_wq);
> + return 0;
> +}
> +core_initcall(cgroup_destroy_wq_init);
> +
>  /*
>   * proc_cgroup_show()
>   *  - Print task's cgroup paths into seq_file, one line for each hierarchy

Thanks Tejun and Hugh.  Sorry for my late entry in getting around to
testing this fix. On the surface it sounds correct however I'd like to
test this on top of 3.10.* since that is what we'll likely be running.
I've tried to apply Hugh's patch above on top of 3.10.19 but it
appears there are a number of conflicts.  Looking over the changes and
my understanding of the problem I believe on 3.10 only the
cgroup_free_fn needs to be run in a separate workqueue.  Below is the
patch I've applied on top of 3.10.19, which I'm about to start
testing.  If it looks like I botched the backport in any way please
let me know so I 

Re: 3.10.16 cgroup_mutex deadlock

2013-11-18 Thread Shawn Bohrer
On Sun, Nov 17, 2013 at 06:17:17PM -0800, Hugh Dickins wrote:
 Sorry for the delay: I was on the point of reporting success last
 night, when I tried a debug kernel: and that didn't work so well
 (got spinlock bad magic report in pwd_adjust_max_active(), and
 tests wouldn't run at all).
 
 Even the non-early cgroup_init() is called well before the
 early_initcall init_workqueues(): though only the debug (lockdep
 and spinlock debug) kernel appeared to have a problem with that.
 
 Here's the patch I ended up with successfully on a 3.11.7-based
 kernel (though below I've rediffed it against 3.11.8): the
 schedule_work-queue_work hunks are slightly different on 3.11
 than in your patch against current, and I did alloc_workqueue()
 from a separate core_initcall.
 
 The interval between cgroup_init and that is a bit of a worry;
 but we don't seem to have suffered from the interval between
 cgroup_init and init_workqueues before (when system_wq is NULL)
 - though you may have more courage than I to reorder them!
 
 Initially I backed out my system_highpri_wq workaround, and
 verified that it was still easy to reproduce the problem with
 one of our cgroup stresstests.  Yes it was, then your modified
 patch below convincingly fixed it.
 
 I ran with Johannes's patch adding extra mem_cgroup_reparent_charges:
 as I'd expected, that didn't solve this issue (though it's worth
 our keeping it in to rule out another source of problems).  And I
 checked back on dumps of failures: they indeed show the tell-tale
 256 kworkers doing cgroup_offline_fn, just as you predicted.
 
 Thanks!
 Hugh
 
 ---
  kernel/cgroup.c | 30 +++---
  1 file changed, 27 insertions(+), 3 deletions(-)
 
 --- 3.11.8/kernel/cgroup.c2013-11-17 17:40:54.200640692 -0800
 +++ linux/kernel/cgroup.c 2013-11-17 17:43:10.876643941 -0800
 @@ -89,6 +89,14 @@ static DEFINE_MUTEX(cgroup_mutex);
  static DEFINE_MUTEX(cgroup_root_mutex);
  
  /*
 + * cgroup destruction makes heavy use of work items and there can be a lot
 + * of concurrent destructions.  Use a separate workqueue so that cgroup
 + * destruction work items don't end up filling up max_active of system_wq
 + * which may lead to deadlock.
 + */
 +static struct workqueue_struct *cgroup_destroy_wq;
 +
 +/*
   * Generate an array of cgroup subsystem pointers. At boot time, this is
   * populated with the built in subsystems, and modular subsystems are
   * registered after that. The mutable section of this array is protected by
 @@ -890,7 +898,7 @@ static void cgroup_free_rcu(struct rcu_h
   struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);
  
   INIT_WORK(cgrp-destroy_work, cgroup_free_fn);
 - schedule_work(cgrp-destroy_work);
 + queue_work(cgroup_destroy_wq, cgrp-destroy_work);
  }
  
  static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 @@ -4205,7 +4213,7 @@ static void css_release(struct percpu_re
   struct cgroup_subsys_state *css =
   container_of(ref, struct cgroup_subsys_state, refcnt);
  
 - schedule_work(css-dput_work);
 + queue_work(cgroup_destroy_wq, css-dput_work);
  }
  
  static void init_cgroup_css(struct cgroup_subsys_state *css,
 @@ -4439,7 +4447,7 @@ static void cgroup_css_killed(struct cgr
  
   /* percpu ref's of all css's are killed, kick off the next step */
   INIT_WORK(cgrp-destroy_work, cgroup_offline_fn);
 - schedule_work(cgrp-destroy_work);
 + queue_work(cgroup_destroy_wq, cgrp-destroy_work);
  }
  
  static void css_ref_killed_fn(struct percpu_ref *ref)
 @@ -4967,6 +4975,22 @@ out:
   return err;
  }
  
 +static int __init cgroup_destroy_wq_init(void)
 +{
 + /*
 +  * There isn't much point in executing destruction path in
 +  * parallel.  Good chunk is serialized with cgroup_mutex anyway.
 +  * Use 1 for @max_active.
 +  *
 +  * We would prefer to do this in cgroup_init() above, but that
 +  * is called before init_workqueues(): so leave this until after.
 +  */
 + cgroup_destroy_wq = alloc_workqueue(cgroup_destroy, 0, 1);
 + BUG_ON(!cgroup_destroy_wq);
 + return 0;
 +}
 +core_initcall(cgroup_destroy_wq_init);
 +
  /*
   * proc_cgroup_show()
   *  - Print task's cgroup paths into seq_file, one line for each hierarchy

Thanks Tejun and Hugh.  Sorry for my late entry in getting around to
testing this fix. On the surface it sounds correct however I'd like to
test this on top of 3.10.* since that is what we'll likely be running.
I've tried to apply Hugh's patch above on top of 3.10.19 but it
appears there are a number of conflicts.  Looking over the changes and
my understanding of the problem I believe on 3.10 only the
cgroup_free_fn needs to be run in a separate workqueue.  Below is the
patch I've applied on top of 3.10.19, which I'm about to start
testing.  If it looks like I botched the backport in any way please
let me know so I can test a propper fix on top of 3.10.19.


---
 kernel/cgroup.c |   28 

Re: 3.10.16 cgroup_mutex deadlock

2013-11-18 Thread Li Zefan
 Thanks Tejun and Hugh.  Sorry for my late entry in getting around to
 testing this fix. On the surface it sounds correct however I'd like to
 test this on top of 3.10.* since that is what we'll likely be running.
 I've tried to apply Hugh's patch above on top of 3.10.19 but it
 appears there are a number of conflicts.  Looking over the changes and
 my understanding of the problem I believe on 3.10 only the
 cgroup_free_fn needs to be run in a separate workqueue.  Below is the
 patch I've applied on top of 3.10.19, which I'm about to start
 testing.  If it looks like I botched the backport in any way please
 let me know so I can test a propper fix on top of 3.10.19.
 

You didn't move css free_work to the dedicate wq as Tejun's patch does.
css free_work won't acquire cgroup_mutex, but when destroying a lot of
cgroups, we can have a lot of css free_work in the workqueue, so I'd
suggest you also use cgroup_destroy_wq for it.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-17 Thread Hugh Dickins
On Fri, 15 Nov 2013, Tejun Heo wrote:

> Hello,
> 
> Shawn, Hugh, can you please verify whether the attached patch makes
> the deadlock go away?

Thanks a lot, Tejun: report below.

> 
> Thanks.
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index e0839bc..dc9dc06 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -90,6 +90,14 @@ static DEFINE_MUTEX(cgroup_mutex);
>  static DEFINE_MUTEX(cgroup_root_mutex);
>  
>  /*
> + * cgroup destruction makes heavy use of work items and there can be a lot
> + * of concurrent destructions.  Use a separate workqueue so that cgroup
> + * destruction work items don't end up filling up max_active of system_wq
> + * which may lead to deadlock.
> + */
> +static struct workqueue_struct *cgroup_destroy_wq;
> +
> +/*
>   * Generate an array of cgroup subsystem pointers. At boot time, this is
>   * populated with the built in subsystems, and modular subsystems are
>   * registered after that. The mutable section of this array is protected by
> @@ -871,7 +879,7 @@ static void cgroup_free_rcu(struct rcu_head *head)
>   struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);
>  
>   INIT_WORK(>destroy_work, cgroup_free_fn);
> - schedule_work(>destroy_work);
> + queue_work(cgroup_destroy_wq, >destroy_work);
>  }
>  
>  static void cgroup_diput(struct dentry *dentry, struct inode *inode)
> @@ -4254,7 +4262,7 @@ static void css_free_rcu_fn(struct rcu_head *rcu_head)
>* css_put().  dput() requires process context which we don't have.
>*/
>   INIT_WORK(>destroy_work, css_free_work_fn);
> - schedule_work(>destroy_work);
> + queue_work(cgroup_destroy_wq, >destroy_work);
>  }
>  
>  static void css_release(struct percpu_ref *ref)
> @@ -4544,7 +4552,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
>   container_of(ref, struct cgroup_subsys_state, refcnt);
>  
>   INIT_WORK(>destroy_work, css_killed_work_fn);
> - schedule_work(>destroy_work);
> + queue_work(cgroup_destroy_wq, >destroy_work);
>  }
>  
>  /**
> @@ -5025,6 +5033,17 @@ int __init cgroup_init(void)
>   if (err)
>   return err;
>  
> + /*
> +  * There isn't much point in executing destruction path in
> +  * parallel.  Good chunk is serialized with cgroup_mutex anyway.
> +  * Use 1 for @max_active.
> +  */
> + cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
> + if (!cgroup_destroy_wq) {
> + err = -ENOMEM;
> + goto out;
> + }
> +
>   for_each_builtin_subsys(ss, i) {
>   if (!ss->early_init)
>   cgroup_init_subsys(ss);
> @@ -5062,9 +5081,11 @@ int __init cgroup_init(void)
>   proc_create("cgroups", 0, NULL, _cgroupstats_operations);
>  
>  out:
> - if (err)
> + if (err) {
> + if (cgroup_destroy_wq)
> + destroy_workqueue(cgroup_destroy_wq);
>   bdi_destroy(_backing_dev_info);
> -
> + }
>   return err;
>  }
>  

Sorry for the delay: I was on the point of reporting success last
night, when I tried a debug kernel: and that didn't work so well
(got spinlock bad magic report in pwd_adjust_max_active(), and
tests wouldn't run at all).

Even the non-early cgroup_init() is called well before the
early_initcall init_workqueues(): though only the debug (lockdep
and spinlock debug) kernel appeared to have a problem with that.

Here's the patch I ended up with successfully on a 3.11.7-based
kernel (though below I've rediffed it against 3.11.8): the
schedule_work->queue_work hunks are slightly different on 3.11
than in your patch against current, and I did alloc_workqueue()
from a separate core_initcall.

The interval between cgroup_init and that is a bit of a worry;
but we don't seem to have suffered from the interval between
cgroup_init and init_workqueues before (when system_wq is NULL)
- though you may have more courage than I to reorder them!

Initially I backed out my system_highpri_wq workaround, and
verified that it was still easy to reproduce the problem with
one of our cgroup stresstests.  Yes it was, then your modified
patch below convincingly fixed it.

I ran with Johannes's patch adding extra mem_cgroup_reparent_charges:
as I'd expected, that didn't solve this issue (though it's worth
our keeping it in to rule out another source of problems).  And I
checked back on dumps of failures: they indeed show the tell-tale
256 kworkers doing cgroup_offline_fn, just as you predicted.

Thanks!
Hugh

---
 kernel/cgroup.c | 30 +++---
 1 file changed, 27 insertions(+), 3 deletions(-)

--- 3.11.8/kernel/cgroup.c  2013-11-17 17:40:54.200640692 -0800
+++ linux/kernel/cgroup.c   2013-11-17 17:43:10.876643941 -0800
@@ -89,6 +89,14 @@ static DEFINE_MUTEX(cgroup_mutex);
 static DEFINE_MUTEX(cgroup_root_mutex);
 
 /*
+ * cgroup destruction makes heavy use of work items and there can be a lot
+ * of concurrent destructions.  Use a separate 

Re: 3.10.16 cgroup_mutex deadlock

2013-11-17 Thread Hugh Dickins
On Fri, 15 Nov 2013, Tejun Heo wrote:

 Hello,
 
 Shawn, Hugh, can you please verify whether the attached patch makes
 the deadlock go away?

Thanks a lot, Tejun: report below.

 
 Thanks.
 
 diff --git a/kernel/cgroup.c b/kernel/cgroup.c
 index e0839bc..dc9dc06 100644
 --- a/kernel/cgroup.c
 +++ b/kernel/cgroup.c
 @@ -90,6 +90,14 @@ static DEFINE_MUTEX(cgroup_mutex);
  static DEFINE_MUTEX(cgroup_root_mutex);
  
  /*
 + * cgroup destruction makes heavy use of work items and there can be a lot
 + * of concurrent destructions.  Use a separate workqueue so that cgroup
 + * destruction work items don't end up filling up max_active of system_wq
 + * which may lead to deadlock.
 + */
 +static struct workqueue_struct *cgroup_destroy_wq;
 +
 +/*
   * Generate an array of cgroup subsystem pointers. At boot time, this is
   * populated with the built in subsystems, and modular subsystems are
   * registered after that. The mutable section of this array is protected by
 @@ -871,7 +879,7 @@ static void cgroup_free_rcu(struct rcu_head *head)
   struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);
  
   INIT_WORK(cgrp-destroy_work, cgroup_free_fn);
 - schedule_work(cgrp-destroy_work);
 + queue_work(cgroup_destroy_wq, cgrp-destroy_work);
  }
  
  static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 @@ -4254,7 +4262,7 @@ static void css_free_rcu_fn(struct rcu_head *rcu_head)
* css_put().  dput() requires process context which we don't have.
*/
   INIT_WORK(css-destroy_work, css_free_work_fn);
 - schedule_work(css-destroy_work);
 + queue_work(cgroup_destroy_wq, css-destroy_work);
  }
  
  static void css_release(struct percpu_ref *ref)
 @@ -4544,7 +4552,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
   container_of(ref, struct cgroup_subsys_state, refcnt);
  
   INIT_WORK(css-destroy_work, css_killed_work_fn);
 - schedule_work(css-destroy_work);
 + queue_work(cgroup_destroy_wq, css-destroy_work);
  }
  
  /**
 @@ -5025,6 +5033,17 @@ int __init cgroup_init(void)
   if (err)
   return err;
  
 + /*
 +  * There isn't much point in executing destruction path in
 +  * parallel.  Good chunk is serialized with cgroup_mutex anyway.
 +  * Use 1 for @max_active.
 +  */
 + cgroup_destroy_wq = alloc_workqueue(cgroup_destroy, 0, 1);
 + if (!cgroup_destroy_wq) {
 + err = -ENOMEM;
 + goto out;
 + }
 +
   for_each_builtin_subsys(ss, i) {
   if (!ss-early_init)
   cgroup_init_subsys(ss);
 @@ -5062,9 +5081,11 @@ int __init cgroup_init(void)
   proc_create(cgroups, 0, NULL, proc_cgroupstats_operations);
  
  out:
 - if (err)
 + if (err) {
 + if (cgroup_destroy_wq)
 + destroy_workqueue(cgroup_destroy_wq);
   bdi_destroy(cgroup_backing_dev_info);
 -
 + }
   return err;
  }
  

Sorry for the delay: I was on the point of reporting success last
night, when I tried a debug kernel: and that didn't work so well
(got spinlock bad magic report in pwd_adjust_max_active(), and
tests wouldn't run at all).

Even the non-early cgroup_init() is called well before the
early_initcall init_workqueues(): though only the debug (lockdep
and spinlock debug) kernel appeared to have a problem with that.

Here's the patch I ended up with successfully on a 3.11.7-based
kernel (though below I've rediffed it against 3.11.8): the
schedule_work-queue_work hunks are slightly different on 3.11
than in your patch against current, and I did alloc_workqueue()
from a separate core_initcall.

The interval between cgroup_init and that is a bit of a worry;
but we don't seem to have suffered from the interval between
cgroup_init and init_workqueues before (when system_wq is NULL)
- though you may have more courage than I to reorder them!

Initially I backed out my system_highpri_wq workaround, and
verified that it was still easy to reproduce the problem with
one of our cgroup stresstests.  Yes it was, then your modified
patch below convincingly fixed it.

I ran with Johannes's patch adding extra mem_cgroup_reparent_charges:
as I'd expected, that didn't solve this issue (though it's worth
our keeping it in to rule out another source of problems).  And I
checked back on dumps of failures: they indeed show the tell-tale
256 kworkers doing cgroup_offline_fn, just as you predicted.

Thanks!
Hugh

---
 kernel/cgroup.c | 30 +++---
 1 file changed, 27 insertions(+), 3 deletions(-)

--- 3.11.8/kernel/cgroup.c  2013-11-17 17:40:54.200640692 -0800
+++ linux/kernel/cgroup.c   2013-11-17 17:43:10.876643941 -0800
@@ -89,6 +89,14 @@ static DEFINE_MUTEX(cgroup_mutex);
 static DEFINE_MUTEX(cgroup_root_mutex);
 
 /*
+ * cgroup destruction makes heavy use of work items and there can be a lot
+ * of concurrent destructions.  Use a separate workqueue so that cgroup
+ * destruction work items 

Re: 3.10.16 cgroup_mutex deadlock

2013-11-14 Thread Tejun Heo
Hello,

Shawn, Hugh, can you please verify whether the attached patch makes
the deadlock go away?

Thanks.

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e0839bc..dc9dc06 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -90,6 +90,14 @@ static DEFINE_MUTEX(cgroup_mutex);
 static DEFINE_MUTEX(cgroup_root_mutex);
 
 /*
+ * cgroup destruction makes heavy use of work items and there can be a lot
+ * of concurrent destructions.  Use a separate workqueue so that cgroup
+ * destruction work items don't end up filling up max_active of system_wq
+ * which may lead to deadlock.
+ */
+static struct workqueue_struct *cgroup_destroy_wq;
+
+/*
  * Generate an array of cgroup subsystem pointers. At boot time, this is
  * populated with the built in subsystems, and modular subsystems are
  * registered after that. The mutable section of this array is protected by
@@ -871,7 +879,7 @@ static void cgroup_free_rcu(struct rcu_head *head)
struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);
 
INIT_WORK(>destroy_work, cgroup_free_fn);
-   schedule_work(>destroy_work);
+   queue_work(cgroup_destroy_wq, >destroy_work);
 }
 
 static void cgroup_diput(struct dentry *dentry, struct inode *inode)
@@ -4254,7 +4262,7 @@ static void css_free_rcu_fn(struct rcu_head *rcu_head)
 * css_put().  dput() requires process context which we don't have.
 */
INIT_WORK(>destroy_work, css_free_work_fn);
-   schedule_work(>destroy_work);
+   queue_work(cgroup_destroy_wq, >destroy_work);
 }
 
 static void css_release(struct percpu_ref *ref)
@@ -4544,7 +4552,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
container_of(ref, struct cgroup_subsys_state, refcnt);
 
INIT_WORK(>destroy_work, css_killed_work_fn);
-   schedule_work(>destroy_work);
+   queue_work(cgroup_destroy_wq, >destroy_work);
 }
 
 /**
@@ -5025,6 +5033,17 @@ int __init cgroup_init(void)
if (err)
return err;
 
+   /*
+* There isn't much point in executing destruction path in
+* parallel.  Good chunk is serialized with cgroup_mutex anyway.
+* Use 1 for @max_active.
+*/
+   cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
+   if (!cgroup_destroy_wq) {
+   err = -ENOMEM;
+   goto out;
+   }
+
for_each_builtin_subsys(ss, i) {
if (!ss->early_init)
cgroup_init_subsys(ss);
@@ -5062,9 +5081,11 @@ int __init cgroup_init(void)
proc_create("cgroups", 0, NULL, _cgroupstats_operations);
 
 out:
-   if (err)
+   if (err) {
+   if (cgroup_destroy_wq)
+   destroy_workqueue(cgroup_destroy_wq);
bdi_destroy(_backing_dev_info);
-
+   }
return err;
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-14 Thread Tejun Heo
Hello,

On Thu, Nov 14, 2013 at 04:56:49PM -0600, Shawn Bohrer wrote:
> After running both concurrently on 40 machines for about 12 hours I've
> managed to reproduce the issue at least once, possibly more.  One
> machine looked identical to this reported issue.  It has a bunch of
> stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach
> waiting on lru_add_drain_all().  A sysrq+l shows all CPUs are idle
> except for the one triggering the sysrq+l.  The sysrq+w unfortunately
> wrapped dmesg so we didn't get the stacks of all blocked tasks.  We
> did however also cat /proc//stack of all kworker threads on the
> system.  There were 265 kworker threads that all have the following
> stack:

Umm... so, WQ_DFL_ACTIVE is 256.  It's just an arbitrarily largish
number which is supposed to serve as protection against runaway
kworker creation.  The assumption there is that there won't be a
dependency chain which can be longer than that and if there are it
should be separated out into a separate workqueue.  It looks like we
*can* have such long chain of dependency with high enough rate of
cgroup destruction.  kworkers trying to destroy cgroups get blocked by
an earlier one which is holding cgroup_mutex.  If the blocked ones
completely consume max_active and then the earlier one tries to
perform an operation which makes use of the system_wq, the forward
progress guarantee gets broken.

So, yeah, it makes sense now.  We're just gonna have to separate out
cgroup destruction to a separate workqueue.  Hugh's temp fix achieved
about the same effect by putting the affected part of destruction to a
different workqueue.  I probably should have realized that we were
hitting max_active when I was told that moving some part to a
different workqueue makes the problem go away.

Will send out a patch soon.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-14 Thread Shawn Bohrer
On Tue, Nov 12, 2013 at 05:55:04PM +0100, Michal Hocko wrote:
> On Tue 12-11-13 09:55:30, Shawn Bohrer wrote:
> > On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote:
> > > On Tue 12-11-13 18:17:20, Li Zefan wrote:
> > > > Cc more people
> > > > 
> > > > On 2013/11/12 6:06, Shawn Bohrer wrote:
> > > > > Hello,
> > > > > 
> > > > > This morning I had a machine running 3.10.16 go unresponsive but
> > > > > before we killed it we were able to get the information below.  I'm
> > > > > not an expert here but it looks like most of the tasks below are
> > > > > blocking waiting on the cgroup_mutex.  You can see that the
> > > > > resource_alloca:16502 task is holding the cgroup_mutex and that task
> > > > > appears to be waiting on a lru_add_drain_all() to complete.
> > > 
> > > Do you have sysrq+l output as well by any chance? That would tell
> > > us what the current CPUs are doing. Dumping all kworker stacks
> > > might be helpful as well. We know that lru_add_drain_all waits for
> > > schedule_on_each_cpu to return so it is waiting for workers to finish.
> > > I would be really curious why some of lru_add_drain_cpu cannot finish
> > > properly. The only reason would be that some work item(s) do not get CPU
> > > or somebody is holding lru_lock.
> > 
> > In fact the sys-admin did manage to fire off a sysrq+l, I've put all
> > of the info from the syslog below.  I've looked it over and I'm not
> > sure it reveals anything.  First looking at the timestamps it appears
> > we ran the sysrq+l 19.2 hours after the cgroup_mutex lockup I
> > previously sent.
> 
> I would expect sysrq+w would still show those kworkers blocked on the
> same cgroup mutex?

Yes, I believe so.

> > I also have atop logs over that whole time period
> > that show hundreds of zombie processes which to me indicates that over
> > that 19.2 hours systemd remained wedged on the cgroup_mutex.  Looking
> > at the backtraces from the sysrq+l it appears most of the CPUs were
> > idle
> 
> Right so either we managed to sleep with the lru_lock held which sounds
> a bit improbable - but who knows - or there is some other problem. I
> would expect the later to be true.
> 
> lru_add_drain executes per-cpu and preemption disabled this means that
> its work item cannot be preempted so the only logical explanation seems
> to be that the work item has never got scheduled.

Meaning you think there would be no kworker thread for the
lru_add_drain at this point?  If so you might be correct.

> OK. In case the issue happens again. It would be very helpful to get the
> kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue
> debugging tricks.

I set up one of my test pools with two scripts trying to reproduce the
problem.  One essentially puts tasks into several cpuset groups that
have cpuset.memory_migrate set, then takes them back out.  It also
occasionally switches cpuset.mems in those groups to try to keep the
memory of those tasks migrating between nodes.  The second script is:

$ cat /home/hbi/cgroup_mutex_cgroup_maker.sh 
#!/bin/bash

session_group=$(ps -o pid,cmd,cgroup -p $$ | grep -E 'c[0-9]+' -o)
cd /sys/fs/cgroup/systemd/user/hbi/${session_group}
pwd

while true; do
for x in $(seq 1 1000); do
mkdir $x
echo $$ > ${x}/tasks
echo $$ > tasks
rmdir $x
done
sleep .1
date
done

After running both concurrently on 40 machines for about 12 hours I've
managed to reproduce the issue at least once, possibly more.  One
machine looked identical to this reported issue.  It has a bunch of
stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach
waiting on lru_add_drain_all().  A sysrq+l shows all CPUs are idle
except for the one triggering the sysrq+l.  The sysrq+w unfortunately
wrapped dmesg so we didn't get the stacks of all blocked tasks.  We
did however also cat /proc//stack of all kworker threads on the
system.  There were 265 kworker threads that all have the following
stack:

[kworker/2:1]
[] cgroup_free_fn+0x2c/0x120
[] process_one_work+0x174/0x490
[] worker_thread+0x11c/0x370
[] kthread+0xc0/0xd0
[] ret_from_fork+0x7c/0xb0
[] 0x

And there were another 101 that had stacks like the following:

[kworker/0:0]
[] worker_thread+0x1bf/0x370
[] kthread+0xc0/0xd0
[] ret_from_fork+0x7c/0xb0
[] 0x

That's it.  Again I'm not sure if that is helpful at all but it seems
to imply that the lru_add_drain_work was not scheduled.

I also managed to kill another two machines running my test.  One of
them we didn't get anything out of, and the other looks like I
deadlocked on the css_set_lock lock.  I'll follow up with the
css_set_lock deadlock in another email since it doesn't look related
to this one.  But it does seem that I can probably reproduce this if
anyone has some debugging ideas.

--
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: 3.10.16 cgroup_mutex deadlock

2013-11-14 Thread Shawn Bohrer
On Tue, Nov 12, 2013 at 05:55:04PM +0100, Michal Hocko wrote:
 On Tue 12-11-13 09:55:30, Shawn Bohrer wrote:
  On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote:
   On Tue 12-11-13 18:17:20, Li Zefan wrote:
Cc more people

On 2013/11/12 6:06, Shawn Bohrer wrote:
 Hello,
 
 This morning I had a machine running 3.10.16 go unresponsive but
 before we killed it we were able to get the information below.  I'm
 not an expert here but it looks like most of the tasks below are
 blocking waiting on the cgroup_mutex.  You can see that the
 resource_alloca:16502 task is holding the cgroup_mutex and that task
 appears to be waiting on a lru_add_drain_all() to complete.
   
   Do you have sysrq+l output as well by any chance? That would tell
   us what the current CPUs are doing. Dumping all kworker stacks
   might be helpful as well. We know that lru_add_drain_all waits for
   schedule_on_each_cpu to return so it is waiting for workers to finish.
   I would be really curious why some of lru_add_drain_cpu cannot finish
   properly. The only reason would be that some work item(s) do not get CPU
   or somebody is holding lru_lock.
  
  In fact the sys-admin did manage to fire off a sysrq+l, I've put all
  of the info from the syslog below.  I've looked it over and I'm not
  sure it reveals anything.  First looking at the timestamps it appears
  we ran the sysrq+l 19.2 hours after the cgroup_mutex lockup I
  previously sent.
 
 I would expect sysrq+w would still show those kworkers blocked on the
 same cgroup mutex?

Yes, I believe so.

  I also have atop logs over that whole time period
  that show hundreds of zombie processes which to me indicates that over
  that 19.2 hours systemd remained wedged on the cgroup_mutex.  Looking
  at the backtraces from the sysrq+l it appears most of the CPUs were
  idle
 
 Right so either we managed to sleep with the lru_lock held which sounds
 a bit improbable - but who knows - or there is some other problem. I
 would expect the later to be true.
 
 lru_add_drain executes per-cpu and preemption disabled this means that
 its work item cannot be preempted so the only logical explanation seems
 to be that the work item has never got scheduled.

Meaning you think there would be no kworker thread for the
lru_add_drain at this point?  If so you might be correct.

 OK. In case the issue happens again. It would be very helpful to get the
 kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue
 debugging tricks.

I set up one of my test pools with two scripts trying to reproduce the
problem.  One essentially puts tasks into several cpuset groups that
have cpuset.memory_migrate set, then takes them back out.  It also
occasionally switches cpuset.mems in those groups to try to keep the
memory of those tasks migrating between nodes.  The second script is:

$ cat /home/hbi/cgroup_mutex_cgroup_maker.sh 
#!/bin/bash

session_group=$(ps -o pid,cmd,cgroup -p $$ | grep -E 'c[0-9]+' -o)
cd /sys/fs/cgroup/systemd/user/hbi/${session_group}
pwd

while true; do
for x in $(seq 1 1000); do
mkdir $x
echo $$  ${x}/tasks
echo $$  tasks
rmdir $x
done
sleep .1
date
done

After running both concurrently on 40 machines for about 12 hours I've
managed to reproduce the issue at least once, possibly more.  One
machine looked identical to this reported issue.  It has a bunch of
stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach
waiting on lru_add_drain_all().  A sysrq+l shows all CPUs are idle
except for the one triggering the sysrq+l.  The sysrq+w unfortunately
wrapped dmesg so we didn't get the stacks of all blocked tasks.  We
did however also cat /proc/pid/stack of all kworker threads on the
system.  There were 265 kworker threads that all have the following
stack:

[kworker/2:1]
[810930ec] cgroup_free_fn+0x2c/0x120
[81057c54] process_one_work+0x174/0x490
[81058d0c] worker_thread+0x11c/0x370
[8105f0b0] kthread+0xc0/0xd0
[814c20dc] ret_from_fork+0x7c/0xb0
[] 0x

And there were another 101 that had stacks like the following:

[kworker/0:0]
[81058daf] worker_thread+0x1bf/0x370
[8105f0b0] kthread+0xc0/0xd0
[814c20dc] ret_from_fork+0x7c/0xb0
[] 0x

That's it.  Again I'm not sure if that is helpful at all but it seems
to imply that the lru_add_drain_work was not scheduled.

I also managed to kill another two machines running my test.  One of
them we didn't get anything out of, and the other looks like I
deadlocked on the css_set_lock lock.  I'll follow up with the
css_set_lock deadlock in another email since it doesn't look related
to this one.  But it does seem that I can probably reproduce this if
anyone has some debugging ideas.

--
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to 

Re: 3.10.16 cgroup_mutex deadlock

2013-11-14 Thread Tejun Heo
Hello,

On Thu, Nov 14, 2013 at 04:56:49PM -0600, Shawn Bohrer wrote:
 After running both concurrently on 40 machines for about 12 hours I've
 managed to reproduce the issue at least once, possibly more.  One
 machine looked identical to this reported issue.  It has a bunch of
 stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach
 waiting on lru_add_drain_all().  A sysrq+l shows all CPUs are idle
 except for the one triggering the sysrq+l.  The sysrq+w unfortunately
 wrapped dmesg so we didn't get the stacks of all blocked tasks.  We
 did however also cat /proc/pid/stack of all kworker threads on the
 system.  There were 265 kworker threads that all have the following
 stack:

Umm... so, WQ_DFL_ACTIVE is 256.  It's just an arbitrarily largish
number which is supposed to serve as protection against runaway
kworker creation.  The assumption there is that there won't be a
dependency chain which can be longer than that and if there are it
should be separated out into a separate workqueue.  It looks like we
*can* have such long chain of dependency with high enough rate of
cgroup destruction.  kworkers trying to destroy cgroups get blocked by
an earlier one which is holding cgroup_mutex.  If the blocked ones
completely consume max_active and then the earlier one tries to
perform an operation which makes use of the system_wq, the forward
progress guarantee gets broken.

So, yeah, it makes sense now.  We're just gonna have to separate out
cgroup destruction to a separate workqueue.  Hugh's temp fix achieved
about the same effect by putting the affected part of destruction to a
different workqueue.  I probably should have realized that we were
hitting max_active when I was told that moving some part to a
different workqueue makes the problem go away.

Will send out a patch soon.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-14 Thread Tejun Heo
Hello,

Shawn, Hugh, can you please verify whether the attached patch makes
the deadlock go away?

Thanks.

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e0839bc..dc9dc06 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -90,6 +90,14 @@ static DEFINE_MUTEX(cgroup_mutex);
 static DEFINE_MUTEX(cgroup_root_mutex);
 
 /*
+ * cgroup destruction makes heavy use of work items and there can be a lot
+ * of concurrent destructions.  Use a separate workqueue so that cgroup
+ * destruction work items don't end up filling up max_active of system_wq
+ * which may lead to deadlock.
+ */
+static struct workqueue_struct *cgroup_destroy_wq;
+
+/*
  * Generate an array of cgroup subsystem pointers. At boot time, this is
  * populated with the built in subsystems, and modular subsystems are
  * registered after that. The mutable section of this array is protected by
@@ -871,7 +879,7 @@ static void cgroup_free_rcu(struct rcu_head *head)
struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);
 
INIT_WORK(cgrp-destroy_work, cgroup_free_fn);
-   schedule_work(cgrp-destroy_work);
+   queue_work(cgroup_destroy_wq, cgrp-destroy_work);
 }
 
 static void cgroup_diput(struct dentry *dentry, struct inode *inode)
@@ -4254,7 +4262,7 @@ static void css_free_rcu_fn(struct rcu_head *rcu_head)
 * css_put().  dput() requires process context which we don't have.
 */
INIT_WORK(css-destroy_work, css_free_work_fn);
-   schedule_work(css-destroy_work);
+   queue_work(cgroup_destroy_wq, css-destroy_work);
 }
 
 static void css_release(struct percpu_ref *ref)
@@ -4544,7 +4552,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
container_of(ref, struct cgroup_subsys_state, refcnt);
 
INIT_WORK(css-destroy_work, css_killed_work_fn);
-   schedule_work(css-destroy_work);
+   queue_work(cgroup_destroy_wq, css-destroy_work);
 }
 
 /**
@@ -5025,6 +5033,17 @@ int __init cgroup_init(void)
if (err)
return err;
 
+   /*
+* There isn't much point in executing destruction path in
+* parallel.  Good chunk is serialized with cgroup_mutex anyway.
+* Use 1 for @max_active.
+*/
+   cgroup_destroy_wq = alloc_workqueue(cgroup_destroy, 0, 1);
+   if (!cgroup_destroy_wq) {
+   err = -ENOMEM;
+   goto out;
+   }
+
for_each_builtin_subsys(ss, i) {
if (!ss-early_init)
cgroup_init_subsys(ss);
@@ -5062,9 +5081,11 @@ int __init cgroup_init(void)
proc_create(cgroups, 0, NULL, proc_cgroupstats_operations);
 
 out:
-   if (err)
+   if (err) {
+   if (cgroup_destroy_wq)
+   destroy_workqueue(cgroup_destroy_wq);
bdi_destroy(cgroup_backing_dev_info);
-
+   }
return err;
 }
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-12 Thread Michal Hocko
On Tue 12-11-13 09:55:30, Shawn Bohrer wrote:
> On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote:
> > On Tue 12-11-13 18:17:20, Li Zefan wrote:
> > > Cc more people
> > > 
> > > On 2013/11/12 6:06, Shawn Bohrer wrote:
> > > > Hello,
> > > > 
> > > > This morning I had a machine running 3.10.16 go unresponsive but
> > > > before we killed it we were able to get the information below.  I'm
> > > > not an expert here but it looks like most of the tasks below are
> > > > blocking waiting on the cgroup_mutex.  You can see that the
> > > > resource_alloca:16502 task is holding the cgroup_mutex and that task
> > > > appears to be waiting on a lru_add_drain_all() to complete.
> > 
> > Do you have sysrq+l output as well by any chance? That would tell
> > us what the current CPUs are doing. Dumping all kworker stacks
> > might be helpful as well. We know that lru_add_drain_all waits for
> > schedule_on_each_cpu to return so it is waiting for workers to finish.
> > I would be really curious why some of lru_add_drain_cpu cannot finish
> > properly. The only reason would be that some work item(s) do not get CPU
> > or somebody is holding lru_lock.
> 
> In fact the sys-admin did manage to fire off a sysrq+l, I've put all
> of the info from the syslog below.  I've looked it over and I'm not
> sure it reveals anything.  First looking at the timestamps it appears
> we ran the sysrq+l 19.2 hours after the cgroup_mutex lockup I
> previously sent.

I would expect sysrq+w would still show those kworkers blocked on the
same cgroup mutex?

> I also have atop logs over that whole time period
> that show hundreds of zombie processes which to me indicates that over
> that 19.2 hours systemd remained wedged on the cgroup_mutex.  Looking
> at the backtraces from the sysrq+l it appears most of the CPUs were
> idle

Right so either we managed to sleep with the lru_lock held which sounds
a bit improbable - but who knows - or there is some other problem. I
would expect the later to be true.

lru_add_drain executes per-cpu and preemption disabled this means that
its work item cannot be preempted so the only logical explanation seems
to be that the work item has never got scheduled.

> except there are a few where ptpd is trying to step the clock
> with clock_settime.  The ptpd process also appears to get stuck for a
> bit but it looks like it recovers because it moves CPUs and the
> previous CPUs become idle.

It gets soft lockup because it is waiting for it's own IPIs which got
preempted by NMI trace dumper. But this is unrelated.

> The fact that ptpd is stepping the clock
> at all at this time means that timekeeping is a mess at this point and
> the system clock is way out of sync.  There are also a few of these
> NMI messages in there that I don't understand but at this point the
> machine was a sinking ship.
> 
> Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Uhhuh. NMI received for 
> unknown reason 21 on CPU 26.
> Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Do you have a strange power 
> saving mode enabled?
> Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Dazed and confused, but 
> trying to continue
> Nov 11 07:03:29 sydtest0 kernel: [764305.327143] Uhhuh. NMI received for 
> unknown reason 31 on CPU 27.
> Nov 11 07:03:29 sydtest0 kernel: [764305.327144] Do you have a strange power 
> saving mode enabled?
> Nov 11 07:03:29 sydtest0 kernel: [764305.327144] Dazed and confused, but 
> trying to continue
> Nov 11 07:03:29 sydtest0 kernel: [764305.327242] Uhhuh. NMI received for 
> unknown reason 31 on CPU 28.
> Nov 11 07:03:29 sydtest0 kernel: [764305.327242] Do you have a strange power 
> saving mode enabled?
> Nov 11 07:03:29 sydtest0 kernel: [764305.327243] Dazed and confused, but 
> trying to continue
> 
> Perhaps there is another task blocking somewhere holding the lru_lock, but at
> this point the machine has been rebooted so I'm not sure how we'd figure out
> what task that might be. Anyway here is the full output of sysrq+l plus
> whatever else ended up in the syslog.

OK. In case the issue happens again. It would be very helpful to get the
kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue
debugging tricks.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-12 Thread Michal Hocko
On Tue 12-11-13 18:17:20, Li Zefan wrote:
> Cc more people
> 
> On 2013/11/12 6:06, Shawn Bohrer wrote:
> > Hello,
> > 
> > This morning I had a machine running 3.10.16 go unresponsive but
> > before we killed it we were able to get the information below.  I'm
> > not an expert here but it looks like most of the tasks below are
> > blocking waiting on the cgroup_mutex.  You can see that the
> > resource_alloca:16502 task is holding the cgroup_mutex and that task
> > appears to be waiting on a lru_add_drain_all() to complete.

Do you have sysrq+l output as well by any chance? That would tell
us what the current CPUs are doing. Dumping all kworker stacks
might be helpful as well. We know that lru_add_drain_all waits for
schedule_on_each_cpu to return so it is waiting for workers to finish.
I would be really curious why some of lru_add_drain_cpu cannot finish
properly. The only reason would be that some work item(s) do not get CPU
or somebody is holding lru_lock.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-12 Thread Li Zefan
Cc more people

On 2013/11/12 6:06, Shawn Bohrer wrote:
> Hello,
> 
> This morning I had a machine running 3.10.16 go unresponsive but
> before we killed it we were able to get the information below.  I'm
> not an expert here but it looks like most of the tasks below are
> blocking waiting on the cgroup_mutex.  You can see that the
> resource_alloca:16502 task is holding the cgroup_mutex and that task
> appears to be waiting on a lru_add_drain_all() to complete.

Ouch, another bug report!

This looks like the same bug that Hugh saw.
(http://permalink.gmane.org/gmane.linux.kernel.cgroups/9351)

What's new in your report is, the lru_add_drain_all() comes from
cpuset_attach() instead of memcg. Morever I thought it was a
3.11 specific bug.

> 
> Initially I thought the deadlock might simply be that the per cpu
> workqueue work from lru_add_drain_all() is stuck waiting on the
> cgroup_free_fn to complete.  However I've read
> Documentation/workqueue.txt and it sounds like the current workqueue
> has multiple kworker threads per cpu and thus this should not happen.
> Both the cgroup_free_fn work and lru_add_dran_all() work run on the
> system_wq which has max_active set to 0 so I believe multiple kworker
> threads should run.  This also appears to be true since all of the
> cgroup_free_fn are running on kworker/12 thread and there are multiple
> blocked.
> 
> Perhaps someone with more experience in the cgroup and workqueue code
> can look at the stacks below and identify the problem, or explain why
> the lru_add_drain_all() work has not completed:
> 
> 
> [694702.013850] INFO: task systemd:1 blocked for more than 120 seconds.
> [694702.015794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [694702.018217] systemd D 81607820 0 1  0 
> 0x
> [694702.020505]  88041dcc1d78 0086 88041dc7f100 
> 8110ad54
> [694702.023006]  0001 88041dc78000 88041dcc1fd8 
> 88041dcc1fd8
> [694702.025508]  88041dcc1fd8 88041dc78000 88041a1e8698 
> 81a417c0
> [694702.028011] Call Trace:
> [694702.028788]  [] ? vma_merge+0x124/0x330
> [694702.030468]  [] schedule+0x29/0x70
> [694702.032011]  [] schedule_preempt_disabled+0xe/0x10
> [694702.033982]  [] __mutex_lock_slowpath+0x112/0x1b0
> [694702.035926]  [] ? kmem_cache_alloc_trace+0x12d/0x160
> [694702.037948]  [] mutex_lock+0x2a/0x50
> [694702.039546]  [] proc_cgroup_show+0x67/0x1d0
> [694702.041330]  [] seq_read+0x16b/0x3e0
> [694702.042927]  [] vfs_read+0xb0/0x180
> [694702.044498]  [] SyS_read+0x52/0xa0
> [694702.046042]  [] system_call_fastpath+0x16/0x1b
> [694702.047917] INFO: task kworker/12:1:203 blocked for more than 120 seconds.
> [694702.050044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [694702.052467] kworker/12:1D  0   203  2 
> 0x
> [694702.054756] Workqueue: events cgroup_free_fn
> [694702.056139]  88041bc1fcf8 0046 88038e7b46a0 
> 00030001
> [694702.058642]  88041bc1fd84 88041da6e9f0 88041bc1ffd8 
> 88041bc1ffd8
> [694702.061144]  88041bc1ffd8 88041da6e9f0 0087 
> 81a417c0
> [694702.063647] Call Trace:
> [694702.064423]  [] schedule+0x29/0x70
> [694702.065966]  [] schedule_preempt_disabled+0xe/0x10
> [694702.067936]  [] __mutex_lock_slowpath+0x112/0x1b0
> [694702.069879]  [] mutex_lock+0x2a/0x50
> [694702.071476]  [] cgroup_free_fn+0x2c/0x120
> [694702.073209]  [] process_one_work+0x174/0x490
> [694702.075019]  [] worker_thread+0x11c/0x370
> [694702.076748]  [] ? manage_workers+0x2c0/0x2c0
> [694702.078560]  [] kthread+0xc0/0xd0
> [694702.080078]  [] ? flush_kthread_worker+0xb0/0xb0
> [694702.081995]  [] ret_from_fork+0x7c/0xb0
> [694702.083671]  [] ? flush_kthread_worker+0xb0/0xb0
> [694702.085595] INFO: task systemd-logind:2885 blocked for more than 120 
> seconds.
> [694702.087801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [694702.090225] systemd-logind  D 81607820 0  2885  1 
> 0x
> [694702.092513]  88041ac6fd88 0082 88041dd8aa60 
> 88041d9bc1a8
> [694702.095014]  88041ac6fda0 88041cac9530 88041ac6ffd8 
> 88041ac6ffd8
> [694702.097517]  88041ac6ffd8 88041cac9530 0c36 
> 81a417c0
> [694702.100019] Call Trace:
> [694702.100793]  [] schedule+0x29/0x70
> [694702.102338]  [] schedule_preempt_disabled+0xe/0x10
> [694702.104309]  [] __mutex_lock_slowpath+0x112/0x1b0
> [694702.198316]  [] mutex_lock+0x2a/0x50
> [694702.292456]  [] cgroup_lock_live_group+0x1d/0x40
> [694702.386833]  [] cgroup_mkdir+0xa8/0x4b0
> [694702.480679]  [] vfs_mkdir+0x84/0xd0
> [694702.574124]  [] SyS_mkdirat+0x5e/0xe0
> [694702.666986]  [] SyS_mkdir+0x19/0x20
> [694702.758969]  [] system_call_fastpath+0x16/0x1b
> [694702.848295] INFO: task kworker/12:2:11512 blocked for more than 120 
> seconds.
> 

Re: 3.10.16 cgroup_mutex deadlock

2013-11-12 Thread Li Zefan
Cc more people

On 2013/11/12 6:06, Shawn Bohrer wrote:
 Hello,
 
 This morning I had a machine running 3.10.16 go unresponsive but
 before we killed it we were able to get the information below.  I'm
 not an expert here but it looks like most of the tasks below are
 blocking waiting on the cgroup_mutex.  You can see that the
 resource_alloca:16502 task is holding the cgroup_mutex and that task
 appears to be waiting on a lru_add_drain_all() to complete.

Ouch, another bug report!

This looks like the same bug that Hugh saw.
(http://permalink.gmane.org/gmane.linux.kernel.cgroups/9351)

What's new in your report is, the lru_add_drain_all() comes from
cpuset_attach() instead of memcg. Morever I thought it was a
3.11 specific bug.

 
 Initially I thought the deadlock might simply be that the per cpu
 workqueue work from lru_add_drain_all() is stuck waiting on the
 cgroup_free_fn to complete.  However I've read
 Documentation/workqueue.txt and it sounds like the current workqueue
 has multiple kworker threads per cpu and thus this should not happen.
 Both the cgroup_free_fn work and lru_add_dran_all() work run on the
 system_wq which has max_active set to 0 so I believe multiple kworker
 threads should run.  This also appears to be true since all of the
 cgroup_free_fn are running on kworker/12 thread and there are multiple
 blocked.
 
 Perhaps someone with more experience in the cgroup and workqueue code
 can look at the stacks below and identify the problem, or explain why
 the lru_add_drain_all() work has not completed:
 
 
 [694702.013850] INFO: task systemd:1 blocked for more than 120 seconds.
 [694702.015794] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
 this message.
 [694702.018217] systemd D 81607820 0 1  0 
 0x
 [694702.020505]  88041dcc1d78 0086 88041dc7f100 
 8110ad54
 [694702.023006]  0001 88041dc78000 88041dcc1fd8 
 88041dcc1fd8
 [694702.025508]  88041dcc1fd8 88041dc78000 88041a1e8698 
 81a417c0
 [694702.028011] Call Trace:
 [694702.028788]  [8110ad54] ? vma_merge+0x124/0x330
 [694702.030468]  [814b8eb9] schedule+0x29/0x70
 [694702.032011]  [814b918e] schedule_preempt_disabled+0xe/0x10
 [694702.033982]  [814b75b2] __mutex_lock_slowpath+0x112/0x1b0
 [694702.035926]  [8112a2bd] ? kmem_cache_alloc_trace+0x12d/0x160
 [694702.037948]  [814b742a] mutex_lock+0x2a/0x50
 [694702.039546]  [81095b77] proc_cgroup_show+0x67/0x1d0
 [694702.041330]  [8115925b] seq_read+0x16b/0x3e0
 [694702.042927]  [811383d0] vfs_read+0xb0/0x180
 [694702.044498]  [81138652] SyS_read+0x52/0xa0
 [694702.046042]  [814c2182] system_call_fastpath+0x16/0x1b
 [694702.047917] INFO: task kworker/12:1:203 blocked for more than 120 seconds.
 [694702.050044] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
 this message.
 [694702.052467] kworker/12:1D  0   203  2 
 0x
 [694702.054756] Workqueue: events cgroup_free_fn
 [694702.056139]  88041bc1fcf8 0046 88038e7b46a0 
 00030001
 [694702.058642]  88041bc1fd84 88041da6e9f0 88041bc1ffd8 
 88041bc1ffd8
 [694702.061144]  88041bc1ffd8 88041da6e9f0 0087 
 81a417c0
 [694702.063647] Call Trace:
 [694702.064423]  [814b8eb9] schedule+0x29/0x70
 [694702.065966]  [814b918e] schedule_preempt_disabled+0xe/0x10
 [694702.067936]  [814b75b2] __mutex_lock_slowpath+0x112/0x1b0
 [694702.069879]  [814b742a] mutex_lock+0x2a/0x50
 [694702.071476]  [810930ec] cgroup_free_fn+0x2c/0x120
 [694702.073209]  [81057c54] process_one_work+0x174/0x490
 [694702.075019]  [81058d0c] worker_thread+0x11c/0x370
 [694702.076748]  [81058bf0] ? manage_workers+0x2c0/0x2c0
 [694702.078560]  [8105f0b0] kthread+0xc0/0xd0
 [694702.080078]  [8105eff0] ? flush_kthread_worker+0xb0/0xb0
 [694702.081995]  [814c20dc] ret_from_fork+0x7c/0xb0
 [694702.083671]  [8105eff0] ? flush_kthread_worker+0xb0/0xb0
 [694702.085595] INFO: task systemd-logind:2885 blocked for more than 120 
 seconds.
 [694702.087801] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
 this message.
 [694702.090225] systemd-logind  D 81607820 0  2885  1 
 0x
 [694702.092513]  88041ac6fd88 0082 88041dd8aa60 
 88041d9bc1a8
 [694702.095014]  88041ac6fda0 88041cac9530 88041ac6ffd8 
 88041ac6ffd8
 [694702.097517]  88041ac6ffd8 88041cac9530 0c36 
 81a417c0
 [694702.100019] Call Trace:
 [694702.100793]  [814b8eb9] schedule+0x29/0x70
 [694702.102338]  [814b918e] schedule_preempt_disabled+0xe/0x10
 [694702.104309]  [814b75b2] __mutex_lock_slowpath+0x112/0x1b0
 [694702.198316]  [814b742a] mutex_lock+0x2a/0x50
 [694702.292456]  [8108fa6d] 

Re: 3.10.16 cgroup_mutex deadlock

2013-11-12 Thread Michal Hocko
On Tue 12-11-13 18:17:20, Li Zefan wrote:
 Cc more people
 
 On 2013/11/12 6:06, Shawn Bohrer wrote:
  Hello,
  
  This morning I had a machine running 3.10.16 go unresponsive but
  before we killed it we were able to get the information below.  I'm
  not an expert here but it looks like most of the tasks below are
  blocking waiting on the cgroup_mutex.  You can see that the
  resource_alloca:16502 task is holding the cgroup_mutex and that task
  appears to be waiting on a lru_add_drain_all() to complete.

Do you have sysrq+l output as well by any chance? That would tell
us what the current CPUs are doing. Dumping all kworker stacks
might be helpful as well. We know that lru_add_drain_all waits for
schedule_on_each_cpu to return so it is waiting for workers to finish.
I would be really curious why some of lru_add_drain_cpu cannot finish
properly. The only reason would be that some work item(s) do not get CPU
or somebody is holding lru_lock.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-12 Thread Michal Hocko
On Tue 12-11-13 09:55:30, Shawn Bohrer wrote:
 On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote:
  On Tue 12-11-13 18:17:20, Li Zefan wrote:
   Cc more people
   
   On 2013/11/12 6:06, Shawn Bohrer wrote:
Hello,

This morning I had a machine running 3.10.16 go unresponsive but
before we killed it we were able to get the information below.  I'm
not an expert here but it looks like most of the tasks below are
blocking waiting on the cgroup_mutex.  You can see that the
resource_alloca:16502 task is holding the cgroup_mutex and that task
appears to be waiting on a lru_add_drain_all() to complete.
  
  Do you have sysrq+l output as well by any chance? That would tell
  us what the current CPUs are doing. Dumping all kworker stacks
  might be helpful as well. We know that lru_add_drain_all waits for
  schedule_on_each_cpu to return so it is waiting for workers to finish.
  I would be really curious why some of lru_add_drain_cpu cannot finish
  properly. The only reason would be that some work item(s) do not get CPU
  or somebody is holding lru_lock.
 
 In fact the sys-admin did manage to fire off a sysrq+l, I've put all
 of the info from the syslog below.  I've looked it over and I'm not
 sure it reveals anything.  First looking at the timestamps it appears
 we ran the sysrq+l 19.2 hours after the cgroup_mutex lockup I
 previously sent.

I would expect sysrq+w would still show those kworkers blocked on the
same cgroup mutex?

 I also have atop logs over that whole time period
 that show hundreds of zombie processes which to me indicates that over
 that 19.2 hours systemd remained wedged on the cgroup_mutex.  Looking
 at the backtraces from the sysrq+l it appears most of the CPUs were
 idle

Right so either we managed to sleep with the lru_lock held which sounds
a bit improbable - but who knows - or there is some other problem. I
would expect the later to be true.

lru_add_drain executes per-cpu and preemption disabled this means that
its work item cannot be preempted so the only logical explanation seems
to be that the work item has never got scheduled.

 except there are a few where ptpd is trying to step the clock
 with clock_settime.  The ptpd process also appears to get stuck for a
 bit but it looks like it recovers because it moves CPUs and the
 previous CPUs become idle.

It gets soft lockup because it is waiting for it's own IPIs which got
preempted by NMI trace dumper. But this is unrelated.

 The fact that ptpd is stepping the clock
 at all at this time means that timekeeping is a mess at this point and
 the system clock is way out of sync.  There are also a few of these
 NMI messages in there that I don't understand but at this point the
 machine was a sinking ship.
 
 Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Uhhuh. NMI received for 
 unknown reason 21 on CPU 26.
 Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Do you have a strange power 
 saving mode enabled?
 Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Dazed and confused, but 
 trying to continue
 Nov 11 07:03:29 sydtest0 kernel: [764305.327143] Uhhuh. NMI received for 
 unknown reason 31 on CPU 27.
 Nov 11 07:03:29 sydtest0 kernel: [764305.327144] Do you have a strange power 
 saving mode enabled?
 Nov 11 07:03:29 sydtest0 kernel: [764305.327144] Dazed and confused, but 
 trying to continue
 Nov 11 07:03:29 sydtest0 kernel: [764305.327242] Uhhuh. NMI received for 
 unknown reason 31 on CPU 28.
 Nov 11 07:03:29 sydtest0 kernel: [764305.327242] Do you have a strange power 
 saving mode enabled?
 Nov 11 07:03:29 sydtest0 kernel: [764305.327243] Dazed and confused, but 
 trying to continue
 
 Perhaps there is another task blocking somewhere holding the lru_lock, but at
 this point the machine has been rebooted so I'm not sure how we'd figure out
 what task that might be. Anyway here is the full output of sysrq+l plus
 whatever else ended up in the syslog.

OK. In case the issue happens again. It would be very helpful to get the
kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue
debugging tricks.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.10.16 cgroup_mutex deadlock

2013-11-11 Thread Shawn Bohrer
Hello,

This morning I had a machine running 3.10.16 go unresponsive but
before we killed it we were able to get the information below.  I'm
not an expert here but it looks like most of the tasks below are
blocking waiting on the cgroup_mutex.  You can see that the
resource_alloca:16502 task is holding the cgroup_mutex and that task
appears to be waiting on a lru_add_drain_all() to complete.

Initially I thought the deadlock might simply be that the per cpu
workqueue work from lru_add_drain_all() is stuck waiting on the
cgroup_free_fn to complete.  However I've read
Documentation/workqueue.txt and it sounds like the current workqueue
has multiple kworker threads per cpu and thus this should not happen.
Both the cgroup_free_fn work and lru_add_dran_all() work run on the
system_wq which has max_active set to 0 so I believe multiple kworker
threads should run.  This also appears to be true since all of the
cgroup_free_fn are running on kworker/12 thread and there are multiple
blocked.

Perhaps someone with more experience in the cgroup and workqueue code
can look at the stacks below and identify the problem, or explain why
the lru_add_drain_all() work has not completed:


[694702.013850] INFO: task systemd:1 blocked for more than 120 seconds.
[694702.015794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[694702.018217] systemd D 81607820 0 1  0 0x
[694702.020505]  88041dcc1d78 0086 88041dc7f100 
8110ad54
[694702.023006]  0001 88041dc78000 88041dcc1fd8 
88041dcc1fd8
[694702.025508]  88041dcc1fd8 88041dc78000 88041a1e8698 
81a417c0
[694702.028011] Call Trace:
[694702.028788]  [] ? vma_merge+0x124/0x330
[694702.030468]  [] schedule+0x29/0x70
[694702.032011]  [] schedule_preempt_disabled+0xe/0x10
[694702.033982]  [] __mutex_lock_slowpath+0x112/0x1b0
[694702.035926]  [] ? kmem_cache_alloc_trace+0x12d/0x160
[694702.037948]  [] mutex_lock+0x2a/0x50
[694702.039546]  [] proc_cgroup_show+0x67/0x1d0
[694702.041330]  [] seq_read+0x16b/0x3e0
[694702.042927]  [] vfs_read+0xb0/0x180
[694702.044498]  [] SyS_read+0x52/0xa0
[694702.046042]  [] system_call_fastpath+0x16/0x1b
[694702.047917] INFO: task kworker/12:1:203 blocked for more than 120 seconds.
[694702.050044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[694702.052467] kworker/12:1D  0   203  2 0x
[694702.054756] Workqueue: events cgroup_free_fn
[694702.056139]  88041bc1fcf8 0046 88038e7b46a0 
00030001
[694702.058642]  88041bc1fd84 88041da6e9f0 88041bc1ffd8 
88041bc1ffd8
[694702.061144]  88041bc1ffd8 88041da6e9f0 0087 
81a417c0
[694702.063647] Call Trace:
[694702.064423]  [] schedule+0x29/0x70
[694702.065966]  [] schedule_preempt_disabled+0xe/0x10
[694702.067936]  [] __mutex_lock_slowpath+0x112/0x1b0
[694702.069879]  [] mutex_lock+0x2a/0x50
[694702.071476]  [] cgroup_free_fn+0x2c/0x120
[694702.073209]  [] process_one_work+0x174/0x490
[694702.075019]  [] worker_thread+0x11c/0x370
[694702.076748]  [] ? manage_workers+0x2c0/0x2c0
[694702.078560]  [] kthread+0xc0/0xd0
[694702.080078]  [] ? flush_kthread_worker+0xb0/0xb0
[694702.081995]  [] ret_from_fork+0x7c/0xb0
[694702.083671]  [] ? flush_kthread_worker+0xb0/0xb0
[694702.085595] INFO: task systemd-logind:2885 blocked for more than 120 
seconds.
[694702.087801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[694702.090225] systemd-logind  D 81607820 0  2885  1 0x
[694702.092513]  88041ac6fd88 0082 88041dd8aa60 
88041d9bc1a8
[694702.095014]  88041ac6fda0 88041cac9530 88041ac6ffd8 
88041ac6ffd8
[694702.097517]  88041ac6ffd8 88041cac9530 0c36 
81a417c0
[694702.100019] Call Trace:
[694702.100793]  [] schedule+0x29/0x70
[694702.102338]  [] schedule_preempt_disabled+0xe/0x10
[694702.104309]  [] __mutex_lock_slowpath+0x112/0x1b0
[694702.198316]  [] mutex_lock+0x2a/0x50
[694702.292456]  [] cgroup_lock_live_group+0x1d/0x40
[694702.386833]  [] cgroup_mkdir+0xa8/0x4b0
[694702.480679]  [] vfs_mkdir+0x84/0xd0
[694702.574124]  [] SyS_mkdirat+0x5e/0xe0
[694702.666986]  [] SyS_mkdir+0x19/0x20
[694702.758969]  [] system_call_fastpath+0x16/0x1b
[694702.848295] INFO: task kworker/12:2:11512 blocked for more than 120 seconds.
[694702.935749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[694703.023603] kworker/12:2D 816079c0 0 11512  2 0x
[694703.109993] Workqueue: events cgroup_free_fn
[694703.193213]  88041b9dfcf8 0046 88041da6e9f0 
ea00106fd240
[694703.278353]  88041f803c00 8803824254c0 88041b9dffd8 
88041b9dffd8
[694703.363757]  88041b9dffd8 8803824254c0 001f17887bb1 
81a417c0
[694703.448550] Call Trace:
[694703.531773]  [] 

3.10.16 cgroup_mutex deadlock

2013-11-11 Thread Shawn Bohrer
Hello,

This morning I had a machine running 3.10.16 go unresponsive but
before we killed it we were able to get the information below.  I'm
not an expert here but it looks like most of the tasks below are
blocking waiting on the cgroup_mutex.  You can see that the
resource_alloca:16502 task is holding the cgroup_mutex and that task
appears to be waiting on a lru_add_drain_all() to complete.

Initially I thought the deadlock might simply be that the per cpu
workqueue work from lru_add_drain_all() is stuck waiting on the
cgroup_free_fn to complete.  However I've read
Documentation/workqueue.txt and it sounds like the current workqueue
has multiple kworker threads per cpu and thus this should not happen.
Both the cgroup_free_fn work and lru_add_dran_all() work run on the
system_wq which has max_active set to 0 so I believe multiple kworker
threads should run.  This also appears to be true since all of the
cgroup_free_fn are running on kworker/12 thread and there are multiple
blocked.

Perhaps someone with more experience in the cgroup and workqueue code
can look at the stacks below and identify the problem, or explain why
the lru_add_drain_all() work has not completed:


[694702.013850] INFO: task systemd:1 blocked for more than 120 seconds.
[694702.015794] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
this message.
[694702.018217] systemd D 81607820 0 1  0 0x
[694702.020505]  88041dcc1d78 0086 88041dc7f100 
8110ad54
[694702.023006]  0001 88041dc78000 88041dcc1fd8 
88041dcc1fd8
[694702.025508]  88041dcc1fd8 88041dc78000 88041a1e8698 
81a417c0
[694702.028011] Call Trace:
[694702.028788]  [8110ad54] ? vma_merge+0x124/0x330
[694702.030468]  [814b8eb9] schedule+0x29/0x70
[694702.032011]  [814b918e] schedule_preempt_disabled+0xe/0x10
[694702.033982]  [814b75b2] __mutex_lock_slowpath+0x112/0x1b0
[694702.035926]  [8112a2bd] ? kmem_cache_alloc_trace+0x12d/0x160
[694702.037948]  [814b742a] mutex_lock+0x2a/0x50
[694702.039546]  [81095b77] proc_cgroup_show+0x67/0x1d0
[694702.041330]  [8115925b] seq_read+0x16b/0x3e0
[694702.042927]  [811383d0] vfs_read+0xb0/0x180
[694702.044498]  [81138652] SyS_read+0x52/0xa0
[694702.046042]  [814c2182] system_call_fastpath+0x16/0x1b
[694702.047917] INFO: task kworker/12:1:203 blocked for more than 120 seconds.
[694702.050044] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
this message.
[694702.052467] kworker/12:1D  0   203  2 0x
[694702.054756] Workqueue: events cgroup_free_fn
[694702.056139]  88041bc1fcf8 0046 88038e7b46a0 
00030001
[694702.058642]  88041bc1fd84 88041da6e9f0 88041bc1ffd8 
88041bc1ffd8
[694702.061144]  88041bc1ffd8 88041da6e9f0 0087 
81a417c0
[694702.063647] Call Trace:
[694702.064423]  [814b8eb9] schedule+0x29/0x70
[694702.065966]  [814b918e] schedule_preempt_disabled+0xe/0x10
[694702.067936]  [814b75b2] __mutex_lock_slowpath+0x112/0x1b0
[694702.069879]  [814b742a] mutex_lock+0x2a/0x50
[694702.071476]  [810930ec] cgroup_free_fn+0x2c/0x120
[694702.073209]  [81057c54] process_one_work+0x174/0x490
[694702.075019]  [81058d0c] worker_thread+0x11c/0x370
[694702.076748]  [81058bf0] ? manage_workers+0x2c0/0x2c0
[694702.078560]  [8105f0b0] kthread+0xc0/0xd0
[694702.080078]  [8105eff0] ? flush_kthread_worker+0xb0/0xb0
[694702.081995]  [814c20dc] ret_from_fork+0x7c/0xb0
[694702.083671]  [8105eff0] ? flush_kthread_worker+0xb0/0xb0
[694702.085595] INFO: task systemd-logind:2885 blocked for more than 120 
seconds.
[694702.087801] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
this message.
[694702.090225] systemd-logind  D 81607820 0  2885  1 0x
[694702.092513]  88041ac6fd88 0082 88041dd8aa60 
88041d9bc1a8
[694702.095014]  88041ac6fda0 88041cac9530 88041ac6ffd8 
88041ac6ffd8
[694702.097517]  88041ac6ffd8 88041cac9530 0c36 
81a417c0
[694702.100019] Call Trace:
[694702.100793]  [814b8eb9] schedule+0x29/0x70
[694702.102338]  [814b918e] schedule_preempt_disabled+0xe/0x10
[694702.104309]  [814b75b2] __mutex_lock_slowpath+0x112/0x1b0
[694702.198316]  [814b742a] mutex_lock+0x2a/0x50
[694702.292456]  [8108fa6d] cgroup_lock_live_group+0x1d/0x40
[694702.386833]  [810946c8] cgroup_mkdir+0xa8/0x4b0
[694702.480679]  [81145ea4] vfs_mkdir+0x84/0xd0
[694702.574124]  [8114791e] SyS_mkdirat+0x5e/0xe0
[694702.666986]  [811479b9] SyS_mkdir+0x19/0x20
[694702.758969]  [814c2182] system_call_fastpath+0x16/0x1b
[694702.848295] INFO: task kworker/12:2:11512 blocked for more than 120 seconds.
[694702.935749]