Re: 3.10.16 cgroup_mutex deadlock
On 2013/12/2 18:31, William Dauchy wrote: > Hi Li, > > On Mon, Nov 25, 2013 at 2:20 AM, Li Zefan wrote: >> I'll do this after the patch hits mainline, if Tejun doesn't plan to. > > Do you have some news about it? > Tejun has already done the backport. :) It has been included in 3.10.22, which will be released in a couple of days. http://article.gmane.org/gmane.linux.kernel.stable/71292 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
Hi Li, On Mon, Nov 25, 2013 at 2:20 AM, Li Zefan wrote: > I'll do this after the patch hits mainline, if Tejun doesn't plan to. Do you have some news about it? -- William -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
Hi Li, On Mon, Nov 25, 2013 at 2:20 AM, Li Zefan lize...@huawei.com wrote: I'll do this after the patch hits mainline, if Tejun doesn't plan to. Do you have some news about it? -- William -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On 2013/12/2 18:31, William Dauchy wrote: Hi Li, On Mon, Nov 25, 2013 at 2:20 AM, Li Zefan lize...@huawei.com wrote: I'll do this after the patch hits mainline, if Tejun doesn't plan to. Do you have some news about it? Tejun has already done the backport. :) It has been included in 3.10.22, which will be released in a couple of days. http://article.gmane.org/gmane.linux.kernel.stable/71292 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On 2013/11/23 6:54, William Dauchy wrote: > Hi Tejun, > > On Fri, Nov 22, 2013 at 11:18 PM, Tejun Heo wrote: >> Just applied to cgroup/for-3.13-fixes w/ stable cc'd. Will push to >> Linus next week. > > Thank your for your quick reply. Do you also have a backport for > v3.10.x already available? > I'll do this after the patch hits mainline, if Tejun doesn't plan to. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On 2013/11/23 6:54, William Dauchy wrote: Hi Tejun, On Fri, Nov 22, 2013 at 11:18 PM, Tejun Heo t...@kernel.org wrote: Just applied to cgroup/for-3.13-fixes w/ stable cc'd. Will push to Linus next week. Thank your for your quick reply. Do you also have a backport for v3.10.x already available? I'll do this after the patch hits mainline, if Tejun doesn't plan to. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
Hi Tejun, On Fri, Nov 22, 2013 at 11:18 PM, Tejun Heo wrote: > Just applied to cgroup/for-3.13-fixes w/ stable cc'd. Will push to > Linus next week. Thank your for your quick reply. Do you also have a backport for v3.10.x already available? Best regards, -- William -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Fri, Nov 22, 2013 at 09:59:37PM +0100, William Dauchy wrote: > Hugh, Tejun, > > Do we have some news about this patch? I'm also hitting this bug on a 3.10.x Just applied to cgroup/for-3.13-fixes w/ stable cc'd. Will push to Linus next week. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Mon, Nov 18, 2013 at 3:17 AM, Hugh Dickins wrote: > Sorry for the delay: I was on the point of reporting success last > night, when I tried a debug kernel: and that didn't work so well > (got spinlock bad magic report in pwd_adjust_max_active(), and > tests wouldn't run at all). > > Even the non-early cgroup_init() is called well before the > early_initcall init_workqueues(): though only the debug (lockdep > and spinlock debug) kernel appeared to have a problem with that. > > Here's the patch I ended up with successfully on a 3.11.7-based > kernel (though below I've rediffed it against 3.11.8): the > schedule_work->queue_work hunks are slightly different on 3.11 > than in your patch against current, and I did alloc_workqueue() > from a separate core_initcall. > > The interval between cgroup_init and that is a bit of a worry; > but we don't seem to have suffered from the interval between > cgroup_init and init_workqueues before (when system_wq is NULL) > - though you may have more courage than I to reorder them! > > Initially I backed out my system_highpri_wq workaround, and > verified that it was still easy to reproduce the problem with > one of our cgroup stresstests. Yes it was, then your modified > patch below convincingly fixed it. > > I ran with Johannes's patch adding extra mem_cgroup_reparent_charges: > as I'd expected, that didn't solve this issue (though it's worth > our keeping it in to rule out another source of problems). And I > checked back on dumps of failures: they indeed show the tell-tale > 256 kworkers doing cgroup_offline_fn, just as you predicted. Hugh, Tejun, Do we have some news about this patch? I'm also hitting this bug on a 3.10.x Thanks, -- William -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Mon, Nov 18, 2013 at 3:17 AM, Hugh Dickins hu...@google.com wrote: Sorry for the delay: I was on the point of reporting success last night, when I tried a debug kernel: and that didn't work so well (got spinlock bad magic report in pwd_adjust_max_active(), and tests wouldn't run at all). Even the non-early cgroup_init() is called well before the early_initcall init_workqueues(): though only the debug (lockdep and spinlock debug) kernel appeared to have a problem with that. Here's the patch I ended up with successfully on a 3.11.7-based kernel (though below I've rediffed it against 3.11.8): the schedule_work-queue_work hunks are slightly different on 3.11 than in your patch against current, and I did alloc_workqueue() from a separate core_initcall. The interval between cgroup_init and that is a bit of a worry; but we don't seem to have suffered from the interval between cgroup_init and init_workqueues before (when system_wq is NULL) - though you may have more courage than I to reorder them! Initially I backed out my system_highpri_wq workaround, and verified that it was still easy to reproduce the problem with one of our cgroup stresstests. Yes it was, then your modified patch below convincingly fixed it. I ran with Johannes's patch adding extra mem_cgroup_reparent_charges: as I'd expected, that didn't solve this issue (though it's worth our keeping it in to rule out another source of problems). And I checked back on dumps of failures: they indeed show the tell-tale 256 kworkers doing cgroup_offline_fn, just as you predicted. Hugh, Tejun, Do we have some news about this patch? I'm also hitting this bug on a 3.10.x Thanks, -- William -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Fri, Nov 22, 2013 at 09:59:37PM +0100, William Dauchy wrote: Hugh, Tejun, Do we have some news about this patch? I'm also hitting this bug on a 3.10.x Just applied to cgroup/for-3.13-fixes w/ stable cc'd. Will push to Linus next week. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
Hi Tejun, On Fri, Nov 22, 2013 at 11:18 PM, Tejun Heo t...@kernel.org wrote: Just applied to cgroup/for-3.13-fixes w/ stable cc'd. Will push to Linus next week. Thank your for your quick reply. Do you also have a backport for v3.10.x already available? Best regards, -- William -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Tue, Nov 19, 2013 at 10:55:18AM +0800, Li Zefan wrote: > > Thanks Tejun and Hugh. Sorry for my late entry in getting around to > > testing this fix. On the surface it sounds correct however I'd like to > > test this on top of 3.10.* since that is what we'll likely be running. > > I've tried to apply Hugh's patch above on top of 3.10.19 but it > > appears there are a number of conflicts. Looking over the changes and > > my understanding of the problem I believe on 3.10 only the > > cgroup_free_fn needs to be run in a separate workqueue. Below is the > > patch I've applied on top of 3.10.19, which I'm about to start > > testing. If it looks like I botched the backport in any way please > > let me know so I can test a propper fix on top of 3.10.19. > > > > You didn't move css free_work to the dedicate wq as Tejun's patch does. > css free_work won't acquire cgroup_mutex, but when destroying a lot of > cgroups, we can have a lot of css free_work in the workqueue, so I'd > suggest you also use cgroup_destroy_wq for it. Well, I didn't move the css free_work, but I did test the patch I posted on top of 3.10.19 and I am unable to reproduce the lockup so it appears my patch was sufficient for 3.10.*. Hopefully we can get this fix applied and backported into stable. Thanks, Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Tue, Nov 19, 2013 at 10:55:18AM +0800, Li Zefan wrote: Thanks Tejun and Hugh. Sorry for my late entry in getting around to testing this fix. On the surface it sounds correct however I'd like to test this on top of 3.10.* since that is what we'll likely be running. I've tried to apply Hugh's patch above on top of 3.10.19 but it appears there are a number of conflicts. Looking over the changes and my understanding of the problem I believe on 3.10 only the cgroup_free_fn needs to be run in a separate workqueue. Below is the patch I've applied on top of 3.10.19, which I'm about to start testing. If it looks like I botched the backport in any way please let me know so I can test a propper fix on top of 3.10.19. You didn't move css free_work to the dedicate wq as Tejun's patch does. css free_work won't acquire cgroup_mutex, but when destroying a lot of cgroups, we can have a lot of css free_work in the workqueue, so I'd suggest you also use cgroup_destroy_wq for it. Well, I didn't move the css free_work, but I did test the patch I posted on top of 3.10.19 and I am unable to reproduce the lockup so it appears my patch was sufficient for 3.10.*. Hopefully we can get this fix applied and backported into stable. Thanks, Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
> Thanks Tejun and Hugh. Sorry for my late entry in getting around to > testing this fix. On the surface it sounds correct however I'd like to > test this on top of 3.10.* since that is what we'll likely be running. > I've tried to apply Hugh's patch above on top of 3.10.19 but it > appears there are a number of conflicts. Looking over the changes and > my understanding of the problem I believe on 3.10 only the > cgroup_free_fn needs to be run in a separate workqueue. Below is the > patch I've applied on top of 3.10.19, which I'm about to start > testing. If it looks like I botched the backport in any way please > let me know so I can test a propper fix on top of 3.10.19. > You didn't move css free_work to the dedicate wq as Tejun's patch does. css free_work won't acquire cgroup_mutex, but when destroying a lot of cgroups, we can have a lot of css free_work in the workqueue, so I'd suggest you also use cgroup_destroy_wq for it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Sun, Nov 17, 2013 at 06:17:17PM -0800, Hugh Dickins wrote: > Sorry for the delay: I was on the point of reporting success last > night, when I tried a debug kernel: and that didn't work so well > (got spinlock bad magic report in pwd_adjust_max_active(), and > tests wouldn't run at all). > > Even the non-early cgroup_init() is called well before the > early_initcall init_workqueues(): though only the debug (lockdep > and spinlock debug) kernel appeared to have a problem with that. > > Here's the patch I ended up with successfully on a 3.11.7-based > kernel (though below I've rediffed it against 3.11.8): the > schedule_work->queue_work hunks are slightly different on 3.11 > than in your patch against current, and I did alloc_workqueue() > from a separate core_initcall. > > The interval between cgroup_init and that is a bit of a worry; > but we don't seem to have suffered from the interval between > cgroup_init and init_workqueues before (when system_wq is NULL) > - though you may have more courage than I to reorder them! > > Initially I backed out my system_highpri_wq workaround, and > verified that it was still easy to reproduce the problem with > one of our cgroup stresstests. Yes it was, then your modified > patch below convincingly fixed it. > > I ran with Johannes's patch adding extra mem_cgroup_reparent_charges: > as I'd expected, that didn't solve this issue (though it's worth > our keeping it in to rule out another source of problems). And I > checked back on dumps of failures: they indeed show the tell-tale > 256 kworkers doing cgroup_offline_fn, just as you predicted. > > Thanks! > Hugh > > --- > kernel/cgroup.c | 30 +++--- > 1 file changed, 27 insertions(+), 3 deletions(-) > > --- 3.11.8/kernel/cgroup.c2013-11-17 17:40:54.200640692 -0800 > +++ linux/kernel/cgroup.c 2013-11-17 17:43:10.876643941 -0800 > @@ -89,6 +89,14 @@ static DEFINE_MUTEX(cgroup_mutex); > static DEFINE_MUTEX(cgroup_root_mutex); > > /* > + * cgroup destruction makes heavy use of work items and there can be a lot > + * of concurrent destructions. Use a separate workqueue so that cgroup > + * destruction work items don't end up filling up max_active of system_wq > + * which may lead to deadlock. > + */ > +static struct workqueue_struct *cgroup_destroy_wq; > + > +/* > * Generate an array of cgroup subsystem pointers. At boot time, this is > * populated with the built in subsystems, and modular subsystems are > * registered after that. The mutable section of this array is protected by > @@ -890,7 +898,7 @@ static void cgroup_free_rcu(struct rcu_h > struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head); > > INIT_WORK(>destroy_work, cgroup_free_fn); > - schedule_work(>destroy_work); > + queue_work(cgroup_destroy_wq, >destroy_work); > } > > static void cgroup_diput(struct dentry *dentry, struct inode *inode) > @@ -4205,7 +4213,7 @@ static void css_release(struct percpu_re > struct cgroup_subsys_state *css = > container_of(ref, struct cgroup_subsys_state, refcnt); > > - schedule_work(>dput_work); > + queue_work(cgroup_destroy_wq, >dput_work); > } > > static void init_cgroup_css(struct cgroup_subsys_state *css, > @@ -4439,7 +4447,7 @@ static void cgroup_css_killed(struct cgr > > /* percpu ref's of all css's are killed, kick off the next step */ > INIT_WORK(>destroy_work, cgroup_offline_fn); > - schedule_work(>destroy_work); > + queue_work(cgroup_destroy_wq, >destroy_work); > } > > static void css_ref_killed_fn(struct percpu_ref *ref) > @@ -4967,6 +4975,22 @@ out: > return err; > } > > +static int __init cgroup_destroy_wq_init(void) > +{ > + /* > + * There isn't much point in executing destruction path in > + * parallel. Good chunk is serialized with cgroup_mutex anyway. > + * Use 1 for @max_active. > + * > + * We would prefer to do this in cgroup_init() above, but that > + * is called before init_workqueues(): so leave this until after. > + */ > + cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1); > + BUG_ON(!cgroup_destroy_wq); > + return 0; > +} > +core_initcall(cgroup_destroy_wq_init); > + > /* > * proc_cgroup_show() > * - Print task's cgroup paths into seq_file, one line for each hierarchy Thanks Tejun and Hugh. Sorry for my late entry in getting around to testing this fix. On the surface it sounds correct however I'd like to test this on top of 3.10.* since that is what we'll likely be running. I've tried to apply Hugh's patch above on top of 3.10.19 but it appears there are a number of conflicts. Looking over the changes and my understanding of the problem I believe on 3.10 only the cgroup_free_fn needs to be run in a separate workqueue. Below is the patch I've applied on top of 3.10.19, which I'm about to start testing. If it looks like I botched the backport in any way please let me know so I
Re: 3.10.16 cgroup_mutex deadlock
On Sun, Nov 17, 2013 at 06:17:17PM -0800, Hugh Dickins wrote: Sorry for the delay: I was on the point of reporting success last night, when I tried a debug kernel: and that didn't work so well (got spinlock bad magic report in pwd_adjust_max_active(), and tests wouldn't run at all). Even the non-early cgroup_init() is called well before the early_initcall init_workqueues(): though only the debug (lockdep and spinlock debug) kernel appeared to have a problem with that. Here's the patch I ended up with successfully on a 3.11.7-based kernel (though below I've rediffed it against 3.11.8): the schedule_work-queue_work hunks are slightly different on 3.11 than in your patch against current, and I did alloc_workqueue() from a separate core_initcall. The interval between cgroup_init and that is a bit of a worry; but we don't seem to have suffered from the interval between cgroup_init and init_workqueues before (when system_wq is NULL) - though you may have more courage than I to reorder them! Initially I backed out my system_highpri_wq workaround, and verified that it was still easy to reproduce the problem with one of our cgroup stresstests. Yes it was, then your modified patch below convincingly fixed it. I ran with Johannes's patch adding extra mem_cgroup_reparent_charges: as I'd expected, that didn't solve this issue (though it's worth our keeping it in to rule out another source of problems). And I checked back on dumps of failures: they indeed show the tell-tale 256 kworkers doing cgroup_offline_fn, just as you predicted. Thanks! Hugh --- kernel/cgroup.c | 30 +++--- 1 file changed, 27 insertions(+), 3 deletions(-) --- 3.11.8/kernel/cgroup.c2013-11-17 17:40:54.200640692 -0800 +++ linux/kernel/cgroup.c 2013-11-17 17:43:10.876643941 -0800 @@ -89,6 +89,14 @@ static DEFINE_MUTEX(cgroup_mutex); static DEFINE_MUTEX(cgroup_root_mutex); /* + * cgroup destruction makes heavy use of work items and there can be a lot + * of concurrent destructions. Use a separate workqueue so that cgroup + * destruction work items don't end up filling up max_active of system_wq + * which may lead to deadlock. + */ +static struct workqueue_struct *cgroup_destroy_wq; + +/* * Generate an array of cgroup subsystem pointers. At boot time, this is * populated with the built in subsystems, and modular subsystems are * registered after that. The mutable section of this array is protected by @@ -890,7 +898,7 @@ static void cgroup_free_rcu(struct rcu_h struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head); INIT_WORK(cgrp-destroy_work, cgroup_free_fn); - schedule_work(cgrp-destroy_work); + queue_work(cgroup_destroy_wq, cgrp-destroy_work); } static void cgroup_diput(struct dentry *dentry, struct inode *inode) @@ -4205,7 +4213,7 @@ static void css_release(struct percpu_re struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); - schedule_work(css-dput_work); + queue_work(cgroup_destroy_wq, css-dput_work); } static void init_cgroup_css(struct cgroup_subsys_state *css, @@ -4439,7 +4447,7 @@ static void cgroup_css_killed(struct cgr /* percpu ref's of all css's are killed, kick off the next step */ INIT_WORK(cgrp-destroy_work, cgroup_offline_fn); - schedule_work(cgrp-destroy_work); + queue_work(cgroup_destroy_wq, cgrp-destroy_work); } static void css_ref_killed_fn(struct percpu_ref *ref) @@ -4967,6 +4975,22 @@ out: return err; } +static int __init cgroup_destroy_wq_init(void) +{ + /* + * There isn't much point in executing destruction path in + * parallel. Good chunk is serialized with cgroup_mutex anyway. + * Use 1 for @max_active. + * + * We would prefer to do this in cgroup_init() above, but that + * is called before init_workqueues(): so leave this until after. + */ + cgroup_destroy_wq = alloc_workqueue(cgroup_destroy, 0, 1); + BUG_ON(!cgroup_destroy_wq); + return 0; +} +core_initcall(cgroup_destroy_wq_init); + /* * proc_cgroup_show() * - Print task's cgroup paths into seq_file, one line for each hierarchy Thanks Tejun and Hugh. Sorry for my late entry in getting around to testing this fix. On the surface it sounds correct however I'd like to test this on top of 3.10.* since that is what we'll likely be running. I've tried to apply Hugh's patch above on top of 3.10.19 but it appears there are a number of conflicts. Looking over the changes and my understanding of the problem I believe on 3.10 only the cgroup_free_fn needs to be run in a separate workqueue. Below is the patch I've applied on top of 3.10.19, which I'm about to start testing. If it looks like I botched the backport in any way please let me know so I can test a propper fix on top of 3.10.19. --- kernel/cgroup.c | 28
Re: 3.10.16 cgroup_mutex deadlock
Thanks Tejun and Hugh. Sorry for my late entry in getting around to testing this fix. On the surface it sounds correct however I'd like to test this on top of 3.10.* since that is what we'll likely be running. I've tried to apply Hugh's patch above on top of 3.10.19 but it appears there are a number of conflicts. Looking over the changes and my understanding of the problem I believe on 3.10 only the cgroup_free_fn needs to be run in a separate workqueue. Below is the patch I've applied on top of 3.10.19, which I'm about to start testing. If it looks like I botched the backport in any way please let me know so I can test a propper fix on top of 3.10.19. You didn't move css free_work to the dedicate wq as Tejun's patch does. css free_work won't acquire cgroup_mutex, but when destroying a lot of cgroups, we can have a lot of css free_work in the workqueue, so I'd suggest you also use cgroup_destroy_wq for it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Fri, 15 Nov 2013, Tejun Heo wrote: > Hello, > > Shawn, Hugh, can you please verify whether the attached patch makes > the deadlock go away? Thanks a lot, Tejun: report below. > > Thanks. > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index e0839bc..dc9dc06 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -90,6 +90,14 @@ static DEFINE_MUTEX(cgroup_mutex); > static DEFINE_MUTEX(cgroup_root_mutex); > > /* > + * cgroup destruction makes heavy use of work items and there can be a lot > + * of concurrent destructions. Use a separate workqueue so that cgroup > + * destruction work items don't end up filling up max_active of system_wq > + * which may lead to deadlock. > + */ > +static struct workqueue_struct *cgroup_destroy_wq; > + > +/* > * Generate an array of cgroup subsystem pointers. At boot time, this is > * populated with the built in subsystems, and modular subsystems are > * registered after that. The mutable section of this array is protected by > @@ -871,7 +879,7 @@ static void cgroup_free_rcu(struct rcu_head *head) > struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head); > > INIT_WORK(>destroy_work, cgroup_free_fn); > - schedule_work(>destroy_work); > + queue_work(cgroup_destroy_wq, >destroy_work); > } > > static void cgroup_diput(struct dentry *dentry, struct inode *inode) > @@ -4254,7 +4262,7 @@ static void css_free_rcu_fn(struct rcu_head *rcu_head) >* css_put(). dput() requires process context which we don't have. >*/ > INIT_WORK(>destroy_work, css_free_work_fn); > - schedule_work(>destroy_work); > + queue_work(cgroup_destroy_wq, >destroy_work); > } > > static void css_release(struct percpu_ref *ref) > @@ -4544,7 +4552,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref) > container_of(ref, struct cgroup_subsys_state, refcnt); > > INIT_WORK(>destroy_work, css_killed_work_fn); > - schedule_work(>destroy_work); > + queue_work(cgroup_destroy_wq, >destroy_work); > } > > /** > @@ -5025,6 +5033,17 @@ int __init cgroup_init(void) > if (err) > return err; > > + /* > + * There isn't much point in executing destruction path in > + * parallel. Good chunk is serialized with cgroup_mutex anyway. > + * Use 1 for @max_active. > + */ > + cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1); > + if (!cgroup_destroy_wq) { > + err = -ENOMEM; > + goto out; > + } > + > for_each_builtin_subsys(ss, i) { > if (!ss->early_init) > cgroup_init_subsys(ss); > @@ -5062,9 +5081,11 @@ int __init cgroup_init(void) > proc_create("cgroups", 0, NULL, _cgroupstats_operations); > > out: > - if (err) > + if (err) { > + if (cgroup_destroy_wq) > + destroy_workqueue(cgroup_destroy_wq); > bdi_destroy(_backing_dev_info); > - > + } > return err; > } > Sorry for the delay: I was on the point of reporting success last night, when I tried a debug kernel: and that didn't work so well (got spinlock bad magic report in pwd_adjust_max_active(), and tests wouldn't run at all). Even the non-early cgroup_init() is called well before the early_initcall init_workqueues(): though only the debug (lockdep and spinlock debug) kernel appeared to have a problem with that. Here's the patch I ended up with successfully on a 3.11.7-based kernel (though below I've rediffed it against 3.11.8): the schedule_work->queue_work hunks are slightly different on 3.11 than in your patch against current, and I did alloc_workqueue() from a separate core_initcall. The interval between cgroup_init and that is a bit of a worry; but we don't seem to have suffered from the interval between cgroup_init and init_workqueues before (when system_wq is NULL) - though you may have more courage than I to reorder them! Initially I backed out my system_highpri_wq workaround, and verified that it was still easy to reproduce the problem with one of our cgroup stresstests. Yes it was, then your modified patch below convincingly fixed it. I ran with Johannes's patch adding extra mem_cgroup_reparent_charges: as I'd expected, that didn't solve this issue (though it's worth our keeping it in to rule out another source of problems). And I checked back on dumps of failures: they indeed show the tell-tale 256 kworkers doing cgroup_offline_fn, just as you predicted. Thanks! Hugh --- kernel/cgroup.c | 30 +++--- 1 file changed, 27 insertions(+), 3 deletions(-) --- 3.11.8/kernel/cgroup.c 2013-11-17 17:40:54.200640692 -0800 +++ linux/kernel/cgroup.c 2013-11-17 17:43:10.876643941 -0800 @@ -89,6 +89,14 @@ static DEFINE_MUTEX(cgroup_mutex); static DEFINE_MUTEX(cgroup_root_mutex); /* + * cgroup destruction makes heavy use of work items and there can be a lot + * of concurrent destructions. Use a separate
Re: 3.10.16 cgroup_mutex deadlock
On Fri, 15 Nov 2013, Tejun Heo wrote: Hello, Shawn, Hugh, can you please verify whether the attached patch makes the deadlock go away? Thanks a lot, Tejun: report below. Thanks. diff --git a/kernel/cgroup.c b/kernel/cgroup.c index e0839bc..dc9dc06 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -90,6 +90,14 @@ static DEFINE_MUTEX(cgroup_mutex); static DEFINE_MUTEX(cgroup_root_mutex); /* + * cgroup destruction makes heavy use of work items and there can be a lot + * of concurrent destructions. Use a separate workqueue so that cgroup + * destruction work items don't end up filling up max_active of system_wq + * which may lead to deadlock. + */ +static struct workqueue_struct *cgroup_destroy_wq; + +/* * Generate an array of cgroup subsystem pointers. At boot time, this is * populated with the built in subsystems, and modular subsystems are * registered after that. The mutable section of this array is protected by @@ -871,7 +879,7 @@ static void cgroup_free_rcu(struct rcu_head *head) struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head); INIT_WORK(cgrp-destroy_work, cgroup_free_fn); - schedule_work(cgrp-destroy_work); + queue_work(cgroup_destroy_wq, cgrp-destroy_work); } static void cgroup_diput(struct dentry *dentry, struct inode *inode) @@ -4254,7 +4262,7 @@ static void css_free_rcu_fn(struct rcu_head *rcu_head) * css_put(). dput() requires process context which we don't have. */ INIT_WORK(css-destroy_work, css_free_work_fn); - schedule_work(css-destroy_work); + queue_work(cgroup_destroy_wq, css-destroy_work); } static void css_release(struct percpu_ref *ref) @@ -4544,7 +4552,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref) container_of(ref, struct cgroup_subsys_state, refcnt); INIT_WORK(css-destroy_work, css_killed_work_fn); - schedule_work(css-destroy_work); + queue_work(cgroup_destroy_wq, css-destroy_work); } /** @@ -5025,6 +5033,17 @@ int __init cgroup_init(void) if (err) return err; + /* + * There isn't much point in executing destruction path in + * parallel. Good chunk is serialized with cgroup_mutex anyway. + * Use 1 for @max_active. + */ + cgroup_destroy_wq = alloc_workqueue(cgroup_destroy, 0, 1); + if (!cgroup_destroy_wq) { + err = -ENOMEM; + goto out; + } + for_each_builtin_subsys(ss, i) { if (!ss-early_init) cgroup_init_subsys(ss); @@ -5062,9 +5081,11 @@ int __init cgroup_init(void) proc_create(cgroups, 0, NULL, proc_cgroupstats_operations); out: - if (err) + if (err) { + if (cgroup_destroy_wq) + destroy_workqueue(cgroup_destroy_wq); bdi_destroy(cgroup_backing_dev_info); - + } return err; } Sorry for the delay: I was on the point of reporting success last night, when I tried a debug kernel: and that didn't work so well (got spinlock bad magic report in pwd_adjust_max_active(), and tests wouldn't run at all). Even the non-early cgroup_init() is called well before the early_initcall init_workqueues(): though only the debug (lockdep and spinlock debug) kernel appeared to have a problem with that. Here's the patch I ended up with successfully on a 3.11.7-based kernel (though below I've rediffed it against 3.11.8): the schedule_work-queue_work hunks are slightly different on 3.11 than in your patch against current, and I did alloc_workqueue() from a separate core_initcall. The interval between cgroup_init and that is a bit of a worry; but we don't seem to have suffered from the interval between cgroup_init and init_workqueues before (when system_wq is NULL) - though you may have more courage than I to reorder them! Initially I backed out my system_highpri_wq workaround, and verified that it was still easy to reproduce the problem with one of our cgroup stresstests. Yes it was, then your modified patch below convincingly fixed it. I ran with Johannes's patch adding extra mem_cgroup_reparent_charges: as I'd expected, that didn't solve this issue (though it's worth our keeping it in to rule out another source of problems). And I checked back on dumps of failures: they indeed show the tell-tale 256 kworkers doing cgroup_offline_fn, just as you predicted. Thanks! Hugh --- kernel/cgroup.c | 30 +++--- 1 file changed, 27 insertions(+), 3 deletions(-) --- 3.11.8/kernel/cgroup.c 2013-11-17 17:40:54.200640692 -0800 +++ linux/kernel/cgroup.c 2013-11-17 17:43:10.876643941 -0800 @@ -89,6 +89,14 @@ static DEFINE_MUTEX(cgroup_mutex); static DEFINE_MUTEX(cgroup_root_mutex); /* + * cgroup destruction makes heavy use of work items and there can be a lot + * of concurrent destructions. Use a separate workqueue so that cgroup + * destruction work items
Re: 3.10.16 cgroup_mutex deadlock
Hello, Shawn, Hugh, can you please verify whether the attached patch makes the deadlock go away? Thanks. diff --git a/kernel/cgroup.c b/kernel/cgroup.c index e0839bc..dc9dc06 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -90,6 +90,14 @@ static DEFINE_MUTEX(cgroup_mutex); static DEFINE_MUTEX(cgroup_root_mutex); /* + * cgroup destruction makes heavy use of work items and there can be a lot + * of concurrent destructions. Use a separate workqueue so that cgroup + * destruction work items don't end up filling up max_active of system_wq + * which may lead to deadlock. + */ +static struct workqueue_struct *cgroup_destroy_wq; + +/* * Generate an array of cgroup subsystem pointers. At boot time, this is * populated with the built in subsystems, and modular subsystems are * registered after that. The mutable section of this array is protected by @@ -871,7 +879,7 @@ static void cgroup_free_rcu(struct rcu_head *head) struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head); INIT_WORK(>destroy_work, cgroup_free_fn); - schedule_work(>destroy_work); + queue_work(cgroup_destroy_wq, >destroy_work); } static void cgroup_diput(struct dentry *dentry, struct inode *inode) @@ -4254,7 +4262,7 @@ static void css_free_rcu_fn(struct rcu_head *rcu_head) * css_put(). dput() requires process context which we don't have. */ INIT_WORK(>destroy_work, css_free_work_fn); - schedule_work(>destroy_work); + queue_work(cgroup_destroy_wq, >destroy_work); } static void css_release(struct percpu_ref *ref) @@ -4544,7 +4552,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref) container_of(ref, struct cgroup_subsys_state, refcnt); INIT_WORK(>destroy_work, css_killed_work_fn); - schedule_work(>destroy_work); + queue_work(cgroup_destroy_wq, >destroy_work); } /** @@ -5025,6 +5033,17 @@ int __init cgroup_init(void) if (err) return err; + /* +* There isn't much point in executing destruction path in +* parallel. Good chunk is serialized with cgroup_mutex anyway. +* Use 1 for @max_active. +*/ + cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1); + if (!cgroup_destroy_wq) { + err = -ENOMEM; + goto out; + } + for_each_builtin_subsys(ss, i) { if (!ss->early_init) cgroup_init_subsys(ss); @@ -5062,9 +5081,11 @@ int __init cgroup_init(void) proc_create("cgroups", 0, NULL, _cgroupstats_operations); out: - if (err) + if (err) { + if (cgroup_destroy_wq) + destroy_workqueue(cgroup_destroy_wq); bdi_destroy(_backing_dev_info); - + } return err; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
Hello, On Thu, Nov 14, 2013 at 04:56:49PM -0600, Shawn Bohrer wrote: > After running both concurrently on 40 machines for about 12 hours I've > managed to reproduce the issue at least once, possibly more. One > machine looked identical to this reported issue. It has a bunch of > stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach > waiting on lru_add_drain_all(). A sysrq+l shows all CPUs are idle > except for the one triggering the sysrq+l. The sysrq+w unfortunately > wrapped dmesg so we didn't get the stacks of all blocked tasks. We > did however also cat /proc//stack of all kworker threads on the > system. There were 265 kworker threads that all have the following > stack: Umm... so, WQ_DFL_ACTIVE is 256. It's just an arbitrarily largish number which is supposed to serve as protection against runaway kworker creation. The assumption there is that there won't be a dependency chain which can be longer than that and if there are it should be separated out into a separate workqueue. It looks like we *can* have such long chain of dependency with high enough rate of cgroup destruction. kworkers trying to destroy cgroups get blocked by an earlier one which is holding cgroup_mutex. If the blocked ones completely consume max_active and then the earlier one tries to perform an operation which makes use of the system_wq, the forward progress guarantee gets broken. So, yeah, it makes sense now. We're just gonna have to separate out cgroup destruction to a separate workqueue. Hugh's temp fix achieved about the same effect by putting the affected part of destruction to a different workqueue. I probably should have realized that we were hitting max_active when I was told that moving some part to a different workqueue makes the problem go away. Will send out a patch soon. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Tue, Nov 12, 2013 at 05:55:04PM +0100, Michal Hocko wrote: > On Tue 12-11-13 09:55:30, Shawn Bohrer wrote: > > On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote: > > > On Tue 12-11-13 18:17:20, Li Zefan wrote: > > > > Cc more people > > > > > > > > On 2013/11/12 6:06, Shawn Bohrer wrote: > > > > > Hello, > > > > > > > > > > This morning I had a machine running 3.10.16 go unresponsive but > > > > > before we killed it we were able to get the information below. I'm > > > > > not an expert here but it looks like most of the tasks below are > > > > > blocking waiting on the cgroup_mutex. You can see that the > > > > > resource_alloca:16502 task is holding the cgroup_mutex and that task > > > > > appears to be waiting on a lru_add_drain_all() to complete. > > > > > > Do you have sysrq+l output as well by any chance? That would tell > > > us what the current CPUs are doing. Dumping all kworker stacks > > > might be helpful as well. We know that lru_add_drain_all waits for > > > schedule_on_each_cpu to return so it is waiting for workers to finish. > > > I would be really curious why some of lru_add_drain_cpu cannot finish > > > properly. The only reason would be that some work item(s) do not get CPU > > > or somebody is holding lru_lock. > > > > In fact the sys-admin did manage to fire off a sysrq+l, I've put all > > of the info from the syslog below. I've looked it over and I'm not > > sure it reveals anything. First looking at the timestamps it appears > > we ran the sysrq+l 19.2 hours after the cgroup_mutex lockup I > > previously sent. > > I would expect sysrq+w would still show those kworkers blocked on the > same cgroup mutex? Yes, I believe so. > > I also have atop logs over that whole time period > > that show hundreds of zombie processes which to me indicates that over > > that 19.2 hours systemd remained wedged on the cgroup_mutex. Looking > > at the backtraces from the sysrq+l it appears most of the CPUs were > > idle > > Right so either we managed to sleep with the lru_lock held which sounds > a bit improbable - but who knows - or there is some other problem. I > would expect the later to be true. > > lru_add_drain executes per-cpu and preemption disabled this means that > its work item cannot be preempted so the only logical explanation seems > to be that the work item has never got scheduled. Meaning you think there would be no kworker thread for the lru_add_drain at this point? If so you might be correct. > OK. In case the issue happens again. It would be very helpful to get the > kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue > debugging tricks. I set up one of my test pools with two scripts trying to reproduce the problem. One essentially puts tasks into several cpuset groups that have cpuset.memory_migrate set, then takes them back out. It also occasionally switches cpuset.mems in those groups to try to keep the memory of those tasks migrating between nodes. The second script is: $ cat /home/hbi/cgroup_mutex_cgroup_maker.sh #!/bin/bash session_group=$(ps -o pid,cmd,cgroup -p $$ | grep -E 'c[0-9]+' -o) cd /sys/fs/cgroup/systemd/user/hbi/${session_group} pwd while true; do for x in $(seq 1 1000); do mkdir $x echo $$ > ${x}/tasks echo $$ > tasks rmdir $x done sleep .1 date done After running both concurrently on 40 machines for about 12 hours I've managed to reproduce the issue at least once, possibly more. One machine looked identical to this reported issue. It has a bunch of stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach waiting on lru_add_drain_all(). A sysrq+l shows all CPUs are idle except for the one triggering the sysrq+l. The sysrq+w unfortunately wrapped dmesg so we didn't get the stacks of all blocked tasks. We did however also cat /proc//stack of all kworker threads on the system. There were 265 kworker threads that all have the following stack: [kworker/2:1] [] cgroup_free_fn+0x2c/0x120 [] process_one_work+0x174/0x490 [] worker_thread+0x11c/0x370 [] kthread+0xc0/0xd0 [] ret_from_fork+0x7c/0xb0 [] 0x And there were another 101 that had stacks like the following: [kworker/0:0] [] worker_thread+0x1bf/0x370 [] kthread+0xc0/0xd0 [] ret_from_fork+0x7c/0xb0 [] 0x That's it. Again I'm not sure if that is helpful at all but it seems to imply that the lru_add_drain_work was not scheduled. I also managed to kill another two machines running my test. One of them we didn't get anything out of, and the other looks like I deadlocked on the css_set_lock lock. I'll follow up with the css_set_lock deadlock in another email since it doesn't look related to this one. But it does seem that I can probably reproduce this if anyone has some debugging ideas. -- Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: 3.10.16 cgroup_mutex deadlock
On Tue, Nov 12, 2013 at 05:55:04PM +0100, Michal Hocko wrote: On Tue 12-11-13 09:55:30, Shawn Bohrer wrote: On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote: On Tue 12-11-13 18:17:20, Li Zefan wrote: Cc more people On 2013/11/12 6:06, Shawn Bohrer wrote: Hello, This morning I had a machine running 3.10.16 go unresponsive but before we killed it we were able to get the information below. I'm not an expert here but it looks like most of the tasks below are blocking waiting on the cgroup_mutex. You can see that the resource_alloca:16502 task is holding the cgroup_mutex and that task appears to be waiting on a lru_add_drain_all() to complete. Do you have sysrq+l output as well by any chance? That would tell us what the current CPUs are doing. Dumping all kworker stacks might be helpful as well. We know that lru_add_drain_all waits for schedule_on_each_cpu to return so it is waiting for workers to finish. I would be really curious why some of lru_add_drain_cpu cannot finish properly. The only reason would be that some work item(s) do not get CPU or somebody is holding lru_lock. In fact the sys-admin did manage to fire off a sysrq+l, I've put all of the info from the syslog below. I've looked it over and I'm not sure it reveals anything. First looking at the timestamps it appears we ran the sysrq+l 19.2 hours after the cgroup_mutex lockup I previously sent. I would expect sysrq+w would still show those kworkers blocked on the same cgroup mutex? Yes, I believe so. I also have atop logs over that whole time period that show hundreds of zombie processes which to me indicates that over that 19.2 hours systemd remained wedged on the cgroup_mutex. Looking at the backtraces from the sysrq+l it appears most of the CPUs were idle Right so either we managed to sleep with the lru_lock held which sounds a bit improbable - but who knows - or there is some other problem. I would expect the later to be true. lru_add_drain executes per-cpu and preemption disabled this means that its work item cannot be preempted so the only logical explanation seems to be that the work item has never got scheduled. Meaning you think there would be no kworker thread for the lru_add_drain at this point? If so you might be correct. OK. In case the issue happens again. It would be very helpful to get the kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue debugging tricks. I set up one of my test pools with two scripts trying to reproduce the problem. One essentially puts tasks into several cpuset groups that have cpuset.memory_migrate set, then takes them back out. It also occasionally switches cpuset.mems in those groups to try to keep the memory of those tasks migrating between nodes. The second script is: $ cat /home/hbi/cgroup_mutex_cgroup_maker.sh #!/bin/bash session_group=$(ps -o pid,cmd,cgroup -p $$ | grep -E 'c[0-9]+' -o) cd /sys/fs/cgroup/systemd/user/hbi/${session_group} pwd while true; do for x in $(seq 1 1000); do mkdir $x echo $$ ${x}/tasks echo $$ tasks rmdir $x done sleep .1 date done After running both concurrently on 40 machines for about 12 hours I've managed to reproduce the issue at least once, possibly more. One machine looked identical to this reported issue. It has a bunch of stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach waiting on lru_add_drain_all(). A sysrq+l shows all CPUs are idle except for the one triggering the sysrq+l. The sysrq+w unfortunately wrapped dmesg so we didn't get the stacks of all blocked tasks. We did however also cat /proc/pid/stack of all kworker threads on the system. There were 265 kworker threads that all have the following stack: [kworker/2:1] [810930ec] cgroup_free_fn+0x2c/0x120 [81057c54] process_one_work+0x174/0x490 [81058d0c] worker_thread+0x11c/0x370 [8105f0b0] kthread+0xc0/0xd0 [814c20dc] ret_from_fork+0x7c/0xb0 [] 0x And there were another 101 that had stacks like the following: [kworker/0:0] [81058daf] worker_thread+0x1bf/0x370 [8105f0b0] kthread+0xc0/0xd0 [814c20dc] ret_from_fork+0x7c/0xb0 [] 0x That's it. Again I'm not sure if that is helpful at all but it seems to imply that the lru_add_drain_work was not scheduled. I also managed to kill another two machines running my test. One of them we didn't get anything out of, and the other looks like I deadlocked on the css_set_lock lock. I'll follow up with the css_set_lock deadlock in another email since it doesn't look related to this one. But it does seem that I can probably reproduce this if anyone has some debugging ideas. -- Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to
Re: 3.10.16 cgroup_mutex deadlock
Hello, On Thu, Nov 14, 2013 at 04:56:49PM -0600, Shawn Bohrer wrote: After running both concurrently on 40 machines for about 12 hours I've managed to reproduce the issue at least once, possibly more. One machine looked identical to this reported issue. It has a bunch of stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach waiting on lru_add_drain_all(). A sysrq+l shows all CPUs are idle except for the one triggering the sysrq+l. The sysrq+w unfortunately wrapped dmesg so we didn't get the stacks of all blocked tasks. We did however also cat /proc/pid/stack of all kworker threads on the system. There were 265 kworker threads that all have the following stack: Umm... so, WQ_DFL_ACTIVE is 256. It's just an arbitrarily largish number which is supposed to serve as protection against runaway kworker creation. The assumption there is that there won't be a dependency chain which can be longer than that and if there are it should be separated out into a separate workqueue. It looks like we *can* have such long chain of dependency with high enough rate of cgroup destruction. kworkers trying to destroy cgroups get blocked by an earlier one which is holding cgroup_mutex. If the blocked ones completely consume max_active and then the earlier one tries to perform an operation which makes use of the system_wq, the forward progress guarantee gets broken. So, yeah, it makes sense now. We're just gonna have to separate out cgroup destruction to a separate workqueue. Hugh's temp fix achieved about the same effect by putting the affected part of destruction to a different workqueue. I probably should have realized that we were hitting max_active when I was told that moving some part to a different workqueue makes the problem go away. Will send out a patch soon. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
Hello, Shawn, Hugh, can you please verify whether the attached patch makes the deadlock go away? Thanks. diff --git a/kernel/cgroup.c b/kernel/cgroup.c index e0839bc..dc9dc06 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -90,6 +90,14 @@ static DEFINE_MUTEX(cgroup_mutex); static DEFINE_MUTEX(cgroup_root_mutex); /* + * cgroup destruction makes heavy use of work items and there can be a lot + * of concurrent destructions. Use a separate workqueue so that cgroup + * destruction work items don't end up filling up max_active of system_wq + * which may lead to deadlock. + */ +static struct workqueue_struct *cgroup_destroy_wq; + +/* * Generate an array of cgroup subsystem pointers. At boot time, this is * populated with the built in subsystems, and modular subsystems are * registered after that. The mutable section of this array is protected by @@ -871,7 +879,7 @@ static void cgroup_free_rcu(struct rcu_head *head) struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head); INIT_WORK(cgrp-destroy_work, cgroup_free_fn); - schedule_work(cgrp-destroy_work); + queue_work(cgroup_destroy_wq, cgrp-destroy_work); } static void cgroup_diput(struct dentry *dentry, struct inode *inode) @@ -4254,7 +4262,7 @@ static void css_free_rcu_fn(struct rcu_head *rcu_head) * css_put(). dput() requires process context which we don't have. */ INIT_WORK(css-destroy_work, css_free_work_fn); - schedule_work(css-destroy_work); + queue_work(cgroup_destroy_wq, css-destroy_work); } static void css_release(struct percpu_ref *ref) @@ -4544,7 +4552,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref) container_of(ref, struct cgroup_subsys_state, refcnt); INIT_WORK(css-destroy_work, css_killed_work_fn); - schedule_work(css-destroy_work); + queue_work(cgroup_destroy_wq, css-destroy_work); } /** @@ -5025,6 +5033,17 @@ int __init cgroup_init(void) if (err) return err; + /* +* There isn't much point in executing destruction path in +* parallel. Good chunk is serialized with cgroup_mutex anyway. +* Use 1 for @max_active. +*/ + cgroup_destroy_wq = alloc_workqueue(cgroup_destroy, 0, 1); + if (!cgroup_destroy_wq) { + err = -ENOMEM; + goto out; + } + for_each_builtin_subsys(ss, i) { if (!ss-early_init) cgroup_init_subsys(ss); @@ -5062,9 +5081,11 @@ int __init cgroup_init(void) proc_create(cgroups, 0, NULL, proc_cgroupstats_operations); out: - if (err) + if (err) { + if (cgroup_destroy_wq) + destroy_workqueue(cgroup_destroy_wq); bdi_destroy(cgroup_backing_dev_info); - + } return err; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Tue 12-11-13 09:55:30, Shawn Bohrer wrote: > On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote: > > On Tue 12-11-13 18:17:20, Li Zefan wrote: > > > Cc more people > > > > > > On 2013/11/12 6:06, Shawn Bohrer wrote: > > > > Hello, > > > > > > > > This morning I had a machine running 3.10.16 go unresponsive but > > > > before we killed it we were able to get the information below. I'm > > > > not an expert here but it looks like most of the tasks below are > > > > blocking waiting on the cgroup_mutex. You can see that the > > > > resource_alloca:16502 task is holding the cgroup_mutex and that task > > > > appears to be waiting on a lru_add_drain_all() to complete. > > > > Do you have sysrq+l output as well by any chance? That would tell > > us what the current CPUs are doing. Dumping all kworker stacks > > might be helpful as well. We know that lru_add_drain_all waits for > > schedule_on_each_cpu to return so it is waiting for workers to finish. > > I would be really curious why some of lru_add_drain_cpu cannot finish > > properly. The only reason would be that some work item(s) do not get CPU > > or somebody is holding lru_lock. > > In fact the sys-admin did manage to fire off a sysrq+l, I've put all > of the info from the syslog below. I've looked it over and I'm not > sure it reveals anything. First looking at the timestamps it appears > we ran the sysrq+l 19.2 hours after the cgroup_mutex lockup I > previously sent. I would expect sysrq+w would still show those kworkers blocked on the same cgroup mutex? > I also have atop logs over that whole time period > that show hundreds of zombie processes which to me indicates that over > that 19.2 hours systemd remained wedged on the cgroup_mutex. Looking > at the backtraces from the sysrq+l it appears most of the CPUs were > idle Right so either we managed to sleep with the lru_lock held which sounds a bit improbable - but who knows - or there is some other problem. I would expect the later to be true. lru_add_drain executes per-cpu and preemption disabled this means that its work item cannot be preempted so the only logical explanation seems to be that the work item has never got scheduled. > except there are a few where ptpd is trying to step the clock > with clock_settime. The ptpd process also appears to get stuck for a > bit but it looks like it recovers because it moves CPUs and the > previous CPUs become idle. It gets soft lockup because it is waiting for it's own IPIs which got preempted by NMI trace dumper. But this is unrelated. > The fact that ptpd is stepping the clock > at all at this time means that timekeeping is a mess at this point and > the system clock is way out of sync. There are also a few of these > NMI messages in there that I don't understand but at this point the > machine was a sinking ship. > > Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Uhhuh. NMI received for > unknown reason 21 on CPU 26. > Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Do you have a strange power > saving mode enabled? > Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Dazed and confused, but > trying to continue > Nov 11 07:03:29 sydtest0 kernel: [764305.327143] Uhhuh. NMI received for > unknown reason 31 on CPU 27. > Nov 11 07:03:29 sydtest0 kernel: [764305.327144] Do you have a strange power > saving mode enabled? > Nov 11 07:03:29 sydtest0 kernel: [764305.327144] Dazed and confused, but > trying to continue > Nov 11 07:03:29 sydtest0 kernel: [764305.327242] Uhhuh. NMI received for > unknown reason 31 on CPU 28. > Nov 11 07:03:29 sydtest0 kernel: [764305.327242] Do you have a strange power > saving mode enabled? > Nov 11 07:03:29 sydtest0 kernel: [764305.327243] Dazed and confused, but > trying to continue > > Perhaps there is another task blocking somewhere holding the lru_lock, but at > this point the machine has been rebooted so I'm not sure how we'd figure out > what task that might be. Anyway here is the full output of sysrq+l plus > whatever else ended up in the syslog. OK. In case the issue happens again. It would be very helpful to get the kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue debugging tricks. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Tue 12-11-13 18:17:20, Li Zefan wrote: > Cc more people > > On 2013/11/12 6:06, Shawn Bohrer wrote: > > Hello, > > > > This morning I had a machine running 3.10.16 go unresponsive but > > before we killed it we were able to get the information below. I'm > > not an expert here but it looks like most of the tasks below are > > blocking waiting on the cgroup_mutex. You can see that the > > resource_alloca:16502 task is holding the cgroup_mutex and that task > > appears to be waiting on a lru_add_drain_all() to complete. Do you have sysrq+l output as well by any chance? That would tell us what the current CPUs are doing. Dumping all kworker stacks might be helpful as well. We know that lru_add_drain_all waits for schedule_on_each_cpu to return so it is waiting for workers to finish. I would be really curious why some of lru_add_drain_cpu cannot finish properly. The only reason would be that some work item(s) do not get CPU or somebody is holding lru_lock. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
Cc more people On 2013/11/12 6:06, Shawn Bohrer wrote: > Hello, > > This morning I had a machine running 3.10.16 go unresponsive but > before we killed it we were able to get the information below. I'm > not an expert here but it looks like most of the tasks below are > blocking waiting on the cgroup_mutex. You can see that the > resource_alloca:16502 task is holding the cgroup_mutex and that task > appears to be waiting on a lru_add_drain_all() to complete. Ouch, another bug report! This looks like the same bug that Hugh saw. (http://permalink.gmane.org/gmane.linux.kernel.cgroups/9351) What's new in your report is, the lru_add_drain_all() comes from cpuset_attach() instead of memcg. Morever I thought it was a 3.11 specific bug. > > Initially I thought the deadlock might simply be that the per cpu > workqueue work from lru_add_drain_all() is stuck waiting on the > cgroup_free_fn to complete. However I've read > Documentation/workqueue.txt and it sounds like the current workqueue > has multiple kworker threads per cpu and thus this should not happen. > Both the cgroup_free_fn work and lru_add_dran_all() work run on the > system_wq which has max_active set to 0 so I believe multiple kworker > threads should run. This also appears to be true since all of the > cgroup_free_fn are running on kworker/12 thread and there are multiple > blocked. > > Perhaps someone with more experience in the cgroup and workqueue code > can look at the stacks below and identify the problem, or explain why > the lru_add_drain_all() work has not completed: > > > [694702.013850] INFO: task systemd:1 blocked for more than 120 seconds. > [694702.015794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [694702.018217] systemd D 81607820 0 1 0 > 0x > [694702.020505] 88041dcc1d78 0086 88041dc7f100 > 8110ad54 > [694702.023006] 0001 88041dc78000 88041dcc1fd8 > 88041dcc1fd8 > [694702.025508] 88041dcc1fd8 88041dc78000 88041a1e8698 > 81a417c0 > [694702.028011] Call Trace: > [694702.028788] [] ? vma_merge+0x124/0x330 > [694702.030468] [] schedule+0x29/0x70 > [694702.032011] [] schedule_preempt_disabled+0xe/0x10 > [694702.033982] [] __mutex_lock_slowpath+0x112/0x1b0 > [694702.035926] [] ? kmem_cache_alloc_trace+0x12d/0x160 > [694702.037948] [] mutex_lock+0x2a/0x50 > [694702.039546] [] proc_cgroup_show+0x67/0x1d0 > [694702.041330] [] seq_read+0x16b/0x3e0 > [694702.042927] [] vfs_read+0xb0/0x180 > [694702.044498] [] SyS_read+0x52/0xa0 > [694702.046042] [] system_call_fastpath+0x16/0x1b > [694702.047917] INFO: task kworker/12:1:203 blocked for more than 120 seconds. > [694702.050044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [694702.052467] kworker/12:1D 0 203 2 > 0x > [694702.054756] Workqueue: events cgroup_free_fn > [694702.056139] 88041bc1fcf8 0046 88038e7b46a0 > 00030001 > [694702.058642] 88041bc1fd84 88041da6e9f0 88041bc1ffd8 > 88041bc1ffd8 > [694702.061144] 88041bc1ffd8 88041da6e9f0 0087 > 81a417c0 > [694702.063647] Call Trace: > [694702.064423] [] schedule+0x29/0x70 > [694702.065966] [] schedule_preempt_disabled+0xe/0x10 > [694702.067936] [] __mutex_lock_slowpath+0x112/0x1b0 > [694702.069879] [] mutex_lock+0x2a/0x50 > [694702.071476] [] cgroup_free_fn+0x2c/0x120 > [694702.073209] [] process_one_work+0x174/0x490 > [694702.075019] [] worker_thread+0x11c/0x370 > [694702.076748] [] ? manage_workers+0x2c0/0x2c0 > [694702.078560] [] kthread+0xc0/0xd0 > [694702.080078] [] ? flush_kthread_worker+0xb0/0xb0 > [694702.081995] [] ret_from_fork+0x7c/0xb0 > [694702.083671] [] ? flush_kthread_worker+0xb0/0xb0 > [694702.085595] INFO: task systemd-logind:2885 blocked for more than 120 > seconds. > [694702.087801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [694702.090225] systemd-logind D 81607820 0 2885 1 > 0x > [694702.092513] 88041ac6fd88 0082 88041dd8aa60 > 88041d9bc1a8 > [694702.095014] 88041ac6fda0 88041cac9530 88041ac6ffd8 > 88041ac6ffd8 > [694702.097517] 88041ac6ffd8 88041cac9530 0c36 > 81a417c0 > [694702.100019] Call Trace: > [694702.100793] [] schedule+0x29/0x70 > [694702.102338] [] schedule_preempt_disabled+0xe/0x10 > [694702.104309] [] __mutex_lock_slowpath+0x112/0x1b0 > [694702.198316] [] mutex_lock+0x2a/0x50 > [694702.292456] [] cgroup_lock_live_group+0x1d/0x40 > [694702.386833] [] cgroup_mkdir+0xa8/0x4b0 > [694702.480679] [] vfs_mkdir+0x84/0xd0 > [694702.574124] [] SyS_mkdirat+0x5e/0xe0 > [694702.666986] [] SyS_mkdir+0x19/0x20 > [694702.758969] [] system_call_fastpath+0x16/0x1b > [694702.848295] INFO: task kworker/12:2:11512 blocked for more than 120 > seconds. >
Re: 3.10.16 cgroup_mutex deadlock
Cc more people On 2013/11/12 6:06, Shawn Bohrer wrote: Hello, This morning I had a machine running 3.10.16 go unresponsive but before we killed it we were able to get the information below. I'm not an expert here but it looks like most of the tasks below are blocking waiting on the cgroup_mutex. You can see that the resource_alloca:16502 task is holding the cgroup_mutex and that task appears to be waiting on a lru_add_drain_all() to complete. Ouch, another bug report! This looks like the same bug that Hugh saw. (http://permalink.gmane.org/gmane.linux.kernel.cgroups/9351) What's new in your report is, the lru_add_drain_all() comes from cpuset_attach() instead of memcg. Morever I thought it was a 3.11 specific bug. Initially I thought the deadlock might simply be that the per cpu workqueue work from lru_add_drain_all() is stuck waiting on the cgroup_free_fn to complete. However I've read Documentation/workqueue.txt and it sounds like the current workqueue has multiple kworker threads per cpu and thus this should not happen. Both the cgroup_free_fn work and lru_add_dran_all() work run on the system_wq which has max_active set to 0 so I believe multiple kworker threads should run. This also appears to be true since all of the cgroup_free_fn are running on kworker/12 thread and there are multiple blocked. Perhaps someone with more experience in the cgroup and workqueue code can look at the stacks below and identify the problem, or explain why the lru_add_drain_all() work has not completed: [694702.013850] INFO: task systemd:1 blocked for more than 120 seconds. [694702.015794] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [694702.018217] systemd D 81607820 0 1 0 0x [694702.020505] 88041dcc1d78 0086 88041dc7f100 8110ad54 [694702.023006] 0001 88041dc78000 88041dcc1fd8 88041dcc1fd8 [694702.025508] 88041dcc1fd8 88041dc78000 88041a1e8698 81a417c0 [694702.028011] Call Trace: [694702.028788] [8110ad54] ? vma_merge+0x124/0x330 [694702.030468] [814b8eb9] schedule+0x29/0x70 [694702.032011] [814b918e] schedule_preempt_disabled+0xe/0x10 [694702.033982] [814b75b2] __mutex_lock_slowpath+0x112/0x1b0 [694702.035926] [8112a2bd] ? kmem_cache_alloc_trace+0x12d/0x160 [694702.037948] [814b742a] mutex_lock+0x2a/0x50 [694702.039546] [81095b77] proc_cgroup_show+0x67/0x1d0 [694702.041330] [8115925b] seq_read+0x16b/0x3e0 [694702.042927] [811383d0] vfs_read+0xb0/0x180 [694702.044498] [81138652] SyS_read+0x52/0xa0 [694702.046042] [814c2182] system_call_fastpath+0x16/0x1b [694702.047917] INFO: task kworker/12:1:203 blocked for more than 120 seconds. [694702.050044] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [694702.052467] kworker/12:1D 0 203 2 0x [694702.054756] Workqueue: events cgroup_free_fn [694702.056139] 88041bc1fcf8 0046 88038e7b46a0 00030001 [694702.058642] 88041bc1fd84 88041da6e9f0 88041bc1ffd8 88041bc1ffd8 [694702.061144] 88041bc1ffd8 88041da6e9f0 0087 81a417c0 [694702.063647] Call Trace: [694702.064423] [814b8eb9] schedule+0x29/0x70 [694702.065966] [814b918e] schedule_preempt_disabled+0xe/0x10 [694702.067936] [814b75b2] __mutex_lock_slowpath+0x112/0x1b0 [694702.069879] [814b742a] mutex_lock+0x2a/0x50 [694702.071476] [810930ec] cgroup_free_fn+0x2c/0x120 [694702.073209] [81057c54] process_one_work+0x174/0x490 [694702.075019] [81058d0c] worker_thread+0x11c/0x370 [694702.076748] [81058bf0] ? manage_workers+0x2c0/0x2c0 [694702.078560] [8105f0b0] kthread+0xc0/0xd0 [694702.080078] [8105eff0] ? flush_kthread_worker+0xb0/0xb0 [694702.081995] [814c20dc] ret_from_fork+0x7c/0xb0 [694702.083671] [8105eff0] ? flush_kthread_worker+0xb0/0xb0 [694702.085595] INFO: task systemd-logind:2885 blocked for more than 120 seconds. [694702.087801] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [694702.090225] systemd-logind D 81607820 0 2885 1 0x [694702.092513] 88041ac6fd88 0082 88041dd8aa60 88041d9bc1a8 [694702.095014] 88041ac6fda0 88041cac9530 88041ac6ffd8 88041ac6ffd8 [694702.097517] 88041ac6ffd8 88041cac9530 0c36 81a417c0 [694702.100019] Call Trace: [694702.100793] [814b8eb9] schedule+0x29/0x70 [694702.102338] [814b918e] schedule_preempt_disabled+0xe/0x10 [694702.104309] [814b75b2] __mutex_lock_slowpath+0x112/0x1b0 [694702.198316] [814b742a] mutex_lock+0x2a/0x50 [694702.292456] [8108fa6d]
Re: 3.10.16 cgroup_mutex deadlock
On Tue 12-11-13 18:17:20, Li Zefan wrote: Cc more people On 2013/11/12 6:06, Shawn Bohrer wrote: Hello, This morning I had a machine running 3.10.16 go unresponsive but before we killed it we were able to get the information below. I'm not an expert here but it looks like most of the tasks below are blocking waiting on the cgroup_mutex. You can see that the resource_alloca:16502 task is holding the cgroup_mutex and that task appears to be waiting on a lru_add_drain_all() to complete. Do you have sysrq+l output as well by any chance? That would tell us what the current CPUs are doing. Dumping all kworker stacks might be helpful as well. We know that lru_add_drain_all waits for schedule_on_each_cpu to return so it is waiting for workers to finish. I would be really curious why some of lru_add_drain_cpu cannot finish properly. The only reason would be that some work item(s) do not get CPU or somebody is holding lru_lock. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Tue 12-11-13 09:55:30, Shawn Bohrer wrote: On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote: On Tue 12-11-13 18:17:20, Li Zefan wrote: Cc more people On 2013/11/12 6:06, Shawn Bohrer wrote: Hello, This morning I had a machine running 3.10.16 go unresponsive but before we killed it we were able to get the information below. I'm not an expert here but it looks like most of the tasks below are blocking waiting on the cgroup_mutex. You can see that the resource_alloca:16502 task is holding the cgroup_mutex and that task appears to be waiting on a lru_add_drain_all() to complete. Do you have sysrq+l output as well by any chance? That would tell us what the current CPUs are doing. Dumping all kworker stacks might be helpful as well. We know that lru_add_drain_all waits for schedule_on_each_cpu to return so it is waiting for workers to finish. I would be really curious why some of lru_add_drain_cpu cannot finish properly. The only reason would be that some work item(s) do not get CPU or somebody is holding lru_lock. In fact the sys-admin did manage to fire off a sysrq+l, I've put all of the info from the syslog below. I've looked it over and I'm not sure it reveals anything. First looking at the timestamps it appears we ran the sysrq+l 19.2 hours after the cgroup_mutex lockup I previously sent. I would expect sysrq+w would still show those kworkers blocked on the same cgroup mutex? I also have atop logs over that whole time period that show hundreds of zombie processes which to me indicates that over that 19.2 hours systemd remained wedged on the cgroup_mutex. Looking at the backtraces from the sysrq+l it appears most of the CPUs were idle Right so either we managed to sleep with the lru_lock held which sounds a bit improbable - but who knows - or there is some other problem. I would expect the later to be true. lru_add_drain executes per-cpu and preemption disabled this means that its work item cannot be preempted so the only logical explanation seems to be that the work item has never got scheduled. except there are a few where ptpd is trying to step the clock with clock_settime. The ptpd process also appears to get stuck for a bit but it looks like it recovers because it moves CPUs and the previous CPUs become idle. It gets soft lockup because it is waiting for it's own IPIs which got preempted by NMI trace dumper. But this is unrelated. The fact that ptpd is stepping the clock at all at this time means that timekeeping is a mess at this point and the system clock is way out of sync. There are also a few of these NMI messages in there that I don't understand but at this point the machine was a sinking ship. Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Uhhuh. NMI received for unknown reason 21 on CPU 26. Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Do you have a strange power saving mode enabled? Nov 11 07:03:29 sydtest0 kernel: [764305.327043] Dazed and confused, but trying to continue Nov 11 07:03:29 sydtest0 kernel: [764305.327143] Uhhuh. NMI received for unknown reason 31 on CPU 27. Nov 11 07:03:29 sydtest0 kernel: [764305.327144] Do you have a strange power saving mode enabled? Nov 11 07:03:29 sydtest0 kernel: [764305.327144] Dazed and confused, but trying to continue Nov 11 07:03:29 sydtest0 kernel: [764305.327242] Uhhuh. NMI received for unknown reason 31 on CPU 28. Nov 11 07:03:29 sydtest0 kernel: [764305.327242] Do you have a strange power saving mode enabled? Nov 11 07:03:29 sydtest0 kernel: [764305.327243] Dazed and confused, but trying to continue Perhaps there is another task blocking somewhere holding the lru_lock, but at this point the machine has been rebooted so I'm not sure how we'd figure out what task that might be. Anyway here is the full output of sysrq+l plus whatever else ended up in the syslog. OK. In case the issue happens again. It would be very helpful to get the kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue debugging tricks. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
3.10.16 cgroup_mutex deadlock
Hello, This morning I had a machine running 3.10.16 go unresponsive but before we killed it we were able to get the information below. I'm not an expert here but it looks like most of the tasks below are blocking waiting on the cgroup_mutex. You can see that the resource_alloca:16502 task is holding the cgroup_mutex and that task appears to be waiting on a lru_add_drain_all() to complete. Initially I thought the deadlock might simply be that the per cpu workqueue work from lru_add_drain_all() is stuck waiting on the cgroup_free_fn to complete. However I've read Documentation/workqueue.txt and it sounds like the current workqueue has multiple kworker threads per cpu and thus this should not happen. Both the cgroup_free_fn work and lru_add_dran_all() work run on the system_wq which has max_active set to 0 so I believe multiple kworker threads should run. This also appears to be true since all of the cgroup_free_fn are running on kworker/12 thread and there are multiple blocked. Perhaps someone with more experience in the cgroup and workqueue code can look at the stacks below and identify the problem, or explain why the lru_add_drain_all() work has not completed: [694702.013850] INFO: task systemd:1 blocked for more than 120 seconds. [694702.015794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [694702.018217] systemd D 81607820 0 1 0 0x [694702.020505] 88041dcc1d78 0086 88041dc7f100 8110ad54 [694702.023006] 0001 88041dc78000 88041dcc1fd8 88041dcc1fd8 [694702.025508] 88041dcc1fd8 88041dc78000 88041a1e8698 81a417c0 [694702.028011] Call Trace: [694702.028788] [] ? vma_merge+0x124/0x330 [694702.030468] [] schedule+0x29/0x70 [694702.032011] [] schedule_preempt_disabled+0xe/0x10 [694702.033982] [] __mutex_lock_slowpath+0x112/0x1b0 [694702.035926] [] ? kmem_cache_alloc_trace+0x12d/0x160 [694702.037948] [] mutex_lock+0x2a/0x50 [694702.039546] [] proc_cgroup_show+0x67/0x1d0 [694702.041330] [] seq_read+0x16b/0x3e0 [694702.042927] [] vfs_read+0xb0/0x180 [694702.044498] [] SyS_read+0x52/0xa0 [694702.046042] [] system_call_fastpath+0x16/0x1b [694702.047917] INFO: task kworker/12:1:203 blocked for more than 120 seconds. [694702.050044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [694702.052467] kworker/12:1D 0 203 2 0x [694702.054756] Workqueue: events cgroup_free_fn [694702.056139] 88041bc1fcf8 0046 88038e7b46a0 00030001 [694702.058642] 88041bc1fd84 88041da6e9f0 88041bc1ffd8 88041bc1ffd8 [694702.061144] 88041bc1ffd8 88041da6e9f0 0087 81a417c0 [694702.063647] Call Trace: [694702.064423] [] schedule+0x29/0x70 [694702.065966] [] schedule_preempt_disabled+0xe/0x10 [694702.067936] [] __mutex_lock_slowpath+0x112/0x1b0 [694702.069879] [] mutex_lock+0x2a/0x50 [694702.071476] [] cgroup_free_fn+0x2c/0x120 [694702.073209] [] process_one_work+0x174/0x490 [694702.075019] [] worker_thread+0x11c/0x370 [694702.076748] [] ? manage_workers+0x2c0/0x2c0 [694702.078560] [] kthread+0xc0/0xd0 [694702.080078] [] ? flush_kthread_worker+0xb0/0xb0 [694702.081995] [] ret_from_fork+0x7c/0xb0 [694702.083671] [] ? flush_kthread_worker+0xb0/0xb0 [694702.085595] INFO: task systemd-logind:2885 blocked for more than 120 seconds. [694702.087801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [694702.090225] systemd-logind D 81607820 0 2885 1 0x [694702.092513] 88041ac6fd88 0082 88041dd8aa60 88041d9bc1a8 [694702.095014] 88041ac6fda0 88041cac9530 88041ac6ffd8 88041ac6ffd8 [694702.097517] 88041ac6ffd8 88041cac9530 0c36 81a417c0 [694702.100019] Call Trace: [694702.100793] [] schedule+0x29/0x70 [694702.102338] [] schedule_preempt_disabled+0xe/0x10 [694702.104309] [] __mutex_lock_slowpath+0x112/0x1b0 [694702.198316] [] mutex_lock+0x2a/0x50 [694702.292456] [] cgroup_lock_live_group+0x1d/0x40 [694702.386833] [] cgroup_mkdir+0xa8/0x4b0 [694702.480679] [] vfs_mkdir+0x84/0xd0 [694702.574124] [] SyS_mkdirat+0x5e/0xe0 [694702.666986] [] SyS_mkdir+0x19/0x20 [694702.758969] [] system_call_fastpath+0x16/0x1b [694702.848295] INFO: task kworker/12:2:11512 blocked for more than 120 seconds. [694702.935749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [694703.023603] kworker/12:2D 816079c0 0 11512 2 0x [694703.109993] Workqueue: events cgroup_free_fn [694703.193213] 88041b9dfcf8 0046 88041da6e9f0 ea00106fd240 [694703.278353] 88041f803c00 8803824254c0 88041b9dffd8 88041b9dffd8 [694703.363757] 88041b9dffd8 8803824254c0 001f17887bb1 81a417c0 [694703.448550] Call Trace: [694703.531773] []
3.10.16 cgroup_mutex deadlock
Hello, This morning I had a machine running 3.10.16 go unresponsive but before we killed it we were able to get the information below. I'm not an expert here but it looks like most of the tasks below are blocking waiting on the cgroup_mutex. You can see that the resource_alloca:16502 task is holding the cgroup_mutex and that task appears to be waiting on a lru_add_drain_all() to complete. Initially I thought the deadlock might simply be that the per cpu workqueue work from lru_add_drain_all() is stuck waiting on the cgroup_free_fn to complete. However I've read Documentation/workqueue.txt and it sounds like the current workqueue has multiple kworker threads per cpu and thus this should not happen. Both the cgroup_free_fn work and lru_add_dran_all() work run on the system_wq which has max_active set to 0 so I believe multiple kworker threads should run. This also appears to be true since all of the cgroup_free_fn are running on kworker/12 thread and there are multiple blocked. Perhaps someone with more experience in the cgroup and workqueue code can look at the stacks below and identify the problem, or explain why the lru_add_drain_all() work has not completed: [694702.013850] INFO: task systemd:1 blocked for more than 120 seconds. [694702.015794] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [694702.018217] systemd D 81607820 0 1 0 0x [694702.020505] 88041dcc1d78 0086 88041dc7f100 8110ad54 [694702.023006] 0001 88041dc78000 88041dcc1fd8 88041dcc1fd8 [694702.025508] 88041dcc1fd8 88041dc78000 88041a1e8698 81a417c0 [694702.028011] Call Trace: [694702.028788] [8110ad54] ? vma_merge+0x124/0x330 [694702.030468] [814b8eb9] schedule+0x29/0x70 [694702.032011] [814b918e] schedule_preempt_disabled+0xe/0x10 [694702.033982] [814b75b2] __mutex_lock_slowpath+0x112/0x1b0 [694702.035926] [8112a2bd] ? kmem_cache_alloc_trace+0x12d/0x160 [694702.037948] [814b742a] mutex_lock+0x2a/0x50 [694702.039546] [81095b77] proc_cgroup_show+0x67/0x1d0 [694702.041330] [8115925b] seq_read+0x16b/0x3e0 [694702.042927] [811383d0] vfs_read+0xb0/0x180 [694702.044498] [81138652] SyS_read+0x52/0xa0 [694702.046042] [814c2182] system_call_fastpath+0x16/0x1b [694702.047917] INFO: task kworker/12:1:203 blocked for more than 120 seconds. [694702.050044] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [694702.052467] kworker/12:1D 0 203 2 0x [694702.054756] Workqueue: events cgroup_free_fn [694702.056139] 88041bc1fcf8 0046 88038e7b46a0 00030001 [694702.058642] 88041bc1fd84 88041da6e9f0 88041bc1ffd8 88041bc1ffd8 [694702.061144] 88041bc1ffd8 88041da6e9f0 0087 81a417c0 [694702.063647] Call Trace: [694702.064423] [814b8eb9] schedule+0x29/0x70 [694702.065966] [814b918e] schedule_preempt_disabled+0xe/0x10 [694702.067936] [814b75b2] __mutex_lock_slowpath+0x112/0x1b0 [694702.069879] [814b742a] mutex_lock+0x2a/0x50 [694702.071476] [810930ec] cgroup_free_fn+0x2c/0x120 [694702.073209] [81057c54] process_one_work+0x174/0x490 [694702.075019] [81058d0c] worker_thread+0x11c/0x370 [694702.076748] [81058bf0] ? manage_workers+0x2c0/0x2c0 [694702.078560] [8105f0b0] kthread+0xc0/0xd0 [694702.080078] [8105eff0] ? flush_kthread_worker+0xb0/0xb0 [694702.081995] [814c20dc] ret_from_fork+0x7c/0xb0 [694702.083671] [8105eff0] ? flush_kthread_worker+0xb0/0xb0 [694702.085595] INFO: task systemd-logind:2885 blocked for more than 120 seconds. [694702.087801] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [694702.090225] systemd-logind D 81607820 0 2885 1 0x [694702.092513] 88041ac6fd88 0082 88041dd8aa60 88041d9bc1a8 [694702.095014] 88041ac6fda0 88041cac9530 88041ac6ffd8 88041ac6ffd8 [694702.097517] 88041ac6ffd8 88041cac9530 0c36 81a417c0 [694702.100019] Call Trace: [694702.100793] [814b8eb9] schedule+0x29/0x70 [694702.102338] [814b918e] schedule_preempt_disabled+0xe/0x10 [694702.104309] [814b75b2] __mutex_lock_slowpath+0x112/0x1b0 [694702.198316] [814b742a] mutex_lock+0x2a/0x50 [694702.292456] [8108fa6d] cgroup_lock_live_group+0x1d/0x40 [694702.386833] [810946c8] cgroup_mkdir+0xa8/0x4b0 [694702.480679] [81145ea4] vfs_mkdir+0x84/0xd0 [694702.574124] [8114791e] SyS_mkdirat+0x5e/0xe0 [694702.666986] [811479b9] SyS_mkdir+0x19/0x20 [694702.758969] [814c2182] system_call_fastpath+0x16/0x1b [694702.848295] INFO: task kworker/12:2:11512 blocked for more than 120 seconds. [694702.935749]