Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
Hello, On Tue, Dec 20, 2011 at 09:50:24AM -0500, Vivek Goyal wrote: > So IIUC, existing mempool implementation is not directly usable for my > requirement and I need to write some code of my own for the caching > layer which always allocates objects from reserve and fills in the > pool asynchronously with the help of a worker thread. I've been looking at it and don't think allowing percpu allocator to be called from no io path is a good idea. The on-demand area filling is tied into vmalloc area management which in turn is tied to arch page table code and I really want to avoid pre-allocating full chunk - it can be huge. I'm trying to extend mempool to cover percpu areas, so please wait a bit. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
On Mon, Dec 19, 2011 at 02:56:35PM -0800, Tejun Heo wrote: > Hello, Vivek. > > On Mon, Dec 19, 2011 at 01:27:17PM -0500, Vivek Goyal wrote: > > Ok, that's good to know. If per cpu allocator can support this use case, > > it will be good for 3.3 onwards. This seems to be right way to go to fix > > the problem. > > Ummm... if we're gonna make percpu usable w/ GFP_NOIO, the right > interim solution would be making a simplistic mempool so that later > when percpu can do it, it can be swapped easily. I really can't see > much benefit of adding refcnting on top of everything just for this. Ok. So are you suggesting that I should write a simple mempool kind of implementation of my own for group and per cpu data allocation. Keep certain number of elements in the cache and trigger a worker thread to allocate more elements once minimum number of elements go below threshold. If pool has run out of pre-allocated elements then allocation will fail and IO will be accounted to root group? I am looking at the mempool implementation (mempool_create()) and looks like that is not suitable for my use case. mempool_alloc() will call into alloc function provided by me and pass the flags. I can't implement an alloc function and honor that flag as per cpu alloc does not take any flags. So IIUC, existing mempool implementation is not directly usable for my requirement and I need to write some code of my own for the caching layer which always allocates objects from reserve and fills in the pool asynchronously with the help of a worker thread. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
On 2011-12-19 19:27, Vivek Goyal wrote: > I think reverting the previous series is not going to be simple either. It > had 13 patches. > > https://lkml.org/lkml/2011/5/19/560 > > By making stats per cpu, I was able to reduce contention on request > queue lock. Now we shall have to bring the lock back. That's not going to happen, having that lock in there was a disaster for small IO on fast devices. It's essentially the limiting factor on benchmark runs on the RHEL kernels that have it included and enabled... -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
Hello, Vivek. On Mon, Dec 19, 2011 at 01:27:17PM -0500, Vivek Goyal wrote: > Ok, that's good to know. If per cpu allocator can support this use case, > it will be good for 3.3 onwards. This seems to be right way to go to fix > the problem. Ummm... if we're gonna make percpu usable w/ GFP_NOIO, the right interim solution would be making a simplistic mempool so that later when percpu can do it, it can be swapped easily. I really can't see much benefit of adding refcnting on top of everything just for this. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
On Mon, Dec 19, 2011 at 09:35:19AM -0800, Tejun Heo wrote: > On Mon, Dec 19, 2011 at 12:27:17PM -0500, Vivek Goyal wrote: > > On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote: > > > > > > On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote: > > > > Thanks for testing it Nate. I did some debugging and found out that > > > > patch > > > > is doing double free on per cpu pointer hence the crash you are running > > > > into. I could reproduce this problem on my box. It is just a matter of > > > > doing rmdir on the blkio cgroup. > > > > > > > > I understood the cmpxchg() semantics wrong. I have fixed it now and > > > > no crashes on directory removal. Can you please give this version a > > > > try. > > > > > > > > Thanks > > > > Vivek > > > > > > After 24 hours of stress testing the machine remains up and working > > > without issue. I will continue to test it, but am reasonably confident > > > that this patch resolves my issue. > > > > > > > Hi Nate, > > > > I have come up with final version of the patch. This time I have used non > > rentrant work queue to queue the stat alloc work. This also gets rid of > > cmpxchg code as there is only one writer at a time. There are couple of > > other small cleanups. > > > > Can you please give patch also a try to make sure I have not broken > > something while doing changes. > > > > This version is based on 3.2-rc6. Once you confirm the results, I will > > rebase it on top of "linux-block for-3.3/core" and post it to Jens for > > inclusion. > > > > Thanks > > Vivek > > > > Block cgroup currently allocates percpu data for some stats. This allocation > > is happening in IO submission path which can recurse in IO stack. > > > > Percpu allocation API does not take any allocation flags as input, hence > > to avoid the deadlock problem under memory pressure, alloc per cpu data > > from a worker thread context. > > > > Only side affect of delayed allocation is that we will lose the blkio cgroup > > stats for that group a small duration. > > > > In case per cpu memory allocation fails, worker thread re-arms the work > > with a delay of 1 second. > > > > This patch is generated on top of 3.2-rc6 > > > > Signed-off-by: Vivek Goyal > > Reported-by: Nate Custer > > Tested-by: Nate Custer > > Hmmm... I really don't like this approach. It's so unnecessarily > complex with extra refcounting and all when about the same thing can > be achieved by implementing simple mempool which is filled > asynchronously. Also, the fix seems way too invasive even for -rc6, > let alone -stable. If reverting isn't gonna be invasive, maybe that's > a better approach for now? I think reverting the previous series is not going to be simple either. It had 13 patches. https://lkml.org/lkml/2011/5/19/560 By making stats per cpu, I was able to reduce contention on request queue lock. Now we shall have to bring the lock back. So to me, neither of the solution is very clean/simple. May be we can take this patch in as temporary stop gap measure for 3.2, while you cook a patch for per cpu allocator for 3.3 onwards. This patch looks big, but is not very complex. Now allocation of group happens without dropping queue lock and it gets rid of all the trickery of dropping lock, reacquiring it and checking for queue dead etc. Some code is related to making group refering atomic again. (Which was atomic in the past. So we just resotore that functionality). The only core piece is taking a reference on the group and asking a worker thread to do the allocation. May be if I break the patch into small pieces, it will be more clear. So if I had to choose between reverting the series vs taking this patch as stop gap measure, I will prefer taking this patch. It can get some soak time in Jens's tree and if everything looks good then push it for 3.2. If things look good there, then it can be pulled into -stable. So it will be a while before these changes are pulled into -stable. > > I've been thinking about it and I think the use case is legit and > maybe making percpu allocator support that isn't such a bad idea. I'm > not completely sure yet tho. I'll give it a try later today. Ok, that's good to know. If per cpu allocator can support this use case, it will be good for 3.3 onwards. This seems to be right way to go to fix the problem. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
On Mon, Dec 19, 2011 at 12:27:17PM -0500, Vivek Goyal wrote: > On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote: > > > > On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote: > > > Thanks for testing it Nate. I did some debugging and found out that patch > > > is doing double free on per cpu pointer hence the crash you are running > > > into. I could reproduce this problem on my box. It is just a matter of > > > doing rmdir on the blkio cgroup. > > > > > > I understood the cmpxchg() semantics wrong. I have fixed it now and > > > no crashes on directory removal. Can you please give this version a > > > try. > > > > > > Thanks > > > Vivek > > > > After 24 hours of stress testing the machine remains up and working without > > issue. I will continue to test it, but am reasonably confident that this > > patch resolves my issue. > > > > Hi Nate, > > I have come up with final version of the patch. This time I have used non > rentrant work queue to queue the stat alloc work. This also gets rid of > cmpxchg code as there is only one writer at a time. There are couple of > other small cleanups. > > Can you please give patch also a try to make sure I have not broken > something while doing changes. > > This version is based on 3.2-rc6. Once you confirm the results, I will > rebase it on top of "linux-block for-3.3/core" and post it to Jens for > inclusion. > > Thanks > Vivek > > Block cgroup currently allocates percpu data for some stats. This allocation > is happening in IO submission path which can recurse in IO stack. > > Percpu allocation API does not take any allocation flags as input, hence > to avoid the deadlock problem under memory pressure, alloc per cpu data > from a worker thread context. > > Only side affect of delayed allocation is that we will lose the blkio cgroup > stats for that group a small duration. > > In case per cpu memory allocation fails, worker thread re-arms the work > with a delay of 1 second. > > This patch is generated on top of 3.2-rc6 > > Signed-off-by: Vivek Goyal > Reported-by: Nate Custer > Tested-by: Nate Custer Hmmm... I really don't like this approach. It's so unnecessarily complex with extra refcounting and all when about the same thing can be achieved by implementing simple mempool which is filled asynchronously. Also, the fix seems way too invasive even for -rc6, let alone -stable. If reverting isn't gonna be invasive, maybe that's a better approach for now? I've been thinking about it and I think the use case is legit and maybe making percpu allocator support that isn't such a bad idea. I'm not completely sure yet tho. I'll give it a try later today. Thank you. -- tejun -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote: > > On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote: > > Thanks for testing it Nate. I did some debugging and found out that patch > > is doing double free on per cpu pointer hence the crash you are running > > into. I could reproduce this problem on my box. It is just a matter of > > doing rmdir on the blkio cgroup. > > > > I understood the cmpxchg() semantics wrong. I have fixed it now and > > no crashes on directory removal. Can you please give this version a > > try. > > > > Thanks > > Vivek > > After 24 hours of stress testing the machine remains up and working without > issue. I will continue to test it, but am reasonably confident that this > patch resolves my issue. > Hi Nate, I have come up with final version of the patch. This time I have used non rentrant work queue to queue the stat alloc work. This also gets rid of cmpxchg code as there is only one writer at a time. There are couple of other small cleanups. Can you please give patch also a try to make sure I have not broken something while doing changes. This version is based on 3.2-rc6. Once you confirm the results, I will rebase it on top of "linux-block for-3.3/core" and post it to Jens for inclusion. Thanks Vivek Block cgroup currently allocates percpu data for some stats. This allocation is happening in IO submission path which can recurse in IO stack. Percpu allocation API does not take any allocation flags as input, hence to avoid the deadlock problem under memory pressure, alloc per cpu data from a worker thread context. Only side affect of delayed allocation is that we will lose the blkio cgroup stats for that group a small duration. In case per cpu memory allocation fails, worker thread re-arms the work with a delay of 1 second. This patch is generated on top of 3.2-rc6 Signed-off-by: Vivek Goyal Reported-by: Nate Custer Tested-by: Nate Custer --- block/blk-cgroup.c | 31 - block/blk-cgroup.h | 11 +++- block/blk-throttle.c | 115 --- block/cfq-iosched.c | 111 +++-- 4 files changed, 173 insertions(+), 95 deletions(-) Index: block/blk-throttle.c === --- block/blk-throttle.c.orig 2011-12-19 23:10:53.0 -0500 +++ block/blk-throttle.c2011-12-19 23:11:14.0 -0500 @@ -81,6 +81,7 @@ struct throtl_grp { int limits_changed; struct rcu_head rcu_head; + struct delayed_work stat_alloc_work; }; struct throtl_data @@ -159,6 +160,12 @@ static void throtl_free_tg(struct rcu_he struct throtl_grp *tg; tg = container_of(head, struct throtl_grp, rcu_head); + + /* +* delayed per cpu allocation should be visible here due to +* barrier guarantees offered by atomic_dec_and_test() called +* during dropping reference. +*/ free_percpu(tg->blkg.stats_cpu); kfree(tg); } @@ -181,6 +188,48 @@ static void throtl_put_tg(struct throtl_ call_rcu(&tg->rcu_head, throtl_free_tg); } +static inline void throtl_schedule_pcpu_stat_alloc(struct throtl_grp *tg, + unsigned long delay) { + WARN_ON(tg->blkg.stats_cpu != NULL); + + /* +* Schedule the group per cpu stat allocation through worker +* thread +*/ + throtl_ref_get_tg(tg); + + /* queue work on non-rentrant work queue */ + queue_delayed_work(system_nrt_wq, &tg->stat_alloc_work, delay); +} + +/* No locks taken. A reference to throtl_grp taken before invocation */ +static void tg_stat_alloc_fn(struct work_struct *work) +{ + struct delayed_work *dwork = to_delayed_work(work); + struct throtl_grp *tg = container_of(dwork, struct throtl_grp, + stat_alloc_work); + void *stat_ptr = NULL; + + stat_ptr = blkio_alloc_blkg_stats_pcpu(); + + if (stat_ptr == NULL) { + /* +* If memory allocation failed, try again with delay of 1s +* FIXME: Put an upper limit on number of retries? +*/ + throtl_schedule_pcpu_stat_alloc(tg, HZ); + throtl_put_tg(tg); + return; + } + + /* +* There is only one writer at a time and that is worker thread. Hence +* no synchronization should be required. +*/ + blkio_set_blkg_pcpu_stats(&tg->blkg, stat_ptr); + throtl_put_tg(tg); +} + static void throtl_init_group(struct throtl_grp *tg) { INIT_HLIST_NODE(&tg->tg_node); @@ -188,6 +237,7 @@ static void throtl_init_group(struct thr bio_list_init(&tg->bio_lists[0]); bio_list_init(&tg->bio_lists[1]); tg->limits_changed = false; + INIT_DELAYED_WORK(&tg->stat_alloc_work, tg_stat_alloc_fn); /* Practically unlimited BW */
Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote: > > On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote: > > Thanks for testing it Nate. I did some debugging and found out that patch > > is doing double free on per cpu pointer hence the crash you are running > > into. I could reproduce this problem on my box. It is just a matter of > > doing rmdir on the blkio cgroup. > > > > I understood the cmpxchg() semantics wrong. I have fixed it now and > > no crashes on directory removal. Can you please give this version a > > try. > > > > Thanks > > Vivek > > After 24 hours of stress testing the machine remains up and working without > issue. I will continue to test it, but am reasonably confident that this > patch resolves my issue. That's good to know. While you continue to stress test, I will make some minor modifications to the patch. (Thinking of delaying the retry of allocation of per cpu memory in case previous attempts failed). Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote: > Thanks for testing it Nate. I did some debugging and found out that patch > is doing double free on per cpu pointer hence the crash you are running > into. I could reproduce this problem on my box. It is just a matter of > doing rmdir on the blkio cgroup. > > I understood the cmpxchg() semantics wrong. I have fixed it now and > no crashes on directory removal. Can you please give this version a > try. > > Thanks > Vivek After 24 hours of stress testing the machine remains up and working without issue. I will continue to test it, but am reasonably confident that this patch resolves my issue. Nate Custer-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
On Fri, Dec 16, 2011 at 12:43:52PM -0600, Nate Custer wrote: > > On Dec 15, 2011, at 1:47 PM, Vivek Goyal wrote: > > Ok, I continued to develop on the patch which tries to allocate per cpu > > stats from worker thread context. Here is the patch. > > > > Can the reporter please try out the patch and see if it helps. I am not > > sure if deadlock was because of mutex issue or not, but it should help > > get rid of lockdep warning. > > > > This patch is on top of for-3.3/core branch of jens's linux-block tree. > > If it does not apply on your kernel version, do let me know the version > > you are testing with and I will generate another version of patch. > > > > If testing results are good, I will break down the patch in small pieces > > and post as a series separately. > > > > Thanks > > Vivek > > Running on a fedora-16 box with the patch applied to the linux-block tree I > still had deadlocks. In fact it seemed to happen much faster and with ligher > workloads. > > I was able to get netconsole setup and a full stacktrace is posted here: > > http://pastebin.com/9Rq68exU Thanks for testing it Nate. I did some debugging and found out that patch is doing double free on per cpu pointer hence the crash you are running into. I could reproduce this problem on my box. It is just a matter of doing rmdir on the blkio cgroup. I understood the cmpxchg() semantics wrong. I have fixed it now and no crashes on directory removal. Can you please give this version a try. Thanks Vivek Alloc per cpu data from a worker thread context to avoid possibility of a deadlock under memory pressure. Only side affect of delayed allocation is that we will lose the blkio cgroup stats for that group a small duration. This patch is generated on top of "for-3.3/core" branch of linux-block tree. -v2: Fixed the issue of double free on percpu pointer. --- block/blk-cgroup.c | 31 - block/blk-cgroup.h |9 +++ block/blk-throttle.c | 116 - block/cfq-iosched.c | 119 ++- 4 files changed, 180 insertions(+), 95 deletions(-) Index: block/blk-throttle.c === --- block/blk-throttle.c.orig 2011-12-17 01:49:34.0 -0500 +++ block/blk-throttle.c2011-12-17 02:21:50.0 -0500 @@ -81,6 +81,7 @@ struct throtl_grp { int limits_changed; struct rcu_head rcu_head; + struct work_struct stat_alloc_work; }; struct throtl_data @@ -159,6 +160,13 @@ static void throtl_free_tg(struct rcu_he struct throtl_grp *tg; tg = container_of(head, struct throtl_grp, rcu_head); + + /* +* Will delayed allocation be visible here for sure? I am relying +* on the fact that after blkg.stats_cpu assignment, we drop +* reference to group using atomic_dec() which should imply +* barrier +*/ free_percpu(tg->blkg.stats_cpu); kfree(tg); } @@ -181,6 +189,48 @@ static void throtl_put_tg(struct throtl_ call_rcu(&tg->rcu_head, throtl_free_tg); } +static inline void throtl_check_schedule_pcpu_stat_alloc(struct throtl_grp *tg) { + if (tg->blkg.stats_cpu != NULL) + return; + /* +* Schedule the group per cpu stat allocation through worker +* thread +*/ + throtl_ref_get_tg(tg); + if (!schedule_work(&tg->stat_alloc_work)) { + /* Work is already scheduled by somebody */ + throtl_put_tg(tg); + } +} + +/* No locks taken. A reference to throtl_grp taken before invocation */ +static void tg_stat_alloc_fn(struct work_struct *work) +{ + struct throtl_grp *tg = container_of(work, struct throtl_grp, + stat_alloc_work); + void *stat_ptr = NULL; + unsigned long ret; + + stat_ptr = blkio_alloc_blkg_stats_pcpu(); + + if (stat_ptr == NULL) { + /* If memory allocation failed, try again */ + throtl_check_schedule_pcpu_stat_alloc(tg); + throtl_put_tg(tg); + return; + } + + ret = blkio_cmpxchg_blkg_stats(&tg->blkg, 0, + (unsigned long)stat_ptr); + + if (ret != 0) { + /* Somebody else got to it first */ + free_percpu(stat_ptr); + } + + throtl_put_tg(tg); +} + static void throtl_init_group(struct throtl_grp *tg) { INIT_HLIST_NODE(&tg->tg_node); @@ -188,6 +238,7 @@ static void throtl_init_group(struct thr bio_list_init(&tg->bio_lists[0]); bio_list_init(&tg->bio_lists[1]); tg->limits_changed = false; + INIT_WORK(&tg->stat_alloc_work, tg_stat_alloc_fn); /* Practically unlimited BW */ tg->bps[0] = tg->bps[1] = -1; @@ -263,8 +314,25 @@ static void throtl_init_add_tg_lists(str throtl_add_group_to_td_list(t
Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
On Dec 15, 2011, at 1:47 PM, Vivek Goyal wrote: > Ok, I continued to develop on the patch which tries to allocate per cpu > stats from worker thread context. Here is the patch. > > Can the reporter please try out the patch and see if it helps. I am not > sure if deadlock was because of mutex issue or not, but it should help > get rid of lockdep warning. > > This patch is on top of for-3.3/core branch of jens's linux-block tree. > If it does not apply on your kernel version, do let me know the version > you are testing with and I will generate another version of patch. > > If testing results are good, I will break down the patch in small pieces > and post as a series separately. > > Thanks > Vivek Running on a fedora-16 box with the patch applied to the linux-block tree I still had deadlocks. In fact it seemed to happen much faster and with ligher workloads. I was able to get netconsole setup and a full stacktrace is posted here: http://pastebin.com/9Rq68exU Nate -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
On Wed, Dec 14, 2011 at 05:03:54PM +0100, Jens Axboe wrote: > On 2011-12-14 14:43, Avi Kivity wrote: > > On 12/14/2011 02:25 PM, Marcelo Tosatti wrote: > >> On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote: > >>> Hello, > >>> > >>> I am struggling with repeatable full hardware locks when running 8-12 KVM > >>> vms. At some point before the hard lock I get a inconsistent lock state > >>> warning. An example of this can be found here: > >>> > >>> http://pastebin.com/8wKhgE2C > >>> > >>> After that the server continues to run for a while and then starts its > >>> death spiral. When it reaches that point it fails to log anything further > >>> to the disk, but by attaching a console I have been able to get a stack > >>> trace documenting the final implosion: > >>> > >>> http://pastebin.com/PbcN76bd > >>> > >>> All of the cores end up hung and the server stops responding to all > >>> input, including SysRq commands. > >>> > >>> I have seen this behavior on two machines (dual E5606 running Fedora 16) > >>> both passed cpuburnin testing and memtest86 scans without error. > >>> > >>> I have reproduced the crash and stack traces from a Fedora debugging > >>> kernel - 3.1.2-1 and with a vanilla 3.1.4 kernel. > >> > >> Busted hardware, apparently. Can you reproduce these issues with the > >> same workload on different hardware? > > > > I don't think it's hardware related. The second trace (in the first > > paste) is called during swap, so GFP_FS is set. The first one is not, > > so GFP_FS is clear. Lockdep is worried about the following scenario: > > > > acpi_early_init() is called > > calls pcpu_alloc(), which takes pcpu_alloc_mutex > > eventually, calls kmalloc(), or some other allocation function > > no memory, so swap > > call try_to_free_pages() > > submit_bio() > > blk_throtl_bio() > > blkio_alloc_blkg_stats() > > alloc_percpu() > > pcpu_alloc(), which takes pcpu_alloc_mutex > > deadlock > > > > It's a little unlikely that acpi_early_init() will OOM, but lockdep > > doesn't know that. Other callers of pcpu_alloc() could trigger the same > > thing. > > > > When lockdep says > > > > [ 5839.924953] other info that might help us debug this: > > [ 5839.925396] Possible unsafe locking scenario: > > [ 5839.925397] > > [ 5839.925840]CPU0 > > [ 5839.926063] > > [ 5839.926287] lock(pcpu_alloc_mutex); > > [ 5839.926533] > > [ 5839.926756] lock(pcpu_alloc_mutex); > > [ 5839.926986] > > > > It really means > > > > > > > > GFP_FS simply marks the beginning of a nested, unrelated context that > > uses the same thread, just like an interrupt. Kudos to lockdep for > > catching that. > > > > I think the allocation in blkio_alloc_blkg_stats() should be moved out > > of the I/O path into some init function. Copying Jens. > > That's completely buggy, basically you end up with a GFP_KERNEL > allocation from the IO submit path. Vivek, per_cpu data needs to be set > up at init time. You can't allocate it dynamically off the IO path. Ok, I continued to develop on the patch which tries to allocate per cpu stats from worker thread context. Here is the patch. Can the reporter please try out the patch and see if it helps. I am not sure if deadlock was because of mutex issue or not, but it should help get rid of lockdep warning. This patch is on top of for-3.3/core branch of jens's linux-block tree. If it does not apply on your kernel version, do let me know the version you are testing with and I will generate another version of patch. If testing results are good, I will break down the patch in small pieces and post as a series separately. Thanks Vivek Alloc per cpu data from a worker thread context to avoid possibility of a deadlock under memory pressure. Only side affect of delayed allocation is that we will lose the blkio cgroup stats for that group a small duration. This patch is generated on top of "for-3.3/core" branch of linux-block tree. --- block/blk-cgroup.c | 31 - block/blk-cgroup.h |9 +++ block/blk-throttle.c | 116 - block/cfq-iosched.c | 119 ++- 4 files changed, 180 insertions(+), 95 deletions(-) Index: block/blk-throttle.c === --- block/blk-throttle.c.orig 2011-12-15 20:31:05.0 -0500 +++ block/blk-throttle.c2011-12-15 20:32:15.0 -0500 @@ -81,6 +81,7 @@ struct throtl_grp { int limits_changed; struct rcu_head rcu_head; + struct work_struct stat_alloc_work; }; struct throtl_data @@ -159,6 +160,13 @@ static void throtl_free_tg(struct rcu_he struct throtl_grp *tg; tg = container_of(head, struct throtl_grp, rcu_head); + + /* +* Will delayed allocation be visible here for sure? I am relying +* on the fact that after blkg.stats_cpu assignment, we drop +