Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-20 Thread Tejun Heo
Hello,

On Tue, Dec 20, 2011 at 09:50:24AM -0500, Vivek Goyal wrote:
> So IIUC, existing mempool implementation is not directly usable for my
> requirement and I need to write some code of my own for the caching
> layer which always allocates objects from reserve and fills in the
> pool asynchronously with the help of a worker thread.

I've been looking at it and don't think allowing percpu allocator to
be called from no io path is a good idea.  The on-demand area filling
is tied into vmalloc area management which in turn is tied to arch
page table code and I really want to avoid pre-allocating full chunk -
it can be huge.  I'm trying to extend mempool to cover percpu areas,
so please wait a bit.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-20 Thread Vivek Goyal
On Mon, Dec 19, 2011 at 02:56:35PM -0800, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Mon, Dec 19, 2011 at 01:27:17PM -0500, Vivek Goyal wrote:
> > Ok, that's good to know. If per cpu allocator can support this use case,
> > it will be good for 3.3 onwards. This seems to be right way to go to fix
> > the problem.
> 
> Ummm... if we're gonna make percpu usable w/ GFP_NOIO, the right
> interim solution would be making a simplistic mempool so that later
> when percpu can do it, it can be swapped easily.  I really can't see
> much benefit of adding refcnting on top of everything just for this.

Ok. So are you suggesting that I should write a simple mempool kind of
implementation of my own for group and per cpu data allocation. Keep
certain number of elements in the cache and trigger a worker thread to
allocate more elements once minimum number of elements go below
threshold. If pool has run out of pre-allocated elements then allocation
will fail and IO will be accounted to root group?

I am looking at the mempool implementation (mempool_create()) and looks
like that is not suitable for my use case. mempool_alloc() will call into
alloc function provided by me and pass the flags. I can't implement an
alloc function and honor that flag as per cpu alloc does not take any
flags.

So IIUC, existing mempool implementation is not directly usable for my
requirement and I need to write some code of my own for the caching
layer which always allocates objects from reserve and fills in the
pool asynchronously with the help of a worker thread.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-20 Thread Jens Axboe
On 2011-12-19 19:27, Vivek Goyal wrote:
> I think reverting the previous series is not going to be simple either. It
> had 13 patches.
> 
> https://lkml.org/lkml/2011/5/19/560
> 
> By making stats per cpu, I was able to reduce contention on request
> queue lock. Now we shall have to bring the lock back. 

That's not going to happen, having that lock in there was a disaster for
small IO on fast devices. It's essentially the limiting factor on
benchmark runs on the RHEL kernels that have it included and enabled...


-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-19 Thread Tejun Heo
Hello, Vivek.

On Mon, Dec 19, 2011 at 01:27:17PM -0500, Vivek Goyal wrote:
> Ok, that's good to know. If per cpu allocator can support this use case,
> it will be good for 3.3 onwards. This seems to be right way to go to fix
> the problem.

Ummm... if we're gonna make percpu usable w/ GFP_NOIO, the right
interim solution would be making a simplistic mempool so that later
when percpu can do it, it can be swapped easily.  I really can't see
much benefit of adding refcnting on top of everything just for this.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-19 Thread Vivek Goyal
On Mon, Dec 19, 2011 at 09:35:19AM -0800, Tejun Heo wrote:
> On Mon, Dec 19, 2011 at 12:27:17PM -0500, Vivek Goyal wrote:
> > On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote:
> > > 
> > > On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> > > > Thanks for testing it Nate. I did some debugging and found out that 
> > > > patch
> > > > is doing double free on per cpu pointer hence the crash you are running
> > > > into. I could reproduce this problem on my box. It is just a matter of
> > > > doing rmdir on the blkio cgroup.
> > > > 
> > > > I understood the cmpxchg() semantics wrong. I have fixed it now and
> > > > no crashes on directory removal. Can you please give this version a
> > > > try.
> > > > 
> > > > Thanks
> > > > Vivek
> > > 
> > > After 24 hours of stress testing the machine remains up and working 
> > > without issue. I will continue to test it, but am reasonably confident 
> > > that this patch resolves my issue.
> > > 
> > 
> > Hi Nate,
> > 
> > I have come up with final version of the patch. This time I have used non
> > rentrant work queue to queue the stat alloc work. This also gets rid of
> > cmpxchg code as there is only one writer at a time. There are couple of
> > other small cleanups.
> > 
> > Can you please give patch also a try to make sure I have not broken
> > something while doing changes.
> > 
> > This version is based on 3.2-rc6. Once you confirm the results, I will
> > rebase it on top of "linux-block for-3.3/core" and post it to Jens for
> > inclusion.
> > 
> > Thanks
> > Vivek
> > 
> > Block cgroup currently allocates percpu data for some stats. This allocation
> > is happening in IO submission path which can recurse in IO stack.
> > 
> > Percpu allocation API does not take any allocation flags as input, hence
> > to avoid the deadlock problem under memory pressure, alloc per cpu data
> > from a worker thread context.
> > 
> > Only side affect of delayed allocation is that we will lose the blkio cgroup
> > stats for that group a small duration.
> > 
> > In case per cpu memory allocation fails, worker thread re-arms the work
> > with a delay of 1 second.
> > 
> > This patch is generated on top of 3.2-rc6
> > 
> > Signed-off-by: Vivek Goyal 
> > Reported-by: Nate Custer 
> > Tested-by: Nate Custer 
> 
> Hmmm... I really don't like this approach.  It's so unnecessarily
> complex with extra refcounting and all when about the same thing can
> be achieved by implementing simple mempool which is filled
> asynchronously.  Also, the fix seems way too invasive even for -rc6,
> let alone -stable.  If reverting isn't gonna be invasive, maybe that's
> a better approach for now?

I think reverting the previous series is not going to be simple either. It
had 13 patches.

https://lkml.org/lkml/2011/5/19/560

By making stats per cpu, I was able to reduce contention on request
queue lock. Now we shall have to bring the lock back. 

So to me, neither of the solution is very clean/simple. May be we can take
this patch in as temporary stop gap measure for 3.2, while you cook a
patch for per cpu allocator for 3.3 onwards.

This patch looks big, but is not very complex. Now allocation of group
happens without dropping queue lock and it gets rid of all the trickery
of dropping lock, reacquiring it and checking for queue dead etc.

Some code is related to making group refering atomic again. (Which was
atomic in the past. So we just resotore that functionality).

The only core piece is taking a reference on the group and asking
a worker thread to do the allocation. May be if I break the patch
into small pieces, it will be more clear.

So if I had to choose between reverting the series vs taking this patch
as stop gap measure, I will prefer taking this patch. It can get some
soak time in Jens's tree and if everything looks good then push it for
3.2. If things look good there, then it can be pulled into -stable. So
it will be a while before these changes are pulled into -stable.

> 
> I've been thinking about it and I think the use case is legit and
> maybe making percpu allocator support that isn't such a bad idea.  I'm
> not completely sure yet tho.  I'll give it a try later today.

Ok, that's good to know. If per cpu allocator can support this use case,
it will be good for 3.3 onwards. This seems to be right way to go to fix
the problem.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-19 Thread Tejun Heo
On Mon, Dec 19, 2011 at 12:27:17PM -0500, Vivek Goyal wrote:
> On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote:
> > 
> > On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> > > Thanks for testing it Nate. I did some debugging and found out that patch
> > > is doing double free on per cpu pointer hence the crash you are running
> > > into. I could reproduce this problem on my box. It is just a matter of
> > > doing rmdir on the blkio cgroup.
> > > 
> > > I understood the cmpxchg() semantics wrong. I have fixed it now and
> > > no crashes on directory removal. Can you please give this version a
> > > try.
> > > 
> > > Thanks
> > > Vivek
> > 
> > After 24 hours of stress testing the machine remains up and working without 
> > issue. I will continue to test it, but am reasonably confident that this 
> > patch resolves my issue.
> > 
> 
> Hi Nate,
> 
> I have come up with final version of the patch. This time I have used non
> rentrant work queue to queue the stat alloc work. This also gets rid of
> cmpxchg code as there is only one writer at a time. There are couple of
> other small cleanups.
> 
> Can you please give patch also a try to make sure I have not broken
> something while doing changes.
> 
> This version is based on 3.2-rc6. Once you confirm the results, I will
> rebase it on top of "linux-block for-3.3/core" and post it to Jens for
> inclusion.
> 
> Thanks
> Vivek
> 
> Block cgroup currently allocates percpu data for some stats. This allocation
> is happening in IO submission path which can recurse in IO stack.
> 
> Percpu allocation API does not take any allocation flags as input, hence
> to avoid the deadlock problem under memory pressure, alloc per cpu data
> from a worker thread context.
> 
> Only side affect of delayed allocation is that we will lose the blkio cgroup
> stats for that group a small duration.
> 
> In case per cpu memory allocation fails, worker thread re-arms the work
> with a delay of 1 second.
> 
> This patch is generated on top of 3.2-rc6
> 
> Signed-off-by: Vivek Goyal 
> Reported-by: Nate Custer 
> Tested-by: Nate Custer 

Hmmm... I really don't like this approach.  It's so unnecessarily
complex with extra refcounting and all when about the same thing can
be achieved by implementing simple mempool which is filled
asynchronously.  Also, the fix seems way too invasive even for -rc6,
let alone -stable.  If reverting isn't gonna be invasive, maybe that's
a better approach for now?

I've been thinking about it and I think the use case is legit and
maybe making percpu allocator support that isn't such a bad idea.  I'm
not completely sure yet tho.  I'll give it a try later today.

Thank you.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-19 Thread Vivek Goyal
On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote:
> 
> On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> > Thanks for testing it Nate. I did some debugging and found out that patch
> > is doing double free on per cpu pointer hence the crash you are running
> > into. I could reproduce this problem on my box. It is just a matter of
> > doing rmdir on the blkio cgroup.
> > 
> > I understood the cmpxchg() semantics wrong. I have fixed it now and
> > no crashes on directory removal. Can you please give this version a
> > try.
> > 
> > Thanks
> > Vivek
> 
> After 24 hours of stress testing the machine remains up and working without 
> issue. I will continue to test it, but am reasonably confident that this 
> patch resolves my issue.
> 

Hi Nate,

I have come up with final version of the patch. This time I have used non
rentrant work queue to queue the stat alloc work. This also gets rid of
cmpxchg code as there is only one writer at a time. There are couple of
other small cleanups.

Can you please give patch also a try to make sure I have not broken
something while doing changes.

This version is based on 3.2-rc6. Once you confirm the results, I will
rebase it on top of "linux-block for-3.3/core" and post it to Jens for
inclusion.

Thanks
Vivek

Block cgroup currently allocates percpu data for some stats. This allocation
is happening in IO submission path which can recurse in IO stack.

Percpu allocation API does not take any allocation flags as input, hence
to avoid the deadlock problem under memory pressure, alloc per cpu data
from a worker thread context.

Only side affect of delayed allocation is that we will lose the blkio cgroup
stats for that group a small duration.

In case per cpu memory allocation fails, worker thread re-arms the work
with a delay of 1 second.

This patch is generated on top of 3.2-rc6

Signed-off-by: Vivek Goyal 
Reported-by: Nate Custer 
Tested-by: Nate Custer 
---
 block/blk-cgroup.c   |   31 -
 block/blk-cgroup.h   |   11 +++-
 block/blk-throttle.c |  115 ---
 block/cfq-iosched.c  |  111 +++--
 4 files changed, 173 insertions(+), 95 deletions(-)

Index: block/blk-throttle.c
===
--- block/blk-throttle.c.orig   2011-12-19 23:10:53.0 -0500
+++ block/blk-throttle.c2011-12-19 23:11:14.0 -0500
@@ -81,6 +81,7 @@ struct throtl_grp {
int limits_changed;
 
struct rcu_head rcu_head;
+   struct delayed_work stat_alloc_work;
 };
 
 struct throtl_data
@@ -159,6 +160,12 @@ static void throtl_free_tg(struct rcu_he
struct throtl_grp *tg;
 
tg = container_of(head, struct throtl_grp, rcu_head);
+
+   /*
+* delayed per cpu allocation should be visible here due to
+* barrier guarantees offered by atomic_dec_and_test() called
+* during dropping reference.
+*/
free_percpu(tg->blkg.stats_cpu);
kfree(tg);
 }
@@ -181,6 +188,48 @@ static void throtl_put_tg(struct throtl_
call_rcu(&tg->rcu_head, throtl_free_tg);
 }
 
+static inline void throtl_schedule_pcpu_stat_alloc(struct throtl_grp *tg,
+   unsigned long delay) {
+   WARN_ON(tg->blkg.stats_cpu != NULL);
+
+   /*
+* Schedule the group per cpu stat allocation through worker
+* thread
+*/
+   throtl_ref_get_tg(tg);
+
+   /* queue work on non-rentrant work queue */
+   queue_delayed_work(system_nrt_wq, &tg->stat_alloc_work, delay);
+}
+
+/* No locks taken. A reference to throtl_grp taken before invocation */
+static void tg_stat_alloc_fn(struct work_struct *work)
+{
+   struct delayed_work *dwork = to_delayed_work(work);
+   struct throtl_grp *tg = container_of(dwork, struct throtl_grp,
+   stat_alloc_work);
+   void *stat_ptr = NULL;
+
+   stat_ptr = blkio_alloc_blkg_stats_pcpu();
+
+   if (stat_ptr == NULL) {
+   /*
+* If memory allocation failed, try again with delay of 1s
+* FIXME: Put an upper limit on number of retries?
+*/
+   throtl_schedule_pcpu_stat_alloc(tg, HZ);
+   throtl_put_tg(tg);
+   return;
+   }
+
+   /*
+* There is only one writer at a time and that is worker thread. Hence
+* no synchronization should be required.
+*/
+   blkio_set_blkg_pcpu_stats(&tg->blkg, stat_ptr);
+   throtl_put_tg(tg);
+}
+
 static void throtl_init_group(struct throtl_grp *tg)
 {
INIT_HLIST_NODE(&tg->tg_node);
@@ -188,6 +237,7 @@ static void throtl_init_group(struct thr
bio_list_init(&tg->bio_lists[0]);
bio_list_init(&tg->bio_lists[1]);
tg->limits_changed = false;
+   INIT_DELAYED_WORK(&tg->stat_alloc_work, tg_stat_alloc_fn);
 
/* Practically unlimited BW */

Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-19 Thread Vivek Goyal
On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote:
> 
> On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> > Thanks for testing it Nate. I did some debugging and found out that patch
> > is doing double free on per cpu pointer hence the crash you are running
> > into. I could reproduce this problem on my box. It is just a matter of
> > doing rmdir on the blkio cgroup.
> > 
> > I understood the cmpxchg() semantics wrong. I have fixed it now and
> > no crashes on directory removal. Can you please give this version a
> > try.
> > 
> > Thanks
> > Vivek
> 
> After 24 hours of stress testing the machine remains up and working without 
> issue. I will continue to test it, but am reasonably confident that this 
> patch resolves my issue.

That's good to know. While you continue to stress test, I will make some
minor modifications to the patch. (Thinking of delaying the retry of
allocation of per cpu memory in case previous attempts failed).

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-18 Thread Nate Custer

On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> Thanks for testing it Nate. I did some debugging and found out that patch
> is doing double free on per cpu pointer hence the crash you are running
> into. I could reproduce this problem on my box. It is just a matter of
> doing rmdir on the blkio cgroup.
> 
> I understood the cmpxchg() semantics wrong. I have fixed it now and
> no crashes on directory removal. Can you please give this version a
> try.
> 
> Thanks
> Vivek

After 24 hours of stress testing the machine remains up and working without 
issue. I will continue to test it, but am reasonably confident that this patch 
resolves my issue.

Nate Custer--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-16 Thread Vivek Goyal
On Fri, Dec 16, 2011 at 12:43:52PM -0600, Nate Custer wrote:
> 
> On Dec 15, 2011, at 1:47 PM, Vivek Goyal wrote:
> > Ok, I continued to develop on the patch which tries to allocate per cpu
> > stats from worker thread context. Here is the patch.
> > 
> > Can the reporter please try out the patch and see if it helps. I am not
> > sure if deadlock was because of mutex issue or not, but it should help
> > get rid of lockdep warning.
> > 
> > This patch is on top of for-3.3/core branch of jens's linux-block tree.
> > If it does not apply on your kernel version, do let me know the version 
> > you are testing with and I will generate another version of patch.
> > 
> > If testing results are good, I will break down the patch in small pieces
> > and post as a series separately.
> > 
> > Thanks
> > Vivek
> 
> Running on a fedora-16 box with the patch applied to the linux-block tree I 
> still had deadlocks. In fact it seemed to happen much faster and with ligher 
> workloads.
> 
> I was able to get netconsole setup and a full stacktrace is posted here:
> 
> http://pastebin.com/9Rq68exU

Thanks for testing it Nate. I did some debugging and found out that patch
is doing double free on per cpu pointer hence the crash you are running
into. I could reproduce this problem on my box. It is just a matter of
doing rmdir on the blkio cgroup.

I understood the cmpxchg() semantics wrong. I have fixed it now and
no crashes on directory removal. Can you please give this version a
try.

Thanks
Vivek

Alloc per cpu data from a worker thread context to avoid possibility of a
deadlock under memory pressure.

Only side affect of delayed allocation is that we will lose the blkio cgroup
stats for that group a small duration.

This patch is generated on top of "for-3.3/core" branch of linux-block tree.

-v2:
 Fixed the issue of double free on percpu pointer.
---
 block/blk-cgroup.c   |   31 -
 block/blk-cgroup.h   |9 +++
 block/blk-throttle.c |  116 -
 block/cfq-iosched.c  |  119 ++-
 4 files changed, 180 insertions(+), 95 deletions(-)

Index: block/blk-throttle.c
===
--- block/blk-throttle.c.orig   2011-12-17 01:49:34.0 -0500
+++ block/blk-throttle.c2011-12-17 02:21:50.0 -0500
@@ -81,6 +81,7 @@ struct throtl_grp {
int limits_changed;
 
struct rcu_head rcu_head;
+   struct work_struct stat_alloc_work;
 };
 
 struct throtl_data
@@ -159,6 +160,13 @@ static void throtl_free_tg(struct rcu_he
struct throtl_grp *tg;
 
tg = container_of(head, struct throtl_grp, rcu_head);
+
+   /*
+* Will delayed allocation be visible here for sure? I am relying
+* on the fact that after blkg.stats_cpu assignment, we drop
+* reference to group using atomic_dec() which should imply
+* barrier
+*/
free_percpu(tg->blkg.stats_cpu);
kfree(tg);
 }
@@ -181,6 +189,48 @@ static void throtl_put_tg(struct throtl_
call_rcu(&tg->rcu_head, throtl_free_tg);
 }
 
+static inline void throtl_check_schedule_pcpu_stat_alloc(struct throtl_grp 
*tg) {
+   if (tg->blkg.stats_cpu != NULL)
+   return;
+   /*
+* Schedule the group per cpu stat allocation through worker
+* thread
+*/
+   throtl_ref_get_tg(tg);
+   if (!schedule_work(&tg->stat_alloc_work)) {
+   /* Work is already scheduled by somebody */
+   throtl_put_tg(tg);
+   }
+}
+
+/* No locks taken. A reference to throtl_grp taken before invocation */
+static void tg_stat_alloc_fn(struct work_struct *work)
+{
+   struct throtl_grp *tg = container_of(work, struct throtl_grp,
+   stat_alloc_work);
+   void *stat_ptr = NULL;
+   unsigned long ret;
+
+   stat_ptr = blkio_alloc_blkg_stats_pcpu();
+
+   if (stat_ptr == NULL) {
+   /* If memory allocation failed, try again */
+   throtl_check_schedule_pcpu_stat_alloc(tg);
+   throtl_put_tg(tg);
+   return;
+   }
+
+   ret = blkio_cmpxchg_blkg_stats(&tg->blkg, 0,
+   (unsigned long)stat_ptr);
+
+   if (ret != 0) {
+   /* Somebody else got to it first */
+   free_percpu(stat_ptr);
+   }
+
+   throtl_put_tg(tg);
+}
+
 static void throtl_init_group(struct throtl_grp *tg)
 {
INIT_HLIST_NODE(&tg->tg_node);
@@ -188,6 +238,7 @@ static void throtl_init_group(struct thr
bio_list_init(&tg->bio_lists[0]);
bio_list_init(&tg->bio_lists[1]);
tg->limits_changed = false;
+   INIT_WORK(&tg->stat_alloc_work, tg_stat_alloc_fn);
 
/* Practically unlimited BW */
tg->bps[0] = tg->bps[1] = -1;
@@ -263,8 +314,25 @@ static void throtl_init_add_tg_lists(str
throtl_add_group_to_td_list(t

Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-16 Thread Nate Custer

On Dec 15, 2011, at 1:47 PM, Vivek Goyal wrote:
> Ok, I continued to develop on the patch which tries to allocate per cpu
> stats from worker thread context. Here is the patch.
> 
> Can the reporter please try out the patch and see if it helps. I am not
> sure if deadlock was because of mutex issue or not, but it should help
> get rid of lockdep warning.
> 
> This patch is on top of for-3.3/core branch of jens's linux-block tree.
> If it does not apply on your kernel version, do let me know the version 
> you are testing with and I will generate another version of patch.
> 
> If testing results are good, I will break down the patch in small pieces
> and post as a series separately.
> 
> Thanks
> Vivek

Running on a fedora-16 box with the patch applied to the linux-block tree I 
still had deadlocks. In fact it seemed to happen much faster and with ligher 
workloads.

I was able to get netconsole setup and a full stacktrace is posted here:

http://pastebin.com/9Rq68exU

Nate

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-15 Thread Vivek Goyal
On Wed, Dec 14, 2011 at 05:03:54PM +0100, Jens Axboe wrote:
> On 2011-12-14 14:43, Avi Kivity wrote:
> > On 12/14/2011 02:25 PM, Marcelo Tosatti wrote:
> >> On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote:
> >>> Hello,
> >>>
> >>> I am struggling with repeatable full hardware locks when running 8-12 KVM 
> >>> vms. At some point before the hard lock I get a inconsistent lock state 
> >>> warning. An example of this can be found here:
> >>>
> >>> http://pastebin.com/8wKhgE2C
> >>>
> >>> After that the server continues to run for a while and then starts its 
> >>> death spiral. When it reaches that point it fails to log anything further 
> >>> to the disk, but by attaching a console I have been able to get a stack 
> >>> trace documenting the final implosion:
> >>>
> >>> http://pastebin.com/PbcN76bd
> >>>
> >>> All of the cores end up hung and the server stops responding to all 
> >>> input, including SysRq commands. 
> >>>
> >>> I have seen this behavior on two machines (dual E5606 running Fedora 16) 
> >>> both passed cpuburnin testing and memtest86 scans without error. 
> >>>
> >>> I have reproduced the crash and stack traces from a Fedora debugging 
> >>> kernel - 3.1.2-1 and with a vanilla 3.1.4 kernel.
> >>
> >> Busted hardware, apparently. Can you reproduce these issues with the
> >> same workload on different hardware?
> > 
> > I don't think it's hardware related.  The second trace (in the first
> > paste) is called during swap, so GFP_FS is set.  The first one is not,
> > so GFP_FS is clear.  Lockdep is worried about the following scenario:
> > 
> >   acpi_early_init() is called
> >   calls pcpu_alloc(), which takes pcpu_alloc_mutex
> >   eventually, calls kmalloc(), or some other allocation function
> >   no memory, so swap
> >   call try_to_free_pages()
> >   submit_bio()
> >   blk_throtl_bio()
> >   blkio_alloc_blkg_stats()
> >   alloc_percpu()
> >   pcpu_alloc(), which takes pcpu_alloc_mutex
> >   deadlock
> > 
> > It's a little unlikely that acpi_early_init() will OOM, but lockdep
> > doesn't know that.  Other callers of pcpu_alloc() could trigger the same
> > thing.
> > 
> > When lockdep says
> > 
> > [ 5839.924953] other info that might help us debug this:
> > [ 5839.925396]  Possible unsafe locking scenario:
> > [ 5839.925397]
> > [ 5839.925840]CPU0
> > [ 5839.926063]
> > [ 5839.926287]   lock(pcpu_alloc_mutex);
> > [ 5839.926533]   
> > [ 5839.926756] lock(pcpu_alloc_mutex);
> > [ 5839.926986]
> > 
> > It really means
> > 
> >
> > 
> > GFP_FS simply marks the beginning of a nested, unrelated context that
> > uses the same thread, just like an interrupt.  Kudos to lockdep for
> > catching that.
> > 
> > I think the allocation in blkio_alloc_blkg_stats() should be moved out
> > of the I/O path into some init function. Copying Jens.
> 
> That's completely buggy, basically you end up with a GFP_KERNEL
> allocation from the IO submit path. Vivek, per_cpu data needs to be set
> up at init time. You can't allocate it dynamically off the IO path.

Ok, I continued to develop on the patch which tries to allocate per cpu
stats from worker thread context. Here is the patch.

Can the reporter please try out the patch and see if it helps. I am not
sure if deadlock was because of mutex issue or not, but it should help
get rid of lockdep warning.

This patch is on top of for-3.3/core branch of jens's linux-block tree.
If it does not apply on your kernel version, do let me know the version 
you are testing with and I will generate another version of patch.

If testing results are good, I will break down the patch in small pieces
and post as a series separately.

Thanks
Vivek

Alloc per cpu data from a worker thread context to avoid possibility of a
deadlock under memory pressure.

Only side affect of delayed allocation is that we will lose the blkio cgroup
stats for that group a small duration.

This patch is generated on top of "for-3.3/core" branch of linux-block tree.

---
 block/blk-cgroup.c   |   31 -
 block/blk-cgroup.h   |9 +++
 block/blk-throttle.c |  116 -
 block/cfq-iosched.c  |  119 ++-
 4 files changed, 180 insertions(+), 95 deletions(-)

Index: block/blk-throttle.c
===
--- block/blk-throttle.c.orig   2011-12-15 20:31:05.0 -0500
+++ block/blk-throttle.c2011-12-15 20:32:15.0 -0500
@@ -81,6 +81,7 @@ struct throtl_grp {
int limits_changed;
 
struct rcu_head rcu_head;
+   struct work_struct stat_alloc_work;
 };
 
 struct throtl_data
@@ -159,6 +160,13 @@ static void throtl_free_tg(struct rcu_he
struct throtl_grp *tg;
 
tg = container_of(head, struct throtl_grp, rcu_head);
+
+   /*
+* Will delayed allocation be visible here for sure? I am relying
+* on the fact that after blkg.stats_cpu assignment, we drop
+