Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-20 Thread Tejun Heo
Hello,

On Tue, Dec 20, 2011 at 09:50:24AM -0500, Vivek Goyal wrote:
> So IIUC, existing mempool implementation is not directly usable for my
> requirement and I need to write some code of my own for the caching
> layer which always allocates objects from reserve and fills in the
> pool asynchronously with the help of a worker thread.

I've been looking at it and don't think allowing percpu allocator to
be called from no io path is a good idea.  The on-demand area filling
is tied into vmalloc area management which in turn is tied to arch
page table code and I really want to avoid pre-allocating full chunk -
it can be huge.  I'm trying to extend mempool to cover percpu areas,
so please wait a bit.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-20 Thread Vivek Goyal
On Mon, Dec 19, 2011 at 02:56:35PM -0800, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Mon, Dec 19, 2011 at 01:27:17PM -0500, Vivek Goyal wrote:
> > Ok, that's good to know. If per cpu allocator can support this use case,
> > it will be good for 3.3 onwards. This seems to be right way to go to fix
> > the problem.
> 
> Ummm... if we're gonna make percpu usable w/ GFP_NOIO, the right
> interim solution would be making a simplistic mempool so that later
> when percpu can do it, it can be swapped easily.  I really can't see
> much benefit of adding refcnting on top of everything just for this.

Ok. So are you suggesting that I should write a simple mempool kind of
implementation of my own for group and per cpu data allocation. Keep
certain number of elements in the cache and trigger a worker thread to
allocate more elements once minimum number of elements go below
threshold. If pool has run out of pre-allocated elements then allocation
will fail and IO will be accounted to root group?

I am looking at the mempool implementation (mempool_create()) and looks
like that is not suitable for my use case. mempool_alloc() will call into
alloc function provided by me and pass the flags. I can't implement an
alloc function and honor that flag as per cpu alloc does not take any
flags.

So IIUC, existing mempool implementation is not directly usable for my
requirement and I need to write some code of my own for the caching
layer which always allocates objects from reserve and fills in the
pool asynchronously with the help of a worker thread.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-20 Thread Jens Axboe
On 2011-12-19 19:27, Vivek Goyal wrote:
> I think reverting the previous series is not going to be simple either. It
> had 13 patches.
> 
> https://lkml.org/lkml/2011/5/19/560
> 
> By making stats per cpu, I was able to reduce contention on request
> queue lock. Now we shall have to bring the lock back. 

That's not going to happen, having that lock in there was a disaster for
small IO on fast devices. It's essentially the limiting factor on
benchmark runs on the RHEL kernels that have it included and enabled...


-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-19 Thread Tejun Heo
Hello, Vivek.

On Mon, Dec 19, 2011 at 01:27:17PM -0500, Vivek Goyal wrote:
> Ok, that's good to know. If per cpu allocator can support this use case,
> it will be good for 3.3 onwards. This seems to be right way to go to fix
> the problem.

Ummm... if we're gonna make percpu usable w/ GFP_NOIO, the right
interim solution would be making a simplistic mempool so that later
when percpu can do it, it can be swapped easily.  I really can't see
much benefit of adding refcnting on top of everything just for this.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-19 Thread Vivek Goyal
On Mon, Dec 19, 2011 at 09:35:19AM -0800, Tejun Heo wrote:
> On Mon, Dec 19, 2011 at 12:27:17PM -0500, Vivek Goyal wrote:
> > On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote:
> > > 
> > > On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> > > > Thanks for testing it Nate. I did some debugging and found out that 
> > > > patch
> > > > is doing double free on per cpu pointer hence the crash you are running
> > > > into. I could reproduce this problem on my box. It is just a matter of
> > > > doing rmdir on the blkio cgroup.
> > > > 
> > > > I understood the cmpxchg() semantics wrong. I have fixed it now and
> > > > no crashes on directory removal. Can you please give this version a
> > > > try.
> > > > 
> > > > Thanks
> > > > Vivek
> > > 
> > > After 24 hours of stress testing the machine remains up and working 
> > > without issue. I will continue to test it, but am reasonably confident 
> > > that this patch resolves my issue.
> > > 
> > 
> > Hi Nate,
> > 
> > I have come up with final version of the patch. This time I have used non
> > rentrant work queue to queue the stat alloc work. This also gets rid of
> > cmpxchg code as there is only one writer at a time. There are couple of
> > other small cleanups.
> > 
> > Can you please give patch also a try to make sure I have not broken
> > something while doing changes.
> > 
> > This version is based on 3.2-rc6. Once you confirm the results, I will
> > rebase it on top of "linux-block for-3.3/core" and post it to Jens for
> > inclusion.
> > 
> > Thanks
> > Vivek
> > 
> > Block cgroup currently allocates percpu data for some stats. This allocation
> > is happening in IO submission path which can recurse in IO stack.
> > 
> > Percpu allocation API does not take any allocation flags as input, hence
> > to avoid the deadlock problem under memory pressure, alloc per cpu data
> > from a worker thread context.
> > 
> > Only side affect of delayed allocation is that we will lose the blkio cgroup
> > stats for that group a small duration.
> > 
> > In case per cpu memory allocation fails, worker thread re-arms the work
> > with a delay of 1 second.
> > 
> > This patch is generated on top of 3.2-rc6
> > 
> > Signed-off-by: Vivek Goyal 
> > Reported-by: Nate Custer 
> > Tested-by: Nate Custer 
> 
> Hmmm... I really don't like this approach.  It's so unnecessarily
> complex with extra refcounting and all when about the same thing can
> be achieved by implementing simple mempool which is filled
> asynchronously.  Also, the fix seems way too invasive even for -rc6,
> let alone -stable.  If reverting isn't gonna be invasive, maybe that's
> a better approach for now?

I think reverting the previous series is not going to be simple either. It
had 13 patches.

https://lkml.org/lkml/2011/5/19/560

By making stats per cpu, I was able to reduce contention on request
queue lock. Now we shall have to bring the lock back. 

So to me, neither of the solution is very clean/simple. May be we can take
this patch in as temporary stop gap measure for 3.2, while you cook a
patch for per cpu allocator for 3.3 onwards.

This patch looks big, but is not very complex. Now allocation of group
happens without dropping queue lock and it gets rid of all the trickery
of dropping lock, reacquiring it and checking for queue dead etc.

Some code is related to making group refering atomic again. (Which was
atomic in the past. So we just resotore that functionality).

The only core piece is taking a reference on the group and asking
a worker thread to do the allocation. May be if I break the patch
into small pieces, it will be more clear.

So if I had to choose between reverting the series vs taking this patch
as stop gap measure, I will prefer taking this patch. It can get some
soak time in Jens's tree and if everything looks good then push it for
3.2. If things look good there, then it can be pulled into -stable. So
it will be a while before these changes are pulled into -stable.

> 
> I've been thinking about it and I think the use case is legit and
> maybe making percpu allocator support that isn't such a bad idea.  I'm
> not completely sure yet tho.  I'll give it a try later today.

Ok, that's good to know. If per cpu allocator can support this use case,
it will be good for 3.3 onwards. This seems to be right way to go to fix
the problem.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-19 Thread Tejun Heo
On Mon, Dec 19, 2011 at 12:27:17PM -0500, Vivek Goyal wrote:
> On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote:
> > 
> > On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> > > Thanks for testing it Nate. I did some debugging and found out that patch
> > > is doing double free on per cpu pointer hence the crash you are running
> > > into. I could reproduce this problem on my box. It is just a matter of
> > > doing rmdir on the blkio cgroup.
> > > 
> > > I understood the cmpxchg() semantics wrong. I have fixed it now and
> > > no crashes on directory removal. Can you please give this version a
> > > try.
> > > 
> > > Thanks
> > > Vivek
> > 
> > After 24 hours of stress testing the machine remains up and working without 
> > issue. I will continue to test it, but am reasonably confident that this 
> > patch resolves my issue.
> > 
> 
> Hi Nate,
> 
> I have come up with final version of the patch. This time I have used non
> rentrant work queue to queue the stat alloc work. This also gets rid of
> cmpxchg code as there is only one writer at a time. There are couple of
> other small cleanups.
> 
> Can you please give patch also a try to make sure I have not broken
> something while doing changes.
> 
> This version is based on 3.2-rc6. Once you confirm the results, I will
> rebase it on top of "linux-block for-3.3/core" and post it to Jens for
> inclusion.
> 
> Thanks
> Vivek
> 
> Block cgroup currently allocates percpu data for some stats. This allocation
> is happening in IO submission path which can recurse in IO stack.
> 
> Percpu allocation API does not take any allocation flags as input, hence
> to avoid the deadlock problem under memory pressure, alloc per cpu data
> from a worker thread context.
> 
> Only side affect of delayed allocation is that we will lose the blkio cgroup
> stats for that group a small duration.
> 
> In case per cpu memory allocation fails, worker thread re-arms the work
> with a delay of 1 second.
> 
> This patch is generated on top of 3.2-rc6
> 
> Signed-off-by: Vivek Goyal 
> Reported-by: Nate Custer 
> Tested-by: Nate Custer 

Hmmm... I really don't like this approach.  It's so unnecessarily
complex with extra refcounting and all when about the same thing can
be achieved by implementing simple mempool which is filled
asynchronously.  Also, the fix seems way too invasive even for -rc6,
let alone -stable.  If reverting isn't gonna be invasive, maybe that's
a better approach for now?

I've been thinking about it and I think the use case is legit and
maybe making percpu allocator support that isn't such a bad idea.  I'm
not completely sure yet tho.  I'll give it a try later today.

Thank you.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-19 Thread Vivek Goyal
On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote:
> 
> On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> > Thanks for testing it Nate. I did some debugging and found out that patch
> > is doing double free on per cpu pointer hence the crash you are running
> > into. I could reproduce this problem on my box. It is just a matter of
> > doing rmdir on the blkio cgroup.
> > 
> > I understood the cmpxchg() semantics wrong. I have fixed it now and
> > no crashes on directory removal. Can you please give this version a
> > try.
> > 
> > Thanks
> > Vivek
> 
> After 24 hours of stress testing the machine remains up and working without 
> issue. I will continue to test it, but am reasonably confident that this 
> patch resolves my issue.
> 

Hi Nate,

I have come up with final version of the patch. This time I have used non
rentrant work queue to queue the stat alloc work. This also gets rid of
cmpxchg code as there is only one writer at a time. There are couple of
other small cleanups.

Can you please give patch also a try to make sure I have not broken
something while doing changes.

This version is based on 3.2-rc6. Once you confirm the results, I will
rebase it on top of "linux-block for-3.3/core" and post it to Jens for
inclusion.

Thanks
Vivek

Block cgroup currently allocates percpu data for some stats. This allocation
is happening in IO submission path which can recurse in IO stack.

Percpu allocation API does not take any allocation flags as input, hence
to avoid the deadlock problem under memory pressure, alloc per cpu data
from a worker thread context.

Only side affect of delayed allocation is that we will lose the blkio cgroup
stats for that group a small duration.

In case per cpu memory allocation fails, worker thread re-arms the work
with a delay of 1 second.

This patch is generated on top of 3.2-rc6

Signed-off-by: Vivek Goyal 
Reported-by: Nate Custer 
Tested-by: Nate Custer 
---
 block/blk-cgroup.c   |   31 -
 block/blk-cgroup.h   |   11 +++-
 block/blk-throttle.c |  115 ---
 block/cfq-iosched.c  |  111 +++--
 4 files changed, 173 insertions(+), 95 deletions(-)

Index: block/blk-throttle.c
===
--- block/blk-throttle.c.orig   2011-12-19 23:10:53.0 -0500
+++ block/blk-throttle.c2011-12-19 23:11:14.0 -0500
@@ -81,6 +81,7 @@ struct throtl_grp {
int limits_changed;
 
struct rcu_head rcu_head;
+   struct delayed_work stat_alloc_work;
 };
 
 struct throtl_data
@@ -159,6 +160,12 @@ static void throtl_free_tg(struct rcu_he
struct throtl_grp *tg;
 
tg = container_of(head, struct throtl_grp, rcu_head);
+
+   /*
+* delayed per cpu allocation should be visible here due to
+* barrier guarantees offered by atomic_dec_and_test() called
+* during dropping reference.
+*/
free_percpu(tg->blkg.stats_cpu);
kfree(tg);
 }
@@ -181,6 +188,48 @@ static void throtl_put_tg(struct throtl_
call_rcu(&tg->rcu_head, throtl_free_tg);
 }
 
+static inline void throtl_schedule_pcpu_stat_alloc(struct throtl_grp *tg,
+   unsigned long delay) {
+   WARN_ON(tg->blkg.stats_cpu != NULL);
+
+   /*
+* Schedule the group per cpu stat allocation through worker
+* thread
+*/
+   throtl_ref_get_tg(tg);
+
+   /* queue work on non-rentrant work queue */
+   queue_delayed_work(system_nrt_wq, &tg->stat_alloc_work, delay);
+}
+
+/* No locks taken. A reference to throtl_grp taken before invocation */
+static void tg_stat_alloc_fn(struct work_struct *work)
+{
+   struct delayed_work *dwork = to_delayed_work(work);
+   struct throtl_grp *tg = container_of(dwork, struct throtl_grp,
+   stat_alloc_work);
+   void *stat_ptr = NULL;
+
+   stat_ptr = blkio_alloc_blkg_stats_pcpu();
+
+   if (stat_ptr == NULL) {
+   /*
+* If memory allocation failed, try again with delay of 1s
+* FIXME: Put an upper limit on number of retries?
+*/
+   throtl_schedule_pcpu_stat_alloc(tg, HZ);
+   throtl_put_tg(tg);
+   return;
+   }
+
+   /*
+* There is only one writer at a time and that is worker thread. Hence
+* no synchronization should be required.
+*/
+   blkio_set_blkg_pcpu_stats(&tg->blkg, stat_ptr);
+   throtl_put_tg(tg);
+}
+
 static void throtl_init_group(struct throtl_grp *tg)
 {
INIT_HLIST_NODE(&tg->tg_node);
@@ -188,6 +237,7 @@ static void throtl_init_group(struct thr
bio_list_init(&tg->bio_lists[0]);
bio_list_init(&tg->bio_lists[1]);
tg->limits_changed = false;
+   INIT_DELAYED_WORK(&tg->stat_alloc_work, tg_stat_alloc_fn);
 
/* Practically unlimited BW */

Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-19 Thread Vivek Goyal
On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote:
> 
> On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> > Thanks for testing it Nate. I did some debugging and found out that patch
> > is doing double free on per cpu pointer hence the crash you are running
> > into. I could reproduce this problem on my box. It is just a matter of
> > doing rmdir on the blkio cgroup.
> > 
> > I understood the cmpxchg() semantics wrong. I have fixed it now and
> > no crashes on directory removal. Can you please give this version a
> > try.
> > 
> > Thanks
> > Vivek
> 
> After 24 hours of stress testing the machine remains up and working without 
> issue. I will continue to test it, but am reasonably confident that this 
> patch resolves my issue.

That's good to know. While you continue to stress test, I will make some
minor modifications to the patch. (Thinking of delaying the retry of
allocation of per cpu memory in case previous attempts failed).

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-18 Thread Nate Custer

On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> Thanks for testing it Nate. I did some debugging and found out that patch
> is doing double free on per cpu pointer hence the crash you are running
> into. I could reproduce this problem on my box. It is just a matter of
> doing rmdir on the blkio cgroup.
> 
> I understood the cmpxchg() semantics wrong. I have fixed it now and
> no crashes on directory removal. Can you please give this version a
> try.
> 
> Thanks
> Vivek

After 24 hours of stress testing the machine remains up and working without 
issue. I will continue to test it, but am reasonably confident that this patch 
resolves my issue.

Nate Custer--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-16 Thread Vivek Goyal
On Fri, Dec 16, 2011 at 12:43:52PM -0600, Nate Custer wrote:
> 
> On Dec 15, 2011, at 1:47 PM, Vivek Goyal wrote:
> > Ok, I continued to develop on the patch which tries to allocate per cpu
> > stats from worker thread context. Here is the patch.
> > 
> > Can the reporter please try out the patch and see if it helps. I am not
> > sure if deadlock was because of mutex issue or not, but it should help
> > get rid of lockdep warning.
> > 
> > This patch is on top of for-3.3/core branch of jens's linux-block tree.
> > If it does not apply on your kernel version, do let me know the version 
> > you are testing with and I will generate another version of patch.
> > 
> > If testing results are good, I will break down the patch in small pieces
> > and post as a series separately.
> > 
> > Thanks
> > Vivek
> 
> Running on a fedora-16 box with the patch applied to the linux-block tree I 
> still had deadlocks. In fact it seemed to happen much faster and with ligher 
> workloads.
> 
> I was able to get netconsole setup and a full stacktrace is posted here:
> 
> http://pastebin.com/9Rq68exU

Thanks for testing it Nate. I did some debugging and found out that patch
is doing double free on per cpu pointer hence the crash you are running
into. I could reproduce this problem on my box. It is just a matter of
doing rmdir on the blkio cgroup.

I understood the cmpxchg() semantics wrong. I have fixed it now and
no crashes on directory removal. Can you please give this version a
try.

Thanks
Vivek

Alloc per cpu data from a worker thread context to avoid possibility of a
deadlock under memory pressure.

Only side affect of delayed allocation is that we will lose the blkio cgroup
stats for that group a small duration.

This patch is generated on top of "for-3.3/core" branch of linux-block tree.

-v2:
 Fixed the issue of double free on percpu pointer.
---
 block/blk-cgroup.c   |   31 -
 block/blk-cgroup.h   |9 +++
 block/blk-throttle.c |  116 -
 block/cfq-iosched.c  |  119 ++-
 4 files changed, 180 insertions(+), 95 deletions(-)

Index: block/blk-throttle.c
===
--- block/blk-throttle.c.orig   2011-12-17 01:49:34.0 -0500
+++ block/blk-throttle.c2011-12-17 02:21:50.0 -0500
@@ -81,6 +81,7 @@ struct throtl_grp {
int limits_changed;
 
struct rcu_head rcu_head;
+   struct work_struct stat_alloc_work;
 };
 
 struct throtl_data
@@ -159,6 +160,13 @@ static void throtl_free_tg(struct rcu_he
struct throtl_grp *tg;
 
tg = container_of(head, struct throtl_grp, rcu_head);
+
+   /*
+* Will delayed allocation be visible here for sure? I am relying
+* on the fact that after blkg.stats_cpu assignment, we drop
+* reference to group using atomic_dec() which should imply
+* barrier
+*/
free_percpu(tg->blkg.stats_cpu);
kfree(tg);
 }
@@ -181,6 +189,48 @@ static void throtl_put_tg(struct throtl_
call_rcu(&tg->rcu_head, throtl_free_tg);
 }
 
+static inline void throtl_check_schedule_pcpu_stat_alloc(struct throtl_grp 
*tg) {
+   if (tg->blkg.stats_cpu != NULL)
+   return;
+   /*
+* Schedule the group per cpu stat allocation through worker
+* thread
+*/
+   throtl_ref_get_tg(tg);
+   if (!schedule_work(&tg->stat_alloc_work)) {
+   /* Work is already scheduled by somebody */
+   throtl_put_tg(tg);
+   }
+}
+
+/* No locks taken. A reference to throtl_grp taken before invocation */
+static void tg_stat_alloc_fn(struct work_struct *work)
+{
+   struct throtl_grp *tg = container_of(work, struct throtl_grp,
+   stat_alloc_work);
+   void *stat_ptr = NULL;
+   unsigned long ret;
+
+   stat_ptr = blkio_alloc_blkg_stats_pcpu();
+
+   if (stat_ptr == NULL) {
+   /* If memory allocation failed, try again */
+   throtl_check_schedule_pcpu_stat_alloc(tg);
+   throtl_put_tg(tg);
+   return;
+   }
+
+   ret = blkio_cmpxchg_blkg_stats(&tg->blkg, 0,
+   (unsigned long)stat_ptr);
+
+   if (ret != 0) {
+   /* Somebody else got to it first */
+   free_percpu(stat_ptr);
+   }
+
+   throtl_put_tg(tg);
+}
+
 static void throtl_init_group(struct throtl_grp *tg)
 {
INIT_HLIST_NODE(&tg->tg_node);
@@ -188,6 +238,7 @@ static void throtl_init_group(struct thr
bio_list_init(&tg->bio_lists[0]);
bio_list_init(&tg->bio_lists[1]);
tg->limits_changed = false;
+   INIT_WORK(&tg->stat_alloc_work, tg_stat_alloc_fn);
 
/* Practically unlimited BW */
tg->bps[0] = tg->bps[1] = -1;
@@ -263,8 +314,25 @@ static void throtl_init_add_tg_lists(str
throtl_add_group_to_td_list(t

Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-16 Thread Nate Custer

On Dec 15, 2011, at 1:47 PM, Vivek Goyal wrote:
> Ok, I continued to develop on the patch which tries to allocate per cpu
> stats from worker thread context. Here is the patch.
> 
> Can the reporter please try out the patch and see if it helps. I am not
> sure if deadlock was because of mutex issue or not, but it should help
> get rid of lockdep warning.
> 
> This patch is on top of for-3.3/core branch of jens's linux-block tree.
> If it does not apply on your kernel version, do let me know the version 
> you are testing with and I will generate another version of patch.
> 
> If testing results are good, I will break down the patch in small pieces
> and post as a series separately.
> 
> Thanks
> Vivek

Running on a fedora-16 box with the patch applied to the linux-block tree I 
still had deadlocks. In fact it seemed to happen much faster and with ligher 
workloads.

I was able to get netconsole setup and a full stacktrace is posted here:

http://pastebin.com/9Rq68exU

Nate

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-15 Thread Vivek Goyal
On Wed, Dec 14, 2011 at 05:03:54PM +0100, Jens Axboe wrote:
> On 2011-12-14 14:43, Avi Kivity wrote:
> > On 12/14/2011 02:25 PM, Marcelo Tosatti wrote:
> >> On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote:
> >>> Hello,
> >>>
> >>> I am struggling with repeatable full hardware locks when running 8-12 KVM 
> >>> vms. At some point before the hard lock I get a inconsistent lock state 
> >>> warning. An example of this can be found here:
> >>>
> >>> http://pastebin.com/8wKhgE2C
> >>>
> >>> After that the server continues to run for a while and then starts its 
> >>> death spiral. When it reaches that point it fails to log anything further 
> >>> to the disk, but by attaching a console I have been able to get a stack 
> >>> trace documenting the final implosion:
> >>>
> >>> http://pastebin.com/PbcN76bd
> >>>
> >>> All of the cores end up hung and the server stops responding to all 
> >>> input, including SysRq commands. 
> >>>
> >>> I have seen this behavior on two machines (dual E5606 running Fedora 16) 
> >>> both passed cpuburnin testing and memtest86 scans without error. 
> >>>
> >>> I have reproduced the crash and stack traces from a Fedora debugging 
> >>> kernel - 3.1.2-1 and with a vanilla 3.1.4 kernel.
> >>
> >> Busted hardware, apparently. Can you reproduce these issues with the
> >> same workload on different hardware?
> > 
> > I don't think it's hardware related.  The second trace (in the first
> > paste) is called during swap, so GFP_FS is set.  The first one is not,
> > so GFP_FS is clear.  Lockdep is worried about the following scenario:
> > 
> >   acpi_early_init() is called
> >   calls pcpu_alloc(), which takes pcpu_alloc_mutex
> >   eventually, calls kmalloc(), or some other allocation function
> >   no memory, so swap
> >   call try_to_free_pages()
> >   submit_bio()
> >   blk_throtl_bio()
> >   blkio_alloc_blkg_stats()
> >   alloc_percpu()
> >   pcpu_alloc(), which takes pcpu_alloc_mutex
> >   deadlock
> > 
> > It's a little unlikely that acpi_early_init() will OOM, but lockdep
> > doesn't know that.  Other callers of pcpu_alloc() could trigger the same
> > thing.
> > 
> > When lockdep says
> > 
> > [ 5839.924953] other info that might help us debug this:
> > [ 5839.925396]  Possible unsafe locking scenario:
> > [ 5839.925397]
> > [ 5839.925840]CPU0
> > [ 5839.926063]
> > [ 5839.926287]   lock(pcpu_alloc_mutex);
> > [ 5839.926533]   
> > [ 5839.926756] lock(pcpu_alloc_mutex);
> > [ 5839.926986]
> > 
> > It really means
> > 
> >
> > 
> > GFP_FS simply marks the beginning of a nested, unrelated context that
> > uses the same thread, just like an interrupt.  Kudos to lockdep for
> > catching that.
> > 
> > I think the allocation in blkio_alloc_blkg_stats() should be moved out
> > of the I/O path into some init function. Copying Jens.
> 
> That's completely buggy, basically you end up with a GFP_KERNEL
> allocation from the IO submit path. Vivek, per_cpu data needs to be set
> up at init time. You can't allocate it dynamically off the IO path.

Ok, I continued to develop on the patch which tries to allocate per cpu
stats from worker thread context. Here is the patch.

Can the reporter please try out the patch and see if it helps. I am not
sure if deadlock was because of mutex issue or not, but it should help
get rid of lockdep warning.

This patch is on top of for-3.3/core branch of jens's linux-block tree.
If it does not apply on your kernel version, do let me know the version 
you are testing with and I will generate another version of patch.

If testing results are good, I will break down the patch in small pieces
and post as a series separately.

Thanks
Vivek

Alloc per cpu data from a worker thread context to avoid possibility of a
deadlock under memory pressure.

Only side affect of delayed allocation is that we will lose the blkio cgroup
stats for that group a small duration.

This patch is generated on top of "for-3.3/core" branch of linux-block tree.

---
 block/blk-cgroup.c   |   31 -
 block/blk-cgroup.h   |9 +++
 block/blk-throttle.c |  116 -
 block/cfq-iosched.c  |  119 ++-
 4 files changed, 180 insertions(+), 95 deletions(-)

Index: block/blk-throttle.c
===
--- block/blk-throttle.c.orig   2011-12-15 20:31:05.0 -0500
+++ block/blk-throttle.c2011-12-15 20:32:15.0 -0500
@@ -81,6 +81,7 @@ struct throtl_grp {
int limits_changed;
 
struct rcu_head rcu_head;
+   struct work_struct stat_alloc_work;
 };
 
 struct throtl_data
@@ -159,6 +160,13 @@ static void throtl_free_tg(struct rcu_he
struct throtl_grp *tg;
 
tg = container_of(head, struct throtl_grp, rcu_head);
+
+   /*
+* Will delayed allocation be visible here for sure? I am relying
+* on the fact that after blkg.stats_cpu assignment, we drop
+   

Re: kvm deadlock

2011-12-14 Thread Vivek Goyal
On Wed, Dec 14, 2011 at 01:41:34PM -0500, Vivek Goyal wrote:
> On Wed, Dec 14, 2011 at 10:16:23AM -0800, Tejun Heo wrote:
> 
> [..]
> > > > > Or may be there is a safer version of pcpu alloc which will return
> > > > > without allocation if pcpu_alloc_mutex is already locked.
> > 
> > pcpu alloc depends on vmalloc allocation, so it isn't trivial.  We can
> > try to make percpu keep cache of areas for this type of allocation but
> > I personally think doing percpu allocation from atomic context or IO
> > path is a bad idea.  Hmmm...
> 
> Looks like I am running out of options here.  I can't find a suitable path
> where I can allocate these stats out of IO path. Because devices can be
> plugged in dynamically (and these stats are per cgroup, per device), and
> cgroups can be created dynamically after device creation, I can't do any
> static allocation out of IO path. So that kind of makes use of per cpu
> memory areas for stats in this case impossible.
> 
> For a moment I thought of doing allocation from worker thread after taking
> a reference on the original group. Allow the IO submission to continue without
> blocking. Just that till per cpu areas are allocated, we will not
> collect any stats.
>

Ok, even though I am not convinced about it yet, I wrote a small patch
to allocate per cpu data area from worker thread context asynchronously
(except for root cgroup). We continue to submit the IO without blocking
for per cpu data and if per cpu data is not allocated yet, we lose the
stat for that duration for that group.

Here is half baked test patch which compiles and boots. CFQ changes are not
done yet. I have yet to find out if I need any additional locking or
barriers. Right now I am just relying on reference counting of group objects.

I am posting it here to figure if this kind of approach is acceptable or
completely frowned upon.

Thanks
Vivek

Not-signed-half-baked-test-patch
---
 block/blk-cgroup.c   |   29 +++
 block/blk-cgroup.h   |2 +
 block/blk-throttle.c |   53 ++-
 3 files changed, 75 insertions(+), 9 deletions(-)

Index: block/blk-throttle.c
===
--- block/blk-throttle.c.orig   2011-12-14 23:25:00.0 -0500
+++ block/blk-throttle.c2011-12-14 23:46:58.0 -0500
@@ -81,6 +81,7 @@ struct throtl_grp {
int limits_changed;
 
struct rcu_head rcu_head;
+   struct work_struct stat_alloc_work;
 };
 
 struct throtl_data
@@ -159,6 +160,7 @@ static void throtl_free_tg(struct rcu_he
struct throtl_grp *tg;
 
tg = container_of(head, struct throtl_grp, rcu_head);
+   /* Will delayed allocation be visible here for sure? */
free_percpu(tg->blkg.stats_cpu);
kfree(tg);
 }
@@ -181,6 +183,24 @@ static void throtl_put_tg(struct throtl_
call_rcu(&tg->rcu_head, throtl_free_tg);
 }
 
+/* No locks taken. A reference to throtl_grp taken before invocation */
+static void tg_stat_alloc_fn(struct work_struct *work)
+{
+   struct throtl_grp *tg = container_of(work, struct throtl_grp,
+   stat_alloc_work);
+   void *stat_ptr = NULL;
+
+   stat_ptr = blkio_alloc_blkg_stats_pcpu();
+
+   if (stat_ptr == NULL) {
+   throtl_put_tg(tg);
+   return;
+   }
+
+   blkio_set_blkg_stats_pcpu(&tg->blkg, stat_ptr);
+   throtl_put_tg(tg);
+}
+
 static void throtl_init_group(struct throtl_grp *tg)
 {
INIT_HLIST_NODE(&tg->tg_node);
@@ -188,6 +208,7 @@ static void throtl_init_group(struct thr
bio_list_init(&tg->bio_lists[0]);
bio_list_init(&tg->bio_lists[1]);
tg->limits_changed = false;
+   INIT_WORK(&tg->stat_alloc_work, tg_stat_alloc_fn);
 
/* Practically unlimited BW */
tg->bps[0] = tg->bps[1] = -1;
@@ -264,7 +285,7 @@ static void throtl_init_add_tg_lists(str
 }
 
 /* Should be called without queue lock and outside of rcu period */
-static struct throtl_grp *throtl_alloc_tg(struct throtl_data *td)
+static struct throtl_grp *throtl_alloc_tg(struct throtl_data *td, bool async)
 {
struct throtl_grp *tg = NULL;
int ret;
@@ -273,14 +294,28 @@ static struct throtl_grp *throtl_alloc_t
if (!tg)
return NULL;
 
-   ret = blkio_alloc_blkg_stats(&tg->blkg);
+   if (!async) {
+   ret = blkio_alloc_blkg_stats(&tg->blkg);
 
-   if (ret) {
-   kfree(tg);
-   return NULL;
+   if (ret) {
+   kfree(tg);
+   return NULL;
+   }
}
 
throtl_init_group(tg);
+
+   if (async) {
+   /*
+* Schedule the group per cpu stat allocation through worker
+* thread
+*/
+   throtl_ref_get_tg(tg);
+   if(!schedule_work(&tg->stat_alloc_work))
+ 

Re: kvm deadlock

2011-12-14 Thread Vivek Goyal
On Wed, Dec 14, 2011 at 10:16:23AM -0800, Tejun Heo wrote:

[..]
> > > > Or may be there is a safer version of pcpu alloc which will return
> > > > without allocation if pcpu_alloc_mutex is already locked.
> 
> pcpu alloc depends on vmalloc allocation, so it isn't trivial.  We can
> try to make percpu keep cache of areas for this type of allocation but
> I personally think doing percpu allocation from atomic context or IO
> path is a bad idea.  Hmmm...

Looks like I am running out of options here.  I can't find a suitable path
where I can allocate these stats out of IO path. Because devices can be
plugged in dynamically (and these stats are per cgroup, per device), and
cgroups can be created dynamically after device creation, I can't do any
static allocation out of IO path. So that kind of makes use of per cpu
memory areas for stats in this case impossible.

For a moment I thought of doing allocation from worker thread after taking
a reference on the original group. Allow the IO submission to continue without
blocking. Just that till per cpu areas are allocated, we will not
collect any stats.

But for locking we rely on request queue lock and request queue might be
gone by the time per cpu areas are allocated. That means we need a group
refenrence on the request queue. Request queue referencing and life time
is already full of bugs. So I don't feel comfortable adding more code
there (till atleast your cleanup patches go in).

Hmm..., is revert of per cpu blkio group stats the only sane choice left
now.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm deadlock

2011-12-14 Thread Tejun Heo
Hello,

On Wed, Dec 14, 2011 at 12:22:34PM -0500, Vivek Goyal wrote:
> [..]
> > __GFP_WAIT isn't the problem, you can block in the IO path. You cannot,
> > however, recurse back into IO submission. That's why CFQ is using
> > GFP_NOIO, implying that waiting is OK, but submitting new IO to satisfy
> > the allocation is not.
> 
> Ok. Got it. So even if we implement mutex_is_locked() check, it does not
> protect me from possiblity of per cpu allocation path recursing into
> IO submission. That means, there needs to be a variant of per cpu
> allocation which can take the allocation flags as parameter and honor
> these flags.

Slightly tangential but we actually have a bug here.  Under high
enough memory pressure, ioc or cic allocation can fail which will make
request alloc fail and retry, which isn't guaranteed to make forward
progress.  struct request itself is mempool backed but ioc/cic aren't.
It seems hitting this problem (and thus IO / memalloc deadlock) isn't
too difficult w/ memcg.

An easy fix would be using INSERT_BACK instead of INSERT_SORT if
elevator_set() fails.  I'll soon post patches to fix the problem.

> > > Or may be there is a safer version of pcpu alloc which will return
> > > without allocation if pcpu_alloc_mutex is already locked.

pcpu alloc depends on vmalloc allocation, so it isn't trivial.  We can
try to make percpu keep cache of areas for this type of allocation but
I personally think doing percpu allocation from atomic context or IO
path is a bad idea.  Hmmm...

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm deadlock

2011-12-14 Thread Vivek Goyal
On Wed, Dec 14, 2011 at 06:09:55PM +0100, Jens Axboe wrote:
> On 2011-12-14 18:03, Vivek Goyal wrote:
> >>> I think the allocation in blkio_alloc_blkg_stats() should be moved out
> >>> of the I/O path into some init function. Copying Jens.
> >>
> >> That's completely buggy, basically you end up with a GFP_KERNEL
> >> allocation from the IO submit path. Vivek, per_cpu data needs to be set
> >> up at init time. You can't allocate it dynamically off the IO path.
> > 
> > Hi Jens,
> > 
> > I am wondering how does CFQ get away with blocking cfqq allocation in
> > IO submit path. I see that blk_queue_bio() will do following.
> > 
> > blk_queue_bio()
> >  get_request_wait()
> >   get_request(..,..,GFP_NOIO)
> >blk_alloc_request()
> > elv_set_request()
> >  cfq_set_request()
> >   ---> Can sleep and do memory allocation in IO submit path as
> >GFP_NOIO has __GFP_WAIT.
> > 
> > So that means sleeping allocation from IO submit path is not necessarily
> > a problem?
> 

[..]
> __GFP_WAIT isn't the problem, you can block in the IO path. You cannot,
> however, recurse back into IO submission. That's why CFQ is using
> GFP_NOIO, implying that waiting is OK, but submitting new IO to satisfy
> the allocation is not.

Ok. Got it. So even if we implement mutex_is_locked() check, it does not
protect me from possiblity of per cpu allocation path recursing into
IO submission. That means, there needs to be a variant of per cpu
allocation which can take the allocation flags as parameter and honor
these flags.

> 
> > But in case of per cpu data allocation, we might be holding
> > pcpu_alloc_mutex() already at the time of calling pcpu allocation
> > again and that might lead to deadlock.  (As Avi mentioned). If yes,
> > then it is a problem.
> 
> It is both a problem and somewhat of a mess...
> 
> > Right now allocation of root group and associated stats happens at
> > queue initialization time. For non-root cgroups, group allocation and
> > associated per cpu stats allocation happens dynamically when the IO is
> > submitted. So in this case may be we are creating a new blkio cgroup
> > and then doing IO which leads to this warning.
> > 
> > I am not sure how to move this allocation to init path. These stats
> > are per group and groups are created dynamically as IO happens in
> > them. Only init path seems to be cgroup creation time. blkg is an
> > object which is contained in a parent object and at that time parent
> > object is not available. It is created dynamically at the IO time
> > (cfq_group, blkio_group etc).
> > 
> > Though it is little hackish but can we just delay the allocation of
> > stats if pcpu_alloc_mutex is held. We shall have to make
> > pcpu_alloc_mutex non static though. Delaying will just not capture the
> > stats for some time and sooner or later we will get regular IO with
> > pcpu_alloc_mutex not held and we can do per cpu allocation at that
> > time. I will write a a test patch.
> > 
> > Or may be there is a safer version of pcpu alloc which will return
> > without allocation if pcpu_alloc_mutex is already locked.
> 
> Both of these proposed solutions aren't really pretty. It would be
> cleaner and have better runtime for the fast path if you could alloc all
> of these when the non-root groups are setup. Why isn't it done that way
> right now?

Actually non-root group itself is setup when first IO happens in the
group. When an IO is submitted, we determine which cgroup does it belong
to, then we see if we have already setup the group. If not, then we
go on to alloc one. (Currently I use GFP_ATOMIC for group allocation).

So group allocation itself happens in IO path (like cfqq allocation),
hence per group per cpu stats are also allocated in that path.

We can't setup blkio groups at the cgroup creation time as at that time
we don't even know how many request queues are threre and on how many
of them cfq is being used.

Setup of root group at queue initialization time is easy. But we can't
setup all other cgroups at that time as it does not take care of future
cgroup creation. Also it will waste memory if cgroups are there but no
IO is happening in them.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm deadlock

2011-12-14 Thread Jens Axboe
On 2011-12-14 18:03, Vivek Goyal wrote:
>>> I think the allocation in blkio_alloc_blkg_stats() should be moved out
>>> of the I/O path into some init function. Copying Jens.
>>
>> That's completely buggy, basically you end up with a GFP_KERNEL
>> allocation from the IO submit path. Vivek, per_cpu data needs to be set
>> up at init time. You can't allocate it dynamically off the IO path.
> 
> Hi Jens,
> 
> I am wondering how does CFQ get away with blocking cfqq allocation in
> IO submit path. I see that blk_queue_bio() will do following.
> 
> blk_queue_bio()
>  get_request_wait()
>   get_request(..,..,GFP_NOIO)
>blk_alloc_request()
> elv_set_request()
>  cfq_set_request()
>   ---> Can sleep and do memory allocation in IO submit path as
>GFP_NOIO has __GFP_WAIT.
> 
> So that means sleeping allocation from IO submit path is not necessarily
> a problem?

__GFP_WAIT isn't the problem, you can block in the IO path. You cannot,
however, recurse back into IO submission. That's why CFQ is using
GFP_NOIO, implying that waiting is OK, but submitting new IO to satisfy
the allocation is not.

> But in case of per cpu data allocation, we might be holding
> pcpu_alloc_mutex() already at the time of calling pcpu allocation
> again and that might lead to deadlock.  (As Avi mentioned). If yes,
> then it is a problem.

It is both a problem and somewhat of a mess...

> Right now allocation of root group and associated stats happens at
> queue initialization time. For non-root cgroups, group allocation and
> associated per cpu stats allocation happens dynamically when the IO is
> submitted. So in this case may be we are creating a new blkio cgroup
> and then doing IO which leads to this warning.
> 
> I am not sure how to move this allocation to init path. These stats
> are per group and groups are created dynamically as IO happens in
> them. Only init path seems to be cgroup creation time. blkg is an
> object which is contained in a parent object and at that time parent
> object is not available. It is created dynamically at the IO time
> (cfq_group, blkio_group etc).
> 
> Though it is little hackish but can we just delay the allocation of
> stats if pcpu_alloc_mutex is held. We shall have to make
> pcpu_alloc_mutex non static though. Delaying will just not capture the
> stats for some time and sooner or later we will get regular IO with
> pcpu_alloc_mutex not held and we can do per cpu allocation at that
> time. I will write a a test patch.
> 
> Or may be there is a safer version of pcpu alloc which will return
> without allocation if pcpu_alloc_mutex is already locked.

Both of these proposed solutions aren't really pretty. It would be
cleaner and have better runtime for the fast path if you could alloc all
of these when the non-root groups are setup. Why isn't it done that way
right now?

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm deadlock

2011-12-14 Thread Vivek Goyal
On Wed, Dec 14, 2011 at 05:03:54PM +0100, Jens Axboe wrote:
> On 2011-12-14 14:43, Avi Kivity wrote:
> > On 12/14/2011 02:25 PM, Marcelo Tosatti wrote:
> >> On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote:
> >>> Hello,
> >>>
> >>> I am struggling with repeatable full hardware locks when running 8-12 KVM 
> >>> vms. At some point before the hard lock I get a inconsistent lock state 
> >>> warning. An example of this can be found here:
> >>>
> >>> http://pastebin.com/8wKhgE2C
> >>>
> >>> After that the server continues to run for a while and then starts its 
> >>> death spiral. When it reaches that point it fails to log anything further 
> >>> to the disk, but by attaching a console I have been able to get a stack 
> >>> trace documenting the final implosion:
> >>>
> >>> http://pastebin.com/PbcN76bd
> >>>
> >>> All of the cores end up hung and the server stops responding to all 
> >>> input, including SysRq commands. 
> >>>
> >>> I have seen this behavior on two machines (dual E5606 running Fedora 16) 
> >>> both passed cpuburnin testing and memtest86 scans without error. 
> >>>
> >>> I have reproduced the crash and stack traces from a Fedora debugging 
> >>> kernel - 3.1.2-1 and with a vanilla 3.1.4 kernel.
> >>
> >> Busted hardware, apparently. Can you reproduce these issues with the
> >> same workload on different hardware?
> > 
> > I don't think it's hardware related.  The second trace (in the first
> > paste) is called during swap, so GFP_FS is set.  The first one is not,
> > so GFP_FS is clear.  Lockdep is worried about the following scenario:
> > 
> >   acpi_early_init() is called
> >   calls pcpu_alloc(), which takes pcpu_alloc_mutex
> >   eventually, calls kmalloc(), or some other allocation function
> >   no memory, so swap
> >   call try_to_free_pages()
> >   submit_bio()
> >   blk_throtl_bio()
> >   blkio_alloc_blkg_stats()
> >   alloc_percpu()
> >   pcpu_alloc(), which takes pcpu_alloc_mutex
> >   deadlock
> > 
> > It's a little unlikely that acpi_early_init() will OOM, but lockdep
> > doesn't know that.  Other callers of pcpu_alloc() could trigger the same
> > thing.
> > 
> > When lockdep says
> > 
> > [ 5839.924953] other info that might help us debug this:
> > [ 5839.925396]  Possible unsafe locking scenario:
> > [ 5839.925397]
> > [ 5839.925840]CPU0
> > [ 5839.926063]
> > [ 5839.926287]   lock(pcpu_alloc_mutex);
> > [ 5839.926533]   
> > [ 5839.926756] lock(pcpu_alloc_mutex);
> > [ 5839.926986]
> > 
> > It really means
> > 
> >
> > 
> > GFP_FS simply marks the beginning of a nested, unrelated context that
> > uses the same thread, just like an interrupt.  Kudos to lockdep for
> > catching that.
> > 
> > I think the allocation in blkio_alloc_blkg_stats() should be moved out
> > of the I/O path into some init function. Copying Jens.
> 
> That's completely buggy, basically you end up with a GFP_KERNEL
> allocation from the IO submit path. Vivek, per_cpu data needs to be set
> up at init time. You can't allocate it dynamically off the IO path.

Hi Jens,

I am wondering how does CFQ get away with blocking cfqq allocation in
IO submit path. I see that blk_queue_bio() will do following.

blk_queue_bio()
 get_request_wait()
  get_request(..,..,GFP_NOIO)
   blk_alloc_request()
elv_set_request()
 cfq_set_request()
  ---> Can sleep and do memory allocation in IO submit path as
   GFP_NOIO has __GFP_WAIT.

So that means sleeping allocation from IO submit path is not necessarily
a problem?

But in case of per cpu data allocation, we might be holding pcpu_alloc_mutex()
already at the time of calling pcpu allocation again and that might lead
to deadlock.  (As Avi mentioned). If yes, then it is a problem.

Right now allocation of root group and associated stats happens at queue
initialization time. For non-root cgroups, group allocation and associated
per cpu stats allocation happens dynamically when the IO is submitted. So
in this case may be we are creating a new blkio cgroup and then doing IO
which leads to this warning.

I am not sure how to move this allocation to init path. These stats are
per group and groups are created dynamically as IO happens in them. Only
init path seems to be cgroup creation time. blkg is an object which is
contained in a parent object and at that time parent object is not
available. It is created dynamically at the IO time (cfq_group,
blkio_group etc).

Though it is little hackish but can we just delay the allocation of stats
if pcpu_alloc_mutex is held. We shall have to make pcpu_alloc_mutex non
static though. Delaying will just not capture the stats for some time
and sooner or later we will get regular IO with pcpu_alloc_mutex not
held and we can do per cpu allocation at that time. I will write a
a test patch.

Or may be there is a safer version of pcpu alloc which will return
without allocation if pcpu_alloc_mutex is already locked.

CCing Tejun too.

Thanks
Vivek
--
To unsubscribe from this list: send the line "un

Re: kvm deadlock

2011-12-14 Thread Jens Axboe
On 2011-12-14 14:43, Avi Kivity wrote:
> On 12/14/2011 02:25 PM, Marcelo Tosatti wrote:
>> On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote:
>>> Hello,
>>>
>>> I am struggling with repeatable full hardware locks when running 8-12 KVM 
>>> vms. At some point before the hard lock I get a inconsistent lock state 
>>> warning. An example of this can be found here:
>>>
>>> http://pastebin.com/8wKhgE2C
>>>
>>> After that the server continues to run for a while and then starts its 
>>> death spiral. When it reaches that point it fails to log anything further 
>>> to the disk, but by attaching a console I have been able to get a stack 
>>> trace documenting the final implosion:
>>>
>>> http://pastebin.com/PbcN76bd
>>>
>>> All of the cores end up hung and the server stops responding to all input, 
>>> including SysRq commands. 
>>>
>>> I have seen this behavior on two machines (dual E5606 running Fedora 16) 
>>> both passed cpuburnin testing and memtest86 scans without error. 
>>>
>>> I have reproduced the crash and stack traces from a Fedora debugging kernel 
>>> - 3.1.2-1 and with a vanilla 3.1.4 kernel.
>>
>> Busted hardware, apparently. Can you reproduce these issues with the
>> same workload on different hardware?
> 
> I don't think it's hardware related.  The second trace (in the first
> paste) is called during swap, so GFP_FS is set.  The first one is not,
> so GFP_FS is clear.  Lockdep is worried about the following scenario:
> 
>   acpi_early_init() is called
>   calls pcpu_alloc(), which takes pcpu_alloc_mutex
>   eventually, calls kmalloc(), or some other allocation function
>   no memory, so swap
>   call try_to_free_pages()
>   submit_bio()
>   blk_throtl_bio()
>   blkio_alloc_blkg_stats()
>   alloc_percpu()
>   pcpu_alloc(), which takes pcpu_alloc_mutex
>   deadlock
> 
> It's a little unlikely that acpi_early_init() will OOM, but lockdep
> doesn't know that.  Other callers of pcpu_alloc() could trigger the same
> thing.
> 
> When lockdep says
> 
> [ 5839.924953] other info that might help us debug this:
> [ 5839.925396]  Possible unsafe locking scenario:
> [ 5839.925397]
> [ 5839.925840]CPU0
> [ 5839.926063]
> [ 5839.926287]   lock(pcpu_alloc_mutex);
> [ 5839.926533]   
> [ 5839.926756] lock(pcpu_alloc_mutex);
> [ 5839.926986]
> 
> It really means
> 
>
> 
> GFP_FS simply marks the beginning of a nested, unrelated context that
> uses the same thread, just like an interrupt.  Kudos to lockdep for
> catching that.
> 
> I think the allocation in blkio_alloc_blkg_stats() should be moved out
> of the I/O path into some init function. Copying Jens.

That's completely buggy, basically you end up with a GFP_KERNEL
allocation from the IO submit path. Vivek, per_cpu data needs to be set
up at init time. You can't allocate it dynamically off the IO path.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm deadlock

2011-12-14 Thread Avi Kivity
On 12/14/2011 04:17 PM, Nate Custer wrote:
> On Dec 14, 2011, at 8:06 AM, Marcelo Tosatti wrote:
> > I don't know. Its a hang ? It could be memory corruption (of the timer
> > olist) instead of a bogus NMI actually, the second.
>
>
> What is pasted in the second paste is what came scrolling across the console 
> right before the end of all responsiveness. It came from a dmesg dump, the 
> next dmesg command was not accepted via ssh and the console attached showed 
> the same stack trace. At that point the system refused to respond to any 
> direct keyboard input, including the SysRq commands that I expected to work 
> after a core dump. 
>
> The issue happened with two servers (same hardware, same build group so there 
> is a chance of a bad hardware batch). Switching to an older kernel/kvm setup 
> in RHEL 6.2 has corrected the issue, which suggests a software issue to me.
>

If you can, please set up netconsole, which should give us complete
dmesg captures.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm deadlock

2011-12-14 Thread Avi Kivity
On 12/14/2011 04:06 PM, Marcelo Tosatti wrote:
> On Wed, Dec 14, 2011 at 04:02:48PM +0200, Avi Kivity wrote:
> > On 12/14/2011 04:00 PM, Marcelo Tosatti wrote:
> > > The other traces have apparently bogus NMI interrupts, but it might be a
> > > software bug, OK.
> > 
> > These are from sysrq-blah, no?  I'm looking at them now.
>
> I don't know. Its a hang ? It could be memory corruption (of the timer
> olist) instead of a bogus NMI actually, the second.

Looks like lots of cpus are waiting on the smp_call_function_single()
lock.  Looks like rcu is complaining:


[ 4959.814010]  [] __const_udelay+0x2c/0x2e
[ 4959.814017]  []
native_safe_apic_wait_icr_idle+0x31/0x3d
[ 4959.814024]  []
__default_send_IPI_dest_field.constprop.0+0x23/0x5d
[ 4959.814032]  []
default_send_IPI_mask_sequence_phys+0x48/0x97
[ 4959.814039]  [] ? tick_nohz_handler+0xdf/0xdf
[ 4959.814044]  [] physflat_send_IPI_all+0x17/0x19
[ 4959.814052]  []
arch_trigger_all_cpu_backtrace+0x57/0x89
[ 4959.814057]  [] __rcu_pending+0x89/0x328
[ 4959.814063]  [] ? tick_nohz_handler+0xdf/0xdf
[ 4959.814067]  [] rcu_check_callbacks+0x88/0xb9
[ 4959.814071]  [] update_process_times+0x3f/0x75

Maybe the core issue is that CPU 3 is spinning in do_insn_fetch() and
denying rcu grace periods.  Nate, can you provide a few more dumps (this
is looking at the second paste, so more of the same)?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm deadlock

2011-12-14 Thread Marcelo Tosatti
On Wed, Dec 14, 2011 at 08:17:45AM -0600, Nate Custer wrote:
> 
> On Dec 14, 2011, at 8:06 AM, Marcelo Tosatti wrote:
> > I don't know. Its a hang ? It could be memory corruption (of the timer
> > olist) instead of a bogus NMI actually, the second.
> 
> 
> What is pasted in the second paste is what came scrolling across the console 
> right before the end of all responsiveness. It came from a dmesg dump, the 
> next dmesg command was not accepted via ssh and the console attached showed 
> the same stack trace. At that point the system refused to respond to any 
> direct keyboard input, including the SysRq commands that I expected to work 
> after a core dump. 
> 
> The issue happened with two servers (same hardware, same build group so there 
> is a chance of a bad hardware batch). Switching to an older kernel/kvm setup 
> in RHEL 6.2 has corrected the issue, which suggests a software issue to me.

Right. Perhaps try an older upstream kernel to find a culprit then.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm deadlock

2011-12-14 Thread Nate Custer

On Dec 14, 2011, at 8:06 AM, Marcelo Tosatti wrote:
> I don't know. Its a hang ? It could be memory corruption (of the timer
> olist) instead of a bogus NMI actually, the second.


What is pasted in the second paste is what came scrolling across the console 
right before the end of all responsiveness. It came from a dmesg dump, the next 
dmesg command was not accepted via ssh and the console attached showed the same 
stack trace. At that point the system refused to respond to any direct keyboard 
input, including the SysRq commands that I expected to work after a core dump. 

The issue happened with two servers (same hardware, same build group so there 
is a chance of a bad hardware batch). Switching to an older kernel/kvm setup in 
RHEL 6.2 has corrected the issue, which suggests a software issue to me.

Nate Custer



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm deadlock

2011-12-14 Thread Marcelo Tosatti
On Wed, Dec 14, 2011 at 04:02:48PM +0200, Avi Kivity wrote:
> On 12/14/2011 04:00 PM, Marcelo Tosatti wrote:
> > The other traces have apparently bogus NMI interrupts, but it might be a
> > software bug, OK.
> 
> These are from sysrq-blah, no?  I'm looking at them now.

I don't know. Its a hang ? It could be memory corruption (of the timer
olist) instead of a bogus NMI actually, the second.

Nate?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm deadlock

2011-12-14 Thread Avi Kivity
On 12/14/2011 04:00 PM, Marcelo Tosatti wrote:
> The other traces have apparently bogus NMI interrupts, but it might be a
> software bug, OK.

These are from sysrq-blah, no?  I'm looking at them now.



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm deadlock

2011-12-14 Thread Marcelo Tosatti
On Wed, Dec 14, 2011 at 03:43:09PM +0200, Avi Kivity wrote:
> On 12/14/2011 02:25 PM, Marcelo Tosatti wrote:
> > On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote:
> > > Hello,
> > > 
> > > I am struggling with repeatable full hardware locks when running 8-12 KVM 
> > > vms. At some point before the hard lock I get a inconsistent lock state 
> > > warning. An example of this can be found here:
> > > 
> > > http://pastebin.com/8wKhgE2C
> > > 
> > > After that the server continues to run for a while and then starts its 
> > > death spiral. When it reaches that point it fails to log anything further 
> > > to the disk, but by attaching a console I have been able to get a stack 
> > > trace documenting the final implosion:
> > > 
> > > http://pastebin.com/PbcN76bd
> > > 
> > > All of the cores end up hung and the server stops responding to all 
> > > input, including SysRq commands. 
> > > 
> > > I have seen this behavior on two machines (dual E5606 running Fedora 16) 
> > > both passed cpuburnin testing and memtest86 scans without error. 
> > > 
> > > I have reproduced the crash and stack traces from a Fedora debugging 
> > > kernel - 3.1.2-1 and with a vanilla 3.1.4 kernel.
> >
> > Busted hardware, apparently. Can you reproduce these issues with the
> > same workload on different hardware?
> 
> I don't think it's hardware related.  The second trace (in the first
> paste) is called during swap, so GFP_FS is set.  The first one is not,
> so GFP_FS is clear.  Lockdep is worried about the following scenario:
> 
>   acpi_early_init() is called
>   calls pcpu_alloc(), which takes pcpu_alloc_mutex
>   eventually, calls kmalloc(), or some other allocation function
>   no memory, so swap
>   call try_to_free_pages()
>   submit_bio()
>   blk_throtl_bio()
>   blkio_alloc_blkg_stats()
>   alloc_percpu()
>   pcpu_alloc(), which takes pcpu_alloc_mutex
>   deadlock
> 
> It's a little unlikely that acpi_early_init() will OOM, but lockdep
> doesn't know that.  Other callers of pcpu_alloc() could trigger the same
> thing.
> 
> When lockdep says
> 
> [ 5839.924953] other info that might help us debug this:
> [ 5839.925396]  Possible unsafe locking scenario:
> [ 5839.925397]
> [ 5839.925840]CPU0
> [ 5839.926063]
> [ 5839.926287]   lock(pcpu_alloc_mutex);
> [ 5839.926533]   
> [ 5839.926756] lock(pcpu_alloc_mutex);
> [ 5839.926986]
> 
> It really means
> 
>
> 
> GFP_FS simply marks the beginning of a nested, unrelated context that
> uses the same thread, just like an interrupt.  Kudos to lockdep for
> catching that.
> 
> I think the allocation in blkio_alloc_blkg_stats() should be moved out
> of the I/O path into some init function. Copying Jens.

The other traces have apparently bogus NMI interrupts, but it might be a
software bug, OK.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm deadlock

2011-12-14 Thread Avi Kivity
On 12/14/2011 02:25 PM, Marcelo Tosatti wrote:
> On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote:
> > Hello,
> > 
> > I am struggling with repeatable full hardware locks when running 8-12 KVM 
> > vms. At some point before the hard lock I get a inconsistent lock state 
> > warning. An example of this can be found here:
> > 
> > http://pastebin.com/8wKhgE2C
> > 
> > After that the server continues to run for a while and then starts its 
> > death spiral. When it reaches that point it fails to log anything further 
> > to the disk, but by attaching a console I have been able to get a stack 
> > trace documenting the final implosion:
> > 
> > http://pastebin.com/PbcN76bd
> > 
> > All of the cores end up hung and the server stops responding to all input, 
> > including SysRq commands. 
> > 
> > I have seen this behavior on two machines (dual E5606 running Fedora 16) 
> > both passed cpuburnin testing and memtest86 scans without error. 
> > 
> > I have reproduced the crash and stack traces from a Fedora debugging kernel 
> > - 3.1.2-1 and with a vanilla 3.1.4 kernel.
>
> Busted hardware, apparently. Can you reproduce these issues with the
> same workload on different hardware?

I don't think it's hardware related.  The second trace (in the first
paste) is called during swap, so GFP_FS is set.  The first one is not,
so GFP_FS is clear.  Lockdep is worried about the following scenario:

  acpi_early_init() is called
  calls pcpu_alloc(), which takes pcpu_alloc_mutex
  eventually, calls kmalloc(), or some other allocation function
  no memory, so swap
  call try_to_free_pages()
  submit_bio()
  blk_throtl_bio()
  blkio_alloc_blkg_stats()
  alloc_percpu()
  pcpu_alloc(), which takes pcpu_alloc_mutex
  deadlock

It's a little unlikely that acpi_early_init() will OOM, but lockdep
doesn't know that.  Other callers of pcpu_alloc() could trigger the same
thing.

When lockdep says

[ 5839.924953] other info that might help us debug this:
[ 5839.925396]  Possible unsafe locking scenario:
[ 5839.925397]
[ 5839.925840]CPU0
[ 5839.926063]
[ 5839.926287]   lock(pcpu_alloc_mutex);
[ 5839.926533]   
[ 5839.926756] lock(pcpu_alloc_mutex);
[ 5839.926986]

It really means

   

GFP_FS simply marks the beginning of a nested, unrelated context that
uses the same thread, just like an interrupt.  Kudos to lockdep for
catching that.

I think the allocation in blkio_alloc_blkg_stats() should be moved out
of the I/O path into some init function. Copying Jens.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm deadlock

2011-12-14 Thread Marcelo Tosatti
On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote:
> Hello,
> 
> I am struggling with repeatable full hardware locks when running 8-12 KVM 
> vms. At some point before the hard lock I get a inconsistent lock state 
> warning. An example of this can be found here:
> 
> http://pastebin.com/8wKhgE2C
> 
> After that the server continues to run for a while and then starts its death 
> spiral. When it reaches that point it fails to log anything further to the 
> disk, but by attaching a console I have been able to get a stack trace 
> documenting the final implosion:
> 
> http://pastebin.com/PbcN76bd
> 
> All of the cores end up hung and the server stops responding to all input, 
> including SysRq commands. 
> 
> I have seen this behavior on two machines (dual E5606 running Fedora 16) both 
> passed cpuburnin testing and memtest86 scans without error. 
> 
> I have reproduced the crash and stack traces from a Fedora debugging kernel - 
> 3.1.2-1 and with a vanilla 3.1.4 kernel.

Busted hardware, apparently. Can you reproduce these issues with the
same workload on different hardware?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvm deadlock

2011-12-05 Thread Nate Custer
Hello,

I am struggling with repeatable full hardware locks when running 8-12 KVM vms. 
At some point before the hard lock I get a inconsistent lock state warning. An 
example of this can be found here:

http://pastebin.com/8wKhgE2C

After that the server continues to run for a while and then starts its death 
spiral. When it reaches that point it fails to log anything further to the 
disk, but by attaching a console I have been able to get a stack trace 
documenting the final implosion:

http://pastebin.com/PbcN76bd

All of the cores end up hung and the server stops responding to all input, 
including SysRq commands. 

I have seen this behavior on two machines (dual E5606 running Fedora 16) both 
passed cpuburnin testing and memtest86 scans without error. 

I have reproduced the crash and stack traces from a Fedora debugging kernel - 
3.1.2-1 and with a vanilla 3.1.4 kernel.

Nate Custer
QA Analyst
cPanel Inc--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm.git & 2.6.37-rc1] KVM deadlock with CONFIG_PREEMPT host

2010-11-09 Thread Avi Kivity

On 11/08/2010 12:28 PM, Jan Kiszka wrote:

Am 08.11.2010 10:18, Markus Trippelsdorf wrote:
>  On Mon, Nov 08, 2010 at 10:04:00AM +0100, Jan Kiszka wrote:
>>  Hi,
>>
>>  I'm seeing lock-ups of the QEMU process on kvm.git as well as current
>>  upstream kernels. This is a backtrace of the hanging VCPU thread:
>>
>>  [] __stop_cpus+0x184/0x1a7
>>  [] try_stop_cpus+0x40/0x59
>>  [] synchronize_sched_expedited+0x84/0x9d
>>  [] __synchronize_srcu+0x33/0x72
>>  [] synchronize_srcu_expedited+0x15/0x17
>>  [] __kvm_set_memory_region+0x6a3/0x782 [kvm]
>>  [] kvm_set_memory_region+0x37/0x50 [kvm]
>>  [] kvm_vm_ioctl_set_memory_region+0x18/0x1a [kvm]
>>  [] kvm_vm_ioctl+0x22d/0x3b1 [kvm]
>>  [] do_vfs_ioctl+0x5a1/0x5e2
>>  [] sys_ioctl+0x56/0x79
>>  [] system_call_fastpath+0x16/0x1b
>>  [] 0x
>>
>>  This issue disappears when disabling CONFIG_PREEMPT on the host.
>>  According to some rough bisecting, it was imported into kvm.git with
>>  merge 146d3bb06b. Given that RCU is involved, I also tried
>>  force-enabling non-preemptible CONFIG_TREE_RCU again, but that made no
>>  difference as long as PREEMPT is on.
>>
>>  Can anyone confirm this or does someone have an idea what goes wrong? Of
>>  course, .config will be provided if required.
>
>  This patch should help (,it fixes the problem in my case):
>  http://article.gmane.org/gmane.linux.kernel/1058018

Yeah, that works here as well.

Would be a nice-to-have in kvm.git until it's pull via the next upstream
merge. At least for me this bug triggers at every VM boot.



I just did so.  Testing with CONFIG_PREEMPT is important.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm.git & 2.6.37-rc1] KVM deadlock with CONFIG_PREEMPT host

2010-11-08 Thread Jan Kiszka
Am 08.11.2010 10:18, Markus Trippelsdorf wrote:
> On Mon, Nov 08, 2010 at 10:04:00AM +0100, Jan Kiszka wrote:
>> Hi,
>>
>> I'm seeing lock-ups of the QEMU process on kvm.git as well as current
>> upstream kernels. This is a backtrace of the hanging VCPU thread:
>>
>> [] __stop_cpus+0x184/0x1a7
>> [] try_stop_cpus+0x40/0x59
>> [] synchronize_sched_expedited+0x84/0x9d
>> [] __synchronize_srcu+0x33/0x72
>> [] synchronize_srcu_expedited+0x15/0x17
>> [] __kvm_set_memory_region+0x6a3/0x782 [kvm]
>> [] kvm_set_memory_region+0x37/0x50 [kvm]
>> [] kvm_vm_ioctl_set_memory_region+0x18/0x1a [kvm]
>> [] kvm_vm_ioctl+0x22d/0x3b1 [kvm]
>> [] do_vfs_ioctl+0x5a1/0x5e2
>> [] sys_ioctl+0x56/0x79
>> [] system_call_fastpath+0x16/0x1b
>> [] 0x
>>
>> This issue disappears when disabling CONFIG_PREEMPT on the host.
>> According to some rough bisecting, it was imported into kvm.git with
>> merge 146d3bb06b. Given that RCU is involved, I also tried
>> force-enabling non-preemptible CONFIG_TREE_RCU again, but that made no
>> difference as long as PREEMPT is on.
>>
>> Can anyone confirm this or does someone have an idea what goes wrong? Of
>> course, .config will be provided if required.
> 
> This patch should help (,it fixes the problem in my case):
> http://article.gmane.org/gmane.linux.kernel/1058018

Yeah, that works here as well.

Would be a nice-to-have in kvm.git until it's pull via the next upstream
merge. At least for me this bug triggers at every VM boot.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm.git & 2.6.37-rc1] KVM deadlock with CONFIG_PREEMPT host

2010-11-08 Thread Markus Trippelsdorf
On Mon, Nov 08, 2010 at 10:04:00AM +0100, Jan Kiszka wrote:
> Hi,
> 
> I'm seeing lock-ups of the QEMU process on kvm.git as well as current
> upstream kernels. This is a backtrace of the hanging VCPU thread:
> 
> [] __stop_cpus+0x184/0x1a7
> [] try_stop_cpus+0x40/0x59
> [] synchronize_sched_expedited+0x84/0x9d
> [] __synchronize_srcu+0x33/0x72
> [] synchronize_srcu_expedited+0x15/0x17
> [] __kvm_set_memory_region+0x6a3/0x782 [kvm]
> [] kvm_set_memory_region+0x37/0x50 [kvm]
> [] kvm_vm_ioctl_set_memory_region+0x18/0x1a [kvm]
> [] kvm_vm_ioctl+0x22d/0x3b1 [kvm]
> [] do_vfs_ioctl+0x5a1/0x5e2
> [] sys_ioctl+0x56/0x79
> [] system_call_fastpath+0x16/0x1b
> [] 0x
> 
> This issue disappears when disabling CONFIG_PREEMPT on the host.
> According to some rough bisecting, it was imported into kvm.git with
> merge 146d3bb06b. Given that RCU is involved, I also tried
> force-enabling non-preemptible CONFIG_TREE_RCU again, but that made no
> difference as long as PREEMPT is on.
> 
> Can anyone confirm this or does someone have an idea what goes wrong? Of
> course, .config will be provided if required.

This patch should help (,it fixes the problem in my case):
http://article.gmane.org/gmane.linux.kernel/1058018
-- 
Markus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[kvm.git & 2.6.37-rc1] KVM deadlock with CONFIG_PREEMPT host

2010-11-08 Thread Jan Kiszka
Hi,

I'm seeing lock-ups of the QEMU process on kvm.git as well as current
upstream kernels. This is a backtrace of the hanging VCPU thread:

[] __stop_cpus+0x184/0x1a7
[] try_stop_cpus+0x40/0x59
[] synchronize_sched_expedited+0x84/0x9d
[] __synchronize_srcu+0x33/0x72
[] synchronize_srcu_expedited+0x15/0x17
[] __kvm_set_memory_region+0x6a3/0x782 [kvm]
[] kvm_set_memory_region+0x37/0x50 [kvm]
[] kvm_vm_ioctl_set_memory_region+0x18/0x1a [kvm]
[] kvm_vm_ioctl+0x22d/0x3b1 [kvm]
[] do_vfs_ioctl+0x5a1/0x5e2
[] sys_ioctl+0x56/0x79
[] system_call_fastpath+0x16/0x1b
[] 0x

This issue disappears when disabling CONFIG_PREEMPT on the host.
According to some rough bisecting, it was imported into kvm.git with
merge 146d3bb06b. Given that RCU is involved, I also tried
force-enabling non-preemptible CONFIG_TREE_RCU again, but that made no
difference as long as PREEMPT is on.

Can anyone confirm this or does someone have an idea what goes wrong? Of
course, .config will be provided if required.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html