[Qemu-devel] [PULL v2 for-2.4 v2 5/7] AioContext: fix broken ctx->dispatching optimization

2015-07-22 Thread Stefan Hajnoczi
From: Paolo Bonzini 

This patch rewrites the ctx->dispatching optimization, which was the cause
of some mysterious hangs that could be reproduced on aarch64 KVM only.
The hangs were indirectly caused by aio_poll() and in particular by
flash memory updates's call to blk_write(), which invokes aio_poll().
Fun stuff: they had an extremely short race window, so much that
adding all kind of tracing to either the kernel or QEMU made it
go away (a single printf made it half as reproducible).

On the plus side, the failure mode (a hang until the next keypress)
made it very easy to examine the state of the process with a debugger.
And there was a very nice reproducer from Laszlo, which failed pretty
often (more than half of the time) on any version of QEMU with a non-debug
kernel; it also failed fast, while still in the firmware.  So, it could
have been worse.

For some unknown reason they happened only with virtio-scsi, but
that's not important.  It's more interesting that they disappeared with
io=native, making thread-pool.c a likely suspect for where the bug arose.
thread-pool.c is also one of the few places which use bottom halves
across threads, by the way.

I hope that no other similar bugs exist, but just in case :) I am
going to describe how the successful debugging went...  Since the
likely culprit was the ctx->dispatching optimization, which mostly
affects bottom halves, the first observation was that there are two
qemu_bh_schedule() invocations in the thread pool: the one in the aio
worker and the one in thread_pool_completion_bh.  The latter always
causes the optimization to trigger, the former may or may not.  In
order to restrict the possibilities, I introduced new functions
qemu_bh_schedule_slow() and qemu_bh_schedule_fast():

 /* qemu_bh_schedule_slow: */
 ctx = bh->ctx;
 bh->idle = 0;
 if (atomic_xchg(&bh->scheduled, 1) == 0) {
 event_notifier_set(&ctx->notifier);
 }

 /* qemu_bh_schedule_fast: */
 ctx = bh->ctx;
 bh->idle = 0;
 assert(ctx->dispatching);
 atomic_xchg(&bh->scheduled, 1);

Notice how the atomic_xchg is still in qemu_bh_schedule_slow().  This
was already debated a few months ago, so I assumed it to be correct.
In retrospect this was a very good idea, as you'll see later.

Changing thread_pool_completion_bh() to qemu_bh_schedule_fast() didn't
trigger the assertion (as expected).  Changing the worker's invocation
to qemu_bh_schedule_slow() didn't hide the bug (another assumption
which luckily held).  This already limited heavily the amount of
interaction between the threads, hinting that the problematic events
must have triggered around thread_pool_completion_bh().

As mentioned early, invoking a debugger to examine the state of a
hung process was pretty easy; the iothread was always waiting on a
poll(..., -1) system call.  Infinite timeouts are much rarer on x86,
and this could be the reason why the bug was never observed there.
With the buggy sequence more or less resolved to an interaction between
thread_pool_completion_bh() and poll(..., -1), my "tracing" strategy was
to just add a few qemu_clock_get_ns(QEMU_CLOCK_REALTIME) calls, hoping
that the ordering of aio_ctx_prepare(), aio_ctx_dispatch, poll() and
qemu_bh_schedule_fast() would provide some hint.  The output was:

(gdb) p last_prepare
$3 = 103885451
(gdb) p last_dispatch
$4 = 103876492
(gdb) p last_poll
$5 = 115909333
(gdb) p last_schedule
$6 = 115925212

Notice how the last call to qemu_poll_ns() came after aio_ctx_dispatch().
This makes little sense unless there is an aio_poll() call involved,
and indeed with a slightly different instrumentation you can see that
there is one:

(gdb) p last_prepare
$3 = 107569679
(gdb) p last_dispatch
$4 = 107561600
(gdb) p last_aio_poll
$5 = 110671400
(gdb) p last_schedule
$6 = 110698917

So the scenario becomes clearer:

   iothread   VCPU thread
--
   aio_ctx_prepare
   aio_ctx_check
   qemu_poll_ns(timeout=-1)
  aio_poll
aio_dispatch
  thread_pool_completion_bh
qemu_bh_schedule()

At this point bh->scheduled = 1 and the iothread has not been woken up.
The solution must be close, but this alone should not be a problem,
because the bottom half is only rescheduled to account for rare situations
(see commit 3c80ca1, thread-pool: avoid deadlock in nested aio_poll()
calls, 2014-07-15).

Introducing a third thread---a thread pool worker thread, which
also does qemu_bh_schedule()---does bring out the problematic case.
The third thread must be awakened *after* the callback is complete and
thread_pool_completion_bh has redone the whole loop, explaining the
short race window.  And then this is what happens:

  thread pool wo

Re: [Qemu-devel] [PULL v2 for-2.4 v2 5/7] AioContext: fix broken ctx->dispatching optimization

2015-07-27 Thread Stefan Hajnoczi
On Fri, Jul 24, 2015 at 7:46 AM, Cornelia Huck  wrote:
> On Thu, 23 Jul 2015 21:29:05 +0200
> Christian Borntraeger  wrote:
>
>> Am 23.07.2015 um 21:25 schrieb Paolo Bonzini:
>> >
>> >
>> > On 23/07/2015 21:19, Christian Borntraeger wrote:
>> >> Am 23.07.2015 um 20:19 schrieb Paolo Bonzini:
>> >>>
>> >>>
>> >>> On 23/07/2015 19:20, Paolo Bonzini wrote:
>> 
>> 
>>  On 23/07/2015 16:14, Cornelia Huck wrote:
>> > (gdb) bt
>> > #0  0x03fffc5871b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
>> >from /lib64/libpthread.so.0
>> > #1  0x8024cfca in qemu_cond_wait (cond=cond@entry=0x9717d950,
>> > mutex=mutex@entry=0x9717d920)
>> > at /data/git/yyy/qemu/util/qemu-thread-posix.c:132
>> > #2  0x8025e83a in rfifolock_lock (r=0x9717d920)
>> > at /data/git/yyy/qemu/util/rfifolock.c:59
>> > #3  0x801b78fa in aio_context_acquire (ctx=)
>> > at /data/git/yyy/qemu/async.c:331
>> > #4  0x8007ceb4 in virtio_blk_data_plane_start (s=0x9717d710)
>> > at /data/git/yyy/qemu/hw/block/dataplane/virtio-blk.c:285
>> > #5  0x8007c64a in virtio_blk_handle_output (vdev=> > out>,
>> > vq=) at /data/git/yyy/qemu/hw/block/virtio-blk.c:599
>> > #6  0x801c56dc in qemu_iohandler_poll (pollfds=0x97142800,
>> > ret=ret@entry=1) at /data/git/yyy/qemu/iohandler.c:126
>> > #7  0x801c5178 in main_loop_wait (nonblocking=)
>> > at /data/git/yyy/qemu/main-loop.c:494
>> > #8  0x80013ee2 in main_loop () at /data/git/yyy/qemu/vl.c:1902
>> > #9  main (argc=, argv=, envp=> > out>)
>> > at /data/git/yyy/qemu/vl.c:4653
>> >
>> > I've stripped down the setup to the following commandline:
>> >
>> > /data/git/yyy/qemu/build/s390x-softmmu/qemu-system-s390x  -machine
>> > s390-ccw-virtio-2.4,accel=kvm,usb=off -m 1024 -smp
>> > 4,sockets=4,cores=1,threads=1 -nographic -drive
>> > file=/dev/sda,if=none,id=drive-virtio-disk0,format=raw,serial=ccwzfcp1,cache=none,aio=native
>> > -device
>> > virtio-blk-ccw,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,x-data-plane=on
>> 
>>  What's the backtrace like for the other threads?  This is almost
>>  definitely a latent bug somewhere else.
>> >>>
>> >>> BTW, I can reproduce this---I'm asking because I cannot even attach gdb
>> >>> to the hung process.
>> >>>
>> >>> The simplest workaround is to reintroduce commit a0710f7995 (iothread:
>> >>> release iothread around aio_poll, 2015-02-20), though it also comes with
>> >>> some risk.  It avoids the bug because it limits the contention on the
>> >>> RFifoLock.
>> >>
>> >> I can reproduce this with the following backtrace (with --enable-debug 
>> >> info added qemu is the tag v2.4.0-rc2)
>> >
>> > Can you check that cherry-picking a0710f7995 works for you?
>> >
>>
>> Yes, this seems to help for me. Conny, can you verify?
>
> I can't reproduce the hang with this on top, so seems to help.

Thanks Christian and Cornelia.  I'm sending a fix.

Stefan



Re: [Qemu-devel] [PULL v2 for-2.4 v2 5/7] AioContext: fix broken ctx->dispatching optimization

2015-07-23 Thread Cornelia Huck
On Wed, 22 Jul 2015 12:43:43 +0100
Stefan Hajnoczi  wrote:

> From: Paolo Bonzini 
> 
> This patch rewrites the ctx->dispatching optimization, which was the cause
> of some mysterious hangs that could be reproduced on aarch64 KVM only.
> The hangs were indirectly caused by aio_poll() and in particular by
> flash memory updates's call to blk_write(), which invokes aio_poll().
> Fun stuff: they had an extremely short race window, so much that
> adding all kind of tracing to either the kernel or QEMU made it
> go away (a single printf made it half as reproducible).

Sadly, this commit seems to *introduce* hangs on my s390x setup - i.e.
the guest does not react at all.

Thread 1 seems to be stuck in dataplane setup:

(gdb) bt 
#0  0x03fffc5871b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x8024cfca in qemu_cond_wait (cond=cond@entry=0x9717d950, 
mutex=mutex@entry=0x9717d920)
at /data/git/yyy/qemu/util/qemu-thread-posix.c:132
#2  0x8025e83a in rfifolock_lock (r=0x9717d920)
at /data/git/yyy/qemu/util/rfifolock.c:59
#3  0x801b78fa in aio_context_acquire (ctx=)
at /data/git/yyy/qemu/async.c:331
#4  0x8007ceb4 in virtio_blk_data_plane_start (s=0x9717d710)
at /data/git/yyy/qemu/hw/block/dataplane/virtio-blk.c:285
#5  0x8007c64a in virtio_blk_handle_output (vdev=, 
vq=) at /data/git/yyy/qemu/hw/block/virtio-blk.c:599
#6  0x801c56dc in qemu_iohandler_poll (pollfds=0x97142800, 
ret=ret@entry=1) at /data/git/yyy/qemu/iohandler.c:126
#7  0x801c5178 in main_loop_wait (nonblocking=)
at /data/git/yyy/qemu/main-loop.c:494
#8  0x80013ee2 in main_loop () at /data/git/yyy/qemu/vl.c:1902
#9  main (argc=, argv=, envp=)
at /data/git/yyy/qemu/vl.c:4653

I've stripped down the setup to the following commandline:

/data/git/yyy/qemu/build/s390x-softmmu/qemu-system-s390x  -machine 
s390-ccw-virtio-2.4,accel=kvm,usb=off -m 1024 -smp 
4,sockets=4,cores=1,threads=1 -nographic -drive 
file=/dev/sda,if=none,id=drive-virtio-disk0,format=raw,serial=ccwzfcp1,cache=none,aio=native
 -device 
virtio-blk-ccw,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,x-data-plane=on

Instant hang for me :(




Re: [Qemu-devel] [PULL v2 for-2.4 v2 5/7] AioContext: fix broken ctx->dispatching optimization

2015-07-23 Thread Paolo Bonzini


On 23/07/2015 16:14, Cornelia Huck wrote:
> (gdb) bt 
> #0  0x03fffc5871b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
>from /lib64/libpthread.so.0
> #1  0x8024cfca in qemu_cond_wait (cond=cond@entry=0x9717d950, 
> mutex=mutex@entry=0x9717d920)
> at /data/git/yyy/qemu/util/qemu-thread-posix.c:132
> #2  0x8025e83a in rfifolock_lock (r=0x9717d920)
> at /data/git/yyy/qemu/util/rfifolock.c:59
> #3  0x801b78fa in aio_context_acquire (ctx=)
> at /data/git/yyy/qemu/async.c:331
> #4  0x8007ceb4 in virtio_blk_data_plane_start (s=0x9717d710)
> at /data/git/yyy/qemu/hw/block/dataplane/virtio-blk.c:285
> #5  0x8007c64a in virtio_blk_handle_output (vdev=, 
> vq=) at /data/git/yyy/qemu/hw/block/virtio-blk.c:599
> #6  0x801c56dc in qemu_iohandler_poll (pollfds=0x97142800, 
> ret=ret@entry=1) at /data/git/yyy/qemu/iohandler.c:126
> #7  0x801c5178 in main_loop_wait (nonblocking=)
> at /data/git/yyy/qemu/main-loop.c:494
> #8  0x80013ee2 in main_loop () at /data/git/yyy/qemu/vl.c:1902
> #9  main (argc=, argv=, envp=)
> at /data/git/yyy/qemu/vl.c:4653
> 
> I've stripped down the setup to the following commandline:
> 
> /data/git/yyy/qemu/build/s390x-softmmu/qemu-system-s390x  -machine
> s390-ccw-virtio-2.4,accel=kvm,usb=off -m 1024 -smp
> 4,sockets=4,cores=1,threads=1 -nographic -drive
> file=/dev/sda,if=none,id=drive-virtio-disk0,format=raw,serial=ccwzfcp1,cache=none,aio=native
> -device
> virtio-blk-ccw,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,x-data-plane=on

What's the backtrace like for the other threads?  This is almost
definitely a latent bug somewhere else.


Paolo



Re: [Qemu-devel] [PULL v2 for-2.4 v2 5/7] AioContext: fix broken ctx->dispatching optimization

2015-07-23 Thread Paolo Bonzini


On 23/07/2015 19:20, Paolo Bonzini wrote:
> 
> 
> On 23/07/2015 16:14, Cornelia Huck wrote:
>> (gdb) bt 
>> #0  0x03fffc5871b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
>>from /lib64/libpthread.so.0
>> #1  0x8024cfca in qemu_cond_wait (cond=cond@entry=0x9717d950, 
>> mutex=mutex@entry=0x9717d920)
>> at /data/git/yyy/qemu/util/qemu-thread-posix.c:132
>> #2  0x8025e83a in rfifolock_lock (r=0x9717d920)
>> at /data/git/yyy/qemu/util/rfifolock.c:59
>> #3  0x801b78fa in aio_context_acquire (ctx=)
>> at /data/git/yyy/qemu/async.c:331
>> #4  0x8007ceb4 in virtio_blk_data_plane_start (s=0x9717d710)
>> at /data/git/yyy/qemu/hw/block/dataplane/virtio-blk.c:285
>> #5  0x8007c64a in virtio_blk_handle_output (vdev=, 
>> vq=) at /data/git/yyy/qemu/hw/block/virtio-blk.c:599
>> #6  0x801c56dc in qemu_iohandler_poll (pollfds=0x97142800, 
>> ret=ret@entry=1) at /data/git/yyy/qemu/iohandler.c:126
>> #7  0x801c5178 in main_loop_wait (nonblocking=)
>> at /data/git/yyy/qemu/main-loop.c:494
>> #8  0x80013ee2 in main_loop () at /data/git/yyy/qemu/vl.c:1902
>> #9  main (argc=, argv=, envp=)
>> at /data/git/yyy/qemu/vl.c:4653
>>
>> I've stripped down the setup to the following commandline:
>>
>> /data/git/yyy/qemu/build/s390x-softmmu/qemu-system-s390x  -machine
>> s390-ccw-virtio-2.4,accel=kvm,usb=off -m 1024 -smp
>> 4,sockets=4,cores=1,threads=1 -nographic -drive
>> file=/dev/sda,if=none,id=drive-virtio-disk0,format=raw,serial=ccwzfcp1,cache=none,aio=native
>> -device
>> virtio-blk-ccw,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,x-data-plane=on
> 
> What's the backtrace like for the other threads?  This is almost
> definitely a latent bug somewhere else.

BTW, I can reproduce this---I'm asking because I cannot even attach gdb
to the hung process.

The simplest workaround is to reintroduce commit a0710f7995 (iothread:
release iothread around aio_poll, 2015-02-20), though it also comes with
some risk.  It avoids the bug because it limits the contention on the
RFifoLock.

Paolo



Re: [Qemu-devel] [PULL v2 for-2.4 v2 5/7] AioContext: fix broken ctx->dispatching optimization

2015-07-23 Thread Christian Borntraeger
Am 23.07.2015 um 20:19 schrieb Paolo Bonzini:
> 
> 
> On 23/07/2015 19:20, Paolo Bonzini wrote:
>>
>>
>> On 23/07/2015 16:14, Cornelia Huck wrote:
>>> (gdb) bt 
>>> #0  0x03fffc5871b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
>>>from /lib64/libpthread.so.0
>>> #1  0x8024cfca in qemu_cond_wait (cond=cond@entry=0x9717d950, 
>>> mutex=mutex@entry=0x9717d920)
>>> at /data/git/yyy/qemu/util/qemu-thread-posix.c:132
>>> #2  0x8025e83a in rfifolock_lock (r=0x9717d920)
>>> at /data/git/yyy/qemu/util/rfifolock.c:59
>>> #3  0x801b78fa in aio_context_acquire (ctx=)
>>> at /data/git/yyy/qemu/async.c:331
>>> #4  0x8007ceb4 in virtio_blk_data_plane_start (s=0x9717d710)
>>> at /data/git/yyy/qemu/hw/block/dataplane/virtio-blk.c:285
>>> #5  0x8007c64a in virtio_blk_handle_output (vdev=, 
>>> vq=) at /data/git/yyy/qemu/hw/block/virtio-blk.c:599
>>> #6  0x801c56dc in qemu_iohandler_poll (pollfds=0x97142800, 
>>> ret=ret@entry=1) at /data/git/yyy/qemu/iohandler.c:126
>>> #7  0x801c5178 in main_loop_wait (nonblocking=)
>>> at /data/git/yyy/qemu/main-loop.c:494
>>> #8  0x80013ee2 in main_loop () at /data/git/yyy/qemu/vl.c:1902
>>> #9  main (argc=, argv=, envp=)
>>> at /data/git/yyy/qemu/vl.c:4653
>>>
>>> I've stripped down the setup to the following commandline:
>>>
>>> /data/git/yyy/qemu/build/s390x-softmmu/qemu-system-s390x  -machine
>>> s390-ccw-virtio-2.4,accel=kvm,usb=off -m 1024 -smp
>>> 4,sockets=4,cores=1,threads=1 -nographic -drive
>>> file=/dev/sda,if=none,id=drive-virtio-disk0,format=raw,serial=ccwzfcp1,cache=none,aio=native
>>> -device
>>> virtio-blk-ccw,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,x-data-plane=on
>>
>> What's the backtrace like for the other threads?  This is almost
>> definitely a latent bug somewhere else.
> 
> BTW, I can reproduce this---I'm asking because I cannot even attach gdb
> to the hung process.
> 
> The simplest workaround is to reintroduce commit a0710f7995 (iothread:
> release iothread around aio_poll, 2015-02-20), though it also comes with
> some risk.  It avoids the bug because it limits the contention on the
> RFifoLock.
> 
> Paolo
> 


I can reproduce this with the following backtrace (with --enable-debug info 
added qemu is the tag v2.4.0-rc2)

Thread 4 (Thread 0x3fffb4ce910 (LWP 57750)):
#0  0x03fffc185e8e in syscall () from /lib64/libc.so.6
#1  0x803578ee in futex_wait (ev=0x8098486c , 
val=4294967295) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:301
#2  0x80357ae8 in qemu_event_wait (ev=0x8098486c 
) at 
/home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:399
#3  0x8036e202 in call_rcu_thread (opaque=0x0) at 
/home/cborntra/REPOS/qemu/util/rcu.c:233
#4  0x03fffc2374e6 in start_thread () from /lib64/libpthread.so.0
#5  0x03fffc18b0fa in thread_start () from /lib64/libc.so.6

Thread 3 (Thread 0x3fffacce910 (LWP 57751)):
#0  0x03fffc17f5d6 in ppoll () from /lib64/libc.so.6
#1  0x80294c2e in qemu_poll_ns (fds=0x3fff40008c0, nfds=1, timeout=-1) 
at /home/cborntra/REPOS/qemu/qemu-timer.c:310
#2  0x80296788 in aio_poll (ctx=0x809c2830, blocking=true) at 
/home/cborntra/REPOS/qemu/aio-posix.c:271
#3  0x80137f58 in iothread_run (opaque=0x809c2450) at 
/home/cborntra/REPOS/qemu/iothread.c:42
#4  0x03fffc2374e6 in start_thread () from /lib64/libpthread.so.0
#5  0x03fffc18b0fa in thread_start () from /lib64/libc.so.6

Thread 2 (Thread 0x3fff9411910 (LWP 57754)):
#0  0x03fffc23e662 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x03fffc2394a4 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x8035706e in qemu_mutex_lock (mutex=0x80559288 
) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:73
#3  0x8005ae46 in qemu_mutex_lock_iothread () at 
/home/cborntra/REPOS/qemu/cpus.c:1164
#4  0x8012f44e in kvm_arch_handle_exit (cs=0x82141930, 
run=0x3fffd31a000) at /home/cborntra/REPOS/qemu/target-s390x/kvm.c:2010
#5  0x800782f8 in kvm_cpu_exec (cpu=0x82141930) at 
/home/cborntra/REPOS/qemu/kvm-all.c:1901
#6  0x8005a73c in qemu_kvm_cpu_thread_fn (arg=0x82141930) at 
/home/cborntra/REPOS/qemu/cpus.c:977
#7  0x03fffc2374e6 in start_thread () from /lib64/libpthread.so.0
#8  0x03fffc18b0fa in thread_start () from /lib64/libc.so.6

Thread 1 (Thread 0x3fffb4d0bd0 (LWP 57736)):
#0  0x03fffc23b57c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x803572fa in qemu_cond_wait (cond=0x809c28c0, mutex=0x809c2890) at 
/home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:132
#2  0x8036dc3e in rfifolock_lock (r=0x809c2890) at 
/home/cborntra/REPOS/qemu/util/rfifolock.c:59
#3  0x80281162 in aio_context_acquire (ctx=0x809c2830) at 
/home/cborntra/REPOS/qemu/async.c:331
#4  0x800a2f08 in virtio_blk_data_plane_start (s=0x80a0d6f0) at 
/home/cborntra/REPOS/qemu/hw/block/dataplane/virtio-blk.c:285

Re: [Qemu-devel] [PULL v2 for-2.4 v2 5/7] AioContext: fix broken ctx->dispatching optimization

2015-07-23 Thread Paolo Bonzini


On 23/07/2015 21:19, Christian Borntraeger wrote:
> Am 23.07.2015 um 20:19 schrieb Paolo Bonzini:
>>
>>
>> On 23/07/2015 19:20, Paolo Bonzini wrote:
>>>
>>>
>>> On 23/07/2015 16:14, Cornelia Huck wrote:
 (gdb) bt 
 #0  0x03fffc5871b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
 #1  0x8024cfca in qemu_cond_wait (cond=cond@entry=0x9717d950, 
 mutex=mutex@entry=0x9717d920)
 at /data/git/yyy/qemu/util/qemu-thread-posix.c:132
 #2  0x8025e83a in rfifolock_lock (r=0x9717d920)
 at /data/git/yyy/qemu/util/rfifolock.c:59
 #3  0x801b78fa in aio_context_acquire (ctx=)
 at /data/git/yyy/qemu/async.c:331
 #4  0x8007ceb4 in virtio_blk_data_plane_start (s=0x9717d710)
 at /data/git/yyy/qemu/hw/block/dataplane/virtio-blk.c:285
 #5  0x8007c64a in virtio_blk_handle_output (vdev=, 
 vq=) at /data/git/yyy/qemu/hw/block/virtio-blk.c:599
 #6  0x801c56dc in qemu_iohandler_poll (pollfds=0x97142800, 
 ret=ret@entry=1) at /data/git/yyy/qemu/iohandler.c:126
 #7  0x801c5178 in main_loop_wait (nonblocking=)
 at /data/git/yyy/qemu/main-loop.c:494
 #8  0x80013ee2 in main_loop () at /data/git/yyy/qemu/vl.c:1902
 #9  main (argc=, argv=, envp=)
 at /data/git/yyy/qemu/vl.c:4653

 I've stripped down the setup to the following commandline:

 /data/git/yyy/qemu/build/s390x-softmmu/qemu-system-s390x  -machine
 s390-ccw-virtio-2.4,accel=kvm,usb=off -m 1024 -smp
 4,sockets=4,cores=1,threads=1 -nographic -drive
 file=/dev/sda,if=none,id=drive-virtio-disk0,format=raw,serial=ccwzfcp1,cache=none,aio=native
 -device
 virtio-blk-ccw,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,x-data-plane=on
>>>
>>> What's the backtrace like for the other threads?  This is almost
>>> definitely a latent bug somewhere else.
>>
>> BTW, I can reproduce this---I'm asking because I cannot even attach gdb
>> to the hung process.
>>
>> The simplest workaround is to reintroduce commit a0710f7995 (iothread:
>> release iothread around aio_poll, 2015-02-20), though it also comes with
>> some risk.  It avoids the bug because it limits the contention on the
>> RFifoLock.
> 
> I can reproduce this with the following backtrace (with --enable-debug info 
> added qemu is the tag v2.4.0-rc2)

Can you check that cherry-picking a0710f7995 works for you?

Paolo

> Thread 4 (Thread 0x3fffb4ce910 (LWP 57750)):
> #0  0x03fffc185e8e in syscall () from /lib64/libc.so.6
> #1  0x803578ee in futex_wait (ev=0x8098486c , 
> val=4294967295) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:301
> #2  0x80357ae8 in qemu_event_wait (ev=0x8098486c 
> ) at 
> /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:399
> #3  0x8036e202 in call_rcu_thread (opaque=0x0) at 
> /home/cborntra/REPOS/qemu/util/rcu.c:233
> #4  0x03fffc2374e6 in start_thread () from /lib64/libpthread.so.0
> #5  0x03fffc18b0fa in thread_start () from /lib64/libc.so.6
> 
> Thread 3 (Thread 0x3fffacce910 (LWP 57751)):
> #0  0x03fffc17f5d6 in ppoll () from /lib64/libc.so.6
> #1  0x80294c2e in qemu_poll_ns (fds=0x3fff40008c0, nfds=1, 
> timeout=-1) at /home/cborntra/REPOS/qemu/qemu-timer.c:310
> #2  0x80296788 in aio_poll (ctx=0x809c2830, blocking=true) at 
> /home/cborntra/REPOS/qemu/aio-posix.c:271
> #3  0x80137f58 in iothread_run (opaque=0x809c2450) at 
> /home/cborntra/REPOS/qemu/iothread.c:42
> #4  0x03fffc2374e6 in start_thread () from /lib64/libpthread.so.0
> #5  0x03fffc18b0fa in thread_start () from /lib64/libc.so.6
> 
> Thread 2 (Thread 0x3fff9411910 (LWP 57754)):
> #0  0x03fffc23e662 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x03fffc2394a4 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #2  0x8035706e in qemu_mutex_lock (mutex=0x80559288 
> ) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:73
> #3  0x8005ae46 in qemu_mutex_lock_iothread () at 
> /home/cborntra/REPOS/qemu/cpus.c:1164
> #4  0x8012f44e in kvm_arch_handle_exit (cs=0x82141930, 
> run=0x3fffd31a000) at /home/cborntra/REPOS/qemu/target-s390x/kvm.c:2010
> #5  0x800782f8 in kvm_cpu_exec (cpu=0x82141930) at 
> /home/cborntra/REPOS/qemu/kvm-all.c:1901
> #6  0x8005a73c in qemu_kvm_cpu_thread_fn (arg=0x82141930) at 
> /home/cborntra/REPOS/qemu/cpus.c:977
> #7  0x03fffc2374e6 in start_thread () from /lib64/libpthread.so.0
> #8  0x03fffc18b0fa in thread_start () from /lib64/libc.so.6
> 
> Thread 1 (Thread 0x3fffb4d0bd0 (LWP 57736)):
> #0  0x03fffc23b57c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x803572fa in qemu_cond_wait (cond=0x809c28c0, mutex=0x809c2890) 
> at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:132
> #2  0x8036dc3e in rfifolock_lock (r=0x809c2890) at 
> /home/cborntra/REPOS/qemu/util/rfifol

Re: [Qemu-devel] [PULL v2 for-2.4 v2 5/7] AioContext: fix broken ctx->dispatching optimization

2015-07-23 Thread Christian Borntraeger
Am 23.07.2015 um 21:25 schrieb Paolo Bonzini:
> 
> 
> On 23/07/2015 21:19, Christian Borntraeger wrote:
>> Am 23.07.2015 um 20:19 schrieb Paolo Bonzini:
>>>
>>>
>>> On 23/07/2015 19:20, Paolo Bonzini wrote:


 On 23/07/2015 16:14, Cornelia Huck wrote:
> (gdb) bt 
> #0  0x03fffc5871b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
>from /lib64/libpthread.so.0
> #1  0x8024cfca in qemu_cond_wait (cond=cond@entry=0x9717d950, 
> mutex=mutex@entry=0x9717d920)
> at /data/git/yyy/qemu/util/qemu-thread-posix.c:132
> #2  0x8025e83a in rfifolock_lock (r=0x9717d920)
> at /data/git/yyy/qemu/util/rfifolock.c:59
> #3  0x801b78fa in aio_context_acquire (ctx=)
> at /data/git/yyy/qemu/async.c:331
> #4  0x8007ceb4 in virtio_blk_data_plane_start (s=0x9717d710)
> at /data/git/yyy/qemu/hw/block/dataplane/virtio-blk.c:285
> #5  0x8007c64a in virtio_blk_handle_output (vdev=, 
> vq=) at /data/git/yyy/qemu/hw/block/virtio-blk.c:599
> #6  0x801c56dc in qemu_iohandler_poll (pollfds=0x97142800, 
> ret=ret@entry=1) at /data/git/yyy/qemu/iohandler.c:126
> #7  0x801c5178 in main_loop_wait (nonblocking=)
> at /data/git/yyy/qemu/main-loop.c:494
> #8  0x80013ee2 in main_loop () at /data/git/yyy/qemu/vl.c:1902
> #9  main (argc=, argv=, envp= out>)
> at /data/git/yyy/qemu/vl.c:4653
>
> I've stripped down the setup to the following commandline:
>
> /data/git/yyy/qemu/build/s390x-softmmu/qemu-system-s390x  -machine
> s390-ccw-virtio-2.4,accel=kvm,usb=off -m 1024 -smp
> 4,sockets=4,cores=1,threads=1 -nographic -drive
> file=/dev/sda,if=none,id=drive-virtio-disk0,format=raw,serial=ccwzfcp1,cache=none,aio=native
> -device
> virtio-blk-ccw,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,x-data-plane=on

 What's the backtrace like for the other threads?  This is almost
 definitely a latent bug somewhere else.
>>>
>>> BTW, I can reproduce this---I'm asking because I cannot even attach gdb
>>> to the hung process.
>>>
>>> The simplest workaround is to reintroduce commit a0710f7995 (iothread:
>>> release iothread around aio_poll, 2015-02-20), though it also comes with
>>> some risk.  It avoids the bug because it limits the contention on the
>>> RFifoLock.
>>
>> I can reproduce this with the following backtrace (with --enable-debug info 
>> added qemu is the tag v2.4.0-rc2)
> 
> Can you check that cherry-picking a0710f7995 works for you?
> 

Yes, this seems to help for me. Conny, can you verify?


> Paolo
> 
>> Thread 4 (Thread 0x3fffb4ce910 (LWP 57750)):
>> #0  0x03fffc185e8e in syscall () from /lib64/libc.so.6
>> #1  0x803578ee in futex_wait (ev=0x8098486c , 
>> val=4294967295) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:301
>> #2  0x80357ae8 in qemu_event_wait (ev=0x8098486c 
>> ) at 
>> /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:399
>> #3  0x8036e202 in call_rcu_thread (opaque=0x0) at 
>> /home/cborntra/REPOS/qemu/util/rcu.c:233
>> #4  0x03fffc2374e6 in start_thread () from /lib64/libpthread.so.0
>> #5  0x03fffc18b0fa in thread_start () from /lib64/libc.so.6
>>
>> Thread 3 (Thread 0x3fffacce910 (LWP 57751)):
>> #0  0x03fffc17f5d6 in ppoll () from /lib64/libc.so.6
>> #1  0x80294c2e in qemu_poll_ns (fds=0x3fff40008c0, nfds=1, 
>> timeout=-1) at /home/cborntra/REPOS/qemu/qemu-timer.c:310
>> #2  0x80296788 in aio_poll (ctx=0x809c2830, blocking=true) at 
>> /home/cborntra/REPOS/qemu/aio-posix.c:271
>> #3  0x80137f58 in iothread_run (opaque=0x809c2450) at 
>> /home/cborntra/REPOS/qemu/iothread.c:42
>> #4  0x03fffc2374e6 in start_thread () from /lib64/libpthread.so.0
>> #5  0x03fffc18b0fa in thread_start () from /lib64/libc.so.6
>>
>> Thread 2 (Thread 0x3fff9411910 (LWP 57754)):
>> #0  0x03fffc23e662 in __lll_lock_wait () from /lib64/libpthread.so.0
>> #1  0x03fffc2394a4 in pthread_mutex_lock () from /lib64/libpthread.so.0
>> #2  0x8035706e in qemu_mutex_lock (mutex=0x80559288 
>> ) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:73
>> #3  0x8005ae46 in qemu_mutex_lock_iothread () at 
>> /home/cborntra/REPOS/qemu/cpus.c:1164
>> #4  0x8012f44e in kvm_arch_handle_exit (cs=0x82141930, 
>> run=0x3fffd31a000) at /home/cborntra/REPOS/qemu/target-s390x/kvm.c:2010
>> #5  0x800782f8 in kvm_cpu_exec (cpu=0x82141930) at 
>> /home/cborntra/REPOS/qemu/kvm-all.c:1901
>> #6  0x8005a73c in qemu_kvm_cpu_thread_fn (arg=0x82141930) at 
>> /home/cborntra/REPOS/qemu/cpus.c:977
>> #7  0x03fffc2374e6 in start_thread () from /lib64/libpthread.so.0
>> #8  0x03fffc18b0fa in thread_start () from /lib64/libc.so.6
>>
>> Thread 1 (Thread 0x3fffb4d0bd0 (LWP 57736)):
>> #0  0x03fffc23b57c in pthread_cond_wait@@GLIBC_2.3.2 () from 
>> /lib64/libpthread.so.0
>> #1  0x803572

Re: [Qemu-devel] [PULL v2 for-2.4 v2 5/7] AioContext: fix broken ctx->dispatching optimization

2015-07-23 Thread Cornelia Huck
On Thu, 23 Jul 2015 21:29:05 +0200
Christian Borntraeger  wrote:

> Am 23.07.2015 um 21:25 schrieb Paolo Bonzini:
> > 
> > 
> > On 23/07/2015 21:19, Christian Borntraeger wrote:
> >> Am 23.07.2015 um 20:19 schrieb Paolo Bonzini:
> >>>
> >>>
> >>> On 23/07/2015 19:20, Paolo Bonzini wrote:
> 
> 
>  On 23/07/2015 16:14, Cornelia Huck wrote:
> > (gdb) bt 
> > #0  0x03fffc5871b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
> >from /lib64/libpthread.so.0
> > #1  0x8024cfca in qemu_cond_wait (cond=cond@entry=0x9717d950, 
> > mutex=mutex@entry=0x9717d920)
> > at /data/git/yyy/qemu/util/qemu-thread-posix.c:132
> > #2  0x8025e83a in rfifolock_lock (r=0x9717d920)
> > at /data/git/yyy/qemu/util/rfifolock.c:59
> > #3  0x801b78fa in aio_context_acquire (ctx=)
> > at /data/git/yyy/qemu/async.c:331
> > #4  0x8007ceb4 in virtio_blk_data_plane_start (s=0x9717d710)
> > at /data/git/yyy/qemu/hw/block/dataplane/virtio-blk.c:285
> > #5  0x8007c64a in virtio_blk_handle_output (vdev= > out>, 
> > vq=) at /data/git/yyy/qemu/hw/block/virtio-blk.c:599
> > #6  0x801c56dc in qemu_iohandler_poll (pollfds=0x97142800, 
> > ret=ret@entry=1) at /data/git/yyy/qemu/iohandler.c:126
> > #7  0x801c5178 in main_loop_wait (nonblocking=)
> > at /data/git/yyy/qemu/main-loop.c:494
> > #8  0x80013ee2 in main_loop () at /data/git/yyy/qemu/vl.c:1902
> > #9  main (argc=, argv=, envp= > out>)
> > at /data/git/yyy/qemu/vl.c:4653
> >
> > I've stripped down the setup to the following commandline:
> >
> > /data/git/yyy/qemu/build/s390x-softmmu/qemu-system-s390x  -machine
> > s390-ccw-virtio-2.4,accel=kvm,usb=off -m 1024 -smp
> > 4,sockets=4,cores=1,threads=1 -nographic -drive
> > file=/dev/sda,if=none,id=drive-virtio-disk0,format=raw,serial=ccwzfcp1,cache=none,aio=native
> > -device
> > virtio-blk-ccw,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,x-data-plane=on
> 
>  What's the backtrace like for the other threads?  This is almost
>  definitely a latent bug somewhere else.
> >>>
> >>> BTW, I can reproduce this---I'm asking because I cannot even attach gdb
> >>> to the hung process.
> >>>
> >>> The simplest workaround is to reintroduce commit a0710f7995 (iothread:
> >>> release iothread around aio_poll, 2015-02-20), though it also comes with
> >>> some risk.  It avoids the bug because it limits the contention on the
> >>> RFifoLock.
> >>
> >> I can reproduce this with the following backtrace (with --enable-debug 
> >> info added qemu is the tag v2.4.0-rc2)
> > 
> > Can you check that cherry-picking a0710f7995 works for you?
> > 
> 
> Yes, this seems to help for me. Conny, can you verify?

I can't reproduce the hang with this on top, so seems to help.