Re: [Qemu-block] [Qemu-devel] [PATCH v2] mirror: follow AioContext change gracefully

2016-06-13 Thread Jason J. Herne

On 06/12/2016 02:51 AM, Fam Zheng wrote:
...

---

v2: Picked up Stefan's RFC patch and move on towards a more complete
fix.  Please review!

Jason: it would be nice if you could test this version again. It differs
from the previous version.


No problem. I'll test v3 when it is available.


--
-- Jason J. Herne (jjhe...@linux.vnet.ibm.com)




Re: [Qemu-block] [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads

2016-06-09 Thread Jason J. Herne

On 06/09/2016 12:31 PM, Stefan Hajnoczi wrote:

On Mon, May 23, 2016 at 7:54 PM, Jason J. Herne
<jjhe...@linux.vnet.ibm.com> wrote:

Libvirt migration command:
virsh migrate --live --persistent --copy-storage-all --migrate-disks vdb
kvm1 qemu+ssh://dev1/system


I guess that this is the same scenario as a manual drive_mirror +
migrate test I have been doing.  I also get the "Co-routine re-entered
recursively" error message.

I've CCed you on an (unfinished) patch that solves the abort.  Please
test it and let me know!



Stefan, Yes!! This patch fixes the problem in our environment :) Thank 
you for digging into it and finding a solution. We'll patiently await a 
completed patch.


--
-- Jason J. Herne (jjhe...@linux.vnet.ibm.com)




Re: [Qemu-block] [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads

2016-06-07 Thread Jason J. Herne

On 06/06/2016 10:44 PM, Fam Zheng wrote:

On Mon, 06/06 14:55, Jason J. Herne wrote:

I'll see if I can reproduce it here.

Fam



Hi Fam,
Have you had any luck reproducing this?


No I cannot reproduce so far.



I can hit the problem 100% of the time. Is there any info I can provide 
to help? Either with debugging or reproducing?


--
-- Jason J. Herne (jjhe...@linux.vnet.ibm.com)




Re: [Qemu-block] [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads

2016-06-06 Thread Jason J. Herne

On 05/25/2016 04:36 AM, Fam Zheng wrote:

On Tue, 05/24 11:05, Jason J. Herne wrote:

Thread 13 (Thread 0x3ff989ff910 (LWP 29452)):
#0  0x03ff99abe2c0 in raise () from /lib64/libc.so.6
#1  0x03ff99abfc26 in abort () from /lib64/libc.so.6
#2  0x80427d80 in qemu_coroutine_enter (co=0x9c5a4120, opaque=0x0)
at /root/kvmdev/qemu/util/qemu-coroutine.c:112
#3  0x8032246e in nbd_restart_write (opaque=0x9c5897b0) at
/root/kvmdev/qemu/block/nbd-client.c:114
#4  0x802b3a1c in aio_dispatch (ctx=0x9c530770) at
/root/kvmdev/qemu/aio-posix.c:341
#5  0x802b4332 in aio_poll (ctx=0x9c530770, blocking=true) at
/root/kvmdev/qemu/aio-posix.c:479
#6  0x80155aba in iothread_run (opaque=0x9c530200) at
/root/kvmdev/qemu/iothread.c:46
#7  0x03ff99c87c2c in start_thread () from /lib64/libpthread.so.0
#8  0x03ff99b8ec9a in thread_start () from /lib64/libc.so.6


This is the continuation of write request to the NBD target


Thread 1 (Thread 0x3ff9a6f2a90 (LWP 29433)):
#0  0x03ff99c8d68a in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x8040932e in qemu_cond_wait (cond=0x9c530800, mutex=0x9c5307d0)
at /root/kvmdev/qemu/util/qemu-thread-posix.c:123
#2  0x80426a38 in rfifolock_lock (r=0x9c5307d0) at
/root/kvmdev/qemu/util/rfifolock.c:59
#3  0x802a1f72 in aio_context_acquire (ctx=0x9c530770) at
/root/kvmdev/qemu/async.c:373
#4  0x802b3f54 in aio_poll (ctx=0x9c530770, blocking=true) at
/root/kvmdev/qemu/aio-posix.c:415
#5  0x8031e7ac in bdrv_flush (bs=0x9c59b5c0) at
/root/kvmdev/qemu/block/io.c:2470
#6  0x802a8e6e in bdrv_close (bs=0x9c59b5c0) at
/root/kvmdev/qemu/block.c:2134
#7  0x802a9966 in bdrv_delete (bs=0x9c59b5c0) at
/root/kvmdev/qemu/block.c:2341
#8  0x802ac7c6 in bdrv_unref (bs=0x9c59b5c0) at
/root/kvmdev/qemu/block.c:3376
#9  0x80315340 in mirror_exit (job=0x9c956ed0, opaque=0x9c9570d0) at
/root/kvmdev/qemu/block/mirror.c:494
#10 0x802afb52 in block_job_defer_to_main_loop_bh
(opaque=0x9c90dc10) at /root/kvmdev/qemu/blockjob.c:476


... while this is the completion of mirror. They are not supposed to happen
together. Either the job is completed too early, or the nbd_restart_write
function is invoked incorrectly.

I'll see if I can reproduce it here.

Fam



Hi Fam,
Have you had any luck reproducing this?


--
-- Jason J. Herne (jjhe...@linux.vnet.ibm.com)




[Qemu-block] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads

2016-05-23 Thread Jason J. Herne
Using libvirt to migrate a guest and one guest disk that is using 
iothreads causes Qemu to crash with the message:

Co-routine re-entered recursively

I've looked into this one a bit but I have not seen anything that 
immediately stands out.

Here is what I have found:

In qemu_coroutine_enter:
if (co->caller) {
fprintf(stderr, "Co-routine re-entered recursively\n");
abort();
}

The value of co->caller is actually changing between the time "if 
(co->caller)" is evaluated and the time I print some debug statements 
directly under the existing fprintf. I confirmed this by saving the 
value in a local variable and printing both the new local variable and 
co->caller immediately after the existing fprintf. This would certainly 
indicate some kind of concurrency issue. However, it does not 
necessarily point to the reason we ended up inside this if statement 
because co->caller was not NULL before it was trashed. Perhaps it was 
trashed more than once then? I figured maybe the problem was with 
coroutine pools so I disabled them (--disable-coroutine-pool) and still 
hit the bug.


The backtrace is not always identical. Here is one instance:
(gdb) bt
#0  0x03ffa78be2c0 in raise () from /lib64/libc.so.6
#1  0x03ffa78bfc26 in abort () from /lib64/libc.so.6
#2  0x80427d80 in qemu_coroutine_enter (co=0xa2cf2b40, 
opaque=0x0) at /root/kvmdev/qemu/util/qemu-coroutine.c:112
#3  0x8032246e in nbd_restart_write	 (opaque=0xa2d0cd40) at 
/root/kvmdev/qemu/block/nbd-client.c:114
#4  0x802b3a1c in aio_dispatch (ctx=0xa2c907a0) at 
/root/kvmdev/qemu/aio-posix.c:341
#5  0x802b4332 in aio_poll (ctx=0xa2c907a0, blocking=true) at 
/root/kvmdev/qemu/aio-posix.c:479
#6  0x80155aba in iothread_run (opaque=0xa2c90260) at 
/root/kvmdev/qemu/iothread.c:46

#7  0x03ffa7a87c2c in start_thread () from /lib64/libpthread.so.0
#8  0x03ffa798ec9a in thread_start () from /lib64/libc.so.6

I've also noticed that co->entry sometimes (maybe always?) points to 
mirror_run. Though, given that co->caller changes unexpectedly I don't 
know if we can trust co->entry.


I do not see the bug when I perform the same migration without migrating 
the disk.

I also do not see the bug when I remove the iothread from the guest.

I tested this scenario as far back as tag v2.4.0 and hit the bug every 
time. I was unable to test v2.3.0 due to unresolved guest hangs. I did, 
however, manage to get as far as this commit:


commit ca96ac44dcd290566090b2435bc828fded356ad9
Author: Stefan Hajnoczi <stefa...@redhat.com>
Date:   Tue Jul 28 18:34:09 2015 +0200
AioContext: force event loop iteration using BH

This commit fixes a hang that my test scenario experiences. I was able 
to test even further back by cherry-picking ca96ac44 on top of the 
earlier commits but at this point I cannot be sure if the bug was 
introduced by ca96ac44 so I stopped.


I am willing to run tests or collect any info needed. I'll keep 
investigating but I won't turn down any help :).


Qemu command line as taken from Libvirt log:
qemu-system-s390x
-name kvm1 -S -machine s390-ccw-virtio-2.6,accel=kvm,usb=off
-m 6144 -realtime mlock=off
-smp 1,sockets=1,cores=1,threads=1
-object iothread,id=iothread1
-uuid 3796d9f0-8555-4a1e-9d5c-fac56b8cbf56
-nographic -no-user-config -nodefaults
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-kvm1/monitor.sock,server,nowait

-mon chardev=charmonitor,id=monitor,mode=control
-rtc base=utc -no-shutdown
-boot strict=on -kernel /data/vms/kvm1/kvm1-image
-initrd /data/vms/kvm1/kvm1-initrd -append 'hvc_iucv=8 TERM=dumb'
-drive 
file=/dev/disk/by-path/ccw-0.0.c22b,format=raw,if=none,id=drive-virtio-disk0,cache=none
-device 
virtio-blk-ccw,scsi=off,devno=fe.0.,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-drive 
file=/data/vms/kvm1/kvm1.qcow,format=qcow2,if=none,id=drive-virtio-disk1,cache=none
-device 
virtio-blk-ccw,iothread=iothread1,scsi=off,devno=fe.0.0008,drive=drive-virtio-disk1,id=virtio-disk1

-netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27
-device 
virtio-net-ccw,netdev=hostnet0,id=net0,mac=52:54:00:c9:86:2b,devno=fe.0.0001
-chardev pty,id=charconsole0 -device 
sclpconsole,chardev=charconsole0,id=console0
-device virtio-balloon-ccw,id=balloon0,devno=fe.0.0002 -msg 
timestamp=on


Libvirt migration command:
virsh migrate --live --persistent --copy-storage-all --migrate-disks vdb 
 kvm1 qemu+ssh://dev1/system


--
-- Jason J. Herne (jjhe...@linux.vnet.ibm.com)