Am 27.10.2011 16:32, schrieb Kevin Wolf: > Am 27.10.2011 16:15, schrieb Kevin Wolf: >> Am 27.10.2011 15:57, schrieb Stefan Hajnoczi: >>> On Thu, Oct 27, 2011 at 03:26:23PM +0200, Kevin Wolf wrote: >>>> Am 19.09.2011 16:37, schrieb Frediano Ziglio: >>>>> Now that iothread is always compiled sending a signal seems only an >>>>> additional step. This patch also avoid writing to two pipe (one from >>>>> signal >>>>> and one in qemu_service_io). >>>>> >>>>> Work with kvm enabled or disabled. strace output is more readable (less >>>>> syscalls). >>>>> >>>>> Signed-off-by: Frediano Ziglio <fredd...@gmail.com> >>>> >>>> Something in this change has bad effects, in the sense that it seems to >>>> break bdrv_read_em. >>> >>> How does it break bdrv_read_em? Are you seeing QEMU hung with 100% CPU >>> utilization or deadlocked? >> >> Sorry, I should have been more detailed here. >> >> No, it's nothing obvious, it must be some subtle side effect. The result >> of bdrv_read_em itself seems to be correct (return value and checksum of >> the read buffer). >> >> However instead of booting into the DOS setup I only get an error >> message "Kein System oder Laufwerksfehler" (don't know how it reads in >> English DOS versions), which seems to be produced by the boot sector. >> >> I excluded all of the minor changes, so I'm sure that it's caused by the >> switch from kill() to a direct call of the function that writes into the >> pipe. >> >>> One interesting thing is that qemu_aio_wait() does not release the QEMU >>> mutex, so we cannot write to a pipe with the mutex held and then spin >>> waiting for the iothread to do work for us. >>> >>> Exactly how kill and qemu_notify_event() were different I'm not sure >>> right now but it could be a factor. >> >> This would cause a hang, right? Then it isn't what I'm seeing. > > While trying out some more things, I added some fprintfs to > posix_aio_process_queue() and suddenly it also fails with the kill() > version. So what has changed might really just be the timing, and it > could be a race somewhere that has always (?) existed.
Replying to myself again... It looks like there is a problem with reentrancy in fdctrl_transfer_handler. I think this would have been guarded by the AsyncContexts before, but we don't have them any more. qemu-system-x86_64: /root/upstream/qemu/hw/fdc.c:1253: fdctrl_transfer_handler: Assertion `reentrancy == 0' failed. Program received signal SIGABRT, Aborted. (gdb) bt #0 0x0000003ccd2329a5 in raise () from /lib64/libc.so.6 #1 0x0000003ccd234185 in abort () from /lib64/libc.so.6 #2 0x0000003ccd22b935 in __assert_fail () from /lib64/libc.so.6 #3 0x000000000046ff09 in fdctrl_transfer_handler (opaque=<value optimized out>, nchan=<value optimized out>, dma_pos=<value optimized out>, dma_len=<value optimized out>) at /root/upstream/qemu/hw/fdc.c:1253 #4 0x000000000046702c in channel_run () at /root/upstream/qemu/hw/dma.c:348 #5 DMA_run () at /root/upstream/qemu/hw/dma.c:378 #6 0x000000000040b0e1 in qemu_bh_poll () at async.c:70 #7 0x000000000040aa19 in qemu_aio_wait () at aio.c:147 #8 0x000000000041c355 in bdrv_read_em (bs=0x131fd80, sector_num=19, buf=<value optimized out>, nb_sectors=1) at block.c:2896 #9 0x000000000041b3d2 in bdrv_read (bs=0x131fd80, sector_num=19, buf=0x1785a00 "IO SYS!", nb_sectors=1) at block.c:1062 #10 0x000000000041b3d2 in bdrv_read (bs=0x131f430, sector_num=19, buf=0x1785a00 "IO SYS!", nb_sectors=1) at block.c:1062 #11 0x000000000046fbb8 in do_fdctrl_transfer_handler (opaque=0x1785788, nchan=2, dma_pos=<value optimized out>, dma_len=512) at /root/upstream/qemu/hw/fdc.c:1178 #12 0x000000000046fecf in fdctrl_transfer_handler (opaque=<value optimized out>, nchan=<value optimized out>, dma_pos=<value optimized out>, dma_len=<value optimized out>) at /root/upstream/qemu/hw/fdc.c:1255 #13 0x000000000046702c in channel_run () at /root/upstream/qemu/hw/dma.c:348 #14 DMA_run () at /root/upstream/qemu/hw/dma.c:378 #15 0x000000000046e456 in fdctrl_start_transfer (fdctrl=0x1785788, direction=1) at /root/upstream/qemu/hw/fdc.c:1107 #16 0x0000000000558a41 in kvm_handle_io (env=0x1323ff0) at /root/upstream/qemu/kvm-all.c:834 #17 kvm_cpu_exec (env=0x1323ff0) at /root/upstream/qemu/kvm-all.c:976 #18 0x000000000053686a in qemu_kvm_cpu_thread_fn (arg=0x1323ff0) at /root/upstream/qemu/cpus.c:661 #19 0x0000003ccda077e1 in start_thread () from /lib64/libpthread.so.0 #20 0x0000003ccd2e151d in clone () from /lib64/libc.so.6 I'm afraid that we can only avoid things like this reliably if we convert all devices to be direct users of AIO/coroutines. The current block layer infrastructure doesn't emulate the behaviour of bdrv_read accurately as bottom halves can be run in the nested main loop. For floppy, the following seems to be a quick fix (Lucas, Cleber, does this solve your problems?), though it's not very satisfying. And I'm not quite sure yet why it doesn't always happen with kill() in posix-aio-compat.c. diff --git a/hw/dma.c b/hw/dma.c index 8a7302a..1d3b6f1 100644 --- a/hw/dma.c +++ b/hw/dma.c @@ -358,6 +358,13 @@ static void DMA_run (void) struct dma_cont *d; int icont, ichan; int rearm = 0; + static int running = 0; + + if (running) { + goto out; + } else { + running = 0; + } d = dma_controllers; @@ -374,6 +381,8 @@ static void DMA_run (void) } } +out: + running = 0; if (rearm) qemu_bh_schedule_idle(dma_bh); } Kevin