On Tue, Sep 27, 2016 at 5:25 PM, Stefan Hajnoczi <stefa...@redhat.com> wrote: > On Tue, Sep 27, 2016 at 04:29:55PM +0200, Roman Penyaev wrote: >> On Tue, Sep 27, 2016 at 4:06 PM, Stefan Hajnoczi <stefa...@redhat.com> wrote: >> > Commit 0ed93d84edabc7656f5c998ae1a346fe8b94ca54 ("linux-aio: process >> > completions from ioq_submit()") added an optimization that processes >> > completions each time ioq_submit() returns with requests in flight. >> > This commit introduces a "Co-routine re-entered recursively" error which >> > can be triggered with -drive format=qcow2,aio=native. >> > >> > Fam Zheng <f...@redhat.com>, Kevin Wolf <kw...@redhat.com>, and I >> > debugged the following backtrace: >> > >> > (gdb) bt >> > #0 0x00007ffff0a046f5 in raise () at /lib64/libc.so.6 >> > #1 0x00007ffff0a062fa in abort () at /lib64/libc.so.6 >> > #2 0x0000555555ac0013 in qemu_coroutine_enter (co=0x5555583464d0) at >> > util/qemu-coroutine.c:113 >> > #3 0x0000555555a4b663 in qemu_laio_process_completions >> > (s=s@entry=0x555557e2f7f0) at block/linux-aio.c:218 >> > #4 0x0000555555a4b874 in ioq_submit (s=s@entry=0x555557e2f7f0) at >> > block/linux-aio.c:331 >> > #5 0x0000555555a4ba12 in laio_do_submit (fd=fd@entry=13, >> > laiocb=laiocb@entry=0x555559d38ae0, offset=offset@entry=2932727808, >> > type=type@entry=1) at block/linux-aio.c:383 >> > #6 0x0000555555a4bbd3 in laio_co_submit (bs=<optimized out>, >> > s=0x555557e2f7f0, fd=13, offset=2932727808, qiov=0x555559d38e20, type=1) >> > at block/linux-aio.c:402 >> > #7 0x0000555555a4fd23 in bdrv_driver_preadv >> > (bs=bs@entry=0x55555663bcb0, offset=offset@entry=2932727808, >> > bytes=bytes@entry=8192, qiov=qiov@entry=0x555559d38e20, flags=0) at >> > block/io.c:804 >> > #8 0x0000555555a52b34 in bdrv_aligned_preadv >> > (bs=bs@entry=0x55555663bcb0, req=req@entry=0x555559d38d20, >> > offset=offset@entry=2932727808, bytes=bytes@entry=8192, >> > align=align@entry=512, qiov=qiov@entry=0x555559d38e20, flags=0) at >> > block/io.c:1041 >> > #9 0x0000555555a52db8 in bdrv_co_preadv (child=<optimized out>, >> > offset=2932727808, bytes=8192, qiov=qiov@entry=0x555559d38e20, >> > flags=flags@entry=0) at block/io.c:1133 >> > #10 0x0000555555a29629 in qcow2_co_preadv (bs=0x555556635890, >> > offset=6178725888, bytes=8192, qiov=0x555557527840, flags=<optimized out>) >> > at block/qcow2.c:1509 >> > #11 0x0000555555a4fd23 in bdrv_driver_preadv >> > (bs=bs@entry=0x555556635890, offset=offset@entry=6178725888, >> > bytes=bytes@entry=8192, qiov=qiov@entry=0x555557527840, flags=0) at >> > block/io.c:804 >> > #12 0x0000555555a52b34 in bdrv_aligned_preadv >> > (bs=bs@entry=0x555556635890, req=req@entry=0x555559d39000, >> > offset=offset@entry=6178725888, bytes=bytes@entry=8192, >> > align=align@entry=1, qiov=qiov@entry=0x555557527840, flags=0) at >> > block/io.c:1041 >> > #13 0x0000555555a52db8 in bdrv_co_preadv (child=<optimized out>, >> > offset=offset@entry=6178725888, bytes=bytes@entry=8192, >> > qiov=qiov@entry=0x555557527840, flags=flags@entry=0) at block/io.c:1133 >> > #14 0x0000555555a4515a in blk_co_preadv (blk=0x5555566356d0, >> > offset=6178725888, bytes=8192, qiov=0x555557527840, flags=0) at >> > block/block-backend.c:783 >> > #15 0x0000555555a45266 in blk_aio_read_entry (opaque=0x5555577025e0) at >> > block/block-backend.c:991 >> > #16 0x0000555555ac0cfa in coroutine_trampoline (i0=<optimized out>, >> > i1=<optimized out>) at util/coroutine-ucontext.c:78 >> > >> > It turned out that re-entrant ioq_submit() and completion processing >> > between three requests caused this error. The following check is not >> > sufficient to prevent recursively entering coroutines: >> > >> > if (laiocb->co != qemu_coroutine_self()) { >> > qemu_coroutine_enter(laiocb->co); >> > } >> > >> > As the following coroutine backtrace shows, not just the current >> > coroutine (self) can be entered. There might also be other coroutines >> > that are currently entered and transferred control due to the qcow2 lock >> > (CoMutex): >> >> I doubt that that was introduced by the commit you've specified: >> 0ed93d84edab. >> >> Before my patch coroutine was unconditionally entered. The following >> is what was changed by 0ed93d84edab: >> >> if (laiocb->co) { >> - qemu_coroutine_enter(laiocb->co); >> + /* Jump and continue completion for foreign requests, don't do >> + * anything for current request, it will be completed shortly. */ >> + if (laiocb->co != qemu_coroutine_self()) { >> + qemu_coroutine_enter(laiocb->co); >> + } > > Unconditionally entering was safe prior to 0ed93d84edab since all > coroutines yielded and qemu_coroutine_entered() would be false all the > time. Therefore it wasn't necessary to protect against re-entering a > coroutine. > >> If you have a strong reproduction, could you please verify that. > > The bug is 100% deterministic. Just boot up a guest with -drive > format=qcow2,aio=native.
It turns out to be that everything is broken. I started all my tests with format=raw,aio=native and immediately got coroutine recursive. That is completely weird. So, what I did is the following: 1. Took latest master (nothing works) 2. Did interactive rebase to 12c8720 12c8720 2016-06-28 | Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging [Peter Maydell] this merge request includes all your patches related to virtio-blk and MQ support. 3. Applied 0ed93d84edab. Everything works fine. 4. Rebased up till 0647d47: 0647d47 2016-09-13 | qcow2: avoid memcpy(dst, NULL, len) [Stefan Hajnoczi] this is the point, after which 0ed93d84edab was applied on master. Got recursive coroutine, so nothing works. 5. Did a besect, which shows this commit: -- commit 1a62d0accdf85fbeac149018ee8d1728e754de73 Author: Eric Blake <ebl...@redhat.com> Date: Fri Jul 15 12:31:59 2016 -0600 block: Fragment reads to max transfer length -- So after this commit my commit 0ed93d84edab stops working. And now for me is completely not clear what is happening there. -- Roman