Am 14.08.2015 um 16:45 schrieb Peter Lieven:
> Am 14.08.2015 um 16:08 schrieb Kevin Wolf:
>> Am 14.08.2015 um 15:43 hat Peter Lieven geschrieben:
>>> Am 22.06.2015 um 23:54 schrieb John Snow:
>>>> On 06/22/2015 09:09 AM, Peter Lieven wrote:
>>>>> Am 22.06.2015 um 11:25 schrieb Stefan Hajnoczi:
>>>>>> On Fri, Jun 19, 2015 at 2:14 PM, Peter Lieven <p...@kamp.de> wrote:
>>>>>>> Am 18.06.2015 um 11:36 schrieb Stefan Hajnoczi:
>>>>>>>> On Thu, Jun 18, 2015 at 10:29 AM, Peter Lieven <p...@kamp.de> wrote:
>>>>>>>>> Am 18.06.2015 um 10:42 schrieb Kevin Wolf:
>>>>>>>>>> Am 18.06.2015 um 10:30 hat Peter Lieven geschrieben:
>>>>>>>>>>> Am 18.06.2015 um 09:45 schrieb Kevin Wolf:
>>>>>>>>>>>> Am 18.06.2015 um 09:12 hat Peter Lieven geschrieben:
>>>>>>>>>>>>> Thread 2 (Thread 0x7ffff5550700 (LWP 2636)):
>>>>>>>>>>>>> #0  0x00007ffff5d87aa3 in ppoll () from
>>>>>>>>>>>>> /lib/x86_64-linux-gnu/libc.so.6
>>>>>>>>>>>>> No symbol table info available.
>>>>>>>>>>>>> #1  0x0000555555955d91 in qemu_poll_ns (fds=0x5555563889c0,
>>>>>>>>>>>>> nfds=3,
>>>>>>>>>>>>>       timeout=4999424576) at qemu-timer.c:326
>>>>>>>>>>>>>           ts = {tv_sec = 4, tv_nsec = 999424576}
>>>>>>>>>>>>>           tvsec = 4
>>>>>>>>>>>>> #2  0x0000555555956feb in aio_poll (ctx=0x5555563528e0,
>>>>>>>>>>>>> blocking=true)
>>>>>>>>>>>>>       at aio-posix.c:231
>>>>>>>>>>>>>           node = 0x0
>>>>>>>>>>>>>           was_dispatching = false
>>>>>>>>>>>>>           ret = 1
>>>>>>>>>>>>>           progress = false
>>>>>>>>>>>>> #3  0x000055555594aeed in bdrv_prwv_co (bs=0x55555637eae0,
>>>>>>>>>>>>> offset=4292007936,
>>>>>>>>>>>>>       qiov=0x7ffff554f760, is_write=false, flags=0) at
>>>>>>>>>>>>> block.c:2699
>>>>>>>>>>>>>           aio_context = 0x5555563528e0
>>>>>>>>>>>>>           co = 0x5555563888a0
>>>>>>>>>>>>>           rwco = {bs = 0x55555637eae0, offset = 4292007936,
>>>>>>>>>>>>>             qiov = 0x7ffff554f760, is_write = false, ret =
>>>>>>>>>>>>> 2147483647,
>>>>>>>>>>>>> flags = 0}
>>>>>>>>>>>>> #4  0x000055555594afa9 in bdrv_rw_co (bs=0x55555637eae0,
>>>>>>>>>>>>> sector_num=8382828,
>>>>>>>>>>>>>       buf=0x7ffff44cc800 "(", nb_sectors=4, is_write=false,
>>>>>>>>>>>>> flags=0)
>>>>>>>>>>>>>       at block.c:2722
>>>>>>>>>>>>>           qiov = {iov = 0x7ffff554f780, niov = 1, nalloc = -1,
>>>>>>>>>>>>> size =
>>>>>>>>>>>>> 2048}
>>>>>>>>>>>>>           iov = {iov_base = 0x7ffff44cc800, iov_len = 2048}
>>>>>>>>>>>>> #5  0x000055555594b008 in bdrv_read (bs=0x55555637eae0,
>>>>>>>>>>>>> sector_num=8382828,
>>>>>>>>>>>>>       buf=0x7ffff44cc800 "(", nb_sectors=4) at block.c:2730
>>>>>>>>>>>>> No locals.
>>>>>>>>>>>>> #6  0x000055555599acef in blk_read (blk=0x555556376820,
>>>>>>>>>>>>> sector_num=8382828,
>>>>>>>>>>>>>       buf=0x7ffff44cc800 "(", nb_sectors=4) at
>>>>>>>>>>>>> block/block-backend.c:404
>>>>>>>>>>>>> No locals.
>>>>>>>>>>>>> #7  0x0000555555833ed2 in cd_read_sector (s=0x555556408f88,
>>>>>>>>>>>>> lba=2095707,
>>>>>>>>>>>>>       buf=0x7ffff44cc800 "(", sector_size=2048) at
>>>>>>>>>>>>> hw/ide/atapi.c:116
>>>>>>>>>>>>>           ret = 32767
>>>>>>>>>>>> Here is the problem: The ATAPI emulation uses synchronous
>>>>>>>>>>>> blk_read()
>>>>>>>>>>>> instead of the AIO or coroutine interfaces. This means that it
>>>>>>>>>>>> keeps
>>>>>>>>>>>> polling for request completion while it holds the BQL until the
>>>>>>>>>>>> request
>>>>>>>>>>>> is completed.
>>>>>>>>>>> I will look at this.
>>>>>>>>> I need some further help. My way to "emulate" a hung NFS Server is to
>>>>>>>>> block it in the Firewall. Currently I face the problem that I
>>>>>>>>> cannot mount
>>>>>>>>> a CD Iso via libnfs (nfs://) without hanging Qemu (i previously
>>>>>>>>> tried with
>>>>>>>>> a kernel NFS mount). It reads a few sectors and then stalls (maybe
>>>>>>>>> another
>>>>>>>>> bug):
>>>>>>>>>
>>>>>>>>> (gdb) thread apply all bt full
>>>>>>>>>
>>>>>>>>> Thread 3 (Thread 0x7ffff0c21700 (LWP 29710)):
>>>>>>>>> #0  qemu_cond_broadcast (cond=cond@entry=0x555556259940) at
>>>>>>>>> util/qemu-thread-posix.c:120
>>>>>>>>>          err = <optimized out>
>>>>>>>>>          __func__ = "qemu_cond_broadcast"
>>>>>>>>> #1  0x0000555555911164 in rfifolock_unlock
>>>>>>>>> (r=r@entry=0x555556259910) at
>>>>>>>>> util/rfifolock.c:75
>>>>>>>>>          __PRETTY_FUNCTION__ = "rfifolock_unlock"
>>>>>>>>> #2  0x0000555555875921 in aio_context_release
>>>>>>>>> (ctx=ctx@entry=0x5555562598b0)
>>>>>>>>> at async.c:329
>>>>>>>>> No locals.
>>>>>>>>> #3  0x000055555588434c in aio_poll (ctx=ctx@entry=0x5555562598b0,
>>>>>>>>> blocking=blocking@entry=true) at aio-posix.c:272
>>>>>>>>>          node = <optimized out>
>>>>>>>>>          was_dispatching = false
>>>>>>>>>          i = <optimized out>
>>>>>>>>>          ret = <optimized out>
>>>>>>>>>          progress = false
>>>>>>>>>          timeout = 611734526
>>>>>>>>>          __PRETTY_FUNCTION__ = "aio_poll"
>>>>>>>>> #4  0x00005555558bc43d in bdrv_prwv_co (bs=bs@entry=0x55555627c0f0,
>>>>>>>>> offset=offset@entry=7038976, qiov=qiov@entry=0x7ffff0c208f0,
>>>>>>>>> is_write=is_write@entry=false, flags=flags@entry=(unknown: 0)) at
>>>>>>>>> block/io.c:552
>>>>>>>>>          aio_context = 0x5555562598b0
>>>>>>>>>          co = <optimized out>
>>>>>>>>>          rwco = {bs = 0x55555627c0f0, offset = 7038976, qiov =
>>>>>>>>> 0x7ffff0c208f0, is_write = false, ret = 2147483647, flags =
>>>>>>>>> (unknown: 0)}
>>>>>>>>> #5  0x00005555558bc533 in bdrv_rw_co (bs=0x55555627c0f0,
>>>>>>>>> sector_num=sector_num@entry=13748, buf=buf@entry=0x555557874800 "(",
>>>>>>>>> nb_sectors=nb_sectors@entry=4, is_write=is_write@entry=false,
>>>>>>>>>      flags=flags@entry=(unknown: 0)) at block/io.c:575
>>>>>>>>>          qiov = {iov = 0x7ffff0c208e0, niov = 1, nalloc = -1, size
>>>>>>>>> = 2048}
>>>>>>>>>          iov = {iov_base = 0x555557874800, iov_len = 2048}
>>>>>>>>> #6  0x00005555558bc593 in bdrv_read (bs=<optimized out>,
>>>>>>>>> sector_num=sector_num@entry=13748, buf=buf@entry=0x555557874800 "(",
>>>>>>>>> nb_sectors=nb_sectors@entry=4) at block/io.c:583
>>>>>>>>> No locals.
>>>>>>>>> #7  0x00005555558af75d in blk_read (blk=<optimized out>,
>>>>>>>>> sector_num=sector_num@entry=13748, buf=buf@entry=0x555557874800 "(",
>>>>>>>>> nb_sectors=nb_sectors@entry=4) at block/block-backend.c:493
>>>>>>>>>          ret = <optimized out>
>>>>>>>>> #8  0x00005555557abb88 in cd_read_sector (sector_size=<optimized out>,
>>>>>>>>> buf=0x555557874800 "(", lba=3437, s=0x55555760db70) at
>>>>>>>>> hw/ide/atapi.c:116
>>>>>>>>>          ret = <optimized out>
>>>>>>>>> #9  ide_atapi_cmd_reply_end (s=0x55555760db70) at hw/ide/atapi.c:190
>>>>>>>>>          byte_count_limit = <optimized out>
>>>>>>>>>          size = <optimized out>
>>>>>>>>>          ret = 2
>>>>>>>> This is still the same scenario Kevin explained.
>>>>>>>>
>>>>>>>> The ATAPI CD-ROM emulation code is using synchronous blk_read().  This
>>>>>>>> function holds the QEMU global mutex while waiting for the I/O request
>>>>>>>> to complete.  This blocks other vcpu threads and the main loop thread.
>>>>>>>>
>>>>>>>> The solution is to convert the CD-ROM emulation code to use
>>>>>>>> blk_aio_readv() instead of blk_read().
>>>>>>> I tried a little, but i am stuck with my approach. I reads one sector
>>>>>>> and then doesn't continue. Maybe someone with more knowledge
>>>>>>> of ATAPI/IDE could help?
>>>>>> Converting synchronous code to asynchronous requires an understanding
>>>>>> of the device's state transitions.  Asynchronous code has to put the
>>>>>> device registers into a busy state until the request completes.  It
>>>>>> also needs to handle hardware register accesses that occur while the
>>>>>> request is still pending.
>>>>> That was my assumption as well. But I don't know how to proceed...
>>>>>
>>>>>> I don't know ATAPI/IDE code well enough to suggest a fix.
>>>>> Maybe @John can help?
>>>>>
>>>>> Peter
>>>>>
>>> I looked into this again and it seems that the remaining problem (at least 
>>> when the CDROM is
>>> mounted via libnfs) is the blk_drain_all() in bmdma_cmd_writeb. At least I 
>>> end there if I have
>>> a proper OS booted and cut off the NFS server. The VM remains responsive 
>>> until the guest OS
>>> issues a DMA cancel.
>>>
>>> I do not know what the proper solution is. I had the following ideas so far 
>>> (not knowing if the
>>> approaches would be correct or not).
>>>
>>> a) Do not clear BM_STATUS_DMAING if we are not able to drain all requests. 
>>> This works until
>>> the connection is reestablished. The guest OS issues DMA cancel operations 
>>> again and
>>> again, but when the connectivity is back I end in the following assertion:
>>>
>>> qemu-system-x86_64: ./hw/ide/pci.h:65: bmdma_active_if: Assertion 
>>> `bmdma->bus->retry_unit != (uint8_t)-1' failed.
>> I would have to check the specs to see if this is allowed.
>>
>>> b) Call the aiocb with -ECANCELED and somehow (?) turn all the callbacks of 
>>> the outstanding IOs into NOPs.
>> This wouldn't be correct for write requests: We would tell the guest
>> that the request is cancelled when it's actually still in flight. At
>> some point it could still complete, however, and that's not expected by
>> the guest.
>>
>>> c) Follow the hint in the comment in bmdma_cmd_writeb (however this works 
>>> out):
>>>              * In the future we'll be able to safely cancel the I/O if the
>>>              * whole DMA operation will be submitted to disk with a single
>>>              * aio operation with preadv/pwritev.
>> Not sure how likely it is that cancelling that single AIOCB will
>> actually cancel the operation and not end up doing bdrv_drain_all()
>> internally instead because there is no good way of cancelling the
>> request.
> Maybe this is a solution? It seems to work for the CDROM only case:
>
> diff --git a/block/io.c b/block/io.c
> index d4bc83b..475d44c 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2075,7 +2075,9 @@ static const AIOCBInfo bdrv_em_co_aiocb_info = {
>  static void bdrv_co_complete(BlockAIOCBCoroutine *acb)
>  {
>      if (!acb->need_bh) {
> -        acb->common.cb(acb->common.opaque, acb->req.error);
> +        if (acb->common.cb) {
> +            acb->common.cb(acb->common.opaque, acb->req.error);
> +        }
>          qemu_aio_unref(acb);
>      }
>  }
> diff --git a/hw/ide/pci.c b/hw/ide/pci.c
> index d31ff88..fecfa3e 100644
> --- a/hw/ide/pci.c
> +++ b/hw/ide/pci.c
> @@ -252,11 +252,17 @@ void bmdma_cmd_writeb(BMDMAState *bm, uint32_t val)
>               * whole DMA operation will be submitted to disk with a single
>               * aio operation with preadv/pwritev.
>               */
> -            if (bm->bus->dma->aiocb) {
> -                blk_drain_all();
> -                assert(bm->bus->dma->aiocb == NULL);
> -            }
> -            bm->status &= ~BM_STATUS_DMAING;
> +             if (bm->bus->dma->aiocb) {
> +                 bool read_write = false;
> +                 read_write |= bm->bus->master && 
> !blk_is_read_only(bm->bus->master->conf.blk);
> +                 read_write |= bm->bus->slave && 
> !blk_is_read_only(bm->bus->slave->conf.blk);
> +                 if (read_write) {
> +                     blk_drain_all();
> +                 } else {
> +                     bm->bus->dma->aiocb->cb = NULL;
> +                 }
> +             }
> +             bm->status &= ~BM_STATUS_DMAING;
>          } else {
>              bm->cur_addr = bm->addr;
>              if (!(bm->status & BM_STATUS_DMAING)) {
>
>
>> Kevin

I meanwhile tested a little with this approach. It seems to work absolutely 
perfect. I can even interrupt a booting vServer (after the kernel is loaded)
and shut the NFS for a long time. The vServer main thread stays responsive. 
When the NFS comes back I see the AIO requests completed that
where cancelled and their callbacks being ignored. The vServer then starts up 
perfectly as the guest OS does all the retrying stuff.

I also found that I do not have to do such vodoo to find out if I have to deal 
with a read/write media. The AIOCB has a pointer to the BDS ;-)

If noone has objections I would create a proper patch (or two) for that.

Peter


Reply via email to