On Wed, Jan 24, 2024 at 06:38:28PM +0100, Hanna Czenczek wrote:
> Hi,
> 
> When registering callbacks via aio_set_event_notifier_poll(), the
> io_poll_end() callback is only invoked when polling actually ends.  If
> the notifiers are removed while in a polling section, it is not called.
> Therefore, io_poll_start() is not necessarily followed up by
> io_poll_end().
> 
> It is not entirely clear whether this is good or bad behavior.  On one
> hand, it may be unexpected to callers.  On the other, it may be
> counterproductive to call io_poll_end() when the polling section has not
> ended yet.
> 
> Right now, there is only one user of aio_set_event_notifier(), which is
> virtio_queue_aio_attach_host_notifier().  It does not expect this
> behavior, which leads to virtqueue notifiers remaining disabled if
> virtio_queue_aio_detach_host_notifier() is called while polling.  That
> can happen e.g. through virtio_scsi_drained_begin() or
> virtio_blk_drained_begin() (through virtio_blk_data_plane_detach()).
> In such a case, the virtqueue may not be processed for a while, letting
> the guest driver hang.  This can be reproduced by repeatedly
> hot-plugging and -unplugging a virtio-scsi device with a scsi-hd disk,
> because the guest will try to enumerate the virtio-scsi device while
> we’re attaching the scsi-hd disk, which causes a drain, which can cause
> the virtio-scsi virtqueue to stall as described.
> 
> Stefan has suggested ensuring we always follow up io_poll_start() by
> io_poll_end():
> 
> https://lists.nongnu.org/archive/html/qemu-block/2023-12/msg00163.html
> 
> I prefer changing the caller instead, because I don’t think we actually
> want the virtqueue notifier to be force-enabled when removing our AIO
> notifiers.  So I believe we actually only want to take care to
> force-enable it when we re-attach the AIO notifiers, and to kick
> virtqueue processing once, in case we missed any events while the AIO
> notifiers were not attached.
> 
> That is done by patch 2.  We have already discussed a prior version of
> it here:
> 
> https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00001.html
> 
> And compared to that, based on the discussion, there are some changes:
> 1. Used virtio_queue_notify() instead of virtio_queue_notify_vq(), as
>    suggested by Paolo, because it’s thread-safe
> 2. Moved virtio_queue_notify() into
>    virtio_queue_aio_attach_host_notifier*(), because we always want it
> 3. Dropped virtio_queue_set_notification(vq, 0) from
>    virtio_queue_aio_detach_host_notifier(): Paolo wasn’t sure whether
>    that was safe to do from any context.  We don’t really need to call
>    it anyway, so I just dropped it.
> 4. Added patch 1:
> 
> Patch 1 fixes virtio_scsi_drained_end() so it won’t attach polling
> notifiers for the event virtqueue.  That didn’t turn out to be an issue
> so far, but with patch 2, Fiona saw the virtqueue processing queue
> spinning in a loop as described in
> 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi: don't waste CPU
> polling the event virtqueue").
> 
> 
> Note that as of eaad0fe26050c227dc5dad63205835bac4912a51 ("scsi: only
> access SCSIDevice->requests from one thread") there’s a different
> problem when trying to reproduce the bug via hot-plugging and
> -unplugging a virtio-scsi device, specifically, when unplugging, qemu
> may crash with an assertion failure[1].  I don’t have a full fix for
> that yet, but in case you need a work-around for the specific case of
> virtio-scsi hot-plugging and -unplugging, you can use this patch:
> 
> https://czenczek.de/0001-DONTMERGE-Fix-crash-on-scsi-unplug.patch
> 
> 
> [1] https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00317.html
> 
> 
> Hanna Czenczek (2):
>   virtio-scsi: Attach event vq notifier with no_poll
>   virtio: Keep notifications disabled during drain
> 
>  include/block/aio.h   |  7 ++++++-
>  hw/scsi/virtio-scsi.c |  7 ++++++-
>  hw/virtio/virtio.c    | 42 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 54 insertions(+), 2 deletions(-)

This patch series also fixes RHEL-7356.

Buglink: https://issues.redhat.com/browse/RHEL-7356.
Tested-by: Stefan Hajnoczi <stefa...@redhat.com>

Attachment: signature.asc
Description: PGP signature

Reply via email to