Re: [RFC PATCH] docs/interop: define STANDALONE protocol feature for vhost-user

2023-07-20 Thread Stefan Hajnoczi
On Thu, Jul 06, 2023 at 12:48:20PM -0400, Michael S. Tsirkin wrote:
> On Tue, Jul 04, 2023 at 01:36:00PM +0100, Alex Bennée wrote:
> > Currently QEMU has to know some details about the back-end to be able
> > to setup the guest. While various parts of the setup can be delegated
> > to the backend (for example config handling) this is a very piecemeal
> > approach.
> 
> > This patch suggests a new feature flag (VHOST_USER_PROTOCOL_F_STANDALONE)
> > which the back-end can advertise which allows a probe message to be
> > sent to get all the details QEMU needs to know in one message.
> 
> The reason we do piecemeal is that these existing pieces can be reused
> as others evolve or fall by wayside.
> 
> For example, I can think of instances where you want to connect
> specifically to e.g. networking backend, and specify it
> on command line. Reasons could be many, e.g. for debugging,
> or to prevent connecting to wrong device on wrong channel
> (kind of like type safety).
> 
> What is the reason to have 1 message? startup latency?
> How about we allow pipelining several messages then?
> Will be easier.

This flag effectively says that the back-end is a full VIRTIO device
with a Device Status Register, Configuration Space, Virtqueues, the
device type, etc. This is different from previous vhost-user devices
which sometimes just offloaded certain virtqueues without providing the
full VIRTIO device (parts were emulated in the VMM).

So for example, a vhost-user-net device does not support the controlq.
Alex's "standalone" device is a mode where the vhost-user protocol is
used but the back-end must implement a full virtio-net device.
Standalone devices are like vDPA device in this respect.

I think it is important to have a protocol feature bit that advertises
that this is a standalone device, since the semantics are different for
traditional vhost-user-net devices.

However, I think having a single message is inflexible and duplicates
existing vhost-user protocol messages like VHOST_USER_GET_QUEUE_NUM. I
would prefer VHOST_USER_GET_DEVICE_ID and other messages.

Stefan


signature.asc
Description: PGP signature


Re: [virtio-dev] [RFC PATCH] docs/interop: define STANDALONE protocol feature for vhost-user

2023-07-20 Thread Stefan Hajnoczi
On Fri, Jul 07, 2023 at 12:27:39PM +0200, Stefano Garzarella wrote:
> On Tue, Jul 04, 2023 at 04:02:42PM +0100, Alex Bennée wrote:
> > 
> > Stefano Garzarella  writes:
> > 
> > > On Tue, Jul 04, 2023 at 01:36:00PM +0100, Alex Bennée wrote:
> > > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> > > > index 5a070adbc1..85b1b1583a 100644
> > > > --- a/docs/interop/vhost-user.rst
> > > > +++ b/docs/interop/vhost-user.rst
> > > > @@ -275,6 +275,21 @@ Inflight description
> > > > 
> > > > :queue size: a 16-bit size of virtqueues
> > > > 
> > > > +Backend specifications
> > > > +^^
> > > > +
> > > > ++---+-+++
> > > > +| device id | config size |   min_vqs  |   max_vqs  |
> > > > ++---+-+++
> > > > +
> > > > +:device id: a 32-bit value holding the VirtIO device ID
> > > > +
> > > > +:config size: a 32-bit value holding the config size (see 
> > > > ``VHOST_USER_GET_CONFIG``)
> > > > +
> > > > +:min_vqs: a 32-bit value holding the minimum number of vqs supported
> > > 
> > > Why do we need the minimum?
> > 
> > We need to know the minimum number because some devices have fixed VQs
> > that must be present.
> 
> But does QEMU need to know this?
> 
> Or is it okay that the driver will then fail in the guest if there
> are not the right number of queues?

I don't understand why min_vqs is needed either. It's not the
front-end's job to ensure that the device will be used properly. A
spec-compliant driver will work with a spec-compliant device, so it's
not clear why the front-end needs this information.

Stefan


signature.asc
Description: PGP signature


Re: [RFC PATCH] docs/interop: define STANDALONE protocol feature for vhost-user

2023-07-20 Thread Stefan Hajnoczi
On Tue, Jul 04, 2023 at 01:36:00PM +0100, Alex Bennée wrote:
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index c4e0cbd702..28b021d5d3 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -202,6 +202,13 @@ typedef struct VhostUserInflight {
>  uint16_t queue_size;
>  } VhostUserInflight;
>  
> +typedef struct VhostUserBackendSpecs {
> +uint32_t device_id;
> +uint32_t config_size;
> +uint32_t min_vqs;

You already answered my question about min_vqs in another sub-thread.
I'll continue there. Please ignore my question.

Stefan


signature.asc
Description: PGP signature


Re: [RFC PATCH] docs/interop: define STANDALONE protocol feature for vhost-user

2023-07-20 Thread Stefan Hajnoczi
On Tue, Jul 04, 2023 at 01:36:00PM +0100, Alex Bennée wrote:
> Currently QEMU has to know some details about the back-end to be able
> to setup the guest. While various parts of the setup can be delegated
> to the backend (for example config handling) this is a very piecemeal
> approach.
> 
> This patch suggests a new feature flag (VHOST_USER_PROTOCOL_F_STANDALONE)
> which the back-end can advertise which allows a probe message to be
> sent to get all the details QEMU needs to know in one message.
> 
> Signed-off-by: Alex Bennée 
> 
> ---
> Initial RFC for discussion. I intend to prototype this work with QEMU
> and one of the rust-vmm vhost-user daemons.
> ---
>  docs/interop/vhost-user.rst | 37 +
>  hw/virtio/vhost-user.c  |  8 
>  2 files changed, 45 insertions(+)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 5a070adbc1..85b1b1583a 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -275,6 +275,21 @@ Inflight description
>  
>  :queue size: a 16-bit size of virtqueues
>  
> +Backend specifications
> +^^
> +
> ++---+-+++
> +| device id | config size |   min_vqs  |   max_vqs  |
> ++---+-+++
> +
> +:device id: a 32-bit value holding the VirtIO device ID
> +
> +:config size: a 32-bit value holding the config size (see 
> ``VHOST_USER_GET_CONFIG``)
> +
> +:min_vqs: a 32-bit value holding the minimum number of vqs supported

What is the purpose of min_vqs? I'm not sure why the front-end needs to
know this.


signature.asc
Description: PGP signature


Re: [PATCH v5 0/3] hw/ufs: Add Universal Flash Storage (UFS) support

2023-07-20 Thread Stefan Hajnoczi
Hi,
I'm ready to merge this but encountered a bug when testing:

  $ qemu-system-x86_64 --device ufs --device ufs-lu
  Segmentation fault (core dumped)

Please ensure there is an error message like with SCSI disks:

  $ qemu-system-x86_64 --device virtio-scsi-pci --device scsi-hd
  qemu-system-x86_64: --device scsi-hd: drive property not set

Thanks,
Stefan


signature.asc
Description: PGP signature


Re: [PATCH 0/3] Support message-based DMA in vfio-user server

2023-07-20 Thread Stefan Hajnoczi
On Tue, Jul 04, 2023 at 01:06:24AM -0700, Mattias Nissler wrote:
> This series adds basic support for message-based DMA in qemu's vfio-user
> server. This is useful for cases where the client does not provide file
> descriptors for accessing system memory via memory mappings. My motivating use
> case is to hook up device models as PCIe endpoints to a hardware design. This
> works by bridging the PCIe transaction layer to vfio-user, and the endpoint
> does not access memory directly, but sends memory requests TLPs to the 
> hardware
> design in order to perform DMA.
> 
> Note that in addition to the 3 commits included, we also need a
> subprojects/libvfio-user roll to bring in this bugfix:
> https://github.com/nutanix/libvfio-user/commit/bb308a2e8ee9486a4c8b53d8d773f7c8faaeba08
> Stefan, can I ask you to kindly update the
> https://gitlab.com/qemu-project/libvfio-user mirror? I'll be happy to include
> an update to subprojects/libvfio-user.wrap in this series.

Done:
https://gitlab.com/qemu-project/libvfio-user/-/commits/master

Repository mirroring is automated now, so new upstream commits will
appear in the QEMU mirror repository from now on.

> 
> Finally, there is some more work required on top of this series to get
> message-based DMA to really work well:
> 
> * libvfio-user has a long-standing issue where socket communication gets 
> messed
>   up when messages are sent from both ends at the same time. See
>   https://github.com/nutanix/libvfio-user/issues/279 for more details. I've
>   been engaging there and plan to contribute a fix.
> 
> * qemu currently breaks down DMA accesses into chunks of size 8 bytes at
>   maximum, each of which will be handled in a separate vfio-user DMA request
>   message. This is quite terrible for large DMA acceses, such as when nvme
>   reads and writes page-sized blocks for example. Thus, I would like to 
> improve
>   qemu to be able to perform larger accesses, at least for indirect memory
>   regions. I have something working locally, but since this will likely result
>   in more involved surgery and discussion, I am leaving this to be addressed 
> in
>   a separate patch.
> 
> Mattias Nissler (3):
>   softmmu: Support concurrent bounce buffers
>   softmmu: Remove DMA unmap notification callback
>   vfio-user: Message-based DMA support
> 
>  hw/remote/vfio-user-obj.c |  62 --
>  softmmu/dma-helpers.c |  28 
>  softmmu/physmem.c | 131 --
>  3 files changed, 83 insertions(+), 138 deletions(-)

Sorry for the late review. I was on vacation and am catching up on
emails.

Paolo worked on the QEMU memory API and can give input on how to make
this efficient for large DMA accesses. There is a chance that memory
dispatch with larger sizes will be needed for ENQCMD CPU instruction
emulation too.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 3/3] vfio-user: Message-based DMA support

2023-07-20 Thread Stefan Hajnoczi
On Tue, Jul 04, 2023 at 01:06:27AM -0700, Mattias Nissler wrote:
> Wire up support for DMA for the case where the vfio-user client does not
> provide mmap()-able file descriptors, but DMA requests must be performed
> via the VFIO-user protocol. This installs an indirect memory region,
> which already works for pci_dma_{read,write}, and pci_dma_map works
> thanks to the existing DMA bounce buffering support.
> 
> Note that while simple scenarios work with this patch, there's a known
> race condition in libvfio-user that will mess up the communication
> channel: https://github.com/nutanix/libvfio-user/issues/279 I intend to
> contribute a fix for this problem, see discussion on the github issue
> for more details.
> 
> Signed-off-by: Mattias Nissler 
> ---
>  hw/remote/vfio-user-obj.c | 62 ++-
>  1 file changed, 55 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 8b10c32a3c..9799580c77 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -300,6 +300,53 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
> char * const buf,
>  return count;
>  }
>  
> +static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
> +unsigned size, MemTxAttrs attrs)
> +{
> +MemoryRegion *region = opaque;
> +VfuObject *o = VFU_OBJECT(region->owner);
> +
> +dma_sg_t *sg = alloca(dma_sg_size());
> +vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
> +if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
> +vfu_sgl_read(o->vfu_ctx, sg, 1, val) != 0) {

Does this work on big-endian host CPUs? It looks like reading 0x12345678
into uint64_t val would result in *val = 0x12345678 instead of
0x12345678.

> +return MEMTX_ERROR;
> +}
> +
> +return MEMTX_OK;
> +}
> +
> +static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
> + unsigned size, MemTxAttrs attrs)
> +{
> +MemoryRegion *region = opaque;
> +VfuObject *o = VFU_OBJECT(region->owner);
> +
> +dma_sg_t *sg = alloca(dma_sg_size());
> +vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
> +if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
> +vfu_sgl_write(o->vfu_ctx, sg, 1, ) != 0)  {

Same potential endianness issue here.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 1/3] softmmu: Support concurrent bounce buffers

2023-07-20 Thread Stefan Hajnoczi
On Tue, Jul 04, 2023 at 01:06:25AM -0700, Mattias Nissler wrote:
> +if (qatomic_dec_fetch(_buffers_in_use) == 1) {
> +cpu_notify_map_clients();
>  }

About my comment regarding removing this API: I see the next patch does
that.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 2/3] softmmu: Remove DMA unmap notification callback

2023-07-20 Thread Stefan Hajnoczi
On Tue, Jul 04, 2023 at 01:06:26AM -0700, Mattias Nissler wrote:
> According to old commit messages, this was introduced to retry a DMA
> operation at a later point in case the single bounce buffer is found to
> be busy. This was never used widely - only the dma-helpers code made use
> of it, but there are other device models that use multiple DMA mappings
> (concurrently) and just failed.
> 
> After the improvement to support multiple concurrent bounce buffers,
> the condition the notification callback allowed to work around no
> longer exists, so we can just remove the logic and simplify the code.
> 
> Signed-off-by: Mattias Nissler 
> ---
>  softmmu/dma-helpers.c | 28 -
>  softmmu/physmem.c | 71 ---
>  2 files changed, 99 deletions(-)

I'm not sure if it will be possible to remove this once a limit is
placed bounce buffer space.

> 
> diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
> index 2463964805..d05d226f11 100644
> --- a/softmmu/dma-helpers.c
> +++ b/softmmu/dma-helpers.c
> @@ -68,23 +68,10 @@ typedef struct {
>  int sg_cur_index;
>  dma_addr_t sg_cur_byte;
>  QEMUIOVector iov;
> -QEMUBH *bh;
>  DMAIOFunc *io_func;
>  void *io_func_opaque;
>  } DMAAIOCB;
>  
> -static void dma_blk_cb(void *opaque, int ret);
> -
> -static void reschedule_dma(void *opaque)
> -{
> -DMAAIOCB *dbs = (DMAAIOCB *)opaque;
> -
> -assert(!dbs->acb && dbs->bh);
> -qemu_bh_delete(dbs->bh);
> -dbs->bh = NULL;
> -dma_blk_cb(dbs, 0);
> -}
> -
>  static void dma_blk_unmap(DMAAIOCB *dbs)
>  {
>  int i;
> @@ -101,7 +88,6 @@ static void dma_complete(DMAAIOCB *dbs, int ret)
>  {
>  trace_dma_complete(dbs, ret, dbs->common.cb);
>  
> -assert(!dbs->acb && !dbs->bh);
>  dma_blk_unmap(dbs);
>  if (dbs->common.cb) {
>  dbs->common.cb(dbs->common.opaque, ret);
> @@ -164,13 +150,6 @@ static void dma_blk_cb(void *opaque, int ret)
>  }
>  }
>  
> -if (dbs->iov.size == 0) {
> -trace_dma_map_wait(dbs);
> -dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
> -cpu_register_map_client(dbs->bh);
> -goto out;
> -}
> -
>  if (!QEMU_IS_ALIGNED(dbs->iov.size, dbs->align)) {
>  qemu_iovec_discard_back(>iov,
>  QEMU_ALIGN_DOWN(dbs->iov.size, dbs->align));
> @@ -189,18 +168,12 @@ static void dma_aio_cancel(BlockAIOCB *acb)
>  
>  trace_dma_aio_cancel(dbs);
>  
> -assert(!(dbs->acb && dbs->bh));
>  if (dbs->acb) {
>  /* This will invoke dma_blk_cb.  */
>  blk_aio_cancel_async(dbs->acb);
>  return;
>  }
>  
> -if (dbs->bh) {
> -cpu_unregister_map_client(dbs->bh);
> -qemu_bh_delete(dbs->bh);
> -dbs->bh = NULL;
> -}
>  if (dbs->common.cb) {
>  dbs->common.cb(dbs->common.opaque, -ECANCELED);
>  }
> @@ -239,7 +212,6 @@ BlockAIOCB *dma_blk_io(AioContext *ctx,
>  dbs->dir = dir;
>  dbs->io_func = io_func;
>  dbs->io_func_opaque = io_func_opaque;
> -dbs->bh = NULL;
>  qemu_iovec_init(>iov, sg->nsg);
>  dma_blk_cb(dbs, 0);
>  return >common;
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 56130b5a1d..2b4123c127 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -2908,49 +2908,6 @@ typedef struct {
>  uint8_t buffer[];
>  } BounceBuffer;
>  
> -static size_t bounce_buffers_in_use;
> -
> -typedef struct MapClient {
> -QEMUBH *bh;
> -QLIST_ENTRY(MapClient) link;
> -} MapClient;
> -
> -QemuMutex map_client_list_lock;
> -static QLIST_HEAD(, MapClient) map_client_list
> -= QLIST_HEAD_INITIALIZER(map_client_list);
> -
> -static void cpu_unregister_map_client_do(MapClient *client)
> -{
> -QLIST_REMOVE(client, link);
> -g_free(client);
> -}
> -
> -static void cpu_notify_map_clients_locked(void)
> -{
> -MapClient *client;
> -
> -while (!QLIST_EMPTY(_client_list)) {
> -client = QLIST_FIRST(_client_list);
> -qemu_bh_schedule(client->bh);
> -cpu_unregister_map_client_do(client);
> -}
> -}
> -
> -void cpu_register_map_client(QEMUBH *bh)
> -{
> -MapClient *client = g_malloc(sizeof(*client));
> -
> -qemu_mutex_lock(_client_list_lock);
> -client->bh = bh;
> -QLIST_INSERT_HEAD(_client_list, client, link);
> -/* Write map_client_list before reading in_use.  */
> -smp_mb();
> -if (qatomic_read(_buffers_in_use)) {
> -cpu_notify_map_clients_locked();
> -}
> -qemu_mutex_unlock(_client_list_lock);
> -}
> -
>  void cpu_exec_init_all(void)
>  {
>  qemu_mutex_init(_list.mutex);
> @@ -2964,28 +2921,6 @@ void cpu_exec_init_all(void)
>  finalize_target_page_bits();
>  io_mem_init();
>  memory_map_init();
> -qemu_mutex_init(_client_list_lock);
> -}
> -
> -void cpu_unregister_map_client(QEMUBH *bh)
> -{
> -MapClient *client;
> -
> -qemu_mutex_lock(_client_list_lock);
> -QLIST_FOREACH(client, 

Re: [PATCH 1/3] softmmu: Support concurrent bounce buffers

2023-07-20 Thread Stefan Hajnoczi
On Tue, Jul 04, 2023 at 01:06:25AM -0700, Mattias Nissler wrote:
> It is not uncommon for device models to request mapping of several DMA
> regions at the same time. An example is igb (and probably other net
> devices as well) when a packet is spread across multiple descriptors.
> 
> In order to support this when indirect DMA is used, as is the case when
> running the device model in a vfio-server process without mmap()-ed DMA,
> this change allocates DMA bounce buffers dynamically instead of
> supporting only a single buffer.
> 
> Signed-off-by: Mattias Nissler 
> ---
>  softmmu/physmem.c | 74 ++-
>  1 file changed, 35 insertions(+), 39 deletions(-)

Is this a functional change or purely a performance optimization? If
it's a performance optimization, please include benchmark results to
justify this change.

QEMU memory allocations must be bounded so that an untrusted guest
cannot cause QEMU to exhaust host memory. There must be a limit to the
amount of bounce buffer memory.

> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index bda475a719..56130b5a1d 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -2904,13 +2904,11 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
>  
>  typedef struct {
>  MemoryRegion *mr;
> -void *buffer;
>  hwaddr addr;
> -hwaddr len;
> -bool in_use;
> +uint8_t buffer[];
>  } BounceBuffer;
>  
> -static BounceBuffer bounce;
> +static size_t bounce_buffers_in_use;
>  
>  typedef struct MapClient {
>  QEMUBH *bh;
> @@ -2947,7 +2945,7 @@ void cpu_register_map_client(QEMUBH *bh)
>  QLIST_INSERT_HEAD(_client_list, client, link);
>  /* Write map_client_list before reading in_use.  */
>  smp_mb();
> -if (!qatomic_read(_use)) {
> +if (qatomic_read(_buffers_in_use)) {
>  cpu_notify_map_clients_locked();
>  }
>  qemu_mutex_unlock(_client_list_lock);
> @@ -3076,31 +3074,24 @@ void *address_space_map(AddressSpace *as,
>  RCU_READ_LOCK_GUARD();
>  fv = address_space_to_flatview(as);
>  mr = flatview_translate(fv, addr, , , is_write, attrs);
> +memory_region_ref(mr);
>  
>  if (!memory_access_is_direct(mr, is_write)) {
> -if (qatomic_xchg(_use, true)) {
> -*plen = 0;
> -return NULL;
> -}
> -/* Avoid unbounded allocations */
> -l = MIN(l, TARGET_PAGE_SIZE);
> -bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
> -bounce.addr = addr;
> -bounce.len = l;
> -
> -memory_region_ref(mr);
> -bounce.mr = mr;
> +qatomic_inc_fetch(_buffers_in_use);
> +
> +BounceBuffer *bounce = g_malloc(l + sizeof(BounceBuffer));
> +bounce->addr = addr;
> +bounce->mr = mr;
> +
>  if (!is_write) {
>  flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
> -   bounce.buffer, l);
> +  bounce->buffer, l);
>  }
>  
>  *plen = l;
> -return bounce.buffer;
> +return bounce->buffer;

Bounce buffer allocation always succeeds now. Can the
cpu_notify_map_clients*() be removed now that no one is waiting for
bounce buffers anymore?

>  }
>  
> -
> -memory_region_ref(mr);
>  *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
>  l, is_write, attrs);
>  fuzz_dma_read_cb(addr, *plen, mr);
> @@ -3114,31 +3105,36 @@ void *address_space_map(AddressSpace *as,
>  void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
>   bool is_write, hwaddr access_len)
>  {
> -if (buffer != bounce.buffer) {
> -MemoryRegion *mr;
> -ram_addr_t addr1;
> +MemoryRegion *mr;
> +ram_addr_t addr1;
> +
> +mr = memory_region_from_host(buffer, );
> +if (mr == NULL) {
> +/*
> + * Must be a bounce buffer (unless the caller passed a pointer which
> + * wasn't returned by address_space_map, which is illegal).
> + */
> +BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
>  
> -mr = memory_region_from_host(buffer, );
> -assert(mr != NULL);
>  if (is_write) {
> -invalidate_and_set_dirty(mr, addr1, access_len);
> +address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
> +bounce->buffer, access_len);
>  }
> -if (xen_enabled()) {
> -xen_invalidate_map_cache_entry(buffer);
> +memory_region_unref(bounce->mr);
> +g_free(bounce);
> +
> +if (qatomic_dec_fetch(_buffers_in_use) == 1) {
> +cpu_notify_map_clients();
>  }
> -memory_region_unref(mr);
>  return;
>  }
> +
> +if (xen_enabled()) {
> +xen_invalidate_map_cache_entry(buffer);
> +}
>  if (is_write) {
> -address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED,

Re: [PATCH 6/6] vhost-user: Have reset_status fall back to reset

2023-07-20 Thread Stefan Hajnoczi
On Wed, Jul 19, 2023 at 04:27:58PM +0200, Hanna Czenczek wrote:
> On 19.07.23 16:11, Hanna Czenczek wrote:
> > On 18.07.23 17:10, Stefan Hajnoczi wrote:
> > > On Tue, Jul 11, 2023 at 05:52:28PM +0200, Hanna Czenczek wrote:
> > > > The only user of vhost_user_reset_status() is vhost_dev_stop(), which
> > > > only uses it as a fall-back to stop the back-end if it does not support
> > > > SUSPEND.  However, vhost-user's implementation is a no-op unless the
> > > > back-end supports SET_STATUS.
> > > > 
> > > > vhost-vdpa's implementation instead just calls
> > > > vhost_vdpa_reset_device(), implying that it's OK to fully reset the
> > > > device if SET_STATUS is not supported.
> > > > 
> > > > To be fair, vhost_vdpa_reset_device() does nothing but to set
> > > > the status
> > > > to zero.  However, that may well be because vhost-vdpa has no method
> > > > besides this to reset a device.  In contrast, vhost-user has
> > > > RESET_DEVICE and a RESET_OWNER, which can be used instead.
> > > > 
> > > > While it is not entirely clear from documentation or git logs, from
> > > > discussions and the order of vhost-user protocol features, it
> > > > appears to
> > > > me as if RESET_OWNER originally had no real meaning for vhost-user, and
> > > > was thus used to signal a device reset to the back-end.  Then,
> > > > RESET_DEVICE was introduced, to have a well-defined dedicated reset
> > > > command.  Finally, vhost-user received full STATUS support, including
> > > > SET_STATUS, so setting the device status to 0 is now the preferred way
> > > > of resetting a device.  Still, RESET_DEVICE and RESET_OWNER should
> > > > remain valid as fall-backs.
> > > > 
> > > > Therefore, have vhost_user_reset_status() fall back to
> > > > vhost_user_reset_device() if the back-end has no STATUS support.
> > > > 
> > > > Signed-off-by: Hanna Czenczek 
> > > > ---
> > > >   hw/virtio/vhost-user.c | 2 ++
> > > >   1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > > index 4507de5a92..53a881ec2a 100644
> > > > --- a/hw/virtio/vhost-user.c
> > > > +++ b/hw/virtio/vhost-user.c
> > > > @@ -2833,6 +2833,8 @@ static void vhost_user_reset_status(struct
> > > > vhost_dev *dev)
> > > >   if (virtio_has_feature(dev->protocol_features,
> > > >  VHOST_USER_PROTOCOL_F_STATUS)) {
> > > >   vhost_user_set_status(dev, 0);
> > > > +    } else {
> > > > +    vhost_user_reset_device(dev);
> > > >   }
> > > >   }
> > > Did you check whether DPDK treats setting the status to 0 as equivalent
> > > to RESET_DEVICE?
> > 
> > If it doesn’t, what’s even the point of using reset_status?
> 
> Sorry, I’m being unclear, and I think this may be important because it ties
> into the question from patch 1, what qemu is even trying to do by running
> SET_STATUS(0) vhost_dev_stop(), so here’s what gave me the impression that
> SET_STATUS(0) and RESET_DEVICE should be equivalent:
> 
> vhost-vdpa.c runs SET_STATUS(0) in a function called
> vhost_vdpa_reset_device().  This is one thing that gave me the impression
> that this is about an actual full reset.
> 
> Another is the whole discussion that we’ve had.  vhost_dev_stop() does not
> call a `vhost_reset_device()` function, it calls `vhost_reset_status()`. 
> Still, we were always talking about resetting the device.

There is some hacky stuff with struct vhost_dev's vq_index_end and
multi-queue devices. I think it's because multi-queue vhost-net device
consist of many vhost_devs and NetClientStates, so certain vhost
operations are skipped unless this is the "first" or "last" vhost_dev
from a large aggregate vhost-net device. That might be responsible for
part of the weirdness.

> 
> It doesn’t make sense to me that vDPA would provide no function to fully
> reset a device, while vhost-user does.  Being able to reset a device sounds
> vital to me.  This also gave me the impression that SET_STATUS(0) on vDPA at
> least is functionally equivalent to a full device reset.
> 
> 
> Maybe SET_STATUS(0) does mean a full device reset on vDPA, but not on
> vhost-user.  That would be a real shame, so I assumed this would not be the
> case; that SET_STATUS(0) does the same thing on both protocols.

Yes, exactly. It has the r

Re: [PATCH v2 1/4] vhost-user.rst: Migrating back-end-internal state

2023-07-20 Thread Stefan Hajnoczi
On Wed, 19 Jul 2023 at 12:35, Hanna Czenczek  wrote:
>
> On 18.07.23 17:57, Stefan Hajnoczi wrote:
> > On Wed, Jul 12, 2023 at 01:16:59PM +0200, Hanna Czenczek wrote:
> >> For vhost-user devices, qemu can migrate the virtio state, but not the
> >> back-end's internal state.  To do so, we need to be able to transfer
> >> this internal state between front-end (qemu) and back-end.
> >>
> >> At this point, this new feature is added for the purpose of virtio-fs
> >> migration.  Because virtiofsd's internal state will not be too large, we
> >> believe it is best to transfer it as a single binary blob after the
> >> streaming phase.
> >>
> >> These are the additions to the protocol:
> >> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE
> > It's not 100% clear whether "migratory" is related to live migration or
> > something else. I don't like the name :P.
> >
> > The name "VHOST_USER_PROTOCOL_F_DEVICE_STATE" would be more obviously
> > associated with SET_DEVICE_STATE_FD and CHECK_DEVICE_STATE than
> > "MIGRATORY_STATE".
>
> Sure, sure.  Naming things is hard. :)
>
> >> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>over which to transfer the state.
> > Does it need to be a pipe or can it be another type of file (e.g. UNIX
> > domain socket)?
>
> It’s difficult to say, honestly.  It can be anything, but I’m not sure
> how to describe that in this specification.
>
> It must be any FD into which the state sender can write the state and
> signal end of state by closing its FD; and from which the state receiver
> can read the state, terminated by seeing an EOF.  As you say, that
> doesn’t mean that the sender has to write the state into the FD, nor
> that the receiver has to read it (into memory), it’s just that either
> side must ensure the other can do it.
>
> > In the future the fd may become bi-directional. Pipes are
> > uni-directional on Linux.
> >
> > I suggest calling it a "file descriptor" and not mentioning "pipes"
> > explicitly.
>
> Works here in the commit message, but in the document, we need to be
> explicit about the requirements for this FD, i.e. the way in which
> front-end and back-end can expect the FD to be usable.  Calling it a
> “pipe” was a simple way, but you’re right, it’s more general than that.
>
> >> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>pipe, the front-end invokes this function to verify success.  There is
> >>no in-band way (through the pipe) to indicate failure, so we need to
> >>check explicitly.
> >>
> >> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >> (which includes establishing the direction of transfer and migration
> >> phase), the sending side writes its data into the pipe, and the reading
> >> side reads it until it sees an EOF.  Then, the front-end will check for
> >> success via CHECK_DEVICE_STATE, which on the destination side includes
> >> checking for integrity (i.e. errors during deserialization).
> >>
> >> Suggested-by: Stefan Hajnoczi 
> >> Signed-off-by: Hanna Czenczek 
> >> ---
> >>   docs/interop/vhost-user.rst | 87 +
> >>   1 file changed, 87 insertions(+)
> >>
> >> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> >> index ac6be34c4c..c98dfeca25 100644
> >> --- a/docs/interop/vhost-user.rst
> >> +++ b/docs/interop/vhost-user.rst
> >> @@ -334,6 +334,7 @@ in the ancillary data:
> >>   * ``VHOST_USER_SET_VRING_ERR``
> >>   * ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name 
> >> ``VHOST_USER_SET_SLAVE_REQ_FD``)
> >>   * ``VHOST_USER_SET_INFLIGHT_FD`` (if 
> >> ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
> >> +* ``VHOST_USER_SET_DEVICE_STATE_FD``
> >>
> >>   If *front-end* is unable to send the full message or receives a wrong
> >>   reply it will close the connection. An optional reconnection mechanism
> >> @@ -497,6 +498,44 @@ it performs WAKE ioctl's on the userfaultfd to wake 
> >> the stalled
> >>   back-end.  The front-end indicates support for this via the
> >>   ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
> >>
> >> +.. _migrating_backend_state:
> >> +
> >> +Migrating back-end state
> >> +
> >> +
> >> +If the back-end has internal state that is to be sent from

Re: [PATCH 5/6] vhost-vdpa: Match vhost-user's status reset

2023-07-19 Thread Stefan Hajnoczi
On Wed, 19 Jul 2023 at 10:10, Hanna Czenczek  wrote:
>
> On 18.07.23 16:50, Stefan Hajnoczi wrote:
> > On Tue, Jul 11, 2023 at 05:52:27PM +0200, Hanna Czenczek wrote:
> >> vhost-vdpa and vhost-user differ in how they reset the status in their
> >> respective vhost_reset_status implementations: vhost-vdpa zeroes it,
> >> then re-adds the S_ACKNOWLEDGE and S_DRIVER config bits.  S_DRIVER_OK is
> >> then set in vhost_vdpa_dev_start().
> >>
> >> vhost-user in contrast just zeroes the status, and does no re-add any
> >> config bits until vhost_user_dev_start() (where it does re-add all of
> >> S_ACKNOWLEDGE, S_DRIVER, and S_DRIVER_OK).
> >>
> >> There is no documentation for vhost_reset_status, but its only caller is
> >> vhost_dev_stop().  So apparently, the device is to be stopped after
> >> vhost_reset_status, and therefore it makes more sense to keep the status
> >> field fully cleared until the back-end is re-started, which is how
> >> vhost-user does it.  Make vhost-vdpa do the same -- if nothing else it's
> >> confusing to have both vhost implementations handle this differently.
> >>
> >> Signed-off-by: Hanna Czenczek 
> >> ---
> >>   hw/virtio/vhost-vdpa.c | 6 +++---
> >>   1 file changed, 3 insertions(+), 3 deletions(-)
> > Hi Hanna,
> > The VIRTIO spec lists the Device Initialization sequence including the
> > bits set in the Device Status Register here:
> > https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-1070001
> >
> > ACKNOWLEDGE and DRIVER must be set before FEATURES_OK. DRIVER_OK is set
> > after FEATURES_OK.
> >
> > The driver may read the Device Configuration Space once ACKNOWLEDGE and
> > DRIVER are set.
> >
> > QEMU's vhost code should follow this sequence (especially for vDPA where
> > full VIRTIO devices are implemented).
> >
> > vhost-user is not faithful to the VIRTIO spec here. That's probably due
> > to the fact that vhost-user didn't have the concept of the Device Status
> > Register until recently and back-ends mostly ignore it.
> >
> > Please do the opposite of this patch: bring vhost-user in line with the
> > VIRTIO specification so that the Device Initialization sequence is
> > followed correctly. I think vhost-vdpa already does the right thing.
>
> Hm.  This sounds all very good, but what leaves me lost is the fact that
> we never actually expose the status field to the guest, as far as I can
> see.  We have no set_status callback, and as written in the commit
> message, the only caller of reset_status is vhost_dev_stop().  So the
> status field seems completely artificial in vhost right now.  That is
> why I’m wondering what the flags even really mean.

vhost (including vDPA and vhost-user) is not a 100% passthrough
solution. The VMM emulates a VIRTIO device (e.g. virtio-fs-pci) that
has some separate state from the vhost back-end, including the Device
Status Register. This is analogous to how passthrough PCI devices
still have emulated PCI registers that are not passed through to the
physical PCI device.

However, just because the vDPA, and now vhost-user with the SET_STATUS
message, back-end is not directly exposed to the guest does not mean
it should diverge from the VIRTIO specification for no reason.

> Another point I made in the commit message is that it is strange that we
> reset the status to 0, and then add the ACKNOWLEDGE and DRIVER while the
> VM is still stopped.  It doesn’t make sense to me to set these flags
> while the guest driver is not operative.

While there is no harm in setting those bits, I agree that leaving the
Device Status Register at 0 while the VM is stopped would be nicer.

> If what you’re saying is that we must set FEATURES_OK only after
> ACKNOWLEDGE and DRIVER, wouldn’t it be still better to set all of these
> flags only in vhost_*_dev_start(), but do it in two separate SET_STATUS
> calls?

The device initialization sequence could be put into vhost_dev_start():
1. ACKNOWLEDGE | DRIVER
2. FEATURES_OK via vhost_dev_set_features()
3. DRIVER_OK via ->vhost_dev_start()

But note that the ->vhost_dev_start() callback is too late to set
ACKNOWLEDGE | DRIVER because feature negotiation happens earlier.

> (You mentioned the configuration space – is that accessed while between
> vhost_dev_stop and vhost_dev_start?)

I don't think so.

>
> Hanna
>
> >> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >> index f7fd19a203..0cde8b40de 100644
> >> --- a/hw/virtio/vhost-vdpa.c
> >> +++ b/hw/virtio/vhost-vdpa.c
> >> @@ -1294,8 +1294,6 @@ static void vhost_vdpa_reset_status(struct vhost_dev 
&g

Re: [PATCH] vhost-user.rst: Clarify enabling/disabling vrings

2023-07-19 Thread Stefan Hajnoczi
On Wed, 19 Jul 2023 at 09:34, Hanna Czenczek  wrote:
>
> On 18.07.23 17:26, Stefan Hajnoczi wrote:
> > On Wed, Jul 12, 2023 at 11:17:04AM +0200, Hanna Czenczek wrote:
> >> Currently, the vhost-user documentation says that rings are to be
> >> initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is
> >> negotiated.  However, by the time of feature negotiation, all rings have
> >> already been initialized, so it is not entirely clear what this means.
> >>
> >> At least the vhost-user-backend Rust crate's implementation interpreted
> >> it to mean that whenever this feature is negotiated, all rings are to be
> >> put into a disabled state, which means that every SET_FEATURES call
> >> would disable all rings, effectively halting the device.  This is
> >> problematic because the VHOST_F_LOG_ALL feature is also set or cleared
> >> this way, which happens during migration.  Doing so should not halt the
> >> device.
> >>
> >> Other implementations have interpreted this to mean that the device is
> >> to be initialized with all rings disabled, and a subsequent SET_FEATURES
> >> call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of
> >> them.  Here, SET_FEATURES will never disable any ring.
> >>
> >> This other interpretation does not suffer the problem of unintentionally
> >> halting the device whenever features are set or cleared, so it seems
> >> better and more reasonable.
> >>
> >> We should clarify this in the documentation.
> >>
> >> Signed-off-by: Hanna Czenczek 
> >> ---
> >>   docs/interop/vhost-user.rst | 23 +--
> >>   1 file changed, 17 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> >> index 5a070adbc1..ca0e899765 100644
> >> --- a/docs/interop/vhost-user.rst
> >> +++ b/docs/interop/vhost-user.rst
> >> @@ -383,12 +383,23 @@ and stop ring upon receiving 
> >> ``VHOST_USER_GET_VRING_BASE``.
> >>
> >>   Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.
> >>
> >> -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
> >> -ring starts directly in the enabled state.
> >> -
> >> -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
> >> -initialized in a disabled state and is enabled by
> >> -``VHOST_USER_SET_VRING_ENABLE`` with parameter 1.
> >> +Between initialization and the first ``VHOST_USER_SET_FEATURES`` call, it
> >> +is implementation-defined whether each ring is enabled or disabled.
> > What is the purpose of this statement? Rings cannot be used before
> > feature negotiation (with the possible exception of legacy devices that
> > allowed this to accomodate buggy drivers).
>
> Perfect :)
>
> > To me this statement complicates things and raises more questions than
> > it answers.
>
> OK.  The context for the statement is as follows: When the back-end
> supports F_PROTOCOL_FEATURES, it is supposed to initialize all vrings in
> a disabled state, so that when the flag is indeed negotiated, that will
> be the state they’re in.  In contrast, older back-ends that don’t
> support that flag will initialize them in an enabled state (because they
> won’t have support for disabled vrings).
>
> The statement was intended to make it clear that this difference in
> behavior is OK, and that the front-end must not rely on either of the
> two.  Only after SET_FEATURES will and must the state be well-defined.
>
> But if you find it just confusing because enabled/disabled has no
> meaning before a virt queue is started anyway, and they mustn’t be
> started before negotiating features, I’m happy to drop it without
> replacement.

Yes, dropping this statement sounds good.

Stefan



Re: [PATCH v4 3/3] hw/ufs: Support for UFS logical unit

2023-07-18 Thread Stefan Hajnoczi
On Tue, Jul 04, 2023 at 05:33:59PM +0900, Jeuk Kim wrote:
> +static Property ufs_lu_props[] = {
> +DEFINE_PROP_DRIVE_IOTHREAD("drive", UfsLu, qdev.conf.blk),

This device is not aware of IOThreads, so I think DEFINE_PROP_DRIVE()
should be used instead.


signature.asc
Description: PGP signature


Re: [PATCH v4 2/3] hw/ufs: Support for Query Transfer Requests

2023-07-18 Thread Stefan Hajnoczi
On Tue, Jul 04, 2023 at 05:33:58PM +0900, Jeuk Kim wrote:
> +static MemTxResult ufs_dma_read_prdt(UfsRequest *req)
> +{
> +UfsHc *u = req->hc;
> +uint16_t prdt_len = le16_to_cpu(req->utrd.prd_table_length);
> +uint16_t prdt_byte_off =
> +le16_to_cpu(req->utrd.prd_table_offset) * sizeof(uint32_t);
> +uint32_t prdt_size = prdt_len * sizeof(UfshcdSgEntry);
> +g_autofree UfshcdSgEntry *prd_entries = NULL;
> +hwaddr req_upiu_base_addr, prdt_base_addr;
> +int err;
> +
> +assert(!req->sg);
> +
> +if (prdt_len == 0) {
> +return MEMTX_OK;
> +}
> +
> +prd_entries = g_new(UfshcdSgEntry, prdt_size);
> +if (!prd_entries) {

g_new() never returns NULL. The process aborts if there is not enough
memory available.

Use g_try_new() if you want to handle memory allocation failure.

> +trace_ufs_err_memory_allocation();
> +return MEMTX_ERROR;
> +}
> +
> +req_upiu_base_addr = ufs_get_req_upiu_base_addr(>utrd);
> +prdt_base_addr = req_upiu_base_addr + prdt_byte_off;
> +
> +err = ufs_addr_read(u, prdt_base_addr, prd_entries, prdt_size);
> +if (err) {
> +trace_ufs_err_dma_read_prdt(req->slot, prdt_base_addr);
> +return err;
> +}
> +
> +req->sg = g_malloc0(sizeof(QEMUSGList));
> +if (!req->sg) {

g_malloc0() never returns NULL. The process aborts if there is not
enough memory available.



signature.asc
Description: PGP signature


Re: [PATCH v4 1/3] hw/ufs: Initial commit for emulated Universal-Flash-Storage

2023-07-18 Thread Stefan Hajnoczi
On Tue, Jul 04, 2023 at 05:33:57PM +0900, Jeuk Kim wrote:
> From: Jeuk Kim 
> 
> Universal Flash Storage (UFS) is a high-performance mass storage device
> with a serial interface. It is primarily used as a high-performance
> data storage device for embedded applications.
> 
> This commit contains code for UFS device to be recognized
> as a UFS PCI device.
> Patches to handle UFS logical unit and Transfer Request will follow.
> 
> Signed-off-by: Jeuk Kim 
> ---
>  MAINTAINERS  |6 +
>  docs/specs/pci-ids.rst   |2 +
>  hw/Kconfig   |1 +
>  hw/meson.build   |1 +
>  hw/ufs/Kconfig   |4 +
>  hw/ufs/meson.build   |1 +
>  hw/ufs/trace-events  |   33 ++
>  hw/ufs/trace.h   |1 +
>  hw/ufs/ufs.c |  304 +++
>  hw/ufs/ufs.h |   42 ++
>  include/block/ufs.h  | 1048 ++
>  include/hw/pci/pci.h |1 +
>  include/hw/pci/pci_ids.h |1 +
>  meson.build  |1 +
>  14 files changed, 1446 insertions(+)
>  create mode 100644 hw/ufs/Kconfig
>  create mode 100644 hw/ufs/meson.build
>  create mode 100644 hw/ufs/trace-events
>  create mode 100644 hw/ufs/trace.h
>  create mode 100644 hw/ufs/ufs.c
>  create mode 100644 hw/ufs/ufs.h
>  create mode 100644 include/block/ufs.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 4feea49a6e..756aae8623 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2237,6 +2237,12 @@ F: tests/qtest/nvme-test.c
>  F: docs/system/devices/nvme.rst
>  T: git git://git.infradead.org/qemu-nvme.git nvme-next
>  
> +ufs
> +M: Jeuk Kim 
> +S: Supported
> +F: hw/ufs/*
> +F: include/block/ufs.h
> +
>  megasas
>  M: Hannes Reinecke 
>  L: qemu-bl...@nongnu.org
> diff --git a/docs/specs/pci-ids.rst b/docs/specs/pci-ids.rst
> index e302bea484..d6707fa069 100644
> --- a/docs/specs/pci-ids.rst
> +++ b/docs/specs/pci-ids.rst
> @@ -92,6 +92,8 @@ PCI devices (other than virtio):
>PCI PVPanic device (``-device pvpanic-pci``)
>  1b36:0012
>PCI ACPI ERST device (``-device acpi-erst``)
> +1b36:0013
> +  PCI UFS device (``-device ufs``)
>  
>  All these devices are documented in :doc:`index`.
>  
> diff --git a/hw/Kconfig b/hw/Kconfig
> index ba62ff6417..9ca7b38c31 100644
> --- a/hw/Kconfig
> +++ b/hw/Kconfig
> @@ -38,6 +38,7 @@ source smbios/Kconfig
>  source ssi/Kconfig
>  source timer/Kconfig
>  source tpm/Kconfig
> +source ufs/Kconfig
>  source usb/Kconfig
>  source virtio/Kconfig
>  source vfio/Kconfig
> diff --git a/hw/meson.build b/hw/meson.build
> index c7ac7d3d75..f01fac4617 100644
> --- a/hw/meson.build
> +++ b/hw/meson.build
> @@ -37,6 +37,7 @@ subdir('smbios')
>  subdir('ssi')
>  subdir('timer')
>  subdir('tpm')
> +subdir('ufs')
>  subdir('usb')
>  subdir('vfio')
>  subdir('virtio')
> diff --git a/hw/ufs/Kconfig b/hw/ufs/Kconfig
> new file mode 100644
> index 00..b7b3392e85
> --- /dev/null
> +++ b/hw/ufs/Kconfig
> @@ -0,0 +1,4 @@
> +config UFS_PCI
> +bool
> +default y if PCI_DEVICES
> +depends on PCI
> diff --git a/hw/ufs/meson.build b/hw/ufs/meson.build
> new file mode 100644
> index 00..eb5164bde9
> --- /dev/null
> +++ b/hw/ufs/meson.build
> @@ -0,0 +1 @@
> +system_ss.add(when: 'CONFIG_UFS_PCI', if_true: files('ufs.c'))
> diff --git a/hw/ufs/trace-events b/hw/ufs/trace-events
> new file mode 100644
> index 00..17793929b1
> --- /dev/null
> +++ b/hw/ufs/trace-events
> @@ -0,0 +1,33 @@
> +# ufs.c
> +ufs_irq_raise(void) "INTx"
> +ufs_irq_lower(void) "INTx"
> +ufs_mmio_read(uint64_t addr, uint64_t data, unsigned size) "addr 0x%"PRIx64" 
> data 0x%"PRIx64" size %d"
> +ufs_mmio_write(uint64_t addr, uint64_t data, unsigned size) "addr 
> 0x%"PRIx64" data 0x%"PRIx64" size %d"
> +ufs_process_db(uint32_t slot) "UTRLDBR slot %"PRIu32""
> +ufs_process_req(uint32_t slot) "UTRLDBR slot %"PRIu32""
> +ufs_complete_req(uint32_t slot) "UTRLDBR slot %"PRIu32""
> +ufs_sendback_req(uint32_t slot) "UTRLDBR slot %"PRIu32""
> +ufs_exec_nop_cmd(uint32_t slot) "UTRLDBR slot %"PRIu32""
> +ufs_exec_scsi_cmd(uint32_t slot, uint8_t lun, uint8_t opcode) "slot 
> %"PRIu32", lun 0x%"PRIx8", opcode 0x%"PRIx8""
> +ufs_exec_query_cmd(uint32_t slot, uint8_t opcode) "slot %"PRIu32", opcode 
> 0x%"PRIx8""
> +ufs_process_uiccmd(uint32_t uiccmd, uint32_t ucmdarg1, uint32_t ucmdarg2, 
> uint32_t ucmdarg3) "uiccmd 0x%"PRIx32", ucmdarg1 0x%"PRIx32", ucmdarg2 
> 0x%"PRIx32", ucmdarg3 0x%"PRIx32""
> +
> +# error condition
> +ufs_err_memory_allocation(void) "failed to allocate memory"
> +ufs_err_dma_read_utrd(uint32_t slot, uint64_t addr) "failed to read utrd. 
> UTRLDBR slot %"PRIu32", UTRD dma addr %"PRIu64""
> +ufs_err_dma_read_req_upiu(uint32_t slot, uint64_t addr) "failed to read req 
> upiu. UTRLDBR slot %"PRIu32", request upiu addr %"PRIu64""
> +ufs_err_dma_read_prdt(uint32_t slot, uint64_t addr) "failed to read prdt. 
> UTRLDBR slot %"PRIu32", prdt addr %"PRIu64""
> +ufs_err_dma_write_utrd(uint32_t slot, uint64_t addr) "failed to 

Re: PING: [PATCH v4 0/3] hw/ufs: Add Universal Flash Storage (UFS) support

2023-07-18 Thread Stefan Hajnoczi
On Tue, Jul 11, 2023 at 07:31:02PM +0900, Jeuk Kim wrote:
> Hi,
> Any more reviews...?
> 
> Dear Stefan
> If you don't mind, Could you give it "reviewed-by"?
> And is there anything else I should do...?

Sorry for the late reply. I was on vacation and am working my way
through pending code review and bug reports.

I have started reviewing this series and should be finish on Wednesday
or Thursday.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v2 4/4] vhost-user-fs: Implement internal migration

2023-07-18 Thread Stefan Hajnoczi
On Wed, Jul 12, 2023 at 01:17:02PM +0200, Hanna Czenczek wrote:
> A virtio-fs device's VM state consists of:
> - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
> - the back-end's (virtiofsd's) internal state
> 
> We get/set the latter via the new vhost operations to transfer migratory
> state.  It is its own dedicated subsection, so that for external
> migration, it can be disabled.
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  hw/virtio/vhost-user-fs.c | 101 +-
>  1 file changed, 100 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH v2 3/4] vhost: Add high-level state save/load functions

2023-07-18 Thread Stefan Hajnoczi
On Wed, Jul 12, 2023 at 01:17:01PM +0200, Hanna Czenczek wrote:
> vhost_save_backend_state() and vhost_load_backend_state() can be used by
> vhost front-ends to easily save and load the back-end's state to/from
> the migration stream.
> 
> Because we do not know the full state size ahead of time,
> vhost_save_backend_state() simply reads the data in 1 MB chunks, and
> writes each chunk consecutively into the migration stream, prefixed by
> its length.  EOF is indicated by a 0-length chunk.
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  include/hw/virtio/vhost.h |  35 +++
>  hw/virtio/vhost.c | 204 ++
>  2 files changed, 239 insertions(+)
> 
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index d8877496e5..0c282abd4e 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -425,4 +425,39 @@ int vhost_set_device_state_fd(struct vhost_dev *dev,
>   */
>  int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
>  
> +/**
> + * vhost_save_backend_state(): High-level function to receive a vhost
> + * back-end's state, and save it in `f`.  Uses

I think the GtkDoc syntax is @f instead of `f`.

> + * `vhost_set_device_state_fd()` to get the data from the back-end, and
> + * stores it in consecutive chunks that are each prefixed by their
> + * respective length (be32).  The end is marked by a 0-length chunk.
> + *
> + * Must only be called while the device and all its vrings are stopped
> + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
> + *
> + * @dev: The vhost device from which to save the state
> + * @f: Migration stream in which to save the state
> + * @errp: Potential error message
> + *
> + * Returns 0 on success, and -errno otherwise.
> + */
> +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error 
> **errp);
> +
> +/**
> + * vhost_load_backend_state(): High-level function to load a vhost
> + * back-end's state from `f`, and send it over to the back-end.  Reads
> + * the data from `f` in the format used by `vhost_save_state()`, and
> + * uses `vhost_set_device_state_fd()` to transfer it to the back-end.
> + *
> + * Must only be called while the device and all its vrings are stopped
> + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
> + *
> + * @dev: The vhost device to which to send the sate
> + * @f: Migration stream from which to load the state
> + * @errp: Potential error message
> + *
> + * Returns 0 on success, and -errno otherwise.
> + */
> +int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error 
> **errp);
> +
>  #endif
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 756b6d55a8..332d49a310 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2128,3 +2128,207 @@ int vhost_check_device_state(struct vhost_dev *dev, 
> Error **errp)
> "vhost transport does not support migration state transfer");
>  return -ENOSYS;
>  }
> +
> +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error 
> **errp)
> +{
> +/* Maximum chunk size in which to transfer the state */
> +const size_t chunk_size = 1 * 1024 * 1024;
> +void *transfer_buf = NULL;
> +g_autoptr(GError) g_err = NULL;
> +int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1;
> +int ret;
> +
> +/* [0] for reading (our end), [1] for writing (back-end's end) */
> +if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, _err)) {
> +error_setg(errp, "Failed to set up state transfer pipe: %s",
> +   g_err->message);
> +ret = -EINVAL;
> +goto fail;
> +}
> +
> +read_fd = pipe_fds[0];
> +write_fd = pipe_fds[1];
> +
> +/*
> + * VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped.
> + * We cannot check dev->suspended, because the back-end may not support
> + * suspending.
> + */
> +assert(!dev->started);
> +
> +/* Transfer ownership of write_fd to the back-end */
> +ret = vhost_set_device_state_fd(dev,
> +VHOST_TRANSFER_STATE_DIRECTION_SAVE,
> +VHOST_TRANSFER_STATE_PHASE_STOPPED,
> +write_fd,
> +_fd,
> +errp);
> +if (ret < 0) {
> +error_prepend(errp, "Failed to initiate state transfer: ");
> +goto fail;
> +}
> +
> +/* If the back-end wishes to use a different pipe, switch over */
> +if (reply_fd >= 0) {
> +close(read_fd);
> +read_fd = reply_fd;
> +}
> +
> +transfer_buf = g_malloc(chunk_size);
> +
> +while (true) {
> +ssize_t read_ret;
> +
> +read_ret = read(read_fd, transfer_buf, chunk_size);
> +if (read_ret < 0) {

Is it necessary to handle -EINTR?

> +ret = -errno;
> +error_setg_errno(errp, -ret, "Failed to receive state");
> +goto fail;
> +}
> +
> +

Re: [PATCH v2 2/4] vhost-user: Interface for migration state transfer

2023-07-18 Thread Stefan Hajnoczi
On Wed, Jul 12, 2023 at 01:17:00PM +0200, Hanna Czenczek wrote:
> Add the interface for transferring the back-end's state during migration
> as defined previously in vhost-user.rst.
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  include/hw/virtio/vhost-backend.h |  24 +
>  include/hw/virtio/vhost.h |  79 
>  hw/virtio/vhost-user.c| 147 ++
>  hw/virtio/vhost.c |  37 
>  4 files changed, 287 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH v2 1/4] vhost-user.rst: Migrating back-end-internal state

2023-07-18 Thread Stefan Hajnoczi
On Wed, Jul 12, 2023 at 01:16:59PM +0200, Hanna Czenczek wrote:
> @@ -1471,6 +1511,53 @@ Front-end message types
>before.  The back-end must again begin processing rings that are not
>stopped, and it may resume background operations.
>  
> +``VHOST_USER_SET_DEVICE_STATE_FD``
> +  :id: 43
> +  :equivalent ioctl: N/A
> +  :request payload: device state transfer parameters

Where are these defined?


signature.asc
Description: PGP signature


Re: [PATCH v2 1/4] vhost-user.rst: Migrating back-end-internal state

2023-07-18 Thread Stefan Hajnoczi
On Wed, Jul 12, 2023 at 01:16:59PM +0200, Hanna Czenczek wrote:
> For vhost-user devices, qemu can migrate the virtio state, but not the
> back-end's internal state.  To do so, we need to be able to transfer
> this internal state between front-end (qemu) and back-end.
> 
> At this point, this new feature is added for the purpose of virtio-fs
> migration.  Because virtiofsd's internal state will not be too large, we
> believe it is best to transfer it as a single binary blob after the
> streaming phase.
> 
> These are the additions to the protocol:
> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE

It's not 100% clear whether "migratory" is related to live migration or
something else. I don't like the name :P.

The name "VHOST_USER_PROTOCOL_F_DEVICE_STATE" would be more obviously
associated with SET_DEVICE_STATE_FD and CHECK_DEVICE_STATE than
"MIGRATORY_STATE".

> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>   over which to transfer the state.

Does it need to be a pipe or can it be another type of file (e.g. UNIX
domain socket)?

In the future the fd may become bi-directional. Pipes are
uni-directional on Linux.

I suggest calling it a "file descriptor" and not mentioning "pipes"
explicitly.

> - CHECK_DEVICE_STATE: After the state has been transferred through the
>   pipe, the front-end invokes this function to verify success.  There is
>   no in-band way (through the pipe) to indicate failure, so we need to
>   check explicitly.
> 
> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> (which includes establishing the direction of transfer and migration
> phase), the sending side writes its data into the pipe, and the reading
> side reads it until it sees an EOF.  Then, the front-end will check for
> success via CHECK_DEVICE_STATE, which on the destination side includes
> checking for integrity (i.e. errors during deserialization).
> 
> Suggested-by: Stefan Hajnoczi 
> Signed-off-by: Hanna Czenczek 
> ---
>  docs/interop/vhost-user.rst | 87 +
>  1 file changed, 87 insertions(+)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index ac6be34c4c..c98dfeca25 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -334,6 +334,7 @@ in the ancillary data:
>  * ``VHOST_USER_SET_VRING_ERR``
>  * ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name 
> ``VHOST_USER_SET_SLAVE_REQ_FD``)
>  * ``VHOST_USER_SET_INFLIGHT_FD`` (if 
> ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
> +* ``VHOST_USER_SET_DEVICE_STATE_FD``
>  
>  If *front-end* is unable to send the full message or receives a wrong
>  reply it will close the connection. An optional reconnection mechanism
> @@ -497,6 +498,44 @@ it performs WAKE ioctl's on the userfaultfd to wake the 
> stalled
>  back-end.  The front-end indicates support for this via the
>  ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
>  
> +.. _migrating_backend_state:
> +
> +Migrating back-end state
> +
> +
> +If the back-end has internal state that is to be sent from source to
> +destination,

Migration and the terms "source" and "destination" have not been
defined. Here is a suggestion for an introductory paragraph:

  Migrating device state involves transferring the state from one
  back-end, called the source, to another back-end, called the
  destination. After migration, the destination transparently resumes
  operation without requiring the driver to re-initialize the device at
  the VIRTIO level. If the migration fails, then the source can
  transparently resume operation until another migration attempt is
  made.

> the front-end may be able to store and transfer it via an
> +internal migration stream.  Support for this is negotiated with the
> +``VHOST_USER_PROTOCOL_F_MIGRATORY_STATE`` feature.
> +
> +First, a channel over which the state is transferred is established on
> +the source side using the ``VHOST_USER_SET_DEVICE_STATE_FD`` message.
> +This message has two parameters:
> +
> +* Direction of transfer: On the source, the data is saved, transferring
> +  it from the back-end to the front-end.  On the destination, the data
> +  is loaded, transferring it from the front-end to the back-end.
> +
> +* Migration phase: Currently, only the period after memory transfer

"memory transfer" is vague. This sentence is referring to VM live
migration and guest RAM but it may be better to focus on just the device
perspective and not the VM:

  Migration is currently only supported while the device is suspended
  and all of its rings are stopped. In the future, additional phases
  might be support to allow iterative migration w

Re: [PATCH] vhost-user.rst: Clarify enabling/disabling vrings

2023-07-18 Thread Stefan Hajnoczi
On Wed, Jul 12, 2023 at 11:17:04AM +0200, Hanna Czenczek wrote:
> Currently, the vhost-user documentation says that rings are to be
> initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is
> negotiated.  However, by the time of feature negotiation, all rings have
> already been initialized, so it is not entirely clear what this means.
> 
> At least the vhost-user-backend Rust crate's implementation interpreted
> it to mean that whenever this feature is negotiated, all rings are to be
> put into a disabled state, which means that every SET_FEATURES call
> would disable all rings, effectively halting the device.  This is
> problematic because the VHOST_F_LOG_ALL feature is also set or cleared
> this way, which happens during migration.  Doing so should not halt the
> device.
> 
> Other implementations have interpreted this to mean that the device is
> to be initialized with all rings disabled, and a subsequent SET_FEATURES
> call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of
> them.  Here, SET_FEATURES will never disable any ring.
> 
> This other interpretation does not suffer the problem of unintentionally
> halting the device whenever features are set or cleared, so it seems
> better and more reasonable.
> 
> We should clarify this in the documentation.
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  docs/interop/vhost-user.rst | 23 +--
>  1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 5a070adbc1..ca0e899765 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -383,12 +383,23 @@ and stop ring upon receiving 
> ``VHOST_USER_GET_VRING_BASE``.
>  
>  Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.
>  
> -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
> -ring starts directly in the enabled state.
> -
> -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
> -initialized in a disabled state and is enabled by
> -``VHOST_USER_SET_VRING_ENABLE`` with parameter 1.
> +Between initialization and the first ``VHOST_USER_SET_FEATURES`` call, it
> +is implementation-defined whether each ring is enabled or disabled.

What is the purpose of this statement? Rings cannot be used before
feature negotiation (with the possible exception of legacy devices that
allowed this to accomodate buggy drivers).

To me this statement complicates things and raises more questions than
it answers.

> +
> +If ``VHOST_USER_SET_FEATURES`` does not negotiate
> +``VHOST_USER_F_PROTOCOL_FEATURES``, each ring, when started, will be
> +enabled immediately.

This sentence can be simplified a little:
"each ring, when started, will be enabled immediately" ->
"rings are enabled immediately when started"

> +
> +If ``VHOST_USER_SET_FEATURES`` does negotiate
> +``VHOST_USER_F_PROTOCOL_FEATURES``, each ring will remain in the disabled
> +state until ``VHOST_USER_SET_VRING_ENABLE`` enables it with parameter 1.
> +
> +Back-end implementations that support ``VHOST_USER_F_PROTOCOL_FEATURES``
> +should implement this by initializing each ring in a disabled state, and
> +enabling them when ``VHOST_USER_SET_FEATURES`` is used without
> +negotiating ``VHOST_USER_F_PROTOCOL_FEATURES``.  Other than that, rings
> +should only be enabled and disabled through
> +``VHOST_USER_SET_VRING_ENABLE``.
>  
>  While processing the rings (whether they are enabled or not), the back-end
>  must support changing some configuration aspects on the fly.
> -- 
> 2.41.0
> 


signature.asc
Description: PGP signature


Re: [PATCH 0/6] vhost-user: Add suspend/resume

2023-07-18 Thread Stefan Hajnoczi
On Tue, Jul 11, 2023 at 05:52:22PM +0200, Hanna Czenczek wrote:
> Hi,
> 
> As discussed on the previous version of the virtio-fs migration series
> (https://lists.nongnu.org/archive/html/qemu-devel/2023-04/msg01575.html),
> we currently don’t have a good way to have a vhost-user back-end fully
> cease all operations, including background operations.  To work around
> this, we reset it, which is not an option for stateful devices like
> virtio-fs.
> 
> Instead, we want the same SUSPEND/RESUME model that vhost-vdpa already
> has, so that we can suspend back-ends when we want them to stop doing
> anything (i.e. on VM stop), and resume them later (i.e. on VM resume).
> This series adds these vhost-user operations to the protocol and
> implements them in qemu.  Furthermore, it has vhost-user and vhost-vdpa
> do roughly the same thing in their reset paths, as far as possible.
> That path will still remain as a fall-back if SUSPEND/RESUME is not
> implemented, and, given that qemu’s vhost-vdpa code currently does not
> make use of RESUME, it is actually always used for vhost-vdpa (to take
> the device out of a suspended state).
> 
> 
> Hanna Czenczek (6):
>   vhost-user.rst: Add suspend/resume
>   vhost-vdpa: Move vhost_vdpa_reset_status() up
>   vhost: Do not reset suspended devices on stop
>   vhost-user: Implement suspend/resume
>   vhost-vdpa: Match vhost-user's status reset
>   vhost-user: Have reset_status fall back to reset
> 
>  docs/interop/vhost-user.rst|  35 +++-
>  include/hw/virtio/vhost-vdpa.h |   2 -
>  include/hw/virtio/vhost.h  |   8 +++
>  hw/virtio/vhost-user.c | 101 -
>  hw/virtio/vhost-vdpa.c |  41 ++---
>  hw/virtio/vhost.c  |   8 ++-
>  6 files changed, 169 insertions(+), 26 deletions(-)

Hi Hanna,
I posted comments but wanted to say great job! There was a long and
somewhat messy email discussion to figure out how to proceed and you
came up with a clean patch series that solves the issues.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 6/6] vhost-user: Have reset_status fall back to reset

2023-07-18 Thread Stefan Hajnoczi
On Tue, Jul 11, 2023 at 05:52:28PM +0200, Hanna Czenczek wrote:
> The only user of vhost_user_reset_status() is vhost_dev_stop(), which
> only uses it as a fall-back to stop the back-end if it does not support
> SUSPEND.  However, vhost-user's implementation is a no-op unless the
> back-end supports SET_STATUS.
> 
> vhost-vdpa's implementation instead just calls
> vhost_vdpa_reset_device(), implying that it's OK to fully reset the
> device if SET_STATUS is not supported.
> 
> To be fair, vhost_vdpa_reset_device() does nothing but to set the status
> to zero.  However, that may well be because vhost-vdpa has no method
> besides this to reset a device.  In contrast, vhost-user has
> RESET_DEVICE and a RESET_OWNER, which can be used instead.
> 
> While it is not entirely clear from documentation or git logs, from
> discussions and the order of vhost-user protocol features, it appears to
> me as if RESET_OWNER originally had no real meaning for vhost-user, and
> was thus used to signal a device reset to the back-end.  Then,
> RESET_DEVICE was introduced, to have a well-defined dedicated reset
> command.  Finally, vhost-user received full STATUS support, including
> SET_STATUS, so setting the device status to 0 is now the preferred way
> of resetting a device.  Still, RESET_DEVICE and RESET_OWNER should
> remain valid as fall-backs.
> 
> Therefore, have vhost_user_reset_status() fall back to
> vhost_user_reset_device() if the back-end has no STATUS support.
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  hw/virtio/vhost-user.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 4507de5a92..53a881ec2a 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -2833,6 +2833,8 @@ static void vhost_user_reset_status(struct vhost_dev 
> *dev)
>  if (virtio_has_feature(dev->protocol_features,
> VHOST_USER_PROTOCOL_F_STATUS)) {
>  vhost_user_set_status(dev, 0);
> +} else {
> +vhost_user_reset_device(dev);
>  }
>  }

Did you check whether DPDK treats setting the status to 0 as equivalent
to RESET_DEVICE?

My understanding is that SET_STATUS is mostly ignored by vhost-user
back-ends today. Even those that implement it may not treat SET_STATUS 0
as equivalent to RESET_DEVICE.

If you decide it's safe to make this change, please also update
vhost-user.rst to document that front-ends should use SET_STATUS 0,
RESET_DEVICE, and RESET_OWNER in order of preference.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 5/6] vhost-vdpa: Match vhost-user's status reset

2023-07-18 Thread Stefan Hajnoczi
On Tue, Jul 11, 2023 at 05:52:27PM +0200, Hanna Czenczek wrote:
> vhost-vdpa and vhost-user differ in how they reset the status in their
> respective vhost_reset_status implementations: vhost-vdpa zeroes it,
> then re-adds the S_ACKNOWLEDGE and S_DRIVER config bits.  S_DRIVER_OK is
> then set in vhost_vdpa_dev_start().
> 
> vhost-user in contrast just zeroes the status, and does no re-add any
> config bits until vhost_user_dev_start() (where it does re-add all of
> S_ACKNOWLEDGE, S_DRIVER, and S_DRIVER_OK).
> 
> There is no documentation for vhost_reset_status, but its only caller is
> vhost_dev_stop().  So apparently, the device is to be stopped after
> vhost_reset_status, and therefore it makes more sense to keep the status
> field fully cleared until the back-end is re-started, which is how
> vhost-user does it.  Make vhost-vdpa do the same -- if nothing else it's
> confusing to have both vhost implementations handle this differently.
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  hw/virtio/vhost-vdpa.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Hi Hanna,
The VIRTIO spec lists the Device Initialization sequence including the
bits set in the Device Status Register here:
https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-1070001

ACKNOWLEDGE and DRIVER must be set before FEATURES_OK. DRIVER_OK is set
after FEATURES_OK.

The driver may read the Device Configuration Space once ACKNOWLEDGE and
DRIVER are set.

QEMU's vhost code should follow this sequence (especially for vDPA where
full VIRTIO devices are implemented).

vhost-user is not faithful to the VIRTIO spec here. That's probably due
to the fact that vhost-user didn't have the concept of the Device Status
Register until recently and back-ends mostly ignore it.

Please do the opposite of this patch: bring vhost-user in line with the
VIRTIO specification so that the Device Initialization sequence is
followed correctly. I think vhost-vdpa already does the right thing.

> 
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index f7fd19a203..0cde8b40de 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -1294,8 +1294,6 @@ static void vhost_vdpa_reset_status(struct vhost_dev 
> *dev)
>  }
>  
>  vhost_vdpa_reset_device(dev);
> -vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> -   VIRTIO_CONFIG_S_DRIVER);
>  memory_listener_unregister(>listener);
>  }
>  
> @@ -1334,7 +1332,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, 
> bool started)
>  }
>  memory_listener_register(>listener, dev->vdev->dma_as);
>  
> -return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
> +return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> +  VIRTIO_CONFIG_S_DRIVER |
> +  VIRTIO_CONFIG_S_DRIVER_OK);
>  }
>  
>  return 0;
> -- 
> 2.41.0
> 


signature.asc
Description: PGP signature


Re: [PATCH 3/6] vhost: Do not reset suspended devices on stop

2023-07-18 Thread Stefan Hajnoczi
On Tue, Jul 11, 2023 at 05:52:25PM +0200, Hanna Czenczek wrote:
> Move the `suspended` field from vhost_vdpa into the global vhost_dev
> struct, so vhost_dev_stop() can check whether the back-end has been
> suspended by `vhost_ops->vhost_dev_start(hdev, false)`.  If it has,
> there is no need to reset it; the reset is just a fall-back to stop
> device operations for back-ends that do not support suspend.
> 
> Unfortunately, for vDPA specifically, RESUME is not yet implemented, so
> when the device is re-started, we still have to do the reset to have it
> un-suspend.
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  include/hw/virtio/vhost-vdpa.h |  2 --
>  include/hw/virtio/vhost.h  |  8 
>  hw/virtio/vhost-vdpa.c | 11 +++
>  hw/virtio/vhost.c  |  8 +++-
>  4 files changed, 22 insertions(+), 7 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH 4/6] vhost-user: Implement suspend/resume

2023-07-18 Thread Stefan Hajnoczi
On Tue, Jul 11, 2023 at 05:52:26PM +0200, Hanna Czenczek wrote:
> Implement SUSPEND/RESUME like vDPA does, by automatically using it in
> vhost_user_dev_start().  (Though our vDPA code does not implement RESUME
> yet, so there, the device is reset when it is to be resumed.)
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  hw/virtio/vhost-user.c | 99 +-
>  1 file changed, 97 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH 2/6] vhost-vdpa: Move vhost_vdpa_reset_status() up

2023-07-18 Thread Stefan Hajnoczi
On Tue, Jul 11, 2023 at 05:52:24PM +0200, Hanna Czenczek wrote:
> The next commit is going to have vhost_vdpa_dev_start() call this, so
> move it up to have the declaration where we are going to need it.
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  hw/virtio/vhost-vdpa.c | 28 ++--
>  1 file changed, 14 insertions(+), 14 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH 1/6] vhost-user.rst: Add suspend/resume

2023-07-18 Thread Stefan Hajnoczi
On Tue, Jul 11, 2023 at 05:52:23PM +0200, Hanna Czenczek wrote:
> When stopping the VM, qemu wants all devices to fully cease any
> operation, too.  Currently, we can only have vhost-user back-ends stop
> processing vrings, but not background operations.  Add the SUSPEND and
> RESUME commands from vDPA, which allow the front-end (qemu) to tell
> back-ends to cease all operations, including those running in the
> background.
> 
> qemu's current work-around for this is to reset the back-end instead of
> suspending it, which will not work for back-ends that have internal
> state that must be preserved across e.g. stop/cont.
> 
> Note that the given specification requires the back-end to delay
> processing kicks (i.e. starting vrings) until the device is resumed,
> instead of requiring the front-end not to send kicks while suspended.
> qemu starts devices (and would just resume them) only when the VM is in
> a running state, so it would be difficult to have qemu delay kicks until
> the device is resumed, which is why this patch specifies handling of
> kicks as it does.
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  docs/interop/vhost-user.rst | 35 +--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 5a070adbc1..ac6be34c4c 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -381,6 +381,10 @@ readable) on the descriptor specified by 
> ``VHOST_USER_SET_VRING_KICK``
>  or receiving the in-band message ``VHOST_USER_VRING_KICK`` if negotiated,
>  and stop ring upon receiving ``VHOST_USER_GET_VRING_BASE``.
>  
> +While the back-end is suspended (via ``VHOST_USER_SUSPEND``), it must
> +never process rings, and thus also delay handling kicks until the

If you respin this series, I suggest replacing "never" with "not" to
emphasize that ring processing is only skipped while the device is
suspended (rather than forever). "Never" feels too strong to use when
describing a temporary state.

> +back-end is resumed again.
> +
>  Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.
>  
>  If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
> @@ -479,8 +483,9 @@ supplied by ``VhostUserLog``.
>  ancillary data, it may be used to inform the front-end that the log has
>  been modified.
>  
> -Once the source has finished migration, rings will be stopped by the
> -source. No further update must be done before rings are restarted.
> +Once the source has finished migration, the device will be suspended and
> +its rings will be stopped by the source. No further update must be done
> +before the device and its rings are resumed.

This paragraph is abstract and doesn't directly identify the mechanisms
or who does what:
- "the device will be suspended" via VHOST_USER_SUSPEND (or reset when
  VHOST_USER_SUSPEND is not supported?) or automatically by the device
  itself or some other mechanism?
- "before the device and its rings are resumed" via VHOST_USER_RESUME?
  And is this referring to the source device?

Please rephrase the paragraph to identify the vhost-user messages
involved.

>  
>  In postcopy migration the back-end is started before all the memory has
>  been received from the source host, and care must be taken to avoid
> @@ -885,6 +890,7 @@ Protocol features
>#define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
>#define VHOST_USER_PROTOCOL_F_STATUS   16
>#define VHOST_USER_PROTOCOL_F_XEN_MMAP 17
> +  #define VHOST_USER_PROTOCOL_F_SUSPEND  18
>  
>  Front-end message types
>  ---
> @@ -1440,6 +1446,31 @@ Front-end message types
>query the back-end for its device status as defined in the Virtio
>specification.
>  
> +``VHOST_USER_SUSPEND``
> +  :id: 41
> +  :equivalent ioctl: VHOST_VDPA_SUSPEND
> +  :request payload: N/A
> +  :reply payload: N/A
> +
> +  When the ``VHOST_USER_PROTOCOL_F_SUSPEND`` protocol feature has been
> +  successfully negotiated, this message is submitted by the front-end to
> +  have the back-end cease all operations except for handling vhost-user
> +  requests.  The back-end must stop processing all virt queues, and it
> +  must not perform any background operations.  It may not resume until a

"background operations" are not defined. What does it mean:
- Anything that writes to memory slots
- Anything that changes the visible state of the device
- Anything that changes the non-visible internal state of the device
- etc
?

> +  subsequent ``VHOST_USER_RESUME`` call.
> +
> +``VHOST_USER_RESUME``
> +  :id: 42
> +  :equivalent ioctl: VHOST_VDPA_RESUME
> +  :request payload: N/A
> +  :reply payload: N/A
> +
> +  When the ``VHOST_USER_PROTOCOL_F_SUSPEND`` protocol feature has been
> +  successfully negotiated, this message is submitted by the front-end to
> +  allow the back-end to resume operations after having been suspended
> +  before.  The back-end 

Re: [PATCH] thread-pool: signal "request_cond" while locked

2023-07-18 Thread Stefan Hajnoczi
On Fri, Jul 14, 2023 at 04:27:20PM +0100, Anthony PERARD wrote:
> From: Anthony PERARD 
> 
> thread_pool_free() might have been called on the `pool`, which would
> be a reason for worker_thread() to quit. In this case,
> `pool->request_cond` is been destroyed.
> 
> If worker_thread() didn't managed to signal `request_cond` before it
> been destroyed by thread_pool_free(), we got:
> util/qemu-thread-posix.c:198: qemu_cond_signal: Assertion 
> `cond->initialized' failed.
> 
> One backtrace:
> __GI___assert_fail (assertion=0x5614abcb "cond->initialized", 
> file=0x5614ab88 "util/qemu-thread-posix.c", line=198,
>   function=0x5614ad80 <__PRETTY_FUNCTION__.17104> "qemu_cond_signal") 
> at assert.c:101
> qemu_cond_signal (cond=0x7fffb800db30) at util/qemu-thread-posix.c:198
> worker_thread (opaque=0x7fffb800dab0) at util/thread-pool.c:129
> qemu_thread_start (args=0x7fffb8000b20) at util/qemu-thread-posix.c:505
> start_thread (arg=) at pthread_create.c:486
> 
> Reported here:
> 
> https://lore.kernel.org/all/ZJwoK50FcnTSfFZ8@MacBook-Air-de-Roger.local/T/#u
> 
> To avoid issue, keep lock while sending a signal to `request_cond`.
> 
> Fixes: 900fa208f506 ("thread-pool: replace semaphore with condition variable")
> Signed-off-by: Anthony PERARD 
> ---
> 
> There's maybe an issue in thread_pool_submit_aio() as well with
> signalling `request_cond`, but maybe it's much less likely to be an
> issue?

The caller must not submit work while destroying the pool, so I'm not
sure when this problem could occur with thread_pool_submit_aio()?

> ---
>  util/thread-pool.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

This is QEMU 8.1 material.

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH v2] block: Fix pad_request's request restriction

2023-07-17 Thread Stefan Hajnoczi
On Fri, Jul 14, 2023 at 10:59:38AM +0200, Hanna Czenczek wrote:
> bdrv_pad_request() relies on requests' lengths not to exceed SIZE_MAX,
> which bdrv_check_qiov_request() does not guarantee.
> 
> bdrv_check_request32() however will guarantee this, and both of
> bdrv_pad_request()'s callers (bdrv_co_preadv_part() and
> bdrv_co_pwritev_part()) already run it before calling
> bdrv_pad_request().  Therefore, bdrv_pad_request() can safely call
> bdrv_check_request32() without expecting error, too.
> 
> In effect, this patch will not change guest-visible behavior.  It is a
> clean-up to tighten a condition to match what is guaranteed by our
> callers, and which exists purely to show clearly why the subsequent
> assertion (`assert(*bytes <= SIZE_MAX)`) is always true.
> 
> Note there is a difference between the interfaces of
> bdrv_check_qiov_request() and bdrv_check_request32(): The former takes
> an errp, the latter does not, so we can no longer just pass
> _abort.  Instead, we need to check the returned value.  While we
> do expect success (because the callers have already run this function),
> an assert(ret == 0) is not much simpler than just to return an error if
> it occurs, so let us handle errors by returning them up the stack now.
> 
> Reported-by: Peter Maydell 
> Fixes: 18743311b829cafc1737a5f20bc3248d5f91ee2a
>("block: Collapse padded I/O vecs exceeding IOV_MAX")
> Signed-off-by: Hanna Czenczek 
> ---
> v2:
> - Added paragraph to the commit message to express explicitly that this
>   patch will not change guest-visible behavior
> - (No code changes)
> ---
>  block/io.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)

Thanks, applied to my block-next tree:
https://gitlab.com/stefanha/qemu/commits/block-next

Stefan


signature.asc
Description: PGP signature


[PULL for-8.1 0/1] Block patches

2023-07-17 Thread Stefan Hajnoczi
The following changes since commit ed8ad9728a9c0eec34db9dff61dfa2f1dd625637:

  Merge tag 'pull-tpm-2023-07-14-1' of https://github.com/stefanberger/qemu-tpm 
into staging (2023-07-15 14:54:04 +0100)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 66547f416a61e0cb711dc76821890242432ba193:

  block/nvme: invoke blk_io_plug_call() outside q->lock (2023-07-17 09:17:41 
-0400)


Pull request

Fix the hang in the nvme:// block driver during startup.

----

Stefan Hajnoczi (1):
  block/nvme: invoke blk_io_plug_call() outside q->lock

 block/nvme.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

-- 
2.40.1




[PULL for-8.1 1/1] block/nvme: invoke blk_io_plug_call() outside q->lock

2023-07-17 Thread Stefan Hajnoczi
blk_io_plug_call() is invoked outside a blk_io_plug()/blk_io_unplug()
section while opening the NVMe drive from:

  nvme_file_open() ->
  nvme_init() ->
  nvme_identify() ->
  nvme_admin_cmd_sync() ->
  nvme_submit_command() ->
  blk_io_plug_call()

blk_io_plug_call() immediately invokes the given callback when the
current thread is not plugged, as is the case during nvme_file_open().

Unfortunately, nvme_submit_command() calls blk_io_plug_call() with
q->lock still held:

...
q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
q->need_kick++;
blk_io_plug_call(nvme_unplug_fn, q);
qemu_mutex_unlock(>lock);
^^^

nvme_unplug_fn() deadlocks trying to acquire q->lock because the lock is
already acquired by the same thread. The symptom is that QEMU hangs
during startup while opening the NVMe drive.

Fix this by moving the blk_io_plug_call() outside q->lock. This is safe
because no other thread runs code related to this queue and
blk_io_plug_call()'s internal state is immune to thread safety issues
since it is thread-local.

Reported-by: Lukáš Doktor 
Signed-off-by: Stefan Hajnoczi 
Tested-by: Lukas Doktor 
Message-id: 20230712191628.252806-1-stefa...@redhat.com
Fixes: f2e590002bd6 ("block/nvme: convert to blk_io_plug_call() API")
Signed-off-by: Stefan Hajnoczi 
---
 block/nvme.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index 7ca85bc44a..b6e95f0b7e 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -501,8 +501,9 @@ static void nvme_submit_command(NVMeQueuePair *q, 
NVMeRequest *req,
q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
 q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
 q->need_kick++;
+qemu_mutex_unlock(>lock);
+
 blk_io_plug_call(nvme_unplug_fn, q);
-qemu_mutex_unlock(>lock);
 }
 
 static void nvme_admin_cmd_sync_cb(void *opaque, int ret)
-- 
2.40.1




drain_call_rcu() vs nested event loops

2023-07-13 Thread Stefan Hajnoczi
Hi,
I've encountered a bug where two vcpu threads enter a device's MMIO
emulation callback at the same time. This is never supposed to happen
thanks to the Big QEMU Lock (BQL), but drain_call_rcu() and nested event
loops make it possible:

1. A device's MMIO emulation callback invokes AIO_WAIT_WHILE().
2. A device_add monitor command runs in AIO_WAIT_WHILE()'s aio_poll()
   nested event loop.
3. qmp_device_add() -> drain_call_rcu() is called and the BQL is
   temporarily dropped.
4. Another vcpu thread dispatches the same device's MMIO callback
   because it is now able to acquire the BQL.

I've included the backtraces below if you want to see the details. They
are from a RHEL qemu-kvm 6.2.0-35 coredump but I haven't found anything
in qemu.git/master that would fix this.

One fix is to make qmp_device_add() a coroutine and schedule a BH in the
iohandler AioContext. That way the coroutine must wait until the nested
event loop finishes before its BH executes. drain_call_rcu() will never
be called from a nested event loop and the problem does not occur
anymore.

Another possibility is to remove the following in monitor_qmp_dispatcher_co():

  /*
   * Move the coroutine from iohandler_ctx to qemu_aio_context for
   * executing the command handler so that it can make progress if it
   * involves an AIO_WAIT_WHILE().
   */
  aio_co_schedule(qemu_get_aio_context(), qmp_dispatcher_co);
  qemu_coroutine_yield();

By executing QMP commands in the iohandler AioContext by default, we can
prevent issues like this in the future. However, there might be some QMP
commands that assume they are running in the qemu_aio_context (e.g.
coroutine commands that yield) and they might need to manually move to
the qemu_aio_context.

What do you think?

Stefan
---
Thread 41 (Thread 0x7fdc3dffb700 (LWP 910296)):
#0  0x7fde88ac99bd in syscall () from /lib64/libc.so.6
#1  0x55bd7a2e066f in qemu_futex_wait (val=, f=) at 
/usr/src/debug/qemu-kvm-6.2.0-35.module+el8.9.0+19024+8193e2ac.x86_64/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x7fdc3dffa2d0) at 
../util/qemu-thread-posix.c:510
#3  0x55bd7a2e8e54 in drain_call_rcu () at ../util/rcu.c:347
#4  0x55bd79f63d1e in qmp_device_add (qdict=, 
ret_data=, errp=) at ../softmmu/qdev-monitor.c:863
#5  0x55bd7a2d420d in do_qmp_dispatch_bh (opaque=0x7fde8c22aee0) at 
../qapi/qmp-dispatch.c:129
#6  0x55bd7a2ef3bd in aio_bh_call (bh=0x7fdc6015cd50) at ../util/async.c:174
#7  aio_bh_poll (ctx=ctx@entry=0x55bd7c910f40) at ../util/async.c:174
#8  0x55bd7a2dd3b2 in aio_poll (ctx=0x55bd7c910f40, 
blocking=blocking@entry=true) at ../util/aio-posix.c:659
#9  0x55bd7a2effea in aio_wait_bh_oneshot (ctx=0x55bd7ca980e0, 
cb=cb@entry=0x55bd7a11a9c0 , 
opaque=opaque@entry=0x55bd7e585c40) at ../util/aio-wait.c:85
#10 0x55bd7a11b30b in virtio_blk_data_plane_stop (vdev=) at 
../hw/block/dataplane/virtio-blk.c:333
#11 0x55bd7a0591e0 in virtio_bus_stop_ioeventfd 
(bus=bus@entry=0x55bd7cb57ba8) at ../hw/virtio/virtio-bus.c:258
#12 0x55bd7a05995f in virtio_bus_stop_ioeventfd 
(bus=bus@entry=0x55bd7cb57ba8) at ../hw/virtio/virtio-bus.c:250
#13 0x55bd7a05b238 in virtio_pci_stop_ioeventfd (proxy=0x55bd7cb4f9a0) at 
../hw/virtio/virtio-pci.c:1289
#14 virtio_pci_common_write (opaque=0x55bd7cb4f9a0, addr=, 
val=, size=) at ../hw/virtio/virtio-pci.c:1289
^^
#15 0x55bd7a0f6777 in memory_region_write_accessor (mr=0x55bd7cb50410, 
addr=, value=, size=1, shift=, 
mask=, attrs=...) at ../softmmu/memory.c:492
#16 0x55bd7a0f320e in access_with_adjusted_size (addr=addr@entry=20, 
value=value@entry=0x7fdc3dffa5c8, size=size@entry=1, access_size_min=, access_size_max=, 
access_fn=0x55bd7a0f6710 , mr=0x55bd7cb50410, 
attrs=...) at ../softmmu/memory.c:554
#17 0x55bd7a0f62a3 in memory_region_dispatch_write 
(mr=mr@entry=0x55bd7cb50410, addr=20, data=, op=, 
attrs=attrs@entry=...) at ../softmmu/memory.c:1504
#18 0x55bd7a0e7f2e in flatview_write_continue (fv=fv@entry=0x55bd7d17cad0, 
addr=addr@entry=4236247060, attrs=..., ptr=ptr@entry=0x7fde84003028, 
len=len@entry=1, addr1=, l=, 
mr=0x55bd7cb50410) at 
/usr/src/debug/qemu-kvm-6.2.0-35.module+el8.9.0+19024+8193e2ac.x86_64/include/qemu/host-utils.h:165
#19 0x55bd7a0e8093 in flatview_write (fv=0x55bd7d17cad0, addr=4236247060, 
attrs=..., buf=0x7fde84003028, len=1) at ../softmmu/physmem.c:2856
#20 0x55bd7a0ebc6f in address_space_write (as=, 
addr=, attrs=..., buf=, len=) at 
../softmmu/physmem.c:2952
#21 0x55bd7a1a28b9 in kvm_cpu_exec (cpu=cpu@entry=0x55bd7cc32bf0) at 
../accel/kvm/kvm-all.c:2995
#22 0x55bd7a1a36e5 in kvm_vcpu_thread_fn (arg=0x55bd7cc32bf0) at 
../accel/kvm/kvm-accel-ops.c:49
#23 0x55bd7a2dfdd4 in qemu_thread_start (args=0x55bd7cc41f20) at 
../util/qemu-thread-posix.c:585
#24 0x7fde88e5d1ca in start_thread () from /lib64/libpthread.so.0
#25 0x7fde88ac9e73 in clone () from /lib64/libc.so.6

Thread 1 (Thread 

[PULL 1/1] virtio-blk: fix host notifier issues during dataplane start/stop

2023-07-12 Thread Stefan Hajnoczi
The main loop thread can consume 100% CPU when using --device
virtio-blk-pci,iothread=. ppoll() constantly returns but
reading virtqueue host notifiers fails with EAGAIN. The file descriptors
are stale and remain registered with the AioContext because of bugs in
the virtio-blk dataplane start/stop code.

The problem is that the dataplane start/stop code involves drain
operations, which call virtio_blk_drained_begin() and
virtio_blk_drained_end() at points where the host notifier is not
operational:
- In virtio_blk_data_plane_start(), blk_set_aio_context() drains after
  vblk->dataplane_started has been set to true but the host notifier has
  not been attached yet.
- In virtio_blk_data_plane_stop(), blk_drain() and blk_set_aio_context()
  drain after the host notifier has already been detached but with
  vblk->dataplane_started still set to true.

I would like to simplify ->ioeventfd_start/stop() to avoid interactions
with drain entirely, but couldn't find a way to do that. Instead, this
patch accepts the fragile nature of the code and reorders it so that
vblk->dataplane_started is false during drain operations. This way the
virtio_blk_drained_begin() and virtio_blk_drained_end() calls don't
touch the host notifier. The result is that
virtio_blk_data_plane_start() and virtio_blk_data_plane_stop() have
complete control over the host notifier and stale file descriptors are
no longer left in the AioContext.

This patch fixes the 100% CPU consumption in the main loop thread and
correctly moves host notifier processing to the IOThread.

Fixes: 1665d9326fd2 ("virtio-blk: implement BlockDevOps->drained_begin()")
Reported-by: Lukáš Doktor 
Signed-off-by: Stefan Hajnoczi 
Tested-by: Lukas Doktor 
Message-id: 20230704151527.193586-1-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 hw/block/dataplane/virtio-blk.c | 67 +++--
 1 file changed, 38 insertions(+), 29 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index c227b39408..da36fcfd0b 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -219,13 +219,6 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 
 memory_region_transaction_commit();
 
-/*
- * These fields are visible to the IOThread so we rely on implicit barriers
- * in aio_context_acquire() on the write side and aio_notify_accept() on
- * the read side.
- */
-s->starting = false;
-vblk->dataplane_started = true;
 trace_virtio_blk_data_plane_start(s);
 
 old_context = blk_get_aio_context(s->conf->conf.blk);
@@ -244,6 +237,18 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 event_notifier_set(virtio_queue_get_host_notifier(vq));
 }
 
+/*
+ * These fields must be visible to the IOThread when it processes the
+ * virtqueue, otherwise it will think dataplane has not started yet.
+ *
+ * Make sure ->dataplane_started is false when blk_set_aio_context() is
+ * called above so that draining does not cause the host notifier to be
+ * detached/attached prematurely.
+ */
+s->starting = false;
+vblk->dataplane_started = true;
+smp_wmb(); /* paired with aio_notify_accept() on the read side */
+
 /* Get this show started by hooking up our callbacks */
 if (!blk_in_drain(s->conf->conf.blk)) {
 aio_context_acquire(s->ctx);
@@ -273,7 +278,6 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
   fail_guest_notifiers:
 vblk->dataplane_disabled = true;
 s->starting = false;
-vblk->dataplane_started = true;
 return -ENOSYS;
 }
 
@@ -327,6 +331,32 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
 aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
 }
 
+/*
+ * Batch all the host notifiers in a single transaction to avoid
+ * quadratic time complexity in address_space_update_ioeventfds().
+ */
+memory_region_transaction_begin();
+
+for (i = 0; i < nvqs; i++) {
+virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
+}
+
+/*
+ * The transaction expects the ioeventfds to be open when it
+ * commits. Do it now, before the cleanup loop.
+ */
+memory_region_transaction_commit();
+
+for (i = 0; i < nvqs; i++) {
+virtio_bus_cleanup_host_notifier(VIRTIO_BUS(qbus), i);
+}
+
+/*
+ * Set ->dataplane_started to false before draining so that host notifiers
+ * are not detached/attached anymore.
+ */
+vblk->dataplane_started = false;
+
 aio_context_acquire(s->ctx);
 
 /* Wait for virtio_blk_dma_restart_bh() and in flight I/O to complete */
@@ -340,32 +370,11 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
 
 aio_context_release(s->ctx);
 
-/*
- * Batch all the host notifiers in a single transaction to avoid
- * quadratic time complexity in address_space_up

[PULL 0/1] Block patches

2023-07-12 Thread Stefan Hajnoczi
The following changes since commit 887cba855bb6ff4775256f7968409281350b568c:

  configure: Fix cross-building for RISCV host (v5) (2023-07-11 17:56:09 +0100)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 75dcb4d790bbe5327169fd72b185960ca58e2fa6:

  virtio-blk: fix host notifier issues during dataplane start/stop (2023-07-12 
15:20:32 -0400)


Pull request



Stefan Hajnoczi (1):
  virtio-blk: fix host notifier issues during dataplane start/stop

 hw/block/dataplane/virtio-blk.c | 67 +++--
 1 file changed, 38 insertions(+), 29 deletions(-)

-- 
2.40.1




Re: [PATCH] block: Fix pad_request's request restriction

2023-07-12 Thread Stefan Hajnoczi
On Wed, 12 Jul 2023 at 10:51, Hanna Czenczek  wrote:
>
> On 12.07.23 16:15, Stefan Hajnoczi wrote:
> > On Wed, Jul 12, 2023 at 09:41:05AM +0200, Hanna Czenczek wrote:
> >> On 11.07.23 22:23, Stefan Hajnoczi wrote:
> >>> On Fri, Jun 09, 2023 at 10:33:16AM +0200, Hanna Czenczek wrote:
> >>>> bdrv_pad_request() relies on requests' lengths not to exceed SIZE_MAX,
> >>>> which bdrv_check_qiov_request() does not guarantee.
> >>>>
> >>>> bdrv_check_request32() however will guarantee this, and both of
> >>>> bdrv_pad_request()'s callers (bdrv_co_preadv_part() and
> >>>> bdrv_co_pwritev_part()) already run it before calling
> >>>> bdrv_pad_request().  Therefore, bdrv_pad_request() can safely call
> >>>> bdrv_check_request32() without expecting error, too.
> >>>>
> >>>> There is one difference between bdrv_check_qiov_request() and
> >>>> bdrv_check_request32(): The former takes an errp, the latter does not,
> >>>> so we can no longer just pass _abort.  Instead, we need to check
> >>>> the returned value.  While we do expect success (because the callers
> >>>> have already run this function), an assert(ret == 0) is not much simpler
> >>>> than just to return an error if it occurs, so let us handle errors by
> >>>> returning them up the stack now.
> >>> Is this patch intended to silence a Coverity warning or can this be
> >>> triggered by a guest?
> >> Neither.  There was a Coverity warning about the `assert(*bytes <=
> >> SIZE_MAX)`, which is always true on 32-bit architectures. Regardless of
> >> Coverity, Peter inquired how bdrv_check_qiov_request() would guarantee this
> >> condition (as the comments I’ve put above the assertions say).  It doesn’t,
> >> only bdrv_check_request32() does, which I was thinking of, and just 
> >> confused
> >> the two.
> > It's unclear to me whether this patch silences a Coverity warning or
> > not? You said "neither", but then you acknowledged there was a Coverity
> > warning. Maybe "was" (past-tense) means something else already fixed it
> > but I don't see any relevant commits in the git log.
>
> There was and is no fix for the Coverity warning.  I have mentioned that
> warning because the question as to why the code uses
> bdrv_check_qiov_request() came in the context of discussing it
> (https://lists.nongnu.org/archive/html/qemu-devel/2023-06/msg01809.html).
>
> I’m not planning on fixing the Coverity warning in the code. `assert(x
> <= SIZE_MAX)` to me is an absolutely reasonable piece of code, even if
> always true (on some platforms), in fact, I find it a good thing if
> asserted conditions are always true, not least because then the compiler
> can optimize them out.  I don’t think we should make it more complicated
> to make Coverity happier.
>
> >> As the commit message says, all callers already run bdrv_check_request32(),
> >> so I expect this change to functionally be a no-op.  (That is why the
> >> pre-patch code runs bdrv_check_qiov_request() with `_abort`.)
> > Okay, this means a guest cannot trigger the assertion failure.
> >
> > Please mention the intent in the commit description: a code cleanup
> > requested by Peter and/or a Coverity warning fix, but definitely not
> > guest triggerable assertion failure.
>
> Sure!
>
> >>> I find this commit description and patch confusing. Instead of checking
> >>> the actual SIZE_MAX value that bdrv_pad_request() relies on, we use a
> >>> 32-bit offsets/lengths helper because it checks INT_MAX or SIZE_MAX (but
> >>> really INT_MAX, because that's always smaller on host architectures that
> >>> QEMU supports).
> >> I preferred to use a bounds-checking function that we already use for
> >> requests, and that happens to be used to limit all I/O that ends up here in
> >> bdrv_pad_request() anyway, instead of adding a new specific limit.
> >>
> >> It doesn’t matter to me, though.  The callers already ensure that 
> >> everything
> >> is in bounds, so I’d be happy with anything, ranging from keeping the bare
> >> assertions with no checks beforehand, over specifically checking SIZE_MAX
> >> and returning an error then, to bdrv_check_request32().
> >>
> >> (I thought repeating the simple bounds check that all callers already did
> >> for verbosity would be the most robust and obvious way to do it, but now 
> >> I’m
> >> biting myself fo

Re: [PATCH] virtio-blk: fix host notifier issues during dataplane start/stop

2023-07-12 Thread Stefan Hajnoczi
On Tue, Jul 04, 2023 at 05:15:27PM +0200, Stefan Hajnoczi wrote:
> The main loop thread can consume 100% CPU when using --device
> virtio-blk-pci,iothread=. ppoll() constantly returns but
> reading virtqueue host notifiers fails with EAGAIN. The file descriptors
> are stale and remain registered with the AioContext because of bugs in
> the virtio-blk dataplane start/stop code.
> 
> The problem is that the dataplane start/stop code involves drain
> operations, which call virtio_blk_drained_begin() and
> virtio_blk_drained_end() at points where the host notifier is not
> operational:
> - In virtio_blk_data_plane_start(), blk_set_aio_context() drains after
>   vblk->dataplane_started has been set to true but the host notifier has
>   not been attached yet.
> - In virtio_blk_data_plane_stop(), blk_drain() and blk_set_aio_context()
>   drain after the host notifier has already been detached but with
>   vblk->dataplane_started still set to true.
> 
> I would like to simplify ->ioeventfd_start/stop() to avoid interactions
> with drain entirely, but couldn't find a way to do that. Instead, this
> patch accepts the fragile nature of the code and reorders it so that
> vblk->dataplane_started is false during drain operations. This way the
> virtio_blk_drained_begin() and virtio_blk_drained_end() calls don't
> touch the host notifier. The result is that
> virtio_blk_data_plane_start() and virtio_blk_data_plane_stop() have
> complete control over the host notifier and stale file descriptors are
> no longer left in the AioContext.
> 
> This patch fixes the 100% CPU consumption in the main loop thread and
> correctly moves host notifier processing to the IOThread.
> 
> Fixes: 1665d9326fd2 ("virtio-blk: implement BlockDevOps->drained_begin()")
> Reported-by: Lukáš Doktor 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  hw/block/dataplane/virtio-blk.c | 67 +++--
>  1 file changed, 38 insertions(+), 29 deletions(-)

Thanks, applied to my block tree:
https://gitlab.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature


[PATCH] block/nvme: invoke blk_io_plug_call() outside q->lock

2023-07-12 Thread Stefan Hajnoczi
blk_io_plug_call() is invoked outside a blk_io_plug()/blk_io_unplug()
section while opening the NVMe drive from:

  nvme_file_open() ->
  nvme_init() ->
  nvme_identify() ->
  nvme_admin_cmd_sync() ->
  nvme_submit_command() ->
  blk_io_plug_call()

blk_io_plug_call() immediately invokes the given callback when the
current thread is not plugged, as is the case during nvme_file_open().

Unfortunately, nvme_submit_command() calls blk_io_plug_call() with
q->lock still held:

...
q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
q->need_kick++;
blk_io_plug_call(nvme_unplug_fn, q);
qemu_mutex_unlock(>lock);
^^^

nvme_unplug_fn() deadlocks trying to acquire q->lock because the lock is
already acquired by the same thread. The symptom is that QEMU hangs
during startup while opening the NVMe drive.

Fix this by moving the blk_io_plug_call() outside q->lock. This is safe
because no other thread runs code related to this queue and
blk_io_plug_call()'s internal state is immune to thread safety issues
since it is thread-local.

Reported-by: Lukáš Doktor 
Fixes: f2e590002bd6 ("block/nvme: convert to blk_io_plug_call() API")
Signed-off-by: Stefan Hajnoczi 
---
 block/nvme.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index 7ca85bc44a..b6e95f0b7e 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -501,8 +501,9 @@ static void nvme_submit_command(NVMeQueuePair *q, 
NVMeRequest *req,
q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
 q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
 q->need_kick++;
+qemu_mutex_unlock(>lock);
+
 blk_io_plug_call(nvme_unplug_fn, q);
-qemu_mutex_unlock(>lock);
 }
 
 static void nvme_admin_cmd_sync_cb(void *opaque, int ret)
-- 
2.40.1




Re: [PATCH] block: Fix pad_request's request restriction

2023-07-12 Thread Stefan Hajnoczi
On Wed, Jul 12, 2023 at 09:41:05AM +0200, Hanna Czenczek wrote:
> On 11.07.23 22:23, Stefan Hajnoczi wrote:
> > On Fri, Jun 09, 2023 at 10:33:16AM +0200, Hanna Czenczek wrote:
> > > bdrv_pad_request() relies on requests' lengths not to exceed SIZE_MAX,
> > > which bdrv_check_qiov_request() does not guarantee.
> > > 
> > > bdrv_check_request32() however will guarantee this, and both of
> > > bdrv_pad_request()'s callers (bdrv_co_preadv_part() and
> > > bdrv_co_pwritev_part()) already run it before calling
> > > bdrv_pad_request().  Therefore, bdrv_pad_request() can safely call
> > > bdrv_check_request32() without expecting error, too.
> > > 
> > > There is one difference between bdrv_check_qiov_request() and
> > > bdrv_check_request32(): The former takes an errp, the latter does not,
> > > so we can no longer just pass _abort.  Instead, we need to check
> > > the returned value.  While we do expect success (because the callers
> > > have already run this function), an assert(ret == 0) is not much simpler
> > > than just to return an error if it occurs, so let us handle errors by
> > > returning them up the stack now.
> > Is this patch intended to silence a Coverity warning or can this be
> > triggered by a guest?
> 
> Neither.  There was a Coverity warning about the `assert(*bytes <=
> SIZE_MAX)`, which is always true on 32-bit architectures. Regardless of
> Coverity, Peter inquired how bdrv_check_qiov_request() would guarantee this
> condition (as the comments I’ve put above the assertions say).  It doesn’t,
> only bdrv_check_request32() does, which I was thinking of, and just confused
> the two.

It's unclear to me whether this patch silences a Coverity warning or
not? You said "neither", but then you acknowledged there was a Coverity
warning. Maybe "was" (past-tense) means something else already fixed it
but I don't see any relevant commits in the git log.

> As the commit message says, all callers already run bdrv_check_request32(),
> so I expect this change to functionally be a no-op.  (That is why the
> pre-patch code runs bdrv_check_qiov_request() with `_abort`.)

Okay, this means a guest cannot trigger the assertion failure.

Please mention the intent in the commit description: a code cleanup
requested by Peter and/or a Coverity warning fix, but definitely not
guest triggerable assertion failure.

> 
> > I find this commit description and patch confusing. Instead of checking
> > the actual SIZE_MAX value that bdrv_pad_request() relies on, we use a
> > 32-bit offsets/lengths helper because it checks INT_MAX or SIZE_MAX (but
> > really INT_MAX, because that's always smaller on host architectures that
> > QEMU supports).
> 
> I preferred to use a bounds-checking function that we already use for
> requests, and that happens to be used to limit all I/O that ends up here in
> bdrv_pad_request() anyway, instead of adding a new specific limit.
> 
> It doesn’t matter to me, though.  The callers already ensure that everything
> is in bounds, so I’d be happy with anything, ranging from keeping the bare
> assertions with no checks beforehand, over specifically checking SIZE_MAX
> and returning an error then, to bdrv_check_request32().
> 
> (I thought repeating the simple bounds check that all callers already did
> for verbosity would be the most robust and obvious way to do it, but now I’m
> biting myself for not just using bare assertions annotated with “Caller must
> guarantee this” from the start...)

Okay. I looked at the code more and don't see a cleanup for the overall
problem of duplicated checks and type mismatches (size_t vs int64_t)
that is appropriate for this patch.

I'm okay with this fix, but please clarify the intent as mentioned above.

> 
> Hanna
> 
> > Vladimir: Is this the intended use of bdrv_check_request32()?
> > 
> > > Reported-by: Peter Maydell 
> > > Fixes: 18743311b829cafc1737a5f20bc3248d5f91ee2a
> > > ("block: Collapse padded I/O vecs exceeding IOV_MAX")
> > > Signed-off-by: Hanna Czenczek 
> > > ---
> > >   block/io.c | 8 ++--
> > >   1 file changed, 6 insertions(+), 2 deletions(-)
> > > diff --git a/block/io.c b/block/io.c
> > > index 30748f0b59..e43b4ad09b 100644
> > > --- a/block/io.c
> > > +++ b/block/io.c
> > > @@ -1710,7 +1710,11 @@ static int bdrv_pad_request(BlockDriverState *bs,
> > >   int sliced_niov;
> > >   size_t sliced_head, sliced_tail;
> > > -bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, 
> > > _abort);
> > > +/* Should have been checked 

Re: [PATCH] block: Fix pad_request's request restriction

2023-07-11 Thread Stefan Hajnoczi
On Fri, Jun 09, 2023 at 10:33:16AM +0200, Hanna Czenczek wrote:
> bdrv_pad_request() relies on requests' lengths not to exceed SIZE_MAX,
> which bdrv_check_qiov_request() does not guarantee.
> 
> bdrv_check_request32() however will guarantee this, and both of
> bdrv_pad_request()'s callers (bdrv_co_preadv_part() and
> bdrv_co_pwritev_part()) already run it before calling
> bdrv_pad_request().  Therefore, bdrv_pad_request() can safely call
> bdrv_check_request32() without expecting error, too.
> 
> There is one difference between bdrv_check_qiov_request() and
> bdrv_check_request32(): The former takes an errp, the latter does not,
> so we can no longer just pass _abort.  Instead, we need to check
> the returned value.  While we do expect success (because the callers
> have already run this function), an assert(ret == 0) is not much simpler
> than just to return an error if it occurs, so let us handle errors by
> returning them up the stack now.

Is this patch intended to silence a Coverity warning or can this be
triggered by a guest?

I find this commit description and patch confusing. Instead of checking
the actual SIZE_MAX value that bdrv_pad_request() relies on, we use a
32-bit offsets/lengths helper because it checks INT_MAX or SIZE_MAX (but
really INT_MAX, because that's always smaller on host architectures that
QEMU supports).

Vladimir: Is this the intended use of bdrv_check_request32()?

> 
> Reported-by: Peter Maydell 
> Fixes: 18743311b829cafc1737a5f20bc3248d5f91ee2a
>("block: Collapse padded I/O vecs exceeding IOV_MAX")
> Signed-off-by: Hanna Czenczek 
> ---
>  block/io.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)

> 
> diff --git a/block/io.c b/block/io.c
> index 30748f0b59..e43b4ad09b 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1710,7 +1710,11 @@ static int bdrv_pad_request(BlockDriverState *bs,
>  int sliced_niov;
>  size_t sliced_head, sliced_tail;
>  
> -bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, 
> _abort);
> +/* Should have been checked by the caller already */
> +ret = bdrv_check_request32(*offset, *bytes, *qiov, *qiov_offset);
> +if (ret < 0) {
> +return ret;
> +}
>  
>  if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) {
>  if (padded) {
> @@ -1723,7 +1727,7 @@ static int bdrv_pad_request(BlockDriverState *bs,
>_head, _tail,
>_niov);
>  
> -/* Guaranteed by bdrv_check_qiov_request() */
> +/* Guaranteed by bdrv_check_request32() */
>  assert(*bytes <= SIZE_MAX);
>  ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
>sliced_head, *bytes);
> -- 
> 2.40.1
> 


signature.asc
Description: PGP signature


Re: [PATCH] Revert "virtio-scsi: Send "REPORTED LUNS CHANGED" sense data upon disk hotplug events"

2023-07-11 Thread Stefan Hajnoczi
On Tue, 11 Jul 2023 at 13:06, Stefano Garzarella  wrote:
>
> CCing `./scripts/get_maintainer.pl -f drivers/scsi/virtio_scsi.c`,
> since I found a few things in the virtio-scsi driver...
>
> FYI we have seen that Linux has problems with a QEMU patch for the
> virtio-scsi device (details at the bottom of this email in the revert
> commit message and BZ).
>
>
> This is what I found when I looked at the Linux code:
>
> In scsi_report_sense() in linux/drivers/scsi/scsi_error.c linux calls
> scsi_report_lun_change() that set `sdev_target->expecting_lun_change =
> 1` when we receive a UNIT ATTENTION with REPORT LUNS CHANGED
> (sshdr->asc == 0x3f && sshdr->ascq == 0x0e).
>
> When `sdev_target->expecting_lun_change = 1` is set and we call
> scsi_check_sense(), for example to check the next UNIT ATTENTION, it
> will return NEEDS_RETRY, that I think will cause the issues we are
> seeing.
>
> `sdev_target->expecting_lun_change` is reset only in
> scsi_decide_disposition() when `REPORT_LUNS` command returns with
> SAM_STAT_GOOD.
> That command is issued in scsi_report_lun_scan() called by
> __scsi_scan_target(), called for example by scsi_scan_target(),
> scsi_scan_host(), etc.
>
> So, checking QEMU, we send VIRTIO_SCSI_EVT_RESET_RESCAN during hotplug
> and VIRTIO_SCSI_EVT_RESET_REMOVED during hotunplug. In both cases now we
> send also the UNIT ATTENTION.
>
> In the virtio-scsi driver, when we receive VIRTIO_SCSI_EVT_RESET_RESCAN
> (hotplug) we call scsi_scan_target() or scsi_add_device(). Both of them
> will call __scsi_scan_target() at some points, sending `REPORT_LUNS`
> command to the device. This does not happen for
> VIRTIO_SCSI_EVT_RESET_REMOVED (hotunplug). Indeed if I remove the
> UNIT ATTENTION from the hotunplug in QEMU, everything works well.
>
> So, I tried to add a scan also for VIRTIO_SCSI_EVT_RESET_REMOVED:
>
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index bd5633667d01..c57658a63097 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -291,6 +291,7 @@ static void virtscsi_handle_transport_reset(struct 
> virtio_scsi *vscsi,
>  }
>  break;
>  case VIRTIO_SCSI_EVT_RESET_REMOVED:
> +   scsi_scan_host(shost);
>  sdev = scsi_device_lookup(shost, 0, target, lun);
>  if (sdev) {
>  scsi_remove_device(sdev);
>
> This somehow helps, now linux only breaks if the plug/unplug frequency
> is really high. If I put a 5 second sleep between plug/unplug events, it
> doesn't break (at least for the duration of my test which has been
> running for about 30 minutes, before it used to break after about a
> minute).
>
> Another thing I noticed is that in QEMU maybe we should set the UNIT
> ATTENTION first and then send the event on the virtqueue, because the
> scan should happen after the unit attention, but I don't know if in any
> case the unit attention is processed before the virtqueue.
>
> I mean something like this:
>
> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index 45b95ea070..13db40f4f3 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> @@ -1079,8 +1079,8 @@ static void virtio_scsi_hotplug(HotplugHandler 
> *hotplug_dev, DeviceState *dev,
>   };
>
>   virtio_scsi_acquire(s);
> -virtio_scsi_push_event(s, );
>   scsi_bus_set_ua(>bus, SENSE_CODE(REPORTED_LUNS_CHANGED));
> +virtio_scsi_push_event(s, );
>   virtio_scsi_release(s);
>   }
>   }
> @@ -,8 +,8 @@ static void virtio_scsi_hotunplug(HotplugHandler 
> *hotplug_dev, DeviceState *dev,
>
>   if (virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_HOTPLUG)) {
>   virtio_scsi_acquire(s);
> -virtio_scsi_push_event(s, );
>   scsi_bus_set_ua(>bus, SENSE_CODE(REPORTED_LUNS_CHANGED));
> +virtio_scsi_push_event(s, );
>   virtio_scsi_release(s);
>   }
>   }

That is racy. It's up to the guest whether the event virtqueue or the
UNIT ATTENTION will be processed first.

If the device wants to ensure ordering then it must withhold the event
until the driver has responded to the UNIT ATTENTION. That may not be
a good idea though.

I'd like to understand the root cause before choosing a solution.

> At this point I think the problem is on the handling of the
> VIRTIO_SCSI_EVT_RESET_REMOVED event in the virtio-scsi driver, where
> somehow we have to redo the bus scan, but scsi_scan_host() doesn't seem
> to be enough when the event rate is very high.

Why is it necessary to rescan the whole bus instead of removing just
the device that has been unplugged?

> I don't know if along with this fix, we also need to limit the rate in
> QEMU somehow.

Why is a high rate problematic?

> Sorry for the length of this email, but I'm not familiar with SCSI and
> wanted some suggestions on how to proceed.
>
> Paolo, Stefan, Linux SCSI maintainers, any suggestion?

I don't know the Linux SCSI 

Re: [PATCH] net: add initial support for AF_XDP network backend

2023-07-10 Thread Stefan Hajnoczi
On Mon, 10 Jul 2023 at 06:55, Ilya Maximets  wrote:
>
> On 7/10/23 05:51, Jason Wang wrote:
> > On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets  wrote:
> >>
> >> On 7/7/23 03:43, Jason Wang wrote:
> >>> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi  wrote:
> >>>>
> >>>> On Wed, 5 Jul 2023 at 02:02, Jason Wang  wrote:
> >>>>>
> >>>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi  
> >>>>> wrote:
> >>>>>>
> >>>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang  wrote:
> >>>>>>>
> >>>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi  
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang  wrote:
> >>>>>>>>>
> >>>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi 
> >>>>>>>>>  wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang  
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi 
> >>>>>>>>>>>  wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang  
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi 
> >>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang  
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets 
> >>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote:
> >>>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets 
> >>>>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote:
> >>>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang 
> >>>>>>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets 
> >>>>>>>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on 
> >>>>>>>>>>>>>>>>>> in terms of PPS.
> >>>>>>>>>>>>>>>>>> So, that might be one case.  Taking into account that just 
> >>>>>>>>>>>>>>>>>> rcu lock and
> >>>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet 
> >>>>>>>>>>>>>>>>>> copy, some batching
> >>>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly.  
> >>>>>>>>>>>>>>>>>> And it shouldn't be
> >>>>>>>>>>>>>>>>>> too hard to implement.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be 
> >>>>>>>>>>>>>>>>>> improved by creating
> >>>>>>>>>>>>>>>>>> a kernel thread for async Tx.  Similarly to w

Re: [PATCH] net: add initial support for AF_XDP network backend

2023-07-10 Thread Stefan Hajnoczi
On Thu, 6 Jul 2023 at 21:43, Jason Wang  wrote:
>
> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi  wrote:
> >
> > On Wed, 5 Jul 2023 at 02:02, Jason Wang  wrote:
> > >
> > > On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi  wrote:
> > > >
> > > > On Fri, 30 Jun 2023 at 09:41, Jason Wang  wrote:
> > > > >
> > > > > On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, 29 Jun 2023 at 07:26, Jason Wang  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets 
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets 
> > > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang 
> > > > > > > > > > > > > > >>>  wrote:
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets 
> > > > > > > > > > > > > > >>>>  wrote:
> > > > > > > > > > > > > > >> It is noticeably more performant than a tap with 
> > > > > > > > > > > > > > >> vhost=on in terms of PPS.
> > > > > > > > > > > > > > >> So, that might be one case.  Taking into account 
> > > > > > > > > > > > > > >> that just rcu lock and
> > > > > > > > > > > > > > >> unlock in virtio-net code takes more time than a 
> > > > > > > > > > > > > > >> packet copy, some batching
> > > > > > > > > > > > > > >> on QEMU side should improve performance 
> > > > > > > > > > > > > > >> significantly.  And it shouldn't be
> > > > > > > > > > > > > > >> too hard to implement.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> Performance over virtual interfaces may 
> > > > > > > > > > > > > > >> potentially be improved by creating
> > > > > > > > > > > > > > >> a kernel thread for async Tx.  Similarly to what 
> > > > > > > > > > > > > > >> io_uring allows.  Currently
> > > > > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, 
> > > > > > > > > > > > > > >> and that doesn't allow to
> > > &

Re: [PATCH] net: add initial support for AF_XDP network backend

2023-07-06 Thread Stefan Hajnoczi
On Wed, 5 Jul 2023 at 02:02, Jason Wang  wrote:
>
> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi  wrote:
> >
> > On Fri, 30 Jun 2023 at 09:41, Jason Wang  wrote:
> > >
> > > On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi  
> > > wrote:
> > > >
> > > > On Thu, 29 Jun 2023 at 07:26, Jason Wang  wrote:
> > > > >
> > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets 
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang 
> > > > > > > > > > > > >>>  wrote:
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets 
> > > > > > > > > > > > >>>>  wrote:
> > > > > > > > > > > > >> It is noticeably more performant than a tap with 
> > > > > > > > > > > > >> vhost=on in terms of PPS.
> > > > > > > > > > > > >> So, that might be one case.  Taking into account 
> > > > > > > > > > > > >> that just rcu lock and
> > > > > > > > > > > > >> unlock in virtio-net code takes more time than a 
> > > > > > > > > > > > >> packet copy, some batching
> > > > > > > > > > > > >> on QEMU side should improve performance 
> > > > > > > > > > > > >> significantly.  And it shouldn't be
> > > > > > > > > > > > >> too hard to implement.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Performance over virtual interfaces may potentially 
> > > > > > > > > > > > >> be improved by creating
> > > > > > > > > > > > >> a kernel thread for async Tx.  Similarly to what 
> > > > > > > > > > > > >> io_uring allows.  Currently
> > > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and 
> > > > > > > > > > > > >> that doesn't allow to
> > > > > > > > > > > > >> scale well.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Interestingly, actually, there are a lot of 
> > > > > > > > > > > > > "duplication" between
> > > > > > > > > > > > > io_uring and AF_XDP:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > > > > > > 2) both use ring for communication
> > > > > &

[PULL 1/1] block/blkio: fix module_block.py parsing

2023-07-04 Thread Stefan Hajnoczi
When QEMU is built with --enable-modules, the module_block.py script
parses block/*.c to find block drivers that are built as modules. The
script generates a table of block drivers called block_driver_modules[].
This table is used for block driver module loading.

The blkio.c driver uses macros to define its BlockDriver structs. This
was done to avoid code duplication but the module_block.py script is
unable to parse the macro. The result is that libblkio-based block
drivers can be built as modules but will not be found at runtime.

One fix is to make the module_block.py script or build system fancier so
it can parse C macros (e.g. by parsing the preprocessed source code). I
chose not to do this because it raises the complexity of the build,
making future issues harder to debug.

Keep things simple: use the macro to avoid duplicating BlockDriver
function pointers but define .format_name and .protocol_name manually
for each BlockDriver. This way the module_block.py is able to parse the
code.

Also get rid of the block driver name macros (e.g. DRIVER_IO_URING)
because module_block.py cannot parse them either.

Fixes: fd66dbd424f5 ("blkio: add libblkio block driver")
Reported-by: Qing Wang 
Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Stefano Garzarella 
Message-id: 20230704123436.187761-1-stefa...@redhat.com
Cc: Stefano Garzarella 
Signed-off-by: Stefan Hajnoczi 
---
 block/blkio.c | 108 ++
 1 file changed, 56 insertions(+), 52 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index 527323d625..1798648134 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -22,16 +22,6 @@
 
 #include "block/block-io.h"
 
-/*
- * Keep the QEMU BlockDriver names identical to the libblkio driver names.
- * Using macros instead of typing out the string literals avoids typos.
- */
-#define DRIVER_IO_URING "io_uring"
-#define DRIVER_NVME_IO_URING "nvme-io_uring"
-#define DRIVER_VIRTIO_BLK_VFIO_PCI "virtio-blk-vfio-pci"
-#define DRIVER_VIRTIO_BLK_VHOST_USER "virtio-blk-vhost-user"
-#define DRIVER_VIRTIO_BLK_VHOST_VDPA "virtio-blk-vhost-vdpa"
-
 /*
  * Allocated bounce buffers are kept in a list sorted by buffer address.
  */
@@ -744,15 +734,15 @@ static int blkio_file_open(BlockDriverState *bs, QDict 
*options, int flags,
 return ret;
 }
 
-if (strcmp(blkio_driver, DRIVER_IO_URING) == 0) {
+if (strcmp(blkio_driver, "io_uring") == 0) {
 ret = blkio_io_uring_open(bs, options, flags, errp);
-} else if (strcmp(blkio_driver, DRIVER_NVME_IO_URING) == 0) {
+} else if (strcmp(blkio_driver, "nvme-io_uring") == 0) {
 ret = blkio_nvme_io_uring(bs, options, flags, errp);
-} else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VFIO_PCI) == 0) {
+} else if (strcmp(blkio_driver, "virtio-blk-vfio-pci") == 0) {
 ret = blkio_virtio_blk_common_open(bs, options, flags, errp);
-} else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VHOST_USER) == 0) {
+} else if (strcmp(blkio_driver, "virtio-blk-vhost-user") == 0) {
 ret = blkio_virtio_blk_common_open(bs, options, flags, errp);
-} else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VHOST_VDPA) == 0) {
+} else if (strcmp(blkio_driver, "virtio-blk-vhost-vdpa") == 0) {
 ret = blkio_virtio_blk_common_open(bs, options, flags, errp);
 } else {
 g_assert_not_reached();
@@ -1028,49 +1018,63 @@ static void blkio_refresh_limits(BlockDriverState *bs, 
Error **errp)
  * - truncate
  */
 
-#define BLKIO_DRIVER(name, ...) \
-{ \
-.format_name = name, \
-.protocol_name   = name, \
-.instance_size   = sizeof(BDRVBlkioState), \
-.bdrv_file_open  = blkio_file_open, \
-.bdrv_close  = blkio_close, \
-.bdrv_co_getlength   = blkio_co_getlength, \
-.bdrv_co_truncate= blkio_truncate, \
-.bdrv_co_get_info= blkio_co_get_info, \
-.bdrv_attach_aio_context = blkio_attach_aio_context, \
-.bdrv_detach_aio_context = blkio_detach_aio_context, \
-.bdrv_co_pdiscard= blkio_co_pdiscard, \
-.bdrv_co_preadv  = blkio_co_preadv, \
-.bdrv_co_pwritev = blkio_co_pwritev, \
-.bdrv_co_flush_to_disk   = blkio_co_flush, \
-.bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
-.bdrv_refresh_limits = blkio_refresh_limits, \
-.bdrv_register_buf   = blkio_register_buf, \
-.bdrv_unregister_buf = blkio_unregister_buf, \
-__VA_ARGS__ \
-}
+/*
+ * Do not include .format_name and .protocol_name because module_block.py
+ * does not parse macros in the source code.
+ */
+#define BLKIO_DRIVER_COMMON \
+.instance_size   = sizeof(BDRVBlkioState), \
+.bdrv_file_open  = blkio_file_open, \
+.bdrv_close   

[PULL 0/1] Block patches

2023-07-04 Thread Stefan Hajnoczi
The following changes since commit d145c0da22cde391d8c6672d33146ce306e8bf75:

  Merge tag 'pull-tcg-20230701' of https://gitlab.com/rth7680/qemu into staging 
(2023-07-01 08:55:37 +0200)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to c21eae1ccc782440f320accb6f90c66cb8f45ee9:

  block/blkio: fix module_block.py parsing (2023-07-04 17:28:25 +0200)


Pull request

Fix --enable-modules with the blkio block driver.



Stefan Hajnoczi (1):
  block/blkio: fix module_block.py parsing

 block/blkio.c | 108 ++
 1 file changed, 56 insertions(+), 52 deletions(-)

-- 
2.40.1




[PATCH] virtio-blk: fix host notifier issues during dataplane start/stop

2023-07-04 Thread Stefan Hajnoczi
The main loop thread can consume 100% CPU when using --device
virtio-blk-pci,iothread=. ppoll() constantly returns but
reading virtqueue host notifiers fails with EAGAIN. The file descriptors
are stale and remain registered with the AioContext because of bugs in
the virtio-blk dataplane start/stop code.

The problem is that the dataplane start/stop code involves drain
operations, which call virtio_blk_drained_begin() and
virtio_blk_drained_end() at points where the host notifier is not
operational:
- In virtio_blk_data_plane_start(), blk_set_aio_context() drains after
  vblk->dataplane_started has been set to true but the host notifier has
  not been attached yet.
- In virtio_blk_data_plane_stop(), blk_drain() and blk_set_aio_context()
  drain after the host notifier has already been detached but with
  vblk->dataplane_started still set to true.

I would like to simplify ->ioeventfd_start/stop() to avoid interactions
with drain entirely, but couldn't find a way to do that. Instead, this
patch accepts the fragile nature of the code and reorders it so that
vblk->dataplane_started is false during drain operations. This way the
virtio_blk_drained_begin() and virtio_blk_drained_end() calls don't
touch the host notifier. The result is that
virtio_blk_data_plane_start() and virtio_blk_data_plane_stop() have
complete control over the host notifier and stale file descriptors are
no longer left in the AioContext.

This patch fixes the 100% CPU consumption in the main loop thread and
correctly moves host notifier processing to the IOThread.

Fixes: 1665d9326fd2 ("virtio-blk: implement BlockDevOps->drained_begin()")
Reported-by: Lukáš Doktor 
Signed-off-by: Stefan Hajnoczi 
---
 hw/block/dataplane/virtio-blk.c | 67 +++--
 1 file changed, 38 insertions(+), 29 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index c227b39408..da36fcfd0b 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -219,13 +219,6 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 
 memory_region_transaction_commit();
 
-/*
- * These fields are visible to the IOThread so we rely on implicit barriers
- * in aio_context_acquire() on the write side and aio_notify_accept() on
- * the read side.
- */
-s->starting = false;
-vblk->dataplane_started = true;
 trace_virtio_blk_data_plane_start(s);
 
 old_context = blk_get_aio_context(s->conf->conf.blk);
@@ -244,6 +237,18 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 event_notifier_set(virtio_queue_get_host_notifier(vq));
 }
 
+/*
+ * These fields must be visible to the IOThread when it processes the
+ * virtqueue, otherwise it will think dataplane has not started yet.
+ *
+ * Make sure ->dataplane_started is false when blk_set_aio_context() is
+ * called above so that draining does not cause the host notifier to be
+ * detached/attached prematurely.
+ */
+s->starting = false;
+vblk->dataplane_started = true;
+smp_wmb(); /* paired with aio_notify_accept() on the read side */
+
 /* Get this show started by hooking up our callbacks */
 if (!blk_in_drain(s->conf->conf.blk)) {
 aio_context_acquire(s->ctx);
@@ -273,7 +278,6 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
   fail_guest_notifiers:
 vblk->dataplane_disabled = true;
 s->starting = false;
-vblk->dataplane_started = true;
 return -ENOSYS;
 }
 
@@ -327,6 +331,32 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
 aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
 }
 
+/*
+ * Batch all the host notifiers in a single transaction to avoid
+ * quadratic time complexity in address_space_update_ioeventfds().
+ */
+memory_region_transaction_begin();
+
+for (i = 0; i < nvqs; i++) {
+virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
+}
+
+/*
+ * The transaction expects the ioeventfds to be open when it
+ * commits. Do it now, before the cleanup loop.
+ */
+memory_region_transaction_commit();
+
+for (i = 0; i < nvqs; i++) {
+virtio_bus_cleanup_host_notifier(VIRTIO_BUS(qbus), i);
+}
+
+/*
+ * Set ->dataplane_started to false before draining so that host notifiers
+ * are not detached/attached anymore.
+ */
+vblk->dataplane_started = false;
+
 aio_context_acquire(s->ctx);
 
 /* Wait for virtio_blk_dma_restart_bh() and in flight I/O to complete */
@@ -340,32 +370,11 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
 
 aio_context_release(s->ctx);
 
-/*
- * Batch all the host notifiers in a single transaction to avoid
- * quadratic time complexity in address_space_update_ioeventfds().
- */
-memory_region_transaction_begin();
-
-for (i =

Re: [PATCH] block/blkio: fix module_block.py parsing

2023-07-04 Thread Stefan Hajnoczi
On Mon, 3 Jul 2023 at 12:55, Stefano Garzarella  wrote:
>
> On Mon, Jul 03, 2023 at 12:35:24PM +0200, Stefan Hajnoczi wrote:
> >When QEMU is built with --enable-modules, the module_block.py script
> >parses block/*.c to find block drivers that are built as modules. The
> >script generates a table of block drivers called block_driver_modules[].
> >This table is used for block driver module loading.
> >
> >The blkio.c driver uses macros to define its BlockDriver structs. This
> >was done to avoid code duplication but the module_block.py script is
> >unable to parse the macro. The result is that libblkio-based block
> >drivers can be built as modules but will not be found at runtime.
> >
> >One fix is to make the module_block.py script or build system fancier so
> >it can parse C macros (e.g. by parsing the preprocessed source code). I
> >chose not to do this because it raises the complexity of the build,
> >making future issues harder to debug.
> >
> >Keep things simple: use the macro to avoid duplicating BlockDriver
> >function pointers but define .format_name and .protocol_name manually
> >for each BlockDriver. This way the module_block.py is able to parse the
> >code.
> >
> >Also get rid of the block driver name macros (e.g. DRIVER_IO_URING)
> >because module_block.py cannot parse them either.
> >
> >Fixes: fd66dbd424f5 ("blkio: add libblkio block driver")
> >Reported-by: Qing Wang 
> >Cc: Stefano Garzarella 
> >Signed-off-by: Stefan Hajnoczi 
> >---
> > block/blkio.c | 110 ++
> > 1 file changed, 57 insertions(+), 53 deletions(-)
> >
> >diff --git a/block/blkio.c b/block/blkio.c
> >index 527323d625..589f829a83 100644
> >--- a/block/blkio.c
> >+++ b/block/blkio.c
> >@@ -22,16 +22,6 @@
> >
> > #include "block/block-io.h"
> >
> >-/*
> >- * Keep the QEMU BlockDriver names identical to the libblkio driver names.
> >- * Using macros instead of typing out the string literals avoids typos.
> >- */
> >-#define DRIVER_IO_URING "io_uring"
> >-#define DRIVER_NVME_IO_URING "nvme-io_uring"
> >-#define DRIVER_VIRTIO_BLK_VFIO_PCI "virtio-blk-vfio-pci"
> >-#define DRIVER_VIRTIO_BLK_VHOST_USER "virtio-blk-vhost-user"
> >-#define DRIVER_VIRTIO_BLK_VHOST_VDPA "virtio-blk-vhost-vdpa"
> >-
> > /*
> >  * Allocated bounce buffers are kept in a list sorted by buffer address.
> >  */
> >@@ -744,15 +734,15 @@ static int blkio_file_open(BlockDriverState *bs, QDict 
> >*options, int flags,
> > return ret;
> > }
> >
> >-if (strcmp(blkio_driver, DRIVER_IO_URING) == 0) {
> >+if (strcmp(blkio_driver, "io_uring") == 0) {
> > ret = blkio_io_uring_open(bs, options, flags, errp);
> >-} else if (strcmp(blkio_driver, DRIVER_NVME_IO_URING) == 0) {
> >+} else if (strcmp(blkio_driver, "nvme-io_uring") == 0) {
> > ret = blkio_nvme_io_uring(bs, options, flags, errp);
> >-} else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VFIO_PCI) == 0) {
> >+} else if (strcmp(blkio_driver, "virtio-blk-vfio-pci") == 0) {
> > ret = blkio_virtio_blk_common_open(bs, options, flags, errp);
> >-} else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VHOST_USER) == 0) {
> >+} else if (strcmp(blkio_driver, "virtio-blk-vhost-user") == 0) {
> > ret = blkio_virtio_blk_common_open(bs, options, flags, errp);
> >-} else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VHOST_VDPA) == 0) {
> >+} else if (strcmp(blkio_driver, "virtio-blk-vhost-vdpa") == 0) {
> > ret = blkio_virtio_blk_common_open(bs, options, flags, errp);
> > } else {
> > g_assert_not_reached();
> >@@ -1028,49 +1018,63 @@ static void blkio_refresh_limits(BlockDriverState 
> >*bs, Error **errp)
> >  * - truncate
> >  */
> >
> >-#define BLKIO_DRIVER(name, ...) \
> >-{ \
> >-.format_name = name, \
> >-.protocol_name   = name, \
> >-.instance_size   = sizeof(BDRVBlkioState), \
> >-.bdrv_file_open  = blkio_file_open, \
> >-.bdrv_close  = blkio_close, \
> >-.bdrv_co_getlength   = blkio_co_getlength, \
> >-.bdrv_co_truncate= blkio_truncate, \
> >-.bdrv_co_get_info= blkio_co_get_info, \
> >-.bdrv_attach_aio_context = blkio_attach_aio_context, \
> >-.bdrv_detach_aio_c

[PATCH v2] block/blkio: fix module_block.py parsing

2023-07-04 Thread Stefan Hajnoczi
When QEMU is built with --enable-modules, the module_block.py script
parses block/*.c to find block drivers that are built as modules. The
script generates a table of block drivers called block_driver_modules[].
This table is used for block driver module loading.

The blkio.c driver uses macros to define its BlockDriver structs. This
was done to avoid code duplication but the module_block.py script is
unable to parse the macro. The result is that libblkio-based block
drivers can be built as modules but will not be found at runtime.

One fix is to make the module_block.py script or build system fancier so
it can parse C macros (e.g. by parsing the preprocessed source code). I
chose not to do this because it raises the complexity of the build,
making future issues harder to debug.

Keep things simple: use the macro to avoid duplicating BlockDriver
function pointers but define .format_name and .protocol_name manually
for each BlockDriver. This way the module_block.py is able to parse the
code.

Also get rid of the block driver name macros (e.g. DRIVER_IO_URING)
because module_block.py cannot parse them either.

Fixes: fd66dbd424f5 ("blkio: add libblkio block driver")
Reported-by: Qing Wang 
Cc: Stefano Garzarella 
Signed-off-by: Stefan Hajnoczi 
---
v2:
- Drop unnecessary backslashes [Stefano]
---
 block/blkio.c | 108 ++
 1 file changed, 56 insertions(+), 52 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index 527323d625..1798648134 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -22,16 +22,6 @@
 
 #include "block/block-io.h"
 
-/*
- * Keep the QEMU BlockDriver names identical to the libblkio driver names.
- * Using macros instead of typing out the string literals avoids typos.
- */
-#define DRIVER_IO_URING "io_uring"
-#define DRIVER_NVME_IO_URING "nvme-io_uring"
-#define DRIVER_VIRTIO_BLK_VFIO_PCI "virtio-blk-vfio-pci"
-#define DRIVER_VIRTIO_BLK_VHOST_USER "virtio-blk-vhost-user"
-#define DRIVER_VIRTIO_BLK_VHOST_VDPA "virtio-blk-vhost-vdpa"
-
 /*
  * Allocated bounce buffers are kept in a list sorted by buffer address.
  */
@@ -744,15 +734,15 @@ static int blkio_file_open(BlockDriverState *bs, QDict 
*options, int flags,
 return ret;
 }
 
-if (strcmp(blkio_driver, DRIVER_IO_URING) == 0) {
+if (strcmp(blkio_driver, "io_uring") == 0) {
 ret = blkio_io_uring_open(bs, options, flags, errp);
-} else if (strcmp(blkio_driver, DRIVER_NVME_IO_URING) == 0) {
+} else if (strcmp(blkio_driver, "nvme-io_uring") == 0) {
 ret = blkio_nvme_io_uring(bs, options, flags, errp);
-} else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VFIO_PCI) == 0) {
+} else if (strcmp(blkio_driver, "virtio-blk-vfio-pci") == 0) {
 ret = blkio_virtio_blk_common_open(bs, options, flags, errp);
-} else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VHOST_USER) == 0) {
+} else if (strcmp(blkio_driver, "virtio-blk-vhost-user") == 0) {
 ret = blkio_virtio_blk_common_open(bs, options, flags, errp);
-} else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VHOST_VDPA) == 0) {
+} else if (strcmp(blkio_driver, "virtio-blk-vhost-vdpa") == 0) {
 ret = blkio_virtio_blk_common_open(bs, options, flags, errp);
 } else {
 g_assert_not_reached();
@@ -1028,49 +1018,63 @@ static void blkio_refresh_limits(BlockDriverState *bs, 
Error **errp)
  * - truncate
  */
 
-#define BLKIO_DRIVER(name, ...) \
-{ \
-.format_name = name, \
-.protocol_name   = name, \
-.instance_size   = sizeof(BDRVBlkioState), \
-.bdrv_file_open  = blkio_file_open, \
-.bdrv_close  = blkio_close, \
-.bdrv_co_getlength   = blkio_co_getlength, \
-.bdrv_co_truncate= blkio_truncate, \
-.bdrv_co_get_info= blkio_co_get_info, \
-.bdrv_attach_aio_context = blkio_attach_aio_context, \
-.bdrv_detach_aio_context = blkio_detach_aio_context, \
-.bdrv_co_pdiscard= blkio_co_pdiscard, \
-.bdrv_co_preadv  = blkio_co_preadv, \
-.bdrv_co_pwritev = blkio_co_pwritev, \
-.bdrv_co_flush_to_disk   = blkio_co_flush, \
-.bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
-.bdrv_refresh_limits = blkio_refresh_limits, \
-.bdrv_register_buf   = blkio_register_buf, \
-.bdrv_unregister_buf = blkio_unregister_buf, \
-__VA_ARGS__ \
-}
+/*
+ * Do not include .format_name and .protocol_name because module_block.py
+ * does not parse macros in the source code.
+ */
+#define BLKIO_DRIVER_COMMON \
+.instance_size   = sizeof(BDRVBlkioState), \
+.bdrv_file_open  = blkio_file_open, \
+.bdrv_close  = blkio_close, \
+.bdrv_co_getlength   = b

[PATCH] block/blkio: fix module_block.py parsing

2023-07-03 Thread Stefan Hajnoczi
When QEMU is built with --enable-modules, the module_block.py script
parses block/*.c to find block drivers that are built as modules. The
script generates a table of block drivers called block_driver_modules[].
This table is used for block driver module loading.

The blkio.c driver uses macros to define its BlockDriver structs. This
was done to avoid code duplication but the module_block.py script is
unable to parse the macro. The result is that libblkio-based block
drivers can be built as modules but will not be found at runtime.

One fix is to make the module_block.py script or build system fancier so
it can parse C macros (e.g. by parsing the preprocessed source code). I
chose not to do this because it raises the complexity of the build,
making future issues harder to debug.

Keep things simple: use the macro to avoid duplicating BlockDriver
function pointers but define .format_name and .protocol_name manually
for each BlockDriver. This way the module_block.py is able to parse the
code.

Also get rid of the block driver name macros (e.g. DRIVER_IO_URING)
because module_block.py cannot parse them either.

Fixes: fd66dbd424f5 ("blkio: add libblkio block driver")
Reported-by: Qing Wang 
Cc: Stefano Garzarella 
Signed-off-by: Stefan Hajnoczi 
---
 block/blkio.c | 110 ++
 1 file changed, 57 insertions(+), 53 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index 527323d625..589f829a83 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -22,16 +22,6 @@
 
 #include "block/block-io.h"
 
-/*
- * Keep the QEMU BlockDriver names identical to the libblkio driver names.
- * Using macros instead of typing out the string literals avoids typos.
- */
-#define DRIVER_IO_URING "io_uring"
-#define DRIVER_NVME_IO_URING "nvme-io_uring"
-#define DRIVER_VIRTIO_BLK_VFIO_PCI "virtio-blk-vfio-pci"
-#define DRIVER_VIRTIO_BLK_VHOST_USER "virtio-blk-vhost-user"
-#define DRIVER_VIRTIO_BLK_VHOST_VDPA "virtio-blk-vhost-vdpa"
-
 /*
  * Allocated bounce buffers are kept in a list sorted by buffer address.
  */
@@ -744,15 +734,15 @@ static int blkio_file_open(BlockDriverState *bs, QDict 
*options, int flags,
 return ret;
 }
 
-if (strcmp(blkio_driver, DRIVER_IO_URING) == 0) {
+if (strcmp(blkio_driver, "io_uring") == 0) {
 ret = blkio_io_uring_open(bs, options, flags, errp);
-} else if (strcmp(blkio_driver, DRIVER_NVME_IO_URING) == 0) {
+} else if (strcmp(blkio_driver, "nvme-io_uring") == 0) {
 ret = blkio_nvme_io_uring(bs, options, flags, errp);
-} else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VFIO_PCI) == 0) {
+} else if (strcmp(blkio_driver, "virtio-blk-vfio-pci") == 0) {
 ret = blkio_virtio_blk_common_open(bs, options, flags, errp);
-} else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VHOST_USER) == 0) {
+} else if (strcmp(blkio_driver, "virtio-blk-vhost-user") == 0) {
 ret = blkio_virtio_blk_common_open(bs, options, flags, errp);
-} else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VHOST_VDPA) == 0) {
+} else if (strcmp(blkio_driver, "virtio-blk-vhost-vdpa") == 0) {
 ret = blkio_virtio_blk_common_open(bs, options, flags, errp);
 } else {
 g_assert_not_reached();
@@ -1028,49 +1018,63 @@ static void blkio_refresh_limits(BlockDriverState *bs, 
Error **errp)
  * - truncate
  */
 
-#define BLKIO_DRIVER(name, ...) \
-{ \
-.format_name = name, \
-.protocol_name   = name, \
-.instance_size   = sizeof(BDRVBlkioState), \
-.bdrv_file_open  = blkio_file_open, \
-.bdrv_close  = blkio_close, \
-.bdrv_co_getlength   = blkio_co_getlength, \
-.bdrv_co_truncate= blkio_truncate, \
-.bdrv_co_get_info= blkio_co_get_info, \
-.bdrv_attach_aio_context = blkio_attach_aio_context, \
-.bdrv_detach_aio_context = blkio_detach_aio_context, \
-.bdrv_co_pdiscard= blkio_co_pdiscard, \
-.bdrv_co_preadv  = blkio_co_preadv, \
-.bdrv_co_pwritev = blkio_co_pwritev, \
-.bdrv_co_flush_to_disk   = blkio_co_flush, \
-.bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
-.bdrv_refresh_limits = blkio_refresh_limits, \
-.bdrv_register_buf   = blkio_register_buf, \
-.bdrv_unregister_buf = blkio_unregister_buf, \
-__VA_ARGS__ \
-}
+/*
+ * Do not include .format_name and .protocol_name because module_block.py
+ * does not parse macros in the source code.
+ */
+#define BLKIO_DRIVER_COMMON \
+.instance_size   = sizeof(BDRVBlkioState), \
+.bdrv_file_open  = blkio_file_open, \
+.bdrv_close  = blkio_close, \
+.bdrv_co_getlength   = blkio_co_getlength, \
+.bdrv_co_truncate= blkio_t

Re: [PATCH] net: add initial support for AF_XDP network backend

2023-07-03 Thread Stefan Hajnoczi
On Fri, 30 Jun 2023 at 09:41, Jason Wang  wrote:
>
> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi  wrote:
> >
> > On Thu, 29 Jun 2023 at 07:26, Jason Wang  wrote:
> > >
> > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi  
> > > wrote:
> > > >
> > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang  wrote:
> > > > >
> > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets 
> > > > > > > > > > >  wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang 
> > > > > > > > > > >>>  wrote:
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets 
> > > > > > > > > > >>>>  wrote:
> > > > > > > > > > >> It is noticeably more performant than a tap with 
> > > > > > > > > > >> vhost=on in terms of PPS.
> > > > > > > > > > >> So, that might be one case.  Taking into account that 
> > > > > > > > > > >> just rcu lock and
> > > > > > > > > > >> unlock in virtio-net code takes more time than a packet 
> > > > > > > > > > >> copy, some batching
> > > > > > > > > > >> on QEMU side should improve performance significantly.  
> > > > > > > > > > >> And it shouldn't be
> > > > > > > > > > >> too hard to implement.
> > > > > > > > > > >>
> > > > > > > > > > >> Performance over virtual interfaces may potentially be 
> > > > > > > > > > >> improved by creating
> > > > > > > > > > >> a kernel thread for async Tx.  Similarly to what 
> > > > > > > > > > >> io_uring allows.  Currently
> > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that 
> > > > > > > > > > >> doesn't allow to
> > > > > > > > > > >> scale well.
> > > > > > > > > > >
> > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" 
> > > > > > > > > > > between
> > > > > > > > > > > io_uring and AF_XDP:
> > > > > > > > > > >
> > > > > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > > > > 2) both use ring for communication
> > > > > > > > > > >
> > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > > > > >
> > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, 
> > > > > > > > > > then we can
> > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, 
> > > > > > > > > > i.e. for
> > > > > > > > > > virtual interfaces.  io_uring thread in the kernel will be 
> > > > > > > > > > able to
> > > > > > > > > > perform transmission for 

Re: [PATCH] net: add initial support for AF_XDP network backend

2023-06-29 Thread Stefan Hajnoczi
On Thu, 29 Jun 2023 at 07:26, Jason Wang  wrote:
>
> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi  wrote:
> >
> > On Wed, 28 Jun 2023 at 10:19, Jason Wang  wrote:
> > >
> > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi  
> > > wrote:
> > > >
> > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang  wrote:
> > > > >
> > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets 
> > > > > > > > >  wrote:
> > > > > > > > >>
> > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang 
> > > > > > > > >>>  wrote:
> > > > > > > > >>>>
> > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets 
> > > > > > > > >>>>  wrote:
> > > > > > > > >> It is noticeably more performant than a tap with vhost=on in 
> > > > > > > > >> terms of PPS.
> > > > > > > > >> So, that might be one case.  Taking into account that just 
> > > > > > > > >> rcu lock and
> > > > > > > > >> unlock in virtio-net code takes more time than a packet 
> > > > > > > > >> copy, some batching
> > > > > > > > >> on QEMU side should improve performance significantly.  And 
> > > > > > > > >> it shouldn't be
> > > > > > > > >> too hard to implement.
> > > > > > > > >>
> > > > > > > > >> Performance over virtual interfaces may potentially be 
> > > > > > > > >> improved by creating
> > > > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring 
> > > > > > > > >> allows.  Currently
> > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that 
> > > > > > > > >> doesn't allow to
> > > > > > > > >> scale well.
> > > > > > > > >
> > > > > > > > > Interestingly, actually, there are a lot of "duplication" 
> > > > > > > > > between
> > > > > > > > > io_uring and AF_XDP:
> > > > > > > > >
> > > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > > 2) both use ring for communication
> > > > > > > > >
> > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > > >
> > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then 
> > > > > > > > we can
> > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. 
> > > > > > > > for
> > > > > > > > virtual interfaces.  io_uring thread in the kernel will be able 
> > > > > > > > to
> > > > > > > > perform transmission for us.
> > > > > > >
> > > > > > > It would be nice if we can use iothread/vhost other than the main 
> > > > > > > loop
> > > > > > > even if io_uring can use kthreads. We can avoid the memory 
> > > > > > > translation
> > > > > > > cost.
> > > > > >
> > > > > > The QEMU event loop (AioContext) has io_uring code
> > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm 
> > > > > > working
> > > > > > on patches to re-enable it and will probably send them in July. The
> > > > > > patches also add an API to submit arbitrary io_uring operations so
> &

Re: [PATCH] net: add initial support for AF_XDP network backend

2023-06-28 Thread Stefan Hajnoczi
On Wed, 28 Jun 2023 at 10:19, Jason Wang  wrote:
>
> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi  wrote:
> >
> > On Wed, 28 Jun 2023 at 09:59, Jason Wang  wrote:
> > >
> > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi  
> > > wrote:
> > > >
> > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang  wrote:
> > > > >
> > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets  
> > > > > wrote:
> > > > > >
> > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets 
> > > > > > >  wrote:
> > > > > > >>
> > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang 
> > > > > > >>>  wrote:
> > > > > > >>>>
> > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets 
> > > > > > >>>>  wrote:
> > > > > > >> It is noticeably more performant than a tap with vhost=on in 
> > > > > > >> terms of PPS.
> > > > > > >> So, that might be one case.  Taking into account that just rcu 
> > > > > > >> lock and
> > > > > > >> unlock in virtio-net code takes more time than a packet copy, 
> > > > > > >> some batching
> > > > > > >> on QEMU side should improve performance significantly.  And it 
> > > > > > >> shouldn't be
> > > > > > >> too hard to implement.
> > > > > > >>
> > > > > > >> Performance over virtual interfaces may potentially be improved 
> > > > > > >> by creating
> > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring 
> > > > > > >> allows.  Currently
> > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't 
> > > > > > >> allow to
> > > > > > >> scale well.
> > > > > > >
> > > > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > > > io_uring and AF_XDP:
> > > > > > >
> > > > > > > 1) both have similar memory model (user register)
> > > > > > > 2) both use ring for communication
> > > > > > >
> > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > >
> > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we 
> > > > > > can
> > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > > > perform transmission for us.
> > > > >
> > > > > It would be nice if we can use iothread/vhost other than the main loop
> > > > > even if io_uring can use kthreads. We can avoid the memory translation
> > > > > cost.
> > > >
> > > > The QEMU event loop (AioContext) has io_uring code
> > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > > > on patches to re-enable it and will probably send them in July. The
> > > > patches also add an API to submit arbitrary io_uring operations so
> > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > main loop and IOThreads will be able to use io_uring on Linux hosts.
> > >
> > > Just to make sure I understand. If we still need a copy from guest to
> > > io_uring buffer, we still need to go via memory API for GPA which
> > > seems expensive.
> > >
> > > Vhost seems to be a shortcut for this.
> >
> > I'm not sure how exactly you're thinking of using io_uring.
> >
> > Simply using io_uring for the event loop (file descriptor monitoring)
> > doesn't involve an extra buffer, but the packet payload still needs to
> > reside in AF_XDP umem, so there is a copy between guest memory and
> > umem.
>
> So there would be a translation from GPA to HVA (unless io_uring
> support 2 stages) which needs to go via qemu memory core. And this
> part seems to be very expensive according to my test in the past.

Yes, but in the current approach where AF_XDP is implemented as a QEMU
netdev, there is already QEMU device emulation (e.g. virtio-net)
happening. So the GPA to HVA translation will happen anyway in device
emulation.

Are you thinking about AF_XDP passthrough where the guest directly
interacts with AF_XDP?

> > If umem encompasses guest memory,
>
> It requires you to pin the whole guest memory and a GPA to HVA
> translation is still required.

Ilya mentioned that umem uses relative offsets instead of absolute
memory addresses. In the AF_XDP passthrough case this means no address
translation needs to be added to AF_XDP.

Regarding pinning - I wonder if that's something that can be refined
in the kernel by adding an AF_XDP flag that enables on-demand pinning
of umem. That way only rx and tx buffers that are currently in use
will be pinned. The disadvantage is the runtime overhead to pin/unpin
pages. I'm not sure whether it's possible to implement this, I haven't
checked the kernel code.

Stefan



Re: [PATCH] net: add initial support for AF_XDP network backend

2023-06-28 Thread Stefan Hajnoczi
On Wed, 28 Jun 2023 at 09:59, Jason Wang  wrote:
>
> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi  wrote:
> >
> > On Wed, 28 Jun 2023 at 05:28, Jason Wang  wrote:
> > >
> > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets  wrote:
> > > >
> > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets  
> > > > > wrote:
> > > > >>
> > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang  
> > > > >>> wrote:
> > > > >>>>
> > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets  
> > > > >>>> wrote:
> > > > >> It is noticeably more performant than a tap with vhost=on in terms 
> > > > >> of PPS.
> > > > >> So, that might be one case.  Taking into account that just rcu lock 
> > > > >> and
> > > > >> unlock in virtio-net code takes more time than a packet copy, some 
> > > > >> batching
> > > > >> on QEMU side should improve performance significantly.  And it 
> > > > >> shouldn't be
> > > > >> too hard to implement.
> > > > >>
> > > > >> Performance over virtual interfaces may potentially be improved by 
> > > > >> creating
> > > > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  
> > > > >> Currently
> > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't 
> > > > >> allow to
> > > > >> scale well.
> > > > >
> > > > > Interestingly, actually, there are a lot of "duplication" between
> > > > > io_uring and AF_XDP:
> > > > >
> > > > > 1) both have similar memory model (user register)
> > > > > 2) both use ring for communication
> > > > >
> > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > >
> > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > > > virtual interfaces.  io_uring thread in the kernel will be able to
> > > > perform transmission for us.
> > >
> > > It would be nice if we can use iothread/vhost other than the main loop
> > > even if io_uring can use kthreads. We can avoid the memory translation
> > > cost.
> >
> > The QEMU event loop (AioContext) has io_uring code
> > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
> > on patches to re-enable it and will probably send them in July. The
> > patches also add an API to submit arbitrary io_uring operations so
> > that you can do stuff besides file descriptor monitoring. Both the
> > main loop and IOThreads will be able to use io_uring on Linux hosts.
>
> Just to make sure I understand. If we still need a copy from guest to
> io_uring buffer, we still need to go via memory API for GPA which
> seems expensive.
>
> Vhost seems to be a shortcut for this.

I'm not sure how exactly you're thinking of using io_uring.

Simply using io_uring for the event loop (file descriptor monitoring)
doesn't involve an extra buffer, but the packet payload still needs to
reside in AF_XDP umem, so there is a copy between guest memory and
umem. If umem encompasses guest memory, it may be possible to avoid
copying the packet payload.

Stefan



Re: [PATCH] net: add initial support for AF_XDP network backend

2023-06-28 Thread Stefan Hajnoczi
On Wed, 28 Jun 2023 at 05:28, Jason Wang  wrote:
>
> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets  wrote:
> >
> > On 6/27/23 04:54, Jason Wang wrote:
> > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets  wrote:
> > >>
> > >> On 6/26/23 08:32, Jason Wang wrote:
> > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang  wrote:
> > 
> >  On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets  
> >  wrote:
> > >> It is noticeably more performant than a tap with vhost=on in terms of 
> > >> PPS.
> > >> So, that might be one case.  Taking into account that just rcu lock and
> > >> unlock in virtio-net code takes more time than a packet copy, some 
> > >> batching
> > >> on QEMU side should improve performance significantly.  And it shouldn't 
> > >> be
> > >> too hard to implement.
> > >>
> > >> Performance over virtual interfaces may potentially be improved by 
> > >> creating
> > >> a kernel thread for async Tx.  Similarly to what io_uring allows.  
> > >> Currently
> > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
> > >> scale well.
> > >
> > > Interestingly, actually, there are a lot of "duplication" between
> > > io_uring and AF_XDP:
> > >
> > > 1) both have similar memory model (user register)
> > > 2) both use ring for communication
> > >
> > > I wonder if we can let io_uring talks directly to AF_XDP.
> >
> > Well, if we submit poll() in QEMU main loop via io_uring, then we can
> > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for
> > virtual interfaces.  io_uring thread in the kernel will be able to
> > perform transmission for us.
>
> It would be nice if we can use iothread/vhost other than the main loop
> even if io_uring can use kthreads. We can avoid the memory translation
> cost.

The QEMU event loop (AioContext) has io_uring code
(utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working
on patches to re-enable it and will probably send them in July. The
patches also add an API to submit arbitrary io_uring operations so
that you can do stuff besides file descriptor monitoring. Both the
main loop and IOThreads will be able to use io_uring on Linux hosts.

Stefan



Re: [PATCH] net: add initial support for AF_XDP network backend

2023-06-27 Thread Stefan Hajnoczi
Can multiple VMs share a host netdev by filtering incoming traffic
based on each VM's MAC address and directing it to the appropriate
XSK? If yes, then I think AF_XDP is interesting when SR-IOV or similar
hardware features are not available.

The idea of an AF_XDP passthrough device seems interesting because it
would minimize the overhead and avoid some of the existing software
limitations (mostly in QEMU's networking subsystem) that you
described. I don't know whether the AF_XDP API is suitable or can be
extended to build a hardware emulation interface, but it seems
plausible. When Stefano Garzarella played with io_uring passthrough
into the guest, one of the issues was guest memory translation (since
the guest doesn't use host userspace virtual addresses). I guess
AF_XDP would need an API for adding/removing memory translations or
operate in a mode where addresses are relative offsets from the start
of the umem regions (but this may be impractical if it limits where
the guest can allocate packet payload buffers).

Whether you pursue the passthrough approach or not, making -netdev
af-xdp work in an environment where QEMU runs unprivileged seems like
the most important practical issue to solve.

Stefan



Re: virtio-blk using a single iothread

2023-06-21 Thread Stefan Hajnoczi
Hi Sagi,
I just got back from a conference and am going to be offline for a
week starting tomorrow. I haven't had time to look through your email
but will reply when I'm back from vacation.

Stefan

On Sun, 11 Jun 2023 at 14:29, Sagi Grimberg  wrote:
>
>
>
> On 6/8/23 19:08, Stefan Hajnoczi wrote:
> > On Thu, Jun 08, 2023 at 10:40:57AM +0300, Sagi Grimberg wrote:
> >> Hey Stefan, Paolo,
> >>
> >> I just had a report from a user experiencing lower virtio-blk
> >> performance than he expected. This user is running virtio-blk on top of
> >> nvme-tcp device. The guest is running 12 CPU cores.
> >>
> >> The guest read/write throughput is capped at around 30% of the available
> >> throughput from the host (~800MB/s from the guest vs. 2800MB/s from the
> >> host - 25Gb/s nic). The workload running on the guest is a
> >> multi-threaded fio workload.
> >>
> >> What is observed is the fact that virtio-blk is using a single disk-wide
> >> iothread processing all the vqs. Specifically nvme-tcp (similar to other
> >> tcp based protocols) is negatively impacted by lack of thread
> >> concurrency that can distribute I/O requests to different TCP
> >> connections.
> >>
> >> We also attempted to move the iothread to a dedicated core, however that
> >> did yield any meaningful performance improvements). The reason appears
> >> to be less about CPU utilization on the iothread core, but more around
> >> single TCP connection serialization.
> >>
> >> Moving to io=threads does increase the throughput, however sacrificing
> >> latency significantly.
> >>
> >> So the user find itself with available host cpus and TCP connections
> >> that it could easily use to get maximum throughput, without the ability
> >> to leverage them. True, other guests will use different
> >> threads/contexts, however the goal here is to allow the full performance
> >> from a single device.
> >>
> >> I've seen several discussions and attempts in the past to allow a
> >> virtio-blk device leverage multiple iothreads, but around 2 years ago
> >> the discussions over this paused. So wanted to ask, are there any plans
> >> or anything in the works to address this limitation?
> >>
> >> I've seen that the spdk folks are heading in this direction with their
> >> vhost-blk implementation:
> >> https://review.spdk.io/gerrit/c/spdk/spdk/+/16068
> >
> > Hi Sagi,
> > Yes, there is an ongoing QEMU multi-queue block layer effort to make it
> > possible for multiple IOThreads to process disk I/O for the same
> > --blockdev in parallel.
>
> Great to know.
>
> > Most of my recent QEMU patches have been part of this effort. There is a
> > work-in-progress branch that supports mapping virtio-blk virtqueues to
> > specific IOThreads:
> > https://gitlab.com/stefanha/qemu/-/commits/virtio-blk-iothread-vq-mapping
>
> Thanks for the pointer.
>
> > The syntax is:
> >
> >--device 
> > '{"driver":"virtio-blk-pci","iothread-vq-mapping":[{"iothread":"iothread0"},{"iothread":"iothread1"}],"drive":"drive0"}'
> >
> > This says "assign virtqueues round-robin to iothread0 and iothread1".
> > Half the virtqueues will be processed by iothread0 and the other half by
> > iothread1. There is also syntax for assigning specific virtqueues to
> > each IOThread, but usually the automatic round-robin assignment is all
> > that's needed.
> >
> > This work is not finished yet. Basic I/O (e.g. fio) works without
> > crashes, but expect to hit issues if you use blockjobs, hotplug, etc.
> >
> > Performance optimization work has just begun, so it won't deliver all
> > the benefits yet. I ran a benchmark yesterday where going from 1 to 2
> > IOThreads increased performance by 25%. That's much less than we're
> > aiming for; attaching two independent virtio-blk devices improves the
> > performance by ~100%. I know we can get there eventually. Some of the
> > bottlenecks are known (e.g. block statistics collection causes lock
> > contention) and others are yet to be investigated.
>
> Hmm, I rebased this branch on top of mainline master and ran a naive
> test, and it seems that performance regressed quite a bit :(
>
> I'm running this test on my laptop (Intel(R) Core(TM) i7-8650U CPU
> @1.90GHz), so this is more qualitative test for BW only.
> I use null_blk as the host device.
>
> With mainline master I get ~9GB/s 64k 

Re: [RFC 5/6] migration: Deprecate block migration

2023-06-21 Thread Stefan Hajnoczi
On Mon, Jun 12, 2023 at 09:33:43PM +0200, Juan Quintela wrote:
> It is obsolete.  It is better to use driver_mirror+NBD instead.
> 
> CC: Kevin Wolf 
> CC: Eric Blake 
> CC: Stefan Hajnoczi 
> CC: Hanna Czenczek 
> 
> Signed-off-by: Juan Quintela 
> 
> ---
> 
> Can any of you give one example of how to use driver_mirror+NBD for
> deprecated.rst?

Please see "QMP invocation for live storage migration with
``drive-mirror`` + NBD" in docs/interop/live-block-operations.rst for a
detailed explanation.

> 
> Thanks, Juan.
> ---
>  docs/about/deprecated.rst |  6 ++
>  qapi/migration.json   | 29 +
>  migration/block.c |  2 ++
>  migration/options.c   |  7 +++
>  4 files changed, 40 insertions(+), 4 deletions(-)
> 
> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> index 518672722d..173c5ba5cb 100644
> --- a/docs/about/deprecated.rst
> +++ b/docs/about/deprecated.rst
> @@ -454,3 +454,9 @@ Everything except ``-incoming defer`` are deprecated.  
> This allows to
>  setup parameters before launching the proper migration with
>  ``migrate-incoming uri``.
>  
> +block migration (since 8.1)
> +'''
> +
> +Block migration is too inflexible.  It needs to migrate all block
> +devices or none.  Use driver_mirror+NBD instead.

blockdev-mirror with NBD

> +
> diff --git a/qapi/migration.json b/qapi/migration.json
> index b71e00737e..a8497de48d 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -258,11 +258,16 @@
>  # blocked.  Present and non-empty when migration is blocked.
>  # (since 6.0)
>  #
> +# Features:
> +#
> +# @deprecated: @disk migration is deprecated.  Use driver_mirror+NBD

blockdev-mirror with NBD

> +# instead.
> +#
>  # Since: 0.14
>  ##
>  { 'struct': 'MigrationInfo',
>'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
> -   '*disk': 'MigrationStats',
> +   '*disk': { 'type': 'MigrationStats', 'features': ['deprecated'] },
> '*vfio': 'VfioStats',
> '*xbzrle-cache': 'XBZRLECacheStats',
> '*total-time': 'int',
> @@ -497,6 +502,9 @@
>  #
>  # Features:
>  #
> +# @deprecated: @block migration is deprecated.  Use driver_mirror+NBD

blockdev-mirror with NBD

> +# instead.
> +#
>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>  #
>  # Since: 1.2
> @@ -506,7 +514,8 @@
> 'compress', 'events', 'postcopy-ram',
> { 'name': 'x-colo', 'features': [ 'unstable' ] },
> 'release-ram',
> -   'block', 'return-path', 'pause-before-switchover', 'multifd',
> +   { 'name': 'block', 'features': [ 'deprecated' ] },
> +   'return-path', 'pause-before-switchover', 'multifd',
> 'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
> { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
> 'validate-uuid', 'background-snapshot',
> @@ -789,6 +798,9 @@
>  #
>  # Features:
>  #
> +# @deprecated: Member @block-incremental is obsolete. Use
> +# driver_mirror+NBD instead.

blockdev-mirror with NBD

> +#
>  # @unstable: Member @x-checkpoint-delay is experimental.
>  #
>  # Since: 2.4
> @@ -803,7 +815,7 @@
> 'tls-creds', 'tls-hostname', 'tls-authz', 'max-bandwidth',
> 'downtime-limit',
> { 'name': 'x-checkpoint-delay', 'features': [ 'unstable' ] },
> -   'block-incremental',
> +   { 'name': 'block-incremental', 'features': [ 'deprecated' ] },
> 'multifd-channels',
> 'xbzrle-cache-size', 'max-postcopy-bandwidth',
> 'max-cpu-throttle', 'multifd-compression',
> @@ -945,6 +957,9 @@
>  #
>  # Features:
>  #
> +# @deprecated: Member @block-incremental is obsolete. Use
> +# driver_mirror+NBD instead.

blockdev-mirror with NBD

> +#
>  # @unstable: Member @x-checkpoint-delay is experimental.
>  #
>  # TODO: either fuse back into MigrationParameters, or make
> @@ -972,7 +987,8 @@
>  '*downtime-limit': 'uint64',
>  '*x-checkpoint-delay': { 'type': 'uint32',
>   'features': [ 'unstable' ] },
> -'*block-incremental': 'bool',
> +'*block-incremental': { 'type': 'bool',
> +'features': [ 'deprecated' ] },
>  '*multifd-channels': 'uint8',
>  '*xbzrle-cache-size': 'size',
>  '*max-postcopy-bandwidth': 'size',
> @@ -1137,6 +1153,9 @@
>  #
>  # Features:
>  #
> +# @deprecated: Member @block-incremental is obsolete. Use

Re: [RFC 2/4] qcow2: add configurations for zoned format extension

2023-06-20 Thread Stefan Hajnoczi
On Mon, Jun 19, 2023 at 10:50:31PM +0800, Sam Li wrote:
> Stefan Hajnoczi  于2023年6月19日周一 22:42写道:
> >
> > On Mon, Jun 19, 2023 at 06:32:52PM +0800, Sam Li wrote:
> > > Stefan Hajnoczi  于2023年6月19日周一 18:10写道:
> > > > On Mon, Jun 05, 2023 at 06:41:06PM +0800, Sam Li wrote:
> > > > > diff --git a/block/qcow2.h b/block/qcow2.h
> > > > > index 4f67eb912a..fe18dc4d97 100644
> > > > > --- a/block/qcow2.h
> > > > > +++ b/block/qcow2.h
> > > > > @@ -235,6 +235,20 @@ typedef struct Qcow2CryptoHeaderExtension {
> > > > >  uint64_t length;
> > > > >  } QEMU_PACKED Qcow2CryptoHeaderExtension;
> > > > >
> > > > > +typedef struct Qcow2ZonedHeaderExtension {
> > > > > +/* Zoned device attributes */
> > > > > +BlockZonedProfile zoned_profile;
> > > > > +BlockZoneModel zoned;
> > > > > +uint32_t zone_size;
> > > > > +uint32_t zone_capacity;
> > > > > +uint32_t nr_zones;
> > > > > +uint32_t zone_nr_conv;
> > > > > +uint32_t max_active_zones;
> > > > > +uint32_t max_open_zones;
> > > > > +uint32_t max_append_sectors;
> > > > > +uint8_t padding[3];
> > > >
> > > > This looks strange. Why is there 3 bytes of padding at the end? Normally
> > > > padding would align to an even power-of-two number of bytes like 2, 4,
> > > > 8, etc.
> > >
> > > It is calculated as 3 if sizeof(zoned+zoned_profile) = 8. Else if it's
> > > 16, the padding is 2.
> >
> > I don't understand. Can you explain why there is padding at the end of
> > this struct?
> 
> The overall size should be aligned with 64 bit, which leaves use one
> uint32_t and two fields zoned, zoned_profile. I am not sure the size
> of macros here and it used 4 for each. So it makes 3 (*8) + 32 + 8 =
> 64 in the end. If the macro size is wrong, then the padding will
> change as well.

The choice of the type (char or int) representing an enum is
implementation-defined according to the C17 standard (see "6.7.2.2
Enumeration specifiers").

Therefore it's not portable to use enums in structs exposed to the
outside world (on-disk formats or network protocols).

Please use uint8_t for the zoned_profile and zoned fields and move them
to the end of the struct so the uint32_t fields are naturally aligned.

I think only 2 bytes of padding will be required to align the struct to
a 64-bit boundary once you've done that.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 05/12] hw/virtio: Add support for apple virtio-blk

2023-06-20 Thread Stefan Hajnoczi
On Wed, Jun 14, 2023 at 10:56:22PM +, Alexander Graf wrote:
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index 39e7f23fab..76b85bb3cb 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1120,6 +1120,20 @@ static int virtio_blk_handle_request(VirtIOBlockReq 
> *req, MultiReqBuffer *mrb)
>  
>  break;
>  }
> +case VIRTIO_BLK_T_APPLE1:
> +{
> +if (s->conf.x_apple_type) {
> +/* Only valid on Apple Virtio */
> +char buf[iov_size(in_iov, in_num)];

I'm concerned that a variable-sized stack buffer could be abused by a
malicious guest. Even if it's harmless in the Apple use case, someone
else might copy this approach and use it where it creates a security
problem. Please either implement iov_memset() or allocate the temporary
buffer using bdrv_blockalign() (and free it with qemu_vfree()).

> +memset(buf, 0, sizeof(buf));
> +iov_from_buf(in_iov, in_num, 0, buf, sizeof(buf));
> +virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);


signature.asc
Description: PGP signature


Re: [PATCH v2 3/3] hw/ufs: Support for UFS logical unit

2023-06-19 Thread Stefan Hajnoczi
On Mon, Jun 19, 2023 at 08:16:27PM +0900, Jeuk Kim wrote:
> On Fri, Jun 19, 2023, Stefan Hajnoczi wrote:
> >On Fri, Jun 16, 2023 at 03:58:27PM +0900, Jeuk Kim wrote:
> >> This commit adds support for ufs logical unit.
> >> The LU handles processing for the SCSI command,
> >> unit descriptor query request.
> >> 
> >> This commit enables the UFS device to process
> >> IO requests.
> >
> >Is UFS a SCSI Host Bus Adapter capable of exposing any SCSI device? The
> >code is written as if UFS was a special-purpose SCSI bus that cannot
> >handle regular SCSI devices already emulated by QEMU (like scsi-hd). As
> >a result, it duplicates a lot of SCSI device code instead of just
> >focussing on unwrapping/wrapping the SCSI commands and responses from
> >the UFS interface.
> >
> >Would it be possible to have:
> >
> >  --device ufs,id=
> >  --device scsi-hd,bus=
> >
> >?
> >
> >I think that would involve less code and be more flexible.
> >
> 
> Unfortunately, UFS is not a generic SCSI Host Bus Adapter.
> UFS uses the SCSI specification to communicate with the driver,
> but its behaviour is very different from that of a typical SCSI device.
> (So it's intentional that UFS looks like a special-purpose SCSI bus.)
> 
> For example, UFS has the well-known lu.
> Unlike typical SCSI devices, where each lu is independent,
> UFS can control other lu's through the well-known lu.
> 
> Therefore, UFS can only work properly with ufs-lu, and not with
> other scsi devices such as scsi-hd. :'(
> 
> That's why I made the UFS bus and added the ufs_bus_check_address()
> to prevent normal scsi devices and UFS from connecting to each other.
> 
> Also, in the future, I will add more ufs-specific features
> like hibernation and zoned, which are different from normal SCSI devices.
> 
> So personally, I think we should define ufs-lu separately as we do now.
> Is that okay?

Yes, I think that makes sense. Thanks for explaining.

Paolo Bonzini is the QEMU SCSI emulation maintainer. He might have more
thoughts about this. I have CCed him, but I think you can continue with
the current approach unless Paolo decides to get involved in this patch
series.

Stefan


signature.asc
Description: PGP signature


Re: [RFC 2/4] qcow2: add configurations for zoned format extension

2023-06-19 Thread Stefan Hajnoczi
On Mon, Jun 19, 2023 at 06:32:52PM +0800, Sam Li wrote:
> Stefan Hajnoczi  于2023年6月19日周一 18:10写道:
> > On Mon, Jun 05, 2023 at 06:41:06PM +0800, Sam Li wrote:
> > > diff --git a/block/qcow2.h b/block/qcow2.h
> > > index 4f67eb912a..fe18dc4d97 100644
> > > --- a/block/qcow2.h
> > > +++ b/block/qcow2.h
> > > @@ -235,6 +235,20 @@ typedef struct Qcow2CryptoHeaderExtension {
> > >  uint64_t length;
> > >  } QEMU_PACKED Qcow2CryptoHeaderExtension;
> > >
> > > +typedef struct Qcow2ZonedHeaderExtension {
> > > +/* Zoned device attributes */
> > > +BlockZonedProfile zoned_profile;
> > > +BlockZoneModel zoned;
> > > +uint32_t zone_size;
> > > +uint32_t zone_capacity;
> > > +uint32_t nr_zones;
> > > +uint32_t zone_nr_conv;
> > > +uint32_t max_active_zones;
> > > +uint32_t max_open_zones;
> > > +uint32_t max_append_sectors;
> > > +uint8_t padding[3];
> >
> > This looks strange. Why is there 3 bytes of padding at the end? Normally
> > padding would align to an even power-of-two number of bytes like 2, 4,
> > 8, etc.
> 
> It is calculated as 3 if sizeof(zoned+zoned_profile) = 8. Else if it's
> 16, the padding is 2.

I don't understand. Can you explain why there is padding at the end of
this struct?


signature.asc
Description: PGP signature


Re: [RFC 4/4] iotests: test the zoned format feature for qcow2 file

2023-06-19 Thread Stefan Hajnoczi
On Mon, Jun 05, 2023 at 06:41:08PM +0800, Sam Li wrote:
> The zoned format feature can be tested by:
> $ tests/qemu-iotests/check zoned-qcow2
> 
> Signed-off-by: Sam Li 
> ---
>  tests/qemu-iotests/tests/zoned-qcow2 | 110 +++
>  tests/qemu-iotests/tests/zoned-qcow2.out |  87 ++
>  2 files changed, 197 insertions(+)
>  create mode 100755 tests/qemu-iotests/tests/zoned-qcow2
>  create mode 100644 tests/qemu-iotests/tests/zoned-qcow2.out
> 
> diff --git a/tests/qemu-iotests/tests/zoned-qcow2 
> b/tests/qemu-iotests/tests/zoned-qcow2
> new file mode 100755
> index 00..6aa5ab3a03
> --- /dev/null
> +++ b/tests/qemu-iotests/tests/zoned-qcow2
> @@ -0,0 +1,110 @@
> +#!/usr/bin/env bash
> +#
> +# Test zone management operations for qcow2 file.
> +#
> +
> +seq="$(basename $0)"
> +echo "QA output created by $seq"
> +status=1 # failure is the default!
> +
> +file_name="zbc.qcow2"

Please use $TEST_IMG_FILE instead of defining your own variable here.
(TEST_IMG_FILE is already defined in common.rc.)

> +_cleanup()
> +{
> +  _cleanup_test_img
> +  _rm_test_img "$file_name"
> +}
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +# get standard environment, filters and checks
> +. ../common.rc
> +. ../common.filter
> +. ../common.qemu
> +
> +# This test only runs on Linux hosts with qcow2 image files.

Then you need to add:
_supported_fmt qcow2

> +_supported_proto file
> +_supported_os Linux

Is this test really Linux-specific?

> +
> +echo
> +echo "=== Initial image setup ==="
> +echo
> +
> +$QEMU_IMG create -f qcow2 $file_name -o size=768M -o zone_size=64M \
> +-o zone_capacity=64M -o zone_nr_conv=0 -o max_append_sectors=512 \
> +-o max_open_zones=0 -o max_active_zones=0 -o zoned_profile=zbc
> +
> +IMG="--image-opts -n driver=qcow2,file.driver=file,file.filename=$file_name"
> +QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
> +
> +echo
> +echo "=== Testing a qcow2 img with zoned format ==="
> +echo
> +echo "case 1: if the operations work"
> +
> +echo "(1) report the first zone:"
> +$QEMU_IO $IMG -c "zrp 0 1"
> +echo
> +echo "report the first 10 zones"
> +$QEMU_IO $IMG -c "zrp 0 10"
> +echo
> +echo "report the last zone:"
> +$QEMU_IO $IMG -c "zrp 0x2C00 2" # 0x2C00 / 512 = 0x16
> +echo
> +echo
> +echo "(2) opening the first zone"
> +$QEMU_IO $IMG -c "zo 0 0x400" # 0x400 / 512 = 0x2
> +echo "report after:"
> +$QEMU_IO $IMG -c "zrp 0 1"
> +echo
> +echo "opening the second zone"
> +$QEMU_IO $IMG -c "zo 0x400 0x400"
> +echo "report after:"
> +$QEMU_IO $IMG -c "zrp 0x400 1"
> +echo
> +echo "opening the last zone"
> +$QEMU_IO $IMG -c "zo 0x2C00 0x400"
> +echo "report after:"
> +$QEMU_IO $IMG -c "zrp 0x2C00 2"
> +echo
> +echo
> +echo "(3) closing the first zone"
> +$QEMU_IO $IMG -c "zc 0 0x400"
> +echo "report after:"
> +$QEMU_IO $IMG -c "zrp 0 1"
> +echo
> +echo "closing the last zone"
> +$QEMU_IO $IMG -c "zc 0x3e7000 0x400"
> +echo "report after:"
> +$QEMU_IO $IMG -c "zrp 0x3e7000 2"
> +echo
> +echo
> +echo "(4) finishing the second zone"
> +$QEMU_IO $IMG -c "zf 0x400 0x400"
> +echo "After finishing a zone:"
> +$QEMU_IO $IMG -c "zrp 0x400 1"
> +echo
> +echo
> +echo "(5) resetting the second zone"
> +$QEMU_IO $IMG -c "zrs 0x400 0x400"
> +echo "After resetting a zone:"
> +$QEMU_IO $IMG -c "zrp 0x400 1"
> +echo
> +echo
> +echo "(6) append write" # the physical block size of the device is 4096
> +$QEMU_IO $IMG -c "zrp 0 1"
> +$QEMU_IO $IMG -c "zap -p 0 0x1000 0x2000"
> +echo "After appending the first zone firstly:"
> +$QEMU_IO $IMG -c "zrp 0 1"
> +$QEMU_IO $IMG -c "zap -p 0 0x1000 0x2000"
> +echo "After appending the first zone secondly:"
> +$QEMU_IO $IMG -c "zrp 0 1"
> +$QEMU_IO $IMG -c "zap -p 0x400 0x1000 0x2000"
> +echo "After appending the second zone firstly:"
> +$QEMU_IO $IMG -c "zrp 0x400 1"
> +$QEMU_IO $IMG -c "zap -p 0x400 0x1000 0x2000"
> +echo "After appending the second zone secondly:"
> +$QEMU_IO $IMG -c "zrp 0x400 1"
> +
> +# success, all done
> +echo "*** done"
> +rm -f $seq.full
> +status=0
> diff --git a/tests/qemu-iotests/tests/zoned-qcow2.out 
> b/tests/qemu-iotests/tests/zoned-qcow2.out
> new file mode 100644
> index 00..288bceffc4
> --- /dev/null
> +++ b/tests/qemu-iotests/tests/zoned-qcow2.out
> @@ -0,0 +1,87 @@
> +QA output created by zoned-qcow2
> +
> +=== Initial image setup ===
> +
> +Formatting 'zbc.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off 
> compression_type=zlib zoned_profile=zbc zone_size=67108864 
> zone_capacity=67108864 zone_nr_conv=0 max_append_sectors=512 
> max_active_zones=0 max_open_zones=0 size=805306368 lazy_refcounts=off 
> refcount_bits=16
> +
> +=== Testing a qcow2 img with zoned format ===
> +
> +case 1: if the operations work
> +(1) report the first zone:
> +start: 0x0, len 0x2, cap 0x2, wptr 0x0, zcond:1, [type: 2]
> +
> +report the first 10 zones
> +start: 0x0, len 0x2, cap 0x2, wptr 0x0, zcond:1, [type: 

Re: [RFC 3/4] qcow2: add zoned emulation capability

2023-06-19 Thread Stefan Hajnoczi
On Mon, Jun 05, 2023 at 06:41:07PM +0800, Sam Li wrote:
> By adding zone operations and zoned metadata, the zoned emulation
> capability enables full emulation support of zoned device using
> a qcow2 file. The zoned device metadata includes zone type,
> zoned device state and write pointer of each zone, which is stored
> to an array of unsigned integers.
> 
> Each zone of a zoned device makes state transitions following
> the zone state machine. The zone state machine mainly describes
> five states, IMPLICIT OPEN, EXPLICIT OPEN, FULL, EMPTY and CLOSED.
> READ ONLY and OFFLINE states will generally be affected by device
> internal events. The operations on zones cause corresponding state
> changing.
> 
> Zoned devices have a limit on zone resources, which puts constraints on
> write operations into zones.
> 
> Signed-off-by: Sam Li 
> ---
>  block/qcow2.c | 629 +-
>  block/qcow2.h |   2 +
>  2 files changed, 629 insertions(+), 2 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index b886dab42b..f030965d5d 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -194,6 +194,164 @@ qcow2_extract_crypto_opts(QemuOpts *opts, const char 
> *fmt, Error **errp)
>  return cryptoopts_qdict;
>  }
>  
> +#define QCOW2_ZT_IS_CONV(wp)(wp & 1ULL << 59)
> +
> +static inline int qcow2_get_wp(uint64_t wp)
> +{
> +/* clear state and type information */
> +return ((wp << 5) >> 5);
> +}
> +
> +static inline int qcow2_get_zs(uint64_t wp)
> +{
> +return (wp >> 60);
> +}
> +
> +static inline void qcow2_set_wp(uint64_t *wp, BlockZoneState zs)
> +{
> +uint64_t addr = qcow2_get_wp(*wp);
> +addr |= ((uint64_t)zs << 60);
> +*wp = addr;
> +}
> +
> +/*
> + * File wp tracking: reset zone, finish zone and append zone can
> + * change the value of write pointer. All zone operations will change
> + * the state of that/those zone.
> + * */
> +static inline void qcow2_wp_tracking_helper(int index, uint64_t wp) {
> +/* format: operations, the wp. */
> +printf("wps[%d]: 0x%x\n", index, qcow2_get_wp(wp)>>BDRV_SECTOR_BITS);
> +}
> +
> +/*
> + * Perform a state assignment and a flush operation that writes the new wp
> + * value to the dedicated location of the disk file.
> + */
> +static int qcow2_write_wp_at(BlockDriverState *bs, uint64_t *wp,
> + uint32_t index, BlockZoneState zs) {
> +BDRVQcow2State *s = bs->opaque;
> +int ret;
> +
> +qcow2_set_wp(wp, zs);
> +ret = bdrv_pwrite(bs->file, s->zoned_header.zonedmeta_offset
> ++ sizeof(uint64_t) * index, sizeof(uint64_t), wp, 0);
> +
> +if (ret < 0) {
> +goto exit;
> +}
> +qcow2_wp_tracking_helper(index, *wp);
> +return ret;
> +
> +exit:
> +error_report("Failed to write metadata with file");
> +return ret;
> +}
> +
> +static int qcow2_check_active(BlockDriverState *bs)
> +{
> +BDRVQcow2State *s = bs->opaque;
> +
> +if (!s->zoned_header.max_active_zones) {
> +return 0;
> +}
> +
> +if (s->nr_zones_exp_open + s->nr_zones_imp_open + s->nr_zones_closed
> +< s->zoned_header.max_active_zones) {
> +return 0;
> +}
> +
> +return -1;
> +}
> +
> +static int qcow2_check_open(BlockDriverState *bs)
> +{
> +BDRVQcow2State *s = bs->opaque;
> +int ret;
> +
> +if (!s->zoned_header.max_open_zones) {
> +return 0;
> +}
> +
> +if (s->nr_zones_exp_open + s->nr_zones_imp_open
> +< s->zoned_header.max_open_zones) {
> +return 0;
> +}
> +
> +if(s->nr_zones_imp_open) {
> +ret = qcow2_check_active(bs);
> +if (ret == 0) {
> +/* TODO: it takes O(n) time complexity (n = nr_zones).
> + * Optimizations required. */
> +/* close one implicitly open zones to make it available */
> +for (int i = s->zoned_header.zone_nr_conv;
> +i < bs->bl.nr_zones; ++i) {
> +uint64_t *wp = >wps->wp[i];
> +if (qcow2_get_zs(*wp) == BLK_ZS_IOPEN) {
> +ret = qcow2_write_wp_at(bs, wp, i, BLK_ZS_CLOSED);
> +if (ret < 0) {
> +return ret;
> +}
> +s->wps->wp[i] = *wp;
> +s->nr_zones_imp_open--;
> +s->nr_zones_closed++;
> +break;
> +}
> +}
> +return 0;
> +}
> +return ret;
> +}
> +
> +return -1;
> +}
> +
> +/*
> + * The zoned device has limited zone resources of open, closed, active
> + * zones.
> + */
> +static int qcow2_check_zone_resources(BlockDriverState *bs,
> +  BlockZoneState zs)
> +{
> +int ret;
> +
> +switch (zs) {
> +case BLK_ZS_EMPTY:
> +ret = qcow2_check_active(bs);
> +if (ret < 0) {
> +error_report("No enough active zones");
> +return ret;
> +}
> 

Re: [RFC 2/4] qcow2: add configurations for zoned format extension

2023-06-19 Thread Stefan Hajnoczi
On Mon, Jun 05, 2023 at 06:41:06PM +0800, Sam Li wrote:
> To configure the zoned format feature on the qcow2 driver, it
> requires following arguments: the device size, zoned profile,
> zoned model, zone size, zone capacity, number of conventional
> zones, limits on zone resources (max append sectors, max open
> zones, and max_active_zones).
> 
> To create a qcow2 file with zoned format, use command like this:
> $ qemu-img create -f qcow2 test.qcow2 -o size=768M -o
> zone_size=64M -o zone_capacity=64M -o zone_nr_conv=0 -o
> max_append_sectors=512 -o max_open_zones=0 -o max_active_zones=0
>  -o zoned_profile=zbc
> 
> Signed-off-by: Sam Li 
> ---
>  block/qcow2.c| 119 +++
>  block/qcow2.h|  21 ++
>  include/block/block-common.h |   5 ++
>  include/block/block_int-common.h |   8 +++
>  qapi/block-core.json |  46 
>  5 files changed, 185 insertions(+), 14 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 7f3948360d..b886dab42b 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -73,6 +73,7 @@ typedef struct {
>  #define  QCOW2_EXT_MAGIC_CRYPTO_HEADER 0x0537be77
>  #define  QCOW2_EXT_MAGIC_BITMAPS 0x23852875
>  #define  QCOW2_EXT_MAGIC_DATA_FILE 0x44415441
> +#define  QCOW2_EXT_MAGIC_ZONED_FORMAT 0x7a6264
>  
>  static int coroutine_fn
>  qcow2_co_preadv_compressed(BlockDriverState *bs,
> @@ -210,6 +211,7 @@ qcow2_read_extensions(BlockDriverState *bs, uint64_t 
> start_offset,
>  uint64_t offset;
>  int ret;
>  Qcow2BitmapHeaderExt bitmaps_ext;
> +Qcow2ZonedHeaderExtension zoned_ext;
>  
>  if (need_update_header != NULL) {
>  *need_update_header = false;
> @@ -431,6 +433,37 @@ qcow2_read_extensions(BlockDriverState *bs, uint64_t 
> start_offset,
>  break;
>  }
>  
> +case QCOW2_EXT_MAGIC_ZONED_FORMAT:
> +{
> +if (ext.len != sizeof(zoned_ext)) {
> +error_setg_errno(errp, -ret, "zoned_ext: "
> + "Invalid extension length");
> +return -EINVAL;
> +}
> +ret = bdrv_pread(bs->file, offset, ext.len, _ext, 0);
> +if (ret < 0) {
> +error_setg_errno(errp, -ret, "zoned_ext: "
> + "Could not read ext header");
> +return ret;
> +}
> +
> +zoned_ext.zone_size = be32_to_cpu(zoned_ext.zone_size);
> +zoned_ext.nr_zones = be32_to_cpu(zoned_ext.nr_zones);
> +zoned_ext.zone_nr_conv = be32_to_cpu(zoned_ext.zone_nr_conv);
> +zoned_ext.max_open_zones = be32_to_cpu(zoned_ext.max_open_zones);
> +zoned_ext.max_active_zones =
> +be32_to_cpu(zoned_ext.max_active_zones);
> +zoned_ext.max_append_sectors =
> +be32_to_cpu(zoned_ext.max_append_sectors);
> +s->zoned_header = zoned_ext;

Please validate these values. The image file is not trusted and may be
broken/corrupt. For example, zone_size=0 and nr_zones=0 must be rejected
because the code can't do anything useful when these values are zero
(similar for values that are not multiples of the block size).

> +
> +#ifdef DEBUG_EXT
> +printf("Qcow2: Got zoned format extension: "
> +   "offset=%" PRIu32 "\n", offset);
> +#endif
> +break;
> +}
> +
>  default:
>  /* unknown magic - save it in case we need to rewrite the header 
> */
>  /* If you add a new feature, make sure to also update the fast
> @@ -3071,6 +3104,31 @@ int qcow2_update_header(BlockDriverState *bs)
>  buflen -= ret;
>  }
>  
> +/* Zoned devices header extension */
> +if (s->zoned_header.zoned == BLK_Z_HM) {
> +Qcow2ZonedHeaderExtension zoned_header = {
> +.zoned_profile  = s->zoned_header.zoned_profile,
> +.zoned  = s->zoned_header.zoned,
> +.nr_zones   = cpu_to_be32(s->zoned_header.nr_zones),
> +.zone_size  = cpu_to_be32(s->zoned_header.zone_size),
> +.zone_capacity  = cpu_to_be32(s->zoned_header.zone_capacity),
> +.zone_nr_conv   = cpu_to_be32(s->zoned_header.zone_nr_conv),
> +.max_open_zones = 
> cpu_to_be32(s->zoned_header.max_open_zones),
> +.max_active_zones   =
> +cpu_to_be32(s->zoned_header.max_active_zones),
> +.max_append_sectors =
> +cpu_to_be32(s->zoned_header.max_append_sectors)
> +};
> +ret = header_ext_add(buf, QCOW2_EXT_MAGIC_ZONED_FORMAT,
> + _header, sizeof(zoned_header),
> + buflen);
> +if (ret < 0) {
> +goto fail;
> +}
> +buf += ret;
> +buflen -= ret;
> +}
> +
>  /* Keep unknown header extensions 

Re: [PATCH v2 3/3] hw/ufs: Support for UFS logical unit

2023-06-19 Thread Stefan Hajnoczi
On Fri, Jun 16, 2023 at 03:58:27PM +0900, Jeuk Kim wrote:
> This commit adds support for ufs logical unit.
> The LU handles processing for the SCSI command,
> unit descriptor query request.
> 
> This commit enables the UFS device to process
> IO requests.

Is UFS a SCSI Host Bus Adapter capable of exposing any SCSI device? The
code is written as if UFS was a special-purpose SCSI bus that cannot
handle regular SCSI devices already emulated by QEMU (like scsi-hd). As
a result, it duplicates a lot of SCSI device code instead of just
focussing on unwrapping/wrapping the SCSI commands and responses from
the UFS interface.

Would it be possible to have:

  --device ufs,id=
  --device scsi-hd,bus=

?

I think that would involve less code and be more flexible.

> 
> Signed-off-by: Jeuk Kim 
> ---
>  hw/ufs/lu.c  | 1441 ++
>  hw/ufs/meson.build   |2 +-
>  hw/ufs/trace-events  |   25 +
>  hw/ufs/ufs.c |  252 ++-
>  hw/ufs/ufs.h |   43 ++
>  include/scsi/constants.h |1 +
>  6 files changed, 1757 insertions(+), 7 deletions(-)
>  create mode 100644 hw/ufs/lu.c
> 
> diff --git a/hw/ufs/lu.c b/hw/ufs/lu.c
> new file mode 100644
> index 00..ef69de61a5
> --- /dev/null
> +++ b/hw/ufs/lu.c
> @@ -0,0 +1,1441 @@
> +/*
> + * QEMU UFS Logical Unit
> + *
> + * Copyright (c) 2023 Samsung Electronics Co., Ltd. All rights reserved.
> + *
> + * Written by Jeuk Kim 
> + *
> + * This code is licensed under the GNU GPL v2 or later.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/units.h"
> +#include "qapi/error.h"
> +#include "qemu/memalign.h"
> +#include "hw/scsi/scsi.h"
> +#include "scsi/constants.h"
> +#include "sysemu/block-backend.h"
> +#include "qemu/cutils.h"
> +#include "trace.h"
> +#include "ufs.h"
> +
> +/*
> + * The code below handling SCSI commands is copied from hw/scsi/scsi-disk.c,
> + * with minor adjustments to make it work for UFS.
> + */
> +
> +#define SCSI_DMA_BUF_SIZE (128 * KiB)
> +#define SCSI_MAX_INQUIRY_LEN 256
> +#define SCSI_INQUIRY_DATA_SIZE 36
> +#define SCSI_MAX_MODE_LEN 256
> +
> +typedef struct UfsSCSIReq {
> +SCSIRequest req;
> +/* Both sector and sector_count are in terms of BDRV_SECTOR_SIZE bytes.  
> */
> +uint64_t sector;
> +uint32_t sector_count;
> +uint32_t buflen;
> +bool started;
> +bool need_fua_emulation;
> +struct iovec iov;
> +QEMUIOVector qiov;
> +BlockAcctCookie acct;
> +} UfsSCSIReq;
> +
> +static void ufs_scsi_free_request(SCSIRequest *req)
> +{
> +UfsSCSIReq *r = DO_UPCAST(UfsSCSIReq, req, req);
> +
> +qemu_vfree(r->iov.iov_base);
> +}
> +
> +static void scsi_check_condition(UfsSCSIReq *r, SCSISense sense)
> +{
> +trace_ufs_scsi_check_condition(r->req.tag, sense.key, sense.asc,
> +   sense.ascq);
> +scsi_req_build_sense(>req, sense);
> +scsi_req_complete(>req, CHECK_CONDITION);
> +}
> +
> +static int ufs_scsi_emulate_vpd_page(SCSIRequest *req, uint8_t *outbuf,
> + uint32_t outbuf_len)
> +{
> +UfsHc *u = UFS(req->bus->qbus.parent);
> +UfsLu *lu = DO_UPCAST(UfsLu, qdev, req->dev);
> +uint8_t page_code = req->cmd.buf[2];
> +int start, buflen = 0;
> +
> +if (outbuf_len < SCSI_INQUIRY_DATA_SIZE) {
> +return -1;
> +}
> +
> +outbuf[buflen++] = lu->qdev.type & 0x1f;
> +outbuf[buflen++] = page_code;
> +outbuf[buflen++] = 0x00;
> +outbuf[buflen++] = 0x00;
> +start = buflen;
> +
> +switch (page_code) {
> +case 0x00: /* Supported page codes, mandatory */
> +{
> +trace_ufs_scsi_emulate_vpd_page_00(req->cmd.xfer);
> +outbuf[buflen++] = 0x00; /* list of supported pages (this page) */
> +if (u->params.serial) {
> +outbuf[buflen++] = 0x80; /* unit serial number */
> +}
> +outbuf[buflen++] = 0x87; /* mode page policy */
> +break;
> +}
> +case 0x80: /* Device serial number, optional */
> +{
> +int l;
> +
> +if (!u->params.serial) {
> +trace_ufs_scsi_emulate_vpd_page_80_not_supported();
> +return -1;
> +}
> +
> +l = strlen(u->params.serial);
> +if (l > SCSI_INQUIRY_DATA_SIZE) {
> +l = SCSI_INQUIRY_DATA_SIZE;
> +}
> +
> +trace_ufs_scsi_emulate_vpd_page_80(req->cmd.xfer);
> +memcpy(outbuf + buflen, u->params.serial, l);
> +buflen += l;
> +break;
> +}
> +case 0x87: /* Mode Page Policy, mandatory */
> +{
> +trace_ufs_scsi_emulate_vpd_page_87(req->cmd.xfer);
> +outbuf[buflen++] = 0x3f; /* apply to all mode pages and subpages */
> +outbuf[buflen++] = 0xff;
> +outbuf[buflen++] = 0; /* shared */
> +outbuf[buflen++] = 0;
> +break;
> +}
> +default:
> +return -1;
> +}
> +/* done with EVPD */
> +assert(buflen - start <= 255);
> +outbuf[start - 1] = 

Re: [PATCH v2 2/3] hw/ufs: Support for Query Transfer Requests

2023-06-19 Thread Stefan Hajnoczi
On Fri, Jun 16, 2023 at 03:58:25PM +0900, Jeuk Kim wrote:
> This commit makes the UFS device support query
> and nop out transfer requests.
> 
> The next patch would be support for UFS logical
> unit and scsi command transfer request.
> 
> Signed-off-by: Jeuk Kim 
> ---
>  hw/ufs/ufs.c | 968 ++-
>  hw/ufs/ufs.h |  45 +++
>  2 files changed, 1012 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ufs/ufs.c b/hw/ufs/ufs.c
> index 9dba1073a8..10ecc8cd7b 100644
> --- a/hw/ufs/ufs.c
> +++ b/hw/ufs/ufs.c
> @@ -19,6 +19,233 @@
>  #define UFS_MAX_NUTRS 32
>  #define UFS_MAX_NUTMRS 8
>  
> +static MemTxResult ufs_addr_read(UfsHc *u, hwaddr addr, void *buf, int size)
> +{
> +uint32_t cap = ldl_le_p(>reg.cap);
> +hwaddr hi = addr + size - 1;
> +
> +if (hi < addr) {
> +return MEMTX_DECODE_ERROR;
> +}
> +
> +if (!FIELD_EX32(cap, CAP, 64AS) && (hi >> 32)) {
> +return MEMTX_DECODE_ERROR;
> +}
> +
> +return pci_dma_read(PCI_DEVICE(u), addr, buf, size);
> +}
> +
> +static MemTxResult ufs_addr_write(UfsHc *u, hwaddr addr, const void *buf,
> +  int size)
> +{
> +uint32_t cap = ldl_le_p(>reg.cap);
> +hwaddr hi = addr + size - 1;
> +if (hi < addr) {
> +return MEMTX_DECODE_ERROR;
> +}
> +
> +if (!FIELD_EX32(cap, CAP, 64AS) && (hi >> 32)) {
> +return MEMTX_DECODE_ERROR;
> +}
> +
> +return pci_dma_write(PCI_DEVICE(u), addr, buf, size);
> +}
> +
> +static void ufs_complete_req(UfsRequest *req, UfsReqResult req_result);
> +
> +static inline hwaddr ufs_get_utrd_addr(UfsHc *u, uint32_t slot)
> +{
> +uint32_t utrlba = ldl_le_p(>reg.utrlba);
> +uint32_t utrlbau = ldl_le_p(>reg.utrlbau);
> +hwaddr utrl_base_addr = (((hwaddr)utrlbau) << 32) + utrlba;
> +hwaddr utrd_addr = utrl_base_addr + slot * sizeof(UtpTransferReqDesc);
> +
> +return utrd_addr;
> +}
> +
> +static inline hwaddr ufs_get_req_upiu_base_addr(const UtpTransferReqDesc 
> *utrd)
> +{
> +uint32_t cmd_desc_base_addr_lo =
> +le32_to_cpu(utrd->command_desc_base_addr_lo);
> +uint32_t cmd_desc_base_addr_hi =
> +le32_to_cpu(utrd->command_desc_base_addr_hi);
> +
> +return (((hwaddr)cmd_desc_base_addr_hi) << 32) + cmd_desc_base_addr_lo;
> +}
> +
> +static inline hwaddr ufs_get_rsp_upiu_base_addr(const UtpTransferReqDesc 
> *utrd)
> +{
> +hwaddr req_upiu_base_addr = ufs_get_req_upiu_base_addr(utrd);
> +uint32_t rsp_upiu_byte_off =
> +le16_to_cpu(utrd->response_upiu_offset) * sizeof(uint32_t);
> +return req_upiu_base_addr + rsp_upiu_byte_off;
> +}
> +
> +static MemTxResult ufs_dma_read_utrd(UfsRequest *req)
> +{
> +UfsHc *u = req->hc;
> +hwaddr utrd_addr = ufs_get_utrd_addr(u, req->slot);
> +MemTxResult ret;
> +
> +ret = ufs_addr_read(u, utrd_addr, >utrd, sizeof(req->utrd));
> +if (ret) {
> +trace_ufs_err_dma_read_utrd(req->slot, utrd_addr);
> +}
> +return ret;
> +}
> +
> +static MemTxResult ufs_dma_read_req_upiu(UfsRequest *req)
> +{
> +UfsHc *u = req->hc;
> +hwaddr req_upiu_base_addr = ufs_get_req_upiu_base_addr(>utrd);
> +UtpUpiuReq *req_upiu = >req_upiu;
> +uint32_t copy_size;
> +uint16_t data_segment_length;
> +MemTxResult ret;
> +
> +/*
> + * To know the size of the req_upiu, we need to read the
> + * data_segment_length in the header first.
> + */
> +ret = ufs_addr_read(u, req_upiu_base_addr, _upiu->header,
> +sizeof(UtpUpiuHeader));
> +if (ret) {
> +trace_ufs_err_dma_read_req_upiu(req->slot, req_upiu_base_addr);
> +return ret;
> +}
> +data_segment_length = be16_to_cpu(req_upiu->header.data_segment_length);
> +
> +copy_size = sizeof(UtpUpiuHeader) + UFS_TRANSACTION_SPECIFIC_FIELD_SIZE +
> +data_segment_length;
> +
> +ret = ufs_addr_read(u, req_upiu_base_addr, >req_upiu, copy_size);
> +if (ret) {
> +trace_ufs_err_dma_read_req_upiu(req->slot, req_upiu_base_addr);
> +}
> +return ret;
> +}
> +
> +static MemTxResult ufs_dma_read_prdt(UfsRequest *req)
> +{
> +UfsHc *u = req->hc;
> +uint16_t prdt_len = le16_to_cpu(req->utrd.prd_table_length);
> +uint16_t prdt_byte_off =
> +le16_to_cpu(req->utrd.prd_table_offset) * sizeof(uint32_t);
> +uint32_t prdt_size = prdt_len * sizeof(UfshcdSgEntry);
> +UfshcdSgEntry *prd_entries;
> +hwaddr req_upiu_base_addr, prdt_base_addr;
> +int err;
> +
> +assert(!req->sg);
> +
> +if (prdt_len == 0) {
> +return MEMTX_OK;
> +}
> +
> +prd_entries = g_new(UfshcdSgEntry, prdt_size);
> +if (!prd_entries) {
> +trace_ufs_err_memory_allocation();
> +return MEMTX_ERROR;
> +}
> +
> +req_upiu_base_addr = ufs_get_req_upiu_base_addr(>utrd);
> +prdt_base_addr = req_upiu_base_addr + prdt_byte_off;
> +
> +err = ufs_addr_read(u, prdt_base_addr, prd_entries, prdt_size);
> +if 

Re: [PATCH v2 1/3] hw/ufs: Initial commit for emulated Universal-Flash-Storage

2023-06-16 Thread Stefan Hajnoczi
On Fri, Jun 16, 2023 at 03:58:21PM +0900, Jeuk Kim wrote:
> Universal Flash Storage (UFS) is a high-performance mass storage device
> with a serial interface. It is primarily used as a high-performance
> data storage device for embedded applications.
> 
> This commit contains code for UFS device to be recognized
> as a UFS PCI device.
> Patches to handle UFS logical unit and Transfer Request will follow.
> 
> Signed-off-by: Jeuk Kim 
> ---
>  MAINTAINERS  |6 +
>  hw/Kconfig   |1 +
>  hw/meson.build   |1 +
>  hw/ufs/Kconfig   |4 +
>  hw/ufs/meson.build   |1 +
>  hw/ufs/trace-events  |   33 ++
>  hw/ufs/trace.h   |1 +
>  hw/ufs/ufs.c |  305 +++
>  hw/ufs/ufs.h |   42 ++
>  include/block/ufs.h  | 1048 ++
>  include/hw/pci/pci.h |1 +
>  include/hw/pci/pci_ids.h |1 +
>  meson.build  |1 +
>  13 files changed, 1445 insertions(+)
>  create mode 100644 hw/ufs/Kconfig
>  create mode 100644 hw/ufs/meson.build
>  create mode 100644 hw/ufs/trace-events
>  create mode 100644 hw/ufs/trace.h
>  create mode 100644 hw/ufs/ufs.c
>  create mode 100644 hw/ufs/ufs.h
>  create mode 100644 include/block/ufs.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 88b5a7ee0a..91c2bfbb09 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2231,6 +2231,12 @@ F: tests/qtest/nvme-test.c
>  F: docs/system/devices/nvme.rst
>  T: git git://git.infradead.org/qemu-nvme.git nvme-next
>  
> +ufs
> +M: Jeuk Kim 
> +S: Supported
> +F: hw/ufs/*
> +F: include/block/ufs.h

Thank you for stepping up as maintainer for UFS. The responsibilities of
maintainers are to:

1. Review other people's patches that modify the code.
2. Send pull requests to the qemu.git maintainer or post your
   acknowledgement of patches so another maintainer can merge patches
   into a larger subsystem branch. For example, you could ack patches
   and Klaus could include them in his pull requests for the time being.
3. Ensure a basic level of testing and CI integration to prevent
   bitrot and regressions.

For this last point, I suggest writing a libqos test that performs
device initialization and executes some basic I/O. Not all existing
emulated storage controllers have this level of testing, but I highly
recommend having an automated test case for your new device. There is
documentation on how to write tests here:
https://qemu.readthedocs.io/en/latest/devel/qtest.html

An example test case is the AHCI (SATA) controller test:
https://gitlab.com/qemu-project/qemu/-/blob/master/tests/qtest/ahci-test.c

>  megasas
>  M: Hannes Reinecke 
>  L: qemu-bl...@nongnu.org
> diff --git a/hw/Kconfig b/hw/Kconfig
> index ba62ff6417..9ca7b38c31 100644
> --- a/hw/Kconfig
> +++ b/hw/Kconfig
> @@ -38,6 +38,7 @@ source smbios/Kconfig
>  source ssi/Kconfig
>  source timer/Kconfig
>  source tpm/Kconfig
> +source ufs/Kconfig
>  source usb/Kconfig
>  source virtio/Kconfig
>  source vfio/Kconfig
> diff --git a/hw/meson.build b/hw/meson.build
> index c7ac7d3d75..f01fac4617 100644
> --- a/hw/meson.build
> +++ b/hw/meson.build
> @@ -37,6 +37,7 @@ subdir('smbios')
>  subdir('ssi')
>  subdir('timer')
>  subdir('tpm')
> +subdir('ufs')
>  subdir('usb')
>  subdir('vfio')
>  subdir('virtio')
> diff --git a/hw/ufs/Kconfig b/hw/ufs/Kconfig
> new file mode 100644
> index 00..b7b3392e85
> --- /dev/null
> +++ b/hw/ufs/Kconfig
> @@ -0,0 +1,4 @@
> +config UFS_PCI
> +bool
> +default y if PCI_DEVICES
> +depends on PCI
> diff --git a/hw/ufs/meson.build b/hw/ufs/meson.build
> new file mode 100644
> index 00..c1d90eeea6
> --- /dev/null
> +++ b/hw/ufs/meson.build
> @@ -0,0 +1 @@
> +softmmu_ss.add(when: 'CONFIG_UFS_PCI', if_true: files('ufs.c'))
> diff --git a/hw/ufs/trace-events b/hw/ufs/trace-events
> new file mode 100644
> index 00..17793929b1
> --- /dev/null
> +++ b/hw/ufs/trace-events
> @@ -0,0 +1,33 @@
> +# ufs.c
> +ufs_irq_raise(void) "INTx"
> +ufs_irq_lower(void) "INTx"
> +ufs_mmio_read(uint64_t addr, uint64_t data, unsigned size) "addr 0x%"PRIx64" 
> data 0x%"PRIx64" size %d"
> +ufs_mmio_write(uint64_t addr, uint64_t data, unsigned size) "addr 
> 0x%"PRIx64" data 0x%"PRIx64" size %d"
> +ufs_process_db(uint32_t slot) "UTRLDBR slot %"PRIu32""
> +ufs_process_req(uint32_t slot) "UTRLDBR slot %"PRIu32""
> +ufs_complete_req(uint32_t slot) "UTRLDBR slot %"PRIu32""
> +ufs_sendback_req(uint32_t slot) "UTRLDBR slot %"PRIu32""
> +ufs_exec_nop_cmd(uint32_t slot) "UTRLDBR slot %"PRIu32""
> +ufs_exec_scsi_cmd(uint32_t slot, uint8_t lun, uint8_t opcode) "slot 
> %"PRIu32", lun 0x%"PRIx8", opcode 0x%"PRIx8""
> +ufs_exec_query_cmd(uint32_t slot, uint8_t opcode) "slot %"PRIu32", opcode 
> 0x%"PRIx8""
> +ufs_process_uiccmd(uint32_t uiccmd, uint32_t ucmdarg1, uint32_t ucmdarg2, 
> uint32_t ucmdarg3) "uiccmd 0x%"PRIx32", ucmdarg1 0x%"PRIx32", ucmdarg2 
> 0x%"PRIx32", ucmdarg3 0x%"PRIx32""
> +
> +# error condition
> 

Re: [PATCH v2 0/3] hw/ufs: Add Universal Flash Storage (UFS) support

2023-06-16 Thread Stefan Hajnoczi
On Fri, Jun 16, 2023 at 03:58:16PM +0900, Jeuk Kim wrote:
> Since v1:
> - use macros of "hw/registerfields.h" (Addressed Philippe's review comments)
> 
> This patch series adds support for a new PCI-based UFS device.
> 
> The UFS pci device id (PCI_DEVICE_ID_REDHAT_UFS) is not registered
> in the Linux kernel yet, so it does not work right away, but I confirmed
> that it works with Linux when the UFS pci device id is registered.
> 
> I have also verified that it works with Windows 10.
> 
> Jeuk Kim (3):
>   hw/ufs: Initial commit for emulated Universal-Flash-Storage
>   hw/ufs: Support for Query Transfer Requests
>   hw/ufs: Support for UFS logical unit

For future patch series (no need to resend):

These patch emails are not threaded. Please use email threads so that CI
systems and patch management tools can easily identify which emails
belong together in a patch series:

  git send-email --thread --no-chain-reply-to ...

It is easiest to permanently set these options with:

  git config format.thread shallow
  git config sendemail.chainReplyTo false

Thanks,
Stefan

> 
>  MAINTAINERS  |6 +
>  hw/Kconfig   |1 +
>  hw/meson.build   |1 +
>  hw/ufs/Kconfig   |4 +
>  hw/ufs/lu.c  | 1441 
>  hw/ufs/meson.build   |1 +
>  hw/ufs/trace-events  |   58 ++
>  hw/ufs/trace.h   |1 +
>  hw/ufs/ufs.c | 1511 ++
>  hw/ufs/ufs.h |  130 
>  include/block/ufs.h  | 1048 ++
>  include/hw/pci/pci.h |1 +
>  include/hw/pci/pci_ids.h |1 +
>  include/scsi/constants.h |1 +
>  meson.build  |1 +
>  15 files changed, 4206 insertions(+)
>  create mode 100644 hw/ufs/Kconfig
>  create mode 100644 hw/ufs/lu.c
>  create mode 100644 hw/ufs/meson.build
>  create mode 100644 hw/ufs/trace-events
>  create mode 100644 hw/ufs/trace.h
>  create mode 100644 hw/ufs/ufs.c
>  create mode 100644 hw/ufs/ufs.h
>  create mode 100644 include/block/ufs.h
> 
> -- 
> 2.34.1
> 


signature.asc
Description: PGP signature


Re: [RFC 1/4] docs/qcow2: add the zoned format feature

2023-06-13 Thread Stefan Hajnoczi
On Mon, Jun 05, 2023 at 06:41:05PM +0800, Sam Li wrote:
> Add the specs for the zoned format feature of the qcow2 driver. Once
> the zoned_profile is set to `zbc`, then the qcow2 file can be taken
> as zoned devices and passed through by virtio-blk device to the guest.
> 
> Signed-off-by: Sam Li 
> ---
>  docs/system/qemu-block-drivers.rst.inc | 31 ++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/docs/system/qemu-block-drivers.rst.inc 
> b/docs/system/qemu-block-drivers.rst.inc
> index 105cb9679c..fdcf343652 100644
> --- a/docs/system/qemu-block-drivers.rst.inc
> +++ b/docs/system/qemu-block-drivers.rst.inc
> @@ -172,6 +172,37 @@ This section describes each format and the options that 
> are supported for it.
>  filename`` to check if the NOCOW flag is set or not (Capital 'C' is
>  NOCOW flag).
>  
> +  .. option:: zoned_profile
> +
> +The option configures the zoned format feature on the qcow2 driver. If
> +this is set to ``zbc``, then it follows the basics of ZBC/ZAC protocol.

What about virtio-blk? NVMe ZNS? Please indicate what effect the profile
has and whether it works with all emulated storage controllers that
support zoned storage.

> +
> +  .. option:: zone_size
> +
> +The size of a zone of the zoned device. The zoned device have the same

"in bytes"? Please document the units.

> +size of zones with an optional smaller last zone.

"The device is divided into zones of this size with the exception of the
last zone, which may be smaller."

> +
> +  .. option:: zone_capacity
> +
> +The capacity of a zone of the zoned device.

This can be expanded:

  The initial capacity value for all zones. The capacity must be less
  than or equal to zone size. If the last zone is smaller, then its
  capacity is capped.

> The zoned device follows the
> +ZBC protocol tends to have the same size as its zone.
> +
> +  .. option:: zone_nr_conv
> +
> +The number of conventional zones of the zoned device.
> +
> +  .. option:: max_open_zones
> +
> +The maximal allowed open zones.
> +
> +  .. option:: max_active_zones
> +
> +The limit of the zones with implicit open, explicit open or closed state.
> +
> +  .. option:: max_append_sectors
> +
> +The maximal sectors that is allowed to append to zones while writing.

Does "sectors" mean 512B blocks or logical block size?

> +
>  .. program:: image-formats
>  .. option:: qed
>  
> -- 
> 2.40.1
> 


signature.asc
Description: PGP signature


Re: [PATCH] virtio-scsi: avoid dangling host notifier in ->ioeventfd_stop()

2023-06-11 Thread Stefan Hajnoczi
Buglink: https://gitlab.com/qemu-project/qemu/-/issues/1680

On Sun, Jun 11, 2023, 15:39 Stefan Hajnoczi  wrote:

> virtio_scsi_dataplane_stop() calls blk_drain_all(), which invokes
> ->drained_begin()/->drained_end() after we've already detached the host
> notifier. virtio_scsi_drained_end() currently attaches the host notifier
> again and leaves it dangling after dataplane has stopped.
>
> This results in the following assertion failure because
> virtio_scsi_defer_to_dataplane() is called from the IOThread instead of
> the main loop thread:
>
>   qemu-system-x86_64: ../softmmu/memory.c::
> memory_region_transaction_commit: Assertion `qemu_mutex_iothread_locked()'
> failed.
>
> Reported-by: Jean-Louis Dupond 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  hw/scsi/virtio-scsi.c | 20 ++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index 9c8ef0aaa6..45b95ea070 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> @@ -1125,7 +1125,16 @@ static void virtio_scsi_drained_begin(SCSIBus *bus)
>  uint32_t total_queues = VIRTIO_SCSI_VQ_NUM_FIXED +
>  s->parent_obj.conf.num_queues;
>
> -if (!s->dataplane_started) {
> +/*
> + * Drain is called when stopping dataplane but the host notifier has
> + * already been detached. Detaching multiple times is a no-op if
> nothing
> + * else is using the monitoring same file descriptor, but avoid it
> just in
> + * case.
> + *
> + * Also, don't detach if dataplane has not even been started yet
> because
> + * the host notifier isn't attached.
> + */
> +if (s->dataplane_stopping || !s->dataplane_started) {
>  return;
>  }
>
> @@ -1143,7 +1152,14 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
>  uint32_t total_queues = VIRTIO_SCSI_VQ_NUM_FIXED +
>  s->parent_obj.conf.num_queues;
>
> -if (!s->dataplane_started) {
> +/*
> + * Drain is called when stopping dataplane. Keep the host notifier
> detached
> + * so it's not left dangling after dataplane is stopped.
> + *
> + * Also, don't attach if dataplane has not even been started yet.
> We're not
> + * ready.
> + */
> +if (s->dataplane_stopping || !s->dataplane_started) {
>  return;
>  }
>
> --
> 2.40.1
>
>
>


[PATCH] virtio-scsi: avoid dangling host notifier in ->ioeventfd_stop()

2023-06-11 Thread Stefan Hajnoczi
virtio_scsi_dataplane_stop() calls blk_drain_all(), which invokes
->drained_begin()/->drained_end() after we've already detached the host
notifier. virtio_scsi_drained_end() currently attaches the host notifier
again and leaves it dangling after dataplane has stopped.

This results in the following assertion failure because
virtio_scsi_defer_to_dataplane() is called from the IOThread instead of
the main loop thread:

  qemu-system-x86_64: ../softmmu/memory.c:: 
memory_region_transaction_commit: Assertion `qemu_mutex_iothread_locked()' 
failed.

Reported-by: Jean-Louis Dupond 
Signed-off-by: Stefan Hajnoczi 
---
 hw/scsi/virtio-scsi.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 9c8ef0aaa6..45b95ea070 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -1125,7 +1125,16 @@ static void virtio_scsi_drained_begin(SCSIBus *bus)
 uint32_t total_queues = VIRTIO_SCSI_VQ_NUM_FIXED +
 s->parent_obj.conf.num_queues;
 
-if (!s->dataplane_started) {
+/*
+ * Drain is called when stopping dataplane but the host notifier has
+ * already been detached. Detaching multiple times is a no-op if nothing
+ * else is using the monitoring same file descriptor, but avoid it just in
+ * case.
+ *
+ * Also, don't detach if dataplane has not even been started yet because
+ * the host notifier isn't attached.
+ */
+if (s->dataplane_stopping || !s->dataplane_started) {
 return;
 }
 
@@ -1143,7 +1152,14 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
 uint32_t total_queues = VIRTIO_SCSI_VQ_NUM_FIXED +
 s->parent_obj.conf.num_queues;
 
-if (!s->dataplane_started) {
+/*
+ * Drain is called when stopping dataplane. Keep the host notifier detached
+ * so it's not left dangling after dataplane is stopped.
+ *
+ * Also, don't attach if dataplane has not even been started yet. We're not
+ * ready.
+ */
+if (s->dataplane_stopping || !s->dataplane_started) {
 return;
 }
 
-- 
2.40.1




Re: virtio-blk using a single iothread

2023-06-08 Thread Stefan Hajnoczi
On Thu, Jun 08, 2023 at 10:40:57AM +0300, Sagi Grimberg wrote:
> Hey Stefan, Paolo,
> 
> I just had a report from a user experiencing lower virtio-blk
> performance than he expected. This user is running virtio-blk on top of
> nvme-tcp device. The guest is running 12 CPU cores.
> 
> The guest read/write throughput is capped at around 30% of the available
> throughput from the host (~800MB/s from the guest vs. 2800MB/s from the
> host - 25Gb/s nic). The workload running on the guest is a
> multi-threaded fio workload.
> 
> What is observed is the fact that virtio-blk is using a single disk-wide
> iothread processing all the vqs. Specifically nvme-tcp (similar to other
> tcp based protocols) is negatively impacted by lack of thread
> concurrency that can distribute I/O requests to different TCP
> connections.
> 
> We also attempted to move the iothread to a dedicated core, however that
> did yield any meaningful performance improvements). The reason appears
> to be less about CPU utilization on the iothread core, but more around
> single TCP connection serialization.
> 
> Moving to io=threads does increase the throughput, however sacrificing
> latency significantly.
> 
> So the user find itself with available host cpus and TCP connections
> that it could easily use to get maximum throughput, without the ability
> to leverage them. True, other guests will use different
> threads/contexts, however the goal here is to allow the full performance
> from a single device.
> 
> I've seen several discussions and attempts in the past to allow a
> virtio-blk device leverage multiple iothreads, but around 2 years ago
> the discussions over this paused. So wanted to ask, are there any plans
> or anything in the works to address this limitation?
> 
> I've seen that the spdk folks are heading in this direction with their
> vhost-blk implementation:
> https://review.spdk.io/gerrit/c/spdk/spdk/+/16068

Hi Sagi,
Yes, there is an ongoing QEMU multi-queue block layer effort to make it
possible for multiple IOThreads to process disk I/O for the same
--blockdev in parallel.

Most of my recent QEMU patches have been part of this effort. There is a
work-in-progress branch that supports mapping virtio-blk virtqueues to
specific IOThreads:
https://gitlab.com/stefanha/qemu/-/commits/virtio-blk-iothread-vq-mapping

The syntax is:

  --device 
'{"driver":"virtio-blk-pci","iothread-vq-mapping":[{"iothread":"iothread0"},{"iothread":"iothread1"}],"drive":"drive0"}'

This says "assign virtqueues round-robin to iothread0 and iothread1".
Half the virtqueues will be processed by iothread0 and the other half by
iothread1. There is also syntax for assigning specific virtqueues to
each IOThread, but usually the automatic round-robin assignment is all
that's needed.

This work is not finished yet. Basic I/O (e.g. fio) works without
crashes, but expect to hit issues if you use blockjobs, hotplug, etc.

Performance optimization work has just begun, so it won't deliver all
the benefits yet. I ran a benchmark yesterday where going from 1 to 2
IOThreads increased performance by 25%. That's much less than we're
aiming for; attaching two independent virtio-blk devices improves the
performance by ~100%. I know we can get there eventually. Some of the
bottlenecks are known (e.g. block statistics collection causes lock
contention) and others are yet to be investigated.

The Ansible playbook, libvirt XML, fio jobs, etc for the benchmark are
available here:
https://gitlab.com/stefanha/virt-playbooks/-/tree/8379665537c47c0901f426f0b9333ade8236ac3b

You are welcome to give the QEMU patches a try. I will be away next week
to attend KVM Forum, so I may not respond to emails quickly but am
interested in what you find.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] iotests: fix 194: filter out racy postcopy-active event

2023-06-07 Thread Stefan Hajnoczi
On Wed, Jun 07, 2023 at 05:36:06PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> The event is racy: it will not appear in the output if bitmap is
> migrated during downtime period of migration and postcopy phase is not
> started.
> 
> Fixes: ae00aa239847 "iotests: 194: test also migration of dirty bitmap"
> Reported-by: Richard Henderson 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> The patch fixes the problem described in
>   [PATCH] gitlab: Disable io-raw-194 for build-tcg-disabled
> and we can keep the test in gitlab ci
> 
>  tests/qemu-iotests/194 | 5 +
>  tests/qemu-iotests/194.out | 1 -
>  2 files changed, 5 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH 2/2] block/file-posix: fix wps checking in raw_co_prw

2023-06-07 Thread Stefan Hajnoczi
On Sun, Jun 04, 2023 at 02:16:58PM +0800, Sam Li wrote:
> If the write operation fails and the wps is NULL, then accessing it will
> lead to data corruption.
> 
> Solving the issue by adding a nullptr checking in get_zones_wp() where
> the wps is used.
> 
> This issue is found by Peter Maydell using the Coverity Tool (CID
> 1512459).
> 
> Signed-off-by: Sam Li 
> ---
>  block/file-posix.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 0d9d179a35..620942bf40 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1340,6 +1340,10 @@ static int get_zones_wp(BlockDriverState *bs, int fd, 
> int64_t offset,
>  rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> blk_zone);
>  g_autofree struct blk_zone_report *rep = NULL;
>  
> +if (!wps) {
> +return -1;
> +}

An error will be printed every time this happens on a non-zoned device:

  static void update_zones_wp(BlockDriverState *bs, int fd, int64_t offset,
  unsigned int nrz)
  {
  if (get_zones_wp(bs, fd, offset, nrz, 0) < 0) {
  error_report("update zone wp failed");

Please change the following code to avoid the call to update_zones_wp():

  #if defined(CONFIG_BLKZONED)
  {
  BlockZoneWps *wps = bs->wps;
  if (ret == 0) {
  if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND))
  && wps && bs->bl.zone_size) {
  uint64_t *wp = >wp[offset / bs->bl.zone_size];
  if (!BDRV_ZT_IS_CONV(*wp)) {
  if (type & QEMU_AIO_ZONE_APPEND) {
  *s->offset = *wp;
  trace_zbd_zone_append_complete(bs, *s->offset
  >> BDRV_SECTOR_BITS);
  }
  /* Advance the wp if needed */
  if (offset + bytes > *wp) {
  *wp = offset + bytes;
  }
  }
  }
  } else {
- if (type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
+ if (wps && (type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND))) {
  update_zones_wp(bs, s->fd, 0, 1);
  }
  }

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 1/2] block/file-posix: fix g_file_get_contents return path

2023-06-07 Thread Stefan Hajnoczi
On Sun, Jun 04, 2023 at 02:16:57PM +0800, Sam Li wrote:
> The g_file_get_contents() function returns a g_boolean. If it fails, the
> returned value will be 0 instead of -1. Solve the issue by skipping
> assigning ret value.
> 
> This issue was found by Matthew Rosato using virtio-blk-{pci,ccw} backed
> by an NVMe partition e.g. /dev/nvme0n1p1 on s390x.
> 
> Signed-off-by: Sam Li 
> ---
>  block/file-posix.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)

The number of bytes returned was never used, so changing the return
value to 0 or -errno is fine:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH v2 00/11] block: Re-enable the graph lock

2023-06-07 Thread Stefan Hajnoczi
On Mon, Jun 05, 2023 at 10:57:00AM +0200, Kevin Wolf wrote:
> This series fixes the deadlock that was observed before commit ad128dff
> ('graph-lock: Disable locking for now'), which just disabled the graph
> lock completely as a workaround to get 8.0.1 stable.
> 
> In theory the problem is simple: We can't poll while still holding the
> lock of a different AioContext. So bdrv_graph_wrlock() just needs to
> drop that lock before it polls. However, there are a number of callers
> that don't even hold the AioContext lock they are supposed to hold, so
> temporarily unlocking tries to unlock a mutex that isn't locked,
> resulting in assertion failures.
> 
> Therefore, much of this series is just for fixing AioContext locking
> correctness. It is only the last two patches that actually fix the
> deadlock and reenable the graph locking.
> 
> v2:
> - Fixed patch 2 to actually lock the correct AioContext even if the
>   device doesn't support iothreads
> - Improved the commit message for patch 7 [Eric]
> - Fixed mismerge in patch 11 (v1 incorrectly left an #if 0 around)
> 
> Kevin Wolf (11):
>   iotests: Test active commit with iothread and background I/O
>   qdev-properties-system: Lock AioContext for blk_insert_bs()
>   test-block-iothread: Lock AioContext for blk_insert_bs()
>   block: Fix AioContext locking in bdrv_open_child()
>   block: Fix AioContext locking in bdrv_attach_child_common()
>   block: Fix AioContext locking in bdrv_reopen_parse_file_or_backing()
>   block: Fix AioContext locking in bdrv_open_inherit()
>   block: Fix AioContext locking in bdrv_open_backing_file()
>   blockjob: Fix AioContext locking in block_job_add_bdrv()
>   graph-lock: Unlock the AioContext while polling
>   Revert "graph-lock: Disable locking for now"
> 
>  include/block/graph-lock.h|   6 +-
>  block.c   | 103 --
>  block/graph-lock.c|  42 ---
>  blockjob.c|  17 ++-
>  hw/core/qdev-properties-system.c  |   8 +-
>  tests/unit/test-block-iothread.c  |   7 +-
>  .../tests/iothreads-commit-active |  85 +++
>  .../tests/iothreads-commit-active.out |  23 
>  8 files changed, 250 insertions(+), 41 deletions(-)
>  create mode 100755 tests/qemu-iotests/tests/iothreads-commit-active
>  create mode 100644 tests/qemu-iotests/tests/iothreads-commit-active.out
> 
> -- 
> 2.40.1
> 

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH 0/1] update maintainers list for vfio-user & multi-process QEMU

2023-06-07 Thread Stefan Hajnoczi
On Wed, 7 Jun 2023 at 11:58, Jagannathan Raman  wrote:
>
> John Johnson doesn't work at Oracle anymore. I tried to contact him to
> get his updated email address, but I haven't heard anything from him.
>
> Jagannathan Raman (1):
>   maintainers: update maintainers list for vfio-user & multi-process
> QEMU
>
>  MAINTAINERS | 1 -
>  1 file changed, 1 deletion(-)

JJ's last email to qemu-devel was in February 2023. Since he no longer
works at Oracle, his email address is probably no longer functional.
Therefore, I think it makes sense to remove him from MAINTAINERS for
the time being. If he resumes work in this area he can be added back
with a new email address.

Reviewed-by: Stefan Hajnoczi 



Re: [PATCH] gitlab: Disable io-raw-194 for build-tcg-disabled

2023-06-07 Thread Stefan Hajnoczi
On Wed, 7 Jun 2023 at 10:39, Vladimir Sementsov-Ogievskiy
 wrote:
>
> On 06.06.23 19:25, Richard Henderson wrote:
> > This test consistently fails on Azure cloud build hosts in
> > a way that suggests a timing problem in the test itself:
> >
> > --- .../194.out
> > +++ .../194.out.bad
> > @@ -14,7 +14,6 @@
> >   {"return": {}}
> >   {"data": {"status": "setup"}, "event": "MIGRATION", "timestamp": 
> > {"microseconds": "USECS", "seconds": "SECS"}}
> >   {"data": {"status": "active"}, "event": "MIGRATION", "timestamp": 
> > {"microseconds": "USECS", "seconds": "SECS"}}
> > -{"data": {"status": "postcopy-active"}, "event": "MIGRATION", "timestamp": 
> > {"microseconds": "USECS", "seconds": "SECS"}}
> >   {"data": {"status": "completed"}, "event": "MIGRATION", "timestamp": 
> > {"microseconds": "USECS", "seconds": "SECS"}}
> >   Gracefully ending the `drive-mirror` job on source...
> >   {"return": {}}
> >
> > Signed-off-by: Richard Henderson 
> > ---
> >   .gitlab-ci.d/buildtest.yml | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
> > index 0f1be14cb6..62483f 100644
> > --- a/.gitlab-ci.d/buildtest.yml
> > +++ b/.gitlab-ci.d/buildtest.yml
> > @@ -236,7 +236,7 @@ build-tcg-disabled:
> >   - cd tests/qemu-iotests/
> >   - ./check -raw 001 002 003 004 005 008 009 010 011 012 021 025 032 
> > 033 048
> >   052 063 077 086 101 104 106 113 148 150 151 152 157 159 160 
> > 163
> > -170 171 183 184 192 194 208 221 226 227 236 253 277 
> > image-fleecing
> > +170 171 183 184 192 208 221 226 227 236 253 277 image-fleecing
> >   - ./check -qcow2 028 051 056 057 058 065 068 082 085 091 095 096 102 
> > 122
> >   124 132 139 142 144 145 151 152 155 157 165 194 196 200 202
> >   208 209 216 218 227 234 246 247 248 250 254 255 257 258
>
>
> There is actually a bug in the test, I've sent a patch:
>
> <20230607143606.1557395-1-vsement...@yandex-team.ru>
> [PATCH] iotests: fix 194: filter out racy postcopy-active event

Awesome, thank you!

Stefan



Re: [PATCH] gitlab: Disable io-raw-194 for build-tcg-disabled

2023-06-07 Thread Stefan Hajnoczi
The line of output that has changed was originally added by the
following commit:

commit ae00aa2398476824f0eca80461da215e7cdc1c3b
Author: Vladimir Sementsov-Ogievskiy 
Date:   Fri May 22 01:06:46 2020 +0300

iotests: 194: test also migration of dirty bitmap

Vladimir: Any idea why the postcopy-active event may not be emitted in
some cases?

Stefan

On Tue, 6 Jun 2023 at 12:26, Richard Henderson
 wrote:
>
> This test consistently fails on Azure cloud build hosts in
> a way that suggests a timing problem in the test itself:
>
> --- .../194.out
> +++ .../194.out.bad
> @@ -14,7 +14,6 @@
>  {"return": {}}
>  {"data": {"status": "setup"}, "event": "MIGRATION", "timestamp": 
> {"microseconds": "USECS", "seconds": "SECS"}}
>  {"data": {"status": "active"}, "event": "MIGRATION", "timestamp": 
> {"microseconds": "USECS", "seconds": "SECS"}}
> -{"data": {"status": "postcopy-active"}, "event": "MIGRATION", "timestamp": 
> {"microseconds": "USECS", "seconds": "SECS"}}
>  {"data": {"status": "completed"}, "event": "MIGRATION", "timestamp": 
> {"microseconds": "USECS", "seconds": "SECS"}}
>  Gracefully ending the `drive-mirror` job on source...
>  {"return": {}}
>
> Signed-off-by: Richard Henderson 
> ---
>  .gitlab-ci.d/buildtest.yml | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
> index 0f1be14cb6..62483f 100644
> --- a/.gitlab-ci.d/buildtest.yml
> +++ b/.gitlab-ci.d/buildtest.yml
> @@ -236,7 +236,7 @@ build-tcg-disabled:
>  - cd tests/qemu-iotests/
>  - ./check -raw 001 002 003 004 005 008 009 010 011 012 021 025 032 033 
> 048
>  052 063 077 086 101 104 106 113 148 150 151 152 157 159 160 163
> -170 171 183 184 192 194 208 221 226 227 236 253 277 
> image-fleecing
> +170 171 183 184 192 208 221 226 227 236 253 277 image-fleecing
>  - ./check -qcow2 028 051 056 057 058 065 068 082 085 091 095 096 102 122
>  124 132 139 142 144 145 151 152 155 157 165 194 196 200 202
>  208 209 216 218 227 234 246 247 248 250 254 255 257 258
> --
> 2.34.1
>
>



Re: [PATCH] xen-block: fix segv on unrealize

2023-06-06 Thread Stefan Hajnoczi
Sorry!

Reviewed-by: Stefan Hajnoczi 



[PULL 3/8] block/blkio: convert to blk_io_plug_call() API

2023-06-01 Thread Stefan Hajnoczi
Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
Acked-by: Kevin Wolf 
Message-id: 20230530180959.1108766-4-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 block/blkio.c | 43 ---
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index 72117fa005..11be8787a3 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -17,6 +17,7 @@
 #include "qemu/error-report.h"
 #include "qapi/qmp/qdict.h"
 #include "qemu/module.h"
+#include "sysemu/block-backend.h"
 #include "exec/memory.h" /* for ram_block_discard_disable() */
 
 #include "block/block-io.h"
@@ -320,16 +321,30 @@ static void blkio_detach_aio_context(BlockDriverState *bs)
NULL, NULL, NULL);
 }
 
-/* Call with s->blkio_lock held to submit I/O after enqueuing a new request */
-static void blkio_submit_io(BlockDriverState *bs)
+/*
+ * Called by blk_io_unplug() or immediately if not plugged. Called without
+ * blkio_lock.
+ */
+static void blkio_unplug_fn(void *opaque)
 {
-if (qatomic_read(>io_plugged) == 0) {
-BDRVBlkioState *s = bs->opaque;
+BDRVBlkioState *s = opaque;
 
+WITH_QEMU_LOCK_GUARD(>blkio_lock) {
 blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
 }
 }
 
+/*
+ * Schedule I/O submission after enqueuing a new request. Called without
+ * blkio_lock.
+ */
+static void blkio_submit_io(BlockDriverState *bs)
+{
+BDRVBlkioState *s = bs->opaque;
+
+blk_io_plug_call(blkio_unplug_fn, s);
+}
+
 static int coroutine_fn
 blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
@@ -340,9 +355,9 @@ blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, 
int64_t bytes)
 
 WITH_QEMU_LOCK_GUARD(>blkio_lock) {
 blkioq_discard(s->blkioq, offset, bytes, , 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
@@ -373,9 +388,9 @@ blkio_co_preadv(BlockDriverState *bs, int64_t offset, 
int64_t bytes,
 
 WITH_QEMU_LOCK_GUARD(>blkio_lock) {
 blkioq_readv(s->blkioq, offset, iov, iovcnt, , 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 
 if (use_bounce_buffer) {
@@ -418,9 +433,9 @@ static int coroutine_fn blkio_co_pwritev(BlockDriverState 
*bs, int64_t offset,
 
 WITH_QEMU_LOCK_GUARD(>blkio_lock) {
 blkioq_writev(s->blkioq, offset, iov, iovcnt, , blkio_flags);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 
 if (use_bounce_buffer) {
@@ -439,9 +454,9 @@ static int coroutine_fn blkio_co_flush(BlockDriverState *bs)
 
 WITH_QEMU_LOCK_GUARD(>blkio_lock) {
 blkioq_flush(s->blkioq, , 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
@@ -467,22 +482,13 @@ static int coroutine_fn 
blkio_co_pwrite_zeroes(BlockDriverState *bs,
 
 WITH_QEMU_LOCK_GUARD(>blkio_lock) {
 blkioq_write_zeroes(s->blkioq, offset, bytes, , blkio_flags);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
 
-static void coroutine_fn blkio_co_io_unplug(BlockDriverState *bs)
-{
-BDRVBlkioState *s = bs->opaque;
-
-WITH_QEMU_LOCK_GUARD(>blkio_lock) {
-blkio_submit_io(bs);
-}
-}
-
 typedef enum {
 BMRR_OK,
 BMRR_SKIP,
@@ -1004,7 +1010,6 @@ static void blkio_refresh_limits(BlockDriverState *bs, 
Error **errp)
 .bdrv_co_pwritev = blkio_co_pwritev, \
 .bdrv_co_flush_to_disk   = blkio_co_flush, \
 .bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
-.bdrv_co_io_unplug   = blkio_co_io_unplug, \
 .bdrv_refresh_limits = blkio_refresh_limits, \
 .bdrv_register_buf   = blkio_register_buf, \
 .bdrv_unregister_buf = blkio_unregister_buf, \
-- 
2.40.1




[PULL 5/8] block/linux-aio: convert to blk_io_plug_call() API

2023-06-01 Thread Stefan Hajnoczi
Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Note that a dev_max_batch check is dropped in laio_io_unplug() because
the semantics of unplug_fn() are different from .bdrv_co_unplug():
1. unplug_fn() is only called when the last blk_io_unplug() call occurs,
   not every time blk_io_unplug() is called.
2. unplug_fn() is per-thread, not per-BlockDriverState, so there is no
   way to get per-BlockDriverState fields like dev_max_batch.

Therefore this condition cannot be moved to laio_unplug_fn(). It is not
obvious that this condition affects performance in practice, so I am
removing it instead of trying to come up with a more complex mechanism
to preserve the condition.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Acked-by: Kevin Wolf 
Reviewed-by: Stefano Garzarella 
Message-id: 20230530180959.1108766-6-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 include/block/raw-aio.h |  7 ---
 block/file-posix.c  | 28 
 block/linux-aio.c   | 41 +++--
 3 files changed, 11 insertions(+), 65 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index da60ca13ef..0f63c2800c 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -62,13 +62,6 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, 
QEMUIOVector *qiov,
 
 void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
-
-/*
- * laio_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void laio_io_plug(void);
-void laio_io_unplug(uint64_t dev_max_batch);
 #endif
 /* io_uring.c - Linux io_uring implementation */
 #ifdef CONFIG_LINUX_IO_URING
diff --git a/block/file-posix.c b/block/file-posix.c
index 7baa8491dd..ac1ed54811 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2550,26 +2550,6 @@ static int coroutine_fn raw_co_pwritev(BlockDriverState 
*bs, int64_t offset,
 return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
 }
 
-static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
-{
-BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-if (s->use_linux_aio) {
-laio_io_plug();
-}
-#endif
-}
-
-static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
-{
-BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-if (s->use_linux_aio) {
-laio_io_unplug(s->aio_max_batch);
-}
-#endif
-}
-
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
 {
 BDRVRawState *s = bs->opaque;
@@ -3914,8 +3894,6 @@ BlockDriver bdrv_file = {
 .bdrv_co_copy_range_from = raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = raw_co_copy_range_to,
 .bdrv_refresh_limits = raw_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4286,8 +4264,6 @@ static BlockDriver bdrv_host_device = {
 .bdrv_co_copy_range_from = raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = raw_co_copy_range_to,
 .bdrv_refresh_limits = raw_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4424,8 +4400,6 @@ static BlockDriver bdrv_host_cdrom = {
 .bdrv_co_pwritev= raw_co_pwritev,
 .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
 .bdrv_refresh_limits= cdrom_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4552,8 +4526,6 @@ static BlockDriver bdrv_host_cdrom = {
 .bdrv_co_pwritev= raw_co_pwritev,
 .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
 .bdrv_refresh_limits= cdrom_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 916f001e32..561c71a9ae 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -15,6 +15,7 @@
 #include "qemu/event_notifier.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 
 /* Only used for assertions.  */
 #include "qemu/coroutine_int.h"
@@ -46,7 +47,6 @@ struct qemu_laiocb {

[PULL 4/8] block/io_uring: convert to blk_io_plug_call() API

2023-06-01 Thread Stefan Hajnoczi
Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
Acked-by: Kevin Wolf 
Message-id: 20230530180959.1108766-5-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 include/block/raw-aio.h |  7 ---
 block/file-posix.c  | 10 --
 block/io_uring.c| 44 -
 block/trace-events  |  5 ++---
 4 files changed, 19 insertions(+), 47 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 0fe85ade77..da60ca13ef 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -81,13 +81,6 @@ int coroutine_fn luring_co_submit(BlockDriverState *bs, int 
fd, uint64_t offset,
   QEMUIOVector *qiov, int type);
 void luring_detach_aio_context(LuringState *s, AioContext *old_context);
 void luring_attach_aio_context(LuringState *s, AioContext *new_context);
-
-/*
- * luring_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void luring_io_plug(void);
-void luring_io_unplug(void);
 #endif
 
 #ifdef _WIN32
diff --git a/block/file-posix.c b/block/file-posix.c
index 0ab158efba..7baa8491dd 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2558,11 +2558,6 @@ static void coroutine_fn raw_co_io_plug(BlockDriverState 
*bs)
 laio_io_plug();
 }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-if (s->use_linux_io_uring) {
-luring_io_plug();
-}
-#endif
 }
 
 static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
@@ -2573,11 +2568,6 @@ static void coroutine_fn 
raw_co_io_unplug(BlockDriverState *bs)
 laio_io_unplug(s->aio_max_batch);
 }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-if (s->use_linux_io_uring) {
-luring_io_unplug();
-}
-#endif
 }
 
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
diff --git a/block/io_uring.c b/block/io_uring.c
index 3a77480e16..69d9820928 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -16,6 +16,7 @@
 #include "block/raw-aio.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 #include "trace.h"
 
 /* Only used for assertions.  */
@@ -41,7 +42,6 @@ typedef struct LuringAIOCB {
 } LuringAIOCB;
 
 typedef struct LuringQueue {
-int plugged;
 unsigned int in_queue;
 unsigned int in_flight;
 bool blocked;
@@ -267,7 +267,7 @@ static void 
luring_process_completions_and_submit(LuringState *s)
 {
 luring_process_completions(s);
 
-if (!s->io_q.plugged && s->io_q.in_queue > 0) {
+if (s->io_q.in_queue > 0) {
 ioq_submit(s);
 }
 }
@@ -301,29 +301,17 @@ static void qemu_luring_poll_ready(void *opaque)
 static void ioq_init(LuringQueue *io_q)
 {
 QSIMPLEQ_INIT(_q->submit_queue);
-io_q->plugged = 0;
 io_q->in_queue = 0;
 io_q->in_flight = 0;
 io_q->blocked = false;
 }
 
-void luring_io_plug(void)
+static void luring_unplug_fn(void *opaque)
 {
-AioContext *ctx = qemu_get_current_aio_context();
-LuringState *s = aio_get_linux_io_uring(ctx);
-trace_luring_io_plug(s);
-s->io_q.plugged++;
-}
-
-void luring_io_unplug(void)
-{
-AioContext *ctx = qemu_get_current_aio_context();
-LuringState *s = aio_get_linux_io_uring(ctx);
-assert(s->io_q.plugged);
-trace_luring_io_unplug(s, s->io_q.blocked, s->io_q.plugged,
-   s->io_q.in_queue, s->io_q.in_flight);
-if (--s->io_q.plugged == 0 &&
-!s->io_q.blocked && s->io_q.in_queue > 0) {
+LuringState *s = opaque;
+trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue,
+   s->io_q.in_flight);
+if (!s->io_q.blocked && s->io_q.in_queue > 0) {
 ioq_submit(s);
 }
 }
@@ -370,14 +358,16 @@ static int luring_do_submit(int fd, LuringAIOCB 
*luringcb, LuringState *s,
 
 QSIMPLEQ_INSERT_TAIL(>io_q.submit_queue, luringcb, next);
 s->io_q.in_queue++;
-trace_luring_do_submit(s, s->io_q.blocked, s->io_q.plugged,
-   s->io_q.in_queue, s->io_q.in_flight);
-if (!s->io_q.blocked &&
-(!s->io_q.plugged ||
- s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES)) {
-ret = ioq_submit(s);
-trace_luring_do_submit_done(s, ret);
-return ret;
+trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue,
+   s->io_q.in_flight);
+if (!s->io_q.blocked) {
+if (s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES) {
+ret = ioq_submit(s);
+  

[PULL 1/8] block: add blk_io_plug_call() API

2023-06-01 Thread Stefan Hajnoczi
Introduce a new API for thread-local blk_io_plug() that does not
traverse the block graph. The goal is to make blk_io_plug() multi-queue
friendly.

Instead of having block drivers track whether or not we're in a plugged
section, provide an API that allows them to defer a function call until
we're unplugged: blk_io_plug_call(fn, opaque). If blk_io_plug_call() is
called multiple times with the same fn/opaque pair, then fn() is only
called once at the end of the function - resulting in batching.

This patch introduces the API and changes blk_io_plug()/blk_io_unplug().
blk_io_plug()/blk_io_unplug() no longer require a BlockBackend argument
because the plug state is now thread-local.

Later patches convert block drivers to blk_io_plug_call() and then we
can finally remove .bdrv_co_io_plug() once all block drivers have been
converted.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
Acked-by: Kevin Wolf 
Message-id: 20230530180959.1108766-2-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS   |   1 +
 include/sysemu/block-backend-io.h |  13 +--
 block/block-backend.c |  22 -
 block/plug.c  | 159 ++
 hw/block/dataplane/xen-block.c|   8 +-
 hw/block/virtio-blk.c |   4 +-
 hw/scsi/virtio-scsi.c |   6 +-
 block/meson.build |   1 +
 8 files changed, 173 insertions(+), 41 deletions(-)
 create mode 100644 block/plug.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4b025a7b63..89f274f85e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2650,6 +2650,7 @@ F: util/aio-*.c
 F: util/aio-*.h
 F: util/fdmon-*.c
 F: block/io.c
+F: block/plug.c
 F: migration/block*
 F: include/block/aio.h
 F: include/block/aio-wait.h
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index d62a7ee773..be4dcef59d 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -100,16 +100,9 @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
 int blk_get_max_iov(BlockBackend *blk);
 int blk_get_max_hw_iov(BlockBackend *blk);
 
-/*
- * blk_io_plug/unplug are thread-local operations. This means that multiple
- * IOThreads can simultaneously call plug/unplug, but the caller must ensure
- * that each unplug() is called in the same IOThread of the matching plug().
- */
-void coroutine_fn blk_co_io_plug(BlockBackend *blk);
-void co_wrapper blk_io_plug(BlockBackend *blk);
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk);
-void co_wrapper blk_io_unplug(BlockBackend *blk);
+void blk_io_plug(void);
+void blk_io_unplug(void);
+void blk_io_plug_call(void (*fn)(void *), void *opaque);
 
 AioContext *blk_get_aio_context(BlockBackend *blk);
 BlockAcctStats *blk_get_stats(BlockBackend *blk);
diff --git a/block/block-backend.c b/block/block-backend.c
index 241f643507..4009ed5fed 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2582,28 +2582,6 @@ void blk_add_insert_bs_notifier(BlockBackend *blk, 
Notifier *notify)
 notifier_list_add(>insert_bs_notifiers, notify);
 }
 
-void coroutine_fn blk_co_io_plug(BlockBackend *blk)
-{
-BlockDriverState *bs = blk_bs(blk);
-IO_CODE();
-GRAPH_RDLOCK_GUARD();
-
-if (bs) {
-bdrv_co_io_plug(bs);
-}
-}
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk)
-{
-BlockDriverState *bs = blk_bs(blk);
-IO_CODE();
-GRAPH_RDLOCK_GUARD();
-
-if (bs) {
-bdrv_co_io_unplug(bs);
-}
-}
-
 BlockAcctStats *blk_get_stats(BlockBackend *blk)
 {
 IO_CODE();
diff --git a/block/plug.c b/block/plug.c
new file mode 100644
index 00..98a155d2f4
--- /dev/null
+++ b/block/plug.c
@@ -0,0 +1,159 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Block I/O plugging
+ *
+ * Copyright Red Hat.
+ *
+ * This API defers a function call within a blk_io_plug()/blk_io_unplug()
+ * section, allowing multiple calls to batch up. This is a performance
+ * optimization that is used in the block layer to submit several I/O requests
+ * at once instead of individually:
+ *
+ *   blk_io_plug(); <-- start of plugged region
+ *   ...
+ *   blk_io_plug_call(my_func, my_obj); <-- deferred my_func(my_obj) call
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   ...
+ *   blk_io_unplug(); <-- end of plugged region, my_func(my_obj) is called once
+ *
+ * This code is actually generic and not tied to the block layer. If another
+ * subsystem needs this functionality, it could be renamed.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/coroutine-tls.h"
+#include "qemu/notify.h"
+#include "qemu/thread.h"
+#include "sysemu/block-backend.h"
+
+/* A function call that has been deferred until unplug() */
+typedef struct {
+void (*fn)(void *);
+void *opaque;
+} UnplugFn;
+
+/* Per-thread state */
+ty

[PULL 6/8] block: remove bdrv_co_io_plug() API

2023-06-01 Thread Stefan Hajnoczi
No block driver implements .bdrv_co_io_plug() anymore. Get rid of the
function pointers.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
Acked-by: Kevin Wolf 
Message-id: 20230530180959.1108766-7-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 include/block/block-io.h |  3 ---
 include/block/block_int-common.h | 11 --
 block/io.c   | 37 
 3 files changed, 51 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index a27e471a87..43af816d75 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -259,9 +259,6 @@ void coroutine_fn bdrv_co_leave(BlockDriverState *bs, 
AioContext *old_ctx);
 
 AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c);
 
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_plug(BlockDriverState *bs);
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_unplug(BlockDriverState *bs);
-
 bool coroutine_fn GRAPH_RDLOCK
 bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
uint32_t granularity, Error **errp);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index b1cbc1e00c..74195c3004 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -768,11 +768,6 @@ struct BlockDriver {
 void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_debug_event)(
 BlockDriverState *bs, BlkdebugEvent event);
 
-/* io queue for linux-aio */
-void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_plug)(BlockDriverState 
*bs);
-void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_unplug)(
-BlockDriverState *bs);
-
 bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
 
 bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_can_store_new_dirty_bitmap)(
@@ -1227,12 +1222,6 @@ struct BlockDriverState {
 unsigned int in_flight;
 unsigned int serialising_in_flight;
 
-/*
- * counter for nested bdrv_io_plug.
- * Accessed with atomic ops.
- */
-unsigned io_plugged;
-
 /* do we need to tell the quest if we have a volatile write cache? */
 int enable_write_cache;
 
diff --git a/block/io.c b/block/io.c
index 540bf8d26d..f2dfc7c405 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3223,43 +3223,6 @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t 
size)
 return mem;
 }
 
-void coroutine_fn bdrv_co_io_plug(BlockDriverState *bs)
-{
-BdrvChild *child;
-IO_CODE();
-assert_bdrv_graph_readable();
-
-QLIST_FOREACH(child, >children, next) {
-bdrv_co_io_plug(child->bs);
-}
-
-if (qatomic_fetch_inc(>io_plugged) == 0) {
-BlockDriver *drv = bs->drv;
-if (drv && drv->bdrv_co_io_plug) {
-drv->bdrv_co_io_plug(bs);
-}
-}
-}
-
-void coroutine_fn bdrv_co_io_unplug(BlockDriverState *bs)
-{
-BdrvChild *child;
-IO_CODE();
-assert_bdrv_graph_readable();
-
-assert(bs->io_plugged);
-if (qatomic_fetch_dec(>io_plugged) == 1) {
-BlockDriver *drv = bs->drv;
-if (drv && drv->bdrv_co_io_unplug) {
-drv->bdrv_co_io_unplug(bs);
-}
-}
-
-QLIST_FOREACH(child, >children, next) {
-bdrv_co_io_unplug(child->bs);
-}
-}
-
 /* Helper that undoes bdrv_register_buf() when it fails partway through */
 static void GRAPH_RDLOCK
 bdrv_register_buf_rollback(BlockDriverState *bs, void *host, size_t size,
-- 
2.40.1




[PULL 8/8] qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa

2023-06-01 Thread Stefan Hajnoczi
From: Stefano Garzarella 

The virtio-blk-vhost-vdpa driver in libblkio 1.3.0 supports the fd
passing through the new 'fd' property.

Since now we are using qemu_open() on '@path' if the virtio-blk driver
supports the fd passing, let's announce it.
In this way, the management layer can pass the file descriptor of an
already opened vhost-vdpa character device. This is useful especially
when the device can only be accessed with certain privileges.

Add the '@fdset' feature only when the virtio-blk-vhost-vdpa driver
in libblkio supports it.

Suggested-by: Markus Armbruster 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 
Message-id: 20230530071941.8954-3-sgarz...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 qapi/block-core.json | 6 ++
 meson.build  | 4 
 2 files changed, 10 insertions(+)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 98d9116dae..4bf89171c6 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3955,10 +3955,16 @@
 #
 # @path: path to the vhost-vdpa character device.
 #
+# Features:
+# @fdset: Member @path supports the special "/dev/fdset/N" path
+# (since 8.1)
+#
 # Since: 7.2
 ##
 { 'struct': 'BlockdevOptionsVirtioBlkVhostVdpa',
   'data': { 'path': 'str' },
+  'features': [ { 'name' :'fdset',
+  'if': 'CONFIG_BLKIO_VHOST_VDPA_FD' } ],
   'if': 'CONFIG_BLKIO' }
 
 ##
diff --git a/meson.build b/meson.build
index bc76ea96bf..a61d3e9b06 100644
--- a/meson.build
+++ b/meson.build
@@ -2106,6 +2106,10 @@ config_host_data.set('CONFIG_LZO', lzo.found())
 config_host_data.set('CONFIG_MPATH', mpathpersist.found())
 config_host_data.set('CONFIG_MPATH_NEW_API', mpathpersist_new_api)
 config_host_data.set('CONFIG_BLKIO', blkio.found())
+if blkio.found()
+  config_host_data.set('CONFIG_BLKIO_VHOST_VDPA_FD',
+   blkio.version().version_compare('>=1.3.0'))
+endif
 config_host_data.set('CONFIG_CURL', curl.found())
 config_host_data.set('CONFIG_CURSES', curses.found())
 config_host_data.set('CONFIG_GBM', gbm.found())
-- 
2.40.1




[PULL 2/8] block/nvme: convert to blk_io_plug_call() API

2023-06-01 Thread Stefan Hajnoczi
Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
Acked-by: Kevin Wolf 
Message-id: 20230530180959.1108766-3-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 block/nvme.c   | 44 
 block/trace-events |  1 -
 2 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 17937d398d..7ca85bc44a 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -25,6 +25,7 @@
 #include "qemu/vfio-helpers.h"
 #include "block/block-io.h"
 #include "block/block_int.h"
+#include "sysemu/block-backend.h"
 #include "sysemu/replay.h"
 #include "trace.h"
 
@@ -119,7 +120,6 @@ struct BDRVNVMeState {
 int blkshift;
 
 uint64_t max_transfer;
-bool plugged;
 
 bool supports_write_zeroes;
 bool supports_discard;
@@ -282,7 +282,7 @@ static void nvme_kick(NVMeQueuePair *q)
 {
 BDRVNVMeState *s = q->s;
 
-if (s->plugged || !q->need_kick) {
+if (!q->need_kick) {
 return;
 }
 trace_nvme_kick(s, q->index);
@@ -387,10 +387,6 @@ static bool nvme_process_completion(NVMeQueuePair *q)
 NvmeCqe *c;
 
 trace_nvme_process_completion(s, q->index, q->inflight);
-if (s->plugged) {
-trace_nvme_process_completion_queue_plugged(s, q->index);
-return false;
-}
 
 /*
  * Support re-entrancy when a request cb() function invokes aio_poll().
@@ -480,6 +476,15 @@ static void nvme_trace_command(const NvmeCmd *cmd)
 }
 }
 
+static void nvme_unplug_fn(void *opaque)
+{
+NVMeQueuePair *q = opaque;
+
+QEMU_LOCK_GUARD(>lock);
+nvme_kick(q);
+nvme_process_completion(q);
+}
+
 static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
 NvmeCmd *cmd, BlockCompletionFunc cb,
 void *opaque)
@@ -496,8 +501,7 @@ static void nvme_submit_command(NVMeQueuePair *q, 
NVMeRequest *req,
q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
 q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
 q->need_kick++;
-nvme_kick(q);
-nvme_process_completion(q);
+blk_io_plug_call(nvme_unplug_fn, q);
 qemu_mutex_unlock(>lock);
 }
 
@@ -1567,27 +1571,6 @@ static void nvme_attach_aio_context(BlockDriverState *bs,
 }
 }
 
-static void coroutine_fn nvme_co_io_plug(BlockDriverState *bs)
-{
-BDRVNVMeState *s = bs->opaque;
-assert(!s->plugged);
-s->plugged = true;
-}
-
-static void coroutine_fn nvme_co_io_unplug(BlockDriverState *bs)
-{
-BDRVNVMeState *s = bs->opaque;
-assert(s->plugged);
-s->plugged = false;
-for (unsigned i = INDEX_IO(0); i < s->queue_count; i++) {
-NVMeQueuePair *q = s->queues[i];
-qemu_mutex_lock(>lock);
-nvme_kick(q);
-nvme_process_completion(q);
-qemu_mutex_unlock(>lock);
-}
-}
-
 static bool nvme_register_buf(BlockDriverState *bs, void *host, size_t size,
   Error **errp)
 {
@@ -1664,9 +1647,6 @@ static BlockDriver bdrv_nvme = {
 .bdrv_detach_aio_context  = nvme_detach_aio_context,
 .bdrv_attach_aio_context  = nvme_attach_aio_context,
 
-.bdrv_co_io_plug  = nvme_co_io_plug,
-.bdrv_co_io_unplug= nvme_co_io_unplug,
-
 .bdrv_register_buf= nvme_register_buf,
 .bdrv_unregister_buf  = nvme_unregister_buf,
 };
diff --git a/block/trace-events b/block/trace-events
index 32665158d6..048ad27519 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -141,7 +141,6 @@ nvme_kick(void *s, unsigned q_index) "s %p q #%u"
 nvme_dma_flush_queue_wait(void *s) "s %p"
 nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) 
"cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
 nvme_process_completion(void *s, unsigned q_index, int inflight) "s %p q #%u 
inflight %d"
-nvme_process_completion_queue_plugged(void *s, unsigned q_index) "s %p q #%u"
 nvme_complete_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
 nvme_submit_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
 nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, int 
c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x"
-- 
2.40.1




[PULL 0/8] Block patches

2023-06-01 Thread Stefan Hajnoczi
The following changes since commit c6a5fc2ac76c5ab709896ee1b0edd33685a67ed1:

  decodetree: Add --output-null for meson testing (2023-05-31 19:56:42 -0700)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 98b126f5e3228a346c774e569e26689943b401dd:

  qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa (2023-06-01 
11:08:21 -0400)


Pull request

- Stefano Garzarella's blkio block driver 'fd' parameter
- My thread-local blk_io_plug() series



Stefan Hajnoczi (6):
  block: add blk_io_plug_call() API
  block/nvme: convert to blk_io_plug_call() API
  block/blkio: convert to blk_io_plug_call() API
  block/io_uring: convert to blk_io_plug_call() API
  block/linux-aio: convert to blk_io_plug_call() API
  block: remove bdrv_co_io_plug() API

Stefano Garzarella (2):
  block/blkio: use qemu_open() to support fd passing for virtio-blk
  qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa

 MAINTAINERS   |   1 +
 qapi/block-core.json  |   6 ++
 meson.build   |   4 +
 include/block/block-io.h  |   3 -
 include/block/block_int-common.h  |  11 ---
 include/block/raw-aio.h   |  14 ---
 include/sysemu/block-backend-io.h |  13 +--
 block/blkio.c |  96 --
 block/block-backend.c |  22 -
 block/file-posix.c|  38 ---
 block/io.c|  37 ---
 block/io_uring.c  |  44 -
 block/linux-aio.c |  41 +++-
 block/nvme.c  |  44 +++--
 block/plug.c  | 159 ++
 hw/block/dataplane/xen-block.c|   8 +-
 hw/block/virtio-blk.c |   4 +-
 hw/scsi/virtio-scsi.c |   6 +-
 block/meson.build |   1 +
 block/trace-events|   6 +-
 20 files changed, 293 insertions(+), 265 deletions(-)
 create mode 100644 block/plug.c

-- 
2.40.1




[PULL 7/8] block/blkio: use qemu_open() to support fd passing for virtio-blk

2023-06-01 Thread Stefan Hajnoczi
From: Stefano Garzarella 

Some virtio-blk drivers (e.g. virtio-blk-vhost-vdpa) supports the fd
passing. Let's expose this to the user, so the management layer
can pass the file descriptor of an already opened path.

If the libblkio virtio-blk driver supports fd passing, let's always
use qemu_open() to open the `path`, so we can handle fd passing
from the management layer through the "/dev/fdset/N" special path.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 
Message-id: 20230530071941.8954-2-sgarz...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 block/blkio.c | 53 ++-
 1 file changed, 44 insertions(+), 9 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index 11be8787a3..527323d625 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -673,25 +673,60 @@ static int blkio_virtio_blk_common_open(BlockDriverState 
*bs,
 {
 const char *path = qdict_get_try_str(options, "path");
 BDRVBlkioState *s = bs->opaque;
-int ret;
+bool fd_supported = false;
+int fd, ret;
 
 if (!path) {
 error_setg(errp, "missing 'path' option");
 return -EINVAL;
 }
 
-ret = blkio_set_str(s->blkio, "path", path);
-qdict_del(options, "path");
-if (ret < 0) {
-error_setg_errno(errp, -ret, "failed to set path: %s",
- blkio_get_error_msg());
-return ret;
-}
-
 if (!(flags & BDRV_O_NOCACHE)) {
 error_setg(errp, "cache.direct=off is not supported");
 return -EINVAL;
 }
+
+if (blkio_get_int(s->blkio, "fd", ) == 0) {
+fd_supported = true;
+}
+
+/*
+ * If the libblkio driver supports fd passing, let's always use qemu_open()
+ * to open the `path`, so we can handle fd passing from the management
+ * layer through the "/dev/fdset/N" special path.
+ */
+if (fd_supported) {
+int open_flags;
+
+if (flags & BDRV_O_RDWR) {
+open_flags = O_RDWR;
+} else {
+open_flags = O_RDONLY;
+}
+
+fd = qemu_open(path, open_flags, errp);
+if (fd < 0) {
+return -EINVAL;
+}
+
+ret = blkio_set_int(s->blkio, "fd", fd);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "failed to set fd: %s",
+ blkio_get_error_msg());
+qemu_close(fd);
+return ret;
+}
+} else {
+ret = blkio_set_str(s->blkio, "path", path);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "failed to set path: %s",
+ blkio_get_error_msg());
+return ret;
+}
+}
+
+qdict_del(options, "path");
+
 return 0;
 }
 
-- 
2.40.1




[PULL v2 06/11] qapi: make the vcpu parameters deprecated for 8.1

2023-06-01 Thread Stefan Hajnoczi
From: Alex Bennée 

I don't think I can remove the parameters directly but certainly mark
them as deprecated.

Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Richard Henderson 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Alex Bennée 
Message-id: 20230526165401.574474-7-alex.ben...@linaro.org
Message-Id: <20230524133952.3971948-6-alex.ben...@linaro.org>
Signed-off-by: Stefan Hajnoczi 
---
 docs/about/deprecated.rst |  7 +++
 qapi/trace.json   | 40 +--
 2 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index 7c45a64363..0743459862 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -226,6 +226,13 @@ QEMU Machine Protocol (QMP) events
 
 Use the more generic event ``DEVICE_UNPLUG_GUEST_ERROR`` instead.
 
+``vcpu`` trace events (since 8.1)
+'
+
+The ability to instrument QEMU helper functions with vCPU-aware trace
+points was removed in 7.0. However QMP still exposed the vcpu
+parameter. This argument has now been deprecated and the remaining
+remaining trace points that used it are selected just by name.
 
 Human Monitor Protocol (HMP) commands
 -
diff --git a/qapi/trace.json b/qapi/trace.json
index 6bf0af0946..39b752fc88 100644
--- a/qapi/trace.json
+++ b/qapi/trace.json
@@ -37,13 +37,14 @@
 #
 # @vcpu: Whether this is a per-vCPU event (since 2.7).
 #
-# An event is per-vCPU if it has the "vcpu" property in the
-# "trace-events" files.
+# Features:
+# @deprecated: Member @vcpu is deprecated, and always ignored.
 #
 # Since: 2.2
 ##
 { 'struct': 'TraceEventInfo',
-  'data': {'name': 'str', 'state': 'TraceEventState', 'vcpu': 'bool'} }
+  'data': {'name': 'str', 'state': 'TraceEventState',
+   'vcpu': { 'type': 'bool', 'features': ['deprecated'] } } }
 
 ##
 # @trace-event-get-state:
@@ -52,19 +53,15 @@
 #
 # @name: Event name pattern (case-sensitive glob).
 #
-# @vcpu: The vCPU to query (any by default; since 2.7).
+# @vcpu: The vCPU to query (since 2.7).
+#
+# Features:
+# @deprecated: Member @vcpu is deprecated, and always ignored.
 #
 # Returns: a list of @TraceEventInfo for the matching events
 #
-# An event is returned if:
-#
-# - its name matches the @name pattern, and
-# - if @vcpu is given, the event has the "vcpu" property.
-#
-# Therefore, if @vcpu is given, the operation will only match per-vCPU
-# events, returning their state on the specified vCPU. Special case:
-# if @name is an exact match, @vcpu is given and the event does not
-# have the "vcpu" property, an error is returned.
+# An event is returned if its name matches the @name pattern
+# (There are no longer any per-vCPU events).
 #
 # Since: 2.2
 #
@@ -75,7 +72,8 @@
 # <- { "return": [ { "name": "qemu_memalign", "state": "disabled", "vcpu": 
false } ] }
 ##
 { 'command': 'trace-event-get-state',
-  'data': {'name': 'str', '*vcpu': 'int'},
+  'data': {'name': 'str',
+   '*vcpu': {'type': 'int', 'features': ['deprecated'] } },
   'returns': ['TraceEventInfo'] }
 
 ##
@@ -91,15 +89,11 @@
 #
 # @vcpu: The vCPU to act upon (all by default; since 2.7).
 #
-# An event's state is modified if:
+# Features:
+# @deprecated: Member @vcpu is deprecated, and always ignored.
 #
-# - its name matches the @name pattern, and
-# - if @vcpu is given, the event has the "vcpu" property.
-#
-# Therefore, if @vcpu is given, the operation will only match per-vCPU
-# events, setting their state on the specified vCPU. Special case: if
-# @name is an exact match, @vcpu is given and the event does not have
-# the "vcpu" property, an error is returned.
+# An event is enabled if its name matches the @name pattern
+# (There are no longer any per-vCPU events).
 #
 # Since: 2.2
 #
@@ -111,4 +105,4 @@
 ##
 { 'command': 'trace-event-set-state',
   'data': {'name': 'str', 'enable': 'bool', '*ignore-unavailable': 'bool',
-   '*vcpu': 'int'} }
+   '*vcpu': {'type': 'int', 'features': ['deprecated'] } } }
-- 
2.40.1




Re: [PATCH v5 0/2] block/blkio: support fd passing for virtio-blk-vhost-vdpa driver

2023-06-01 Thread Stefan Hajnoczi
On Tue, May 30, 2023 at 09:19:39AM +0200, Stefano Garzarella wrote:
> v5:
> - moved `features` to the object level to simplify libvirt code [Jonathon]
> - wrapped a line too long in the documentation [Markus]
> - added Stefan R-b tags
> 
> v4: 
> https://lore.kernel.org/qemu-devel/20230526150304.158206-1-sgarz...@redhat.com/
> - added patch 02 to allow libvirt to discover we support fdset [Markus]
> - modified the commit description of patch 01
> 
> v3: 
> https://lore.kernel.org/qemu-devel/20230511091527.46620-1-sgarz...@redhat.com/
> - use qemu_open() on `path` to simplify libvirt code [Jonathon]
> - remove patch 01 since we are not using monitor_fd_param() anymore
> 
> v2: 
> https://lore.kernel.org/qemu-devel/20230504092843.62493-1-sgarz...@redhat.com/
> - added patch 01 to use monitor_fd_param() in the blkio module
> - use monitor_fd_param() to parse the fd like vhost devices [Stefan]
> 
> v1: 
> https://lore.kernel.org/qemu-devel/20230502145050.224615-1-sgarz...@redhat.com/
> 
> The virtio-blk-vhost-vdpa driver in libblkio 1.3.0 supports the new
> 'fd' property. Let's expose this to the user, so the management layer
> can pass the file descriptor of an already opened vhost-vdpa character
> device. This is useful especially when the device can only be accessed
> with certain privileges.
> 
> Stefano Garzarella (2):
>   block/blkio: use qemu_open() to support fd passing for virtio-blk
>   qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa
> 
>  meson.build  |  4 
>  qapi/block-core.json |  6 +
>  block/blkio.c| 53 
>  3 files changed, 54 insertions(+), 9 deletions(-)
> 
> -- 
> 2.40.1
> 

Thanks, applied to my block tree:
https://gitlab.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature


[PULL v2 05/11] docs/deprecated: move QMP events bellow QMP command section

2023-06-01 Thread Stefan Hajnoczi
From: Alex Bennée 

Also rename the section to make the fact this is part of the
management protocol even clearer.

Suggested-by: Markus Armbruster 
Signed-off-by: Alex Bennée 
Message-id: 20230526165401.574474-6-alex.ben...@linaro.org
Signed-off-by: Stefan Hajnoczi 
---
 docs/about/deprecated.rst | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index e934e0a13a..7c45a64363 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -218,6 +218,15 @@ instruction per translated block" mode (which can be set 
on the
 command line or via the HMP, but not via QMP). The information remains
 available via the HMP 'info jit' command.
 
+QEMU Machine Protocol (QMP) events
+--
+
+``MEM_UNPLUG_ERROR`` (since 6.2)
+
+
+Use the more generic event ``DEVICE_UNPLUG_GUEST_ERROR`` instead.
+
+
 Human Monitor Protocol (HMP) commands
 -
 
@@ -251,15 +260,6 @@ it. Since all recent x86 hardware from the past >10 years 
is capable of the
 64-bit x86 extensions, a corresponding 64-bit OS should be used instead.
 
 
-QEMU API (QAPI) events
---
-
-``MEM_UNPLUG_ERROR`` (since 6.2)
-
-
-Use the more generic event ``DEVICE_UNPLUG_GUEST_ERROR`` instead.
-
-
 System emulator machines
 
 
-- 
2.40.1




[PULL v2 11/11] accel/tcg: include cs_base in our hash calculations

2023-06-01 Thread Stefan Hajnoczi
From: Alex Bennée 

We weren't using cs_base in the hash calculations before. Since the
arm front end moved a chunk of flags in a378206a20 (target/arm: Move
mode specific TB flags to tb->cs_base) they comprise of an important
part of the execution state.

Widen the tb_hash_func to include cs_base and expand to qemu_xxhash8()
to accommodate it.

My initial benchmark shows very little difference in the
runtime.

Before:

armhf

➜  hyperfine -w 2 -m 20 "./arm-softmmu/qemu-system-arm -cpu cortex-a15 -machine 
type=virt,highmem=off -display none -m 2048 -serial mon:stdio -netdev 
user,id=unet,hostfwd=tcp::-:22 -device virtio-net-pci,netdev=unet -device 
virtio-scsi-pci -blockdev 
driver=raw,node-name=hd,discard=unmap,file.driver=host_device,file.filename=/dev/zen-disk/debian-bullseye-armhf
 -device scsi-hd,drive=hd -smp 4 -kernel 
/home/alex/lsrc/linux.git/builds/arm/arch/arm/boot/zImage -append 
'console=ttyAMA0 root=/dev/sda2 systemd.unit=benchmark.service' -snapshot"
Benchmark 1: ./arm-softmmu/qemu-system-arm -cpu cortex-a15 -machine 
type=virt,highmem=off -display none -m 2048 -serial mon:stdio -netdev 
user,id=unet,hostfwd=tcp::-:22 -device virtio-net-pci,netdev=unet -device 
virtio-scsi-pci -blockdev 
driver=raw,node-name=hd,discard=unmap,file.driver=host_device,file.filename=/dev/zen-disk/debian-bullseye-armhf
 -device scsi-hd,drive=hd -smp 4 -kernel 
/home/alex/lsrc/linux.git/builds/arm/arch/arm/boot/zImage -append 
'console=ttyAMA0 root=/dev/sda2 systemd.unit=benchmark.service' -snapshot
  Time (mean ± σ): 24.627 s ±  2.708 s[User: 34.309 s, System: 1.797 s]
  Range (min … max):   22.345 s … 29.864 s20 runs

arm64

➜  hyperfine -w 2 -n 20 "./qemu-system-aarch64 -cpu max,pauth-impdef=on 
-machine type=virt,virtualization=on,gic-version=3 -display none -serial 
mon:stdio -netdev user,id=unet,hostfwd=tcp::-:22,hostfwd=tcp::1234-:1234 
-device virtio-net-pci,netdev=unet -device virtio-scsi-pci -blockdev 
driver=raw,node-name=hd,discard=unmap,file.driver=host_device,file.filename=/dev/zen-disk/debian-bullseye-arm64
 -device scsi-hd,drive=hd -smp 4 -kernel 
~/lsrc/linux.git/builds/arm64/arch/arm64/boot/Image.gz -append 'console=ttyAMA0 
root=/dev/sda2 systemd.unit=benchmark-pigz.service' -snapshot"
Benchmark 1: 20
  Time (mean ± σ): 62.559 s ±  2.917 s[User: 189.115 s, System: 4.089 s]
  Range (min … max):   59.997 s … 70.153 s10 runs

After:

armhf

Benchmark 1: ./arm-softmmu/qemu-system-arm -cpu cortex-a15 -machine 
type=virt,highmem=off -display none -m 2048 -serial mon:stdio -netdev 
user,id=unet,hostfwd=tcp::-:22 -device virtio-net-pci,netdev=unet -device 
virtio-scsi-pci -blockdev 
driver=raw,node-name=hd,discard=unmap,file.driver=host_device,file.filename=/dev/zen-disk/debian-bullseye-armhf
 -device scsi-hd,drive=hd -smp 4 -kernel 
/home/alex/lsrc/linux.git/builds/arm/arch/arm/boot/zImage -append 
'console=ttyAMA0 root=/dev/sda2 systemd.unit=benchmark.service' -snapshot
  Time (mean ± σ): 24.223 s ±  2.151 s[User: 34.284 s, System: 1.906 s]
  Range (min … max):   22.000 s … 28.476 s20 runs

arm64

hyperfine -w 2 -n 20 "./qemu-system-aarch64 -cpu max,pauth-impdef=on -machine 
type=virt,virtualization=on,gic-version=3 -display none -serial mon:stdio 
-netdev user,id=unet,hostfwd=tcp::-:22,hostfwd=tcp::1234-:1234 -device 
virtio-net-pci,netdev=unet -device virtio-scsi-pci -blockdev 
driver=raw,node-name=hd,discard=unmap,file.driver=host_device,file.filename=/dev/zen-disk/debian-bullseye-arm64
 -device scsi-hd,drive=hd -smp 4 -kernel 
~/lsrc/linux.git/builds/arm64/arch/arm64/boot/Image.gz -append 'console=ttyAMA0 
root=/dev/sda2 systemd.unit=benchmark-pigz.service' -snapshot"
Benchmark 1: 20
  Time (mean ± σ): 62.769 s ±  1.978 s[User: 188.431 s, System: 5.269 s]
  Range (min … max):   60.285 s … 66.868 s10 runs

Signed-off-by: Alex Bennée 
Reviewed-by: Richard Henderson 
Message-id: 20230526165401.574474-12-alex.ben...@linaro.org
Message-Id: <20230524133952.3971948-11-alex.ben...@linaro.org>
Signed-off-by: Stefan Hajnoczi 
---
 accel/tcg/tb-hash.h   |  4 ++--
 include/qemu/xxhash.h | 23 +--
 accel/tcg/cpu-exec.c  |  2 +-
 accel/tcg/tb-maint.c  |  4 ++--
 util/qsp.c|  2 +-
 5 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/accel/tcg/tb-hash.h b/accel/tcg/tb-hash.h
index 1d19c69caa..2ba2193731 100644
--- a/accel/tcg/tb-hash.h
+++ b/accel/tcg/tb-hash.h
@@ -62,9 +62,9 @@ static inline unsigned int 
tb_jmp_cache_hash_func(target_ulong pc)
 
 static inline
 uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc,
-  uint32_t flags, uint32_t cf_mask)
+  uint32_t flags, uint64_t flags2, uint32_t cf_mask)
 {
-return qemu_xxhash6(phys_pc, pc, flags, cf_mask);
+return qemu_xxhash8(phys_pc, pc, flags2, flags, cf_mask);
 }
 
 #endif
diff --git a/include/qemu/xxhash.h b/include/qemu/xxhash.h
index c2dcccadb

[PULL v2 04/11] scripts/qapi: document the tool that generated the file

2023-06-01 Thread Stefan Hajnoczi
From: Alex Bennée 

This makes it a little easier for developers to find where things
where being generated.

Reviewed-by: Richard Henderson 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Markus Armbruster 
Signed-off-by: Alex Bennée 
Message-id: 20230526165401.574474-5-alex.ben...@linaro.org
Message-Id: <20230524133952.3971948-5-alex.ben...@linaro.org>
Signed-off-by: Stefan Hajnoczi 
---
 scripts/qapi/gen.py | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/scripts/qapi/gen.py b/scripts/qapi/gen.py
index 8f8f784f4a..70bc576a10 100644
--- a/scripts/qapi/gen.py
+++ b/scripts/qapi/gen.py
@@ -13,6 +13,7 @@
 
 from contextlib import contextmanager
 import os
+import sys
 import re
 from typing import (
 Dict,
@@ -162,7 +163,7 @@ def __init__(self, fname: str, blurb: str, pydoc: str):
 
 def _top(self) -> str:
 return mcgen('''
-/* AUTOMATICALLY GENERATED, DO NOT MODIFY */
+/* AUTOMATICALLY GENERATED by %(tool)s DO NOT MODIFY */
 
 /*
 %(blurb)s
@@ -174,6 +175,7 @@ def _top(self) -> str:
  */
 
 ''',
+ tool=os.path.basename(sys.argv[0]),
  blurb=self._blurb, copyright=self._copyright)
 
 def _bottom(self) -> str:
@@ -195,7 +197,10 @@ def _bottom(self) -> str:
 
 class QAPIGenTrace(QAPIGen):
 def _top(self) -> str:
-return super()._top() + '# AUTOMATICALLY GENERATED, DO NOT MODIFY\n\n'
+return (super()._top()
++ '# AUTOMATICALLY GENERATED by '
++ os.path.basename(sys.argv[0])
++ ', DO NOT MODIFY\n\n')
 
 
 @contextmanager
-- 
2.40.1




[PULL v2 03/11] trace: remove vcpu_id from the TraceEvent structure

2023-06-01 Thread Stefan Hajnoczi
From: Alex Bennée 

This does involve temporarily stubbing out some helper functions
before we excise the rest of the code.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Richard Henderson 
Signed-off-by: Alex Bennée 
Message-id: 20230526165401.574474-4-alex.ben...@linaro.org
Message-Id: <20230524133952.3971948-4-alex.ben...@linaro.org>
Signed-off-by: Stefan Hajnoczi 
---
 trace/control-internal.h  |  4 ++--
 trace/event-internal.h|  2 --
 trace/control.c   | 10 --
 scripts/tracetool/format/c.py |  6 --
 scripts/tracetool/format/h.py | 11 +--
 5 files changed, 3 insertions(+), 30 deletions(-)

diff --git a/trace/control-internal.h b/trace/control-internal.h
index 8b2b50a7cf..0178121720 100644
--- a/trace/control-internal.h
+++ b/trace/control-internal.h
@@ -27,12 +27,12 @@ static inline uint32_t trace_event_get_id(TraceEvent *ev)
 
 static inline uint32_t trace_event_get_vcpu_id(TraceEvent *ev)
 {
-return ev->vcpu_id;
+return 0;
 }
 
 static inline bool trace_event_is_vcpu(TraceEvent *ev)
 {
-return ev->vcpu_id != TRACE_VCPU_EVENT_NONE;
+return false;
 }
 
 static inline const char * trace_event_get_name(TraceEvent *ev)
diff --git a/trace/event-internal.h b/trace/event-internal.h
index f63500b37e..0c24e01b52 100644
--- a/trace/event-internal.h
+++ b/trace/event-internal.h
@@ -19,7 +19,6 @@
 /**
  * TraceEvent:
  * @id: Unique event identifier.
- * @vcpu_id: Unique per-vCPU event identifier.
  * @name: Event name.
  * @sstate: Static tracing state.
  * @dstate: Dynamic tracing state
@@ -33,7 +32,6 @@
  */
 typedef struct TraceEvent {
 uint32_t id;
-uint32_t vcpu_id;
 const char * name;
 const bool sstate;
 uint16_t *dstate;
diff --git a/trace/control.c b/trace/control.c
index d24af91004..5dfb609954 100644
--- a/trace/control.c
+++ b/trace/control.c
@@ -68,16 +68,6 @@ void trace_event_register_group(TraceEvent **events)
 size_t i;
 for (i = 0; events[i] != NULL; i++) {
 events[i]->id = next_id++;
-if (events[i]->vcpu_id == TRACE_VCPU_EVENT_NONE) {
-continue;
-}
-
-if (likely(next_vcpu_id < CPU_TRACE_DSTATE_MAX_EVENTS)) {
-events[i]->vcpu_id = next_vcpu_id++;
-} else {
-warn_report("too many vcpu trace events; dropping '%s'",
-events[i]->name);
-}
 }
 event_groups = g_renew(TraceEventGroup, event_groups, nevent_groups + 1);
 event_groups[nevent_groups].events = events;
diff --git a/scripts/tracetool/format/c.py b/scripts/tracetool/format/c.py
index c390c1844a..69edf0d588 100644
--- a/scripts/tracetool/format/c.py
+++ b/scripts/tracetool/format/c.py
@@ -32,19 +32,13 @@ def generate(events, backend, group):
 out('uint16_t %s;' % e.api(e.QEMU_DSTATE))
 
 for e in events:
-if "vcpu" in e.properties:
-vcpu_id = 0
-else:
-vcpu_id = "TRACE_VCPU_EVENT_NONE"
 out('TraceEvent %(event)s = {',
 '.id = 0,',
-'.vcpu_id = %(vcpu_id)s,',
 '.name = \"%(name)s\",',
 '.sstate = %(sstate)s,',
 '.dstate = &%(dstate)s ',
 '};',
 event = e.api(e.QEMU_EVENT),
-vcpu_id = vcpu_id,
 name = e.name,
 sstate = "TRACE_%s_ENABLED" % e.name.upper(),
 dstate = e.api(e.QEMU_DSTATE))
diff --git a/scripts/tracetool/format/h.py b/scripts/tracetool/format/h.py
index e94f0be7da..285d7b03a9 100644
--- a/scripts/tracetool/format/h.py
+++ b/scripts/tracetool/format/h.py
@@ -74,16 +74,7 @@ def generate(events, backend, group):
 
 out('}')
 
-# tracer wrapper with checks (per-vCPU tracing)
-if "vcpu" in e.properties:
-trace_cpu = next(iter(e.args))[1]
-cond = "trace_event_get_vcpu_state(%(cpu)s,"\
-   " TRACE_%(id)s)"\
-   % dict(
-   cpu=trace_cpu,
-   id=e.name.upper())
-else:
-cond = "true"
+cond = "true"
 
 out('',
 'static inline void %(api)s(%(args)s)',
-- 
2.40.1




[PULL v2 08/11] trace: remove control-vcpu.h

2023-06-01 Thread Stefan Hajnoczi
From: Alex Bennée 

Now we no longer have vcpu controlled trace events we can excise the
code that allows us to query its status.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Richard Henderson 
Signed-off-by: Alex Bennée 
Message-id: 20230526165401.574474-9-alex.ben...@linaro.org
Message-Id: <20230524133952.3971948-8-alex.ben...@linaro.org>
Signed-off-by: Stefan Hajnoczi 
---
 trace/control-vcpu.h  | 47 ---
 trace/qmp.c   |  2 +-
 scripts/tracetool/format/h.py |  5 +---
 3 files changed, 2 insertions(+), 52 deletions(-)
 delete mode 100644 trace/control-vcpu.h

diff --git a/trace/control-vcpu.h b/trace/control-vcpu.h
deleted file mode 100644
index 800fc5a219..00
--- a/trace/control-vcpu.h
+++ /dev/null
@@ -1,47 +0,0 @@
-/*
- * Interface for configuring and controlling the state of tracing events.
- *
- * Copyright (C) 2011-2016 Lluís Vilanova 
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- */
-
-#ifndef TRACE__CONTROL_VCPU_H
-#define TRACE__CONTROL_VCPU_H
-
-#include "control.h"
-#include "event-internal.h"
-#include "hw/core/cpu.h"
-
-/**
- * trace_event_get_vcpu_state:
- * @vcpu: Target vCPU.
- * @id: Event identifier name.
- *
- * Get the tracing state of an event (both static and dynamic) for the given
- * vCPU.
- *
- * If the event has the disabled property, the check will have no performance
- * impact.
- */
-#define trace_event_get_vcpu_state(vcpu, id)\
-((id ##_ENABLED) && \
- trace_event_get_vcpu_state_dynamic_by_vcpu_id( \
- vcpu, _ ## id ## _EVENT.vcpu_id))
-
-#include "control-internal.h"
-
-static inline bool
-trace_event_get_vcpu_state_dynamic_by_vcpu_id(CPUState *vcpu,
-  uint32_t vcpu_id)
-{
-/* it's on fast path, avoid consistency checks (asserts) */
-if (unlikely(trace_events_enabled_count)) {
-return test_bit(vcpu_id, vcpu->trace_dstate);
-} else {
-return false;
-}
-}
-
-#endif
diff --git a/trace/qmp.c b/trace/qmp.c
index aa760f1fc4..3e3971c6a8 100644
--- a/trace/qmp.c
+++ b/trace/qmp.c
@@ -10,7 +10,7 @@
 #include "qemu/osdep.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-trace.h"
-#include "control-vcpu.h"
+#include "control.h"
 
 
 static bool check_events(bool ignore_unavailable, bool is_pattern,
diff --git a/scripts/tracetool/format/h.py b/scripts/tracetool/format/h.py
index 285d7b03a9..ea126b07ea 100644
--- a/scripts/tracetool/format/h.py
+++ b/scripts/tracetool/format/h.py
@@ -16,10 +16,7 @@
 
 
 def generate(events, backend, group):
-if group == "root":
-header = "trace/control-vcpu.h"
-else:
-header = "trace/control.h"
+header = "trace/control.h"
 
 out('/* This file is autogenerated by tracetool, do not edit. */',
 '',
-- 
2.40.1




<    7   8   9   10   11   12   13   14   15   16   >