Re: Using packed virtqueues in Confidential VMs

2023-11-20 Thread Stefan Hajnoczi
On Mon, Nov 20, 2023 at 10:13:15AM +, Reshetova, Elena wrote:
> Hi Stefan, 
> 
> Thank you for following up on this! Please find my comments inline. 
> 
> > -Original Message-
> > From: Stefan Hajnoczi 
> > Sent: Thursday, November 16, 2023 10:03 PM
> > To: Reshetova, Elena 
> > Cc: Michael S. Tsirkin ; virtio-dev@lists.oasis-open.org;
> > virtualizat...@lists.linux.dev
> > Subject: Using packed virtqueues in Confidential VMs
> > 
> > Hi Elena,
> > You raised concerns about using packed virtqueues with untrusted devices at
> > Linux Plumbers Conference. I reviewed the specification and did not find
> > fundamental issues that would preclude the use of packed virtqueues in
> > untrusted devices. Do you have more information about issues with packed
> > virtqueues?
> 
> First of all a bit of clarification: our overall logic for making our first 
> reference
> release of Linux intel tdx stacks [1] was to enable only minimal required
> functionality and this also applied to numerous modes that virtio provided. 
> Because for each enabled functionality we would have to do a code audit and
> a proper fuzzing setup and all of this requires resources. 
> 
> The choice of split queue was a natural first step since it is the most 
> straightforward
> to understand (at least it was for us, bare in mind we are not experts in 
> virtio as
> you are) and the fact that there was work already done ([2] and other patches)
> to harden the descriptors for split queue. 
> 
> [1] https://github.com/intel/tdx-tools 
> [2] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/virtio?h=v6.6-rc4=72b5e8958738aaa453db5149e6ca3bcf416023b9
> 
> I remember looking at the packed queue long ago and noticing that at least
> some descriptor fields are under device control and I wasn’t sure why the 
> similar 
> hardening was not done here as for the split case. However, we had many
> issues to handle in past, and since we didn’t need the packed queue, we
> never went to investigate this further. 
> It is also possible that we simply misunderstood the code at that point.
> 
> > 
> > I also reviewed Linux's virtio_ring.c to look for implementation issues. One
> > thing I noticed was that detach_buf_packed -> vring_unmap_desc_packed trusts
> > the fields of indirect descriptors that have been mapped to the device:
> > 
> >   flags = le16_to_cpu(desc->flags);
> > 
> >   dma_unmap_page(vring_dma_dev(vq),
> >  le64_to_cpu(desc->addr),
> >  le32_to_cpu(desc->len),
> >  (flags & VRING_DESC_F_WRITE) ?
> >  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> 
> 
> > 
> > This could be problematic if the device is able to modify indirect 
> > descriptors.
> > However, the indirect descriptor table is mapped with DMA_TO_DEVICE:
> > 
> >   addr = vring_map_single(vq, desc,
> >   total_sg * sizeof(struct vring_packed_desc),
> >   DMA_TO_DEVICE);
> > 
> > There is no problem when there is an enforcing IOMMU that maps the page with
> > read-only permissions but that's not always the case. 
> 
> We don’t use IOMMU at the moment for the confidential guest, since we don’t
> have to (memory is encrypted/protected) and only explicitly shared pages are
> available for the host/devices to modify. 
> Do I understand it correctly that in our case the indirect descriptor table 
> will 
> end up mapped shared for this mode to work and then there is no protection? 

Correct.

> 
> Software devices (QEMU,
> > vhost kernel, or vhost-user) usually have full access to guest RAM. They can
> > cause dma_unmap_page() to be invoked with arguments of their choice (except
> > for
> > the first argument) by modifying indirect descriptors.
> > 
> > I am not sure if this poses a danger since software devices already have 
> > access
> > to guest RAM, but I think this code is risky. It would be safer for the 
> > driver
> > to stash away the arguments needed for dma_unmap_page() in memory that is
> > not
> > mapped to the device.
> > 
> > Other than that, I didn't find any issues with the packed virtqueue
> > implementation.
> 
> Thank you for looking into this! Even if we didn’t need the packed queue, 
> I am sure other deployments might need it and it would be the best for 
> virtio to provide all modes that are secure. 

I looked at the split virtqueue layout code and noticed
vring_unmap_one_split_indirect() does the same thing. So packed and
split virtqueues are the same with respect to loading descriptor fields
from memory that devices without an IOMMU may have overwritten.

The impact of letting an attacker choose most of the arguments to
dma_unmap_page() is not clear to me. There is swiotlb state associated
with every mapping. I haven't read the swiotlb code, so I'm not sure how
robust it is against invalid inputs.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Using packed virtqueues in Confidential VMs

2023-11-16 Thread Stefan Hajnoczi
Hi Elena,
You raised concerns about using packed virtqueues with untrusted devices at
Linux Plumbers Conference. I reviewed the specification and did not find
fundamental issues that would preclude the use of packed virtqueues in
untrusted devices. Do you have more information about issues with packed
virtqueues?

I also reviewed Linux's virtio_ring.c to look for implementation issues. One
thing I noticed was that detach_buf_packed -> vring_unmap_desc_packed trusts
the fields of indirect descriptors that have been mapped to the device:

  flags = le16_to_cpu(desc->flags);

  dma_unmap_page(vring_dma_dev(vq),
 le64_to_cpu(desc->addr),
 le32_to_cpu(desc->len),
 (flags & VRING_DESC_F_WRITE) ?
 DMA_FROM_DEVICE : DMA_TO_DEVICE);

This could be problematic if the device is able to modify indirect descriptors.
However, the indirect descriptor table is mapped with DMA_TO_DEVICE:

  addr = vring_map_single(vq, desc,
  total_sg * sizeof(struct vring_packed_desc),
  DMA_TO_DEVICE);

There is no problem when there is an enforcing IOMMU that maps the page with
read-only permissions but that's not always the case. Software devices (QEMU,
vhost kernel, or vhost-user) usually have full access to guest RAM. They can
cause dma_unmap_page() to be invoked with arguments of their choice (except for
the first argument) by modifying indirect descriptors.

I am not sure if this poses a danger since software devices already have access
to guest RAM, but I think this code is risky. It would be safer for the driver
to stash away the arguments needed for dma_unmap_page() in memory that is not
mapped to the device.

Other than that, I didn't find any issues with the packed virtqueue
implementation.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [RFC PATCH v2] docs/interop: define PROBE feature for vhost-user VirtIO devices

2023-09-08 Thread Stefan Hajnoczi
On Fri, Sep 08, 2023 at 01:03:26PM +0100, Alex Bennée wrote:
> 
> Stefan Hajnoczi  writes:
> 
> > On Fri, Sep 01, 2023 at 12:00:18PM +0100, Alex Bennée wrote:
> >> Currently QEMU has to know some details about the VirtIO device
> >> supported by a vhost-user daemon to be able to setup the guest. This
> >> makes it hard for QEMU to add support for additional vhost-user
> >> daemons without adding specific stubs for each additional VirtIO
> >> device.
> >> 
> >> This patch suggests a new feature flag (VHOST_USER_PROTOCOL_F_PROBE)
> >> which the back-end can advertise which allows a probe message to be
> >> sent to get all the details QEMU needs to know in one message.
> >> 
> >> Together with the existing features VHOST_USER_PROTOCOL_F_STATUS and
> >> VHOST_USER_PROTOCOL_F_CONFIG we can create "standalone" vhost-user
> >> daemons which are capable of handling all aspects of the VirtIO
> >> transactions with only a generic stub on the QEMU side. These daemons
> >> can also be used without QEMU in situations where there isn't a full
> >> VMM managing their setup.
> >> 
> >> Signed-off-by: Alex Bennée 
> >
> > I think the mindset for this change should be "vhost-user is becoming a
> > VIRTIO Transport". VIRTIO Transports have a reasonably well-defined
> > feature set in the VIRTIO specification. The goal should be to cover
> > every VIRTIO Transport operation via vhost-user protocol messages so
> > that the VIRTIO device model can be fully conveyed over vhost-user.
> 
> Is it though? The transport is a guest visible construct whereas
> vhost-user is purely a backend implementation detail that should be
> invisible to the guest.

No, the transport is not necessarily guest-visible. The vhost-user model
is that the front-end emulates a VIRTIO device and some aspects of that
device are delegated to the vhost-user back-end.

In other words, the vhost-user device is not the same as the VIRTIO
device that the guest sees, but it's still important for the vhost-user
back-end to be a VIRTIO Transport because that's how we can be sure it
supports the VIRTIO device model properly.

> 
> Also the various backends do things a different set of ways. The
> differences between MMIO and PCI are mostly around where config space is
> and how IRQs are handled. For CCW we do actually have a set of commands
> we can look at:
> 
>   #define CCW_CMD_SET_VQ 0x13 
>   #define CCW_CMD_VDEV_RESET 0x33 
>   #define CCW_CMD_SET_IND 0x43 
>   #define CCW_CMD_SET_CONF_IND 0x53 
>   #define CCW_CMD_SET_IND_ADAPTER 0x73 
>   #define CCW_CMD_READ_FEAT 0x12 
>   #define CCW_CMD_WRITE_FEAT 0x11 
>   #define CCW_CMD_READ_CONF 0x22 
>   #define CCW_CMD_WRITE_CONF 0x21 
>   #define CCW_CMD_WRITE_STATUS 0x31 
>   #define CCW_CMD_READ_VQ_CONF 0x32 
>   #define CCW_CMD_SET_VIRTIO_REV 0x83 
>   #define CCW_CMD_READ_STATUS 0x72
> 
> which I think we already have mappings for.

Yes, there are differences between the transports. vhost-user uses
eventfds (callfd/kickfd) instead of interrupts.

> > Anything less is yet another ad-hoc protocol extension that will lead to
> > more bugs and hacks when it turns out some VIRTIO devices cannot be
> > expressed due to limitations in the protocol.
> 
> I agree we want to do this right.
> 
> > This requires going through the VIRTIO spec to find a correspondence
> > between virtio-pci/virtio-mmio/virtio-ccw's interfaces and vhost-user
> > protocol messages. In most cases vhost-user already offers messages and
> > your patch adds more of what is missing. I think this effort is already
> > very close but missing the final check that it really matches the VIRTIO
> > spec.
> >
> > Please do the comparison against the VIRTIO Transports and then adjust
> > this patch to make it clear that the back-end is becoming a full-fledged
> > VIRTIO Transport:
> > - The name of the patch series should reflect that.
> > - The vhost-user protocol feature should be named F_TRANSPORT.
> > - The messages added in this patch should have a 1:1 correspondence with
> >   the VIRTIO spec including using the same terminology for consistency.
> >
> > Sorry for the hassle, but I think this is a really crucial point where
> > we have the chance to make vhost-user work smoothly in the future...but
> > only if we can faithfully expose VIRTIO Transport semantics.
> 
> I wonder if first be handled by cleaning up the VirtIO spec to make it
> clear what capabilities each transport needs to support?

It's a fair point that the VIRTIO spec does not provide an interface
definition for the VIRTIO Transport or at 

Re: [virtio-dev] [RFC PATCH v2] docs/interop: define PROBE feature for vhost-user VirtIO devices

2023-09-07 Thread Stefan Hajnoczi
On Tue, Sep 05, 2023 at 10:34:11AM +0100, Alex Bennée wrote:
> 
> Albert Esteve  writes:
> 
> > This looks great! Thanks for this proposal.
> >
> > On Fri, Sep 1, 2023 at 1:00 PM Alex Bennée  wrote:
> >
> >  Currently QEMU has to know some details about the VirtIO device
> >  supported by a vhost-user daemon to be able to setup the guest. This
> >  makes it hard for QEMU to add support for additional vhost-user
> >  daemons without adding specific stubs for each additional VirtIO
> >  device.
> >
> >  This patch suggests a new feature flag (VHOST_USER_PROTOCOL_F_PROBE)
> >  which the back-end can advertise which allows a probe message to be
> >  sent to get all the details QEMU needs to know in one message.
> >
> >  Together with the existing features VHOST_USER_PROTOCOL_F_STATUS and
> >  VHOST_USER_PROTOCOL_F_CONFIG we can create "standalone" vhost-user
> >  daemons which are capable of handling all aspects of the VirtIO
> >  transactions with only a generic stub on the QEMU side. These daemons
> >  can also be used without QEMU in situations where there isn't a full
> >  VMM managing their setup.
> >
> >  Signed-off-by: Alex Bennée 
> >
> >  ---
> >  v2
> >- dropped F_STANDALONE in favour of F_PROBE
> >- split probe details across several messages
> >- probe messages don't automatically imply a standalone daemon
> >- add wording where probe details interact (F_MQ/F_CONFIG)
> >- define VMM and make clear QEMU is only one of many potential VMMs
> >- reword commit message
> >  ---
> >   docs/interop/vhost-user.rst | 90 -
> >   hw/virtio/vhost-user.c  |  8 
> >   2 files changed, 88 insertions(+), 10 deletions(-)
> >
> >  diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> >  index 5a070adbc1..ba3b5e07b7 100644
> >  --- a/docs/interop/vhost-user.rst
> >  +++ b/docs/interop/vhost-user.rst
> >  @@ -7,6 +7,7 @@ Vhost-user Protocol
> >   ..
> > Copyright 2014 Virtual Open Systems Sarl.
> > Copyright 2019 Intel Corporation
> >  +  Copyright 2023 Linaro Ltd
> > Licence: This work is licensed under the terms of the GNU GPL,
> >  version 2 or later. See the COPYING file in the top-level
> >  directory.
> >  @@ -27,17 +28,31 @@ The protocol defines 2 sides of the communication, 
> > *front-end* and
> >   *back-end*. The *front-end* is the application that shares its 
> > virtqueues, in
> >   our case QEMU. The *back-end* is the consumer of the virtqueues.
> >
> >  -In the current implementation QEMU is the *front-end*, and the *back-end*
> >  -is the external process consuming the virtio queues, for example a
> >  -software Ethernet switch running in user space, such as Snabbswitch,
> >  -or a block device back-end processing read & write to a virtual
> >  -disk. In order to facilitate interoperability between various back-end
> >  -implementations, it is recommended to follow the :ref:`Backend program
> >  -conventions `.
> >  +In the current implementation a Virtual Machine Manager (VMM) such as
> >  +QEMU is the *front-end*, and the *back-end* is the external process
> >  +consuming the virtio queues, for example a software Ethernet switch
> >  +running in user space, such as Snabbswitch, or a block device back-end
> >  +processing read & write to a virtual disk. In order to facilitate
> >  +interoperability between various back-end implementations, it is
> >  +recommended to follow the :ref:`Backend program conventions
> >  +`.
> >
> >   The *front-end* and *back-end* can be either a client (i.e. connecting) or
> >   server (listening) in the socket communication.
> >
> >  +Probing device details
> >  +--
> >  +
> >  +Traditionally the vhost-user daemon *back-end* shares configuration
> >  +responsibilities with the VMM *front-end* which needs to know certain
> >  +key bits of information about the device. This means the VMM needs to
> >  +define at least a minimal stub for each VirtIO device it wants to
> >  +support. If the daemon supports the right set of protocol features the
> >  +VMM can probe the daemon for the information it needs to setup the
> >  +device. See :ref:`Probing features for standalone daemons
> >  +` for more details.
> >  +
> >  +
> >   Support for platforms other than Linux
> >   --
> >
> >  @@ -316,6 +331,7 @@ replies. Here is a list of the ones that do:
> >   * ``VHOST_USER_GET_VRING_BASE``
> >   * ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
> >   * ``VHOST_USER_GET_INFLIGHT_FD`` (if 
> > ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
> >  +* ``VHOST_USER_GET_BACKEND_SPECS`` (if 
> > ``VHOST_USER_PROTOCOL_F_STANDALONE``)
> >
> >   .. seealso::
> >
> >  @@ -396,9 +412,10 @@ must support changing some configuration aspects on 
> > the fly.
> >   Multiple queue support
> >   --
> >
> >  -Many devices have a fixed number of virtqueues.  In this case the 
> > front-end
> >  

[virtio-dev] Re: [RFC PATCH v2] docs/interop: define PROBE feature for vhost-user VirtIO devices

2023-09-07 Thread Stefan Hajnoczi
On Fri, Sep 01, 2023 at 12:00:18PM +0100, Alex Bennée wrote:
> Currently QEMU has to know some details about the VirtIO device
> supported by a vhost-user daemon to be able to setup the guest. This
> makes it hard for QEMU to add support for additional vhost-user
> daemons without adding specific stubs for each additional VirtIO
> device.
> 
> This patch suggests a new feature flag (VHOST_USER_PROTOCOL_F_PROBE)
> which the back-end can advertise which allows a probe message to be
> sent to get all the details QEMU needs to know in one message.
> 
> Together with the existing features VHOST_USER_PROTOCOL_F_STATUS and
> VHOST_USER_PROTOCOL_F_CONFIG we can create "standalone" vhost-user
> daemons which are capable of handling all aspects of the VirtIO
> transactions with only a generic stub on the QEMU side. These daemons
> can also be used without QEMU in situations where there isn't a full
> VMM managing their setup.
> 
> Signed-off-by: Alex Bennée 

I think the mindset for this change should be "vhost-user is becoming a
VIRTIO Transport". VIRTIO Transports have a reasonably well-defined
feature set in the VIRTIO specification. The goal should be to cover
every VIRTIO Transport operation via vhost-user protocol messages so
that the VIRTIO device model can be fully conveyed over vhost-user.

Anything less is yet another ad-hoc protocol extension that will lead to
more bugs and hacks when it turns out some VIRTIO devices cannot be
expressed due to limitations in the protocol.

This requires going through the VIRTIO spec to find a correspondence
between virtio-pci/virtio-mmio/virtio-ccw's interfaces and vhost-user
protocol messages. In most cases vhost-user already offers messages and
your patch adds more of what is missing. I think this effort is already
very close but missing the final check that it really matches the VIRTIO
spec.

Please do the comparison against the VIRTIO Transports and then adjust
this patch to make it clear that the back-end is becoming a full-fledged
VIRTIO Transport:
- The name of the patch series should reflect that.
- The vhost-user protocol feature should be named F_TRANSPORT.
- The messages added in this patch should have a 1:1 correspondence with
  the VIRTIO spec including using the same terminology for consistency.

Sorry for the hassle, but I think this is a really crucial point where
we have the chance to make vhost-user work smoothly in the future...but
only if we can faithfully expose VIRTIO Transport semantics.

> 
> ---
> v2
>   - dropped F_STANDALONE in favour of F_PROBE
>   - split probe details across several messages
>   - probe messages don't automatically imply a standalone daemon
>   - add wording where probe details interact (F_MQ/F_CONFIG)
>   - define VMM and make clear QEMU is only one of many potential VMMs
>   - reword commit message
> ---
>  docs/interop/vhost-user.rst | 90 -
>  hw/virtio/vhost-user.c  |  8 
>  2 files changed, 88 insertions(+), 10 deletions(-)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 5a070adbc1..ba3b5e07b7 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -7,6 +7,7 @@ Vhost-user Protocol
>  ..
>Copyright 2014 Virtual Open Systems Sarl.
>Copyright 2019 Intel Corporation
> +  Copyright 2023 Linaro Ltd
>Licence: This work is licensed under the terms of the GNU GPL,
> version 2 or later. See the COPYING file in the top-level
> directory.
> @@ -27,17 +28,31 @@ The protocol defines 2 sides of the communication, 
> *front-end* and
>  *back-end*. The *front-end* is the application that shares its virtqueues, in
>  our case QEMU. The *back-end* is the consumer of the virtqueues.
>  
> -In the current implementation QEMU is the *front-end*, and the *back-end*
> -is the external process consuming the virtio queues, for example a
> -software Ethernet switch running in user space, such as Snabbswitch,
> -or a block device back-end processing read & write to a virtual
> -disk. In order to facilitate interoperability between various back-end
> -implementations, it is recommended to follow the :ref:`Backend program
> -conventions `.
> +In the current implementation a Virtual Machine Manager (VMM) such as
> +QEMU is the *front-end*, and the *back-end* is the external process
> +consuming the virtio queues, for example a software Ethernet switch
> +running in user space, such as Snabbswitch, or a block device back-end
> +processing read & write to a virtual disk. In order to facilitate
> +interoperability between various back-end implementations, it is
> +recommended to follow the :ref:`Backend program conventions
> +`.
>  
>  The *front-end* and *back-end* can be either a client (i.e. connecting) or
>  server (listening) in the socket communication.
>  
> +Probing device details
> +--
> +
> +Traditionally the vhost-user daemon *back-end* shares configuration
> +responsibilities with 

[virtio-dev] Re: [virtio-comment] [RFC PATCH 1/5] virtio: introduce SUSPEND bit in device status

2023-08-21 Thread Stefan Hajnoczi
On Fri, Aug 18, 2023 at 05:55:42PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 8/18/2023 12:04 AM, Stefan Hajnoczi wrote:
> > On Thu, Aug 17, 2023 at 05:15:16PM +0200, Eugenio Perez Martin wrote:
> > > On Tue, Aug 15, 2023 at 2:29 PM Stefan Hajnoczi  
> > > wrote:
> > > > On Tue, Aug 15, 2023 at 06:31:23PM +0800, Zhu, Lingshan wrote:
> > > > > 
> > > > > On 8/14/2023 10:30 PM, Stefan Hajnoczi wrote:
> > > > > > On Tue, Aug 15, 2023 at 03:29:00AM +0800, Zhu Lingshan wrote:
> > > > > > > This patch introudces a new status bit in the device status: 
> > > > > > > SUSPEND.
> > > > > > > 
> > > > > > > This SUSPEND bit can be used by the driver to suspend a device,
> > > > > > > in order to stablize the device states and virtqueue states.
> > > > > > > 
> > > > > > > Its main use case is live migration.
> > > > > > > 
> > > > > > > Signed-off-by: Jason Wang 
> > > > > > > Signed-off-by: Eugenio PÃrez 
> > > > > > There is an character encoding issue in Eugenio's surname.
> > > > > Oh, I copied his SOB form his email, I will copy from git log to fix 
> > > > > this,
> > > > > thanks for point out it.
> > > > > > > Signed-off-by: Zhu Lingshan 
> > > > > > > ---
> > > > > > >content.tex | 18 ++
> > > > > > >1 file changed, 18 insertions(+)
> > > > > > This patch hints at the asynchronous nature of the SUSPEND bit (the
> > > > > > driver must re-read the Device Status Field) but doesn't explain the
> > > > > > rationale or any limits.
> > > > > > 
> > > > > > For example, is there a timeout or should the driver re-read the 
> > > > > > Device
> > > > > > Status Field forever?
> > > > > It depends on the driver, normally we expect this operation can be 
> > > > > done
> > > > > successfully
> > > > > like how the driver/device handles FEATURES_OK.
> > > > > 
> > > > > Once failed due to:
> > > > > 1) driver timeout, the driver can reset the device
> > > > > 2) device failure, the device can set NEEDS_RESET.
> > > > I mention this because SUSPEND involves quiescing the device so that no
> > > > requests are in flight and that can take an unbounded amount of time on
> > > > a virtio-blk, virtio-scsi, or virtiofs device. If the driver is busy
> > > > waiting for the device to report the SUSPEND bit, then that could take a
> > > > long time/forever.
> > > > 
> > > > Imagine a virtiofs PCI device implemented in hardware that forwards I/O
> > > > to a distributed storage system. If the distributed storage system has a
> > > > request in flight then SUSPEND needs to wait for it to complete. The
> > > > device has no control over how long that will take, but if it does not
> > > > then corruption could occur later on.
> > > > 
> > > > There are two issues with long SUSPEND times:
> > > > 1. Busy wait CPU consumption. Since there is no interrupt that signals
> > > > when the bit is set, the best a driver can do is to back off
> > > > gradually and use timers to avoid hogging the CPU.
> > > > 2. Synchronous blocking. If the call stack that led the driver to set
> > > > SUSPEND is blocked until the device reports the SUSPEND bit, then
> > > > other parts of the system could experience blocking. For example, 
> > > > the
> > > > VMM might be blocked in a vhost ioctl() call, which makes the guest
> > > > unresponsive.
> > > > 
> > > I think all of this already happens with ring reset or even a plain
> > > device reset, doesn't it?
> > Yes, but keep in mind that the driver has typically already drained
> > requests when it invokes reset. If SUSPEND is used transparently during
> > live migration then there really will be requests in flight because the
> > guest driver is unaware.
> I agree, and as discussed before, I think in this series we should say:
> the device MUST wait untilall descriptors that being processed to finish and
> mark them as used.

Sounds good.

> Then in the following series, we should implement in-flight IO tracker.
> > 
> > > In my opinion the best thing the device can do here is to fail the
> > > request after a certain time, the same way it would fail if the
> > > backend distributed storage system gets disconnected or latency gets
> > > out of bounds.
> > In order to prevent corruption there needs to be a fence in addition to
> > a timeout. In other words, the storage backend needs to guarantee that
> > any requests sent before the fence will be ignored if they are still
> > encountered.
> Please correct me if I misunderstand anything.
> 
> I am not sure a remote target is aware of SUSPEND, but the device can
> fail SUSPEND by setting NEEDS_RESET for sure. IN this series,
> maybe it is best to flush and wait for the IO requests.

Yes, it is necessary to wait for pending I/O.

I was pointing out that timeouts implemented inside the storage
initiator (client) are unsafe. The storage target (server) needs to
participate in stopping in-flight I/O one way or another (timeout,
cancellation, fence, etc).

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] [RFC PATCH 1/5] virtio: introduce SUSPEND bit in device status

2023-08-17 Thread Stefan Hajnoczi
On Thu, Aug 17, 2023 at 05:15:16PM +0200, Eugenio Perez Martin wrote:
> On Tue, Aug 15, 2023 at 2:29 PM Stefan Hajnoczi  wrote:
> >
> > On Tue, Aug 15, 2023 at 06:31:23PM +0800, Zhu, Lingshan wrote:
> > >
> > >
> > > On 8/14/2023 10:30 PM, Stefan Hajnoczi wrote:
> > > > On Tue, Aug 15, 2023 at 03:29:00AM +0800, Zhu Lingshan wrote:
> > > > > This patch introudces a new status bit in the device status: SUSPEND.
> > > > >
> > > > > This SUSPEND bit can be used by the driver to suspend a device,
> > > > > in order to stablize the device states and virtqueue states.
> > > > >
> > > > > Its main use case is live migration.
> > > > >
> > > > > Signed-off-by: Jason Wang 
> > > > > Signed-off-by: Eugenio PÃrez 
> > > > There is an character encoding issue in Eugenio's surname.
> > > Oh, I copied his SOB form his email, I will copy from git log to fix this,
> > > thanks for point out it.
> > > >
> > > > > Signed-off-by: Zhu Lingshan 
> > > > > ---
> > > > >   content.tex | 18 ++
> > > > >   1 file changed, 18 insertions(+)
> > > > This patch hints at the asynchronous nature of the SUSPEND bit (the
> > > > driver must re-read the Device Status Field) but doesn't explain the
> > > > rationale or any limits.
> > > >
> > > > For example, is there a timeout or should the driver re-read the Device
> > > > Status Field forever?
> > > It depends on the driver, normally we expect this operation can be done
> > > successfully
> > > like how the driver/device handles FEATURES_OK.
> > >
> > > Once failed due to:
> > > 1) driver timeout, the driver can reset the device
> > > 2) device failure, the device can set NEEDS_RESET.
> >
> > I mention this because SUSPEND involves quiescing the device so that no
> > requests are in flight and that can take an unbounded amount of time on
> > a virtio-blk, virtio-scsi, or virtiofs device. If the driver is busy
> > waiting for the device to report the SUSPEND bit, then that could take a
> > long time/forever.
> >
> > Imagine a virtiofs PCI device implemented in hardware that forwards I/O
> > to a distributed storage system. If the distributed storage system has a
> > request in flight then SUSPEND needs to wait for it to complete. The
> > device has no control over how long that will take, but if it does not
> > then corruption could occur later on.
> >
> > There are two issues with long SUSPEND times:
> > 1. Busy wait CPU consumption. Since there is no interrupt that signals
> >when the bit is set, the best a driver can do is to back off
> >gradually and use timers to avoid hogging the CPU.
> > 2. Synchronous blocking. If the call stack that led the driver to set
> >SUSPEND is blocked until the device reports the SUSPEND bit, then
> >other parts of the system could experience blocking. For example, the
> >VMM might be blocked in a vhost ioctl() call, which makes the guest
> >unresponsive.
> >
> 
> I think all of this already happens with ring reset or even a plain
> device reset, doesn't it?

Yes, but keep in mind that the driver has typically already drained
requests when it invokes reset. If SUSPEND is used transparently during
live migration then there really will be requests in flight because the
guest driver is unaware.

> 
> In my opinion the best thing the device can do here is to fail the
> request after a certain time, the same way it would fail if the
> backend distributed storage system gets disconnected or latency gets
> out of bounds.

In order to prevent corruption there needs to be a fence in addition to
a timeout. In other words, the storage backend needs to guarantee that
any requests sent before the fence will be ignored if they are still
encountered.

> > Making SUSPEND asynchronous is more complicated but would allow long
> > SUSPEND times to be handled gracefully.
> >
> 
> Maybe that should be the direction of the transport vq, so transport
> commands are asynchronous and we get rid of all the similar problems
> in one shot?

Yes, a transport virtqueue would make this operation asynchronous.

Stefan

> 
> Thanks!
> 
> > > >
> > > > Does the driver need to re-read the Device Status Field after clearing
> > > > the SUSPEND bit?
> > > I think the driver should re-read, I will add this in the next version.
> > > >
> > > > > diff --git a/content.

[virtio-dev] Re: [virtio-comment] [RFC PATCH 3/5] virtio: The actions by the device upon SUSPEND

2023-08-16 Thread Stefan Hajnoczi
On Wed, Aug 16, 2023 at 12:25:39PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 8/15/2023 8:33 PM, Stefan Hajnoczi wrote:
> > On Tue, Aug 15, 2023 at 07:07:42PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 8/14/2023 11:00 PM, Stefan Hajnoczi wrote:
> > > > On Tue, Aug 15, 2023 at 03:29:02AM +0800, Zhu Lingshan wrote:
> > > > > This commit specifies the actions to be taken by the device upon
> > > > > SUSPEND.
> > > > > 
> > > > > Signed-off-by: Jason Wang 
> > > > > Signed-off-by: Eugenio PÃrez 
> > > > > Signed-off-by: Zhu Lingshan 
> > > > > ---
> > > > >content.tex | 9 +
> > > > >1 file changed, 9 insertions(+)
> > > > > 
> > > > > diff --git a/content.tex b/content.tex
> > > > > index 074f43e..43bd5de 100644
> > > > > --- a/content.tex
> > > > > +++ b/content.tex
> > > > > @@ -96,6 +96,15 @@ \section{\field{Device Status} 
> > > > > Field}\label{sec:Basic Facilities of a Virtio Dev
> > > > >If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set, the device 
> > > > > MUST clear SUSPEND
> > > > >and resumes operation upon DRIVER_OK.
> > > > > +If VIRTIO_F_SUSPEND is negotiated, when SUSPEND is set, the device 
> > > > > MUST perform the following operations:
> > > > "when SUSPEND is set" is ambigious. It could mean when the driver writes
> > > > the Device Status Field, when the device transitions to SUSPEND, or
> > > > something else. Maybe "before the device reports the SUSPEND bit set in
> > > > the Device Status Field"?
> > > Nice catch! will fix: when the driver sets SUSPEND, before the device
> > > reports the
> > > SUSPEND bit set in the \field{device status}, the device MUST perform the
> > > following actions.
> > > > > +\begin{itemize}
> > > > > +\item Stop comsuming any descriptors
> > > > "consuming"
> > > yes
> > > > It may be more consistent to talk about virtqueue buffers (i.e.
> > > > requests) rather than descriptors here.
> > > yes
> > > > > +\item Mark all finished descriptors as used and send used buffer 
> > > > > notification to the driver
> > > > The device has to complete everything that is in flight, just completing
> > > > "finished" stuff is not enough. Does "finished descriptors" really mean
> > > > "in-flight virtqueue buffers"?
> > > A patch of tracking in-flight descriptors is planned. Here I expect
> > > "finished" mean "done" buffers, transmitted and received ACK.
> > Yes, I don't mean tracking in-flight virtqueue buffers, I mean waiting
> > until they are marked as used. I think "finished" is unclear and the
> > text should explain that the device waits for in-flight virtqueue
> > buffers (in the future this could be relaxed with in-flight tracking
> > functionality).
> Sure, I can add this in the next version: the device MUST wait until all
> descriptors that being processed to finish and mark them as used

Sounds good.

> And we can replace it with tracking in-flight descriptors in the future
> 
> Thanks
> > 
> > > > "send a used buffer notification" or "send used buffer notifications"
> > > Yes will fix
> > > > > +\item Record Virtqueue State of each enabled virtqueue, see section 
> > > > > \ref{sec:Virtqueues / Virtqueue State}
> > > > Does "Record" mean that Virtqueue State fields can only be accessed by
> > > > the driver while device is paused (they may be outdated or invalid while
> > > > the device is unpaused)?
> > > Yes, to avoid race with the device and make device implementation easier,
> > > this is also mentioned in Patch 4.
> > > > > +\item Pause its operation and preserve all configurations in its 
> > > > > Device Configuration Space, see \ref{sec:Basic Facilities of a Virtio 
> > > > > Device / Device Configuration Space}
> > > > What does "preserve" mean? Does it mean the Device Configuration Space
> > > > is not allowed to change during SUSPEND?
> > > Yes, once the device presents SUSPEND in its device status, it should not
> > > change any
> > > fields in its Device Configuration Space.
> > > 
> > > Thanks for your revie

[virtio-dev] Re: [virtio-comment] [RFC PATCH 3/5] virtio: The actions by the device upon SUSPEND

2023-08-15 Thread Stefan Hajnoczi
On Tue, Aug 15, 2023 at 07:07:42PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 8/14/2023 11:00 PM, Stefan Hajnoczi wrote:
> > On Tue, Aug 15, 2023 at 03:29:02AM +0800, Zhu Lingshan wrote:
> > > This commit specifies the actions to be taken by the device upon
> > > SUSPEND.
> > > 
> > > Signed-off-by: Jason Wang 
> > > Signed-off-by: Eugenio PÃrez 
> > > Signed-off-by: Zhu Lingshan 
> > > ---
> > >   content.tex | 9 +
> > >   1 file changed, 9 insertions(+)
> > > 
> > > diff --git a/content.tex b/content.tex
> > > index 074f43e..43bd5de 100644
> > > --- a/content.tex
> > > +++ b/content.tex
> > > @@ -96,6 +96,15 @@ \section{\field{Device Status} Field}\label{sec:Basic 
> > > Facilities of a Virtio Dev
> > >   If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set, the device MUST 
> > > clear SUSPEND
> > >   and resumes operation upon DRIVER_OK.
> > > +If VIRTIO_F_SUSPEND is negotiated, when SUSPEND is set, the device MUST 
> > > perform the following operations:
> > "when SUSPEND is set" is ambigious. It could mean when the driver writes
> > the Device Status Field, when the device transitions to SUSPEND, or
> > something else. Maybe "before the device reports the SUSPEND bit set in
> > the Device Status Field"?
> Nice catch! will fix: when the driver sets SUSPEND, before the device
> reports the
> SUSPEND bit set in the \field{device status}, the device MUST perform the
> following actions.
> > 
> > > +\begin{itemize}
> > > +\item Stop comsuming any descriptors
> > "consuming"
> yes
> > 
> > It may be more consistent to talk about virtqueue buffers (i.e.
> > requests) rather than descriptors here.
> yes
> > 
> > > +\item Mark all finished descriptors as used and send used buffer 
> > > notification to the driver
> > The device has to complete everything that is in flight, just completing
> > "finished" stuff is not enough. Does "finished descriptors" really mean
> > "in-flight virtqueue buffers"?
> A patch of tracking in-flight descriptors is planned. Here I expect
> "finished" mean "done" buffers, transmitted and received ACK.

Yes, I don't mean tracking in-flight virtqueue buffers, I mean waiting
until they are marked as used. I think "finished" is unclear and the
text should explain that the device waits for in-flight virtqueue
buffers (in the future this could be relaxed with in-flight tracking
functionality).

> > 
> > "send a used buffer notification" or "send used buffer notifications"
> Yes will fix
> > 
> > > +\item Record Virtqueue State of each enabled virtqueue, see section 
> > > \ref{sec:Virtqueues / Virtqueue State}
> > Does "Record" mean that Virtqueue State fields can only be accessed by
> > the driver while device is paused (they may be outdated or invalid while
> > the device is unpaused)?
> Yes, to avoid race with the device and make device implementation easier,
> this is also mentioned in Patch 4.
> > 
> > > +\item Pause its operation and preserve all configurations in its Device 
> > > Configuration Space, see \ref{sec:Basic Facilities of a Virtio Device / 
> > > Device Configuration Space}
> > What does "preserve" mean? Does it mean the Device Configuration Space
> > is not allowed to change during SUSPEND?
> Yes, once the device presents SUSPEND in its device status, it should not
> change any
> fields in its Device Configuration Space.
> 
> Thanks for your review
> > 
> > > +\item Present SUSPEND in \field{device status}
> > > +\end{itemize}
> > > +
> > >   \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / 
> > > Feature Bits}
> > >   Each virtio device offers all the features it understands.  During
> > > -- 
> > > 2.35.3
> > > 
> > > 
> > > This publicly archived list offers a means to provide input to the
> > > OASIS Virtual I/O Device (VIRTIO) TC.
> > > 
> > > In order to verify user consent to the Feedback License terms and
> > > to minimize spam in the list archive, subscription is required
> > > before posting.
> > > 
> > > Subscribe: virtio-comment-subscr...@lists.oasis-open.org
> > > Unsubscribe: virtio-comment-unsubscr...@lists.oasis-open.org
> > > List help: virtio-comment-h...@lists.oasis-open.org
> > > List archive: https://lists.oasis-open.org/archives/virtio-com

[virtio-dev] Re: [virtio-comment] [RFC PATCH 1/5] virtio: introduce SUSPEND bit in device status

2023-08-15 Thread Stefan Hajnoczi
On Tue, Aug 15, 2023 at 06:31:23PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 8/14/2023 10:30 PM, Stefan Hajnoczi wrote:
> > On Tue, Aug 15, 2023 at 03:29:00AM +0800, Zhu Lingshan wrote:
> > > This patch introudces a new status bit in the device status: SUSPEND.
> > > 
> > > This SUSPEND bit can be used by the driver to suspend a device,
> > > in order to stablize the device states and virtqueue states.
> > > 
> > > Its main use case is live migration.
> > > 
> > > Signed-off-by: Jason Wang 
> > > Signed-off-by: Eugenio PÃrez 
> > There is an character encoding issue in Eugenio's surname.
> Oh, I copied his SOB form his email, I will copy from git log to fix this,
> thanks for point out it.
> > 
> > > Signed-off-by: Zhu Lingshan 
> > > ---
> > >   content.tex | 18 ++
> > >   1 file changed, 18 insertions(+)
> > This patch hints at the asynchronous nature of the SUSPEND bit (the
> > driver must re-read the Device Status Field) but doesn't explain the
> > rationale or any limits.
> > 
> > For example, is there a timeout or should the driver re-read the Device
> > Status Field forever?
> It depends on the driver, normally we expect this operation can be done
> successfully
> like how the driver/device handles FEATURES_OK.
> 
> Once failed due to:
> 1) driver timeout, the driver can reset the device
> 2) device failure, the device can set NEEDS_RESET.

I mention this because SUSPEND involves quiescing the device so that no
requests are in flight and that can take an unbounded amount of time on
a virtio-blk, virtio-scsi, or virtiofs device. If the driver is busy
waiting for the device to report the SUSPEND bit, then that could take a
long time/forever.

Imagine a virtiofs PCI device implemented in hardware that forwards I/O
to a distributed storage system. If the distributed storage system has a
request in flight then SUSPEND needs to wait for it to complete. The
device has no control over how long that will take, but if it does not
then corruption could occur later on.

There are two issues with long SUSPEND times:
1. Busy wait CPU consumption. Since there is no interrupt that signals
   when the bit is set, the best a driver can do is to back off
   gradually and use timers to avoid hogging the CPU.
2. Synchronous blocking. If the call stack that led the driver to set
   SUSPEND is blocked until the device reports the SUSPEND bit, then
   other parts of the system could experience blocking. For example, the
   VMM might be blocked in a vhost ioctl() call, which makes the guest
   unresponsive.

Making SUSPEND asynchronous is more complicated but would allow long
SUSPEND times to be handled gracefully.

> > 
> > Does the driver need to re-read the Device Status Field after clearing
> > the SUSPEND bit?
> I think the driver should re-read, I will add this in the next version.
> > 
> > > diff --git a/content.tex b/content.tex
> > > index 0a62dce..1bb4401 100644
> > > --- a/content.tex
> > > +++ b/content.tex
> > > @@ -47,6 +47,9 @@ \section{\field{Device Status} Field}\label{sec:Basic 
> > > Facilities of a Virtio Dev
> > >   \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
> > > drive the device.
> > > +\item[SUSPEND (16)] When VIRTIO_F_SUSPEND is negotiated, indicates that 
> > > the
> > > +  device has been suspended by the driver.
> > > +
> > >   \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
> > > an error from which it can't recover.
> > >   \end{description}
> > > @@ -73,6 +76,10 @@ \section{\field{Device Status} Field}\label{sec:Basic 
> > > Facilities of a Virtio Dev
> > >   recover by issuing a reset.
> > >   \end{note}
> > > +The driver MUST NOT set SUSPEND if FEATURES_OK is not set.
> > > +
> > > +When set SUSPEND, the driver MUST re-read \field{device status} to 
> > > ensure the SUSPEND bit is set.
> > "When setting SUSPEND, ..." would be grammatically correct. Another
> > option is "After setting the SUSPEND bit, ...".
> Will fix in the next version.
> > 
> > > +
> > >   \devicenormative{\subsection}{Device Status Field}{Basic Facilities of 
> > > a Virtio Device / Device Status Field}
> > >   The device MUST NOT consume buffers or send any used buffer
> > > @@ -82,6 +89,13 @@ \section{\field{Device Status} Field}\label{sec:Basic 
> > > Facilities of a Virtio Dev
> > >   that a reset is needed.  If DRIVER_OK is set, after it sets 
> > > DEVICE_NEEDS_RESET, the device
> > >   

Re: [virtio-dev] [RFC PATCH 4/5] virtqueue: constraints for virtqueue state

2023-08-14 Thread Stefan Hajnoczi
On Tue, Aug 15, 2023 at 03:29:03AM +0800, Zhu Lingshan wrote:
> This commit specifies the constraints of the virtqueue state,
> and the actions should be taken by the device when SUSPEND
> and DRIVER_OK is set
> 
> Signed-off-by: Jason Wang 
> Signed-off-by: Zhu Lingshan 
> ---
>  content.tex | 31 +++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index 43bd5de..f6ac581 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -587,6 +587,37 @@ \subsection{\field{Used State} Field}
>  
>  See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
>  
> +\drivernormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio 
> Device / Virtqueue State}
> +
> +If VIRTIO_F_QUEUE_STATE has been negotiated, the driver MUST set SUSPEND in 
> \field{device status}
> +first before getting or setting Virtqueue State of any virtqueues.
> +
> +If VIRTIO_F_QUEUE_STATE has been negotiaged but VIRTIO_RING_F_PACKED not 
> been negotiated,

"negotiated"

Please run a spell checker.

> +the driver MUST NOT access \field{Used State} of any virtqueues, it should 
> use the
> +used index in the used ring.
> +
> +\devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio 
> Device / Virtqueue State}
> +
> +If VIRTIO_F_QUEUE_STATE has been negotiated but SUSPEND is not set in 
> \field{device status},
> +the device MUST ignore any accesses against Virtqueue State of any 
> virtqueues.

"accesses against" is not commonly used in English. "accesses to" is more 
natural.

> +
> +When VIRTIO_F_QUEUE_STATE has been negotiated but VIRTIO_RING_F_PACKED is 
> not,
> +the device MUST ignore any accesses against \field{Used State}.

Same here.

> +
> +If VIRTIO_F_QUEUE_STATE has been negotiaged, the device MUST reset
> +the Virtqueue State of every virtqueue upon a reset.

I'm not sure if this statement is necessary since Virtqueue State
becomes inaccessible when the device is reset. By definition, Virtqueue
State is the device-internal state of the virtqueue, so it follows
logically that next time the driver can access the Virtqueue State after
reset, it contains the current state and not previous state from before
reset.

> +
> +If VIRTIO_F_QUEUE_STATE and VIRTIO_RING_F_PACKED have been negotiaged, when 
> SUSPEND is set,
> +the device MUST record the Virtqueue State of every enabled virtqueue
> +in \field{Available State} and \field{Used State} respectively,
> +and correspondingly restore the Virtqueue State of every enabled virtqueue
> +from \field{Avaiable State} and \field{Used State} when DRIVER_OK is set.

"Available State"

> +
> +If VIRTIO_F_QUEUE_STATE has been negotiated but VIRTIO_RING_F_PACKED has 
> been not, when SUSPEND is set,
> +the device MUST record the available state of every enabled virtqueue in 
> \field{Available State},
> +and restore the available state of every enabled virtqueue from 
> \field{Avaiable State}

"Available State"

> +when DRIVER_OK is set.
> +
>  \input{admin.tex}
>  
>  \chapter{General Initialization And Device Operation}\label{sec:General 
> Initialization And Device Operation}
> -- 
> 2.35.3
> 
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
> 


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] [RFC PATCH 3/5] virtio: The actions by the device upon SUSPEND

2023-08-14 Thread Stefan Hajnoczi
On Tue, Aug 15, 2023 at 03:29:02AM +0800, Zhu Lingshan wrote:
> This commit specifies the actions to be taken by the device upon
> SUSPEND.
> 
> Signed-off-by: Jason Wang 
> Signed-off-by: Eugenio PÃrez 
> Signed-off-by: Zhu Lingshan 
> ---
>  content.tex | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index 074f43e..43bd5de 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -96,6 +96,15 @@ \section{\field{Device Status} Field}\label{sec:Basic 
> Facilities of a Virtio Dev
>  If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set, the device MUST clear 
> SUSPEND
>  and resumes operation upon DRIVER_OK.
>  
> +If VIRTIO_F_SUSPEND is negotiated, when SUSPEND is set, the device MUST 
> perform the following operations:

"when SUSPEND is set" is ambigious. It could mean when the driver writes
the Device Status Field, when the device transitions to SUSPEND, or
something else. Maybe "before the device reports the SUSPEND bit set in
the Device Status Field"?

> +\begin{itemize}
> +\item Stop comsuming any descriptors

"consuming"

It may be more consistent to talk about virtqueue buffers (i.e.
requests) rather than descriptors here.

> +\item Mark all finished descriptors as used and send used buffer 
> notification to the driver

The device has to complete everything that is in flight, just completing
"finished" stuff is not enough. Does "finished descriptors" really mean
"in-flight virtqueue buffers"?

"send a used buffer notification" or "send used buffer notifications"

> +\item Record Virtqueue State of each enabled virtqueue, see section 
> \ref{sec:Virtqueues / Virtqueue State}

Does "Record" mean that Virtqueue State fields can only be accessed by
the driver while device is paused (they may be outdated or invalid while
the device is unpaused)?

> +\item Pause its operation and preserve all configurations in its Device 
> Configuration Space, see \ref{sec:Basic Facilities of a Virtio Device / 
> Device Configuration Space}

What does "preserve" mean? Does it mean the Device Configuration Space
is not allowed to change during SUSPEND?

> +\item Present SUSPEND in \field{device status}
> +\end{itemize}
> +
>  \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / 
> Feature Bits}
>  
>  Each virtio device offers all the features it understands.  During
> -- 
> 2.35.3
> 
> 
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
> 
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
> 
> Subscribe: virtio-comment-subscr...@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscr...@lists.oasis-open.org
> List help: virtio-comment-h...@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
> 


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] [RFC PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-08-14 Thread Stefan Hajnoczi
On Tue, Aug 15, 2023 at 03:28:59AM +0800, Zhu Lingshan wrote:
> This seires introduces
> 1)a new SUSPEND bit in the device status
> Which is used to suspend the device, so that the device states
> and virtqueue states are stablized.
> 
> 2)virtqueue state and accessors, to get and set last_avail_idx
> and last_used_idx of virtqueues.
> 
> The main usecase of these new facilities is Live Migration.
> 
> Furture work: dirty page tracking and in-flight descriptors.
> 
> This is RFC, this series carries on Jason and Eugenio's pervious work.
> 
> Any comments are welcome.

In order to support stateful devices like virtiofs, virtio-gpu, or
virtio-crypto it would be necessary to add device state save/load
functionality too.

Even virtio-net needs device state save/load functionality if the VMM is
not intercepting the stateful controlq.

In-flight descriptors can be included in the device state.

In fact, the only reason why I think VIRTIO_F_QUEUE_STATE makes sense is
for easy integration into existing vhost software stacks. Otherwise we
should just define a device state save/load interface along with
standard device state serialization for devices where migration
interoperability is possible. VIRTIO_F_QUEUE_STATE is a small subset of
device state save/load.

I think the convenience of integrating into existing vhost software
stacks is worth the duplication, but I wanted to mention that this
interface will not be enough to live migrate all device types and we'll
need something more in the future.

Stefan

> 
> Zhu Lingshan (5):
>   virtio: introduce SUSPEND bit in device status
>   virtio: introduce vq state as basic facility
>   virtio: The actions by the device upon SUSPEND
>   virtqueue: constraints for virtqueue state
>   virtio-pci: implement VIRTIO_F_QUEUE_STATE
> 
>  content.tex   | 120 ++
>  transport-pci.tex |  15 ++
>  2 files changed, 135 insertions(+)
> 
> -- 
> 2.35.3
> 
> 
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
> 
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
> 
> Subscribe: virtio-comment-subscr...@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscr...@lists.oasis-open.org
> List help: virtio-comment-h...@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
> 


signature.asc
Description: PGP signature


Re: [virtio-dev] [RFC PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

2023-08-14 Thread Stefan Hajnoczi
On Tue, Aug 15, 2023 at 03:29:04AM +0800, Zhu Lingshan wrote:
> This patch adds two new le16 fields to common configruation structure
> to support VIRTIO_F_QUEUE_STATE in PCI tranport layer.
> 
> Signed-off-by: Jason Wang 
> Signed-off-by: Zhu Lingshan 
> ---
>  transport-pci.tex | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/transport-pci.tex b/transport-pci.tex
> index a5c6719..d9bccb0 100644
> --- a/transport-pci.tex
> +++ b/transport-pci.tex
> @@ -321,6 +321,8 @@ \subsubsection{Common configuration structure 
> layout}\label{sec:Virtio Transport
>  le64 queue_device;  /* read-write */
>  le16 queue_notif_config_data;   /* read-only for driver */
>  le16 queue_reset;   /* read-write */
> +le16 queue_avail_state; /* read-write */
> +le16 queue_used_state;  /* read-write */
>  
>  /* About the administration virtqueue. */
>  le16 admin_queue_index; /* read-only for driver */

This is an incompatible change to the register layout. Please add new
registers at the end of struct virtio_pci_common_cfg so field offsets
don't change for existing drivers and devices.

> @@ -415,6 +417,16 @@ \subsubsection{Common configuration structure 
> layout}\label{sec:Virtio Transport
>  This field exists only if VIRTIO_F_RING_RESET has been
>  negotiated. (see \ref{sec:Basic Facilities of a Virtio Device / 
> Virtqueues / Virtqueue Reset}).
>  
> +\item[\field{queue_avail_state}]
> +This field is vaild only if VIRTIO_F_QUEUE_STATE has been

"valid"

> +negotiated. The driver sets and gets the available state of
> +the virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
> +
> +\item[\field{queue_used_state}]
> +This field is vaild only if VIRTIO_F_QUEUE_STATE has been

"valid"

> +negotiated. The driver sets and gets the used state of the
> +virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
> +
>  \item[\field{admin_queue_index}]
>  The device uses this to report the index of the first administration 
> virtqueue.
>  This field is valid only if VIRTIO_F_ADMIN_VQ has been negotiated.
> @@ -488,6 +500,9 @@ \subsubsection{Common configuration structure 
> layout}\label{sec:Virtio Transport
>  present either a value of 0 or a power of 2 in
>  \field{queue_size}.
>  
> +If VIRTIO_F_QUEUE_STATE has not been negotiated, the device MUST ignore
> +any accesses against \field{queue_avail_state} and \field{queue_used_state}.

"accesses to"

> +
>  If VIRTIO_F_ADMIN_VQ has been negotiated, the value
>  \field{admin_queue_index} MUST be equal to, or bigger than
>  \field{num_queues}; also, \field{admin_queue_num} MUST be
> -- 
> 2.35.3
> 
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
> 


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] [RFC PATCH 1/5] virtio: introduce SUSPEND bit in device status

2023-08-14 Thread Stefan Hajnoczi
On Tue, Aug 15, 2023 at 03:29:00AM +0800, Zhu Lingshan wrote:
> This patch introudces a new status bit in the device status: SUSPEND.
> 
> This SUSPEND bit can be used by the driver to suspend a device,
> in order to stablize the device states and virtqueue states.
> 
> Its main use case is live migration.
> 
> Signed-off-by: Jason Wang 
> Signed-off-by: Eugenio PÃrez 

There is an character encoding issue in Eugenio's surname.

> Signed-off-by: Zhu Lingshan 
> ---
>  content.tex | 18 ++
>  1 file changed, 18 insertions(+)

This patch hints at the asynchronous nature of the SUSPEND bit (the
driver must re-read the Device Status Field) but doesn't explain the
rationale or any limits.

For example, is there a timeout or should the driver re-read the Device
Status Field forever?

Does the driver need to re-read the Device Status Field after clearing
the SUSPEND bit?

> 
> diff --git a/content.tex b/content.tex
> index 0a62dce..1bb4401 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -47,6 +47,9 @@ \section{\field{Device Status} Field}\label{sec:Basic 
> Facilities of a Virtio Dev
>  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
>drive the device.
>  
> +\item[SUSPEND (16)] When VIRTIO_F_SUSPEND is negotiated, indicates that the
> +  device has been suspended by the driver.
> +
>  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>an error from which it can't recover.
>  \end{description}
> @@ -73,6 +76,10 @@ \section{\field{Device Status} Field}\label{sec:Basic 
> Facilities of a Virtio Dev
>  recover by issuing a reset.
>  \end{note}
>  
> +The driver MUST NOT set SUSPEND if FEATURES_OK is not set.
> +
> +When set SUSPEND, the driver MUST re-read \field{device status} to ensure 
> the SUSPEND bit is set.

"When setting SUSPEND, ..." would be grammatically correct. Another
option is "After setting the SUSPEND bit, ...".

> +
>  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a 
> Virtio Device / Device Status Field}
>  
>  The device MUST NOT consume buffers or send any used buffer
> @@ -82,6 +89,13 @@ \section{\field{Device Status} Field}\label{sec:Basic 
> Facilities of a Virtio Dev
>  that a reset is needed.  If DRIVER_OK is set, after it sets 
> DEVICE_NEEDS_RESET, the device
>  MUST send a device configuration change notification to the driver.
>  
> +The device MUST ignore SUSPEND if FEATURES_OK is not set.
> +
> +The deivce MUST ignore SUSPEND if VIRTIO_F_SUSPEND is not negotiated.
> +
> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set, the device MUST clear 
> SUSPEND
> +and resumes operation upon DRIVER_OK.

I can't parse this sentence. If the driver writes SUSPEND | DRIVER_OK |
... to the Device Status Field, then the device accepts DRIVER_OK and
clears SUSPEND?

Why?

> +
>  \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / 
> Feature Bits}
>  
>  Each virtio device offers all the features it understands.  During
> @@ -872,6 +886,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved 
> Feature Bits}
>   \ref{devicenormative:Basic Facilities of a Virtio Device / Feature 
> Bits} for
>   handling features reserved for future use.
>  
> +  \item[VIRTIO_F_SUSPEND(41)] This feature indicates that the driver can
> +   SUSPEND the device.
> +   See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> +
>  \end{description}
>  
>  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> -- 
> 2.35.3
> 
> 
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
> 
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
> 
> Subscribe: virtio-comment-subscr...@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscr...@lists.oasis-open.org
> List help: virtio-comment-h...@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
> 


signature.asc
Description: PGP signature


Re: [virtio-dev] [RFC PATCH 2/5] virtio: introduce vq state as basic facility

2023-08-14 Thread Stefan Hajnoczi
On Tue, Aug 15, 2023 at 03:29:01AM +0800, Zhu Lingshan wrote:
> This patch adds new device facility to save and restore virtqueue
> state. The virtqueue state is split into two parts:
> 
> - The available state: The state that is used for read the next
>   available buffer.
> - The used state: The state that is used for make buffer used.
> 
> This will simply the transport specific method implemention. E.g two
> le16 could be used instead of a single le32). For split virtqueue, we
> only need the available state since the used state is implemented in
> the virtqueue itself (the used index). For packed virtqueue, we need
> both the available state and the used state.
> 
> Those states are required to implement live migration support for
> virtio device.
>
> Signed-off-by: Jason Wang 
> Signed-off-by: Zhu Lingshan 
> ---
>  content.tex | 62 +
>  1 file changed, 62 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index 1bb4401..074f43e 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -516,6 +516,68 @@ \section{Exporting Objects}\label{sec:Basic Facilities 
> of a Virtio Device / Expo
>  types. It is RECOMMENDED that devices generate version 4
>  UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
>  
> +\section{Virtqueue State}\label{sec:Virtqueues / Virtqueue State}
> +
> +When VIRTIO_F_QUEUE_STATE is negotiated, the driver can set and
> +get the device internal virtqueue state through the following
> +fields. The implementation of the interfaces is transport specific.
> +
> +\subsection{\field{Available State} Field}
> +
> +The available state field is two bytes virtqueue state that is used by

"two bytes of virtqueue state"

> +the device to read the next available buffer.
> +
> +When VIRTIO_RING_F_PACKED is not negotiated, it contains:
> +
> +\begin{lstlisting}
> +le16 last_avail_idx;
> +\end{lstlisting}
> +
> +The \field{last_avail_idx} field is the free-running available ring
> +index where the device will read the next available head of a
> +descriptor chain.
> +
> +See also \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The 
> Virtqueue Available Ring}.
> +
> +When VIRTIO_RING_F_PACKED is negotiated, it contains:
> +
> +\begin{lstlisting}
> +le16 {
> +  last_avail_idx : 15;
> +  last_avail_wrap_counter : 1;
> +};
> +\end{lstlisting}
> +
> +The \field{last_avail_idx} field is the free-running location
> +where the device read the next descriptor from the virtqueue descriptor ring.
> +
> +The \field{last_avail_wrap_counter} field is the last driver ring wrap
> +counter that was observed by the device.
> +
> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
> +
> +\subsection{\field{Used State} Field}
> +
> +The used state field is two bytes of virtqueue state that is used by
> +the device when marking a buffer used.

Please specify the non-packed case:

"When VIRTIO_RING_F_PACKED is not negotiated, the 16-bit value is always
0."

> +
> +When VIRTIO_RING_F_PACKED is negotiated, the used state contains:
> +
> +\begin{lstlisting}
> +le16 {
> +  used_idx : 15;
> +  used_wrap_counter : 1;
> +};
> +\end{lstlisting}
> +
> +The \field{used_idx} field is the free-running location where the device 
> write next

"device writes the next"

> +used descriptor to the descriptor ring.
> +
> +The \field{used_wrap_counter} field is the wrap counter that is used
> +by the device.
> +
> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
> +
>  \input{admin.tex}
>  
>  \chapter{General Initialization And Device Operation}\label{sec:General 
> Initialization And Device Operation}
> -- 
> 2.35.3
> 
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
> 


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] [RFC PATCH 0/5] virtio: introduce SUSPEND bit and vq state

2023-08-14 Thread Stefan Hajnoczi
On Tue, Aug 15, 2023 at 03:28:59AM +0800, Zhu Lingshan wrote:
> This seires introduces
> 1)a new SUSPEND bit in the device status
> Which is used to suspend the device, so that the device states
> and virtqueue states are stablized.

Hanna, you might be interested in this because it has overlap with your
vhost-user device state migration work.

Stefan

> 
> 2)virtqueue state and accessors, to get and set last_avail_idx
> and last_used_idx of virtqueues.
> 
> The main usecase of these new facilities is Live Migration.
> 
> Furture work: dirty page tracking and in-flight descriptors.
> 
> This is RFC, this series carries on Jason and Eugenio's pervious work.
> 
> Any comments are welcome.
> 
> Zhu Lingshan (5):
>   virtio: introduce SUSPEND bit in device status
>   virtio: introduce vq state as basic facility
>   virtio: The actions by the device upon SUSPEND
>   virtqueue: constraints for virtqueue state
>   virtio-pci: implement VIRTIO_F_QUEUE_STATE
> 
>  content.tex   | 120 ++
>  transport-pci.tex |  15 ++
>  2 files changed, 135 insertions(+)
> 
> -- 
> 2.35.3
> 
> 
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
> 
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
> 
> Subscribe: virtio-comment-subscr...@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscr...@lists.oasis-open.org
> List help: virtio-comment-h...@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
> 


signature.asc
Description: PGP signature


[virtio-dev] Re: [RFC PATCH] docs/interop: define STANDALONE protocol feature for vhost-user

2023-07-26 Thread Stefan Hajnoczi
On Wed, Jul 26, 2023 at 12:02:33PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 24, 2023 at 02:08:39PM -0400, Stefan Hajnoczi wrote:
> > On Thu, Jul 20, 2023 at 06:22:08PM -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jul 20, 2023 at 05:31:03PM -0400, Stefan Hajnoczi wrote:
> > > > On Thu, 20 Jul 2023 at 17:15, Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Thu, Jul 20, 2023 at 03:58:37PM -0400, Stefan Hajnoczi wrote:
> > > > > > On Thu, Jul 06, 2023 at 12:48:20PM -0400, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Jul 04, 2023 at 01:36:00PM +0100, Alex Bennée wrote:
> > > > > > > > Currently QEMU has to know some details about the back-end to 
> > > > > > > > be able
> > > > > > > > to setup the guest. While various parts of the setup can be 
> > > > > > > > delegated
> > > > > > > > to the backend (for example config handling) this is a very 
> > > > > > > > piecemeal
> > > > > > > > approach.
> > > > > > >
> > > > > > > > This patch suggests a new feature flag 
> > > > > > > > (VHOST_USER_PROTOCOL_F_STANDALONE)
> > > > > > > > which the back-end can advertise which allows a probe message 
> > > > > > > > to be
> > > > > > > > sent to get all the details QEMU needs to know in one message.
> > > > > > >
> > > > > > > The reason we do piecemeal is that these existing pieces can be 
> > > > > > > reused
> > > > > > > as others evolve or fall by wayside.
> > > > > > >
> > > > > > > For example, I can think of instances where you want to connect
> > > > > > > specifically to e.g. networking backend, and specify it
> > > > > > > on command line. Reasons could be many, e.g. for debugging,
> > > > > > > or to prevent connecting to wrong device on wrong channel
> > > > > > > (kind of like type safety).
> > > > > > >
> > > > > > > What is the reason to have 1 message? startup latency?
> > > > > > > How about we allow pipelining several messages then?
> > > > > > > Will be easier.
> > > > > >
> > > > > > This flag effectively says that the back-end is a full VIRTIO device
> > > > > > with a Device Status Register, Configuration Space, Virtqueues, the
> > > > > > device type, etc. This is different from previous vhost-user devices
> > > > > > which sometimes just offloaded certain virtqueues without providing 
> > > > > > the
> > > > > > full VIRTIO device (parts were emulated in the VMM).
> > > > > >
> > > > > > So for example, a vhost-user-net device does not support the 
> > > > > > controlq.
> > > > > > Alex's "standalone" device is a mode where the vhost-user protocol 
> > > > > > is
> > > > > > used but the back-end must implement a full virtio-net device.
> > > > > > Standalone devices are like vDPA device in this respect.
> > > > > >
> > > > > > I think it is important to have a protocol feature bit that 
> > > > > > advertises
> > > > > > that this is a standalone device, since the semantics are different 
> > > > > > for
> > > > > > traditional vhost-user-net devices.
> > > > >
> > > > > Not sure what that would gain as compared to a feature bit per
> > > > > message as we did previously.
> > > > 
> > > > Having a single feature bit makes it easier to distinguish between a
> > > > traditional vhost-user device and a standalone device.
> > > > 
> > > > For example, the presence of VHOST_USER_F_GET_DEVICE_ID doesn't tell
> > > > you whether this device is a standalone device that is appropriate for
> > > > a new generic QEMU --device vhost-user-device feature that Alex is
> > > > working on. It could be a traditional vhost-user device that is not
> > > > standalone but implements the VHOST_USER_GET_DEVICE_ID message.
> > > > 
> > > > How will we detect standalone devices? It will be messy if there is no
> > > > single feature bit that advertises that this back-end is a standalone
> > > > device.
> > > > 
> > > > Stefan
> > > 
> > > Looks like standalone implies some 5-6 messages to be supported.
> > > So just test the 6 bits are all ones.
> > 
> > It's not clear to me that the individual bits together mean this is
> > really a standalone device, but let's go with individual commands and
> > see if a front-end can distinguish standalone devices or not. If not,
> > then we can still add "standalone" feature bit before merging the code.
> > 
> > Stefan
> 
> 
> I think it just shows that what a "standalone" device is just isn't
> that well defined ;).

No, I think it's the opposite way around. The existing vhost model is
not well defined. From time to time someone has to extend it to add
things like Configuration Space support or VHOST_USER_GET_DEVICE_ID.

Standalone devices are well defined: they are what the VIRTIO
specification describes.

My concern is that new messages added for standalone devices might also
be useful for existing non-standalone devices. Or do you want to declare
these new messages off-limits to non-standalone devices?

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [RFC PATCH] docs/interop: define STANDALONE protocol feature for vhost-user

2023-07-24 Thread Stefan Hajnoczi
On Thu, Jul 20, 2023 at 06:22:08PM -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 20, 2023 at 05:31:03PM -0400, Stefan Hajnoczi wrote:
> > On Thu, 20 Jul 2023 at 17:15, Michael S. Tsirkin  wrote:
> > >
> > > On Thu, Jul 20, 2023 at 03:58:37PM -0400, Stefan Hajnoczi wrote:
> > > > On Thu, Jul 06, 2023 at 12:48:20PM -0400, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 04, 2023 at 01:36:00PM +0100, Alex Bennée wrote:
> > > > > > Currently QEMU has to know some details about the back-end to be 
> > > > > > able
> > > > > > to setup the guest. While various parts of the setup can be 
> > > > > > delegated
> > > > > > to the backend (for example config handling) this is a very 
> > > > > > piecemeal
> > > > > > approach.
> > > > >
> > > > > > This patch suggests a new feature flag 
> > > > > > (VHOST_USER_PROTOCOL_F_STANDALONE)
> > > > > > which the back-end can advertise which allows a probe message to be
> > > > > > sent to get all the details QEMU needs to know in one message.
> > > > >
> > > > > The reason we do piecemeal is that these existing pieces can be reused
> > > > > as others evolve or fall by wayside.
> > > > >
> > > > > For example, I can think of instances where you want to connect
> > > > > specifically to e.g. networking backend, and specify it
> > > > > on command line. Reasons could be many, e.g. for debugging,
> > > > > or to prevent connecting to wrong device on wrong channel
> > > > > (kind of like type safety).
> > > > >
> > > > > What is the reason to have 1 message? startup latency?
> > > > > How about we allow pipelining several messages then?
> > > > > Will be easier.
> > > >
> > > > This flag effectively says that the back-end is a full VIRTIO device
> > > > with a Device Status Register, Configuration Space, Virtqueues, the
> > > > device type, etc. This is different from previous vhost-user devices
> > > > which sometimes just offloaded certain virtqueues without providing the
> > > > full VIRTIO device (parts were emulated in the VMM).
> > > >
> > > > So for example, a vhost-user-net device does not support the controlq.
> > > > Alex's "standalone" device is a mode where the vhost-user protocol is
> > > > used but the back-end must implement a full virtio-net device.
> > > > Standalone devices are like vDPA device in this respect.
> > > >
> > > > I think it is important to have a protocol feature bit that advertises
> > > > that this is a standalone device, since the semantics are different for
> > > > traditional vhost-user-net devices.
> > >
> > > Not sure what that would gain as compared to a feature bit per
> > > message as we did previously.
> > 
> > Having a single feature bit makes it easier to distinguish between a
> > traditional vhost-user device and a standalone device.
> > 
> > For example, the presence of VHOST_USER_F_GET_DEVICE_ID doesn't tell
> > you whether this device is a standalone device that is appropriate for
> > a new generic QEMU --device vhost-user-device feature that Alex is
> > working on. It could be a traditional vhost-user device that is not
> > standalone but implements the VHOST_USER_GET_DEVICE_ID message.
> > 
> > How will we detect standalone devices? It will be messy if there is no
> > single feature bit that advertises that this back-end is a standalone
> > device.
> > 
> > Stefan
> 
> Looks like standalone implies some 5-6 messages to be supported.
> So just test the 6 bits are all ones.

It's not clear to me that the individual bits together mean this is
really a standalone device, but let's go with individual commands and
see if a front-end can distinguish standalone devices or not. If not,
then we can still add "standalone" feature bit before merging the code.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [RFC PATCH] docs/interop: define STANDALONE protocol feature for vhost-user

2023-07-20 Thread Stefan Hajnoczi
On Thu, Jul 06, 2023 at 12:48:20PM -0400, Michael S. Tsirkin wrote:
> On Tue, Jul 04, 2023 at 01:36:00PM +0100, Alex Bennée wrote:
> > Currently QEMU has to know some details about the back-end to be able
> > to setup the guest. While various parts of the setup can be delegated
> > to the backend (for example config handling) this is a very piecemeal
> > approach.
> 
> > This patch suggests a new feature flag (VHOST_USER_PROTOCOL_F_STANDALONE)
> > which the back-end can advertise which allows a probe message to be
> > sent to get all the details QEMU needs to know in one message.
> 
> The reason we do piecemeal is that these existing pieces can be reused
> as others evolve or fall by wayside.
> 
> For example, I can think of instances where you want to connect
> specifically to e.g. networking backend, and specify it
> on command line. Reasons could be many, e.g. for debugging,
> or to prevent connecting to wrong device on wrong channel
> (kind of like type safety).
> 
> What is the reason to have 1 message? startup latency?
> How about we allow pipelining several messages then?
> Will be easier.

This flag effectively says that the back-end is a full VIRTIO device
with a Device Status Register, Configuration Space, Virtqueues, the
device type, etc. This is different from previous vhost-user devices
which sometimes just offloaded certain virtqueues without providing the
full VIRTIO device (parts were emulated in the VMM).

So for example, a vhost-user-net device does not support the controlq.
Alex's "standalone" device is a mode where the vhost-user protocol is
used but the back-end must implement a full virtio-net device.
Standalone devices are like vDPA device in this respect.

I think it is important to have a protocol feature bit that advertises
that this is a standalone device, since the semantics are different for
traditional vhost-user-net devices.

However, I think having a single message is inflexible and duplicates
existing vhost-user protocol messages like VHOST_USER_GET_QUEUE_NUM. I
would prefer VHOST_USER_GET_DEVICE_ID and other messages.

Stefan


signature.asc
Description: PGP signature


Re: [virtio-dev] [RFC PATCH] docs/interop: define STANDALONE protocol feature for vhost-user

2023-07-20 Thread Stefan Hajnoczi
On Fri, Jul 07, 2023 at 12:27:39PM +0200, Stefano Garzarella wrote:
> On Tue, Jul 04, 2023 at 04:02:42PM +0100, Alex Bennée wrote:
> > 
> > Stefano Garzarella  writes:
> > 
> > > On Tue, Jul 04, 2023 at 01:36:00PM +0100, Alex Bennée wrote:
> > > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> > > > index 5a070adbc1..85b1b1583a 100644
> > > > --- a/docs/interop/vhost-user.rst
> > > > +++ b/docs/interop/vhost-user.rst
> > > > @@ -275,6 +275,21 @@ Inflight description
> > > > 
> > > > :queue size: a 16-bit size of virtqueues
> > > > 
> > > > +Backend specifications
> > > > +^^
> > > > +
> > > > ++---+-+++
> > > > +| device id | config size |   min_vqs  |   max_vqs  |
> > > > ++---+-+++
> > > > +
> > > > +:device id: a 32-bit value holding the VirtIO device ID
> > > > +
> > > > +:config size: a 32-bit value holding the config size (see 
> > > > ``VHOST_USER_GET_CONFIG``)
> > > > +
> > > > +:min_vqs: a 32-bit value holding the minimum number of vqs supported
> > > 
> > > Why do we need the minimum?
> > 
> > We need to know the minimum number because some devices have fixed VQs
> > that must be present.
> 
> But does QEMU need to know this?
> 
> Or is it okay that the driver will then fail in the guest if there
> are not the right number of queues?

I don't understand why min_vqs is needed either. It's not the
front-end's job to ensure that the device will be used properly. A
spec-compliant driver will work with a spec-compliant device, so it's
not clear why the front-end needs this information.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [RFC PATCH] docs/interop: define STANDALONE protocol feature for vhost-user

2023-07-20 Thread Stefan Hajnoczi
On Tue, Jul 04, 2023 at 01:36:00PM +0100, Alex Bennée wrote:
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index c4e0cbd702..28b021d5d3 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -202,6 +202,13 @@ typedef struct VhostUserInflight {
>  uint16_t queue_size;
>  } VhostUserInflight;
>  
> +typedef struct VhostUserBackendSpecs {
> +uint32_t device_id;
> +uint32_t config_size;
> +uint32_t min_vqs;

You already answered my question about min_vqs in another sub-thread.
I'll continue there. Please ignore my question.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [RFC PATCH] docs/interop: define STANDALONE protocol feature for vhost-user

2023-07-20 Thread Stefan Hajnoczi
On Tue, Jul 04, 2023 at 01:36:00PM +0100, Alex Bennée wrote:
> Currently QEMU has to know some details about the back-end to be able
> to setup the guest. While various parts of the setup can be delegated
> to the backend (for example config handling) this is a very piecemeal
> approach.
> 
> This patch suggests a new feature flag (VHOST_USER_PROTOCOL_F_STANDALONE)
> which the back-end can advertise which allows a probe message to be
> sent to get all the details QEMU needs to know in one message.
> 
> Signed-off-by: Alex Bennée 
> 
> ---
> Initial RFC for discussion. I intend to prototype this work with QEMU
> and one of the rust-vmm vhost-user daemons.
> ---
>  docs/interop/vhost-user.rst | 37 +
>  hw/virtio/vhost-user.c  |  8 
>  2 files changed, 45 insertions(+)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 5a070adbc1..85b1b1583a 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -275,6 +275,21 @@ Inflight description
>  
>  :queue size: a 16-bit size of virtqueues
>  
> +Backend specifications
> +^^
> +
> ++---+-+++
> +| device id | config size |   min_vqs  |   max_vqs  |
> ++---+-+++
> +
> +:device id: a 32-bit value holding the VirtIO device ID
> +
> +:config size: a 32-bit value holding the config size (see 
> ``VHOST_USER_GET_CONFIG``)
> +
> +:min_vqs: a 32-bit value holding the minimum number of vqs supported

What is the purpose of min_vqs? I'm not sure why the front-end needs to
know this.


signature.asc
Description: PGP signature


Re: [virtio-dev] virtio-snd and snapshots (e.g. in QEMU) when audio is active

2023-07-20 Thread Stefan Hajnoczi
On Wed, Jul 19, 2023 at 04:21:08PM -0700, Roman Kiryanov wrote:
> Hi,
> 
> I work in Android Studio Emulator and we use virtio-snd (implemented
> ourselves) for audio output/input. According to the spec (1.2), the
> device has one TX virtqueue for all output streams and one RX
> virtqueue for all input streams. Each stream may and usually have more
> than one period (I request 4 periods).
> 
> Because virtqueues are shared between streams (if there are more than
> one stream in the same direction), I cannot fetch vq messages when a
> stream needs one. I fetch vq messages (and put them into my own buffer
> to process them later) when the kernel puts them into a vq. I hope
> this is correct. I think I tried processing them immediately (at least
> for TX) but the kernel was not happy with this because I was draining
> the buffer too fast causing XRUN.
> 
> If a snapshot request comes when audio streams are active I may have
> several unprocessed messages for several streams for both TX and RX.
> In my case messages are VirtQueueElement* which I don't think can be
> saved directly.
> 
> Could you please advise what a device is expected to do in this case?

Do you mean QEMU's VirtQueueElement? There are devices in QEMU that
save/load in-flight VirtQueueElements. See qemu_put_virtqueue_element().

Often devices quiesce (e.g. by draining in-flight I/O requests) when the
VM is stopped before the device state is saved. That makes life simpler.

Stefan

> 
> Thank you.
> 
> Regards,
> Roman.
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
> 


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] [PROPOSAL] Virtio Over Fabrics(TCP/RDMA)

2023-04-25 Thread Stefan Hajnoczi
On Mon, Apr 24, 2023 at 11:40:02AM +0800, Jason Wang wrote:
> On Sun, Apr 23, 2023 at 7:31 PM zhenwei pi  wrote:
> > I develop an kernel initiator(unstable, WIP version, currently TCP/RDMA
> > supported):
> > https://github.com/pizhenwei/linux/tree/virtio-of-github
> 
> A quick glance at the code told me it's a mediation layer that convert
> descriptors in the vring to the fabric specific packet. This is the
> vDPA way.
>
> If we agree virtio of fabic is useful, we need invent facilities to
> allow building packet directly without bothering the virtqueue (the
> API is layout independent anyhow).

I agree. vrings makes sense for RDMA, but I think virtio_fabrics.c
should not be dependent on vrings.

Linux struct virtqueue is independent of vrings but the implementation
currently lives in virtio_ring.c because there has never been a
non-vring transport before.

It would be nice to implement virtqueue_add_sgs() specifically for
virtio_tcp.c without the use of vrings. Is a new struct
virtqueue_ops needed with with .add_sgs() and related callbacks?

Luckily the  API already supports this abstraction and
changes to existing device drivers should be unnecessary or minimal.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] [PROPOSAL] Virtio Over Fabrics(TCP/RDMA)

2023-04-25 Thread Stefan Hajnoczi
On Mon, Apr 24, 2023 at 11:40:02AM +0800, Jason Wang wrote:
> On Sun, Apr 23, 2023 at 7:31 PM zhenwei pi  wrote:
> > "Virtio Over Fabrics" aims at "reuse virtio device specifications", and
> > provides network defined peripheral devices.
> > And this protocol also could be used in virtualization environment,
> > typically hypervisor(or vhost-user process) handles request from virtio
> > PCI/MMIO/CCW, remaps request and forwards to target by fabrics.
> 
> This requires meditation in the datapath, isn't it?
> 
> >
> > - Protocol
> > The detail protocol definition see:
> > https://github.com/pizhenwei/linux/blob/virtio-of-github/include/uapi/linux/virtio_of.h
> 
> I'd say a RFC patch for virtio spec is more suitable than the codes.

VIRTIO over TCP has long been anticipated but so far no one posted an
implementation. There are probably mentions of it from 10+ years ago.
I'm excited to see this!

Both the VIRTIO spec and the Linux drivers provide an abstraction that
allows fabrics (e.g. TCP) to fit in as a VIRTIO Transport. vrings are
not the only way to implement virtqueues.

Many VIRTIO devices will work fine over a message passing transport like
TCP. A few devices like the balloon device may not make sense. Shared
Memory Regions won't work.

Please define VIRTIO over Fabrics as a Transport in the VIRTIO spec so
that the integration with the VIRTIO device model is seamless. I look
forward to discussing spec patches.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] [PATCH] transport-mmio: Replace virtual queue with virtqueue

2023-04-12 Thread Stefan Hajnoczi
On Tue, Apr 11, 2023 at 11:16:47PM +0300, Parav Pandit wrote:
> Basic facilities define the virtqueue construct for device <-> driver
> communication.
> 
> PCI transport and individual devices description also refers to it as
> virtqueue.
> 
> MMIO refers to it as 'virtual queue'.
> 
> Align MMIO transport description to call such object a virtqueue.
> 
> This patch is on top of [1].
> 
> [1] https://lists.oasis-open.org/archives/virtio-dev/202304/msg00253.html
> Signed-off-by: Parav Pandit 
> ---
>  transport-mmio.tex | 50 +++---
>  1 file changed, 25 insertions(+), 25 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio] [RFC] virtio 1.3 schedule

2023-04-05 Thread Stefan Hajnoczi
On Wed, Apr 05, 2023 at 03:57:09PM +0200, Cornelia Huck wrote:
> That said, let me propose the following timeline:
> 
> - July 1st 2023: enter feature freeze; all proposed changes must at
>   least have an open github issue that refers to a proposal on-list
> - August 1st 2023: enter change freeze; everything needs to have been
>   voted upon (i.e. we can start with preparations for a release like
>   compiling the change log)
> - at the same time, fork virtio-next, which can be used for development
>   while we're working on the release
> - August/September 2023: prepare draft, initiate initial voting etc.
> - by September/October 2023: have the final 1.3 spec ready, voted upon,
>   and released
> - virtio-next can be merged into mainline, and we continue with business
>   as usual
> 
> Thoughts? We might want to move things a tad earlier, but I don't think
> we should push the schedule out much further.

Sounds good to me. If the date is earlier then it will be necessary to
scrutinize the list of features for 1.3 and think hard about which ones
fit, but freezing on July 1st sounds realistic to me.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] [PATCH 00/11] Introduce transitional mmr pci device

2023-04-04 Thread Stefan Hajnoczi
On Mon, Apr 03, 2023 at 08:27:28PM +, Parav Pandit wrote:
> 
> > From: Stefan Hajnoczi 
> > Sent: Monday, April 3, 2023 3:10 PM
> 
> 
> > Maybe call it "Memory-mapped Transitional"? That name would be easier to
> > understand.
> >
> Sounds fine to me.
>  
> > > > Modern devices were added to Linux in 2014 and support SR-IOV.
> > >
> > > > Why is it
> > > > important to support Transitional (which really means Legacy
> > > > devices, otherwise Modern devices would be sufficient)?
> > > >
> > > To support guest VMs which only understand legacy devices and
> > > unfortunately they are still in much wider use by the users.
> > 
> > I wonder which guest software without Modern VIRTIO support will still be
> > supported by the time Transitional MMR software and hardware becomes
> > available. Are you aiming for particular guest software versions?
> 
> Transitional MMR hardware is available almost now.
> Transitional MMR software is also WIP to be available as soon as we ratify 
> the spec via tvq or via mmr or both.
> This will be support the guest sw version such as 2.6.32.754 kernel.

That is a RHEL 6 kernel. RHEL 6 entered Extended Life Cycle Support on
November 30, 2020 (https://access.redhat.com/articles/4665701).

I would use existing software emulation for really old guests, but if
you see an opportunity for hardware passthrough, then why not.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v11 09/10] admin: conformance clauses

2023-04-04 Thread Stefan Hajnoczi
On Mon, Apr 03, 2023 at 11:03:56AM -0400, Michael S. Tsirkin wrote:
> Add conformance clauses for admin commands and admin virtqueues.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> ---
> typo and wording fixes reported by David
> clarify cmd list use can be repeated but not in parallel with other
> commands, and returning same value across uses - comment by Lingshan
> add conformance clauses for new error codes
> ---
>  admin.tex | 227 ++
>  1 file changed, 227 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v11 08/10] admin: command list discovery

2023-04-04 Thread Stefan Hajnoczi
On Mon, Apr 03, 2023 at 11:03:49AM -0400, Michael S. Tsirkin wrote:
> Add commands to find out which commands does each group support,
> as well as enable their use by driver.
> This will be especially useful once we have multiple group types.
> 
> An alternative is per-type VQs. This is possible but will
> require more per-transport work. Discovery through the vq
> helps keep things contained.
> 
> e.g. lack of support for some command can switch to a legacy mode
> 
> note that commands are expected to be avolved by adding new
> fields to command specific data at the tail, so
> we generally do not need feature bits for compatibility.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> ---
> dropped S.O.B by Max
> explain in commit log how commands will evolve, comment by Stefan
> replace will use with is capable of use, comment by Stefan
> typo fix reported by David and Lingshan
> rename cmds to cmd_opcodes suggested by Jiri
> list group_type explicitly in command description suggested by Jiri
> clarify how can device not support some commands
> always say group administration commands, comment by Jiri
> ---
>  admin.tex | 102 +++++-
>  1 file changed, 101 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v11 05/10] pci: add admin vq registers to virtio over pci

2023-04-04 Thread Stefan Hajnoczi
On Mon, Apr 03, 2023 at 11:03:29AM -0400, Michael S. Tsirkin wrote:
> @@ -1033,6 +1037,19 @@ \subsubsection{Common configuration structure 
> layout}\label{sec:Virtio Transport
>  This field exists only if VIRTIO_F_RING_RESET has been
>  negotiated. (see \ref{sec:Basic Facilities of a Virtio Device / 
> Virtqueues / Virtqueue Reset}).
>  
> +\item[\field{admin_queue_index}]
> +The device uses this to report the index of the first administration 
> virtqueue.
> +This field is valid only if VIRTIO_F_ADMIN_VQ has been negotiated.
> +\item[\field{admin_queue_num}]
> + The device uses this to report the number of the
> + supported administration virtqueues.
> + Virtqueues with index
> + between \field{admin_queue_index} and (\field{admin_queue_index} +
> + \field{admin_queue_num} - 1) inclusive serve as administration
> + virtqueues.
> + The value 0 indicates no supported administration virtqueues.
> + This field is valid only if VIRTIO_F_ADMIN_VQ has been
> + negotiated.
>  \end{description}

Maybe add a device-normative statement that [admin_queue_index,
admin_queue_index + admin_queue_num) must be located after
device-specific virtqueues?

That would remind implementers that the device-specific virtqueue layout
needs to be followed and cannot be modified by the presence of
Administration Virtqueues.

>  \devicenormative{\paragraph}{Common configuration structure layout}{Virtio 
> Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common 
> configuration structure layout}
> @@ -1119,6 +1136,14 @@ \subsubsection{Common configuration structure 
> layout}\label{sec:Virtio Transport
>  were used before the queue reset.
>  (see \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue 
> Reset}).
>  
> +If VIRTIO_F_ADMIN_VQ has been negotiated, and if the driver
> +configures any administration virtqueues, the driver MUST
> +configure the administration virtqueues using the index
> +in the range \field{admin_queue_index} to
> +\field{admin_queue_index} + \field{admin_queue_num} - 1 inclusive.
> +The driver MAY configure less administration virtqueues than

s/less/fewer/


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v11 04/10] admin: introduce virtio admin virtqueues

2023-04-04 Thread Stefan Hajnoczi
On Mon, Apr 03, 2023 at 11:03:23AM -0400, Michael S. Tsirkin wrote:
> The admin virtqueues will be the first interface used to issue admin commands.
> 
> Currently the virtio specification defines control virtqueue to manipulate
> features and configuration of the device it operates on:
> virtio-net, virtio-scsi, etc all have existing control virtqueues. However,
> control virtqueue commands are device type specific, which makes it very
> difficult to extend for device agnostic commands.
> 
> Keeping the device-specific virtqueue separate from the admin virtqueue
> is simpler and has fewer potential problems. I don't think creating
> common infrastructure for device-specific control virtqueues across
> device types worthwhile or within the scope of this patch series.
> 
> To support this requirement in a more generic way, this patch introduces
> a new admin virtqueue interface.
> The admin virtqueue can be seen as the virtqueue analog to a transport.
> The admin queue thus does nothing device type-specific (net, scsi, etc)
> and instead focuses on transporting the admin commands.
> 
> We also support more than one admin virtqueue, for QoS and
> scalability requirements.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> ---
> explain ordering of commands as suggested by Stefan
> dropped Max's S.O.B
> reword commit log as suggested by David
> minor wording fixes suggested by David
> ---
>  admin.tex   | 75 +
>  content.tex |  7 +++--
>  2 files changed, 80 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v11 03/10] admin: introduce group administration commands

2023-04-04 Thread Stefan Hajnoczi
On Mon, Apr 03, 2023 at 11:03:16AM -0400, Michael S. Tsirkin wrote:
> This introduces a general structure for group administration commands,
> used to control device groups through their owner.
> 
> Following patches will introduce specific commands and an interface for
> submitting these commands to the owner.
> 
> Note that the commands are focused on controlling device groups:
> this is why group related fields are in the generic part of
> the structure.
> Without this the admin vq would become a "whatever" vq which does not do
> anything specific at all, just a general transport like thing.
> I feel going this way opens the design space to the point where
> we no longer know what belongs in e.g. config space
> what in the control q and what in the admin q.
> As it is, whatever deals with groups is in the admin q; other
> things not in the admin q.
> 
> There are specific exceptions such as query but that's an exception that
> proves the rule ;)
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> changes since v10:
>   explain the role of status vs status_qualifier, addresses
>   multiple comments
>   add more status values and appropriate qualifiers,
>   as suggested by Parav
>   clarify reserved command opcodes as suggested by Stefan
>   better formatting for ranges, comment by Jiri
>   make sure command-specific-data is a multiple of qword,
>   so things are aligned, suggested by Jiri
>   add Q_OK qualifier for success
> ---
>  admin.tex| 121 +++
>  introduction.tex |   3 ++
>  2 files changed, 124 insertions(+)

My thoughts on status/status_qualifier haven't changed, but it's a
question of taste:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v11 02/10] admin: introduce device group and related concepts

2023-04-04 Thread Stefan Hajnoczi
On Mon, Apr 03, 2023 at 11:03:09AM -0400, Michael S. Tsirkin wrote:
> Each device group has a type. For now, define one initial group type:
> 
> SR-IOV type - PCI SR-IOV virtual functions (VFs) of a given
> PCI SR-IOV physical function (PF). This group may contain zero or more
> virtio devices according to NumVFs configured.
> 
> Each device within a group has a unique identifier. This identifier
> is the group member identifier.
> 
> Note: one can argue both ways whether the new device group handling
> functionality (this and following patches) is closer
> to a new device type or a new transport type.
> 
> However, I expect that we will add more features in the near future. To
> facilitate this as much as possible of the text is located in the new
> admin chapter.
> 
> I did my best to minimize transport-specific text.
> 
> There's a bit of duplication with 0x1 repeated twice and
> no special section for group type identifiers.
> I think it's ok to defer adding these until we have more group
> types.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
> changes in v11:
>   dropped Max's S.O.B.
>   dropped an unmatched ) as reported by Parav
>   document that member id is 1 to numvfs
>   document that vf enable must be set (will also be reflected in
>   the conformance section)
>   document that we deferred generalizing text as requested by Parav -
>   I think we can do it later
>   minor wording fixes to address comments by David
>   add concepts of owner and member driver. unused currently
>   but will come in handy later, as suggested by Stefan
> ---
>  admin.tex   | 63 +++++++++
>  content.tex |  2 ++
>  2 files changed, 65 insertions(+)
>  create mode 100644 admin.tex

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] [PATCH 00/11] Introduce transitional mmr pci device

2023-04-03 Thread Stefan Hajnoczi
On Mon, Apr 03, 2023 at 01:48:46PM -0400, Michael S. Tsirkin wrote:
> On Mon, Apr 03, 2023 at 10:53:29AM -0400, Parav Pandit wrote:
> > On 4/3/2023 10:45 AM, Stefan Hajnoczi wrote:
> > > On Fri, Mar 31, 2023 at 01:58:23AM +0300, Parav Pandit wrote:
> > > > 3. A hypervisor system prefers to have single stack regardless of
> > > > virtio device type (net/blk) and be future compatible with a
> > > > single vfio stack using SR-IOV or other scalable device
> > > > virtualization technology to map PCI devices to the guest VM.
> > > > (as transitional or otherwise)
> > > 
> > > What does this paragraph mean?
> > > 
> > It means regardless of a VF being transitional MMR VF or 1.x VF without any
> > MMR extensions, there is single vfio virtio driver handling both type of
> > devices to map to the guest VM.
> 
> I don't think this can be vfio. You need a host layer translating
> things such as device ID etc.

An mdev driver?


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] [PATCH 00/11] Introduce transitional mmr pci device

2023-04-03 Thread Stefan Hajnoczi
On Mon, Apr 03, 2023 at 10:53:29AM -0400, Parav Pandit wrote:
> 
> 
> On 4/3/2023 10:45 AM, Stefan Hajnoczi wrote:
> > On Fri, Mar 31, 2023 at 01:58:23AM +0300, Parav Pandit wrote:
> > > Overview:
> > > -
> > > The Transitional MMR device is a variant of the transitional PCI device.
> > 
> > What does "MMR" mean?
> > 
> memory mapped registers.
> Explained below in the design section and also in relevant patches 6 to 11.

Maybe call it "Memory-mapped Transitional"? That name would be easier to
understand.

> > Modern devices were added to Linux in 2014 and support SR-IOV.
> 
> > Why is it
> > important to support Transitional (which really means Legacy devices,
> > otherwise Modern devices would be sufficient)?
> > 
> To support guest VMs which only understand legacy devices and unfortunately
> they are still in much wider use by the users.

I wonder which guest software without Modern VIRTIO support will still
be supported by the time Transitional MMR software and hardware becomes
available. Are you aiming for particular guest software versions?

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] [PATCH 00/11] Introduce transitional mmr pci device

2023-04-03 Thread Stefan Hajnoczi
On Fri, Mar 31, 2023 at 01:58:23AM +0300, Parav Pandit wrote:
> Overview:
> -
> The Transitional MMR device is a variant of the transitional PCI device.

What does "MMR" mean?

> It has its own small Device ID range. It does not have I/O
> region BAR; instead it exposes legacy configuration and device
> specific registers at an offset in the memory region BAR.
> 
> Such transitional MMR devices will be used at the scale of
> thousands of devices using PCI SR-IOV and/or future scalable
> virtualization technology to provide backward
> compatibility (for legacy devices) and also future
> compatibility with new features.
> 
> Usecase:
> 
> 1. A hypervisor/system needs to provide transitional
>virtio devices to the guest VM at scale of thousands,
>typically, one to eight devices per VM.
> 
> 2. A hypervisor/system needs to provide such devices using a
>vendor agnostic driver in the hypervisor system.

Is the idea that the hypervisor configures the new Transitional MMR
devices and makes them appear like virtio-pci Transitional devices?

In other words, the guest doesn't know about Transitional MMR and does
not need any code changes.

> 3. A hypervisor system prefers to have single stack regardless of
>virtio device type (net/blk) and be future compatible with a
>single vfio stack using SR-IOV or other scalable device
>virtualization technology to map PCI devices to the guest VM.
>(as transitional or otherwise)

What does this paragraph mean?

> 
> Motivation/Background:
> --
> The existing transitional PCI device is missing support for
> PCI SR-IOV based devices. Currently it does not work beyond
> PCI PF, or as software emulated device in reality. It currently
> has below cited system level limitations:
> 
> [a] PCIe spec citation:
> VFs do not support I/O Space and thus VF BARs shall not
> indicate I/O Space.
> 
> [b] cpu arch citiation:
> Intel 64 and IA-32 Architectures Software Developer’s Manual:
> The processor’s I/O address space is separate and distinct from
> the physical-memory address space. The I/O address space consists
> of 64K individually addressable 8-bit I/O ports, numbered 0 through H.
> 
> [c] PCIe spec citation:
> If a bridge implements an I/O address range,...I/O address range
> will be aligned to a 4 KB boundary.
> 
> [d] I/O region accesses at PCI system level is slow as they are non-posted
> operations in PCIe fabric.
> 
> The usecase requirements and limitations above can be solved by
> extending the transitional device, mapping legacy and device
> specific configuration registers in a memory PCI BAR instead
> of using non composable I/O region.
> 
> Please review.

Modern devices were added to Linux in 2014 and support SR-IOV. Why is it
important to support Transitional (which really means Legacy devices,
otherwise Modern devices would be sufficient)?

> 
> Patch summary:
> --
> patch 1 to 5 prepares the spec
> patch 6 to 11 defines transitional mmr device
> 
> patch-1 uses lower case alphabets to name device id
> patch-2 move transitional device id in legay section along with
> revision id
> patch-3 splits legacy feature bits description from device id
> patch-4 rename and moves virtio config registers next to 1.x
> registers section
> patch-5 Adds missing helper verb in terminology definitions
> patch-6 introduces transitional mmr device
> patch-7 introduces transitional mmr device pci device ids
> patch-8 introduces virtio extended pci capability
> patch-9 describes new pci capability to locate legacy mmr
> registers
> patch-10 extended usage of driver notification capability for
>  the transitional mmr device
> patch-11 adds conformance section of the transitional mmr device
> 
> This design and details further described below.
> 
> Design:
> ---
> Below picture captures the main small difference between current
> transitional PCI SR-IOV VF and transitional MMR SR-IOV VF.
> 
> +--+ ++ ++
> |virtio 1.x| |Transitional| |Transitional|
> |SRIOV VF  | |SRIOV VF| |MMR SRIOV VF|
> |  | || ||
> ++---+ | ++---+   | ++---+   |
> ||dev_id =   | | ||dev_id =   |   | ||dev_id =   |   |
> ||{0x1040-0x106C}| | ||{0x1000-0x103f}|   | ||{0x10f9-0x10ff}|   |
> |+---+ | |+---+   | |+---+   |
> |  | || ||
> |++| |++  | |+-+ |
> ||Memory BAR  || ||Memory BAR  |  | ||Memory BAR   | |
> |++| |++  | || | |
> |  | || || +--+| |
> |  | |+-+ | || |legacy virtio || |
> |  | ||IOBAR impossible | 

Re: [virtio-dev] [PATCH 1/2] virtio-blk: migrate to the latest patchset version

2023-03-28 Thread Stefan Hajnoczi
On Mon, Mar 27, 2023 at 10:29:27PM -0400, Dmitry Fomichev wrote:
> The merged patch series to support zoned block devices in virtio-blk
> is not the most up to date version. The merged patch can be found at
> 
> https://lore.kernel.org/linux-block/20221016034127.330942-3-dmitry.fomic...@wdc.com/
> 
> , but the latest and reviewed version is
> 
> https://lore.kernel.org/linux-block/20221110053952.3378990-3-dmitry.fomic...@wdc.com/
> 
> The differences between the two are mostly cleanups, but there is one
> change that is very important in terms of compatibility with the
> approved virtio-zbd specification.
> 
> Before it was approved, the OASIS virtio spec had a change in
> VIRTIO_BLK_T_ZONE_APPEND request layout that is not reflected in the
> current virtio-blk driver code. In the running code, the status is
> the first byte of the in-header that is followed by some pad bytes
> and the u64 that carries the sector at which the data has been written
> to the zone back to the driver, aka the append sector.
> 
> This layout turned out to be problematic for implementing in QEMU and
> the request status byte has been eventually made the last byte of the
> in-header. The current code doesn't expect that and this causes the
> append sector value always come as zero to the block layer. This needs
> to be fixed ASAP.
> 
> Fixes: 95bfec41bd3d ("virtio-blk: add support for zoned block devices")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Dmitry Fomichev 
> ---
>  drivers/block/virtio_blk.c  | 238 +---
>  include/uapi/linux/virtio_blk.h |  18 +--
>  2 files changed, 165 insertions(+), 91 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [virtio-dev] [PATCH 2/2] virtio-blk: fix ZBD probe in kernels without ZBD support

2023-03-28 Thread Stefan Hajnoczi
On Mon, Mar 27, 2023 at 10:29:28PM -0400, Dmitry Fomichev wrote:
> When the kernel is built without support for zoned block devices,
> virtio-blk probe needs to error out any host-managed device scans
> to prevent such devices from appearing in the system as non-zoned.
> The current virtio-blk code simply bypasses all ZBD checks if
> CONFIG_BLK_DEV_ZONED is not defined and this leads to host-managed
> block devices being presented as non-zoned in the OS. This is one of
> the main problems this patch series is aimed to fix.
> 
> In this patch, make VIRTIO_BLK_F_ZONED feature defined even when
> CONFIG_BLK_DEV_ZONED is not. This change makes the code compliant with
> the voted revision of virtio-blk ZBD spec. Modify the probe code to
> look at the situation when VIRTIO_BLK_F_ZONED is negotiated in a kernel
> that is built without ZBD support. In this case, the code checks
> the zoned model of the device and fails the probe is the device
> is host-managed.
> 
> The patch also adds the comment to clarify that the call to perform
> the zoned device probe is correctly placed after virtio_device ready().
> 
> Fixes: 95bfec41bd3d ("virtio-blk: add support for zoned block devices")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Dmitry Fomichev 
> ---
>  drivers/block/virtio_blk.c | 34 ++--------
>  1 file changed, 18 insertions(+), 16 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH V3 0/2] qemu: vhost-user: Support Xen memory mapping quirks

2023-03-15 Thread Stefan Hajnoczi
On Thu, Mar 09, 2023 at 02:20:59PM +0530, Viresh Kumar wrote:
> Hello,
> 
> This patchset tries to update the vhost-user protocol to make it support 
> special
> memory mapping required in case of Xen hypervisor.
> 
> The first patch is mostly cleanup and second one introduces a new xen specific
> feature.
> 
> V2->V3:
> - Remove the extra message and instead update the memory regions to carry
>   additional data.
> 
> - Drop the one region one mmap relationship and allow back-end to map only 
> parts
>   of a region at once, required for Xen grant mappings.
> 
> - Additional cleanup patch 1/2.
> 
> V1->V2:
> - Make the custom mmap feature Xen specific, instead of being generic.
> - Clearly define which memory regions are impacted by this change.
> - Allow VHOST_USER_SET_XEN_MMAP to be called multiple times.
> - Additional Bit(2) property in flags.

Looks good, thanks!

Michael is the maintainer and this patch series will go through his tree.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH V3 2/2] docs: vhost-user: Add Xen specific memory mapping support

2023-03-15 Thread Stefan Hajnoczi
On Thu, Mar 09, 2023 at 02:21:01PM +0530, Viresh Kumar wrote:
> The current model of memory mapping at the back-end works fine where a
> standard call to mmap() (for the respective file descriptor) is enough
> before the front-end can start accessing the guest memory.
> 
> There are other complex cases though where the back-end needs more
> information and simple mmap() isn't enough. For example Xen, a type-1
> hypervisor, currently supports memory mapping via two different methods,
> foreign-mapping (via /dev/privcmd) and grant-dev (via /dev/gntdev). In
> both these cases, the back-end needs to call mmap() and ioctl(), with
> extra information like the Xen domain-id of the guest whose memory we
> are trying to map.
> 
> Add a new protocol feature, 'VHOST_USER_PROTOCOL_F_XEN_MMAP', which lets
> the back-end know about the additional memory mapping requirements.
> When this feature is negotiated, the front-end will send the additional
> information within the memory regions themselves.
> 
> Signed-off-by: Viresh Kumar 
> ---
>  docs/interop/vhost-user.rst | 21 +++++
>  1 file changed, 21 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH V3 1/2] docs: vhost-user: Define memory region separately

2023-03-15 Thread Stefan Hajnoczi
On Thu, Mar 09, 2023 at 02:21:00PM +0530, Viresh Kumar wrote:
> The same layout is defined twice, once in "single memory region
> description" and then in "memory regions description".
> 
> Separate out details of memory region from these two and reuse the same
> definition later on.
> 
> While at it, also rename "memory regions description" to "multiple
> memory regions description", to avoid potential confusion around similar
> names. And define single region before multiple ones.
> 
> This is just a documentation optimization, the protocol remains the same.
> 
> Signed-off-by: Viresh Kumar 
> ---
>  docs/interop/vhost-user.rst | 39 +--------
>  1 file changed, 18 insertions(+), 21 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-09 Thread Stefan Hajnoczi
On Thu, Mar 09, 2023 at 08:55:15AM -0500, Michael S. Tsirkin wrote:
> On Thu, Mar 09, 2023 at 07:35:42AM -0500, Stefan Hajnoczi wrote:
> > On Wed, Mar 08, 2023 at 12:21:46PM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Mar 08, 2023 at 12:15:23PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > Or we could say that admin commands must complete within 
> > > > > > > > bounded time,
> > > > > > > > but I'm not sure that is implementable for some device types 
> > > > > > > > like
> > > > > > > > virtio-blk, virtio-scsi, and virtiofs.
> > > > > > > 
> > > > > > > No we can't.
> > > > > > > Some commands, for example FW upgrade can take 10 minutes and 
> > > > > > > it's perfectly
> > > > > > > fine. Other commands like setting feature bit will take 1 
> > > > > > > millisec.
> > > > > > > Each device implements commands in a different internal logic so 
> > > > > > > we can't
> > > > > > > expect to complete after X time.
> > > > > > 
> > > > > > When I say bounded time, I mean that it finishes in a finite amount 
> > > > > > of
> > > > > > time. I'm not saying there is a specific time X that all device
> > > > > > implementations must satisfy. Unbounded means it might never finish.
> > > > > 
> > > > > There might be a chance that any command for any virtio device type 
> > > > > will
> > > > > never finish. Nothing new here in the adminq.
> > > > > 
> > > > > what one can do is to set a timeout for himself and if this timeout 
> > > > > expire -
> > > > > check the device status. If it needs_reset - do a reset. if status is 
> > > > > ok,
> > > > > then wait some more time.
> > > > > After X retries, unmap buffers or reset the adminq.
> > > > 
> > > > Michael: What effect does resetting the group owner device have on group
> > > > member devices?
> > > 
> > > virtio level reset? It's a good question. I'd expect them all to be
> > > reset no?
> > > 
> > > > I'm concerned that this approach disrupts all group member devices. For
> > > > example, you try to add a new device but the command hangs. In order to
> > > > recover you now have to reset the group owner device and this breaks all
> > > > the group member devices.
> > > 
> > > 
> > > I agree. How about a VQ level reset though? Seems like exactly
> > > what's needed here?
> > 
> > Yes, a new virtqueue-level reset feature would take care of this case.
> > 
> > Stefan
> 
> Why would we need a new one? We already have VIRTIO_F_RING_RESET.

I was unaware :).

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-09 Thread Stefan Hajnoczi
On Wed, Mar 08, 2023 at 12:21:46PM -0500, Michael S. Tsirkin wrote:
> On Wed, Mar 08, 2023 at 12:15:23PM -0500, Stefan Hajnoczi wrote:
> > > > > > Or we could say that admin commands must complete within bounded 
> > > > > > time,
> > > > > > but I'm not sure that is implementable for some device types like
> > > > > > virtio-blk, virtio-scsi, and virtiofs.
> > > > > 
> > > > > No we can't.
> > > > > Some commands, for example FW upgrade can take 10 minutes and it's 
> > > > > perfectly
> > > > > fine. Other commands like setting feature bit will take 1 millisec.
> > > > > Each device implements commands in a different internal logic so we 
> > > > > can't
> > > > > expect to complete after X time.
> > > > 
> > > > When I say bounded time, I mean that it finishes in a finite amount of
> > > > time. I'm not saying there is a specific time X that all device
> > > > implementations must satisfy. Unbounded means it might never finish.
> > > 
> > > There might be a chance that any command for any virtio device type will
> > > never finish. Nothing new here in the adminq.
> > > 
> > > what one can do is to set a timeout for himself and if this timeout 
> > > expire -
> > > check the device status. If it needs_reset - do a reset. if status is ok,
> > > then wait some more time.
> > > After X retries, unmap buffers or reset the adminq.
> > 
> > Michael: What effect does resetting the group owner device have on group
> > member devices?
> 
> virtio level reset? It's a good question. I'd expect them all to be
> reset no?
> 
> > I'm concerned that this approach disrupts all group member devices. For
> > example, you try to add a new device but the command hangs. In order to
> > recover you now have to reset the group owner device and this breaks all
> > the group member devices.
> 
> 
> I agree. How about a VQ level reset though? Seems like exactly
> what's needed here?

Yes, a new virtqueue-level reset feature would take care of this case.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio] Re: [virtio-comment] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-08 Thread Stefan Hajnoczi
On Wed, Mar 08, 2023 at 01:57:43PM +0100, Jiri Pirko wrote:
> Wed, Mar 08, 2023 at 01:44:18PM CET, stefa...@redhat.com wrote:
> >On Wed, Mar 08, 2023 at 11:17:35AM +0100, Jiri Pirko wrote:
> >> Tue, Mar 07, 2023 at 08:03:47PM CET, stefa...@redhat.com wrote:
> >> >On Tue, Mar 07, 2023 at 04:07:54PM +0100, Jiri Pirko wrote:
> >> >> Tue, Mar 07, 2023 at 03:39:11PM CET, stefa...@redhat.com wrote:
> >> >> >On Tue, Mar 07, 2023 at 09:03:18AM +0100, Jiri Pirko wrote:
> >> >> >> Mon, Mar 06, 2023 at 07:37:31PM CET, m...@redhat.com wrote:
> >> >> >> >On Mon, Mar 06, 2023 at 06:03:40AM -0500, Stefan Hajnoczi wrote:
> >> >> >> >> On Sun, Mar 05, 2023 at 07:18:24PM -0500, Michael S. Tsirkin 
> >> >> >> >> wrote:
> >> >> >> >> > On Sun, Mar 05, 2023 at 07:03:02PM -0500, Stefan Hajnoczi wrote:
> >> >> >> >> > > On Sun, Mar 05, 2023 at 04:38:59AM -0500, Michael S. Tsirkin 
> >> >> >> >> > > wrote:
> >> >> >> >> > > > On Fri, Mar 03, 2023 at 03:21:33PM -0500, Stefan Hajnoczi 
> >> >> >> >> > > > wrote:
> >> >> >> >> > > > > What happens if a command takes 1 second to complete, is 
> >> >> >> >> > > > > the device
> >> >> >> >> > > > > allowed to process the next command from the virtqueue 
> >> >> >> >> > > > > during this time,
> >> >> >> >> > > > > possibly completing it before the first command?
> >> >> >> >> > > > > 
> >> >> >> >> > > > > This requires additional clarification in the spec 
> >> >> >> >> > > > > because "they are
> >> >> >> >> > > > > processed by the device in the order in which they are 
> >> >> >> >> > > > > queued" does not
> >> >> >> >> > > > > explain whether commands block the virtqueue (in order 
> >> >> >> >> > > > > completion) or
> >> >> >> >> > > > > not (out of order completion).
> >> >> >> >> > > > 
> >> >> >> >> > > > Oh I begin to see. Hmm how does e.g. virtio scsi handle 
> >> >> >> >> > > > this?
> >> >> >> >> > > 
> >> >> >> >> > > virtio-scsi, virtio-blk, and NVMe requests may complete out 
> >> >> >> >> > > of order.
> >> >> >> >> > > Several may be processed by the device at the same time.
> >> >> >> >> > 
> >> >> >> >> > Let's say I submit a write followed by read - is read
> >> >> >> >> > guaranteed to return an up to date info?
> >> >> >> >> 
> >> >> >> >> In general, no. The driver must wait for the write completion 
> >> >> >> >> before
> >> >> >> >> submitting the read if it wants consistency.
> >> >> >> >> 
> >> >> >> >> Stefan
> >> >> >> >
> >> >> >> >I see.  I think it's a good design to follow then.
> >> >> >> 
> >> >> >> Hmm, is it suitable to have this approach for configuration 
> >> >> >> interface?
> >> >> >> Storage device is a different beast, having parallel reads and writes
> >> >> >> makes complete sense for performance.
> >> >> >> 
> >> >> >> ->read a req
> >> >> >> ->read b req
> >> >> >> ->read c req
> >> >> >> <-read a rep
> >> >> >> <-read b rep
> >> >> >> <-read c rep
> >> >> >> 
> >> >> >> There is no dependency, even between writes.
> >> >> >> 
> >> >> >> But in case of configuration, does not make any sense to me.
> >> >> >> Why is it needed? To pass the burden of consistency of
> >> >> >> configuration to driver sounds odd at least.
> >> >> >> 
> >> >> >> I

[virtio-dev] Re: [virtio-comment] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-08 Thread Stefan Hajnoczi
On Wed, Mar 08, 2023 at 06:19:41PM +0200, Max Gurtovoy wrote:
> 
> 
> On 08/03/2023 16:13, Stefan Hajnoczi wrote:
> > On Wed, Mar 08, 2023 at 01:17:33PM +0200, Max Gurtovoy wrote:
> > > 
> > > 
> > > On 06/03/2023 18:25, Stefan Hajnoczi wrote:
> > > > On Mon, Mar 06, 2023 at 05:28:03PM +0200, Max Gurtovoy wrote:
> > > > > 
> > > > > 
> > > > > On 06/03/2023 13:20, Stefan Hajnoczi wrote:
> > > > > > On Mon, Mar 06, 2023 at 04:00:50PM +0800, Jason Wang wrote:
> > > > > > > 
> > > > > > > 在 2023/3/6 08:03, Stefan Hajnoczi 写道:
> > > > > > > > On Sun, Mar 05, 2023 at 04:38:59AM -0500, Michael S. Tsirkin 
> > > > > > > > wrote:
> > > > > > > > > On Fri, Mar 03, 2023 at 03:21:33PM -0500, Stefan Hajnoczi 
> > > > > > > > > wrote:
> > > > > > > > > > What happens if a command takes 1 second to complete, is 
> > > > > > > > > > the device
> > > > > > > > > > allowed to process the next command from the virtqueue 
> > > > > > > > > > during this time,
> > > > > > > > > > possibly completing it before the first command?
> > > > > > > > > > 
> > > > > > > > > > This requires additional clarification in the spec because 
> > > > > > > > > > "they are
> > > > > > > > > > processed by the device in the order in which they are 
> > > > > > > > > > queued" does not
> > > > > > > > > > explain whether commands block the virtqueue (in order 
> > > > > > > > > > completion) or
> > > > > > > > > > not (out of order completion).
> > > > > > > > > Oh I begin to see. Hmm how does e.g. virtio scsi handle this?
> > > > > > > > virtio-scsi, virtio-blk, and NVMe requests may complete out of 
> > > > > > > > order.
> > > > > > > > Several may be processed by the device at the same time.
> > > > > > > > 
> > > > > > > > They rely on multi-queue for abort operations:
> > > > > > > > 
> > > > > > > > In virtio-scsi the abort requests 
> > > > > > > > (VIRTIO_SCSI_T_TMF_ABORT_TASK) are
> > > > > > > > sent on the control virtqueue. The the request identifier 
> > > > > > > > namespace is
> > > > > > > > shared across all virtqueues so it's possible to abort a 
> > > > > > > > request that
> > > > > > > > was submitted to any command virtqueue.
> > > > > > > > 
> > > > > > > > NVMe also follows the same design where abort commands are sent 
> > > > > > > > on the
> > > > > > > > Admin Submission Queue instead of an I/O Submission Queue. It's 
> > > > > > > > possible
> > > > > > > > to identify NVMe requests by  > > > > > > > Identifier>.
> > > > > > > > 
> > > > > > > > virtio-blk doesn't support aborting requests.
> > > > > > > > 
> > > > > > > > I think the logic behind this design is that if a queue gets 
> > > > > > > > stuck
> > > > > > > > processing long-running requests, then the device should not be 
> > > > > > > > forced
> > > > > > > > to perform lookahead in the queue to find abort commands. A 
> > > > > > > > separate
> > > > > > > > control/admin queue is used for the abort requests.
> > > > > > > 
> > > > > > > 
> > > > > > > Or device need mandate some kind of QOS here, e.g a request must 
> > > > > > > be complete
> > > > > > > in some time. Otherwise we don't have sufficient reliability for 
> > > > > > > using it as
> > > > > > > management task?
> > > > > > 
> > > > > > Yes, if all commands can be executed in bounded time then a 
> > > > > > guarantee is
> > > > > > possible.
> > > > > > 
> > &

[virtio-dev] Re: [virtio-comment] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-08 Thread Stefan Hajnoczi
On Wed, Mar 08, 2023 at 01:17:33PM +0200, Max Gurtovoy wrote:
> 
> 
> On 06/03/2023 18:25, Stefan Hajnoczi wrote:
> > On Mon, Mar 06, 2023 at 05:28:03PM +0200, Max Gurtovoy wrote:
> > > 
> > > 
> > > On 06/03/2023 13:20, Stefan Hajnoczi wrote:
> > > > On Mon, Mar 06, 2023 at 04:00:50PM +0800, Jason Wang wrote:
> > > > > 
> > > > > 在 2023/3/6 08:03, Stefan Hajnoczi 写道:
> > > > > > On Sun, Mar 05, 2023 at 04:38:59AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Fri, Mar 03, 2023 at 03:21:33PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > What happens if a command takes 1 second to complete, is the 
> > > > > > > > device
> > > > > > > > allowed to process the next command from the virtqueue during 
> > > > > > > > this time,
> > > > > > > > possibly completing it before the first command?
> > > > > > > > 
> > > > > > > > This requires additional clarification in the spec because 
> > > > > > > > "they are
> > > > > > > > processed by the device in the order in which they are queued" 
> > > > > > > > does not
> > > > > > > > explain whether commands block the virtqueue (in order 
> > > > > > > > completion) or
> > > > > > > > not (out of order completion).
> > > > > > > Oh I begin to see. Hmm how does e.g. virtio scsi handle this?
> > > > > > virtio-scsi, virtio-blk, and NVMe requests may complete out of 
> > > > > > order.
> > > > > > Several may be processed by the device at the same time.
> > > > > > 
> > > > > > They rely on multi-queue for abort operations:
> > > > > > 
> > > > > > In virtio-scsi the abort requests (VIRTIO_SCSI_T_TMF_ABORT_TASK) are
> > > > > > sent on the control virtqueue. The the request identifier namespace 
> > > > > > is
> > > > > > shared across all virtqueues so it's possible to abort a request 
> > > > > > that
> > > > > > was submitted to any command virtqueue.
> > > > > > 
> > > > > > NVMe also follows the same design where abort commands are sent on 
> > > > > > the
> > > > > > Admin Submission Queue instead of an I/O Submission Queue. It's 
> > > > > > possible
> > > > > > to identify NVMe requests by  > > > > > Identifier>.
> > > > > > 
> > > > > > virtio-blk doesn't support aborting requests.
> > > > > > 
> > > > > > I think the logic behind this design is that if a queue gets stuck
> > > > > > processing long-running requests, then the device should not be 
> > > > > > forced
> > > > > > to perform lookahead in the queue to find abort commands. A separate
> > > > > > control/admin queue is used for the abort requests.
> > > > > 
> > > > > 
> > > > > Or device need mandate some kind of QOS here, e.g a request must be 
> > > > > complete
> > > > > in some time. Otherwise we don't have sufficient reliability for 
> > > > > using it as
> > > > > management task?
> > > > 
> > > > Yes, if all commands can be executed in bounded time then a guarantee is
> > > > possible.
> > > > 
> > > > Here is an example where that's hard: imagine a virtio-blk device backed
> > > > by network storage. When an admin queue command is used to delete a
> > > > group member, any of the group member's in-flight I/O requests need to
> > > > be aborted. If the network hangs while the group member is being
> > > > deleted, then the device can't complete an orderly shutdown of I/O
> > > > requests in a reasonable time.
> > > > 
> > > > That example shows a basic group admin command that I think Michael is
> > > > about to propose. We can't avoid this problem by not making it a group
> > > > admin command - it needs to be a group admin command. So I think it's
> > > > likely that there will be admin commands that take an unbounded amount
> > > > of time to complete. One way to achieve what you mentioned is timeouts.
> > > 
> > > I think that you're g

[virtio-dev] Re: [virtio-comment] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-08 Thread Stefan Hajnoczi
On Wed, Mar 08, 2023 at 11:17:35AM +0100, Jiri Pirko wrote:
> Tue, Mar 07, 2023 at 08:03:47PM CET, stefa...@redhat.com wrote:
> >On Tue, Mar 07, 2023 at 04:07:54PM +0100, Jiri Pirko wrote:
> >> Tue, Mar 07, 2023 at 03:39:11PM CET, stefa...@redhat.com wrote:
> >> >On Tue, Mar 07, 2023 at 09:03:18AM +0100, Jiri Pirko wrote:
> >> >> Mon, Mar 06, 2023 at 07:37:31PM CET, m...@redhat.com wrote:
> >> >> >On Mon, Mar 06, 2023 at 06:03:40AM -0500, Stefan Hajnoczi wrote:
> >> >> >> On Sun, Mar 05, 2023 at 07:18:24PM -0500, Michael S. Tsirkin wrote:
> >> >> >> > On Sun, Mar 05, 2023 at 07:03:02PM -0500, Stefan Hajnoczi wrote:
> >> >> >> > > On Sun, Mar 05, 2023 at 04:38:59AM -0500, Michael S. Tsirkin 
> >> >> >> > > wrote:
> >> >> >> > > > On Fri, Mar 03, 2023 at 03:21:33PM -0500, Stefan Hajnoczi 
> >> >> >> > > > wrote:
> >> >> >> > > > > What happens if a command takes 1 second to complete, is the 
> >> >> >> > > > > device
> >> >> >> > > > > allowed to process the next command from the virtqueue 
> >> >> >> > > > > during this time,
> >> >> >> > > > > possibly completing it before the first command?
> >> >> >> > > > > 
> >> >> >> > > > > This requires additional clarification in the spec because 
> >> >> >> > > > > "they are
> >> >> >> > > > > processed by the device in the order in which they are 
> >> >> >> > > > > queued" does not
> >> >> >> > > > > explain whether commands block the virtqueue (in order 
> >> >> >> > > > > completion) or
> >> >> >> > > > > not (out of order completion).
> >> >> >> > > > 
> >> >> >> > > > Oh I begin to see. Hmm how does e.g. virtio scsi handle this?
> >> >> >> > > 
> >> >> >> > > virtio-scsi, virtio-blk, and NVMe requests may complete out of 
> >> >> >> > > order.
> >> >> >> > > Several may be processed by the device at the same time.
> >> >> >> > 
> >> >> >> > Let's say I submit a write followed by read - is read
> >> >> >> > guaranteed to return an up to date info?
> >> >> >> 
> >> >> >> In general, no. The driver must wait for the write completion before
> >> >> >> submitting the read if it wants consistency.
> >> >> >> 
> >> >> >> Stefan
> >> >> >
> >> >> >I see.  I think it's a good design to follow then.
> >> >> 
> >> >> Hmm, is it suitable to have this approach for configuration interface?
> >> >> Storage device is a different beast, having parallel reads and writes
> >> >> makes complete sense for performance.
> >> >> 
> >> >> ->read a req
> >> >> ->read b req
> >> >> ->read c req
> >> >> <-read a rep
> >> >> <-read b rep
> >> >> <-read c rep
> >> >> 
> >> >> There is no dependency, even between writes.
> >> >> 
> >> >> But in case of configuration, does not make any sense to me.
> >> >> Why is it needed? To pass the burden of consistency of
> >> >> configuration to driver sounds odd at least.
> >> >> 
> >> >> I sense there is no concete idea about what the "admin virtqueue" should
> >> >> serve for exactly.
> >> >
> >> >It's useful for long-running commands because they prevent other
> >> >commands from executing.
> >> >
> >> >An example I've given is that deleting a group member might require
> >> >waiting for the group member's I/O activity to finish. If that I/O
> >> >activity cannot be cancelled instantaneously, then it could take an
> >> >unbounded amount of time to delete the group member. The device would be
> >> >unable to process futher admin commands.
> >> 
> >> I see. Then I believe that the device should handle the dependencies.
> >> Example 1:
> >> -> REQ cmd t

[virtio-dev] Re: [virtio] RE: [virtio-comment] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-08 Thread Stefan Hajnoczi
On Wed, Mar 08, 2023 at 01:17:44PM +0800, Jason Wang wrote:
> On Wed, Mar 8, 2023 at 3:09 AM Parav Pandit  wrote:
> >
> >
> >
> > > From: Stefan Hajnoczi 
> > > Sent: Tuesday, March 7, 2023 2:04 PM
> >
> > > An alternative is unconditional out-of-order completion, where there are 
> > > no
> > > per-command ordering rules. The driver must wait for a command to complete
> > > if it relies on the results of that command for its next command. I like 
> > > this
> > > approach because it's less complex in the spec and for device 
> > > implementers,
> > > while the burden on the driver implementer is still reasonable.
> > +1.
> 
> Note that this is the way current ctrl virtqueue works.
> 
> > This has best of both.
> > 1. Command ordering knowledge and hence the decision left to the one which 
> > issues them. (driver).
> > 2. Ability to execute multiple unrelated commands using a single AQ.
> > 3. stateless device in AQ command execution
> 
> Does this mean if we want to migrate AQ (not use AQ to migrate), we
> need to wait for the AQ command to be completed?

It depends. Today, if the AQ is emulated by the VMM then it might be
possible to migrate while there is a command in-flight. If the AQ is
handled by the device then no because VIRTIO currently does not support
device state migration.

In the future, device state migration will probably be added to
vhost-user and vDPA. In that case a device can migrate in-flight AQ
activity - assuming it's possible to pause it, serialize state,
deserialize it, and resume it. Otherwise it's still necessary to wait.

To be clear, I'm just taking about AQ commands that are currently in the
process of executing. Commands that are in the virtqueue but have not
yet been started by the device can be easily migrated using the existing
virtqueue migration infrastructure.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-07 Thread Stefan Hajnoczi
On Tue, Mar 07, 2023 at 04:07:54PM +0100, Jiri Pirko wrote:
> Tue, Mar 07, 2023 at 03:39:11PM CET, stefa...@redhat.com wrote:
> >On Tue, Mar 07, 2023 at 09:03:18AM +0100, Jiri Pirko wrote:
> >> Mon, Mar 06, 2023 at 07:37:31PM CET, m...@redhat.com wrote:
> >> >On Mon, Mar 06, 2023 at 06:03:40AM -0500, Stefan Hajnoczi wrote:
> >> >> On Sun, Mar 05, 2023 at 07:18:24PM -0500, Michael S. Tsirkin wrote:
> >> >> > On Sun, Mar 05, 2023 at 07:03:02PM -0500, Stefan Hajnoczi wrote:
> >> >> > > On Sun, Mar 05, 2023 at 04:38:59AM -0500, Michael S. Tsirkin wrote:
> >> >> > > > On Fri, Mar 03, 2023 at 03:21:33PM -0500, Stefan Hajnoczi wrote:
> >> >> > > > > What happens if a command takes 1 second to complete, is the 
> >> >> > > > > device
> >> >> > > > > allowed to process the next command from the virtqueue during 
> >> >> > > > > this time,
> >> >> > > > > possibly completing it before the first command?
> >> >> > > > > 
> >> >> > > > > This requires additional clarification in the spec because 
> >> >> > > > > "they are
> >> >> > > > > processed by the device in the order in which they are queued" 
> >> >> > > > > does not
> >> >> > > > > explain whether commands block the virtqueue (in order 
> >> >> > > > > completion) or
> >> >> > > > > not (out of order completion).
> >> >> > > > 
> >> >> > > > Oh I begin to see. Hmm how does e.g. virtio scsi handle this?
> >> >> > > 
> >> >> > > virtio-scsi, virtio-blk, and NVMe requests may complete out of 
> >> >> > > order.
> >> >> > > Several may be processed by the device at the same time.
> >> >> > 
> >> >> > Let's say I submit a write followed by read - is read
> >> >> > guaranteed to return an up to date info?
> >> >> 
> >> >> In general, no. The driver must wait for the write completion before
> >> >> submitting the read if it wants consistency.
> >> >> 
> >> >> Stefan
> >> >
> >> >I see.  I think it's a good design to follow then.
> >> 
> >> Hmm, is it suitable to have this approach for configuration interface?
> >> Storage device is a different beast, having parallel reads and writes
> >> makes complete sense for performance.
> >> 
> >> ->read a req
> >> ->read b req
> >> ->read c req
> >> <-read a rep
> >> <-read b rep
> >> <-read c rep
> >> 
> >> There is no dependency, even between writes.
> >> 
> >> But in case of configuration, does not make any sense to me.
> >> Why is it needed? To pass the burden of consistency of
> >> configuration to driver sounds odd at least.
> >> 
> >> I sense there is no concete idea about what the "admin virtqueue" should
> >> serve for exactly.
> >
> >It's useful for long-running commands because they prevent other
> >commands from executing.
> >
> >An example I've given is that deleting a group member might require
> >waiting for the group member's I/O activity to finish. If that I/O
> >activity cannot be cancelled instantaneously, then it could take an
> >unbounded amount of time to delete the group member. The device would be
> >unable to process futher admin commands.
> 
> I see. Then I believe that the device should handle the dependencies.
> Example 1:
> -> REQ cmd to create group member A
> -> REQ cmd to create group member B
> <- REP cmd to create group member A
> <- REP cmd to create group member B
> 
> The device according to internal implementation can either serialize the
> 2 group member creations or do it in parallel, if it supports it.
> 
> Example 2:
> -> REQ cmd to create group member A
> -> REQ cmd config group member A
> <- REP cmd to create group member A
> <- REP cmd config group member A
> 
> Here the serialization is necessary and the device is the one to take
> care of it.
> 
> Makes sense?

Yes, I understand. The spec would need to define ordering rules for
specific commands and the device must implement them. It allows the
driver to pipeline commands while also allowing out-of-order completion
(parallelism) in some cases.

[virtio-dev] Re: [PATCH V2] docs: vhost-user: Add Xen specific memory mapping support

2023-03-07 Thread Stefan Hajnoczi
On Tue, Mar 07, 2023 at 11:13:36AM +0530, Viresh Kumar wrote:
> On 06-03-23, 10:34, Stefan Hajnoczi wrote:
> > On Mon, Mar 06, 2023 at 04:40:24PM +0530, Viresh Kumar wrote:
> > > +Xen mmap description
> > > +
> > > +
> > > ++---+---+
> > > +| flags | domid |
> > > ++---+---+
> > > +
> > > +:flags: 64-bit bit field
> > > +
> > > +- Bit 0 is set for Xen foreign memory memory mapping.
> > > +- Bit 1 is set for Xen grant memory memory mapping.
> > > +- Bit 2 is set if the back-end can directly map additional memory (like
> > > +  descriptor buffers or indirect descriptors, which aren't part of 
> > > already
> > > +  shared memory regions) without the need of front-end sending an 
> > > additional
> > > +  memory region first.
> > 
> > I don't understand what Bit 2 does. Can you rephrase this? It's unclear
> > to me how additional memory can be mapped without a memory region
> > (especially the fd) is sent?
> 
> I (somehow) assumed we will be able to use the same file descriptor
> that was shared for the virtqueues memory regions and yes I can see
> now why it wouldn't work or create problems.
> 
> And I need suggestion now on how to make this work.
> 
> With Xen grants, the front end receives grant address from the from
> guest kernel, they aren't physical addresses, kind of IOMMU stuff.
> 
> The back-end gets access for memory regions of the virtqueues alone
> initially.  When the back-end gets a request, it reads the descriptor
> and finds the buffer address, which isn't part of already shared
> regions. The same happens for descriptor addresses in case indirect
> descriptor feature is negotiated.
> 
> At this point I was thinking maybe the back-end can simply call the
> mmap/ioctl to map the memory, using the file descriptor used for the
> virtqueues.
> 
> How else can we make this work ? We also need to unmap/remove the
> memory region, as soon as the buffer is processed as the grant address
> won't be relevant for any subsequent request.
> 
> Should I use VHOST_USER_IOTLB_MSG for this ? I did look at it and I
> wasn't convinced if it was an exact fit. For example it says that a
> memory address reported with miss/access fail should be part of an
> already sent memory region, which isn't the case here.

VHOST_USER_IOTLB_MSG probably isn't necessary because address
translation is not required. It will also reduce performance by adding
extra communication.

Instead, you could change the 1 memory region : 1 mmap relationship that
existing non-Xen vhost-user back-end implementations have. In Xen
vhost-user back-ends, the memory region details (including the file
descriptor and Xen domain id) would be stashed away in back-end when the
front-end adds memory regions. No mmap would be performed upon
VHOST_USER_ADD_MEM_REG or VHOST_USER_SET_MEM_TABLE.

Whenever the back-end needs to do DMA, it looks up the memory region and
performs the mmap + Xen-specific calls:
- A long-lived mmap of the vring is set up when
  VHOST_USER_SET_VRING_ENABLE is received.
- Short-lived mmaps of the indirect descriptors and memory pointed to by
  the descriptors is set up by the virtqueue processing code.

Does this sound workable to you?

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio-comment] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-07 Thread Stefan Hajnoczi
On Tue, Mar 07, 2023 at 09:03:18AM +0100, Jiri Pirko wrote:
> Mon, Mar 06, 2023 at 07:37:31PM CET, m...@redhat.com wrote:
> >On Mon, Mar 06, 2023 at 06:03:40AM -0500, Stefan Hajnoczi wrote:
> >> On Sun, Mar 05, 2023 at 07:18:24PM -0500, Michael S. Tsirkin wrote:
> >> > On Sun, Mar 05, 2023 at 07:03:02PM -0500, Stefan Hajnoczi wrote:
> >> > > On Sun, Mar 05, 2023 at 04:38:59AM -0500, Michael S. Tsirkin wrote:
> >> > > > On Fri, Mar 03, 2023 at 03:21:33PM -0500, Stefan Hajnoczi wrote:
> >> > > > > What happens if a command takes 1 second to complete, is the device
> >> > > > > allowed to process the next command from the virtqueue during this 
> >> > > > > time,
> >> > > > > possibly completing it before the first command?
> >> > > > > 
> >> > > > > This requires additional clarification in the spec because "they 
> >> > > > > are
> >> > > > > processed by the device in the order in which they are queued" 
> >> > > > > does not
> >> > > > > explain whether commands block the virtqueue (in order completion) 
> >> > > > > or
> >> > > > > not (out of order completion).
> >> > > > 
> >> > > > Oh I begin to see. Hmm how does e.g. virtio scsi handle this?
> >> > > 
> >> > > virtio-scsi, virtio-blk, and NVMe requests may complete out of order.
> >> > > Several may be processed by the device at the same time.
> >> > 
> >> > Let's say I submit a write followed by read - is read
> >> > guaranteed to return an up to date info?
> >> 
> >> In general, no. The driver must wait for the write completion before
> >> submitting the read if it wants consistency.
> >> 
> >> Stefan
> >
> >I see.  I think it's a good design to follow then.
> 
> Hmm, is it suitable to have this approach for configuration interface?
> Storage device is a different beast, having parallel reads and writes
> makes complete sense for performance.
> 
> ->read a req
> ->read b req
> ->read c req
> <-read a rep
> <-read b rep
> <-read c rep
> 
> There is no dependency, even between writes.
> 
> But in case of configuration, does not make any sense to me.
> Why is it needed? To pass the burden of consistency of
> configuration to driver sounds odd at least.
> 
> I sense there is no concete idea about what the "admin virtqueue" should
> serve for exactly.

It's useful for long-running commands because they prevent other
commands from executing.

An example I've given is that deleting a group member might require
waiting for the group member's I/O activity to finish. If that I/O
activity cannot be cancelled instantaneously, then it could take an
unbounded amount of time to delete the group member. The device would be
unable to process futher admin commands.

Group member creation might have similar issues if it involves acquiring
remote resources (e.g. connecting to a Ceph cluster or allocating ports
on a distributed network switch). It can be impossible to defer resource
acquisition/initialization because because VIRTIO devices must be
available as soon as the driver can see them (i.e. how do populate
Configuration Space fields if you don't have the details of the resource
yet?).

So I have raised two questions:

1. What are the admin queue command completion semantics: in-order or
   out-of-order command completion?

2. Will there be long-running commands and how will we deal with them
   when they hang?

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-06 Thread Stefan Hajnoczi
On Mon, Mar 06, 2023 at 01:37:31PM -0500, Michael S. Tsirkin wrote:
> On Mon, Mar 06, 2023 at 06:03:40AM -0500, Stefan Hajnoczi wrote:
> > On Sun, Mar 05, 2023 at 07:18:24PM -0500, Michael S. Tsirkin wrote:
> > > On Sun, Mar 05, 2023 at 07:03:02PM -0500, Stefan Hajnoczi wrote:
> > > > On Sun, Mar 05, 2023 at 04:38:59AM -0500, Michael S. Tsirkin wrote:
> > > > > On Fri, Mar 03, 2023 at 03:21:33PM -0500, Stefan Hajnoczi wrote:
> > > > > > What happens if a command takes 1 second to complete, is the device
> > > > > > allowed to process the next command from the virtqueue during this 
> > > > > > time,
> > > > > > possibly completing it before the first command?
> > > > > > 
> > > > > > This requires additional clarification in the spec because "they are
> > > > > > processed by the device in the order in which they are queued" does 
> > > > > > not
> > > > > > explain whether commands block the virtqueue (in order completion) 
> > > > > > or
> > > > > > not (out of order completion).
> > > > > 
> > > > > Oh I begin to see. Hmm how does e.g. virtio scsi handle this?
> > > > 
> > > > virtio-scsi, virtio-blk, and NVMe requests may complete out of order.
> > > > Several may be processed by the device at the same time.
> > > 
> > > Let's say I submit a write followed by read - is read
> > > guaranteed to return an up to date info?
> > 
> > In general, no. The driver must wait for the write completion before
> > submitting the read if it wants consistency.
> > 
> > Stefan
> 
> I see.  I think it's a good design to follow then.
> 
> I'll just copy
>   The driver queues requests to an arbitrary request queue, and
>   they are used by the device on that same queue. It is the
>   responsibility of the driver to ensure strict request ordering
>   for commands placed on different queues, because they will be
>   consumed with no order constraints.
> 
> replacing "request" with "admin".

That sounds like it's only about multi-queue because it says "... for
commands placed on different queues". What I mentioned about a write
followed by a read quest also applies within a single queue.

Can you clarify the semantics in the single queue case?

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-06 Thread Stefan Hajnoczi
On Mon, Mar 06, 2023 at 05:28:03PM +0200, Max Gurtovoy wrote:
> 
> 
> On 06/03/2023 13:20, Stefan Hajnoczi wrote:
> > On Mon, Mar 06, 2023 at 04:00:50PM +0800, Jason Wang wrote:
> > > 
> > > 在 2023/3/6 08:03, Stefan Hajnoczi 写道:
> > > > On Sun, Mar 05, 2023 at 04:38:59AM -0500, Michael S. Tsirkin wrote:
> > > > > On Fri, Mar 03, 2023 at 03:21:33PM -0500, Stefan Hajnoczi wrote:
> > > > > > What happens if a command takes 1 second to complete, is the device
> > > > > > allowed to process the next command from the virtqueue during this 
> > > > > > time,
> > > > > > possibly completing it before the first command?
> > > > > > 
> > > > > > This requires additional clarification in the spec because "they are
> > > > > > processed by the device in the order in which they are queued" does 
> > > > > > not
> > > > > > explain whether commands block the virtqueue (in order completion) 
> > > > > > or
> > > > > > not (out of order completion).
> > > > > Oh I begin to see. Hmm how does e.g. virtio scsi handle this?
> > > > virtio-scsi, virtio-blk, and NVMe requests may complete out of order.
> > > > Several may be processed by the device at the same time.
> > > > 
> > > > They rely on multi-queue for abort operations:
> > > > 
> > > > In virtio-scsi the abort requests (VIRTIO_SCSI_T_TMF_ABORT_TASK) are
> > > > sent on the control virtqueue. The the request identifier namespace is
> > > > shared across all virtqueues so it's possible to abort a request that
> > > > was submitted to any command virtqueue.
> > > > 
> > > > NVMe also follows the same design where abort commands are sent on the
> > > > Admin Submission Queue instead of an I/O Submission Queue. It's possible
> > > > to identify NVMe requests by .
> > > > 
> > > > virtio-blk doesn't support aborting requests.
> > > > 
> > > > I think the logic behind this design is that if a queue gets stuck
> > > > processing long-running requests, then the device should not be forced
> > > > to perform lookahead in the queue to find abort commands. A separate
> > > > control/admin queue is used for the abort requests.
> > > 
> > > 
> > > Or device need mandate some kind of QOS here, e.g a request must be 
> > > complete
> > > in some time. Otherwise we don't have sufficient reliability for using it 
> > > as
> > > management task?
> > 
> > Yes, if all commands can be executed in bounded time then a guarantee is
> > possible.
> > 
> > Here is an example where that's hard: imagine a virtio-blk device backed
> > by network storage. When an admin queue command is used to delete a
> > group member, any of the group member's in-flight I/O requests need to
> > be aborted. If the network hangs while the group member is being
> > deleted, then the device can't complete an orderly shutdown of I/O
> > requests in a reasonable time.
> > 
> > That example shows a basic group admin command that I think Michael is
> > about to propose. We can't avoid this problem by not making it a group
> > admin command - it needs to be a group admin command. So I think it's
> > likely that there will be admin commands that take an unbounded amount
> > of time to complete. One way to achieve what you mentioned is timeouts.
> 
> I think that you're getting into device specific implementation details and
> I'm not sure it's necessary.
> 
> I don't think we need to abort admin commands. Admin commands can be
> flushed/aborted during the device reset phase.
> Only IO commands should have the possibility to being aborted as you
> mentioned in NVMe and SCSI (and potentially in virtio-blk).

It's a general design issue that should be clarified now rather than
being left unspecified.

I'm not saying that it must be possible to abort admin commands. There
are other options, like requiring the device itself to fail a command
after a timeout.

Or we could say that admin commands must complete within bounded time,
but I'm not sure that is implementable for some device types like
virtio-blk, virtio-scsi, and virtiofs.

> For your example, stopping a member is possible even it there are some
> errors in the network. You can for example destroy all the connections to
> the remote target and complete all the BIOS with some error.

Forgetting about in-flight requests doesn't necessarily make them go
away. It creates a race between forgotten requests and reconnection. In
the worst case a forgotten write request takes effect after
reconnection, causing data corruption.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH V2] docs: vhost-user: Add Xen specific memory mapping support

2023-03-06 Thread Stefan Hajnoczi
On Mon, Mar 06, 2023 at 04:40:24PM +0530, Viresh Kumar wrote:
> The current model of memory mapping at the back-end works fine where a
> standard call to mmap() (for the respective file descriptor) is enough
> before the front-end can start accessing the guest memory.
> 
> There are other complex cases though where the back-end needs more
> information and simple mmap() isn't enough. For example Xen, a type-1
> hypervisor, currently supports memory mapping via two different methods,
> foreign-mapping (via /dev/privcmd) and grant-dev (via /dev/gntdev). In
> both these cases, the back-end needs to call mmap() and ioctl(), and
> need to pass extra information via the ioctl(), like the Xen domain-id
> of the guest whose memory we are trying to map.
> 
> Add a new protocol feature, 'VHOST_USER_PROTOCOL_F_XEN_MMAP', which lets
> the back-end know about the additional memory mapping requirements.
> When this feature is negotiated, the front-end can send the
> 'VHOST_USER_SET_XEN_MMAP' message type to provide the additional
> information to the back-end.
> 
> Signed-off-by: Viresh Kumar 
> ---
> V1->V2:
> - Make the custom mmap feature Xen specific, instead of being generic.
> - Clearly define which memory regions are impacted by this change.
> - Allow VHOST_USER_SET_XEN_MMAP to be called multiple times.
> - Additional Bit(2) property in flags.
> 
>  docs/interop/vhost-user.rst | 36 
>  1 file changed, 36 insertions(+)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 3f18ab424eb0..8be5f5eae941 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -258,6 +258,24 @@ Inflight description
>  
>  :queue size: a 16-bit size of virtqueues
>  
> +Xen mmap description
> +
> +
> ++---+---+
> +| flags | domid |
> ++---+---+
> +
> +:flags: 64-bit bit field
> +
> +- Bit 0 is set for Xen foreign memory memory mapping.
> +- Bit 1 is set for Xen grant memory memory mapping.
> +- Bit 2 is set if the back-end can directly map additional memory (like
> +  descriptor buffers or indirect descriptors, which aren't part of already
> +  shared memory regions) without the need of front-end sending an additional
> +  memory region first.

I don't understand what Bit 2 does. Can you rephrase this? It's unclear
to me how additional memory can be mapped without a memory region
(especially the fd) is sent?

> +
> +:domid: a 64-bit Xen hypervisor specific domain id.
> +
>  C structure
>  ---
>  
> @@ -867,6 +885,7 @@ Protocol features
>#define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
>#define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
>#define VHOST_USER_PROTOCOL_F_STATUS   16
> +  #define VHOST_USER_PROTOCOL_F_XEN_MMAP 17
>  
>  Front-end message types
>  ---
> @@ -1422,6 +1441,23 @@ Front-end message types
>query the back-end for its device status as defined in the Virtio
>specification.
>  
> +``VHOST_USER_SET_XEN_MMAP``
> +  :id: 41
> +  :equivalent ioctl: N/A
> +  :request payload: Xen mmap description
> +  :reply payload: N/A
> +
> +  When the ``VHOST_USER_PROTOCOL_F_XEN_MMAP`` protocol feature has been
> +  successfully negotiated, this message is submitted by the front-end to set 
> the
> +  Xen hypervisor specific memory mapping configurations at the back-end.  
> These
> +  configurations should be used to mmap memory regions, virtqueues, 
> descriptors
> +  and descriptor buffers. The front-end must send this message before any
> +  memory-regions are sent to the back-end via ``VHOST_USER_SET_MEM_TABLE`` or
> +  ``VHOST_USER_ADD_MEM_REG`` message types. The front-end can send this 
> message
> +  multiple times, if different mmap configurations are required for different
> +  memory regions, where the most recent ``VHOST_USER_SET_XEN_MMAP`` must be 
> used
> +  by the back-end to map any newly shared memory regions.

This message modifies the behavior of subsequent
VHOST_USER_SET_MEM_TABLE and VHOST_USER_ADD_MEM_REG messages. The memory
region structs can be extended and then VHOST_USER_SET_XEN_MMAP isn't
needed.

In other words:

  When VHOST_USER_PROTOCOL_F_XEN_MMAP is negotiated, each "Memory
  regions description" and "Single memory region description" has the
  following additional fields appended:

  ++---+
  | xen_mmap_flags | domid |
  ++---+

  :xen_mmap_flags: 64-bit bit field
  :domid: a 64-bit Xen hypervisor specific domain id.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-06 Thread Stefan Hajnoczi
On Mon, Mar 06, 2023 at 04:00:50PM +0800, Jason Wang wrote:
> 
> 在 2023/3/6 08:03, Stefan Hajnoczi 写道:
> > On Sun, Mar 05, 2023 at 04:38:59AM -0500, Michael S. Tsirkin wrote:
> > > On Fri, Mar 03, 2023 at 03:21:33PM -0500, Stefan Hajnoczi wrote:
> > > > What happens if a command takes 1 second to complete, is the device
> > > > allowed to process the next command from the virtqueue during this time,
> > > > possibly completing it before the first command?
> > > > 
> > > > This requires additional clarification in the spec because "they are
> > > > processed by the device in the order in which they are queued" does not
> > > > explain whether commands block the virtqueue (in order completion) or
> > > > not (out of order completion).
> > > Oh I begin to see. Hmm how does e.g. virtio scsi handle this?
> > virtio-scsi, virtio-blk, and NVMe requests may complete out of order.
> > Several may be processed by the device at the same time.
> > 
> > They rely on multi-queue for abort operations:
> > 
> > In virtio-scsi the abort requests (VIRTIO_SCSI_T_TMF_ABORT_TASK) are
> > sent on the control virtqueue. The the request identifier namespace is
> > shared across all virtqueues so it's possible to abort a request that
> > was submitted to any command virtqueue.
> > 
> > NVMe also follows the same design where abort commands are sent on the
> > Admin Submission Queue instead of an I/O Submission Queue. It's possible
> > to identify NVMe requests by .
> > 
> > virtio-blk doesn't support aborting requests.
> > 
> > I think the logic behind this design is that if a queue gets stuck
> > processing long-running requests, then the device should not be forced
> > to perform lookahead in the queue to find abort commands. A separate
> > control/admin queue is used for the abort requests.
> 
> 
> Or device need mandate some kind of QOS here, e.g a request must be complete
> in some time. Otherwise we don't have sufficient reliability for using it as
> management task?

Yes, if all commands can be executed in bounded time then a guarantee is
possible.

Here is an example where that's hard: imagine a virtio-blk device backed
by network storage. When an admin queue command is used to delete a
group member, any of the group member's in-flight I/O requests need to
be aborted. If the network hangs while the group member is being
deleted, then the device can't complete an orderly shutdown of I/O
requests in a reasonable time.

That example shows a basic group admin command that I think Michael is
about to propose. We can't avoid this problem by not making it a group
admin command - it needs to be a group admin command. So I think it's
likely that there will be admin commands that take an unbounded amount
of time to complete. One way to achieve what you mentioned is timeouts.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-06 Thread Stefan Hajnoczi
On Sun, Mar 05, 2023 at 07:18:24PM -0500, Michael S. Tsirkin wrote:
> On Sun, Mar 05, 2023 at 07:03:02PM -0500, Stefan Hajnoczi wrote:
> > On Sun, Mar 05, 2023 at 04:38:59AM -0500, Michael S. Tsirkin wrote:
> > > On Fri, Mar 03, 2023 at 03:21:33PM -0500, Stefan Hajnoczi wrote:
> > > > What happens if a command takes 1 second to complete, is the device
> > > > allowed to process the next command from the virtqueue during this time,
> > > > possibly completing it before the first command?
> > > > 
> > > > This requires additional clarification in the spec because "they are
> > > > processed by the device in the order in which they are queued" does not
> > > > explain whether commands block the virtqueue (in order completion) or
> > > > not (out of order completion).
> > > 
> > > Oh I begin to see. Hmm how does e.g. virtio scsi handle this?
> > 
> > virtio-scsi, virtio-blk, and NVMe requests may complete out of order.
> > Several may be processed by the device at the same time.
> 
> Let's say I submit a write followed by read - is read
> guaranteed to return an up to date info?

In general, no. The driver must wait for the write completion before
submitting the read if it wants consistency.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-05 Thread Stefan Hajnoczi
On Sun, Mar 05, 2023 at 04:38:59AM -0500, Michael S. Tsirkin wrote:
> On Fri, Mar 03, 2023 at 03:21:33PM -0500, Stefan Hajnoczi wrote:
> > What happens if a command takes 1 second to complete, is the device
> > allowed to process the next command from the virtqueue during this time,
> > possibly completing it before the first command?
> > 
> > This requires additional clarification in the spec because "they are
> > processed by the device in the order in which they are queued" does not
> > explain whether commands block the virtqueue (in order completion) or
> > not (out of order completion).
> 
> Oh I begin to see. Hmm how does e.g. virtio scsi handle this?

virtio-scsi, virtio-blk, and NVMe requests may complete out of order.
Several may be processed by the device at the same time.

They rely on multi-queue for abort operations:

In virtio-scsi the abort requests (VIRTIO_SCSI_T_TMF_ABORT_TASK) are
sent on the control virtqueue. The the request identifier namespace is
shared across all virtqueues so it's possible to abort a request that
was submitted to any command virtqueue.

NVMe also follows the same design where abort commands are sent on the
Admin Submission Queue instead of an I/O Submission Queue. It's possible
to identify NVMe requests by .

virtio-blk doesn't support aborting requests.

I think the logic behind this design is that if a queue gets stuck
processing long-running requests, then the device should not be forced
to perform lookahead in the queue to find abort commands. A separate
control/admin queue is used for the abort requests.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v10 03/10] admin: introduce group administration commands

2023-03-03 Thread Stefan Hajnoczi
On Fri, Mar 03, 2023 at 08:18:43AM -0500, Michael S. Tsirkin wrote:
> On Fri, Mar 03, 2023 at 08:13:37AM -0500, Stefan Hajnoczi wrote:
> > On Thu, Mar 02, 2023 at 06:57:24PM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Mar 02, 2023 at 03:10:11PM -0500, Stefan Hajnoczi wrote:
> > > > On Thu, Mar 02, 2023 at 08:05:02AM -0500, Michael S. Tsirkin wrote:
> > > > > This introduces a general structure for group administration commands,
> > > > > used to control device groups through their owner.
> > > > > 
> > > > > Following patches will introduce specific commands and an interface 
> > > > > for
> > > > > submitting these commands to the owner.
> > > > > 
> > > > > Signed-off-by: Max Gurtovoy 
> > > > > Signed-off-by: Michael S. Tsirkin 
> > > > > ---
> > > > >  admin.tex| 108 
> > > > > +++
> > > > >  introduction.tex |   3 ++
> > > > >  2 files changed, 111 insertions(+)
> > > > > 
> > > > > diff --git a/admin.tex b/admin.tex
> > > > > index 3dc47be..7e28b77 100644
> > > > > --- a/admin.tex
> > > > > +++ b/admin.tex
> > > > > @@ -46,4 +46,112 @@ \section{Device groups}\label{sec:Basic 
> > > > > Facilities of a Virtio Device / Device g
> > > > >  PCI transport (see \ref{sec:Virtio Transport Options / Virtio Over 
> > > > > PCI Bus}).
> > > > >  \end{description}
> > > > >  
> > > > > +\subsection{Group administration commands}\label{sec:Basic 
> > > > > Facilities of a Virtio Device / Device groups / Group administration 
> > > > > commands}
> > > > >  
> > > > > +The driver sends group administration commands to the owner device of
> > > > 
> > > > I notice that the terminology is simply "the driver". "Owner driver"
> > > > and "group member driver" might be clearer because there will be two
> > > > (possibly different) drivers involved.
> > > 
> > > Hmm I don't really want to repeat owner everywhere.
> > > I will clarify that in this section simple "driver" and "device" are
> > > owner, "member device" and "member driver" is always called explicitly.
> > 
> > Sounds good.
> > 
> > > > > +a group to control member devices of the group.
> > > > > +This mechanism can
> > > > > +be used, for example, to configure a member device before it is
> > > > > +initialized by its driver.
> > > > > +\footnote{The term "administration" is intended to be interpreted
> > > > > +widely to include any kind of control. See specific commands
> > > > > +for detail.}
> > > > > +
> > > > > +All the group administration commands are of the following form:
> > > > > +
> > > > > +\begin{lstlisting}
> > > > > +struct virtio_admin_cmd {
> > > > > +/* Device-readable part */
> > > > > +le16 opcode;
> > > > > +/*
> > > > > + * 1 - SR-IOV
> > > > > + * 2 - 65535 reserved
> > > > > + */
> > > > > +le16 group_type;
> > > > > +/* unused, reserved for future extensions */
> > > > > +u8 reserved1[12];
> > > > > +le64 group_member_id;
> > > > > +u8 command_specific_data[];
> > > > > +
> > > > > +/* Device-writable part */
> > > > > +le16 status;
> > > > > +le16 status_qualifier;
> > > > > +/* unused, reserved for future extensions */
> > > > > +u8 reserved2[4];
> > > > > +u8 command_specific_result[];
> > > > > +};
> > > > > +\end{lstlisting}
> > > > > +
> > > > > +For all commands, \field{opcode}, \field{group_type} and if
> > > > > +necessary \field{group_member_id} and \field{command_specific_data} 
> > > > > are
> > > > > +set by the driver, and the owner device sets \field{status} and if
> > > > > +needed \field{status_qualifier} and
> > > > > +\field{command_specific_result}.
> > > > > +
> > >

[virtio-dev] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-03 Thread Stefan Hajnoczi
On Fri, Mar 03, 2023 at 08:37:31AM -0500, Michael S. Tsirkin wrote:
> On Fri, Mar 03, 2023 at 08:28:40AM -0500, Stefan Hajnoczi wrote:
> > On Thu, Mar 02, 2023 at 07:05:14PM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Mar 02, 2023 at 03:40:07PM -0500, Stefan Hajnoczi wrote:
> > > > On Thu, Mar 02, 2023 at 08:05:06AM -0500, Michael S. Tsirkin wrote:
> > > > > The admin virtqueues will be the first interface to issue admin 
> > > > > commands.
> > > > > 
> > > > > Currently virtio specification defines control virtqueue to manipulate
> > > > > features and configuration of the device it operates on. However,
> > > > > control virtqueue commands are device type specific, which makes it 
> > > > > very
> > > > > difficult to extend for device agnostic commands.
> > > > > 
> > > > > To support this requirement in a more generic way, this patch 
> > > > > introduces
> > > > > a new admin virtqueue interface.
> > > > 
> > > > Is this referring to the existing virtio-net, virtio-scsi, etc control
> > > > virtqueues?
> > > > 
> > > > I see the admin virtqueue as the virtqueue equivalent to transport
> > > > feature bits. The admin queue does nothing device type-specific (net,
> > > > scsi, etc) and instead focusses on transport and core virtio tasks.
> > > > 
> > > > Keeping the device-specific virtqueue separate from the admin virtqueue
> > > > is simpler and has fewer potential problems. I don't think creating
> > > > common infrastructure for device-specific control virtqueues across
> > > > device types worthwhile or within the scope of this patch series.
> > > 
> > > yes this commit log is outdated. referred to original
> > > proposal by Max which also planned to replace cvq.
> > > will update.
> > > 
> > > 
> > > > > We also support more than one admin virtqueue, for QoS and
> > > > > scalability requirements.
> > > > > 
> > > > > Signed-off-by: Max Gurtovoy 
> > > > > Signed-off-by: Michael S. Tsirkin 
> > > > > ---
> > > > >  admin.tex   | 74 
> > > > > +
> > > > >  content.tex |  7 +++--
> > > > >  2 files changed, 79 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/admin.tex b/admin.tex
> > > > > index 7e28b77..3ffac2e 100644
> > > > > --- a/admin.tex
> > > > > +++ b/admin.tex
> > > > > @@ -155,3 +155,77 @@ \subsection{Group administration 
> > > > > commands}\label{sec:Basic Facilities of a Virti
> > > > >  \field{command_specific_data} and \field{command_specific_result}
> > > > >  depends on these structures and is described separately or is
> > > > >  implicit in the structure description.
> > > > > +
> > > > > +\section{Administration Virtqueues}\label{sec:Basic Facilities of a 
> > > > > Virtio Device / Administration Virtqueues}
> > > > > +
> > > > > +An administration virtqueue of an owner device is used to submit
> > > > > +group administration commands. An owner device can have more
> > > > > +than one administration virtqueue.
> > > > > +
> > > > > +If VIRTIO_F_ADMIN_VQ has been negotiated, an owner device exposes one
> > > > > +or more adminstration virtqueues. The number and locations of the
> > > > > +administration virtqueues are exposed by the owner device in a 
> > > > > transport
> > > > > +specific manner.
> > > > > +
> > > > > +The driver submits commands by queueing them to an arbitrary
> > > > > +administration virtqueue, and they are processed by the
> > > > > +device in the order in which they are queued. It is the
> > > > > +responsibility of the driver to ensure ordering for commands
> > > > > +placed on different administration virtqueues, because they will
> > > > > +be executed with no order constraints.
> > > > 
> > > > Does "they are processed by the device in the order in which they are
> > > > queued" mean only 1 admin command can be running at any given time and
> > > > therefore they will complete in order? This would allow pipelining
> > > &g

[virtio-dev] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-03 Thread Stefan Hajnoczi
On Thu, Mar 02, 2023 at 07:05:14PM -0500, Michael S. Tsirkin wrote:
> On Thu, Mar 02, 2023 at 03:40:07PM -0500, Stefan Hajnoczi wrote:
> > On Thu, Mar 02, 2023 at 08:05:06AM -0500, Michael S. Tsirkin wrote:
> > > The admin virtqueues will be the first interface to issue admin commands.
> > > 
> > > Currently virtio specification defines control virtqueue to manipulate
> > > features and configuration of the device it operates on. However,
> > > control virtqueue commands are device type specific, which makes it very
> > > difficult to extend for device agnostic commands.
> > > 
> > > To support this requirement in a more generic way, this patch introduces
> > > a new admin virtqueue interface.
> > 
> > Is this referring to the existing virtio-net, virtio-scsi, etc control
> > virtqueues?
> > 
> > I see the admin virtqueue as the virtqueue equivalent to transport
> > feature bits. The admin queue does nothing device type-specific (net,
> > scsi, etc) and instead focusses on transport and core virtio tasks.
> > 
> > Keeping the device-specific virtqueue separate from the admin virtqueue
> > is simpler and has fewer potential problems. I don't think creating
> > common infrastructure for device-specific control virtqueues across
> > device types worthwhile or within the scope of this patch series.
> 
> yes this commit log is outdated. referred to original
> proposal by Max which also planned to replace cvq.
> will update.
> 
> 
> > > We also support more than one admin virtqueue, for QoS and
> > > scalability requirements.
> > > 
> > > Signed-off-by: Max Gurtovoy 
> > > Signed-off-by: Michael S. Tsirkin 
> > > ---
> > >  admin.tex   | 74 +
> > >  content.tex |  7 +++--
> > >  2 files changed, 79 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/admin.tex b/admin.tex
> > > index 7e28b77..3ffac2e 100644
> > > --- a/admin.tex
> > > +++ b/admin.tex
> > > @@ -155,3 +155,77 @@ \subsection{Group administration 
> > > commands}\label{sec:Basic Facilities of a Virti
> > >  \field{command_specific_data} and \field{command_specific_result}
> > >  depends on these structures and is described separately or is
> > >  implicit in the structure description.
> > > +
> > > +\section{Administration Virtqueues}\label{sec:Basic Facilities of a 
> > > Virtio Device / Administration Virtqueues}
> > > +
> > > +An administration virtqueue of an owner device is used to submit
> > > +group administration commands. An owner device can have more
> > > +than one administration virtqueue.
> > > +
> > > +If VIRTIO_F_ADMIN_VQ has been negotiated, an owner device exposes one
> > > +or more adminstration virtqueues. The number and locations of the
> > > +administration virtqueues are exposed by the owner device in a transport
> > > +specific manner.
> > > +
> > > +The driver submits commands by queueing them to an arbitrary
> > > +administration virtqueue, and they are processed by the
> > > +device in the order in which they are queued. It is the
> > > +responsibility of the driver to ensure ordering for commands
> > > +placed on different administration virtqueues, because they will
> > > +be executed with no order constraints.
> > 
> > Does "they are processed by the device in the order in which they are
> > queued" mean only 1 admin command can be running at any given time and
> > therefore they will complete in order? This would allow pipelining
> > dependent commands but prevent long-running commands because the
> > virtqueue is blocked while executing a command.
> 
> we have multiple vqs for that.

This reminds me of the async challenges with QEMU's QMP monitor.

Will it ever be possible to abort an in-flight command? I guess the
abort command would need to be submitted on another virtqueue, but how
do you identify the in-flight command that you wish to cancel?

Please clarify the blocking semantics in the spec because it wasn't
clear to me.

Thanks,
Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v10 03/10] admin: introduce group administration commands

2023-03-03 Thread Stefan Hajnoczi
On Thu, Mar 02, 2023 at 07:01:56PM -0500, Michael S. Tsirkin wrote:
> On Thu, Mar 02, 2023 at 03:19:12PM -0500, Stefan Hajnoczi wrote:
> > On Thu, Mar 02, 2023 at 06:40:29PM +, Parav Pandit wrote:
> > > 
> > > > From: Michael S. Tsirkin 
> > > > Sent: Thursday, March 2, 2023 8:05 AM
> > > 
> > > > +When \field{status} is VIRTIO_ADMIN_STATUS_OK, \field{status_qialifier}
> > > > +is reserved and set to zero by the device.
> > > > +
> > > s/status_qialifier/status_qualifier
> > > Missed from v10 of Feb.
> > > 
> > > > +When \field{status} is VIRTIO_ADMIN_STATUS_EINVAL, the following table
> > > > +describes possible \field{status_qialifier} values:
> > > s/status_qialifier/status_qualifier
> > > 
> > > Can you please add other useful error codes in addition to the EINVAL?
> > > Few that we are needed EAGAIN, ENOMEM, EBUSY, ENODEV.
> > 
> > Please define a unique constant for each error condition that can occur
> > instead of sharing catch-all errno constants between multiple error
> > conditions. If a driver wants to squash them together into an errno,
> > that's fine, but I think doing this at the hardware interface level is
> > just propagating the mistakes of errnos.
> > 
> > Only status_qualifier is needed and the vague status field can be
> > dropped. It's not clear to me why adding EAGAIN, ENOMEM, EBUSY, and
> > ENODEV is useful. They have no meaning to the driver, only the
> > status_qualifier really indicates what is going on.
> 
> At a high level at the moment we have only two cases:
> - ok
> - invalid input supplied by driver
> 
> maybe we will have more reasons for a failure - remains to
> be seen.
> 
> 
> 
> 
> 
> > 
> > I'm sure you guys have discussed this previously, but please provide
> > rationale in the spec because it looks weird to someone with fresh eyes.
> > 
> > Stefan
> 
> Really most drivers just want to propagate errno to userspace.
> All the detailed reporting is for sure well intentional but
> in the end it is at best printed into log - end to end
> people just end up with a switch statement
> converting these to errno codes.
> So we are passing them from device and this way there will be
> some uniformity.

Please clarify the rationale in the spec. I don't agree with it, as
explained in my earlier email, but as long as its documented, people can
try to follow it.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v10 03/10] admin: introduce group administration commands

2023-03-03 Thread Stefan Hajnoczi
On Thu, Mar 02, 2023 at 06:57:24PM -0500, Michael S. Tsirkin wrote:
> On Thu, Mar 02, 2023 at 03:10:11PM -0500, Stefan Hajnoczi wrote:
> > On Thu, Mar 02, 2023 at 08:05:02AM -0500, Michael S. Tsirkin wrote:
> > > This introduces a general structure for group administration commands,
> > > used to control device groups through their owner.
> > > 
> > > Following patches will introduce specific commands and an interface for
> > > submitting these commands to the owner.
> > > 
> > > Signed-off-by: Max Gurtovoy 
> > > Signed-off-by: Michael S. Tsirkin 
> > > ---
> > >  admin.tex| 108 +++
> > >  introduction.tex |   3 ++
> > >  2 files changed, 111 insertions(+)
> > > 
> > > diff --git a/admin.tex b/admin.tex
> > > index 3dc47be..7e28b77 100644
> > > --- a/admin.tex
> > > +++ b/admin.tex
> > > @@ -46,4 +46,112 @@ \section{Device groups}\label{sec:Basic Facilities of 
> > > a Virtio Device / Device g
> > >  PCI transport (see \ref{sec:Virtio Transport Options / Virtio Over PCI 
> > > Bus}).
> > >  \end{description}
> > >  
> > > +\subsection{Group administration commands}\label{sec:Basic Facilities of 
> > > a Virtio Device / Device groups / Group administration commands}
> > >  
> > > +The driver sends group administration commands to the owner device of
> > 
> > I notice that the terminology is simply "the driver". "Owner driver"
> > and "group member driver" might be clearer because there will be two
> > (possibly different) drivers involved.
> 
> Hmm I don't really want to repeat owner everywhere.
> I will clarify that in this section simple "driver" and "device" are
> owner, "member device" and "member driver" is always called explicitly.

Sounds good.

> > > +a group to control member devices of the group.
> > > +This mechanism can
> > > +be used, for example, to configure a member device before it is
> > > +initialized by its driver.
> > > +\footnote{The term "administration" is intended to be interpreted
> > > +widely to include any kind of control. See specific commands
> > > +for detail.}
> > > +
> > > +All the group administration commands are of the following form:
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_admin_cmd {
> > > +/* Device-readable part */
> > > +le16 opcode;
> > > +/*
> > > + * 1 - SR-IOV
> > > + * 2 - 65535 reserved
> > > + */
> > > +le16 group_type;
> > > +/* unused, reserved for future extensions */
> > > +u8 reserved1[12];
> > > +le64 group_member_id;
> > > +u8 command_specific_data[];
> > > +
> > > +/* Device-writable part */
> > > +le16 status;
> > > +le16 status_qualifier;
> > > +/* unused, reserved for future extensions */
> > > +u8 reserved2[4];
> > > +u8 command_specific_result[];
> > > +};
> > > +\end{lstlisting}
> > > +
> > > +For all commands, \field{opcode}, \field{group_type} and if
> > > +necessary \field{group_member_id} and \field{command_specific_data} are
> > > +set by the driver, and the owner device sets \field{status} and if
> > > +needed \field{status_qualifier} and
> > > +\field{command_specific_result}.
> > > +
> > > +Generally, any unused device-readable fields are set to zero by the 
> > > driver
> > > +and ignored by the device.  Any unused device-writeable fields are set 
> > > to zero
> > > +by the device and ignored by the driver.
> > > +
> > > +\field{opcode} specifies the command. The valid
> > > +values for \field{opcode} can be found in the following table:
> > > +
> > > +\begin{tabular}{|l|l|}
> > > +\hline
> > > +opcode & Name & Command Description \\
> > > +\hline \hline
> > > +0x - 0x7FFF & - &  Group administration commands\\
> > > +\hline
> > > +0x8000 - 0x & - & Reserved\\
> > > +\hline
> > > +\end{tabular}
> > 
> > I thought all commands are "group administration commands" but this
> > table makes it look like they are just a subset (0x - 0x7FFF) of
> > group administration commands, which is 

[virtio-dev] Re: [PATCH v10 10/10] ccw: document more reserved features

2023-03-02 Thread Stefan Hajnoczi
On Thu, Mar 02, 2023 at 08:05:30AM -0500, Michael S. Tsirkin wrote:
> vq reset and shared memory are unsupported, too.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  content.tex | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v10 08/10] admin: command list discovery

2023-03-02 Thread Stefan Hajnoczi
On Thu, Mar 02, 2023 at 08:05:22AM -0500, Michael S. Tsirkin wrote:
> Add commands to find out which commands does each group support,
> as well as enable their use by driver.
> This will be especially useful once we have multiple group types.
> 
> An alternative is per-type VQs. This is possible but will
> require more per-transport work. Discovery through the vq
> helps keep things contained.
> 
> Signed-off-by: Max Gurtovoy 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  admin.tex | 97 ++-
>  1 file changed, 96 insertions(+), 1 deletion(-)
> 
> diff --git a/admin.tex b/admin.tex
> index 3ffac2e..1172054 100644
> --- a/admin.tex
> +++ b/admin.tex
> @@ -99,7 +99,9 @@ \subsection{Group administration commands}\label{sec:Basic 
> Facilities of a Virti
>  \hline
>  opcode & Name & Command Description \\
>  \hline \hline
> -0x - 0x7FFF & - &  Group administration commands\\
> +0x & VIRTIO_ADMIN_CMD_LIST_QUERY & Provides to driver list of commands 
> supported for this group type\\
> +0x0001 & VIRTIO_ADMIN_CMD_LIST_USE & Provides to device list of commands 
> used for this group type \\
> +0x0002 - 0x7FFF & - &  Group administration commands\\
>  \hline
>  0x8000 - 0x & - & Reserved\\
>  \hline
> @@ -156,6 +158,99 @@ \subsection{Group administration 
> commands}\label{sec:Basic Facilities of a Virti
>  depends on these structures and is described separately or is
>  implicit in the structure description.
>  
> +Before sending any administration commands to the device, the driver
> +needs to communicate to the device which commands it is going to
> +use. Initially (after reset), only two commands are assumed to be used:
> +VIRTIO_ADMIN_CMD_LIST_QUERY and VIRTIO_ADMIN_CMD_LIST_USE.
> +
> +Before sending any other commands for any member of a specific group to
> +the device, the driver queries the supported commands via
> +VIRTIO_ADMIN_CMD_LIST_QUERY and sends the commands it will use via

"will use" probably means "is capable of using" rather than "is sure to
use"? It might be worth tweaking the language here to make that clear.

> +VIRTIO_ADMIN_CMD_LIST_USE.
> +
> +Commands VIRTIO_ADMIN_CMD_LIST_QUERY and
> +VIRTIO_ADMIN_CMD_LIST_USE
> +both use the following structure describing the
> +command opcodes:
> +
> +\begin{lstlisting}
> +struct virtio_admin_cmd_list {
> +   /* Indicates which of the below fields were returned
> +   le64 device_admin_cmds[];
> +};
> +\end{lstlisting}
> +
> +This structure is an array of 64 bit values in little-endian byte
> +order, in which a bit is set if the specific command opcode
> +is supported. Thus, \field{device_admin_cmds[0]} refers to the first 32-bit 
> value
> +in this array corresponding to opcodes 0 to 63,
> +\field{device_admin_cmds[1]} is the second 64-bit value
> +corresponding to opcodes 64 to 127, etc.
> +For example, the array of size 2 including
> +the values 0x3 in \field{device_admin_cmds[0]}
> +and 0x1 in \field{device_admin_cmds[1]} indicates that only
> +opcodes 0, 1 and 64 are supported.
> +The length of the array depends on the supported opcodes - it is
> +large enough to include bits set for all supported opcodes,
> +that is the length can be calculated by starting with the largest
> +supported opcode adding one, dividing by 64 and rounding up.
> +In other words, for
> +VIRTIO_ADMIN_CMD_LIST_QUERY and VIRTIO_ADMIN_CMD_LIST_USE the
> +length of \field{command_specific_result} and
> +\field{command_specific_data} respectively will be
> +$DIV_ROUND_UP(max_cmd, 64) * 8$ where DIV_ROUND_UP is integer division
> +with round up and \field{max_cmd} is the largest available command opcode.
> +
> +The array is also allowed to be larger and to additionally
> +include an arbitrary number of all-zero entries.
> +
> +Accordingly, bits 0 and 1 corresponding to opcode 0
> +(VIRTIO_ADMIN_CMD_LIST_QUERY) and 1
> +(VIRTIO_ADMIN_CMD_LIST_USE) are
> +always set in \field{device_admin_cmds[0]} returned by 
> VIRTIO_ADMIN_CMD_LIST_QUERY.
> +
> +For the command VIRTIO_ADMIN_CMD_LIST_QUERY, \field{opcode} is set to 0x0.
> +The \field{group_member_id} is unused. It is set to zero by driver.
> +This command has no command specific data.
> +The device, upon success, returns a result in
> +\field{command_specific_result} in the format
> +\field{struct virtio_admin_cmd_list} describing the
> +list of administration commands supported for the given group.
> +
> +
> +For the command VIRTIO_ADMIN_CMD_LIST_USE, \field{opcode}
> +is set to 0x1.
> +The \field{group_member_id} is unused. It is set to zero by driver.
> +The \field{command_specific_data} is in the format
> +\field{struct virtio_admin_cmd_list} describing the
> +list of administration commands used by the driver.
> +This command has no command specific result.
> +
> +The driver issues the command VIRTIO_ADMIN_CMD_LIST_QUERY to
> +query the list of commands valid for this group and before sending
> +any commands for any member of a group.
> +

[virtio-dev] Re: [PATCH v10 07/10] ccw: document ADMIN_VQ as reserved

2023-03-02 Thread Stefan Hajnoczi
On Thu, Mar 02, 2023 at 08:05:18AM -0500, Michael S. Tsirkin wrote:
> Adding relevant registers needs more work and it's not
> clear what the use-case will be as currently only
> the PCI transport is supported. But let's keep the
> door open on this.
> We already say it's reserved in a central place, but it
> does not hurt to remind implementers to mask it.
> 
> Note: there are more features to add to this list.
> Will be done later with a patch on top.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  content.tex | 12 
>  1 file changed, 12 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v10 06/10] mmio: document ADMIN_VQ as reserved

2023-03-02 Thread Stefan Hajnoczi
On Thu, Mar 02, 2023 at 08:05:14AM -0500, Michael S. Tsirkin wrote:
> Adding relevant registers needs more work and it's not
> clear what the use-case will be as currently only
> the PCI transport is supported. But let's keep the
> door open on this.
> We already say it's reserved in a central place, but it
> does not hurt to remind implementers to mask it.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  content.tex | 12 
>  1 file changed, 12 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v10 05/10] pci: add admin vq registers to virtio over pci

2023-03-02 Thread Stefan Hajnoczi
On Thu, Mar 02, 2023 at 08:05:10AM -0500, Michael S. Tsirkin wrote:
> Add new registers to the PCI common configuration structure.
> 
> These registers will be used for querying the indices of the admin
> virtqueues of the owner device. To configure, reset or enable the admin
> virtqueues, the driver should follow existing queue configuration/setup
> sequence.
> 
> Signed-off-by: Max Gurtovoy 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  content.tex | 35 +++
>  1 file changed, 35 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index c8647c9..1076125 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -946,6 +946,10 @@ \subsubsection{Common configuration structure 
> layout}\label{sec:Virtio Transport
>  le64 queue_device;  /* read-write */
>  le16 queue_notify_data; /* read-only for driver */
>  le16 queue_reset;   /* read-write */
> +
> +/* About the administration virtqueue. */
> +le16 admin_queue_index; /* read-only for driver */
> +le16 admin_queue_num; /* read-only for driver */
>  };
>  \end{lstlisting}
>  
> @@ -1031,6 +1035,20 @@ \subsubsection{Common configuration structure 
> layout}\label{sec:Virtio Transport
>  This field exists only if VIRTIO_F_RING_RESET has been
>  negotiated. (see \ref{sec:Basic Facilities of a Virtio Device / 
> Virtqueues / Virtqueue Reset}).
>  
> +\item[\field{admin_queue_index}]
> +The device uses this to report the index of the first administration 
> virtqueue.
> +This field is valid only if VIRTIO_F_ADMIN_VQ has been negotiated.
> +\item[\field{admin_queue_num}]
> + The device uses this to report the number of the
> + supported administration virtqueues.
> + This is a zero based value.  Virtqueues with index
> + between \field{admin_queue_index} and (\field{admin_queue_index} +
> + \field{admin_queue_num}) inclusive serve as administration
> + virtqueues.
> + Thus the number of administration virtqueues equals
> + (\field{admin_queue_num} + 1).

Why is this value zero-based? virtio_pci_common_cfg's num_queues field
is le16 and not zero-based. It is not possible to have
admin_queue_index=0 admin_queue_num=0x because num_queues=0x1
cannot be represented.

I suggest making this field behave the same as virtio_pci_common_cfg's
num_queues.

> + This field is valid only if VIRTIO_F_ADMIN_VQ has been
> + negotiated.
>  \end{description}
>  
>  \devicenormative{\paragraph}{Common configuration structure layout}{Virtio 
> Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common 
> configuration structure layout}
> @@ -1117,6 +1135,14 @@ \subsubsection{Common configuration structure 
> layout}\label{sec:Virtio Transport
>  were used before the queue reset.
>  (see \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue 
> Reset}).
>  
> +If VIRTIO_F_ADMIN_VQ has been negotiated, and if the driver
> +configures any administration virtqueues, the driver MUST
> +configure the administration virtqueues using the index
> +in the range \field{admin_queue_index} to
> +\field{admin_queue_index} + \field{admin_queue_num} inclusive.
> +The driver MAY configure less administration virtqueues than

s/less/fewer/

> +supported by the device.
> +
>  \subsubsection{Notification structure layout}\label{sec:Virtio Transport 
> Options / Virtio Over PCI Bus / PCI Device Layout / Notification capability}
>  
>  The notification location is found using the VIRTIO_PCI_CAP_NOTIFY_CFG
> @@ -7684,6 +7710,15 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved 
> Feature Bits}
>  
>\item[VIRTIO_F_ADMIN_VQ(41)] This feature indicates that the device 
> exposes one or more
>administration virtqueues.
> +  At the moment this feature is only supported for devices using
> +  \ref{sec:Virtio Transport Options / Virtio Over PCI
> +   Bus}~\nameref{sec:Virtio Transport Options / Virtio Over PCI Bus}
> +   as the transport and is reserved for future use for
> +   devices using other transports (see
> +   \ref{drivernormative:Basic Facilities of a Virtio Device / Feature 
> Bits}
> + and
> + \ref{devicenormative:Basic Facilities of a Virtio Device / Feature 
> Bits} for
> + handling features reserved for future use.
>  
>  \end{description}
>  
> -- 
> MST
> 


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues

2023-03-02 Thread Stefan Hajnoczi
On Thu, Mar 02, 2023 at 08:05:06AM -0500, Michael S. Tsirkin wrote:
> The admin virtqueues will be the first interface to issue admin commands.
> 
> Currently virtio specification defines control virtqueue to manipulate
> features and configuration of the device it operates on. However,
> control virtqueue commands are device type specific, which makes it very
> difficult to extend for device agnostic commands.
> 
> To support this requirement in a more generic way, this patch introduces
> a new admin virtqueue interface.

Is this referring to the existing virtio-net, virtio-scsi, etc control
virtqueues?

I see the admin virtqueue as the virtqueue equivalent to transport
feature bits. The admin queue does nothing device type-specific (net,
scsi, etc) and instead focusses on transport and core virtio tasks.

Keeping the device-specific virtqueue separate from the admin virtqueue
is simpler and has fewer potential problems. I don't think creating
common infrastructure for device-specific control virtqueues across
device types worthwhile or within the scope of this patch series.

> We also support more than one admin virtqueue, for QoS and
> scalability requirements.
> 
> Signed-off-by: Max Gurtovoy 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  admin.tex   | 74 +
>  content.tex |  7 +++--
>  2 files changed, 79 insertions(+), 2 deletions(-)
> 
> diff --git a/admin.tex b/admin.tex
> index 7e28b77..3ffac2e 100644
> --- a/admin.tex
> +++ b/admin.tex
> @@ -155,3 +155,77 @@ \subsection{Group administration 
> commands}\label{sec:Basic Facilities of a Virti
>  \field{command_specific_data} and \field{command_specific_result}
>  depends on these structures and is described separately or is
>  implicit in the structure description.
> +
> +\section{Administration Virtqueues}\label{sec:Basic Facilities of a Virtio 
> Device / Administration Virtqueues}
> +
> +An administration virtqueue of an owner device is used to submit
> +group administration commands. An owner device can have more
> +than one administration virtqueue.
> +
> +If VIRTIO_F_ADMIN_VQ has been negotiated, an owner device exposes one
> +or more adminstration virtqueues. The number and locations of the
> +administration virtqueues are exposed by the owner device in a transport
> +specific manner.
> +
> +The driver submits commands by queueing them to an arbitrary
> +administration virtqueue, and they are processed by the
> +device in the order in which they are queued. It is the
> +responsibility of the driver to ensure ordering for commands
> +placed on different administration virtqueues, because they will
> +be executed with no order constraints.

Does "they are processed by the device in the order in which they are
queued" mean only 1 admin command can be running at any given time and
therefore they will complete in order? This would allow pipelining
dependent commands but prevent long-running commands because the
virtqueue is blocked while executing a command.

> +
> +Administration virtqueues are used as follows:
> +\begin{itemize}
> +\item The driver submits the command using the \field{struct 
> virtio_admin_cmd}
> +structure using a buffer consisting of two parts: a device-readable one 
> followed by a
> +device-writable one.
> +\item the device-readable part includes fields from \field{opcode}
> +through \field{command_specific_data}.
> +\item the device-writeable buffer includes fields from \field{status}
> +through \field{command_specific_result} inclusive.
> +\end{itemize}
> +
> +For each command, this specification describes a distinct
> +format structure used for \field{command_specific_data} and
> +\field{command_specific_result}, the length of these fields
> +depends on the command.
> +
> +However, to ensure forward compatibility
> +\begin{itemize}
> +\item drivers are allowed to submit buffers that are longer
> +than what the device expects
> +(that is, longer than the length of
> +\field{opcode} through \field{command_specific_data}).
> +This allows the driver to maintain
> +a single format structure even if some structure fields are
> +unused by the device.
> +\item drivers are allowed to submit buffers that are shorter
> +than what the device expects
> +(that is, shorter than the length of \field{status} through
> +\field{command_specific_result}). This allows the device to maintain
> +a single format structure even if some structure fields are
> +unused by the driver.
> +\end{itemize}
> +
> +The device compares the length of each part (device-readable and
> +device-writeable) of the buffer as submitted by driver to what it
> +expects and then silently truncates the structures to either the
> +length submitted by the driver, or the length described in this
> +specification, whichever is shorter.  The device silently ignores
> +any data falling outside the shorter of the two lengths. Any
> +missing fields are interpreted as set to zero.
> +
> +Similarly, the driver compares the used 

[virtio-dev] Re: [PATCH v10 03/10] admin: introduce group administration commands

2023-03-02 Thread Stefan Hajnoczi
On Thu, Mar 02, 2023 at 06:40:29PM +, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin 
> > Sent: Thursday, March 2, 2023 8:05 AM
> 
> > +When \field{status} is VIRTIO_ADMIN_STATUS_OK, \field{status_qialifier}
> > +is reserved and set to zero by the device.
> > +
> s/status_qialifier/status_qualifier
> Missed from v10 of Feb.
> 
> > +When \field{status} is VIRTIO_ADMIN_STATUS_EINVAL, the following table
> > +describes possible \field{status_qialifier} values:
> s/status_qialifier/status_qualifier
> 
> Can you please add other useful error codes in addition to the EINVAL?
> Few that we are needed EAGAIN, ENOMEM, EBUSY, ENODEV.

Please define a unique constant for each error condition that can occur
instead of sharing catch-all errno constants between multiple error
conditions. If a driver wants to squash them together into an errno,
that's fine, but I think doing this at the hardware interface level is
just propagating the mistakes of errnos.

Only status_qualifier is needed and the vague status field can be
dropped. It's not clear to me why adding EAGAIN, ENOMEM, EBUSY, and
ENODEV is useful. They have no meaning to the driver, only the
status_qualifier really indicates what is going on.

I'm sure you guys have discussed this previously, but please provide
rationale in the spec because it looks weird to someone with fresh eyes.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v10 03/10] admin: introduce group administration commands

2023-03-02 Thread Stefan Hajnoczi
On Thu, Mar 02, 2023 at 08:05:02AM -0500, Michael S. Tsirkin wrote:
> This introduces a general structure for group administration commands,
> used to control device groups through their owner.
> 
> Following patches will introduce specific commands and an interface for
> submitting these commands to the owner.
> 
> Signed-off-by: Max Gurtovoy 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  admin.tex| 108 +++
>  introduction.tex |   3 ++
>  2 files changed, 111 insertions(+)
> 
> diff --git a/admin.tex b/admin.tex
> index 3dc47be..7e28b77 100644
> --- a/admin.tex
> +++ b/admin.tex
> @@ -46,4 +46,112 @@ \section{Device groups}\label{sec:Basic Facilities of a 
> Virtio Device / Device g
>  PCI transport (see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus}).
>  \end{description}
>  
> +\subsection{Group administration commands}\label{sec:Basic Facilities of a 
> Virtio Device / Device groups / Group administration commands}
>  
> +The driver sends group administration commands to the owner device of

I notice that the terminology is simply "the driver". "Owner driver"
and "group member driver" might be clearer because there will be two
(possibly different) drivers involved.

> +a group to control member devices of the group.
> +This mechanism can
> +be used, for example, to configure a member device before it is
> +initialized by its driver.
> +\footnote{The term "administration" is intended to be interpreted
> +widely to include any kind of control. See specific commands
> +for detail.}
> +
> +All the group administration commands are of the following form:
> +
> +\begin{lstlisting}
> +struct virtio_admin_cmd {
> +/* Device-readable part */
> +le16 opcode;
> +/*
> + * 1 - SR-IOV
> + * 2 - 65535 reserved
> + */
> +le16 group_type;
> +/* unused, reserved for future extensions */
> +u8 reserved1[12];
> +le64 group_member_id;
> +u8 command_specific_data[];
> +
> +/* Device-writable part */
> +le16 status;
> +le16 status_qualifier;
> +/* unused, reserved for future extensions */
> +u8 reserved2[4];
> +u8 command_specific_result[];
> +};
> +\end{lstlisting}
> +
> +For all commands, \field{opcode}, \field{group_type} and if
> +necessary \field{group_member_id} and \field{command_specific_data} are
> +set by the driver, and the owner device sets \field{status} and if
> +needed \field{status_qualifier} and
> +\field{command_specific_result}.
> +
> +Generally, any unused device-readable fields are set to zero by the driver
> +and ignored by the device.  Any unused device-writeable fields are set to 
> zero
> +by the device and ignored by the driver.
> +
> +\field{opcode} specifies the command. The valid
> +values for \field{opcode} can be found in the following table:
> +
> +\begin{tabular}{|l|l|}
> +\hline
> +opcode & Name & Command Description \\
> +\hline \hline
> +0x - 0x7FFF & - &  Group administration commands\\
> +\hline
> +0x8000 - 0x & - & Reserved\\
> +\hline
> +\end{tabular}

I thought all commands are "group administration commands" but this
table makes it look like they are just a subset (0x - 0x7FFF) of
group administration commands, which is a paradox.

> +
> +The \field{group_type} specifies the group type identifier.
> +The \field{group_member_id} specifies the member identifier within the group.
> +See section \ref{sec:Introduction / Terminology / Device group}
> +for the definition of the group type identifier and group member
> +identifier.
> +
> +The following table describes possible \field{status} values;
> +to simplify common implementations, they are intentionally
> +matching common \hyperref[intro:errno]{Linux error names and numbers}:
> +
> +\begin{tabular}{|l|l|l|}
> +\hline
> +Status (decimal) & Name & Description \\
> +\hline \hline
> +00   & VIRTIO_ADMIN_STATUS_OK& successful completion  \\
> +\hline
> +22   & VIRTIO_ADMIN_STATUS_EINVAL& invalid command \\
> +\hline
> +other   & -& group administration command error  \\
> +\hline
> +\end{tabular}
> +
> +When \field{status} is VIRTIO_ADMIN_STATUS_OK, \field{status_qialifier}

s/qialifier/qualifier/

> +is reserved and set to zero by the device.
> +
> +When \field{status} is VIRTIO_ADMIN_STATUS_EINVAL,
> +the following table describes possible \field{status_qialifier} values:

s/qialifier/qualifier/

> +\begin{tabular}{|l|l|l|}
> +\hline
> +Status & Name & Description \\
> +\hline \hline
> +0x00   & VIRTIO_ADMIN_STATUS_Q_INVALID_COMMAND   & command error: no 
> additional information  \\
> +\hline
> +0x01   & VIRTIO_ADMIN_STATUS_Q_INVALID_OPCODE& unsupported or invalid 
> \field{opcode}  \\
> +\hline
> +0x02   & VIRTIO_ADMIN_STATUS_Q_INVALID_FIELD& unsupported or invalid 
> field within \field{command_specific_data}  \\
> +\hline
> +0x03   & VIRTIO_ADMIN_STATUS_Q_INVALID_GROUP& unsupported or invalid 
> 

[virtio-dev] Re: [virtio] [PATCH v10 02/10] admin: introduce device group and related concepts

2023-03-02 Thread Stefan Hajnoczi
On Thu, Mar 02, 2023 at 08:04:57AM -0500, Michael S. Tsirkin wrote:
> +The following group types, and their identifiers, are currently specified):

spurious ')'


signature.asc
Description: PGP signature


Re: [virtio-dev] [RFC QEMU] docs: vhost-user: Add custom memory mapping support

2023-03-02 Thread Stefan Hajnoczi
On Thu, Mar 02, 2023 at 01:49:07PM +0530, Viresh Kumar wrote:
> On 01-03-23, 12:29, Stefan Hajnoczi wrote:
> > What is the advantage over defining separate messages? Separate messages
> > are cleaner and more typesafe.
> 
> I thought we wanted to keep single message for one kind of functionality, 
> which
> is mmap related quirks here. And so it would be better if we can reuse the 
> same
> for next hypervisor which may need this.
> 
> The value parameter is not fixed and is hypervisor specific, for Xen this is 
> the
> domain id, for others it may mean something else.

mmap-related quirks have no parameters or behavior in common so there's
no advantage in sharing a single vhost-user protocol message. Sharing
the same message just makes it awkward to build and parse the message.

> > I don't have a concrete example, but was thinking of a guest that shares
> > memory with other guests (like the experimental virtio-vhost-user
> > device). Maybe there would be a scenario where some memory belongs to
> > one domain and some belongs to another (but has been mapped into the
> > first domain), and the vhost-user back-end needs to access both.
> 
> These look tricky (and real) and I am not sure how we would want to handle
> these. Maybe wait until we have a real use-case ?

A way to deal with that is to include mmap information every time fds
are passed with a message instead of sending one global message at the
start of the vhost-user connection. This would allow each mmap to
associate extra information instead of forcing them all to use the same
information.

> > The other thing that comes to mind is that the spec must clearly state
> > which mmaps are affected by the Xen domain information. For example,
> > just mem table memory regions and not the
> > VHOST_USER_PROTOCOL_F_LOG_SHMFD feature?
> 
> Maybe we can mention that only the mmap's performed via /dev/xen/privcmd and
> /dev/xen/gntdev files are affected by this ?

No, this doesn't explain when mmap must be performed via
/dev/xen/privcmd and /dev/xen/gntdev. The spec should be explicit about
this instead of assuming that the device implementer already knows this.

Stefan


signature.asc
Description: PGP signature


Re: [virtio-dev] [RFC QEMU] docs: vhost-user: Add custom memory mapping support

2023-03-01 Thread Stefan Hajnoczi
On Wed, Mar 01, 2023 at 04:31:41PM +, Alex Bennée wrote:
> 
> Stefan Hajnoczi  writes:
> 
> > [[PGP Signed Part:Undecided]]
> > Resend - for some reason my email didn't make it out.
> >
> > From: Stefan Hajnoczi 
> > Subject: Re: [virtio-dev] [RFC QEMU] docs: vhost-user: Add custom memory
> >  mapping support
> > To: Viresh Kumar 
> > Cc: qemu-de...@nongnu.org, virtio-dev@lists.oasis-open.org, "Michael S.
> >  Tsirkin" , Vincent Guittot ,
> >  Alex Bennée ,
> > stratos-...@op-lists.linaro.org, Oleksandr Tyshchenko
> >  , xen-de...@lists.xen.org, Andrew Cooper
> >  , Juergen Gross , Sebastien
> >  Boeuf
> > , Liu Jiang , 
> > Mathieu
> >  Poirier 
> > Date: Tue, 21 Feb 2023 10:17:01 -0500 (1 week, 1 day, 1 hour ago)
> > Flags: seen, signed, personal
> >
> > On Tue, Feb 21, 2023 at 03:20:41PM +0530, Viresh Kumar wrote:
> >> The current model of memory mapping at the back-end works fine with
> >> Qemu, where a standard call to mmap() for the respective file
> >> descriptor, passed from front-end, is generally all we need to do before
> >> the front-end can start accessing the guest memory.
> >> 
> >> There are other complex cases though, where we need more information at
> >> the backend and need to do more than just an mmap() call. For example,
> >> Xen, a type-1 hypervisor, currently supports memory mapping via two
> >> different methods, foreign-mapping (via /dev/privcmd) and grant-dev (via
> >> /dev/gntdev). In both these cases, the back-end needs to call mmap()
> >> followed by an ioctl() (or vice-versa), and need to pass extra
> >> information via the ioctl(), like the Xen domain-id of the guest whose
> >> memory we are trying to map.
> >> 
> >> Add a new protocol feature, 'VHOST_USER_PROTOCOL_F_CUSTOM_MMAP', which
> >> lets the back-end know about the additional memory mapping requirements.
> >> When this feature is negotiated, the front-end can send the
> >> 'VHOST_USER_CUSTOM_MMAP' message type to provide the additional
> >> information to the back-end.
> >> 
> >> Signed-off-by: Viresh Kumar 
> >> ---
> >>  docs/interop/vhost-user.rst | 32 
> >>  1 file changed, 32 insertions(+)
> >
> > The alternative to an in-band approach is to configure these details
> > out-of-band. For example, via command-line options to the vhost-user
> > back-end:
> >
> >   $ my-xen-device --mapping-type=foreign-mapping --domain-id=123
> >
> > I was thinking about both approaches and don't see an obvious reason to
> > choose one or the other. What do you think?
> 
> In-band has the nice property of being dynamic and not having to have
> some other thing construct command lines. We are also trying to keep the
> daemons from being Xen specific and keep the type of mmap as an
> implementation detail that is mostly elided by the rust-vmm memory
> traits.

Okay.

> >
> >> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> >> index 3f18ab424eb0..f2b1d705593a 100644
> >> --- a/docs/interop/vhost-user.rst
> >> +++ b/docs/interop/vhost-user.rst
> >> @@ -258,6 +258,23 @@ Inflight description
> >>  
> >>  :queue size: a 16-bit size of virtqueues
> >>  
> >> +Custom mmap description
> >> +^^^
> >> +
> >> ++---+---+
> >> +| flags | value |
> >> ++---+---+
> >> +
> >> +:flags: 64-bit bit field
> >> +
> >> +- Bit 0 is Xen foreign memory access flag - needs Xen foreign memory 
> >> mapping.
> >> +- Bit 1 is Xen grant memory access flag - needs Xen grant memory mapping.
> >> +
> >> +:value: a 64-bit hypervisor specific value.
> >> +
> >> +- For Xen foreign or grant memory access, this is set with guest's xen 
> >> domain
> >> +  id.
> >
> > This is highly Xen-specific. How about naming the feature XEN_MMAP
> > instead of CUSTOM_MMAP? If someone needs to add other mmap data later,
> > they should define their own struct instead of trying to squeeze into
> > the same fields as Xen.
> 
> We hope to support additional mmap mechanisms in the future - one
> proposal is to use the hypervisor specific interface to support an
> ioctl() that creates a domain specific device which can then be treated
> more generically.
> 
> Could we not declare the message as flag + n bytes of domain specific
> message as defined my msglen?

What is th

Re: [virtio-dev] [RFC QEMU] docs: vhost-user: Add custom memory mapping support

2023-03-01 Thread Stefan Hajnoczi
Resend - for some reason my email didn't make it out.

- Forwarded message from Stefan Hajnoczi  -

Date: Tue, 21 Feb 2023 10:17:01 -0500
From: Stefan Hajnoczi 
To: Viresh Kumar 
Cc: qemu-de...@nongnu.org, virtio-dev@lists.oasis-open.org, "Michael S. 
Tsirkin" , Vincent Guittot , Alex 
Bennée ,
stratos-...@op-lists.linaro.org, Oleksandr Tyshchenko 
, xen-de...@lists.xen.org, Andrew Cooper 
, Juergen Gross , Sebastien Boeuf
, Liu Jiang , 
Mathieu Poirier 
Subject: Re: [virtio-dev] [RFC QEMU] docs: vhost-user: Add custom memory 
mapping support
Message-ID: 

On Tue, Feb 21, 2023 at 03:20:41PM +0530, Viresh Kumar wrote:
> The current model of memory mapping at the back-end works fine with
> Qemu, where a standard call to mmap() for the respective file
> descriptor, passed from front-end, is generally all we need to do before
> the front-end can start accessing the guest memory.
> 
> There are other complex cases though, where we need more information at
> the backend and need to do more than just an mmap() call. For example,
> Xen, a type-1 hypervisor, currently supports memory mapping via two
> different methods, foreign-mapping (via /dev/privcmd) and grant-dev (via
> /dev/gntdev). In both these cases, the back-end needs to call mmap()
> followed by an ioctl() (or vice-versa), and need to pass extra
> information via the ioctl(), like the Xen domain-id of the guest whose
> memory we are trying to map.
> 
> Add a new protocol feature, 'VHOST_USER_PROTOCOL_F_CUSTOM_MMAP', which
> lets the back-end know about the additional memory mapping requirements.
> When this feature is negotiated, the front-end can send the
> 'VHOST_USER_CUSTOM_MMAP' message type to provide the additional
> information to the back-end.
> 
> Signed-off-by: Viresh Kumar 
> ---
>  docs/interop/vhost-user.rst | 32 
>  1 file changed, 32 insertions(+)

The alternative to an in-band approach is to configure these details
out-of-band. For example, via command-line options to the vhost-user
back-end:

  $ my-xen-device --mapping-type=foreign-mapping --domain-id=123

I was thinking about both approaches and don't see an obvious reason to
choose one or the other. What do you think?

> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 3f18ab424eb0..f2b1d705593a 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -258,6 +258,23 @@ Inflight description
>  
>  :queue size: a 16-bit size of virtqueues
>  
> +Custom mmap description
> +^^^
> +
> ++---+---+
> +| flags | value |
> ++---+---+
> +
> +:flags: 64-bit bit field
> +
> +- Bit 0 is Xen foreign memory access flag - needs Xen foreign memory mapping.
> +- Bit 1 is Xen grant memory access flag - needs Xen grant memory mapping.
> +
> +:value: a 64-bit hypervisor specific value.
> +
> +- For Xen foreign or grant memory access, this is set with guest's xen domain
> +  id.

This is highly Xen-specific. How about naming the feature XEN_MMAP
instead of CUSTOM_MMAP? If someone needs to add other mmap data later,
they should define their own struct instead of trying to squeeze into
the same fields as Xen.

There is an assumption in this design that a single
VHOST_USER_CUSTOM_MMAP message provides the information necessary for
all mmaps. Are you sure the limitation that every mmap belongs to the
same domain will be workable in the future?

> +
>  C structure
>  ---
>  
> @@ -867,6 +884,7 @@ Protocol features
>#define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
>#define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
>#define VHOST_USER_PROTOCOL_F_STATUS   16
> +  #define VHOST_USER_PROTOCOL_F_CUSTOM_MMAP  17
>  
>  Front-end message types
>  ---
> @@ -1422,6 +1440,20 @@ Front-end message types
>query the back-end for its device status as defined in the Virtio
>specification.
>  
> +``VHOST_USER_CUSTOM_MMAP``

Most vhost-user protocol messages have a verb like
get/set/close/add/listen/etc. I suggest renaming this to
VHOST_USER_SET_XEN_MMAP_INFO.

> +  :id: 41
> +  :equivalent ioctl: N/A
> +  :request payload: Custom mmap description
> +  :reply payload: N/A
> +
> +  When the ``VHOST_USER_PROTOCOL_F_CUSTOM_MMAP`` protocol feature has been
> +  successfully negotiated, this message is submitted by the front-end to
> +  notify the back-end of the special memory mapping requirements, that the
> +  back-end needs to take care of, while mapping any memory regions sent
> +  over by the front-end. The front-end must send this message before
> +  any memory-regions are sent to the back-end via 
> ``VHOST_USER_SET_MEM_TABLE``
> +  or ``VHOST_USER_ADD_MEM_REG`` me

[virtio-dev] Re: [virtio-comment] [PATCH] virtio-blk: Define dev cfg layout before its fields

2023-02-23 Thread Stefan Hajnoczi
On Thu, Feb 23, 2023 at 03:52:05PM +0200, Parav Pandit wrote:
> Define device configuration layout structure before describing its
> individual fields.
> 
> This is an editorial change.
> 
> Suggested-by: Cornelia Huck 
> Reviewed-by: Max Gurtovoy 
> Signed-off-by: Parav Pandit 
> ---
>  device-types/blk/description.tex | 95 
>  1 file changed, 48 insertions(+), 47 deletions(-)
> 
> diff --git a/device-types/blk/description.tex 
> b/device-types/blk/description.tex
> index 20007e3..517b012 100644
> --- a/device-types/blk/description.tex
> +++ b/device-types/blk/description.tex
> @@ -83,6 +83,54 @@ \subsubsection{Legacy Interface: Feature 
> bits}\label{sec:Device Types / Block De
>  
>  \subsection{Device configuration layout}\label{sec:Device Types / Block 
> Device / Device configuration layout}
>  
> +The block device has the following device configuration layout.
> +
> +\begin{lstlisting}
> +struct virtio_blk_config {
> +le64 capacity;
> +le32 size_max;
> +le32 seg_max;
> +struct virtio_blk_geometry {
> +le16 cylinders;
> +u8 heads;
> +u8 sectors;
> +} geometry;
> +le32 blk_size;
> +struct virtio_blk_topology {
> +// # of logical blocks per physical block (log2)
> +u8 physical_block_exp;
> +// offset of first aligned logical block
> +u8 alignment_offset;
> +// suggested minimum I/O size in blocks
> +le16 min_io_size;
> +// optimal (suggested maximum) I/O size in blocks
> +le32 opt_io_size;
> +} topology;
> +u8 writeback;
> +u8 unused0;
> +u16 num_queues;
> +le32 max_discard_sectors;
> +le32 max_discard_seg;
> +le32 discard_sector_alignment;
> +le32 max_write_zeroes_sectors;
> +le32 max_write_zeroes_seg;
> +u8 write_zeroes_may_unmap;
> +u8 unused1[3];
> +le32 max_secure_erase_sectors;
> +le32 max_secure_erase_seg;
> +le32 secure_erase_sector_alignment;
> +struct virtio_blk_zoned_characteristics {
> +le32 zone_sectors;
> +le32 max_open_zones;
> +le32 max_active_zones;
> +le32 max_append_sectors;
> +le32 write_granularity;
> +u8 model;
> +u8 unused2[3];
> +} zoned;
> +};
> +\end{lstlisting}
> +
>  The \field{capacity} of the device (expressed in 512-byte sectors) is always
>  present. The availability of the others all depend on various feature
>  bits as indicated above.
> @@ -167,53 +215,6 @@ \subsection{Device configuration 
> layout}\label{sec:Device Types / Block Device /
>  terminated by the device with a "zone resources exceeded" error as defined 
> for
>  specific commands later.
>  
> -\begin{lstlisting}
> -struct virtio_blk_config {
> -le64 capacity;
> -le32 size_max;
> -le32 seg_max;
> -struct virtio_blk_geometry {
> -le16 cylinders;
> -u8 heads;
> -u8 sectors;
> -} geometry;
> -le32 blk_size;
> -struct virtio_blk_topology {
> -// # of logical blocks per physical block (log2)
> -u8 physical_block_exp;
> -// offset of first aligned logical block
> -u8 alignment_offset;
> -// suggested minimum I/O size in blocks
> -le16 min_io_size;
> -// optimal (suggested maximum) I/O size in blocks
> -le32 opt_io_size;
> -} topology;
> -u8 writeback;
> -u8 unused0;
> -u16 num_queues;
> -le32 max_discard_sectors;
> -le32 max_discard_seg;
> -le32 discard_sector_alignment;
> -le32 max_write_zeroes_sectors;
> -le32 max_write_zeroes_seg;
> -u8 write_zeroes_may_unmap;
> -u8 unused1[3];
> -le32 max_secure_erase_sectors;
> -le32 max_secure_erase_seg;
> -le32 secure_erase_sector_alignment;
> -struct virtio_blk_zoned_characteristics {
> -le32 zone_sectors;
> -le32 max_open_zones;
> -le32 max_active_zones;
> -le32 max_append_sectors;
> -le32 write_granularity;
> -u8 model;
> -u8 unused2[3];
> -} zoned;
> -};
> -\end{lstlisting}
> -
> -
>  \subsubsection{Legacy Interface: Device configuration 
> layout}\label{sec:Device Types / Block Device / Device configuration layout / 
> Legacy Interface: Device configuration layout}
>  When using the legacy interface, transitional devices and drivers
>  MUST format the fields in struct virtio_blk_config
> -- 
> 2.26.2

Yes, please!

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [virtio-dev] [PATCH v4 2/2] virtio-blk: add support for zoned block devices

2022-11-02 Thread Stefan Hajnoczi
On Sun, Oct 30, 2022 at 12:35:45AM -0400, Dmitry Fomichev wrote:
> This patch adds support for Zoned Block Devices (ZBDs) to the kernel
> virtio-blk driver.
> 
> The patch accompanies the virtio-blk ZBD support draft that is now
> being proposed for standardization. The latest version of the draft is
> linked at
> 
> https://github.com/oasis-tcs/virtio-spec/issues/143 .
> 
> The QEMU zoned device code that implements these protocol extensions
> has been developed by Sam Li and it is currently in review at the QEMU
> mailing list.
> 
> A number of virtblk request structure changes has been introduced to
> accommodate the functionality that is specific to zoned block devices
> and, most importantly, make room for carrying the Zoned Append sector
> value from the device back to the driver along with the request status.
> 
> The zone-specific code in the patch is heavily influenced by NVMe ZNS
> code in drivers/nvme/host/zns.c, but it is simpler because the proposed
> virtio ZBD draft only covers the zoned device features that are
> relevant to the zoned functionality provided by Linux block layer.
> 
> Co-developed-by: Stefan Hajnoczi 
> Signed-off-by: Stefan Hajnoczi 
> Signed-off-by: Dmitry Fomichev 
> ---
>  drivers/block/virtio_blk.c  | 438 ++--
>  include/uapi/linux/virtio_blk.h | 105 
>  2 files changed, 524 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 3efe3da5f8c2..03b5302fac6e 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #define PART_BITS 4
> @@ -80,22 +81,51 @@ struct virtio_blk {
>   int num_vqs;
>   int io_queues[HCTX_MAX_TYPES];
>   struct virtio_blk_vq *vqs;
> +
> + /* For zoned device */
> + unsigned int zone_sectors;
>  };
>  
>  struct virtblk_req {
> + /* Out header */
>   struct virtio_blk_outhdr out_hdr;
> - u8 status;
> +
> + /* In header */
> + union {
> + struct {
> + u8 status;
> + } common;
> +
> + /*
> +  * The zone append command has an extended in header.
> +  * The status field in zone_append_in_hdr must always
> +  * be the last byte.
> +  */
> + struct {
> + __virtio64 sector;
> + u8 status;
> + } zone_append;
> + } in_hdr;
> +
> + size_t in_hdr_len;
> +
>   struct sg_table sg_table;
>   struct scatterlist sg[];
>  };
>  
> -static inline blk_status_t virtblk_result(struct virtblk_req *vbr)
> +static inline blk_status_t virtblk_result(u8 status)
>  {
> - switch (vbr->status) {
> + switch (status) {
>   case VIRTIO_BLK_S_OK:
>   return BLK_STS_OK;
>   case VIRTIO_BLK_S_UNSUPP:
>   return BLK_STS_NOTSUPP;
> + case VIRTIO_BLK_S_ZONE_OPEN_RESOURCE:
> + return BLK_STS_ZONE_OPEN_RESOURCE;
> + case VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE:
> + return BLK_STS_ZONE_ACTIVE_RESOURCE;
> + case VIRTIO_BLK_S_IOERR:
> + case VIRTIO_BLK_S_ZONE_UNALIGNED_WP:
>   default:
>   return BLK_STS_IOERR;
>   }
> @@ -111,11 +141,11 @@ static inline struct virtio_blk_vq 
> *get_virtio_blk_vq(struct blk_mq_hw_ctx *hctx
>  
>  static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr)
>  {
> - struct scatterlist hdr, status, *sgs[3];
> + struct scatterlist out_hdr, in_hdr, *sgs[3];
>   unsigned int num_out = 0, num_in = 0;
>  
> - sg_init_one(, >out_hdr, sizeof(vbr->out_hdr));
> - sgs[num_out++] = 
> + sg_init_one(_hdr, >out_hdr, sizeof(vbr->out_hdr));
> + sgs[num_out++] = _hdr;
>  
>   if (vbr->sg_table.nents) {
>   if (vbr->out_hdr.type & cpu_to_virtio32(vq->vdev, 
> VIRTIO_BLK_T_OUT))
> @@ -124,8 +154,8 @@ static int virtblk_add_req(struct virtqueue *vq, struct 
> virtblk_req *vbr)
>   sgs[num_out + num_in++] = vbr->sg_table.sgl;
>   }
>  
> - sg_init_one(, >status, sizeof(vbr->status));
> - sgs[num_out + num_in++] = 
> + sg_init_one(_hdr, >in_hdr.common.status, vbr->in_hdr_len);
> + sgs[num_out + num_in++] = _hdr;
>  
>   return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
>  }
> @@ -214,21 +244,22 @@ static blk_status_t virtblk_setup_cmd(struct 
> virtio_device *vdev,
> struct

[virtio-dev] Re: [virtio-comment] [PATCH v7] virtio-blk: add zoned block device specification

2022-10-31 Thread Stefan Hajnoczi
On Sun, Oct 30, 2022 at 11:06:32PM -0400, Dmitry Fomichev wrote:

A minor issue about normative vs non-normative sections:

> +Devices that offer the VIRTIO_BLK_F_ZONED feature while reporting the
> +VIRTIO_BLK_Z_NONE zoned model are drive-managed zoned block devices. In this
> +case, the driver should treat the device as a regular non-zoned block device.

The VIRTIO spec tries to use should/may/must/etc in the drivernormative
and devicenormative sections. The non-normative sections describe the
interface without making normative statements.

Use of "should" suggests that maybe this should be in a normative
section? If you decide not to move it to a normative section then you
could simply reword it:

  "the driver treats the device as a regular ..."

> +
> +Host-managed zoned block devices have their LBA range divided into Sequential
> +Write Required (SWR) zones that require some additional handling from the 
> host
> +for correct operation. All write requests to SWR zones must be sequential and

There is a "must" here. Same question as above.

Aside from that:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [virtio-dev] Re: [External] Re: [PATCH v3 3/4] content: Introduce driver/device auxiliary notifications for MMIO

2022-10-31 Thread Stefan Hajnoczi
On Mon, Oct 31, 2022 at 02:22:22PM +, Usama Arif wrote:
> On 27/10/2022 22:27, Stefan Hajnoczi wrote:
> > On Fri, Oct 07, 2022 at 05:56:42PM +0100, Usama Arif wrote:
> > > This includes the additions to the corresponding device and driver
> > > conformances.
> > > 
> > > Signed-off-by: Usama Arif 
> > > Signed-off-by: Stefan Hajnoczi 
> > 
> > You can keep this. One thought:
> > 
> > I realized that virtio-mmio support won't be very useful until
> > virtio-mmio gets MSI-X support because drivers currently cannot
> > differentiate between device auxiliary notifications.
> > 
> > Stefan
> 
> Did you mean drivers currently cannot differentiate between "driver"
> auxiliary notifications? The driver will be able to be able to choose the
> index it wants to notifiy using DeviceAuxNotificationIndex.

Yes, sorry!

> 
> I can add the following line in the Interrupt Status change of the patch if
> it makes sense?
> "As MSI-X is not available, it is not possible to determine which driver
> auxiliary notification occured, hence only a single notification is
> supported."

Yes, that is how I understood the mechanism to work. Depending on the
nature of the device-specific event, it may be unworkable for the driver
to have no way of identifying the exact event.

Stefan


signature.asc
Description: PGP signature


Re: [virtio-dev] Re: [PATCH v3 3/4] content: Introduce driver/device auxiliary notifications for MMIO

2022-10-31 Thread Stefan Hajnoczi
On Mon, Oct 31, 2022 at 11:45:55AM -0400, Michael S. Tsirkin wrote:
> On Thu, Oct 27, 2022 at 05:27:28PM -0400, Stefan Hajnoczi wrote:
> > On Fri, Oct 07, 2022 at 05:56:42PM +0100, Usama Arif wrote:
> > > This includes the additions to the corresponding device and driver
> > > conformances.
> > > 
> > > Signed-off-by: Usama Arif 
> > > Signed-off-by: Stefan Hajnoczi 
> > 
> > You can keep this. One thought:
> > 
> > I realized that virtio-mmio support won't be very useful until
> > virtio-mmio gets MSI-X support because drivers currently cannot
> > differentiate between device auxiliary notifications.
> > 
> > Stefan
> 
> 
> The big issue with MSI is it actually needs a ton of registers
> for interrupt rebalancing, masking etc and they need to be fast.
> Maybe a status bit like pci has makes sense here.

This patch does add an ISR bit:

diff --git a/content.tex b/content.tex
index 33362b7..8968fcd 100644
--- a/content.tex
+++ b/content.tex
@@ -2049,6 +2049,8 @@ \subsection{MMIO Device Register Layout}\label{sec:Virtio 
Transport Options / Vi
 in at least one of the active virtual queues.
   \item [Configuration Change Notification] - bit 1 - the interrupt was
 asserted because the configuration of the device has changed.
+  \item [Device-specific Driver Auxiliary Notification] - bit 2 - the 
interrupt was
+asserted because a device-specific event occurred to notify the driver.

The problem is that the driver doesn't know which device-specific event
occurred, so the mechanism is less useful than the MSI-X approach where
the driver knows exactly which (of potentially many) device-specific
events occurred.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v3 2/4] content: Introduce driver/device aux. notification cfg type for PCI

2022-10-31 Thread Stefan Hajnoczi
On Fri, Oct 07, 2022 at 05:56:41PM +0100, Usama Arif wrote:
> This includes the PCI device conformances for these notification
> capabilities.
> 
> Signed-off-by: Usama Arif 
> Signed-off-by: Stefan Hajnoczi 

You can keep this, but please see comments below.

> Signed-off-by: Nikos Dragazis 
> ---
>  conformance.tex |   2 +
>  content.tex | 191 ++--
>  2 files changed, 171 insertions(+), 22 deletions(-)
> 
> diff --git a/conformance.tex b/conformance.tex
> index c3c1d3e..f6914b0 100644
> --- a/conformance.tex
> +++ b/conformance.tex
> @@ -370,6 +370,8 @@ \section{Conformance Targets}\label{sec:Conformance / 
> Conformance Targets}
>  \item \ref{devicenormative:Virtio Transport Options / Virtio Over PCI Bus / 
> PCI Device Layout / Notification capability}
>  \item \ref{devicenormative:Virtio Transport Options / Virtio Over PCI Bus / 
> PCI Device Layout / ISR status capability}
>  \item \ref{devicenormative:Virtio Transport Options / Virtio Over PCI Bus / 
> PCI Device Layout / Device-specific configuration}
> +\item \ref{devicenormative:Virtio Transport Options / Virtio Over PCI Bus / 
> PCI Device Layout / Device auxiliary notification capability}
> +\item \ref{devicenormative:Virtio Transport Options / Virtio Over PCI Bus / 
> PCI Device Layout / Driver auxiliary notification capability}
>  \item \ref{devicenormative:Virtio Transport Options / Virtio Over PCI Bus / 
> PCI Device Layout / Shared memory capability}
>  \item \ref{devicenormative:Virtio Transport Options / Virtio Over PCI Bus / 
> PCI Device Layout / PCI configuration access capability}
>  \item \ref{devicenormative:Virtio Transport Options / Virtio Over PCI Bus / 
> PCI-specific Initialization And Device Operation / Device Initialization / 
> Non-transitional Device With Legacy Driver}
> diff --git a/content.tex b/content.tex
> index 49f9f2a..33362b7 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -719,6 +719,8 @@ \subsection{Virtio Structure PCI 
> Capabilities}\label{sec:Virtio Transport Option
>  \item ISR Status
>  \item Device-specific configuration (optional)
>  \item PCI configuration access
> +\item Driver auxiliary notifications (optional)
> +\item Device auxiliary notifications (optional)
>  \end{itemize}
>  
>  Each structure can be mapped by a Base Address register (BAR) belonging to
> @@ -765,19 +767,23 @@ \subsection{Virtio Structure PCI 
> Capabilities}\label{sec:Virtio Transport Option
>  
>  \begin{lstlisting}
>  /* Common configuration */
> -#define VIRTIO_PCI_CAP_COMMON_CFG1
> +#define VIRTIO_PCI_CAP_COMMON_CFG 1
>  /* Notifications */
> -#define VIRTIO_PCI_CAP_NOTIFY_CFG2
> +#define VIRTIO_PCI_CAP_NOTIFY_CFG 2
>  /* ISR Status */
> -#define VIRTIO_PCI_CAP_ISR_CFG   3
> +#define VIRTIO_PCI_CAP_ISR_CFG3
>  /* Device specific configuration */
> -#define VIRTIO_PCI_CAP_DEVICE_CFG4
> +#define VIRTIO_PCI_CAP_DEVICE_CFG 4
>  /* PCI configuration access */
> -#define VIRTIO_PCI_CAP_PCI_CFG   5
> +#define VIRTIO_PCI_CAP_PCI_CFG5
> +/* Device auxiliary notification */
> +#define VIRTIO_PCI_CAP_DEVICE_AUX_NOTIFY_CFG  6
> +/* Driver auxiliary notification */
> +#define VIRTIO_PCI_CAP_DRIVER_AUX_NOTIFY_CFG  7
>  /* Shared memory region */
> -#define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
> +#define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG  8
>  /* Vendor-specific data */
> -#define VIRTIO_PCI_CAP_VENDOR_CFG9
> +#define VIRTIO_PCI_CAP_VENDOR_CFG 9
>  \end{lstlisting}
>  
>  Any other value is reserved for future use.
> @@ -1158,11 +1164,11 @@ \subsubsection{ISR status 
> capability}\label{sec:Virtio Transport Options / Virti
>  The ISR bits allow the driver to distinguish between device-specific 
> configuration
>  change interrupts and normal virtqueue interrupts:
>  
> -\begin{tabular}{ |l||l|l|l| }
> +\begin{tabular}{ |l||l|l|l|l| }
>  \hline
> -Bits   & 0   & 1   &  2 to 31 \\
> +Bits   & 0   & 1 
>& 2  & 3 to 31 \\
>  \hline
> -Purpose& Queue Interrupt  & Device Configuration Interrupt & Reserved \\
> +Purpose& Queue Interrupt  & Device Configuration Interrupt & Driver 
> Auxiliary Notification Interrupt & Reserved \\
>  \hline
>  \end{tabular}
>  
> @@ -1208,6 +1214,135 @@ \subsubsection{Device-specific 
> configuration}\label{sec:Virtio Transport Options
>  
>  The \field{offset} for the device-specific configuration MUST be 4-byte 
> aligned.
>  

[virtio-dev] Re: [PATCH v3 3/4] content: Introduce driver/device auxiliary notifications for MMIO

2022-10-31 Thread Stefan Hajnoczi
On Fri, Oct 07, 2022 at 05:56:42PM +0100, Usama Arif wrote:
> This includes the additions to the corresponding device and driver
> conformances.
> 
> Signed-off-by: Usama Arif 
> Signed-off-by: Stefan Hajnoczi 

You can keep this. One thought:

I realized that virtio-mmio support won't be very useful until
virtio-mmio gets MSI-X support because drivers currently cannot
differentiate between device auxiliary notifications.

Stefan


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v3 1/4] content: Introduce driver/device auxiliary notifications

2022-10-27 Thread Stefan Hajnoczi
On Fri, Oct 07, 2022 at 05:56:40PM +0100, Usama Arif wrote:
> Driver auxiliary notifications allow the device to send notifications
> other than configuration changes and used buffer notifications to the
> driver, these are optional and their meaning is device-specific.
> 
> Device auxiliary notifications allow the driver to send notifications
> other than available buffer notifications to the device for example
> through a device register, these are optional and their meaning is
> device-specific.
> 
> These device-specific notifications are needed later when adding support
> for virtio-vhost-user device.
> 
> Signed-off-by: Usama Arif 
> Signed-off-by: Stefan Hajnoczi 
> Signed-off-by: Nikos Dragazis 
> ---
>  content.tex | 35 ++-
>  1 file changed, 22 insertions(+), 13 deletions(-)
> 
> diff --git a/content.tex b/content.tex
> index e863709..49f9f2a 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -160,29 +160,38 @@ \subsection{Legacy Interface: A Note on Feature
>  Specification text within these sections generally does not apply
>  to non-transitional devices.
>  
> -\section{Notifications}\label{sec:Basic Facilities of a Virtio Device
> -/ Notifications}
> +\section{Notifications}\label{sec:Basic Facilities of a Virtio Device / 
> Notifications}
>  
>  The notion of sending a notification (driver to device or device
>  to driver) plays an important role in this specification. The
>  modus operandi of the notifications is transport specific.
>  
> -There are three types of notifications: 
> +There are five types of notifications:
>  \begin{itemize}
>  \item configuration change notification
>  \item available buffer notification
> -\item used buffer notification. 
> +\item used buffer notification
> +\item driver auxiliary notification
> +\item device auxiliary notification
>  \end{itemize}
>  
> -Configuration change notifications and used buffer notifications are sent
> -by the device, the recipient is the driver. A configuration change
> -notification indicates that the device configuration space has changed; a
> -used buffer notification indicates that a buffer may have been made used
> -on the virtqueue designated by the notification.
> -
> -Available buffer notifications are sent by the driver, the recipient is
> -the device. This type of notification indicates that a buffer may have
> -been made available on the virtqueue designated by the notification.
> +Configuration change notifications, used buffer notifications and
> +driver auxiliary notifications are sent by the device,
> +the recipient is the driver. A configuration change notification indicates
> +that the device configuration space has changed; a used buffer notification
> +indicates that a buffer may have been made used on the virtqueue designated
> +by the notification; driver auxiliary notifications allow the
> +device to send notifications other than configuration changes and used
> +buffer notifications to the driver, these are optional and their meaning
> +is device-specific.

The meaning of "optional" is confusing and I suggest dropping it.

Auxiliary notifications are not used by all device types. In that sense
they are "optional".

But a device types that do use them, like virtio-vhost-user, may require
them. You cannot implement a virtio-vhost-user device without auxiliary
notifications, so they are mandatory (not "optional") for that device
type.

Finally, it is also possible for a device type to define a feature bit
that enables auxiliary notifications. Drivers/devices that don't support
auxiliary notifications leave the feature bit cleared and still work
properly. In this case they are truly "optional".

"Optional" is confusing. Just saying "their meaning is device-specific"
is enough.

> +
> +Available buffer notifications and device auxiliary notifications
> +are sent by the driver, the recipient is the device. Available buffer
> +notifications indicate that a buffer may have been made available on the
> +virtqueue designated by the notification; device auxiliary
> +notifications allow the driver to send notifications other than available
> +buffer notifications to the device for example through a device register, 
> these
> +are optional and their meaning is device-specific.
>  
>  The semantics, the transport-specific implementations, and other
>  important aspects of the different notifications are specified in detail
> -- 
> 2.25.1
> 


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v3 0/4] Introduce aux. notifications and virtio-vhost-user

2022-10-27 Thread Stefan Hajnoczi
On Fri, Oct 07, 2022 at 05:56:39PM +0100, Usama Arif wrote:
> Hi,
> 
> This patch series introduces device and driver auxiliary notifications
> as a new set of virtio device resources, as well as vhost-user device
> backend that uses these resources.
> 
> Driver auxiliary notifications allow the device to send notifications
> other than configuration changes and used buffer notifications to the
> driver, these are optional and their meaning is **device-specific**.
> 
> Device auxiliary notifcations allow the driver to send notifcations
> other than available buffer notifications to the device for example
> through a device register, these are optional and their meaning is
> **device-specific**.
> 
> These resources are used in the last patch by the virtio-vhost-user
> device in order to send/receive notifications to/from the driver
> regarding activity on the vhost-user virtqueues. By standardizing
> these resources as standalone virtio device resources, other future
> devices will be able to use them as well.

Hi Michael,
Do you have time to review this? I have been too involved in
virtio-vhost-user in the past to do a truly independent review and your
VIRTIO + vhost expertise would be perfect here.

Thanks,
Stefan

> 
> The last patch introduces the vhost-user device backend which facilitates
> vhost-user device emulation through vhost-user protocol exchanges and
> access to shared memory. Software-defined networking, storage, and other
> I/O appliances can provide services through this device.
> 
> This device is based on Wei Wang's vhost-pci work. The virtio-vhost-user
> device differs from vhost-pci because it is a single virtio
> device type that exposes the vhost-user protocol instead of a family of
> new virtio device types, one for each vhost-user device type.
> 
> A HTML version with the changes is available at [1].
> 
> For more information about virtio-vhost-user, see [2].
> 
> These patches are based on the work initially done by Stefan Hajnoczi [2]
> and continued by Nikos Dragazis [3], [4].
> 
> A working prototype implementing this spec can be reproduced using
> instructions in [5] utilizing QEMU [6] and DPDK [7].
> This is also based on the work initially done by Stefan
> and continued later by Nikos. The prototype code uses the terms
> "doorbell"/"device-specific notification" and "master"/"slave" instead of
> "device auxiliary notification"/"driver auxiliary notification" and
> "frontend"/"backend". This is based on older work, however, their
> functionality is the same. If these virtio-spec changes get approved,
> I will refractor the patches for QEMU/DPDK and send them for review
> according to the final patches in the respective mailing list.
> 
> Thanks and looking forward to your response!
> Usama
> 
> [1] https://uarif1.github.io/vvu/v3/virtio-v1.2-cs01
> [2] https://wiki.qemu.org/Features/VirtioVhostUser
> [3] https://ndragazis.github.io/dpdk-vhost-vvu-demo.html
> [4] https://lists.oasis-open.org/archives/virtio-dev/202005/msg00132.html
> [5] https://uarif1.github.io/vvu/dpdk-vvu-instructions
> [6] https://github.com/uarif1/qemu/tree/vvu
> [7] https://github.com/uarif1/dpdk/tree/vvu
> --
> v2->v3:
> Mostly addressing comments made by Michael S. Tsirkin:
> - Give a clearer relation between backend driver, backend device and frontend 
> at
>   beginning of the Vhost-user Device Backend section.
> - change uuid to id.
> - Add information about inflight memory.
> 
> v1->v2:
> - Shortened device aux. notification section
> - Make data width of device aux. notifications device specific and not
>   limited to 2 bytes.
> - Added information about the minimum number of MSIX vectors.
> - Split virtio-mmio implementation of driver aux. notifications into 2
>   registers, DeviceAuxNotificationIndex and DeviceAuxNotificationData.
> - Made the shared memory section in virtio vhost-user clearer and added
>   the shared memory would look like when
>   VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS has been negotiated.
> 
> Usama Arif (4):
>   content: Introduce driver/device auxiliary notifications
>   content: Introduce driver/device aux. notification cfg type for PCI
>   content: Introduce driver/device auxiliary notifications for MMIO
>   vhost-user: add vhost-user device type
> 
>  conformance.tex   |  29 +++-
>  content.tex   | 254 +-
>  introduction.tex  |   3 +
>  virtio-vhost-user.tex | 309 ++
>  4 files changed, 554 insertions(+), 41 deletions(-)
>  create mode 100644 virtio-vhost-user.tex
> 
> -- 
> 2.25.1
> 


signature.asc
Description: PGP signature


Re: [virtio-dev] [PATCH v2 2/2] virtio-blk: add support for zoned block devices

2022-10-16 Thread Stefan Hajnoczi
On Sat, Oct 15, 2022 at 11:41:27PM -0400, Dmitry Fomichev wrote:
> This patch adds support for Zoned Block Devices (ZBDs) to the kernel
> virtio-blk driver.
> 
> The patch accompanies the virtio-blk ZBD support draft that is now
> being proposed for standardization. The latest version of the draft is
> linked at
> 
> https://github.com/oasis-tcs/virtio-spec/issues/143 .
> 
> The QEMU zoned device code that implements these protocol extensions
> has been developed by Sam Li and it is currently in review at the QEMU
> mailing list.
> 
> A number of virtblk request structure changes has been introduced to
> accommodate the functionality that is specific to zoned block devices
> and, most importantly, make room for carrying the Zoned Append sector
> value from the device back to the driver along with the request status.
> 
> The zone-specific code in the patch is heavily influenced by NVMe ZNS
> code in drivers/nvme/host/zns.c, but it is simpler because the proposed
> virtio ZBD draft only covers the zoned device features that are
> relevant to the zoned functionality provided by Linux block layer.
> 
> Co-developed-by: Stefan Hajnoczi 
> Signed-off-by: Stefan Hajnoczi 
> Signed-off-by: Dmitry Fomichev 
> ---
>  drivers/block/virtio_blk.c  | 377 ++--
>  include/uapi/linux/virtio_blk.h | 105 +
>  2 files changed, 464 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 3efe3da5f8c2..c9ad590816f8 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -80,22 +80,51 @@ struct virtio_blk {
>   int num_vqs;
>   int io_queues[HCTX_MAX_TYPES];
>   struct virtio_blk_vq *vqs;
> +
> + /* For zoned device */
> + unsigned int zone_sectors;
>  };
>  
>  struct virtblk_req {
> + /* Out header */
>   struct virtio_blk_outhdr out_hdr;
> - u8 status;
> +
> + /* In header */
> + union {
> + u8 status;
> +
> + /*
> +  * The zone append command has an extended in header.
> +  * The status field in zone_append_in_hdr must have
> +  * the same offset in virtblk_req as the non-zoned
> +  * status field above.
> +  */
> + struct {
> + u8 status;
> + u8 reserved[7];
> + u64 append_sector;
> + } zone_append_in_hdr;
> + };

This layout is not finalized yet. In the last revision of the spec patch
I suggested making the status byte the last byte so that devices that
don't know about zone append can still complete the request with
status=VIRTIO_BLK_S_UNSUPP at the correct offset.

> +
> + size_t in_hdr_len;
> +
>   struct sg_table sg_table;
>   struct scatterlist sg[];
>  };
>  
> -static inline blk_status_t virtblk_result(struct virtblk_req *vbr)
> +static inline blk_status_t virtblk_result(u8 status)
>  {
> - switch (vbr->status) {
> + switch (status) {
>   case VIRTIO_BLK_S_OK:
>   return BLK_STS_OK;
>   case VIRTIO_BLK_S_UNSUPP:
>   return BLK_STS_NOTSUPP;
> + case VIRTIO_BLK_S_ZONE_OPEN_RESOURCE:
> + return BLK_STS_ZONE_OPEN_RESOURCE;
> + case VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE:
> + return BLK_STS_ZONE_ACTIVE_RESOURCE;
> + case VIRTIO_BLK_S_IOERR:
> + case VIRTIO_BLK_S_ZONE_UNALIGNED_WP:
>   default:
>   return BLK_STS_IOERR;
>   }
> @@ -111,11 +140,11 @@ static inline struct virtio_blk_vq 
> *get_virtio_blk_vq(struct blk_mq_hw_ctx *hctx
>  
>  static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr)
>  {
> - struct scatterlist hdr, status, *sgs[3];
> + struct scatterlist out_hdr, in_hdr, *sgs[3];
>   unsigned int num_out = 0, num_in = 0;
>  
> - sg_init_one(, >out_hdr, sizeof(vbr->out_hdr));
> - sgs[num_out++] = 
> + sg_init_one(_hdr, >out_hdr, sizeof(vbr->out_hdr));
> + sgs[num_out++] = _hdr;
>  
>   if (vbr->sg_table.nents) {
>   if (vbr->out_hdr.type & cpu_to_virtio32(vq->vdev, 
> VIRTIO_BLK_T_OUT))
> @@ -124,8 +153,8 @@ static int virtblk_add_req(struct virtqueue *vq, struct 
> virtblk_req *vbr)
>   sgs[num_out + num_in++] = vbr->sg_table.sgl;
>   }
>  
> - sg_init_one(, >status, sizeof(vbr->status));
> - sgs[num_out + num_in++] = 
> + sg_init_one(_hdr, >status, vbr->in_hdr_len);
> + sgs[num_out + num_in++] = _hdr;
>  
>   return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
> 

Re: [virtio-dev] [PATCH v2 1/2] virtio-blk: use a helper to handle request queuing errors

2022-10-16 Thread Stefan Hajnoczi
On Sat, Oct 15, 2022 at 11:41:26PM -0400, Dmitry Fomichev wrote:
> Define a new helper function, virtblk_fail_to_queue(), to
> clean up the error handling code in virtio_queue_rq().
> 
> Signed-off-by: Dmitry Fomichev 
> ---
>  drivers/block/virtio_blk.c | 29 -
>  1 file changed, 16 insertions(+), 13 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [virtio-dev] [PATCH v6] virtio-blk: add zoned block device specification

2022-10-06 Thread Stefan Hajnoczi
On Sun, Sep 04, 2022 at 12:56:01PM -0400, Dmitry Fomichev wrote:
> @@ -4746,7 +4930,15 @@ \subsection{Device Operation}\label{sec:Device Types / 
> Block Device / Device Ope
>  le32 reserved;
>  le64 sector;
>  u8 data[];
> -u8 status;
> +union {
> +u8 status;
> +
> +struct {
> +u8 status;
> +u8 reserved[7];
> +le64 append_sector;
> +} zone_append_in_hdr;
> +};

If a zone append request is sent to a device without VIRTIO_BLK_F_ZONED,
then should the device store VIRTIO_BLK_S_UNSUPP to status or to
zone_append_in_hdr.status?

(I think the answer is status.)

It seems simpler to avoid this problem and instead define:

  struct virtio_blk_zone_append_in_hdr {
  le64 append_sector;
  }

and say that zone append requests have this struct *before* struct
virtio_blk_in_hdr. So the layout is:

  le32 type;
  le32 reserved;
  le64 sector;
  u8 data[];
  le64 append_sector;
  u8 status;

That way the location of the status field does not change, it's always
the last byte.

Stefan


signature.asc
Description: PGP signature


Re: [virtio-dev] [PATCH v6] virtio-blk: add zoned block device specification

2022-10-04 Thread Stefan Hajnoczi
On Sun, Sep 04, 2022 at 12:56:01PM -0400, Dmitry Fomichev wrote:
> Introduce support for Zoned Block Devices to virtio.
> 
> Zoned Block Devices (ZBDs) aim to achieve a better capacity, latency
> and/or cost characteristics compared to commonly available block
> devices by getting the entire LBA space of the device divided to block
> regions that are much larger than the LBA size. These regions are
> called zones and they can only be written sequentially. More details
> about ZBDs can be found at
> 
> https://zonedstorage.io/docs/introduction/zoned-storage .
> 
> In its current form, the virtio protocol for block devices (virtio-blk)
> is not aware of ZBDs but it allows the driver to successfully scan a
> host-managed drive provided by the virtio block device. As the result,
> the host-managed drive is recognized by virtio driver as a regular,
> non-zoned drive that will operate erroneously under the most common
> write workloads. Host-aware ZBDs are currently usable, but their
> performance may not be optimal because the driver can only see them as
> non-zoned block devices.
> 
> To fix this, the virtio-blk protocol needs to be extended to add the
> capabilities to convey the zone characteristics of ZBDs at the device
> side to the driver and to provide support for ZBD-specific commands -
> Report Zones, four zone operations (Open, Close, Finish and Reset) and
> (optionally) Zone Append. The proposed standard extension aims to
> define this new functionality.
> 
> This patch extends the virtio-blk section of virtio specification with
> the minimum set of requirements that are necessary to support ZBDs.
> The resulting device model is a subset of the models defined in ZAC/ZBC
> and ZNS standards documents. The included functionality mirrors
> the existing Linux kernel block layer ZBD support and should be
> sufficient to handle the host-managed and host-aware HDDs that are on
> the market today as well as ZNS SSDs that are entering the market at
> the time of submission of this patch.
> 
> I would like to thank the following people for their useful feedback
> and suggestions while working on the initial iterations of this patch.
> 
> Damien Le Moal 
> Matias Bjørling 
> Niklas Cassel 
> Hans Holmberg 
> 
> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/143
> Signed-off-by: Dmitry Fomichev 
> ---

Hi Dmitry,
I have reviewed the spec. Please see the comments below. They are minor
issues and overall I think this can be merged soon.

Thanks,
Stefan

> v5 -> v6:
> 
> Address review comments from Cornelia Huck:
> 
>  - add a clause to disallow VIRTIO_BLK_F_ZONED feature to be offered by
>legacy devices
> 
>  - clarify VIRTIO_BLK_F_DISCARD negotiation procedure for zoned devices
> 
>  - simplify definitions of constant values that are specific to zoned
>devices
> 
>  - editorial changes
> 
> v4 -> v5:
> 
> Add Fixes tag pointing to the corresponding GitHub issue.
> 
> Improve the patch changelog.
> 
> v3 -> v4:
> 
> Address additional feedback from Stefan:
> 
>  - align the append sector field to 8 bytes instead of 4
> 
>  - define "zone sector address" in the non-normative section and use
>this term in the text in a consistent way. Make sure it is clear
>that the value is in bytes.
> 
>  - move portions of VIRTIO_BLK_T_ZONE_REPORT description to the
>non-normative section
> 
>  - clarify the wording about reading of unwritten data
> 
>  - editorial changes
> 
> v2 -> v3:
> 
> A few changes made as the result of off-list discussions with Stefan,
> Damien and Hannes:
> 
>  - drop virtblk_zoned_req for zoned devices and define a union for
>virtio request in header that is specific to ZONE APPEND request
> 
>  - drop support for ALL bit in all zone operations except for RESET
>ZONE. For this zone management operation, define a new request type,
>VIRTIO_BLK_T_ZONE_RESET_ALL. This way, the zone management out
>request header is no longer necessary
> 
>  - editorial changes
> 
> v1 -> v2:
> 
> Address Stefan's review comments:
> 
>  - move normative clauses to normative sections
> 
>  - remove the "partial" bit in zone report
> 
>  - change layout of virtio_blk_zoned_req. The "all" flag becomes a bit
>in "zone" bit field struct. This leaves 31 bits for potential future
>extensions. Move the status byte to be the last one in the struct
> 
>  - set ZBD-specific error codes in the status field, not in
>"zoned_result" field. The former "zoned_result" member now becomes
>"append_sector"
> 
>  - make a few editorial changes
> ---
>  content.tex | 667 +++-
>  1 file changed, 665 insertions(+), 2 deletions(-)
> 
> diff --git a/content.tex b/content.tex
> index 7508dd1..bbc52ad 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -4557,6 +4557,13 @@ \subsection{Feature bits}\label{sec:Device Types / 
> Block Device / Feature bits}
>   maximum erase sectors count in \field{max_secure_erase_sectors} and
>   maximum 

Re: [virtio-dev] VM memory protection and zero-copy transfers.

2022-09-09 Thread Stefan Hajnoczi
On Fri, Sep 09, 2022 at 08:52:02AM +, Afsa, Baptiste wrote:
> Hello,
> 
> I ran some benchmarks to compare the performances achieved by the swiotlb
> approach and our dynamic memory granting solution while using different buffer
> sizes. Without any surprise, the swiotlb approach performs much better when 
> the
> buffers are small. Actually for small buffers, the performances are on par 
> with
> the original configuration where the entire memory is shared. Of course these
> results are specific to the platform I used and the system workload (e.g. CPU
> utilization, caches utilization).
> 
> At the moment we are not planning to add a mechanism that would take the
> decision between copying or granting the buffers dynamically based on their
> sizes, but this experience showed us that devices that uses small packets 
> would
> benefits from going through the swiotlb. So we are considering making this
> configurable on a per-device basis in our solution.
> 
> We have also experimented with the use of a virtual IOMMU on the guest side 
> and
> we have a few concerns with this option.
> 
> If we add a virtual IOMMU, we can see mapping commands being issued as virtio
> buffers are exchanged between the device and the driver. However the kernel
> controls the mappings from DMA addresses to physical addresses. In theory, we
> could remap the memory in the host address space to "implement" these mappings
> but we have some additional constraints that make this approach problematic.
> 
> Our solution runs on systems where the physical IOMMU does not support address
> translation. So we rely on having an identity mapping between the guest 
> address
> space and the physical address space to allow the guest OS to initiate DMA
> transactions. If the memory that we import for virtio buffers uses translated
> addresses, these buffers cannot be used in DMA transactions.
> 
> We also have an issue with letting the driver control the exported memory
> through an IOMMU. If we do this, we need to consider what will happen if the
> guest unmaps a virtio buffer while it is in use on the device side.
> 
> Although it looks possible to recover from such a scenario in the case of a
> device doing CPU accesses to the shared memory, things get more complicated if
> we start considering that the buffer may be involved in a DMA transaction.
> 
> In some previous projects, we have learned that the ability for the hardware
> device and/or its associated driver to recover from an aborted transaction is
> not something that we can rely upon in the general case.
> 
> For this reason, in our typical memory granting scenarios, we usually "lock" 
> the
> shared memory regions to prevent the exporter from revoking the mappings until
> the importer says it is ok to do so.
> 
> Note that locking the mappings could be applied here as well. In this case, we
> would still use this concept of shadow virtqueues and the hypervisor would be
> responsible for locking/unlocking the virtio buffers as they cycle between the
> device and the driver. This design is likely to be slower than the original
> implementation as the cost of locking the mappings is significant (i.e. an 
> extra
> page table walk to validate the memory regions).
> 
> As we discussed in this thread, there are a few options available to enable
> virtio in configurations where the VM address spaces are isolated. I think 
> they
> all have different trade-offs. Our approach certainly have some drawbacks but 
> it
> also addresses some specific considerations that are relevant in our use
> case. Different configurations will probably require different solutions to 
> this
> question.
> 
> What would be the next steps to go forward with adding a new feature bit such 
> as
> the one I discussed in my original email? Should we prepare a patch on the
> specification and post it here for further discussions?

Hi Baptiste,
Thanks for sharing your findings. As the next step please send your
VIRTIO spec change proposal to the mailing list so they can be discussed
in detail.

Example commands for sending spec patch emails are here:
https://github.com/oasis-tcs/virtio-spec#providing-feedback

Thanks,
Stefan


signature.asc
Description: PGP signature


Re: [virtio-dev] Next VirtIO device for Project Stratos?

2022-09-07 Thread Stefan Hajnoczi
On Wed, Sep 07, 2022 at 03:09:27PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > On Tue, Sep 06, 2022 at 06:33:36PM +0100, Dr. David Alan Gilbert wrote:
> > > * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > > > On Sat, Sep 03, 2022 at 07:43:08AM +, Alyssa Ross wrote:
> > > > > Hi Alex and everyone else, just catching up on some mail and wanted to
> > > > > clarify some things:
> > > > > 
> > > > > Alex Bennée  writes:
> > > > > 
> > > > > > This email is driven by a brain storming session at a recent sprint
> > > > > > where we considered what VirtIO devices we should look at 
> > > > > > implementing
> > > > > > next. I ended up going through all the assigned device IDs hunting 
> > > > > > for
> > > > > > missing spec discussion and existing drivers so I'd welcome feedback
> > > > > > from anybody actively using them - especially as my suppositions 
> > > > > > about
> > > > > > device types I'm not familiar with may be way off!
> > > > > >
> > > > > > [...snip...]
> > > > > >
> > > > > > GPU device / 16
> > > > > > ---
> > > > > >
> > > > > > This is now a fairly mature part of the spec and has 
> > > > > > implementations is
> > > > > > the kernel, QEMU and a vhost-user backend. However as is 
> > > > > > commensurate
> > > > > > with the complexity of GPUs there is ongoing development moving 
> > > > > > from the
> > > > > > VirGL OpenGL encapsulation to a thing called GFXSTREAM which is 
> > > > > > meant to
> > > > > > make some things easier.
> > > > > >
> > > > > > A potential area of interest here is working out what the 
> > > > > > differences
> > > > > > are in use cases between virtio-gpu and virtio-wayland. 
> > > > > > virtio-wayland
> > > > > > is currently a ChromeOS only invention so hasn't seen any 
> > > > > > upstreaming or
> > > > > > specification work but may make more sense where multiple VMs are
> > > > > > drawing only elements of a final display which is composited by a 
> > > > > > master
> > > > > > program. For further reading see Alyssa's write-up:
> > > > > >
> > > > > >   https://alyssa.is/using-virtio-wl/
> > > > > >
> > > > > > I'm not sure how widely used the existing vhost-user backend is for
> > > > > > virtio-gpu but it could present an opportunity for a more beefy 
> > > > > > rust-vmm
> > > > > > backend implementation?
> > > > > 
> > > > > As I understand it, virtio-wayland is effectively deprecated in favour
> > > > > of sending Wayland messages over cross-domain virtio-gpu contexts.  
> > > > > It's
> > > > > possible to do this now with an upstream kernel, whereas 
> > > > > virtio-wayland
> > > > > always required a custom driver in the Chromium kernel.
> > > > > 
> > > > > But crosvm is still the only implementation of a virtio-gpu device 
> > > > > that
> > > > > supports Wayland over cross-domain contexts, so it would be great to 
> > > > > see
> > > > > a more generic implementation.  Especially because, while crosvm can
> > > > > share its virtio-gpu device over vhost-user, it does so in a way 
> > > > > that's
> > > > > incompatible with the standardised vhost-user-gpu as implemented by
> > > > > QEMU.  When I asked the crosvm developers in their Matrix channel what
> > > > > it would take to use the standard vhost-user-gpu variant, they said 
> > > > > that
> > > > > the standard variant was lacking functionality they needed, like 
> > > > > mapping
> > > > > and unmapping GPU buffers into the guest.
> > > > 
> > > > That sounds somewhat similar to virtiofs and its DAX Window, which needs
> > > > vhost-user protocol extensions because of how memory is handled. David
> > > > Gilbert wrote the QEMU virtiofs DAX patches, which are under
> > > > development.
>

Re: [virtio-dev] Next VirtIO device for Project Stratos?

2022-09-06 Thread Stefan Hajnoczi
On Tue, Sep 06, 2022 at 06:33:36PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > On Sat, Sep 03, 2022 at 07:43:08AM +, Alyssa Ross wrote:
> > > Hi Alex and everyone else, just catching up on some mail and wanted to
> > > clarify some things:
> > > 
> > > Alex Bennée  writes:
> > > 
> > > > This email is driven by a brain storming session at a recent sprint
> > > > where we considered what VirtIO devices we should look at implementing
> > > > next. I ended up going through all the assigned device IDs hunting for
> > > > missing spec discussion and existing drivers so I'd welcome feedback
> > > > from anybody actively using them - especially as my suppositions about
> > > > device types I'm not familiar with may be way off!
> > > >
> > > > [...snip...]
> > > >
> > > > GPU device / 16
> > > > ---
> > > >
> > > > This is now a fairly mature part of the spec and has implementations is
> > > > the kernel, QEMU and a vhost-user backend. However as is commensurate
> > > > with the complexity of GPUs there is ongoing development moving from the
> > > > VirGL OpenGL encapsulation to a thing called GFXSTREAM which is meant to
> > > > make some things easier.
> > > >
> > > > A potential area of interest here is working out what the differences
> > > > are in use cases between virtio-gpu and virtio-wayland. virtio-wayland
> > > > is currently a ChromeOS only invention so hasn't seen any upstreaming or
> > > > specification work but may make more sense where multiple VMs are
> > > > drawing only elements of a final display which is composited by a master
> > > > program. For further reading see Alyssa's write-up:
> > > >
> > > >   https://alyssa.is/using-virtio-wl/
> > > >
> > > > I'm not sure how widely used the existing vhost-user backend is for
> > > > virtio-gpu but it could present an opportunity for a more beefy rust-vmm
> > > > backend implementation?
> > > 
> > > As I understand it, virtio-wayland is effectively deprecated in favour
> > > of sending Wayland messages over cross-domain virtio-gpu contexts.  It's
> > > possible to do this now with an upstream kernel, whereas virtio-wayland
> > > always required a custom driver in the Chromium kernel.
> > > 
> > > But crosvm is still the only implementation of a virtio-gpu device that
> > > supports Wayland over cross-domain contexts, so it would be great to see
> > > a more generic implementation.  Especially because, while crosvm can
> > > share its virtio-gpu device over vhost-user, it does so in a way that's
> > > incompatible with the standardised vhost-user-gpu as implemented by
> > > QEMU.  When I asked the crosvm developers in their Matrix channel what
> > > it would take to use the standard vhost-user-gpu variant, they said that
> > > the standard variant was lacking functionality they needed, like mapping
> > > and unmapping GPU buffers into the guest.
> > 
> > That sounds somewhat similar to virtiofs and its DAX Window, which needs
> > vhost-user protocol extensions because of how memory is handled. David
> > Gilbert wrote the QEMU virtiofs DAX patches, which are under
> > development.
> > 
> > I took a quick look at the virtio-gpu specs. If the crosvm behavior you
> > mentioned is covered in the VIRTIO spec then I guess it's the "host
> > visible memory region"?
> > 
> > (If it's not in the VIRTIO spec then a spec change needs to be proposed
> > first and a vhost-user protocol spec change can then support that new
> > virtio-gpu feature.)
> > 
> > The VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB command maps the device's resource
> > into the host visible memory region so that the driver can see it.
> > 
> > The virtiofs DAX window uses vhost-user slave channel messages to
> > provide file descriptors and offsets for QEMU to mmap. QEMU mmaps the
> > file pages into the shared memory region seen by the guest driver.
> > 
> > Maybe an equivalent mechanism is needed for virtio-gpu so a device
> > resource file descriptor can be passed to QEMU and then mmapped so the
> > guest driver can see the pages?
> > 
> > I think it's possible to unify the virtiofs and virtio-gpu extensions to
> > the vhost-user protocol. Two new slave channel messages are needed: "map
> >  to shared memory resource 

Re: [virtio-dev] Next VirtIO device for Project Stratos?

2022-09-05 Thread Stefan Hajnoczi
On Sat, Sep 03, 2022 at 07:43:08AM +, Alyssa Ross wrote:
> Hi Alex and everyone else, just catching up on some mail and wanted to
> clarify some things:
> 
> Alex Bennée  writes:
> 
> > This email is driven by a brain storming session at a recent sprint
> > where we considered what VirtIO devices we should look at implementing
> > next. I ended up going through all the assigned device IDs hunting for
> > missing spec discussion and existing drivers so I'd welcome feedback
> > from anybody actively using them - especially as my suppositions about
> > device types I'm not familiar with may be way off!
> >
> > [...snip...]
> >
> > GPU device / 16
> > ---
> >
> > This is now a fairly mature part of the spec and has implementations is
> > the kernel, QEMU and a vhost-user backend. However as is commensurate
> > with the complexity of GPUs there is ongoing development moving from the
> > VirGL OpenGL encapsulation to a thing called GFXSTREAM which is meant to
> > make some things easier.
> >
> > A potential area of interest here is working out what the differences
> > are in use cases between virtio-gpu and virtio-wayland. virtio-wayland
> > is currently a ChromeOS only invention so hasn't seen any upstreaming or
> > specification work but may make more sense where multiple VMs are
> > drawing only elements of a final display which is composited by a master
> > program. For further reading see Alyssa's write-up:
> >
> >   https://alyssa.is/using-virtio-wl/
> >
> > I'm not sure how widely used the existing vhost-user backend is for
> > virtio-gpu but it could present an opportunity for a more beefy rust-vmm
> > backend implementation?
> 
> As I understand it, virtio-wayland is effectively deprecated in favour
> of sending Wayland messages over cross-domain virtio-gpu contexts.  It's
> possible to do this now with an upstream kernel, whereas virtio-wayland
> always required a custom driver in the Chromium kernel.
> 
> But crosvm is still the only implementation of a virtio-gpu device that
> supports Wayland over cross-domain contexts, so it would be great to see
> a more generic implementation.  Especially because, while crosvm can
> share its virtio-gpu device over vhost-user, it does so in a way that's
> incompatible with the standardised vhost-user-gpu as implemented by
> QEMU.  When I asked the crosvm developers in their Matrix channel what
> it would take to use the standard vhost-user-gpu variant, they said that
> the standard variant was lacking functionality they needed, like mapping
> and unmapping GPU buffers into the guest.

That sounds somewhat similar to virtiofs and its DAX Window, which needs
vhost-user protocol extensions because of how memory is handled. David
Gilbert wrote the QEMU virtiofs DAX patches, which are under
development.

I took a quick look at the virtio-gpu specs. If the crosvm behavior you
mentioned is covered in the VIRTIO spec then I guess it's the "host
visible memory region"?

(If it's not in the VIRTIO spec then a spec change needs to be proposed
first and a vhost-user protocol spec change can then support that new
virtio-gpu feature.)

The VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB command maps the device's resource
into the host visible memory region so that the driver can see it.

The virtiofs DAX window uses vhost-user slave channel messages to
provide file descriptors and offsets for QEMU to mmap. QEMU mmaps the
file pages into the shared memory region seen by the guest driver.

Maybe an equivalent mechanism is needed for virtio-gpu so a device
resource file descriptor can be passed to QEMU and then mmapped so the
guest driver can see the pages?

I think it's possible to unify the virtiofs and virtio-gpu extensions to
the vhost-user protocol. Two new slave channel messages are needed: "map
 to shared memory resource " and "unmap  from shared memory resource ". Both devices could use these
messages to implement their respective DAX Window and Blob Resource
functionality.

> 
> So if we wanted to push forward with getting making Wayland over
> virttio-gpu less crosvm specific, I suppose the first step would be to
> figure out with the crosvm developers what functionality is missing in
> the vhost-user-gpu protocol.  That would then make it possible to use
> crosvm's device (with the Wayland support) with other VMMs like QEMU.
> 
> (CCing my colleage Puck, who has also been working with me on getting
> Wayland over virtio-gpu up and running outside of Chrome OS.)

I have CCed David Gilbert (virtiofs DAX Window) and Gurchetan Singh
(virtio-gpu shared memory region).

Stefan


signature.asc
Description: PGP signature


Re: [virtio-dev] [RFC PATCH v3] virtio-blk: add zoned block device specification

2022-08-08 Thread Stefan Hajnoczi
On Thu, 4 Aug 2022 at 18:41, Dmitry Fomichev  wrote:

Hi Dmitry,
I think RFC can be removed from the Subject line for the next revision
of this patch.

For instructions on how to bring this to a Technical Committee vote, see:
https://github.com/oasis-tcs/virtio-spec#use-of-github-issues

>
> Introduce support for Zoned Block Devices to virtio.
>
> Zoned Block Devices (ZBDs) aim to achieve a better capacity, latency
> and/or cost characteristics compared to commonly available block
> devices by getting the entire LBA space of the device divided to block
> regions that are much larger than the LBA size. These regions are
> called zones and they can only be written sequentially. More details
> about ZBDs can be found at
>
> https://zonedstorage.io/docs/introduction/zoned-storage .
>
> In its current form, the virtio protocol for block devices (virtio-blk)
> is not aware of ZBDs but it allows the guest to successfully scan a
> host-managed drive provided by the host. As the result, the
> host-managed drive appears at the guest as a regular drive that will
> operate erroneously under the most common write workloads.
>
> To fix this, the virtio-blk protocol needs to be extended to add the
> capabilities to convey the zone characteristics of host ZBDs to the
> guest and to provide the support for ZBD-specific commands - Report
> Zones, four zone operations and (optionally) Zone Append. The proposed
> standard extension aims to provide this functionality.
>
> This patch extends the virtio-blk section of virtio specification with
> the minimum set of requirements that are necessary to support ZBDs.
> The resulting device model is a subset of the models defined in ZAC/ZBC
> and ZNS standards documents. The included functionality mirrors
> the existing Linux kernel block layer ZBD support and should be
> sufficient to handle the host-managed and host-aware HDDs that are on
> the market today as well as ZNS SSDs that are entering the market at
> the moment of this patch submission.
>
> I have developed a proof of concept patch series that adds ZBD support
> to virtio-blk Linux kernel driver by implementing the protocol
> extensions defined in the spec patch. I would like to receive feedback
> on this specification patch before posting that series to the block
> LKML.
>
> I would like to thank the following people for their useful feedback
> and suggestions while working on the initial iterations of this patch.
>
> Damien Le Moal 
> Matias Bjørling 
> Niklas Cassel 
> Hans Holmberg 
>
> v1 -> v2:
>
> Address Stefan's review comments:
>
>  - move normative clauses to normative sections
>
>  - remove the "partial" bit in zone report
>
>  - change layout of virtio_blk_zoned_req. The "all" flag becomes a bit
>in "zone" bit field struct. This leaves 31 bits for potential future
>extensions. Move the status byte to be the last one in the struct
>
>  - set ZBD-specific error codes in the status field, not in
>"zoned_result" field. The former "zoned_result" member now becomes
>"append_sector"
>
>  - make a few editorial changes
>
> v2 -> v3:
>
> A few changes made as the result of off-list discussions with Stefan,
> Damien and Hannes:
>
>  - drop virtblk_zoned_req for zoned devices and define a union for
>virtio request in header that is specific to ZONE APPEND request
>
>  - drop support for ALL bit in all zone operations except for RESET
>ZONE. For this zone management operation, define a new request type,
>VIRTIO_BLK_T_ZONE_RESET_ALL. This way, the zone management out
>request header is no longer necessary
>
>  - editorial changes
>
> Signed-off-by: Dmitry Fomichev 
> ---
>  content.tex | 660 +++-
>  1 file changed, 658 insertions(+), 2 deletions(-)
>
> diff --git a/content.tex b/content.tex
> index 7508dd1..cfb8727 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -4557,6 +4557,13 @@ \subsection{Feature bits}\label{sec:Device Types / 
> Block Device / Feature bits}
>   maximum erase sectors count in \field{max_secure_erase_sectors} and
>   maximum erase segment number in \field{max_secure_erase_seg}.
>
> +\item[VIRTIO_BLK_F_ZONED(17)] Device is a Zoned Block Device, that is, a 
> device
> +   that follows the zoned storage device behavior that is also supported 
> by
> +   industry standards such as the T10 Zoned Block Command standard (ZBC 
> r05) or
> +   the NVMe(TM) NVM Express Zoned Namespace Command Set Specification 
> 1.1b
> +   (ZNS). For brevity, these standard documents are referred as "ZBD 
> standards"
> +   from this point on in the text.
> +
>  \end{description}
>
>  \subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / 
> Block Device / Feature bits / Legacy Interface: Feature bits}
> @@ -4589,6 +4596,75 @@ \subsection{Device configuration 
> layout}\label{sec:Device Types / Block Device /
>  \field{max_secure_erase_sectors} \field{secure_erase_sector_alignment} are 
> expressed
> 

Re: [virtio-dev] VM memory protection and zero-copy transfers.

2022-07-15 Thread Stefan Hajnoczi
On Fri, Jul 15, 2022 at 02:18:32PM +, Afsa, Baptiste wrote:
> > One thing I didn't see in your proposal was a copying vs zero-copy
> > threshold. Maybe it helps to look at the size of requests and copy data
> > instead of granting pages when descriptors are small? On the other hand,
> > a 4 KB page size means that many descriptors won't be larger than 4 KB
> > anyway due to guest physical memory fragmentation. This is basically
> > hybrid of swiotlb and your proposal - zero-copy when it pays off,
> > copying when it's cheap.
> >
> > As I mentioned, I think IOMMUs are worth investigating, in particular
> > for the case where mappings are rarely changed. They are fast in that
> > case.
> 
> I agree there is much likely a point until which copying is cheaper. However, 
> we
> have not considered this in our initial investigation.
> 
> Thank you Stefan for taking the time to read this proposal and for your
> feedback.

I will be away until August so I probably won't participate in the
discussion.

One thing I forgot to say is that relying on a single benchmark may not
be representative of real workloads. Two factors we touched on are
descriptor size and page reuse. It seems worth evaluating solutions
based on multiple benchmarks that vary these factors.

Stefan


signature.asc
Description: PGP signature


Re: [virtio-dev] VM memory protection and zero-copy transfers.

2022-07-12 Thread Stefan Hajnoczi
On Fri, Jul 08, 2022 at 01:56:31PM +, Afsa, Baptiste wrote:
> Hello everyone,
> 
> The traditional virtio model relies on the ability for the host to access the
> entire memory of the guest VM.

The VIRTIO device model (virtqueues, configuration space, feature
negotiation, etc) does not rely on shared memory access between the
device and the driver.

There is a shared memory resource in the device model that some devices
use, but that's the only thing that requires shared memory.

It's the virtio-pci, virtio-mmio, etc transports and their use of the
vring layout that requires shared memory access.

This might seem pedantic but there's a practical reason for making the
distinction. It should be possible to have a virtio-tcp or other message
passing transport for VIRTIO one day. Correctly layered drivers will
work regardless of whether the underlying transport relies on shared
memory or message passing.

> Virtio is also used in system configurations
> where the devices are not featured by the host (which may not exist as such in
> the case of a Type-1 hypervisor) but by another, unprivileged guest VM. In 
> such
> a configuration, the guest VM memory sharing requirement would raise security
> concerns.

Guest drivers can use IOMMU functionality to restrict device access to
memory, if available from the transport. For example, a virtio-pci
driver implementation can program the IOMMU to allow read/write access
only to the vring and virtqueue buffer pages.

> The following proposal removes that requirement by introducing an alternative
> model where the interactions between the virtio driver and the virtio device 
> are
> mediated by the hypervisor. This concept is applicable to both Type-1 and 
> Type-2
> hypervisors. In the following write-up, the "host" thus refers either to the
> host OS or to the guest VM that executes the virtio device.
> 
> The main objective is to keep the memory of the VM that runs the driver 
> isolated
> from the memory that runs the device, while still allowing zero-copy transfers
> between the two domains. The operations that control the exchange of the 
> virtio
> buffers are handled by hypervisor code that sits between the device and the
> driver.
> 
> As opposed to the regular virtio model, the virtqueues allocated by the driver
> are not shared with the device directly. Instead, the hypervisor allocates a
> separate set of virtqueues that have the same sizes as the original ones and
> shares this second set with the device. These hypervisor-allocated virtqueues
> are referred as the "shadow virtqueues".
> 
> During device operation, the hypervisor copies the descriptors between the
> driver and the shadow virtqueues as the buffers cycle between the driver and 
> the
> device.
> 
> Whenever the driver adds some buffers to the available ring, the hypervisor
> validates the descriptors and dynamically grants the I/O buffers to the host 
> or
> VM that runs the device. The hypervisor then copies these descriptors to the
> shadow virtqueue's available ring. At the other end, when the device returns
> buffers to the shadow virtqueue's used ring, the hypervisor unmaps these 
> buffers
> from the host's address space and copy the descriptor to the driver's used 
> ring.
> 
> Although the virtio buffers can be allocated anywhere in the guest memory and
> are not necessarily page-aligned, the memory sharing granularity is 
> constrained
> by the page size. So when a buffer is mapped to the host address space, the
> hypervisor may end up sharing more memory that what is strictly needed.
> 
> The cost of granting the memory dynamically as virtio transfers goes is
> significant, though. We measured up to 40% performance degradation when using
> this dynamic buffer granting mechanism.
> 
> We also compared this solution to other approaches that we have seen 
> elsewhere.
> For instance, using the swiotlb mechanism along with the
> VIRTIO_F_ACCESS_PLATFORM feature bit to force a copy of the I/O buffers to a
> statically shared memory region. In that case, the same set of benchmarks 
> shows
> an even bigger performance degradation, up to 60%, compared to the original
> virtio performance.

Did you try virtio-pci with an IOMMU? The advantage compared to both
your proposal and swiotlb is that workloads that reuse buffers have no
performance overhead because the IOMMU mappings remain in place across
virtqueue requests.

I have CCed Jean-Philippe Brucker  who
designed the virtio-iommu device.

Using an IOMMU can be slower than the approach you are proposing when
each request requires new mappings. That's because your approach
combines the virtqueue kick processing with the page granting whereas
programming an IOMMU with map/unmap commands is a separate vmexit from
the virtqueue kick. It's probably easier to make your approach faster in
the dynamic mappings case for this reason.

A page-table based IOMMU (doesn't require explicit map/unamp commands
because it reads mappings on-demand from a 

Re: [virtio-dev] [RFC PATCH v2] virtio-blk: add zoned block device specification

2022-06-11 Thread Stefan Hajnoczi
On Sun, Jun 12, 2022 at 01:58:36AM +, Dmitry Fomichev wrote:
> On Sat, 2022-06-11 at 21:09 +0100, Stefan Hajnoczi wrote:
> > On Thu, Jun 09, 2022 at 08:43:21PM -0400, Dmitry Fomichev wrote:
> > > +Host-managed zoned block devices have their LBA range divided to 
> > > Sequential
> > > +Write Required (SWR) zones that require some additional handling from the
> > > host
> > > +for sustainable operation. All write requests to SWR zones must be
> > > sequential
> > > +and the zones with some data need to be reset before that data can be
> > > rewritten.
> > > +Host-managed devices support a set of ZBD-specific I/O requests that can 
> > > be
> > > used
> > > +by the host to manage device zones. Host-managed devices report
> > > VIRTIO_BLK_Z_HM
> > > +value in the \field{model} field in \field{zoned}.
> > 
> > One of
> > "report the VIRTIO_BLK_Z_HM value in ..."
> > "report a VIRTIO_BLK_Z_HM value in ..."
> > "report VIRTIO_BLK_Z_HM in ..."
> > reads more naturally.
> 
> OK.
> 
> > 
> > > +
> > > +Host-aware zoned block devices have their LBA range divided to Sequential
> > > +Write Preferred (SWP) zones that support the random write access, 
> > > similar to
> > > +regular non-zoned devices. However, the device I/O performance might not 
> > > be
> > > +optimal if SWP zones are used in a random I/O pattern. SWP zones also
> > > support
> > > +the same set of ZBD-specific I/O requests as host-managed devices that 
> > > allow
> > > +host-aware devices to be managed by any host that supports zoned block
> > > devices
> > > +to achieve its optimum performance. Host-aware devices report
> > > VIRTIO_BLK_Z_HA
> > > +value in the \field{model} field in \field{zoned}.
> > 
> > Same as above.
> 
> Sure.
> 
> > 
> > > +
> > > +During device operation, SWR and SWP zones can be in one of the following
> > > states:
> > > +empty, implicitly-open, explicitly-open, closed and full. The state 
> > > machine
> > > that
> > > +governs the transitions between these states is described later in this
> > > document.
> > > +
> > > +SWR and SWP zones consume volatile device resources while being in 
> > > certain
> > > +states and the device may set limits on the number of zones that can be 
> > > in
> > > these
> > > +states simultaneously.
> > > +
> > > +Zoned block devices use two internal counters to account for the device
> > > +resources in use, the number of currently open zones and the number of
> > > currently
> > > +active zones.
> > > +
> > > +Any zone state transition from a state that doesn't consume a zone 
> > > resource
> > > to a
> > > +state that consumes the same resource increments the internal device 
> > > counter
> > > for
> > > +that resource. Any zone transition out of a state that consumes a zone
> > > resource
> > > +to a state that doesn't consume the same resource decrements the counter.
> > > Any
> > > +request that causes the device to exceed the reported zone resource 
> > > limits
> > > is
> > > +terminated by the device with a "zone resources exceeded" error as 
> > > defined
> > > for
> > > +specific commands later.
> > > +
> > >  \begin{lstlisting}
> > >  struct virtio_blk_config {
> > >  le64 capacity;
> > > @@ -4623,6 +4694,15 @@ \subsection{Device configuration
> > > layout}\label{sec:Device Types / Block Device /
> > >  le32 max_secure_erase_sectors;
> > >  le32 max_secure_erase_seg;
> > >  le32 secure_erase_sector_alignment;
> > > +    struct virtio_blk_zoned_characteristics {
> > > +    le32 zone_sectors;
> > > +    le32 max_open_zones;
> > > +    le32 max_active_zones;
> > > +    le32 max_append_sectors;
> > > +    le32 write_granularity;
> > > +    u8 model;
> > > +    u8 unused2[3];
> > > +    } zoned;
> > >  };
> > >  \end{lstlisting}
> > >  
> > > @@ -4686,6 +4766,10 @@ \subsection{Device Initialization}\label{sec:Device
> > > Types / Block Device / Devic
> > >  \field{secure_erase_sector_alignment} can b

Re: [virtio-dev] [RFC PATCH v2] virtio-blk: add zoned block device specification

2022-06-11 Thread Stefan Hajnoczi
On Thu, Jun 09, 2022 at 08:43:21PM -0400, Dmitry Fomichev wrote:
> +Host-managed zoned block devices have their LBA range divided to Sequential
> +Write Required (SWR) zones that require some additional handling from the 
> host
> +for sustainable operation. All write requests to SWR zones must be sequential
> +and the zones with some data need to be reset before that data can be 
> rewritten.
> +Host-managed devices support a set of ZBD-specific I/O requests that can be 
> used
> +by the host to manage device zones. Host-managed devices report 
> VIRTIO_BLK_Z_HM
> +value in the \field{model} field in \field{zoned}.

One of
"report the VIRTIO_BLK_Z_HM value in ..."
"report a VIRTIO_BLK_Z_HM value in ..."
"report VIRTIO_BLK_Z_HM in ..."
reads more naturally.

> +
> +Host-aware zoned block devices have their LBA range divided to Sequential
> +Write Preferred (SWP) zones that support the random write access, similar to
> +regular non-zoned devices. However, the device I/O performance might not be
> +optimal if SWP zones are used in a random I/O pattern. SWP zones also support
> +the same set of ZBD-specific I/O requests as host-managed devices that allow
> +host-aware devices to be managed by any host that supports zoned block 
> devices
> +to achieve its optimum performance. Host-aware devices report VIRTIO_BLK_Z_HA
> +value in the \field{model} field in \field{zoned}.

Same as above.

> +
> +During device operation, SWR and SWP zones can be in one of the following 
> states:
> +empty, implicitly-open, explicitly-open, closed and full. The state machine 
> that
> +governs the transitions between these states is described later in this 
> document.
> +
> +SWR and SWP zones consume volatile device resources while being in certain
> +states and the device may set limits on the number of zones that can be in 
> these
> +states simultaneously.
> +
> +Zoned block devices use two internal counters to account for the device
> +resources in use, the number of currently open zones and the number of 
> currently
> +active zones.
> +
> +Any zone state transition from a state that doesn't consume a zone resource 
> to a
> +state that consumes the same resource increments the internal device counter 
> for
> +that resource. Any zone transition out of a state that consumes a zone 
> resource
> +to a state that doesn't consume the same resource decrements the counter. Any
> +request that causes the device to exceed the reported zone resource limits is
> +terminated by the device with a "zone resources exceeded" error as defined 
> for
> +specific commands later.
> +
>  \begin{lstlisting}
>  struct virtio_blk_config {
>  le64 capacity;
> @@ -4623,6 +4694,15 @@ \subsection{Device configuration 
> layout}\label{sec:Device Types / Block Device /
>  le32 max_secure_erase_sectors;
>  le32 max_secure_erase_seg;
>  le32 secure_erase_sector_alignment;
> +struct virtio_blk_zoned_characteristics {
> +le32 zone_sectors;
> +le32 max_open_zones;
> +le32 max_active_zones;
> +le32 max_append_sectors;
> +le32 write_granularity;
> +u8 model;
> +u8 unused2[3];
> +} zoned;
>  };
>  \end{lstlisting}
>  
> @@ -4686,6 +4766,10 @@ \subsection{Device Initialization}\label{sec:Device 
> Types / Block Device / Devic
>  \field{secure_erase_sector_alignment} can be used by OS when splitting a
>  request based on alignment.
>  
> +\item If the VIRTIO_BLK_F_ZONED feature is negotiated, the fields in
> +\field{zoned} can be read by the driver to determine the zone
> +characteristics of the device. All \field{zoned} fields are read-only.
> +
>  \end{enumerate}
>  
>  \drivernormative{\subsubsection}{Device Initialization}{Device Types / Block 
> Device / Device Initialization}
> @@ -4701,6 +4785,33 @@ \subsection{Device Initialization}\label{sec:Device 
> Types / Block Device / Devic
>  The driver MUST NOT read \field{writeback} before setting
>  the FEATURES_OK \field{device status} bit.
>  
> +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_ZONED feature if they are incapable
> +of supporting devices with the VIRTIO_BLK_Z_HM or VIRTIO_BLK_Z_HA zoned 
> model.
> +
> +Drivers MAY operate with VIRTIO_BLK_F_ZONED feature negotiated when the 
> device
> +reports VIRTIO_BLK_Z_NONE zoned model for testing and development.

Specifying a particular use case ("testing and development") could be
interpreted to imply that other use cases MAY NOT operate with
VIRTIO_BLK_F_ZONED.

I suggest dropping the use case from the sentence or moving that sub-point
outside the normative section to where the text describes VIRTIO_BLK_Z_NONE.
Something like "Devices that offer VIRTIO_BLK_F_ZONED with VIRTIO_BLK_Z_NONE
commonly do so for testing and development purposes".

> +Each request is of form:
> +
> +\begin{lstlisting}
> +struct virtio_blk_zoned_req {
> +le32 type;
> +le32 reserved;
> +  

Re: [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification

2022-06-04 Thread Stefan Hajnoczi
On Sat, Jun 04, 2022 at 12:09:46AM +, Dmitry Fomichev wrote:
> On Fri, 2022-06-03 at 09:08 +0100, Stefan Hajnoczi wrote:
> > (Resent from my subscriber email.)
> > 
> > On Fri, 3 Jun 2022 at 03:30, Dmitry Fomichev  
> > wrote:
> > > 
> > > On Thu, 2022-06-02 at 08:23 +0100, Stefan Hajnoczi wrote:
> > > > (Resending from my mailing list subscriber address so this makes it onto
> > > > the list.)
> > > > 
> > > > On Thu, 2 Jun 2022 at 00:56, Dmitry Fomichev 
> > > > wrote:
> > > > Here is the non-union representation of the structs. Note that this
> > > > means existing request processing code can handle IN/OUT/FLUSH even
> > > > when VIRTIO_BLK_F_ZONED is negotiated. We don't need two different
> > > > code paths for these common commands:
> > > > 
> > > > /* IN/OUT/FLUSH/DISCARD/WRITE_ZEROES */
> > > > struct virtio_blk_req {
> > > >     le32 type;
> > > >     le32 reserved;
> > > >     le64 sector;
> > > >     u8 data[];
> > > >     u8 status;
> > > > };
> > > > 
> > > > /* ZONE_APPEND */
> > > > struct virtio_blk_req_zone_append {
> > > >     le32 type;
> > > >     le32 reserved;
> > > >     le64 sector;
> > > >     u8 data[];
> > > >     le64 zone_result;
> > > >     u8 status;
> > > > };
> > > > 
> > > > /* ZONE_OPEN, ZONE_CLOSE, ZONE_FINISH, ZONE_RESET */
> > > > struct virtio_blk_req_zone_mgmt_send {
> > > >     le32 type;
> > > >     le32 reserved;
> > > >     le64 sector;
> > > >     /* ALL zone operation flag */
> > > >     __u8 all;
> > > >     __u8 unused1[3];
> > > >     u8 data[];
> > > >     u8 status;
> > > > };
> > > > 
> > > > /* ZONE_REPORT */
> > > > struct virtio_blk_req_zone_mgmt_receive {
> > > >     le32 type;
> > > >     le32 reserved;
> > > >     le64 sector;
> > > >     /* Partial zone report flag */
> > > >     __u8 partial;
> > > >     __u8 unused2[3];
> > > >     u8 data[];
> > > >     u8 status;
> > > > };
> > > 
> > > We felt that it is less confusing to have a single ZBD request structure 
> > > so
> > > there
> > > is 1:1 relationship between the VIRTIO_BLK_F_ZONED bit value and the 
> > > choice
> > > of
> > > the request -
> > > 
> > > VIRTIO_BLK_F_ZONED off ->  virtio_blk_req
> > > VIRTIO_BLK_F_ZONED on ->  virtio_blk_zoned_req
> > > 
> > > We also want to make sure that zone management/send/receive and append
> > > requests
> > > are always the same size and it should be easier with a single request
> > > structure.
> > > 
> > > It is possible to use a simpler, but less structured union here -
> > > 
> > > struct virtio_blk_zoned_req {
> > >     le32 type;
> > >     le32 reserved;
> > >     le64 sector;
> > >     union zoned_params {
> > >     /* ALL zone operation flag */
> > >     __u8 all;
> > >     /* Partial zone report flag */
> > >     __u8 partial;
> > >     __u32 rsvd1;
> > >     } zone;
> > >     u8 data[];
> > >     le64 zone_result;
> > >     u8 reserved1[3];
> > >     u8 status;
> > > }
> > > 
> > > Using the single code path is very important and the PoC Linux driver code
> > > that I
> > > have sketched up achieves that. Since the request header is identical for 
> > > the
> > > both cases, I hope that it can be done in QEMU too.
> > 
> > There are two possibilities for optimizing the request layout:
> > 1. For drivers and devices that implement both !VIRTIO_BLK_F_ZONED and
> > VIRTIO_BLK_F_ZONED. Here it's best to share the same request layout
> > for commands available in both modes. In other words IN and OUT
> > requests to zoned devices should use struct virtio_blk_req so code can
> > be shared and only commands unique to zoned devices use struct
> > virtio_blk_zoned_req.
> > 2. For drivers and devices that implement only one of
> > !VIRTIO_BLK_F_ZONED and VIR

Re: [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification

2022-06-03 Thread Stefan Hajnoczi
On Fri, Jun 03, 2022 at 05:16:01PM +0900, Damien Le Moal wrote:
> On 6/3/22 16:33, Stefan Hajnoczi wrote:
> [...]
> >> If partial is 0, the nr_zones will contain the number of zones that 
> >> potentially
> >> can be reported starting from the request "sector" position and until the 
> >> end of
> >> sector range of the device. It does not necessarily equal the total number 
> >> of
> >> zones on the drive.
> > 
> > The driver knows zone_sectors and nr_zones, so it can calculate
> > nr_zones - (sector / zone_sectors) itself. Why send a ZONE_REPORT
> > partial=0 request to the device to get this value?
> 
> Good point. Indeed, the partial bit is not useful when we have all zone
> with the same size. ZBC does not enforce that. So in general, it is not
> possible to calculate the number of zones starting from a particular LBA
> even if the total capacity is known. A report zones reply thus always
> indicate the *total* number of zones that could be reported from a given
> LBA, regardless of the report buffer size. This forces the drives to go
> through all zones even if the report zones buffer is already filled up. So
> longer command execution times.
> 
> The partial bit prevents that and is very useful to speedup a zone report
> for a single zone (e.g. to know that zone wp location).
> 
> For a Linux host, since the kernel enforces that all zones must have the
> same size, the partial bit is not useful. But other hosts may not enforce
> this, in which case the partial bit can be useful.
> 
> That said, I would not mind getting rid of it and assume a partial=1
> behavior, always. The kernel ioctl API does not have that parameter
> anyway, so it would have to be emulated by the driver. Not too hard to do,
> but still more code than strictly necessary.

Okay. We can start simple and introduce a feature bit later for devices
that have multiple zone sizes, if necessary.

Stefan


signature.asc
Description: PGP signature


  1   2   3   4   5   6   >