On 7/14/2024 10:14 PM, Jason Wang wrote:
On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare <steven.sist...@oracle.com> wrote:

Live update is a technique wherein an application saves its state, exec's
to an updated version of itself, and restores its state.  Clients of the
application experience a brief suspension of service, on the order of
100's of milliseconds, but are otherwise unaffected.

Define and implement interfaces that allow vdpa devices to be preserved
across fork or exec, to support live update for applications such as QEMU.
The device must be suspended during the update, but its DMA mappings are
preserved, so the suspension is brief.

The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
accounting from one process to another.

The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
VHOST_NEW_OWNER is supported.

The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
address in the new process.

The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
VHOST_IOTLB_REMAP is supported and required.  Some devices do not
require it, because the userland address of each DMA mapping is discarded
after being translated to a physical address.

Here is a pseudo-code sequence for performing live update, based on
suspend + reset because resume is not yet widely available.  The vdpa device
descriptor, fd, remains open across the exec.

   ioctl(fd, VHOST_VDPA_SUSPEND)
   ioctl(fd, VHOST_VDPA_SET_STATUS, 0)

I don't understand why we need a reset after suspend, it looks to me
the previous suspend became meaningless.

The suspend guarantees completion of in-progress DMA.  At least, that is
my interpretation of why that is done for live migration in QEMU, which
also does suspend + reset + re-create.  I am following the live migration
model.

   exec

   ioctl(fd, VHOST_NEW_OWNER)

   issue ioctls to re-create vrings

   if VHOST_BACKEND_F_IOTLB_REMAP

So the idea is for a device that is using a virtual address, it
doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all?

Actually the reverse: if the device translates virtual to physical when
the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP
is not needed.

       foreach dma mapping
           write(fd, {VHOST_IOTLB_REMAP, new_addr})

   ioctl(fd, VHOST_VDPA_SET_STATUS,
             ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)

 From API level, this seems to be asymmetric as we have suspending but
not resuming?

Again, I am just following the path taken by live migration.
I will be happy to use resume when the devices and QEMU support it.
The decision to use reset vs resume should not affect the definition
and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP.

- Steve

This is faster than VHOST_RESET_OWNER + VHOST_SET_OWNER + VHOST_IOTLB_UPDATE,
as that would would unpin and repin physical pages, which would cost multiple
seconds for large memories.

This is implemented in QEMU by the patch series "Live update: vdpa"
   https://lore.kernel.org/qemu-devel/TBD  (reference to be posted shortly)

The QEMU implementation leverages the live migration code path, but after
CPR exec's new QEMU:
   - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
   - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of
     VHOST_IOTLB_UPDATE

Changes in V2:
   - clean up handling of set_map vs dma_map vs platform iommu in remap
   - augment and clarify commit messages and comments

Steve Sistare (7):
   vhost-vdpa: count pinned memory
   vhost-vdpa: pass mm to bind
   vhost-vdpa: VHOST_NEW_OWNER
   vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
   vhost-vdpa: VHOST_IOTLB_REMAP
   vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
   vdpa/mlx5: new owner capability

  drivers/vdpa/mlx5/net/mlx5_vnet.c |   3 +-
  drivers/vhost/vdpa.c              | 125 ++++++++++++++++++++++++++++--
  drivers/vhost/vhost.c             |  15 ++++
  drivers/vhost/vhost.h             |   1 +
  include/uapi/linux/vhost.h        |  10 +++
  include/uapi/linux/vhost_types.h  |  15 +++-
  6 files changed, 161 insertions(+), 8 deletions(-)

--
2.39.3


Thanks


Reply via email to