On Wed, Oct 04, 2023 at 02:58:59PM +0200, Hanna Czenczek wrote: > Currently, the vhost-user documentation says that rings are to be > initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is > negotiated. However, by the time of feature negotiation, all rings have > already been initialized, so it is not entirely clear what this means. > > At least the vhost-user-backend Rust crate's implementation interpreted > it to mean that whenever this feature is negotiated, all rings are to > put into a disabled state, which means that every SET_FEATURES call > would disable all rings, effectively halting the device. This is > problematic because the VHOST_F_LOG_ALL feature is also set or cleared > this way, which happens during migration. Doing so should not halt the > device. > > Other implementations have interpreted this to mean that the device is > to be initialized with all rings disabled, and a subsequent SET_FEATURES > call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of > them. Here, SET_FEATURES will never disable any ring. > > This interpretation does not suffer the problem of unintentionally > halting the device whenever features are set or cleared, so it seems > better and more reasonable. > > We can clarify this in the documentation by making it explicit that the > enabled/disabled state is tracked even while the vring is stopped. > Every vring is initialized in a disabled state, and SET_FEATURES without > VHOST_USER_F_PROTOCOL_FEATURES simply becomes one way to enable all > vrings. > > Signed-off-by: Hanna Czenczek <hre...@redhat.com>
OK so I am expecting v5. My advice is to move patch 1 to end of patchset so we can defer it if we want to. > --- > docs/interop/vhost-user.rst | 32 +++++++++++++++++--------------- > 1 file changed, 17 insertions(+), 15 deletions(-) > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst > index 50f5acebe5..9f4940a036 100644 > --- a/docs/interop/vhost-user.rst > +++ b/docs/interop/vhost-user.rst > @@ -395,31 +395,33 @@ negotiation. > Ring states > ----------- > > -Rings can be in one of three states: > +Rings have two independent states: started/stopped, and enabled/disabled. > > -* stopped: the back-end must not process the ring at all. > +* While a ring is stopped, the back-end must not process the ring at > + all, regardless of whether it is enabled or disabled. The > + enabled/disabled state should still be tracked, though, so it can come > + into effect once the ring is started. > > -* started but disabled: the back-end must process the ring without > +* started and disabled: The back-end must process the ring without > causing any side effects. For example, for a networking device, > in the disabled state the back-end must not supply any new RX packets, > but must process and discard any TX packets. > > -* started and enabled. > +* started and enabled: The back-end must process the ring normally, i.e. > + process all requests and execute them. > > -Each ring is initialized in a stopped state. The back-end must start > -ring upon receiving a kick (that is, detecting that file descriptor is > -readable) on the descriptor specified by ``VHOST_USER_SET_VRING_KICK`` > -or receiving the in-band message ``VHOST_USER_VRING_KICK`` if negotiated, > -and stop ring upon receiving ``VHOST_USER_GET_VRING_BASE``. > +Each ring is initialized in a stopped and disabled state. The back-end > +must start a ring upon receiving a kick (that is, detecting that file > +descriptor is readable) on the descriptor specified by > +``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message > +``VHOST_USER_VRING_KICK`` if negotiated, and stop a ring upon receiving > +``VHOST_USER_GET_VRING_BASE``. > > Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``. > > -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the > -ring starts directly in the enabled state. > - > -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is > -initialized in a disabled state and is enabled by > -``VHOST_USER_SET_VRING_ENABLE`` with parameter 1. > +In addition, upon receiving a ``VHOST_USER_SET_FEATURES`` message from > +the front-end without ``VHOST_USER_F_PROTOCOL_FEATURES`` set, the > +back-end must enable all rings immediately. > > While processing the rings (whether they are enabled or not), the back-end > must support changing some configuration aspects on the fly. > -- > 2.41.0 On Wed, Oct 04, 2023 at 02:59:00PM +0200, Hanna Czenczek wrote: > In vDPA, GET_VRING_BASE does not stop the queried vring, which is why > SUSPEND was introduced so that the returned index would be stable. In > vhost-user, it does stop the vring, so under the same reasoning, it can > get away without SUSPEND. > > Still, we do want to clarify that if the device is completely stopped, > i.e. all vrings are stopped, the back-end should cease to modify any > state relating to the guest. Do this by calling it "suspended". > > Suggested-by: Stefan Hajnoczi <stefa...@redhat.com> > Signed-off-by: Hanna Czenczek <hre...@redhat.com> > --- > docs/interop/vhost-user.rst | 20 +++++++++++++++++++- > 1 file changed, 19 insertions(+), 1 deletion(-) > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst > index 9f4940a036..d282155562 100644 > --- a/docs/interop/vhost-user.rst > +++ b/docs/interop/vhost-user.rst > @@ -426,6 +426,19 @@ back-end must enable all rings immediately. > While processing the rings (whether they are enabled or not), the back-end > must support changing some configuration aspects on the fly. > > +.. _suspended_device_state: > + > +Suspended device state > +^^^^^^^^^^^^^^^^^^^^^^ > + > +While all vrings are stopped, the device is *suspended*. In addition to > +not processing any vring (because they are stopped), the device must: > + > +* not write to any guest memory regions, > +* not send any notifications to the guest, > +* not send any messages to the front-end, > +* still process and reply to messages from the front-end. > + > Multiple queue support > ---------------------- > > @@ -513,7 +526,8 @@ ancillary data, it may be used to inform the front-end > that the log has > been modified. > > Once the source has finished migration, rings will be stopped by the > -source. No further update must be done before rings are restarted. > +source (:ref:`Suspended device state <suspended_device_state>`). No > +further update must be done before rings are restarted. > > In postcopy migration the back-end is started before all the memory has > been received from the source host, and care must be taken to avoid > @@ -1101,6 +1115,10 @@ Front-end message types > (*a vring descriptor index for split virtqueues* vs. *vring descriptor > indices for packed virtqueues*). > > + When and as long as all of a device’s vrings are stopped, it is > + *suspended*, see :ref:`Suspended device state > + <suspended_device_state>`. > + > The request payload’s *num* field is currently reserved and must be > set to 0. > > -- > 2.41.0 On Wed, Oct 04, 2023 at 02:59:01PM +0200, Hanna Czenczek wrote: > For vhost-user devices, qemu can migrate the virtio state, but not the > back-end's internal state. To do so, we need to be able to transfer > this internal state between front-end (qemu) and back-end. > > At this point, this new feature is added for the purpose of virtio-fs > migration. Because virtiofsd's internal state will not be too large, we > believe it is best to transfer it as a single binary blob after the > streaming phase. > > These are the additions to the protocol: > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_DEVICE_STATE > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a file > descriptor over which to transfer the state. > - CHECK_DEVICE_STATE: After the state has been transferred through the > file descriptor, the front-end invokes this function to verify > success. There is no in-band way (through the file descriptor) to > indicate failure, so we need to check explicitly. > > Once the transfer FD has been established via SET_DEVICE_STATE_FD > (which includes establishing the direction of transfer and migration > phase), the sending side writes its data into it, and the reading side > reads it until it sees an EOF. Then, the front-end will check for > success via CHECK_DEVICE_STATE, which on the destination side includes > checking for integrity (i.e. errors during deserialization). > > Signed-off-by: Hanna Czenczek <hre...@redhat.com> > --- > docs/interop/vhost-user.rst | 172 ++++++++++++++++++++++++++++++++++++ > 1 file changed, 172 insertions(+) > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst > index d282155562..aa91e2b34e 100644 > --- a/docs/interop/vhost-user.rst > +++ b/docs/interop/vhost-user.rst > @@ -306,6 +306,32 @@ Inflight description > > :queue size: a 16-bit size of virtqueues > > +Device state transfer parameters > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > ++--------------------+-----------------+ > +| transfer direction | migration phase | > ++--------------------+-----------------+ > + > +:transfer direction: a 32-bit enum, describing the direction in which > + the state is transferred: > + > + - 0: Save: Transfer the state from the back-end to the front-end, > + which happens on the source side of migration > + - 1: Load: Transfer the state from the front-end to the back-end, > + which happens on the destination side of migration > + > +:migration phase: a 32-bit enum, describing the state in which the VM > + guest and devices are: > + > + - 0: Stopped (in the period after the transfer of memory-mapped > + regions before switch-over to the destination): The VM guest is > + stopped, and the vhost-user device is suspended (see > + :ref:`Suspended device state <suspended_device_state>`). > + > + In the future, additional phases might be added e.g. to allow > + iterative migration while the device is running. > + > C structure > ----------- > > @@ -365,6 +391,7 @@ in the ancillary data: > * ``VHOST_USER_SET_VRING_ERR`` > * ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name > ``VHOST_USER_SET_SLAVE_REQ_FD``) > * ``VHOST_USER_SET_INFLIGHT_FD`` (if > ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) > +* ``VHOST_USER_SET_DEVICE_STATE_FD`` > > If *front-end* is unable to send the full message or receives a wrong > reply it will close the connection. An optional reconnection mechanism > @@ -539,6 +566,80 @@ it performs WAKE ioctl's on the userfaultfd to wake the > stalled > back-end. The front-end indicates support for this via the > ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature. > > +.. _migrating_backend_state: > + > +Migrating back-end state > +^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Migrating device state involves transferring the state from one > +back-end, called the source, to another back-end, called the > +destination. After migration, the destination transparently resumes > +operation without requiring the driver to re-initialize the device at > +the VIRTIO level. If the migration fails, then the source can > +transparently resume operation until another migration attempt is made. > + > +Generally, the front-end is connected to a virtual machine guest (which > +contains the driver), which has its own state to transfer between source > +and destination, and therefore will have an implementation-specific > +mechanism to do so. The ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature > +provides functionality to have the front-end include the back-end's > +state in this transfer operation so the back-end does not need to > +implement its own mechanism, and so the virtual machine may have its > +complete state, including vhost-user devices' states, contained within a > +single stream of data. > + > +To do this, the back-end state is transferred from back-end to front-end > +on the source side, and vice versa on the destination side. This > +transfer happens over a channel that is negotiated using the > +``VHOST_USER_SET_DEVICE_STATE_FD`` message. This message has two > +parameters: > + > +* Direction of transfer: On the source, the data is saved, transferring > + it from the back-end to the front-end. On the destination, the data > + is loaded, transferring it from the front-end to the back-end. > + > +* Migration phase: Currently, the only supported phase is the period > + after the transfer of memory-mapped regions before switch-over to the > + destination, when both the source and destination devices are > + suspended (:ref:`Suspended device state <suspended_device_state>`). > + In the future, additional phases might be supported to allow iterative > + migration while the device is running. > + > +The nature of the channel is implementation-defined, but it must > +generally behave like a pipe: The writing end will write all the data it > +has into it, signalling the end of data by closing its end. The reading > +end must read all of this data (until encountering the end of file) and > +process it. > + > +* When saving, the writing end is the source back-end, and the reading > + end is the source front-end. After reading the state data from the > + channel, the source front-end must transfer it to the destination > + front-end through an implementation-defined mechanism. > + > +* When loading, the writing end is the destination front-end, and the > + reading end is the destination back-end. After reading the state data > + from the channel, the destination back-end must deserialize its > + internal state from that data and set itself up to allow the driver to > + seamlessly resume operation on the VIRTIO level. > + > +Seamlessly resuming operation means that the migration must be > +transparent to the guest driver, which operates on the VIRTIO level. > +This driver will not perform any re-initialization steps, but continue > +to use the device as if no migration had occurred. The vhost-user > +front-end, however, will re-initialize the vhost state on the > +destination, following the usual protocol for establishing a connection > +to a vhost-user back-end: This includes, for example, setting up memory > +mappings and kick and call FDs as necessary, negotiating protocol > +features, or setting the initial vring base indices (to the same value > +as on the source side, so that operation can resume). > + > +Both on the source and on the destination side, after the respective > +front-end has seen all data transferred (when the transfer FD has been > +closed), it sends the ``VHOST_USER_CHECK_DEVICE_STATE`` message to > +verify that data transfer was successful in the back-end, too. The > +back-end responds once it knows whether the transfer and processing was > +successful or not. > + > Memory access > ------------- > > @@ -932,6 +1033,7 @@ Protocol features > #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS 15 > #define VHOST_USER_PROTOCOL_F_STATUS 16 > #define VHOST_USER_PROTOCOL_F_XEN_MMAP 17 > + #define VHOST_USER_PROTOCOL_F_DEVICE_STATE 18 > > Front-end message types > ----------------------- > @@ -1532,6 +1634,76 @@ Front-end message types > back-end for its device status as defined in the Virtio specification. > Deprecated together with VHOST_USER_SET_STATUS. > > +``VHOST_USER_SET_DEVICE_STATE_FD`` > + :id: 41 > + :equivalent ioctl: N/A > + :request payload: device state transfer parameters > + :reply payload: ``u64`` > + > + Front-end and back-end negotiate a channel over which to transfer the > + back-end’s internal state during migration. Either side (front-end or > + back-end) may create the channel. The nature of this channel is not > + restricted or defined in this document, but whichever side creates it > + must create a file descriptor that is provided to the respectively > + other side, allowing access to the channel. This FD must behave as > + follows: > + > + * For the writing end, it must allow writing the whole back-end state > + sequentially. Closing the file descriptor signals the end of > + transfer. > + > + * For the reading end, it must allow reading the whole back-end state > + sequentially. The end of file signals the end of the transfer. > + > + For example, the channel may be a pipe, in which case the two ends of > + the pipe fulfill these requirements respectively. > + > + Initially, the front-end creates a channel along with such an FD. It > + passes the FD to the back-end as ancillary data of a > + ``VHOST_USER_SET_DEVICE_STATE_FD`` message. The back-end may create a > + different transfer channel, passing the respective FD back to the > + front-end as ancillary data of the reply. If so, the front-end must > + then discard its channel and use the one provided by the back-end. > + > + Whether the back-end should decide to use its own channel is decided > + based on efficiency: If the channel is a pipe, both ends will most > + likely need to copy data into and out of it. Any channel that allows > + for more efficient processing on at least one end, e.g. through > + zero-copy, is considered more efficient and thus preferred. If the > + back-end can provide such a channel, it should decide to use it. > + > + The request payload contains parameters for the subsequent data > + transfer, as described in the :ref:`Migrating back-end state > + <migrating_backend_state>` section. > + > + The value returned is both an indication for success, and whether a > + file descriptor for a back-end-provided channel is returned: Bits 0–7 > + are 0 on success, and non-zero on error. Bit 8 is the invalid FD > + flag; this flag is set when there is no file descriptor returned. > + When this flag is not set, the front-end must use the returned file > + descriptor as its end of the transfer channel. The back-end must not > + both indicate an error and return a file descriptor. > + > + Using this function requires prior negotiation of the > + ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature. > + > +``VHOST_USER_CHECK_DEVICE_STATE`` > + :id: 42 > + :equivalent ioctl: N/A > + :request payload: N/A > + :reply payload: ``u64`` > + > + After transferring the back-end’s internal state during migration (see > + the :ref:`Migrating back-end state <migrating_backend_state>` > + section), check whether the back-end was able to successfully fully > + process the state. > + > + The value returned indicates success or error; 0 is success, any > + non-zero value is an error. > + > + Using this function requires prior negotiation of the > + ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature. > + > > Back-end message types > ---------------------- > -- > 2.41.0 On Wed, Oct 04, 2023 at 02:59:02PM +0200, Hanna Czenczek wrote: > Add the interface for transferring the back-end's state during migration > as defined previously in vhost-user.rst. > > Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com> > Signed-off-by: Hanna Czenczek <hre...@redhat.com> > --- > include/hw/virtio/vhost-backend.h | 24 +++++ > include/hw/virtio/vhost.h | 78 ++++++++++++++++ > hw/virtio/vhost-user.c | 148 ++++++++++++++++++++++++++++++ > hw/virtio/vhost.c | 37 ++++++++ > 4 files changed, 287 insertions(+) > > diff --git a/include/hw/virtio/vhost-backend.h > b/include/hw/virtio/vhost-backend.h > index 31a251a9f5..b6eee7e9fd 100644 > --- a/include/hw/virtio/vhost-backend.h > +++ b/include/hw/virtio/vhost-backend.h > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > } VhostSetConfigType; > > +typedef enum VhostDeviceStateDirection { > + /* Transfer state from back-end (device) to front-end */ > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > + /* Transfer state from front-end to back-end (device) */ > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > +} VhostDeviceStateDirection; > + > +typedef enum VhostDeviceStatePhase { > + /* The device (and all its vrings) is stopped */ > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > +} VhostDeviceStatePhase; > + > struct vhost_inflight; > struct vhost_dev; > struct vhost_log; > @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev > *dev, > > typedef void (*vhost_reset_status_op)(struct vhost_dev *dev); > > +typedef bool (*vhost_supports_device_state_op)(struct vhost_dev *dev); > +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev, > + VhostDeviceStateDirection > direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp); > +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error > **errp); > + > typedef struct VhostOps { > VhostBackendType backend_type; > vhost_backend_init vhost_backend_init; > @@ -181,6 +202,9 @@ typedef struct VhostOps { > vhost_force_iommu_op vhost_force_iommu; > vhost_set_config_call_op vhost_set_config_call; > vhost_reset_status_op vhost_reset_status; > + vhost_supports_device_state_op vhost_supports_device_state; > + vhost_set_device_state_fd_op vhost_set_device_state_fd; > + vhost_check_device_state_op vhost_check_device_state; > } VhostOps; > > int vhost_backend_update_device_iotlb(struct vhost_dev *dev, > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h > index 14621f9e79..a0d03c9fdf 100644 > --- a/include/hw/virtio/vhost.h > +++ b/include/hw/virtio/vhost.h > @@ -348,4 +348,82 @@ static inline int vhost_reset_device(struct vhost_dev > *hdev) > } > #endif /* CONFIG_VHOST */ > > +/** > + * vhost_supports_device_state(): Checks whether the back-end supports > + * transferring internal device state for the purpose of migration. > + * Support for this feature is required for vhost_set_device_state_fd() > + * and vhost_check_device_state(). > + * > + * @dev: The vhost device > + * > + * Returns true if the device supports these commands, and false if it > + * does not. > + */ > +bool vhost_supports_device_state(struct vhost_dev *dev); > + > +/** > + * vhost_set_device_state_fd(): Begin transfer of internal state from/to > + * the back-end for the purpose of migration. Data is to be transferred > + * over a pipe according to @direction and @phase. The sending end must > + * only write to the pipe, and the receiving end must only read from it. > + * Once the sending end is done, it closes its FD. The receiving end > + * must take this as the end-of-transfer signal and close its FD, too. > + * > + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the > + * read FD for LOAD. This function transfers ownership of @fd to the > + * back-end, i.e. closes it in the front-end. > + * > + * The back-end may optionally reply with an FD of its own, if this > + * improves efficiency on its end. In this case, the returned FD is > + * stored in *reply_fd. The back-end will discard the FD sent to it, > + * and the front-end must use *reply_fd for transferring state to/from > + * the back-end. > + * > + * @dev: The vhost device > + * @direction: The direction in which the state is to be transferred. > + * For outgoing migrations, this is SAVE, and data is read > + * from the back-end and stored by the front-end in the > + * migration stream. > + * For incoming migrations, this is LOAD, and data is read > + * by the front-end from the migration stream and sent to > + * the back-end to restore the saved state. > + * @phase: Which migration phase we are in. Currently, there is only > + * STOPPED (device and all vrings are stopped), in the future, > + * more phases such as PRE_COPY or POST_COPY may be added. > + * @fd: Back-end's end of the pipe through which to transfer state; note > + * that ownership is transferred to the back-end, so this function > + * closes @fd in the front-end. > + * @reply_fd: If the back-end wishes to use a different pipe for state > + * transfer, this will contain an FD for the front-end to > + * use. Otherwise, -1 is stored here. > + * @errp: Potential error description > + * > + * Returns 0 on success, and -errno on failure. > + */ > +int vhost_set_device_state_fd(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp); > + > +/** > + * vhost_set_device_state_fd(): After transferring state from/to the > + * back-end via vhost_set_device_state_fd(), i.e. once the sending end > + * has closed the pipe, inquire the back-end to report any potential > + * errors that have occurred on its side. This allows to sense errors > + * like: > + * - During outgoing migration, when the source side had already started > + * to produce its state, something went wrong and it failed to finish > + * - During incoming migration, when the received state is somehow > + * invalid and cannot be processed by the back-end > + * > + * @dev: The vhost device > + * @errp: Potential error description > + * > + * Returns 0 when the back-end reports successful state transfer and > + * processing, and -errno when an error occurred somewhere. > + */ > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp); > + > #endif > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c > index 7bed9ad7d5..7096b148a9 100644 > --- a/hw/virtio/vhost-user.c > +++ b/hw/virtio/vhost-user.c > @@ -74,6 +74,8 @@ enum VhostUserProtocolFeature { > /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */ > VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15, > VHOST_USER_PROTOCOL_F_STATUS = 16, > + /* Feature 17 reserved for VHOST_USER_PROTOCOL_F_XEN_MMAP. */ > + VHOST_USER_PROTOCOL_F_DEVICE_STATE = 18, > VHOST_USER_PROTOCOL_F_MAX > }; > > @@ -121,6 +123,8 @@ typedef enum VhostUserRequest { > VHOST_USER_REM_MEM_REG = 38, > VHOST_USER_SET_STATUS = 39, > VHOST_USER_GET_STATUS = 40, > + VHOST_USER_SET_DEVICE_STATE_FD = 41, > + VHOST_USER_CHECK_DEVICE_STATE = 42, > VHOST_USER_MAX > } VhostUserRequest; > > @@ -212,6 +216,12 @@ typedef struct { > uint32_t size; /* the following payload size */ > } QEMU_PACKED VhostUserHeader; > > +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */ > +typedef struct VhostUserTransferDeviceState { > + uint32_t direction; > + uint32_t phase; > +} VhostUserTransferDeviceState; > + > typedef union { > #define VHOST_USER_VRING_IDX_MASK (0xff) > #define VHOST_USER_VRING_NOFD_MASK (0x1 << 8) > @@ -226,6 +236,7 @@ typedef union { > VhostUserCryptoSession session; > VhostUserVringArea area; > VhostUserInflight inflight; > + VhostUserTransferDeviceState transfer_state; > } VhostUserPayload; > > typedef struct VhostUserMsg { > @@ -2746,6 +2757,140 @@ static void vhost_user_reset_status(struct vhost_dev > *dev) > } > } > > +static bool vhost_user_supports_device_state(struct vhost_dev *dev) > +{ > + return virtio_has_feature(dev->protocol_features, > + VHOST_USER_PROTOCOL_F_DEVICE_STATE); > +} > + > +static int vhost_user_set_device_state_fd(struct vhost_dev *dev, > + VhostDeviceStateDirection > direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp) > +{ > + int ret; > + struct vhost_user *vu = dev->opaque; > + VhostUserMsg msg = { > + .hdr = { > + .request = VHOST_USER_SET_DEVICE_STATE_FD, > + .flags = VHOST_USER_VERSION, > + .size = sizeof(msg.payload.transfer_state), > + }, > + .payload.transfer_state = { > + .direction = direction, > + .phase = phase, > + }, > + }; > + > + *reply_fd = -1; > + > + if (!vhost_user_supports_device_state(dev)) { > + close(fd); > + error_setg(errp, "Back-end does not support migration state > transfer"); > + return -ENOTSUP; > + } > + > + ret = vhost_user_write(dev, &msg, &fd, 1); > + close(fd); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to send SET_DEVICE_STATE_FD message"); > + return ret; > + } > + > + ret = vhost_user_read(dev, &msg); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to receive SET_DEVICE_STATE_FD reply"); > + return ret; > + } > + > + if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) { > + error_setg(errp, > + "Received unexpected message type, expected %d, received > %d", > + VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request); > + return -EPROTO; > + } > + > + if (msg.hdr.size != sizeof(msg.payload.u64)) { > + error_setg(errp, > + "Received bad message size, expected %zu, received %" > PRIu32, > + sizeof(msg.payload.u64), msg.hdr.size); > + return -EPROTO; > + } > + > + if ((msg.payload.u64 & 0xff) != 0) { > + error_setg(errp, "Back-end did not accept migration state transfer"); > + return -EIO; > + } > + > + if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) { > + *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr); > + if (*reply_fd < 0) { > + error_setg(errp, > + "Failed to get back-end-provided transfer pipe FD"); > + *reply_fd = -1; > + return -EIO; > + } > + } > + > + return 0; > +} > + > +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp) > +{ > + int ret; > + VhostUserMsg msg = { > + .hdr = { > + .request = VHOST_USER_CHECK_DEVICE_STATE, > + .flags = VHOST_USER_VERSION, > + .size = 0, > + }, > + }; > + > + if (!vhost_user_supports_device_state(dev)) { > + error_setg(errp, "Back-end does not support migration state > transfer"); > + return -ENOTSUP; > + } > + > + ret = vhost_user_write(dev, &msg, NULL, 0); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to send CHECK_DEVICE_STATE message"); > + return ret; > + } > + > + ret = vhost_user_read(dev, &msg); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to receive CHECK_DEVICE_STATE reply"); > + return ret; > + } > + > + if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) { > + error_setg(errp, > + "Received unexpected message type, expected %d, received > %d", > + VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request); > + return -EPROTO; > + } > + > + if (msg.hdr.size != sizeof(msg.payload.u64)) { > + error_setg(errp, > + "Received bad message size, expected %zu, received %" > PRIu32, > + sizeof(msg.payload.u64), msg.hdr.size); > + return -EPROTO; > + } > + > + if (msg.payload.u64 != 0) { > + error_setg(errp, "Back-end failed to process its internal state"); > + return -EIO; > + } > + > + return 0; > +} > + > const VhostOps user_ops = { > .backend_type = VHOST_BACKEND_TYPE_USER, > .vhost_backend_init = vhost_user_backend_init, > @@ -2782,4 +2927,7 @@ const VhostOps user_ops = { > .vhost_set_inflight_fd = vhost_user_set_inflight_fd, > .vhost_dev_start = vhost_user_dev_start, > .vhost_reset_status = vhost_user_reset_status, > + .vhost_supports_device_state = vhost_user_supports_device_state, > + .vhost_set_device_state_fd = vhost_user_set_device_state_fd, > + .vhost_check_device_state = vhost_user_check_device_state, > }; > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > index 6003e50e83..85e199f0aa 100644 > --- a/hw/virtio/vhost.c > +++ b/hw/virtio/vhost.c > @@ -2096,3 +2096,40 @@ int vhost_reset_device(struct vhost_dev *hdev) > > return -ENOSYS; > } > + > +bool vhost_supports_device_state(struct vhost_dev *dev) > +{ > + if (dev->vhost_ops->vhost_supports_device_state) { > + return dev->vhost_ops->vhost_supports_device_state(dev); > + } > + > + return false; > +} > + > +int vhost_set_device_state_fd(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp) > +{ > + if (dev->vhost_ops->vhost_set_device_state_fd) { > + return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, > phase, > + fd, reply_fd, errp); > + } > + > + error_setg(errp, > + "vhost transport does not support migration state transfer"); > + return -ENOSYS; > +} > + > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp) > +{ > + if (dev->vhost_ops->vhost_check_device_state) { > + return dev->vhost_ops->vhost_check_device_state(dev, errp); > + } > + > + error_setg(errp, > + "vhost transport does not support migration state transfer"); > + return -ENOSYS; > +} > -- > 2.41.0 On Wed, Oct 04, 2023 at 02:59:04PM +0200, Hanna Czenczek wrote: > A virtio-fs device's VM state consists of: > - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE) > - the back-end's (virtiofsd's) internal state > > We get/set the latter via the new vhost operations to transfer migratory > state. It is its own dedicated subsection, so that for external > migration, it can be disabled. > > Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com> > Signed-off-by: Hanna Czenczek <hre...@redhat.com> > --- > hw/virtio/vhost-user-fs.c | 101 +++++++++++++++++++++++++++++++++++++- > 1 file changed, 100 insertions(+), 1 deletion(-) > > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c > index 49d699ffc2..eb91723855 100644 > --- a/hw/virtio/vhost-user-fs.c > +++ b/hw/virtio/vhost-user-fs.c > @@ -298,9 +298,108 @@ static struct vhost_dev *vuf_get_vhost(VirtIODevice > *vdev) > return &fs->vhost_dev; > } > > +/** > + * Fetch the internal state from virtiofsd and save it to `f`. > + */ > +static int vuf_save_state(QEMUFile *f, void *pv, size_t size, > + const VMStateField *field, JSONWriter *vmdesc) > +{ > + VirtIODevice *vdev = pv; > + VHostUserFS *fs = VHOST_USER_FS(vdev); > + Error *local_error = NULL; > + int ret; > + > + ret = vhost_save_backend_state(&fs->vhost_dev, f, &local_error); > + if (ret < 0) { > + error_reportf_err(local_error, > + "Error saving back-end state of %s device %s " > + "(tag: \"%s\"): ", > + vdev->name, vdev->parent_obj.canonical_path, > + fs->conf.tag ?: "<none>"); > + return ret; > + } > + > + return 0; > +} > + > +/** > + * Load virtiofsd's internal state from `f` and send it over to virtiofsd. > + */ > +static int vuf_load_state(QEMUFile *f, void *pv, size_t size, > + const VMStateField *field) > +{ > + VirtIODevice *vdev = pv; > + VHostUserFS *fs = VHOST_USER_FS(vdev); > + Error *local_error = NULL; > + int ret; > + > + ret = vhost_load_backend_state(&fs->vhost_dev, f, &local_error); > + if (ret < 0) { > + error_reportf_err(local_error, > + "Error loading back-end state of %s device %s " > + "(tag: \"%s\"): ", > + vdev->name, vdev->parent_obj.canonical_path, > + fs->conf.tag ?: "<none>"); > + return ret; > + } > + > + return 0; > +} > + > +static bool vuf_is_internal_migration(void *opaque) > +{ > + /* TODO: Return false when an external migration is requested */ > + return true; > +} > + > +static int vuf_check_migration_support(void *opaque) > +{ > + VirtIODevice *vdev = opaque; > + VHostUserFS *fs = VHOST_USER_FS(vdev); > + > + if (!vhost_supports_device_state(&fs->vhost_dev)) { > + error_report("Back-end of %s device %s (tag: \"%s\") does not > support " > + "migration through qemu", > + vdev->name, vdev->parent_obj.canonical_path, > + fs->conf.tag ?: "<none>"); > + return -ENOTSUP; > + } > + > + return 0; > +} > + > +static const VMStateDescription vuf_backend_vmstate; > + > static const VMStateDescription vuf_vmstate = { > .name = "vhost-user-fs", > - .unmigratable = 1, > + .version_id = 0, > + .fields = (VMStateField[]) { > + VMSTATE_VIRTIO_DEVICE, > + VMSTATE_END_OF_LIST() > + }, > + .subsections = (const VMStateDescription * []) { > + &vuf_backend_vmstate, > + NULL, > + } > +}; > + > +static const VMStateDescription vuf_backend_vmstate = { > + .name = "vhost-user-fs-backend", > + .version_id = 0, > + .needed = vuf_is_internal_migration, > + .pre_load = vuf_check_migration_support, > + .pre_save = vuf_check_migration_support, > + .fields = (VMStateField[]) { > + { > + .name = "back-end", > + .info = &(const VMStateInfo) { > + .name = "virtio-fs back-end state", > + .get = vuf_load_state, > + .put = vuf_save_state, > + }, > + }, > + VMSTATE_END_OF_LIST() > + }, > }; > > static Property vuf_properties[] = { > -- > 2.41.0 On Wed, Oct 04, 2023 at 02:59:03PM +0200, Hanna Czenczek wrote: > vhost_save_backend_state() and vhost_load_backend_state() can be used by > vhost front-ends to easily save and load the back-end's state to/from > the migration stream. > > Because we do not know the full state size ahead of time, > vhost_save_backend_state() simply reads the data in 1 MB chunks, and > writes each chunk consecutively into the migration stream, prefixed by > its length. EOF is indicated by a 0-length chunk. > > Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com> > Signed-off-by: Hanna Czenczek <hre...@redhat.com> > --- > include/hw/virtio/vhost.h | 35 +++++++ > hw/virtio/vhost.c | 204 ++++++++++++++++++++++++++++++++++++++ > 2 files changed, 239 insertions(+) > > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h > index a0d03c9fdf..100fcc874d 100644 > --- a/include/hw/virtio/vhost.h > +++ b/include/hw/virtio/vhost.h > @@ -426,4 +426,39 @@ int vhost_set_device_state_fd(struct vhost_dev *dev, > */ > int vhost_check_device_state(struct vhost_dev *dev, Error **errp); > > +/** > + * vhost_save_backend_state(): High-level function to receive a vhost > + * back-end's state, and save it in @f. Uses > + * `vhost_set_device_state_fd()` to get the data from the back-end, and > + * stores it in consecutive chunks that are each prefixed by their > + * respective length (be32). The end is marked by a 0-length chunk. > + * > + * Must only be called while the device and all its vrings are stopped > + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`). > + * > + * @dev: The vhost device from which to save the state > + * @f: Migration stream in which to save the state > + * @errp: Potential error message > + * > + * Returns 0 on success, and -errno otherwise. > + */ > +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error > **errp); > + > +/** > + * vhost_load_backend_state(): High-level function to load a vhost > + * back-end's state from @f, and send it over to the back-end. Reads > + * the data from @f in the format used by `vhost_save_state()`, and uses > + * `vhost_set_device_state_fd()` to transfer it to the back-end. > + * > + * Must only be called while the device and all its vrings are stopped > + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`). > + * > + * @dev: The vhost device to which to send the sate > + * @f: Migration stream from which to load the state > + * @errp: Potential error message > + * > + * Returns 0 on success, and -errno otherwise. > + */ > +int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error > **errp); > + > #endif > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > index 85e199f0aa..1465adf13a 100644 > --- a/hw/virtio/vhost.c > +++ b/hw/virtio/vhost.c > @@ -2133,3 +2133,207 @@ int vhost_check_device_state(struct vhost_dev *dev, > Error **errp) > "vhost transport does not support migration state transfer"); > return -ENOSYS; > } > + > +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error > **errp) > +{ > + /* Maximum chunk size in which to transfer the state */ > + const size_t chunk_size = 1 * 1024 * 1024; > + g_autofree void *transfer_buf = NULL; > + g_autoptr(GError) g_err = NULL; > + int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1; > + int ret; > + > + /* [0] for reading (our end), [1] for writing (back-end's end) */ > + if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) { > + error_setg(errp, "Failed to set up state transfer pipe: %s", > + g_err->message); > + ret = -EINVAL; > + goto fail; > + } > + > + read_fd = pipe_fds[0]; > + write_fd = pipe_fds[1]; > + > + /* > + * VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped. > + * Ideally, it is suspended, but SUSPEND/RESUME currently do not exist > for > + * vhost-user, so just check that it is stopped at all. > + */ > + assert(!dev->started); > + > + /* Transfer ownership of write_fd to the back-end */ > + ret = vhost_set_device_state_fd(dev, > + VHOST_TRANSFER_STATE_DIRECTION_SAVE, > + VHOST_TRANSFER_STATE_PHASE_STOPPED, > + write_fd, > + &reply_fd, > + errp); > + if (ret < 0) { > + error_prepend(errp, "Failed to initiate state transfer: "); > + goto fail; > + } > + > + /* If the back-end wishes to use a different pipe, switch over */ > + if (reply_fd >= 0) { > + close(read_fd); > + read_fd = reply_fd; > + } > + > + transfer_buf = g_malloc(chunk_size); > + > + while (true) { > + ssize_t read_ret; > + > + read_ret = RETRY_ON_EINTR(read(read_fd, transfer_buf, chunk_size)); > + if (read_ret < 0) { > + ret = -errno; > + error_setg_errno(errp, -ret, "Failed to receive state"); > + goto fail; > + } > + > + assert(read_ret <= chunk_size); > + qemu_put_be32(f, read_ret); > + > + if (read_ret == 0) { > + /* EOF */ > + break; > + } > + > + qemu_put_buffer(f, transfer_buf, read_ret); > + } > + > + /* > + * Back-end will not really care, but be clean and close our end of the > pipe > + * before inquiring the back-end about whether transfer was successful > + */ > + close(read_fd); > + read_fd = -1; > + > + /* Also, verify that the device is still stopped */ > + assert(!dev->started); > + > + ret = vhost_check_device_state(dev, errp); > + if (ret < 0) { > + goto fail; > + } > + > + ret = 0; > +fail: > + if (read_fd >= 0) { > + close(read_fd); > + } > + > + return ret; > +} > + > +int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error > **errp) > +{ > + size_t transfer_buf_size = 0; > + g_autofree void *transfer_buf = NULL; > + g_autoptr(GError) g_err = NULL; > + int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1; > + int ret; > + > + /* [0] for reading (back-end's end), [1] for writing (our end) */ > + if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) { > + error_setg(errp, "Failed to set up state transfer pipe: %s", > + g_err->message); > + ret = -EINVAL; > + goto fail; > + } > + > + read_fd = pipe_fds[0]; > + write_fd = pipe_fds[1]; > + > + /* > + * VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped. > + * Ideally, it is suspended, but SUSPEND/RESUME currently do not exist > for > + * vhost-user, so just check that it is stopped at all. > + */ > + assert(!dev->started); > + > + /* Transfer ownership of read_fd to the back-end */ > + ret = vhost_set_device_state_fd(dev, > + VHOST_TRANSFER_STATE_DIRECTION_LOAD, > + VHOST_TRANSFER_STATE_PHASE_STOPPED, > + read_fd, > + &reply_fd, > + errp); > + if (ret < 0) { > + error_prepend(errp, "Failed to initiate state transfer: "); > + goto fail; > + } > + > + /* If the back-end wishes to use a different pipe, switch over */ > + if (reply_fd >= 0) { > + close(write_fd); > + write_fd = reply_fd; > + } > + > + while (true) { > + size_t this_chunk_size = qemu_get_be32(f); > + ssize_t write_ret; > + const uint8_t *transfer_pointer; > + > + if (this_chunk_size == 0) { > + /* End of state */ > + break; > + } > + > + if (transfer_buf_size < this_chunk_size) { > + transfer_buf = g_realloc(transfer_buf, this_chunk_size); > + transfer_buf_size = this_chunk_size; > + } > + > + if (qemu_get_buffer(f, transfer_buf, this_chunk_size) < > + this_chunk_size) > + { > + error_setg(errp, "Failed to read state"); > + ret = -EINVAL; > + goto fail; > + } > + > + transfer_pointer = transfer_buf; > + while (this_chunk_size > 0) { > + write_ret = RETRY_ON_EINTR( > + write(write_fd, transfer_pointer, this_chunk_size) > + ); > + if (write_ret < 0) { > + ret = -errno; > + error_setg_errno(errp, -ret, "Failed to send state"); > + goto fail; > + } else if (write_ret == 0) { > + error_setg(errp, "Failed to send state: Connection is > closed"); > + ret = -ECONNRESET; > + goto fail; > + } > + > + assert(write_ret <= this_chunk_size); > + this_chunk_size -= write_ret; > + transfer_pointer += write_ret; > + } > + } > + > + /* > + * Close our end, thus ending transfer, before inquiring the back-end > about > + * whether transfer was successful > + */ > + close(write_fd); > + write_fd = -1; > + > + /* Also, verify that the device is still stopped */ > + assert(!dev->started); > + > + ret = vhost_check_device_state(dev, errp); > + if (ret < 0) { > + goto fail; > + } > + > + ret = 0; > +fail: > + if (write_fd >= 0) { > + close(write_fd); > + } > + > + return ret; > +} > -- > 2.41.0