On Wed, 15 Mar 2023 at 11:56, Hanna Czenczek <hre...@redhat.com> wrote: > > On 15.03.23 14:58, Stefan Hajnoczi wrote: > > On Mon, Mar 13, 2023 at 06:48:32PM +0100, Hanna Czenczek wrote: > >> Add a virtio-fs-specific vhost-user interface to facilitate migrating > >> back-end-internal state. We plan to migrate the internal state simply > > Luckily the interface does not need to be virtiofs-specific since it > > only transfers opaque data. Any stateful device can use this for > > migration. Please make it generic both at the vhost-user protocol > > message level and at the QEMU vhost API level. > > OK, sure. > > >> as a binary blob after the streaming phase, so all we need is a way to > >> transfer such a blob from and to the back-end. We do so by using a > >> dedicated area of shared memory through which the blob is transferred in > >> chunks. > > Keeping the migration data transfer separate from the vhost-user UNIX > > domain socket is a good idea since the amount of data could be large and > > may congest the UNIX domain socket. The shared memory interface solves > > this. > > > > Where I get lost is why it needs to be shared memory instead of simply > > an fd? On the source, the front-end could read the fd until EOF and > > transfer the opaque data. On the destination, the front-end could write > > to the fd and then close it. I think that would be simpler than the > > shared memory interface and could potentially support zero-copy via > > splice(2) (QEMU doesn't need to look at the data being transferred!). > > > > Here is an outline of an fd-based interface: > > > > - SET_DEVICE_STATE_FD: The front-end passes a file descriptor for > > transferring device state. > > > > The @direction argument: > > - SAVE: the back-end transfers an outgoing device state over the fd. > > - LOAD: the back-end transfers an incoming device state over the fd. > > > > The @phase argument: > > - STOPPED: the device is stopped. > > - PRE_COPY: reserved for future use. > > - POST_COPY: reserved for future use. > > > > The back-end transfers data over the fd according to @direction and > > @phase upon receiving the SET_DEVICE_STATE_FD message. > > > > There are loose ends like how the message interacts with the virtqueue > > enabled state, what happens if multiple SET_DEVICE_STATE_FD messages are > > sent, etc. I have ignored them for now. > > > > What I wanted to mention about the fd-based interface is: > > > > - It's just one message. The I/O activity happens via the fd and does > > not involve GET_STATE/SET_STATE messages over the vhost-user domain > > socket. > > > > - Buffer management is up to the front-end and back-end implementations > > and a bit simpler than the shared memory interface. > > > > Did you choose the shared memory approach because it has certain > > advantages? > > I simply chose it because I didn’t think of anything else. :) > > Using just an FD for a pipe-like interface sounds perfect to me. I > expect that to make the code simpler and, as you point out, it’s just > better in general. Thanks!
The Linux VFIO Migration v2 API could be interesting to look at too: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/vfio.h#n814 It has a state machine that puts the device into pre-copy/saving/loading/etc states. > > What is the rationale for waiting to receive the entire incoming state > > before parsing it rather than parsing it in a streaming fashion? Can > > this be left as an implementation detail of the vhost-user back-end so > > that there's freedom in choosing either approach? > > The rationale was that when using the shared memory approach, you need > to specify the offset into the state of the chunk that you’re currently > transferring. So to allow streaming, you’d need to make the front-end > transfer the chunks in a streaming fashion, so that these offsets are > continuously increasing. Definitely possible, and reasonable, I just > thought it’d be easier not to define it at this point and just state > that decoding at the end is always safe. > > When using a pipe/splicing, however, that won’t be a concern anymore, so > yes, then we can definitely allow the back-end to decode its state while > it’s still being received. I see. Stefan