On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei....@oracle.com> wrote: > > > On 6/7/23 01:08, Eugenio Perez Martin wrote: > > On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei....@oracle.com> wrote: > >> Sorry for reviving this old thread, I lost the best timing to follow up > >> on this while I was on vacation. I have been working on this and found > >> out some discrepancy, please see below. > >> > >> On 4/5/23 04:37, Eugenio Perez Martin wrote: > >>> Hi! > >>> > >>> As mentioned in the last upstream virtio-networking meeting, one of > >>> the factors that adds more downtime to migration is the handling of > >>> the guest memory (pin, map, etc). At this moment this handling is > >>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the > >>> destination device waits until all the guest memory / state is > >>> migrated to start pinning all the memory. > >>> > >>> The proposal is to bind it to the char device life cycle (open vs > >>> close), > >> Hmmm, really? If it's the life cycle for char device, the next guest / > >> qemu launch on the same vhost-vdpa device node won't make it work. > >> > > Maybe my sentence was not accurate, but I think we're on the same page here. > > > > Two qemu instances opening the same char device at the same time are > > not allowed, and vhost_vdpa_release clean all the maps. So the next > > qemu that opens the char device should see a clean device anyway. > > I mean the pin can't be done at the time of char device open, where the > user address space is not known/bound yet. The earliest point possible > for pinning would be until the vhost_attach_mm() call from SET_OWNER is > done.
Maybe we are deviating, let me start again. Using QEMU code, what I'm proposing is to modify the lifecycle of the .listener member of struct vhost_vdpa. At this moment, the memory listener is registered at vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev, and is unregistered in both vhost_vdpa_reset_status and vhost_vdpa_cleanup. My original proposal was just to move the memory listener registration to the last vhost_vdpa_init, and remove the unregister from vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would be the same, the device should not realize this change. One of the concerns was that it could delay VM initialization, and I didn't profile it but I think that may be the case. I'm not sure about the right fix but I think the change is easy to profile. If that is the case, we could: * use a flag (listener->address_space ?) and only register the listener in _init if waiting for a migration, do it in _start otherwise. * something like io_uring, where the map can be done in parallel and we can fail _start if some of them fails. > Actually I think the counterpart vhost_detach_mm() only gets > handled in vhost_vdpa_release() at device close time is a resulting > artifact and amiss of today's vhost protocol - the opposite RESET_OWNER > call is not made mandatory hence only seen implemented in vhost-net > device today. One qemu instance could well exec(3) another new qemu > instance to live upgrade itself while keeping all emulated devices and > guest alive. The current vhost design simply prohibits this from happening. > Ok, I was not aware of this. Thanks for explaining it! > > > > >>> so all the guest memory can be pinned for all the guest / qemu > >>> lifecycle. > >> I think to tie pinning to guest / qemu process life cycle makes more > >> sense. Essentially this pinning part needs to be decoupled from the > >> iotlb mapping abstraction layer, and can / should work as a standalone > >> uAPI. Such that QEMU at the destination may launch and pin all guest's > >> memory as needed without having to start the device, while awaiting any > >> incoming migration request. Though problem is, there's no existing vhost > >> uAPI that could properly serve as the vehicle for that. SET_OWNER / > >> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against > >> introducing a new but clean vhost uAPI for pinning guest pages, subject > >> to guest's life cycle? > >> > > I think that to pin or not pin memory maps should be a kernel > > decision, not to be driven by qemu. > > It's kernel decision for sure. I am with this part. > > > I'm not against it if needed, but > > let me know if the current "clean at close" address your concerns. > > To better facilitate QEMU exec (live update) case, I propose we add new > vhost uAPI pair for explicit pinning request - which would live with > user mm's, or more precisely qemu instance's lifecycle. > Ok I see your problem better now, but I think it should be solved at kernel level. Does that live update need to forcefully unpin and pin the memory again, or that is just a consequence of how it works the memory listener right now? Why not extend the RESET_OWNER to the rest of devices? It seems the most natural way to me. Thanks! > > > >> Another concern is the use_va stuff, originally it tags to the device > >> level and is made static at the time of device instantiation, which is > >> fine. But others to come just find a new home at per-group level or > >> per-vq level struct. Hard to tell whether or not pinning is actually > >> needed for the latter use_va friends, as they are essentially tied to > >> the virtio life cycle or feature negotiation. While guest / Qemu starts > >> way earlier than that. Perhaps just ignore those sub-device level use_va > >> usages? Presumably !use_va at the device level is sufficient to infer > >> the need of pinning for device? > >> > > I don't follow this. But I have the feeling that the subject of my > > original mail is way more accurate if I would have said just "memory > > maps". > > I think the iotlb layer in vhost-vdpa just provides the abstraction for > mapping, not pinning. Although in some case mapping implicitly relies on > pinning for DMA purpose, it doesn't have to tie to that in uAPI > semantics. We can do explicit on-demand pinning for cases for e.g. > warming up device at live migration destination. > > > > > > I still consider the way to fix it is to actually delegate that to the > > kernel vdpa, so it can choose if a particular ASID needs the pin or > > not. But let me know if I missed something. > > You can disregard this for now. I will discuss that further with you > guys while bind_mm and per-group use_va stuffs are landed. > > Thanks! > -Siwei > > > > > > > Thanks! > > > >> Regards, > >> -Siwei > >> > >> > >>> This has two main problems: > >>> * At this moment the reset semantics forces the vdpa device to unmap > >>> all the memory. So this change needs a vhost vdpa feature flag. > >>> * This may increase the initialization time. Maybe we can delay it if > >>> qemu is not the destination of a LM. Anyway I think this should be > >>> done as an optimization on top. > >>> > >>> Any ideas or comments in this regard? > >>> > >>> Thanks! > >>> >