Re: Reducing vdpa migration downtime because of memory pin / maps

Eugenio Perez Martin Fri, 09 Jun 2023 07:49:38 -0700

On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
>
>
> On 6/7/23 01:08, Eugenio Perez Martin wrote:
> > On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
> >> Sorry for reviving this old thread, I lost the best timing to follow up
> >> on this while I was on vacation. I have been working on this and found
> >> out some discrepancy, please see below.
> >>
> >> On 4/5/23 04:37, Eugenio Perez Martin wrote:
> >>> Hi!
> >>>
> >>> As mentioned in the last upstream virtio-networking meeting, one of
> >>> the factors that adds more downtime to migration is the handling of
> >>> the guest memory (pin, map, etc). At this moment this handling is
> >>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> >>> destination device waits until all the guest memory / state is
> >>> migrated to start pinning all the memory.
> >>>
> >>> The proposal is to bind it to the char device life cycle (open vs
> >>> close),
> >> Hmmm, really? If it's the life cycle for char device, the next guest /
> >> qemu launch on the same vhost-vdpa device node won't make it work.
> >>
> > Maybe my sentence was not accurate, but I think we're on the same page here.
> >
> > Two qemu instances opening the same char device at the same time are
> > not allowed, and vhost_vdpa_release clean all the maps. So the next
> > qemu that opens the char device should see a clean device anyway.
>
> I mean the pin can't be done at the time of char device open, where the
> user address space is not known/bound yet. The earliest point possible
> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
> done.


Maybe we are deviating, let me start again.

Using QEMU code, what I'm proposing is to modify the lifecycle of the
.listener member of struct vhost_vdpa.

At this moment, the memory listener is registered at
vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
and is unregistered in both vhost_vdpa_reset_status and
vhost_vdpa_cleanup.

My original proposal was just to move the memory listener registration
to the last vhost_vdpa_init, and remove the unregister from
vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
be the same, the device should not realize this change.

One of the concerns was that it could delay VM initialization, and I
didn't profile it but I think that may be the case. I'm not sure about
the right fix but I think the change is easy to profile. If that is
the case, we could:
* use a flag (listener->address_space ?) and only register the
listener in _init if waiting for a migration, do it in _start
otherwise.
* something like io_uring, where the map can be done in parallel and
we can fail _start if some of them fails.

> Actually I think the counterpart vhost_detach_mm() only gets
> handled in vhost_vdpa_release() at device close time is a resulting
> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
> call is not made mandatory hence only seen implemented in vhost-net
> device today. One qemu instance could well exec(3) another new qemu
> instance to live upgrade itself while keeping all emulated devices and
> guest alive. The current vhost design simply prohibits this from happening.
>

Ok, I was not aware of this. Thanks for explaining it!

>
> >
> >>>    so all the guest memory can be pinned for all the guest / qemu
> >>> lifecycle.
> >> I think to tie pinning to guest / qemu process life cycle makes more
> >> sense. Essentially this pinning part needs to be decoupled from the
> >> iotlb mapping abstraction layer, and can / should work as a standalone
> >> uAPI. Such that QEMU at the destination may launch and pin all guest's
> >> memory as needed without having to start the device, while awaiting any
> >> incoming migration request. Though problem is, there's no existing vhost
> >> uAPI that could properly serve as the vehicle for that. SET_OWNER /
> >> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
> >> introducing a new but clean vhost uAPI for pinning guest pages, subject
> >> to guest's life cycle?
> >>
> > I think that to pin or not pin memory maps should be a kernel
> > decision, not to be driven by qemu.
>
> It's kernel decision for sure. I am with this part.
>
> > I'm not against it if needed, but
> > let me know if the current "clean at close" address your concerns.
>
> To better facilitate QEMU exec (live update) case, I propose we add new
> vhost uAPI pair for explicit pinning request - which would live with
> user mm's, or more precisely qemu instance's lifecycle.
>

Ok I see your problem better now, but I think it should be solved at
kernel level. Does that live update need to forcefully unpin and pin
the memory again, or that is just a consequence of how it works the
memory listener right now?

Why not extend the RESET_OWNER to the rest of devices? It seems the
most natural way to me.

Thanks!


> >
> >> Another concern is the use_va stuff, originally it tags to the device
> >> level and is made static at the time of device instantiation, which is
> >> fine. But others to come just find a new home at per-group level or
> >> per-vq level struct. Hard to tell whether or not pinning is actually
> >> needed for the latter use_va friends, as they are essentially tied to
> >> the virtio life cycle or feature negotiation. While guest / Qemu starts
> >> way earlier than that. Perhaps just ignore those sub-device level use_va
> >> usages? Presumably !use_va at the device level is sufficient to infer
> >> the need of pinning for device?
> >>
> > I don't follow this. But I have the feeling that the subject of my
> > original mail is way more accurate if I would have said just "memory
> > maps".
>
> I think the iotlb layer in vhost-vdpa just provides the abstraction for
> mapping, not pinning. Although in some case mapping implicitly relies on
> pinning for DMA purpose, it doesn't have to tie to that in uAPI
> semantics. We can do explicit on-demand pinning for cases for e.g.
> warming up device at live migration destination.
>
>
> >
> > I still consider the way to fix it is to actually delegate that to the
> > kernel vdpa, so it can choose if a particular ASID needs the pin or
> > not. But let me know if I missed something.
>
> You can disregard this for now. I will discuss that further with you
> guys while bind_mm and per-group use_va stuffs are landed.
>
> Thanks!
> -Siwei
>
>
>
> >
> > Thanks!
> >
> >> Regards,
> >> -Siwei
> >>
> >>
> >>> This has two main problems:
> >>> * At this moment the reset semantics forces the vdpa device to unmap
> >>> all the memory. So this change needs a vhost vdpa feature flag.
> >>> * This may increase the initialization time. Maybe we can delay it if
> >>> qemu is not the destination of a LM. Anyway I think this should be
> >>> done as an optimization on top.
> >>>
> >>> Any ideas or comments in this regard?
> >>>
> >>> Thanks!
> >>>
>

Re: Reducing vdpa migration downtime because of memory pin / maps

Reply via email to