Re: Reducing vdpa migration downtime because of memory pin / maps

Eugenio Perez Martin Sun, 09 Jul 2023 23:06:09 -0700

On Sat, Jul 8, 2023 at 11:14 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
>
>
>
> On 7/5/2023 10:46 PM, Eugenio Perez Martin wrote:
> > On Thu, Jul 6, 2023 at 2:13 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
> >>
> >>
> >> On 7/5/2023 11:03 AM, Eugenio Perez Martin wrote:
> >>> On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
> >>>>
> >>>> On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:
> >>>>> On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei....@oracle.com> 
> >>>>> wrote:
> >>>>>> On 6/7/23 01:08, Eugenio Perez Martin wrote:
> >>>>>>> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei....@oracle.com> 
> >>>>>>> wrote:
> >>>>>>>> Sorry for reviving this old thread, I lost the best timing to follow 
> >>>>>>>> up
> >>>>>>>> on this while I was on vacation. I have been working on this and 
> >>>>>>>> found
> >>>>>>>> out some discrepancy, please see below.
> >>>>>>>>
> >>>>>>>> On 4/5/23 04:37, Eugenio Perez Martin wrote:
> >>>>>>>>> Hi!
> >>>>>>>>>
> >>>>>>>>> As mentioned in the last upstream virtio-networking meeting, one of
> >>>>>>>>> the factors that adds more downtime to migration is the handling of
> >>>>>>>>> the guest memory (pin, map, etc). At this moment this handling is
> >>>>>>>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, 
> >>>>>>>>> the
> >>>>>>>>> destination device waits until all the guest memory / state is
> >>>>>>>>> migrated to start pinning all the memory.
> >>>>>>>>>
> >>>>>>>>> The proposal is to bind it to the char device life cycle (open vs
> >>>>>>>>> close),
> >>>>>>>> Hmmm, really? If it's the life cycle for char device, the next guest 
> >>>>>>>> /
> >>>>>>>> qemu launch on the same vhost-vdpa device node won't make it work.
> >>>>>>>>
> >>>>>>> Maybe my sentence was not accurate, but I think we're on the same 
> >>>>>>> page here.
> >>>>>>>
> >>>>>>> Two qemu instances opening the same char device at the same time are
> >>>>>>> not allowed, and vhost_vdpa_release clean all the maps. So the next
> >>>>>>> qemu that opens the char device should see a clean device anyway.
> >>>>>> I mean the pin can't be done at the time of char device open, where the
> >>>>>> user address space is not known/bound yet. The earliest point possible
> >>>>>> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
> >>>>>> done.
> >>>>> Maybe we are deviating, let me start again.
> >>>>>
> >>>>> Using QEMU code, what I'm proposing is to modify the lifecycle of the
> >>>>> .listener member of struct vhost_vdpa.
> >>>>>
> >>>>> At this moment, the memory listener is registered at
> >>>>> vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
> >>>>> and is unregistered in both vhost_vdpa_reset_status and
> >>>>> vhost_vdpa_cleanup.
> >>>>>
> >>>>> My original proposal was just to move the memory listener registration
> >>>>> to the last vhost_vdpa_init, and remove the unregister from
> >>>>> vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
> >>>>> be the same, the device should not realize this change.
> >>>> This can address LM downtime latency for sure, but it won't help
> >>>> downtime during dynamic SVQ switch - which still needs to go through the
> >>>> full unmap/map cycle (that includes the slow part for pinning) from
> >>>> passthrough to SVQ mode. Be noted not every device could work with a
> >>>> separate ASID for SVQ descriptors. The fix should expect to work on
> >>>> normal vDPA vendor devices without a separate descriptor ASID, with
> >>>> platform IOMMU underneath or with on-chip IOMMU.
> >>>>
> >>> At this moment the SVQ switch is very inefficient mapping-wise, as it
> >>> unmap all the GPA->HVA maps and overrides it. In particular, SVQ is
> >>> allocated in low regions of the iova space, and then the guest memory
> >>> is allocated in this new IOVA region incrementally.
> >> Yep. The key to build this fast path for SVQ switching I think is to
> >> maintain the identity mapping for the passthrough queues so that QEMU
> >> can reuse the old mappings for guest memory (e.g. GIOVA identity mapped
> >> to GPA) while incrementally adding new mappings for SVQ vrings.
> >>
> >>> We can optimize that if we place SVQ in a free GPA area instead.
> >> Here's a question though: it might not be hard to find a free GPA range
> >> for the non-vIOMMU case (allocate iova from beyond the 48bit or 52bit
> >> ranges), but I'm not sure if easy to find a free GIOVA range for the
> >> vIOMMU case - particularly this has to work in the same entire 64bit
> >> IOVA address ranges that (for now) QEMU won't be able to "reserve" a
> >> specific IOVA ranges for SVQ from the vIOMMU. Do you foresee this can be
> >> done for every QEMU emulated vIOMMU (intel-iommu amd-iommu, arm smmu and
> >> virito-iommu) so that we can call it out as a generic means for SVQ
> >> switching optimization?
> >>
> > In the case vIOMMU allocates a new block we will use the same algorithm as 
> > now:
> > * Find a new free IOVA chunk of the same size
> > * Map this new SVQ IOVA, that may or may not be the same as SVQ
> >
> > Since we must go through the translation phase to sanitize guest's
> > available descriptors anyway, it has zero added cost.
> Not sure I followed, this can work but doesn't seem able to reuse the
> old host kernel mappings for guest memory, hence still requires remap of
> the entire host IOVA ranges when SVQ IOVA comes along. I think by
> maintaining 1:1 identity map on guest memory, we don't have to bother
> tearing down existing HVA->HPA mappings in kernel thus save the
> expensive pinning calls at large. I don't clearly see under this scheme,
> how the new SVQ IOVA may work with potential conflict on IOVA space from
> hotplugged memory - in this case the 1:1 IOVA->GPA identity guest memory
> mapping can't be kept.
>


There is no need to maintain the 1:1 for memory mapped after the
pinning. The bigger reason to maintain them is to reduce the downtime
because of pinning. After that, we can reuse the method we're using at
this moment, looking for a new free hole for the new map. ew only need
to pin the new map.

> > Another option would be to move the SVQ vring to a new region, but I
> > don't see any advantage on maintaining 1:1 mapping at that point.
> See above. For pinning avoidance point of view (i.e. QEMU software
> optimization on SVQ switching downtime), I think it's crucial to
> maintain 1:1 mapping and move SVQ vring to a specific region. But I'm
> not sure how complex it will be for QEMU to get it implemented in a
> clean way.
>
> Thanks,
> -Siwei
>
> >
> >> If this QEMU/vIOMMU "hack" is not universally feasible, I would rather
> >> build a fast path in the kernel via a new vhost IOTLB command, say
> >> INVALIDATE_AND_UPDATE_ALL, to atomically flush all existing
> >> (passthrough) mappings and update to use the SVQ ones in a single batch,
> >> while keeping the pages for guest memory always pinned (the kernel will
> >> make this decision). This doesn't expose pinning to userspace, and can
> >> also fix downtime issue.
> >>
> >>>    All
> >>> of the "translations" still need to be done, to ensure the guest
> >>> doesn't have access to SVQ vring. That way, qemu will not send all the
> >>> unmaps & maps, only the new ones. And vhost/vdpa does not need to call
> >>> unpin_user_page / pin_user_pages for all the guest memory.
> >>>
> >>> More optimizations include the batching of the SVQ vrings.
> >> Nods.
> >>
> >>>>> One of the concerns was that it could delay VM initialization, and I
> >>>>> didn't profile it but I think that may be the case.
> >>>> Yes, that's the concern here - we should not introduce regression to
> >>>> normal VM boot process/time. In case of large VM it's very easy to see
> >>>> the side effect if we go this way.
> >>>>
> >>>>>     I'm not sure about
> >>>>> the right fix but I think the change is easy to profile. If that is
> >>>>> the case, we could:
> >>>>> * use a flag (listener->address_space ?) and only register the
> >>>>> listener in _init if waiting for a migration, do it in _start
> >>>>> otherwise.
> >>>> Just doing this alone won't help SVQ mode switch downtime, see the
> >>>> reason stated above.
> >>>>
> >>>>> * something like io_uring, where the map can be done in parallel and
> >>>>> we can fail _start if some of them fails.
> >>>> This can alleviate the problem somehow, but still sub-optimal and not
> >>>> scalable with larger size. I'd like zero or least pinning to be done at
> >>>> the SVQ switch or migration time.
> >>>>
> >>> To reduce even further the pinning at SVQ time we would need to
> >>> preallocate SVQ vrings before suspending the device.
> >> Yep. Preallocate SVQ vrings at the start of migration, but before
> >> suspending the device. This will work under the assumption that we can
> >> reserve or "steal" some ranges from the GPA or GIOVA space...
> >>
> >> Thanks,
> >> -Siwei
> >>
> >>>>>> Actually I think the counterpart vhost_detach_mm() only gets
> >>>>>> handled in vhost_vdpa_release() at device close time is a resulting
> >>>>>> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
> >>>>>> call is not made mandatory hence only seen implemented in vhost-net
> >>>>>> device today. One qemu instance could well exec(3) another new qemu
> >>>>>> instance to live upgrade itself while keeping all emulated devices and
> >>>>>> guest alive. The current vhost design simply prohibits this from 
> >>>>>> happening.
> >>>>>>
> >>>>> Ok, I was not aware of this. Thanks for explaining it!
> >>>>>
> >>>>>>>>>       so all the guest memory can be pinned for all the guest / qemu
> >>>>>>>>> lifecycle.
> >>>>>>>> I think to tie pinning to guest / qemu process life cycle makes more
> >>>>>>>> sense. Essentially this pinning part needs to be decoupled from the
> >>>>>>>> iotlb mapping abstraction layer, and can / should work as a 
> >>>>>>>> standalone
> >>>>>>>> uAPI. Such that QEMU at the destination may launch and pin all 
> >>>>>>>> guest's
> >>>>>>>> memory as needed without having to start the device, while awaiting 
> >>>>>>>> any
> >>>>>>>> incoming migration request. Though problem is, there's no existing 
> >>>>>>>> vhost
> >>>>>>>> uAPI that could properly serve as the vehicle for that. SET_OWNER /
> >>>>>>>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection 
> >>>>>>>> against
> >>>>>>>> introducing a new but clean vhost uAPI for pinning guest pages, 
> >>>>>>>> subject
> >>>>>>>> to guest's life cycle?
> >>>>>>>>
> >>>>>>> I think that to pin or not pin memory maps should be a kernel
> >>>>>>> decision, not to be driven by qemu.
> >>>>>> It's kernel decision for sure. I am with this part.
> >>>>>>
> >>>>>>> I'm not against it if needed, but
> >>>>>>> let me know if the current "clean at close" address your concerns.
> >>>>>> To better facilitate QEMU exec (live update) case, I propose we add new
> >>>>>> vhost uAPI pair for explicit pinning request - which would live with
> >>>>>> user mm's, or more precisely qemu instance's lifecycle.
> >>>>>>
> >>>>> Ok I see your problem better now, but I think it should be solved at
> >>>>> kernel level. Does that live update need to forcefully unpin and pin
> >>>>> the memory again,
> >>>> No, it should avoid the unpin&pin cycle, otherwise it'd defeat the
> >>>> downtime expectation. The exec(3)'d process should inherit the page
> >>>> pinning and/or mlock accounting from the original QEMU process, while
> >>>> keeping original page pinning intact. Physical page mappings for DMA can
> >>>> be kept as is to avoid the need of reprogramming device, though in this
> >>>> case the existing vhost iotlb entries should be updated to reflect the
> >>>> new HVA in the exec(3)'d QEMU process.
> >>>>
> >>>>>     or that is just a consequence of how it works the
> >>>>> memory listener right now?
> >>>>>
> >>>>> Why not extend the RESET_OWNER to the rest of devices? It seems the
> >>>>> most natural way to me.
> >>>> Not sure, I think RESET_OWNER might be too heavy weighted to implement
> >>>> live update, and people are not clear what the exact semantics are by
> >>>> using it (which part of the device state is being reset, and how
> >>>> much)... In addition, people working on iommufd intended to make this a
> >>>> "one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER:
> >>>>
> >>>> https://lore.kernel.org/kvm/y5ibvv9pnmifi...@ziepe.ca/
> >>>>
> >>>> New uAPI to just change ownership of mm seems a better fit to me...
> >>>>
> >>> I'm not sure about the right solution here, but there are other
> >>> proposals to batch ioctls. But maybe creating a new one fits better.
> >>>
> >>> Thanks!
> >>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>
> >>>>> Thanks!
> >>>>>
> >>>>>
> >>>>>>>> Another concern is the use_va stuff, originally it tags to the device
> >>>>>>>> level and is made static at the time of device instantiation, which 
> >>>>>>>> is
> >>>>>>>> fine. But others to come just find a new home at per-group level or
> >>>>>>>> per-vq level struct. Hard to tell whether or not pinning is actually
> >>>>>>>> needed for the latter use_va friends, as they are essentially tied to
> >>>>>>>> the virtio life cycle or feature negotiation. While guest / Qemu 
> >>>>>>>> starts
> >>>>>>>> way earlier than that. Perhaps just ignore those sub-device level 
> >>>>>>>> use_va
> >>>>>>>> usages? Presumably !use_va at the device level is sufficient to infer
> >>>>>>>> the need of pinning for device?
> >>>>>>>>
> >>>>>>> I don't follow this. But I have the feeling that the subject of my
> >>>>>>> original mail is way more accurate if I would have said just "memory
> >>>>>>> maps".
> >>>>>> I think the iotlb layer in vhost-vdpa just provides the abstraction for
> >>>>>> mapping, not pinning. Although in some case mapping implicitly relies 
> >>>>>> on
> >>>>>> pinning for DMA purpose, it doesn't have to tie to that in uAPI
> >>>>>> semantics. We can do explicit on-demand pinning for cases for e.g.
> >>>>>> warming up device at live migration destination.
> >>>>>>
> >>>>>>
> >>>>>>> I still consider the way to fix it is to actually delegate that to the
> >>>>>>> kernel vdpa, so it can choose if a particular ASID needs the pin or
> >>>>>>> not. But let me know if I missed something.
> >>>>>> You can disregard this for now. I will discuss that further with you
> >>>>>> guys while bind_mm and per-group use_va stuffs are landed.
> >>>>>>
> >>>>>> Thanks!
> >>>>>> -Siwei
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> -Siwei
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> This has two main problems:
> >>>>>>>>> * At this moment the reset semantics forces the vdpa device to unmap
> >>>>>>>>> all the memory. So this change needs a vhost vdpa feature flag.
> >>>>>>>>> * This may increase the initialization time. Maybe we can delay it 
> >>>>>>>>> if
> >>>>>>>>> qemu is not the destination of a LM. Anyway I think this should be
> >>>>>>>>> done as an optimization on top.
> >>>>>>>>>
> >>>>>>>>> Any ideas or comments in this regard?
> >>>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>>
>

Re: Reducing vdpa migration downtime because of memory pin / maps

Reply via email to