On Tue, Nov 10, 2020 at 08:20:50AM -0700, Alex Williamson wrote: > External email: Use caution opening links or attachments > > > On Tue, 10 Nov 2020 19:46:20 +0530 > Kirti Wankhede <kwankh...@nvidia.com> wrote: > > > On 11/10/2020 2:40 PM, Dr. David Alan Gilbert wrote: > > > * Alex Williamson (alex.william...@redhat.com) wrote: > > >> On Mon, 9 Nov 2020 19:44:17 +0000 > > >> "Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote: > > >> > > >>> * Alex Williamson (alex.william...@redhat.com) wrote: > > >>>> Per the proposed documentation for vfio device migration: > > >>>> > > >>>> Dirty pages are tracked when device is in stop-and-copy phase > > >>>> because if pages are marked dirty during pre-copy phase and > > >>>> content is transfered from source to destination, there is no > > >>>> way to know newly dirtied pages from the point they were copied > > >>>> earlier until device stops. To avoid repeated copy of same > > >>>> content, pinned pages are marked dirty only during > > >>>> stop-and-copy phase. > > >>>> > > >>>> Essentially, since we don't have hardware dirty page tracking for > > >>>> assigned devices at this point, we consider any page that is pinned > > >>>> by an mdev vendor driver or pinned and mapped through the IOMMU to > > >>>> be perpetually dirty. In the worst case, this may result in all of > > >>>> guest memory being considered dirty during every iteration of live > > >>>> migration. The current vfio implementation of migration has chosen > > >>>> to mask device dirtied pages until the final stages of migration in > > >>>> order to avoid this worst case scenario. > > >>>> > > >>>> Allowing the device to implement a policy decision to prioritize > > >>>> reduced migration data like this jeopardizes QEMU's overall ability > > >>>> to implement any degree of service level guarantees during migration. > > >>>> For example, any estimates towards achieving acceptable downtime > > >>>> margins cannot be trusted when such a device is present. The vfio > > >>>> device should participate in dirty page tracking to the best of its > > >>>> ability throughout migration, even if that means the dirty footprint > > >>>> of the device impedes migration progress, allowing both QEMU and > > >>>> higher level management tools to decide whether to continue the > > >>>> migration or abort due to failure to achieve the desired behavior. > > >>> > > >>> I don't feel particularly badly about the decision to squash it in > > >>> during the stop-and-copy phase; for devices where the pinned memory > > >>> is large, I don't think doing it during the main phase makes much sense; > > >>> especially if you then have to deal with tracking changes in pinning. > > >> > > >> > > >> AFAIK the kernel support for tracking changes in page pinning already > > >> exists, this is largely the vfio device in QEMU that decides when to > > >> start exposing the device dirty footprint to QEMU. I'm a bit surprised > > >> by this answer though, we don't really know what the device memory > > >> footprint is. It might be large, it might be nothing, but by not > > >> participating in dirty page tracking until the VM is stopped, we can't > > >> know what the footprint is and how it will affect downtime. Is it > > >> really the place of a QEMU device driver to impose this sort of policy? > > > > > > If it could actually track changes then I'd agree we shouldn't impose > > > any policy; but if it's just marking the whole area as dirty we're going > > > to need a bodge somewhere; this bodge doesn't look any worse than the > > > others to me. > > > > > >> > > >>> Having said that, I agree with marking it as experimental, because > > >>> I'm dubious how useful it will be for the same reason, I worry > > >>> about whether the downtime will be so large to make it pointless. > > >> > > > > Not all device state is large, for example NIC might only report > > currently mapped RX buffers which usually not more than a 1GB and could > > be as low as 10's of MB. GPU might or might not have large data, that > > depends on its use cases. > > > Right, it's only if we have a vendor driver that doesn't pin any memory > when dirty tracking is enabled and we're running without a viommu that > we would expect all of guest memory to be continuously dirty. > > > > >> TBH I think that's the wrong reason to mark it experimental. There's > > >> clearly demand for vfio device migration and even if the practical use > > >> cases are initially small, they will expand over time and hardware will > > >> get better. My objection is that the current behavior masks the > > >> hardware and device limitations, leading to unrealistic expectations. > > >> If the user expects minimal downtime, configures convergence to account > > >> for that, QEMU thinks it can achieve it, and then the device marks > > >> everything dirty, that's not supportable. > > > > > > Yes, agreed. > > > > Yes, there is demand for vfio device migration and many devices owners > > started scoping and development for migration support. > > Instead of making whole migration support as experimental, we can have > > opt-in option to decide to mark sys mem pages dirty during iterative > > phase (pre-copy phase) of migration. > > > Per my previous suggestion, I'd think an opt-out would be more > appropriate, ie. implementing pre-copy dirty page tracking by default.
I think this will be a better approach without marking this feature as experimental. Thanks, Neo > > > > >> OTOH if the vfio device > > >> participates in dirty tracking through pre-copy, then the practical use > > >> cases will find themselves as migrations will either be aborted because > > >> downtime tolerances cannot be achieved or downtimes will be configured > > >> to match reality. Thanks, > > > > > > Without a way to prioritise the unpinned memory during that period, > > > we're going to be repeatedly sending the pinned memory which is going to > > > lead to a much larger bandwidth usage that required; so that's going in > > > completely the wrong direction and also wrong from the point of view of > > > the user. > > > Who decides which is the wrong direction for the user? If the user > wants minimal bandwidth regardless of downtime, can't they create a > procedure to pause the VM, do the migration, then resume? Are there > already migration tunables to effectively achieve this? If a user > attempts to do a "live" migration, isn't our priority then shifted to > managing the downtime constraints over the bandwidth? IOW the policy > decision is implied by the user actions and configuration of the > migration, I don't think that at the device level we should be guessing > which feature to prioritize, just like a vCPU doesn't to stop marking > dirty pages during pre-copy because it's touching too much memory. > Higher level policies and configurations should determine inflection > points... imo. Thanks, > > Alex >