On Tue, 10 Nov 2020 19:46:20 +0530 Kirti Wankhede <kwankh...@nvidia.com> wrote:
> On 11/10/2020 2:40 PM, Dr. David Alan Gilbert wrote: > > * Alex Williamson (alex.william...@redhat.com) wrote: > >> On Mon, 9 Nov 2020 19:44:17 +0000 > >> "Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote: > >> > >>> * Alex Williamson (alex.william...@redhat.com) wrote: > >>>> Per the proposed documentation for vfio device migration: > >>>> > >>>> Dirty pages are tracked when device is in stop-and-copy phase > >>>> because if pages are marked dirty during pre-copy phase and > >>>> content is transfered from source to destination, there is no > >>>> way to know newly dirtied pages from the point they were copied > >>>> earlier until device stops. To avoid repeated copy of same > >>>> content, pinned pages are marked dirty only during > >>>> stop-and-copy phase. > >>>> > >>>> Essentially, since we don't have hardware dirty page tracking for > >>>> assigned devices at this point, we consider any page that is pinned > >>>> by an mdev vendor driver or pinned and mapped through the IOMMU to > >>>> be perpetually dirty. In the worst case, this may result in all of > >>>> guest memory being considered dirty during every iteration of live > >>>> migration. The current vfio implementation of migration has chosen > >>>> to mask device dirtied pages until the final stages of migration in > >>>> order to avoid this worst case scenario. > >>>> > >>>> Allowing the device to implement a policy decision to prioritize > >>>> reduced migration data like this jeopardizes QEMU's overall ability > >>>> to implement any degree of service level guarantees during migration. > >>>> For example, any estimates towards achieving acceptable downtime > >>>> margins cannot be trusted when such a device is present. The vfio > >>>> device should participate in dirty page tracking to the best of its > >>>> ability throughout migration, even if that means the dirty footprint > >>>> of the device impedes migration progress, allowing both QEMU and > >>>> higher level management tools to decide whether to continue the > >>>> migration or abort due to failure to achieve the desired behavior. > >>> > >>> I don't feel particularly badly about the decision to squash it in > >>> during the stop-and-copy phase; for devices where the pinned memory > >>> is large, I don't think doing it during the main phase makes much sense; > >>> especially if you then have to deal with tracking changes in pinning. > >> > >> > >> AFAIK the kernel support for tracking changes in page pinning already > >> exists, this is largely the vfio device in QEMU that decides when to > >> start exposing the device dirty footprint to QEMU. I'm a bit surprised > >> by this answer though, we don't really know what the device memory > >> footprint is. It might be large, it might be nothing, but by not > >> participating in dirty page tracking until the VM is stopped, we can't > >> know what the footprint is and how it will affect downtime. Is it > >> really the place of a QEMU device driver to impose this sort of policy? > > > > If it could actually track changes then I'd agree we shouldn't impose > > any policy; but if it's just marking the whole area as dirty we're going > > to need a bodge somewhere; this bodge doesn't look any worse than the > > others to me. > > > >> > >>> Having said that, I agree with marking it as experimental, because > >>> I'm dubious how useful it will be for the same reason, I worry > >>> about whether the downtime will be so large to make it pointless. > >> > > Not all device state is large, for example NIC might only report > currently mapped RX buffers which usually not more than a 1GB and could > be as low as 10's of MB. GPU might or might not have large data, that > depends on its use cases. Right, it's only if we have a vendor driver that doesn't pin any memory when dirty tracking is enabled and we're running without a viommu that we would expect all of guest memory to be continuously dirty. > >> TBH I think that's the wrong reason to mark it experimental. There's > >> clearly demand for vfio device migration and even if the practical use > >> cases are initially small, they will expand over time and hardware will > >> get better. My objection is that the current behavior masks the > >> hardware and device limitations, leading to unrealistic expectations. > >> If the user expects minimal downtime, configures convergence to account > >> for that, QEMU thinks it can achieve it, and then the device marks > >> everything dirty, that's not supportable. > > > > Yes, agreed. > > Yes, there is demand for vfio device migration and many devices owners > started scoping and development for migration support. > Instead of making whole migration support as experimental, we can have > opt-in option to decide to mark sys mem pages dirty during iterative > phase (pre-copy phase) of migration. Per my previous suggestion, I'd think an opt-out would be more appropriate, ie. implementing pre-copy dirty page tracking by default. > >> OTOH if the vfio device > >> participates in dirty tracking through pre-copy, then the practical use > >> cases will find themselves as migrations will either be aborted because > >> downtime tolerances cannot be achieved or downtimes will be configured > >> to match reality. Thanks, > > > > Without a way to prioritise the unpinned memory during that period, > > we're going to be repeatedly sending the pinned memory which is going to > > lead to a much larger bandwidth usage that required; so that's going in > > completely the wrong direction and also wrong from the point of view of > > the user. Who decides which is the wrong direction for the user? If the user wants minimal bandwidth regardless of downtime, can't they create a procedure to pause the VM, do the migration, then resume? Are there already migration tunables to effectively achieve this? If a user attempts to do a "live" migration, isn't our priority then shifted to managing the downtime constraints over the bandwidth? IOW the policy decision is implied by the user actions and configuration of the migration, I don't think that at the device level we should be guessing which feature to prioritize, just like a vCPU doesn't to stop marking dirty pages during pre-copy because it's touching too much memory. Higher level policies and configurations should determine inflection points... imo. Thanks, Alex