* Alex Williamson (alex.william...@redhat.com) wrote: > On Mon, 9 Nov 2020 19:44:17 +0000 > "Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote: > > > * Alex Williamson (alex.william...@redhat.com) wrote: > > > Per the proposed documentation for vfio device migration: > > > > > > Dirty pages are tracked when device is in stop-and-copy phase > > > because if pages are marked dirty during pre-copy phase and > > > content is transfered from source to destination, there is no > > > way to know newly dirtied pages from the point they were copied > > > earlier until device stops. To avoid repeated copy of same > > > content, pinned pages are marked dirty only during > > > stop-and-copy phase. > > > > > > Essentially, since we don't have hardware dirty page tracking for > > > assigned devices at this point, we consider any page that is pinned > > > by an mdev vendor driver or pinned and mapped through the IOMMU to > > > be perpetually dirty. In the worst case, this may result in all of > > > guest memory being considered dirty during every iteration of live > > > migration. The current vfio implementation of migration has chosen > > > to mask device dirtied pages until the final stages of migration in > > > order to avoid this worst case scenario. > > > > > > Allowing the device to implement a policy decision to prioritize > > > reduced migration data like this jeopardizes QEMU's overall ability > > > to implement any degree of service level guarantees during migration. > > > For example, any estimates towards achieving acceptable downtime > > > margins cannot be trusted when such a device is present. The vfio > > > device should participate in dirty page tracking to the best of its > > > ability throughout migration, even if that means the dirty footprint > > > of the device impedes migration progress, allowing both QEMU and > > > higher level management tools to decide whether to continue the > > > migration or abort due to failure to achieve the desired behavior. > > > > I don't feel particularly badly about the decision to squash it in > > during the stop-and-copy phase; for devices where the pinned memory > > is large, I don't think doing it during the main phase makes much sense; > > especially if you then have to deal with tracking changes in pinning. > > > AFAIK the kernel support for tracking changes in page pinning already > exists, this is largely the vfio device in QEMU that decides when to > start exposing the device dirty footprint to QEMU. I'm a bit surprised > by this answer though, we don't really know what the device memory > footprint is. It might be large, it might be nothing, but by not > participating in dirty page tracking until the VM is stopped, we can't > know what the footprint is and how it will affect downtime. Is it > really the place of a QEMU device driver to impose this sort of policy?
If it could actually track changes then I'd agree we shouldn't impose any policy; but if it's just marking the whole area as dirty we're going to need a bodge somewhere; this bodge doesn't look any worse than the others to me. > > > Having said that, I agree with marking it as experimental, because > > I'm dubious how useful it will be for the same reason, I worry > > about whether the downtime will be so large to make it pointless. > > > TBH I think that's the wrong reason to mark it experimental. There's > clearly demand for vfio device migration and even if the practical use > cases are initially small, they will expand over time and hardware will > get better. My objection is that the current behavior masks the > hardware and device limitations, leading to unrealistic expectations. > If the user expects minimal downtime, configures convergence to account > for that, QEMU thinks it can achieve it, and then the device marks > everything dirty, that's not supportable. Yes, agreed. > OTOH if the vfio device > participates in dirty tracking through pre-copy, then the practical use > cases will find themselves as migrations will either be aborted because > downtime tolerances cannot be achieved or downtimes will be configured > to match reality. Thanks, Without a way to prioritise the unpinned memory during that period, we're going to be repeatedly sending the pinned memory which is going to lead to a much larger bandwidth usage that required; so that's going in completely the wrong direction and also wrong from the point of view of the user. Dave > > Alex > > > Reviewed-by: Dr. David Alan Gilbert <dgilb...@redhat.com> > > > > > Link: https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg00807.html > > > Cc: Kirti Wankhede <kwankh...@nvidia.com> > > > Cc: Neo Jia <c...@nvidia.com> > > > Cc: Dr. David Alan Gilbert <dgilb...@redhat.com> > > > Cc: Juan Quintela <quint...@redhat.com> > > > Cc: Philippe Mathieu-Daudé <phi...@redhat.com> > > > Cc: Cornelia Huck <coh...@redhat.com> > > > Signed-off-by: Alex Williamson <alex.william...@redhat.com> > > > --- > > > > > > Given that our discussion in the link above seems to be going in > > > circles, I'm afraid it seems necessary to both have a contigency > > > plan and to raise the visibility of the current behavior to > > > determine whether others agree that this is a sufficiently > > > troubling behavior to consider migration support experimental > > > at this stage. Please voice your opinion or contribute patches > > > to resolve this before QEMU 5.2. Thanks, > > > > > > Alex > > > > > > hw/vfio/migration.c | 2 +- > > > hw/vfio/pci.c | 2 ++ > > > include/hw/vfio/vfio-common.h | 1 + > > > 3 files changed, 4 insertions(+), 1 deletion(-) > > > > > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c > > > index 3ce285ea395d..cd44d465a50b 100644 > > > --- a/hw/vfio/migration.c > > > +++ b/hw/vfio/migration.c > > > @@ -882,7 +882,7 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error > > > **errp) > > > Error *local_err = NULL; > > > int ret = -ENOTSUP; > > > > > > - if (!container->dirty_pages_supported) { > > > + if (!vbasedev->enable_migration || > > > !container->dirty_pages_supported) { > > > goto add_blocker; > > > } > > > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > > > index 58c0ce8971e3..1349b900e513 100644 > > > --- a/hw/vfio/pci.c > > > +++ b/hw/vfio/pci.c > > > @@ -3194,6 +3194,8 @@ static Property vfio_pci_dev_properties[] = { > > > VFIO_FEATURE_ENABLE_REQ_BIT, true), > > > DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features, > > > VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false), > > > + DEFINE_PROP_BOOL("x-enable-migration", VFIOPCIDevice, > > > + vbasedev.enable_migration, false), > > > DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, > > > false), > > > DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice, > > > vbasedev.ram_block_discard_allowed, false), > > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h > > > index baeb4dcff102..2119872c8af1 100644 > > > --- a/include/hw/vfio/vfio-common.h > > > +++ b/include/hw/vfio/vfio-common.h > > > @@ -123,6 +123,7 @@ typedef struct VFIODevice { > > > bool needs_reset; > > > bool no_mmap; > > > bool ram_block_discard_allowed; > > > + bool enable_migration; > > > VFIODeviceOps *ops; > > > unsigned int num_irqs; > > > unsigned int num_regions; > > > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK