On Thu, 1 Feb 2024 13:10:48 -0500 "Michael S. Tsirkin" <m...@redhat.com> wrote:
> On Thu, Feb 01, 2024 at 11:08:58PM +0530, Vinayak Kale wrote: > > > > On 31/01/24 11:08 pm, Alex Williamson wrote: > > > > > > On Wed, 31 Jan 2024 15:22:59 +0530 > > > Vinayak Kale <vk...@nvidia.com> wrote: > > > > > > > On 31/01/24 12:28 am, Alex Williamson wrote: > > > > > > > > > > On Tue, 30 Jan 2024 23:32:26 +0530 > > > > > Vinayak Kale <vk...@nvidia.com> wrote: > > > > > > > > > > > Missed adding Michael, Marcel, Alex and Avihai earlier, apologies. > > > > > > > > > > > > Regards, > > > > > > Vinayak > > > > > > > > > > > > On 30/01/24 3:26 pm, Vinayak Kale wrote: > > > > > > > In case of migration, during restore operation, qemu checks the > > > > > > > config space of the pci device with the config space > > > > > > > in the migration stream captured during save operation. In case > > > > > > > of config space data mismatch, restore operation is failed. > > > > > > > > > > > > > > config space check is done in function get_pci_config_device(). > > > > > > > By default VSC (vendor-specific-capability) in config space is > > > > > > > checked. > > > > > > > > > > > > > > Ideally qemu should not check VSC during restore/load. This patch > > > > > > > skips the check by not setting pdev->cmask[] for VSC offsets in > > > > > > > pci_add_capability(). > > > > > > > If cmask[] is not set for an offset, then qemu skips config space > > > > > > > check for that offset. > > > > > > > > > > > > > > Signed-off-by: Vinayak Kale <vk...@nvidia.com> > > > > > > > --- > > > > > > > hw/pci/pci.c | 7 +++++-- > > > > > > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c > > > > > > > index 76080af580..32429109df 100644 > > > > > > > --- a/hw/pci/pci.c > > > > > > > +++ b/hw/pci/pci.c > > > > > > > @@ -2485,8 +2485,11 @@ int pci_add_capability(PCIDevice *pdev, > > > > > > > uint8_t cap_id, > > > > > > > memset(pdev->used + offset, 0xFF, QEMU_ALIGN_UP(size, 4)); > > > > > > > /* Make capability read-only by default */ > > > > > > > memset(pdev->wmask + offset, 0, size); > > > > > > > - /* Check capability by default */ > > > > > > > - memset(pdev->cmask + offset, 0xFF, size); > > > > > > > + > > > > > > > + if (cap_id != PCI_CAP_ID_VNDR) { > > > > > > > + /* Check non-vendor specific capability by default */ > > > > > > > + memset(pdev->cmask + offset, 0xFF, size); > > > > > > > + } > > > > > > > return offset; > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > If there is a possibility that the data within the vendor specific cap > > > > > can be consumed by the driver or diagnostic tools, then it's part of > > > > > the device ABI and should be consistent across migration. A mismatch > > > > > can certainly cause a migration failure, but why shouldn't it? > > > > > > > > Sure, the device ABI should be consistent across migration. In case of > > > > VSC, it should represent same format on source and destination. But > > > > shouldn't VSC content check or its interpretation be left to vendor > > > > driver instead of qemu? > > > > > > By "vendor driver" here, are you suggesting that QEMU device models (ex. > > > hw/net/{e1000*,igb*,rtl8139*}) should perform that validation? If so, > > > where's the patch that introduces any sort of validation hooks for > > > vendors to provide? Where is this validation going to happen in the > > > case of a migratable vfio-pci variant devices? Nothing about this > > > patch suggests that it's deferring responsibility to some other code > > > entity, it only indicates "checking this breaks, let's not do it". > > > > > > It's possible that the device you care about only reports volatile > > > diagnostic information through the vendor specific capability, but > > > another device might use it to report information relative to the > > > internal hardware configuration. Without knowing what the vendor > > > specific capability contains, QEMU needs to take the most conservative > > > approach by default. Thanks, > > > > PCI/PCIe spec doesn’t define ABI for VSC/VSEC contents. Any other code > > entity except vendor driver should ignore VSC contents. QEMU’s expectation > > of VSC contents to be equal on source and destination seems incorrect given > > that QEMU has no control over ABI for VSC contents. > > > > > > > > Alex > > > > > I don't get why this matters though. This is no different from any other > device specific register. If a register is visible to guest it generally > should not change across migration. If you are migrating a VFIO device and > you are making a vsc visible to guest then your migration routine must > make sure to migrate the contents of vsc. > > Maybe there's a good reason to have a register which actually does > change. Then, please document the actual reason. When you say: > > Ideally qemu should not check VSC during restore/load. > > then that is clearly wrong in most cases. The argument as I understand it is that enforcing that the contents remain unchanged between source and target is a policy, but QEMU has no basis to create such a policy because the ABI for this capability is not defined by the spec. Furthermore since the ABI of this capability is only defined by the device, only the driver for the device should have any interaction with the capability. There's some merit there, but a potential flaw is that QEMU does in fact register several vendor specific capabilities for which it does know the ABI. See for example: vfio_add_vmd_shadow_cap() vfio_add_nv_gpudirect_cap() pci_bridge_qemu_reserve_cap_init() virtio_pci_add_mem_cap() I think only the vfio VMD one can be claimed not to support migration since afaik we could add a GPUDirect clique capability to a migratable vGPU. So since these are QEMU created capabilities and QEMU does understand the ABI, what about these specific capabilities would allow them to change between source and target VMs? The GPUDirect clique capability describes p2p capabilities between devices. That information cannot spontaneously change. The bridge vendor capability exposes reserved bridge resources, which I think is consumed by firmware so a change would result in resources shifting after migration and a guest reboot. That would be bad. The virtio vendor capability does a variety of things that I'm sure MST could explain better than me, but also seems like defined ABI to the VM. So actually is it the case that the only vendor specific capabilities exposed to the VM that QEMU doesn't understand are those provided from a device exposed through vfio-pci? If that's the case, then QEMU's default policy to verify that the vendor capability is unchanged seems to be valid and I'd think the patch instead should be creating exceptions to that default policy for vfio-pci devices (ie. nothing says vfio-pci cannot clear cmask bits set by the pci-core). I think we also see above that there are vendor specific capabilities that have a scope beyond the guest driver itself, some visible through userspace tools, some consumed by firmware. So I still find it difficult to determine whether any vfio-pci exception should be unconditional or tied to specific devices. Also, migration is only supported by vfio-pci variant drivers, so why is it that the variant driver cannot or should not make the target capability match the source? If the contents of the capability changes, does it need to be exposed at all by the variant driver, or is the vendor specific capability the only means to expose this information for the driver? Thanks, Alex