[AMD Official Use Only - AMD Internal Distribution Only] Hi Alex, before I submit a new patch according to Cedric's email, I want to make some clarifications
On Tue, 3 Feb 2026 16:41:12 +0100 Manojlo Pekovic <[email protected]> wrote: > From: Manojlo Pekovic <[email protected]> > > Extend PCIe Atomic Ops completer support to multifunction vfio-pci > devices by calculating the common subset of atomic capabilities across > all functions. > > For multifunction devices, we now: > 1. Calculate the intersection of atomic capabilities across all > functions 2. Only enable capabilities that ALL functions with atomic > support share 3. Functions without atomic support are skipped (not > penalized) > > During vfio_realize(), other functions in a multifunction device may > not be realized yet, making it impossible to enumerate all functions. > We defer atomic ops configuration to machine_done time for > multifunction devices, ensuring all functions are realized before > calculating common capabilities. Exactly what are we trying to enable with this support? Why do we need multifunction AtomicOps support versus multiple single function assignments under separate root ports? ++ While systems do allow separate assignments, we have come up on more than one occasion where the topology of the bare metal was mimicked by VM's configuration. ++ Current solutions were hacking and forcing atomics ops on the root port, or changing the topology of the VM (which no longer represents the same as bare metal) ++ Right now, I cannot confirm that doing either has created any faults, however, from UX standpoint, the correct way is that user shouldn't think about it ++ I understand your concerns, and I can either ++ - Defer this patch and investigate further ++ - Proceed with it for completeness, given its low risk ++ What's your preference? > Signed-off-by: Manojlo Pekovic <[email protected]> > --- > hw/vfio/pci.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++++- > hw/vfio/pci.h | 1 + > 2 files changed, 137 insertions(+), 2 deletions(-) > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index > 6a6c8f1807..89f171f431 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -1931,6 +1931,67 @@ static uint32_t vfio_get_atomic_cap(VFIOPCIDevice > *vdev) > return mask; > } > > +static void vfio_pci_machine_done(Notifier *notifier, void *data); Why aren't these ordered to avoid the forward declaration? > + > +/* > + * This function iterates through all possible functions (0-7) > + * at the same slot and returns the intersection of their capabilities. > + * > + * Returns: Bitmask of atomic capabilities supported by functions on slot > + * (PCI_EXP_DEVCAP2_ATOMIC_COMP32/64/128), or 0 if: > + * - Mixed device types found (vfio-pci + emulated) > + * - No common capabilities exist > + */ > +static uint32_t vfio_get_multifunction_atomic_cap(VFIOPCIDevice *vdev, > + PCIBus *bus) { > + PCIDevice *func_dev; > + VFIOPCIDevice *func_vdev; > + int slot = PCI_SLOT(vdev->pdev.devfn); > + int target_devfn; > + uint32_t func_cap; > + uint32_t common_mask = PCI_EXP_DEVCAP2_ATOMIC_COMP32 | > + PCI_EXP_DEVCAP2_ATOMIC_COMP64 | > + PCI_EXP_DEVCAP2_ATOMIC_COMP128; > + > + /* Iterate through all possible functions at this slot */ > + for (int fn = 0; fn < PCI_FUNC_MAX; fn++) { > + target_devfn = PCI_DEVFN(slot, fn); > + func_dev = pci_find_device(bus, pci_bus_num(bus), > + target_devfn); > + > + if (!func_dev) { > + continue; > + } > + > + /* > + * We can only calculate atomic caps for vfio-pci devices. If we > + * encounter a mixed multifunction device (e.g., vfio-pci function 0 > + * + emulated e1000 function 1), we cannot safely determine the > atomic > + * capabilities. Taking the conservative approach: disable all atomic > + * ops for mixed configurations. > + */ > + > + func_vdev = (VFIOPCIDevice *)object_dynamic_cast(OBJECT(func_dev), > + TYPE_VFIO_PCI); > + if (!func_vdev) { > + return 0; > + } > + > + /* > + * A function with no atomic completer support simply cannot > + * receive and execute atomic operations. > + * This does NOT mean other functions in the same multifunction > + * device should be prevented from supporting atomics. > + */ > + func_cap = vfio_get_atomic_cap(func_vdev); > + if (func_cap != 0) { > + common_mask &= func_cap; > + } It's not clear to me that this is a valid interpretation: PCIe 7.0, 6.15.2: If any Function in a Multi-Function Device supports AtomicOp Completer or AtomicOp routing capability, all Functions with Memory Space BARs in that device must decode properly formed AtomicOp Requests and handle any they don’t support as an Unsupported Request (UR). Note that in such devices, Functions lacking AtomicOp Completer capability are forbidden to handle properly formed AtomicOp Requests as Malformed TLPs. If we're adding an arbitrary device that may not support AtomicOps into a virtual multifunction package, how can we know that a device that doesn't have an AtomicOps cap wouldn't respond with a Malformed TLP error? ++ My mistake in wording the comment. It is not that completer cannot receive operations, but rather that the true state of the hardware should be passed through, having same functionality on both VM and Bare Metal ++ If the devices that doesn't support AtomicOps is passed through, it will have same capabilities as actual hardware, having same properties as bare metal (and same faults if hardware cannot respond with UR) ++ Because of the check that we do for VFIOPCIDevice in lines before, if non-vfio device is passed we will simply return > + } > + > + return common_mask; > +} > + > static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev) { > PCIBus *bus = pci_get_bus(&vdev->pdev); @@ -1948,8 +2009,7 @@ > static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev) > if (pci_bus_is_root(bus) || !parent || !parent->exp.exp_cap || > pcie_cap_get_type(parent) != PCI_EXP_TYPE_ROOT_PORT || > pcie_cap_get_version(parent) != PCI_EXP_FLAGS_VER2 || > - vdev->pdev.devfn || > - vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) { > + vdev->pdev.devfn) { > return; > } > > @@ -1962,6 +2022,25 @@ static void vfio_pci_enable_rp_atomics(VFIOPCIDevice > *vdev) > return; > } > > + /* > + * Single-function: All device info is available now, configure > immediately > + * Multifunction: Other functions may not be realized yet, defer to > + * machine_done time when all devices guaranteed to exist > + */ This also means that multifunction devices can only support cold-plug whereas single function devices support AtomicOps with cold-plug and hot-plug. The cold-plug device can be unplugged and re-added and it will have different behavior. Additionally, it seems we only enable AtomicOps if function 0 supports AtomicOps, what's the basis for that decision? Based on the previous comment, the more conservative approach might be that all functions need to support AtomicOps, but... Maybe what we really want is for each function to attempt to initialize the AtomicOps bits on the root port. The last realized function should get it correct regardless of the order the functions are realized, though we'll need to track our 'Abort if already configured' strategy differently. I tossed out this machine_done notifier as a possibility, but I'd rather avoid it. ++ Machine_done was something I hesitantly did, trying to not change as much functionality as possible of the current code. ++ If its appropriate, I can try with changing 'Abort if already configured' logic, to support this approach ++ What is your suggestion on this? > + if (vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) { > + /* > + * Register machine_done notifier if not already registered. > + * The notifier will be called after all devices are realized, at > which > + * point we can safely enumerate all functions and calculate the > common > + * atomic capabilities. > + */ > + if (!vdev->machine_done.notify) { > + vdev->machine_done.notify = vfio_pci_machine_done; > + qemu_add_machine_init_done_notifier(&vdev->machine_done); > + } > + return; /* Actual work deferred to vfio_pci_machine_done() */ > + } > + > mask = vfio_get_atomic_cap(vdev); > > if (!mask) { > @@ -1972,6 +2051,51 @@ static void vfio_pci_enable_rp_atomics(VFIOPCIDevice > *vdev) > vdev->clear_parent_atomics_on_exit = true; } > > +/* > + * Called at machine_done time, after ALL devices are realized > + * but before the VM starts running. > + * > + * For multifunction devices, we need to calculate atomic > +capabilities > + * across all functions. During realize of function 0, the other > +functions > + * (1-7) may not exist yet. By deferring to machine_done time, we > +ensure > + * all functions are realized and can be safely enumerated. > + */ > +static void vfio_pci_machine_done(Notifier *notifier, void *data) { > + VFIOPCIDevice *vdev = container_of(notifier, VFIOPCIDevice, > machine_done); > + PCIBus *bus = pci_get_bus(&vdev->pdev); > + PCIDevice *parent = pci_bridge_get_device(bus); > + uint32_t mask; > + uint8_t *pos; > + > + /* > + * Sanity checks: Should always pass since vfio_pci_enable_rp_atomics() > + * already validated these conditions before registering this notifier. > + */ > + if (!parent || !parent->exp.exp_cap) { > + return; > + } > + > + pos = parent->config + parent->exp.exp_cap + PCI_EXP_DEVCAP2; > + > + /* Abort if there'a already an Atomic Ops configuration on the > + root port */ > + > + if (pci_get_long(pos) & (PCI_EXP_DEVCAP2_ATOMIC_COMP32 | > + PCI_EXP_DEVCAP2_ATOMIC_COMP64 | > + PCI_EXP_DEVCAP2_ATOMIC_COMP128)) { > + return; > + } > + > + mask = vfio_get_multifunction_atomic_cap(vdev, bus); > + > + if (!mask) { > + return; > + } > + > + pci_long_test_and_set_mask(pos, mask); > + vdev->clear_parent_atomics_on_exit = true; } Duplication relative to vfio_pci_enable_rp_atomics() should be refactored if we keep it. Thanks, Alex > + > static void vfio_pci_disable_rp_atomics(VFIOPCIDevice *vdev) { > if (vdev->clear_parent_atomics_on_exit) { @@ -3281,6 +3405,16 @@ > static void vfio_exitfn(PCIDevice *pdev) > VFIOPCIDevice *vdev = VFIO_PCI(pdev); > VFIODevice *vbasedev = &vdev->vbasedev; > > + /* > + * Remove machine_done notifier if it was registered. > + * This happens for multifunction devices where function 0 registered > + * a notifier to defer atomic ops configuration to machine_done time. > + */ > + if (vdev->machine_done.notify) { > + qemu_remove_machine_init_done_notifier(&vdev->machine_done); > + vdev->machine_done.notify = NULL; > + } > + > vfio_unregister_req_notifier(vdev); > vfio_unregister_err_notifier(vdev); > pci_device_set_intx_routing_notifier(&vdev->pdev, NULL); diff > --git a/hw/vfio/pci.h b/hw/vfio/pci.h index 5ad090a229..9182ab0053 > 100644 > --- a/hw/vfio/pci.h > +++ b/hw/vfio/pci.h > @@ -180,6 +180,7 @@ struct VFIOPCIDevice { > bool skip_vsc_check; > VFIODisplay *dpy; > Notifier irqchip_change_notifier; > + Notifier machine_done; > }; > > /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot > match hw */ Thanks, Manojlo
