On Wed, Feb 4, 2026, at 11:18 AM, Pekovic, Manojlo wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi Alex,
> before I submit a new patch according to Cedric's email, I want to make 
> some clarifications
>
> On Tue, 3 Feb 2026 16:41:12 +0100
> Manojlo Pekovic <[email protected]> wrote:
>
>> From: Manojlo Pekovic <[email protected]>
>>
>> Extend PCIe Atomic Ops completer support to multifunction vfio-pci
>> devices by calculating the common subset of atomic capabilities across
>> all functions.
>>
>> For multifunction devices, we now:
>> 1. Calculate the intersection of atomic capabilities across all
>> functions 2. Only enable capabilities that ALL functions with atomic
>> support share 3. Functions without atomic support are skipped (not
>> penalized)
>>
>> During vfio_realize(), other functions in a multifunction device may
>> not be realized yet, making it impossible to enumerate all functions.
>> We defer atomic ops configuration to machine_done time for
>> multifunction devices, ensuring all functions are realized before
>> calculating common capabilities.
>
> Exactly what are we trying to enable with this support?  Why do we need 
> multifunction AtomicOps support versus multiple single function 
> assignments under separate root ports?
>
> ++ While systems do allow separate assignments, we have come up on more 
> than one occasion where the topology of the bare metal was mimicked by 
> VM's configuration.
> ++ Current solutions were hacking and forcing atomics ops on the root 
> port, or changing the topology of the VM (which no longer represents 
> the same as bare metal)
> ++ Right now, I cannot confirm that doing either has created any 
> faults, however, from UX standpoint, the correct way is that user 
> shouldn't think about it
> ++ I understand your concerns, and I can either
> ++ - Defer this patch and investigate further
> ++ - Proceed with it for completeness, given its low risk
> ++ What's your preference?

Atomic ops routing is complicated though.  The current automatic setup focuses 
on a specific use case.  The vfio interface *only* report atomic ops support 
relative to the root bus, therefore we can effectively only enable host->device 
or device->host atomic ops, thus automatic configuration is limited to the 
single function downstream of a root port, without atomic ops routing enabled.

It's been a while since I implemented this and I should have left better 
breadcrumbs as to the single function restriction, but I think it was not 
simply determining the common set of capabilities between the devices, but 
actually the implications relative to device->device atomic ops.

For example, imagine that we want to create a virtual topology that enables 
atomic ops between two single function devices, each downstream of a separate 
root port.  Before we can set the atomic ops routing flag on the root ports we 
need to evaluate whether the host system has atomic ops routing on every 
physical interconnect between the devices.  The current vfio interface doesn't 
tell us this.

Now, instead of a virtual topology placing those devices under separate root 
ports with explicit routing flags, what if we lump them together into a 
multifunction device.  Do these functions implicitly support device->device 
atomic ops?  Such paths are typically implementation details beyond the scope 
of the spec.

Therefore, it's not really clear what gremlins we're unleashing in this "low 
risk" extension.

Back to the point, atomic ops routing is complicated, QEMU currently kicks 
anything beyond the trivial case back to the VM administrator.  If the VM 
administrator doesn't want to think about it, analyze the host topology, create 
a compatible VM topology, and manually set appropriate atomic ops bits, then 
the burden probably needs to go in the direction of VM builders and management 
tools rather than pushed down into QEMU.  QEMU doesn't have the visibility to 
determine host routing and is forced to work with the topology that's been 
specified.

>> +        /*
>> +         * A function with no atomic completer support simply cannot
>> +         * receive and execute atomic operations.
>> +         * This does NOT mean other functions in the same multifunction
>> +         * device should be prevented from supporting atomics.
>> +         */
>> +        func_cap = vfio_get_atomic_cap(func_vdev);
>> +        if (func_cap != 0) {
>> +            common_mask &= func_cap;
>> +        }
>
> It's not clear to me that this is a valid interpretation:
>
> PCIe 7.0, 6.15.2:
>
>         If any Function in a Multi-Function Device supports AtomicOp
>         Completer or AtomicOp routing capability, all Functions with
>         Memory Space BARs in that device must decode properly formed
>         AtomicOp Requests and handle any they don’t support as an
>         Unsupported Request (UR). Note that in such devices, Functions
>         lacking AtomicOp Completer capability are forbidden to handle
>         properly formed AtomicOp Requests as Malformed TLPs.
>
> If we're adding an arbitrary device that may not support AtomicOps into 
> a virtual multifunction package, how can we know that a device that 
> doesn't have an AtomicOps cap wouldn't respond with a Malformed TLP 
> error?
>
> ++ My mistake in wording the comment. It is not that completer cannot 
> receive operations, but rather that the true state of the hardware 
> should be passed through, having same functionality on both VM and Bare 
> Metal

But we cannot enforce that in QEMU.

> ++ If the devices that doesn't support AtomicOps is passed through, it 
> will have same capabilities as actual hardware, having same properties 
> as bare metal (and same faults if hardware cannot respond with UR)

Only if the function is part of the same multifunction device on bare metal.  
QEMU cannot enforce that.

> ++ Because of the check that we do for VFIOPCIDevice in lines before, 
> if non-vfio device is passed we will simply return

We can compose multifunction devices in QEMU from separate multi- and single- 
function devices on bare metal.  The fact that they are VFIOPCIDevices tells us 
nothing about the physical association.

>> +    }
>> +
>> +    return common_mask;
>> +}
>> +
>>  static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)  {
>>      PCIBus *bus = pci_get_bus(&vdev->pdev); @@ -1948,8 +2009,7 @@
>> static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
>>      if (pci_bus_is_root(bus) || !parent || !parent->exp.exp_cap ||
>>          pcie_cap_get_type(parent) != PCI_EXP_TYPE_ROOT_PORT ||
>>          pcie_cap_get_version(parent) != PCI_EXP_FLAGS_VER2 ||
>> -        vdev->pdev.devfn ||
>> -        vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
>> +        vdev->pdev.devfn) {
>>          return;
>>      }
>>
>> @@ -1962,6 +2022,25 @@ static void vfio_pci_enable_rp_atomics(VFIOPCIDevice 
>> *vdev)
>>          return;
>>      }
>>
>> +    /*
>> +     * Single-function: All device info is available now, configure 
>> immediately
>> +     * Multifunction: Other functions may not be realized yet, defer to
>> +     *                machine_done time when all devices guaranteed to exist
>> +     */
>
> This also means that multifunction devices can only support cold-plug 
> whereas single function devices support AtomicOps with cold-plug and 
> hot-plug.  The cold-plug device can be unplugged and re-added and it 
> will have different behavior.
>
> Additionally, it seems we only enable AtomicOps if function 0 supports 
> AtomicOps, what's the basis for that decision?  Based on the previous 
> comment, the more conservative approach might be that all functions 
> need to support AtomicOps, but...
>
> Maybe what we really want is for each function to attempt to initialize 
> the AtomicOps bits on the root port.  The last realized function should 
> get it correct regardless of the order the functions are realized, 
> though we'll need to track our 'Abort if already configured' strategy 
> differently.
>
> I tossed out this machine_done notifier as a possibility, but I'd 
> rather avoid it.
>
> ++ Machine_done was something I hesitantly did, trying to not change as 
> much functionality as possible of the current code.
> ++ If its appropriate, I can try with changing 'Abort if already 
> configured' logic, to support this approach
> ++ What is your suggestion on this?

I'm not convinced it's QEMU's job, or that QEMU is even capable of serving the 
intended goal here.  Thanks,

Alex

Reply via email to