On 1/12/26 17:16, Shameer Kolothum wrote:
Hi Cédric,

-----Original Message-----
From: Cédric Le Goater <[email protected]>
Sent: 09 January 2026 17:05
To: Shameer Kolothum <[email protected]>; Duan, Zhenzhong
<[email protected]>; [email protected]; qemu-
[email protected]
Cc: [email protected]; [email protected]; [email protected];
[email protected]; Nicolin Chen <[email protected]>; Nathan Chen
<[email protected]>; Matt Ochs <[email protected]>; Jason Gunthorpe
<[email protected]>; Krishnakant Jaju <[email protected]>
Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
region

External email: Use caution opening links or attachments


On 1/8/26 12:04, Shameer Kolothum wrote:


-----Original Message-----
From: Duan, Zhenzhong <[email protected]>
Sent: 08 January 2026 09:41
To: Shameer Kolothum <[email protected]>; qemu-
[email protected];
[email protected]
Cc: [email protected]; [email protected]; [email protected];
[email protected]; [email protected]; Nicolin Chen
<[email protected]>; Nathan Chen <[email protected]>; Matt Ochs
<[email protected]>; Jason Gunthorpe <[email protected]>; Krishnakant
Jaju <[email protected]>
Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR
per region

External email: Use caution opening links or attachments


On 12/22/2025 9:53 PM, Shameer Kolothum wrote:
From: Nicolin Chen <[email protected]>

Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory
for P2P use cases. Create a dmabuf for each mapped BAR region after
the mmap is set up, and store the returned fd in the region’s RAMBlock.
This allows QEMU to pass the fd to dma_map_file(), enabling iommufd
to import the dmabuf and map the BAR correctly in the host IOMMU
page
table.

If the kernel lacks support or dmabuf setup fails, QEMU skips the
setup and continues with normal mmap handling.

Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Shameer Kolothum <[email protected]>
---
    hw/vfio/region.c     | 57
+++++++++++++++++++++++++++++++++++++++++++-
    hw/vfio/trace-events |  1 +
    2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/region.c b/hw/vfio/region.c index
b165ab0b93..6949f6779c 100644
--- a/hw/vfio/region.c
+++ b/hw/vfio/region.c
@@ -29,6 +29,7 @@
    #include "qemu/error-report.h"
    #include "qemu/units.h"
    #include "monitor/monitor.h"
+#include "system/ramblock.h"
    #include "vfio-helpers.h"

    /*
@@ -238,13 +239,52 @@ static void vfio_subregion_unmap(VFIORegion
*region, int index)
        region->mmaps[index].mmap = NULL;
    }

+static int vfio_region_create_dma_buf(VFIORegion *region) {
+    g_autofree struct vfio_device_feature *feature = NULL;
+    VFIODevice *vbasedev = region->vbasedev;
+    struct vfio_device_feature_dma_buf *dma_buf;
+    size_t total_size;
+    int i, ret;
+
+    g_assert(region->nr_mmaps);
+
+    total_size = sizeof(*feature) + sizeof(*dma_buf) +
+                 sizeof(struct vfio_region_dma_range) * region->nr_mmaps;
+    feature = g_malloc0(total_size);
+    *feature = (struct vfio_device_feature) {
+        .argsz = total_size,
+        .flags = VFIO_DEVICE_FEATURE_GET |
VFIO_DEVICE_FEATURE_DMA_BUF,
+    };
+
+    dma_buf = (void *)feature->data;
+    *dma_buf = (struct vfio_device_feature_dma_buf) {
+        .region_index = region->nr,
+        .open_flags = O_RDWR,
+        .nr_ranges = region->nr_mmaps,
+    };
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        dma_buf->dma_ranges[i].offset = region->mmaps[i].offset;
+        dma_buf->dma_ranges[i].length = region->mmaps[i].size;
+    }
+
+    ret = vbasedev->io_ops->device_feature(vbasedev, feature);

vbasedev->io_ops->device_feature may be NULL for other backend like
vbasedev->vfio-
user.

Ah..Ok. I will add a check.

Could you please add a global routine :

    int vfio_device_get_feature(VFIODevice *vbasedev, struct
vfio_device_feature *feature)

Ok.





+    for (i = 0; i < region->nr_mmaps; i++) {
+        trace_vfio_region_dmabuf(region->vbasedev->name, ret, region->nr,
+                                 region->mem->name, region->mmaps[i].offset,
+                                 region->mmaps[i].size);
+    }
+    return ret;
+}
+
    int vfio_region_mmap(VFIORegion *region)
    {
        int i, ret, prot = 0;
        char *name;
        int fd;

-    if (!region->mem) {
+    if (!region->mem || !region->nr_mmaps) {

Just curious, when will above check return true?
I think `!region->mem` covers cases where no MemoryRegion was created
(e.g. zero sized regions).  And nr_mmaps checks regions with mmap
support exists (VFIO_REGION_INFO_FLAG_MMAP/ _CAP_SPARSE_MMAP).


            return 0;
        }

@@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
                                   region->mmaps[i].size - 1);
        }

+    ret = vfio_region_create_dma_buf(region);
+    if (ret < 0) {
+        if (ret == -ENOTTY) {
+            warn_report_once("VFIO dmabuf not supported in kernel");
+        } else {
+            error_report("%s: failed to create dmabuf: %s",
+                         memory_region_name(region->mem),
+ strerror(errno));

Shouldn't we return 'ret' in this case ?

That would result in:

Failed to mmap 0018:06:00.0 BAR 0. Performance may be slow

Not sure that error msg is correct in this context.

Agree. It would be a step backwards from the current situation.

If we don't return 'ret' here
vfio_container_dma_map() will eventually report the warning:

qemu-system-aarch64: warning: vfio_container_dma_map(0xaaaaff67ad40, 
0x58000000000, 0xb90000, 0xffff64000000) = -14 (Bad address)
0018:06:00.0: PCI peer-to-peer transactions on BARs are not supported.
qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI BAR?

Yes that's what we have today.
I think the above is good enough for this. Please let me know.

Given that, with this change, QEMU will now always request a
DMABUF fd, the question is to what extent a failure should
be considered critical.

Would there be reasons to fail to realize the vfio-pci device
and stop the machine ?

Thanks,

C.


Reply via email to