Hi Leon, > Subject: Re: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO > regions > > > > > > > From: Leon Romanovsky <leo...@nvidia.com> > > > > > > Add support for exporting PCI device MMIO regions through dma-buf, > > > enabling safe sharing of non-struct page memory with controlled > > > lifetime management. This allows RDMA and other subsystems to > import > > > dma-buf FDs and build them into memory regions for PCI P2P > operations. > > > > > > The implementation provides a revocable attachment mechanism using > > > dma-buf move operations. MMIO regions are normally pinned as BARs > > > don't change physical addresses, but access is revoked when the VFIO > > > device is closed or a PCI reset is issued. This ensures kernel > > > self-defense against potentially hostile userspace. > > > > > > Signed-off-by: Jason Gunthorpe <j...@nvidia.com> > > > Signed-off-by: Vivek Kasireddy <vivek.kasire...@intel.com> > > > Signed-off-by: Leon Romanovsky <leo...@nvidia.com> > > > --- > > > drivers/vfio/pci/Kconfig | 20 ++ > > > drivers/vfio/pci/Makefile | 2 + > > > drivers/vfio/pci/vfio_pci_config.c | 22 +- > > > drivers/vfio/pci/vfio_pci_core.c | 25 ++- > > > drivers/vfio/pci/vfio_pci_dmabuf.c | 321 > +++++++++++++++++++++++++++++ > > > drivers/vfio/pci/vfio_pci_priv.h | 23 +++ > > > include/linux/dma-buf.h | 1 + > > > include/linux/vfio_pci_core.h | 3 + > > > include/uapi/linux/vfio.h | 19 ++ > > > 9 files changed, 431 insertions(+), 5 deletions(-) > > > create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c > > <...> > > > > +static int validate_dmabuf_input(struct vfio_pci_core_device *vdev, > > > + struct vfio_device_feature_dma_buf > *dma_buf) > > > +{ > > > + struct pci_dev *pdev = vdev->pdev; > > > + u32 bar = dma_buf->region_index; > > > + u64 offset = dma_buf->offset; > > > + u64 len = dma_buf->length; > > > + resource_size_t bar_size; > > > + u64 sum; > > > + > > > + /* > > > + * For PCI the region_index is the BAR number like everything else. > > > + */ > > > + if (bar >= VFIO_PCI_ROM_REGION_INDEX) > > > + return -ENODEV; > > <...> > > > > +/** > > > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the > > > + * regions selected. > > > + * > > > + * open_flags are the typical flags passed to open(2), eg O_RDWR, > > > O_CLOEXEC, > > > + * etc. offset/length specify a slice of the region to create the dmabuf > from. > > > + * nr_ranges is the total number of (P2P DMA) ranges that comprise the > > > dmabuf. > > Any particular reason why you dropped the option (nr_ranges) of creating > a > > single dmabuf from multiple ranges of an MMIO region? > > I did it for two reasons. First, I wanted to simplify the code in order > to speed-up discussion over the patchset itself. Second, I failed to > find justification for need of multiple ranges, as the number of BARs > are limited by VFIO_PCI_ROM_REGION_INDEX (6) and same functionality > can be achieved by multiple calls to DMABUF import. I don't think the same functionality can be achieved by multiple calls to dmabuf import. AFAIU, a dmabuf (as of today) is backed by a SGL that can have multiple entries because it represents a scattered buffer (multiple non-contiguous entries in System RAM or an MMIO region). But in this patch you are constraining it such that only one entry associated with a buffer would be included, which effectively means that we cannot create a dmabuf to represent scattered buffers (located in a single MMIO region such as VRAM or other device memory) anymore.
> > > > > Restricting the dmabuf to a single range (or having to create multiple > dmabufs > > to represent multiple regions/ranges associated with a single scattered > buffer) > > would be very limiting and may not work in all cases. For instance, in my > use-case, > > I am trying to share a large (4k mode) framebuffer (FB) located in GPU's > VRAM > > between two (p2p compatible) GPU devices. And, this would probably not > work > > given that allocating a large contiguous FB (nr_ranges = 1) in VRAM may > not be > > feasible when there is memory pressure. > > Can you please help me and point to the place in the code where this can > fail? > I'm probably missing something basic as there are no large allocations > in the current patchset. Sorry, I was not very clear. What I meant is that it is not prudent to assume that there will only be one range associated with an MMIO region which we need to consider while creating a dmabuf. And, I was pointing out my use-case as an example where vfio-pci needs to create a dmabuf for a large buffer (FB) that would likely be scattered (and not contiguous) in an MMIO region (such as VRAM). Let me further explain with my use-case. Here is a link to my Qemu-based test: https://gitlab.freedesktop.org/Vivek/qemu/-/commit/b2bdb16d9cfaf55384c95b1f060f175ad1c89e95#81dc845f0babf39649c4e086e173375614111b4a_29_46 While exhaustively testing this case, I noticed that the Guest VM (GPU driver) would occasionally create the buffer (represented by virtio_gpu_simple_resource, for which we need to create a dmabuf) in such a way that there are multiple entries (indicated by res->iov_cnt) that need to be included. This is the main reason why I added support for nr_ranges > 1 to this patch/feature. Furthermore, creating multiple dmabufs to represent each range of the same buffer, like you suggest IIUC is suboptimal and does not align with how dmabuf works currently. > > > > > Furthermore, since you are adding a new UAPI with this patch/feature, as > you know, > > we cannot go back and tweak it (to add support for nr_ranges > 1) should > there > > be a need in the future, but you can always use nr_ranges = 1 anytime. > Therefore, > > I think it makes sense to be flexible in terms of the number of ranges to > include > > while creating a dmabuf instead of restricting ourselves to one range. > > I'm not a big fan of over-engineering. Let's first understand if this > case is needed. As explained above with my use-case, having support for nr_ranges > 1 is not just nice to have but absolutely necessary. Otherwise, this feature would be constrained to creating dmabufs for contiguous buffers (nr_ranges = 1) only, which would limit its effectiveness as most GPU buffers are rarely contiguous. Thanks, Vivek > > Thanks > > > > > Thanks, > > Vivek > > > > > + * > > > + * Return: The fd number on success, -1 and errno is set on failure. > > > + */ > > > +#define VFIO_DEVICE_FEATURE_DMA_BUF 11 > > > + > > > +struct vfio_device_feature_dma_buf { > > > + __u32 region_index; > > > + __u32 open_flags; > > > + __u64 offset; > > > + __u64 length; > > > +}; > > > + > > > /* -------- API for Type1 VFIO IOMMU -------- */ > > > > > > /** > > > -- > > > 2.50.1 > >