On Thu, May 30, 2024 at 12:21 AM Kasireddy, Vivek <vivek.kasire...@intel.com> wrote:
> Hi Gurchetan, > > > > > On Fri, May 24, 2024 at 11:33 AM Kasireddy, Vivek > > <vivek.kasire...@intel.com <mailto:vivek.kasire...@intel.com> > wrote: > > > > > > Hi, > > > > Sorry, my previous reply got messed up as a result of HTML > > formatting. This is > > a plain text version of the same reply. > > > > > > > > > > > Having virtio-gpu import scanout buffers (via prime) from > other > > > devices means that we'd be adding a head to headless GPUs > > assigned > > > to a Guest VM or additional heads to regular GPU devices > that > > are > > > passthrough'd to the Guest. In these cases, the Guest > > compositor > > > can render into the scanout buffer using a primary GPU and > has > > the > > > secondary GPU (virtio-gpu) import it for display purposes. > > > > > > The main advantage with this is that the imported scanout > > buffer can > > > either be displayed locally on the Host (e.g, using Qemu + > GTK > > UI) > > > or encoded and streamed to a remote client (e.g, Qemu + > Spice > > UI). > > > Note that since Qemu uses udmabuf driver, there would be no > > > copies > > > made of the scanout buffer as it is displayed. This should > be > > > possible even when it might reside in device memory such > has > > > VRAM. > > > > > > The specific use-case that can be supported with this > series is > > when > > > running Weston or other guest compositors with "additional- > > devices" > > > feature (./weston --drm-device=card1 --additional- > > devices=card0). > > > More info about this feature can be found at: > > > https://gitlab.freedesktop.org/wayland/weston/- > > > /merge_requests/736 > > > > > > In the above scenario, card1 could be a dGPU or an iGPU and > > card0 > > > would be virtio-gpu in KMS only mode. However, the case > > where this > > > patch series could be particularly useful is when card1 is > a GPU > > VF > > > that needs to share its scanout buffer (in a zero-copy > way) with > > the > > > GPU PF on the Host. Or, it can also be useful when the > scanout > > buffer > > > needs to be shared between any two GPU devices (assuming > > one of > > > them > > > is assigned to a Guest VM) as long as they are P2P DMA > > compatible. > > > > > > > > > > > > Is passthrough iGPU-only or passthrough dGPU-only something you > > intend to > > > use? > > Our main use-case involves passthrough’g a headless dGPU VF device > > and sharing > > the Guest compositor’s scanout buffer with dGPU PF device on the > > Host. Same goal for > > headless iGPU VF to iGPU PF device as well. > > > > > > > > Just to check my understanding: the same physical {i, d}GPU is > partitioned > > into the VF and PF, but the PF handles host-side display integration and > > rendering? > Yes, that is mostly right. In a nutshell, the same physical GPU is > partitioned > into one PF device and multiple VF devices. Only the PF device has access > to > the display hardware and can do KMS (on the Host). The VF devices are > headless with no access to display hardware (cannot do KMS but can do > render/ > encode/decode) and are generally assigned (or passthrough'd) to the Guest > VMs. > Some more details about this model can be found here: > > https://lore.kernel.org/dri-devel/20231110182231.1730-1-michal.wajdec...@intel.com/ > > > > > > > However, using a combination of iGPU and dGPU where either of > > them can be passthrough’d > > to the Guest is something I think can be supported with this patch > > series as well. > > > > > > > > If it's a dGPU + iGPU setup, then the way other people seem to > do it > > is a > > > "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass- > > through the > > > dGPU. > > > > > > For example, AMD seems to use virgl to allocate and import into > > the dGPU. > > > > > > https://gitlab.freedesktop.org/mesa/mesa/- > > /merge_requests/23896 > > > > > > https://lore.kernel.org/all/20231221100016.4022353-1- > > > julia.zh...@amd.com/ <http://julia.zh...@amd.com/> > > > > > > > > > ChromeOS also uses that method (see crrev.com/c/3764931 > > <http://crrev.com/c/3764931> > > > <http://crrev.com/c/3764931> ) [cc: dGPU architect +Dominik Behr > > > <mailto:db...@google.com <mailto:db...@google.com> > ] > > > > > > So if iGPU + dGPU is the primary use case, you should be able to > > use these > > > methods as well. The model would "virtualized iGPU" + > > passthrough dGPU, > > > not split SoCs. > > In our use-case, the goal is to have only one primary GPU > > (passthrough’d iGPU/dGPU) > > do all the rendering (using native DRI drivers) for > clients/compositor > > and all the outputs > > and share the scanout buffers with the secondary GPU (virtio-gpu). > > Since this is mostly > > how Mutter (and also Weston) work in a multi-GPU setup, I am not > > sure if virgl is needed. > > > > > > > > I think you can probably use virgl with the PF and others probably will, > but > > supporting multiple methods in Linux is not unheard of. > In our case, we have an alternative SR-IOV based GPU > virtualization/partitioning > model (as described above) where a Guest VM will have access to a > hardware-accelerated > GPU VF device for its rendering/encode/decode needs. So, in this > situation, using > virgl will become redundant and unnecessary. > > And, in this model, we intend to use virtio-gpu for KMS in the Guest VM > (since the > GPU VF device cannot do KMS) with the addition of this patchset. However, > note that, > since not all GPU SKUs/versions have the SRIOV capability, we plan on > using virgl in > those cases where it becomes necessary. > > > > > Does your patchset need the Mesa kmsro patchset to function correctly? > > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9592 > This patchset is an alternative proposal. So, KMSRO would not be needed. > AFAICS, the above MR is mainly stalled because KMSRO uses dumb buffers > which are not suitable for hardware-based rendering in all cases. And, > KMSRO > is not really helpful performance-wise with dGPUs, as it forces most > buffers to > be allocated from system memory. > Previously, it was recommended when exploring VDMABUF: "the /important/ thing is that the driver which exports the dma-buf (and thus handles the mappings) must be aware of the virtualization so it can properly coordinate things with the host side." https://patchwork.kernel.org/project/linux-media/patch/20210203073517.1908882-3-vivek.kasire...@intel.com/#23975915 So that's why the KMSRO approach was tried (virtio-gpu dumb allocations, not i915). But as you point out, nobody uses dumb buffers for hardware-based rendering. So, if you are going with i915 allocates + virtio-gpu imports, it should be fine if you fixed all the issues with i915 allocates + VDMABUF imports. It seems your fixes add complexity in VFIO and other places, but having virtio-gpu 3d + virgl allocate adds complexity to Mesa-based allocation paths (i915, amdgpu would all have to open virtio-gpu render node, and pick a context type etc.). I would just do virtio-gpu allocates, since it only requires user-space patches and no extra ioctls, but that reflects my preferences. If the mm/VFIO/QEMU people are fine with your approach, I see nothing wrong with merging it. The one caveat is if someone uses a non-GTK/EGL host path, we'll have to pin memory for the lifetime of the import, since knowing RESOURCE_FLUSH is done is not sufficient. But if you're only using it, it shouldn't be an issue right now. > > > > > > > If so, I would try to get that reviewed first to meet DRM requirements > > (https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source- > > userspace-requirements). You might explicitly call out the design > decision > > you're making: ("We can probably use virgl as the virtualized iGPU via > PF, but > > that adds unnecessary complexity b/c ______"). > As I described above, what we have is an alternative GPU virtualization > scheme > where virgl is not necessary if SRIOV capability is available. And, as > mentioned > earlier, I have tested this series with Mutter/Gnome-shell (upstream > master) > (plus one small patch: > https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3745) > and no other changes to any other userspace components on Host and Guest. > > > > > > > And, doing it this way means that no other userspace components > > need to be modified > > on both the Guest and the Host. > > > > > > > > > > > > > > As part of the import, the virtio-gpu driver shares the dma > > > addresses and lengths with Qemu which then determines > > whether > > > the > > > memory region they belong to is owned by a PCI device or > > whether it > > > is part of the Guest's system ram. If it is the former, it > identifies > > > the devid (or bdf) and bar and provides this info (along > with > > offsets > > > and sizes) to the udmabuf driver. In the latter case, > instead of > > the > > > the devid and bar it provides the memfd. The udmabuf driver > > then > > > creates a dmabuf using this info that Qemu shares with > Spice > > for > > > encode via Gstreamer. > > > > > > Note that the virtio-gpu driver registers a move_notify() > callback > > > to track location changes associated with the scanout > buffer and > > > sends attach/detach backing cmds to Qemu when appropriate. > > And, > > > synchronization (that is, ensuring that Guest and Host are > not > > > using the scanout buffer at the same time) is ensured by > > pinning/ > > > unpinning the dmabuf as part of plane update and using a > fence > > > in resource_flush cmd. > > > > > > > > > I'm not sure how QEMU's display paths work, but with crosvm if > > you share > > > the guest-created dmabuf with the display, and the guest moves > > the backing > > > pages, the only recourse is the destroy the surface and show a > > black screen > > > to the user: not the best thing experience wise. > > Since Qemu GTK UI uses EGL, there is a blit done from the guest’s > > scanout buffer onto an EGL > > backed buffer on the Host. So, this problem would not happen as of > > now. > > > > > > > > The guest kernel doesn't know you're using the QEMU GTK UI + EGL host- > > side. > So, with blob=true, there is a dma fence in resource_flush() that gets > associated > with the Blit/Encode on the Host. This guest dma fence should eventually > be signalled > only when the Host is done using guest's scanout buffer. > > > > > If somebody wants to use the virtio-gpu import mechanism with lower-level > > Wayland-based display integration, then the problem would occur. > Right, one way to address this issue is to prevent the Guest compositor > from > reusing the scanout buffer (until the Host is done) and forcing it to pick > a new > buffer (since Mesa GBM allows 4 backbuffers). > I have tried this experiment with KMSRO and Wayland-based Qemu UI > previously > on iGPUs (and Weston) and noticed that the Guest FPS was getting halved: > > https://lore.kernel.org/qemu-devel/20210913222036.3193732-1-vivek.kasire...@intel.com/ > > and also discussed and proposed a solution which did not go anywhere: > > https://lore.kernel.org/dri-devel/20210913233529.3194401-1-vivek.kasire...@intel.com/ > > > > > Perhaps, do that just to be safe unless you have performance concerns. > If you meant pinning the imported scanout buffer in the Guest, then yes, > that is something I am already doing in this patchset. > > > > > > > > > > > Only amdgpu calls dma_buf_move_notfiy(..), and you're probably > > testing on > > > Intel only, so you may not be hitting that code path anyways. > > I have tested with the Xe driver in the Guest which also calls > > dma_buf_move_notfiy(). But > > note that for dGPUs, both Xe and amdgpu migrate the scanout buffer > > from vram to system > > memory as part of export, because virtio-gpu is not P2P compatible. > > However, I am hoping > > to relax this (p2p check against virtio-gpu) in Xe driver if it > detects > > that it is running in > > VF mode once the following patch series is merged: > > https://lore.kernel.org/dri-devel/20240422063602.3690124-1- > > vivek.kasire...@intel.com/ > > > > > I forgot the > > > exact reason, but apparently udmabuf may not work with amdgpu > > displays > > > and it seems the virtualized iGPU + dGPU is the way to go for > > amdgpu > > > anyways. > > I would really like to know why udmabuf would not work with > > amdgpu? > > > > > > > > It's just a rumor I heard, but the idea is udmabuf would be imported into > > AMDGPU_GEM_DOMAIN_CPU only. > > > > https://cgit.freedesktop.org/drm/drm- > > misc/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#n333 > > > > "AMDGPU_GEM_DOMAIN_CPU: System memory that is not GPU accessible. > > Memory in this pool could be swapped out to disk if there is pressure." > > > > https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html > > > > > > Perhaps that limitation is artificial and unnecessary, and it may indeed > work. > > I don't think anybody has tried... > Since udmabuf driver properly pins the backing pages (from memfd) for DMA, > I don't see any reason why amdgpu would not be able to import. > > Thanks, > Vivek > > > > > > > > > > > > So I recommend just pinning the buffer for the lifetime of the > > > import for simplicity and correctness. > > Yeah, in this patch series, the dmabuf is indeed pinned, but only > for a > > short duration in the Guest – > > just until the Host is done using it (blit or encode). > > > > Thanks, > > Vivek > > > > > > > > > > > This series is available at: > > > https://gitlab.freedesktop.org/Vivek/drm-tip/- > > > /commits/virtgpu_import_rfc > > > > > > along with additional patches for Qemu and Spice here: > > > https://gitlab.freedesktop.org/Vivek/qemu/- > > > /commits/virtgpu_dmabuf_pcidev > > > https://gitlab.freedesktop.org/Vivek/spice/- > > > /commits/encode_dmabuf_v4 > > > > > > Patchset overview: > > > > > > Patch 1: Implement > > > VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd > > > Patch 2-3: Helpers to initalize, import, free imported > object > > > Patch 4-5: Import and use buffers from other devices for > > scanout > > > Patch 6-7: Have udmabuf driver create dmabuf from PCI bars > > for P2P > > > DMA > > > > > > This series is tested using the following method: > > > - Run Qemu with the following relevant options: > > > qemu-system-x86_64 -m 4096m .... > > > -device vfio-pci,host=0000:03:00.0 > > > -device virtio- > > vga,max_outputs=1,blob=true,xres=1920,yres=1080 > > > -spice port=3001,gl=on,disable-ticketing=on,preferred- > > > codec=gstreamer:h264 > > > -object memory-backend-memfd,id=mem1,size=4096M > > > -machine memory-backend=mem1 ... > > > - Run upstream Weston with the following options in the > Guest > > VM: > > > ./weston --drm-device=card1 --additional-devices=card0 > > > > > > where card1 is a DG2 dGPU (passthrough'd and using xe > driver > > in > > > Guest VM), > > > card0 is virtio-gpu and the Host is using a RPL iGPU. > > > > > > Cc: Gerd Hoffmann <kra...@redhat.com > > <mailto:kra...@redhat.com> > > > <mailto:kra...@redhat.com <mailto:kra...@redhat.com> > > > > > Cc: Dongwon Kim <dongwon....@intel.com > > <mailto:dongwon....@intel.com> > > > <mailto:dongwon....@intel.com > > <mailto:dongwon....@intel.com> > > > > > Cc: Daniel Vetter <daniel.vet...@ffwll.ch > > <mailto:daniel.vet...@ffwll.ch> > > > <mailto:daniel.vet...@ffwll.ch <mailto:daniel.vet...@ffwll.ch> > > > > > > Cc: Christian Koenig <christian.koe...@amd.com > > <mailto:christian.koe...@amd.com> > > > <mailto:christian.koe...@amd.com > > <mailto:christian.koe...@amd.com> > > > > > Cc: Dmitry Osipenko <dmitry.osipe...@collabora.com > > <mailto:dmitry.osipe...@collabora.com> > > > <mailto:dmitry.osipe...@collabora.com > > <mailto:dmitry.osipe...@collabora.com> > > > > > Cc: Rob Clark <robdcl...@chromium.org > > <mailto:robdcl...@chromium.org> > > > <mailto:robdcl...@chromium.org > > <mailto:robdcl...@chromium.org> > > > > > Cc: Thomas Hellström <thomas.hellst...@linux.intel.com > > <mailto:thomas.hellst...@linux.intel.com> > > > <mailto:thomas.hellst...@linux.intel.com > > <mailto:thomas.hellst...@linux.intel.com> > > > > > Cc: Oded Gabbay <ogab...@kernel.org > > <mailto:ogab...@kernel.org> > > > <mailto:ogab...@kernel.org <mailto:ogab...@kernel.org> > > > > > Cc: Michal Wajdeczko <michal.wajdec...@intel.com > > <mailto:michal.wajdec...@intel.com> > > > <mailto:michal.wajdec...@intel.com > > <mailto:michal.wajdec...@intel.com> > > > > > Cc: Michael Tretter <m.tret...@pengutronix.de > > <mailto:m.tret...@pengutronix.de> > > > <mailto:m.tret...@pengutronix.de > > <mailto:m.tret...@pengutronix.de> > > > > > > > > Vivek Kasireddy (7): > > > drm/virtio: Implement > > > VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd > > > drm/virtio: Add a helper to map and note the dma addrs > and > > > lengths > > > drm/virtio: Add helpers to initialize and free the > imported > > object > > > drm/virtio: Import prime buffers from other devices as > guest > > blobs > > > drm/virtio: Ensure that bo's backing store is valid while > > updating > > > plane > > > udmabuf/uapi: Add new ioctl to create a dmabuf from PCI > bar > > > regions > > > udmabuf: Implement UDMABUF_CREATE_LIST_FOR_PCIDEV > > ioctl > > > > > > drivers/dma-buf/udmabuf.c | 122 > ++++++++++++++++-- > > > drivers/gpu/drm/virtio/virtgpu_drv.h | 8 ++ > > > drivers/gpu/drm/virtio/virtgpu_plane.c | 56 ++++++++- > > > drivers/gpu/drm/virtio/virtgpu_prime.c | 167 > > > ++++++++++++++++++++++++- > > > drivers/gpu/drm/virtio/virtgpu_vq.c | 15 +++ > > > include/uapi/linux/udmabuf.h | 11 +- > > > 6 files changed, 368 insertions(+), 11 deletions(-) > > > > > > -- > > > 2.43.0 > > > > > > > > > > > >