virtio: Import scanout buffers from other devices

Gurchetan Singh Fri, 14 Jun 2024 17:20:12 -0700

On Thu, May 30, 2024 at 12:21 AM Kasireddy, Vivek <vivek.kasire...@intel.com>
wrote:


> Hi Gurchetan,
>
> >
> > On Fri, May 24, 2024 at 11:33 AM Kasireddy, Vivek
> > <vivek.kasire...@intel.com <mailto:vivek.kasire...@intel.com> > wrote:
> >
> >
> >       Hi,
> >
> >       Sorry, my previous reply got messed up as a result of HTML
> > formatting. This is
> >       a plain text version of the same reply.
> >
> >       >
> >       >
> >       >       Having virtio-gpu import scanout buffers (via prime) from
> other
> >       >       devices means that we'd be adding a head to headless GPUs
> > assigned
> >       >       to a Guest VM or additional heads to regular GPU devices
> that
> > are
> >       >       passthrough'd to the Guest. In these cases, the Guest
> > compositor
> >       >       can render into the scanout buffer using a primary GPU and
> has
> > the
> >       >       secondary GPU (virtio-gpu) import it for display purposes.
> >       >
> >       >       The main advantage with this is that the imported scanout
> > buffer can
> >       >       either be displayed locally on the Host (e.g, using Qemu +
> GTK
> > UI)
> >       >       or encoded and streamed to a remote client (e.g, Qemu +
> Spice
> > UI).
> >       >       Note that since Qemu uses udmabuf driver, there would be no
> >       > copies
> >       >       made of the scanout buffer as it is displayed. This should
> be
> >       >       possible even when it might reside in device memory such
> has
> >       > VRAM.
> >       >
> >       >       The specific use-case that can be supported with this
> series is
> > when
> >       >       running Weston or other guest compositors with "additional-
> > devices"
> >       >       feature (./weston --drm-device=card1 --additional-
> > devices=card0).
> >       >       More info about this feature can be found at:
> >       >       https://gitlab.freedesktop.org/wayland/weston/-
> >       > /merge_requests/736
> >       >
> >       >       In the above scenario, card1 could be a dGPU or an iGPU and
> > card0
> >       >       would be virtio-gpu in KMS only mode. However, the case
> > where this
> >       >       patch series could be particularly useful is when card1 is
> a GPU
> > VF
> >       >       that needs to share its scanout buffer (in a zero-copy
> way) with
> > the
> >       >       GPU PF on the Host. Or, it can also be useful when the
> scanout
> > buffer
> >       >       needs to be shared between any two GPU devices (assuming
> > one of
> >       > them
> >       >       is assigned to a Guest VM) as long as they are P2P DMA
> > compatible.
> >       >
> >       >
> >       >
> >       > Is passthrough iGPU-only or passthrough dGPU-only something you
> > intend to
> >       > use?
> >       Our main use-case involves passthrough’g a headless dGPU VF device
> > and sharing
> >       the Guest compositor’s scanout buffer with dGPU PF device on the
> > Host. Same goal for
> >       headless iGPU VF to iGPU PF device as well.
> >
> >
> >
> > Just to check my understanding: the same physical {i, d}GPU is
> partitioned
> > into the VF and PF, but the PF handles host-side display integration and
> > rendering?
> Yes, that is mostly right. In a nutshell, the same physical GPU is
> partitioned
> into one PF device and multiple VF devices. Only the PF device has access
> to
> the display hardware and can do KMS (on the Host). The VF devices are
> headless with no access to display hardware (cannot do KMS but can do
> render/
> encode/decode) and are generally assigned (or passthrough'd) to the Guest
> VMs.
> Some more details about this model can be found here:
>
> https://lore.kernel.org/dri-devel/20231110182231.1730-1-michal.wajdec...@intel.com/
>
> >
> >
> >       However, using a combination of iGPU and dGPU where either of
> > them can be passthrough’d
> >       to the Guest is something I think can be supported with this patch
> > series as well.
> >
> >       >
> >       > If it's a dGPU + iGPU setup, then the way other people seem to
> do it
> > is a
> >       > "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-
> > through the
> >       > dGPU.
> >       >
> >       > For example, AMD seems to use virgl to allocate and import into
> > the dGPU.
> >       >
> >       > https://gitlab.freedesktop.org/mesa/mesa/-
> > /merge_requests/23896
> >       >
> >       > https://lore.kernel.org/all/20231221100016.4022353-1-
> >       > julia.zh...@amd.com/ <http://julia.zh...@amd.com/>
> >       >
> >       >
> >       > ChromeOS also uses that method (see crrev.com/c/3764931
> > <http://crrev.com/c/3764931>
> >       > <http://crrev.com/c/3764931> ) [cc: dGPU architect +Dominik Behr
> >       > <mailto:db...@google.com <mailto:db...@google.com> > ]
> >       >
> >       > So if iGPU + dGPU is the primary use case, you should be able to
> > use these
> >       > methods as well.  The model would "virtualized iGPU" +
> > passthrough dGPU,
> >       > not split SoCs.
> >       In our use-case, the goal is to have only one primary GPU
> > (passthrough’d iGPU/dGPU)
> >       do all the rendering (using native DRI drivers) for
> clients/compositor
> > and all the outputs
> >       and share the scanout buffers with the secondary GPU (virtio-gpu).
> > Since this is mostly
> >       how Mutter (and also Weston) work in a multi-GPU setup, I am not
> > sure if virgl is needed.
> >
> >
> >
> > I think you can probably use virgl with the PF and others probably will,
> but
> > supporting multiple methods in Linux is not unheard of.
> In our case, we have an alternative SR-IOV based GPU
> virtualization/partitioning
> model (as described above) where a Guest VM will have access to a
> hardware-accelerated
> GPU VF device for its rendering/encode/decode needs. So, in this
> situation, using
> virgl will become redundant and unnecessary.
>
> And, in this model, we intend to use virtio-gpu for KMS in the Guest VM
> (since the
> GPU VF device cannot do KMS) with the addition of this patchset. However,
> note that,
> since not all GPU SKUs/versions have the SRIOV capability, we plan on
> using virgl in
> those cases where it becomes necessary.
>
> >
> > Does your patchset need the Mesa kmsro patchset to function correctly?
> >
> > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9592
> This patchset is an alternative proposal. So, KMSRO would not be needed.
> AFAICS, the above MR is mainly stalled because KMSRO uses dumb buffers
> which are not suitable for hardware-based rendering in all cases. And,
> KMSRO
> is not really helpful performance-wise with dGPUs, as it forces most
> buffers to
> be allocated from system memory.
>

Previously, it was recommended when exploring VDMABUF: "the /important/
thing is that the driver which exports the dma-buf (and thus handles the
mappings) must be aware of the virtualization so it can properly coordinate
things with the host side."

https://patchwork.kernel.org/project/linux-media/patch/20210203073517.1908882-3-vivek.kasire...@intel.com/#23975915


So that's why the KMSRO approach was tried  (virtio-gpu dumb allocations,
not i915).  But as you point out, nobody uses dumb buffers for
hardware-based rendering.

So, if you are going with i915 allocates + virtio-gpu imports,  it should
be fine if you fixed all the issues with i915 allocates + VDMABUF imports.
It seems your fixes add complexity in VFIO and other places, but having
virtio-gpu 3d + virgl allocate adds complexity to Mesa-based allocation
paths (i915, amdgpu would all have to open virtio-gpu render node, and pick
a context type etc.).

I would just do virtio-gpu allocates, since it only requires user-space
patches and no extra ioctls, but that reflects my preferences.  If the
mm/VFIO/QEMU people are fine with your approach, I see nothing wrong with
merging it.

The one caveat is if someone uses a non-GTK/EGL host path, we'll have to
pin memory for the lifetime of the import, since knowing RESOURCE_FLUSH is
done is not sufficient.  But if you're only using it, it shouldn't be an
issue right now.


>
> >
> >
> > If so, I would try to get that reviewed first to meet DRM requirements
> > (https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-
> > userspace-requirements).  You might explicitly call out the design
> decision
> > you're making: ("We can probably use virgl as the virtualized iGPU via
> PF, but
> > that adds unnecessary complexity b/c ______").
> As I described above, what we have is an alternative GPU virtualization
> scheme
> where virgl is not necessary if SRIOV capability is available. And, as
> mentioned
> earlier, I have tested this series with Mutter/Gnome-shell (upstream
> master)
> (plus one small patch:
> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3745)
> and no other changes to any other userspace components on Host and Guest.
>
> >
> >
> >       And, doing it this way means that no other userspace components
> > need to be modified
> >       on both the Guest and the Host.
> >
> >       >
> >       >
> >       >
> >       >       As part of the import, the virtio-gpu driver shares the dma
> >       >       addresses and lengths with Qemu which then determines
> > whether
> >       > the
> >       >       memory region they belong to is owned by a PCI device or
> > whether it
> >       >       is part of the Guest's system ram. If it is the former, it
> identifies
> >       >       the devid (or bdf) and bar and provides this info (along
> with
> > offsets
> >       >       and sizes) to the udmabuf driver. In the latter case,
> instead of
> > the
> >       >       the devid and bar it provides the memfd. The udmabuf driver
> > then
> >       >       creates a dmabuf using this info that Qemu shares with
> Spice
> > for
> >       >       encode via Gstreamer.
> >       >
> >       >       Note that the virtio-gpu driver registers a move_notify()
> callback
> >       >       to track location changes associated with the scanout
> buffer and
> >       >       sends attach/detach backing cmds to Qemu when appropriate.
> > And,
> >       >       synchronization (that is, ensuring that Guest and Host are
> not
> >       >       using the scanout buffer at the same time) is ensured by
> > pinning/
> >       >       unpinning the dmabuf as part of plane update and using a
> fence
> >       >       in resource_flush cmd.
> >       >
> >       >
> >       > I'm not sure how QEMU's display paths work, but with crosvm if
> > you share
> >       > the guest-created dmabuf with the display, and the guest moves
> > the backing
> >       > pages, the only recourse is the destroy the surface and show a
> > black screen
> >       > to the user: not the best thing experience wise.
> >       Since Qemu GTK UI uses EGL, there is a blit done from the guest’s
> > scanout buffer onto an EGL
> >       backed buffer on the Host. So, this problem would not happen as of
> > now.
> >
> >
> >
> > The guest kernel doesn't know you're using the QEMU GTK UI + EGL host-
> > side.
> So, with blob=true, there is a dma fence in resource_flush() that gets
> associated
> with the Blit/Encode on the Host. This guest dma fence should eventually
> be signalled
> only when the Host is done using guest's scanout buffer.
>
> >
> > If somebody wants to use the virtio-gpu import mechanism with lower-level
> > Wayland-based display integration, then the problem would occur.
> Right, one way to address this issue is to prevent the Guest compositor
> from
> reusing the scanout buffer (until the Host is done) and forcing it to pick
> a new
> buffer (since Mesa GBM allows 4 backbuffers).
> I have tried this experiment with KMSRO and Wayland-based Qemu UI
> previously
> on iGPUs (and Weston) and noticed that the Guest FPS was getting halved:
>
> https://lore.kernel.org/qemu-devel/20210913222036.3193732-1-vivek.kasire...@intel.com/
>
> and also discussed and proposed a solution which did not go anywhere:
>
> https://lore.kernel.org/dri-devel/20210913233529.3194401-1-vivek.kasire...@intel.com/
>
> >
> > Perhaps, do that just to be safe unless you have performance concerns.
> If you meant pinning the imported scanout buffer in the Guest, then yes,
> that is something I am already doing in this patchset.
>
> >
> >
> >       >
> >       > Only amdgpu calls dma_buf_move_notfiy(..), and you're probably
> > testing on
> >       > Intel only, so you may not be hitting that code path anyways.
> >       I have tested with the Xe driver in the Guest which also calls
> > dma_buf_move_notfiy(). But
> >       note that for dGPUs, both Xe and amdgpu migrate the scanout buffer
> > from vram to system
> >       memory as part of export, because virtio-gpu is not P2P compatible.
> > However, I am hoping
> >       to relax this (p2p check against virtio-gpu) in Xe driver if it
> detects
> > that it is running in
> >       VF mode once the following patch series is merged:
> >       https://lore.kernel.org/dri-devel/20240422063602.3690124-1-
> > vivek.kasire...@intel.com/
> >
> >       > I forgot the
> >       > exact reason, but apparently udmabuf may not work with amdgpu
> > displays
> >       > and it seems the virtualized iGPU + dGPU is the way to go for
> > amdgpu
> >       > anyways.
> >       I would really like to know why udmabuf would not work with
> > amdgpu?
> >
> >
> >
> > It's just a rumor I heard, but the idea is udmabuf would be imported into
> > AMDGPU_GEM_DOMAIN_CPU only.
> >
> > https://cgit.freedesktop.org/drm/drm-
> > misc/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#n333
> >
> > "AMDGPU_GEM_DOMAIN_CPU: System memory that is not GPU accessible.
> > Memory in this pool could be swapped out to disk if there is pressure."
> >
> > https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html
> >
> >
> > Perhaps that limitation is artificial and unnecessary, and it may indeed
> work.
> > I don't think anybody has tried...
> Since udmabuf driver properly pins the backing pages (from memfd) for DMA,
> I don't see any reason why amdgpu would not be able to import.
>
> Thanks,
> Vivek
>
> >
> >
> >
> >
> >       > So I recommend just pinning the buffer for the lifetime of the
> >       > import for simplicity and correctness.
> >       Yeah, in this patch series, the dmabuf is indeed pinned, but only
> for a
> > short duration in the Guest –
> >       just until the Host is done using it (blit or encode).
> >
> >       Thanks,
> >       Vivek
> >
> >       >
> >       >
> >       >       This series is available at:
> >       >       https://gitlab.freedesktop.org/Vivek/drm-tip/-
> >       > /commits/virtgpu_import_rfc
> >       >
> >       >       along with additional patches for Qemu and Spice here:
> >       >       https://gitlab.freedesktop.org/Vivek/qemu/-
> >       > /commits/virtgpu_dmabuf_pcidev
> >       >       https://gitlab.freedesktop.org/Vivek/spice/-
> >       > /commits/encode_dmabuf_v4
> >       >
> >       >       Patchset overview:
> >       >
> >       >       Patch 1:   Implement
> >       > VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> >       >       Patch 2-3: Helpers to initalize, import, free imported
> object
> >       >       Patch 4-5: Import and use buffers from other devices for
> > scanout
> >       >       Patch 6-7: Have udmabuf driver create dmabuf from PCI bars
> > for P2P
> >       > DMA
> >       >
> >       >       This series is tested using the following method:
> >       >       - Run Qemu with the following relevant options:
> >       >         qemu-system-x86_64 -m 4096m ....
> >       >         -device vfio-pci,host=0000:03:00.0
> >       >         -device virtio-
> > vga,max_outputs=1,blob=true,xres=1920,yres=1080
> >       >         -spice port=3001,gl=on,disable-ticketing=on,preferred-
> >       > codec=gstreamer:h264
> >       >         -object memory-backend-memfd,id=mem1,size=4096M
> >       >         -machine memory-backend=mem1 ...
> >       >       - Run upstream Weston with the following options in the
> Guest
> > VM:
> >       >         ./weston --drm-device=card1 --additional-devices=card0
> >       >
> >       >       where card1 is a DG2 dGPU (passthrough'd and using xe
> driver
> > in
> >       > Guest VM),
> >       >       card0 is virtio-gpu and the Host is using a RPL iGPU.
> >       >
> >       >       Cc: Gerd Hoffmann <kra...@redhat.com
> > <mailto:kra...@redhat.com>
> >       > <mailto:kra...@redhat.com <mailto:kra...@redhat.com> > >
> >       >       Cc: Dongwon Kim <dongwon....@intel.com
> > <mailto:dongwon....@intel.com>
> >       > <mailto:dongwon....@intel.com
> > <mailto:dongwon....@intel.com> > >
> >       >       Cc: Daniel Vetter <daniel.vet...@ffwll.ch
> > <mailto:daniel.vet...@ffwll.ch>
> >       > <mailto:daniel.vet...@ffwll.ch <mailto:daniel.vet...@ffwll.ch>
> > >
> >       >       Cc: Christian Koenig <christian.koe...@amd.com
> > <mailto:christian.koe...@amd.com>
> >       > <mailto:christian.koe...@amd.com
> > <mailto:christian.koe...@amd.com> > >
> >       >       Cc: Dmitry Osipenko <dmitry.osipe...@collabora.com
> > <mailto:dmitry.osipe...@collabora.com>
> >       > <mailto:dmitry.osipe...@collabora.com
> > <mailto:dmitry.osipe...@collabora.com> > >
> >       >       Cc: Rob Clark <robdcl...@chromium.org
> > <mailto:robdcl...@chromium.org>
> >       > <mailto:robdcl...@chromium.org
> > <mailto:robdcl...@chromium.org> > >
> >       >       Cc: Thomas Hellström <thomas.hellst...@linux.intel.com
> > <mailto:thomas.hellst...@linux.intel.com>
> >       > <mailto:thomas.hellst...@linux.intel.com
> > <mailto:thomas.hellst...@linux.intel.com> > >
> >       >       Cc: Oded Gabbay <ogab...@kernel.org
> > <mailto:ogab...@kernel.org>
> >       > <mailto:ogab...@kernel.org <mailto:ogab...@kernel.org> > >
> >       >       Cc: Michal Wajdeczko <michal.wajdec...@intel.com
> > <mailto:michal.wajdec...@intel.com>
> >       > <mailto:michal.wajdec...@intel.com
> > <mailto:michal.wajdec...@intel.com> > >
> >       >       Cc: Michael Tretter <m.tret...@pengutronix.de
> > <mailto:m.tret...@pengutronix.de>
> >       > <mailto:m.tret...@pengutronix.de
> > <mailto:m.tret...@pengutronix.de> > >
> >       >
> >       >       Vivek Kasireddy (7):
> >       >         drm/virtio: Implement
> >       > VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> >       >         drm/virtio: Add a helper to map and note the dma addrs
> and
> >       > lengths
> >       >         drm/virtio: Add helpers to initialize and free the
> imported
> > object
> >       >         drm/virtio: Import prime buffers from other devices as
> guest
> > blobs
> >       >         drm/virtio: Ensure that bo's backing store is valid while
> > updating
> >       >           plane
> >       >         udmabuf/uapi: Add new ioctl to create a dmabuf from PCI
> bar
> >       > regions
> >       >         udmabuf: Implement UDMABUF_CREATE_LIST_FOR_PCIDEV
> > ioctl
> >       >
> >       >        drivers/dma-buf/udmabuf.c              | 122
> ++++++++++++++++--
> >       >        drivers/gpu/drm/virtio/virtgpu_drv.h   |   8 ++
> >       >        drivers/gpu/drm/virtio/virtgpu_plane.c |  56 ++++++++-
> >       >        drivers/gpu/drm/virtio/virtgpu_prime.c | 167
> >       > ++++++++++++++++++++++++-
> >       >        drivers/gpu/drm/virtio/virtgpu_vq.c    |  15 +++
> >       >        include/uapi/linux/udmabuf.h           |  11 +-
> >       >        6 files changed, 368 insertions(+), 11 deletions(-)
> >       >
> >       >       --
> >       >       2.43.0
> >       >
> >       >
> >
> >
>
>

Re: [RFC 0/7] drm/virtio: Import scanout buffers from other devices

Reply via email to