RE: [RFC 0/7] drm/virtio: Import scanout buffers from other devices

2024-06-18 Thread Kasireddy, Vivek
Hi Gurchetan,

> 
> On Thu, May 30, 2024 at 12:21 AM Kasireddy, Vivek
> mailto:vivek.kasire...@intel.com> > wrote:
> 
> 
>   Hi Gurchetan,
> 
>   >
>   > On Fri, May 24, 2024 at 11:33 AM Kasireddy, Vivek
>   > mailto:vivek.kasire...@intel.com>
>  > >
> wrote:
>   >
>   >
>   >   Hi,
>   >
>   >   Sorry, my previous reply got messed up as a result of HTML
>   > formatting. This is
>   >   a plain text version of the same reply.
>   >
>   >   >
>   >   >
>   >   >   Having virtio-gpu import scanout buffers (via prime) 
> from
> other
>   >   >   devices means that we'd be adding a head to headless
> GPUs
>   > assigned
>   >   >   to a Guest VM or additional heads to regular GPU devices
> that
>   > are
>   >   >   passthrough'd to the Guest. In these cases, the Guest
>   > compositor
>   >   >   can render into the scanout buffer using a primary GPU
> and has
>   > the
>   >   >   secondary GPU (virtio-gpu) import it for display 
> purposes.
>   >   >
>   >   >   The main advantage with this is that the imported 
> scanout
>   > buffer can
>   >   >   either be displayed locally on the Host (e.g, using 
> Qemu +
> GTK
>   > UI)
>   >   >   or encoded and streamed to a remote client (e.g, Qemu +
> Spice
>   > UI).
>   >   >   Note that since Qemu uses udmabuf driver, there would
> be no
>   >   > copies
>   >   >   made of the scanout buffer as it is displayed. This 
> should
> be
>   >   >   possible even when it might reside in device memory such
> has
>   >   > VRAM.
>   >   >
>   >   >   The specific use-case that can be supported with this 
> series
> is
>   > when
>   >   >   running Weston or other guest compositors with
> "additional-
>   > devices"
>   >   >   feature (./weston --drm-device=card1 --additional-
>   > devices=card0).
>   >   >   More info about this feature can be found at:
>   >   >   https://gitlab.freedesktop.org/wayland/weston/-
>   >   > /merge_requests/736
>   >   >
>   >   >   In the above scenario, card1 could be a dGPU or an iGPU
> and
>   > card0
>   >   >   would be virtio-gpu in KMS only mode. However, the case
>   > where this
>   >   >   patch series could be particularly useful is when card1 
> is a
> GPU
>   > VF
>   >   >   that needs to share its scanout buffer (in a zero-copy 
> way)
> with
>   > the
>   >   >   GPU PF on the Host. Or, it can also be useful when the
> scanout
>   > buffer
>   >   >   needs to be shared between any two GPU devices
> (assuming
>   > one of
>   >   > them
>   >   >   is assigned to a Guest VM) as long as they are P2P DMA
>   > compatible.
>   >   >
>   >   >
>   >   >
>   >   > Is passthrough iGPU-only or passthrough dGPU-only
> something you
> > intend to
> >   > use?
> >   Our main use-case involves passthrough’g a headless dGPU VF
> device
> > and sharing
> >   the Guest compositor’s scanout buffer with dGPU PF device on
> the
> > Host. Same goal for
> >   headless iGPU VF to iGPU PF device as well.
> >
> >
> >
> > Just to check my understanding: the same physical {i, d}GPU is
> partitioned
> > into the VF and PF, but the PF handles host-side display integration
> and
> > rendering?
> Yes, that is mostly right. In a nutshell, the same physical GPU is
> partitioned
> into one PF device and multiple VF devices. Only the PF device has
> access to
> the display hardware and can do KMS (on the Host). The VF devices
> are
> headless with no access to display hardware (cannot do KMS but can
> do render/
> encode/decode) and are generally assigned (or passthrough'd) to the
> Guest VMs.
> Some more details about this model can be found here:
> https://lore.kernel.org/dri-devel/20231110182231.1730-1-
> michal.wajdec...@intel.com/
> 
> >
> >
> >   However, using a combination of iGPU and dGPU where either
> of
> > them can be passthrough’d
> >   to the Guest is something I think can be supported with this
> patch
> > series as well.
> >
> > >
> > > If it's a dGPU + iGPU setup, then the way other people seem to
> do it
> > is a
> > > "virtualized" iGPU (via virgl/gfxstream/take your pick) and
> pass-
> > through the
> > > dGPU.
> > >
> > > For example, AMD seems to use virgl to allocate and import
> into
> > the dGPU.
> > >
> > > https://gitlab.freedesktop.org/mesa/mesa/-
> > /merge_requests/23896
> > >
> > > https://lore.kernel.org/all/20231221100016.4022353-1-
> > > julia.zh...@amd.com/ 
> 

Re: [RFC 0/7] drm/virtio: Import scanout buffers from other devices

2024-06-14 Thread Gurchetan Singh
On Thu, May 30, 2024 at 12:21 AM Kasireddy, Vivek 
wrote:

> Hi Gurchetan,
>
> >
> > On Fri, May 24, 2024 at 11:33 AM Kasireddy, Vivek
> > mailto:vivek.kasire...@intel.com> > wrote:
> >
> >
> >   Hi,
> >
> >   Sorry, my previous reply got messed up as a result of HTML
> > formatting. This is
> >   a plain text version of the same reply.
> >
> >   >
> >   >
> >   >   Having virtio-gpu import scanout buffers (via prime) from
> other
> >   >   devices means that we'd be adding a head to headless GPUs
> > assigned
> >   >   to a Guest VM or additional heads to regular GPU devices
> that
> > are
> >   >   passthrough'd to the Guest. In these cases, the Guest
> > compositor
> >   >   can render into the scanout buffer using a primary GPU and
> has
> > the
> >   >   secondary GPU (virtio-gpu) import it for display purposes.
> >   >
> >   >   The main advantage with this is that the imported scanout
> > buffer can
> >   >   either be displayed locally on the Host (e.g, using Qemu +
> GTK
> > UI)
> >   >   or encoded and streamed to a remote client (e.g, Qemu +
> Spice
> > UI).
> >   >   Note that since Qemu uses udmabuf driver, there would be no
> >   > copies
> >   >   made of the scanout buffer as it is displayed. This should
> be
> >   >   possible even when it might reside in device memory such
> has
> >   > VRAM.
> >   >
> >   >   The specific use-case that can be supported with this
> series is
> > when
> >   >   running Weston or other guest compositors with "additional-
> > devices"
> >   >   feature (./weston --drm-device=card1 --additional-
> > devices=card0).
> >   >   More info about this feature can be found at:
> >   >   https://gitlab.freedesktop.org/wayland/weston/-
> >   > /merge_requests/736
> >   >
> >   >   In the above scenario, card1 could be a dGPU or an iGPU and
> > card0
> >   >   would be virtio-gpu in KMS only mode. However, the case
> > where this
> >   >   patch series could be particularly useful is when card1 is
> a GPU
> > VF
> >   >   that needs to share its scanout buffer (in a zero-copy
> way) with
> > the
> >   >   GPU PF on the Host. Or, it can also be useful when the
> scanout
> > buffer
> >   >   needs to be shared between any two GPU devices (assuming
> > one of
> >   > them
> >   >   is assigned to a Guest VM) as long as they are P2P DMA
> > compatible.
> >   >
> >   >
> >   >
> >   > Is passthrough iGPU-only or passthrough dGPU-only something you
> > intend to
> >   > use?
> >   Our main use-case involves passthrough’g a headless dGPU VF device
> > and sharing
> >   the Guest compositor’s scanout buffer with dGPU PF device on the
> > Host. Same goal for
> >   headless iGPU VF to iGPU PF device as well.
> >
> >
> >
> > Just to check my understanding: the same physical {i, d}GPU is
> partitioned
> > into the VF and PF, but the PF handles host-side display integration and
> > rendering?
> Yes, that is mostly right. In a nutshell, the same physical GPU is
> partitioned
> into one PF device and multiple VF devices. Only the PF device has access
> to
> the display hardware and can do KMS (on the Host). The VF devices are
> headless with no access to display hardware (cannot do KMS but can do
> render/
> encode/decode) and are generally assigned (or passthrough'd) to the Guest
> VMs.
> Some more details about this model can be found here:
>
> https://lore.kernel.org/dri-devel/20231110182231.1730-1-michal.wajdec...@intel.com/
>
> >
> >
> >   However, using a combination of iGPU and dGPU where either of
> > them can be passthrough’d
> >   to the Guest is something I think can be supported with this patch
> > series as well.
> >
> >   >
> >   > If it's a dGPU + iGPU setup, then the way other people seem to
> do it
> > is a
> >   > "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-
> > through the
> >   > dGPU.
> >   >
> >   > For example, AMD seems to use virgl to allocate and import into
> > the dGPU.
> >   >
> >   > https://gitlab.freedesktop.org/mesa/mesa/-
> > /merge_requests/23896
> >   >
> >   > https://lore.kernel.org/all/20231221100016.4022353-1-
> >   > julia.zh...@amd.com/ 
> >   >
> >   >
> >   > ChromeOS also uses that method (see crrev.com/c/3764931
> > 
> >   >  ) [cc: dGPU architect +Dominik Behr
> >   >  > ]
> >   >
> >   > So if iGPU + dGPU is the primary use case, you should be able to
> > use these
> >   > methods as well.  The model would "virtualized iGPU" +
> > passthrough dGPU,
> >   > not split SoCs.
> >   In our use-case, the goal is to have only one 

RE: [RFC 0/7] drm/virtio: Import scanout buffers from other devices

2024-05-30 Thread Kasireddy, Vivek
Hi Gurchetan,

> 
> On Fri, May 24, 2024 at 11:33 AM Kasireddy, Vivek
> mailto:vivek.kasire...@intel.com> > wrote:
> 
> 
>   Hi,
> 
>   Sorry, my previous reply got messed up as a result of HTML
> formatting. This is
>   a plain text version of the same reply.
> 
>   >
>   >
>   >   Having virtio-gpu import scanout buffers (via prime) from other
>   >   devices means that we'd be adding a head to headless GPUs
> assigned
>   >   to a Guest VM or additional heads to regular GPU devices that
> are
>   >   passthrough'd to the Guest. In these cases, the Guest
> compositor
>   >   can render into the scanout buffer using a primary GPU and has
> the
>   >   secondary GPU (virtio-gpu) import it for display purposes.
>   >
>   >   The main advantage with this is that the imported scanout
> buffer can
>   >   either be displayed locally on the Host (e.g, using Qemu + GTK
> UI)
>   >   or encoded and streamed to a remote client (e.g, Qemu + Spice
> UI).
>   >   Note that since Qemu uses udmabuf driver, there would be no
>   > copies
>   >   made of the scanout buffer as it is displayed. This should be
>   >   possible even when it might reside in device memory such has
>   > VRAM.
>   >
>   >   The specific use-case that can be supported with this series is
> when
>   >   running Weston or other guest compositors with "additional-
> devices"
>   >   feature (./weston --drm-device=card1 --additional-
> devices=card0).
>   >   More info about this feature can be found at:
>   >   https://gitlab.freedesktop.org/wayland/weston/-
>   > /merge_requests/736
>   >
>   >   In the above scenario, card1 could be a dGPU or an iGPU and
> card0
>   >   would be virtio-gpu in KMS only mode. However, the case
> where this
>   >   patch series could be particularly useful is when card1 is a GPU
> VF
>   >   that needs to share its scanout buffer (in a zero-copy way) with
> the
>   >   GPU PF on the Host. Or, it can also be useful when the scanout
> buffer
>   >   needs to be shared between any two GPU devices (assuming
> one of
>   > them
>   >   is assigned to a Guest VM) as long as they are P2P DMA
> compatible.
>   >
>   >
>   >
>   > Is passthrough iGPU-only or passthrough dGPU-only something you
> intend to
>   > use?
>   Our main use-case involves passthrough’g a headless dGPU VF device
> and sharing
>   the Guest compositor’s scanout buffer with dGPU PF device on the
> Host. Same goal for
>   headless iGPU VF to iGPU PF device as well.
> 
> 
> 
> Just to check my understanding: the same physical {i, d}GPU is partitioned
> into the VF and PF, but the PF handles host-side display integration and
> rendering?
Yes, that is mostly right. In a nutshell, the same physical GPU is partitioned
into one PF device and multiple VF devices. Only the PF device has access to
the display hardware and can do KMS (on the Host). The VF devices are
headless with no access to display hardware (cannot do KMS but can do render/
encode/decode) and are generally assigned (or passthrough'd) to the Guest VMs.
Some more details about this model can be found here:
https://lore.kernel.org/dri-devel/20231110182231.1730-1-michal.wajdec...@intel.com/

> 
> 
>   However, using a combination of iGPU and dGPU where either of
> them can be passthrough’d
>   to the Guest is something I think can be supported with this patch
> series as well.
> 
>   >
>   > If it's a dGPU + iGPU setup, then the way other people seem to do it
> is a
>   > "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-
> through the
>   > dGPU.
>   >
>   > For example, AMD seems to use virgl to allocate and import into
> the dGPU.
>   >
>   > https://gitlab.freedesktop.org/mesa/mesa/-
> /merge_requests/23896
>   >
>   > https://lore.kernel.org/all/20231221100016.4022353-1-
>   > julia.zh...@amd.com/ 
>   >
>   >
>   > ChromeOS also uses that method (see crrev.com/c/3764931
> 
>   >  ) [cc: dGPU architect +Dominik Behr
>   >  > ]
>   >
>   > So if iGPU + dGPU is the primary use case, you should be able to
> use these
>   > methods as well.  The model would "virtualized iGPU" +
> passthrough dGPU,
>   > not split SoCs.
>   In our use-case, the goal is to have only one primary GPU
> (passthrough’d iGPU/dGPU)
>   do all the rendering (using native DRI drivers) for clients/compositor
> and all the outputs
>   and share the scanout buffers with the secondary GPU (virtio-gpu).
> Since this is mostly
>   how Mutter (and also Weston) work in a multi-GPU setup, I am not
> sure if virgl is 

Re: [RFC 0/7] drm/virtio: Import scanout buffers from other devices

2024-05-29 Thread Gurchetan Singh
On Fri, May 24, 2024 at 11:33 AM Kasireddy, Vivek 
wrote:

> Hi,
>
> Sorry, my previous reply got messed up as a result of HTML formatting.
> This is
> a plain text version of the same reply.
>
> >
> >
> >   Having virtio-gpu import scanout buffers (via prime) from other
> >   devices means that we'd be adding a head to headless GPUs assigned
> >   to a Guest VM or additional heads to regular GPU devices that are
> >   passthrough'd to the Guest. In these cases, the Guest compositor
> >   can render into the scanout buffer using a primary GPU and has the
> >   secondary GPU (virtio-gpu) import it for display purposes.
> >
> >   The main advantage with this is that the imported scanout buffer
> can
> >   either be displayed locally on the Host (e.g, using Qemu + GTK UI)
> >   or encoded and streamed to a remote client (e.g, Qemu + Spice UI).
> >   Note that since Qemu uses udmabuf driver, there would be no
> > copies
> >   made of the scanout buffer as it is displayed. This should be
> >   possible even when it might reside in device memory such has
> > VRAM.
> >
> >   The specific use-case that can be supported with this series is
> when
> >   running Weston or other guest compositors with "additional-devices"
> >   feature (./weston --drm-device=card1 --additional-devices=card0).
> >   More info about this feature can be found at:
> >   https://gitlab.freedesktop.org/wayland/weston/-
> > /merge_requests/736
> >
> >   In the above scenario, card1 could be a dGPU or an iGPU and card0
> >   would be virtio-gpu in KMS only mode. However, the case where this
> >   patch series could be particularly useful is when card1 is a GPU VF
> >   that needs to share its scanout buffer (in a zero-copy way) with
> the
> >   GPU PF on the Host. Or, it can also be useful when the scanout
> buffer
> >   needs to be shared between any two GPU devices (assuming one of
> > them
> >   is assigned to a Guest VM) as long as they are P2P DMA compatible.
> >
> >
> >
> > Is passthrough iGPU-only or passthrough dGPU-only something you intend to
> > use?
> Our main use-case involves passthrough’g a headless dGPU VF device and
> sharing
> the Guest compositor’s scanout buffer with dGPU PF device on the Host.
> Same goal for
> headless iGPU VF to iGPU PF device as well.
>

Just to check my understanding: the same physical {i, d}GPU is partitioned
into the VF and PF, but the PF handles host-side display integration and
rendering?


> However, using a combination of iGPU and dGPU where either of them can be
> passthrough’d
> to the Guest is something I think can be supported with this patch series
> as well.
>
> >
> > If it's a dGPU + iGPU setup, then the way other people seem to do it is a
> > "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-through
> the
> > dGPU.
> >
> > For example, AMD seems to use virgl to allocate and import into the dGPU.
> >
> > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23896
> >
> > https://lore.kernel.org/all/20231221100016.4022353-1-
> > julia.zh...@amd.com/
> >
> >
> > ChromeOS also uses that method (see crrev.com/c/3764931
> >  ) [cc: dGPU architect +Dominik Behr
> >  ]
> >
> > So if iGPU + dGPU is the primary use case, you should be able to use
> these
> > methods as well.  The model would "virtualized iGPU" + passthrough dGPU,
> > not split SoCs.
> In our use-case, the goal is to have only one primary GPU (passthrough’d
> iGPU/dGPU)
> do all the rendering (using native DRI drivers) for clients/compositor and
> all the outputs
> and share the scanout buffers with the secondary GPU (virtio-gpu). Since
> this is mostly
> how Mutter (and also Weston) work in a multi-GPU setup, I am not sure if
> virgl is needed.
>

I think you can probably use virgl with the PF and others probably will,
but supporting multiple methods in Linux is not unheard of.

Does your patchset need the Mesa kmsro patchset to function correctly?

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9592

If so, I would try to get that reviewed first to meet DRM requirements (
https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements).
You might explicitly call out the design decision you're making: ("We can
probably use virgl as the virtualized iGPU via PF, but that adds
unnecessary complexity b/c __").

And, doing it this way means that no other userspace components need to be
> modified
> on both the Guest and the Host.
>
> >
> >
> >
> >   As part of the import, the virtio-gpu driver shares the dma
> >   addresses and lengths with Qemu which then determines whether
> > the
> >   memory region they belong to is owned by a PCI device or whether it
> >   is part of the Guest's system ram. If it is the former, it
> identifies
> >   the devid (or bdf) and bar and provides this info (along with
> offsets
> 

RE: [RFC 0/7] drm/virtio: Import scanout buffers from other devices

2024-05-24 Thread Kasireddy, Vivek
Hi,

Sorry, my previous reply got messed up as a result of HTML formatting. This is
a plain text version of the same reply.

> 
> 
>   Having virtio-gpu import scanout buffers (via prime) from other
>   devices means that we'd be adding a head to headless GPUs assigned
>   to a Guest VM or additional heads to regular GPU devices that are
>   passthrough'd to the Guest. In these cases, the Guest compositor
>   can render into the scanout buffer using a primary GPU and has the
>   secondary GPU (virtio-gpu) import it for display purposes.
> 
>   The main advantage with this is that the imported scanout buffer can
>   either be displayed locally on the Host (e.g, using Qemu + GTK UI)
>   or encoded and streamed to a remote client (e.g, Qemu + Spice UI).
>   Note that since Qemu uses udmabuf driver, there would be no
> copies
>   made of the scanout buffer as it is displayed. This should be
>   possible even when it might reside in device memory such has
> VRAM.
> 
>   The specific use-case that can be supported with this series is when
>   running Weston or other guest compositors with "additional-devices"
>   feature (./weston --drm-device=card1 --additional-devices=card0).
>   More info about this feature can be found at:
>   https://gitlab.freedesktop.org/wayland/weston/-
> /merge_requests/736
> 
>   In the above scenario, card1 could be a dGPU or an iGPU and card0
>   would be virtio-gpu in KMS only mode. However, the case where this
>   patch series could be particularly useful is when card1 is a GPU VF
>   that needs to share its scanout buffer (in a zero-copy way) with the
>   GPU PF on the Host. Or, it can also be useful when the scanout buffer
>   needs to be shared between any two GPU devices (assuming one of
> them
>   is assigned to a Guest VM) as long as they are P2P DMA compatible.
> 
> 
> 
> Is passthrough iGPU-only or passthrough dGPU-only something you intend to
> use?
Our main use-case involves passthrough’g a headless dGPU VF device and sharing
the Guest compositor’s scanout buffer with dGPU PF device on the Host. Same 
goal for
headless iGPU VF to iGPU PF device as well.

However, using a combination of iGPU and dGPU where either of them can be 
passthrough’d
to the Guest is something I think can be supported with this patch series as 
well.

> 
> If it's a dGPU + iGPU setup, then the way other people seem to do it is a
> "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-through the
> dGPU.
> 
> For example, AMD seems to use virgl to allocate and import into the dGPU.
> 
> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23896
> 
> https://lore.kernel.org/all/20231221100016.4022353-1-
> julia.zh...@amd.com/
> 
> 
> ChromeOS also uses that method (see crrev.com/c/3764931
>  ) [cc: dGPU architect +Dominik Behr
>  ]
> 
> So if iGPU + dGPU is the primary use case, you should be able to use these
> methods as well.  The model would "virtualized iGPU" + passthrough dGPU,
> not split SoCs.
In our use-case, the goal is to have only one primary GPU (passthrough’d 
iGPU/dGPU)
do all the rendering (using native DRI drivers) for clients/compositor and all 
the outputs
and share the scanout buffers with the secondary GPU (virtio-gpu). Since this 
is mostly
how Mutter (and also Weston) work in a multi-GPU setup, I am not sure if virgl 
is needed.

And, doing it this way means that no other userspace components need to be 
modified
on both the Guest and the Host.

> 
> 
> 
>   As part of the import, the virtio-gpu driver shares the dma
>   addresses and lengths with Qemu which then determines whether
> the
>   memory region they belong to is owned by a PCI device or whether it
>   is part of the Guest's system ram. If it is the former, it identifies
>   the devid (or bdf) and bar and provides this info (along with offsets
>   and sizes) to the udmabuf driver. In the latter case, instead of the
>   the devid and bar it provides the memfd. The udmabuf driver then
>   creates a dmabuf using this info that Qemu shares with Spice for
>   encode via Gstreamer.
> 
>   Note that the virtio-gpu driver registers a move_notify() callback
>   to track location changes associated with the scanout buffer and
>   sends attach/detach backing cmds to Qemu when appropriate. And,
>   synchronization (that is, ensuring that Guest and Host are not
>   using the scanout buffer at the same time) is ensured by pinning/
>   unpinning the dmabuf as part of plane update and using a fence
>   in resource_flush cmd.
> 
> 
> I'm not sure how QEMU's display paths work, but with crosvm if you share
> the guest-created dmabuf with the display, and the guest moves the backing
> pages, the only recourse is the destroy the surface and show a black screen
> to the user: not the best thing experience wise.

RE: [RFC 0/7] drm/virtio: Import scanout buffers from other devices

2024-05-24 Thread Kasireddy, Vivek
Hi Gurchetan,

Thank you for taking a look at this patch series!



On Thu, Mar 28, 2024 at 2:01 AM Vivek Kasireddy 
mailto:vivek.kasire...@intel.com>> wrote:
Having virtio-gpu import scanout buffers (via prime) from other
devices means that we'd be adding a head to headless GPUs assigned
to a Guest VM or additional heads to regular GPU devices that are
passthrough'd to the Guest. In these cases, the Guest compositor
can render into the scanout buffer using a primary GPU and has the
secondary GPU (virtio-gpu) import it for display purposes.

The main advantage with this is that the imported scanout buffer can
either be displayed locally on the Host (e.g, using Qemu + GTK UI)
or encoded and streamed to a remote client (e.g, Qemu + Spice UI).
Note that since Qemu uses udmabuf driver, there would be no copies
made of the scanout buffer as it is displayed. This should be
possible even when it might reside in device memory such has VRAM.

The specific use-case that can be supported with this series is when
running Weston or other guest compositors with "additional-devices"
feature (./weston --drm-device=card1 --additional-devices=card0).
More info about this feature can be found at:
https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/736

In the above scenario, card1 could be a dGPU or an iGPU and card0
would be virtio-gpu in KMS only mode. However, the case where this
patch series could be particularly useful is when card1 is a GPU VF
that needs to share its scanout buffer (in a zero-copy way) with the
GPU PF on the Host. Or, it can also be useful when the scanout buffer
needs to be shared between any two GPU devices (assuming one of them
is assigned to a Guest VM) as long as they are P2P DMA compatible.

Is passthrough iGPU-only or passthrough dGPU-only something you intend to use?
Our main use-case involves passthrough’g a headless dGPU VF device and sharing
the Guest compositor’s scanout buffer with dGPU PF device on the Host. Same 
goal for
headless iGPU VF to iGPU PF device as well.

However, using a combination of iGPU and dGPU where either of them can be 
passthrough’d
to the Guest is something I think can be supported with this patch series as 
well.

If it's a dGPU + iGPU setup, then the way other people seem to do it is a 
"virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-through the 
dGPU.

For example, AMD seems to use virgl to allocate and import into the dGPU.

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23896
https://lore.kernel.org/all/20231221100016.4022353-1-julia.zh...@amd.com/

ChromeOS also uses that method (see 
crrev.com/c/3764931) [cc: dGPU architect +Dominik 
Behr]

So if iGPU + dGPU is the primary use case, you should be able to use these 
methods as well.  The model would "virtualized iGPU" + passthrough dGPU, not 
split SoCs.
In our use-case, the goal is to have only one primary GPU (passthrough’d 
iGPU/dGPU)
do all the rendering (using native DRI drivers) for clients/compositor and all 
the outputs
and share the scanout buffers with the secondary GPU (virtio-gpu). Since this 
is mostly
how Mutter (and also Weston) work in a multi-GPU setup, I am not sure if virgl 
is needed.

As part of the import, the virtio-gpu driver shares the dma
addresses and lengths with Qemu which then determines whether the
memory region they belong to is owned by a PCI device or whether it
is part of the Guest's system ram. If it is the former, it identifies
the devid (or bdf) and bar and provides this info (along with offsets
and sizes) to the udmabuf driver. In the latter case, instead of the
the devid and bar it provides the memfd. The udmabuf driver then
creates a dmabuf using this info that Qemu shares with Spice for
encode via Gstreamer.

Note that the virtio-gpu driver registers a move_notify() callback
to track location changes associated with the scanout buffer and
sends attach/detach backing cmds to Qemu when appropriate. And,
synchronization (that is, ensuring that Guest and Host are not
using the scanout buffer at the same time) is ensured by pinning/
unpinning the dmabuf as part of plane update and using a fence
in resource_flush cmd.

I'm not sure how QEMU's display paths work, but with crosvm if you share the 
guest-created dmabuf with the display, and the guest moves the backing pages, 
the only recourse is the destroy the surface and show a black screen to the 
user: not the best thing experience wise.
Since Qemu GTK UI uses EGL, there is a blit done from the guest’s scanout 
buffer onto an EGL
backed buffer on the Host. So, this problem would not happen as of now.

Only amdgpu calls dma_buf_move_notfiy(..), and you're probably testing on Intel 
only, so you may not be hitting that code path anyways.
I have tested with the Xe driver in the Guest which also calls 
dma_buf_move_notfiy(). However,
note that for dGPUs, both Xe and amdgpu migrate the scanout buffer from vram to 
system
memory as part 

Re: [RFC 0/7] drm/virtio: Import scanout buffers from other devices

2024-05-23 Thread Gurchetan Singh
On Thu, Mar 28, 2024 at 2:01 AM Vivek Kasireddy 
wrote:

> Having virtio-gpu import scanout buffers (via prime) from other
> devices means that we'd be adding a head to headless GPUs assigned
> to a Guest VM or additional heads to regular GPU devices that are
> passthrough'd to the Guest. In these cases, the Guest compositor
> can render into the scanout buffer using a primary GPU and has the
> secondary GPU (virtio-gpu) import it for display purposes.
>
> The main advantage with this is that the imported scanout buffer can
> either be displayed locally on the Host (e.g, using Qemu + GTK UI)
> or encoded and streamed to a remote client (e.g, Qemu + Spice UI).
> Note that since Qemu uses udmabuf driver, there would be no copies
> made of the scanout buffer as it is displayed. This should be
> possible even when it might reside in device memory such has VRAM.
>
> The specific use-case that can be supported with this series is when
> running Weston or other guest compositors with "additional-devices"
> feature (./weston --drm-device=card1 --additional-devices=card0).
> More info about this feature can be found at:
> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/736
>
> In the above scenario, card1 could be a dGPU or an iGPU and card0
> would be virtio-gpu in KMS only mode. However, the case where this
> patch series could be particularly useful is when card1 is a GPU VF
> that needs to share its scanout buffer (in a zero-copy way) with the
> GPU PF on the Host. Or, it can also be useful when the scanout buffer
> needs to be shared between any two GPU devices (assuming one of them
> is assigned to a Guest VM) as long as they are P2P DMA compatible.
>

Is passthrough iGPU-only or passthrough dGPU-only something you intend to
use?

If it's a dGPU + iGPU setup, then the way other people seem to do it is a
"virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-through
the dGPU.

For example, AMD seems to use virgl to allocate and import into the dGPU.

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23896
https://lore.kernel.org/all/20231221100016.4022353-1-julia.zh...@amd.com/

ChromeOS also uses that method (see crrev.com/c/3764931) [cc: dGPU
architect +Dominik Behr ]

So if iGPU + dGPU is the primary use case, you should be able to use these
methods as well.  The model would "virtualized iGPU" + passthrough dGPU,
not split SoCs.


> As part of the import, the virtio-gpu driver shares the dma
> addresses and lengths with Qemu which then determines whether the
> memory region they belong to is owned by a PCI device or whether it
> is part of the Guest's system ram. If it is the former, it identifies
> the devid (or bdf) and bar and provides this info (along with offsets
> and sizes) to the udmabuf driver. In the latter case, instead of the
> the devid and bar it provides the memfd. The udmabuf driver then
> creates a dmabuf using this info that Qemu shares with Spice for
> encode via Gstreamer.
>
> Note that the virtio-gpu driver registers a move_notify() callback
> to track location changes associated with the scanout buffer and
> sends attach/detach backing cmds to Qemu when appropriate. And,
> synchronization (that is, ensuring that Guest and Host are not
> using the scanout buffer at the same time) is ensured by pinning/
> unpinning the dmabuf as part of plane update and using a fence
> in resource_flush cmd.


I'm not sure how QEMU's display paths work, but with crosvm if you share
the guest-created dmabuf with the display, and the guest moves the backing
pages, the only recourse is the destroy the surface and show a black screen
to the user: not the best thing experience wise.

Only amdgpu calls dma_buf_move_notfiy(..), and you're probably testing on
Intel only, so you may not be hitting that code path anyways.  I forgot the
exact reason, but apparently udmabuf may not work with amdgpu displays and
it seems the virtualized iGPU + dGPU is the way to go for amdgpu anyways.
So I recommend just pinning the buffer for the lifetime of the import for
simplicity and correctness.


> This series is available at:
> https://gitlab.freedesktop.org/Vivek/drm-tip/-/commits/virtgpu_import_rfc
>
> along with additional patches for Qemu and Spice here:
> https://gitlab.freedesktop.org/Vivek/qemu/-/commits/virtgpu_dmabuf_pcidev
> https://gitlab.freedesktop.org/Vivek/spice/-/commits/encode_dmabuf_v4
>
> Patchset overview:
>
> Patch 1:   Implement VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> Patch 2-3: Helpers to initalize, import, free imported object
> Patch 4-5: Import and use buffers from other devices for scanout
> Patch 6-7: Have udmabuf driver create dmabuf from PCI bars for P2P DMA
>
> This series is tested using the following method:
> - Run Qemu with the following relevant options:
>   qemu-system-x86_64 -m 4096m 
>   -device vfio-pci,host=:03:00.0
>   -device virtio-vga,max_outputs=1,blob=true,xres=1920,yres=1080
>   -spice
>