[Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)

2023-08-25 Thread Eduardo Hopperdietzel
Hi Giuseppe,

I initially believed there would be no distinction in performing R/W
operations on DMA or SHM maps. However, in a previous email, David
mentioned:




*> The relevant kwin expert, Xaver Hugl, stated in a chat:> "While the
overhead on the compositor side would be lower, rendering> into a dmabuf
with the CPU is pretty slow, especially on dedicated> GPUs and especially
with QPainter."*

Upon testing, I discovered that every R/W operation is not synchronized
with the backing storage or importers of the buffer, which would
potentially slow down QPainter operations. It appears there's a cache
memory where written data is temporarily stored for quick reading, and the
kernel subsystem schedules DMA transfers in a non-blocking manner. I
believe Xaver Hugl was referring to the performance being lower in terms of
transfers, as a fence mechanism is required to ensure the compositor
finishes importing the buffer into the GPU.

Best regards,
Eduardo Hopperdietzel
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)

2023-08-25 Thread Giuseppe D'Angelo via Development

Hi,

On 25/08/2023 12:34, Vlad Zahorodnii wrote:

I'm really curious here, and these aren't rhetorical questions: why
would anyone expect to be a difference in performance, as far as
QPainter is concerned? Isn't it ultimately just using a CPU-based
renderer onto a block of memory? Why should it make a difference where
that memory comes from / how it's managed / etc.? Are we're talking
about "far memory" (NUMA-like) scenarios?



It makes a difference to the compositor. The compositor will have to
upload pixel data from RAM to VRAM so it can composite the windows using
OpenGL or Vulkan. If the client provides dmabuf client buffers, the
compositor can skip the uploading step thus reduce the amount of time it
takes to compose a frame.


Thanks, this part was OK with me. I was confused by the previous emails 
possibly implying that QPainter-based painting *alone* was making a 
difference between DMA and SHM, and couldn't understand why.


--
Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts



smime.p7s
Description: S/MIME Cryptographic Signature
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)

2023-08-25 Thread Vlad Zahorodnii

On 8/25/23 12:12, Giuseppe D'Angelo via Development wrote:


On 24/08/2023 21:37, Eduardo Hopperdietzel wrote:
The results show that there's no significant difference in the time 
it takes for read and write operations using QPainter in SHM and DMA 
maps.


I'm really curious here, and these aren't rhetorical questions: why 
would anyone expect to be a difference in performance, as far as 
QPainter is concerned? Isn't it ultimately just using a CPU-based 
renderer onto a block of memory? Why should it make a difference where 
that memory comes from / how it's managed / etc.? Are we're talking 
about "far memory" (NUMA-like) scenarios?
It makes a difference to the compositor. The compositor will have to 
upload pixel data from RAM to VRAM so it can composite the windows using 
OpenGL or Vulkan. If the client provides dmabuf client buffers, the 
compositor can skip the uploading step thus reduce the amount of time it 
takes to compose a frame.


Regards,
Vlad
--
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)

2023-08-25 Thread Giuseppe D'Angelo via Development

On 24/08/2023 21:37, Eduardo Hopperdietzel wrote:
The results show that there's no significant difference in the time it 
takes for read and write operations using QPainter in SHM and DMA maps.


I'm really curious here, and these aren't rhetorical questions: why 
would anyone expect to be a difference in performance, as far as 
QPainter is concerned? Isn't it ultimately just using a CPU-based 
renderer onto a block of memory? Why should it make a difference where 
that memory comes from / how it's managed / etc.? Are we're talking 
about "far memory" (NUMA-like) scenarios?


Thank you,
--
Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts



smime.p7s
Description: S/MIME Cryptographic Signature
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)

2023-08-24 Thread Ilya Fedin
On Thu, 24 Aug 2023 15:37:05 -0400
Eduardo Hopperdietzel  wrote:

> Hi David,
> 
> I've made a little Wayland app that uses both SHM and DMA, and I
> tested it on Weston, Sway, and my own compositor. I also tried it on
> three different machines: two with Intel i7 CPUs and one with a
> smaller ARM CPU. These machines had Intel Iris Pro, Nvidia GT525M,
> and Mali-400 GPUs, respectively.
> 
> Here's the code and results for one of the machines:
> 
> https://github.com/ehopperdietzel/QPainter-SHM-DMA-Benchmark
> 
> The results show that there's no significant difference in the time it
> takes for read and write operations using QPainter in SHM and DMA
> maps. It seems like DMA I/O operations are handled asynchronously by
> the kernel. The most noticeable improvement is on the compositor
> side. When using DMA, the experience feels much smoother, especially
> when moving other windows while the benchmark is running on
> single-threaded compositors like Weston. There's also a slight
> increase in the number of frame callbacks returned by the compositors
> when using DMA, though it doesn't significantly boost the overall FPS.
> 
> However, there are challenges with implementing DMA:
> 
> 1. There does not seems to be standard method to create DMA buffers in
> userspace. I tried creating a GBM bo, obtaining a PRIME fd, and
> mapping it, but this isn't supported by all GPUs/drivers. For
> instance, it didn't work with the Mali GPU using the Lima driver. I
> also experimented with DMA-BUFF heaps, but driver support does not
> seems to be consistent across all distributions, and accessing
> /dev/dma-heaps/** often requires superuser privileges.
> 
> 2. When using DMA, triple buffering is necessary; otherwise,
> compositors only display partial buffer updates. This could
> potentially be avoided by using DMA fencing mechanisms (like EGL does
> under the hood) and protocols like this one:
> 
> https://wayland.app/protocols/linux-explicit-synchronization-unstable-v1
> 
> But it seems that not many compositors have implemented it.
> 
> To sum it up, while DMA does offer a performance boost, it's not
> without its issues:
> 
> - DMA's effectiveness varies depending on hardware.
> - Implementing DMA can be complex.
> - The performance gains might not justify the effort.
> 
> So, as you mentioned earlier, it's probably best to stick with SHM
> and let the compositor handle uploads using DMA, preferably
> asynchronously.
> 
> Cheers,
> 
> Eduardo Hopperdietzel

I wonder whether this would help with FramelessWindowHint artifacts on
Debian 10? Currently SHM doesn't work correctly on Debian 10 and one
has to create a child QOpenGLWidget for artifacts to disappear.
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


[Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)

2023-08-24 Thread Eduardo Hopperdietzel
Hi David,

I've made a little Wayland app that uses both SHM and DMA, and I tested it
on Weston, Sway, and my own compositor. I also tried it on three different
machines: two with Intel i7 CPUs and one with a smaller ARM CPU. These
machines had Intel Iris Pro, Nvidia GT525M, and Mali-400 GPUs, respectively.

Here's the code and results for one of the machines:

https://github.com/ehopperdietzel/QPainter-SHM-DMA-Benchmark

The results show that there's no significant difference in the time it
takes for read and write operations using QPainter in SHM and DMA maps. It
seems like DMA I/O operations are handled asynchronously by the kernel. The
most noticeable improvement is on the compositor side. When using DMA, the
experience feels much smoother, especially when moving other windows while
the benchmark is running on single-threaded compositors like Weston.
There's also a slight increase in the number of frame callbacks returned by
the compositors when using DMA, though it doesn't significantly boost the
overall FPS.

However, there are challenges with implementing DMA:

1. There does not seems to be standard method to create DMA buffers in
userspace. I tried creating a GBM bo, obtaining a PRIME fd, and mapping it,
but this isn't supported by all GPUs/drivers. For instance, it didn't work
with the Mali GPU using the Lima driver. I also experimented with DMA-BUFF
heaps, but driver support does not seems to be consistent across all
distributions, and accessing /dev/dma-heaps/** often requires superuser
privileges.

2. When using DMA, triple buffering is necessary; otherwise, compositors
only display partial buffer updates. This could potentially be avoided by
using DMA fencing mechanisms (like EGL does under the hood) and protocols
like this one:

https://wayland.app/protocols/linux-explicit-synchronization-unstable-v1

But it seems that not many compositors have implemented it.

To sum it up, while DMA does offer a performance boost, it's not without
its issues:

- DMA's effectiveness varies depending on hardware.
- Implementing DMA can be complex.
- The performance gains might not justify the effort.

So, as you mentioned earlier, it's probably best to stick with SHM and let
the compositor handle uploads using DMA, preferably asynchronously.

Cheers,

Eduardo Hopperdietzel
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)

2023-08-21 Thread David Edmundson
On Fri, Aug 18, 2023 at 8:18 PM Eduardo Hopperdietzel
 wrote:
>
> Hi David,
>
> That's a very good point I hadn't thought about. I will create a testing 
> Wayland client benchmark and measure the time it takes for QPainer to perform 
> different drawing operations using both SHM and DMA. I'll also test buffer 
> resizing and measure the overall (client/compositor) performance by counting 
> the number of frame callbacks (FPS) returned by different compositors.
>
> To ensure the benchmark provides a representative performance evaluation for 
> Qt, I would appreciate it if you could clarify the following doubts:
>
> 1. Does Qt respect the wl_surface frame callbacks sent by the compositor, or 
> does it simply draw as many frames as it can?

Yes to both :)
If you use the normal loop of QWidget::update you'll get the
paintEvent each callback
If you use QWidget::repaint (which docs err against using) you'll
blast out loads of frames then block when you hit some hardcoded limit
which will block till we get buffers released.

> 2. When using SHM, does Qt reuse the same buffer on a wl_surface if it 
> receives a wl_buffer release event before a wl_surface frame callback? And 
> does it use more than one otherwise?

There is a pool that is re-used.
It's double buffered at a minimum but scales to handle the repaint case above.

> 3. If DMA was implemented, I suppose double buffering would be mandatory, or 
> should I consider triple buffering or more?

The logic would be the same.

David
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


[Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)

2023-08-18 Thread Eduardo Hopperdietzel
Hi David,

That's a very good point I hadn't thought about. I will create a testing
Wayland client benchmark and measure the time it takes for QPainer to
perform different drawing operations using both SHM and DMA. I'll also test
buffer resizing and measure the overall (client/compositor) performance by
counting the number of frame callbacks (FPS) returned by different
compositors.

To ensure the benchmark provides a representative performance evaluation
for Qt, I would appreciate it if you could clarify the following doubts:

1. Does Qt respect the wl_surface frame callbacks sent by the compositor,
or does it simply draw as many frames as it can?
2. When using SHM, does Qt reuse the same buffer on a wl_surface if it
receives a wl_buffer release event before a wl_surface frame callback? And
does it use more than one otherwise?
3. If DMA was implemented, I suppose double buffering would be mandatory,
or should I consider triple buffering or more?

Cheers,
Eduardo Hopperdietzel
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)

2023-08-17 Thread David Edmundson
>
> Do you foresee any potential issues with this approach? Please feel free to 
> share your thoughts.
>

The relevant kwin expert, Xaver Hugl said in a chat about this topic:
"While the overhead on the compositor side would be lower, rendering
into a dmabuf with the CPU is pretty slow, especially on dedicated
GPUs and especially with QPainter"

This means there's a strong chance this doesn't have the performance
boost that it looks to have on paper. Rendering into shared memory is
as fast as rendering into any application local memory.
This is potentially avoidable with a shadow buffer and then an upload
at the end, but then we're just moving work to the client rather than
saving work.

I would suggest getting some test benchmarks ahead of time, it's a lot
of code for something that might not pay off.
We'll also need to ensure we do fairly extensive real-world benchmarks
before landing a final merge request.

David Edmundson
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


[Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)

2023-08-14 Thread Eduardo Hopperdietzel
Hello Kai,

Thank you for your reply. After following the links you provided, I believe
that implementing the DMA feature should take place within the following
file:

https://code.qt.io/cgit/qt/qtwayland.git/tree/src/client/qwaylandshmbackingstore.cpp

Here's my proposed plan for this implementation:

1. Start by checking whether the compositor supports the wl_drm and
linux-DMA-buff protocols.
2. Manually implement both protocols, without relying on the wayland-egl
implementation.
3. Authenticate a DRM device fd through the wl_drm protocol.
4. Verify if the compositor supports the DRM_FORMAT_MOD_LINEAR modifier
using the linux-dma-buff protocol.
5. Create a LINEAR DMA buffer using the GBM library. I propose making this
buffer relatively large to prevent the need for frequent destruction and
recreation of a new buffer when the window is resized.
6. Perform mmap on the DMA buffer.
7. Wrap the mapped buffer with a QImage(), following a similar approach as
it's done with shared memory.

If any of these steps fail, fallback to shared memory.

Do you foresee any potential issues with this approach? Please feel free to
share your thoughts.

Best regards,
Eduardo Hopperdietzel
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)

2023-08-06 Thread Kai Uwe Broulik

Hi,

Qt Wayland Client currently relies on EGL through wayland-egl 
integration to do the right thing. On Mesa for instance Qt indeed 
transparently uses zwp_linux_dmabuf_v1.


However, I agree that Qt itself should have an implementation of the 
aforementioned protocol on the *client* side, too (Qt Wayland Compositor 
supports clients talking dmabuf to it), for platforms where this is not 
done under the hood.


Code that could serve as an inspiration is the compositor dmabuf-v1 
implementation [1], a client side integration [2] (this is what you want 
to write, just using dmabuf instead), a weston example implementation [3].


Cheers
Kai Uwe

[1] 
https://code.qt.io/cgit/qt/qtwayland.git/tree/src/hardwareintegration/compositor/linux-dmabuf-unstable-v1
[2] 
https://code.qt.io/cgit/qt/qtwayland.git/tree/src/hardwareintegration/compositor/wayland-egl
[3] 
https://gitlab.freedesktop.org/wayland/weston/-/blob/main/clients/simple-dmabuf-egl.c

--
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


[Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)

2023-08-06 Thread Eduardo Hopperdietzel
Hello,

I've noticed that Qt currently uses shared memory for buffer sharing with
Wayland compositors in non-OpenGL rendering applications. However, a more
efficient approach would be to use the zwp_linux_dmabuf_v1 protocol. By
creating a DMA buffer, mapping it, and performing CPU rendering there, the
compositor could directly import the buffers into the GPU, eliminating the
need for extra copies. This optimization would significantly improve
performance, particularly when scaling large windows on HiDPI displays. The
implementation on the Qt side should be almost identical to using shared
memory (I think).

I would be willing to implement this, and I would appreciate it if you
could guide me to the code that handles this. I assume I should look into
the code of the Wayland platform plugin?

Best regards,

Eduardo Hopperdietzel
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development