[Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)
Hi Giuseppe, I initially believed there would be no distinction in performing R/W operations on DMA or SHM maps. However, in a previous email, David mentioned: *> The relevant kwin expert, Xaver Hugl, stated in a chat:> "While the overhead on the compositor side would be lower, rendering> into a dmabuf with the CPU is pretty slow, especially on dedicated> GPUs and especially with QPainter."* Upon testing, I discovered that every R/W operation is not synchronized with the backing storage or importers of the buffer, which would potentially slow down QPainter operations. It appears there's a cache memory where written data is temporarily stored for quick reading, and the kernel subsystem schedules DMA transfers in a non-blocking manner. I believe Xaver Hugl was referring to the performance being lower in terms of transfers, as a fence mechanism is required to ensure the compositor finishes importing the buffer into the GPU. Best regards, Eduardo Hopperdietzel -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)
Hi, On 25/08/2023 12:34, Vlad Zahorodnii wrote: I'm really curious here, and these aren't rhetorical questions: why would anyone expect to be a difference in performance, as far as QPainter is concerned? Isn't it ultimately just using a CPU-based renderer onto a block of memory? Why should it make a difference where that memory comes from / how it's managed / etc.? Are we're talking about "far memory" (NUMA-like) scenarios? It makes a difference to the compositor. The compositor will have to upload pixel data from RAM to VRAM so it can composite the windows using OpenGL or Vulkan. If the client provides dmabuf client buffers, the compositor can skip the uploading step thus reduce the amount of time it takes to compose a frame. Thanks, this part was OK with me. I was confused by the previous emails possibly implying that QPainter-based painting *alone* was making a difference between DMA and SHM, and couldn't understand why. -- Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer KDAB (France) S.A.S., a KDAB Group company Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com KDAB - The Qt, C++ and OpenGL Experts smime.p7s Description: S/MIME Cryptographic Signature -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)
On 8/25/23 12:12, Giuseppe D'Angelo via Development wrote: On 24/08/2023 21:37, Eduardo Hopperdietzel wrote: The results show that there's no significant difference in the time it takes for read and write operations using QPainter in SHM and DMA maps. I'm really curious here, and these aren't rhetorical questions: why would anyone expect to be a difference in performance, as far as QPainter is concerned? Isn't it ultimately just using a CPU-based renderer onto a block of memory? Why should it make a difference where that memory comes from / how it's managed / etc.? Are we're talking about "far memory" (NUMA-like) scenarios? It makes a difference to the compositor. The compositor will have to upload pixel data from RAM to VRAM so it can composite the windows using OpenGL or Vulkan. If the client provides dmabuf client buffers, the compositor can skip the uploading step thus reduce the amount of time it takes to compose a frame. Regards, Vlad -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)
On 24/08/2023 21:37, Eduardo Hopperdietzel wrote: The results show that there's no significant difference in the time it takes for read and write operations using QPainter in SHM and DMA maps. I'm really curious here, and these aren't rhetorical questions: why would anyone expect to be a difference in performance, as far as QPainter is concerned? Isn't it ultimately just using a CPU-based renderer onto a block of memory? Why should it make a difference where that memory comes from / how it's managed / etc.? Are we're talking about "far memory" (NUMA-like) scenarios? Thank you, -- Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer KDAB (France) S.A.S., a KDAB Group company Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com KDAB - The Qt, C++ and OpenGL Experts smime.p7s Description: S/MIME Cryptographic Signature -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)
On Thu, 24 Aug 2023 15:37:05 -0400 Eduardo Hopperdietzel wrote: > Hi David, > > I've made a little Wayland app that uses both SHM and DMA, and I > tested it on Weston, Sway, and my own compositor. I also tried it on > three different machines: two with Intel i7 CPUs and one with a > smaller ARM CPU. These machines had Intel Iris Pro, Nvidia GT525M, > and Mali-400 GPUs, respectively. > > Here's the code and results for one of the machines: > > https://github.com/ehopperdietzel/QPainter-SHM-DMA-Benchmark > > The results show that there's no significant difference in the time it > takes for read and write operations using QPainter in SHM and DMA > maps. It seems like DMA I/O operations are handled asynchronously by > the kernel. The most noticeable improvement is on the compositor > side. When using DMA, the experience feels much smoother, especially > when moving other windows while the benchmark is running on > single-threaded compositors like Weston. There's also a slight > increase in the number of frame callbacks returned by the compositors > when using DMA, though it doesn't significantly boost the overall FPS. > > However, there are challenges with implementing DMA: > > 1. There does not seems to be standard method to create DMA buffers in > userspace. I tried creating a GBM bo, obtaining a PRIME fd, and > mapping it, but this isn't supported by all GPUs/drivers. For > instance, it didn't work with the Mali GPU using the Lima driver. I > also experimented with DMA-BUFF heaps, but driver support does not > seems to be consistent across all distributions, and accessing > /dev/dma-heaps/** often requires superuser privileges. > > 2. When using DMA, triple buffering is necessary; otherwise, > compositors only display partial buffer updates. This could > potentially be avoided by using DMA fencing mechanisms (like EGL does > under the hood) and protocols like this one: > > https://wayland.app/protocols/linux-explicit-synchronization-unstable-v1 > > But it seems that not many compositors have implemented it. > > To sum it up, while DMA does offer a performance boost, it's not > without its issues: > > - DMA's effectiveness varies depending on hardware. > - Implementing DMA can be complex. > - The performance gains might not justify the effort. > > So, as you mentioned earlier, it's probably best to stick with SHM > and let the compositor handle uploads using DMA, preferably > asynchronously. > > Cheers, > > Eduardo Hopperdietzel I wonder whether this would help with FramelessWindowHint artifacts on Debian 10? Currently SHM doesn't work correctly on Debian 10 and one has to create a child QOpenGLWidget for artifacts to disappear. -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
[Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)
Hi David, I've made a little Wayland app that uses both SHM and DMA, and I tested it on Weston, Sway, and my own compositor. I also tried it on three different machines: two with Intel i7 CPUs and one with a smaller ARM CPU. These machines had Intel Iris Pro, Nvidia GT525M, and Mali-400 GPUs, respectively. Here's the code and results for one of the machines: https://github.com/ehopperdietzel/QPainter-SHM-DMA-Benchmark The results show that there's no significant difference in the time it takes for read and write operations using QPainter in SHM and DMA maps. It seems like DMA I/O operations are handled asynchronously by the kernel. The most noticeable improvement is on the compositor side. When using DMA, the experience feels much smoother, especially when moving other windows while the benchmark is running on single-threaded compositors like Weston. There's also a slight increase in the number of frame callbacks returned by the compositors when using DMA, though it doesn't significantly boost the overall FPS. However, there are challenges with implementing DMA: 1. There does not seems to be standard method to create DMA buffers in userspace. I tried creating a GBM bo, obtaining a PRIME fd, and mapping it, but this isn't supported by all GPUs/drivers. For instance, it didn't work with the Mali GPU using the Lima driver. I also experimented with DMA-BUFF heaps, but driver support does not seems to be consistent across all distributions, and accessing /dev/dma-heaps/** often requires superuser privileges. 2. When using DMA, triple buffering is necessary; otherwise, compositors only display partial buffer updates. This could potentially be avoided by using DMA fencing mechanisms (like EGL does under the hood) and protocols like this one: https://wayland.app/protocols/linux-explicit-synchronization-unstable-v1 But it seems that not many compositors have implemented it. To sum it up, while DMA does offer a performance boost, it's not without its issues: - DMA's effectiveness varies depending on hardware. - Implementing DMA can be complex. - The performance gains might not justify the effort. So, as you mentioned earlier, it's probably best to stick with SHM and let the compositor handle uploads using DMA, preferably asynchronously. Cheers, Eduardo Hopperdietzel -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)
On Fri, Aug 18, 2023 at 8:18 PM Eduardo Hopperdietzel wrote: > > Hi David, > > That's a very good point I hadn't thought about. I will create a testing > Wayland client benchmark and measure the time it takes for QPainer to perform > different drawing operations using both SHM and DMA. I'll also test buffer > resizing and measure the overall (client/compositor) performance by counting > the number of frame callbacks (FPS) returned by different compositors. > > To ensure the benchmark provides a representative performance evaluation for > Qt, I would appreciate it if you could clarify the following doubts: > > 1. Does Qt respect the wl_surface frame callbacks sent by the compositor, or > does it simply draw as many frames as it can? Yes to both :) If you use the normal loop of QWidget::update you'll get the paintEvent each callback If you use QWidget::repaint (which docs err against using) you'll blast out loads of frames then block when you hit some hardcoded limit which will block till we get buffers released. > 2. When using SHM, does Qt reuse the same buffer on a wl_surface if it > receives a wl_buffer release event before a wl_surface frame callback? And > does it use more than one otherwise? There is a pool that is re-used. It's double buffered at a minimum but scales to handle the repaint case above. > 3. If DMA was implemented, I suppose double buffering would be mandatory, or > should I consider triple buffering or more? The logic would be the same. David -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
[Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)
Hi David, That's a very good point I hadn't thought about. I will create a testing Wayland client benchmark and measure the time it takes for QPainer to perform different drawing operations using both SHM and DMA. I'll also test buffer resizing and measure the overall (client/compositor) performance by counting the number of frame callbacks (FPS) returned by different compositors. To ensure the benchmark provides a representative performance evaluation for Qt, I would appreciate it if you could clarify the following doubts: 1. Does Qt respect the wl_surface frame callbacks sent by the compositor, or does it simply draw as many frames as it can? 2. When using SHM, does Qt reuse the same buffer on a wl_surface if it receives a wl_buffer release event before a wl_surface frame callback? And does it use more than one otherwise? 3. If DMA was implemented, I suppose double buffering would be mandatory, or should I consider triple buffering or more? Cheers, Eduardo Hopperdietzel -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)
> > Do you foresee any potential issues with this approach? Please feel free to > share your thoughts. > The relevant kwin expert, Xaver Hugl said in a chat about this topic: "While the overhead on the compositor side would be lower, rendering into a dmabuf with the CPU is pretty slow, especially on dedicated GPUs and especially with QPainter" This means there's a strong chance this doesn't have the performance boost that it looks to have on paper. Rendering into shared memory is as fast as rendering into any application local memory. This is potentially avoidable with a shadow buffer and then an upload at the end, but then we're just moving work to the client rather than saving work. I would suggest getting some test benchmarks ahead of time, it's a lot of code for something that might not pay off. We'll also need to ensure we do fairly extensive real-world benchmarks before landing a final merge request. David Edmundson -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
[Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)
Hello Kai, Thank you for your reply. After following the links you provided, I believe that implementing the DMA feature should take place within the following file: https://code.qt.io/cgit/qt/qtwayland.git/tree/src/client/qwaylandshmbackingstore.cpp Here's my proposed plan for this implementation: 1. Start by checking whether the compositor supports the wl_drm and linux-DMA-buff protocols. 2. Manually implement both protocols, without relying on the wayland-egl implementation. 3. Authenticate a DRM device fd through the wl_drm protocol. 4. Verify if the compositor supports the DRM_FORMAT_MOD_LINEAR modifier using the linux-dma-buff protocol. 5. Create a LINEAR DMA buffer using the GBM library. I propose making this buffer relatively large to prevent the need for frequent destruction and recreation of a new buffer when the window is resized. 6. Perform mmap on the DMA buffer. 7. Wrap the mapped buffer with a QImage(), following a similar approach as it's done with shared memory. If any of these steps fail, fallback to shared memory. Do you foresee any potential issues with this approach? Please feel free to share your thoughts. Best regards, Eduardo Hopperdietzel -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)
Hi, Qt Wayland Client currently relies on EGL through wayland-egl integration to do the right thing. On Mesa for instance Qt indeed transparently uses zwp_linux_dmabuf_v1. However, I agree that Qt itself should have an implementation of the aforementioned protocol on the *client* side, too (Qt Wayland Compositor supports clients talking dmabuf to it), for platforms where this is not done under the hood. Code that could serve as an inspiration is the compositor dmabuf-v1 implementation [1], a client side integration [2] (this is what you want to write, just using dmabuf instead), a weston example implementation [3]. Cheers Kai Uwe [1] https://code.qt.io/cgit/qt/qtwayland.git/tree/src/hardwareintegration/compositor/linux-dmabuf-unstable-v1 [2] https://code.qt.io/cgit/qt/qtwayland.git/tree/src/hardwareintegration/compositor/wayland-egl [3] https://gitlab.freedesktop.org/wayland/weston/-/blob/main/clients/simple-dmabuf-egl.c -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
[Development] Using DMA instead of SHM in non OpenGL apps (Linux/Wayland)
Hello, I've noticed that Qt currently uses shared memory for buffer sharing with Wayland compositors in non-OpenGL rendering applications. However, a more efficient approach would be to use the zwp_linux_dmabuf_v1 protocol. By creating a DMA buffer, mapping it, and performing CPU rendering there, the compositor could directly import the buffers into the GPU, eliminating the need for extra copies. This optimization would significantly improve performance, particularly when scaling large windows on HiDPI displays. The implementation on the Qt side should be almost identical to using shared memory (I think). I would be willing to implement this, and I would appreciate it if you could guide me to the code that handles this. I assume I should look into the code of the Wayland platform plugin? Best regards, Eduardo Hopperdietzel -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development