Does gbm_bo_map() implicitly synchronise?
I'm experimenting with DRI3 and its use of GBM to share buffers. It mostly works fine, but I'm seeing some issues that have me concerned there might be a synchronisation issue. The documentation isn't entirely clear, so my question is if gbm_bo_map() handles all the implicit synchronisation for me, or if there is something more I can do? I tried doing gbm_bo_get_fd() followed by a select() and ioctl(DMA_BUF_IOCTL_SYNC), but my issue did not go away. Now I'm unsure if I'm doing it wrong, or if I'm chasing the wrong theory. Anyone with insight on what's needed for stable synchronisation? Regards, -- Pierre Ossman Software Development Cendio AB https://cendio.com Teknikringen 8 https://twitter.com/ThinLinc 583 30 Linköpinghttps://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
gbm_bo_map synchronizes if it needs to move memory to make the buffer readable by the CPU or if the buffer is being used/written by the GPU. Marek On Sat, Jun 15, 2024 at 1:12 AM Pierre Ossman wrote: > > I'm experimenting with DRI3 and its use of GBM to share buffers. It > mostly works fine, but I'm seeing some issues that have me concerned > there might be a synchronisation issue. > > The documentation isn't entirely clear, so my question is if > gbm_bo_map() handles all the implicit synchronisation for me, or if > there is something more I can do? > > I tried doing gbm_bo_get_fd() followed by a select() and > ioctl(DMA_BUF_IOCTL_SYNC), but my issue did not go away. Now I'm unsure > if I'm doing it wrong, or if I'm chasing the wrong theory. > > Anyone with insight on what's needed for stable synchronisation? > > Regards, > -- > Pierre Ossman Software Development > Cendio AB https://cendio.com > Teknikringen 8 https://twitter.com/ThinLinc > 583 30 Linköpinghttps://facebook.com/ThinLinc > Phone: +46-13-214600 > > A: Because it messes up the order in which people normally read text. > Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
On 15/06/2024 07:54, Marek Olšák wrote: gbm_bo_map synchronizes if it needs to move memory to make the buffer readable by the CPU or if the buffer is being used/written by the GPU. Great, thanks! That means I need to look elsewhere for the source of my issue. I was concerned that since I was accessing the data using gbm_bo_map(), rather than using OpenGL, I was missing out on some synchronisation step and getting data before the GPU had finished any queued rendering. Regards -- Pierre Ossman Software Development Cendio AB http://cendio.com Teknikringen 8 http://twitter.com/ThinLinc 583 30 Linköpinghttp://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
It's probably driver-specific. Some drivers might need glFlush before you use gbm_bo_map because gbm might only wait for work that has been flushed. Marek On Sat, Jun 15, 2024 at 4:29 AM Pierre Ossman wrote: > > On 15/06/2024 07:54, Marek Olšák wrote: > > gbm_bo_map synchronizes if it needs to move memory to make the buffer > > readable by the CPU or if the buffer is being used/written by the GPU. > > > > Great, thanks! That means I need to look elsewhere for the source of my > issue. > > I was concerned that since I was accessing the data using gbm_bo_map(), > rather than using OpenGL, I was missing out on some synchronisation step > and getting data before the GPU had finished any queued rendering. > > Regards > -- > Pierre Ossman Software Development > Cendio AB http://cendio.com > Teknikringen 8 http://twitter.com/ThinLinc > 583 30 Linköpinghttp://facebook.com/ThinLinc > Phone: +46-13-214600 > > A: Because it messes up the order in which people normally read text. > Q: Why is top-posting such a bad thing? >
Re: Does gbm_bo_map() implicitly synchronise?
On 15/06/2024 13:35, Marek Olšák wrote: It's probably driver-specific. Some drivers might need glFlush before you use gbm_bo_map because gbm might only wait for work that has been flushed. That would be needed on the "writing" side, right? So if I'm seeing issues when mapping for reading, then it would indicate a bug in the other peer? Which would be gnome-shell in my case. Any way I could test this? Can I force extra syncs/flushes in some way and see if the issue goes away? I tried adding a sleep of 10ms before reading the data, but did not see any improvement. Which would make sense if the commands are still sitting in an application buffer somewhere, rather than with the GPU. Regards -- Pierre Ossman Software Development Cendio AB https://cendio.com Teknikringen 8 https://twitter.com/ThinLinc 583 30 Linköpinghttps://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
Am 17.06.24 um 09:32 schrieb Pierre Ossman: On 15/06/2024 13:35, Marek Olšák wrote: It's probably driver-specific. Some drivers might need glFlush before you use gbm_bo_map because gbm might only wait for work that has been flushed. That would be needed on the "writing" side, right? So if I'm seeing issues when mapping for reading, then it would indicate a bug in the other peer? Which would be gnome-shell in my case. Any way I could test this? Can I force extra syncs/flushes in some way and see if the issue goes away? Well the primary question here is what do you want to wait for? As Marek wrote GBM and the kernel can only see work which has been flushed and is not queued up inside the OpenGL library for example. I tried adding a sleep of 10ms before reading the data, but did not see any improvement. Which would make sense if the commands are still sitting in an application buffer somewhere, rather than with the GPU. Let me try to clarify a couple of things: The DMA_BUF_IOCTL_SYNC function is to flush and invalidate caches so that the GPU can see values written by the CPU and the CPU can see values written by the GPU. But that IOCTL does *not* wait for any async GPU operation to finish. If you want to wait for async GPU operations you either need to call the OpenGL functions to read pixels or do a select() (or poll, epoll etc...) call on the DMA-buf file descriptor. So if you want to do some rendering with OpenGL and then see the result in a buffer memory mapping the correct sequence would be the following: 1. Issue OpenGL rendering commands. 2. Call glFlush() to make sure the hw actually starts working on the rendering. 3. Call select() on the DMA-buf file descriptor to wait for the rendering to complete. 4. Use DMA_BUF_IOCTL_SYNC to make the rendering result CPU visible. Regards, Christian. Regards
Re: Does gbm_bo_map() implicitly synchronise?
On 17/06/2024 10:13, Christian König wrote: Let me try to clarify a couple of things: The DMA_BUF_IOCTL_SYNC function is to flush and invalidate caches so that the GPU can see values written by the CPU and the CPU can see values written by the GPU. But that IOCTL does *not* wait for any async GPU operation to finish. If you want to wait for async GPU operations you either need to call the OpenGL functions to read pixels or do a select() (or poll, epoll etc...) call on the DMA-buf file descriptor. Thanks for the clarification! Just to avoid any uncertainty, are both of these things done implicitly by gbm_bo_map()/gbm_bo_unmap()? I did test adding those steps just in case, but unfortunately did not see an improvement. My order was: 1. gbm_bo_import(GBM_BO_USE_RENDERING) 2. gbm_bo_get_fd() 3. Wait for client to request displaying the buffer 4. gbm_bo_map(GBM_BO_TRANSFER_READ) 5. select(fd+1, &fds, NULL, NULL, NULL) 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_READ }) 7. pixman_blt() 8. gbm_bo_unmap() So if you want to do some rendering with OpenGL and then see the result in a buffer memory mapping the correct sequence would be the following: 1. Issue OpenGL rendering commands. 2. Call glFlush() to make sure the hw actually starts working on the rendering. 3. Call select() on the DMA-buf file descriptor to wait for the rendering to complete. 4. Use DMA_BUF_IOCTL_SYNC to make the rendering result CPU visible. What I want to do is implement the X server side of DRI3 in just CPU. It works for every application I've tested except gnome-shell. I would assume that 1. and 2. are supposed to be done by the X client, i.e. gnome-shell? What I need to be able to do is access the result of that, once the X client tries to draw using that GBM backed pixmap (e.g. using PresentPixmap). So far, we've only tested Intel GPUs, but we are setting up Nvidia and AMD GPUs at the moment. It will be interesting to see if the issue remains on those or not. Regards -- Pierre Ossman Software Development Cendio AB https://cendio.com Teknikringen 8 https://twitter.com/ThinLinc 583 30 Linköpinghttps://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
Am 17.06.24 um 12:29 schrieb Pierre Ossman: On 17/06/2024 10:13, Christian König wrote: Let me try to clarify a couple of things: The DMA_BUF_IOCTL_SYNC function is to flush and invalidate caches so that the GPU can see values written by the CPU and the CPU can see values written by the GPU. But that IOCTL does *not* wait for any async GPU operation to finish. If you want to wait for async GPU operations you either need to call the OpenGL functions to read pixels or do a select() (or poll, epoll etc...) call on the DMA-buf file descriptor. Thanks for the clarification! Just to avoid any uncertainty, are both of these things done implicitly by gbm_bo_map()/gbm_bo_unmap()? gbm_bo_map() is *not* doing any synchronization whatsoever as far as I know. It just does the steps necessary for the mmap(). I did test adding those steps just in case, but unfortunately did not see an improvement. My order was: 1. gbm_bo_import(GBM_BO_USE_RENDERING) 2. gbm_bo_get_fd() 3. Wait for client to request displaying the buffer 4. gbm_bo_map(GBM_BO_TRANSFER_READ) 5. select(fd+1, &fds, NULL, NULL, NULL) 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_READ }) 7. pixman_blt() 8. gbm_bo_unmap() At least of hand that looks like it should work. So if you want to do some rendering with OpenGL and then see the result in a buffer memory mapping the correct sequence would be the following: 1. Issue OpenGL rendering commands. 2. Call glFlush() to make sure the hw actually starts working on the rendering. 3. Call select() on the DMA-buf file descriptor to wait for the rendering to complete. 4. Use DMA_BUF_IOCTL_SYNC to make the rendering result CPU visible. What I want to do is implement the X server side of DRI3 in just CPU. It works for every application I've tested except gnome-shell. I would assume that 1. and 2. are supposed to be done by the X client, i.e. gnome-shell? Yes, exactly that. What I need to be able to do is access the result of that, once the X client tries to draw using that GBM backed pixmap (e.g. using PresentPixmap). No idea why that doesn't work. Regards, Christian. So far, we've only tested Intel GPUs, but we are setting up Nvidia and AMD GPUs at the moment. It will be interesting to see if the issue remains on those or not. Regards
Re: Does gbm_bo_map() implicitly synchronise?
Am 17.06.24 um 16:50 schrieb Michel Dänzer: On 2024-06-17 12:29, Pierre Ossman wrote: Just to avoid any uncertainty, are both of these things done implicitly by gbm_bo_map()/gbm_bo_unmap()? I did test adding those steps just in case, but unfortunately did not see an improvement. My order was: 1. gbm_bo_import(GBM_BO_USE_RENDERING) 2. gbm_bo_get_fd() 3. Wait for client to request displaying the buffer 4. gbm_bo_map(GBM_BO_TRANSFER_READ) 5. select(fd+1, &fds, NULL, NULL, NULL) *If* select() is needed, it needs to be before gbm_bo_map(), because the latter may perform a blit from the real BO to a staging one for CPU access. But don't you then need to wait for the blit to finish? Regards, Christian. 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_READ }) gbm_bo_map() should do this internally if needed. 7. pixman_blt() 8. gbm_bo_unmap()
Re: Does gbm_bo_map() implicitly synchronise?
On 2024-06-17 12:29, Pierre Ossman wrote: > > Just to avoid any uncertainty, are both of these things done implicitly by > gbm_bo_map()/gbm_bo_unmap()? > > I did test adding those steps just in case, but unfortunately did not see an > improvement. My order was: > > 1. gbm_bo_import(GBM_BO_USE_RENDERING) > 2. gbm_bo_get_fd() > 3. Wait for client to request displaying the buffer > 4. gbm_bo_map(GBM_BO_TRANSFER_READ) > 5. select(fd+1, &fds, NULL, NULL, NULL) *If* select() is needed, it needs to be before gbm_bo_map(), because the latter may perform a blit from the real BO to a staging one for CPU access. > 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | > DMA_BUF_SYNC_READ }) gbm_bo_map() should do this internally if needed. > 7. pixman_blt() > 8. gbm_bo_unmap() -- Earthling Michel Dänzer| https://redhat.com Libre software enthusiast | Mesa and Xwayland developer
Re: Does gbm_bo_map() implicitly synchronise?
Am 17.06.24 um 16:55 schrieb Michel Dänzer: On 2024-06-17 16:52, Christian König wrote: Am 17.06.24 um 16:50 schrieb Michel Dänzer: On 2024-06-17 12:29, Pierre Ossman wrote: Just to avoid any uncertainty, are both of these things done implicitly by gbm_bo_map()/gbm_bo_unmap()? I did test adding those steps just in case, but unfortunately did not see an improvement. My order was: 1. gbm_bo_import(GBM_BO_USE_RENDERING) 2. gbm_bo_get_fd() 3. Wait for client to request displaying the buffer 4. gbm_bo_map(GBM_BO_TRANSFER_READ) 5. select(fd+1, &fds, NULL, NULL, NULL) *If* select() is needed, it needs to be before gbm_bo_map(), because the latter may perform a blit from the real BO to a staging one for CPU access. But don't you then need to wait for the blit to finish? No, gbm_bo_map() must handle that internally. When it returns, the CPU must see the correct contents. Ah, ok in that case that function does more than I expected. Thanks, Christian. 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_READ }) gbm_bo_map() should do this internally if needed. 7. pixman_blt() 8. gbm_bo_unmap()
Re: Does gbm_bo_map() implicitly synchronise?
On 2024-06-17 16:52, Christian König wrote: > Am 17.06.24 um 16:50 schrieb Michel Dänzer: >> On 2024-06-17 12:29, Pierre Ossman wrote: >>> Just to avoid any uncertainty, are both of these things done implicitly by >>> gbm_bo_map()/gbm_bo_unmap()? >>> >>> I did test adding those steps just in case, but unfortunately did not see >>> an improvement. My order was: >>> >>> 1. gbm_bo_import(GBM_BO_USE_RENDERING) >>> 2. gbm_bo_get_fd() >>> 3. Wait for client to request displaying the buffer >>> 4. gbm_bo_map(GBM_BO_TRANSFER_READ) >>> 5. select(fd+1, &fds, NULL, NULL, NULL) >> *If* select() is needed, it needs to be before gbm_bo_map(), because the >> latter may perform a blit from the real BO to a staging one for CPU access. > > But don't you then need to wait for the blit to finish? No, gbm_bo_map() must handle that internally. When it returns, the CPU must see the correct contents. >>> 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | >>> DMA_BUF_SYNC_READ }) >> gbm_bo_map() should do this internally if needed. >> >> >>> 7. pixman_blt() >>> 8. gbm_bo_unmap() >> > -- Earthling Michel Dänzer| https://redhat.com Libre software enthusiast | Mesa and Xwayland developer
Re: Does gbm_bo_map() implicitly synchronise?
On 17/06/2024 16:50, Michel Dänzer wrote: On 2024-06-17 12:29, Pierre Ossman wrote: Just to avoid any uncertainty, are both of these things done implicitly by gbm_bo_map()/gbm_bo_unmap()? I did test adding those steps just in case, but unfortunately did not see an improvement. My order was: 1. gbm_bo_import(GBM_BO_USE_RENDERING) 2. gbm_bo_get_fd() 3. Wait for client to request displaying the buffer 4. gbm_bo_map(GBM_BO_TRANSFER_READ) 5. select(fd+1, &fds, NULL, NULL, NULL) *If* select() is needed, it needs to be before gbm_bo_map(), because the latter may perform a blit from the real BO to a staging one for CPU access. Can I know whether it is needed or not? Or should I be cautious and always do it? I also assumed I should do select() with readfds set when I want to read, and writefds set when I want to write? Still, after moving it before the map the issue unfortunately remains. :/ A recording of the issue is available here, in case the behaviour rings a bell for anyone: http://www.cendio.com/~ossman/dri3/Screencast%20from%202024-06-17%2017-06-50.webm (tried to include it as an attachment, but that email was filtered out somewhere) Regards, -- Pierre Ossman Software Development Cendio AB https://cendio.com Teknikringen 8 https://twitter.com/ThinLinc 583 30 Linköpinghttps://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
On 2024-06-17 17:27, Pierre Ossman wrote: > On 17/06/2024 16:50, Michel Dänzer wrote: >> On 2024-06-17 12:29, Pierre Ossman wrote: >>> >>> Just to avoid any uncertainty, are both of these things done implicitly by >>> gbm_bo_map()/gbm_bo_unmap()? >>> >>> I did test adding those steps just in case, but unfortunately did not see >>> an improvement. My order was: >>> >>> 1. gbm_bo_import(GBM_BO_USE_RENDERING) >>> 2. gbm_bo_get_fd() >>> 3. Wait for client to request displaying the buffer >>> 4. gbm_bo_map(GBM_BO_TRANSFER_READ) >>> 5. select(fd+1, &fds, NULL, NULL, NULL) >> >> *If* select() is needed, it needs to be before gbm_bo_map(), because the >> latter may perform a blit from the real BO to a staging one for CPU access. >> > > Can I know whether it is needed or not? Or should I be cautious and always do > it? Assuming GBM in the X server uses the GPU HW driver, I'd say it shouldn't be needed. > A recording of the issue is available here, in case the behaviour rings a > bell for anyone: > > http://www.cendio.com/~ossman/dri3/Screencast%20from%202024-06-17%2017-06-50.webm Interesting. Looks like the surroundings (drop shadow region?) of the window move along with it first, then the surroundings get fixed up in the next frame. As far as I know, mutter doesn't move window contents like that on the client side; it always redraws the damaged output region from scratch. So I wonder if the initial move together with surroundings is actually a blit on the X server side (possibly triggered by mutter moving the X window in its function as window manager). And then the surroundings fixing themselves up is the correct output from mutter via DRI3/Present. If so, the issue isn't synchronization, it's that the first blit happens at all. -- Earthling Michel Dänzer| https://redhat.com Libre software enthusiast | Mesa and Xwayland developer
Re: Does gbm_bo_map() implicitly synchronise?
On 17/06/2024 18:09, Michel Dänzer wrote: Can I know whether it is needed or not? Or should I be cautious and always do it? Assuming GBM in the X server uses the GPU HW driver, I'd say it shouldn't be needed. It does not (except the driver libgbm loads). We're trying to use this in Xvnc, so it's all CPU. We're just trying to make sure the applications can use the full power of the GPU to render their stuff before handing it over to the X server. :) A recording of the issue is available here, in case the behaviour rings a bell for anyone: http://www.cendio.com/~ossman/dri3/Screencast%20from%202024-06-17%2017-06-50.webm Interesting. Looks like the surroundings (drop shadow region?) of the window move along with it first, then the surroundings get fixed up in the next frame. As far as I know, mutter doesn't move window contents like that on the client side; it always redraws the damaged output region from scratch. So I wonder if the initial move together with surroundings is actually a blit on the X server side (possibly triggered by mutter moving the X window in its function as window manager). And then the surroundings fixing themselves up is the correct output from mutter via DRI3/Present. If so, the issue isn't synchronization, it's that the first blit happens at all. Hmm... The source of the blit is CopyWindow being called as a result of the window moving. But I would have expected that to be inhibited by the fact that a compositor is active. It's also surprising that this only happens if DRI3 is involved. I would also have expected something similar with software rendering. Albeit with a PutImage instead of PresentPixmap for the correct data. But everything works there. I will need to dig further. Regards, -- Pierre Ossman Software Development Cendio AB http://cendio.com Teknikringen 8 http://twitter.com/ThinLinc 583 30 Linköpinghttp://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
Am 17.06.24 um 19:18 schrieb Pierre Ossman: On 17/06/2024 18:09, Michel Dänzer wrote: Can I know whether it is needed or not? Or should I be cautious and always do it? Assuming GBM in the X server uses the GPU HW driver, I'd say it shouldn't be needed. It does not (except the driver libgbm loads). We're trying to use this in Xvnc, so it's all CPU. We're just trying to make sure the applications can use the full power of the GPU to render their stuff before handing it over to the X server. :) That whole approach won't work. When you don't have a HW driver loaded or at least tell the client that it should render into a linear buffer somehow then the data in the buffer will be tilled in a hw specific format. As far as I know you can't read that vendor agnostic with the CPU, you need the hw driver for that. Regards, Christian. A recording of the issue is available here, in case the behaviour rings a bell for anyone: http://www.cendio.com/~ossman/dri3/Screencast%20from%202024-06-17%2017-06-50.webm Interesting. Looks like the surroundings (drop shadow region?) of the window move along with it first, then the surroundings get fixed up in the next frame. As far as I know, mutter doesn't move window contents like that on the client side; it always redraws the damaged output region from scratch. So I wonder if the initial move together with surroundings is actually a blit on the X server side (possibly triggered by mutter moving the X window in its function as window manager). And then the surroundings fixing themselves up is the correct output from mutter via DRI3/Present. If so, the issue isn't synchronization, it's that the first blit happens at all. Hmm... The source of the blit is CopyWindow being called as a result of the window moving. But I would have expected that to be inhibited by the fact that a compositor is active. It's also surprising that this only happens if DRI3 is involved. I would also have expected something similar with software rendering. Albeit with a PutImage instead of PresentPixmap for the correct data. But everything works there. I will need to dig further. Regards,
Re: Does gbm_bo_map() implicitly synchronise?
On 17/06/2024 20:18, Christian König wrote: Am 17.06.24 um 19:18 schrieb Pierre Ossman: On 17/06/2024 18:09, Michel Dänzer wrote: Can I know whether it is needed or not? Or should I be cautious and always do it? Assuming GBM in the X server uses the GPU HW driver, I'd say it shouldn't be needed. It does not (except the driver libgbm loads). We're trying to use this in Xvnc, so it's all CPU. We're just trying to make sure the applications can use the full power of the GPU to render their stuff before handing it over to the X server. :) That whole approach won't work. When you don't have a HW driver loaded or at least tell the client that it should render into a linear buffer somehow then the data in the buffer will be tilled in a hw specific format. As far as I know you can't read that vendor agnostic with the CPU, you need the hw driver for that. I'm confused. What's the goal of the GBM abstraction and specifically gbm_bo_map() if it's not a hardware-agnostic way of accessing buffers? In practice, we are getting linear buffers. At least on Intel and AMD GPUs. Nvidia are being a bit difficult getting GBM working, so we haven't tested that yet. I see there is the GBM_BO_USE_LINEAR flag. We have not used it yet, as we haven't seen a need for it. What is the effect of that? Would it guarantee what we are just lucky to see at the moment? Regards -- Pierre Ossman Software Development Cendio AB http://cendio.com Teknikringen 8 http://twitter.com/ThinLinc 583 30 Linköpinghttp://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
Am 18.06.24 um 07:01 schrieb Pierre Ossman: On 17/06/2024 20:18, Christian König wrote: Am 17.06.24 um 19:18 schrieb Pierre Ossman: On 17/06/2024 18:09, Michel Dänzer wrote: Can I know whether it is needed or not? Or should I be cautious and always do it? Assuming GBM in the X server uses the GPU HW driver, I'd say it shouldn't be needed. It does not (except the driver libgbm loads). We're trying to use this in Xvnc, so it's all CPU. We're just trying to make sure the applications can use the full power of the GPU to render their stuff before handing it over to the X server. :) That whole approach won't work. When you don't have a HW driver loaded or at least tell the client that it should render into a linear buffer somehow then the data in the buffer will be tilled in a hw specific format. As far as I know you can't read that vendor agnostic with the CPU, you need the hw driver for that. I'm confused. What's the goal of the GBM abstraction and specifically gbm_bo_map() if it's not a hardware-agnostic way of accessing buffers? There is no hardware agnostic way of accessing buffers which contain hw specific data. You always need a hw specific backend for that or use the linear flag which makes the data hw agnostic. In practice, we are getting linear buffers. At least on Intel and AMD GPUs. Nvidia are being a bit difficult getting GBM working, so we haven't tested that yet. That's either because you have a linear buffer for some reason or the hardware specific gbm backend has inserted a blit as Michel described. I see there is the GBM_BO_USE_LINEAR flag. We have not used it yet, as we haven't seen a need for it. What is the effect of that? Would it guarantee what we are just lucky to see at the moment? Michel and/or Marek need to answer that. I'm coming from the kernel side and maintaining the DMA-buf implementation backing all this, but I'm not an expert on gbm. Regards, Christian. Regards
Re: Does gbm_bo_map() implicitly synchronise?
On 2024-06-17 19:18, Pierre Ossman wrote: > On 17/06/2024 18:09, Michel Dänzer wrote: >>> >>> Can I know whether it is needed or not? Or should I be cautious and always >>> do it? >> >> Assuming GBM in the X server uses the GPU HW driver, I'd say it shouldn't be >> needed. Let me revise that statement: It shouldn't be needed, period. If llvmpipe needs it, it should happen as part of gbm_bo_map. (Not sure this is implemented at this time, I'd argue it's a Mesa bug if not though) > It does not (except the driver libgbm loads). We're trying to use this in > Xvnc, so it's all CPU. Mesa's GBM backend (built into libgbm) is essentially a frontend for Gallium drivers. It initializes a suitable driver for the DRM fd passed to gbm_create_device. This could be the GPU HW driver, which might explain why the contents from gnome-shell are displayed correctly (eventually). > We're just trying to make sure the applications can use the full power of the > GPU to render their stuff before handing it over to the X server. :) A note on architecture: Mutter supports running as a headless Wayland compositor, and supports remote desktop (including remote login as of GNOME 46) via gnome-remote-desktop and RDP. This allows both Wayland and X (via Xwayland) clients to run with full HW acceleration. -- Earthling Michel Dänzer| https://redhat.com Libre software enthusiast | Mesa and Xwayland developer
Re: Does gbm_bo_map() implicitly synchronise?
On 17/06/2024 19:18, Pierre Ossman wrote: Hmm... The source of the blit is CopyWindow being called as a result of the window moving. But I would have expected that to be inhibited by the fact that a compositor is active. It's also surprising that this only happens if DRI3 is involved. I would also have expected something similar with software rendering. Albeit with a PutImage instead of PresentPixmap for the correct data. But everything works there. I will need to dig further. Well, this is embarrassing. The issue was not in GNOME, Mesa or Xorg. They rendered everything absolutely correctly. The issue was in the VNC code that didn't pay attention to the fact that the window was redirected and so sent bogus rendering instructions to the VNC client. :/ With that fixed everything renders perfectly fine! Still, thank you for all the insight given regarding GBM! Regards -- Pierre Ossman Software Development Cendio AB http://cendio.com Teknikringen 8 http://twitter.com/ThinLinc 583 30 Linköpinghttp://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
El 17/6/24 a las 12:29, Pierre Ossman escribió: So if you want to do some rendering with OpenGL and then see the result in a buffer memory mapping the correct sequence would be the following: 1. Issue OpenGL rendering commands. 2. Call glFlush() to make sure the hw actually starts working on the rendering. 3. Call select() on the DMA-buf file descriptor to wait for the rendering to complete. 4. Use DMA_BUF_IOCTL_SYNC to make the rendering result CPU visible. What I want to do is implement the X server side of DRI3 in just CPU. It works for every application I've tested except gnome-shell. You can have a look at the Open MR we created two years ago for Xserver [1] "modesetting: Add DRI3 support to modesetting driver with glamor disabled". We are using it downstream for Raspberry Pi OS to enable on RPi1-3 GPU accelerated client applications, while the Xserver is using software composition with pixman. [1] https://gitlab.freedesktop.org/xorg/xserver/-/merge_requests/945 We recently identified that it has an issue[2] with synchronization on the server side when after glFlush() in the client side the command list takes too much (several seconds) to finish the rendering. [2] https://gitlab.freedesktop.org/mesa/mesa/-/issues/11228 I would assume that 1. and 2. are supposed to be done by the X client, i.e. gnome-shell? What I need to be able to do is access the result of that, once the X client tries to draw using that GBM backed pixmap (e.g. using PresentPixmap). So far, we've only tested Intel GPUs, but we are setting up Nvidia and AMD GPUs at the moment. It will be interesting to see if the issue remains on those or not. Regards, Chema Casanova
Re: Does gbm_bo_map() implicitly synchronise?
On 6/20/24 11:04, Chema Casanova wrote: You can have a look at the Open MR we created two years ago for Xserver [1] "modesetting: Add DRI3 support to modesetting driver with glamor disabled". We are using it downstream for Raspberry Pi OS to enable on RPi1-3 GPU accelerated client applications, while the Xserver is using software composition with pixman. [1] https://gitlab.freedesktop.org/xorg/xserver/-/merge_requests/945 I did actually look at that to get some idea of how things are connected. But the comments suggested that the design wasn't robust, so we ended up trying a different approach. Our work is now available in the latest TigerVNC beta, via this PR: https://github.com/TigerVNC/tigervnc/pull/1771 We recently identified that it has an issue[2] with synchronization on the server side when after glFlush() in the client side the command list takes too much (several seconds) to finish the rendering. [2] https://gitlab.freedesktop.org/mesa/mesa/-/issues/11228 Oh. I can try to test it here. We don't seem to have any synchronisation issues now that we got that VNC bug resolved. The two big issues we have presently is the SIGBUS crash I opened a separate thread about, and getting glvnd to choose correctly when the Nvidia driver is used. Regards -- Pierre Ossman Software Development Cendio AB https://cendio.com Teknikringen 8 https://twitter.com/ThinLinc 583 30 Linköpinghttps://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
On 6/20/24 15:59, Pierre Ossman wrote: We recently identified that it has an issue[2] with synchronization on the server side when after glFlush() in the client side the command list takes too much (several seconds) to finish the rendering. [2] https://gitlab.freedesktop.org/mesa/mesa/-/issues/11228 Oh. I can try to test it here. We don't seem to have any synchronisation issues now that we got that VNC bug resolved. I just tested here, and could not see the issue with our implementation with either an AMD iGPU or Nvidia dGPU. They might be too fast to trigger the issue? I have a Pi4 here as well, but it's not set up for this yet. Regards -- Pierre Ossman Software Development Cendio AB https://cendio.com Teknikringen 8 https://twitter.com/ThinLinc 583 30 Linköpinghttps://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
FWIW, the NVIDIA binary driver's implementation of gbm_bo_map/unmap() 1) Don't do any synchronization against in-flight work. The assumption is that if the content is going to be read, the API writing the data has established that coherence. Likewise, if it's going to be written, the API reading it afterwards does any invalidates or whatever are needed for coherence. 2) We don't blit anything or format convert, because our GBM implementation has no DMA engine access, and I'd like to keep it that way. Setting up a DMA-capable driver instance is much more expensive as far as runtime resources than setting up a simple allocator+mmap driver, at least in our driver architecture. Our GBM map just does an mmap(), and if it's not linear, you're not going to be able to interpret the data unless you've read up on our tiling formats. I'm aware this is different from Mesa, and no one has complained thus far. If we were forced to fix it, I imagine we'd do something like ask a shared engine in the kernel to do the blit on userspace's behalf, which would probably be slow but save resources. Basically, don't use gbm_bo_map() for anything non-trivial on our implementation. It's not the right tool for e.g., reading back or populating OpenGL textures or X pixmaps. If you don't want to run on the NV implementation, feel free to ignore this advice, but I'd still suggest it's not the best tool for most jobs. Thanks, -James On 6/17/24 03:29, Pierre Ossman wrote: On 17/06/2024 10:13, Christian König wrote: Let me try to clarify a couple of things: The DMA_BUF_IOCTL_SYNC function is to flush and invalidate caches so that the GPU can see values written by the CPU and the CPU can see values written by the GPU. But that IOCTL does *not* wait for any async GPU operation to finish. If you want to wait for async GPU operations you either need to call the OpenGL functions to read pixels or do a select() (or poll, epoll etc...) call on the DMA-buf file descriptor. Thanks for the clarification! Just to avoid any uncertainty, are both of these things done implicitly by gbm_bo_map()/gbm_bo_unmap()? I did test adding those steps just in case, but unfortunately did not see an improvement. My order was: 1. gbm_bo_import(GBM_BO_USE_RENDERING) 2. gbm_bo_get_fd() 3. Wait for client to request displaying the buffer 4. gbm_bo_map(GBM_BO_TRANSFER_READ) 5. select(fd+1, &fds, NULL, NULL, NULL) 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_READ }) 7. pixman_blt() 8. gbm_bo_unmap() So if you want to do some rendering with OpenGL and then see the result in a buffer memory mapping the correct sequence would be the following: 1. Issue OpenGL rendering commands. 2. Call glFlush() to make sure the hw actually starts working on the rendering. 3. Call select() on the DMA-buf file descriptor to wait for the rendering to complete. 4. Use DMA_BUF_IOCTL_SYNC to make the rendering result CPU visible. What I want to do is implement the X server side of DRI3 in just CPU. It works for every application I've tested except gnome-shell. I would assume that 1. and 2. are supposed to be done by the X client, i.e. gnome-shell? What I need to be able to do is access the result of that, once the X client tries to draw using that GBM backed pixmap (e.g. using PresentPixmap). So far, we've only tested Intel GPUs, but we are setting up Nvidia and AMD GPUs at the moment. It will be interesting to see if the issue remains on those or not. Regards
Re: Does gbm_bo_map() implicitly synchronise?
Am 24.06.24 um 21:08 schrieb James Jones: FWIW, the NVIDIA binary driver's implementation of gbm_bo_map/unmap() 1) Don't do any synchronization against in-flight work. The assumption is that if the content is going to be read, the API writing the data has established that coherence. Likewise, if it's going to be written, the API reading it afterwards does any invalidates or whatever are needed for coherence. That matches my assumption of what this function does, but is just the opposite of what Michel explained what it does. Is it somewhere documented if gbm_bo_map() should wait for in-flight work or not? Regards, Christian. 2) We don't blit anything or format convert, because our GBM implementation has no DMA engine access, and I'd like to keep it that way. Setting up a DMA-capable driver instance is much more expensive as far as runtime resources than setting up a simple allocator+mmap driver, at least in our driver architecture. Our GBM map just does an mmap(), and if it's not linear, you're not going to be able to interpret the data unless you've read up on our tiling formats. I'm aware this is different from Mesa, and no one has complained thus far. If we were forced to fix it, I imagine we'd do something like ask a shared engine in the kernel to do the blit on userspace's behalf, which would probably be slow but save resources. Basically, don't use gbm_bo_map() for anything non-trivial on our implementation. It's not the right tool for e.g., reading back or populating OpenGL textures or X pixmaps. If you don't want to run on the NV implementation, feel free to ignore this advice, but I'd still suggest it's not the best tool for most jobs. Thanks, -James On 6/17/24 03:29, Pierre Ossman wrote: On 17/06/2024 10:13, Christian König wrote: Let me try to clarify a couple of things: The DMA_BUF_IOCTL_SYNC function is to flush and invalidate caches so that the GPU can see values written by the CPU and the CPU can see values written by the GPU. But that IOCTL does *not* wait for any async GPU operation to finish. If you want to wait for async GPU operations you either need to call the OpenGL functions to read pixels or do a select() (or poll, epoll etc...) call on the DMA-buf file descriptor. Thanks for the clarification! Just to avoid any uncertainty, are both of these things done implicitly by gbm_bo_map()/gbm_bo_unmap()? I did test adding those steps just in case, but unfortunately did not see an improvement. My order was: 1. gbm_bo_import(GBM_BO_USE_RENDERING) 2. gbm_bo_get_fd() 3. Wait for client to request displaying the buffer 4. gbm_bo_map(GBM_BO_TRANSFER_READ) 5. select(fd+1, &fds, NULL, NULL, NULL) 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_READ }) 7. pixman_blt() 8. gbm_bo_unmap() So if you want to do some rendering with OpenGL and then see the result in a buffer memory mapping the correct sequence would be the following: 1. Issue OpenGL rendering commands. 2. Call glFlush() to make sure the hw actually starts working on the rendering. 3. Call select() on the DMA-buf file descriptor to wait for the rendering to complete. 4. Use DMA_BUF_IOCTL_SYNC to make the rendering result CPU visible. What I want to do is implement the X server side of DRI3 in just CPU. It works for every application I've tested except gnome-shell. I would assume that 1. and 2. are supposed to be done by the X client, i.e. gnome-shell? What I need to be able to do is access the result of that, once the X client tries to draw using that GBM backed pixmap (e.g. using PresentPixmap). So far, we've only tested Intel GPUs, but we are setting up Nvidia and AMD GPUs at the moment. It will be interesting to see if the issue remains on those or not. Regards
Re: Does gbm_bo_map() implicitly synchronise?
On 2024-06-24 21:08, James Jones wrote: > FWIW, the NVIDIA binary driver's implementation of gbm_bo_map/unmap() > > 1) Don't do any synchronization against in-flight work. The assumption is > that if the content is going to be read, the API writing the data has > established that coherence. Likewise, if it's going to be written, the API > reading it afterwards does any invalidates or whatever are needed for > coherence. > > 2) We don't blit anything or format convert, because our GBM implementation > has no DMA engine access, and I'd like to keep it that way. Setting up a > DMA-capable driver instance is much more expensive as far as runtime > resources than setting up a simple allocator+mmap driver, at least in our > driver architecture. Our GBM map just does an mmap(), and if it's not linear, > you're not going to be able to interpret the data unless you've read up on > our tiling formats. I'm aware this is different from Mesa, and no one has > complained thus far. I've seen at least one webkitgtk issue report about gbm_bo_map not working as intended with nvidia. gbm_bo_map definitely has to handle tiling, that's one of its main purposes. It also really has to handle implicit synchronization, since there's no GBM API for explicit synchronization. Just doing a direct mmap for gbm_bo_map can be bad for other reasons as well. E.g. if the BO storage is in VRAM and the application does CPU reads, it'll fall down a performance cliff. -- Earthling Michel Dänzer| https://redhat.com Libre software enthusiast | Mesa and Xwayland developer
Re: Does gbm_bo_map() implicitly synchronise?
Am Dienstag, dem 25.06.2024 um 09:56 +0200 schrieb Michel Dänzer: > On 2024-06-24 21:08, James Jones wrote: > > FWIW, the NVIDIA binary driver's implementation of gbm_bo_map/unmap() > > > > 1) Don't do any synchronization against in-flight work. The assumption is > > that if the content is going to be read, the API writing the data has > > established that coherence. Likewise, if it's going to be written, the API > > reading it afterwards does any invalidates or whatever are needed for > > coherence. > > > > 2) We don't blit anything or format convert, because our GBM implementation > > has no DMA engine access, and I'd like to keep it that way. Setting up a > > DMA-capable driver instance is much more expensive as far as runtime > > resources than setting up a simple allocator+mmap driver, at least in our > > driver architecture. Our GBM map just does an mmap(), and if it's not > > linear, you're not going to be able to interpret the data unless you've > > read up on our tiling formats. I'm aware this is different from Mesa, and > > no one has complained thus far. > > I've seen at least one webkitgtk issue report about gbm_bo_map not working as > intended with nvidia. > > gbm_bo_map definitely has to handle tiling, that's one of its main purposes. > Unfortunately gbm_bo_map is severely underspecified in that regard. Gallium drivers always handled tiling as the map has been implemented as a transfer, but i965 also didn't handle tiling and just returned a mapping of the raw tiled storage. > It also really has to handle implicit synchronization, since there's no GBM > API for explicit synchronization. > One could demand that the caller does something like eglClientWaitSync on a sync object fencing the hardware operations. Implicit sync on gbm_bo_map already is kind of a gray area, as GBM uses a different context to implement the transfer than the rendering API. It will only synchronize with commands flushed from the rendering context. Anything still buffered in the rendering context is invisible to gbm. Again, none of this is really specified anywhere. But I guess most users at this point assume the Mesa behavior and will break if another implementation doesn't do the same. Regards, Lucas