Hey,

this issue is hunting me for days now (Freeze at random times), i tried
to pinpoint it to a function or at least reproduce it reliable, but
without any luck. There is some info from my side that may help to find
it:

sorry for the AI slop:


We have been debugging this freeze extensively with bpftrace uprobe
instrumentation, MUTTER_DEBUG=kms,render,sync logging, GBM buffer tracking,
and an automated watchdog. Here is what we found, what we ruled out, and
what remains open.

System: 
- Ubuntu 26.04
- GNOME 50/Wayland 
- GDM
- AMD Threadripper 3960X
- NVIDIA RTX 3060 Ti (GA104)
- driver 595.58.03 proprietary
- 2x EIZO EV2785 4K via DisplayPort (through KVM switch).


## 1. The freeze is a render-freeze, not a full deadlock

D-Bus remains responsive (gnome-shell Peer.Ping succeeds). The mainloop
keeps dispatching events. Only rendering is stuck. We detect this
automatically by monitoring the KMS thread's context switches via
/proc/PID/task/TID/schedstat — they stop entirely while the main thread
blocks in eglSwapBuffers.


## 2. Backtrace

Main thread (gnome-shell):

    #5  pthread_cond_wait (mutex=0x5e7358810330)
    #7-#14  libEGL_nvidia.so.0 (8 frames, no debug symbols)
    #15 cogl_onscreen_egl_swap_buffers_with_damage  cogl-onscreen-egl.c:290
    #16 meta_onscreen_native_swap_buffers_with_damage  
meta-onscreen-native.c:1814
    #17 cogl_onscreen_swap_buffers_with_damage
    #18 swap_framebuffer  meta-stage-impl.c:301
    #19 meta_stage_impl_redraw_view_primary
    #23 handle_frame_clock_frame  clutter-stage-view.c:1052
    #24 clutter_frame_clock_dispatch  clutter-frame-clock.c:1698
    #26 g_main_dispatch

Line 1814 is the parent_class->swap_buffers_with_damage() call — the
actual eglSwapBuffers. This is BEFORE sync_fd is assigned, so sync_fd = -1
in the local variables is the uninitialized default, not a failed fence.

KMS thread: idle in ppoll -> g_main_loop_run -> meta_thread_impl_run
(waiting for events that never arrive)


## 3. What we ruled out

a) swap_failed is NOT the deadlock path.

We placed a uprobe at meta_onscreen_native_swap_buffers_with_damage+0x190
(the goto swap_failed target). During lock/unlock stress testing with GPU
load (glxgears), we triggered swap_failed 1,093 times in a single session
without a single deadlock. The swap_failed path exits BEFORE eglSwapBuffers
(line 1796 jumps before line 1811), so no buffer is locked and eglSwapBuffers
is never called. It is a safety mechanism, not the bug.

b) No pageflip callbacks are lost.

We probed page_flip_feedback_discarded, page_flip_feedback_flipped,
crtc_page_flip_feedback_discarded, and crtc_page_flip_feedback_flipped.
After 36,000+ swaps: flips_discarded = 0. Every submitted pageflip gets
its callback.

c) GBM buffers do not leak during normal operation.

We traced gbm_surface_lock_front_buffer and gbm_surface_release_buffer.
The held buffer count stays between -1 and 2, including during lock/unlock
cycles. Every locked buffer is released. We have not yet observed the
buffer count during an actual freeze (see section 6).

d) Notifications and window switching are not direct triggers.

Stress-testing with rapid notify-send and xdotool window movement did not
reproduce the freeze.


## 4. What we observed

a) maybe_post_next_frame has a high early-return rate.

About 40-60% of calls return in <5us without posting, because
posted_frame != NULL (line 1892). This shows the system is regularly at
the edge of frame backpressure with dual 4K monitors.

b) Screen lock triggers eglSetDamageRegion errors.

Each lock triggers a "Disabling '/dev/dri/card1'" + disable-device
transaction (all CRTCs set ACTIVE=0). About 3 seconds later (monitor
DPMS recovery time), eglSetDamageRegion errors appear at ~16ms intervals.
Most lock/unlock cycles recover from these errors.

c) In one freeze captured with MUTTER_DEBUG=kms, CRTC 63 stopped
receiving commits while CRTC 82 continued.

The last CRTC 63 activity was:

    21:40:44.078438  Posting primary plane composite update for CRTC 63
    21:40:44.094170  Page flip callback for CRTC (63)  <- last ever

After this, only CRTC 82 received commits. 57 seconds later, the main
thread deadlocked in eglSwapBuffers.

d) MUTTER_DEBUG=kms logging itself may increase freeze likelihood.

The freeze we captured with KMS logging occurred during normal use. When
we later tried to reproduce with lock/unlock stress testing (without KMS
logging), we could not trigger a freeze despite 1,093 swap_failed events.
The per-frame I/O from KMS debug logging may slow the frame pipeline
enough to widen the race window.


## 5. Relationship to the open MRs

- MR !5008 (swap_failed cleanup): Would not fix our deadlock. swap_failed
  fires before eglSwapBuffers. We hit it 1,093 times without deadlocking.

- MR !4512 (don't orphan superseded updates): We initially suspected GBM
  buffer leaks via clear_superseded_frame, but buffer tracing during
  lock/unlock shows no leak. Cannot confirm or deny relevance without
  capturing a freeze with the GBM tracer running.

- MR !5003 (guard early swap bail behind render_source check): Only
  affects systems without render_source. Our system has render_source
  (NVIDIA with sync_fd support), so this MR would not change behavior
  for us.


## 6. Current status

We are running continuously with:
- MUTTER_DEBUG=kms,render,sync (to slow the pipeline and capture state)
- bpftrace tracing gbm_surface_lock/release (to catch buffer leaks)
- Automated watchdog detecting stalled KMS thread (to capture backtrace)

Waiting for the next freeze to occur during normal use. When it does, we
will have:
- GBM buffer held count at the moment of deadlock
- Full KMS/render debug log from the journal
- GDB backtrace of all threads

This will tell us whether the deadlock is caused by a GBM buffer
exhaustion on the Mutter side, or by an internal state issue in
libEGL_nvidia triggered by the CRTC disable/re-enable cycle.


## 7. Available data

Full GDB backtraces (all threads + kernel stacks), KMS debug logs, and
bpftrace data from multiple sessions available on request.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146782

Title:
  GNOME Shell blocks indefinitely in libEGL_nvidia.so.0 during
  cogl_onscreen_swap_buffers_with_damage() on Wayland

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/mutter/+bug/2146782/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to