Hi!
I think we have a serious kernel bug that is related to or inside in
drivers/gpu/drm/ttm/ttm_bo.c
The reason for my assumptions lies in one of my recent system freezes
with kernel 6.3.4 that go along with massive kernel error logs in
journalctl. An extract from the logs:
...
May 28 14:38:41 fedora.domain kernel: WARNING: CPU: 4 PID: 5523 at
drivers/gpu/drm/ttm/ttm_bo.c:326 ttm_bo_release+0x289/0x2e0 [ttm]
...
May 28 14:38:41 fedora.domain kernel: WARNING: CPU: 4 PID: 5523 at
drivers/gpu/drm/ttm/ttm_bo.c:327 ttm_bo_release+0x296/0x2e0 [ttm]
...
May 28 14:38:41 fedora.domain kernel: kernel BUG at
drivers/gpu/drm/ttm/ttm_bo.c:193!
...
The above information is more detailed than most of the occurrences, and
its the first occurrence that did not end up in a freeze immediately or
a few seconds after it. However, the corrupted state of the system
became again apparent when I tried to shutdown some time after the above
errors:
...
|May 28 14:51:09 fedora.domain kernel: #PF: error_code(0x0000) -
not-present page May 28 14:51:09 fedora.domain kernel: #PF: supervisor
read access in kernel mode May 28 14:51:09 fedora.domain kernel: BUG:
unable to handle page fault for address: 0000003000300010|
...
I have that issue already for a longer time, at least since 6.2.X.
You can find my bug report and many full logs (including the full logs
of the above) from root's journalctl in:
https://bugzilla.redhat.com/show_bug.cgi?id=2193110
Ignore the title and the initial comments of the bug report, it is
definitely not related to Firefox. Assuming that you want to focus on
the kernel error logs of 6.3.X, you might focus only on the last 5 comments.
Additionally to the journalctl error logs that I already added through
links in the bug report, I tested today once again 6.3.4 with
amd_pstate=active (by default I am on amd_state=passive which feels most
stable on my hardware) -> see
https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/retry6.3.4/fullSystemFreeze.kernel6.3.4.pstate-ACTIVE.log
(I have not yet put this into the bug report since I no longer assume it
is relevant)
Some other people from Fedora have experienced related issues; see the
comments on the test result pages in our update system:
https://bodhi.fedoraproject.org/updates/FEDORA-2023-514965dd8a (6.3.3 &
6.3.4)
https://bodhi.fedoraproject.org/updates/FEDORA-2023-26325e5399 (6.2.15)
-> I am quite sure I have seen that issue already before 6.2.15.
Maybe also related (but without explicit information referring to ttm_bo.c):
https://gitlab.freedesktop.org/drm/amd/-/issues/2548
https://gitlab.freedesktop.org/drm/amd/-/issues/2447
Let me know if you need more information or if I can help with testing.
My hardware: AMD Ryzen 6850 Pro, I have no dedicated graphics but only
the AMD graphics of my Ryzen. I use Fedora 38 KDE -> cat
/proc/sys/kernel/tainted = 0.
I will try updating my BIOS in the next days when I have time to see if
that makes a difference, but I guess this is not related given the logs.
Regards,
Chris