https://bugs.freedesktop.org/show_bug.cgi?id=59069

--- Comment #38 from Matt Whitlock <freedesk...@mattwhitlock.name> ---
Created attachment 123753
  --> https://bugs.freedesktop.org/attachment.cgi?id=123753&action=edit
kernel log with ttm_validate, RT_FAULT, ZETA_FAULT, PAGE_NOT_PRESENT

The problems also started in earnest for me around the time I upgraded to
Plasma 5. Nouveau was never *stable* before then, but I was able to ignore its
errors for the most part. Now I can't go more than a few days without X
freezing or even the kernel panicking.

I do not believe the problems are triggered solely by plasmashell. I most
frequently see the "fail ttm_validate" message for kscreenlocker_greet while I
am away from my computer. I also very frequently see graphical corruption on
the lock screen in the border around my avatar.

There are other regressions too. I used to be able to use the XVideo output
module in VLC (in fact, it was the only one that was stable). Now, neither
XVideo nor OpenGL/GLX will run more than a few frames before the video freezes
and "fail ttm_validate" messages spew into the kernel log. The only VLC output
module that gives me any stability anymore is VDPAU and only if I disable
hardware decoding, but even that will freeze X hard from time to time.

The "fail ttm_validate" messages are just the harbinger of impending doom. If I
continue without rebooting, eventually I'll be hit by an onslaught of much more
ominous errors. Here's a small sampling:

May 14 02:53:20 [kernel] nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 4
[chrome[21051]] subc 0 mthd 0060 data beef0201
May 14 03:43:11 [kernel] nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 6
[kwin_x11[2304]] subc 0 mthd 0060 data beef0201
May 14 05:06:50 [kernel] nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 1 [DRM]
subc 0 mthd 0060 data 80000002

May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - 00000040
[RT_FAULT] - Address 00204c7000
May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - e0c:
00000000, e18: 00000000, e1c: 00000000, e20: 00001100, e24: 00030000
May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 1 - 00000040
[RT_FAULT] - Address 00204c8000
May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 1 - e0c:
00000000, e18: 00000000, e1c: 00000010, e20: 00001100, e24: 00030000
May 14 13:17:52 [kernel] nouveau 0000:01:00.0: gr: 00200000 [] ch 11
[000eeda000 plasmashell[2665]] subc 3 class 8297 mthd 1904 data 01000404
May 14 13:17:52 [kernel] nouveau 0000:01:00.0: fb: trapped write at 00204c8000
on channel 11 [0eeda000 plasmashell[2665]] engine 00 [PGRAPH] client 0b [PROP]
subclient 00 [RT0] reason 00000002 [PAGE_NOT_PRESENT]
May 14 13:17:52 [kernel] nouveau 0000:01:00.0: fb: trapped write at 0020563800
on channel 2 [0fb2f000 X[2086]] engine 00 [PGRAPH] client 0b [PROP] subclient
08 [ZETA] reason 00000002 [PAGE_NOT_PRESENT]

May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - 00000020
[ZETA_FAULT] - Address 002054b100
May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - e0c:
00000000, e18: 00000000, e1c: 00040000, e20: 00020000, e24: 08030000
May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 1 - 00000040
[RT_FAULT] - Address 00204f1b00
May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 1 - e0c:
00000000, e18: 00000000, e1c: 006c0110, e20: 00001100, e24: 00030000
May 14 13:18:01 [kernel] nouveau 0000:01:00.0: gr: 00200000 [] ch 11
[000eeda000 plasmashell[2665]] subc 3 class 8297 mthd 1344 data 00004001
May 14 13:18:01 [kernel] nouveau 0000:01:00.0: fb: trapped write at 0020555b00
on channel 11 [0eeda000 plasmashell[2665]] engine 00 [PGRAPH] client 0b [PROP]
subclient 08 [ZETA] reason 00000002 [PAGE_NOT_PRESENT]

Attached is the complete error log from this session.

The problems aren't limited to X, though. When nouveau enters a failure state
like this, it corrupts memory belonging to other processes. I have several
times (at least thrice) seen bitcoind crash at the same time as this storm of
nouveau errors, logging an error message like:

2016-05-14 16:26:37 Corruption: block checksum mismatch
2016-05-14 16:26:37 *** System error while flushing: Database corrupted
2016-05-14 16:26:37 Error: Error: A fatal internal error occurred, see
debug.log for details
2016-05-14 16:26:37 Shutdown: done

When I started seeing these problems, I suspected bad RAM, so I ran Memtest86+
overnight but found no errors. So my suspicion is that nouveau is writing to
pages it shouldn't.

Could someone help me modify my kernel so that, instead of merely printing
"fail ttm_validate", nouveau sends a SIGBUS to the active process when this
occurs? Then I can run plasmashell in gdb and get a clue as to what's causing
this.

-- 
You are receiving this mail because:
You are the assignee for the bug.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Reply via email to