On Thu, May 20, 2021 at 06:29:36PM +1000, Jonathan Gray wrote: > On Thu, May 20, 2021 at 10:10:14AM +0200, Matthieu Herrb wrote: > > On Thu, May 20, 2021 at 09:53:02AM +0200, Peter N. M. Hansteen wrote: > > > On Thu, May 20, 2021 at 08:53:14AM +1000, Jonathan Gray wrote: > > > > On Wed, May 19, 2021 at 06:32:01PM +0200, Peter N. M. Hansteen wrote: > > > > > On Wed, May 19, 2021 at 04:43:44PM +0200, Peter N. M. Hansteen wrote: > > > > > > > outdated...) > > > > > > > > > > > > I tried the first, that only seemed to have the effect of having > > > > > > the freeze come faster. So I commented out that part of the > > > > > > xorg.conf > > > > > > and I'm trying the steps in the README now, but for some reasons I > > > > > > don't get any dumps in /var/crash as expected. Then again I could > > > > > > well > > > > > > be missing some crucial step. > > > > > > > > > > Still no luck getting coredumps but when I sshed in after the last > > > > > freeze > > > > > the last two lines of dmesg were > > > > > > > > > > [drm] *ERROR* ring sdma0 timeout, signaled seq=110053, emitted > > > > > seq=110053 > > > > > [drm] *ERROR* Process information: process pid 0 thread pid 0 > > > > > > > > > > the [drm] part has me suspect this is related (but I don't know what > > > > > sdma > > > > > signifies in this context) > > > > > > > > sdma is the asynchronous System DMA engine > > > > > > > > Ring timeouts like this are a known problem with amdgpu which persist > > > > across multiple major drm versions. > > > > > > Looking at what appears in the log (/var/log/messages) the time when X > > > freezes corresponds very well with when those messages are recorded. > > > > > > The question is, how do I usefully debug this? I've gone over the > > > README's > > > procedure a few times now and it unfortunately does not produce any > > > coredumps > > > or traces. > > > > When the X server is locked up I can still ssh into the machine and > > attach a debugger to the running process. I've got a few backtraces > > from that, but without full symbols it's even harder to understand > > what's going on. > > > > I suspect issues with our futex implementation; in every case I find > > one thread stuck in a drm ioctl while others are blocked on futex > > waits. > > > > Running an X server + Mesa fully built with debug symbols seem to make > > the issue less frequent and when it happenned I didn't have time to > > launch a debugger on it so far... > > > > > > > > > > One option is of course to trade up or sideways to something like > > > https://www.power.no/data-og-tilbehoer/pc-og-mac/baerbar-pc/asus-zenbook-s-ux393ea-pure2-13-laptop/p-1115705/ > > > (Intel Core i7-1165G7 with Iris Xe graphics), but would that have a better > > > chance of success (or for that matter be helpful to the project)? > > > > Which window manager / desktop environment are you using ? I've tried > > to back to WindowMaker (from xfwm4) and it also seems to not trigger > > the lock ups on a Ryzen Vega. But it hasn't been long enough. Sometime > > I can run for days without a lockup and sometimes it locks up after > > minutes. > > To trigger ring timeouts on amdgpu I use graphics/piglit. > piglit run -s quick <outdir> > which will rapidly open and close windows and take quite a while if left > to complete. >
I do hit the dmafence X lockup with windowmaker when using chrome (e.g. playing youtube videos is a good trigger). Firefox does not seem to cause these dmafence issues. There are some other situations that trigger the dmafence lockup but it is far less often. I also did notice X crashes (but also had not much luck at getting coredumps) which seem to be much more random. Often X dies over night for me while xlock and DMPS is on. > > BTW this also leads me to wonder if KARL could have an impact on the > > issue in case there is some un-initialized memory access somewhere in > > the code... -- :wq Claudio