Roland Scheidegger wrote:
There's historically been code which tried to do that, but it
just never ever worked...



I always thought that the code did work, however since the reset usually happened in DRM driver it did not know how to set the mode.
There were quite a few occasions where I was able to kill X and
then restart it. (though on other occasions killing X resulted in
hard lockup).


What's different in Nikolai's patch is that the reset is initiated
by X itself.

It might be that the lockup was due to clocks being switched off,
but the DAC left active - or something similar.


I can confirm this code works for r200 (I'm still trying to figure
out how to fix the lockups for r200 with multiple GL apps, apparently
without much success...). However, sometimes it seems to fail, I've
seen 3 different cases what happens if the watchdog is triggered: 1)
one of the 2 running GL apps is killed, the other continues to run. 2) both running GL apps are killed, if that happens no further GL
apps can be started (will always hang and got killed). 3) GPU lockup
still happens (that happened just once, and with more than 2 apps, I
was not able to reproduce it, and forgot to look in the log file if
the watchdog was triggered or not). Case 1) and 2) are both
frequently seen. But it's nice to see "VPURecover for Linux" ;-).



Roland

I've played around with this a bit (unwillingly ;-) on my r200, though
there seem to be some problems.
First, the watchdog will not always detect if a client has caused a GPU hang (since the lock must not be held necessarily at that time). I guess this isn't really the fault of the watchdog, as it's not intended for that.
But for instance, running glxgears & quake3 at the same time, both
processes will get stuck in r200WaitForFrameCompletion (as far as I can
tell) rather sooner than later. When using irqs (fthrottle_mode=2), one
client will have the lock and it will get killed (though experience
shows always both will get killed for some reason). However, when using
busy waits (fthrottle_mode=0) the watchdog will never trigger, since the
lock is always released and regrabbed while waiting. Actually though in
this case you can switch away from X to a VT (can't do that when using
irqs), X will moan about "Idle timed out, resetting engine", switch back
to X and both apps will continue to run (and lock up again shortly
afterwards, but that's another story). Maybe the DRI driver itself should have some sort of a watchdog too and call just that reset code when it has to wait too long? Not sure if/how that works when using irqs. That would be nicer than just to kill the app. AFAIK the ATI VPURecover feature in the windows driver also works like that, if it detects a gpu lockup, it will reset the chip and try to continue, and only if it locks up again very soon the application is killed.
Of course it would be better just to fix the lockups ;-), but no progress so far :-(.


Other problem I've seen is that sometimes no more 3d apps can be started. This seems to depend on where the applications are stuck when they are killed. The above quake3 & glxgears example will always have that outcome, I'm also seeing this when they are killed:
r200WaitIrq: drmRadeonIrqWait: -16
r200WaitIrq: drmRadeonIrqWait: -16
Interestingly, this can be fixed by switching away from X and back, the VTEnter/Resume code seems to be responsible for that I think.



Roland




------------------------------------------------------- This SF.Net email is sponsored by the new InstallShield X.
From Windows to Linux, servers to mobile, InstallShield X is the one
installation-authoring solution that does it all. Learn more and
evaluate today! http://www.installshield.com/Dev2Dev/0504
--
_______________________________________________
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel

Reply via email to