Re: Stabilize flaky GCN target/offloading testing

2024-03-06 Thread Richard Biener
On Wed, 6 Mar 2024, Andrew Stubbs wrote:

> On 06/03/2024 12:09, Thomas Schwinge wrote:
> > Hi!
> > 
> > On 2024-02-21T17:32:13+0100, Richard Biener  wrote:
> >> Am 21.02.2024 um 13:34 schrieb Thomas Schwinge :
> >>> [...] per my work on 
> >>> "libgomp make check time is excessive", all execution testing in libgomp
> >>> is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  [...]
> >>> (... with the caveat that execution tests for
> >>> effective-targets are *not* governed by that, as I've found yesterday.
> >>> I have a WIP hack for that, too.)
> > 
> >>> What disturbs the testing a lot is, that the GPU may get into a bad
> >>> state, upon which any use either fails with a
> >>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
> >>> 'libhsa-runtime64.so.1'...
> >>>
> >>> I've now tried to debug the latter case (hang).  When the GPU gets into
> >>> this bad state (whatever exactly that is),
> >>> 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
> >>> then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
> >>> vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
> >>> before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
> >>> There it hangs until killed (for example, until DejaGnu's timeout
> >>> mechanism kills the process -- just that the next GPU-using execution
> >>> test then runs into the same thing again...).
> >>>
> >>> In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
> >>> we're able to recover via:
> >>>
> >>> $ flock /tmp/gpu.lock sudo cat
> >>> /sys/kernel/debug/dri/0/amdgpu_gpu_recover
> >>> 0
> > 
> > At least most of the times.  I've found that -- sometimes... ;-( -- if
> > you run into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', then do
> > 'amdgpu_gpu_recover', and then immediately re-execute, you'll again run
> > into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.  That appears to be avoidable
> > by injecting some artificial "cool-down period"...  (The latter I've not
> > yet tested extensively.)
> > 
> >>> This is, obviously, a hack, probably needs a serial lock to not disturb
> >>> other things, has hard-coded 'dri/0', and as I said in
> >>> 
> >>> "GCN RDNA2+ vs. GCC SLP vectorizer":
> >>>
> >>> | I've no idea what
> >>> | 'amdgpu_gpu_recover' would do if the GPU is also used for display.
> >>
> >> It ends up terminating your X session?
> > 
> > Eh  ;'-|
> > 
> >> (there?s some automatic driver recovery that?s also sometimes triggered
> >> which sounds like the same thing).
> > 
> >> I need to try using the integrated graphics for X11 to see if that avoids
> >> the issue.
> > 
> > A few years ago, I tried that for a Nvidia GPU laptop, and -- if I now
> > remember correctly -- basically got it to work, via hand-editing
> > '/etc/X11/xorg.conf' and all that...  But: I couldn't get external HDMI
> > to work in that setup, and therefore reverted to "standard".
> > 
> >> Guess AMD needs to improve the driver/runtime (or we - it?s open source at
> >> least up to the firmware).
> > 
> >>> However, it's very useful in my testing.  :-|
> >>>
> >>> The questions is, how to detect the "hang" state without first running
> >>> into a timeout (and disambiguating such a timeout from a user code
> >>> timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
> >>> initialization, and before the actual GPU kernel launch cancel it with
> >>> 'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
> >>> error message that we can then react on, like for
> >>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
> >>> no-go in libgomp -- instead, use a helper thread to similarly implement a
> >>> watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
> >>> other purposes.)  Any other clever ideas?  What's a suitable value for
> >>> "a few seconds"?
> > 
> > I'm attaching my current "GCN: Watchdog for device image load", covering
> > both 'gcc/config/gcn/gcn-run.cc' and 'libgomp/plugin/plugin-gcn.c'.
> > (That's using 'timer_create' etc. instead of 'alarm'/'SIGALRM'. )
> > 
> > That, plus routing *all* potential GPU usage (in particular: including
> > execution tests for effective-targets, see above) through a serial lock
> > ('flock', implemented in DejaGnu board file, outside of the the
> > "DejaGnu timeout domain", similar to
> > 'libgomp/testsuite/lib/libgomp.exp:libgomp_load', see above), plus
> > catching 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (both the "real" ones and
> > the "fake" ones via "GCN: Watchdog for device image load") and in that
> > case 'amdgpu_gpu_recover' and re-execution of the respective executable,
> > does greatly stabilize flaky GCN target/offloading testing.
> > 
> > Do we have consensus to move forward with this approach, generally?
> 
> I've also observed a number of random hangs in 

Re: Stabilize flaky GCN target/offloading testing

2024-03-06 Thread Andrew Stubbs

On 06/03/2024 12:09, Thomas Schwinge wrote:

Hi!

On 2024-02-21T17:32:13+0100, Richard Biener  wrote:

Am 21.02.2024 um 13:34 schrieb Thomas Schwinge :

[...] per my work on 
"libgomp make check time is excessive", all execution testing in libgomp
is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  [...]
(... with the caveat that execution tests for
effective-targets are *not* governed by that, as I've found yesterday.
I have a WIP hack for that, too.)



What disturbs the testing a lot is, that the GPU may get into a bad
state, upon which any use either fails with a
'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
'libhsa-runtime64.so.1'...

I've now tried to debug the latter case (hang).  When the GPU gets into
this bad state (whatever exactly that is),
'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
There it hangs until killed (for example, until DejaGnu's timeout
mechanism kills the process -- just that the next GPU-using execution
test then runs into the same thing again...).

In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
we're able to recover via:

$ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
0


At least most of the times.  I've found that -- sometimes... ;-( -- if
you run into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', then do
'amdgpu_gpu_recover', and then immediately re-execute, you'll again run
into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.  That appears to be avoidable
by injecting some artificial "cool-down period"...  (The latter I've not
yet tested extensively.)


This is, obviously, a hack, probably needs a serial lock to not disturb
other things, has hard-coded 'dri/0', and as I said in

"GCN RDNA2+ vs. GCC SLP vectorizer":

| I've no idea what
| 'amdgpu_gpu_recover' would do if the GPU is also used for display.


It ends up terminating your X session…


Eh  ;'-|


(there’s some automatic driver recovery that’s also sometimes triggered which 
sounds like the same thing).



I need to try using the integrated graphics for X11 to see if that avoids the 
issue.


A few years ago, I tried that for a Nvidia GPU laptop, and -- if I now
remember correctly -- basically got it to work, via hand-editing
'/etc/X11/xorg.conf' and all that...  But: I couldn't get external HDMI
to work in that setup, and therefore reverted to "standard".


Guess AMD needs to improve the driver/runtime (or we - it’s open source at 
least up to the firmware).



However, it's very useful in my testing.  :-|

The questions is, how to detect the "hang" state without first running
into a timeout (and disambiguating such a timeout from a user code
timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
initialization, and before the actual GPU kernel launch cancel it with
'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
error message that we can then react on, like for
'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
no-go in libgomp -- instead, use a helper thread to similarly implement a
watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
other purposes.)  Any other clever ideas?  What's a suitable value for
"a few seconds"?


I'm attaching my current "GCN: Watchdog for device image load", covering
both 'gcc/config/gcn/gcn-run.cc' and 'libgomp/plugin/plugin-gcn.c'.
(That's using 'timer_create' etc. instead of 'alarm'/'SIGALRM'. )

That, plus routing *all* potential GPU usage (in particular: including
execution tests for effective-targets, see above) through a serial lock
('flock', implemented in DejaGnu board file, outside of the the
"DejaGnu timeout domain", similar to
'libgomp/testsuite/lib/libgomp.exp:libgomp_load', see above), plus
catching 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (both the "real" ones and
the "fake" ones via "GCN: Watchdog for device image load") and in that
case 'amdgpu_gpu_recover' and re-execution of the respective executable,
does greatly stabilize flaky GCN target/offloading testing.

Do we have consensus to move forward with this approach, generally?


I've also observed a number of random hangs in host-side code outside 
our control, but after the kernel has exited. In general this watchdog 
approach might help with these. I do feel like it's "papering over the 
cracks", but if we can't fix it at the end of the day it's just a 
little extra code.


My only concern is that it might actually cause failures, perhaps on 
heavily loaded systems, or with network filesystems, or during debugging.


Andrew