> Am 21.02.2024 um 13:34 schrieb Thomas Schwinge <tschwi...@baylibre.com>:
> 
> Hi!
> 
>> On 2024-02-01T15:49:02+0100, Richard Biener <rguent...@suse.de> wrote:
>>> On Thu, 1 Feb 2024, Thomas Schwinge wrote:
>>> On 2024-01-26T10:45:10+0100, Richard Biener <rguent...@suse.de> wrote:
>>>> On Fri, 26 Jan 2024, Richard Biener wrote:
>>>>> On Wed, 24 Jan 2024, Andrew Stubbs wrote:
>>>>>> [...] is enough to get gfx1100 working for most purposes, on top of the
>>>>>> patch that Tobias committed a week or so ago; there are still some test
>>>>>> failures to investigate, and probably some tuning to do.
>>>>>> 
>>>>>> It might also get gfx1030 working too. @Richi, could you test it,
>>>>>> please?
>>>>> 
>>>>> I can report partial success here.  [...]
>>> 
>>>>> I'll followup with a test summary once the (serial) run of libgomp
>>>>> testing finished.
>>> 
>>> (Why serial, by the way?)
>> 
>> Just out of caution ... (I'm using the GPU for the desktop at the
>> same time and dmesg gets spammed with some not-so reassuring
>> "errors" during the offloading)
> 
> Yeah, indeed 'dmesg' is full of "notes"...
> 
> However, note that per my work on <https://gcc.gnu.org/PR66005>
> "libgomp make check time is excessive", all execution testing in libgomp
> is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  So,
> no problem/difference in that regard, to run parallel
> 'check-target-libgomp'.  (... with the caveat that execution tests for
> effective-targets are *not* governed by that, as I've found yesterday.
> I have a WIP hack for that, too.)
> 
> 
>>> [...] what I
>>> got with '-march=gfx1100' for AMD Radeon RX 7900 XTX.  [...]
> 
>>> [...] execution test FAILs.  Not all FAILs appear all the time [...]
> 
> What disturbs the testing a lot is, that the GPU may get into a bad
> state, upon which any use either fails with a
> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
> 'libhsa-runtime64.so.1'...
> 
> I've now tried to debug the latter case (hang).  When the GPU gets into
> this bad state (whatever exactly that is),
> 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
> then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
> vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
> before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
> There it hangs until killed (for example, until DejaGnu's timeout
> mechanism kills the process -- just that the next GPU-using execution
> test then runs into the same thing again...).
> 
> In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
> we're able to recover via:
> 
>    $ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
>    0
> 
> This is, obviously, a hack, probably needs a serial lock to not disturb
> other things, has hard-coded 'dri/0', and as I said in
> <https://inbox.sourceware.org/87plww8qin....@euler.schwinge.ddns.net>
> "GCN RDNA2+ vs. GCC SLP vectorizer":
> 
> | I've no idea what
> | 'amdgpu_gpu_recover' would do if the GPU is also used for display.

It ends up terminating your X session… (there’s some automatic driver recovery 
that’s also sometimes triggered which sounds like the same thing).  I need to 
try using the integrated graphics for X11 to see if that avoids the issue.

Guess AMD needs to improve the driver/runtime (or we - it’s open source at 
least up to the firmware).

Richard 

> However, it's very useful in my testing.  :-|
> 
> The questions is, how to detect the "hang" state without first running
> into a timeout (and disambiguating such a timeout from a user code
> timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
> initialization, and before the actual GPU kernel launch cancel it with
> 'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
> error message that we can then react on, like for
> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
> no-go in libgomp -- instead, use a helper thread to similarly implement a
> watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
> other purposes.)  Any other clever ideas?  What's a suitable value for
> "a few seconds"?
> 
> 
> Grüße
> Thomas

Reply via email to