Hi! On 2024-02-01T15:49:02+0100, Richard Biener <rguent...@suse.de> wrote: > On Thu, 1 Feb 2024, Thomas Schwinge wrote: >> On 2024-01-26T10:45:10+0100, Richard Biener <rguent...@suse.de> wrote: >> > On Fri, 26 Jan 2024, Richard Biener wrote: >> >> On Wed, 24 Jan 2024, Andrew Stubbs wrote: >> >> > [...] is enough to get gfx1100 working for most purposes, on top of the >> >> > patch that Tobias committed a week or so ago; there are still some test >> >> > failures to investigate, and probably some tuning to do. >> >> > >> >> > It might also get gfx1030 working too. @Richi, could you test it, >> >> > please? >> >> >> >> I can report partial success here. [...] >> >> >> I'll followup with a test summary once the (serial) run of libgomp >> >> testing finished. >> >> (Why serial, by the way?) > > Just out of caution ... (I'm using the GPU for the desktop at the > same time and dmesg gets spammed with some not-so reassuring > "errors" during the offloading)
Yeah, indeed 'dmesg' is full of "notes"... However, note that per my work on <https://gcc.gnu.org/PR66005> "libgomp make check time is excessive", all execution testing in libgomp is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'. So, no problem/difference in that regard, to run parallel 'check-target-libgomp'. (... with the caveat that execution tests for effective-targets are *not* governed by that, as I've found yesterday. I have a WIP hack for that, too.) >> [...] what I >> got with '-march=gfx1100' for AMD Radeon RX 7900 XTX. [...] >> [...] execution test FAILs. Not all FAILs appear all the time [...] What disturbs the testing a lot is, that the GPU may get into a bad state, upon which any use either fails with a 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in 'libhsa-runtime64.so.1'... I've now tried to debug the latter case (hang). When the GPU gets into this bad state (whatever exactly that is), 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze' vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'. There it hangs until killed (for example, until DejaGnu's timeout mechanism kills the process -- just that the next GPU-using execution test then runs into the same thing again...). In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state), we're able to recover via: $ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover 0 This is, obviously, a hack, probably needs a serial lock to not disturb other things, has hard-coded 'dri/0', and as I said in <https://inbox.sourceware.org/87plww8qin....@euler.schwinge.ddns.net> "GCN RDNA2+ vs. GCC SLP vectorizer": | I've no idea what | 'amdgpu_gpu_recover' would do if the GPU is also used for display. However, it's very useful in my testing. :-| The questions is, how to detect the "hang" state without first running into a timeout (and disambiguating such a timeout from a user code timeout)? Add a watchdog: call 'alarm([a few seconds])' before device initialization, and before the actual GPU kernel launch cancel it with 'alarm(0)'? (..., and add a handler for 'SIGALRM' to print a distinct error message that we can then react on, like for 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.) Probably 'alarm'/'SIGALRM' is a no-go in libgomp -- instead, use a helper thread to similarly implement a watchdog? ('libgomp/plugin/plugin-gcn.c' already is using pthreads for other purposes.) Any other clever ideas? What's a suitable value for "a few seconds"? Grüße Thomas