Hi,

that sounds like one of the endless AMD GPU stability issues other people and myself (RX 570, RX 6600 XT, ...) have been having for years.

ROCm is simply not reliable on consumer cards and never has been. ROCm 6.0 officially only supports the MI100/MI200/MI300 data center GPUs, the Radeon Pro W7900/W6800/V620/VII and the Radeon RX 7900/VII on at most kernel 6.2 (Ubuntu 22.04.3). Everything else is unsupported.

From what I have gathered over the years, I would say running both graphics and compute at the same time on the same AMD GPU without messing up the internal memory management/GPU state only seems to work reliably under *very* strict constraints.

The data center chips don't support any graphics at all, so no issues there (we run MI100/MI210/MI250s at work and I've never seen anything like this there). The Radeon Pro W7900/W6800/V620/VII GPUs are usually only available in very few and certified configurations, so probably less issues to keep track of. But the RDNA2/RDNA3-based consumer GPUs are being used in an endless variety of system builds and configurations, and I assume AMD developers simply don't invest a lot of time in making these more stable since ROCm doesn't support them anyways.

When the drm/amdgpu errors started appearing, I would usually have at most 20 minutes until the whole machine would lock itself up. So every time I noticed OpenCL issues in darktable, I had to reboot, breaking my workflow and concentration. If I had a lot of work before me, I would simply turn OpenCL off completely and trade speed for stability.

For me the situation became so unbearable last year, I ended up plugging in an Intel Arc A770 GPU besides my RX 6600 XT. The A770 only does OpenCL (not just for darktable), the RX 6600 XT only graphics (ROCm is not installed). I would maybe be able to get my hand on an MI100, but they're passively cooled and don't fit into a standard ATX case.

(The sorry state of OpenCL might also be a reason to put more emphasis on darktable's CPU codepaths and other optimisations again.)

cheers,
Simon


P.S. Not just using a different/recent kernel can break a working setup, but also changing the firmware files (usually installed in /lib/firmware/amdgpu/). An update of the firmware package might therefore also break a previously working kernel.



On 21.12.23 21:01, Šarūnas wrote:
Debian unstable,
kernel 6.6 from experimental,
ROCm 6.0,
Radeon RX 7600,
darktable 4.4.2.

OpenCL works 2 out of 3 times between reboots.
~/.cache and ~/.config/darktable are purged between reboots.

When it works, all is fine, no error messages.

When it doesn't (stuck at/before CL kernel compile), attaching to darktable process with `strace` shows endless FUTEX'es. dmesg shows repeating amdgpu drm and iommu errors (they keep repeating after darktable is stopped).


Attachment: OpenPGP_0xCE9228264D6BD39A_and_old_rev.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to