Package: librocfft0-tests
Version: 6.1.2-1
Severity: normal
X-Debbugs-Cc: c...@slerp.xyz

Dear Maintainer,

The rocfft tests are crashing on gfx1035 after updating the kernel from
bookworm to bookworm-backports (6.1 to 6.10). This can be seen be comparing
these two nearly identical runs from before [1] and after [2] installing a new
kernel.

This is the failing test:
498s [ RUN      ] 
pow2_1D/accuracy_test.vs_fftw/complex_forward_len_268435456_single_ip_batch_4_istride_1_CI_ostride_1_CI_idist_268435456_odist_268435456_ioffset_0_0_ooffset_0_0
558s Killed
558s dmesg: read kernel buffer failed: Operation not permitted

It seems that previously it was skipped:
162s [ RUN      ] 
pow2_1D/accuracy_test.vs_fftw/complex_forward_len_268435456_single_ip_batch_4_istride_1_CI_ostride_1_CI_idist_268435456_odist_268435456_ioffset_0_0_ooffset_0_0
162s clients/tests/accuracy_test.h:1260: Skipped
162s Raw problem size (9 GiB) raw data too large for device

There are other tests that use much more RAM that are still skipped, even with 
the newer kernel:
395s [ RUN      ] 
pow2_1D/accuracy_test.vs_fftw/complex_forward_len_268435456_double_op_batch_4_istride_1_CI_ostride_1_CI_idist_268435456_odist_268435456_ioffset_0_0_ooffset_0_0
395s clients/tests/accuracy_test.h:1214: Skipped
395s needed_ramgb: 96, ramgb limit: 61.

The failure therefore appears to be due to changes in the GPU memory capacity
reported for the APUs. In Linux 6.10, the driver is able to dynamically
allocate more system memory for use by the GPU.

The rocfft library queries the amount of memory available on the device and it
skips tests that would use more memory than is available. I suspect that the
allocation for this test is failing but the library attempts to use the
returned value anyway.

While there may be 61GB of memory available in theory, it might be that there
are other factors that reduce that amount. For example, allocations in host
memory might reduce the available memory for the device in APU systems like
this. While the device memory limit might be 61GB, you cannot allocate 40GB of
host memory and 40GB of device memory, as they come from the same pool.

The rocfft-test client provides the --R <gb> and --V <gb> options to explicitly
specify the host and device memory limits, respectively. That may be a good
workaround for the moment, although it would be better if rocfft-test could
query the system hardware and choose appropriate defaults.

Sincerely,
Cory Bloor

[1]: 
https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/33925/log.gz
[2]: 
https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/34278/log.gz

-- System Information:
Debian Release: 12.7
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 6.10.6+bpo-amd64 (SMP w/16 CPU threads; PREEMPT)
Locale: LANG=en_CA.UTF-8, LC_CTYPE=en_CA.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_CA:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Reply via email to