Package: librocfft0-tests Version: 6.1.2-1 Severity: normal X-Debbugs-Cc: c...@slerp.xyz
Dear Maintainer, The rocfft tests are crashing on gfx1035 after updating the kernel from bookworm to bookworm-backports (6.1 to 6.10). This can be seen be comparing these two nearly identical runs from before [1] and after [2] installing a new kernel. This is the failing test: 498s [ RUN ] pow2_1D/accuracy_test.vs_fftw/complex_forward_len_268435456_single_ip_batch_4_istride_1_CI_ostride_1_CI_idist_268435456_odist_268435456_ioffset_0_0_ooffset_0_0 558s Killed 558s dmesg: read kernel buffer failed: Operation not permitted It seems that previously it was skipped: 162s [ RUN ] pow2_1D/accuracy_test.vs_fftw/complex_forward_len_268435456_single_ip_batch_4_istride_1_CI_ostride_1_CI_idist_268435456_odist_268435456_ioffset_0_0_ooffset_0_0 162s clients/tests/accuracy_test.h:1260: Skipped 162s Raw problem size (9 GiB) raw data too large for device There are other tests that use much more RAM that are still skipped, even with the newer kernel: 395s [ RUN ] pow2_1D/accuracy_test.vs_fftw/complex_forward_len_268435456_double_op_batch_4_istride_1_CI_ostride_1_CI_idist_268435456_odist_268435456_ioffset_0_0_ooffset_0_0 395s clients/tests/accuracy_test.h:1214: Skipped 395s needed_ramgb: 96, ramgb limit: 61. The failure therefore appears to be due to changes in the GPU memory capacity reported for the APUs. In Linux 6.10, the driver is able to dynamically allocate more system memory for use by the GPU. The rocfft library queries the amount of memory available on the device and it skips tests that would use more memory than is available. I suspect that the allocation for this test is failing but the library attempts to use the returned value anyway. While there may be 61GB of memory available in theory, it might be that there are other factors that reduce that amount. For example, allocations in host memory might reduce the available memory for the device in APU systems like this. While the device memory limit might be 61GB, you cannot allocate 40GB of host memory and 40GB of device memory, as they come from the same pool. The rocfft-test client provides the --R <gb> and --V <gb> options to explicitly specify the host and device memory limits, respectively. That may be a good workaround for the moment, although it would be better if rocfft-test could query the system hardware and choose appropriate defaults. Sincerely, Cory Bloor [1]: https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/33925/log.gz [2]: https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/34278/log.gz -- System Information: Debian Release: 12.7 APT prefers stable-updates APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 6.10.6+bpo-amd64 (SMP w/16 CPU threads; PREEMPT) Locale: LANG=en_CA.UTF-8, LC_CTYPE=en_CA.UTF-8 (charmap=UTF-8), LANGUAGE=en_CA:en Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled