Hi Cory, On 2024-09-27 20:52, Cordell Bloor wrote: > The rocfft tests are crashing on gfx1035 after updating the kernel from > bookworm to bookworm-backports (6.1 to 6.10). This can be seen be comparing > these two nearly identical runs from before [1] and after [2] installing a new > kernel. > > This is the failing test: > 498s [ RUN ] > pow2_1D/accuracy_test.vs_fftw/complex_forward_len_268435456_single_ip_batch_4_istride_1_CI_ostride_1_CI_idist_268435456_odist_268435456_ioffset_0_0_ooffset_0_0 > 558s Killed
"Killed" sounds like something sent SIGKILL, and I suspect this was the OOM killer, especially since you mention memory allocation below. > 558s dmesg: read kernel buffer failed: Operation not permitted This isn't from the test, this is our test runner that tries to capture dmesg before and after [3] each test, for debugging purposes. These get exported as artifacts, and made available in our CI. This fails with rootless podman because reading dmesg is a privileged operation by default. On the host, could you try $ sudo sysctl kernel.dmesg_restrict=0 and then run the test again. This should enable dmesg capturing by regular users, and if it really is the OOM killer, it should be logged there. (In QEMU, we have root access to dmesg, so this is not a problem there.) > It seems that previously it was skipped: > 162s [ RUN ] > pow2_1D/accuracy_test.vs_fftw/complex_forward_len_268435456_single_ip_batch_4_istride_1_CI_ostride_1_CI_idist_268435456_odist_268435456_ioffset_0_0_ooffset_0_0 > 162s clients/tests/accuracy_test.h:1260: Skipped > 162s Raw problem size (9 GiB) raw data too large for device > > There are other tests that use much more RAM that are still skipped, even > with the newer kernel: > 395s [ RUN ] > pow2_1D/accuracy_test.vs_fftw/complex_forward_len_268435456_double_op_batch_4_istride_1_CI_ostride_1_CI_idist_268435456_odist_268435456_ioffset_0_0_ooffset_0_0 > 395s clients/tests/accuracy_test.h:1214: Skipped > 395s needed_ramgb: 96, ramgb limit: 61. > > The failure therefore appears to be due to changes in the GPU memory capacity > reported for the APUs. In Linux 6.10, the driver is able to dynamically > allocate more system memory for use by the GPU. > > The rocfft library queries the amount of memory available on the device and it > skips tests that would use more memory than is available. I suspect that the > allocation for this test is failing but the library attempts to use the > returned value anyway. Possibly another factor: the kernel overcommits memory by default. If more actual memory is used than physically available, the OOM killer will kill something, which would neatly fit to the "Killed" above. You can turn off overcommitment with: $ sudo sysctl vm.overcommit_memory=2 Perhaps that also changes something. > The rocfft-test client provides the --R <gb> and --V <gb> options to > explicitly > specify the host and device memory limits, respectively. That may be a good > workaround for the moment, although it would be better if rocfft-test could > query the system hardware and choose appropriate defaults. Could you share the output of rocminfo with both 6.1 and 6.10? I don't think it needs to be run in the test container, at least I don't see why the result on bare metal should differ. Best, Christian > [1]: > https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/33925/log.gz > [2]: > https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/34278/log.gz [3]: https://sources.debian.org/src/rocfft/6.1.2-1/debian/tests/upstream-binaries/#L70