On 2024-09-27 23:34, Cordell Bloor wrote: >> $ sudo sysctl kernel.dmesg_restrict=0 >> $ sudo sysctl vm.overcommit_memory=2
> The log output after applying both changes: > > [ RUN ] > pow2_1D/accuracy_test.vs_fftw/complex_forward_len_67108864_single_op_batch_1_istride_1_CI_ostride_1_CI_idist_67108864_odist_67108864_ioffset_0_0_ooffset_0_0 > [ OK ] > pow2_1D/accuracy_test.vs_fftw/complex_forward_len_67108864_single_op_batch_1_istride_1_CI_ostride_1_CI_idist_67108864_odist_67108864_ioffset_0_0_ooffset_0_0 > (953 ms) > [ RUN ] > pow2_1D/accuracy_test.vs_fftw/complex_forward_len_134217728_double_ip_batch_4_istride_1_CI_ostride_1_CI_idist_134217728_odist_134217728_ioffset_0_0_ooffset_0_0 > command1 FAIL non-zero exit status 1 > > The dmesg output from test after applying both changes: Am I interpreting this right that the "Killed" disappeared? If so, then the issue should be reproducible by re-enabling vm.overcommit_memory=0. Would be nice to be certain of this. > [50555.651205] __vm_enough_memory: pid: 57317, comm: rocfft-test, bytes: > 8592035840 not enough memory for the allocation > [50555.651226] __vm_enough_memory: pid: 57317, comm: rocfft-test, bytes: > 8592035840 not enough memory for the allocation > [50555.651233] __vm_enough_memory: pid: 57317, comm: rocfft-test, bytes: > 8572432384 not enough memory for the allocation > [50555.651237] __vm_enough_memory: pid: 57317, comm: rocfft-test, bytes: > 8592166912 not enough memory for the allocation > [50555.651261] show_signal_msg: 11 callbacks suppressed > [50555.651263] rocfft-test[57317]: segfault at 3c0 ip 00007fab8c38937b > sp 00007faa749fe558 error 6 in > libfftw3.so.3.6.10[18937b,7fab8c224000+1c5000] likely on CPU 9 (core 4, > socket 0) > [50555.651276] Code: 2d 57 15 48 8e 06 00 c4 c1 65 5c d9 c5 e5 57 1d 3b > 8e 06 00 c4 43 7d 05 d2 05 c4 e3 7d 05 db 05 c4 41 4d 5c ca c4 c1 4d 58 > f2 <c4> 43 7d 19 0c 0a 01 c4 41 79 29 0a c5 55 58 cb c5 d5 5c eb 4d 8b> > I also just noticed that [2] is segfaulting, so there's clearly another > issue even with the older kernel. I hadn't noticed that before. It > didn't do that when rocfft 6.1.2 was first uploaded [4]. It seems that this is non-deterministic. Some test complete, some don't. Sadly, we don't have dmesg for the older tests, but looking at the tail of the log [5] just two days after [4], we can see a > 7779s Memory access fault by GPU node-1 (Agent handle: 0x55790431e060) on > address 0xffb895600000. Reason: Page not present or supervisor privilege. > [...] which could be related. It looks non-deterministic because this only occurs occasionally, and at different locations in the test run. An easy way to spot this is to look at log sizes [6]; completed tests tend to have ~640KB, shorter means early abort. This one [7] crashed almost immediately. If it's not related, things become even more complicated... > See attached for rocminfo logs from Debian Stable. Here's the diff: > Pool Info: > Pool 1 > Segment: GLOBAL; FLAGS: COARSE GRAINED > - Size: 2097152(0x200000) KB > + Size: 31761860(0x1e4a5c4) KB This is the pool from the gfx1035. It increased in size from 2GiB to ~32GiB. If overcommit was indeed the issue behind "Killed", then I suspect that the test malloc'ed so much such that it eventually triggered the OOM when both test and GPU consumed all physical memory, eg: with a 32GiB large test case computed on both GPU and CPU for expected/actual comparison. Best, Christian >>> [1]: >>> https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/33925/log.gz >>> [2]: >>> https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/34278/log.gz >> [3]: >> https://sources.debian.org/src/rocfft/6.1.2-1/debian/tests/upstream-binaries/#L70 >> > [4]: > https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/18220/log.gz [5]: https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx1035/18314/ [6]: https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx1035/ [7]: https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx1035/23638/