Bug#1082888: librocfft0-tests: read kernel buffer failed on Linux 6.10

Christian Kastner Fri, 27 Sep 2024 15:15:15 -0700

On 2024-09-27 23:34, Cordell Bloor wrote:
>>   $ sudo sysctl kernel.dmesg_restrict=0
>>   $ sudo sysctl vm.overcommit_memory=2


> The log output after applying both changes:
> 
> [ RUN      ]
> pow2_1D/accuracy_test.vs_fftw/complex_forward_len_67108864_single_op_batch_1_istride_1_CI_ostride_1_CI_idist_67108864_odist_67108864_ioffset_0_0_ooffset_0_0
> [       OK ]
> pow2_1D/accuracy_test.vs_fftw/complex_forward_len_67108864_single_op_batch_1_istride_1_CI_ostride_1_CI_idist_67108864_odist_67108864_ioffset_0_0_ooffset_0_0
>  (953 ms)
> [ RUN      ]
> pow2_1D/accuracy_test.vs_fftw/complex_forward_len_134217728_double_ip_batch_4_istride_1_CI_ostride_1_CI_idist_134217728_odist_134217728_ioffset_0_0_ooffset_0_0
> command1             FAIL non-zero exit status 1
> 
> The dmesg output from test after applying both changes:

Am I interpreting this right that the "Killed" disappeared? If so, then the 
issue should be reproducible by re-enabling vm.overcommit_memory=0.

Would be nice to be certain of this.

> [50555.651205] __vm_enough_memory: pid: 57317, comm: rocfft-test, bytes:
> 8592035840 not enough memory for the allocation
> [50555.651226] __vm_enough_memory: pid: 57317, comm: rocfft-test, bytes:
> 8592035840 not enough memory for the allocation
> [50555.651233] __vm_enough_memory: pid: 57317, comm: rocfft-test, bytes:
> 8572432384 not enough memory for the allocation
> [50555.651237] __vm_enough_memory: pid: 57317, comm: rocfft-test, bytes:
> 8592166912 not enough memory for the allocation
> [50555.651261] show_signal_msg: 11 callbacks suppressed
> [50555.651263] rocfft-test[57317]: segfault at 3c0 ip 00007fab8c38937b
> sp 00007faa749fe558 error 6 in
> libfftw3.so.3.6.10[18937b,7fab8c224000+1c5000] likely on CPU 9 (core 4,
> socket 0)
> [50555.651276] Code: 2d 57 15 48 8e 06 00 c4 c1 65 5c d9 c5 e5 57 1d 3b
> 8e 06 00 c4 43 7d 05 d2 05 c4 e3 7d 05 db 05 c4 41 4d 5c ca c4 c1 4d 58
> f2 <c4> 43 7d 19 0c 0a 01 c4 41 79 29 0a c5 55 58 cb c5 d5 5c eb 4d 8b>

> I also just noticed that [2] is segfaulting, so there's clearly another
> issue even with the older kernel. I hadn't noticed that before. It
> didn't do that when rocfft 6.1.2 was first uploaded [4].

It seems that this is non-deterministic. Some test complete, some don't. Sadly, 
we don't have dmesg for the older tests, but looking at the tail of the log [5] 
just two days after [4], we can see a

> 7779s Memory access fault by GPU node-1 (Agent handle: 0x55790431e060) on 
> address 0xffb895600000. Reason: Page not present or supervisor privilege. 
> [...]

which could be related.

It looks non-deterministic because this only occurs occasionally, and at 
different locations in the test run. An easy way to spot this is to look at log 
sizes [6]; completed tests tend to have ~640KB, shorter means early abort. This 
one [7] crashed almost immediately.

If it's not related, things become even more complicated...

> See attached for rocminfo logs from Debian Stable. Here's the diff:

>    Pool Info:               
>      Pool 1                   
>        Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
> -      Size:                    2097152(0x200000) KB               
> +      Size:                    31761860(0x1e4a5c4) KB     

This is the pool from the gfx1035. It increased in size from 2GiB to ~32GiB.

If overcommit was indeed the issue behind "Killed", then I suspect that the 
test malloc'ed so much such that it eventually triggered the OOM when both test 
and GPU consumed all physical memory, eg: with a 32GiB large test case computed 
on both GPU and CPU for expected/actual comparison.

Best,
Christian

>>> [1]: 
>>> https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/33925/log.gz
>>> [2]: 
>>> https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/34278/log.gz
>> [3]: 
>> https://sources.debian.org/src/rocfft/6.1.2-1/debian/tests/upstream-binaries/#L70
>>
> [4]:
> https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/18220/log.gz

[5]: https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx1035/18314/
[6]: https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx1035/
[7]: https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx1035/23638/

Bug#1082888: librocfft0-tests: read kernel buffer failed on Linux 6.10

Reply via email to