From: Vitaly Prosyak <[email protected]>
The ring content dump in amdgpu_coredump() uses two separate loops over
adev->rings[]: the first counts rings with unsignalled fences to size
the allocation, and the second copies ring data into the allocated
buffers.
Both loops use the same condition to skip rings:
atomic_read(&ring->fence_drv.last_seq) == ring->fence_drv.sync_seq
Because last_seq is an atomic that is updated concurrently by the fence
signalling path, additional rings may appear unsignalled in the second
loop that were signalled during the first. When this happens, idx
exceeds the allocated ring_count and the store to coredump->rings[idx]
writes past the end of the kcalloc-ed buffer.
This was found during IGT stressful test amd_queue_reset which
triggers random GPU resets. The OVERSIZE subtest
(CMD_STREAM_EXEC_INVALID_PACKET_LENGTH_OVERSIZE on GFX ring) provokes
a ring timeout and subsequent coredump, which hits the race between
the counting and copying loops. The failure is non-deterministic and
depends on fence signalling timing during the reset.
KASAN log:
BUG: KASAN: slab-out-of-bounds in amdgpu_coredump+0x1274/0x12f0 [amdgpu]
Write of size 4 at addr ffff888106154258 by task kworker/u128:5/23625
CPU: 16 UID: 0 PID: 23625 Comm: kworker/u128:5 Not tainted 6.19.0+ #35
Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Call Trace:
<TASK>
dump_stack_lvl+0xa5/0x110
print_report+0xd1/0x660
kasan_report+0xf3/0x130
__asan_report_store4_noabort+0x17/0x30
amdgpu_coredump+0x1274/0x12f0 [amdgpu]
amdgpu_job_timedout+0xef0/0x16c0 [amdgpu]
drm_sched_job_timedout+0x194/0x5c0 [gpu_sched]
process_one_work+0x84b/0x1990
worker_thread+0x6b8/0x11b0
</TASK>
Allocated by task 23625:
kasan_save_stack+0x39/0x70
__kasan_kmalloc+0xc3/0xd0
__kmalloc_noprof+0x2ec/0x910
amdgpu_coredump+0x5c5/0x12f0 [amdgpu]
amdgpu_job_timedout+0xef0/0x16c0 [amdgpu]
The buggy address belongs to the object at ffff888106154200
which belongs to the cache kmalloc-rnd-09-96 of size 96
The buggy address is located 16 bytes to the right of
allocated 72-byte region [ffff888106154200, ffff888106154248)
72 bytes = 3 * sizeof(struct amdgpu_coredump_ring), so ring_count was 3
but idx reached 3+, writing ring_index (at struct offset 16) 16 bytes
past the allocation.
Fix by adding an idx < ring_count guard to the copy loop so it cannot
exceed the allocated count even when the fence state changes between
the two passes.
Fixes: 678236b37eee (drm/amdgpu: save ring content before resetting the device)
Cc: Pierre-Eric Pelloux-Prayer <[email protected]>
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Vitaly Prosyak <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
index d386bc775d03..3d5a2abf27c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
@@ -553,7 +553,7 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool
skip_vram_check,
coredump->rings_dw = kzalloc(total_ring_size, GFP_NOWAIT);
coredump->rings = kcalloc(ring_count, sizeof(struct
amdgpu_coredump_ring), GFP_NOWAIT);
if (coredump->rings && coredump->rings_dw) {
- for (i = 0, off = 0, idx = 0; i < adev->num_rings; i++) {
+ for (i = 0, off = 0, idx = 0; i < adev->num_rings && idx <
ring_count; i++) {
ring = adev->rings[i];
if (atomic_read(&ring->fence_drv.last_seq) ==
ring->fence_drv.sync_seq &&
--
2.43.0