[PATCH] drm/sched: Add error code parameter to drm_sched_start
From: Vitaly Prosyak The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for each job with a NULL parent in the pending list, making it difficult to distinguish between recovery methods, whether a queue reset or a full GPU reset was used. To improve this, we first try a soft recovery for timeout jobs and use the error code -ENODATA. If soft recovery fails, we proceed with a queue reset, where the error code remains -ENODATA for the job. Finally, for a full GPU reset, we use error codes -ECANCELED or -ETIME. This patch adds an error code parameter to drm_sched_start, allowing us to differentiate between queue reset and GPU reset failures. This enables user mode and test applications to validate the expected correctness of the requested operation. After a successful queue reset, the only way to continue normal operation is to call drm_sched_job_done with the specific error code -ENODATA. v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain and amdgpu_device_unlock_reset_domain to allow user mode to track the queue reset status and distinguish between queue reset and GPU reset. v2: Christian suggested using the error codes -ENODATA for queue reset and -ECANCELED or -ETIME for GPU reset, returned to amdgpu_cs_wait_ioctl. v3: To meet the requirements, we introduce a new function drm_sched_start_ex with an additional parameter to set dma_fence_set_error, allowing us to handle the specific error codes appropriately and dispose of bad jobs with the selected error code depending on whether it was a queue reset or GPU reset. v4: Alex suggested using a new name, drm_sched_start_with_recovery_error, which more accurately describes the function's purpose. Additionally, it was recommended to add documentation details about the new method. v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex) Cc: Alex Deucher Cc: Christian Koenig Signed-off-by: Jesse Zhang Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/scheduler/sched_main.c | 30 +++--- include/drm/gpu_scheduler.h| 1 + 2 files changed, 28 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 7e90c9f95611..c42449358b3f 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -671,13 +671,24 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) EXPORT_SYMBOL(drm_sched_stop); /** - * drm_sched_start - recover jobs after a reset + * drm_sched_start_with_recovery_error - recover jobs after a reset with + * custom error * * @sched: scheduler instance * @full_recovery: proceed with complete sched restart + * @error : err code for set dma_fence_set_error + * + * Starts the scheduler and allows setting a custom dma_fence_set_error, + * which can be used to identify the recovery mechanism actually used. * + * For example: + * - If a soft or queue reset was used, dma_fence_set_error is set to -ENODATA. + * - If an entire GPU reset was used, the error code is set to -ECANCELED. + * + * This approach enables user mode and test applications to know which + * recovery method was used for a given bad job. */ -void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) +void drm_sched_start_with_recovery_error(struct drm_gpu_scheduler *sched, bool full_recovery, int error) { struct drm_sched_job *s_job, *tmp; int r; @@ -704,7 +715,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n", r); } else - drm_sched_job_done(s_job, -ECANCELED); + drm_sched_job_done(s_job, error); } if (full_recovery) @@ -712,6 +723,19 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) drm_sched_wqueue_start(sched); } +EXPORT_SYMBOL(drm_sched_start_with_recovery_error); + +/** + * drm_sched_start - recover jobs after a reset + * + * @sched: scheduler instance + * @full_recovery: proceed with complete sched restart + * + */ +void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) +{ + drm_sched_start_with_recovery_error(sched, full_recovery, -ECANCELED); +} EXPORT_SYMBOL(drm_sched_start); /** diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 5acc64954a88..e80ea947a864 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -580,6 +580,7 @@ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched); void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched); void drm_sched_stop(struct drm_gpu_sch
[PATCH 1/2] drm/sched: Add error code parameter to drm_sched_start
From: Vitaly Prosyak The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for each job with a NULL parent in the pending list, making it difficult to distinguish between recovery methods, whether a queue reset or a full GPU reset was used. To improve this, we first try a soft recovery for timeout jobs and use the error code -ENODATA. If soft recovery fails, we proceed with a queue reset, where the error code remains -ENODATA for the job. Finally, for a full GPU reset, we use error codes -ECANCELED or -ETIME. This patch adds an error code parameter to drm_sched_start, allowing us to differentiate between queue reset and GPU reset failures. This enables user mode and test applications to validate the expected correctness of the requested operation. After a successful queue reset, the only way to continue normal operation is to call drm_sched_job_done with the specific error code -ENODATA. v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain and amdgpu_device_unlock_reset_domain to allow user mode to track the queue reset status and distinguish between queue reset and GPU reset. v2: Christian suggested using the error codes -ENODATA for queue reset and -ECANCELED or -ETIME for GPU reset, returned to amdgpu_cs_wait_ioctl. v3: To meet the requirements, we introduce a new function drm_sched_start_ex with an additional parameter to set dma_fence_set_error, allowing us to handle the specific error codes appropriately and dispose of bad jobs with the selected error code depending on whether it was a queue reset or GPU reset. v4: Alex suggested using a new name, drm_sched_start_with_recovery_error, which more accurately describes the function's purpose. Additionally, it was recommended to add documentation details about the new method. Cc: Alex Deucher Cc: Christian Koenig Signed-off-by: Jesse Zhang Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/scheduler/sched_main.c | 30 +++--- include/drm/gpu_scheduler.h| 1 + 2 files changed, 28 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 7e90c9f95611..c42449358b3f 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -671,13 +671,24 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) EXPORT_SYMBOL(drm_sched_stop); /** - * drm_sched_start - recover jobs after a reset + * drm_sched_start_with_recovery_error - recover jobs after a reset with + * custom error * * @sched: scheduler instance * @full_recovery: proceed with complete sched restart + * @error : err code for set dma_fence_set_error + * + * Starts the scheduler and allows setting a custom dma_fence_set_error, + * which can be used to identify the recovery mechanism actually used. * + * For example: + * - If a soft or queue reset was used, dma_fence_set_error is set to -ENODATA. + * - If an entire GPU reset was used, the error code is set to -ECANCELED. + * + * This approach enables user mode and test applications to know which + * recovery method was used for a given bad job. */ -void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) +void drm_sched_start_with_recovery_error(struct drm_gpu_scheduler *sched, bool full_recovery, int error) { struct drm_sched_job *s_job, *tmp; int r; @@ -704,7 +715,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n", r); } else - drm_sched_job_done(s_job, -ECANCELED); + drm_sched_job_done(s_job, error); } if (full_recovery) @@ -712,6 +723,19 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) drm_sched_wqueue_start(sched); } +EXPORT_SYMBOL(drm_sched_start_with_recovery_error); + +/** + * drm_sched_start - recover jobs after a reset + * + * @sched: scheduler instance + * @full_recovery: proceed with complete sched restart + * + */ +void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) +{ + drm_sched_start_with_recovery_error(sched, full_recovery, -ECANCELED); +} EXPORT_SYMBOL(drm_sched_start); /** diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 5acc64954a88..444fa6761590 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -580,6 +580,7 @@ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched); void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched); void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad); void drm_sched_start(struct drm_gpu_
[PATCH] drm/sched: Add error code parameter to drm_sched_start
From: Vitaly Prosyak The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for each job with a NULL parent in the pending list, making it difficult to distinguish between recovery methods, whether a queue reset or a full GPU reset was used. To improve this, we first try a soft recovery for timeout jobs and use the error code -ENODATA. If soft recovery fails, we proceed with a queue reset, where the error code would still be -ENODATA for the job. Finally, for a full GPU reset, we use error codes -ECANCELED or -ETIME. This patch adds an error code parameter to drm_sched_start, allowing us to differentiate between queue reset and GPU reset failures. This enables user mode and test applications to validate the expected correctness of the requested operation. After a successful queue reset, the only way to continue normal operation is to call drm_sched_job_done with the specific error code -ENODATA. v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain and amdgpu_device_unlock_reset_domain to allow user mode to track the queue reset status and distinguish between queue reset and GPU reset. v2: Christian suggested using the error codes -ENODATA for queue reset and -ECANCELED or -ETIME for GPU reset, returned to amdgpu_cs_wait_ioctl. v3: To meet the requirements, we introduce a new function drm_sched_start_ex with an additional parameter to set dma_fence_set_error, allowing us to handle the specific error codes appropriately and dispose of bad jobs with the selected error code depending on whether it was a queue reset or GPU reset. Signed-off-by: Jesse Zhang Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/scheduler/sched_main.c | 19 --- include/drm/gpu_scheduler.h| 1 + 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 7e90c9f95611..5a534772335a 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -671,13 +671,14 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) EXPORT_SYMBOL(drm_sched_stop); /** - * drm_sched_start - recover jobs after a reset + * drm_sched_start_ex - recover jobs after a reset * * @sched: scheduler instance * @full_recovery: proceed with complete sched restart + * @error : err code for set dma_fence_set_error * */ -void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) +void drm_sched_start_ex(struct drm_gpu_scheduler *sched, bool full_recovery, int error) { struct drm_sched_job *s_job, *tmp; int r; @@ -704,7 +705,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n", r); } else - drm_sched_job_done(s_job, -ECANCELED); + drm_sched_job_done(s_job, error); } if (full_recovery) @@ -712,6 +713,18 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) drm_sched_wqueue_start(sched); } +EXPORT_SYMBOL(drm_sched_start_ex); +/** + * drm_sched_start - recover jobs after a reset + * + * @sched: scheduler instance + * @full_recovery: proceed with complete sched restart + * + */ +void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) +{ + drm_sched_start_ex(sched, full_recovery, -ECANCELED); +} EXPORT_SYMBOL(drm_sched_start); /** diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 5acc64954a88..444fa6761590 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -580,6 +580,7 @@ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched); void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched); void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad); void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery); +void drm_sched_start_ex(struct drm_gpu_scheduler *sched, bool full_recovery, int error); void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched); void drm_sched_increase_karma(struct drm_sched_job *bad); void drm_sched_reset_karma(struct drm_sched_job *bad); -- 2.25.1
[PATCH] drm/amdgpu: Add error parameter to amdgpu_fence_driver_force_completion
From: Vitaly Prosyak In the case of a queue reset, we need the ability to customize the error code from -ECANCELED to -ENODATA for scenarios where the queue reset is successful. It was decided to use -ECANCELED for GPU reset cases and -ENODATA for queue reset cases. This change introduces an error parameter to the amdgpu_fence_driver_force_completion function, allowing us to specify a custom error code for these cases (queue reset or GPU reset). Signed-off-by: Jesse Zhang Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 6 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h| 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 2 +- 5 files changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 0e1a11b6b989..a2b42c079bbf 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -1960,7 +1960,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val) /* swap out the old fences */ amdgpu_ib_preempt_fences_swap(ring, fences); - amdgpu_fence_driver_force_completion(ring); + amdgpu_fence_driver_force_completion(ring, -ECANCELED); /* resubmit unfinished jobs */ amdgpu_ib_preempt_job_recovery(&ring->sched); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 730dae77570c..441eb8757d09 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5308,7 +5308,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev, amdgpu_fence_driver_clear_job_fences(ring); /* after all hw jobs are reset, hw fence is meaningless, so force_completion */ - amdgpu_fence_driver_force_completion(ring); + amdgpu_fence_driver_force_completion(ring, -ECANCELED); } amdgpu_fence_driver_isr_toggle(adev, false); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index 2f24a6aa13bf..ac4942fdbae0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -610,7 +610,7 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev) r = -ENODEV; /* no need to trigger GPU reset as we are unloading */ if (r) - amdgpu_fence_driver_force_completion(ring); + amdgpu_fence_driver_force_completion(ring, -ECANCELED); if (!drm_dev_is_unplugged(adev_to_drm(adev)) && ring->fence_drv.irq_src && @@ -757,9 +757,9 @@ void amdgpu_fence_driver_set_error(struct amdgpu_ring *ring, int error) * @ring: fence of the ring to signal * */ -void amdgpu_fence_driver_force_completion(struct amdgpu_ring *ring) +void amdgpu_fence_driver_force_completion(struct amdgpu_ring *ring, int error) { - amdgpu_fence_driver_set_error(ring, -ECANCELED); + amdgpu_fence_driver_set_error(ring, error); amdgpu_fence_write(ring, ring->fence_drv.sync_seq); amdgpu_fence_process(ring); } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h index 582053f1cd56..045a2a548b80 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h @@ -130,7 +130,7 @@ extern const struct drm_sched_backend_ops amdgpu_sched_ops; void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring); void amdgpu_fence_driver_set_error(struct amdgpu_ring *ring, int error); -void amdgpu_fence_driver_force_completion(struct amdgpu_ring *ring); +void amdgpu_fence_driver_force_completion(struct amdgpu_ring *ring, int error); int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring); int amdgpu_fence_driver_start_ring(struct amdgpu_ring *ring, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c index 07d930339b07..ec440bb763d9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c @@ -511,7 +511,7 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev) } memset_io(ptr, 0, size); /* to restore uvd fence seq */ - amdgpu_fence_driver_force_completion(&adev->uvd.inst[i].ring); + amdgpu_fence_driver_force_completion(&adev->uvd.inst[i].ring, -ECANCELED); } } return 0; -- 2.25.1
[PATCH] drm/amdkfd: fix NULL pointer dereference
From: Vitaly Prosyak [ +0.006038] BUG: kernel NULL pointer dereference, address: 0028 [ +0.006969] #PF: supervisor read access in kernel mode [ +0.005139] #PF: error_code(0x) - not-present page [ +0.005139] PGD 0 P4D 0 [ +0.002530] Oops: [#1] PREEMPT SMP NOPTI [ +0.004356] CPU: 11 PID: 12625 Comm: kworker/11:0 Tainted: GW 6.7.0+ #2 [ +0.008097] Hardware name: ASUS System Product Name/Pro WS WRX80E-SAGE SE WIFI II, BIOS 1302 12/08/2023 [ +0.009398] Workqueue: events evict_process_worker [amdgpu] [ +0.005750] RIP: 0010:evict_process_worker+0x2f/0x460 [amdgpu] [ +0.005991] Code: 55 48 89 e5 41 57 41 56 4c 8d b7 a8 fc ff ff 41 55 41 54 53 48 89 fb 48 83 ec 10 0f 1f 44 00 00 48 8b 43 f8 8b 93 b0 00 00 00 <48> 3b 50 28 0f 85 50 03 00 00 48 8d 7b 58 e8 ee be cb bf 48 8b 05 [ +0.018791] RSP: 0018:c90009a2be10 EFLAGS: 00010282 [ +0.005226] RAX: RBX: 888197ffc358 RCX: [ +0.007140] RDX: 0a1b RSI: RDI: 888197ffc358 [ +0.007139] RBP: c90009a2be48 R08: R09: [ +0.007139] R10: R11: R12: 888197ffc358 [ +0.007139] R13: 888100153a00 R14: 888197ffc000 R15: 888100153a05 [ +0.007137] FS: () GS:889facac() knlGS: [ +0.008094] CS: 0010 DS: ES: CR0: 80050033 [ +0.005747] CR2: 0028 CR3: 00010d1fc001 CR4: 00770ef0 [ +0.007138] PKRU: 5554 [ +0.002702] Call Trace: [ +0.002443] [ +0.002096] ? show_regs+0x72/0x90 [ +0.003402] ? __die+0x25/0x80 [ +0.003052] ? page_fault_oops+0x154/0x4c0 [ +0.004099] ? do_user_addr_fault+0x30e/0x6e0 [ +0.004357] ? psi_group_change+0x237/0x520 [ +0.004185] ? exc_page_fault+0x84/0x1b0 [ +0.003926] ? asm_exc_page_fault+0x27/0x30 [ +0.004187] ? evict_process_worker+0x2f/0x460 [amdgpu] [ +0.005377] process_one_work+0x17b/0x360 [ +0.004011] ? __pfx_worker_thread+0x10/0x10 [ +0.004269] worker_thread+0x307/0x430 [ +0.003748] ? __pfx_worker_thread+0x10/0x10 [ +0.004268] kthread+0xf7/0x130 [ +0.003142] ? __pfx_kthread+0x10/0x10 [ +0.003749] ret_from_fork+0x46/0x70 [ +0.003573] ? __pfx_kthread+0x10/0x10 [ +0.003747] ret_from_fork_asm+0x1b/0x30 [ +0.003924] When we run stressful tests, the eviction fence could be zero and not match to last_eviction_seqno. Avoid calling dma_fence_signal and dma_fence_put with zero fences to rely on checking parameters in DMA API. Cc: Alex Deucher Cc: Christian Koenig Cc: Xiaogang Chen Cc: Felix Kuehling Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index eb380296017d..a15fae1c398a 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -2118,7 +2118,7 @@ static void evict_process_worker(struct work_struct *work) */ p = container_of(dwork, struct kfd_process, eviction_work); trace_kfd_evict_process_worker_start(p); - WARN_ONCE(p->last_eviction_seqno != p->ef->seqno, + WARN_ONCE(p->ef && p->last_eviction_seqno != p->ef->seqno, "Eviction fence mismatch\n"); /* Narrow window of overlap between restore and evict work @@ -2134,9 +2134,11 @@ static void evict_process_worker(struct work_struct *work) pr_debug("Started evicting pasid 0x%x\n", p->pasid); ret = kfd_process_evict_queues(p, false, KFD_QUEUE_EVICTION_TRIGGER_TTM); if (!ret) { - dma_fence_signal(p->ef); - dma_fence_put(p->ef); - p->ef = NULL; + if (p->ef) { + dma_fence_signal(p->ef); + dma_fence_put(p->ef); + p->ef = NULL; + } if (!kfd_process_unmap_doorbells_if_idle(p)) kfd_process_schedule_restore(p); -- 2.25.1
[PATCH] drm/sched: fix null-ptr-deref in init entity
From: Vitaly Prosyak The bug can be triggered by sending an amdgpu_cs_wait_ioctl to the AMDGPU DRM driver on any ASICs with valid context. The bug was reported by Joonkyo Jung . For example the following code: static void Syzkaller2(int fd) { union drm_amdgpu_ctx arg1; union drm_amdgpu_wait_cs arg2; arg1.in.op = AMDGPU_CTX_OP_ALLOC_CTX; ret = drmIoctl(fd, 0x140106442 /* amdgpu_ctx_ioctl */, &arg1); arg2.in.handle = 0x0; arg2.in.timeout = 0x2; arg2.in.ip_type = AMD_IP_VPE /* 0x9 */; arg2->in.ip_instance = 0x0; arg2.in.ring = 0x0; arg2.in.ctx_id = arg1.out.alloc.ctx_id; drmIoctl(fd, 0xc0206449 /* AMDGPU_WAIT_CS * /, &arg2); } The ioctl AMDGPU_WAIT_CS without previously submitted job could be assumed that the error should be returned, but the following commit 1decbf6bb0b4dc56c9da6c5e57b994ebfc2be3aa modified the logic and allowed to have sched_rq equal to NULL. As a result when there is no job the ioctl AMDGPU_WAIT_CS returns success. The change fixes null-ptr-deref in init entity and the stack below demonstrates the error condition: [ +0.07] BUG: kernel NULL pointer dereference, address: 0028 [ +0.007086] #PF: supervisor read access in kernel mode [ +0.005234] #PF: error_code(0x) - not-present page [ +0.005232] PGD 0 P4D 0 [ +0.002501] Oops: [#1] PREEMPT SMP KASAN NOPTI [ +0.005034] CPU: 10 PID: 9229 Comm: amd_basic Tainted: GB WL 6.7.0+ #4 [ +0.007797] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020 [ +0.009798] RIP: 0010:drm_sched_entity_init+0x2d3/0x420 [gpu_sched] [ +0.006426] Code: 80 00 00 00 00 00 00 00 e8 1a 81 82 e0 49 89 9c 24 c0 00 00 00 4c 89 ef e8 4a 80 82 e0 49 8b 5d 00 48 8d 7b 28 e8 3d 80 82 e0 <48> 83 7b 28 00 0f 84 28 01 00 00 4d 8d ac 24 98 00 00 00 49 8d 5c [ +0.019094] RSP: 0018:c90014c1fa40 EFLAGS: 00010282 [ +0.005237] RAX: 0001 RBX: RCX: 8113f3fa [ +0.007326] RDX: fbfff0a7889d RSI: 0008 RDI: 853c44e0 [ +0.007264] RBP: c90014c1fa80 R08: 0001 R09: fbfff0a7889c [ +0.007266] R10: 853c44e7 R11: 0001 R12: 8881a719b010 [ +0.007263] R13: 88810d412748 R14: 0002 R15: [ +0.007264] FS: 77045540() GS:8883cc90() knlGS: [ +0.008236] CS: 0010 DS: ES: CR0: 80050033 [ +0.005851] CR2: 0028 CR3: 00011912e000 CR4: 00350ef0 [ +0.007175] Call Trace: [ +0.002561] [ +0.002141] ? show_regs+0x6a/0x80 [ +0.003473] ? __die+0x25/0x70 [ +0.003124] ? page_fault_oops+0x214/0x720 [ +0.004179] ? preempt_count_sub+0x18/0xc0 [ +0.004093] ? __pfx_page_fault_oops+0x10/0x10 [ +0.004590] ? srso_return_thunk+0x5/0x5f [ +0.004000] ? vprintk_default+0x1d/0x30 [ +0.004063] ? srso_return_thunk+0x5/0x5f [ +0.004087] ? vprintk+0x5c/0x90 [ +0.003296] ? drm_sched_entity_init+0x2d3/0x420 [gpu_sched] [ +0.005807] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? _printk+0xb3/0xe0 [ +0.003293] ? __pfx__printk+0x10/0x10 [ +0.003735] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [ +0.005482] ? do_user_addr_fault+0x345/0x770 [ +0.004361] ? exc_page_fault+0x64/0xf0 [ +0.003972] ? asm_exc_page_fault+0x27/0x30 [ +0.004271] ? add_taint+0x2a/0xa0 [ +0.003476] ? drm_sched_entity_init+0x2d3/0x420 [gpu_sched] [ +0.005812] amdgpu_ctx_get_entity+0x3f9/0x770 [amdgpu] [ +0.009530] ? finish_task_switch.isra.0+0x129/0x470 [ +0.005068] ? __pfx_amdgpu_ctx_get_entity+0x10/0x10 [amdgpu] [ +0.010063] ? __kasan_check_write+0x14/0x20 [ +0.004356] ? srso_return_thunk+0x5/0x5f [ +0.004001] ? mutex_unlock+0x81/0xd0 [ +0.003802] ? srso_return_thunk+0x5/0x5f [ +0.004096] amdgpu_cs_wait_ioctl+0xf6/0x270 [amdgpu] [ +0.009355] ? __pfx_amdgpu_cs_wait_ioctl+0x10/0x10 [amdgpu] [ +0.009981] ? srso_return_thunk+0x5/0x5f [ +0.004089] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? __srcu_read_lock+0x20/0x50 [ +0.004096] drm_ioctl_kernel+0x140/0x1f0 [drm] [ +0.005080] ? __pfx_amdgpu_cs_wait_ioctl+0x10/0x10 [amdgpu] [ +0.009974] ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm] [ +0.005618] ? srso_return_thunk+0x5/0x5f [ +0.004088] ? __kasan_check_write+0x14/0x20 [ +0.004357] drm_ioctl+0x3da/0x730 [drm] [ +0.004461] ? __pfx_amdgpu_cs_wait_ioctl+0x10/0x10 [amdgpu] [ +0.009979] ? __pfx_drm_ioctl+0x10/0x10 [drm] [ +0.004993] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? __kasan_check_write+0x14/0x20 [ +0.004356] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? _raw_spin_lock_irqsave+0x99/0x100 [ +0.004712] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ +0.005063] ? __pfx_arch_do_signal_or_restart+0x10/0x10 [ +0.005477] ? srso_return_thunk+0x5/0x5f [ +0.004000] ? preempt_count_sub+0x18/0xc0 [ +0.004237] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? _raw_spin_unlock_
[PATCH] drm/sched: fix null-ptr-deref in init entity
From: Vitaly Prosyak The bug can be triggered by sending an amdgpu_cs_wait_ioctl to the AMDGPU DRM driver on any ASICs with valid context. The bug was reported by Joonkyo Jung . For example the following code: static void Syzkaller2(int fd) { union drm_amdgpu_ctx arg1; union drm_amdgpu_wait_cs arg2; arg1.in.op = AMDGPU_CTX_OP_ALLOC_CTX; ret = drmIoctl(fd, 0x140106442 /* amdgpu_ctx_ioctl */, &arg1); arg2.in.handle = 0x0; arg2.in.timeout = 0x2; arg2.in.ip_type = AMD_IP_VPE /* 0x9 */; arg2->in.ip_instance = 0x0; arg2.in.ring = 0x0; arg2.in.ctx_id = arg1.out.alloc.ctx_id; drmIoctl(fd, 0xc0206449 /* AMDGPU_WAIT_CS * /, &arg2); } The ioctl AMDGPU_WAIT_CS without previously submitted job could be assumed that the error should be returned, but the following commit 1decbf6bb0b4dc56c9da6c5e57b994ebfc2be3aa modified the logic and allowed to have sched_rq equal to NULL. As a result when there is no job the ioctl AMDGPU_WAIT_CS returns success. The change fixes null-ptr-deref in init entity and the stack below demonstrates the error condition: [ +0.07] BUG: kernel NULL pointer dereference, address: 0028 [ +0.007086] #PF: supervisor read access in kernel mode [ +0.005234] #PF: error_code(0x) - not-present page [ +0.005232] PGD 0 P4D 0 [ +0.002501] Oops: [#1] PREEMPT SMP KASAN NOPTI [ +0.005034] CPU: 10 PID: 9229 Comm: amd_basic Tainted: GB WL 6.7.0+ #4 [ +0.007797] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020 [ +0.009798] RIP: 0010:drm_sched_entity_init+0x2d3/0x420 [gpu_sched] [ +0.006426] Code: 80 00 00 00 00 00 00 00 e8 1a 81 82 e0 49 89 9c 24 c0 00 00 00 4c 89 ef e8 4a 80 82 e0 49 8b 5d 00 48 8d 7b 28 e8 3d 80 82 e0 <48> 83 7b 28 00 0f 84 28 01 00 00 4d 8d ac 24 98 00 00 00 49 8d 5c [ +0.019094] RSP: 0018:c90014c1fa40 EFLAGS: 00010282 [ +0.005237] RAX: 0001 RBX: RCX: 8113f3fa [ +0.007326] RDX: fbfff0a7889d RSI: 0008 RDI: 853c44e0 [ +0.007264] RBP: c90014c1fa80 R08: 0001 R09: fbfff0a7889c [ +0.007266] R10: 853c44e7 R11: 0001 R12: 8881a719b010 [ +0.007263] R13: 88810d412748 R14: 0002 R15: [ +0.007264] FS: 77045540() GS:8883cc90() knlGS: [ +0.008236] CS: 0010 DS: ES: CR0: 80050033 [ +0.005851] CR2: 0028 CR3: 00011912e000 CR4: 00350ef0 [ +0.007175] Call Trace: [ +0.002561] [ +0.002141] ? show_regs+0x6a/0x80 [ +0.003473] ? __die+0x25/0x70 [ +0.003124] ? page_fault_oops+0x214/0x720 [ +0.004179] ? preempt_count_sub+0x18/0xc0 [ +0.004093] ? __pfx_page_fault_oops+0x10/0x10 [ +0.004590] ? srso_return_thunk+0x5/0x5f [ +0.004000] ? vprintk_default+0x1d/0x30 [ +0.004063] ? srso_return_thunk+0x5/0x5f [ +0.004087] ? vprintk+0x5c/0x90 [ +0.003296] ? drm_sched_entity_init+0x2d3/0x420 [gpu_sched] [ +0.005807] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? _printk+0xb3/0xe0 [ +0.003293] ? __pfx__printk+0x10/0x10 [ +0.003735] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [ +0.005482] ? do_user_addr_fault+0x345/0x770 [ +0.004361] ? exc_page_fault+0x64/0xf0 [ +0.003972] ? asm_exc_page_fault+0x27/0x30 [ +0.004271] ? add_taint+0x2a/0xa0 [ +0.003476] ? drm_sched_entity_init+0x2d3/0x420 [gpu_sched] [ +0.005812] amdgpu_ctx_get_entity+0x3f9/0x770 [amdgpu] [ +0.009530] ? finish_task_switch.isra.0+0x129/0x470 [ +0.005068] ? __pfx_amdgpu_ctx_get_entity+0x10/0x10 [amdgpu] [ +0.010063] ? __kasan_check_write+0x14/0x20 [ +0.004356] ? srso_return_thunk+0x5/0x5f [ +0.004001] ? mutex_unlock+0x81/0xd0 [ +0.003802] ? srso_return_thunk+0x5/0x5f [ +0.004096] amdgpu_cs_wait_ioctl+0xf6/0x270 [amdgpu] [ +0.009355] ? __pfx_amdgpu_cs_wait_ioctl+0x10/0x10 [amdgpu] [ +0.009981] ? srso_return_thunk+0x5/0x5f [ +0.004089] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? __srcu_read_lock+0x20/0x50 [ +0.004096] drm_ioctl_kernel+0x140/0x1f0 [drm] [ +0.005080] ? __pfx_amdgpu_cs_wait_ioctl+0x10/0x10 [amdgpu] [ +0.009974] ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm] [ +0.005618] ? srso_return_thunk+0x5/0x5f [ +0.004088] ? __kasan_check_write+0x14/0x20 [ +0.004357] drm_ioctl+0x3da/0x730 [drm] [ +0.004461] ? __pfx_amdgpu_cs_wait_ioctl+0x10/0x10 [amdgpu] [ +0.009979] ? __pfx_drm_ioctl+0x10/0x10 [drm] [ +0.004993] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? __kasan_check_write+0x14/0x20 [ +0.004356] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? _raw_spin_lock_irqsave+0x99/0x100 [ +0.004712] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ +0.005063] ? __pfx_arch_do_signal_or_restart+0x10/0x10 [ +0.005477] ? srso_return_thunk+0x5/0x5f [ +0.004000] ? preempt_count_sub+0x18/0xc0 [ +0.004237] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? _raw_spin_unlock_
[PATCH] drm/scheduler: fix null-ptr-deref in init entity
From: Vitaly Prosyak The bug can be triggered by sending an amdgpu_cs_wait_ioctl to the AMDGPU DRM driver on any ASICs with valid context. The bug was reported by Joonkyo Jung . For example the following code: static void Syzkaller2(int fd) { union drm_amdgpu_ctx arg1; union drm_amdgpu_wait_cs arg2; arg1.in.op = AMDGPU_CTX_OP_ALLOC_CTX; ret = drmIoctl(fd, 0x140106442 /* amdgpu_ctx_ioctl */, &arg1); arg2.in.handle = 0x0; arg2.in.timeout = 0x2; arg2.in.ip_type = AMD_IP_VPE /* 0x9 */; arg2->in.ip_instance = 0x0; arg2.in.ring = 0x0; arg2.in.ctx_id = arg1.out.alloc.ctx_id; drmIoctl(fd, 0xc0206449 /* AMDGPU_WAIT_CS * /, &arg2); } The ioctl AMDGPU_WAIT_CS without previously submitted job could be assumed that the error should be returned, but the following commit 1decbf6bb0b4dc56c9da6c5e57b994ebfc2be3aa modified the logic and allowed to have sched_rq equal to NULL. As a result when there is no job the ioctl AMDGPU_WAIT_CS returns success. The change fixes null-ptr-deref in init entity and the stack below demonstrates the error condition: [ +0.07] BUG: kernel NULL pointer dereference, address: 0028 [ +0.007086] #PF: supervisor read access in kernel mode [ +0.005234] #PF: error_code(0x) - not-present page [ +0.005232] PGD 0 P4D 0 [ +0.002501] Oops: [#1] PREEMPT SMP KASAN NOPTI [ +0.005034] CPU: 10 PID: 9229 Comm: amd_basic Tainted: GB WL 6.7.0+ #4 [ +0.007797] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020 [ +0.009798] RIP: 0010:drm_sched_entity_init+0x2d3/0x420 [gpu_sched] [ +0.006426] Code: 80 00 00 00 00 00 00 00 e8 1a 81 82 e0 49 89 9c 24 c0 00 00 00 4c 89 ef e8 4a 80 82 e0 49 8b 5d 00 48 8d 7b 28 e8 3d 80 82 e0 <48> 83 7b 28 00 0f 84 28 01 00 00 4d 8d ac 24 98 00 00 00 49 8d 5c [ +0.019094] RSP: 0018:c90014c1fa40 EFLAGS: 00010282 [ +0.005237] RAX: 0001 RBX: RCX: 8113f3fa [ +0.007326] RDX: fbfff0a7889d RSI: 0008 RDI: 853c44e0 [ +0.007264] RBP: c90014c1fa80 R08: 0001 R09: fbfff0a7889c [ +0.007266] R10: 853c44e7 R11: 0001 R12: 8881a719b010 [ +0.007263] R13: 88810d412748 R14: 0002 R15: [ +0.007264] FS: 77045540() GS:8883cc90() knlGS: [ +0.008236] CS: 0010 DS: ES: CR0: 80050033 [ +0.005851] CR2: 0028 CR3: 00011912e000 CR4: 00350ef0 [ +0.007175] Call Trace: [ +0.002561] [ +0.002141] ? show_regs+0x6a/0x80 [ +0.003473] ? __die+0x25/0x70 [ +0.003124] ? page_fault_oops+0x214/0x720 [ +0.004179] ? preempt_count_sub+0x18/0xc0 [ +0.004093] ? __pfx_page_fault_oops+0x10/0x10 [ +0.004590] ? srso_return_thunk+0x5/0x5f [ +0.004000] ? vprintk_default+0x1d/0x30 [ +0.004063] ? srso_return_thunk+0x5/0x5f [ +0.004087] ? vprintk+0x5c/0x90 [ +0.003296] ? drm_sched_entity_init+0x2d3/0x420 [gpu_sched] [ +0.005807] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? _printk+0xb3/0xe0 [ +0.003293] ? __pfx__printk+0x10/0x10 [ +0.003735] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [ +0.005482] ? do_user_addr_fault+0x345/0x770 [ +0.004361] ? exc_page_fault+0x64/0xf0 [ +0.003972] ? asm_exc_page_fault+0x27/0x30 [ +0.004271] ? add_taint+0x2a/0xa0 [ +0.003476] ? drm_sched_entity_init+0x2d3/0x420 [gpu_sched] [ +0.005812] amdgpu_ctx_get_entity+0x3f9/0x770 [amdgpu] [ +0.009530] ? finish_task_switch.isra.0+0x129/0x470 [ +0.005068] ? __pfx_amdgpu_ctx_get_entity+0x10/0x10 [amdgpu] [ +0.010063] ? __kasan_check_write+0x14/0x20 [ +0.004356] ? srso_return_thunk+0x5/0x5f [ +0.004001] ? mutex_unlock+0x81/0xd0 [ +0.003802] ? srso_return_thunk+0x5/0x5f [ +0.004096] amdgpu_cs_wait_ioctl+0xf6/0x270 [amdgpu] [ +0.009355] ? __pfx_amdgpu_cs_wait_ioctl+0x10/0x10 [amdgpu] [ +0.009981] ? srso_return_thunk+0x5/0x5f [ +0.004089] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? __srcu_read_lock+0x20/0x50 [ +0.004096] drm_ioctl_kernel+0x140/0x1f0 [drm] [ +0.005080] ? __pfx_amdgpu_cs_wait_ioctl+0x10/0x10 [amdgpu] [ +0.009974] ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm] [ +0.005618] ? srso_return_thunk+0x5/0x5f [ +0.004088] ? __kasan_check_write+0x14/0x20 [ +0.004357] drm_ioctl+0x3da/0x730 [drm] [ +0.004461] ? __pfx_amdgpu_cs_wait_ioctl+0x10/0x10 [amdgpu] [ +0.009979] ? __pfx_drm_ioctl+0x10/0x10 [drm] [ +0.004993] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? __kasan_check_write+0x14/0x20 [ +0.004356] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? _raw_spin_lock_irqsave+0x99/0x100 [ +0.004712] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ +0.005063] ? __pfx_arch_do_signal_or_restart+0x10/0x10 [ +0.005477] ? srso_return_thunk+0x5/0x5f [ +0.004000] ? preempt_count_sub+0x18/0xc0 [ +0.004237] ? srso_return_thunk+0x5/0x5f [ +0.004090] ? _raw_spin_unlock_
[PATCH] drm/amdgpu: fix use-after-free bug
From: Vitaly Prosyak The bug can be triggered by sending a single amdgpu_gem_userptr_ioctl to the AMDGPU DRM driver on any ASICs with an invalid address and size. The bug was reported by Joonkyo Jung . For example the following code: static void Syzkaller1(int fd) { struct drm_amdgpu_gem_userptr arg; int ret; arg.addr = 0x; arg.size = 0x8000; /*2 Gb*/ arg.flags = 0x7; ret = drmIoctl(fd, 0xc1186451/*amdgpu_gem_userptr_ioctl*/, &arg); } Due to the address and size are not valid there is a failure in amdgpu_hmm_register->mmu_interval_notifier_insert->__mmu_interval_notifier_insert-> check_shl_overflow, but we even the amdgpu_hmm_register failure we still call amdgpu_hmm_unregister into amdgpu_gem_object_free which causes access to a bad address. The following stack is below when the issue is reproduced when Kazan is enabled: [ +0.14] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020 [ +0.09] RIP: 0010:mmu_interval_notifier_remove+0x327/0x340 [ +0.17] Code: ff ff 49 89 44 24 08 48 b8 00 01 00 00 00 00 ad de 4c 89 f7 49 89 47 40 48 83 c0 22 49 89 47 48 e8 ce d1 2d 01 e9 32 ff ff ff <0f> 0b e9 16 ff ff ff 4c 89 ef e8 fa 14 b3 ff e9 36 ff ff ff e8 80 [ +0.14] RSP: 0018:c90002657988 EFLAGS: 00010246 [ +0.13] RAX: RBX: 1920004caf35 RCX: 8160565b [ +0.11] RDX: dc00 RSI: 0004 RDI: 8881a9f78260 [ +0.10] RBP: c90002657a70 R08: 0001 R09: f520004caf25 [ +0.10] R10: 0003 R11: 8161d1d6 R12: 88810e988c00 [ +0.10] R13: 888126fb5a00 R14: 88810e988c0c R15: 8881a9f78260 [ +0.11] FS: 7ff9ec848540() GS:8883cc88() knlGS: [ +0.12] CS: 0010 DS: ES: CR0: 80050033 [ +0.10] CR2: 55b3f7e14328 CR3: 0001b577 CR4: 00350ef0 [ +0.10] Call Trace: [ +0.06] [ +0.07] ? show_regs+0x6a/0x80 [ +0.18] ? __warn+0xa5/0x1b0 [ +0.19] ? mmu_interval_notifier_remove+0x327/0x340 [ +0.18] ? report_bug+0x24a/0x290 [ +0.22] ? handle_bug+0x46/0x90 [ +0.15] ? exc_invalid_op+0x19/0x50 [ +0.16] ? asm_exc_invalid_op+0x1b/0x20 [ +0.17] ? kasan_save_stack+0x26/0x50 [ +0.17] ? mmu_interval_notifier_remove+0x23b/0x340 [ +0.19] ? mmu_interval_notifier_remove+0x327/0x340 [ +0.19] ? mmu_interval_notifier_remove+0x23b/0x340 [ +0.20] ? __pfx_mmu_interval_notifier_remove+0x10/0x10 [ +0.17] ? kasan_save_alloc_info+0x1e/0x30 [ +0.18] ? srso_return_thunk+0x5/0x5f [ +0.14] ? __kasan_kmalloc+0xb1/0xc0 [ +0.18] ? srso_return_thunk+0x5/0x5f [ +0.13] ? __kasan_check_read+0x11/0x20 [ +0.20] amdgpu_hmm_unregister+0x34/0x50 [amdgpu] [ +0.004695] amdgpu_gem_object_free+0x66/0xa0 [amdgpu] [ +0.004534] ? __pfx_amdgpu_gem_object_free+0x10/0x10 [amdgpu] [ +0.004291] ? do_syscall_64+0x5f/0xe0 [ +0.23] ? srso_return_thunk+0x5/0x5f [ +0.17] drm_gem_object_free+0x3b/0x50 [drm] [ +0.000489] amdgpu_gem_userptr_ioctl+0x306/0x500 [amdgpu] [ +0.004295] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004270] ? srso_return_thunk+0x5/0x5f [ +0.14] ? __this_cpu_preempt_check+0x13/0x20 [ +0.15] ? srso_return_thunk+0x5/0x5f [ +0.13] ? sysvec_apic_timer_interrupt+0x57/0xc0 [ +0.20] ? srso_return_thunk+0x5/0x5f [ +0.14] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [ +0.22] ? drm_ioctl_kernel+0x17b/0x1f0 [drm] [ +0.000496] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004272] ? drm_ioctl_kernel+0x190/0x1f0 [drm] [ +0.000492] drm_ioctl_kernel+0x140/0x1f0 [drm] [ +0.000497] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004297] ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm] [ +0.000489] ? srso_return_thunk+0x5/0x5f [ +0.11] ? __kasan_check_write+0x14/0x20 [ +0.16] drm_ioctl+0x3da/0x730 [drm] [ +0.000475] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004293] ? __pfx_drm_ioctl+0x10/0x10 [drm] [ +0.000506] ? __pfx_rpm_resume+0x10/0x10 [ +0.16] ? srso_return_thunk+0x5/0x5f [ +0.11] ? __kasan_check_write+0x14/0x20 [ +0.10] ? srso_return_thunk+0x5/0x5f [ +0.11] ? _raw_spin_lock_irqsave+0x99/0x100 [ +0.15] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ +0.14] ? srso_return_thunk+0x5/0x5f [ +0.13] ? srso_return_thunk+0x5/0x5f [ +0.11] ? srso_return_thunk+0x5/0x5f [ +0.11] ? preempt_count_sub+0x18/0xc0 [ +0.13] ? srso_return_thunk+0x5/0x5f [ +0.10] ? _raw_spin_unlock_irqrestore+0x27/0x50 [ +0.19] amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu] [ +0.004272] __x64_sys_ioctl+0xcd/0x110 [ +0.20] do_syscall_64+0x5f/0xe0 [ +0.21] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ +0.15] RIP: 0033:0x7ff9ed31a94f [ +0.12] Code: 00 48 89 44 24 18
[PATCH] drm/amdgpu: fix use-after-free bug
From: Vitaly Prosyak The bug can be triggered by sending a single amdgpu_gem_userptr_ioctl to the AMDGPU DRM driver on any ASICs with an invalid address and size. The bug was reported by Joonkyo Jung . For example the following code: static void Syzkaller1(int fd) { struct drm_amdgpu_gem_userptr arg; int ret; arg.addr = 0x; arg.size = 0x8000; /*2 Gb*/ arg.flags = 0x7; ret = drmIoctl(fd, 0xc1186451/*amdgpu_gem_userptr_ioctl*/, &arg); } Due to the address and size are not valid there is a failure in amdgpu_hmm_register->mmu_interval_notifier_insert->__mmu_interval_notifier_insert-> check_shl_overflow, but we even the amdgpu_hmm_register failure we still call amdgpu_hmm_unregister into amdgpu_gem_object_free which causes access to a bad address. The following stack is below when the issue is reproduced when Kazan is enabled: [ +0.14] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020 [ +0.09] RIP: 0010:mmu_interval_notifier_remove+0x327/0x340 [ +0.17] Code: ff ff 49 89 44 24 08 48 b8 00 01 00 00 00 00 ad de 4c 89 f7 49 89 47 40 48 83 c0 22 49 89 47 48 e8 ce d1 2d 01 e9 32 ff ff ff <0f> 0b e9 16 ff ff ff 4c 89 ef e8 fa 14 b3 ff e9 36 ff ff ff e8 80 [ +0.14] RSP: 0018:c90002657988 EFLAGS: 00010246 [ +0.13] RAX: RBX: 1920004caf35 RCX: 8160565b [ +0.11] RDX: dc00 RSI: 0004 RDI: 8881a9f78260 [ +0.10] RBP: c90002657a70 R08: 0001 R09: f520004caf25 [ +0.10] R10: 0003 R11: 8161d1d6 R12: 88810e988c00 [ +0.10] R13: 888126fb5a00 R14: 88810e988c0c R15: 8881a9f78260 [ +0.11] FS: 7ff9ec848540() GS:8883cc88() knlGS: [ +0.12] CS: 0010 DS: ES: CR0: 80050033 [ +0.10] CR2: 55b3f7e14328 CR3: 0001b577 CR4: 00350ef0 [ +0.10] Call Trace: [ +0.06] [ +0.07] ? show_regs+0x6a/0x80 [ +0.18] ? __warn+0xa5/0x1b0 [ +0.19] ? mmu_interval_notifier_remove+0x327/0x340 [ +0.18] ? report_bug+0x24a/0x290 [ +0.22] ? handle_bug+0x46/0x90 [ +0.15] ? exc_invalid_op+0x19/0x50 [ +0.16] ? asm_exc_invalid_op+0x1b/0x20 [ +0.17] ? kasan_save_stack+0x26/0x50 [ +0.17] ? mmu_interval_notifier_remove+0x23b/0x340 [ +0.19] ? mmu_interval_notifier_remove+0x327/0x340 [ +0.19] ? mmu_interval_notifier_remove+0x23b/0x340 [ +0.20] ? __pfx_mmu_interval_notifier_remove+0x10/0x10 [ +0.17] ? kasan_save_alloc_info+0x1e/0x30 [ +0.18] ? srso_return_thunk+0x5/0x5f [ +0.14] ? __kasan_kmalloc+0xb1/0xc0 [ +0.18] ? srso_return_thunk+0x5/0x5f [ +0.13] ? __kasan_check_read+0x11/0x20 [ +0.20] amdgpu_hmm_unregister+0x34/0x50 [amdgpu] [ +0.004695] amdgpu_gem_object_free+0x66/0xa0 [amdgpu] [ +0.004534] ? __pfx_amdgpu_gem_object_free+0x10/0x10 [amdgpu] [ +0.004291] ? do_syscall_64+0x5f/0xe0 [ +0.23] ? srso_return_thunk+0x5/0x5f [ +0.17] drm_gem_object_free+0x3b/0x50 [drm] [ +0.000489] amdgpu_gem_userptr_ioctl+0x306/0x500 [amdgpu] [ +0.004295] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004270] ? srso_return_thunk+0x5/0x5f [ +0.14] ? __this_cpu_preempt_check+0x13/0x20 [ +0.15] ? srso_return_thunk+0x5/0x5f [ +0.13] ? sysvec_apic_timer_interrupt+0x57/0xc0 [ +0.20] ? srso_return_thunk+0x5/0x5f [ +0.14] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [ +0.22] ? drm_ioctl_kernel+0x17b/0x1f0 [drm] [ +0.000496] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004272] ? drm_ioctl_kernel+0x190/0x1f0 [drm] [ +0.000492] drm_ioctl_kernel+0x140/0x1f0 [drm] [ +0.000497] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004297] ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm] [ +0.000489] ? srso_return_thunk+0x5/0x5f [ +0.11] ? __kasan_check_write+0x14/0x20 [ +0.16] drm_ioctl+0x3da/0x730 [drm] [ +0.000475] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004293] ? __pfx_drm_ioctl+0x10/0x10 [drm] [ +0.000506] ? __pfx_rpm_resume+0x10/0x10 [ +0.16] ? srso_return_thunk+0x5/0x5f [ +0.11] ? __kasan_check_write+0x14/0x20 [ +0.10] ? srso_return_thunk+0x5/0x5f [ +0.11] ? _raw_spin_lock_irqsave+0x99/0x100 [ +0.15] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ +0.14] ? srso_return_thunk+0x5/0x5f [ +0.13] ? srso_return_thunk+0x5/0x5f [ +0.11] ? srso_return_thunk+0x5/0x5f [ +0.11] ? preempt_count_sub+0x18/0xc0 [ +0.13] ? srso_return_thunk+0x5/0x5f [ +0.10] ? _raw_spin_unlock_irqrestore+0x27/0x50 [ +0.19] amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu] [ +0.004272] __x64_sys_ioctl+0xcd/0x110 [ +0.20] do_syscall_64+0x5f/0xe0 [ +0.21] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ +0.15] RIP: 0033:0x7ff9ed31a94f [ +0.12] Code: 00 48 89 44 24 18
[PATCH] drm/amdgpu: fix use-after-free bug
From: Vitaly Prosyak The bug can be triggered by sending a single amdgpu_gem_userptr_ioctl to the AMDGPU DRM driver on any ASICs with an invalid address and size. The bug was reported by Joonkyo Jung . For example the following code: static void Syzkaller1(int fd) { struct drm_amdgpu_gem_userptr arg; int ret; arg.addr = 0x; arg.size = 0x8000; /*2 Gb*/ arg.flags = 0x7; ret = drmIoctl(fd, 0xc1186451/*amdgpu_gem_userptr_ioctl*/, &arg); } Due to the address and size are not valid there is a failure in amdgpu_hmm_register->mmu_interval_notifier_insert->__mmu_interval_notifier_insert-> check_shl_overflow, but we even the amdgpu_hmm_register failure we still call amdgpu_hmm_unregister into amdgpu_gem_object_free which causes access to a bad address. The following stack is below when the issue is reproduced when Kazan is enabled: [ +0.14] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020 [ +0.09] RIP: 0010:mmu_interval_notifier_remove+0x327/0x340 [ +0.17] Code: ff ff 49 89 44 24 08 48 b8 00 01 00 00 00 00 ad de 4c 89 f7 49 89 47 40 48 83 c0 22 49 89 47 48 e8 ce d1 2d 01 e9 32 ff ff ff <0f> 0b e9 16 ff ff ff 4c 89 ef e8 fa 14 b3 ff e9 36 ff ff ff e8 80 [ +0.14] RSP: 0018:c90002657988 EFLAGS: 00010246 [ +0.13] RAX: RBX: 1920004caf35 RCX: 8160565b [ +0.11] RDX: dc00 RSI: 0004 RDI: 8881a9f78260 [ +0.10] RBP: c90002657a70 R08: 0001 R09: f520004caf25 [ +0.10] R10: 0003 R11: 8161d1d6 R12: 88810e988c00 [ +0.10] R13: 888126fb5a00 R14: 88810e988c0c R15: 8881a9f78260 [ +0.11] FS: 7ff9ec848540() GS:8883cc88() knlGS: [ +0.12] CS: 0010 DS: ES: CR0: 80050033 [ +0.10] CR2: 55b3f7e14328 CR3: 0001b577 CR4: 00350ef0 [ +0.10] Call Trace: [ +0.06] [ +0.07] ? show_regs+0x6a/0x80 [ +0.18] ? __warn+0xa5/0x1b0 [ +0.19] ? mmu_interval_notifier_remove+0x327/0x340 [ +0.18] ? report_bug+0x24a/0x290 [ +0.22] ? handle_bug+0x46/0x90 [ +0.15] ? exc_invalid_op+0x19/0x50 [ +0.16] ? asm_exc_invalid_op+0x1b/0x20 [ +0.17] ? kasan_save_stack+0x26/0x50 [ +0.17] ? mmu_interval_notifier_remove+0x23b/0x340 [ +0.19] ? mmu_interval_notifier_remove+0x327/0x340 [ +0.19] ? mmu_interval_notifier_remove+0x23b/0x340 [ +0.20] ? __pfx_mmu_interval_notifier_remove+0x10/0x10 [ +0.17] ? kasan_save_alloc_info+0x1e/0x30 [ +0.18] ? srso_return_thunk+0x5/0x5f [ +0.14] ? __kasan_kmalloc+0xb1/0xc0 [ +0.18] ? srso_return_thunk+0x5/0x5f [ +0.13] ? __kasan_check_read+0x11/0x20 [ +0.20] amdgpu_hmm_unregister+0x34/0x50 [amdgpu] [ +0.004695] amdgpu_gem_object_free+0x66/0xa0 [amdgpu] [ +0.004534] ? __pfx_amdgpu_gem_object_free+0x10/0x10 [amdgpu] [ +0.004291] ? do_syscall_64+0x5f/0xe0 [ +0.23] ? srso_return_thunk+0x5/0x5f [ +0.17] drm_gem_object_free+0x3b/0x50 [drm] [ +0.000489] amdgpu_gem_userptr_ioctl+0x306/0x500 [amdgpu] [ +0.004295] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004270] ? srso_return_thunk+0x5/0x5f [ +0.14] ? __this_cpu_preempt_check+0x13/0x20 [ +0.15] ? srso_return_thunk+0x5/0x5f [ +0.13] ? sysvec_apic_timer_interrupt+0x57/0xc0 [ +0.20] ? srso_return_thunk+0x5/0x5f [ +0.14] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [ +0.22] ? drm_ioctl_kernel+0x17b/0x1f0 [drm] [ +0.000496] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004272] ? drm_ioctl_kernel+0x190/0x1f0 [drm] [ +0.000492] drm_ioctl_kernel+0x140/0x1f0 [drm] [ +0.000497] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004297] ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm] [ +0.000489] ? srso_return_thunk+0x5/0x5f [ +0.11] ? __kasan_check_write+0x14/0x20 [ +0.16] drm_ioctl+0x3da/0x730 [drm] [ +0.000475] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004293] ? __pfx_drm_ioctl+0x10/0x10 [drm] [ +0.000506] ? __pfx_rpm_resume+0x10/0x10 [ +0.16] ? srso_return_thunk+0x5/0x5f [ +0.11] ? __kasan_check_write+0x14/0x20 [ +0.10] ? srso_return_thunk+0x5/0x5f [ +0.11] ? _raw_spin_lock_irqsave+0x99/0x100 [ +0.15] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ +0.14] ? srso_return_thunk+0x5/0x5f [ +0.13] ? srso_return_thunk+0x5/0x5f [ +0.11] ? srso_return_thunk+0x5/0x5f [ +0.11] ? preempt_count_sub+0x18/0xc0 [ +0.13] ? srso_return_thunk+0x5/0x5f [ +0.10] ? _raw_spin_unlock_irqrestore+0x27/0x50 [ +0.19] amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu] [ +0.004272] __x64_sys_ioctl+0xcd/0x110 [ +0.20] do_syscall_64+0x5f/0xe0 [ +0.21] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ +0.15] RIP: 0033:0x7ff9ed31a94f [ +0.12] Code: 00 48 89 44 24 18
[PATCH] drm/amdgpu: check flag ring->no_scheduler before usage
From: Vitaly Prosyak The issue started to appear after the following commit 11b3b9f461c5c4f700f6c8da202fcc2fd6418e1f (scheduler to variable number of run-queues). The scheduler flag ready (ring->sched.ready) could not be used to validate multiple scenarios, for example, check job is running, gpu_reset, PCI errors etc. The reason is that after GPU reset, the flag is set to true unconditionally even for those rings with an uninitialized scheduler. As a result, we called drm_sched_stop, drm_sched_start for the uninitialized schedule and NULL pointer dereference is occured. For example, the following occurs on Navi10 during GPU reset: [ 354.231044] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS V1.03.B10 04/01/2019 [ 354.239152] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched] [ 354.246047] RIP: 0010:__flush_work.isra.0+0x23a/0x250 [ 354.251110] Code: 8b 04 25 40 2e 03 00 48 89 44 24 40 48 8b 73 40 8b 4b 30 e9 f9 fe ff ff 40 30 f6 4c 8b 36 e9 37 fe ff ff 0f 0b e9 3a ff ff ff <0f> 0b e9 33 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 [ 354.269876] RSP: 0018:b234c00e3c20 EFLAGS: 00010246 [ 354.275121] RAX: 0011 RBX: 9796d9796de0 RCX: [ 354.282271] RDX: 0001 RSI: RDI: 9796d9796de0 [ 354.289420] RBP: 9796d9796de0 R08: 977780401940 R09: a1a5c620 [ 354.296570] R10: R11: R12: [ 354.303720] R13: 0001 R14: 9796d97905c8 R15: 9796d9790230 [ 354.310868] FS: () GS:97865f04() knlGS: [ 354.318963] CS: 0010 DS: ES: CR0: 80050033 [ 354.324717] CR2: 7fd5341fca50 CR3: 002c27a22000 CR4: 003506f0 [ 354.324717] CR2: 7fd5341fca50 CR3: 002c27a22000 CR4: 003506f0 [ 354.331859] Call Trace: [ 354.334320] [ 354.336433] ? __flush_work.isra.0+0x23a/0x250 [ 354.340891] ? __warn+0x81/0x130 [ 354.344139] ? __flush_work.isra.0+0x23a/0x250 [ 354.348594] ? report_bug+0x171/0x1a0 [ 354.352279] ? handle_bug+0x3c/0x80 [ 354.355787] ? exc_invalid_op+0x17/0x70 [ 354.359635] ? asm_exc_invalid_op+0x1a/0x20 [ 354.363844] ? __flush_work.isra.0+0x23a/0x250 [ 354.368307] ? srso_return_thunk+0x5/0x5f [ 354.372331] ? srso_return_thunk+0x5/0x5f [ 354.376351] ? desc_read_finalized_seq+0x1f/0x70 [ 354.380982] ? srso_return_thunk+0x5/0x5f [ 354.385011] ? _prb_read_valid+0x20e/0x280 [ 354.389130] __cancel_work_timer+0xd3/0x160 [ 354.39] drm_sched_stop+0x46/0x1f0 [gpu_sched] [ 354.398143] amdgpu_device_gpu_recover+0x318/0xca0 [amdgpu] [ 354.403995] ? __drm_err+0x1/0x70 [drm] [ 354.407884] amdgpu_job_timedout+0x151/0x240 [amdgpu] [ 354.413279] drm_sched_job_timedout+0x76/0x100 [gpu_sched] [ 354.418787] process_one_work+0x174/0x340 [ 354.422816] worker_thread+0x27b/0x3a0 [ 354.426586] ? __pfx_worker_thread+0x10/0x10 [ 354.430874] kthread+0xe8/0x120 [ 354.434030] ? __pfx_kthread+0x10/0x10 [ 354.437790] ret_from_fork+0x34/0x50 [ 354.441377] ? __pfx_kthread+0x10/0x10 [ 354.445139] ret_from_fork_asm+0x1b/0x30 [ 354.449079] [ 354.451285] ---[ end trace ]--- [ 354.455917] BUG: kernel NULL pointer dereference, address: 0008 [ 354.462883] #PF: supervisor read access in kernel mode [ 354.468029] #PF: error_code(0x) - not-present page [ 354.473167] PGD 0 P4D 0 [ 354.475705] Oops: [#1] PREEMPT SMP NOPTI [ 354.480066] CPU: 1 PID: 11 Comm: kworker/u64:0 Tainted: GW 6.7.0-991912.1.zuul.e7596ab24dae4bb686e58b0f1e7842da #1 [ 354.491883] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS V1.03.B10 04/01/2019 [ 354.499976] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched] [ 354.506855] RIP: 0010:drm_sched_stop+0x61/0x1f0 [gpu_sched] The solution is every place where we check the ready flag and check for ring->no_scheduler. The ready flag serves the purpose in case an initialization is failed, like starting the worker thread, etc. Cc: Alex Deucher Cc: Christian Koenig Signed-off-by: Vitaly Prosyak --- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c| 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 6 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 -- 3 files changed, 13 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c index 899e31e3a5e8..70bbf602df34 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c @@ -292,6 +292,8 @@ static int suspend_resume_compute_scheduler(struct amdgpu_device *adev, bool sus if (!(ring && drm_sched_wqueue_ready(&ring->sched))) continue; + if (ring->no_sche
[PATCH] drm/amdgpu: fix software pci_unplug on some chips
From: Vitaly Prosyak When software 'pci unplug' using IGT is executed we got a sysfs directory entry is NULL for differant ras blocks like hdp, umc, etc. Before call 'sysfs_remove_file_from_group' and 'sysfs_remove_group' check that 'sd' is not NULL. [ +0.01] RIP: 0010:sysfs_remove_group+0x83/0x90 [ +0.02] Code: 31 c0 31 d2 31 f6 31 ff e9 9a a8 b4 00 4c 89 e7 e8 f2 a2 ff ff eb c2 49 8b 55 00 48 8b 33 48 c7 c7 80 65 94 82 e8 cd 82 bb ff <0f> 0b eb cc 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 [ +0.01] RSP: 0018:c90002067c90 EFLAGS: 00010246 [ +0.02] RAX: RBX: 824ea180 RCX: [ +0.01] RDX: RSI: RDI: [ +0.01] RBP: c90002067ca8 R08: R09: [ +0.01] R10: R11: R12: [ +0.01] R13: 88810a395f48 R14: 888101aab0d0 R15: [ +0.01] FS: 7f5ddaa43a00() GS:88841e80() knlGS: [ +0.02] CS: 0010 DS: ES: CR0: 80050033 [ +0.01] CR2: 7f8ffa61ba50 CR3: 000106432000 CR4: 00350ef0 [ +0.01] Call Trace: [ +0.01] [ +0.01] ? show_regs+0x72/0x90 [ +0.02] ? sysfs_remove_group+0x83/0x90 [ +0.02] ? __warn+0x8d/0x160 [ +0.01] ? sysfs_remove_group+0x83/0x90 [ +0.01] ? report_bug+0x1bb/0x1d0 [ +0.03] ? handle_bug+0x46/0x90 [ +0.01] ? exc_invalid_op+0x19/0x80 [ +0.02] ? asm_exc_invalid_op+0x1b/0x20 [ +0.03] ? sysfs_remove_group+0x83/0x90 [ +0.01] dpm_sysfs_remove+0x61/0x70 [ +0.02] device_del+0xa3/0x3d0 [ +0.02] ? ktime_get_mono_fast_ns+0x46/0xb0 [ +0.02] device_unregister+0x18/0x70 [ +0.01] i2c_del_adapter+0x26d/0x330 [ +0.02] arcturus_i2c_control_fini+0x25/0x50 [amdgpu] [ +0.000236] smu_sw_fini+0x38/0x260 [amdgpu] [ +0.000241] amdgpu_device_fini_sw+0x116/0x670 [amdgpu] [ +0.000186] ? mutex_lock+0x13/0x50 [ +0.03] amdgpu_driver_release_kms+0x16/0x40 [amdgpu] [ +0.000192] drm_minor_release+0x4f/0x80 [drm] [ +0.25] drm_release+0xfe/0x150 [drm] [ +0.27] __fput+0x9f/0x290 [ +0.02] fput+0xe/0x20 [ +0.02] task_work_run+0x61/0xa0 [ +0.02] exit_to_user_mode_prepare+0x150/0x170 [ +0.02] syscall_exit_to_user_mode+0x2a/0x50 Cc: Hawking Zhang Cc: Luben Tuikov Cc: Alex Deucher Cc: Christian Koenig Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 5fb57419ef77..1673a10835a1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -1390,7 +1390,8 @@ static void amdgpu_ras_sysfs_remove_bad_page_node(struct amdgpu_device *adev) { struct amdgpu_ras *con = amdgpu_ras_get_context(adev); - sysfs_remove_file_from_group(&adev->dev->kobj, + if (adev->dev->kobj.sd) + sysfs_remove_file_from_group(&adev->dev->kobj, &con->badpages_attr.attr, RAS_FS_NAME); } @@ -1409,7 +1410,8 @@ static int amdgpu_ras_sysfs_remove_dev_attr_node(struct amdgpu_device *adev) .attrs = attrs, }; - sysfs_remove_group(&adev->dev->kobj, &group); + if (adev->dev->kobj.sd) + sysfs_remove_group(&adev->dev->kobj, &group); return 0; } @@ -1456,7 +1458,8 @@ int amdgpu_ras_sysfs_remove(struct amdgpu_device *adev, if (!obj || !obj->attr_inuse) return -EINVAL; - sysfs_remove_file_from_group(&adev->dev->kobj, + if (adev->dev->kobj.sd) + sysfs_remove_file_from_group(&adev->dev->kobj, &obj->sysfs_attr.attr, RAS_FS_NAME); obj->attr_inuse = 0; -- 2.25.1
[PATCH] drm/sched: Check scheduler work queue before calling timeout handling
From: Vitaly Prosyak During an IGT GPU reset test we see again oops despite of commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling timeout handling). It uses ready condition whether to call drm_sched_fault which unwind the TDR leads to GPU reset. However it looks the ready condition is overloaded with other meanings, for example, for the following stack is related GPU reset : 0 gfx_v9_0_cp_gfx_start 1 gfx_v9_0_cp_gfx_resume 2 gfx_v9_0_cp_resume 3 gfx_v9_0_hw_init 4 gfx_v9_0_resume 5 amdgpu_device_ip_resume_phase2 does the following: /* start the ring */ gfx_v9_0_cp_gfx_start(adev); ring->sched.ready = true; The same approach is for other ASICs as well : gfx_v8_0_cp_gfx_resume gfx_v10_0_kiq_resume, etc... As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault and then drm_sched_fault. However now it depends on whether the interrupt service routine drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready field of the scheduler to true even for uninitialized schedulers and causes oops vs no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and NULL pointer dereference does not occur. Use the field timeout_wq to prevent oops for uninitialized schedulers. The field could be initialized by the work queue of resetting the domain. Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling") v1: Corrections to commit message (Luben) Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/scheduler/sched_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 649fac2e1ccb..670b7997f389 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) */ void drm_sched_fault(struct drm_gpu_scheduler *sched) { - if (sched->ready) + if (sched->timeout_wq) mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0); } EXPORT_SYMBOL(drm_sched_fault); -- 2.25.1
[PATCH] drm/sched: Check scheduler work queue before calling timeout handling
From: Vitaly Prosyak During an IGT GPU reset test we see again oops despite of commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling timeout handling). It uses ready condition whether to call drm_sched_fault which unwind the TDR leads to GPU reset. However it looks the ready condition is overloaded with other meanings, for example, for the following stack is related GPU reset : 0 gfx_v9_0_cp_gfx_start 1 gfx_v9_0_cp_gfx_resume 2 gfx_v9_0_cp_resume 3 gfx_v9_0_hw_init 4 gfx_v9_0_resume 5 amdgpu_device_ip_resume_phase2 does the following: /* start the ring */ gfx_v9_0_cp_gfx_start(adev); ring->sched.ready = true; The same approach is for other ASICs as well : gfx_v8_0_cp_gfx_resume gfx_v10_0_kiq_resume, etc... As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault and then drm_sched_fault. However now it depends on whether the interrupt service routine drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready field of the scheduler to true even for uninitialized schedulers and causes oops vs no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and NULL pointer dereference does not occur. Use the field timeout_wq to prevent oops for uninitialized schedulers. The field could be initialized by the work queue of resetting the domain. Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling") v1: Corrections to commit message (Luben) Signed-off-by: Vitaly Prosyak Reviewed-by: Luben Tuikov --- drivers/gpu/drm/scheduler/sched_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 649fac2e1ccb..670b7997f389 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) */ void drm_sched_fault(struct drm_gpu_scheduler *sched) { - if (sched->ready) + if (sched->timeout_wq) mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0); } EXPORT_SYMBOL(drm_sched_fault); -- 2.25.1
[PATCH] drm/sched: Check scheduler work queue before calling timeout handling
From: Vitaly Prosyak During an IGT GPU reset test we see again oops despite of commit 0c8c901aaaebc9bf8bf189ffc116e678f7a2dc16 drm/sched: Check scheduler ready before calling timeout handling. It uses ready condition whether to call drm_sched_fault which unwind the TDR leads to GPU reset. However it looks the ready condition is overloaded with other meanings, for example, for the following stack is related GPU reset : 0 gfx_v9_0_cp_gfx_start 1 gfx_v9_0_cp_gfx_resume 2 gfx_v9_0_cp_resume 3 gfx_v9_0_hw_init 4 gfx_v9_0_resume 5 amdgpu_device_ip_resume_phase2 does the following: /* start the ring */ gfx_v9_0_cp_gfx_start(adev); ring->sched.ready = true; The same approach is for other ASICs as well : gfx_v8_0_cp_gfx_resume gfx_v10_0_kiq_resume, etc... As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault and then drm_sched_fault. However now it depends on whether the interrupt service routine drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready field of the scheduler to true even for not initialized schedulers and causes oops vs no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and NULL pointer dereference does not occur. Use the field timeout_wq to prevent oops for uninitialized schedulers. The field could be initialized by the work queue of resetting the domain. Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/scheduler/sched_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 649fac2e1ccb..670b7997f389 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) */ void drm_sched_fault(struct drm_gpu_scheduler *sched) { - if (sched->ready) + if (sched->timeout_wq) mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0); } EXPORT_SYMBOL(drm_sched_fault); -- 2.25.1
[PATCH 2/3] drm/amdgpu: always sending PSP messages LOAD_ASD and UNLOAD_TA
From: Vitaly Prosyak We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring a lock in drm_dev_enter during driver unload because we must call drm_dev_unplug as the beginning of unload driver sequence. Added WARNING if other PSP messages are sent without a lock. After this commit, the following commands would work -sudo modprobe -r amdgpu -sudo modprobe amdgpu Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index a8391f269cd0..40929f34447c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -606,12 +606,21 @@ psp_cmd_submit_buf(struct psp_context *psp, int timeout = 2; bool ras_intr = false; bool skip_unsupport = false; + bool dev_entered; if (psp->adev->no_hw_access) return 0; - if (!drm_dev_enter(adev_to_drm(psp->adev), &idx)) - return 0; + dev_entered = drm_dev_enter(adev_to_drm(psp->adev), &idx); + /* +* We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring +* a lock in drm_dev_enter during driver unload because we must call +* drm_dev_unplug as the beginning of unload driver sequence . It is very +* crucial that userspace can't access device instances anymore. +*/ + if (!dev_entered) + WARN_ON(psp->cmd_buf_mem->cmd_id != GFX_CMD_ID_LOAD_ASD && + psp->cmd_buf_mem->cmd_id != GFX_CMD_ID_UNLOAD_TA); memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE); @@ -676,7 +685,8 @@ psp_cmd_submit_buf(struct psp_context *psp, } exit: - drm_dev_exit(idx); + if (dev_entered) + drm_dev_exit(idx); return ret; } -- 2.25.1
[PATCH 2/3] drm/amdgpu: always sending PSP messages LOAD_ASD and UNLOAD_TA
From: Vitaly Prosyak We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring a lock in drm_dev_enter during driver unload because we must call drm_dev_unplug as the beginning of unload driver sequence. Added WARNING if other PSP messages are sent without a lock. After this commit, the following commands would work -sudo modprobe -r amdgpu -sudo modprobe amdgpu Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index a8391f269cd0..40929f34447c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -606,12 +606,21 @@ psp_cmd_submit_buf(struct psp_context *psp, int timeout = 2; bool ras_intr = false; bool skip_unsupport = false; + bool dev_entered; if (psp->adev->no_hw_access) return 0; - if (!drm_dev_enter(adev_to_drm(psp->adev), &idx)) - return 0; + dev_entered = drm_dev_enter(adev_to_drm(psp->adev), &idx); + /* +* We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring +* a lock in drm_dev_enter during driver unload because we must call +* drm_dev_unplug as the beginning of unload driver sequence . It is very +* crucial that userspace can't access device instances anymore. +*/ + if (!dev_entered) + WARN_ON(psp->cmd_buf_mem->cmd_id != GFX_CMD_ID_LOAD_ASD && + psp->cmd_buf_mem->cmd_id != GFX_CMD_ID_UNLOAD_TA); memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE); @@ -676,7 +685,8 @@ psp_cmd_submit_buf(struct psp_context *psp, } exit: - drm_dev_exit(idx); + if (dev_entered) + drm_dev_exit(idx); return ret; } -- 2.25.1
[PATCH 1/3] Revert "drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled"
From: Vitaly Prosyak This reverts commit fac53471d0ea9693d314aa2df08d62b2e7e3a0f8. The following change: move the drm_dev_unplug call after amdgpu_driver_unload_kms in amdgpu_pci_remove. The reason is the following: amdgpu_pci_remove calls drm_dev_unregister and it should be called first to ensure userspace can't access the device instance anymore. If we call drm_dev_unplug after amdgpu_driver_unload_kms then we observe IGT PCI software unplug test failure (kernel hung) for all ASICs. This is how this regression was found. After this revert, the following commands do work not, but it would be fixed in the next commit: - sudo modprobe -r amdgpu - sudo modprobe amdgpu Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 0f9a5b12c3a5..a10b627c8357 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4031,7 +4031,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev) amdgpu_gart_dummy_page_fini(adev); - amdgpu_device_unmap_mmio(adev); + if (drm_dev_is_unplugged(adev_to_drm(adev))) + amdgpu_device_unmap_mmio(adev); } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index a75dba2caeca..7edbaa90fac9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2227,6 +2227,8 @@ amdgpu_pci_remove(struct pci_dev *pdev) struct drm_device *dev = pci_get_drvdata(pdev); struct amdgpu_device *adev = drm_to_adev(dev); + drm_dev_unplug(dev); + if (adev->pm.rpm_mode != AMDGPU_RUNPM_NONE) { pm_runtime_get_sync(dev->dev); pm_runtime_forbid(dev->dev); @@ -2266,8 +2268,6 @@ amdgpu_pci_remove(struct pci_dev *pdev) amdgpu_driver_unload_kms(dev); - drm_dev_unplug(dev); - /* * Flush any in flight DMA operations from device. * Clear the Bus Master Enable bit and then wait on the PCIe Device -- 2.25.1
[PATCH 2/3] drm/amdgpu: always sending PSP messages LOAD_ASD and UNLOAD_TA
From: Vitaly Prosyak We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring a lock in drm_dev_enter during driver unload because we must call drm_dev_unplug as the beginning of unload driver sequence. Added WARNING if other PSP messages are sent without a lock. After this commit, the following commands would work -sudo modprobe -r amdgpu -sudo modprobe amdgpu Signed-off-by: Vitaly Prosyak Reviewed-by Alex Deucher Change-Id: I57f65fe820e2f7055f8065cd18c63fe6ff3ab694 --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index a8391f269cd0..40929f34447c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -606,12 +606,21 @@ psp_cmd_submit_buf(struct psp_context *psp, int timeout = 2; bool ras_intr = false; bool skip_unsupport = false; + bool dev_entered; if (psp->adev->no_hw_access) return 0; - if (!drm_dev_enter(adev_to_drm(psp->adev), &idx)) - return 0; + dev_entered = drm_dev_enter(adev_to_drm(psp->adev), &idx); + /* +* We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring +* a lock in drm_dev_enter during driver unload because we must call +* drm_dev_unplug as the beginning of unload driver sequence . It is very +* crucial that userspace can't access device instances anymore. +*/ + if (!dev_entered) + WARN_ON(psp->cmd_buf_mem->cmd_id != GFX_CMD_ID_LOAD_ASD && + psp->cmd_buf_mem->cmd_id != GFX_CMD_ID_UNLOAD_TA); memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE); @@ -676,7 +685,8 @@ psp_cmd_submit_buf(struct psp_context *psp, } exit: - drm_dev_exit(idx); + if (dev_entered) + drm_dev_exit(idx); return ret; } -- 2.25.1
[PATCH 2/3] drm/amdgpu: always sending PSP messages LOAD_ASD and UNLOAD_TA
From: Vitaly Prosyak We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring a lock in drm_dev_enter during driver unload because we must call drm_dev_unplug as the beginning of unload driver sequence. Added WARNING if other PSP messages are sent without a lock. After this commit, the following commands would work -sudo modprobe -r amdgpu -sudo modprobe amdgpu Signed-off-by: Vitaly Prosyak Reviewed-by Alex Deucher Change-Id: I57f65fe820e2f7055f8065cd18c63fe6ff3ab694 --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index a8391f269cd0..40929f34447c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -606,12 +606,21 @@ psp_cmd_submit_buf(struct psp_context *psp, int timeout = 2; bool ras_intr = false; bool skip_unsupport = false; + bool dev_entered; if (psp->adev->no_hw_access) return 0; - if (!drm_dev_enter(adev_to_drm(psp->adev), &idx)) - return 0; + dev_entered = drm_dev_enter(adev_to_drm(psp->adev), &idx); + /* +* We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring +* a lock in drm_dev_enter during driver unload because we must call +* drm_dev_unplug as the beginning of unload driver sequence . It is very +* crucial that userspace can't access device instances anymore. +*/ + if (!dev_entered) + WARN_ON(psp->cmd_buf_mem->cmd_id != GFX_CMD_ID_LOAD_ASD && + psp->cmd_buf_mem->cmd_id != GFX_CMD_ID_UNLOAD_TA); memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE); @@ -676,7 +685,8 @@ psp_cmd_submit_buf(struct psp_context *psp, } exit: - drm_dev_exit(idx); + if (dev_entered) + drm_dev_exit(idx); return ret; } -- 2.25.1
[PATCH 3/3] drm/amdgpu: use pci_dev_is_disconnected
From: Vitaly Prosyak Added condition for pci_dev_is_disconnected and keeps drm_dev_is_unplugged to check whether we should unmap MMIO. Suggested by Alex regarding pci_dev_is_disconnected. Suggested by Christian keeping drm_dev_is_unplugged. Signed-off-by: Vitaly Prosyak Reviewed-by Alex Deucher Reviewed-by Christian Koenig Change-Id: I618c471cd398437d4ed6dec6d22be78e12683ae6 --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index a10b627c8357..d3568e1ded23 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -78,6 +78,8 @@ #include +#include "../../../../pci/pci.h" + MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin"); MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin"); MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin"); @@ -4031,7 +4033,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev) amdgpu_gart_dummy_page_fini(adev); - if (drm_dev_is_unplugged(adev_to_drm(adev))) + if (pci_dev_is_disconnected(adev->pdev) && + drm_dev_is_unplugged(adev_to_drm(adev))) amdgpu_device_unmap_mmio(adev); } -- 2.25.1
[PATCH 1/3] Revert "drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled"
From: Vitaly Prosyak This reverts commit fac53471d0ea9693d314aa2df08d62b2e7e3a0f8. The following change: move the drm_dev_unplug call after amdgpu_driver_unload_kms in amdgpu_pci_remove. The reason is the following: amdgpu_pci_remove calls drm_dev_unregister and it should be called first to ensure userspace can't access the device instance anymore. If we call drm_dev_unplug after amdgpu_driver_unload_kms then we observe IGT PCI software unplug test failure (kernel hung) for all ASICs. This is how this regression was found. After this revert, the following commands do work not, but it would be fixed in the next commit: - sudo modprobe -r amdgpu - sudo modprobe amdgpu Signed-off-by: Vitaly Prosyak Reviewed-by Alex Deucher Change-Id: Ia5c6c174dddb89a33dd93b641fd05466ffb3500c --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 0f9a5b12c3a5..a10b627c8357 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4031,7 +4031,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev) amdgpu_gart_dummy_page_fini(adev); - amdgpu_device_unmap_mmio(adev); + if (drm_dev_is_unplugged(adev_to_drm(adev))) + amdgpu_device_unmap_mmio(adev); } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index a75dba2caeca..7edbaa90fac9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2227,6 +2227,8 @@ amdgpu_pci_remove(struct pci_dev *pdev) struct drm_device *dev = pci_get_drvdata(pdev); struct amdgpu_device *adev = drm_to_adev(dev); + drm_dev_unplug(dev); + if (adev->pm.rpm_mode != AMDGPU_RUNPM_NONE) { pm_runtime_get_sync(dev->dev); pm_runtime_forbid(dev->dev); @@ -2266,8 +2268,6 @@ amdgpu_pci_remove(struct pci_dev *pdev) amdgpu_driver_unload_kms(dev); - drm_dev_unplug(dev); - /* * Flush any in flight DMA operations from device. * Clear the Bus Master Enable bit and then wait on the PCIe Device -- 2.25.1
[PATCH 2/2] drm/amdgpu: always sending PSP messages LOAD_ASD and UNLOAD_TA
From: Vitaly Prosyak We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring a lock in drm_dev_enter during driver unload because we must call drm_dev_unplug as the beginning of unload driver sequence. Added WARNING if other PSP messages are sent without a lock. After this commit, the following commands would work -sudo modprobe -r amdgpu -sudo modeprobe amdgpu Signed-off-by: Vitaly Prosyak Reviewed-by Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index a8391f269cd0..7ef18e062c0e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -606,12 +606,21 @@ psp_cmd_submit_buf(struct psp_context *psp, int timeout = 2; bool ras_intr = false; bool skip_unsupport = false; + bool dev_entered; if (psp->adev->no_hw_access) return 0; - if (!drm_dev_enter(adev_to_drm(psp->adev), &idx)) - return 0; + dev_entered = drm_dev_enter(adev_to_drm(psp->adev), &idx); + /* +* We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring +* a lock in drm_dev_enter during driver unload because we must call +* drm_dev_unplug as the beginning of unload driver sequence . It is very +* crucial that userspace can't access device instances anymore. +*/ + if (!dev_entered) + WARN_ON(psp->cmd_buf_mem->cmd_id != GFX_CMD_ID_LOAD_ASD && + psp->cmd_buf_mem->cmd_id != GFX_CMD_ID_UNLOAD_TA); memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE); @@ -676,7 +685,8 @@ psp_cmd_submit_buf(struct psp_context *psp, } exit: - drm_dev_exit(idx); + if (dev_entered == true ) + drm_dev_exit(idx); return ret; } -- 2.25.1
[PATCH 1/2] Revert "drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled"
From: Vitaly Prosyak This reverts commit fac53471d0ea9693d314aa2df08d62b2e7e3a0f8. The following change: move the drm_dev_unplug call after amdgpu_driver_unload_kms in amdgpu_pci_remove. The reason is the following: amdgpu_pci_remove calls drm_dev_unregister and it should be called first to ensure userspace can't access the device instance anymore. If we call drm_dev_unplug after amdgpu_driver_unload_kms then we observe IGT PCI software unplug test failure (kernel hung) for all ASICs. This is how this regression was found. After this revert, the following commands do work not, but it would be fixed in the next commit: - sudo modprobe -r amdgpu - sudo modeprobe amdgpu Signed-off-by: Vitaly Prosyak Reviewed-by Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index efd4f8226120..58d445a0590f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4019,7 +4019,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev) amdgpu_gart_dummy_page_fini(adev); - amdgpu_device_unmap_mmio(adev); + if (drm_dev_is_unplugged(adev_to_drm(adev))) + amdgpu_device_unmap_mmio(adev); } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index a75dba2caeca..7edbaa90fac9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2227,6 +2227,8 @@ amdgpu_pci_remove(struct pci_dev *pdev) struct drm_device *dev = pci_get_drvdata(pdev); struct amdgpu_device *adev = drm_to_adev(dev); + drm_dev_unplug(dev); + if (adev->pm.rpm_mode != AMDGPU_RUNPM_NONE) { pm_runtime_get_sync(dev->dev); pm_runtime_forbid(dev->dev); @@ -2266,8 +2268,6 @@ amdgpu_pci_remove(struct pci_dev *pdev) amdgpu_driver_unload_kms(dev); - drm_dev_unplug(dev); - /* * Flush any in flight DMA operations from device. * Clear the Bus Master Enable bit and then wait on the PCIe Device -- 2.25.1
[PATCH] drm/amdgpu: revert part of a commit fac53471d0ea9
From: Vitaly Prosyak Revert the following change: move drm_dev_unplug call after amdgpu_driver_unload_kms in amdgpu_pci_remove. The reason is following: amdgpu_pci_remove calls drm_dev_unregister and it should be called first to ensure userspace can't access the device instance anymore. If we call drm_dev_unplug after amdgpu_driver_unload_kms then we observe IGT PCI software unplug test failure (kernel hung) for all ASIC's. This is how this regression was found. Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 06aba201d4db..8a1047224000 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2226,6 +2226,8 @@ amdgpu_pci_remove(struct pci_dev *pdev) struct drm_device *dev = pci_get_drvdata(pdev); struct amdgpu_device *adev = drm_to_adev(dev); + drm_dev_unplug(dev); + if (adev->pm.rpm_mode != AMDGPU_RUNPM_NONE) { pm_runtime_get_sync(dev->dev); pm_runtime_forbid(dev->dev); @@ -2265,7 +2267,7 @@ amdgpu_pci_remove(struct pci_dev *pdev) amdgpu_driver_unload_kms(dev); - drm_dev_unplug(dev); + /* * Flush any in flight DMA operations from device. -- 2.25.1