Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume
On 10/22/19 4:04 PM, Yang, Philip wrote: > > On 2019-10-22 3:36 p.m., Grodzovsky, Andrey wrote: >> On 10/22/19 3:19 PM, Yang, Philip wrote: >>> On 2019-10-22 2:40 p.m., Grodzovsky, Andrey wrote: On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote: > On 10/22/19 2:28 PM, Yang, Philip wrote: >> If device reset/suspend/resume failed for some reason, dqm lock is >> hold forever and this causes deadlock. Below is a kernel backtrace when >> application open kfd after suspend/resume failed. >> >> Instead of holding dqm lock in pre_reset and releasing dqm lock in >> post_reset, add dqm->device_stopped flag which is modified in >> dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection >> because write/read are all inside dqm lock. >> >> For HWS case, map_queues_cpsch and unmap_queues_cpsch checks >> device_stopped flag before sending the updated runlist. > Is there a chance of race condition here where dqm->device_stopped > returns true for some operation (e.g.map_queues_cpsch) but just as it > proceeds GPU reset starts ? > > Andrey Correction - dqm->device_stopped returns FALSE >>> No race condition here, dqm->device_stopped is set to FALSE in >>> kgd2kfd_post_reset -> dqm->ops.start(), which the last step of >>> amdgpu_device_gpu_recover, so it's safe to do map_queue_cpsch. >>> >>> Regards, >>> Philip >> Sorry - i was confused by the commit description vs. body - in the >> description it's called device_stopped flag while in the body it's >> sched_running - probably the description needs to be fixed. >> > Thanks, commit description is updated. > >> So i mean the switch true->false when GPU reset just begins - you get an >> IOCTL to map a queue (if i understood KFD code correctly), check >> dqm->sched_running == true and continue, what if right then GPU reset >> started due to some issue like job timeout or user triggered reset - >> dqm->sched_running becomes false but you already past that checkpoint in >> the IOCTL, no ? >> > map a queue is done under dqm lock, to send the updated runlist to HWS, > then relase dqm lock. GPU reset start pre_reset will unmap all queues > first under dqm lock, then set dqm->sched_running to false, release dqm > lock, and then continue to reset HW. dqm lock guarantee the two > operations are serialized. > > Philip I see now that it's ok. Andrey > >> Andrey >> Andrey >> v2: For no-HWS case, when device is stopped, don't call >> load/destroy_mqd for eviction, restore and create queue, and avoid >> debugfs dump hdqs. >> >> Backtrace of dqm lock deadlock: >> >> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more >> than 120 seconds. >> [Thu Oct 17 16:43:37 2019] Not tainted >> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 >> [Thu Oct 17 16:43:37 2019] "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> [Thu Oct 17 16:43:37 2019] rocminfoD0 3024 2947 >> 0x8000 >> [Thu Oct 17 16:43:37 2019] Call Trace: >> [Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0 >> [Thu Oct 17 16:43:37 2019] schedule+0x32/0x70 >> [Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10 >> [Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0 >> [Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0 >> [Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0 >> [amdgpu] >> [Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0 >> [amdgpu] >> [Thu Oct 17 16:43:37 2019] >> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] >> [Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220 >> [amdgpu] >> [Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0 >> [Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0 >> [Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200 >> [Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110 >> [Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120 >> [Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20 >> [Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200 >> [Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0 >> [Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20 >> [Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100 >> [Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> >> Suggested-by: Felix Kuehling >> Signed-off-by: Philip Yang >> --- >>drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +-- >>drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 -- >>.../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 >> +-- >>.../drm/amd/amdkfd/kfd_device_queue_manager.h | 1 + >>4 files changed, 46 insertions(+), 13 deletions(-) >> >> diff --git
Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume
On 2019-10-22 3:36 p.m., Grodzovsky, Andrey wrote: > > On 10/22/19 3:19 PM, Yang, Philip wrote: >> >> On 2019-10-22 2:40 p.m., Grodzovsky, Andrey wrote: >>> On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote: On 10/22/19 2:28 PM, Yang, Philip wrote: > If device reset/suspend/resume failed for some reason, dqm lock is > hold forever and this causes deadlock. Below is a kernel backtrace when > application open kfd after suspend/resume failed. > > Instead of holding dqm lock in pre_reset and releasing dqm lock in > post_reset, add dqm->device_stopped flag which is modified in > dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection > because write/read are all inside dqm lock. > > For HWS case, map_queues_cpsch and unmap_queues_cpsch checks > device_stopped flag before sending the updated runlist. Is there a chance of race condition here where dqm->device_stopped returns true for some operation (e.g.map_queues_cpsch) but just as it proceeds GPU reset starts ? Andrey >>> >>> Correction - >>> >>> dqm->device_stopped returns FALSE >>> >> No race condition here, dqm->device_stopped is set to FALSE in >> kgd2kfd_post_reset -> dqm->ops.start(), which the last step of >> amdgpu_device_gpu_recover, so it's safe to do map_queue_cpsch. >> >> Regards, >> Philip > > Sorry - i was confused by the commit description vs. body - in the > description it's called device_stopped flag while in the body it's > sched_running - probably the description needs to be fixed. > Thanks, commit description is updated. > So i mean the switch true->false when GPU reset just begins - you get an > IOCTL to map a queue (if i understood KFD code correctly), check > dqm->sched_running == true and continue, what if right then GPU reset > started due to some issue like job timeout or user triggered reset - > dqm->sched_running becomes false but you already past that checkpoint in > the IOCTL, no ? > map a queue is done under dqm lock, to send the updated runlist to HWS, then relase dqm lock. GPU reset start pre_reset will unmap all queues first under dqm lock, then set dqm->sched_running to false, release dqm lock, and then continue to reset HW. dqm lock guarantee the two operations are serialized. Philip > Andrey > >> >>> Andrey >>> > v2: For no-HWS case, when device is stopped, don't call > load/destroy_mqd for eviction, restore and create queue, and avoid > debugfs dump hdqs. > > Backtrace of dqm lock deadlock: > > [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more > than 120 seconds. > [Thu Oct 17 16:43:37 2019] Not tainted > 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 > [Thu Oct 17 16:43:37 2019] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Thu Oct 17 16:43:37 2019] rocminfoD0 3024 2947 > 0x8000 > [Thu Oct 17 16:43:37 2019] Call Trace: > [Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0 > [Thu Oct 17 16:43:37 2019] schedule+0x32/0x70 > [Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10 > [Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0 > [Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0 > [Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0 > [amdgpu] > [Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0 > [amdgpu] > [Thu Oct 17 16:43:37 2019] > kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] > [Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220 > [amdgpu] > [Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0 > [Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0 > [Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200 > [Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110 > [Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120 > [Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20 > [Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200 > [Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0 > [Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20 > [Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100 > [Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Suggested-by: Felix Kuehling > Signed-off-by: Philip Yang > --- > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +-- > drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 -- > .../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 > +-- > .../drm/amd/amdkfd/kfd_device_queue_manager.h | 1 + > 4 files changed, 46 insertions(+), 13 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > index d9e36dbf13d5..40d75c39f08e 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >
Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume
On 2019-10-22 2:44 p.m., Kuehling, Felix wrote: > On 2019-10-22 14:28, Yang, Philip wrote: >> If device reset/suspend/resume failed for some reason, dqm lock is >> hold forever and this causes deadlock. Below is a kernel backtrace when >> application open kfd after suspend/resume failed. >> >> Instead of holding dqm lock in pre_reset and releasing dqm lock in >> post_reset, add dqm->device_stopped flag which is modified in >> dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection >> because write/read are all inside dqm lock. >> >> For HWS case, map_queues_cpsch and unmap_queues_cpsch checks >> device_stopped flag before sending the updated runlist. >> >> v2: For no-HWS case, when device is stopped, don't call >> load/destroy_mqd for eviction, restore and create queue, and avoid >> debugfs dump hdqs. >> >> Backtrace of dqm lock deadlock: >> >> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more >> than 120 seconds. >> [Thu Oct 17 16:43:37 2019] Not tainted >> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 >> [Thu Oct 17 16:43:37 2019] "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> [Thu Oct 17 16:43:37 2019] rocminfoD0 3024 2947 >> 0x8000 >> [Thu Oct 17 16:43:37 2019] Call Trace: >> [Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0 >> [Thu Oct 17 16:43:37 2019] schedule+0x32/0x70 >> [Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10 >> [Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0 >> [Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0 >> [Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0 >> [amdgpu] >> [Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0 >> [amdgpu] >> [Thu Oct 17 16:43:37 2019] >> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] >> [Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220 >> [amdgpu] >> [Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0 >> [Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0 >> [Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200 >> [Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110 >> [Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120 >> [Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20 >> [Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200 >> [Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0 >> [Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20 >> [Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100 >> [Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> >> Suggested-by: Felix Kuehling >> Signed-off-by: Philip Yang > > Three more comments inline. With those comments addressed, this patch is > > Reviewed-by: Felix Kuehling > > >> --- >>drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +-- >>drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 -- >>.../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +-- >>.../drm/amd/amdkfd/kfd_device_queue_manager.h | 1 + >>4 files changed, 46 insertions(+), 13 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >> index d9e36dbf13d5..40d75c39f08e 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file >> *filep) >> return -EPERM; >> } >> >> +if (kfd_is_locked()) >> +return -EAGAIN; >> + >> process = kfd_create_process(filep); >> if (IS_ERR(process)) >> return PTR_ERR(process); >> >> -if (kfd_is_locked()) >> -return -EAGAIN; >> - > > Is this part of the change still needed? I remember that this sequence > was a bit tricky with some potential race condition when Shaoyun was > working on it. This may have unintended side effects. > > Revert this change to be safe, it is not needed and not clear if there are side effects. >> dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n", >> process->pasid, process->is_32bit_user_mode); >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> index 8f4b24e84964..4fa8834ce7cb 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> @@ -730,9 +730,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd) >> return 0; >> kgd2kfd_suspend(kfd); >> >> -/* hold dqm->lock to prevent further execution*/ >> -dqm_lock(kfd->dqm); >> - >> kfd_signal_reset_event(kfd); >> return 0; >>} >> @@ -750,8 +747,6 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd) >> if (!kfd->init_complete) >> return 0; >> >> -dqm_unlock(kfd->dqm); >> - >> ret = kfd_resume(kfd); >> if (ret) >> return ret; >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume
On 10/22/19 3:19 PM, Yang, Philip wrote: > > On 2019-10-22 2:40 p.m., Grodzovsky, Andrey wrote: >> On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote: >>> On 10/22/19 2:28 PM, Yang, Philip wrote: If device reset/suspend/resume failed for some reason, dqm lock is hold forever and this causes deadlock. Below is a kernel backtrace when application open kfd after suspend/resume failed. Instead of holding dqm lock in pre_reset and releasing dqm lock in post_reset, add dqm->device_stopped flag which is modified in dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection because write/read are all inside dqm lock. For HWS case, map_queues_cpsch and unmap_queues_cpsch checks device_stopped flag before sending the updated runlist. >>> Is there a chance of race condition here where dqm->device_stopped >>> returns true for some operation (e.g.map_queues_cpsch) but just as it >>> proceeds GPU reset starts ? >>> >>> Andrey >> >> Correction - >> >> dqm->device_stopped returns FALSE >> > No race condition here, dqm->device_stopped is set to FALSE in > kgd2kfd_post_reset -> dqm->ops.start(), which the last step of > amdgpu_device_gpu_recover, so it's safe to do map_queue_cpsch. > > Regards, > Philip Sorry - i was confused by the commit description vs. body - in the description it's called device_stopped flag while in the body it's sched_running - probably the description needs to be fixed. So i mean the switch true->false when GPU reset just begins - you get an IOCTL to map a queue (if i understood KFD code correctly), check dqm->sched_running == true and continue, what if right then GPU reset started due to some issue like job timeout or user triggered reset - dqm->sched_running becomes false but you already past that checkpoint in the IOCTL, no ? Andrey > >> Andrey >> >>> v2: For no-HWS case, when device is stopped, don't call load/destroy_mqd for eviction, restore and create queue, and avoid debugfs dump hdqs. Backtrace of dqm lock deadlock: [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more than 120 seconds. [Thu Oct 17 16:43:37 2019] Not tainted 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 [Thu Oct 17 16:43:37 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Oct 17 16:43:37 2019] rocminfoD0 3024 2947 0x8000 [Thu Oct 17 16:43:37 2019] Call Trace: [Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0 [Thu Oct 17 16:43:37 2019] schedule+0x32/0x70 [Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10 [Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0 [Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0 [Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0 [amdgpu] [Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0 [amdgpu] [Thu Oct 17 16:43:37 2019] kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] [Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220 [amdgpu] [Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0 [Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0 [Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200 [Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110 [Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120 [Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20 [Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200 [Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0 [Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20 [Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100 [Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Suggested-by: Felix Kuehling Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +-- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 -- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +-- .../drm/amd/amdkfd/kfd_device_queue_manager.h | 1 + 4 files changed, 46 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index d9e36dbf13d5..40d75c39f08e 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file *filep) return -EPERM; } + if (kfd_is_locked()) + return -EAGAIN; + process = kfd_create_process(filep); if (IS_ERR(process)) return PTR_ERR(process); - if (kfd_is_locked()) - return -EAGAIN; - dev_dbg(kfd_device,
Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume
On 2019-10-22 2:40 p.m., Grodzovsky, Andrey wrote: > > On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote: >> On 10/22/19 2:28 PM, Yang, Philip wrote: >>> If device reset/suspend/resume failed for some reason, dqm lock is >>> hold forever and this causes deadlock. Below is a kernel backtrace when >>> application open kfd after suspend/resume failed. >>> >>> Instead of holding dqm lock in pre_reset and releasing dqm lock in >>> post_reset, add dqm->device_stopped flag which is modified in >>> dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection >>> because write/read are all inside dqm lock. >>> >>> For HWS case, map_queues_cpsch and unmap_queues_cpsch checks >>> device_stopped flag before sending the updated runlist. >> >> Is there a chance of race condition here where dqm->device_stopped >> returns true for some operation (e.g.map_queues_cpsch) but just as it >> proceeds GPU reset starts ? >> >> Andrey > > > Correction - > > dqm->device_stopped returns FALSE > No race condition here, dqm->device_stopped is set to FALSE in kgd2kfd_post_reset -> dqm->ops.start(), which the last step of amdgpu_device_gpu_recover, so it's safe to do map_queue_cpsch. Regards, Philip > Andrey > >> >> >>> v2: For no-HWS case, when device is stopped, don't call >>> load/destroy_mqd for eviction, restore and create queue, and avoid >>> debugfs dump hdqs. >>> >>> Backtrace of dqm lock deadlock: >>> >>> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more >>> than 120 seconds. >>> [Thu Oct 17 16:43:37 2019] Not tainted >>> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 >>> [Thu Oct 17 16:43:37 2019] "echo 0 > >>> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> [Thu Oct 17 16:43:37 2019] rocminfoD0 3024 2947 >>> 0x8000 >>> [Thu Oct 17 16:43:37 2019] Call Trace: >>> [Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0 >>> [Thu Oct 17 16:43:37 2019] schedule+0x32/0x70 >>> [Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10 >>> [Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0 >>> [Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0 >>> [Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0 >>> [amdgpu] >>> [Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0 >>> [amdgpu] >>> [Thu Oct 17 16:43:37 2019] >>> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] >>> [Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220 >>> [amdgpu] >>> [Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0 >>> [Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0 >>> [Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200 >>> [Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110 >>> [Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120 >>> [Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20 >>> [Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200 >>> [Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0 >>> [Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20 >>> [Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100 >>> [Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> >>> Suggested-by: Felix Kuehling >>> Signed-off-by: Philip Yang >>> --- >>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +-- >>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 -- >>> .../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +-- >>> .../drm/amd/amdkfd/kfd_device_queue_manager.h | 1 + >>> 4 files changed, 46 insertions(+), 13 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >>> index d9e36dbf13d5..40d75c39f08e 100644 >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >>> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file >>> *filep) >>> return -EPERM; >>> } >>> >>> + if (kfd_is_locked()) >>> + return -EAGAIN; >>> + >>> process = kfd_create_process(filep); >>> if (IS_ERR(process)) >>> return PTR_ERR(process); >>> >>> - if (kfd_is_locked()) >>> - return -EAGAIN; >>> - >>> dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - >>> %d\n", >>> process->pasid, process->is_32bit_user_mode); >>> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c >>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >>> index 8f4b24e84964..4fa8834ce7cb 100644 >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >>> @@ -730,9 +730,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd) >>> return 0; >>> kgd2kfd_suspend(kfd); >>> >>> - /* hold dqm->lock to prevent further execution*/ >>> - dqm_lock(kfd->dqm); >>> - >>> kfd_signal_reset_event(kfd); >>> return 0; >>> } >>> @@ -750,8
Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume
On 2019-10-22 14:28, Yang, Philip wrote: > If device reset/suspend/resume failed for some reason, dqm lock is > hold forever and this causes deadlock. Below is a kernel backtrace when > application open kfd after suspend/resume failed. > > Instead of holding dqm lock in pre_reset and releasing dqm lock in > post_reset, add dqm->device_stopped flag which is modified in > dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection > because write/read are all inside dqm lock. > > For HWS case, map_queues_cpsch and unmap_queues_cpsch checks > device_stopped flag before sending the updated runlist. > > v2: For no-HWS case, when device is stopped, don't call > load/destroy_mqd for eviction, restore and create queue, and avoid > debugfs dump hdqs. > > Backtrace of dqm lock deadlock: > > [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more > than 120 seconds. > [Thu Oct 17 16:43:37 2019] Not tainted > 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 > [Thu Oct 17 16:43:37 2019] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Thu Oct 17 16:43:37 2019] rocminfoD0 3024 2947 > 0x8000 > [Thu Oct 17 16:43:37 2019] Call Trace: > [Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0 > [Thu Oct 17 16:43:37 2019] schedule+0x32/0x70 > [Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10 > [Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0 > [Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0 > [Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0 > [amdgpu] > [Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0 > [amdgpu] > [Thu Oct 17 16:43:37 2019] > kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] > [Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220 > [amdgpu] > [Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0 > [Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0 > [Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200 > [Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110 > [Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120 > [Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20 > [Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200 > [Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0 > [Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20 > [Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100 > [Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Suggested-by: Felix Kuehling > Signed-off-by: Philip Yang Three more comments inline. With those comments addressed, this patch is Reviewed-by: Felix Kuehling > --- > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +-- > drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 -- > .../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +-- > .../drm/amd/amdkfd/kfd_device_queue_manager.h | 1 + > 4 files changed, 46 insertions(+), 13 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > index d9e36dbf13d5..40d75c39f08e 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file > *filep) > return -EPERM; > } > > + if (kfd_is_locked()) > + return -EAGAIN; > + > process = kfd_create_process(filep); > if (IS_ERR(process)) > return PTR_ERR(process); > > - if (kfd_is_locked()) > - return -EAGAIN; > - Is this part of the change still needed? I remember that this sequence was a bit tricky with some potential race condition when Shaoyun was working on it. This may have unintended side effects. > dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n", > process->pasid, process->is_32bit_user_mode); > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c > b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > index 8f4b24e84964..4fa8834ce7cb 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > @@ -730,9 +730,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd) > return 0; > kgd2kfd_suspend(kfd); > > - /* hold dqm->lock to prevent further execution*/ > - dqm_lock(kfd->dqm); > - > kfd_signal_reset_event(kfd); > return 0; > } > @@ -750,8 +747,6 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd) > if (!kfd->init_complete) > return 0; > > - dqm_unlock(kfd->dqm); > - > ret = kfd_resume(kfd); > if (ret) > return ret; > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c > b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c > index 81fb545cf42c..82e1c6280d13 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c > @@ -340,6 +340,10 @@ static int
Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume
On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote: > On 10/22/19 2:28 PM, Yang, Philip wrote: >> If device reset/suspend/resume failed for some reason, dqm lock is >> hold forever and this causes deadlock. Below is a kernel backtrace when >> application open kfd after suspend/resume failed. >> >> Instead of holding dqm lock in pre_reset and releasing dqm lock in >> post_reset, add dqm->device_stopped flag which is modified in >> dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection >> because write/read are all inside dqm lock. >> >> For HWS case, map_queues_cpsch and unmap_queues_cpsch checks >> device_stopped flag before sending the updated runlist. > > Is there a chance of race condition here where dqm->device_stopped > returns true for some operation (e.g.map_queues_cpsch) but just as it > proceeds GPU reset starts ? > > Andrey Correction - dqm->device_stopped returns FALSE Andrey > > >> v2: For no-HWS case, when device is stopped, don't call >> load/destroy_mqd for eviction, restore and create queue, and avoid >> debugfs dump hdqs. >> >> Backtrace of dqm lock deadlock: >> >> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more >> than 120 seconds. >> [Thu Oct 17 16:43:37 2019] Not tainted >> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 >> [Thu Oct 17 16:43:37 2019] "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> [Thu Oct 17 16:43:37 2019] rocminfoD0 3024 2947 >> 0x8000 >> [Thu Oct 17 16:43:37 2019] Call Trace: >> [Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0 >> [Thu Oct 17 16:43:37 2019] schedule+0x32/0x70 >> [Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10 >> [Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0 >> [Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0 >> [Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0 >> [amdgpu] >> [Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0 >> [amdgpu] >> [Thu Oct 17 16:43:37 2019] >> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] >> [Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220 >> [amdgpu] >> [Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0 >> [Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0 >> [Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200 >> [Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110 >> [Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120 >> [Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20 >> [Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200 >> [Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0 >> [Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20 >> [Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100 >> [Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> >> Suggested-by: Felix Kuehling >> Signed-off-by: Philip Yang >> --- >>drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +-- >>drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 -- >>.../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +-- >>.../drm/amd/amdkfd/kfd_device_queue_manager.h | 1 + >>4 files changed, 46 insertions(+), 13 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >> index d9e36dbf13d5..40d75c39f08e 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file >> *filep) >> return -EPERM; >> } >> >> +if (kfd_is_locked()) >> +return -EAGAIN; >> + >> process = kfd_create_process(filep); >> if (IS_ERR(process)) >> return PTR_ERR(process); >> >> -if (kfd_is_locked()) >> -return -EAGAIN; >> - >> dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n", >> process->pasid, process->is_32bit_user_mode); >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> index 8f4b24e84964..4fa8834ce7cb 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> @@ -730,9 +730,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd) >> return 0; >> kgd2kfd_suspend(kfd); >> >> -/* hold dqm->lock to prevent further execution*/ >> -dqm_lock(kfd->dqm); >> - >> kfd_signal_reset_event(kfd); >> return 0; >>} >> @@ -750,8 +747,6 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd) >> if (!kfd->init_complete) >> return 0; >> >> -dqm_unlock(kfd->dqm); >> - >> ret = kfd_resume(kfd); >> if (ret) >> return ret; >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >> index 81fb545cf42c..82e1c6280d13 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >> +++
Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume
On 10/22/19 2:28 PM, Yang, Philip wrote: > If device reset/suspend/resume failed for some reason, dqm lock is > hold forever and this causes deadlock. Below is a kernel backtrace when > application open kfd after suspend/resume failed. > > Instead of holding dqm lock in pre_reset and releasing dqm lock in > post_reset, add dqm->device_stopped flag which is modified in > dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection > because write/read are all inside dqm lock. > > For HWS case, map_queues_cpsch and unmap_queues_cpsch checks > device_stopped flag before sending the updated runlist. Is there a chance of race condition here where dqm->device_stopped returns true for some operation (e.g.map_queues_cpsch) but just as it proceeds GPU reset starts ? Andrey > > v2: For no-HWS case, when device is stopped, don't call > load/destroy_mqd for eviction, restore and create queue, and avoid > debugfs dump hdqs. > > Backtrace of dqm lock deadlock: > > [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more > than 120 seconds. > [Thu Oct 17 16:43:37 2019] Not tainted > 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 > [Thu Oct 17 16:43:37 2019] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Thu Oct 17 16:43:37 2019] rocminfoD0 3024 2947 > 0x8000 > [Thu Oct 17 16:43:37 2019] Call Trace: > [Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0 > [Thu Oct 17 16:43:37 2019] schedule+0x32/0x70 > [Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10 > [Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0 > [Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0 > [Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0 > [amdgpu] > [Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0 > [amdgpu] > [Thu Oct 17 16:43:37 2019] > kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] > [Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220 > [amdgpu] > [Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0 > [Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0 > [Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200 > [Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110 > [Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120 > [Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20 > [Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200 > [Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0 > [Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20 > [Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100 > [Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Suggested-by: Felix Kuehling > Signed-off-by: Philip Yang > --- > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +-- > drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 -- > .../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +-- > .../drm/amd/amdkfd/kfd_device_queue_manager.h | 1 + > 4 files changed, 46 insertions(+), 13 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > index d9e36dbf13d5..40d75c39f08e 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file > *filep) > return -EPERM; > } > > + if (kfd_is_locked()) > + return -EAGAIN; > + > process = kfd_create_process(filep); > if (IS_ERR(process)) > return PTR_ERR(process); > > - if (kfd_is_locked()) > - return -EAGAIN; > - > dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n", > process->pasid, process->is_32bit_user_mode); > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c > b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > index 8f4b24e84964..4fa8834ce7cb 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > @@ -730,9 +730,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd) > return 0; > kgd2kfd_suspend(kfd); > > - /* hold dqm->lock to prevent further execution*/ > - dqm_lock(kfd->dqm); > - > kfd_signal_reset_event(kfd); > return 0; > } > @@ -750,8 +747,6 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd) > if (!kfd->init_complete) > return 0; > > - dqm_unlock(kfd->dqm); > - > ret = kfd_resume(kfd); > if (ret) > return ret; > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c > b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c > index 81fb545cf42c..82e1c6280d13 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c > @@ -340,6 +340,10 @@ static int create_queue_nocpsch(struct > device_queue_manager *dqm, > mqd_mgr->init_mqd(mqd_mgr, >mqd, q->mqd_mem_obj, >
[PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume
If device reset/suspend/resume failed for some reason, dqm lock is hold forever and this causes deadlock. Below is a kernel backtrace when application open kfd after suspend/resume failed. Instead of holding dqm lock in pre_reset and releasing dqm lock in post_reset, add dqm->device_stopped flag which is modified in dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection because write/read are all inside dqm lock. For HWS case, map_queues_cpsch and unmap_queues_cpsch checks device_stopped flag before sending the updated runlist. v2: For no-HWS case, when device is stopped, don't call load/destroy_mqd for eviction, restore and create queue, and avoid debugfs dump hdqs. Backtrace of dqm lock deadlock: [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more than 120 seconds. [Thu Oct 17 16:43:37 2019] Not tainted 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 [Thu Oct 17 16:43:37 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Oct 17 16:43:37 2019] rocminfoD0 3024 2947 0x8000 [Thu Oct 17 16:43:37 2019] Call Trace: [Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0 [Thu Oct 17 16:43:37 2019] schedule+0x32/0x70 [Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10 [Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0 [Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0 [Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0 [amdgpu] [Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0 [amdgpu] [Thu Oct 17 16:43:37 2019] kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] [Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220 [amdgpu] [Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0 [Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0 [Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200 [Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110 [Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120 [Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20 [Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200 [Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0 [Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20 [Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100 [Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Suggested-by: Felix Kuehling Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +-- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 -- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +-- .../drm/amd/amdkfd/kfd_device_queue_manager.h | 1 + 4 files changed, 46 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index d9e36dbf13d5..40d75c39f08e 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file *filep) return -EPERM; } + if (kfd_is_locked()) + return -EAGAIN; + process = kfd_create_process(filep); if (IS_ERR(process)) return PTR_ERR(process); - if (kfd_is_locked()) - return -EAGAIN; - dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n", process->pasid, process->is_32bit_user_mode); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c index 8f4b24e84964..4fa8834ce7cb 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c @@ -730,9 +730,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd) return 0; kgd2kfd_suspend(kfd); - /* hold dqm->lock to prevent further execution*/ - dqm_lock(kfd->dqm); - kfd_signal_reset_event(kfd); return 0; } @@ -750,8 +747,6 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd) if (!kfd->init_complete) return 0; - dqm_unlock(kfd->dqm); - ret = kfd_resume(kfd); if (ret) return ret; diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c index 81fb545cf42c..82e1c6280d13 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c @@ -340,6 +340,10 @@ static int create_queue_nocpsch(struct device_queue_manager *dqm, mqd_mgr->init_mqd(mqd_mgr, >mqd, q->mqd_mem_obj, >gart_mqd_addr, >properties); if (q->properties.is_active) { + if (!dqm->sched_running) { + WARN_ONCE(1, "Load non-HWS mqd while stopped\n"); + goto add_queue_to_list; + } if (WARN(q->process->mm != current->mm, "should only run in user thread")) @@