RE: [PATCH] drm/amdgpu: Add timeout for sync wait

2023-10-19 Thread Deng, Emily
[AMD Official Use Only - General]

Hi Felix,
 Yes, will correct the description. Will add another patch to handle the 
return for sync wait.

Emily Deng
Best Wishes



>-Original Message-
>From: Kuehling, Felix 
>Sent: Friday, October 20, 2023 12:05 AM
>To: Deng, Emily ; amd-gfx@lists.freedesktop.org
>Subject: Re: [PATCH] drm/amdgpu: Add timeout for sync wait
>
>On 2023-10-19 05:31, Emily Deng wrote:
>> Issue: Dead heappen during gpu recover
>>
>> [56433.829492] amdgpu :04:00.0: amdgpu: GPU reset begin!
>> [56550.499625] INFO: task kworker/u80:0:10 blocked for more than 120
>seconds.
>> [56550.520215]   Tainted: G   OE  6.2.0-34-generic 
>> #34~22.04.1-
>Ubuntu
>> [56550.542883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>disables this message.
>> [56550.566313] task:kworker/u80:0   state:D stack:0 pid:10ppid:2
>flags:0x4000
>> [56550.591318] Workqueue: kfd_restore_wq restore_process_worker
>> [amdgpu] [56550.611391] Call Trace:
>> [56550.618698]  
>> [56550.624968]  __schedule+0x2b7/0x5f0 [56550.635416]
>> schedule+0x68/0x110 [56550.645090]  schedule_timeout+0x151/0x160
>> [56550.657096]  ? amdgpu_vm_bo_update+0x46e/0x660 [amdgpu]
>> [56550.673245]  dma_fence_default_wait+0x1a2/0x1e0
>> [56550.686818]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
>> [56550.702728]  dma_fence_wait_timeout+0x117/0x140
>> [56550.716301]  amdgpu_sync_wait+0x62/0xa0 [amdgpu] [56550.730654]
>> amdgpu_amdkfd_gpuvm_restore_process_bos+0x59e/0x770 [amdgpu]
>> [56550.751668]  ? newidle_balance+0x298/0x490 [56550.763936]
>> restore_process_worker+0x42/0x270 [amdgpu] [56550.780183]
>> process_one_work+0x21f/0x440 [56550.792193]  worker_thread+0x50/0x3f0
>> [56550.803165]  ? __pfx_worker_thread+0x10/0x10 [56550.815934]
>> kthread+0xee/0x120 [56550.825342]  ? __pfx_kthread+0x10/0x10
>> [56550.836548]  ret_from_fork+0x2c/0x50 [56550.847262]   [
>> 1935.215502] Call Trace:
>> [ 1935.222827]  
>> [ 1935.229121]  __schedule+0x23d/0x5a0 [ 1935.239583]
>> schedule+0x4e/0xc0 [ 1935.248983]  schedule_timeout+0x103/0x140 [
>> 1935.261002]  __wait_for_common+0xae/0x150 [ 1935.273008]  ?
>> usleep_range_state+0x90/0x90 [ 1935.285546]
>> wait_for_completion+0x24/0x30 [ 1935.297813]
>> __flush_work.isra.0+0x175/0x280 [ 1935.310611]  ?
>> worker_detach_from_pool+0xc0/0xc0 [ 1935.324436]
>> flush_delayed_work+0x31/0x50 [ 1935.336455]
>> kfd_suspend_all_processes+0x96/0x150 [amdgpu] [ 1935.353429]
>> kgd2kfd_suspend+0xb8/0xe0 [amdgpu] [ 1935.367469]
>> kgd2kfd_pre_reset+0x81/0xf0 [amdgpu] [ 1935.382036]
>> amdgpu_amdkfd_pre_reset+0x1a/0x30 [amdgpu] [ 1935.398156]
>> amdgpu_device_gpu_recover.cold+0x210/0xcf2 [amdgpu] [ 1935.416722]
>> amdgpu_job_timedout+0x19f/0x1e0 [amdgpu] [ 1935.432367]
>> drm_sched_job_timedout+0x6f/0x120 [amd_sched] [ 1935.448792]
>> process_one_work+0x22b/0x3d0 [ 1935.460806]
>worker_thread+0x53/0x420
>> [ 1935.471777]  ? process_one_work+0x3d0/0x3d0 [ 1935.484307]
>> kthread+0x12a/0x150 [ 1935.493993]  ? set_kthread_struct+0x50/0x50 [
>> 1935.506513]  ret_from_fork+0x22/0x30
>
>Looking at the time stamps, this seems to be a mash-up of two different logs. I
>think you're trying to show how a restore_processes worker is stuck on a fence,
>and that's causing kgd2kfd_pre_reset to hang when it tries to flush the work.
>
>The fence it's hanging on is probably something related to a page table update
>that got caught up in the GPU hang. Adding a timeout here seems reasonable.
>There may be another problem, because
>amdgpu_amdkfd_gpuvm_restore_process_bos ignores the return value of
>amdgpu_sync_wait. We shouldi probably handle the timeout gracefully with a
>"goto validate_map_fail".
>
>Regards,
>   Felix
>
>
>>
>> It is because the amdgpu_sync_wait is waiting for the bad job's fence,
>> and never return, so the recover couldn't continue.
>>
>>
>> Signed-off-by: Emily Deng 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 16 +---
>>   1 file changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>> index dcd8c066bc1f..c922867c5675 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>> @@ -406,9 +406,19 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync,
>bool intr)
>>  int i, r;
>>
>>  hash_for_each_safe(sync->fences, i, tmp, e, node) {
>> -r = dma_fence_wait(e->fence, intr);
&

Re: [PATCH] drm/amdgpu: Add timeout for sync wait

2023-10-19 Thread Felix Kuehling

On 2023-10-19 05:31, Emily Deng wrote:

Issue: Dead heappen during gpu recover

[56433.829492] amdgpu :04:00.0: amdgpu: GPU reset begin!
[56550.499625] INFO: task kworker/u80:0:10 blocked for more than 120 seconds.
[56550.520215]   Tainted: G   OE  6.2.0-34-generic 
#34~22.04.1-Ubuntu
[56550.542883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[56550.566313] task:kworker/u80:0   state:D stack:0 pid:10ppid:2  
flags:0x4000
[56550.591318] Workqueue: kfd_restore_wq restore_process_worker [amdgpu]
[56550.611391] Call Trace:
[56550.618698]  
[56550.624968]  __schedule+0x2b7/0x5f0
[56550.635416]  schedule+0x68/0x110
[56550.645090]  schedule_timeout+0x151/0x160
[56550.657096]  ? amdgpu_vm_bo_update+0x46e/0x660 [amdgpu]
[56550.673245]  dma_fence_default_wait+0x1a2/0x1e0
[56550.686818]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[56550.702728]  dma_fence_wait_timeout+0x117/0x140
[56550.716301]  amdgpu_sync_wait+0x62/0xa0 [amdgpu]
[56550.730654]  amdgpu_amdkfd_gpuvm_restore_process_bos+0x59e/0x770 [amdgpu]
[56550.751668]  ? newidle_balance+0x298/0x490
[56550.763936]  restore_process_worker+0x42/0x270 [amdgpu]
[56550.780183]  process_one_work+0x21f/0x440
[56550.792193]  worker_thread+0x50/0x3f0
[56550.803165]  ? __pfx_worker_thread+0x10/0x10
[56550.815934]  kthread+0xee/0x120
[56550.825342]  ? __pfx_kthread+0x10/0x10
[56550.836548]  ret_from_fork+0x2c/0x50
[56550.847262]  
[ 1935.215502] Call Trace:
[ 1935.222827]  
[ 1935.229121]  __schedule+0x23d/0x5a0
[ 1935.239583]  schedule+0x4e/0xc0
[ 1935.248983]  schedule_timeout+0x103/0x140
[ 1935.261002]  __wait_for_common+0xae/0x150
[ 1935.273008]  ? usleep_range_state+0x90/0x90
[ 1935.285546]  wait_for_completion+0x24/0x30
[ 1935.297813]  __flush_work.isra.0+0x175/0x280
[ 1935.310611]  ? worker_detach_from_pool+0xc0/0xc0
[ 1935.324436]  flush_delayed_work+0x31/0x50
[ 1935.336455]  kfd_suspend_all_processes+0x96/0x150 [amdgpu]
[ 1935.353429]  kgd2kfd_suspend+0xb8/0xe0 [amdgpu]
[ 1935.367469]  kgd2kfd_pre_reset+0x81/0xf0 [amdgpu]
[ 1935.382036]  amdgpu_amdkfd_pre_reset+0x1a/0x30 [amdgpu]
[ 1935.398156]  amdgpu_device_gpu_recover.cold+0x210/0xcf2 [amdgpu]
[ 1935.416722]  amdgpu_job_timedout+0x19f/0x1e0 [amdgpu]
[ 1935.432367]  drm_sched_job_timedout+0x6f/0x120 [amd_sched]
[ 1935.448792]  process_one_work+0x22b/0x3d0
[ 1935.460806]  worker_thread+0x53/0x420
[ 1935.471777]  ? process_one_work+0x3d0/0x3d0
[ 1935.484307]  kthread+0x12a/0x150
[ 1935.493993]  ? set_kthread_struct+0x50/0x50
[ 1935.506513]  ret_from_fork+0x22/0x30


Looking at the time stamps, this seems to be a mash-up of two different 
logs. I think you're trying to show how a restore_processes worker is 
stuck on a fence, and that's causing kgd2kfd_pre_reset to hang when it 
tries to flush the work.


The fence it's hanging on is probably something related to a page table 
update that got caught up in the GPU hang. Adding a timeout here seems 
reasonable. There may be another problem, because 
amdgpu_amdkfd_gpuvm_restore_process_bos ignores the return value of 
amdgpu_sync_wait. We shouldi probably handle the timeout gracefully with 
a "goto validate_map_fail".


Regards,
  Felix




It is because the amdgpu_sync_wait is waiting for the bad job's fence, and
never return, so the recover couldn't continue.


Signed-off-by: Emily Deng 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 16 +---
  1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index dcd8c066bc1f..c922867c5675 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -406,9 +406,19 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr)
int i, r;
  
  	hash_for_each_safe(sync->fences, i, tmp, e, node) {

-   r = dma_fence_wait(e->fence, intr);
-   if (r)
-   return r;
+   struct drm_sched_fence *s_fence = to_drm_sched_fence(e->fence);
+
+   if (s_fence) {
+   r = dma_fence_wait_timeout(e->fence, intr, 
s_fence->sched->timeout);
+   if (r == 0)
+   r = -ETIMEDOUT;
+   if (r < 0)
+   return r;
+   } else {
+   r = dma_fence_wait(e->fence, intr);
+   if (r)
+   return r;
+   }
  
  		amdgpu_sync_entry_free(e);

}