RE: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop

2022-01-29 Thread Chai, Thomas
It have solution to solve this defect,   I am debugging the modifications. 

-Original Message-
From: Zhou1, Tao  
Sent: Saturday, January 29, 2022 3:54 PM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Clements, John 

Subject: RE: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop

[AMD Official Use Only]

For quick workaround, I agree with the solution. But regarding the root cause, 
the list is still messed up.
Can we make ras_list to be a global variable across all cards, and add list 
empty check (or add a flag to indicate the register status of ras block) before 
list add to avoid redundant register?

Regards,
Tao

> -Original Message-
> From: Chai, Thomas 
> Sent: Saturday, January 29, 2022 11:53 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking 
> ; Zhou1, Tao ; Clements, 
> John ; Chai, Thomas 
> Subject: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop
> 
> 1. The infinite loop causing soft lock occurs on multiple amdgpu cards
>supporting ras feature.
> 2. This a workaround patch. It is valid for multiple amdgpu cards of the
>same type.
> 3. The root cause is that each GPU card device has a separate .ras_list
>link header, but the instance and linked list node of each ras block
>are unique. When each device is initialized, each ras instance will
>repeatedly add link node to the device every time. In this way, only
>the .ras_list of the last initialized device is completely correct.
>the .ras_list->prev and .ras_list->next of the device initialzied
>before can still point to the correct ras instance, but the prev
>pointer and next pointer of the pointed ras instance both point to
>the last initialized device's .ras_ list instead of the beginning
>.ras_ list. When using list_for_each_entry_safe searches for
>non-existent Ras nodes on devices other than the last device, the
>last ras instance next pointer cannot always be equal to the
>beginning .ras_list, so that the loop cannot be terminated, the
>program enters a infinite loop.
>  BTW: Since the data and initialization process of each card are the same,
>   the link list between ras instances will not be destroyed every time
>   the device is initialized.
>  4. The soft locked logs are as follows:
> [  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G   OE
> 5.13.0-27-generic #29~20.04.1-Ubuntu
> [  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, 
> BIOS T20200717143848 07/17/2020 [  262.165698] Workqueue: events 
> amdgpu_ras_do_recovery [amdgpu] [  262.165980] RIP:
> 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu] [  262.166239] Code: 
> 68
> d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 
> 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 
> 89 f5 48 83 e8 28 48
> 39 d3 74 25 49 89 c6 49 8b 45 [  262.166243] RSP: 
> 0018:ac908fa87d80
> EFLAGS: 0202 [  262.166247] RAX: c1394248 RBX: 
> 91e4ab8d6e20
> RCX: c1394248 [  262.166249] RDX: 91e4aa356e20 RSI:
> 000e RDI: 91e4ab8c [  262.166252] RBP:
> ac908fa87da8 R08: 0007 R09: 0001 [  
> 262.166254] R10: 91e4930b64ec R11:  R12:
> 000e [  262.166256] R13: 91e4aa356df8 R14: 
> c1394320
> R15: 0003 [  262.166258] FS:  ()
> GS:92238fb4() knlGS: [  262.166261] CS:  
> 0010
> DS:  ES:  CR0: 80050033 [  262.166264] CR2:
> 0001004865d0 CR3: 00406d796000 CR4: 00350ee0 [  
> 262.166267] Call Trace:
> [  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu] [  
> 262.166529]  ? psi_task_switch+0xd2/0x250 [  262.166537]  ?
> __switch_to+0x11d/0x460 [  262.166542]  ? __switch_to_asm+0x36/0x70 [  
> 262.166549]  process_one_work+0x220/0x3c0 [  262.166556]
> worker_thread+0x4d/0x3f0 [  262.166560]  ? 
> process_one_work+0x3c0/0x3c0 [  262.166563]  kthread+0x12b/0x150 [  
> 262.166568]  ?
> set_kthread_struct+0x40/0x40 [  262.166571]  ret_from_fork+0x22/0x30
> 
> Signed-off-by: yipechai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index d4e07d0acb66..3d533ef0783d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -884,6 +884,7 @@ static int amdgpu_ras_block_match_default(struct
> amdgpu_ras_block_object *block_  static struct amdgpu_ras_block_object 
> *amdgpu_ras_get_ras_block(struct amdgpu_device *adev,
>

RE: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop

2022-01-28 Thread Zhou1, Tao
[AMD Official Use Only]

For quick workaround, I agree with the solution. But regarding the root cause, 
the list is still messed up.
Can we make ras_list to be a global variable across all cards, and add list 
empty check (or add a flag to indicate the register status of ras block) before 
list add to avoid redundant register?

Regards,
Tao

> -Original Message-
> From: Chai, Thomas 
> Sent: Saturday, January 29, 2022 11:53 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Clements,
> John ; Chai, Thomas 
> Subject: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop
> 
> 1. The infinite loop causing soft lock occurs on multiple amdgpu cards
>supporting ras feature.
> 2. This a workaround patch. It is valid for multiple amdgpu cards of the
>same type.
> 3. The root cause is that each GPU card device has a separate .ras_list
>link header, but the instance and linked list node of each ras block
>are unique. When each device is initialized, each ras instance will
>repeatedly add link node to the device every time. In this way, only
>the .ras_list of the last initialized device is completely correct.
>the .ras_list->prev and .ras_list->next of the device initialzied
>before can still point to the correct ras instance, but the prev
>pointer and next pointer of the pointed ras instance both point to
>the last initialized device's .ras_ list instead of the beginning
>.ras_ list. When using list_for_each_entry_safe searches for
>non-existent Ras nodes on devices other than the last device, the
>last ras instance next pointer cannot always be equal to the
>beginning .ras_list, so that the loop cannot be terminated, the
>program enters a infinite loop.
>  BTW: Since the data and initialization process of each card are the same,
>   the link list between ras instances will not be destroyed every time
>   the device is initialized.
>  4. The soft locked logs are as follows:
> [  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G   OE
> 5.13.0-27-generic #29~20.04.1-Ubuntu
> [  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU,
> BIOS T20200717143848 07/17/2020 [  262.165698] Workqueue: events
> amdgpu_ras_do_recovery [amdgpu] [  262.165980] RIP:
> 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu] [  262.166239] Code: 68
> d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 
> 89
> ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 
> 48
> 39 d3 74 25 49 89 c6 49 8b 45 [  262.166243] RSP: 0018:ac908fa87d80
> EFLAGS: 0202 [  262.166247] RAX: c1394248 RBX: 91e4ab8d6e20
> RCX: c1394248 [  262.166249] RDX: 91e4aa356e20 RSI:
> 000e RDI: 91e4ab8c [  262.166252] RBP:
> ac908fa87da8 R08: 0007 R09: 0001
> [  262.166254] R10: 91e4930b64ec R11:  R12:
> 000e [  262.166256] R13: 91e4aa356df8 R14: c1394320
> R15: 0003 [  262.166258] FS:  ()
> GS:92238fb4() knlGS: [  262.166261] CS:  0010
> DS:  ES:  CR0: 80050033 [  262.166264] CR2:
> 0001004865d0 CR3: 00406d796000 CR4: 00350ee0
> [  262.166267] Call Trace:
> [  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu]
> [  262.166529]  ? psi_task_switch+0xd2/0x250 [  262.166537]  ?
> __switch_to+0x11d/0x460 [  262.166542]  ? __switch_to_asm+0x36/0x70
> [  262.166549]  process_one_work+0x220/0x3c0 [  262.166556]
> worker_thread+0x4d/0x3f0 [  262.166560]  ? process_one_work+0x3c0/0x3c0
> [  262.166563]  kthread+0x12b/0x150 [  262.166568]  ?
> set_kthread_struct+0x40/0x40 [  262.166571]  ret_from_fork+0x22/0x30
> 
> Signed-off-by: yipechai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index d4e07d0acb66..3d533ef0783d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -884,6 +884,7 @@ static int amdgpu_ras_block_match_default(struct
> amdgpu_ras_block_object *block_  static struct amdgpu_ras_block_object
> *amdgpu_ras_get_ras_block(struct amdgpu_device *adev,
>   enum amdgpu_ras_block block,
> uint32_t sub_block_index)  {
> + int loop_cnt = 0;
>   struct amdgpu_ras_block_object *obj, *tmp;
> 
>   if (block >= AMDGPU_RAS_BLOCK__LAST)
> @@ -900,6 +901,9 @@ static struct amdgpu_ras_block_object
> *amdgpu_ras_get_ras_block(struct amdgpu_de
>   if (amdgpu_ras_block_match_default(obj, block) == 0)
>   return obj;
>   }
> +
> + if (++loop_cnt >= AMDGPU_RAS_BLOCK__LAST)
> + break;
>   }
> 
>   return NULL;
> --
> 2.25.1


RE: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop

2022-01-28 Thread Chai, Thomas
OK

-Original Message-
From: Chen, Guchun  
Sent: Saturday, January 29, 2022 12:02 PM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Zhou1, Tao ; Zhang, Hawking ; 
Clements, John ; Chai, Thomas ; 
Chai, Thomas 
Subject: RE: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop

[Public]

Please add a Fixes tag, as it should fix a regression from former patch.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of yipechai
Sent: Saturday, January 29, 2022 11:53 AM
To: amd-gfx@lists.freedesktop.org
Cc: Zhou1, Tao ; Zhang, Hawking ; 
Clements, John ; Chai, Thomas ; 
Chai, Thomas 
Subject: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop

1. The infinite loop causing soft lock occurs on multiple amdgpu cards
   supporting ras feature.
2. This a workaround patch. It is valid for multiple amdgpu cards of the
   same type.
3. The root cause is that each GPU card device has a separate .ras_list
   link header, but the instance and linked list node of each ras block
   are unique. When each device is initialized, each ras instance will
   repeatedly add link node to the device every time. In this way, only
   the .ras_list of the last initialized device is completely correct.
   the .ras_list->prev and .ras_list->next of the device initialzied
   before can still point to the correct ras instance, but the prev
   pointer and next pointer of the pointed ras instance both point to
   the last initialized device's .ras_ list instead of the beginning
   .ras_ list. When using list_for_each_entry_safe searches for
   non-existent Ras nodes on devices other than the last device, the
   last ras instance next pointer cannot always be equal to the
   beginning .ras_list, so that the loop cannot be terminated, the
   program enters a infinite loop.
 BTW: Since the data and initialization process of each card are the same,
  the link list between ras instances will not be destroyed every time
  the device is initialized.
 4. The soft locked logs are as follows:
[  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G   OE 
5.13.0-27-generic #29~20.04.1-Ubuntu
[  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS 
T20200717143848 07/17/2020 [  262.165698] Workqueue: events 
amdgpu_ras_do_recovery [amdgpu] [  262.165980] RIP: 
0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu] [  262.166239] Code: 68 d8 4c 
8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef 
e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 
d3 74 25 49 89 c6 49 8b 45 [  262.166243] RSP: 0018:ac908fa87d80 EFLAGS: 
0202 [  262.166247] RAX: c1394248 RBX: 91e4ab8d6e20 RCX: 
c1394248 [  262.166249] RDX: 91e4aa356e20 RSI: 000e 
RDI: 91e4ab8c [  262.166252] RBP: ac908fa87da8 R08: 
0007 R09: 0001 [  262.166254] R10: 91e4930b64ec 
R11:  R12: 000e [  262.166256] R13: 
91e4aa356df8 R14: c1394320 R15: 0003 [  262.166258] FS: 
 () GS:92238fb4() knlGS: [  
262.166261] CS:  0010 DS:  ES:  CR0: 80050033 [  262.166264] 
CR2: 0001004865d0 CR3: 00406d796000 CR4: 00350ee0 [  
262.166267] Call Trace:
[  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu] [  262.166529]  ? 
psi_task_switch+0xd2/0x250 [  262.166537]  ? __switch_to+0x11d/0x460 [  
262.166542]  ? __switch_to_asm+0x36/0x70 [  262.166549]  
process_one_work+0x220/0x3c0 [  262.166556]  worker_thread+0x4d/0x3f0 [  
262.166560]  ? process_one_work+0x3c0/0x3c0 [  262.166563]  kthread+0x12b/0x150 
[  262.166568]  ? set_kthread_struct+0x40/0x40 [  262.166571]  
ret_from_fork+0x22/0x30

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index d4e07d0acb66..3d533ef0783d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -884,6 +884,7 @@ static int amdgpu_ras_block_match_default(struct 
amdgpu_ras_block_object *block_  static struct amdgpu_ras_block_object 
*amdgpu_ras_get_ras_block(struct amdgpu_device *adev,
enum amdgpu_ras_block block, uint32_t 
sub_block_index)  {
+   int loop_cnt = 0;
struct amdgpu_ras_block_object *obj, *tmp;
 
if (block >= AMDGPU_RAS_BLOCK__LAST)
@@ -900,6 +901,9 @@ static struct amdgpu_ras_block_object 
*amdgpu_ras_get_ras_block(struct amdgpu_de
if (amdgpu_ras_block_match_default(obj, block) == 0)
return obj;
}
+
+   if (++loop_cnt >= AMDGPU_RAS_BLOCK__LAST)
+   break;
}
 
return NULL;
--
2.25.1


RE: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop

2022-01-28 Thread Clements, John
[AMD Official Use Only]

Reviewed-by: John Clements 

-Original Message-
From: Chai, Thomas  
Sent: Saturday, January 29, 2022 11:53 AM
To: amd-gfx@lists.freedesktop.org
Cc: Chai, Thomas ; Zhang, Hawking ; 
Zhou1, Tao ; Clements, John ; Chai, 
Thomas 
Subject: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop

1. The infinite loop causing soft lock occurs on multiple amdgpu cards
   supporting ras feature.
2. This a workaround patch. It is valid for multiple amdgpu cards of the
   same type.
3. The root cause is that each GPU card device has a separate .ras_list
   link header, but the instance and linked list node of each ras block
   are unique. When each device is initialized, each ras instance will
   repeatedly add link node to the device every time. In this way, only
   the .ras_list of the last initialized device is completely correct.
   the .ras_list->prev and .ras_list->next of the device initialzied
   before can still point to the correct ras instance, but the prev
   pointer and next pointer of the pointed ras instance both point to
   the last initialized device's .ras_ list instead of the beginning
   .ras_ list. When using list_for_each_entry_safe searches for
   non-existent Ras nodes on devices other than the last device, the
   last ras instance next pointer cannot always be equal to the
   beginning .ras_list, so that the loop cannot be terminated, the
   program enters a infinite loop.
 BTW: Since the data and initialization process of each card are the same,
  the link list between ras instances will not be destroyed every time
  the device is initialized.
 4. The soft locked logs are as follows:
[  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G   OE 
5.13.0-27-generic #29~20.04.1-Ubuntu
[  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS 
T20200717143848 07/17/2020 [  262.165698] Workqueue: events 
amdgpu_ras_do_recovery [amdgpu] [  262.165980] RIP: 
0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu] [  262.166239] Code: 68 d8 4c 
8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef 
e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 
d3 74 25 49 89 c6 49 8b 45 [  262.166243] RSP: 0018:ac908fa87d80 EFLAGS: 
0202 [  262.166247] RAX: c1394248 RBX: 91e4ab8d6e20 RCX: 
c1394248 [  262.166249] RDX: 91e4aa356e20 RSI: 000e 
RDI: 91e4ab8c [  262.166252] RBP: ac908fa87da8 R08: 
0007 R09: 0001 [  262.166254] R10: 91e4930b64ec 
R11:  R12: 000e [  262.166256] R13: 
91e4aa356df8 R14: c1394320 R15: 0003 [  262.166258] FS: 
 () GS:92238fb4() knlGS: [  
262.166261] CS:  0010 DS:  ES:  CR0: 80050033 [  262.166264] 
CR2: 0001004865d0 CR3: 00406d796000 CR4: 00350ee0 [  
262.166267] Call Trace:
[  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu] [  262.166529]  ? 
psi_task_switch+0xd2/0x250 [  262.166537]  ? __switch_to+0x11d/0x460 [  
262.166542]  ? __switch_to_asm+0x36/0x70 [  262.166549]  
process_one_work+0x220/0x3c0 [  262.166556]  worker_thread+0x4d/0x3f0 [  
262.166560]  ? process_one_work+0x3c0/0x3c0 [  262.166563]  kthread+0x12b/0x150 
[  262.166568]  ? set_kthread_struct+0x40/0x40 [  262.166571]  
ret_from_fork+0x22/0x30

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index d4e07d0acb66..3d533ef0783d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -884,6 +884,7 @@ static int amdgpu_ras_block_match_default(struct 
amdgpu_ras_block_object *block_  static struct amdgpu_ras_block_object 
*amdgpu_ras_get_ras_block(struct amdgpu_device *adev,
enum amdgpu_ras_block block, uint32_t 
sub_block_index)  {
+   int loop_cnt = 0;
struct amdgpu_ras_block_object *obj, *tmp;
 
if (block >= AMDGPU_RAS_BLOCK__LAST)
@@ -900,6 +901,9 @@ static struct amdgpu_ras_block_object 
*amdgpu_ras_get_ras_block(struct amdgpu_de
if (amdgpu_ras_block_match_default(obj, block) == 0)
return obj;
}
+
+   if (++loop_cnt >= AMDGPU_RAS_BLOCK__LAST)
+   break;
}
 
return NULL;
--
2.25.1


RE: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop

2022-01-28 Thread Chen, Guchun
[Public]

Please add a Fixes tag, as it should fix a regression from former patch.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of yipechai
Sent: Saturday, January 29, 2022 11:53 AM
To: amd-gfx@lists.freedesktop.org
Cc: Zhou1, Tao ; Zhang, Hawking ; 
Clements, John ; Chai, Thomas ; 
Chai, Thomas 
Subject: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop

1. The infinite loop causing soft lock occurs on multiple amdgpu cards
   supporting ras feature.
2. This a workaround patch. It is valid for multiple amdgpu cards of the
   same type.
3. The root cause is that each GPU card device has a separate .ras_list
   link header, but the instance and linked list node of each ras block
   are unique. When each device is initialized, each ras instance will
   repeatedly add link node to the device every time. In this way, only
   the .ras_list of the last initialized device is completely correct.
   the .ras_list->prev and .ras_list->next of the device initialzied
   before can still point to the correct ras instance, but the prev
   pointer and next pointer of the pointed ras instance both point to
   the last initialized device's .ras_ list instead of the beginning
   .ras_ list. When using list_for_each_entry_safe searches for
   non-existent Ras nodes on devices other than the last device, the
   last ras instance next pointer cannot always be equal to the
   beginning .ras_list, so that the loop cannot be terminated, the
   program enters a infinite loop.
 BTW: Since the data and initialization process of each card are the same,
  the link list between ras instances will not be destroyed every time
  the device is initialized.
 4. The soft locked logs are as follows:
[  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G   OE 
5.13.0-27-generic #29~20.04.1-Ubuntu
[  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS 
T20200717143848 07/17/2020 [  262.165698] Workqueue: events 
amdgpu_ras_do_recovery [amdgpu] [  262.165980] RIP: 
0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu] [  262.166239] Code: 68 d8 4c 
8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef 
e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 
d3 74 25 49 89 c6 49 8b 45 [  262.166243] RSP: 0018:ac908fa87d80 EFLAGS: 
0202 [  262.166247] RAX: c1394248 RBX: 91e4ab8d6e20 RCX: 
c1394248 [  262.166249] RDX: 91e4aa356e20 RSI: 000e 
RDI: 91e4ab8c [  262.166252] RBP: ac908fa87da8 R08: 
0007 R09: 0001 [  262.166254] R10: 91e4930b64ec 
R11:  R12: 000e [  262.166256] R13: 
91e4aa356df8 R14: c1394320 R15: 0003 [  262.166258] FS: 
 () GS:92238fb4() knlGS: [  
262.166261] CS:  0010 DS:  ES:  CR0: 80050033 [  262.166264] 
CR2: 0001004865d0 CR3: 00406d796000 CR4: 00350ee0 [  
262.166267] Call Trace:
[  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu] [  262.166529]  ? 
psi_task_switch+0xd2/0x250 [  262.166537]  ? __switch_to+0x11d/0x460 [  
262.166542]  ? __switch_to_asm+0x36/0x70 [  262.166549]  
process_one_work+0x220/0x3c0 [  262.166556]  worker_thread+0x4d/0x3f0 [  
262.166560]  ? process_one_work+0x3c0/0x3c0 [  262.166563]  kthread+0x12b/0x150 
[  262.166568]  ? set_kthread_struct+0x40/0x40 [  262.166571]  
ret_from_fork+0x22/0x30

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index d4e07d0acb66..3d533ef0783d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -884,6 +884,7 @@ static int amdgpu_ras_block_match_default(struct 
amdgpu_ras_block_object *block_  static struct amdgpu_ras_block_object 
*amdgpu_ras_get_ras_block(struct amdgpu_device *adev,
enum amdgpu_ras_block block, uint32_t 
sub_block_index)  {
+   int loop_cnt = 0;
struct amdgpu_ras_block_object *obj, *tmp;
 
if (block >= AMDGPU_RAS_BLOCK__LAST)
@@ -900,6 +901,9 @@ static struct amdgpu_ras_block_object 
*amdgpu_ras_get_ras_block(struct amdgpu_de
if (amdgpu_ras_block_match_default(obj, block) == 0)
return obj;
}
+
+   if (++loop_cnt >= AMDGPU_RAS_BLOCK__LAST)
+   break;
}
 
return NULL;
--
2.25.1


[PATCH] drm/amdgpu: Add judgement to avoid infinite loop

2022-01-28 Thread yipechai
1. The infinite loop causing soft lock occurs on multiple amdgpu cards
   supporting ras feature.
2. This a workaround patch. It is valid for multiple amdgpu cards of the
   same type.
3. The root cause is that each GPU card device has a separate .ras_list
   link header, but the instance and linked list node of each ras block
   are unique. When each device is initialized, each ras instance will
   repeatedly add link node to the device every time. In this way, only
   the .ras_list of the last initialized device is completely correct.
   the .ras_list->prev and .ras_list->next of the device initialzied
   before can still point to the correct ras instance, but the prev
   pointer and next pointer of the pointed ras instance both point to
   the last initialized device's .ras_ list instead of the beginning
   .ras_ list. When using list_for_each_entry_safe searches for
   non-existent Ras nodes on devices other than the last device, the
   last ras instance next pointer cannot always be equal to the
   beginning .ras_list, so that the loop cannot be terminated, the
   program enters a infinite loop.
 BTW: Since the data and initialization process of each card are the same,
  the link list between ras instances will not be destroyed every time
  the device is initialized.
 4. The soft locked logs are as follows:
[  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G   OE 
5.13.0-27-generic #29~20.04.1-Ubuntu
[  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS 
T20200717143848 07/17/2020
[  262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[  262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu]
[  262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 
32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 
28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45
[  262.166243] RSP: 0018:ac908fa87d80 EFLAGS: 0202
[  262.166247] RAX: c1394248 RBX: 91e4ab8d6e20 RCX: c1394248
[  262.166249] RDX: 91e4aa356e20 RSI: 000e RDI: 91e4ab8c
[  262.166252] RBP: ac908fa87da8 R08: 0007 R09: 0001
[  262.166254] R10: 91e4930b64ec R11:  R12: 000e
[  262.166256] R13: 91e4aa356df8 R14: c1394320 R15: 0003
[  262.166258] FS:  () GS:92238fb4() 
knlGS:
[  262.166261] CS:  0010 DS:  ES:  CR0: 80050033
[  262.166264] CR2: 0001004865d0 CR3: 00406d796000 CR4: 00350ee0
[  262.166267] Call Trace:
[  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu]
[  262.166529]  ? psi_task_switch+0xd2/0x250
[  262.166537]  ? __switch_to+0x11d/0x460
[  262.166542]  ? __switch_to_asm+0x36/0x70
[  262.166549]  process_one_work+0x220/0x3c0
[  262.166556]  worker_thread+0x4d/0x3f0
[  262.166560]  ? process_one_work+0x3c0/0x3c0
[  262.166563]  kthread+0x12b/0x150
[  262.166568]  ? set_kthread_struct+0x40/0x40
[  262.166571]  ret_from_fork+0x22/0x30

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index d4e07d0acb66..3d533ef0783d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -884,6 +884,7 @@ static int amdgpu_ras_block_match_default(struct 
amdgpu_ras_block_object *block_
 static struct amdgpu_ras_block_object *amdgpu_ras_get_ras_block(struct 
amdgpu_device *adev,
enum amdgpu_ras_block block, uint32_t 
sub_block_index)
 {
+   int loop_cnt = 0;
struct amdgpu_ras_block_object *obj, *tmp;
 
if (block >= AMDGPU_RAS_BLOCK__LAST)
@@ -900,6 +901,9 @@ static struct amdgpu_ras_block_object 
*amdgpu_ras_get_ras_block(struct amdgpu_de
if (amdgpu_ras_block_match_default(obj, block) == 0)
return obj;
}
+
+   if (++loop_cnt >= AMDGPU_RAS_BLOCK__LAST)
+   break;
}
 
return NULL;
-- 
2.25.1