Re: [PATCH] Revert "drm/sched: Use parent fence instead of finished"

2023-03-31 Thread Rob Clark
On Fri, Dec 2, 2022 at 9:24 AM Arvind Yadav  wrote:
>
> This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86.
>
> This is causing instability on Linus' desktop, and Observed System
> hung  when running MesaGL benchmark or VK CTS runs.
>
> netconsole got me the following oops:
> [ 1234.778760] BUG: kernel NULL pointer dereference, address: 
> 0088
> [ 1234.778782] #PF: supervisor read access in kernel mode
> [ 1234.778787] #PF: error_code(0x) - not-present page
> [ 1234.778791] PGD 0 P4D 0
> [ 1234.778798] Oops:  [#1] PREEMPT SMP NOPTI
> [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
> [ 1234.778809] Hardware name: System manufacturer System Product
> Name/PRIME X370-PRO, BIOS 5603 07/28/2020
> [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
> [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
> ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
> 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
> 00 f0
> [ 1234.778834] RSP: :abe680380de0 EFLAGS: 00010087
> [ 1234.778839] RAX: c04e9230 RBX:  RCX: 
> 0018
> [ 1234.778897] RDX: 0ba278e8977a RSI: 953fb288b460 RDI: 
> 
> [ 1234.778901] RBP: 953fb288b598 R08: 00e0 R09: 
> 953fbd98b808
> [ 1234.778905] R10:  R11: abe680380ff8 R12: 
> abe680380e00
> [ 1234.778908] R13: 0001 R14:  R15: 
> 953fbd9ec458
> [ 1234.778912] FS:  7f35e7008580() GS:95428ebc()
> knlGS:
> [ 1234.778916] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1234.778919] CR2: 0088 CR3: 00010147c000 CR4: 
> 003506e0
> [ 1234.778924] Call Trace:
> [ 1234.778981]  
> [ 1234.778989]  dma_fence_signal_timestamp_locked+0x6a/0xe0
> [ 1234.778999]  dma_fence_signal+0x2c/0x50
> [ 1234.779005]  amdgpu_fence_process+0xc8/0x140 [amdgpu]
> [ 1234.779234]  sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
> [ 1234.779395]  amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
> [ 1234.779609]  amdgpu_ih_process+0x80/0x100 [amdgpu]
> [ 1234.779783]  amdgpu_irq_handler+0x1f/0x60 [amdgpu]
> [ 1234.779940]  __handle_irq_event_percpu+0x46/0x190
> [ 1234.779946]  handle_irq_event+0x34/0x70
> [ 1234.779949]  handle_edge_irq+0x9f/0x240
> [ 1234.779954]  __common_interrupt+0x66/0x100
> [ 1234.779960]  common_interrupt+0xa0/0xc0
> [ 1234.779965]  
> [ 1234.779968]  
> [ 1234.779971]  asm_common_interrupt+0x22/0x40
> [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
> [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
> 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
> 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
> 83 ea
> [ 1234.779985] RSP: :abe680bcfd78 EFLAGS: 0202
>
> Revert it for now and figure it out later.

Just fwiw, the issue here is a race against sched_main observing that
the hw fence is signaled and doing job_cleanup and the driver retiring
the job.  I don't think there is a sane way to use the parent fence
without having this race condition so the "figure it out later" is
"don't do that" ;-)

BR,
-R

> Signed-off-by: Arvind Yadav 
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 820c0c5544e1..ea7bfa99d6c9 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -790,7 +790,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
> job = list_first_entry_or_null(&sched->pending_list,
>struct drm_sched_job, list);
>
> -   if (job && dma_fence_is_signaled(job->s_fence->parent)) {
> +   if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
> /* remove job from pending_list */
> list_del_init(&job->list);
>
> @@ -802,7 +802,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>
> if (next) {
> next->s_fence->scheduled.timestamp =
> -   job->s_fence->parent->timestamp;
> +   job->s_fence->finished.timestamp;
> /* start TO timer for next job */
> drm_sched_start_timeout(sched);
> }
> --
> 2.25.1
>


Re: [PATCH] Revert "drm/sched: Use parent fence instead of finished"

2022-12-03 Thread Christian König

Am 02.12.22 um 18:23 schrieb Arvind Yadav:

This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86.

 This is causing instability on Linus' desktop, and Observed System
 hung  when running MesaGL benchmark or VK CTS runs.

 netconsole got me the following oops:
 [ 1234.778760] BUG: kernel NULL pointer dereference, address: 
0088
 [ 1234.778782] #PF: supervisor read access in kernel mode
 [ 1234.778787] #PF: error_code(0x) - not-present page
 [ 1234.778791] PGD 0 P4D 0
 [ 1234.778798] Oops:  [#1] PREEMPT SMP NOPTI
 [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
 [ 1234.778809] Hardware name: System manufacturer System Product
 Name/PRIME X370-PRO, BIOS 5603 07/28/2020
 [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
 [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
 ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
 00 f0
 [ 1234.778834] RSP: :abe680380de0 EFLAGS: 00010087
 [ 1234.778839] RAX: c04e9230 RBX:  RCX: 
0018
 [ 1234.778897] RDX: 0ba278e8977a RSI: 953fb288b460 RDI: 

 [ 1234.778901] RBP: 953fb288b598 R08: 00e0 R09: 
953fbd98b808
 [ 1234.778905] R10:  R11: abe680380ff8 R12: 
abe680380e00
 [ 1234.778908] R13: 0001 R14:  R15: 
953fbd9ec458
 [ 1234.778912] FS:  7f35e7008580() GS:95428ebc()
 knlGS:
 [ 1234.778916] CS:  0010 DS:  ES:  CR0: 80050033
 [ 1234.778919] CR2: 0088 CR3: 00010147c000 CR4: 
003506e0
 [ 1234.778924] Call Trace:
 [ 1234.778981]  
 [ 1234.778989]  dma_fence_signal_timestamp_locked+0x6a/0xe0
 [ 1234.778999]  dma_fence_signal+0x2c/0x50
 [ 1234.779005]  amdgpu_fence_process+0xc8/0x140 [amdgpu]
 [ 1234.779234]  sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
 [ 1234.779395]  amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
 [ 1234.779609]  amdgpu_ih_process+0x80/0x100 [amdgpu]
 [ 1234.779783]  amdgpu_irq_handler+0x1f/0x60 [amdgpu]
 [ 1234.779940]  __handle_irq_event_percpu+0x46/0x190
 [ 1234.779946]  handle_irq_event+0x34/0x70
 [ 1234.779949]  handle_edge_irq+0x9f/0x240
 [ 1234.779954]  __common_interrupt+0x66/0x100
 [ 1234.779960]  common_interrupt+0xa0/0xc0
 [ 1234.779965]  
 [ 1234.779968]  
 [ 1234.779971]  asm_common_interrupt+0x22/0x40
 [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
 [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
 83 ea
 [ 1234.779985] RSP: :abe680bcfd78 EFLAGS: 0202

 Revert it for now and figure it out later.

Signed-off-by: Arvind Yadav 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 820c0c5544e1..ea7bfa99d6c9 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -790,7 +790,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
job = list_first_entry_or_null(&sched->pending_list,
   struct drm_sched_job, list);
  
-	if (job && dma_fence_is_signaled(job->s_fence->parent)) {

+   if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
/* remove job from pending_list */
list_del_init(&job->list);
  
@@ -802,7 +802,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
  
  		if (next) {

next->s_fence->scheduled.timestamp =
-   job->s_fence->parent->timestamp;
+   job->s_fence->finished.timestamp;
/* start TO timer for next job */
drm_sched_start_timeout(sched);
}




[PATCH] Revert "drm/sched: Use parent fence instead of finished"

2022-12-02 Thread Arvind Yadav
This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86.

This is causing instability on Linus' desktop, and Observed System
hung  when running MesaGL benchmark or VK CTS runs.

netconsole got me the following oops:
[ 1234.778760] BUG: kernel NULL pointer dereference, address: 
0088
[ 1234.778782] #PF: supervisor read access in kernel mode
[ 1234.778787] #PF: error_code(0x) - not-present page
[ 1234.778791] PGD 0 P4D 0
[ 1234.778798] Oops:  [#1] PREEMPT SMP NOPTI
[ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
[ 1234.778809] Hardware name: System manufacturer System Product
Name/PRIME X370-PRO, BIOS 5603 07/28/2020
[ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
[ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
00 f0
[ 1234.778834] RSP: :abe680380de0 EFLAGS: 00010087
[ 1234.778839] RAX: c04e9230 RBX:  RCX: 
0018
[ 1234.778897] RDX: 0ba278e8977a RSI: 953fb288b460 RDI: 

[ 1234.778901] RBP: 953fb288b598 R08: 00e0 R09: 
953fbd98b808
[ 1234.778905] R10:  R11: abe680380ff8 R12: 
abe680380e00
[ 1234.778908] R13: 0001 R14:  R15: 
953fbd9ec458
[ 1234.778912] FS:  7f35e7008580() GS:95428ebc()
knlGS:
[ 1234.778916] CS:  0010 DS:  ES:  CR0: 80050033
[ 1234.778919] CR2: 0088 CR3: 00010147c000 CR4: 
003506e0
[ 1234.778924] Call Trace:
[ 1234.778981]  
[ 1234.778989]  dma_fence_signal_timestamp_locked+0x6a/0xe0
[ 1234.778999]  dma_fence_signal+0x2c/0x50
[ 1234.779005]  amdgpu_fence_process+0xc8/0x140 [amdgpu]
[ 1234.779234]  sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
[ 1234.779395]  amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
[ 1234.779609]  amdgpu_ih_process+0x80/0x100 [amdgpu]
[ 1234.779783]  amdgpu_irq_handler+0x1f/0x60 [amdgpu]
[ 1234.779940]  __handle_irq_event_percpu+0x46/0x190
[ 1234.779946]  handle_irq_event+0x34/0x70
[ 1234.779949]  handle_edge_irq+0x9f/0x240
[ 1234.779954]  __common_interrupt+0x66/0x100
[ 1234.779960]  common_interrupt+0xa0/0xc0
[ 1234.779965]  
[ 1234.779968]  
[ 1234.779971]  asm_common_interrupt+0x22/0x40
[ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
[ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
83 ea
[ 1234.779985] RSP: :abe680bcfd78 EFLAGS: 0202

Revert it for now and figure it out later.

Signed-off-by: Arvind Yadav 
---
 drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 820c0c5544e1..ea7bfa99d6c9 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -790,7 +790,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
job = list_first_entry_or_null(&sched->pending_list,
   struct drm_sched_job, list);
 
-   if (job && dma_fence_is_signaled(job->s_fence->parent)) {
+   if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
/* remove job from pending_list */
list_del_init(&job->list);
 
@@ -802,7 +802,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
 
if (next) {
next->s_fence->scheduled.timestamp =
-   job->s_fence->parent->timestamp;
+   job->s_fence->finished.timestamp;
/* start TO timer for next job */
drm_sched_start_timeout(sched);
}
-- 
2.25.1



[PATCH] Revert "drm/sched: Use parent fence instead of finished"

2022-10-09 Thread hongchengwen
From: Dave Airlie 

This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86.

This is causing instability on Linus' desktop, and I'm seeing
oops with VK CTS runs.

netconsole got me the following oops:
[ 1234.778760] BUG: kernel NULL pointer dereference, address: 0088
[ 1234.778782] #PF: supervisor read access in kernel mode
[ 1234.778787] #PF: error_code(0x) - not-present page
[ 1234.778791] PGD 0 P4D 0
[ 1234.778798] Oops:  [#1] PREEMPT SMP NOPTI
[ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
[ 1234.778809] Hardware name: System manufacturer System Product
Name/PRIME X370-PRO, BIOS 5603 07/28/2020
[ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
[ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
00 f0
[ 1234.778834] RSP: :abe680380de0 EFLAGS: 00010087
[ 1234.778839] RAX: c04e9230 RBX:  RCX: 0018
[ 1234.778897] RDX: 0ba278e8977a RSI: 953fb288b460 RDI: 
[ 1234.778901] RBP: 953fb288b598 R08: 00e0 R09: 953fbd98b808
[ 1234.778905] R10:  R11: abe680380ff8 R12: abe680380e00
[ 1234.778908] R13: 0001 R14:  R15: 953fbd9ec458
[ 1234.778912] FS:  7f35e7008580() GS:95428ebc()
knlGS:
[ 1234.778916] CS:  0010 DS:  ES:  CR0: 80050033
[ 1234.778919] CR2: 0088 CR3: 00010147c000 CR4: 003506e0
[ 1234.778924] Call Trace:
[ 1234.778981]  
[ 1234.778989]  dma_fence_signal_timestamp_locked+0x6a/0xe0
[ 1234.778999]  dma_fence_signal+0x2c/0x50
[ 1234.779005]  amdgpu_fence_process+0xc8/0x140 [amdgpu]
[ 1234.779234]  sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
[ 1234.779395]  amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
[ 1234.779609]  amdgpu_ih_process+0x80/0x100 [amdgpu]
[ 1234.779783]  amdgpu_irq_handler+0x1f/0x60 [amdgpu]
[ 1234.779940]  __handle_irq_event_percpu+0x46/0x190
[ 1234.779946]  handle_irq_event+0x34/0x70
[ 1234.779949]  handle_edge_irq+0x9f/0x240
[ 1234.779954]  __common_interrupt+0x66/0x100
[ 1234.779960]  common_interrupt+0xa0/0xc0
[ 1234.779965]  
[ 1234.779968]  
[ 1234.779971]  asm_common_interrupt+0x22/0x40
[ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
[ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
83 ea
[ 1234.779985] RSP: :abe680bcfd78 EFLAGS: 0202

Revert it for now and figure it out later.

Signed-off-by: Dave Airlie 
---
 drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 4f2395d..e5a4ecd 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -829,7 +829,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
job = list_first_entry_or_null(&sched->pending_list,
   struct drm_sched_job, list);

-   if (job && dma_fence_is_signaled(job->s_fence->parent)) {
+   if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
/* remove job from pending_list */
list_del_init(&job->list);

@@ -841,7 +841,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)

if (next) {
next->s_fence->scheduled.timestamp =
-   job->s_fence->parent->timestamp;
+   job->s_fence->finished.timestamp;
/* start TO timer for next job */
drm_sched_start_timeout(sched);
}
--
2.7.4


OPPO

本电子邮件及其附件含有OPPO公司的保密信息,仅限于邮件指明的收件人使用(包含个人及群组)。禁止任何人在未经授权的情况下以任何形式使用。如果您错收了本邮件,请立即以电子邮件通知发件人并删除本邮件及其附件。

This e-mail and its attachments contain confidential information from OPPO, 
which is intended only for the person or entity whose address is listed above. 
Any use of the information contained herein in any way (including, but not 
limited to, total or partial disclosure, reproduction, or dissemination) by 
persons other than the intended recipient(s) is prohibited. If you receive this 
e-mail in error, please notify the sender by phone or email immediately and 
delete it!