Re: [PATCH] Revert "drm/sched: Use parent fence instead of finished"
On Fri, Dec 2, 2022 at 9:24 AM Arvind Yadav wrote: > > This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86. > > This is causing instability on Linus' desktop, and Observed System > hung when running MesaGL benchmark or VK CTS runs. > > netconsole got me the following oops: > [ 1234.778760] BUG: kernel NULL pointer dereference, address: > 0088 > [ 1234.778782] #PF: supervisor read access in kernel mode > [ 1234.778787] #PF: error_code(0x) - not-present page > [ 1234.778791] PGD 0 P4D 0 > [ 1234.778798] Oops: [#1] PREEMPT SMP NOPTI > [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2 > [ 1234.778809] Hardware name: System manufacturer System Product > Name/PRIME X370-PRO, BIOS 5603 07/28/2020 > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] > [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f > ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53 > 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00 > 00 f0 > [ 1234.778834] RSP: :abe680380de0 EFLAGS: 00010087 > [ 1234.778839] RAX: c04e9230 RBX: RCX: > 0018 > [ 1234.778897] RDX: 0ba278e8977a RSI: 953fb288b460 RDI: > > [ 1234.778901] RBP: 953fb288b598 R08: 00e0 R09: > 953fbd98b808 > [ 1234.778905] R10: R11: abe680380ff8 R12: > abe680380e00 > [ 1234.778908] R13: 0001 R14: R15: > 953fbd9ec458 > [ 1234.778912] FS: 7f35e7008580() GS:95428ebc() > knlGS: > [ 1234.778916] CS: 0010 DS: ES: CR0: 80050033 > [ 1234.778919] CR2: 0088 CR3: 00010147c000 CR4: > 003506e0 > [ 1234.778924] Call Trace: > [ 1234.778981] > [ 1234.778989] dma_fence_signal_timestamp_locked+0x6a/0xe0 > [ 1234.778999] dma_fence_signal+0x2c/0x50 > [ 1234.779005] amdgpu_fence_process+0xc8/0x140 [amdgpu] > [ 1234.779234] sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu] > [ 1234.779395] amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu] > [ 1234.779609] amdgpu_ih_process+0x80/0x100 [amdgpu] > [ 1234.779783] amdgpu_irq_handler+0x1f/0x60 [amdgpu] > [ 1234.779940] __handle_irq_event_percpu+0x46/0x190 > [ 1234.779946] handle_irq_event+0x34/0x70 > [ 1234.779949] handle_edge_irq+0x9f/0x240 > [ 1234.779954] __common_interrupt+0x66/0x100 > [ 1234.779960] common_interrupt+0xa0/0xc0 > [ 1234.779965] > [ 1234.779968] > [ 1234.779971] asm_common_interrupt+0x22/0x40 > [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110 > [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41 > 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30 > 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48 > 83 ea > [ 1234.779985] RSP: :abe680bcfd78 EFLAGS: 0202 > > Revert it for now and figure it out later. Just fwiw, the issue here is a race against sched_main observing that the hw fence is signaled and doing job_cleanup and the driver retiring the job. I don't think there is a sane way to use the parent fence without having this race condition so the "figure it out later" is "don't do that" ;-) BR, -R > Signed-off-by: Arvind Yadav > --- > drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index 820c0c5544e1..ea7bfa99d6c9 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -790,7 +790,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) > job = list_first_entry_or_null(&sched->pending_list, >struct drm_sched_job, list); > > - if (job && dma_fence_is_signaled(job->s_fence->parent)) { > + if (job && dma_fence_is_signaled(&job->s_fence->finished)) { > /* remove job from pending_list */ > list_del_init(&job->list); > > @@ -802,7 +802,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) > > if (next) { > next->s_fence->scheduled.timestamp = > - job->s_fence->parent->timestamp; > + job->s_fence->finished.timestamp; > /* start TO timer for next job */ > drm_sched_start_timeout(sched); > } > -- > 2.25.1 >
Re: [PATCH] Revert "drm/sched: Use parent fence instead of finished"
Am 02.12.22 um 18:23 schrieb Arvind Yadav: This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86. This is causing instability on Linus' desktop, and Observed System hung when running MesaGL benchmark or VK CTS runs. netconsole got me the following oops: [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0088 [ 1234.778782] #PF: supervisor read access in kernel mode [ 1234.778787] #PF: error_code(0x) - not-present page [ 1234.778791] PGD 0 P4D 0 [ 1234.778798] Oops: [#1] PREEMPT SMP NOPTI [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2 [ 1234.778809] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5603 07/28/2020 [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00 00 f0 [ 1234.778834] RSP: :abe680380de0 EFLAGS: 00010087 [ 1234.778839] RAX: c04e9230 RBX: RCX: 0018 [ 1234.778897] RDX: 0ba278e8977a RSI: 953fb288b460 RDI: [ 1234.778901] RBP: 953fb288b598 R08: 00e0 R09: 953fbd98b808 [ 1234.778905] R10: R11: abe680380ff8 R12: abe680380e00 [ 1234.778908] R13: 0001 R14: R15: 953fbd9ec458 [ 1234.778912] FS: 7f35e7008580() GS:95428ebc() knlGS: [ 1234.778916] CS: 0010 DS: ES: CR0: 80050033 [ 1234.778919] CR2: 0088 CR3: 00010147c000 CR4: 003506e0 [ 1234.778924] Call Trace: [ 1234.778981] [ 1234.778989] dma_fence_signal_timestamp_locked+0x6a/0xe0 [ 1234.778999] dma_fence_signal+0x2c/0x50 [ 1234.779005] amdgpu_fence_process+0xc8/0x140 [amdgpu] [ 1234.779234] sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu] [ 1234.779395] amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu] [ 1234.779609] amdgpu_ih_process+0x80/0x100 [amdgpu] [ 1234.779783] amdgpu_irq_handler+0x1f/0x60 [amdgpu] [ 1234.779940] __handle_irq_event_percpu+0x46/0x190 [ 1234.779946] handle_irq_event+0x34/0x70 [ 1234.779949] handle_edge_irq+0x9f/0x240 [ 1234.779954] __common_interrupt+0x66/0x100 [ 1234.779960] common_interrupt+0xa0/0xc0 [ 1234.779965] [ 1234.779968] [ 1234.779971] asm_common_interrupt+0x22/0x40 [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110 [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48 83 ea [ 1234.779985] RSP: :abe680bcfd78 EFLAGS: 0202 Revert it for now and figure it out later. Signed-off-by: Arvind Yadav Reviewed-by: Christian König --- drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 820c0c5544e1..ea7bfa99d6c9 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -790,7 +790,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) job = list_first_entry_or_null(&sched->pending_list, struct drm_sched_job, list); - if (job && dma_fence_is_signaled(job->s_fence->parent)) { + if (job && dma_fence_is_signaled(&job->s_fence->finished)) { /* remove job from pending_list */ list_del_init(&job->list); @@ -802,7 +802,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) if (next) { next->s_fence->scheduled.timestamp = - job->s_fence->parent->timestamp; + job->s_fence->finished.timestamp; /* start TO timer for next job */ drm_sched_start_timeout(sched); }
[PATCH] Revert "drm/sched: Use parent fence instead of finished"
This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86. This is causing instability on Linus' desktop, and Observed System hung when running MesaGL benchmark or VK CTS runs. netconsole got me the following oops: [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0088 [ 1234.778782] #PF: supervisor read access in kernel mode [ 1234.778787] #PF: error_code(0x) - not-present page [ 1234.778791] PGD 0 P4D 0 [ 1234.778798] Oops: [#1] PREEMPT SMP NOPTI [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2 [ 1234.778809] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5603 07/28/2020 [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00 00 f0 [ 1234.778834] RSP: :abe680380de0 EFLAGS: 00010087 [ 1234.778839] RAX: c04e9230 RBX: RCX: 0018 [ 1234.778897] RDX: 0ba278e8977a RSI: 953fb288b460 RDI: [ 1234.778901] RBP: 953fb288b598 R08: 00e0 R09: 953fbd98b808 [ 1234.778905] R10: R11: abe680380ff8 R12: abe680380e00 [ 1234.778908] R13: 0001 R14: R15: 953fbd9ec458 [ 1234.778912] FS: 7f35e7008580() GS:95428ebc() knlGS: [ 1234.778916] CS: 0010 DS: ES: CR0: 80050033 [ 1234.778919] CR2: 0088 CR3: 00010147c000 CR4: 003506e0 [ 1234.778924] Call Trace: [ 1234.778981] [ 1234.778989] dma_fence_signal_timestamp_locked+0x6a/0xe0 [ 1234.778999] dma_fence_signal+0x2c/0x50 [ 1234.779005] amdgpu_fence_process+0xc8/0x140 [amdgpu] [ 1234.779234] sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu] [ 1234.779395] amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu] [ 1234.779609] amdgpu_ih_process+0x80/0x100 [amdgpu] [ 1234.779783] amdgpu_irq_handler+0x1f/0x60 [amdgpu] [ 1234.779940] __handle_irq_event_percpu+0x46/0x190 [ 1234.779946] handle_irq_event+0x34/0x70 [ 1234.779949] handle_edge_irq+0x9f/0x240 [ 1234.779954] __common_interrupt+0x66/0x100 [ 1234.779960] common_interrupt+0xa0/0xc0 [ 1234.779965] [ 1234.779968] [ 1234.779971] asm_common_interrupt+0x22/0x40 [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110 [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48 83 ea [ 1234.779985] RSP: :abe680bcfd78 EFLAGS: 0202 Revert it for now and figure it out later. Signed-off-by: Arvind Yadav --- drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 820c0c5544e1..ea7bfa99d6c9 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -790,7 +790,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) job = list_first_entry_or_null(&sched->pending_list, struct drm_sched_job, list); - if (job && dma_fence_is_signaled(job->s_fence->parent)) { + if (job && dma_fence_is_signaled(&job->s_fence->finished)) { /* remove job from pending_list */ list_del_init(&job->list); @@ -802,7 +802,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) if (next) { next->s_fence->scheduled.timestamp = - job->s_fence->parent->timestamp; + job->s_fence->finished.timestamp; /* start TO timer for next job */ drm_sched_start_timeout(sched); } -- 2.25.1
[PATCH] Revert "drm/sched: Use parent fence instead of finished"
From: Dave Airlie This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86. This is causing instability on Linus' desktop, and I'm seeing oops with VK CTS runs. netconsole got me the following oops: [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0088 [ 1234.778782] #PF: supervisor read access in kernel mode [ 1234.778787] #PF: error_code(0x) - not-present page [ 1234.778791] PGD 0 P4D 0 [ 1234.778798] Oops: [#1] PREEMPT SMP NOPTI [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2 [ 1234.778809] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5603 07/28/2020 [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00 00 f0 [ 1234.778834] RSP: :abe680380de0 EFLAGS: 00010087 [ 1234.778839] RAX: c04e9230 RBX: RCX: 0018 [ 1234.778897] RDX: 0ba278e8977a RSI: 953fb288b460 RDI: [ 1234.778901] RBP: 953fb288b598 R08: 00e0 R09: 953fbd98b808 [ 1234.778905] R10: R11: abe680380ff8 R12: abe680380e00 [ 1234.778908] R13: 0001 R14: R15: 953fbd9ec458 [ 1234.778912] FS: 7f35e7008580() GS:95428ebc() knlGS: [ 1234.778916] CS: 0010 DS: ES: CR0: 80050033 [ 1234.778919] CR2: 0088 CR3: 00010147c000 CR4: 003506e0 [ 1234.778924] Call Trace: [ 1234.778981] [ 1234.778989] dma_fence_signal_timestamp_locked+0x6a/0xe0 [ 1234.778999] dma_fence_signal+0x2c/0x50 [ 1234.779005] amdgpu_fence_process+0xc8/0x140 [amdgpu] [ 1234.779234] sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu] [ 1234.779395] amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu] [ 1234.779609] amdgpu_ih_process+0x80/0x100 [amdgpu] [ 1234.779783] amdgpu_irq_handler+0x1f/0x60 [amdgpu] [ 1234.779940] __handle_irq_event_percpu+0x46/0x190 [ 1234.779946] handle_irq_event+0x34/0x70 [ 1234.779949] handle_edge_irq+0x9f/0x240 [ 1234.779954] __common_interrupt+0x66/0x100 [ 1234.779960] common_interrupt+0xa0/0xc0 [ 1234.779965] [ 1234.779968] [ 1234.779971] asm_common_interrupt+0x22/0x40 [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110 [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48 83 ea [ 1234.779985] RSP: :abe680bcfd78 EFLAGS: 0202 Revert it for now and figure it out later. Signed-off-by: Dave Airlie --- drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 4f2395d..e5a4ecd 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -829,7 +829,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) job = list_first_entry_or_null(&sched->pending_list, struct drm_sched_job, list); - if (job && dma_fence_is_signaled(job->s_fence->parent)) { + if (job && dma_fence_is_signaled(&job->s_fence->finished)) { /* remove job from pending_list */ list_del_init(&job->list); @@ -841,7 +841,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) if (next) { next->s_fence->scheduled.timestamp = - job->s_fence->parent->timestamp; + job->s_fence->finished.timestamp; /* start TO timer for next job */ drm_sched_start_timeout(sched); } -- 2.7.4 OPPO 本电子邮件及其附件含有OPPO公司的保密信息,仅限于邮件指明的收件人使用(包含个人及群组)。禁止任何人在未经授权的情况下以任何形式使用。如果您错收了本邮件,请立即以电子邮件通知发件人并删除本邮件及其附件。 This e-mail and its attachments contain confidential information from OPPO, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!