Re: [PATCH v4] drm/amd/amdgpu embed hw_fence into amdgpu_job

2021-08-11 Thread Jingwen Chen
Hi Guchun

Sorry to cause this fail. already submit a v5 patch to fix this.

On Wed Aug 11, 2021 at 05:26:52PM +0800, Chen, Guchun wrote:
> [Public]
> 
> Attach the error log.
> 
> [   99.534964] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
> [   99.535531] amdgpu: SRAT table not found
> [   99.535532] amdgpu: Virtual CRAT table created for GPU
> [   99.536695] amdgpu: Topology: Add dGPU node [0x73a3:0x1002]
> [   99.536697] kfd kfd: amdgpu: added device 1002:73a3
> [   99.536717] amdgpu :03:00.0: amdgpu: SE 4, SH per SE 2, CU per SH 10, 
> active_cu_number 60
> [   99.536904] BUG: kernel NULL pointer dereference, address: 0048
> [   99.536906] #PF: supervisor read access in kernel mode
> [   99.536907] #PF: error_code(0x) - not-present page
> [   99.536908] PGD 0 P4D 0
> [   99.536910] Oops:  [#1] SMP PTI
> [   99.536911] CPU: 8 PID: 2282 Comm: sdma0 Not tainted 5.13.0-guchchen #1
> [   99.536913] Hardware name: System manufacturer System Product Name/TUF 
> Z370-PLUS GAMING II, BIOS 0411 09/21/2018
> [   99.536914] RIP: 0010:amdgpu_fence_enable_signaling+0x15/0x40 [amdgpu]
> [   99.537023] [drm] Unknown EDID CEA parser results
> [   99.537044] Code: 00 e9 4f 55 ab ed 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 
> 00 00 0f 1f 44 00 00 48 81 7f 08 20 c7 b1 c0 74 02 31 ff 48 8b 7f 40 <48> 8b 
> 47 48 48 85 c0 74 06 b8 01 00 00 00 c3 48 8b 35 95 9c e5 ee
> [   99.537046] RSP: 0018:b50b01dcfe58 EFLAGS: 00010046
> [   99.537047] RAX: c07adcc0 RBX: 9bd53c3f4d90 RCX: 
> 0017
> [   99.537048] RDX: 0001 RSI: 9bd53c3f4c58 RDI: 
> 
> [   99.537049] RBP: 9bd53c3f4c00 R08:  R09: 
> b918
> [   99.537050] R10: 0001 R11: 0074 R12: 
> c06e4d10
> [   99.537050] R13: 0246 R14: 9bd53b60b9a0 R15: 
> 9bd53c3f4d90
> [   99.537051] FS:  () GS:9bd826c0() 
> knlGS:
> [   99.537052] CS:  0010 DS:  ES:  CR0: 80050033
> [   99.537053] CR2: 0048 CR3: 00021360a005 CR4: 
> 003706e0
> [   99.537054] DR0:  DR1:  DR2: 
> 
> [   99.537055] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   99.537056] Call Trace:
> [   99.537057]  __dma_fence_enable_signaling+0x3c/0xa0
> [   99.537060]  dma_fence_add_callback+0x39/0xa0
> [   99.537062]  drm_sched_main+0x1aa/0x390 [gpu_sched]
> [   99.537065]  ? wait_woken+0x80/0x80
> [   99.537068]  ? drm_sched_get_cleanup_job+0x120/0x120 [gpu_sched]
> [   99.537070]  kthread+0x117/0x130
> [   99.537071]  ? kthread_park+0x90/0x9
> 
> Regards,
> Guchun
> 
> -Original Message-
> From: amd-gfx  On Behalf Of Chen, 
> Guchun
> Sent: Wednesday, August 11, 2021 5:24 PM
> To: Grodzovsky, Andrey ; Chen, JingWen 
> ; amd-gfx@lists.freedesktop.org
> Cc: Liu, Monk ; Koenig, Christian 
> ; Jack Zhang ; Jack Zhang 
> 
> Subject: RE: [PATCH v4] drm/amd/amdgpu embed hw_fence into amdgpu_job
> 
> [Public]
> 
> Hi Jingwen,
> 
> Your patch has caused amdgpu driver load failure on all ASICs. Please revert 
> it first and come up with a proper fix.
> 
> Regards,
> Guchun
> 
> -Original Message-----
> From: amd-gfx  On Behalf Of Andrey 
> Grodzovsky
> Sent: Wednesday, August 11, 2021 12:41 AM
> To: Chen, JingWen ; amd-gfx@lists.freedesktop.org
> Cc: Liu, Monk ; Koenig, Christian 
> ; Jack Zhang ; Jack Zhang 
> 
> Subject: Re: [PATCH v4] drm/amd/amdgpu embed hw_fence into amdgpu_job
> 
> Reviewed-by: Andrey Grodzovsky 
> 
> Andrey
> 
> On 2021-08-09 11:22 p.m., Jingwen Chen wrote:
> > From: Jack Zhang 
> >
> > Why: Previously hw fence is alloced separately with job.
> > It caused historical lifetime issues and corner cases.
> > The ideal situation is to take fence to manage both job and fence's 
> > lifetime, and simplify the design of gpu-scheduler.
> >
> > How:
> > We propose to embed hw_fence into amdgpu_job.
> > 1. We cover the normal job submission by this method.
> > 2. For ib_test, and submit without a parent job keep the legacy way to 
> > create a hw fence separately.
> > v2:
> > use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is 
> > embeded in a job.
> > v3:
> > remove redundant variable ring in amdgpu_job
> > v4:
> > add tdr sequence support for this feature. Add a job_run_counter to 
> > indicate whether this job is a resubmit job.
> >
> > Signed-off-by: Jingwen Chen 
> > Signed-off-by: Jack Zhang 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/am

RE: [PATCH v4] drm/amd/amdgpu embed hw_fence into amdgpu_job

2021-08-11 Thread Chen, Guchun
[Public]

Attach the error log.

[   99.534964] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[   99.535531] amdgpu: SRAT table not found
[   99.535532] amdgpu: Virtual CRAT table created for GPU
[   99.536695] amdgpu: Topology: Add dGPU node [0x73a3:0x1002]
[   99.536697] kfd kfd: amdgpu: added device 1002:73a3
[   99.536717] amdgpu :03:00.0: amdgpu: SE 4, SH per SE 2, CU per SH 10, 
active_cu_number 60
[   99.536904] BUG: kernel NULL pointer dereference, address: 0048
[   99.536906] #PF: supervisor read access in kernel mode
[   99.536907] #PF: error_code(0x) - not-present page
[   99.536908] PGD 0 P4D 0
[   99.536910] Oops:  [#1] SMP PTI
[   99.536911] CPU: 8 PID: 2282 Comm: sdma0 Not tainted 5.13.0-guchchen #1
[   99.536913] Hardware name: System manufacturer System Product Name/TUF 
Z370-PLUS GAMING II, BIOS 0411 09/21/2018
[   99.536914] RIP: 0010:amdgpu_fence_enable_signaling+0x15/0x40 [amdgpu]
[   99.537023] [drm] Unknown EDID CEA parser results
[   99.537044] Code: 00 e9 4f 55 ab ed 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 
00 00 0f 1f 44 00 00 48 81 7f 08 20 c7 b1 c0 74 02 31 ff 48 8b 7f 40 <48> 8b 47 
48 48 85 c0 74 06 b8 01 00 00 00 c3 48 8b 35 95 9c e5 ee
[   99.537046] RSP: 0018:b50b01dcfe58 EFLAGS: 00010046
[   99.537047] RAX: c07adcc0 RBX: 9bd53c3f4d90 RCX: 0017
[   99.537048] RDX: 0001 RSI: 9bd53c3f4c58 RDI: 
[   99.537049] RBP: 9bd53c3f4c00 R08:  R09: b918
[   99.537050] R10: 0001 R11: 0074 R12: c06e4d10
[   99.537050] R13: 0246 R14: 9bd53b60b9a0 R15: 9bd53c3f4d90
[   99.537051] FS:  () GS:9bd826c0() 
knlGS:
[   99.537052] CS:  0010 DS:  ES:  CR0: 80050033
[   99.537053] CR2: 0048 CR3: 00021360a005 CR4: 003706e0
[   99.537054] DR0:  DR1:  DR2: 
[   99.537055] DR3:  DR6: fffe0ff0 DR7: 0400
[   99.537056] Call Trace:
[   99.537057]  __dma_fence_enable_signaling+0x3c/0xa0
[   99.537060]  dma_fence_add_callback+0x39/0xa0
[   99.537062]  drm_sched_main+0x1aa/0x390 [gpu_sched]
[   99.537065]  ? wait_woken+0x80/0x80
[   99.537068]  ? drm_sched_get_cleanup_job+0x120/0x120 [gpu_sched]
[   99.537070]  kthread+0x117/0x130
[   99.537071]  ? kthread_park+0x90/0x9

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Chen, Guchun
Sent: Wednesday, August 11, 2021 5:24 PM
To: Grodzovsky, Andrey ; Chen, JingWen 
; amd-gfx@lists.freedesktop.org
Cc: Liu, Monk ; Koenig, Christian ; 
Jack Zhang ; Jack Zhang 
Subject: RE: [PATCH v4] drm/amd/amdgpu embed hw_fence into amdgpu_job

[Public]

Hi Jingwen,

Your patch has caused amdgpu driver load failure on all ASICs. Please revert it 
first and come up with a proper fix.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Andrey 
Grodzovsky
Sent: Wednesday, August 11, 2021 12:41 AM
To: Chen, JingWen ; amd-gfx@lists.freedesktop.org
Cc: Liu, Monk ; Koenig, Christian ; 
Jack Zhang ; Jack Zhang 
Subject: Re: [PATCH v4] drm/amd/amdgpu embed hw_fence into amdgpu_job

Reviewed-by: Andrey Grodzovsky 

Andrey

On 2021-08-09 11:22 p.m., Jingwen Chen wrote:
> From: Jack Zhang 
>
> Why: Previously hw fence is alloced separately with job.
> It caused historical lifetime issues and corner cases.
> The ideal situation is to take fence to manage both job and fence's 
> lifetime, and simplify the design of gpu-scheduler.
>
> How:
> We propose to embed hw_fence into amdgpu_job.
> 1. We cover the normal job submission by this method.
> 2. For ib_test, and submit without a parent job keep the legacy way to 
> create a hw fence separately.
> v2:
> use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is 
> embeded in a job.
> v3:
> remove redundant variable ring in amdgpu_job
> v4:
> add tdr sequence support for this feature. Add a job_run_counter to 
> indicate whether this job is a resubmit job.
>
> Signed-off-by: Jingwen Chen 
> Signed-off-by: Jack Zhang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c  |  1 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 12 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c   | 73 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c  |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 39 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  6 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h|  5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  |  2 +-
>   9 files changed, 108 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 7b46ba551cb2..3003ee1c9487 

RE: [PATCH v4] drm/amd/amdgpu embed hw_fence into amdgpu_job

2021-08-11 Thread Chen, Guchun
[Public]

Hi Jingwen,

Your patch has caused amdgpu driver load failure on all ASICs. Please revert it 
first and come up with a proper fix.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Andrey 
Grodzovsky
Sent: Wednesday, August 11, 2021 12:41 AM
To: Chen, JingWen ; amd-gfx@lists.freedesktop.org
Cc: Liu, Monk ; Koenig, Christian ; 
Jack Zhang ; Jack Zhang 
Subject: Re: [PATCH v4] drm/amd/amdgpu embed hw_fence into amdgpu_job

Reviewed-by: Andrey Grodzovsky 

Andrey

On 2021-08-09 11:22 p.m., Jingwen Chen wrote:
> From: Jack Zhang 
>
> Why: Previously hw fence is alloced separately with job.
> It caused historical lifetime issues and corner cases.
> The ideal situation is to take fence to manage both job and fence's 
> lifetime, and simplify the design of gpu-scheduler.
>
> How:
> We propose to embed hw_fence into amdgpu_job.
> 1. We cover the normal job submission by this method.
> 2. For ib_test, and submit without a parent job keep the legacy way to 
> create a hw fence separately.
> v2:
> use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is 
> embeded in a job.
> v3:
> remove redundant variable ring in amdgpu_job
> v4:
> add tdr sequence support for this feature. Add a job_run_counter to 
> indicate whether this job is a resubmit job.
>
> Signed-off-by: Jingwen Chen 
> Signed-off-by: Jack Zhang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c  |  1 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 12 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c   | 73 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c  |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 39 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  6 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h|  5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  |  2 +-
>   9 files changed, 108 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 7b46ba551cb2..3003ee1c9487 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -714,7 +714,6 @@ int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum 
> kgd_engine_type engine,
>   ret = dma_fence_wait(f, false);
>   
>   err_ib_sched:
> - dma_fence_put(f);
>   amdgpu_job_free(job);
>   err:
>   return ret;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 536005bff24a..277128846dd1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1414,7 +1414,7 @@ static void amdgpu_ib_preempt_mark_partial_job(struct 
> amdgpu_ring *ring)
>   continue;
>   }
>   job = to_amdgpu_job(s_job);
> - if (preempted && job->fence == fence)
> + if (preempted && (>hw_fence) == fence)
>   /* mark the job as preempted */
>   job->preemption_status |= AMDGPU_IB_PREEMPTED;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 9e53ff851496..ade2fa07a50a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4447,7 +4447,7 @@ int amdgpu_device_mode1_reset(struct amdgpu_device 
> *adev)
>   int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>struct amdgpu_reset_context *reset_context)
>   {
> - int i, r = 0;
> + int i, j, r = 0;
>   struct amdgpu_job *job = NULL;
>   bool need_full_reset =
>   test_bit(AMDGPU_NEED_FULL_RESET, _context->flags); @@ 
> -4471,6 +4471,16 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
> *adev,
>   if (!ring || !ring->sched.thread)
>   continue;
>   
> + /*clear job fence from fence drv to avoid force_completion
> +  *leave NULL and vm flush fence in fence drv */
> + for (j = 0; j <= ring->fence_drv.num_fences_mask; j ++) {
> + struct dma_fence *old,**ptr;
> + ptr = >fence_drv.fences[j];
> + old = rcu_dereference_protected(*ptr, 1);
> + if (old && test_bit(AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT, 
> >flags)) {
> + RCU_INIT_POINTER(*ptr, NULL);
> + }
> + }
>   /* after all hw jobs are reset, hw fence is meaningless, so

Re: [PATCH v4] drm/amd/amdgpu embed hw_fence into amdgpu_job

2021-08-10 Thread Andrey Grodzovsky

Reviewed-by: Andrey Grodzovsky 

Andrey

On 2021-08-09 11:22 p.m., Jingwen Chen wrote:

From: Jack Zhang 

Why: Previously hw fence is alloced separately with job.
It caused historical lifetime issues and corner cases.
The ideal situation is to take fence to manage both job
and fence's lifetime, and simplify the design of gpu-scheduler.

How:
We propose to embed hw_fence into amdgpu_job.
1. We cover the normal job submission by this method.
2. For ib_test, and submit without a parent job keep the
legacy way to create a hw fence separately.
v2:
use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
embeded in a job.
v3:
remove redundant variable ring in amdgpu_job
v4:
add tdr sequence support for this feature. Add a job_run_counter to
indicate whether this job is a resubmit job.

Signed-off-by: Jingwen Chen 
Signed-off-by: Jack Zhang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c  |  1 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 12 +++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c   | 73 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c  |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 39 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  6 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h|  5 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  |  2 +-
  9 files changed, 108 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 7b46ba551cb2..3003ee1c9487 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -714,7 +714,6 @@ int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum 
kgd_engine_type engine,
ret = dma_fence_wait(f, false);
  
  err_ib_sched:

-   dma_fence_put(f);
amdgpu_job_free(job);
  err:
return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 536005bff24a..277128846dd1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1414,7 +1414,7 @@ static void amdgpu_ib_preempt_mark_partial_job(struct 
amdgpu_ring *ring)
continue;
}
job = to_amdgpu_job(s_job);
-   if (preempted && job->fence == fence)
+   if (preempted && (>hw_fence) == fence)
/* mark the job as preempted */
job->preemption_status |= AMDGPU_IB_PREEMPTED;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9e53ff851496..ade2fa07a50a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4447,7 +4447,7 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
  int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 struct amdgpu_reset_context *reset_context)
  {
-   int i, r = 0;
+   int i, j, r = 0;
struct amdgpu_job *job = NULL;
bool need_full_reset =
test_bit(AMDGPU_NEED_FULL_RESET, _context->flags);
@@ -4471,6 +4471,16 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
*adev,
if (!ring || !ring->sched.thread)
continue;
  
+		/*clear job fence from fence drv to avoid force_completion

+*leave NULL and vm flush fence in fence drv */
+   for (j = 0; j <= ring->fence_drv.num_fences_mask; j ++) {
+   struct dma_fence *old,**ptr;
+   ptr = >fence_drv.fences[j];
+   old = rcu_dereference_protected(*ptr, 1);
+   if (old && test_bit(AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT, 
>flags)) {
+   RCU_INIT_POINTER(*ptr, NULL);
+   }
+   }
/* after all hw jobs are reset, hw fence is meaningless, so 
force_completion */
amdgpu_fence_driver_force_completion(ring);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 7495911516c2..a8302e324110 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -129,30 +129,50 @@ static u32 amdgpu_fence_read(struct amdgpu_ring *ring)
   *
   * @ring: ring the fence is associated with
   * @f: resulting fence object
+ * @job: job the fence is embeded in
   * @flags: flags to pass into the subordinate .emit_fence() call
   *
   * Emits a fence command on the requested ring (all asics).
   * Returns 0 on success, -ENOMEM on failure.
   */
-int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
+int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, struct 
amdgpu_job *job,
  unsigned flags)
  {
struct