RE: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost)

2024-07-23 Thread Yin, ZhenGuo (Chris)
[AMD Official Use Only - AMD Internal Distribution Only]

Hi Christian

I prepared this patch because we met a page fault after gpu reset in SRIOV 
triggered by quark.
After investigation, I found that the page table did not get restored after gpu 
reset.
I just tried to use vm_update_mode=0 to disable VM update by CPU, and this 
issue still exists.

-[Christian] When VRAM is lost any submission using the VM entity will fail and 
so result in a new page table generation.
I believe that you are referring this piece of code in function 
amdgpu_job_run():
/* Skip job if VRAM is lost and never resubmit gangs */
if (job->generation != amdgpu_vm_generation(adev, job->vm) ||
(job->job_run_counter && job->gang_submit))
dma_fence_set_error(finished, -ECANCELED);

I agree that the submission from the old ctx will be set as ECANCELED and be 
skipped.
But if the application then creates a new ctx and submit a new job, both 
new_ctx->generation and new_job->generation will be initiated as the updated 
one.
ctx->generation = amdgpu_vm_generation(mgr->adev, >vm);
(*job)->generation = amdgpu_vm_generation(adev, vm);
And the job will be executed, hence there will be a page fault due to VRAM lost.

Please correct me if I have some misunderstanding.
I still can not see why any submission using the VM entity will fail due to 
VRAM lost.

Best,
Zhenguo
Cloud-GPU Core team, SRDC

-Original Message-
From: Christian König 
Sent: Tuesday, July 23, 2024 3:13 PM
To: Yin, ZhenGuo (Chris) ; amd-gfx@lists.freedesktop.org
Cc: Koenig, Christian 
Subject: Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram 
lost)

Am 23.07.24 um 05:05 schrieb ZhenGuo Yin:
> [Why]
> Page table of compute VM in the VRAM will lost after gpu reset.
> VRAM won't be restored since compute VM has no shadows.
>
> [How]
> Use higher 32-bit of vm->generation to record a vram_lost_counter.
> Reset the VM state machine when vm->genertaion is not equal to
> re-generation token.
>
> v2: Check vm->generation instead of calling drm_sched_entity_error in
> amdgpu_vm_validate.

I've just double checked the logic again and as far as I can see this patch 
here is still completely superfluous.

When VRAM is lost any submission using the VM entity will fail and so result in 
a new page table generation.

What isn't handled are CPU based page table updates, but for those we could 
potentially change the condition inside the CPU page tables code.

Regards,
Christian.

>
> Signed-off-by: ZhenGuo Yin 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++
>   1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 3abfa66d72a2..9e2f84c166e7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, 
> struct amdgpu_vm *vm)
>   if (!vm)
>   return result;
>
> - result += vm->generation;
> + result += (vm->generation & 0x);
>   /* Add one if the page tables will be re-generated on next CS */
>   if (drm_sched_entity_error(>delayed))
>   ++result;
> @@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, 
> struct amdgpu_vm *vm,
>   struct amdgpu_bo *shadow;
>   struct amdgpu_bo *bo;
>   int r;
> + uint32_t vram_lost_counter = atomic_read(>vram_lost_counter);
>
> - if (drm_sched_entity_error(>delayed)) {
> - ++vm->generation;
> + if (vm->generation != amdgpu_vm_generation(adev, vm)) {
> + if (drm_sched_entity_error(>delayed))
> + ++vm->generation;
> + vm->generation = (u64)vram_lost_counter << 32 | (vm->generation 
> &
> +0x);
>   amdgpu_vm_bo_reset_state_machine(vm);
>   amdgpu_vm_fini_entities(vm);
>   r = amdgpu_vm_init_entities(adev, vm); @@ -2439,7 +2442,7 @@ int
> amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   vm->last_update = dma_fence_get_stub();
>   vm->last_unlocked = dma_fence_get_stub();
>   vm->last_tlb_flush = dma_fence_get_stub();
> - vm->generation = 0;
> + vm->generation = (u64)atomic_read(>vram_lost_counter) << 32;
>
>   mutex_init(>eviction_lock);
>   vm->evicting = false;



RE: [PATCH] drm/amdgpu: reset vm state machine after gpu reset(vram lost)

2024-07-19 Thread Yin, ZhenGuo (Chris)
[AMD Official Use Only - AMD Internal Distribution Only]

Hi, Christian

Why loosing VRAM would result in the vm entity to become invalid?

I think only if there has a fence error appeared(like a pending SDMA job got 
timedout or cancelled), then the entity vm->delayed will be set as error.

If a gpu reset triggered by a GFX job, and there has no SDMA job in the pending 
list, the entity won't be set as error.

Best,
Zhenguo
Cloud-GPU Core team, SRDC

-Original Message-
From: Koenig, Christian 
Sent: Friday, July 19, 2024 5:22 PM
To: Yin, ZhenGuo (Chris) ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: reset vm state machine after gpu reset(vram 
lost)

Am 19.07.24 um 11:19 schrieb ZhenGuo Yin:
> [Why]
> Page table of compute VM in the VRAM will lost after gpu reset.
> VRAM won't be restored since compute VM has no shadows.
>
> [How]
> Use higher 32-bit of vm->generation to record a vram_lost_counter.
> Reset the VM state machine when the counter is not equal to current
> vram_lost_counter of the device.

Mhm, that was my original approach as well but we came to the conclusion that 
this shouldn't be necessary since loosing VRAM would result in the entity to 
become invalid as well.

Why is that necessary?

Regards,
Christian.

>
> Signed-off-by: ZhenGuo Yin 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 --
>   1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 3abfa66d72a2..fd7f912816dc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, 
> struct amdgpu_vm *vm)
>   if (!vm)
>   return result;
>
> - result += vm->generation;
> + result += (vm->generation & 0x);
>   /* Add one if the page tables will be re-generated on next CS */
>   if (drm_sched_entity_error(>delayed))
>   ++result;
> @@ -467,6 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, 
> struct amdgpu_vm *vm,
>   struct amdgpu_bo *shadow;
>   struct amdgpu_bo *bo;
>   int r;
> + uint32_t vram_lost_counter = atomic_read(>vram_lost_counter);
> +
> + if ((vm->generation >> 32) != vram_lost_counter) {
> + amdgpu_vm_bo_reset_state_machine(vm);
> + vm->generation = (u64)vram_lost_counter << 32 | (vm->generation 
> & 0x);
> + }
>
>   if (drm_sched_entity_error(>delayed)) {
>   ++vm->generation;
> @@ -2439,7 +2445,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
> amdgpu_vm *vm,
>   vm->last_update = dma_fence_get_stub();
>   vm->last_unlocked = dma_fence_get_stub();
>   vm->last_tlb_flush = dma_fence_get_stub();
> - vm->generation = 0;
> + vm->generation = (u64)atomic_read(>vram_lost_counter) << 32;
>
>   mutex_init(>eviction_lock);
>   vm->evicting = false;



RE: [PATCH] drm/amdgpu: clear set_q_mode_offs when VM changed

2024-04-02 Thread Yin, ZhenGuo (Chris)
[AMD Official Use Only - General]

Yeah, I agree.  But I have no idea which place is better.

Best,
Zhenguo
Cloud-GPU Core team, SRDC

-Original Message-
From: Koenig, Christian 
Sent: Tuesday, April 2, 2024 4:14 PM
To: Yin, ZhenGuo (Chris) ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: Re: [PATCH] drm/amdgpu: clear set_q_mode_offs when VM changed

Am 02.04.24 um 08:37 schrieb ZhenGuo Yin:
> [Why]
> set_q_mode_offs don't get cleared after GPU reset, nexting SET_Q_MODE
> packet to init shadow memory will be skiped, hence there has a page fault.
>
> [How]
> VM flush is needed after GPU reset, clear set_q_mode_offs when
> emitting VM flush.
>
> Fixes: 8dad9c062355 ("drm/amdgpu: workaround to avoid SET_Q_MODE
> packets v2")
> Signed-off-by: ZhenGuo Yin 

Good catch, yeah it probably doesn't make much sense to execute this after a VM 
flush.

Reviewed-by: Christian König 

> ---
>   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> index 7a906318e451..c11c6299711e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> @@ -5465,6 +5465,7 @@ static void gfx_v11_0_ring_emit_vm_flush(struct 
> amdgpu_ring *ring,
>   /* Make sure that we can't skip the SET_Q_MODE packets when the VM
>* changed in any way.
>*/
> + ring->set_q_mode_offs = 0;
>   ring->set_q_mode_ptr = NULL;
>   }
>



RE: [PATCH] drm/amd/amdgpu: add pipe1 hardware support

2024-03-26 Thread Yin, ZhenGuo (Chris)
[AMD Official Use Only - General]

Hi, Alex

Could you please help to review this patch?
This should fix the bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2117.

Thx.

Best,
Zhenguo
Cloud-GPU Core team, SRDC

-Original Message-
From: Yin, ZhenGuo (Chris) 
Sent: Friday, March 15, 2024 2:12 PM
To: amd-gfx@lists.freedesktop.org
Cc: mdaen...@redhat.com; Deucher, Alexander ; 
Koenig, Christian ; Paneer Selvam, Arunpravin 
; Deng, Emily ; Liu, Monk 
; Yin, ZhenGuo (Chris) 
Subject: [PATCH] drm/amd/amdgpu: add pipe1 hardware support

Enable pipe1 support starting from SIENNA CICHLID asic.
Need to use correct ref/mask for pipe1 hdp flush.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2117
Fixes: 085292c3d780 ("Revert "drm/amd/amdgpu: add pipe1 hardware support"")
Signed-off-by: ZhenGuo Yin 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index f90905ef32c7..5eb6f189920b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -4518,7 +4518,7 @@ static int gfx_v10_0_sw_init(void *handle)
case IP_VERSION(10, 3, 3):
case IP_VERSION(10, 3, 7):
adev->gfx.me.num_me = 1;
-   adev->gfx.me.num_pipe_per_me = 1;
+   adev->gfx.me.num_pipe_per_me = 2;
adev->gfx.me.num_queue_per_pipe = 1;
adev->gfx.mec.num_mec = 2;
adev->gfx.mec.num_pipe_per_mec = 4;
@@ -8317,7 +8317,7 @@ static void gfx_v10_0_ring_emit_hdp_flush(struct 
amdgpu_ring *ring)
}
reg_mem_engine = 0;
} else {
-   ref_and_mask = nbio_hf_reg->ref_and_mask_cp0;
+   ref_and_mask = nbio_hf_reg->ref_and_mask_cp0 << ring->pipe;
reg_mem_engine = 1; /* pfp */
}

--
2.35.1



Re: [PATCH] drm/amdgpu: Skip access gfx11 golden registers under SRIOV

2023-11-26 Thread Yin, ZhenGuo (Chris)

Add potential reviewers.

On 11/23/2023 4:57 PM, ZhenGuo Yin wrote:

[Why]
Golden registers are PF-only registers on gfx11.
RLCG interface will return "out-of-range" under SRIOV VF.

[How]
Skip access gfx11 golden registers under SRIOV.

Signed-off-by: ZhenGuo Yin 
---
  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 8ed4a6fb147a..288e926e5cd4 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -293,6 +293,9 @@ static void gfx_v11_0_set_kiq_pm4_funcs(struct 
amdgpu_device *adev)
  
  static void gfx_v11_0_init_golden_registers(struct amdgpu_device *adev)

  {
+   if (amdgpu_sriov_vf(adev))
+   return;
+
switch (amdgpu_ip_version(adev, GC_HWIP, 0)) {
case IP_VERSION(11, 0, 1):
case IP_VERSION(11, 0, 4):


Re: [PATCH v2] drm/amdgpu doorbell range should be set when gpu recovery

2023-10-30 Thread Yin, ZhenGuo (Chris)

Acked-by: ZhenGuo Yin 

On 10/30/2023 5:57 PM, Lin.Cao wrote:

GFX doorbell range should be set after flr otherwise the gfx doorbell
range will be overlap with MEC.

v2: remove "amdgpu_sriov_vf" and "amdgpu_in_reset" check, and add grbm
select for the case of 2 gfx rings.

Signed-off-by: Lin.Cao 
---
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index d9ccacd06fba..c9f4e8252070 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -6467,6 +6467,13 @@ static int gfx_v10_0_gfx_init_queue(struct amdgpu_ring 
*ring)
if (adev->gfx.me.mqd_backup[mqd_idx])
memcpy(adev->gfx.me.mqd_backup[mqd_idx], mqd, 
sizeof(*mqd));
} else {
+   mutex_lock(>srbm_mutex);
+   nv_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
+   if (ring->doorbell_index == adev->doorbell_index.gfx_ring0 << 1)
+   gfx_v10_0_cp_gfx_set_doorbell(adev, ring);
+
+   nv_grbm_select(adev, 0, 0, 0, 0);
+   mutex_unlock(>srbm_mutex);
/* restore mqd with the backup copy */
if (adev->gfx.me.mqd_backup[mqd_idx])
memcpy(mqd, adev->gfx.me.mqd_backup[mqd_idx], 
sizeof(*mqd));


RE: [PATCH 1/8] drm/scheduler: properly forward fence errors

2023-08-23 Thread Yin, ZhenGuo (Chris)
[AMD Official Use Only - General]

Ping..

Actually, I prepare a patch aiming to fix this issue.
But I'm not sure whether this is proper for drm/scheduler.

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 9654e8942382..35dc0b86a18e 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -463,6 +463,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct 
drm_sched_job *bad)
  _job->cb)) {
dma_fence_put(s_job->s_fence->parent);
s_job->s_fence->parent = NULL;
+   dma_fence_set_error(_job->s_fence->finished, 
-EHWPOISON);
atomic_dec(>hw_rq_count);
} else {
/*
Best,
Zhenguo
Cloud-GPU Core team, SRDC

-Original Message-----
From: Yin, ZhenGuo (Chris)
Sent: Thursday, August 17, 2023 4:17 PM
To: Christian König ; 
amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben ; Chen, JingWen (Wayne) 
; Liu, Monk ; Li, Chong(Alan) 
; cao, lin 
Subject: RE: [PATCH 1/8] drm/scheduler: properly forward fence errors

Hi, @Christian König

Any updates for the fix?
Recently we found that there will be a page fault after FLR, since an SDMA job 
in the pending list was dropped without forwarding fence errors.


Best,
Zhenguo
Cloud-GPU Core team, SRDC

-Original Message-
From: Christian König 
Sent: Thursday, April 27, 2023 8:05 PM
To: Yin, ZhenGuo (Chris) ; amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben ; Chen, JingWen (Wayne) 
; Liu, Monk 
Subject: Re: [PATCH 1/8] drm/scheduler: properly forward fence errors

Well good point, but as part of the effort of the Intel team to move the 
scheduler over to a work item based design those two functions are probably 
about to be removed.

Since we will probably have that in the internal package for a bit longer I'm 
going to send a fix for this.

Regards,
Christian.

Am 27.04.23 um 12:35 schrieb Yin, ZhenGuo (Chris):
> [AMD Official Use Only - General]
>
> Hi, Christian
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index fcd4bfef7415..649fac2e1ccb 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -533,12 +533,12 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, 
> bool full_recovery)
>   r = dma_fence_add_callback(fence, _job->cb,
>  drm_sched_job_done_cb);
>   if (r == -ENOENT)
> - drm_sched_job_done(s_job);
> + drm_sched_job_done(s_job, fence->error);
>   else if (r)
>   DRM_DEV_ERROR(sched->dev, "fence add callback 
> failed (%d)\n",
> r);
>   } else
> - drm_sched_job_done(s_job);
> + drm_sched_job_done(s_job, 0);
>   }
>
>   if (full_recovery) {
>
> I believe that the finished fence of some skipped jobs during FLR HASN'T been 
> set to -ECANCELED.
> In function drm_sched_stop, the callback has been removed from hw_fence and 
> s_fence->parent has been set to NULL, see commit 
> 45ecaea738830b9d521c93520c8f201359dcbd95(drm/sched: Partial revert of 
> 'drm/sched: Keep s_fence->parent pointer').
> In functnion drm_sched_start, jobs in the pending list pretend to be done 
> without any errors(drm_sched_job_done(s_job, 0)).
>
>
> Best,
> Zhenguo
> Cloud-GPU Core team, SRDC
>
> -Original Message-
> From: amd-gfx  On Behalf Of
> Christian König
> Sent: Thursday, April 20, 2023 7:58 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben 
> Subject: [PATCH 1/8] drm/scheduler: properly forward fence errors
>
> When a hw fence is signaled with an error properly forward that to the 
> finished fence.
>
> Signed-off-by: Christian König 
> ---
>   drivers/gpu/drm/scheduler/sched_entity.c |  4 +---  
> drivers/gpu/drm/scheduler/sched_fence.c  |  4 +++-
>   drivers/gpu/drm/scheduler/sched_main.c   | 18 --
>   include/drm/gpu_scheduler.h  |  2 +-
>   4 files changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index 15d04a0ec623..eaf71fe15ed3 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -144,7 +144,7 @@ static void drm_sched_entity_kill_jobs_work(struct 
> work_struct *wrk)  {
>   struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
>
> - d

RE: [PATCH 1/8] drm/scheduler: properly forward fence errors

2023-08-17 Thread Yin, ZhenGuo (Chris)
[AMD Official Use Only - General]

Hi, @Christian König

Any updates for the fix?
Recently we found that there will be a page fault after FLR, since an SDMA job 
in the pending list was dropped without forwarding fence errors.


Best,
Zhenguo
Cloud-GPU Core team, SRDC

-Original Message-
From: Christian König 
Sent: Thursday, April 27, 2023 8:05 PM
To: Yin, ZhenGuo (Chris) ; amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben ; Chen, JingWen (Wayne) 
; Liu, Monk 
Subject: Re: [PATCH 1/8] drm/scheduler: properly forward fence errors

Well good point, but as part of the effort of the Intel team to move the 
scheduler over to a work item based design those two functions are probably 
about to be removed.

Since we will probably have that in the internal package for a bit longer I'm 
going to send a fix for this.

Regards,
Christian.

Am 27.04.23 um 12:35 schrieb Yin, ZhenGuo (Chris):
> [AMD Official Use Only - General]
>
> Hi, Christian
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index fcd4bfef7415..649fac2e1ccb 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -533,12 +533,12 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, 
> bool full_recovery)
>   r = dma_fence_add_callback(fence, _job->cb,
>  drm_sched_job_done_cb);
>   if (r == -ENOENT)
> - drm_sched_job_done(s_job);
> + drm_sched_job_done(s_job, fence->error);
>   else if (r)
>   DRM_DEV_ERROR(sched->dev, "fence add callback 
> failed (%d)\n",
> r);
>   } else
> - drm_sched_job_done(s_job);
> + drm_sched_job_done(s_job, 0);
>   }
>
>   if (full_recovery) {
>
> I believe that the finished fence of some skipped jobs during FLR HASN'T been 
> set to -ECANCELED.
> In function drm_sched_stop, the callback has been removed from hw_fence and 
> s_fence->parent has been set to NULL, see commit 
> 45ecaea738830b9d521c93520c8f201359dcbd95(drm/sched: Partial revert of 
> 'drm/sched: Keep s_fence->parent pointer').
> In functnion drm_sched_start, jobs in the pending list pretend to be done 
> without any errors(drm_sched_job_done(s_job, 0)).
>
>
> Best,
> Zhenguo
> Cloud-GPU Core team, SRDC
>
> -Original Message-
> From: amd-gfx  On Behalf Of
> Christian König
> Sent: Thursday, April 20, 2023 7:58 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben 
> Subject: [PATCH 1/8] drm/scheduler: properly forward fence errors
>
> When a hw fence is signaled with an error properly forward that to the 
> finished fence.
>
> Signed-off-by: Christian König 
> ---
>   drivers/gpu/drm/scheduler/sched_entity.c |  4 +---  
> drivers/gpu/drm/scheduler/sched_fence.c  |  4 +++-
>   drivers/gpu/drm/scheduler/sched_main.c   | 18 --
>   include/drm/gpu_scheduler.h  |  2 +-
>   4 files changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index 15d04a0ec623..eaf71fe15ed3 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -144,7 +144,7 @@ static void drm_sched_entity_kill_jobs_work(struct 
> work_struct *wrk)  {
>   struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
>
> - drm_sched_fence_finished(job->s_fence);
> + drm_sched_fence_finished(job->s_fence, -ESRCH);
>   WARN_ON(job->s_fence->parent);
>   job->sched->ops->free_job(job);
>   }
> @@ -195,8 +195,6 @@ static void drm_sched_entity_kill(struct drm_sched_entity 
> *entity)
>   while ((job = to_drm_sched_job(spsc_queue_pop(>job_queue {
>   struct drm_sched_fence *s_fence = job->s_fence;
>
> - dma_fence_set_error(_fence->finished, -ESRCH);
> -
>   dma_fence_get(_fence->finished);
>   if (!prev || dma_fence_add_callback(prev, >finish_cb,
>  drm_sched_entity_kill_jobs_cb)) diff 
> --git
> a/drivers/gpu/drm/scheduler/sched_fence.c
> b/drivers/gpu/drm/scheduler/sched_fence.c
> index 7fd869520ef2..1a6bea98c5cc 100644
> --- a/drivers/gpu/drm/scheduler/sched_fence.c
> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> @@ -53,8 +53,10 @@ void drm_sched_fence_scheduled(struct drm_sched_fence 
> *fence)
>   dma_fence_signal(>scheduled);
>   }
>
> -void drm_sched_fenc

RE: [PATCH v2 06/07] drm/amdgpu: add option params to enforce process isolation between graphics and compute

2023-06-07 Thread Yin, ZhenGuo (Chris)
[AMD Official Use Only - General]

Hi Felix

Yes. This patch is aiming to avoid concurrent running of kernel graphics queue 
and kernel compute queue.

Previously we might get wrong guilty job if a bad compute job and a good gfx 
job submitted concurrently.
Like Xorg might be crashed if we submitted a bad compute job in another process 
at the same time.

Advanced TDR can resolve this issue by the extra job resubmission to identify 
the real bad job.
After removing advanced TDR(commit 06a2d7cc3f0476be4682ef90eb09a28fa3daed37), 
we need another method to fix this issue.

As suggested from Christian, we can use the reserved vmid to prevent the 
concurrency of compute and gfx to enforce isolation.
Here we only focus on KGQ/KCQ, not considering KFD compute since it is using 
another set of VMID.
Christian, do you have any comments?


Best,
Zhenguo
-Original Message-
From: Kuehling, Felix 
Sent: Wednesday, June 7, 2023 9:35 PM
To: Li, Chong(Alan) ; amd-gfx@lists.freedesktop.org
Cc: Koenig, Christian ; Yin, ZhenGuo (Chris) 

Subject: Re: [PATCH v2 06/07] drm/amdgpu: add option params to enforce process 
isolation between graphics and compute

I can't see the other patches in this series, so I'm missing some context. I 
don't understand what "process isolation between graphics and compute" means 
here. It seems to be unrelated to KFD compute. This patch seems to be mostly 
about handling of reserved VMIDs. Maybe you're trying to avoid running Vulcan 
graphics and compute concurrently? But this does not prevent concurrency with 
KFD compute.

Can you clarify what this is for?

Thanks,
   Felix


Am 2023-06-07 um 06:57 schrieb Chong Li:
> enforce process isolation between graphics and compute via using the same 
> reserved vmid.
>
> v2: remove params "struct amdgpu_vm *vm" from
>  amdgpu_vmid_alloc_reserved and amdgpu_vmid_free_reserved.
>
> Signed-off-by: Chong Li 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 17 +++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h |  6 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 22 +-
>   5 files changed, 36 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index ce196badf42d..ef098a7287d0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -215,6 +215,7 @@ extern int amdgpu_force_asic_type;
>   extern int amdgpu_smartshift_bias;
>   extern int amdgpu_use_xgmi_p2p;
>   extern int amdgpu_mtype_local;
> +extern bool enforce_isolation;
>   #ifdef CONFIG_HSA_AMD
>   extern int sched_policy;
>   extern bool debug_evictions;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 3d91e123f9bd..fdb6fb8229ab 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -153,7 +153,7 @@ uint amdgpu_pg_mask = 0x;
>   uint amdgpu_sdma_phase_quantum = 32;
>   char *amdgpu_disable_cu;
>   char *amdgpu_virtual_display;
> -
> +bool enforce_isolation;
>   /*
>* OverDrive(bit 14) disabled by default
>* GFX DCS(bit 19) disabled by default @@ -973,6 +973,14 @@
> MODULE_PARM_DESC(
>   4 = 
> AMDGPU_CPX_PARTITION_MODE)");
>   module_param_named(user_partt_mode, amdgpu_user_partt_mode, uint,
> 0444);
>
> +
> +/**
> + * DOC: enforce_isolation (bool)
> + * enforce process isolation between graphics and compute via using the same 
> reserved vmid.
> + */
> +module_param(enforce_isolation, bool, 0444);
> +MODULE_PARM_DESC(enforce_isolation, "enforce process isolation
> +between graphics and compute . enforce_isolation = on");
> +
>   /* These devices are not supported by amdgpu.
>* They are supported by the mach64, r128, radeon drivers
>*/
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> index c991ca0b7a1c..ff1ea99292fb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> @@ -409,7 +409,7 @@ int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct 
> amdgpu_ring *ring,
>   if (r || !idle)
>   goto error;
>
> - if (vm->reserved_vmid[vmhub]) {
> + if (vm->reserved_vmid[vmhub] || (enforce_isolation && (vmhub ==
> +AMDGPU_GFXHUB(0 {
>   r = amdgpu_vmid_grab_reserved(vm, ring, job, , fence);
>   if (r || !id)
>   goto error;
> @@ -460,14 +460,11 @@ int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct 
> amdgpu_ri

RE: [PATCH 1/8] drm/scheduler: properly forward fence errors

2023-04-27 Thread Yin, ZhenGuo (Chris)
[AMD Official Use Only - General]

Hi, Christian

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index fcd4bfef7415..649fac2e1ccb 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -533,12 +533,12 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, 
bool full_recovery)
r = dma_fence_add_callback(fence, _job->cb,
   drm_sched_job_done_cb);
if (r == -ENOENT)
-   drm_sched_job_done(s_job);
+   drm_sched_job_done(s_job, fence->error);
else if (r)
DRM_DEV_ERROR(sched->dev, "fence add callback 
failed (%d)\n",
  r);
} else
-   drm_sched_job_done(s_job);
+   drm_sched_job_done(s_job, 0);
}
 
if (full_recovery) {

I believe that the finished fence of some skipped jobs during FLR HASN'T been 
set to -ECANCELED.
In function drm_sched_stop, the callback has been removed from hw_fence and 
s_fence->parent has been set to NULL, see commit 
45ecaea738830b9d521c93520c8f201359dcbd95(drm/sched: Partial revert of 
'drm/sched: Keep s_fence->parent pointer').
In functnion drm_sched_start, jobs in the pending list pretend to be done 
without any errors(drm_sched_job_done(s_job, 0)).


Best,
Zhenguo
Cloud-GPU Core team, SRDC

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Thursday, April 20, 2023 7:58 PM
To: amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben 
Subject: [PATCH 1/8] drm/scheduler: properly forward fence errors

When a hw fence is signaled with an error properly forward that to the finished 
fence.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/scheduler/sched_entity.c |  4 +---  
drivers/gpu/drm/scheduler/sched_fence.c  |  4 +++-
 drivers/gpu/drm/scheduler/sched_main.c   | 18 --
 include/drm/gpu_scheduler.h  |  2 +-
 4 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index 15d04a0ec623..eaf71fe15ed3 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -144,7 +144,7 @@ static void drm_sched_entity_kill_jobs_work(struct 
work_struct *wrk)  {
struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
 
-   drm_sched_fence_finished(job->s_fence);
+   drm_sched_fence_finished(job->s_fence, -ESRCH);
WARN_ON(job->s_fence->parent);
job->sched->ops->free_job(job);
 }
@@ -195,8 +195,6 @@ static void drm_sched_entity_kill(struct drm_sched_entity 
*entity)
while ((job = to_drm_sched_job(spsc_queue_pop(>job_queue {
struct drm_sched_fence *s_fence = job->s_fence;
 
-   dma_fence_set_error(_fence->finished, -ESRCH);
-
dma_fence_get(_fence->finished);
if (!prev || dma_fence_add_callback(prev, >finish_cb,
   drm_sched_entity_kill_jobs_cb)) diff 
--git a/drivers/gpu/drm/scheduler/sched_fence.c 
b/drivers/gpu/drm/scheduler/sched_fence.c
index 7fd869520ef2..1a6bea98c5cc 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -53,8 +53,10 @@ void drm_sched_fence_scheduled(struct drm_sched_fence *fence)
dma_fence_signal(>scheduled);
 }
 
-void drm_sched_fence_finished(struct drm_sched_fence *fence)
+void drm_sched_fence_finished(struct drm_sched_fence *fence, int 
+result)
 {
+   if (result)
+   dma_fence_set_error(>finished, result);
dma_fence_signal(>finished);
 }
 
diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index fcd4bfef7415..649fac2e1ccb 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -257,7 +257,7 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
  *
  * Finish the job's fence and wake up the worker thread.
  */
-static void drm_sched_job_done(struct drm_sched_job *s_job)
+static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
 {
struct drm_sched_fence *s_fence = s_job->s_fence;
struct drm_gpu_scheduler *sched = s_fence->sched; @@ -268,7 +268,7 @@ 
static void drm_sched_job_done(struct drm_sched_job *s_job)
trace_drm_sched_process_job(s_fence);
 
dma_fence_get(_fence->finished);
-   drm_sched_fence_finished(s_fence);
+   drm_sched_fence_finished(s_fence, result);
dma_fence_put(_fence->finished);
wake_up_interruptible(>wake_up_worker);
 }
@@ -282,7 +282,7 @@ static void drm_sched_job_done_cb(struct dma_fence *f, 
struct dma_fence_cb *cb)  {
struct drm_sched_job *s_job = container_of(cb, 

Re: [PATCH] drm/amdgpu: Stop clearing kiq position during fini

2023-02-28 Thread Yin, ZhenGuo (Chris)

Acked-by: ZhenGuo Yin 

On 2/27/2023 2:45 PM, Yaoyao Lei wrote:

Do not clear kiq position in RLC_CP_SCHEDULER so that CP could perform
IDLE-SAVE after VF fini.
Otherwise it could cause GFX hang if another Win guest is rendering.

Signed-off-by: Yaoyao Lei 
---
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 14 +++---
  1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 6983acc456b2..073f5f23bc3b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -7285,17 +7285,9 @@ static int gfx_v10_0_hw_fini(void *handle)
  
  	if (amdgpu_sriov_vf(adev)) {

gfx_v10_0_cp_gfx_enable(adev, false);
-   /* Program KIQ position of RLC_CP_SCHEDULERS during destroy */
-   if (adev->ip_versions[GC_HWIP][0] >= IP_VERSION(10, 3, 0)) {
-   tmp = RREG32_SOC15(GC, 0, 
mmRLC_CP_SCHEDULERS_Sienna_Cichlid);
-   tmp &= 0xff00;
-   WREG32_SOC15(GC, 0, mmRLC_CP_SCHEDULERS_Sienna_Cichlid, 
tmp);
-   } else {
-   tmp = RREG32_SOC15(GC, 0, mmRLC_CP_SCHEDULERS);
-   tmp &= 0xff00;
-   WREG32_SOC15(GC, 0, mmRLC_CP_SCHEDULERS, tmp);
-   }
-
+   /* Remove the steps of clearing KIQ position.
+* It causes GFX hang when another Win guest is rendering.
+*/
return 0;
}
gfx_v10_0_cp_enable(adev, false);


Re: [PATCH 1/5] drm/amd/amdgpu revert "implement tdr advanced mode"

2023-01-28 Thread Yin, ZhenGuo (Chris)

Hi, Christian

A later bad compute job can block a good gfx job, so the original TDR 
design find a wrong guilty job(good gfx job).


Advanced TDR re-submits jobs in order to find the real guilty job(bad 
compute job).


After reverting this commit, how does the new gang-submit promise the 
isolation between compute jobs and gfx jobs?


On 10/26/2022 11:35 PM, Christian König wrote:

This reverts commit e6c6338f393b74ac0b303d567bb918b44ae7ad75.

This feature basically re-submits one job after another to
figure out which one was the one causing a hang.

This is obviously incompatible with gang-submit which requires
that multiple jobs run at the same time. It's also absolutely
not helpful to crash the hardware multiple times if a clean
recovery is desired.

For testing and debugging environments we should rather disable
recovery alltogether to be able to inspect the state with a hw
debugger.

Additional to that the sw implementation is clearly buggy and causes
reference count issues for the hardware fence.

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 103 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|   2 +-
  drivers/gpu/drm/scheduler/sched_main.c |  58 ++--
  include/drm/gpu_scheduler.h|   3 -
  4 files changed, 10 insertions(+), 156 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 6f958603c8cc..d4584e577b51 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5070,94 +5070,6 @@ static int amdgpu_device_suspend_display_audio(struct 
amdgpu_device *adev)
return 0;
  }
  
-static void amdgpu_device_recheck_guilty_jobs(

-   struct amdgpu_device *adev, struct list_head *device_list_handle,
-   struct amdgpu_reset_context *reset_context)
-{
-   int i, r = 0;
-
-   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-   struct amdgpu_ring *ring = adev->rings[i];
-   int ret = 0;
-   struct drm_sched_job *s_job;
-
-   if (!ring || !ring->sched.thread)
-   continue;
-
-   s_job = list_first_entry_or_null(>sched.pending_list,
-   struct drm_sched_job, list);
-   if (s_job == NULL)
-   continue;
-
-   /* clear job's guilty and depend the folowing step to decide 
the real one */
-   drm_sched_reset_karma(s_job);
-   drm_sched_resubmit_jobs_ext(>sched, 1);
-
-   if (!s_job->s_fence->parent) {
-   DRM_WARN("Failed to get a HW fence for job!");
-   continue;
-   }
-
-   ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, 
ring->sched.timeout);
-   if (ret == 0) { /* timeout */
-   DRM_ERROR("Found the real bad job! ring:%s, 
job_id:%llx\n",
-   ring->sched.name, s_job->id);
-
-
-   amdgpu_fence_driver_isr_toggle(adev, true);
-
-   /* Clear this failed job from fence array */
-   amdgpu_fence_driver_clear_job_fences(ring);
-
-   amdgpu_fence_driver_isr_toggle(adev, false);
-
-   /* Since the job won't signal and we go for
-* another resubmit drop this parent pointer
-*/
-   dma_fence_put(s_job->s_fence->parent);
-   s_job->s_fence->parent = NULL;
-
-   /* set guilty */
-   drm_sched_increase_karma(s_job);
-   amdgpu_reset_prepare_hwcontext(adev, reset_context);
-retry:
-   /* do hw reset */
-   if (amdgpu_sriov_vf(adev)) {
-   amdgpu_virt_fini_data_exchange(adev);
-   r = amdgpu_device_reset_sriov(adev, false);
-   if (r)
-   adev->asic_reset_res = r;
-   } else {
-   clear_bit(AMDGPU_SKIP_HW_RESET,
- _context->flags);
-   r = amdgpu_do_asic_reset(device_list_handle,
-reset_context);
-   if (r && r == -EAGAIN)
-   goto retry;
-   }
-
-   /*
-* add reset counter so that the following
-* resubmitted job could flush vmid
-*/
-   atomic_inc(>gpu_reset_counter);
-   continue;
-   }
-
-   /* got the hw fence, signal finished fence */
-   

Re: [PATCH] drm/ttm: update bulk move object of ghost BO

2022-09-05 Thread Yin, ZhenGuo (Chris)
Inside the function ttm_bo_set_bulk_move, it calls 
ttm_resource_del_bulk_move to remove the old resource from the bulk_move 
list.


If we set the bulk_move to NULL manually as suggested, the old resource 
attached in the ghost BO seems won't be removed from the bulk_move.


On 9/1/2022 7:13 PM, Christian König wrote:

Am 01.09.22 um 13:11 schrieb Christian König:

Am 01.09.22 um 11:29 schrieb ZhenGuo Yin:

[Why]
Ghost BO is released with non-empty bulk move object. There is a
warning trace:
WARNING: CPU: 19 PID: 1582 at ttm/ttm_bo.c:366 
ttm_bo_release+0x2e1/0x2f0 [amdttm]

Call Trace:
   amddma_resv_reserve_fences+0x10d/0x1f0 [amdkcl]
   amdttm_bo_put+0x28/0x30 [amdttm]
   amdttm_bo_move_accel_cleanup+0x126/0x200 [amdttm]
   amdgpu_bo_move+0x1a8/0x770 [amdgpu]
   ttm_bo_handle_move_mem+0xb0/0x140 [amdttm]
   amdttm_bo_validate+0xbf/0x100 [amdttm]

[How]
The resource of ghost BO should be moved to LRU directly, instead of
using bulk move. The bulk move object of ghost BO should set to NULL
before function ttm_bo_move_to_lru_tail_unlocked.

Fixed:·5b951e487fd6bf5f·("drm/ttm:·fix·bulk·move·handling·v2")
Signed-off-by: ZhenGuo Yin 


Good catch, but the fix is not 100% correct. Please rather just NULL 
the member while initializing the BO structure.


E.g. something like this:

 
 fbo->base.pin_count = 0;
+fbo->base.bulk_move= NULL;
 if (bo->type != ttm_bo_type_sg)
 


On the other hand thinking about it that won't work either.

You need to set bulk_move to NULL manually in an else clauses or 
something like this.


Regards,
Christian.



Thanks,
Christian.


---
  drivers/gpu/drm/ttm/ttm_bo_util.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c 
b/drivers/gpu/drm/ttm/ttm_bo_util.c

index 1cbfb00c1d65..a90bbbd91910 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -238,6 +238,7 @@ static int ttm_buffer_object_transfer(struct 
ttm_buffer_object *bo,

    if (fbo->base.resource) {
  ttm_resource_set_bo(fbo->base.resource, >base);
+    ttm_bo_set_bulk_move(>base, NULL);
  bo->resource = NULL;
  }