RE: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost)
[AMD Official Use Only - AMD Internal Distribution Only] Hi Christian I prepared this patch because we met a page fault after gpu reset in SRIOV triggered by quark. After investigation, I found that the page table did not get restored after gpu reset. I just tried to use vm_update_mode=0 to disable VM update by CPU, and this issue still exists. -[Christian] When VRAM is lost any submission using the VM entity will fail and so result in a new page table generation. I believe that you are referring this piece of code in function amdgpu_job_run(): /* Skip job if VRAM is lost and never resubmit gangs */ if (job->generation != amdgpu_vm_generation(adev, job->vm) || (job->job_run_counter && job->gang_submit)) dma_fence_set_error(finished, -ECANCELED); I agree that the submission from the old ctx will be set as ECANCELED and be skipped. But if the application then creates a new ctx and submit a new job, both new_ctx->generation and new_job->generation will be initiated as the updated one. ctx->generation = amdgpu_vm_generation(mgr->adev, >vm); (*job)->generation = amdgpu_vm_generation(adev, vm); And the job will be executed, hence there will be a page fault due to VRAM lost. Please correct me if I have some misunderstanding. I still can not see why any submission using the VM entity will fail due to VRAM lost. Best, Zhenguo Cloud-GPU Core team, SRDC -Original Message- From: Christian König Sent: Tuesday, July 23, 2024 3:13 PM To: Yin, ZhenGuo (Chris) ; amd-gfx@lists.freedesktop.org Cc: Koenig, Christian Subject: Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) Am 23.07.24 um 05:05 schrieb ZhenGuo Yin: > [Why] > Page table of compute VM in the VRAM will lost after gpu reset. > VRAM won't be restored since compute VM has no shadows. > > [How] > Use higher 32-bit of vm->generation to record a vram_lost_counter. > Reset the VM state machine when vm->genertaion is not equal to > re-generation token. > > v2: Check vm->generation instead of calling drm_sched_entity_error in > amdgpu_vm_validate. I've just double checked the logic again and as far as I can see this patch here is still completely superfluous. When VRAM is lost any submission using the VM entity will fail and so result in a new page table generation. What isn't handled are CPU based page table updates, but for those we could potentially change the condition inside the CPU page tables code. Regards, Christian. > > Signed-off-by: ZhenGuo Yin > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++ > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index 3abfa66d72a2..9e2f84c166e7 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, > struct amdgpu_vm *vm) > if (!vm) > return result; > > - result += vm->generation; > + result += (vm->generation & 0x); > /* Add one if the page tables will be re-generated on next CS */ > if (drm_sched_entity_error(>delayed)) > ++result; > @@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, > struct amdgpu_vm *vm, > struct amdgpu_bo *shadow; > struct amdgpu_bo *bo; > int r; > + uint32_t vram_lost_counter = atomic_read(>vram_lost_counter); > > - if (drm_sched_entity_error(>delayed)) { > - ++vm->generation; > + if (vm->generation != amdgpu_vm_generation(adev, vm)) { > + if (drm_sched_entity_error(>delayed)) > + ++vm->generation; > + vm->generation = (u64)vram_lost_counter << 32 | (vm->generation > & > +0x); > amdgpu_vm_bo_reset_state_machine(vm); > amdgpu_vm_fini_entities(vm); > r = amdgpu_vm_init_entities(adev, vm); @@ -2439,7 +2442,7 @@ int > amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, > vm->last_update = dma_fence_get_stub(); > vm->last_unlocked = dma_fence_get_stub(); > vm->last_tlb_flush = dma_fence_get_stub(); > - vm->generation = 0; > + vm->generation = (u64)atomic_read(>vram_lost_counter) << 32; > > mutex_init(>eviction_lock); > vm->evicting = false;
RE: [PATCH] drm/amdgpu: reset vm state machine after gpu reset(vram lost)
[AMD Official Use Only - AMD Internal Distribution Only] Hi, Christian Why loosing VRAM would result in the vm entity to become invalid? I think only if there has a fence error appeared(like a pending SDMA job got timedout or cancelled), then the entity vm->delayed will be set as error. If a gpu reset triggered by a GFX job, and there has no SDMA job in the pending list, the entity won't be set as error. Best, Zhenguo Cloud-GPU Core team, SRDC -Original Message- From: Koenig, Christian Sent: Friday, July 19, 2024 5:22 PM To: Yin, ZhenGuo (Chris) ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amdgpu: reset vm state machine after gpu reset(vram lost) Am 19.07.24 um 11:19 schrieb ZhenGuo Yin: > [Why] > Page table of compute VM in the VRAM will lost after gpu reset. > VRAM won't be restored since compute VM has no shadows. > > [How] > Use higher 32-bit of vm->generation to record a vram_lost_counter. > Reset the VM state machine when the counter is not equal to current > vram_lost_counter of the device. Mhm, that was my original approach as well but we came to the conclusion that this shouldn't be necessary since loosing VRAM would result in the entity to become invalid as well. Why is that necessary? Regards, Christian. > > Signed-off-by: ZhenGuo Yin > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 -- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index 3abfa66d72a2..fd7f912816dc 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, > struct amdgpu_vm *vm) > if (!vm) > return result; > > - result += vm->generation; > + result += (vm->generation & 0x); > /* Add one if the page tables will be re-generated on next CS */ > if (drm_sched_entity_error(>delayed)) > ++result; > @@ -467,6 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, > struct amdgpu_vm *vm, > struct amdgpu_bo *shadow; > struct amdgpu_bo *bo; > int r; > + uint32_t vram_lost_counter = atomic_read(>vram_lost_counter); > + > + if ((vm->generation >> 32) != vram_lost_counter) { > + amdgpu_vm_bo_reset_state_machine(vm); > + vm->generation = (u64)vram_lost_counter << 32 | (vm->generation > & 0x); > + } > > if (drm_sched_entity_error(>delayed)) { > ++vm->generation; > @@ -2439,7 +2445,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct > amdgpu_vm *vm, > vm->last_update = dma_fence_get_stub(); > vm->last_unlocked = dma_fence_get_stub(); > vm->last_tlb_flush = dma_fence_get_stub(); > - vm->generation = 0; > + vm->generation = (u64)atomic_read(>vram_lost_counter) << 32; > > mutex_init(>eviction_lock); > vm->evicting = false;
RE: [PATCH] drm/amdgpu: clear set_q_mode_offs when VM changed
[AMD Official Use Only - General] Yeah, I agree. But I have no idea which place is better. Best, Zhenguo Cloud-GPU Core team, SRDC -Original Message- From: Koenig, Christian Sent: Tuesday, April 2, 2024 4:14 PM To: Yin, ZhenGuo (Chris) ; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander Subject: Re: [PATCH] drm/amdgpu: clear set_q_mode_offs when VM changed Am 02.04.24 um 08:37 schrieb ZhenGuo Yin: > [Why] > set_q_mode_offs don't get cleared after GPU reset, nexting SET_Q_MODE > packet to init shadow memory will be skiped, hence there has a page fault. > > [How] > VM flush is needed after GPU reset, clear set_q_mode_offs when > emitting VM flush. > > Fixes: 8dad9c062355 ("drm/amdgpu: workaround to avoid SET_Q_MODE > packets v2") > Signed-off-by: ZhenGuo Yin Good catch, yeah it probably doesn't make much sense to execute this after a VM flush. Reviewed-by: Christian König > --- > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > index 7a906318e451..c11c6299711e 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > @@ -5465,6 +5465,7 @@ static void gfx_v11_0_ring_emit_vm_flush(struct > amdgpu_ring *ring, > /* Make sure that we can't skip the SET_Q_MODE packets when the VM >* changed in any way. >*/ > + ring->set_q_mode_offs = 0; > ring->set_q_mode_ptr = NULL; > } >
RE: [PATCH] drm/amd/amdgpu: add pipe1 hardware support
[AMD Official Use Only - General] Hi, Alex Could you please help to review this patch? This should fix the bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2117. Thx. Best, Zhenguo Cloud-GPU Core team, SRDC -Original Message- From: Yin, ZhenGuo (Chris) Sent: Friday, March 15, 2024 2:12 PM To: amd-gfx@lists.freedesktop.org Cc: mdaen...@redhat.com; Deucher, Alexander ; Koenig, Christian ; Paneer Selvam, Arunpravin ; Deng, Emily ; Liu, Monk ; Yin, ZhenGuo (Chris) Subject: [PATCH] drm/amd/amdgpu: add pipe1 hardware support Enable pipe1 support starting from SIENNA CICHLID asic. Need to use correct ref/mask for pipe1 hdp flush. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2117 Fixes: 085292c3d780 ("Revert "drm/amd/amdgpu: add pipe1 hardware support"") Signed-off-by: ZhenGuo Yin --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index f90905ef32c7..5eb6f189920b 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -4518,7 +4518,7 @@ static int gfx_v10_0_sw_init(void *handle) case IP_VERSION(10, 3, 3): case IP_VERSION(10, 3, 7): adev->gfx.me.num_me = 1; - adev->gfx.me.num_pipe_per_me = 1; + adev->gfx.me.num_pipe_per_me = 2; adev->gfx.me.num_queue_per_pipe = 1; adev->gfx.mec.num_mec = 2; adev->gfx.mec.num_pipe_per_mec = 4; @@ -8317,7 +8317,7 @@ static void gfx_v10_0_ring_emit_hdp_flush(struct amdgpu_ring *ring) } reg_mem_engine = 0; } else { - ref_and_mask = nbio_hf_reg->ref_and_mask_cp0; + ref_and_mask = nbio_hf_reg->ref_and_mask_cp0 << ring->pipe; reg_mem_engine = 1; /* pfp */ } -- 2.35.1
Re: [PATCH] drm/amdgpu: Skip access gfx11 golden registers under SRIOV
Add potential reviewers. On 11/23/2023 4:57 PM, ZhenGuo Yin wrote: [Why] Golden registers are PF-only registers on gfx11. RLCG interface will return "out-of-range" under SRIOV VF. [How] Skip access gfx11 golden registers under SRIOV. Signed-off-by: ZhenGuo Yin --- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c index 8ed4a6fb147a..288e926e5cd4 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c @@ -293,6 +293,9 @@ static void gfx_v11_0_set_kiq_pm4_funcs(struct amdgpu_device *adev) static void gfx_v11_0_init_golden_registers(struct amdgpu_device *adev) { + if (amdgpu_sriov_vf(adev)) + return; + switch (amdgpu_ip_version(adev, GC_HWIP, 0)) { case IP_VERSION(11, 0, 1): case IP_VERSION(11, 0, 4):
Re: [PATCH v2] drm/amdgpu doorbell range should be set when gpu recovery
Acked-by: ZhenGuo Yin On 10/30/2023 5:57 PM, Lin.Cao wrote: GFX doorbell range should be set after flr otherwise the gfx doorbell range will be overlap with MEC. v2: remove "amdgpu_sriov_vf" and "amdgpu_in_reset" check, and add grbm select for the case of 2 gfx rings. Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index d9ccacd06fba..c9f4e8252070 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -6467,6 +6467,13 @@ static int gfx_v10_0_gfx_init_queue(struct amdgpu_ring *ring) if (adev->gfx.me.mqd_backup[mqd_idx]) memcpy(adev->gfx.me.mqd_backup[mqd_idx], mqd, sizeof(*mqd)); } else { + mutex_lock(>srbm_mutex); + nv_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0); + if (ring->doorbell_index == adev->doorbell_index.gfx_ring0 << 1) + gfx_v10_0_cp_gfx_set_doorbell(adev, ring); + + nv_grbm_select(adev, 0, 0, 0, 0); + mutex_unlock(>srbm_mutex); /* restore mqd with the backup copy */ if (adev->gfx.me.mqd_backup[mqd_idx]) memcpy(mqd, adev->gfx.me.mqd_backup[mqd_idx], sizeof(*mqd));
RE: [PATCH 1/8] drm/scheduler: properly forward fence errors
[AMD Official Use Only - General] Ping.. Actually, I prepare a patch aiming to fix this issue. But I'm not sure whether this is proper for drm/scheduler. diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 9654e8942382..35dc0b86a18e 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -463,6 +463,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) _job->cb)) { dma_fence_put(s_job->s_fence->parent); s_job->s_fence->parent = NULL; + dma_fence_set_error(_job->s_fence->finished, -EHWPOISON); atomic_dec(>hw_rq_count); } else { /* Best, Zhenguo Cloud-GPU Core team, SRDC -Original Message----- From: Yin, ZhenGuo (Chris) Sent: Thursday, August 17, 2023 4:17 PM To: Christian König ; amd-gfx@lists.freedesktop.org Cc: Tuikov, Luben ; Chen, JingWen (Wayne) ; Liu, Monk ; Li, Chong(Alan) ; cao, lin Subject: RE: [PATCH 1/8] drm/scheduler: properly forward fence errors Hi, @Christian König Any updates for the fix? Recently we found that there will be a page fault after FLR, since an SDMA job in the pending list was dropped without forwarding fence errors. Best, Zhenguo Cloud-GPU Core team, SRDC -Original Message- From: Christian König Sent: Thursday, April 27, 2023 8:05 PM To: Yin, ZhenGuo (Chris) ; amd-gfx@lists.freedesktop.org Cc: Tuikov, Luben ; Chen, JingWen (Wayne) ; Liu, Monk Subject: Re: [PATCH 1/8] drm/scheduler: properly forward fence errors Well good point, but as part of the effort of the Intel team to move the scheduler over to a work item based design those two functions are probably about to be removed. Since we will probably have that in the internal package for a bit longer I'm going to send a fix for this. Regards, Christian. Am 27.04.23 um 12:35 schrieb Yin, ZhenGuo (Chris): > [AMD Official Use Only - General] > > Hi, Christian > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index fcd4bfef7415..649fac2e1ccb 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -533,12 +533,12 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, > bool full_recovery) > r = dma_fence_add_callback(fence, _job->cb, > drm_sched_job_done_cb); > if (r == -ENOENT) > - drm_sched_job_done(s_job); > + drm_sched_job_done(s_job, fence->error); > else if (r) > DRM_DEV_ERROR(sched->dev, "fence add callback > failed (%d)\n", > r); > } else > - drm_sched_job_done(s_job); > + drm_sched_job_done(s_job, 0); > } > > if (full_recovery) { > > I believe that the finished fence of some skipped jobs during FLR HASN'T been > set to -ECANCELED. > In function drm_sched_stop, the callback has been removed from hw_fence and > s_fence->parent has been set to NULL, see commit > 45ecaea738830b9d521c93520c8f201359dcbd95(drm/sched: Partial revert of > 'drm/sched: Keep s_fence->parent pointer'). > In functnion drm_sched_start, jobs in the pending list pretend to be done > without any errors(drm_sched_job_done(s_job, 0)). > > > Best, > Zhenguo > Cloud-GPU Core team, SRDC > > -Original Message- > From: amd-gfx On Behalf Of > Christian König > Sent: Thursday, April 20, 2023 7:58 PM > To: amd-gfx@lists.freedesktop.org > Cc: Tuikov, Luben > Subject: [PATCH 1/8] drm/scheduler: properly forward fence errors > > When a hw fence is signaled with an error properly forward that to the > finished fence. > > Signed-off-by: Christian König > --- > drivers/gpu/drm/scheduler/sched_entity.c | 4 +--- > drivers/gpu/drm/scheduler/sched_fence.c | 4 +++- > drivers/gpu/drm/scheduler/sched_main.c | 18 -- > include/drm/gpu_scheduler.h | 2 +- > 4 files changed, 13 insertions(+), 15 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c > b/drivers/gpu/drm/scheduler/sched_entity.c > index 15d04a0ec623..eaf71fe15ed3 100644 > --- a/drivers/gpu/drm/scheduler/sched_entity.c > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > @@ -144,7 +144,7 @@ static void drm_sched_entity_kill_jobs_work(struct > work_struct *wrk) { > struct drm_sched_job *job = container_of(wrk, typeof(*job), work); > > - d
RE: [PATCH 1/8] drm/scheduler: properly forward fence errors
[AMD Official Use Only - General] Hi, @Christian König Any updates for the fix? Recently we found that there will be a page fault after FLR, since an SDMA job in the pending list was dropped without forwarding fence errors. Best, Zhenguo Cloud-GPU Core team, SRDC -Original Message- From: Christian König Sent: Thursday, April 27, 2023 8:05 PM To: Yin, ZhenGuo (Chris) ; amd-gfx@lists.freedesktop.org Cc: Tuikov, Luben ; Chen, JingWen (Wayne) ; Liu, Monk Subject: Re: [PATCH 1/8] drm/scheduler: properly forward fence errors Well good point, but as part of the effort of the Intel team to move the scheduler over to a work item based design those two functions are probably about to be removed. Since we will probably have that in the internal package for a bit longer I'm going to send a fix for this. Regards, Christian. Am 27.04.23 um 12:35 schrieb Yin, ZhenGuo (Chris): > [AMD Official Use Only - General] > > Hi, Christian > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index fcd4bfef7415..649fac2e1ccb 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -533,12 +533,12 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, > bool full_recovery) > r = dma_fence_add_callback(fence, _job->cb, > drm_sched_job_done_cb); > if (r == -ENOENT) > - drm_sched_job_done(s_job); > + drm_sched_job_done(s_job, fence->error); > else if (r) > DRM_DEV_ERROR(sched->dev, "fence add callback > failed (%d)\n", > r); > } else > - drm_sched_job_done(s_job); > + drm_sched_job_done(s_job, 0); > } > > if (full_recovery) { > > I believe that the finished fence of some skipped jobs during FLR HASN'T been > set to -ECANCELED. > In function drm_sched_stop, the callback has been removed from hw_fence and > s_fence->parent has been set to NULL, see commit > 45ecaea738830b9d521c93520c8f201359dcbd95(drm/sched: Partial revert of > 'drm/sched: Keep s_fence->parent pointer'). > In functnion drm_sched_start, jobs in the pending list pretend to be done > without any errors(drm_sched_job_done(s_job, 0)). > > > Best, > Zhenguo > Cloud-GPU Core team, SRDC > > -Original Message- > From: amd-gfx On Behalf Of > Christian König > Sent: Thursday, April 20, 2023 7:58 PM > To: amd-gfx@lists.freedesktop.org > Cc: Tuikov, Luben > Subject: [PATCH 1/8] drm/scheduler: properly forward fence errors > > When a hw fence is signaled with an error properly forward that to the > finished fence. > > Signed-off-by: Christian König > --- > drivers/gpu/drm/scheduler/sched_entity.c | 4 +--- > drivers/gpu/drm/scheduler/sched_fence.c | 4 +++- > drivers/gpu/drm/scheduler/sched_main.c | 18 -- > include/drm/gpu_scheduler.h | 2 +- > 4 files changed, 13 insertions(+), 15 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c > b/drivers/gpu/drm/scheduler/sched_entity.c > index 15d04a0ec623..eaf71fe15ed3 100644 > --- a/drivers/gpu/drm/scheduler/sched_entity.c > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > @@ -144,7 +144,7 @@ static void drm_sched_entity_kill_jobs_work(struct > work_struct *wrk) { > struct drm_sched_job *job = container_of(wrk, typeof(*job), work); > > - drm_sched_fence_finished(job->s_fence); > + drm_sched_fence_finished(job->s_fence, -ESRCH); > WARN_ON(job->s_fence->parent); > job->sched->ops->free_job(job); > } > @@ -195,8 +195,6 @@ static void drm_sched_entity_kill(struct drm_sched_entity > *entity) > while ((job = to_drm_sched_job(spsc_queue_pop(>job_queue { > struct drm_sched_fence *s_fence = job->s_fence; > > - dma_fence_set_error(_fence->finished, -ESRCH); > - > dma_fence_get(_fence->finished); > if (!prev || dma_fence_add_callback(prev, >finish_cb, > drm_sched_entity_kill_jobs_cb)) diff > --git > a/drivers/gpu/drm/scheduler/sched_fence.c > b/drivers/gpu/drm/scheduler/sched_fence.c > index 7fd869520ef2..1a6bea98c5cc 100644 > --- a/drivers/gpu/drm/scheduler/sched_fence.c > +++ b/drivers/gpu/drm/scheduler/sched_fence.c > @@ -53,8 +53,10 @@ void drm_sched_fence_scheduled(struct drm_sched_fence > *fence) > dma_fence_signal(>scheduled); > } > > -void drm_sched_fenc
RE: [PATCH v2 06/07] drm/amdgpu: add option params to enforce process isolation between graphics and compute
[AMD Official Use Only - General] Hi Felix Yes. This patch is aiming to avoid concurrent running of kernel graphics queue and kernel compute queue. Previously we might get wrong guilty job if a bad compute job and a good gfx job submitted concurrently. Like Xorg might be crashed if we submitted a bad compute job in another process at the same time. Advanced TDR can resolve this issue by the extra job resubmission to identify the real bad job. After removing advanced TDR(commit 06a2d7cc3f0476be4682ef90eb09a28fa3daed37), we need another method to fix this issue. As suggested from Christian, we can use the reserved vmid to prevent the concurrency of compute and gfx to enforce isolation. Here we only focus on KGQ/KCQ, not considering KFD compute since it is using another set of VMID. Christian, do you have any comments? Best, Zhenguo -Original Message- From: Kuehling, Felix Sent: Wednesday, June 7, 2023 9:35 PM To: Li, Chong(Alan) ; amd-gfx@lists.freedesktop.org Cc: Koenig, Christian ; Yin, ZhenGuo (Chris) Subject: Re: [PATCH v2 06/07] drm/amdgpu: add option params to enforce process isolation between graphics and compute I can't see the other patches in this series, so I'm missing some context. I don't understand what "process isolation between graphics and compute" means here. It seems to be unrelated to KFD compute. This patch seems to be mostly about handling of reserved VMIDs. Maybe you're trying to avoid running Vulcan graphics and compute concurrently? But this does not prevent concurrency with KFD compute. Can you clarify what this is for? Thanks, Felix Am 2023-06-07 um 06:57 schrieb Chong Li: > enforce process isolation between graphics and compute via using the same > reserved vmid. > > v2: remove params "struct amdgpu_vm *vm" from > amdgpu_vmid_alloc_reserved and amdgpu_vmid_free_reserved. > > Signed-off-by: Chong Li > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 17 +++-- > drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h | 6 ++ > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 22 +- > 5 files changed, 36 insertions(+), 20 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index ce196badf42d..ef098a7287d0 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -215,6 +215,7 @@ extern int amdgpu_force_asic_type; > extern int amdgpu_smartshift_bias; > extern int amdgpu_use_xgmi_p2p; > extern int amdgpu_mtype_local; > +extern bool enforce_isolation; > #ifdef CONFIG_HSA_AMD > extern int sched_policy; > extern bool debug_evictions; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > index 3d91e123f9bd..fdb6fb8229ab 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > @@ -153,7 +153,7 @@ uint amdgpu_pg_mask = 0x; > uint amdgpu_sdma_phase_quantum = 32; > char *amdgpu_disable_cu; > char *amdgpu_virtual_display; > - > +bool enforce_isolation; > /* >* OverDrive(bit 14) disabled by default >* GFX DCS(bit 19) disabled by default @@ -973,6 +973,14 @@ > MODULE_PARM_DESC( > 4 = > AMDGPU_CPX_PARTITION_MODE)"); > module_param_named(user_partt_mode, amdgpu_user_partt_mode, uint, > 0444); > > + > +/** > + * DOC: enforce_isolation (bool) > + * enforce process isolation between graphics and compute via using the same > reserved vmid. > + */ > +module_param(enforce_isolation, bool, 0444); > +MODULE_PARM_DESC(enforce_isolation, "enforce process isolation > +between graphics and compute . enforce_isolation = on"); > + > /* These devices are not supported by amdgpu. >* They are supported by the mach64, r128, radeon drivers >*/ > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c > index c991ca0b7a1c..ff1ea99292fb 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c > @@ -409,7 +409,7 @@ int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct > amdgpu_ring *ring, > if (r || !idle) > goto error; > > - if (vm->reserved_vmid[vmhub]) { > + if (vm->reserved_vmid[vmhub] || (enforce_isolation && (vmhub == > +AMDGPU_GFXHUB(0 { > r = amdgpu_vmid_grab_reserved(vm, ring, job, , fence); > if (r || !id) > goto error; > @@ -460,14 +460,11 @@ int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct > amdgpu_ri
RE: [PATCH 1/8] drm/scheduler: properly forward fence errors
[AMD Official Use Only - General] Hi, Christian diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index fcd4bfef7415..649fac2e1ccb 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -533,12 +533,12 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) r = dma_fence_add_callback(fence, _job->cb, drm_sched_job_done_cb); if (r == -ENOENT) - drm_sched_job_done(s_job); + drm_sched_job_done(s_job, fence->error); else if (r) DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n", r); } else - drm_sched_job_done(s_job); + drm_sched_job_done(s_job, 0); } if (full_recovery) { I believe that the finished fence of some skipped jobs during FLR HASN'T been set to -ECANCELED. In function drm_sched_stop, the callback has been removed from hw_fence and s_fence->parent has been set to NULL, see commit 45ecaea738830b9d521c93520c8f201359dcbd95(drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer'). In functnion drm_sched_start, jobs in the pending list pretend to be done without any errors(drm_sched_job_done(s_job, 0)). Best, Zhenguo Cloud-GPU Core team, SRDC -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Thursday, April 20, 2023 7:58 PM To: amd-gfx@lists.freedesktop.org Cc: Tuikov, Luben Subject: [PATCH 1/8] drm/scheduler: properly forward fence errors When a hw fence is signaled with an error properly forward that to the finished fence. Signed-off-by: Christian König --- drivers/gpu/drm/scheduler/sched_entity.c | 4 +--- drivers/gpu/drm/scheduler/sched_fence.c | 4 +++- drivers/gpu/drm/scheduler/sched_main.c | 18 -- include/drm/gpu_scheduler.h | 2 +- 4 files changed, 13 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 15d04a0ec623..eaf71fe15ed3 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -144,7 +144,7 @@ static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) { struct drm_sched_job *job = container_of(wrk, typeof(*job), work); - drm_sched_fence_finished(job->s_fence); + drm_sched_fence_finished(job->s_fence, -ESRCH); WARN_ON(job->s_fence->parent); job->sched->ops->free_job(job); } @@ -195,8 +195,6 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity) while ((job = to_drm_sched_job(spsc_queue_pop(>job_queue { struct drm_sched_fence *s_fence = job->s_fence; - dma_fence_set_error(_fence->finished, -ESRCH); - dma_fence_get(_fence->finished); if (!prev || dma_fence_add_callback(prev, >finish_cb, drm_sched_entity_kill_jobs_cb)) diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c index 7fd869520ef2..1a6bea98c5cc 100644 --- a/drivers/gpu/drm/scheduler/sched_fence.c +++ b/drivers/gpu/drm/scheduler/sched_fence.c @@ -53,8 +53,10 @@ void drm_sched_fence_scheduled(struct drm_sched_fence *fence) dma_fence_signal(>scheduled); } -void drm_sched_fence_finished(struct drm_sched_fence *fence) +void drm_sched_fence_finished(struct drm_sched_fence *fence, int +result) { + if (result) + dma_fence_set_error(>finished, result); dma_fence_signal(>finished); } diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index fcd4bfef7415..649fac2e1ccb 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -257,7 +257,7 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) * * Finish the job's fence and wake up the worker thread. */ -static void drm_sched_job_done(struct drm_sched_job *s_job) +static void drm_sched_job_done(struct drm_sched_job *s_job, int result) { struct drm_sched_fence *s_fence = s_job->s_fence; struct drm_gpu_scheduler *sched = s_fence->sched; @@ -268,7 +268,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job) trace_drm_sched_process_job(s_fence); dma_fence_get(_fence->finished); - drm_sched_fence_finished(s_fence); + drm_sched_fence_finished(s_fence, result); dma_fence_put(_fence->finished); wake_up_interruptible(>wake_up_worker); } @@ -282,7 +282,7 @@ static void drm_sched_job_done_cb(struct dma_fence *f, struct dma_fence_cb *cb) { struct drm_sched_job *s_job = container_of(cb,
Re: [PATCH] drm/amdgpu: Stop clearing kiq position during fini
Acked-by: ZhenGuo Yin On 2/27/2023 2:45 PM, Yaoyao Lei wrote: Do not clear kiq position in RLC_CP_SCHEDULER so that CP could perform IDLE-SAVE after VF fini. Otherwise it could cause GFX hang if another Win guest is rendering. Signed-off-by: Yaoyao Lei --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 14 +++--- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 6983acc456b2..073f5f23bc3b 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -7285,17 +7285,9 @@ static int gfx_v10_0_hw_fini(void *handle) if (amdgpu_sriov_vf(adev)) { gfx_v10_0_cp_gfx_enable(adev, false); - /* Program KIQ position of RLC_CP_SCHEDULERS during destroy */ - if (adev->ip_versions[GC_HWIP][0] >= IP_VERSION(10, 3, 0)) { - tmp = RREG32_SOC15(GC, 0, mmRLC_CP_SCHEDULERS_Sienna_Cichlid); - tmp &= 0xff00; - WREG32_SOC15(GC, 0, mmRLC_CP_SCHEDULERS_Sienna_Cichlid, tmp); - } else { - tmp = RREG32_SOC15(GC, 0, mmRLC_CP_SCHEDULERS); - tmp &= 0xff00; - WREG32_SOC15(GC, 0, mmRLC_CP_SCHEDULERS, tmp); - } - + /* Remove the steps of clearing KIQ position. +* It causes GFX hang when another Win guest is rendering. +*/ return 0; } gfx_v10_0_cp_enable(adev, false);
Re: [PATCH 1/5] drm/amd/amdgpu revert "implement tdr advanced mode"
Hi, Christian A later bad compute job can block a good gfx job, so the original TDR design find a wrong guilty job(good gfx job). Advanced TDR re-submits jobs in order to find the real guilty job(bad compute job). After reverting this commit, how does the new gang-submit promise the isolation between compute jobs and gfx jobs? On 10/26/2022 11:35 PM, Christian König wrote: This reverts commit e6c6338f393b74ac0b303d567bb918b44ae7ad75. This feature basically re-submits one job after another to figure out which one was the one causing a hang. This is obviously incompatible with gang-submit which requires that multiple jobs run at the same time. It's also absolutely not helpful to crash the hardware multiple times if a clean recovery is desired. For testing and debugging environments we should rather disable recovery alltogether to be able to inspect the state with a hw debugger. Additional to that the sw implementation is clearly buggy and causes reference count issues for the hardware fence. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 103 - drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 2 +- drivers/gpu/drm/scheduler/sched_main.c | 58 ++-- include/drm/gpu_scheduler.h| 3 - 4 files changed, 10 insertions(+), 156 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 6f958603c8cc..d4584e577b51 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5070,94 +5070,6 @@ static int amdgpu_device_suspend_display_audio(struct amdgpu_device *adev) return 0; } -static void amdgpu_device_recheck_guilty_jobs( - struct amdgpu_device *adev, struct list_head *device_list_handle, - struct amdgpu_reset_context *reset_context) -{ - int i, r = 0; - - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { - struct amdgpu_ring *ring = adev->rings[i]; - int ret = 0; - struct drm_sched_job *s_job; - - if (!ring || !ring->sched.thread) - continue; - - s_job = list_first_entry_or_null(>sched.pending_list, - struct drm_sched_job, list); - if (s_job == NULL) - continue; - - /* clear job's guilty and depend the folowing step to decide the real one */ - drm_sched_reset_karma(s_job); - drm_sched_resubmit_jobs_ext(>sched, 1); - - if (!s_job->s_fence->parent) { - DRM_WARN("Failed to get a HW fence for job!"); - continue; - } - - ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout); - if (ret == 0) { /* timeout */ - DRM_ERROR("Found the real bad job! ring:%s, job_id:%llx\n", - ring->sched.name, s_job->id); - - - amdgpu_fence_driver_isr_toggle(adev, true); - - /* Clear this failed job from fence array */ - amdgpu_fence_driver_clear_job_fences(ring); - - amdgpu_fence_driver_isr_toggle(adev, false); - - /* Since the job won't signal and we go for -* another resubmit drop this parent pointer -*/ - dma_fence_put(s_job->s_fence->parent); - s_job->s_fence->parent = NULL; - - /* set guilty */ - drm_sched_increase_karma(s_job); - amdgpu_reset_prepare_hwcontext(adev, reset_context); -retry: - /* do hw reset */ - if (amdgpu_sriov_vf(adev)) { - amdgpu_virt_fini_data_exchange(adev); - r = amdgpu_device_reset_sriov(adev, false); - if (r) - adev->asic_reset_res = r; - } else { - clear_bit(AMDGPU_SKIP_HW_RESET, - _context->flags); - r = amdgpu_do_asic_reset(device_list_handle, -reset_context); - if (r && r == -EAGAIN) - goto retry; - } - - /* -* add reset counter so that the following -* resubmitted job could flush vmid -*/ - atomic_inc(>gpu_reset_counter); - continue; - } - - /* got the hw fence, signal finished fence */ -
Re: [PATCH] drm/ttm: update bulk move object of ghost BO
Inside the function ttm_bo_set_bulk_move, it calls ttm_resource_del_bulk_move to remove the old resource from the bulk_move list. If we set the bulk_move to NULL manually as suggested, the old resource attached in the ghost BO seems won't be removed from the bulk_move. On 9/1/2022 7:13 PM, Christian König wrote: Am 01.09.22 um 13:11 schrieb Christian König: Am 01.09.22 um 11:29 schrieb ZhenGuo Yin: [Why] Ghost BO is released with non-empty bulk move object. There is a warning trace: WARNING: CPU: 19 PID: 1582 at ttm/ttm_bo.c:366 ttm_bo_release+0x2e1/0x2f0 [amdttm] Call Trace: amddma_resv_reserve_fences+0x10d/0x1f0 [amdkcl] amdttm_bo_put+0x28/0x30 [amdttm] amdttm_bo_move_accel_cleanup+0x126/0x200 [amdttm] amdgpu_bo_move+0x1a8/0x770 [amdgpu] ttm_bo_handle_move_mem+0xb0/0x140 [amdttm] amdttm_bo_validate+0xbf/0x100 [amdttm] [How] The resource of ghost BO should be moved to LRU directly, instead of using bulk move. The bulk move object of ghost BO should set to NULL before function ttm_bo_move_to_lru_tail_unlocked. Fixed:·5b951e487fd6bf5f·("drm/ttm:·fix·bulk·move·handling·v2") Signed-off-by: ZhenGuo Yin Good catch, but the fix is not 100% correct. Please rather just NULL the member while initializing the BO structure. E.g. something like this: fbo->base.pin_count = 0; +fbo->base.bulk_move= NULL; if (bo->type != ttm_bo_type_sg) On the other hand thinking about it that won't work either. You need to set bulk_move to NULL manually in an else clauses or something like this. Regards, Christian. Thanks, Christian. --- drivers/gpu/drm/ttm/ttm_bo_util.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index 1cbfb00c1d65..a90bbbd91910 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -238,6 +238,7 @@ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, if (fbo->base.resource) { ttm_resource_set_bo(fbo->base.resource, >base); + ttm_bo_set_bulk_move(>base, NULL); bo->resource = NULL; }