RE: [PATCH] drm/amd/pm: Disable SMU messages in navi10 sriov
[AMD Official Use Only] Reviewed-by: Jack Zhang -Original Message- From: amd-gfx On Behalf Of Chen, JingWen Sent: Friday, June 11, 2021 6:52 PM To: Zha, YiFan(Even) ; amd-gfx@lists.freedesktop.org Cc: Zha, YiFan(Even) ; Liu, Monk Subject: RE: [PATCH] drm/amd/pm: Disable SMU messages in navi10 sriov [AMD Official Use Only] [AMD Official Use Only] Acked-by: Jingwen Chen Best Regards, JingWen Chen -Original Message- From: Yifan Zha Sent: Friday, June 11, 2021 6:49 PM To: amd-gfx@lists.freedesktop.org Cc: Liu, Monk ; Chen, JingWen ; Zha, YiFan(Even) Subject: [PATCH] drm/amd/pm: Disable SMU messages in navi10 sriov [Why] sriov vf send unsupported SMU message lead to fail. [How] disable related messages in sriov. Signed-off-by: Yifan Zha --- drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c index 78fe13183e8b..e1b019115e92 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c @@ -80,10 +80,10 @@ static struct cmn2asic_msg_mapping navi10_message_map[SMU_MSG_MAX_COUNT] = { MSG_MAP(SetAllowedFeaturesMaskHigh, PPSMC_MSG_SetAllowedFeaturesMaskHigh, 0), MSG_MAP(EnableAllSmuFeatures, PPSMC_MSG_EnableAllSmuFeatures, 0), MSG_MAP(DisableAllSmuFeatures, PPSMC_MSG_DisableAllSmuFeatures,0), - MSG_MAP(EnableSmuFeaturesLow, PPSMC_MSG_EnableSmuFeaturesLow, 1), - MSG_MAP(EnableSmuFeaturesHigh, PPSMC_MSG_EnableSmuFeaturesHigh,1), - MSG_MAP(DisableSmuFeaturesLow, PPSMC_MSG_DisableSmuFeaturesLow,1), - MSG_MAP(DisableSmuFeaturesHigh, PPSMC_MSG_DisableSmuFeaturesHigh, 1), + MSG_MAP(EnableSmuFeaturesLow, PPSMC_MSG_EnableSmuFeaturesLow, 0), + MSG_MAP(EnableSmuFeaturesHigh, PPSMC_MSG_EnableSmuFeaturesHigh,0), + MSG_MAP(DisableSmuFeaturesLow, PPSMC_MSG_DisableSmuFeaturesLow,0), + MSG_MAP(DisableSmuFeaturesHigh, PPSMC_MSG_DisableSmuFeaturesHigh, 0), MSG_MAP(GetEnabledSmuFeaturesLow, PPSMC_MSG_GetEnabledSmuFeaturesLow, 1), MSG_MAP(GetEnabledSmuFeaturesHigh, PPSMC_MSG_GetEnabledSmuFeaturesHigh,1), MSG_MAP(SetWorkloadMask,PPSMC_MSG_SetWorkloadMask, 1), -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7CJack.Zhang1%40amd.com%7C217fd989e7324c61c37f08d92cc6eda6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637590055175617495%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=BZQrUILQlHA3VmkwytrLKFP05FW8UUDFLR4XQFjxELs%3Dreserved=0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak
[AMD Official Use Only - Internal Distribution Only] Hi, Andrey, >>how u handle non guilty singnaled jobs in drm_sched_stop, currently looks >>like you don't call put for them and just explicitly free them as before Good point, I missed that place. Will cover that in my next patch. >>Also sched->free_guilty seems useless with the new approach. Yes, I agree. >>Do we even need the cleanup mechanism at drm_sched_get_cleanup_job with this >>approach... I am not quite sure about that for now, let me think about this topic today. Hi, Christian, should I add a fence and get/put to that fence rather than using an explicit refcount? And another concerns? Thanks, Jack -Original Message- From: Grodzovsky, Andrey Sent: Friday, March 26, 2021 12:32 AM To: Zhang, Jack (Jian) ; Christian König ; dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Liu, Monk ; Deng, Emily ; Rob Herring ; Tomeu Vizoso ; Steven Price Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak There are a few issues here like - how u handle non guilty singnaled jobs in drm_sched_stop, currently looks like you don't call put for them and just explicitly free them as before. Also sched->free_guilty seems useless with the new approach. Do we even need the cleanup mechanism at drm_sched_get_cleanup_job with this approach... But first - We need Christian to express his opinion on this since I think he opposed refcounting jobs and that we should concentrate on fences instead. Christian - can you chime in here ? Andrey On 2021-03-25 5:51 a.m., Zhang, Jack (Jian) wrote: > [AMD Official Use Only - Internal Distribution Only] > > > Hi, Andrey > > Thank you for your good opinions. > > I literally agree with you that the refcount could solve the > get_clean_up_up cocurrent job gracefully, and no need to re-insert the > > job back anymore. > > I quickly made a draft for this idea as follows: > > How do you like it? I will start implement to it after I got your > acknowledge. > > Thanks, > > Jack > > +void drm_job_get(struct drm_sched_job *s_job) > > +{ > > + kref_get(_job->refcount); > > +} > > + > > +void drm_job_do_release(struct kref *ref) > > +{ > > + struct drm_sched_job *s_job; > > + struct drm_gpu_scheduler *sched; > > + > > + s_job = container_of(ref, struct drm_sched_job, refcount); > > + sched = s_job->sched; > > + sched->ops->free_job(s_job); > > +} > > + > > +void drm_job_put(struct drm_sched_job *s_job) > > +{ > > + kref_put(_job->refcount, drm_job_do_release); > > +} > > + > > static void drm_sched_job_begin(struct drm_sched_job *s_job) > > { > > struct drm_gpu_scheduler *sched = s_job->sched; > > + kref_init(_job->refcount); > > + drm_job_get(s_job); > > spin_lock(>job_list_lock); > > list_add_tail(_job->node, >ring_mirror_list); > > drm_sched_start_timeout(sched); > > @@ -294,17 +316,16 @@ static void drm_sched_job_timedout(struct > work_struct *work) > > * drm_sched_cleanup_jobs. It will be reinserted back > after sched->thread > > * is parked at which point it's safe. > > */ > > - list_del_init(>node); > > + drm_job_get(job); > > spin_unlock(>job_list_lock); > > job->sched->ops->timedout_job(job); > > - > > + drm_job_put(job); > > /* > > * Guilty job did complete and hence needs to be > manually removed > > * See drm_sched_stop doc. > > */ > > if (sched->free_guilty) { > > - job->sched->ops->free_job(job); > > sched->free_guilty = false; > > } > > } else { > > @@ -355,20 +376,6 @@ void drm_sched_stop(struct drm_gpu_scheduler > *sched, struct drm_sched_job *bad) > > - /* > > -* Reinsert back the bad job here - now it's safe as > > -* drm_sched_get_cleanup_job cannot race against us and > release the > > -* bad job at this point - we parked (waited for) any in > progress > > -* (earlier) cleanups and drm_sched_get_cleanup_job will not > be called > > -* now until the scheduler thread is unparked. > > -*/ > > - if (bad && bad->sched == sched) > > - /* > > -* Add
RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak
[AMD Official Use Only - Internal Distribution Only] Hi, Steve, Thank you for your detailed comments. But currently the patch is not finalized. We found some potential race condition even with this patch. The solution is under discussion and hopefully we could find an ideal one. After that, I will start to consider other drm-driver if it will influence other drivers(except for amdgpu). Best, Jack -Original Message- From: Steven Price Sent: Monday, March 22, 2021 11:29 PM To: Zhang, Jack (Jian) ; dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Grodzovsky, Andrey ; Liu, Monk ; Deng, Emily ; Rob Herring ; Tomeu Vizoso Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak On 15/03/2021 05:23, Zhang, Jack (Jian) wrote: > [AMD Public Use] > > Hi, Rob/Tomeu/Steven, > > Would you please help to review this patch for panfrost driver? > > Thanks, > Jack Zhang > > -Original Message- > From: Jack Zhang > Sent: Monday, March 15, 2021 1:21 PM > To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; > Koenig, Christian ; Grodzovsky, Andrey > ; Liu, Monk ; Deng, Emily > > Cc: Zhang, Jack (Jian) > Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid > memleak > > re-insert Bailing jobs to avoid memory leak. > > V2: move re-insert step to drm/scheduler logic > V3: add panfrost's return value for bailing jobs in case it hits the > memleak issue. This commit message could do with some work - it's really hard to decipher what the actual problem you're solving is. > > Signed-off-by: Jack Zhang > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++-- > drivers/gpu/drm/panfrost/panfrost_job.c| 4 ++-- > drivers/gpu/drm/scheduler/sched_main.c | 8 +++- > include/drm/gpu_scheduler.h| 1 + > 5 files changed, 19 insertions(+), 6 deletions(-) > [...] > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c > b/drivers/gpu/drm/panfrost/panfrost_job.c > index 6003cfeb1322..e2cb4f32dae1 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_job.c > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c > @@ -444,7 +444,7 @@ static enum drm_gpu_sched_stat > panfrost_job_timedout(struct drm_sched_job >* spurious. Bail out. >*/ > if (dma_fence_is_signaled(job->done_fence)) > -return DRM_GPU_SCHED_STAT_NOMINAL; > +return DRM_GPU_SCHED_STAT_BAILING; > > dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, > head=0x%x, tail=0x%x, sched_job=%p", > js, > @@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat > panfrost_job_timedout(struct drm_sched_job > > /* Scheduler is already stopped, nothing to do. */ > if (!panfrost_scheduler_stop(>js->queue[js], sched_job)) > -return DRM_GPU_SCHED_STAT_NOMINAL; > +return DRM_GPU_SCHED_STAT_BAILING; > > /* Schedule a reset if there's no reset in progress. */ > if (!atomic_xchg(>reset.pending, 1)) This looks correct to me - in these two cases drm_sched_stop() is not called on the sched_job, so it looks like currently the job will be leaked. > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index 92d8de24d0a1..a44f621fb5c4 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct > *work) > { > struct drm_gpu_scheduler *sched; > struct drm_sched_job *job; > +int ret; > > sched = container_of(work, struct drm_gpu_scheduler, > work_tdr.work); > > @@ -331,8 +332,13 @@ static void drm_sched_job_timedout(struct work_struct > *work) > list_del_init(>list); > spin_unlock(>job_list_lock); > > -job->sched->ops->timedout_job(job); > +ret = job->sched->ops->timedout_job(job); > > +if (ret == DRM_GPU_SCHED_STAT_BAILING) { > +spin_lock(>job_list_lock); > +list_add(>node, >ring_mirror_list); > +spin_unlock(>job_list_lock); > +} I think we could really do with a comment somewhere explaining what "bailing" means in this context. For the Panfrost case we have two cases: * The GPU job actually finished while the timeout code was running (done_fence is signalled). * The GPU is already in the process of being reset (Panfrost has multiple queues, so mostly like a bad job in another queue). I'm also not convinced that (for Panfrost) it makes sense to be adding the jobs back to the list. For the first case above clearly the job could just be freed (it's complete). The second case is more interesting and Panfrost currently doesn't handle this well. In theory the dri
RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak
[AMD Official Use Only - Internal Distribution Only] Hi, Andrey Thank you for your good opinions. I literally agree with you that the refcount could solve the get_clean_up_up cocurrent job gracefully, and no need to re-insert the job back anymore. I quickly made a draft for this idea as follows: How do you like it? I will start implement to it after I got your acknowledge. Thanks, Jack +void drm_job_get(struct drm_sched_job *s_job) +{ + kref_get(_job->refcount); +} + +void drm_job_do_release(struct kref *ref) +{ + struct drm_sched_job *s_job; + struct drm_gpu_scheduler *sched; + + s_job = container_of(ref, struct drm_sched_job, refcount); + sched = s_job->sched; + sched->ops->free_job(s_job); +} + +void drm_job_put(struct drm_sched_job *s_job) +{ + kref_put(_job->refcount, drm_job_do_release); +} + static void drm_sched_job_begin(struct drm_sched_job *s_job) { struct drm_gpu_scheduler *sched = s_job->sched; + kref_init(_job->refcount); + drm_job_get(s_job); spin_lock(>job_list_lock); list_add_tail(_job->node, >ring_mirror_list); drm_sched_start_timeout(sched); @@ -294,17 +316,16 @@ static void drm_sched_job_timedout(struct work_struct *work) * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread * is parked at which point it's safe. */ - list_del_init(>node); + drm_job_get(job); spin_unlock(>job_list_lock); job->sched->ops->timedout_job(job); - + drm_job_put(job); /* * Guilty job did complete and hence needs to be manually removed * See drm_sched_stop doc. */ if (sched->free_guilty) { - job->sched->ops->free_job(job); sched->free_guilty = false; } } else { @@ -355,20 +376,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) - /* -* Reinsert back the bad job here - now it's safe as -* drm_sched_get_cleanup_job cannot race against us and release the -* bad job at this point - we parked (waited for) any in progress -* (earlier) cleanups and drm_sched_get_cleanup_job will not be called -* now until the scheduler thread is unparked. -*/ - if (bad && bad->sched == sched) - /* -* Add at the head of the queue to reflect it was the earliest -* job extracted. -*/ - list_add(>node, >ring_mirror_list); - /* * Iterate the job list from later to earlier one and either deactive * their HW callbacks or remove them from mirror list if they already @@ -774,7 +781,7 @@ static int drm_sched_main(void *param) kthread_should_stop()); if (cleanup_job) { - sched->ops->free_job(cleanup_job); + drm_job_put(cleanup_job); /* queue timeout for next job */ drm_sched_start_timeout(sched); } diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 5a1f068af1c2..b80513eec90f 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -188,6 +188,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f); * to schedule the job. */ struct drm_sched_job { + struct kref refcount; struct spsc_nodequeue_node; struct drm_gpu_scheduler*sched; struct drm_sched_fence *s_fence; @@ -198,6 +199,7 @@ struct drm_sched_job { enum drm_sched_priority s_priority; struct drm_sched_entity *entity; struct dma_fence_cb cb; + }; From: Grodzovsky, Andrey Sent: Friday, March 19, 2021 12:17 AM To: Zhang, Jack (Jian) ; Christian König ; dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Liu, Monk ; Deng, Emily ; Rob Herring ; Tomeu Vizoso ; Steven Price Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak On 2021-03-18 6:41 a.m., Zhang, Jack (Jian) wrote: [AMD Official Use Only - Internal Distribution Only] Hi, Andrey Let me summarize the background of this patch: In TDR resubmit step “amdgpu_device_recheck_guilty_jobs, It will submit first jobs of each ring and do guilty job re-check. At that point, We had to make sure each job is in the mirror list(or re-inserted back already). But we found the current code never re-insert the job to mirror list in the 2nd, 3rd job_timeout thread(Bailing TDR thread). This not only will cause memleak of the bailing jobs. What’s more important, the 1st tdr t
RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak
[AMD Official Use Only - Internal Distribution Only] Hi, Andrey Let me summarize the background of this patch: In TDR resubmit step “amdgpu_device_recheck_guilty_jobs, It will submit first jobs of each ring and do guilty job re-check. At that point, We had to make sure each job is in the mirror list(or re-inserted back already). But we found the current code never re-insert the job to mirror list in the 2nd, 3rd job_timeout thread(Bailing TDR thread). This not only will cause memleak of the bailing jobs. What’s more important, the 1st tdr thread can never iterate the bailing job and set its guilty status to a correct status. Therefore, we had to re-insert the job(or even not delete node) for bailing job. For the above V3 patch, the racing condition in my mind is: we cannot make sure all bailing jobs are finished before we do amdgpu_device_recheck_guilty_jobs. Based on this insight, I think we have two options to solve this issue: 1. Skip delete node in tdr thread2, thread3, 4 … (using mutex or atomic variable) 2. Re-insert back bailing job, and meanwhile use semaphore in each tdr thread to keep the sequence as expected and ensure each job is in the mirror list when do resubmit step. For Option1, logic is simpler and we need only one global atomic variable: What do you think about this plan? Option1 should look like the following logic: +static atomic_t in_reset; //a global atomic var for synchronization static void drm_sched_process_job(struct dma_fence *f, struct dma_fence_cb *cb); /** @@ -295,6 +296,12 @@ static void drm_sched_job_timedout(struct work_struct *work) * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread * is parked at which point it's safe. */ + if (atomic_cmpxchg(_reset, 0, 1) != 0) { //skip delete node if it’s thead1,2,3,…. + spin_unlock(>job_list_lock); + drm_sched_start_timeout(sched); + return; + } + list_del_init(>node); spin_unlock(>job_list_lock); @@ -320,6 +327,7 @@ static void drm_sched_job_timedout(struct work_struct *work) spin_lock(>job_list_lock); drm_sched_start_timeout(sched); spin_unlock(>job_list_lock); + atomic_set(_reset, 0); //reset in_reset when the first thread finished tdr } Thanks, Jack From: amd-gfx On Behalf Of Zhang, Jack (Jian) Sent: Wednesday, March 17, 2021 11:11 PM To: Christian König ; dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Liu, Monk ; Deng, Emily ; Rob Herring ; Tomeu Vizoso ; Steven Price ; Grodzovsky, Andrey Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak [AMD Official Use Only - Internal Distribution Only] [AMD Official Use Only - Internal Distribution Only] Hi,Andrey, Good catch,I will expore this corner case and give feedback soon~ Best, Jack From: Grodzovsky, Andrey mailto:andrey.grodzov...@amd.com>> Sent: Wednesday, March 17, 2021 10:50:59 PM To: Christian König mailto:ckoenig.leichtzumer...@gmail.com>>; Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; dri-de...@lists.freedesktop.org<mailto:dri-de...@lists.freedesktop.org> mailto:dri-de...@lists.freedesktop.org>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> mailto:amd-gfx@lists.freedesktop.org>>; Koenig, Christian mailto:christian.koe...@amd.com>>; Liu, Monk mailto:monk@amd.com>>; Deng, Emily mailto:emily.d...@amd.com>>; Rob Herring mailto:r...@kernel.org>>; Tomeu Vizoso mailto:tomeu.viz...@collabora.com>>; Steven Price mailto:steven.pr...@arm.com>> Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak I actually have a race condition concern here - see bellow - On 2021-03-17 3:43 a.m., Christian König wrote: > I was hoping Andrey would take a look since I'm really busy with other > work right now. > > Regards, > Christian. > > Am 17.03.21 um 07:46 schrieb Zhang, Jack (Jian): >> Hi, Andrey/Crhistian and Team, >> >> I didn't receive the reviewer's message from maintainers on panfrost >> driver for several days. >> Due to this patch is urgent for my current working project. >> Would you please help to give some review ideas? >> >> Many Thanks, >> Jack >> -Original Message- >> From: Zhang, Jack (Jian) >> Sent: Tuesday, March 16, 2021 3:20 PM >> To: dri-de...@lists.freedesktop.org<mailto:dri-de...@lists.freedesktop.org>; >> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; >> Koenig, Christian >> mailto:christian.koe...@amd.com>>; Grodzovsky, >> Andrey >> mailto:andrey.gro
Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak
[AMD Official Use Only - Internal Distribution Only] Hi,Andrey, Good catch,I will expore this corner case and give feedback soon~ Best, Jack From: Grodzovsky, Andrey Sent: Wednesday, March 17, 2021 10:50:59 PM To: Christian König ; Zhang, Jack (Jian) ; dri-de...@lists.freedesktop.org ; amd-gfx@lists.freedesktop.org ; Koenig, Christian ; Liu, Monk ; Deng, Emily ; Rob Herring ; Tomeu Vizoso ; Steven Price Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak I actually have a race condition concern here - see bellow - On 2021-03-17 3:43 a.m., Christian König wrote: > I was hoping Andrey would take a look since I'm really busy with other > work right now. > > Regards, > Christian. > > Am 17.03.21 um 07:46 schrieb Zhang, Jack (Jian): >> Hi, Andrey/Crhistian and Team, >> >> I didn't receive the reviewer's message from maintainers on panfrost >> driver for several days. >> Due to this patch is urgent for my current working project. >> Would you please help to give some review ideas? >> >> Many Thanks, >> Jack >> -Original Message- >> From: Zhang, Jack (Jian) >> Sent: Tuesday, March 16, 2021 3:20 PM >> To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; >> Koenig, Christian ; Grodzovsky, Andrey >> ; Liu, Monk ; Deng, >> Emily ; Rob Herring ; Tomeu >> Vizoso ; Steven Price >> Subject: RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid >> memleak >> >> [AMD Public Use] >> >> Ping >> >> -Original Message- >> From: Zhang, Jack (Jian) >> Sent: Monday, March 15, 2021 1:24 PM >> To: Jack Zhang ; >> dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; >> Koenig, Christian ; Grodzovsky, Andrey >> ; Liu, Monk ; Deng, >> Emily ; Rob Herring ; Tomeu >> Vizoso ; Steven Price >> Subject: RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid >> memleak >> >> [AMD Public Use] >> >> Hi, Rob/Tomeu/Steven, >> >> Would you please help to review this patch for panfrost driver? >> >> Thanks, >> Jack Zhang >> >> -Original Message- >> From: Jack Zhang >> Sent: Monday, March 15, 2021 1:21 PM >> To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; >> Koenig, Christian ; Grodzovsky, Andrey >> ; Liu, Monk ; Deng, >> Emily >> Cc: Zhang, Jack (Jian) >> Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak >> >> re-insert Bailing jobs to avoid memory leak. >> >> V2: move re-insert step to drm/scheduler logic >> V3: add panfrost's return value for bailing jobs in case it hits the >> memleak issue. >> >> Signed-off-by: Jack Zhang >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++-- >> drivers/gpu/drm/panfrost/panfrost_job.c| 4 ++-- >> drivers/gpu/drm/scheduler/sched_main.c | 8 +++- >> include/drm/gpu_scheduler.h| 1 + >> 5 files changed, 19 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index 79b9cc73763f..86463b0f936e 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -4815,8 +4815,10 @@ int amdgpu_device_gpu_recover(struct >> amdgpu_device *adev, >> job ? job->base.id : -1); >> /* even we skipped this reset, still need to set the job >> to guilty */ >> -if (job) >> +if (job) { >> drm_sched_increase_karma(>base); >> +r = DRM_GPU_SCHED_STAT_BAILING; >> +} >> goto skip_recovery; >> } >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> index 759b34799221..41390bdacd9e 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> @@ -34,6 +34,7 @@ static enum drm_gpu_sched_stat >> amdgpu_job_timedout(struct drm_sched_job *s_job) >> struct amdgpu_job *job = to_amdgpu_job(s_job); >> struct amdgpu_task_info ti; >> struct amdgpu_device *adev = ring->adev; >> +int ret; >> memset(, 0, sizeof(struct amdgpu_task_info)); >> @@ -52,8 +53,11 @@ static enum drm_gpu_sched_stat >> amdgpu_job_timedout(struct drm_sched_job *s_job) >> ti.process_name, ti.
RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak
Hi, Andrey/Crhistian and Team, I didn't receive the reviewer's message from maintainers on panfrost driver for several days. Due to this patch is urgent for my current working project. Would you please help to give some review ideas? Many Thanks, Jack -Original Message- From: Zhang, Jack (Jian) Sent: Tuesday, March 16, 2021 3:20 PM To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Grodzovsky, Andrey ; Liu, Monk ; Deng, Emily ; Rob Herring ; Tomeu Vizoso ; Steven Price Subject: RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak [AMD Public Use] Ping -Original Message- From: Zhang, Jack (Jian) Sent: Monday, March 15, 2021 1:24 PM To: Jack Zhang ; dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Grodzovsky, Andrey ; Liu, Monk ; Deng, Emily ; Rob Herring ; Tomeu Vizoso ; Steven Price Subject: RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak [AMD Public Use] Hi, Rob/Tomeu/Steven, Would you please help to review this patch for panfrost driver? Thanks, Jack Zhang -Original Message- From: Jack Zhang Sent: Monday, March 15, 2021 1:21 PM To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Grodzovsky, Andrey ; Liu, Monk ; Deng, Emily Cc: Zhang, Jack (Jian) Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak re-insert Bailing jobs to avoid memory leak. V2: move re-insert step to drm/scheduler logic V3: add panfrost's return value for bailing jobs in case it hits the memleak issue. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++-- drivers/gpu/drm/panfrost/panfrost_job.c| 4 ++-- drivers/gpu/drm/scheduler/sched_main.c | 8 +++- include/drm/gpu_scheduler.h| 1 + 5 files changed, 19 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 79b9cc73763f..86463b0f936e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4815,8 +4815,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, job ? job->base.id : -1); /* even we skipped this reset, still need to set the job to guilty */ - if (job) + if (job) { drm_sched_increase_karma(>base); + r = DRM_GPU_SCHED_STAT_BAILING; + } goto skip_recovery; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 759b34799221..41390bdacd9e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -34,6 +34,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) struct amdgpu_job *job = to_amdgpu_job(s_job); struct amdgpu_task_info ti; struct amdgpu_device *adev = ring->adev; + int ret; memset(, 0, sizeof(struct amdgpu_task_info)); @@ -52,8 +53,11 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) ti.process_name, ti.tgid, ti.task_name, ti.pid); if (amdgpu_device_should_recover_gpu(ring->adev)) { - amdgpu_device_gpu_recover(ring->adev, job); - return DRM_GPU_SCHED_STAT_NOMINAL; + ret = amdgpu_device_gpu_recover(ring->adev, job); + if (ret == DRM_GPU_SCHED_STAT_BAILING) + return DRM_GPU_SCHED_STAT_BAILING; + else + return DRM_GPU_SCHED_STAT_NOMINAL; } else { drm_sched_suspend_timeout(>sched); if (amdgpu_sriov_vf(adev)) diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 6003cfeb1322..e2cb4f32dae1 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -444,7 +444,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job * spurious. Bail out. */ if (dma_fence_is_signaled(job->done_fence)) - return DRM_GPU_SCHED_STAT_NOMINAL; + return DRM_GPU_SCHED_STAT_BAILING; dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p", js, @@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job /* Scheduler is already stopped, nothing to do. */ if (!panfrost_scheduler_stop(>js->queue[js], sched_job)) - return DRM_GPU_SCHED_STAT_NOMINAL; + return DRM_GPU_SCHED_STAT_BAILING; /* Sched
RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak
[AMD Public Use] Ping -Original Message- From: Zhang, Jack (Jian) Sent: Monday, March 15, 2021 1:24 PM To: Jack Zhang ; dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Grodzovsky, Andrey ; Liu, Monk ; Deng, Emily ; Rob Herring ; Tomeu Vizoso ; Steven Price Subject: RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak [AMD Public Use] Hi, Rob/Tomeu/Steven, Would you please help to review this patch for panfrost driver? Thanks, Jack Zhang -Original Message- From: Jack Zhang Sent: Monday, March 15, 2021 1:21 PM To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Grodzovsky, Andrey ; Liu, Monk ; Deng, Emily Cc: Zhang, Jack (Jian) Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak re-insert Bailing jobs to avoid memory leak. V2: move re-insert step to drm/scheduler logic V3: add panfrost's return value for bailing jobs in case it hits the memleak issue. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++-- drivers/gpu/drm/panfrost/panfrost_job.c| 4 ++-- drivers/gpu/drm/scheduler/sched_main.c | 8 +++- include/drm/gpu_scheduler.h| 1 + 5 files changed, 19 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 79b9cc73763f..86463b0f936e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4815,8 +4815,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, job ? job->base.id : -1); /* even we skipped this reset, still need to set the job to guilty */ - if (job) + if (job) { drm_sched_increase_karma(>base); + r = DRM_GPU_SCHED_STAT_BAILING; + } goto skip_recovery; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 759b34799221..41390bdacd9e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -34,6 +34,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) struct amdgpu_job *job = to_amdgpu_job(s_job); struct amdgpu_task_info ti; struct amdgpu_device *adev = ring->adev; + int ret; memset(, 0, sizeof(struct amdgpu_task_info)); @@ -52,8 +53,11 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) ti.process_name, ti.tgid, ti.task_name, ti.pid); if (amdgpu_device_should_recover_gpu(ring->adev)) { - amdgpu_device_gpu_recover(ring->adev, job); - return DRM_GPU_SCHED_STAT_NOMINAL; + ret = amdgpu_device_gpu_recover(ring->adev, job); + if (ret == DRM_GPU_SCHED_STAT_BAILING) + return DRM_GPU_SCHED_STAT_BAILING; + else + return DRM_GPU_SCHED_STAT_NOMINAL; } else { drm_sched_suspend_timeout(>sched); if (amdgpu_sriov_vf(adev)) diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 6003cfeb1322..e2cb4f32dae1 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -444,7 +444,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job * spurious. Bail out. */ if (dma_fence_is_signaled(job->done_fence)) - return DRM_GPU_SCHED_STAT_NOMINAL; + return DRM_GPU_SCHED_STAT_BAILING; dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p", js, @@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job /* Scheduler is already stopped, nothing to do. */ if (!panfrost_scheduler_stop(>js->queue[js], sched_job)) - return DRM_GPU_SCHED_STAT_NOMINAL; + return DRM_GPU_SCHED_STAT_BAILING; /* Schedule a reset if there's no reset in progress. */ if (!atomic_xchg(>reset.pending, 1)) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 92d8de24d0a1..a44f621fb5c4 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work) { struct drm_gpu_scheduler *sched; struct drm_sched_job *job; + int ret; sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work); @@ -331,8 +332,13 @@ static void dr
RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak
[AMD Public Use] Hi, Rob/Tomeu/Steven, Would you please help to review this patch for panfrost driver? Thanks, Jack Zhang -Original Message- From: Jack Zhang Sent: Monday, March 15, 2021 1:21 PM To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Grodzovsky, Andrey ; Liu, Monk ; Deng, Emily Cc: Zhang, Jack (Jian) Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak re-insert Bailing jobs to avoid memory leak. V2: move re-insert step to drm/scheduler logic V3: add panfrost's return value for bailing jobs in case it hits the memleak issue. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++-- drivers/gpu/drm/panfrost/panfrost_job.c| 4 ++-- drivers/gpu/drm/scheduler/sched_main.c | 8 +++- include/drm/gpu_scheduler.h| 1 + 5 files changed, 19 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 79b9cc73763f..86463b0f936e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4815,8 +4815,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, job ? job->base.id : -1); /* even we skipped this reset, still need to set the job to guilty */ - if (job) + if (job) { drm_sched_increase_karma(>base); + r = DRM_GPU_SCHED_STAT_BAILING; + } goto skip_recovery; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 759b34799221..41390bdacd9e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -34,6 +34,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) struct amdgpu_job *job = to_amdgpu_job(s_job); struct amdgpu_task_info ti; struct amdgpu_device *adev = ring->adev; + int ret; memset(, 0, sizeof(struct amdgpu_task_info)); @@ -52,8 +53,11 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) ti.process_name, ti.tgid, ti.task_name, ti.pid); if (amdgpu_device_should_recover_gpu(ring->adev)) { - amdgpu_device_gpu_recover(ring->adev, job); - return DRM_GPU_SCHED_STAT_NOMINAL; + ret = amdgpu_device_gpu_recover(ring->adev, job); + if (ret == DRM_GPU_SCHED_STAT_BAILING) + return DRM_GPU_SCHED_STAT_BAILING; + else + return DRM_GPU_SCHED_STAT_NOMINAL; } else { drm_sched_suspend_timeout(>sched); if (amdgpu_sriov_vf(adev)) diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 6003cfeb1322..e2cb4f32dae1 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -444,7 +444,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job * spurious. Bail out. */ if (dma_fence_is_signaled(job->done_fence)) - return DRM_GPU_SCHED_STAT_NOMINAL; + return DRM_GPU_SCHED_STAT_BAILING; dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p", js, @@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job /* Scheduler is already stopped, nothing to do. */ if (!panfrost_scheduler_stop(>js->queue[js], sched_job)) - return DRM_GPU_SCHED_STAT_NOMINAL; + return DRM_GPU_SCHED_STAT_BAILING; /* Schedule a reset if there's no reset in progress. */ if (!atomic_xchg(>reset.pending, 1)) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 92d8de24d0a1..a44f621fb5c4 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work) { struct drm_gpu_scheduler *sched; struct drm_sched_job *job; + int ret; sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work); @@ -331,8 +332,13 @@ static void drm_sched_job_timedout(struct work_struct *work) list_del_init(>list); spin_unlock(>job_list_lock); - job->sched->ops->timedout_job(job); + ret = job->sched->ops->timedout_job(job); + if (ret == DRM_GPU_SCHED_STAT_BAILING) { + spin_lock(>job_list_lock); +
RE: [PATCH] drm/scheduler re-insert Bailing job to avoid memleak
[AMD Official Use Only - Internal Distribution Only] Hi, Andrey, ok, I have changed it and uploaded V2 patch. Thanks, Jack -Original Message- From: Grodzovsky, Andrey Sent: Friday, March 12, 2021 1:04 PM To: Alex Deucher ; Zhang, Jack (Jian) ; Maling list - DRI developers Cc: amd-gfx list ; Koenig, Christian ; Liu, Monk ; Deng, Emily Subject: Re: [PATCH] drm/scheduler re-insert Bailing job to avoid memleak Check panfrost driver at panfrost_scheduler_stop, and panfrost_job_timedout - they also terminate prematurely in both places so probably worth adding this there too. Andrey On 2021-03-11 11:13 p.m., Alex Deucher wrote: > +dri-devel > > Please be sure to cc dri-devel when you send out gpu scheduler patches. > > On Thu, Mar 11, 2021 at 10:57 PM Jack Zhang wrote: >> >> re-insert Bailing jobs to avoid memory leak. >> >> Signed-off-by: Jack Zhang >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++-- >> drivers/gpu/drm/scheduler/sched_main.c | 8 +++- >> include/drm/gpu_scheduler.h| 1 + >> 4 files changed, 17 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index 79b9cc73763f..86463b0f936e 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -4815,8 +4815,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device >> *adev, >> job ? job->base.id : -1); >> >> /* even we skipped this reset, still need to set the job to >> guilty */ >> - if (job) >> + if (job) { >> drm_sched_increase_karma(>base); >> + r = DRM_GPU_SCHED_STAT_BAILING; >> + } >> goto skip_recovery; >> } >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> index 759b34799221..41390bdacd9e 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> @@ -34,6 +34,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct >> drm_sched_job *s_job) >> struct amdgpu_job *job = to_amdgpu_job(s_job); >> struct amdgpu_task_info ti; >> struct amdgpu_device *adev = ring->adev; >> + int ret; >> >> memset(, 0, sizeof(struct amdgpu_task_info)); >> >> @@ -52,8 +53,11 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct >> drm_sched_job *s_job) >>ti.process_name, ti.tgid, ti.task_name, ti.pid); >> >> if (amdgpu_device_should_recover_gpu(ring->adev)) { >> - amdgpu_device_gpu_recover(ring->adev, job); >> - return DRM_GPU_SCHED_STAT_NOMINAL; >> + ret = amdgpu_device_gpu_recover(ring->adev, job); >> + if (ret == DRM_GPU_SCHED_STAT_BAILING) >> + return DRM_GPU_SCHED_STAT_BAILING; >> + else >> + return DRM_GPU_SCHED_STAT_NOMINAL; >> } else { >> drm_sched_suspend_timeout(>sched); >> if (amdgpu_sriov_vf(adev)) diff --git >> a/drivers/gpu/drm/scheduler/sched_main.c >> b/drivers/gpu/drm/scheduler/sched_main.c >> index 92d8de24d0a1..a44f621fb5c4 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct >> *work) >> { >> struct drm_gpu_scheduler *sched; >> struct drm_sched_job *job; >> + int ret; >> >> sched = container_of(work, struct drm_gpu_scheduler, >> work_tdr.work); >> >> @@ -331,8 +332,13 @@ static void drm_sched_job_timedout(struct work_struct >> *work) >> list_del_init(>list); >> spin_unlock(>job_list_lock); >> >> - job->sched->ops->timedout_job(job); >> + ret = job->sched->ops->timedout_job(job); >> >> + if (ret == DRM_GPU_SCHED_STAT_BAILING) { >> + spin_lock(>job_list_lock); >> + list_add(>node, >ring_mirror_list); >> + spin_unlock(>job_list_lock); >> + } Problem here that since you alr
Re: [PATCH v7] drm/amd/amdgpu implement tdr advanced mode
[AMD Official Use Only - Internal Distribution Only] hi,Andrey and Christian, V8 patch is uploaded. Thanks, Jack 发件人: amd-gfx 代表 Zhang, Jack (Jian) 发送时间: 2021年3月11日星期四 下午8:20 收件人: amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey; Liu, Monk; Deng, Emily; Koenig, Christian 主题: Re: [PATCH v7] drm/amd/amdgpu implement tdr advanced mode [AMD Official Use Only - Internal Distribution Only] [AMD Official Use Only - Internal Distribution Only] hi,Christian, Good idea,thank you for these efforts. I will update in next version. Jack From: Koenig, Christian Sent: Thursday, March 11, 2021 6:41:05 PM To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org ; Grodzovsky, Andrey ; Liu, Monk ; Deng, Emily Subject: Re: [PATCH v7] drm/amd/amdgpu implement tdr advanced mode Am 11.03.21 um 06:58 schrieb Jack Zhang: > [Why] > Previous tdr design treats the first job in job_timeout as the bad job. > But sometimes a later bad compute job can block a good gfx job and > cause an unexpected gfx job timeout because gfx and compute ring share > internal GC HW mutually. > > [How] > This patch implements an advanced tdr mode.It involves an additinal > synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit > step in order to find the real bad job. > > 1. At Step0 Resubmit stage, it synchronously submits and pends for the > first job being signaled. If it gets timeout, we identify it as guilty > and do hw reset. After that, we would do the normal resubmit step to > resubmit left jobs. > > 2. For whole gpu reset(vram lost), do resubmit as the old way. > > Signed-off-by: Jack Zhang > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 63 > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 2 +- > drivers/gpu/drm/scheduler/sched_main.c | 107 +++-- > include/drm/gpu_scheduler.h| 3 + > 4 files changed, 142 insertions(+), 33 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index e247c3a2ec08..5b182ade26ad 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, >int i, r = 0; >bool need_emergency_restart = false; >bool audio_suspended = false; > + int tmp_vram_lost_counter; > >/* > * Special case: RAS triggered and full reset isn't supported > @@ -4788,6 +4789,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, >} >} > > + tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); >/* Actual ASIC resets if needed.*/ >/* TODO Implement XGMI hive reset logic for SRIOV */ >if (amdgpu_sriov_vf(adev)) { > @@ -4805,6 +4807,67 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, >/* Post ASIC reset for all devs .*/ >list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { > > + if (amdgpu_gpu_recovery == 2 && > + !(tmp_vram_lost_counter < > atomic_read(>vram_lost_counter))) { > + > + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { Starting from here I would put this into a separate helper function in amdgpu_ring.c. You should also probably use checkpatch.pl once more since a couple of lines below might result in warnings. Christian. > + struct amdgpu_ring *ring = tmp_adev->rings[i]; > + int ret = 0; > + struct drm_sched_job *s_job; > + > + if (!ring || !ring->sched.thread) > + continue; > + > + /* No point to resubmit jobs if we didn't HW > reset*/ > + if (!tmp_adev->asic_reset_res && !job_signaled) > { > + > + s_job = > list_first_entry_or_null(>sched.pending_list, struct drm_sched_job, > list); > + if (s_job == NULL) > + continue; > + > + /* clear job's guilty and depend the > folowing step to decide the real one */ > + drm_sched_reset_karma(s_job); > + > drm_sched_resubmit_jobs_ext(>sched, 1); > + ret = > dma_fence_wait_timeout(s_job->s_fence->parent,
Re: [PATCH v7] drm/amd/amdgpu implement tdr advanced mode
[AMD Official Use Only - Internal Distribution Only] hi?Christian? Good idea?thank you for these efforts. I will update in next version. Jack From: Koenig, Christian Sent: Thursday, March 11, 2021 6:41:05 PM To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org ; Grodzovsky, Andrey ; Liu, Monk ; Deng, Emily Subject: Re: [PATCH v7] drm/amd/amdgpu implement tdr advanced mode Am 11.03.21 um 06:58 schrieb Jack Zhang: > [Why] > Previous tdr design treats the first job in job_timeout as the bad job. > But sometimes a later bad compute job can block a good gfx job and > cause an unexpected gfx job timeout because gfx and compute ring share > internal GC HW mutually. > > [How] > This patch implements an advanced tdr mode.It involves an additinal > synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit > step in order to find the real bad job. > > 1. At Step0 Resubmit stage, it synchronously submits and pends for the > first job being signaled. If it gets timeout, we identify it as guilty > and do hw reset. After that, we would do the normal resubmit step to > resubmit left jobs. > > 2. For whole gpu reset(vram lost), do resubmit as the old way. > > Signed-off-by: Jack Zhang > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 63 > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 2 +- > drivers/gpu/drm/scheduler/sched_main.c | 107 +++-- > include/drm/gpu_scheduler.h| 3 + > 4 files changed, 142 insertions(+), 33 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index e247c3a2ec08..5b182ade26ad 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, >int i, r = 0; >bool need_emergency_restart = false; >bool audio_suspended = false; > + int tmp_vram_lost_counter; > >/* > * Special case: RAS triggered and full reset isn't supported > @@ -4788,6 +4789,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, >} >} > > + tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); >/* Actual ASIC resets if needed.*/ >/* TODO Implement XGMI hive reset logic for SRIOV */ >if (amdgpu_sriov_vf(adev)) { > @@ -4805,6 +4807,67 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, >/* Post ASIC reset for all devs .*/ >list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { > > + if (amdgpu_gpu_recovery == 2 && > + !(tmp_vram_lost_counter < > atomic_read(>vram_lost_counter))) { > + > + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { Starting from here I would put this into a separate helper function in amdgpu_ring.c. You should also probably use checkpatch.pl once more since a couple of lines below might result in warnings. Christian. > + struct amdgpu_ring *ring = tmp_adev->rings[i]; > + int ret = 0; > + struct drm_sched_job *s_job; > + > + if (!ring || !ring->sched.thread) > + continue; > + > + /* No point to resubmit jobs if we didn't HW > reset*/ > + if (!tmp_adev->asic_reset_res && !job_signaled) > { > + > + s_job = > list_first_entry_or_null(>sched.pending_list, struct drm_sched_job, > list); > + if (s_job == NULL) > + continue; > + > + /* clear job's guilty and depend the > folowing step to decide the real one */ > + drm_sched_reset_karma(s_job); > + > drm_sched_resubmit_jobs_ext(>sched, 1); > + ret = > dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout); > + > + if (ret == 0) { /* timeout */ > + DRM_ERROR("Found the real bad > job! ring:%s, job_id:%llx\n", ring->sched.name, s_job->id); > + /* set guilty */ > + drm_sched_increase_karma(s_job); > + > + /* do hw
RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode
[AMD Official Use Only - Internal Distribution Only] Hi, Andrey, I just submitted the V7 patch in which I wrapped the increase_karma func and resubmit_job func. Thanks, Jack From: amd-gfx On Behalf Of Zhang, Jack (Jian) Sent: Thursday, March 11, 2021 11:28 AM To: Grodzovsky, Andrey ; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Liu, Monk ; Deng, Emily Subject: RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode [AMD Official Use Only - Internal Distribution Only] [AMD Official Use Only - Internal Distribution Only] Ok, that will do the trick. Thank you! Jack From: Grodzovsky, Andrey mailto:andrey.grodzov...@amd.com>> Sent: Thursday, March 11, 2021 11:24 AM To: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Koenig, Christian mailto:christian.koe...@amd.com>>; Liu, Monk mailto:monk@amd.com>>; Deng, Emily mailto:emily.d...@amd.com>> Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode You can just rename drm_sched_resubmit_jobs to drm_sched_resubmit_jobs_imp and create a wrapper function drm_sched_resubmit_jobs which will call that function with MAX_INT and so no need for code change in other driver will be needed. For amdgpu you call directly drm_sched_resubmit_jobs_imp with 1 and MAX_INT. Andrey ________ From: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>> Sent: 10 March 2021 22:05 To: Grodzovsky, Andrey mailto:andrey.grodzov...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> mailto:amd-gfx@lists.freedesktop.org>>; Koenig, Christian mailto:christian.koe...@amd.com>>; Liu, Monk mailto:monk@amd.com>>; Deng, Emily mailto:emily.d...@amd.com>> Subject: RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode [AMD Official Use Only - Internal Distribution Only] Hi, Andrey, Thank you for your review. >> This is identical to drm_sched_resubmit_jobs in all but the 'int max' >> handling, can't you reuse drm_sched_resubmit_jobs by passing max argument==1 >> for the find guilty run and then, for the later normal run passing >> max==MAX_INT to avoid break the iteration to early ? Due to C language doesn't support function overloading, we couldn’t define function like the following(compiler error): void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max = INT_MAX); We have to rewrite every vender's code where call the drm_sched_resubmit_jobs if we defined the function like this. void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max); Meanwhile, I also explored the Variable Arguments in C, But it still cannot meet the needs very well. Do you know some other methods that could help us to make this happen? Otherwise, I'm afraid we have to define a separate ways as patch V6 did. Thanks, Jack -Original Message- From: Grodzovsky, Andrey mailto:andrey.grodzov...@amd.com>> Sent: Thursday, March 11, 2021 12:19 AM To: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Koenig, Christian mailto:christian.koe...@amd.com>>; Liu, Monk mailto:monk@amd.com>>; Deng, Emily mailto:emily.d...@amd.com>> Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode On 2021-03-10 8:33 a.m., Jack Zhang wrote: > [Why] > Previous tdr design treats the first job in job_timeout as the bad job. > But sometimes a later bad compute job can block a good gfx job and > cause an unexpected gfx job timeout because gfx and compute ring share > internal GC HW mutually. > > [How] > This patch implements an advanced tdr mode.It involves an additinal > synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit > step in order to find the real bad job. > > 1. At Step0 Resubmit stage, it synchronously submits and pends for the > first job being signaled. If it gets timeout, we identify it as guilty > and do hw reset. After that, we would do the normal resubmit step to > resubmit left jobs. > > 2. Re-insert Bailing job to mirror_list, and leave it to be handled by > the main reset thread. > > 3. For whole gpu reset(vram lost), do resubmit as the old way. > > Signed-off-by: Jack Zhang mailto:jack.zha...@amd.com>> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 72 - > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 2 +- > drivers/gpu/drm/scheduler/sched_main.c | 75 ++ > include/drm/gpu_scheduler.h| 2 + > 4 files changed, 148 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index e247c3a2ec08..dac0a242e5f5 100644 > --- a/dri
RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode
[AMD Official Use Only - Internal Distribution Only] Ok, that will do the trick. Thank you! Jack From: Grodzovsky, Andrey Sent: Thursday, March 11, 2021 11:24 AM To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Liu, Monk ; Deng, Emily Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode You can just rename drm_sched_resubmit_jobs to drm_sched_resubmit_jobs_imp and create a wrapper function drm_sched_resubmit_jobs which will call that function with MAX_INT and so no need for code change in other driver will be needed. For amdgpu you call directly drm_sched_resubmit_jobs_imp with 1 and MAX_INT. Andrey From: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>> Sent: 10 March 2021 22:05 To: Grodzovsky, Andrey mailto:andrey.grodzov...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> mailto:amd-gfx@lists.freedesktop.org>>; Koenig, Christian mailto:christian.koe...@amd.com>>; Liu, Monk mailto:monk@amd.com>>; Deng, Emily mailto:emily.d...@amd.com>> Subject: RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode [AMD Official Use Only - Internal Distribution Only] Hi, Andrey, Thank you for your review. >> This is identical to drm_sched_resubmit_jobs in all but the 'int max' >> handling, can't you reuse drm_sched_resubmit_jobs by passing max argument==1 >> for the find guilty run and then, for the later normal run passing >> max==MAX_INT to avoid break the iteration to early ? Due to C language doesn't support function overloading, we couldn’t define function like the following(compiler error): void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max = INT_MAX); We have to rewrite every vender's code where call the drm_sched_resubmit_jobs if we defined the function like this. void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max); Meanwhile, I also explored the Variable Arguments in C, But it still cannot meet the needs very well. Do you know some other methods that could help us to make this happen? Otherwise, I'm afraid we have to define a separate ways as patch V6 did. Thanks, Jack -Original Message- From: Grodzovsky, Andrey mailto:andrey.grodzov...@amd.com>> Sent: Thursday, March 11, 2021 12:19 AM To: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Koenig, Christian mailto:christian.koe...@amd.com>>; Liu, Monk mailto:monk@amd.com>>; Deng, Emily mailto:emily.d...@amd.com>> Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode On 2021-03-10 8:33 a.m., Jack Zhang wrote: > [Why] > Previous tdr design treats the first job in job_timeout as the bad job. > But sometimes a later bad compute job can block a good gfx job and > cause an unexpected gfx job timeout because gfx and compute ring share > internal GC HW mutually. > > [How] > This patch implements an advanced tdr mode.It involves an additinal > synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit > step in order to find the real bad job. > > 1. At Step0 Resubmit stage, it synchronously submits and pends for the > first job being signaled. If it gets timeout, we identify it as guilty > and do hw reset. After that, we would do the normal resubmit step to > resubmit left jobs. > > 2. Re-insert Bailing job to mirror_list, and leave it to be handled by > the main reset thread. > > 3. For whole gpu reset(vram lost), do resubmit as the old way. > > Signed-off-by: Jack Zhang mailto:jack.zha...@amd.com>> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 72 - > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 2 +- > drivers/gpu/drm/scheduler/sched_main.c | 75 ++ > include/drm/gpu_scheduler.h| 2 + > 4 files changed, 148 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index e247c3a2ec08..dac0a242e5f5 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, > int i, r = 0; > bool need_emergency_restart = false; > bool audio_suspended = false; > +int tmp_vram_lost_counter; > > /* >* Special case: RAS triggered and full reset isn't supported @@ > -4689,9 +4690,14 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in > progress", > job ? job->base.id : -1); > > -/* even we skipped this reset, still need to set the job to guilty
RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode
[AMD Official Use Only - Internal Distribution Only] Hi, Andrey, Thank you for your review. >> This is identical to drm_sched_resubmit_jobs in all but the 'int max' >> handling, can't you reuse drm_sched_resubmit_jobs by passing max argument==1 >> for the find guilty run and then, for the later normal run passing >> max==MAX_INT to avoid break the iteration to early ? Due to C language doesn't support function overloading, we couldn’t define function like the following(compiler error): void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max = INT_MAX); We have to rewrite every vender's code where call the drm_sched_resubmit_jobs if we defined the function like this. void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max); Meanwhile, I also explored the Variable Arguments in C, But it still cannot meet the needs very well. Do you know some other methods that could help us to make this happen? Otherwise, I'm afraid we have to define a separate ways as patch V6 did. Thanks, Jack -Original Message- From: Grodzovsky, Andrey Sent: Thursday, March 11, 2021 12:19 AM To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Liu, Monk ; Deng, Emily Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode On 2021-03-10 8:33 a.m., Jack Zhang wrote: > [Why] > Previous tdr design treats the first job in job_timeout as the bad job. > But sometimes a later bad compute job can block a good gfx job and > cause an unexpected gfx job timeout because gfx and compute ring share > internal GC HW mutually. > > [How] > This patch implements an advanced tdr mode.It involves an additinal > synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit > step in order to find the real bad job. > > 1. At Step0 Resubmit stage, it synchronously submits and pends for the > first job being signaled. If it gets timeout, we identify it as guilty > and do hw reset. After that, we would do the normal resubmit step to > resubmit left jobs. > > 2. Re-insert Bailing job to mirror_list, and leave it to be handled by > the main reset thread. > > 3. For whole gpu reset(vram lost), do resubmit as the old way. > > Signed-off-by: Jack Zhang > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 72 - > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 2 +- > drivers/gpu/drm/scheduler/sched_main.c | 75 ++ > include/drm/gpu_scheduler.h| 2 + > 4 files changed, 148 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index e247c3a2ec08..dac0a242e5f5 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, > int i, r = 0; > bool need_emergency_restart = false; > bool audio_suspended = false; > +int tmp_vram_lost_counter; > > /* >* Special case: RAS triggered and full reset isn't supported @@ > -4689,9 +4690,14 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in > progress", > job ? job->base.id : -1); > > -/* even we skipped this reset, still need to set the job to guilty */ > -if (job) > +if (job) { > +/* re-insert node to avoid memory leak */ > +spin_lock(>base.sched->job_list_lock); > +list_add(>base.node, >base.sched->pending_list); > +spin_unlock(>base.sched->job_list_lock); > +/* even we skipped this reset, still need to set the job to guilty > +*/ > drm_sched_increase_karma(>base); > +} I think this piece should be in a standalone patch as it comes to fix a bug of leaking jobs and not directly related to find actual guilty job. Also, this is not the only place where the problem arises - it also arises in other drivers where they check that guilty job's fence already signaled and bail out before reinsertion of bad job to mirror list. Seems to me better fix would be to handle this within scheduler code by adding specific return value to drm_sched_backend_ops.timedout_job marking that the code terminated early - before calling drm_sched_stop and so drm_sched_job_timedout would know this and then reinsert the job back to mirror_list itself. > goto skip_recovery; > } > > @@ -4788,6 +4794,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, > } > } > > +tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); > /* Actual ASIC resets if needed.*/ > /* TODO Implement XGMI hive reset logic for SRIOV */ > if (amdgpu_sriov_vf(adev)) { > @@ -
Re: [PATCH v4] drm/amd/amdgpu implement tdr advanced mode
[AMD Official Use Only - Internal Distribution Only] As Monk previously explained in another mail session. Because the drm_sched_resubmit_jobs submit all jobs which does not meet our needs. We need a new APi drm_sched_resubmit_jobs2() to submit the first job of a ring. (also we can leverage the error handling for Null or Error fence inside this function ). It seems clean to use this jobs2 function than writing the run_job + some error handling in the main logic. Thanks, /Jack 发件人: Koenig, Christian 发送时间: 2021年3月10日星期三 下午9:49 收件人: Zhang, Jack (Jian); amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey; Liu, Monk; Deng, Emily 主题: Re: [PATCH v4] drm/amd/amdgpu implement tdr advanced mode Andrey needs to review the reste, but essentially I don't see the reason why you need this drm_sched_resubmit_jobs2(). Christian. Am 10.03.21 um 14:36 schrieb Zhang, Jack (Jian): [AMD Official Use Only - Internal Distribution Only] Thanks Christian, just did the checkpatch scan. Please see the V6 patch /Jack From: Koenig, Christian <mailto:christian.koe...@amd.com> Sent: Wednesday, March 10, 2021 8:54:53 PM To: Zhang, Jack (Jian) <mailto:jack.zha...@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>; Grodzovsky, Andrey <mailto:andrey.grodzov...@amd.com>; Liu, Monk <mailto:monk@amd.com>; Deng, Emily <mailto:emily.d...@amd.com> Subject: Re: [PATCH v4] drm/amd/amdgpu implement tdr advanced mode Am 10.03.21 um 12:19 schrieb Jack Zhang: > [Why] > Previous tdr design treats the first job in job_timeout as the bad job. > But sometimes a later bad compute job can block a good gfx job and > cause an unexpected gfx job timeout because gfx and compute ring share > internal GC HW mutually. > > [How] > This patch implements an advanced tdr mode.It involves an additinal > synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit > step in order to find the real bad job. > > 1. At Step0 Resubmit stage, it synchronously submits and pends for the > first job being signaled. If it gets timeout, we identify it as guilty > and do hw reset. After that, we would do the normal resubmit step to > resubmit left jobs. > > 2. Re-insert Bailing job to mirror_list, and leave it to be handled by > the main reset thread. > > 3. For whole gpu reset(vram lost), do resubmit as the old way. > > Signed-off-by: Jack Zhang <mailto:jack.zha...@amd.com> > Change-Id: I408357f10b9034caaa1b83610e19e514c5fbaaf2 > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 73 - > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 2 +- > drivers/gpu/drm/scheduler/sched_main.c | 75 ++ > include/drm/gpu_scheduler.h| 2 + > 4 files changed, 148 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index e247c3a2ec08..02056355a055 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4639,7 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, >int i, r = 0; >bool need_emergency_restart = false; >bool audio_suspended = false; > - > + int tmp_vram_lost_counter; Please keep the empoty line between declaration and code. In general give that patch a pass with checkpath.pl since there are a couple of other place which needs fixing at first glance. Christian. >/* > * Special case: RAS triggered and full reset isn't supported > */ > @@ -4689,9 +4689,14 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, >dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another > already in progress", >job ? job->base.id : -1); > > - /* even we skipped this reset, still need to set the job to > guilty */ > - if (job) > + if (job) { > + /* re-insert node to avoid memory leak */ > + spin_lock(>base.sched->job_list_lock); > + list_add(>base.node, > >base.sched->pending_list); > + spin_unlock(>base.sched->job_list_lock); > + /* even we skipped this reset, still need to set the > job to guilty */ >drm_sched_increase_karma(>base); > + } >goto skip_recovery; >} > > @@ -4788,6 +4793,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, >} >} >
Re: [PATCH v4] drm/amd/amdgpu implement tdr advanced mode
[AMD Official Use Only - Internal Distribution Only] Thanks Christian? just did the checkpatch scan. Please see the V6 patch /Jack From: Koenig, Christian Sent: Wednesday, March 10, 2021 8:54:53 PM To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org ; Grodzovsky, Andrey ; Liu, Monk ; Deng, Emily Subject: Re: [PATCH v4] drm/amd/amdgpu implement tdr advanced mode Am 10.03.21 um 12:19 schrieb Jack Zhang: > [Why] > Previous tdr design treats the first job in job_timeout as the bad job. > But sometimes a later bad compute job can block a good gfx job and > cause an unexpected gfx job timeout because gfx and compute ring share > internal GC HW mutually. > > [How] > This patch implements an advanced tdr mode.It involves an additinal > synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit > step in order to find the real bad job. > > 1. At Step0 Resubmit stage, it synchronously submits and pends for the > first job being signaled. If it gets timeout, we identify it as guilty > and do hw reset. After that, we would do the normal resubmit step to > resubmit left jobs. > > 2. Re-insert Bailing job to mirror_list, and leave it to be handled by > the main reset thread. > > 3. For whole gpu reset(vram lost), do resubmit as the old way. > > Signed-off-by: Jack Zhang > Change-Id: I408357f10b9034caaa1b83610e19e514c5fbaaf2 > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 73 - > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 2 +- > drivers/gpu/drm/scheduler/sched_main.c | 75 ++ > include/drm/gpu_scheduler.h| 2 + > 4 files changed, 148 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index e247c3a2ec08..02056355a055 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4639,7 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, >int i, r = 0; >bool need_emergency_restart = false; >bool audio_suspended = false; > - > + int tmp_vram_lost_counter; Please keep the empoty line between declaration and code. In general give that patch a pass with checkpath.pl since there are a couple of other place which needs fixing at first glance. Christian. >/* > * Special case: RAS triggered and full reset isn't supported > */ > @@ -4689,9 +4689,14 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, >dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another > already in progress", >job ? job->base.id : -1); > > - /* even we skipped this reset, still need to set the job to > guilty */ > - if (job) > + if (job) { > + /* re-insert node to avoid memory leak */ > + spin_lock(>base.sched->job_list_lock); > + list_add(>base.node, > >base.sched->pending_list); > + spin_unlock(>base.sched->job_list_lock); > + /* even we skipped this reset, still need to set the > job to guilty */ >drm_sched_increase_karma(>base); > + } >goto skip_recovery; >} > > @@ -4788,6 +4793,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, >} >} > > + tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); >/* Actual ASIC resets if needed.*/ >/* TODO Implement XGMI hive reset logic for SRIOV */ >if (amdgpu_sriov_vf(adev)) { > @@ -4805,6 +4811,67 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, >/* Post ASIC reset for all devs .*/ >list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { > > + if (amdgpu_gpu_recovery == 2 && > + !(tmp_vram_lost_counter < > atomic_read(>vram_lost_counter))) { > + > + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { > + struct amdgpu_ring *ring = tmp_adev->rings[i]; > + int ret = 0; > + struct drm_sched_job *s_job; > + > + if (!ring || !ring->sched.thread) > + continue; > + > + /* No point to resubmit jobs if we didn't HW > reset*/ > + if (!tmp_adev->asic_reset_res && !job_signaled) >
RE: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
[AMD Official Use Only - Internal Distribution Only] Hi, Andrey and Christian, I have uploaded the V4 patch. And involved the previous review opinions. Would you please help to take a review of it? The basic idea is to add an additional resubmit step before the normal resubmit. In this resubmit step, We submit the first job of each ring. And wait for hw fence. If fence is signaled, signal the finished fence job, delete the node and free the job. If fence is timeout, do hw reset, and set it to guilty. After we verified all the first job in each ring. We can make sure we have cleared all the finished jobs and set guilty to every hanged jobs. Then we could safely do the normal resubmit jobs. Thanks, Jack -Original Message- From: Koenig, Christian Sent: Wednesday, March 10, 2021 1:42 AM To: Grodzovsky, Andrey ; Liu, Monk ; Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; Deng, Emily Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode Yeah, sorry. I've seen your reply only after I wrote this mail. I think we can go ahead if you don't have much other concern on this. Christian. Am 09.03.21 um 17:48 schrieb Andrey Grodzovsky: > I we are talking about 'PATCH v3] drm/amd/amdgpu implement tdr > advanced mode' which was sent yesterday then I already went over it > and only had 2 cosmetical comments. > > Andrey > > On 2021-03-09 6:16 a.m., Christian König wrote: >> Yeah, that are some really good points. I completely agree that we >> shouldn't do any larger cleanup right now. >> >> But I think we still need some more review on this. I most likely >> won't have enough time to look into this before the weekend. >> >> Andrey can you take a look as well? >> >> Thanks, >> Christian. >> >> Am 09.03.21 um 08:29 schrieb Liu, Monk: >>> [AMD Official Use Only - Internal Distribution Only] >>> >>> Christian >>> >>> what feasible and practice now is: >>> 1) we implement the advanced TDR mode in upstream first (so we can >>> copy the same scheme in our LTS kernel) -- if you want we can avoid >>> change drm/scheduler part code, but that one is already rejected by >>> you due to complicated >>> 2) then we retire the mirror list concept and rework the >>> drm/scheduler with KFIFO >>> 3) remove the guilty/karma handling from scheduler >>> >>> So I basically agree with you on the spirit of above changes: hide >>> those AMD internal concept or tricks in vendor's driver part and >>> make scheduler simple and scalable But that definitely need a >>> longer design and discussion, so why don't we focus on our current >>> problems now, as long as the new change doesn't regress it is still >>> a good change based on current TDR implements >>> >>> I would proposal we only change AMD part code in this time, Jack's >>> first version patch didn't touch scheduler part, but you stated it >>> was too complicated and rejected it >>> >>> So the allowable revise options is what Jack did in ver2, which need >>> to introduce a new API in scheduler drm_sched_resubmit_jobs2(). >>> >>> Hah: --( what do you think ? >>> >>> Thanks >>> >>> -- >>> Monk Liu | Cloud-GPU Core team >>> -- >>> >>> -Original Message- >>> From: Koenig, Christian >>> Sent: Monday, March 8, 2021 3:53 PM >>> To: Liu, Monk ; Zhang, Jack (Jian) >>> ; amd-gfx@lists.freedesktop.org; Grodzovsky, >>> Andrey ; Deng, Emily >>> Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode >>> >>> >>> >>> Am 08.03.21 um 05:06 schrieb Liu, Monk: >>>> [AMD Official Use Only - Internal Distribution Only] >>>> >>>>>> well first of all please completely drop the affinity group stuff >>>>>> from this patch. We should concentrate on one feature at at time. >>>> We need it to expedite the process, we can introduce this change in >>>> another patch >>>> >>>> >>>>>> Then the implementation is way to complicate. All you need to do >>>>>> is insert a dma_fence_wait after re-scheduling each job after a >>>>>> reset. >>>> No that's not true, during the " drm_sched_resubmit_jobs" it will >>>> put all jobs in mirror list into the hw ring, but we can only allow >>>> the first job to the ring To catch the real guilty one (otherwise >>>> it is possible that
RE: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
[AMD Official Use Only - Internal Distribution Only] Hi, Christian, Since this change is a bit critical to our project, we would be grateful that could get your review. Are there anything that's is not clear enough I could help to explain? Again, Thanks for your huge help to our problem. Thanks, Jack -Original Message- From: Koenig, Christian Sent: Monday, March 8, 2021 8:45 PM To: Zhang, Jack (Jian) ; Liu, Monk ; amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey ; Deng, Emily Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode Hi Jack, yes that comes pretty close. I'm going over the patch right now. Some things still look a bit complicated to me, but I need to wrap my head around how and why we are doing it this way once more. Christian. Am 08.03.21 um 13:43 schrieb Zhang, Jack (Jian): > [AMD Public Use] > > Hi, Christian, > > I made some change on V3 patch that insert a dma_fence_wait for the first > jobs after resubmit jobs. > It seems simpler than the V2 patch. Is this what you first thinks of in your > mind? > > Thanks, > Jack > > -Original Message- > From: Koenig, Christian > Sent: Monday, March 8, 2021 3:53 PM > To: Liu, Monk ; Zhang, Jack (Jian) > ; amd-gfx@lists.freedesktop.org; Grodzovsky, > Andrey ; Deng, Emily > Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode > > > > Am 08.03.21 um 05:06 schrieb Liu, Monk: >> [AMD Official Use Only - Internal Distribution Only] >> >>>> well first of all please completely drop the affinity group stuff from >>>> this patch. We should concentrate on one feature at at time. >> We need it to expedite the process, we can introduce this change in >> another patch >> >> >>>> Then the implementation is way to complicate. All you need to do is insert >>>> a dma_fence_wait after re-scheduling each job after a reset. >> No that's not true, during the " drm_sched_resubmit_jobs" it will put >> all jobs in mirror list into the hw ring, but we can only allow the >> first job to the ring To catch the real guilty one (otherwise it is possible >> that the later job in the ring also has bug and affect our judgement) So we >> need to implement a new " drm_sched_resubmit_jobs2()", like this way: > Something like this. But since waiting for the guilty job is AMD specific we > should rather rework the stuff from the beginning. > > What I have in mind is the following: > 1. Add a reference from the scheduler fence back to the job which is cleared > only when the scheduler fence finishes. > 2. Completely drop the ring_mirror_list and replace it with a kfifo of > pointers to the active scheduler fences. > 3. Replace drm_sched_resubmit_jobs with a drm_sched_for_each_active() macro > which allows drivers to iterate over all the active jobs and > resubmit/wait/mark them as guilty etc etc.. > 4. Remove the guilty/karma handling from the scheduler. This is something AMD > specific and shouldn't leak into common code. > > Regards, > Christian. > >> drm_sched_resubmit_jobs2() >> ~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler *sched, int max) >> 500 { >> 501 struct drm_sched_job *s_job, *tmp; >> 502 uint64_t guilty_context; >> 503 bool found_guilty = false; >> 504 struct dma_fence *fence; >> + 505 int i = 0; >> 506 >> 507 list_for_each_entry_safe(s_job, tmp, >ring_mirror_list, >> node) { >> 508 struct drm_sched_fence *s_fence = s_job->s_fence; >> 509 >> + 510 if (i >= max) >> + 511 break; >> + 512 >> 513 if (!found_guilty && atomic_read(_job->karma) > >> sched->hang_limit) { >> 514 found_guilty = true; >> 515 guilty_context = s_job->s_fence->scheduled.context; >> 516 } >> 517 >> 518 if (found_guilty && s_job->s_fence->scheduled.context == >> guilty_context) >> 519 dma_fence_set_error(_fence->finished, -ECANCELED); >> 520 >> 521 dma_fence_put(s_job->s_fence->parent); >> 522 fence = sched->ops->run_job(s_job); >> + 523 i++; >> 524 >> 525 if (IS_ERR_OR_NULL(fence)) { >> 526 if (IS_ERR(fence)) >> 527 dma_fence_set_error(_fence->finished, >> PTR_ERR(fence)); >> 528 >> 529 s_job->s_fence->parent = NULL; >> 530 } else { >> 531 s_job->s_fence->parent
RE: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
[AMD Public Use] Hi, Christian, I made some change on V3 patch that insert a dma_fence_wait for the first jobs after resubmit jobs. It seems simpler than the V2 patch. Is this what you first thinks of in your mind? Thanks, Jack -Original Message- From: Koenig, Christian Sent: Monday, March 8, 2021 3:53 PM To: Liu, Monk ; Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey ; Deng, Emily Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode Am 08.03.21 um 05:06 schrieb Liu, Monk: > [AMD Official Use Only - Internal Distribution Only] > >>> well first of all please completely drop the affinity group stuff from this >>> patch. We should concentrate on one feature at at time. > We need it to expedite the process, we can introduce this change in > another patch > > >>> Then the implementation is way to complicate. All you need to do is insert >>> a dma_fence_wait after re-scheduling each job after a reset. > No that's not true, during the " drm_sched_resubmit_jobs" it will put > all jobs in mirror list into the hw ring, but we can only allow the > first job to the ring To catch the real guilty one (otherwise it is possible > that the later job in the ring also has bug and affect our judgement) So we > need to implement a new " drm_sched_resubmit_jobs2()", like this way: Something like this. But since waiting for the guilty job is AMD specific we should rather rework the stuff from the beginning. What I have in mind is the following: 1. Add a reference from the scheduler fence back to the job which is cleared only when the scheduler fence finishes. 2. Completely drop the ring_mirror_list and replace it with a kfifo of pointers to the active scheduler fences. 3. Replace drm_sched_resubmit_jobs with a drm_sched_for_each_active() macro which allows drivers to iterate over all the active jobs and resubmit/wait/mark them as guilty etc etc.. 4. Remove the guilty/karma handling from the scheduler. This is something AMD specific and shouldn't leak into common code. Regards, Christian. > > drm_sched_resubmit_jobs2() > ~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler *sched, int max) >500 { >501 struct drm_sched_job *s_job, *tmp; >502 uint64_t guilty_context; >503 bool found_guilty = false; >504 struct dma_fence *fence; > + 505 int i = 0; >506 >507 list_for_each_entry_safe(s_job, tmp, >ring_mirror_list, > node) { >508 struct drm_sched_fence *s_fence = s_job->s_fence; >509 > + 510 if (i >= max) > + 511 break; > + 512 >513 if (!found_guilty && atomic_read(_job->karma) > > sched->hang_limit) { >514 found_guilty = true; >515 guilty_context = s_job->s_fence->scheduled.context; >516 } >517 >518 if (found_guilty && s_job->s_fence->scheduled.context == > guilty_context) >519 dma_fence_set_error(_fence->finished, -ECANCELED); >520 >521 dma_fence_put(s_job->s_fence->parent); >522 fence = sched->ops->run_job(s_job); > + 523 i++; >524 >525 if (IS_ERR_OR_NULL(fence)) { >526 if (IS_ERR(fence)) >527 dma_fence_set_error(_fence->finished, > PTR_ERR(fence)); >528 >529 s_job->s_fence->parent = NULL; >530 } else { >531 s_job->s_fence->parent = fence; >532 } >533 >534 >535 } >536 } >537 EXPORT_SYMBOL(drm_sched_resubmit_jobs); >538 > > > > Thanks > > -- > Monk Liu | Cloud-GPU Core team > -- > > -Original Message- > From: Koenig, Christian > Sent: Sunday, March 7, 2021 3:03 AM > To: Zhang, Jack (Jian) ; > amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey > ; Liu, Monk ; Deng, Emily > > Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode > > Hi Jack, > > well first of all please completely drop the affinity group stuff from this > patch. We should concentrate on one feature at at time. > > Then the implementation is way to complicate. All you need to do is insert a > dma_fence_wait after re-scheduling each job after a reset. > > Additional to that this feature is completely AMD specific and shouldn't > affect the common scheduler in any way. > > Regards, > Christian. > > Am 06.03.21 um 18:25 schrieb Jack Zhang: >> [Why] >> Previous tdr design treats the first job in job_timeout as the bad job.
RE: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
Hi, Christian and Monk, Thanks so much for these valuable ideas. I will follow the advices. >> Then the implementation is way to complicate. All you need to do is insert a >> dma_fence_wait after re-scheduling each job after a reset. Sure, if I re-implement amd_sched_resubmit_jobs2(), and add a dma_fence_wait inside the job2 func, the change will be simpler. I will do the change and submit for review later. Thanks a lot for your help. /Jack -Original Message- From: Liu, Monk Sent: Monday, March 8, 2021 12:06 PM To: Koenig, Christian ; Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey ; Deng, Emily Subject: RE: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode [AMD Official Use Only - Internal Distribution Only] >> well first of all please completely drop the affinity group stuff from this >> patch. We should concentrate on one feature at at time. We need it to expedite the process, we can introduce this change in another patch >> Then the implementation is way to complicate. All you need to do is insert a >> dma_fence_wait after re-scheduling each job after a reset. No that's not true, during the " drm_sched_resubmit_jobs" it will put all jobs in mirror list into the hw ring, but we can only allow the first job to the ring To catch the real guilty one (otherwise it is possible that the later job in the ring also has bug and affect our judgement) So we need to implement a new " drm_sched_resubmit_jobs2()", like this way: drm_sched_resubmit_jobs2() ~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler *sched, int max) 500 { 501 struct drm_sched_job *s_job, *tmp; 502 uint64_t guilty_context; 503 bool found_guilty = false; 504 struct dma_fence *fence; + 505 int i = 0; 506 507 list_for_each_entry_safe(s_job, tmp, >ring_mirror_list, node) { 508 struct drm_sched_fence *s_fence = s_job->s_fence; 509 + 510 if (i >= max) + 511 break; + 512 513 if (!found_guilty && atomic_read(_job->karma) > sched->hang_limit) { 514 found_guilty = true; 515 guilty_context = s_job->s_fence->scheduled.context; 516 } 517 518 if (found_guilty && s_job->s_fence->scheduled.context == guilty_context) 519 dma_fence_set_error(_fence->finished, -ECANCELED); 520 521 dma_fence_put(s_job->s_fence->parent); 522 fence = sched->ops->run_job(s_job); + 523 i++; 524 525 if (IS_ERR_OR_NULL(fence)) { 526 if (IS_ERR(fence)) 527 dma_fence_set_error(_fence->finished, PTR_ERR(fence)); 528 529 s_job->s_fence->parent = NULL; 530 } else { 531 s_job->s_fence->parent = fence; 532 } 533 534 535 } 536 } 537 EXPORT_SYMBOL(drm_sched_resubmit_jobs); 538 Thanks -- Monk Liu | Cloud-GPU Core team ------ -Original Message- From: Koenig, Christian Sent: Sunday, March 7, 2021 3:03 AM To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey ; Liu, Monk ; Deng, Emily Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode Hi Jack, well first of all please completely drop the affinity group stuff from this patch. We should concentrate on one feature at at time. Then the implementation is way to complicate. All you need to do is insert a dma_fence_wait after re-scheduling each job after a reset. Additional to that this feature is completely AMD specific and shouldn't affect the common scheduler in any way. Regards, Christian. Am 06.03.21 um 18:25 schrieb Jack Zhang: > [Why] > Previous tdr design treats the first job in job_timeout as the bad job. > But sometimes a later bad compute job can block a good gfx job and > cause an unexpected gfx job timeout because gfx and compute ring share > internal GC HW mutually. > > [How] > This patch implements an advanced tdr mode.It involves an additinal > synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit > step in order to find the real bad job. > > 1. For Bailing TDR job, re-insert it to mirror_list, don't set it to > guilty and leave it to be handled by the main reset thread. > > 2. Don't set the job to guilty in pre_asic_reset, and leave it to be > handled by Step0 Resubmit Stage. > > 3. At Step0 Resubmit stage, it first resubmit jobs asynchronously, > then it iterate each ring mirror_list, synchronously pend for each hw > fence being signaled. If the a job's hw fence get timeout, we identify > it as guilty and do hw reset to recover hw. After that, we would do > the normal resubmit step to resubmit left jobs. > > 4. For whole gpu re
RE: [PATCH] drm/amd/amdgpu: add fini virt data exchange to ip_suspend
[AMD Official Use Only - Internal Distribution Only] Reviewed-by: Jack Zhang -Original Message- From: Jingwen Chen Sent: Friday, March 5, 2021 2:12 PM To: amd-gfx@lists.freedesktop.org Cc: Chen, Horace ; Zhang, Jack (Jian) ; Chen, JingWen Subject: [PATCH] drm/amd/amdgpu: add fini virt data exchange to ip_suspend [Why] when try to shutdown guest vm in sriov mode, virt data exchange is not fini. After vram lost, trying to write vram could hang cpu. [How] add fini virt data exchange in ip_suspend Signed-off-by: Jingwen Chen --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index a11760ec3924..bec725b50c1b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2774,8 +2774,10 @@ int amdgpu_device_ip_suspend(struct amdgpu_device *adev) { int r; -if (amdgpu_sriov_vf(adev)) +if (amdgpu_sriov_vf(adev)) { +amdgpu_virt_fini_data_exchange(adev); amdgpu_virt_request_full_gpu(adev, false); +} r = amdgpu_device_ip_suspend_phase1(adev); if (r) -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amd/amdgpu: fini data exchange when req_gpu_fini in SRIOV
[AMD Official Use Only - Internal Distribution Only] Please comments what the issue is without this change. /Jack -Original Message- From: Jingwen Chen Sent: Friday, March 5, 2021 12:44 PM To: amd-gfx@lists.freedesktop.org Cc: Chen, Horace ; Zhang, Jack (Jian) ; Chen, JingWen Subject: [PATCH] drm/amd/amdgpu: fini data exchange when req_gpu_fini in SRIOV Do fini data exchange everytime req_gpu_fini in SRIOV Signed-off-by: Jingwen Chen --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +--- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 3 +++ 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index a11760ec3924..e3ed52f66414 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3601,10 +3601,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev) /* make sure IB test finished before entering exclusive mode * to avoid preemption on IB test * */ -if (amdgpu_sriov_vf(adev)) { +if (amdgpu_sriov_vf(adev)) amdgpu_virt_request_full_gpu(adev, false); -amdgpu_virt_fini_data_exchange(adev); -} /* disable all interrupts */ amdgpu_irq_disable_all(adev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c index 3dd7eec52344..af1e5d8fc2b8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c @@ -114,6 +114,9 @@ int amdgpu_virt_request_full_gpu(struct amdgpu_device *adev, bool init) struct amdgpu_virt *virt = >virt; int r; +if (!init) +amdgpu_virt_fini_data_exchange(adev); + if (virt->ops && virt->ops->req_full_gpu) { r = virt->ops->req_full_gpu(adev, init); if (r) -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] amdgpu/sriov Stop data exchange for wholegpu reset
[AMD Official Use Only - Internal Distribution Only] Ping... -Original Message- From: Zhang, Jack (Jian) Sent: Friday, January 8, 2021 11:07 AM To: amd-gfx@lists.freedesktop.org; Liu, Monk ; Chen, JingWen ; Deucher, Alexander ; Deng, Emily Subject: RE: [PATCH] amdgpu/sriov Stop data exchange for wholegpu reset Ping -Original Message- From: Jack Zhang Sent: Thursday, January 7, 2021 6:47 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) ; Zhang, Jack (Jian) ; Chen, JingWen Subject: [PATCH] amdgpu/sriov Stop data exchange for wholegpu reset [Why] When host trigger a whole gpu reset, guest will keep waiting till host finish reset. But there's a work queue in guest exchanging data between vf which need to access frame buffer. During whole gpu reset, frame buffer is not accessable, and this causes the call trace. [How] After vf get reset notification from pf, stop data exchange. Signed-off-by: Jingwen Chen Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 1 + drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c| 1 + drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c| 1 + 3 files changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c index 83ca5cbffe2c..3e212862cf5d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c @@ -571,6 +571,7 @@ void amdgpu_virt_fini_data_exchange(struct amdgpu_device *adev) DRM_INFO("clean up the vf2pf work item\n"); flush_delayed_work(>virt.vf2pf_work); cancel_delayed_work_sync(>virt.vf2pf_work); +adev->virt.vf2pf_update_interval_ms = 0; } } diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c index 7767ccca526b..3ee481557fc9 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c @@ -255,6 +255,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work) if (!down_read_trylock(>reset_sem)) return; +amdgpu_virt_fini_data_exchange(adev); atomic_set(>in_gpu_reset, 1); do { diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c index dd5c1e6ce009..48e588d3c409 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c @@ -276,6 +276,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work) if (!down_read_trylock(>reset_sem)) return; +amdgpu_virt_fini_data_exchange(adev); atomic_set(>in_gpu_reset, 1); do { -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] amdgpu/sriov Stop data exchange for wholegpu reset
Ping -Original Message- From: Jack Zhang Sent: Thursday, January 7, 2021 6:47 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) ; Zhang, Jack (Jian) ; Chen, JingWen Subject: [PATCH] amdgpu/sriov Stop data exchange for wholegpu reset [Why] When host trigger a whole gpu reset, guest will keep waiting till host finish reset. But there's a work queue in guest exchanging data between vf which need to access frame buffer. During whole gpu reset, frame buffer is not accessable, and this causes the call trace. [How] After vf get reset notification from pf, stop data exchange. Signed-off-by: Jingwen Chen Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 1 + drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c| 1 + drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c| 1 + 3 files changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c index 83ca5cbffe2c..3e212862cf5d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c @@ -571,6 +571,7 @@ void amdgpu_virt_fini_data_exchange(struct amdgpu_device *adev) DRM_INFO("clean up the vf2pf work item\n"); flush_delayed_work(>virt.vf2pf_work); cancel_delayed_work_sync(>virt.vf2pf_work); + adev->virt.vf2pf_update_interval_ms = 0; } } diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c index 7767ccca526b..3ee481557fc9 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c @@ -255,6 +255,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work) if (!down_read_trylock(>reset_sem)) return; + amdgpu_virt_fini_data_exchange(adev); atomic_set(>in_gpu_reset, 1); do { diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c index dd5c1e6ce009..48e588d3c409 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c @@ -276,6 +276,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work) if (!down_read_trylock(>reset_sem)) return; + amdgpu_virt_fini_data_exchange(adev); atomic_set(>in_gpu_reset, 1); do { -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH 3/5] drm/amd/sriov add mmsch_v3 interface
Hi, Dennis, I gave some feedback in the comments. Thank you for your review. Best Regards, Jack Zhang -Original Message- From: Li, Dennis Sent: Tuesday, July 14, 2020 12:35 PM To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) ; Liu, Leo ; Zhang, Hawking Subject: RE: [PATCH 3/5] drm/amd/sriov add mmsch_v3 interface [AMD Official Use Only - Internal Distribution Only] Hi, Jack, Please see the following comments. Best Regards Dennis Li -Original Message- From: amd-gfx On Behalf Of Jack Zhang Sent: Tuesday, July 14, 2020 10:47 AM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) ; Liu, Leo ; Zhang, Hawking Subject: [PATCH 3/5] drm/amd/sriov add mmsch_v3 interface For VCN3.0 SRIOV, Guest driver needs to communicate with mmsch to set the World Switch for MM appropriately. This patch add the interface for mmsch_v3.0. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h | 130 1 file changed, 130 insertions(+) create mode 100644 drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h diff --git a/drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h b/drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h new file mode 100644 index ..3e4e858a6965 --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h @@ -0,0 +1,130 @@ +/* + * Copyright 2020 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person +obtaining a + * copy of this software and associated documentation files (the +"Software"), + * to deal in the Software without restriction, including without +limitation + * the rights to use, copy, modify, merge, publish, distribute, +sublicense, + * and/or sell copies of the Software, and to permit persons to whom +the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be +included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT +SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, +DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR +OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE +OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#ifndef __MMSCH_V3_0_H__ +#define __MMSCH_V3_0_H__ + +#include "amdgpu_vcn.h" + +#define MMSCH_VERSION_MAJOR3 +#define MMSCH_VERSION_MINOR0 +#define MMSCH_VERSION (MMSCH_VERSION_MAJOR << 16 | MMSCH_VERSION_MINOR) + +enum mmsch_v3_0_command_type { + MMSCH_COMMAND__DIRECT_REG_WRITE = 0, + MMSCH_COMMAND__DIRECT_REG_POLLING = 2, + MMSCH_COMMAND__DIRECT_REG_READ_MODIFY_WRITE = 3, + MMSCH_COMMAND__INDIRECT_REG_WRITE = 8, + MMSCH_COMMAND__END = 0xf +}; + +struct mmsch_v3_0_table_info { + uint32_t init_status; + uint32_t table_offset; + uint32_t table_size; +}; + +struct mmsch_v3_0_init_header { + uint32_t version; + uint32_t total_size; + struct mmsch_v3_0_table_info inst[AMDGPU_MAX_VCN_INSTANCES]; }; [Dennis] You have defined total_size, why inst size is AMDGPU_MAX_VCN_INSTANCES, which will cause memory waste. [Jack] In our case, AMDGPU_MAX_VCN_INSTANCES is a fixed number, which equals 2. And struct mmsch_v3_0_table_info occupy 3 dws. Thus there's not too much memory waste. +struct mmsch_v3_0_cmd_direct_reg_header { + uint32_t reg_offset : 28; + uint32_t command_type : 4; +}; + +struct mmsch_v3_0_cmd_indirect_reg_header { + uint32_t reg_offset: 20; + uint32_t reg_idx_space : 8; + uint32_t command_type : 4; +}; + +struct mmsch_v3_0_cmd_direct_write { + struct mmsch_v3_0_cmd_direct_reg_header cmd_header; + uint32_t reg_value; +}; + +struct mmsch_v3_0_cmd_direct_read_modify_write { + struct mmsch_v3_0_cmd_direct_reg_header cmd_header; + uint32_t write_data; + uint32_t mask_value; +}; + +struct mmsch_v3_0_cmd_direct_polling { + struct mmsch_v3_0_cmd_direct_reg_header cmd_header; + uint32_t mask_value; + uint32_t wait_value; +}; + +struct mmsch_v3_0_cmd_end { + struct mmsch_v3_0_cmd_direct_reg_header cmd_header; }; + +struct mmsch_v3_0_cmd_indirect_write { + struct mmsch_v3_0_cmd_indirect_reg_header cmd_header; + uint32_t reg_value; +}; + +#define MMSCH_V3_0_INSERT_DIRECT_RD_MOD_WT(reg, mask, data) { \ + size = sizeof(struct mmsch_v3_0_cmd_direct_read_modify_write); \ + size_dw = size / 4; \ + direct_rd_mod_wt.cmd_header.reg_offset = reg; \ + direct_rd_mod_wt.mask_value = mask; \ + direct_rd_mod_wt.write_data = data; \ + memcpy((void *)table_loc, _rd_mod_wt, size); \ + table_loc += size
Re: [PATCH] drm/amdgpu fix incorrect sysfs remove behavior for xgmi
[AMD Official Use Only - Internal Distribution Only] ping Outlook for Android<https://aka.ms/ghei36> From: Jack Zhang Sent: Monday, May 18, 2020 5:00:53 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH] drm/amdgpu fix incorrect sysfs remove behavior for xgmi Under xgmi setup,some sysfs fail to create for the second time of kmd driver loading. It's due to sysfs nodes are not removed appropriately in the last unlod time. Changes of this patch: 1. remove sysfs for dev_attr_xgmi_error 2. remove sysfs_link adev->dev->kobj with target name. And it only needs to be removed once for a xgmi setup 3. remove sysfs_link hive->kobj with target name In amdgpu_xgmi_remove_device: 1. amdgpu_xgmi_sysfs_rem_dev_info needs to be run per device 2. amdgpu_xgmi_sysfs_destroy needs to be run on the last node of device. v2: initialize array with memset Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c index e9e59bc..3b46ea8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c @@ -325,9 +325,19 @@ static int amdgpu_xgmi_sysfs_add_dev_info(struct amdgpu_device *adev, static void amdgpu_xgmi_sysfs_rem_dev_info(struct amdgpu_device *adev, struct amdgpu_hive_info *hive) { + char node[10]; + memset(node, 0, sizeof(node)); + device_remove_file(adev->dev, _attr_xgmi_device_id); - sysfs_remove_link(>dev->kobj, adev->ddev->unique); - sysfs_remove_link(hive->kobj, adev->ddev->unique); + device_remove_file(adev->dev, _attr_xgmi_error); + + if (adev != hive->adev) { + sysfs_remove_link(>dev->kobj,"xgmi_hive_info"); + } + + sprintf(node, "node%d", hive->number_devices); + sysfs_remove_link(hive->kobj, node); + } @@ -583,14 +593,14 @@ int amdgpu_xgmi_remove_device(struct amdgpu_device *adev) if (!hive) return -EINVAL; - if (!(hive->number_devices--)) { + task_barrier_rem_task(>tb); + amdgpu_xgmi_sysfs_rem_dev_info(adev, hive); + mutex_unlock(>hive_lock); + + if(!(--hive->number_devices)){ amdgpu_xgmi_sysfs_destroy(adev, hive); mutex_destroy(>hive_lock); mutex_destroy(>reset_lock); - } else { - task_barrier_rem_task(>tb); - amdgpu_xgmi_sysfs_rem_dev_info(adev, hive); - mutex_unlock(>hive_lock); } return psp_xgmi_terminate(>psp); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu fix incorrect sysfs remove behavior for xgmi
-Original Message- From: Jack Zhang Sent: Monday, May 18, 2020 12:45 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH] drm/amdgpu fix incorrect sysfs remove behavior for xgmi Under xgmi setup,some sysfs fail to create for the second time of kmd driver loading. It's due to sysfs nodes are not removed appropriately in the last unlod time. Changes of this patch: 1. remove sysfs for dev_attr_xgmi_error 2. remove sysfs_link adev->dev->kobj with target name. And it only needs to be removed once for a xgmi setup 3. remove sysfs_link hive->kobj with target name In amdgpu_xgmi_remove_device: 1. amdgpu_xgmi_sysfs_rem_dev_info needs to be run per device 2. amdgpu_xgmi_sysfs_destroy needs to be run on the last node of device. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 22 +++--- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c index e9e59bc..bfe2468 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c @@ -325,9 +325,17 @@ static int amdgpu_xgmi_sysfs_add_dev_info(struct amdgpu_device *adev, static void amdgpu_xgmi_sysfs_rem_dev_info(struct amdgpu_device *adev, struct amdgpu_hive_info *hive) { + char node[10] = { 0 }; device_remove_file(adev->dev, _attr_xgmi_device_id); - sysfs_remove_link(>dev->kobj, adev->ddev->unique); - sysfs_remove_link(hive->kobj, adev->ddev->unique); + device_remove_file(adev->dev, _attr_xgmi_error); + + if (adev != hive->adev) { + sysfs_remove_link(>dev->kobj,"xgmi_hive_info"); + } + + sprintf(node, "node%d", hive->number_devices); + sysfs_remove_link(hive->kobj, node); + } @@ -583,14 +591,14 @@ int amdgpu_xgmi_remove_device(struct amdgpu_device *adev) if (!hive) return -EINVAL; - if (!(hive->number_devices--)) { + task_barrier_rem_task(>tb); + amdgpu_xgmi_sysfs_rem_dev_info(adev, hive); + mutex_unlock(>hive_lock); + + if(!(--hive->number_devices)){ amdgpu_xgmi_sysfs_destroy(adev, hive); mutex_destroy(>hive_lock); mutex_destroy(>reset_lock); - } else { - task_barrier_rem_task(>tb); - amdgpu_xgmi_sysfs_rem_dev_info(adev, hive); - mutex_unlock(>hive_lock); } return psp_xgmi_terminate(>psp); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset
Thanks Monk, I just updated the patch and it could passed 1000 rounds TDR test. Sent out an review email. Regards, Jack -Original Message- From: Liu, Monk Sent: Friday, April 3, 2020 11:38 AM To: Kuehling, Felix ; Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org Subject: RE: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset Thanks Felix Hi Jack I think below changes can resolve your problem , we had this on our customer branch already, it fix the memory leak, and also fix my previous bug . Can you make this change applied to gfx_v10/v9 ? thanks ! diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c index 29749502..532258445 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c @@ -543,6 +543,8 @@ static int kgd_hqd_destroy(struct kgd_dev *kgd, void *mqd, uint32_t temp; struct v10_compute_mqd *m = get_mqd(mqd); + if (amdgpu_sriov_vf(adev) && adev->in_gpu_reset) + return 0; #if 0 unsigned long flags; int retry; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 35b32ad..f6479e1 100755 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3653,6 +3653,8 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, if (r) return r; + amdgpu_amdkfd_pre_reset(adev); + /* Resume IP prior to SMC */ r = amdgpu_device_ip_reinit_early_sriov(adev); if (r) _ Monk Liu|GPU Virtualization Team |AMD -Original Message- From: Kuehling, Felix Sent: Friday, April 3, 2020 1:26 AM To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; Liu, Monk Subject: Re: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset [+Monk] This looks reasonable to me. However, you're effectively reverting this commit by Monk: a03eb637d2a5 drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV In hind-sight, Monk's commit was broken. Removing the call to pre_reset has other consequences, such as breaking notifications about reset to user mode, and probably invalidating some assumptions in kfd_post_reset. Can you coordinate with Monk to work out why his change was needed, and whether you'll need a different solution for the problem he was trying to address? In the meanwhile, this patch is Acked-by: Felix Kuehling Am 2020-04-02 um 3:20 a.m. schrieb Jack Zhang: > kfd_pre_reset will free mem_objs allocated by kfd_gtt_sa_allocate > > Without this change, sriov tdr code path will never free those > allocated memories and get memory leak. > > Signed-off-by: Jack Zhang > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 8faaa17..832daf7 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -3847,6 +3847,8 @@ static int amdgpu_device_reset_sriov(struct > amdgpu_device *adev, { > int r; > > + amdgpu_amdkfd_pre_reset(adev); > + > if (from_hypervisor) > r = amdgpu_virt_request_full_gpu(adev, true); > else ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset
-Original Message- From: Jack Zhang Sent: Thursday, April 2, 2020 3:20 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset kfd_pre_reset will free mem_objs allocated by kfd_gtt_sa_allocate Without this change, sriov tdr code path will never free those allocated memories and get memory leak. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 8faaa17..832daf7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3847,6 +3847,8 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, { int r; + amdgpu_amdkfd_pre_reset(adev); + if (from_hypervisor) r = amdgpu_virt_request_full_gpu(adev, true); else -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu/sriov refine vcn_v2_5_early_init func
Ping... -Original Message- From: amd-gfx On Behalf Of Jack Zhang Sent: Tuesday, March 10, 2020 8:49 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) ; Zhang, Jack (Jian) Subject: [PATCH] drm/amdgpu/sriov refine vcn_v2_5_early_init func refine the assignment for vcn.num_vcn_inst, vcn.harvest_config, vcn.num_enc_rings in VF Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 35 ++- 1 file changed, 18 insertions(+), 17 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c index 2d64ba1..9b22e2b 100644 --- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c @@ -74,29 +74,30 @@ static int amdgpu_ih_clientid_vcns[] = { static int vcn_v2_5_early_init(void *handle) { struct amdgpu_device *adev = (struct amdgpu_device *)handle; - if (adev->asic_type == CHIP_ARCTURUS) { - u32 harvest; - int i; - - adev->vcn.num_vcn_inst = VCN25_MAX_HW_INSTANCES_ARCTURUS; - for (i = 0; i < adev->vcn.num_vcn_inst; i++) { - harvest = RREG32_SOC15(UVD, i, mmCC_UVD_HARVESTING); - if (harvest & CC_UVD_HARVESTING__UVD_DISABLE_MASK) - adev->vcn.harvest_config |= 1 << i; - } - - if (adev->vcn.harvest_config == (AMDGPU_VCN_HARVEST_VCN0 | -AMDGPU_VCN_HARVEST_VCN1)) - /* both instances are harvested, disable the block */ - return -ENOENT; - } else - adev->vcn.num_vcn_inst = 1; if (amdgpu_sriov_vf(adev)) { adev->vcn.num_vcn_inst = 2; adev->vcn.harvest_config = 0; adev->vcn.num_enc_rings = 1; } else { + if (adev->asic_type == CHIP_ARCTURUS) { + u32 harvest; + int i; + + adev->vcn.num_vcn_inst = VCN25_MAX_HW_INSTANCES_ARCTURUS; + for (i = 0; i < adev->vcn.num_vcn_inst; i++) { + harvest = RREG32_SOC15(UVD, i, mmCC_UVD_HARVESTING); + if (harvest & CC_UVD_HARVESTING__UVD_DISABLE_MASK) + adev->vcn.harvest_config |= 1 << i; + } + + if (adev->vcn.harvest_config == (AMDGPU_VCN_HARVEST_VCN0 | + AMDGPU_VCN_HARVEST_VCN1)) + /* both instances are harvested, disable the block */ + return -ENOENT; + } else + adev->vcn.num_vcn_inst = 1; + adev->vcn.num_enc_rings = 2; } -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CJack.Zhang1%40amd.com%7Cb499c8855c4e497bbeee08d7c4f15e1a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637194413257832189sdata=Tw6BCqhv%2BteHBneDLesEYilaiu6%2F8oKQX4KKRAlYdtQ%3Dreserved=0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu/sriov Don't send msg when smu suspend
Hi, Team, Would you please help to take a look at this patch? Regards, Jack -Original Message- From: amd-gfx On Behalf Of Jack Zhang Sent: Wednesday, February 5, 2020 5:18 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH] drm/amdgpu/sriov Don't send msg when smu suspend For sriov and pp_onevf_mode, do not send message to set smu status, becasue smu doesn't support these messages under VF. Besides, it should skip smu_suspend when pp_onevf_mode is disabled. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 --- drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 21 + 2 files changed, 21 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 4ff7ce3..2d1f8d4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2353,15 +2353,16 @@ static int amdgpu_device_ip_suspend_phase2(struct amdgpu_device *adev) } adev->ip_blocks[i].status.hw = false; /* handle putting the SMC in the appropriate state */ - if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_SMC) { - r = amdgpu_dpm_set_mp1_state(adev, adev->mp1_state); - if (r) { - DRM_ERROR("SMC failed to set mp1 state %d, %d\n", - adev->mp1_state, r); - return r; + if(!amdgpu_sriov_vf(adev)){ + if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_SMC) { + r = amdgpu_dpm_set_mp1_state(adev, adev->mp1_state); + if (r) { + DRM_ERROR("SMC failed to set mp1 state %d, %d\n", + adev->mp1_state, r); + return r; + } } } - adev->ip_blocks[i].status.hw = false; } diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c index 99ad4dd..a6d7b5f 100644 --- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c +++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c @@ -1461,21 +1461,26 @@ static int smu_suspend(void *handle) struct smu_context *smu = >smu; bool baco_feature_is_enabled = false; + if (amdgpu_sriov_vf(adev)&& !amdgpu_sriov_is_pp_one_vf(adev)) + return 0; + if (!smu->pm_enabled) return 0; if(!smu->is_apu) baco_feature_is_enabled = smu_feature_is_enabled(smu, SMU_FEATURE_BACO_BIT); - ret = smu_system_features_control(smu, false); - if (ret) - return ret; - - if (baco_feature_is_enabled) { - ret = smu_feature_set_enabled(smu, SMU_FEATURE_BACO_BIT, true); - if (ret) { - pr_warn("set BACO feature enabled failed, return %d\n", ret); + if(!amdgpu_sriov_vf(adev)) { + ret = smu_system_features_control(smu, false); + if (ret) return ret; + + if (baco_feature_is_enabled) { + ret = smu_feature_set_enabled(smu, SMU_FEATURE_BACO_BIT, true); + if (ret) { + pr_warn("set BACO feature enabled failed, return %d\n", ret); + return ret; + } } } -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CJack.Zhang1%40amd.com%7Ceb00cb04455340909ef908d7aa1c5ab5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164911088955698sdata=AtqrYF%2Br53lO9oQLu6Q%2BPeco5KKDGKODjCvOWQmO9hw%3Dreserved=0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF
hi, Emily, Thank you for your review. Comment inline, updated patch attached -Original Message- From: Deng, Emily Sent: Friday, December 27, 2019 3:27 PM To: Zhang, Jack (Jian) ; Feng, Kenneth ; Deucher, Alexander ; Quan, Evan ; Wang, Kevin(Yang) ; Tao, Yintian ; Min, Frank ; Liu, Monk ; amd-gfx@lists.freedesktop.org Subject: RE: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF [AMD Official Use Only - Internal Distribution Only] >-Original Message- >From: Zhang, Jack (Jian) >Sent: Friday, December 27, 2019 3:00 PM >To: Feng, Kenneth ; Deucher, Alexander >; Quan, Evan ; Wang, >Kevin(Yang) ; Tao, Yintian ; >Deng, Emily ; Min, Frank ; Liu, >Monk ; amd-gfx@lists.freedesktop.org; Zhang, Jack >(Jian) >Subject: RE: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for >ARCTURUS VF > > > >-Original Message- >From: Jack Zhang >Sent: Friday, December 27, 2019 2:57 PM >To: amd-gfx@lists.freedesktop.org >Cc: Zhang, Jack (Jian) >Subject: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF > >Before, initialization of smu ip block would be skipped for sriov >ASICs. But if there's only one VF being used, guest driver should be >able to dump some HW info such as clks, temperature,etc. > >To solve this, now after onevf mode is enabled, host driver will notify >guest. If it's onevf mode, guest will do smu hw_init and skip some >steps in normal smu hw_init flow because host driver has already done it for >smu. > >With this fix, guest app can talk with smu and dump hw information from smu. > >v2: refine the logic for pm_enabled.Skip hw_init by not changing pm_enabled. > >Signed-off-by: Jack Zhang >--- > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 3 +- > drivers/gpu/drm/amd/amdgpu/soc15.c | 3 +- > drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 45 >+-- >--- > 3 files changed, 29 insertions(+), 22 deletions(-) > >diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c >b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c >index 8469834..08130a6 100644 >--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c >+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c >@@ -1448,7 +1448,8 @@ static int psp_np_fw_load(struct psp_context *psp) > || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G > || ucode->ucode_id == >AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL > || ucode->ucode_id == >AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM >- || ucode->ucode_id == >AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM)) >+ || ucode->ucode_id == >AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM >+ || ucode->ucode_id == AMDGPU_UCODE_ID_SMC)) > /*skip ucode loading in SRIOV VF */ > continue; > >diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c >b/drivers/gpu/drm/amd/amdgpu/soc15.c >index b53d401..a271496 100644 >--- a/drivers/gpu/drm/amd/amdgpu/soc15.c >+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c >@@ -827,8 +827,7 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev) > amdgpu_device_ip_block_add(adev, >_virtual_ip_block); > amdgpu_device_ip_block_add(adev, _v9_0_ip_block); > amdgpu_device_ip_block_add(adev, _v4_0_ip_block); >- if (!amdgpu_sriov_vf(adev)) >- amdgpu_device_ip_block_add(adev, >_v11_0_ip_block); >+ amdgpu_device_ip_block_add(adev, _v11_0_ip_block); > > if (amdgpu_sriov_vf(adev)) { > if (likely(adev->firmware.load_type == >AMDGPU_FW_LOAD_PSP)) diff --git >a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c >b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c >index 936c682..42c0a6d 100644 >--- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c >+++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c >@@ -531,10 +531,14 @@ bool is_support_sw_smu(struct amdgpu_device >*adev) > if (adev->asic_type == CHIP_VEGA20) > return (amdgpu_dpm == 2) ? true : false; > else if (adev->asic_type >= CHIP_ARCTURUS) { >- if (amdgpu_sriov_vf(adev)) >- return false; >- else >+ if (amdgpu_sriov_vf(adev)) { >+ if(amdgpu_sriov_is_pp_one_vf(adev)) >+ return true; >+ else >+ return false; >+ } else { > return true; >+ } > } else > return false; > } >@@ -1062,20 +1066,19 @@ static int smu_smc_table_hw_init(struct >smu_context *smu, > } > > /* smu_dump_pptable(smu); *
RE: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF
Hi, all, Here is some description about sriov and pp_one_vf mode: 1) amdgpu_sriov_vf marks if it is sriov or bare-metal. While amdgpu_sriov_is_pp_one_vf is a mode of sriov- It means there is only one VF generated by host driver. 2) Host administrator can determine “vf number” when host driver is loaded. If vf_num =1, host driver will notify guest driver it is under one vf mode --- pp_one_vf return true. Otherwise, pp_one_vf return false. Without unloading guest driver and host driver, vf_num cannot be changed. So it is a static process. 3) Under pp_one_vf mode in sriov platform, guest driver will do smu hw_init, the goal of it is to let guest driver “talk” with smu by sending authorized smu messages. This will help user mode app to dump info like clks, temperature, GPU usage…. Currently we don’t support guest driver to write value to smu. We can only read smu information. Besides, as host driver has already initialized smu hw such as loading smu firmware and and set some feature control modes, some hw init steps need to skip in guest driver, such as write pptable, load smc firmware. 4) At the same time, pp_one_vf mode need smu some firmware changes to open permission for certain messages in VF. B.R. Jack -Original Message- From: Zhang, Jack (Jian) Sent: Friday, December 27, 2019 3:00 PM To: Feng, Kenneth ; Deucher, Alexander ; Quan, Evan ; Wang, Kevin(Yang) ; Tao, Yintian ; Deng, Emily ; Min, Frank ; Liu, Monk ; amd-gfx@lists.freedesktop.org; Zhang, Jack (Jian) Subject: RE: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF -Original Message- From: Jack Zhang Sent: Friday, December 27, 2019 2:57 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF Before, initialization of smu ip block would be skipped for sriov ASICs. But if there's only one VF being used, guest driver should be able to dump some HW info such as clks, temperature,etc. To solve this, now after onevf mode is enabled, host driver will notify guest. If it's onevf mode, guest will do smu hw_init and skip some steps in normal smu hw_init flow because host driver has already done it for smu. With this fix, guest app can talk with smu and dump hw information from smu. v2: refine the logic for pm_enabled.Skip hw_init by not changing pm_enabled. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 3 +- drivers/gpu/drm/amd/amdgpu/soc15.c | 3 +- drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 45 +- 3 files changed, 29 insertions(+), 22 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index 8469834..08130a6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -1448,7 +1448,8 @@ static int psp_np_fw_load(struct psp_context *psp) || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM - || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM)) + || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM + || ucode->ucode_id == AMDGPU_UCODE_ID_SMC)) /*skip ucode loading in SRIOV VF */ continue; diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index b53d401..a271496 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -827,8 +827,7 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev) amdgpu_device_ip_block_add(adev, _virtual_ip_block); amdgpu_device_ip_block_add(adev, _v9_0_ip_block); amdgpu_device_ip_block_add(adev, _v4_0_ip_block); - if (!amdgpu_sriov_vf(adev)) - amdgpu_device_ip_block_add(adev, _v11_0_ip_block); + amdgpu_device_ip_block_add(adev, _v11_0_ip_block); if (amdgpu_sriov_vf(adev)) { if (likely(adev->firmware.load_type == AMDGPU_FW_LOAD_PSP)) diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c index 936c682..42c0a6d 100644 --- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c +++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c @@ -531,10 +531,14 @@ bool is_support_sw_smu(struct amdgpu_device *adev) if (adev->asic_type == CHIP_VEGA20) return (amdgpu_dpm == 2) ? true : false; else if (adev->asic_type >= CHIP_ARCTURUS) { - if (amdgpu_sriov_vf(adev)) - return false; - else +
RE: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF
-Original Message- From: Jack Zhang Sent: Friday, December 27, 2019 2:57 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF Before, initialization of smu ip block would be skipped for sriov ASICs. But if there's only one VF being used, guest driver should be able to dump some HW info such as clks, temperature,etc. To solve this, now after onevf mode is enabled, host driver will notify guest. If it's onevf mode, guest will do smu hw_init and skip some steps in normal smu hw_init flow because host driver has already done it for smu. With this fix, guest app can talk with smu and dump hw information from smu. v2: refine the logic for pm_enabled.Skip hw_init by not changing pm_enabled. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 3 +- drivers/gpu/drm/amd/amdgpu/soc15.c | 3 +- drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 45 +- 3 files changed, 29 insertions(+), 22 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index 8469834..08130a6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -1448,7 +1448,8 @@ static int psp_np_fw_load(struct psp_context *psp) || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM - || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM)) + || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM + || ucode->ucode_id == AMDGPU_UCODE_ID_SMC)) /*skip ucode loading in SRIOV VF */ continue; diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index b53d401..a271496 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -827,8 +827,7 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev) amdgpu_device_ip_block_add(adev, _virtual_ip_block); amdgpu_device_ip_block_add(adev, _v9_0_ip_block); amdgpu_device_ip_block_add(adev, _v4_0_ip_block); - if (!amdgpu_sriov_vf(adev)) - amdgpu_device_ip_block_add(adev, _v11_0_ip_block); + amdgpu_device_ip_block_add(adev, _v11_0_ip_block); if (amdgpu_sriov_vf(adev)) { if (likely(adev->firmware.load_type == AMDGPU_FW_LOAD_PSP)) diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c index 936c682..42c0a6d 100644 --- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c +++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c @@ -531,10 +531,14 @@ bool is_support_sw_smu(struct amdgpu_device *adev) if (adev->asic_type == CHIP_VEGA20) return (amdgpu_dpm == 2) ? true : false; else if (adev->asic_type >= CHIP_ARCTURUS) { - if (amdgpu_sriov_vf(adev)) - return false; - else + if (amdgpu_sriov_vf(adev)) { + if(amdgpu_sriov_is_pp_one_vf(adev)) + return true; + else + return false; + } else { return true; + } } else return false; } @@ -1062,20 +1066,19 @@ static int smu_smc_table_hw_init(struct smu_context *smu, } /* smu_dump_pptable(smu); */ + if(amdgpu_sriov_vf(adev) && !amdgpu_sriov_is_pp_one_vf(adev)){ + /* +* Copy pptable bo in the vram to smc with SMU MSGs such as +* SetDriverDramAddr and TransferTableDram2Smu. +*/ + ret = smu_write_pptable(smu); + if (ret) + return ret; - /* -* Copy pptable bo in the vram to smc with SMU MSGs such as -* SetDriverDramAddr and TransferTableDram2Smu. -*/ - ret = smu_write_pptable(smu); - if (ret) - return ret; - - /* issue Run*Btc msg */ - ret = smu_run_btc(smu); - if (ret) - return ret; - + /* issue Run*Btc msg */ + ret = smu_run_btc(smu); + if (ret) + return ret; ret = smu_feature_set_allowed_mask(smu); if (ret) return ret; @@ -1083,7 +1086,7 @@ static int smu_smc_table_hw_init(struct smu_context *smu, ret = smu_system_features_control(smu, true); if (ret) return ret; - + } if (adev->asic_type != CHIP_ARCTURUS) {
RE: [PATCH 2/2] amd/amdgpu/sriov use resume for smc recover under sriov
-Original Message- From: Jack Zhang Sent: Friday, December 27, 2019 2:57 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH 2/2] amd/amdgpu/sriov use resume for smc recover under sriov Under sriov and pp_onevf mode, 1.take resume function for smc recover to avoid potential memory leak. 2.add return condition inside smc resume function for sriov_pp_onevf_mode and pm_enabled param. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +- drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 6 ++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 6a4b142..547fa3e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2458,7 +2458,11 @@ static int amdgpu_device_ip_reinit_late_sriov(struct amdgpu_device *adev) block->status.hw) continue; - r = block->version->funcs->hw_init(adev); + if (block->version->type == AMD_IP_BLOCK_TYPE_SMC) + r = block->version->funcs->resume(adev); + else + r = block->version->funcs->hw_init(adev); + DRM_INFO("RE-INIT-late: %s %s\n", block->version->funcs->name, r?"failed":"succeeded"); if (r) return r; diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c index 42c0a6d..7cad358 100644 --- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c +++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c @@ -1416,6 +1416,12 @@ static int smu_resume(void *handle) struct amdgpu_device *adev = (struct amdgpu_device *)handle; struct smu_context *smu = >smu; + if (amdgpu_sriov_vf(adev)&& !amdgpu_sriov_is_pp_one_vf(adev)) + return 0; + + if (!smu->pm_enabled) + return 0; + pr_info("SMU is resuming...\n"); ret = smu_start_smc_engine(smu); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amd/powerplay free driver_pptable for smu reinit
Ping... -Original Message- From: Zhang, Jack (Jian) Sent: Tuesday, December 24, 2019 1:05 PM To: Wang, Kevin(Yang) ; Feng, Kenneth ; Quan, Evan ; Jack Zhang ; amd-gfx@lists.freedesktop.org Subject: RE: [PATCH] drm/amd/powerplay free driver_pptable for smu reinit -Original Message- From: Jack Zhang Sent: Tuesday, December 24, 2019 1:05 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH] drm/amd/powerplay free driver_pptable for smu reinit During gpu recover, smu hw reinit will fail becasue table_context->driver_pptable is not freed and set to NULL. Free the driver_pptable pointer if it's not NULL. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/powerplay/smu_v11_0.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c index 7781d24..ca877bd 100644 --- a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c +++ b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c @@ -713,8 +713,10 @@ int smu_v11_0_parse_pptable(struct smu_context *smu) struct smu_table_context *table_context = >smu_table; struct smu_table *table = _context->tables[SMU_TABLE_PPTABLE]; - if (table_context->driver_pptable) - return -EINVAL; + if (table_context->driver_pptable) { + kfree(table_context->driver_pptable); + table_context->driver_pptable = NULL; + } table_context->driver_pptable = kzalloc(table->size, GFP_KERNEL); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF
Ping… From: Zhang, Jack (Jian) Sent: Tuesday, December 24, 2019 11:51 AM To: Wang, Kevin(Yang) ; Feng, Kenneth ; Tao, Yintian ; amd-gfx@lists.freedesktop.org; Deng, Emily ; Quan, Evan Subject: RE: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF Hi, Kevin, Thank you for your code review. 1. I updated the patch to refine pm_enbable logic (attached). 1. For the following comments. “* Copy pptable bo in the vram…”, it is the original bare-metal comments. I didn’t change it. 1. “could you describe the function of pp_one_vf and sriov_vf ?” 1. amdgpu_sriov_vf marks if it is sriov or bare-metal. While amdgpu_sriov_is_pp_one_vf is a mode of sriov- It means there is only one VF generated by host driver. 2. When host driver is loaded, host administrator can determine “vf number”. If vf_num =1, host driver will let guest driver know it is under one vf mode-pp_one_vf return true. Otherwise, pp_one_vf return false. Without unloading guest driver and host driver, vf_num cannot be changed. So it is a static process. 3. Under pp_one_vf mode, guest driver will do hw_init for smu, the purpose of it is to enable guest driver to “talk” with smu by sending authorized smu messages. This will help user mode app to dump info like clks, temperature, GPU usage…. Currently we don’t support guest driver to write value to smu. Only support read permission to dump smu infos. Besides, as host driver has already initialized smu hw, some hw init steps need to skip in guest hw_init function of smu block,such as write pptable, load smc firmware. 1. pp_one_vf mode need smu some firmware changes to open permission for certain messages in VF. B.R. Jack From: Wang, Kevin(Yang) mailto:kevin1.w...@amd.com>> Sent: Monday, December 23, 2019 6:11 PM To: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; Feng, Kenneth mailto:kenneth.f...@amd.com>>; Tao, Yintian mailto:yintian@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Deng, Emily mailto:emily.d...@amd.com>> Cc: Quan, Evan mailto:evan.q...@amd.com>> Subject: Re: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF add @Quan, Evan<mailto:evan.q...@amd.com> to support arcturus asic. comment inline. ____ From: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>> Sent: Monday, December 23, 2019 4:42 PM To: Feng, Kenneth mailto:kenneth.f...@amd.com>>; Wang, Kevin(Yang) mailto:kevin1.w...@amd.com>>; Tao, Yintian mailto:yintian@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> mailto:amd-gfx@lists.freedesktop.org>>; Deng, Emily mailto:emily.d...@amd.com>> Cc: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>> Subject: RE: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF -Original Message- From: Jack Zhang mailto:jack.zha...@amd.com>> Sent: Monday, December 23, 2019 4:40 PM To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> Cc: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>> Subject: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF Before, initialization of smu ip block would be skipped for sriov ASICs. But if there's only one VF being used, guest driver should be able to dump some HW info such as clks, temperature,etc. To solve this, now after onevf mode is enabled, host driver will notify guest. If it's onevf mode, guest will do smu hw_init and skip some steps in normal smu hw_init flow because host driver has already done it for smu. With this fix, guest app can talk with smu and dump hw information from smu. Signed-off-by: Jack Zhang mailto:jack.zha...@amd.com>> --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 3 +- drivers/gpu/drm/amd/amdgpu/soc15.c | 3 +- drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 49 ++ 3 files changed, 33 insertions(+), 22 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index 8469834..08130a6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -1448,7 +1448,8 @@ static int psp_np_fw_load(struct psp_context *psp) || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM - || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM)) + || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM + || ucode->ucode_id == AMDGPU_UCODE_ID_SMC)) /*skip ucode loading in SRIOV VF */ continue; diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index b53d401..a2
RE: [PATCH] drm/amd/powerplay free driver_pptable for smu reinit
-Original Message- From: Jack Zhang Sent: Tuesday, December 24, 2019 1:05 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH] drm/amd/powerplay free driver_pptable for smu reinit During gpu recover, smu hw reinit will fail becasue table_context->driver_pptable is not freed and set to NULL. Free the driver_pptable pointer if it's not NULL. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/powerplay/smu_v11_0.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c index 7781d24..ca877bd 100644 --- a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c +++ b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c @@ -713,8 +713,10 @@ int smu_v11_0_parse_pptable(struct smu_context *smu) struct smu_table_context *table_context = >smu_table; struct smu_table *table = _context->tables[SMU_TABLE_PPTABLE]; - if (table_context->driver_pptable) - return -EINVAL; + if (table_context->driver_pptable) { + kfree(table_context->driver_pptable); + table_context->driver_pptable = NULL; + } table_context->driver_pptable = kzalloc(table->size, GFP_KERNEL); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF
Hi, Kevin, Thank you for your code review. 1. I updated the patch to refine pm_enbable logic (attached). 1. For the following comments. “* Copy pptable bo in the vram…”, it is the original bare-metal comments. I didn’t change it. 1. “could you describe the function of pp_one_vf and sriov_vf ?” 1. amdgpu_sriov_vf marks if it is sriov or bare-metal. While amdgpu_sriov_is_pp_one_vf is a mode of sriov- It means there is only one VF generated by host driver. 2. When host driver is loaded, host administrator can determine “vf number”. If vf_num =1, host driver will let guest driver know it is under one vf mode-pp_one_vf return true. Otherwise, pp_one_vf return false. Without unloading guest driver and host driver, vf_num cannot be changed. So it is a static process. 3. Under pp_one_vf mode, guest driver will do hw_init for smu, the purpose of it is to enable guest driver to “talk” with smu by sending authorized smu messages. This will help user mode app to dump info like clks, temperature, GPU usage…. Currently we don’t support guest driver to write value to smu. Only support read permission to dump smu infos. Besides, as host driver has already initialized smu hw, some hw init steps need to skip in guest hw_init function of smu block,such as write pptable, load smc firmware. 1. pp_one_vf mode need smu some firmware changes to open permission for certain messages in VF. B.R. Jack From: Wang, Kevin(Yang) Sent: Monday, December 23, 2019 6:11 PM To: Zhang, Jack (Jian) ; Feng, Kenneth ; Tao, Yintian ; amd-gfx@lists.freedesktop.org; Deng, Emily Cc: Quan, Evan Subject: Re: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF add @Quan, Evan<mailto:evan.q...@amd.com> to support arcturus asic. comment inline. From: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>> Sent: Monday, December 23, 2019 4:42 PM To: Feng, Kenneth mailto:kenneth.f...@amd.com>>; Wang, Kevin(Yang) mailto:kevin1.w...@amd.com>>; Tao, Yintian mailto:yintian@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> mailto:amd-gfx@lists.freedesktop.org>>; Deng, Emily mailto:emily.d...@amd.com>> Cc: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>> Subject: RE: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF -Original Message- From: Jack Zhang mailto:jack.zha...@amd.com>> Sent: Monday, December 23, 2019 4:40 PM To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> Cc: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>> Subject: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF Before, initialization of smu ip block would be skipped for sriov ASICs. But if there's only one VF being used, guest driver should be able to dump some HW info such as clks, temperature,etc. To solve this, now after onevf mode is enabled, host driver will notify guest. If it's onevf mode, guest will do smu hw_init and skip some steps in normal smu hw_init flow because host driver has already done it for smu. With this fix, guest app can talk with smu and dump hw information from smu. Signed-off-by: Jack Zhang mailto:jack.zha...@amd.com>> --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 3 +- drivers/gpu/drm/amd/amdgpu/soc15.c | 3 +- drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 49 ++ 3 files changed, 33 insertions(+), 22 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index 8469834..08130a6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -1448,7 +1448,8 @@ static int psp_np_fw_load(struct psp_context *psp) || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM - || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM)) + || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM + || ucode->ucode_id == AMDGPU_UCODE_ID_SMC)) /*skip ucode loading in SRIOV VF */ continue; diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index b53d401..a271496 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -827,8 +827,7 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev) amdgpu_device_ip_block_add(adev, _virtual_ip_block); amdgpu_device_ip_block_add(adev, _v9_0_ip_block); amdgpu_device_ip_block_add(adev, _v4_0_ip_block); - if (!amdgpu_sriov_vf(adev)) - amdgpu_device_ip_block_add(adev, _v11_0_ip_block); +
RE: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF
-Original Message- From: Jack Zhang Sent: Monday, December 23, 2019 4:40 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF Before, initialization of smu ip block would be skipped for sriov ASICs. But if there's only one VF being used, guest driver should be able to dump some HW info such as clks, temperature,etc. To solve this, now after onevf mode is enabled, host driver will notify guest. If it's onevf mode, guest will do smu hw_init and skip some steps in normal smu hw_init flow because host driver has already done it for smu. With this fix, guest app can talk with smu and dump hw information from smu. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 3 +- drivers/gpu/drm/amd/amdgpu/soc15.c | 3 +- drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 49 ++ 3 files changed, 33 insertions(+), 22 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index 8469834..08130a6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -1448,7 +1448,8 @@ static int psp_np_fw_load(struct psp_context *psp) || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM - || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM)) + || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM + || ucode->ucode_id == AMDGPU_UCODE_ID_SMC)) /*skip ucode loading in SRIOV VF */ continue; diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index b53d401..a271496 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -827,8 +827,7 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev) amdgpu_device_ip_block_add(adev, _virtual_ip_block); amdgpu_device_ip_block_add(adev, _v9_0_ip_block); amdgpu_device_ip_block_add(adev, _v4_0_ip_block); - if (!amdgpu_sriov_vf(adev)) - amdgpu_device_ip_block_add(adev, _v11_0_ip_block); + amdgpu_device_ip_block_add(adev, _v11_0_ip_block); if (amdgpu_sriov_vf(adev)) { if (likely(adev->firmware.load_type == AMDGPU_FW_LOAD_PSP)) diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c index 936c682..c07fb26 100644 --- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c +++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c @@ -531,10 +531,14 @@ bool is_support_sw_smu(struct amdgpu_device *adev) if (adev->asic_type == CHIP_VEGA20) return (amdgpu_dpm == 2) ? true : false; else if (adev->asic_type >= CHIP_ARCTURUS) { - if (amdgpu_sriov_vf(adev)) - return false; - else + if (amdgpu_sriov_vf(adev)) { + if(amdgpu_sriov_is_pp_one_vf(adev)) + return true; + else + return false; + } else { return true; + } } else return false; } @@ -1062,20 +1066,19 @@ static int smu_smc_table_hw_init(struct smu_context *smu, } /* smu_dump_pptable(smu); */ + if(amdgpu_sriov_vf(adev) && !amdgpu_sriov_is_pp_one_vf(adev)){ + /* +* Copy pptable bo in the vram to smc with SMU MSGs such as +* SetDriverDramAddr and TransferTableDram2Smu. +*/ + ret = smu_write_pptable(smu); + if (ret) + return ret; - /* -* Copy pptable bo in the vram to smc with SMU MSGs such as -* SetDriverDramAddr and TransferTableDram2Smu. -*/ - ret = smu_write_pptable(smu); - if (ret) - return ret; - - /* issue Run*Btc msg */ - ret = smu_run_btc(smu); - if (ret) - return ret; - + /* issue Run*Btc msg */ + ret = smu_run_btc(smu); + if (ret) + return ret; ret = smu_feature_set_allowed_mask(smu); if (ret) return ret; @@ -1083,7 +1086,7 @@ static int smu_smc_table_hw_init(struct smu_context *smu, ret = smu_system_features_control(smu, true); if (ret) return ret; - + } if (adev->asic_type != CHIP_ARCTURUS) { ret = smu_notify_display_change(smu); if (ret) @@ -1136,8 +1139,9 @
RE: [PATCH] amd/amdgpu/sriov swSMU disable for sriov
Thanks Evan, I will add "{ }" before I check-in the code. Best, Jack -Original Message- From: Quan, Evan Sent: Tuesday, December 3, 2019 10:45 AM To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: RE: [PATCH] amd/amdgpu/sriov swSMU disable for sriov > -Original Message- > From: amd-gfx On Behalf Of > Jack Zhang > Sent: Monday, December 2, 2019 7:05 PM > To: amd-gfx@lists.freedesktop.org > Cc: Zhang, Jack (Jian) > Subject: [PATCH] amd/amdgpu/sriov swSMU disable for sriov > > For boards greater than ARCTURUS, and under sriov platform, swSMU is > not supported because smu ip block is commented at guest driver. > > Generally for sriov, initialization of smu is moved to host driver. > Thus, smu sw_init and hw_init will not be executed at guest driver. > > Without sw structure being initialized in guest driver, swSMU cannot > declare to be supported. > > Signed-off-by: Jack Zhang > --- > drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c > b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c > index 36001a4..0b8a53b 100644 > --- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c > +++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c > @@ -531,7 +531,10 @@ bool is_support_sw_smu(struct amdgpu_device *adev) > if (adev->asic_type == CHIP_VEGA20) > return (amdgpu_dpm == 2) ? true : false; > else if (adev->asic_type >= CHIP_ARCTURUS) > - return true; > + if (amdgpu_sriov_vf(adev)) > + return false; > + else > + return true; [Quan, Evan] Are "{" and "}" missing around this code block? This seems a little weird. > else > return false; > } > -- > 2.7.4 > > ___ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist > s.free > desktop.org%2Fmailman%2Flistinfo%2Famd- > gfxdata=02%7C01%7Cevan.quan%40amd.com%7Ca0119099a3db450554 > f208d777178b5f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6371 > 08815384772040sdata=8RJ1QyDzHEcnOnk0EBGkhfVljeiPWaZSNlO6OyAa > enc%3Dreserved=0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] amd/amdgpu/sriov swSMU disable for sriov
Hi, Team, Would you please help to review this patch? Best, Jack -Original Message- From: Jack Zhang Sent: Monday, December 2, 2019 7:05 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH] amd/amdgpu/sriov swSMU disable for sriov For boards greater than ARCTURUS, and under sriov platform, swSMU is not supported because smu ip block is commented at guest driver. Generally for sriov, initialization of smu is moved to host driver. Thus, smu sw_init and hw_init will not be executed at guest driver. Without sw structure being initialized in guest driver, swSMU cannot declare to be supported. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c index 36001a4..0b8a53b 100644 --- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c +++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c @@ -531,7 +531,10 @@ bool is_support_sw_smu(struct amdgpu_device *adev) if (adev->asic_type == CHIP_VEGA20) return (amdgpu_dpm == 2) ? true : false; else if (adev->asic_type >= CHIP_ARCTURUS) - return true; + if (amdgpu_sriov_vf(adev)) + return false; + else + return true; else return false; } -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/2] drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus VF.
Sure,will do that! Thanks, Jack Outlook for Android<https://aka.ms/ghei36> From: Deucher, Alexander Sent: Friday, November 22, 2019 12:00:04 AM To: Zhang, Jack (Jian) ; Alex Deucher Cc: amd-gfx list Subject: Re: [PATCH 2/2] drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus VF. You can have my RB on the first patch if you fix the compiler warnings. Alex From: amd-gfx on behalf of Zhang, Jack (Jian) Sent: Thursday, November 21, 2019 10:58 AM To: Alex Deucher Cc: amd-gfx list Subject: Re: [PATCH 2/2] drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus VF. Thanks Alex for this review. Both of the two patches will not influence bare-metal code path. B. R. Jack Outlook for Android<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Faka.ms%2Fghei36=02%7C01%7Calexander.deucher%40amd.com%7Cf350a586b8b840526a2c08d76e9bb538%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637099487450051533=ghsxwJqO5AO%2B2pBhPoQz70SkyYyeNCnA5FQMAHQVVpI%3D=0> From: Alex Deucher Sent: Thursday, November 21, 2019 11:26:37 PM To: Zhang, Jack (Jian) Cc: amd-gfx list Subject: Re: [PATCH 2/2] drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus VF. Nevermind. I missed the sr-iov check. Patch is: Acked-by: Alex Deucher On Thu, Nov 21, 2019 at 10:25 AM Alex Deucher wrote: > > Won't this impact bare metal hw that needs this? > > Alex > > On Thu, Nov 21, 2019 at 1:17 AM Jack Zhang wrote: > > > > After rlcg fw 2.1, kmd driver starts to load extra fw for > > LIST_CNTL,GPM_MEM,SRM_MEM. We needs to skip the three fw > > because all rlcg related fw have been loaded by host driver. > > Guest driver would load the three fw fail without this change. > > > > Signed-off-by: Jack Zhang > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 5 - > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > index c3a42d3..eecde80 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > @@ -1470,7 +1470,10 @@ static int psp_np_fw_load(struct psp_context *psp) > > || ucode->ucode_id == AMDGPU_UCODE_ID_SDMA5 > > || ucode->ucode_id == AMDGPU_UCODE_ID_SDMA6 > > || ucode->ucode_id == AMDGPU_UCODE_ID_SDMA7 > > - || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G)) > > +|| ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G > > + || ucode->ucode_id == > > AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL > > + || ucode->ucode_id == > > AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM > > + || ucode->ucode_id == > > AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM)) > > /*skip ucode loading in SRIOV VF */ > > continue; > > > > -- > > 2.7.4 > > > > ___ > > amd-gfx mailing list > > amd-gfx@lists.freedesktop.org > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CJack.Zhang1%40amd.com%7C62c30b71b3b8490862d808d76e973ae5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637099468112639673sdata=iC932sWNGrLeexp7wIDdrZ7cjVBcFv5giOTC3uHdE%2Fk%3Dreserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx=02%7C01%7Calexander.deucher%40amd.com%7Cf350a586b8b840526a2c08d76e9bb538%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637099487450061527=FofOoED2ixVkMukh%2FOXH8mdtu%2B%2BwilXYGxbQVNRflUY%3D=0> ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/2] drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus VF.
Thanks Alex for this review. Both of the two patches will not influence bare-metal code path. B. R. Jack Outlook for Android<https://aka.ms/ghei36> From: Alex Deucher Sent: Thursday, November 21, 2019 11:26:37 PM To: Zhang, Jack (Jian) Cc: amd-gfx list Subject: Re: [PATCH 2/2] drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus VF. Nevermind. I missed the sr-iov check. Patch is: Acked-by: Alex Deucher On Thu, Nov 21, 2019 at 10:25 AM Alex Deucher wrote: > > Won't this impact bare metal hw that needs this? > > Alex > > On Thu, Nov 21, 2019 at 1:17 AM Jack Zhang wrote: > > > > After rlcg fw 2.1, kmd driver starts to load extra fw for > > LIST_CNTL,GPM_MEM,SRM_MEM. We needs to skip the three fw > > because all rlcg related fw have been loaded by host driver. > > Guest driver would load the three fw fail without this change. > > > > Signed-off-by: Jack Zhang > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 5 - > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > index c3a42d3..eecde80 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > @@ -1470,7 +1470,10 @@ static int psp_np_fw_load(struct psp_context *psp) > > || ucode->ucode_id == AMDGPU_UCODE_ID_SDMA5 > > || ucode->ucode_id == AMDGPU_UCODE_ID_SDMA6 > > || ucode->ucode_id == AMDGPU_UCODE_ID_SDMA7 > > - || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G)) > > +|| ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G > > + || ucode->ucode_id == > > AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL > > + || ucode->ucode_id == > > AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM > > + || ucode->ucode_id == > > AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM)) > > /*skip ucode loading in SRIOV VF */ > > continue; > > > > -- > > 2.7.4 > > > > ___ > > amd-gfx mailing list > > amd-gfx@lists.freedesktop.org > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CJack.Zhang1%40amd.com%7C62c30b71b3b8490862d808d76e973ae5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637099468112639673sdata=iC932sWNGrLeexp7wIDdrZ7cjVBcFv5giOTC3uHdE%2Fk%3Dreserved=0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH 1/2] drm/amd/amdgpu/sriov temporarily skip ras, dtm, hdcp for arcturus VF
Hi, Team, Would you please help to take a look this patch? BR Jack -Original Message- From: amd-gfx On Behalf Of Jack Zhang Sent: Thursday, November 21, 2019 2:17 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH 1/2] drm/amd/amdgpu/sriov temporarily skip ras, dtm, hdcp for arcturus VF Temporarily skip ras,dtm,hdcp initialize and terminate for arcturus VF Currently the three features haven't been enabled at SRIOV, it would trigger guest driver load fail with the bare-metal path of the three features. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 36 + 1 file changed, 36 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index 2a8a08a..c3a42d3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -756,6 +756,12 @@ int psp_ras_enable_features(struct psp_context *psp, static int psp_ras_terminate(struct psp_context *psp) { + /* +* TODO: bypass the terminate in sriov for now +*/ + if (amdgpu_sriov_vf(psp->adev)) + return 0; + int ret; if (!psp->ras.ras_initialized) @@ -777,6 +783,12 @@ static int psp_ras_terminate(struct psp_context *psp) static int psp_ras_initialize(struct psp_context *psp) { + /* +* TODO: bypass the initialize in sriov for now +*/ + if (amdgpu_sriov_vf(psp->adev)) + return 0; + int ret; if (!psp->adev->psp.ta_ras_ucode_size || @@ -872,6 +884,12 @@ static int psp_hdcp_load(struct psp_context *psp) } static int psp_hdcp_initialize(struct psp_context *psp) { + /* +* TODO: bypass the initialize in sriov for now +*/ + if (amdgpu_sriov_vf(psp->adev)) + return 0; + int ret; if (!psp->adev->psp.ta_hdcp_ucode_size || @@ -960,6 +978,12 @@ int psp_hdcp_invoke(struct psp_context *psp, uint32_t ta_cmd_id) static int psp_hdcp_terminate(struct psp_context *psp) { + /* +* TODO: bypass the terminate in sriov for now +*/ + if (amdgpu_sriov_vf(psp->adev)) + return 0; + int ret; if (!psp->hdcp_context.hdcp_initialized) @@ -1051,6 +1075,12 @@ static int psp_dtm_load(struct psp_context *psp) static int psp_dtm_initialize(struct psp_context *psp) { + /* +* TODO: bypass the initialize in sriov for now +*/ + if (amdgpu_sriov_vf(psp->adev)) + return 0; + int ret; if (!psp->adev->psp.ta_dtm_ucode_size || @@ -1109,6 +1139,12 @@ int psp_dtm_invoke(struct psp_context *psp, uint32_t ta_cmd_id) static int psp_dtm_terminate(struct psp_context *psp) { + /* +* TODO: bypass the terminate in sriov for now +*/ + if (amdgpu_sriov_vf(psp->adev)) + return 0; + int ret; if (!psp->dtm_context.dtm_initialized) -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amd/amdgpu/sriov ip block setting of Arcturus
Hi, folks, Would you please help to review the patch? Thanks, Jack -Original Message- From: amd-gfx On Behalf Of Jack Zhang Sent: Monday, September 30, 2019 1:00 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH] drm/amd/amdgpu/sriov ip block setting of Arcturus Add ip block setting for Arcturus SRIOV 1.PSP need to be initialized before IH. 2.SMU doesn't need to be initialized at kmd driver. 3.Arcturus doesn't support DCE hardware,it needs to skip register access to DCE. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 10 ++ drivers/gpu/drm/amd/amdgpu/soc15.c| 19 +++ 2 files changed, 21 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c index 95a9a5f5..44023bd 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c @@ -1330,11 +1330,13 @@ static int gmc_v9_0_hw_init(void *handle) gmc_v9_0_init_golden_registers(adev); if (adev->mode_info.num_crtc) { - /* Lockout access through VGA aperture*/ - WREG32_FIELD15(DCE, 0, VGA_HDP_CONTROL, VGA_MEMORY_DISABLE, 1); + if (adev->asic_type != CHIP_ARCTURUS) { + /* Lockout access through VGA aperture*/ + WREG32_FIELD15(DCE, 0, VGA_HDP_CONTROL, VGA_MEMORY_DISABLE, 1); - /* disable VGA render */ - WREG32_FIELD15(DCE, 0, VGA_RENDER_CONTROL, VGA_VSTATUS_CNTL, 0); + /* disable VGA render */ + WREG32_FIELD15(DCE, 0, VGA_RENDER_CONTROL, VGA_VSTATUS_CNTL, 0); + } } r = gmc_v9_0_gart_enable(adev); diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index dbd790e..ac181e3 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -754,14 +754,25 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev) case CHIP_ARCTURUS: amdgpu_device_ip_block_add(adev, _common_ip_block); amdgpu_device_ip_block_add(adev, _v9_0_ip_block); - amdgpu_device_ip_block_add(adev, _ih_ip_block); - if (likely(adev->firmware.load_type == AMDGPU_FW_LOAD_PSP)) - amdgpu_device_ip_block_add(adev, _v11_0_ip_block); + + /* For MI100 SR-IOV, PSP need to be initialized before IH */ + if (amdgpu_sriov_vf(adev)) { + if (likely(adev->firmware.load_type == AMDGPU_FW_LOAD_PSP)) + amdgpu_device_ip_block_add(adev, _v11_0_ip_block); + amdgpu_device_ip_block_add(adev, _ih_ip_block); + } else { + amdgpu_device_ip_block_add(adev, _ih_ip_block); + if (likely(adev->firmware.load_type == AMDGPU_FW_LOAD_PSP)) + amdgpu_device_ip_block_add(adev, _v11_0_ip_block); + } + if (adev->enable_virtual_display || amdgpu_sriov_vf(adev)) amdgpu_device_ip_block_add(adev, _virtual_ip_block); amdgpu_device_ip_block_add(adev, _v9_0_ip_block); amdgpu_device_ip_block_add(adev, _v4_0_ip_block); - amdgpu_device_ip_block_add(adev, _v11_0_ip_block); + if (!amdgpu_sriov_vf(adev)) + amdgpu_device_ip_block_add(adev, _v11_0_ip_block); + if (unlikely(adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT)) amdgpu_device_ip_block_add(adev, _v2_5_ip_block); break; -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or passthrough
Reviewed & Tested-by: Jack Zhang mailto:jack.zha...@amd.com>> BR, Jack From: Deng, Emily Sent: Thursday, September 19, 2019 10:58 AM To: Koenig, Christian Cc: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; Teng, Rui ; Cui, Flora Subject: RE: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or passthrough Hi Jack, Could you please give a try about this? Both for passthrough and sriov. Best wishes Emily Deng From: Koenig, Christian mailto:christian.koe...@amd.com>> Sent: Wednesday, September 18, 2019 6:47 PM To: Deng, Emily mailto:emily.d...@amd.com>> Cc: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Teng, Rui mailto:rui.t...@amd.com>>; Cui, Flora mailto:flora@amd.com>> Subject: Re: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or passthrough Hi Jack & Emily, asking around a bit helped, we should be able to take a look at the module state to distinct the two use cases like this: diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 6b96a5738e57..0af134eb03e2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -1074,7 +1074,10 @@ amdgpu_pci_remove(struct pci_dev *pdev) { struct drm_device *dev = pci_get_drvdata(pdev); - DRM_ERROR("Device removal is currently not supported outside of fbcon\n"); +#ifdef MODULE + if (THIS_MODULE->state != MODULE_STATE_GOING) +#endif + DRM_ERROR("Device removal is currently not supported outside of fbcon\n"); drm_dev_unplug(dev); drm_dev_put(dev); pci_disable_device(pdev); It's a bit of a hack, but I think that this should work. Regards, Christian. Am 18.09.19 um 12:29 schrieb Christian König: Hi Emily, Do you think this is because the wrong use case? Well Jack's use case is correct, but the PCIe hot plug removal use case is incorrect. Changing it to a warning is most likely not a good idea either because it is indeed an error to try to use PCIe hot plug removal. What we need to distinct is why the function is called, if it's because of pci_unregister_driver(_kms_pci_driver) in amdgpu_exit() then the use case is valid and we should not print the error. But if it's because somebody does something like "echo 1 > /sys/bus/pci/devices/\:0b\:00.1/remove" then that is invalid and we should print it. We could do some hack and look at the stack trace, but that is probably not reliable either. Maybe we can look at the module reference count or something like that. Regards, Christian. Am 18.09.19 um 12:04 schrieb Deng, Emily: Hi Christian, Do you think this is because the wrong use case? Even we add the error log, the issue still happens. Could we change this error to warning? As for the right method to remove the driver, it shouldn’t occur issues. Best wishes Emily Deng From: Koenig, Christian <mailto:christian.koe...@amd.com> Sent: Wednesday, September 18, 2019 5:45 PM To: Deng, Emily <mailto:emily.d...@amd.com> Cc: Zhang, Jack (Jian) <mailto:jack.zha...@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Teng, Rui <mailto:rui.t...@amd.com>; Cui, Flora <mailto:flora@amd.com> Subject: RE: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or passthrough Yes, exactly. The problem is that even when output is qxl or the Intel driver our driver is still loaded and forcefully removing it renders the desktop unusable. Christian. Am 18.09.2019 11:24 schrieb "Deng, Emily" mailto:emily.d...@amd.com>>: Hi Christian, Do you mean, for passthrough mode, the desktop is rendered by our asic, but enduser is trying to remove the driver by hot plug? Best wishes Emily Deng From: Koenig, Christian mailto:christian.koe...@amd.com>> Sent: Wednesday, September 18, 2019 4:44 PM To: Deng, Emily mailto:emily.d...@amd.com>> Cc: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Teng, Rui mailto:rui.t...@amd.com>>; Cui, Flora mailto:flora@amd.com>> Subject: RE: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or passthrough Hi Emily, Yeah, exactly that's the problem: In some cases the driver can be unloaded while it is still in use! See we added this error message because endusers tried to use PCIe hot plug to unload the driver to use the hardware for paththrough. But this will completely nuke your desktop, even when amdgpu is only a secondary device like in the qxl case. Jack is using the correct way of doing it, e.g. using "modprobe -r" or rmmod. Both commands check the use count before unloading the driver instances. I don't see a way to distingt the two cases and what Jack
RE: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or passthrough
Hi, Christian and folks, In virtual machines(such virt-manager), there's always a virtual graphics device existed like "qxl" as the default gfx device. So under such condition, we believe it's reasonable to unload amdgpu driver as it is not treated as the default fbcon device. Would you please help to review this patch? Best wishes, Jack -Original Message- From: Jack Zhang Sent: Wednesday, September 18, 2019 3:25 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or passthrough In virtual machine, there would be a qxl or cirrus graphics device as the default master fbcon device. So for PF(passthrough mode) or SRIOV VF, it is reasonable to unload amdgpu driver. Amdgpu doesn't have to be the only fbcon device under this condition. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 420888e..ada2b25 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -1103,8 +1103,9 @@ static void amdgpu_pci_remove(struct pci_dev *pdev) { struct drm_device *dev = pci_get_drvdata(pdev); - - DRM_ERROR("Device removal is currently not supported outside of fbcon\n"); + struct amdgpu_device *adev = dev->dev_private; + if (!amdgpu_sriov_vf(adev) && !amdgpu_passthrough(adev)) + DRM_ERROR("Device removal is currently not supported outside of +fbcon\n"); drm_dev_unplug(dev); drm_dev_put(dev); pci_disable_device(pdev); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu/sriov: add ring_stop before ring_create in psp v11 code
ping -Original Message- From: amd-gfx On Behalf Of Jack Zhang Sent: Tuesday, September 10, 2019 4:09 PM To: amd-gfx@lists.freedesktop.org Cc: Zhang, Jack (Jian) Subject: [PATCH] drm/amdgpu/sriov: add ring_stop before ring_create in psp v11 code psp v11 code missed ring stop in ring create function(VMR) while psp v3.1 code had the code. This will cause VM destroy1 fail and psp ring create fail. For SIOV-VF, ring_stop should not be deleted in ring_create function. Signed-off-by: Jack Zhang --- drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 61 +++--- 1 file changed, 34 insertions(+), 27 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c index cddfa3c..8e9b34a 100644 --- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c @@ -397,6 +397,34 @@ static bool psp_v11_0_support_vmr_ring(struct psp_context *psp) return false; } +static int psp_v11_0_ring_stop(struct psp_context *psp, + enum psp_ring_type ring_type) +{ + int ret = 0; + struct amdgpu_device *adev = psp->adev; + + /* Write the ring destroy command*/ + if (psp_v11_0_support_vmr_ring(psp)) + WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_101, +GFX_CTRL_CMD_ID_DESTROY_GPCOM_RING); + else + WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_64, +GFX_CTRL_CMD_ID_DESTROY_RINGS); + + /* there might be handshake issue with hardware which needs delay */ + mdelay(20); + + /* Wait for response flag (bit 31) */ + if (psp_v11_0_support_vmr_ring(psp)) + ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_101), + 0x8000, 0x8000, false); + else + ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_64), + 0x8000, 0x8000, false); + + return ret; +} + static int psp_v11_0_ring_create(struct psp_context *psp, enum psp_ring_type ring_type) { @@ -406,6 +434,12 @@ static int psp_v11_0_ring_create(struct psp_context *psp, struct amdgpu_device *adev = psp->adev; if (psp_v11_0_support_vmr_ring(psp)) { + ret = psp_v11_0_ring_stop(psp, ring_type); + if (ret) { + DRM_ERROR("psp_v11_0_ring_stop_sriov failed!\n"); + return ret; + } + /* Write low address of the ring to C2PMSG_102 */ psp_ring_reg = lower_32_bits(ring->ring_mem_mc_addr); WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_102, psp_ring_reg); @@ -450,33 +484,6 @@ static int psp_v11_0_ring_create(struct psp_context *psp, return ret; } -static int psp_v11_0_ring_stop(struct psp_context *psp, - enum psp_ring_type ring_type) -{ - int ret = 0; - struct amdgpu_device *adev = psp->adev; - - /* Write the ring destroy command*/ - if (psp_v11_0_support_vmr_ring(psp)) - WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_101, -GFX_CTRL_CMD_ID_DESTROY_GPCOM_RING); - else - WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_64, -GFX_CTRL_CMD_ID_DESTROY_RINGS); - - /* there might be handshake issue with hardware which needs delay */ - mdelay(20); - - /* Wait for response flag (bit 31) */ - if (psp_v11_0_support_vmr_ring(psp)) - ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_101), - 0x8000, 0x8000, false); - else - ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_64), - 0x8000, 0x8000, false); - - return ret; -} static int psp_v11_0_ring_destroy(struct psp_context *psp, enum psp_ring_type ring_type) -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx