RE: [PATCH] drm/amd/pm: Disable SMU messages in navi10 sriov

2021-06-17 Thread Zhang, Jack (Jian)
[AMD Official Use Only]

Reviewed-by: Jack Zhang 


-Original Message-
From: amd-gfx  On Behalf Of Chen, JingWen
Sent: Friday, June 11, 2021 6:52 PM
To: Zha, YiFan(Even) ; amd-gfx@lists.freedesktop.org
Cc: Zha, YiFan(Even) ; Liu, Monk 
Subject: RE: [PATCH] drm/amd/pm: Disable SMU messages in navi10 sriov

[AMD Official Use Only]

[AMD Official Use Only]

Acked-by: Jingwen Chen 

Best Regards,
JingWen Chen

-Original Message-
From: Yifan Zha 
Sent: Friday, June 11, 2021 6:49 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Monk ; Chen, JingWen ; Zha, 
YiFan(Even) 
Subject: [PATCH] drm/amd/pm: Disable SMU messages in navi10 sriov

[Why]
sriov vf send unsupported SMU message lead to fail.

[How]
disable related messages in sriov.

Signed-off-by: Yifan Zha 
---
 drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
index 78fe13183e8b..e1b019115e92 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
@@ -80,10 +80,10 @@ static struct cmn2asic_msg_mapping 
navi10_message_map[SMU_MSG_MAX_COUNT] = {
MSG_MAP(SetAllowedFeaturesMaskHigh, 
PPSMC_MSG_SetAllowedFeaturesMaskHigh,   0),
MSG_MAP(EnableAllSmuFeatures,   PPSMC_MSG_EnableAllSmuFeatures, 
0),
MSG_MAP(DisableAllSmuFeatures,  
PPSMC_MSG_DisableAllSmuFeatures,0),
-   MSG_MAP(EnableSmuFeaturesLow,   PPSMC_MSG_EnableSmuFeaturesLow, 
1),
-   MSG_MAP(EnableSmuFeaturesHigh,  
PPSMC_MSG_EnableSmuFeaturesHigh,1),
-   MSG_MAP(DisableSmuFeaturesLow,  
PPSMC_MSG_DisableSmuFeaturesLow,1),
-   MSG_MAP(DisableSmuFeaturesHigh, 
PPSMC_MSG_DisableSmuFeaturesHigh,   1),
+   MSG_MAP(EnableSmuFeaturesLow,   PPSMC_MSG_EnableSmuFeaturesLow, 
0),
+   MSG_MAP(EnableSmuFeaturesHigh,  
PPSMC_MSG_EnableSmuFeaturesHigh,0),
+   MSG_MAP(DisableSmuFeaturesLow,  
PPSMC_MSG_DisableSmuFeaturesLow,0),
+   MSG_MAP(DisableSmuFeaturesHigh, 
PPSMC_MSG_DisableSmuFeaturesHigh,   0),
MSG_MAP(GetEnabledSmuFeaturesLow,   
PPSMC_MSG_GetEnabledSmuFeaturesLow, 1),
MSG_MAP(GetEnabledSmuFeaturesHigh,  
PPSMC_MSG_GetEnabledSmuFeaturesHigh,1),
MSG_MAP(SetWorkloadMask,PPSMC_MSG_SetWorkloadMask,  
1),
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7CJack.Zhang1%40amd.com%7C217fd989e7324c61c37f08d92cc6eda6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637590055175617495%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=BZQrUILQlHA3VmkwytrLKFP05FW8UUDFLR4XQFjxELs%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

2021-03-25 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,

>>how u handle non guilty singnaled jobs in drm_sched_stop, currently looks 
>>like you don't call put for them and just explicitly free them as before
Good point, I missed that place. Will cover that in my next patch.

>>Also sched->free_guilty seems useless with the new approach.
Yes, I agree.

>>Do we even need the cleanup mechanism at drm_sched_get_cleanup_job with this 
>>approach...
I am not quite sure about that for now, let me think about this topic today.

Hi, Christian,
should I add a fence and get/put to that fence rather than using an explicit 
refcount?
And another concerns?

Thanks,
Jack

-Original Message-
From: Grodzovsky, Andrey 
Sent: Friday, March 26, 2021 12:32 AM
To: Zhang, Jack (Jian) ; Christian König 
; dri-de...@lists.freedesktop.org; 
amd-gfx@lists.freedesktop.org; Koenig, Christian ; 
Liu, Monk ; Deng, Emily ; Rob Herring 
; Tomeu Vizoso ; Steven Price 

Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

There are a few issues here like - how u handle non guilty singnaled jobs in 
drm_sched_stop, currently looks like you don't call put for them and just 
explicitly free them as before. Also sched->free_guilty seems useless with the 
new approach. Do we even need the cleanup mechanism at 
drm_sched_get_cleanup_job with this approach...

But first - We need Christian to express his opinion on this since I think he 
opposed refcounting jobs and that we should concentrate on fences instead.

Christian - can you chime in here ?

Andrey

On 2021-03-25 5:51 a.m., Zhang, Jack (Jian) wrote:
> [AMD Official Use Only - Internal Distribution Only]
>
>
> Hi, Andrey
>
> Thank you for your good opinions.
>
> I literally agree with you that the refcount could solve the
> get_clean_up_up cocurrent job gracefully, and no need to re-insert the
>
> job back anymore.
>
> I quickly made a draft for this idea as follows:
>
> How do you like it? I will start implement to it after I got your
> acknowledge.
>
> Thanks,
>
> Jack
>
> +void drm_job_get(struct drm_sched_job *s_job)
>
> +{
>
> +   kref_get(_job->refcount);
>
> +}
>
> +
>
> +void drm_job_do_release(struct kref *ref)
>
> +{
>
> +   struct drm_sched_job *s_job;
>
> +   struct drm_gpu_scheduler *sched;
>
> +
>
> +   s_job = container_of(ref, struct drm_sched_job, refcount);
>
> +   sched = s_job->sched;
>
> +   sched->ops->free_job(s_job);
>
> +}
>
> +
>
> +void drm_job_put(struct drm_sched_job *s_job)
>
> +{
>
> +   kref_put(_job->refcount, drm_job_do_release);
>
> +}
>
> +
>
> static void drm_sched_job_begin(struct drm_sched_job *s_job)
>
> {
>
>  struct drm_gpu_scheduler *sched = s_job->sched;
>
> +   kref_init(_job->refcount);
>
> +   drm_job_get(s_job);
>
>  spin_lock(>job_list_lock);
>
>  list_add_tail(_job->node, >ring_mirror_list);
>
>  drm_sched_start_timeout(sched);
>
> @@ -294,17 +316,16 @@ static void drm_sched_job_timedout(struct
> work_struct *work)
>
>   * drm_sched_cleanup_jobs. It will be reinserted back
> after sched->thread
>
>   * is parked at which point it's safe.
>
>   */
>
> -   list_del_init(>node);
>
> +   drm_job_get(job);
>
>  spin_unlock(>job_list_lock);
>
>  job->sched->ops->timedout_job(job);
>
> -
>
> +   drm_job_put(job);
>
>  /*
>
>   * Guilty job did complete and hence needs to be
> manually removed
>
>   * See drm_sched_stop doc.
>
>   */
>
>  if (sched->free_guilty) {
>
> -   job->sched->ops->free_job(job);
>
>  sched->free_guilty = false;
>
>  }
>
>  } else {
>
> @@ -355,20 +376,6 @@ void drm_sched_stop(struct drm_gpu_scheduler
> *sched, struct drm_sched_job *bad)
>
> -   /*
>
> -* Reinsert back the bad job here - now it's safe as
>
> -* drm_sched_get_cleanup_job cannot race against us and
> release the
>
> -* bad job at this point - we parked (waited for) any in
> progress
>
> -* (earlier) cleanups and drm_sched_get_cleanup_job will not
> be called
>
> -* now until the scheduler thread is unparked.
>
> -*/
>
> -   if (bad && bad->sched == sched)
>
> -   /*
>
> -* Add

RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

2021-03-25 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Hi, Steve,

Thank you for your detailed comments.

But currently the patch is not finalized.
We found some potential race condition even with this patch. The solution is 
under discussion and hopefully we could find an ideal one.
After that, I will start to consider other drm-driver if it will influence 
other drivers(except for amdgpu).

Best,
Jack

-Original Message-
From: Steven Price 
Sent: Monday, March 22, 2021 11:29 PM
To: Zhang, Jack (Jian) ; dri-de...@lists.freedesktop.org; 
amd-gfx@lists.freedesktop.org; Koenig, Christian ; 
Grodzovsky, Andrey ; Liu, Monk ; 
Deng, Emily ; Rob Herring ; Tomeu Vizoso 

Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

On 15/03/2021 05:23, Zhang, Jack (Jian) wrote:
> [AMD Public Use]
>
> Hi, Rob/Tomeu/Steven,
>
> Would you please help to review this patch for panfrost driver?
>
> Thanks,
> Jack Zhang
>
> -Original Message-
> From: Jack Zhang 
> Sent: Monday, March 15, 2021 1:21 PM
> To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org;
> Koenig, Christian ; Grodzovsky, Andrey
> ; Liu, Monk ; Deng, Emily
> 
> Cc: Zhang, Jack (Jian) 
> Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid
> memleak
>
> re-insert Bailing jobs to avoid memory leak.
>
> V2: move re-insert step to drm/scheduler logic
> V3: add panfrost's return value for bailing jobs in case it hits the
> memleak issue.

This commit message could do with some work - it's really hard to decipher what 
the actual problem you're solving is.

>
> Signed-off-by: Jack Zhang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++--
>   drivers/gpu/drm/panfrost/panfrost_job.c| 4 ++--
>   drivers/gpu/drm/scheduler/sched_main.c | 8 +++-
>   include/drm/gpu_scheduler.h| 1 +
>   5 files changed, 19 insertions(+), 6 deletions(-)
>
[...]
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c
> b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 6003cfeb1322..e2cb4f32dae1 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -444,7 +444,7 @@ static enum drm_gpu_sched_stat 
> panfrost_job_timedout(struct drm_sched_job
>* spurious. Bail out.
>*/
>   if (dma_fence_is_signaled(job->done_fence))
> -return DRM_GPU_SCHED_STAT_NOMINAL;
> +return DRM_GPU_SCHED_STAT_BAILING;
>
>   dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, 
> head=0x%x, tail=0x%x, sched_job=%p",
>   js,
> @@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat
> panfrost_job_timedout(struct drm_sched_job
>
>   /* Scheduler is already stopped, nothing to do. */
>   if (!panfrost_scheduler_stop(>js->queue[js], sched_job))
> -return DRM_GPU_SCHED_STAT_NOMINAL;
> +return DRM_GPU_SCHED_STAT_BAILING;
>
>   /* Schedule a reset if there's no reset in progress. */
>   if (!atomic_xchg(>reset.pending, 1))

This looks correct to me - in these two cases drm_sched_stop() is not called on 
the sched_job, so it looks like currently the job will be leaked.

> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 92d8de24d0a1..a44f621fb5c4 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct 
> *work)
>   {
>   struct drm_gpu_scheduler *sched;
>   struct drm_sched_job *job;
> +int ret;
>
>   sched = container_of(work, struct drm_gpu_scheduler,
> work_tdr.work);
>
> @@ -331,8 +332,13 @@ static void drm_sched_job_timedout(struct work_struct 
> *work)
>   list_del_init(>list);
>   spin_unlock(>job_list_lock);
>
> -job->sched->ops->timedout_job(job);
> +ret = job->sched->ops->timedout_job(job);
>
> +if (ret == DRM_GPU_SCHED_STAT_BAILING) {
> +spin_lock(>job_list_lock);
> +list_add(>node, >ring_mirror_list);
> +spin_unlock(>job_list_lock);
> +}

I think we could really do with a comment somewhere explaining what "bailing" 
means in this context. For the Panfrost case we have two cases:

  * The GPU job actually finished while the timeout code was running 
(done_fence is signalled).

  * The GPU is already in the process of being reset (Panfrost has multiple 
queues, so mostly like a bad job in another queue).

I'm also not convinced that (for Panfrost) it makes sense to be adding the jobs 
back to the list. For the first case above clearly the job could just be freed 
(it's complete). The second case is more interesting and Panfrost currently 
doesn't handle this well. In theory the dri

RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

2021-03-25 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey

Thank you for your good opinions.
I literally agree with you that the refcount could solve the get_clean_up_up 
cocurrent job gracefully, and no need to re-insert the
job back anymore.

I quickly made a draft for this idea as follows:
How do you like it? I will start implement to it after I got your acknowledge.

Thanks,
Jack

+void drm_job_get(struct drm_sched_job *s_job)
+{
+   kref_get(_job->refcount);
+}
+
+void drm_job_do_release(struct kref *ref)
+{
+   struct drm_sched_job *s_job;
+   struct drm_gpu_scheduler *sched;
+
+   s_job = container_of(ref, struct drm_sched_job, refcount);
+   sched = s_job->sched;
+   sched->ops->free_job(s_job);
+}
+
+void drm_job_put(struct drm_sched_job *s_job)
+{
+   kref_put(_job->refcount, drm_job_do_release);
+}
+
static void drm_sched_job_begin(struct drm_sched_job *s_job)
{
struct drm_gpu_scheduler *sched = s_job->sched;
+   kref_init(_job->refcount);
+   drm_job_get(s_job);
spin_lock(>job_list_lock);
list_add_tail(_job->node, >ring_mirror_list);
drm_sched_start_timeout(sched);
@@ -294,17 +316,16 @@ static void drm_sched_job_timedout(struct work_struct 
*work)
 * drm_sched_cleanup_jobs. It will be reinserted back after 
sched->thread
 * is parked at which point it's safe.
 */
-   list_del_init(>node);
+   drm_job_get(job);
spin_unlock(>job_list_lock);
job->sched->ops->timedout_job(job);
-
+   drm_job_put(job);
/*
 * Guilty job did complete and hence needs to be manually 
removed
 * See drm_sched_stop doc.
 */
if (sched->free_guilty) {
-   job->sched->ops->free_job(job);
sched->free_guilty = false;
}
} else {
@@ -355,20 +376,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, 
struct drm_sched_job *bad)
-   /*
-* Reinsert back the bad job here - now it's safe as
-* drm_sched_get_cleanup_job cannot race against us and release the
-* bad job at this point - we parked (waited for) any in progress
-* (earlier) cleanups and drm_sched_get_cleanup_job will not be called
-* now until the scheduler thread is unparked.
-*/
-   if (bad && bad->sched == sched)
-   /*
-* Add at the head of the queue to reflect it was the earliest
-* job extracted.
-*/
-   list_add(>node, >ring_mirror_list);
-
/*
 * Iterate the job list from later to  earlier one and either deactive
 * their HW callbacks or remove them from mirror list if they already
@@ -774,7 +781,7 @@ static int drm_sched_main(void *param)
 kthread_should_stop());
if (cleanup_job) {
-   sched->ops->free_job(cleanup_job);
+   drm_job_put(cleanup_job);
/* queue timeout for next job */
drm_sched_start_timeout(sched);
}
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 5a1f068af1c2..b80513eec90f 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -188,6 +188,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence 
*f);
  * to schedule the job.
  */
struct drm_sched_job {
+   struct kref refcount;
struct spsc_nodequeue_node;
struct drm_gpu_scheduler*sched;
struct drm_sched_fence  *s_fence;
@@ -198,6 +199,7 @@ struct drm_sched_job {
enum drm_sched_priority s_priority;
struct drm_sched_entity  *entity;
struct dma_fence_cb     cb;
+
};

From: Grodzovsky, Andrey 
Sent: Friday, March 19, 2021 12:17 AM
To: Zhang, Jack (Jian) ; Christian König 
; dri-de...@lists.freedesktop.org; 
amd-gfx@lists.freedesktop.org; Koenig, Christian ; 
Liu, Monk ; Deng, Emily ; Rob Herring 
; Tomeu Vizoso ; Steven Price 

Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak



On 2021-03-18 6:41 a.m., Zhang, Jack (Jian) wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey

Let me summarize the background of this patch:

In TDR resubmit step “amdgpu_device_recheck_guilty_jobs,
It will submit first jobs of each ring and do guilty job re-check.
At that point, We had to make sure each job is in the mirror list(or 
re-inserted back already).

But we found the current code never re-insert the job to mirror list in the 
2nd, 3rd job_timeout thread(Bailing TDR thread).
This not only will cause memleak of the bailing jobs. What’s more important, 
the 1st tdr t

RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

2021-03-18 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey

Let me summarize the background of this patch:

In TDR resubmit step “amdgpu_device_recheck_guilty_jobs,
It will submit first jobs of each ring and do guilty job re-check.
At that point, We had to make sure each job is in the mirror list(or 
re-inserted back already).

But we found the current code never re-insert the job to mirror list in the 
2nd, 3rd job_timeout thread(Bailing TDR thread).
This not only will cause memleak of the bailing jobs. What’s more important, 
the 1st tdr thread can never iterate the bailing job and set its guilty status 
to a correct status.

Therefore, we had to re-insert the job(or even not delete node) for bailing job.

For the above V3 patch, the racing condition in my mind is:
we cannot make sure all bailing jobs are finished before we do 
amdgpu_device_recheck_guilty_jobs.

Based on this insight, I think we have two options to solve this issue:

  1.  Skip delete node in tdr thread2, thread3, 4 … (using mutex or atomic 
variable)
  2.  Re-insert back bailing job, and meanwhile use semaphore in each tdr 
thread to keep the sequence as expected and ensure each job is in the mirror 
list when do resubmit step.

For Option1, logic is simpler and we need only one global atomic variable:
What do you think about this plan?

Option1 should look like the following logic:


+static atomic_t in_reset; //a global atomic var for synchronization
static void drm_sched_process_job(struct dma_fence *f, struct dma_fence_cb *cb);
 /**
@@ -295,6 +296,12 @@ static void drm_sched_job_timedout(struct work_struct 
*work)
 * drm_sched_cleanup_jobs. It will be reinserted back after 
sched->thread
 * is parked at which point it's safe.
 */
+   if (atomic_cmpxchg(_reset, 0, 1) != 0) {  //skip delete node 
if it’s thead1,2,3,….
+   spin_unlock(>job_list_lock);
+   drm_sched_start_timeout(sched);
+   return;
+   }
+
list_del_init(>node);
spin_unlock(>job_list_lock);
@@ -320,6 +327,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
spin_lock(>job_list_lock);
drm_sched_start_timeout(sched);
spin_unlock(>job_list_lock);
+   atomic_set(_reset, 0); //reset in_reset when the first thread 
finished tdr
}


Thanks,
Jack
From: amd-gfx  On Behalf Of Zhang, Jack 
(Jian)
Sent: Wednesday, March 17, 2021 11:11 PM
To: Christian König ; 
dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, 
Christian ; Liu, Monk ; Deng, Emily 
; Rob Herring ; Tomeu Vizoso 
; Steven Price ; Grodzovsky, 
Andrey 
Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak


[AMD Official Use Only - Internal Distribution Only]


[AMD Official Use Only - Internal Distribution Only]

Hi,Andrey,

Good catch,I will expore this corner case and give feedback soon~
Best,
Jack


From: Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>>
Sent: Wednesday, March 17, 2021 10:50:59 PM
To: Christian König 
mailto:ckoenig.leichtzumer...@gmail.com>>; 
Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; 
dri-de...@lists.freedesktop.org<mailto:dri-de...@lists.freedesktop.org> 
mailto:dri-de...@lists.freedesktop.org>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>; Koenig, 
Christian mailto:christian.koe...@amd.com>>; Liu, 
Monk mailto:monk@amd.com>>; Deng, Emily 
mailto:emily.d...@amd.com>>; Rob Herring 
mailto:r...@kernel.org>>; Tomeu Vizoso 
mailto:tomeu.viz...@collabora.com>>; Steven Price 
mailto:steven.pr...@arm.com>>
Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

I actually have a race condition concern here - see bellow -

On 2021-03-17 3:43 a.m., Christian König wrote:
> I was hoping Andrey would take a look since I'm really busy with other
> work right now.
>
> Regards,
> Christian.
>
> Am 17.03.21 um 07:46 schrieb Zhang, Jack (Jian):
>> Hi, Andrey/Crhistian and Team,
>>
>> I didn't receive the reviewer's message from maintainers on panfrost
>> driver for several days.
>> Due to this patch is urgent for my current working project.
>> Would you please help to give some review ideas?
>>
>> Many Thanks,
>> Jack
>> -Original Message-
>> From: Zhang, Jack (Jian)
>> Sent: Tuesday, March 16, 2021 3:20 PM
>> To: dri-de...@lists.freedesktop.org<mailto:dri-de...@lists.freedesktop.org>; 
>> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>;
>> Koenig, Christian 
>> mailto:christian.koe...@amd.com>>; Grodzovsky, 
>> Andrey
>> mailto:andrey.gro

Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

2021-03-17 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Hi,Andrey,

Good catch,I will expore this corner case and give feedback soon~

Best,
Jack


From: Grodzovsky, Andrey 
Sent: Wednesday, March 17, 2021 10:50:59 PM
To: Christian König ; Zhang, Jack (Jian) 
; dri-de...@lists.freedesktop.org 
; amd-gfx@lists.freedesktop.org 
; Koenig, Christian ; 
Liu, Monk ; Deng, Emily ; Rob Herring 
; Tomeu Vizoso ; Steven Price 

Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

I actually have a race condition concern here - see bellow -

On 2021-03-17 3:43 a.m., Christian König wrote:
> I was hoping Andrey would take a look since I'm really busy with other
> work right now.
>
> Regards,
> Christian.
>
> Am 17.03.21 um 07:46 schrieb Zhang, Jack (Jian):
>> Hi, Andrey/Crhistian and Team,
>>
>> I didn't receive the reviewer's message from maintainers on panfrost
>> driver for several days.
>> Due to this patch is urgent for my current working project.
>> Would you please help to give some review ideas?
>>
>> Many Thanks,
>> Jack
>> -Original Message-
>> From: Zhang, Jack (Jian)
>> Sent: Tuesday, March 16, 2021 3:20 PM
>> To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org;
>> Koenig, Christian ; Grodzovsky, Andrey
>> ; Liu, Monk ; Deng,
>> Emily ; Rob Herring ; Tomeu
>> Vizoso ; Steven Price 
>> Subject: RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid
>> memleak
>>
>> [AMD Public Use]
>>
>> Ping
>>
>> -Original Message-
>> From: Zhang, Jack (Jian)
>> Sent: Monday, March 15, 2021 1:24 PM
>> To: Jack Zhang ;
>> dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org;
>> Koenig, Christian ; Grodzovsky, Andrey
>> ; Liu, Monk ; Deng,
>> Emily ; Rob Herring ; Tomeu
>> Vizoso ; Steven Price 
>> Subject: RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid
>> memleak
>>
>> [AMD Public Use]
>>
>> Hi, Rob/Tomeu/Steven,
>>
>> Would you please help to review this patch for panfrost driver?
>>
>> Thanks,
>> Jack Zhang
>>
>> -Original Message-
>> From: Jack Zhang 
>> Sent: Monday, March 15, 2021 1:21 PM
>> To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org;
>> Koenig, Christian ; Grodzovsky, Andrey
>> ; Liu, Monk ; Deng,
>> Emily 
>> Cc: Zhang, Jack (Jian) 
>> Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak
>>
>> re-insert Bailing jobs to avoid memory leak.
>>
>> V2: move re-insert step to drm/scheduler logic
>> V3: add panfrost's return value for bailing jobs in case it hits the
>> memleak issue.
>>
>> Signed-off-by: Jack Zhang 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++--
>>   drivers/gpu/drm/panfrost/panfrost_job.c| 4 ++--
>>   drivers/gpu/drm/scheduler/sched_main.c | 8 +++-
>>   include/drm/gpu_scheduler.h| 1 +
>>   5 files changed, 19 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 79b9cc73763f..86463b0f936e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4815,8 +4815,10 @@ int amdgpu_device_gpu_recover(struct
>> amdgpu_device *adev,
>>   job ? job->base.id : -1);
>> /* even we skipped this reset, still need to set the job
>> to guilty */
>> -if (job)
>> +if (job) {
>>   drm_sched_increase_karma(>base);
>> +r = DRM_GPU_SCHED_STAT_BAILING;
>> +}
>>   goto skip_recovery;
>>   }
>>   diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 759b34799221..41390bdacd9e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -34,6 +34,7 @@ static enum drm_gpu_sched_stat
>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>   struct amdgpu_job *job = to_amdgpu_job(s_job);
>>   struct amdgpu_task_info ti;
>>   struct amdgpu_device *adev = ring->adev;
>> +int ret;
>> memset(, 0, sizeof(struct amdgpu_task_info));
>>   @@ -52,8 +53,11 @@ static enum drm_gpu_sched_stat
>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>> ti.process_name, ti.

RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

2021-03-17 Thread Zhang, Jack (Jian)
Hi, Andrey/Crhistian and Team,

I didn't receive the reviewer's message from maintainers on panfrost driver for 
several days.
Due to this patch is urgent for my current working project.
Would you please help to give some review ideas?

Many Thanks,
Jack
-Original Message-
From: Zhang, Jack (Jian) 
Sent: Tuesday, March 16, 2021 3:20 PM
To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, 
Christian ; Grodzovsky, Andrey 
; Liu, Monk ; Deng, Emily 
; Rob Herring ; Tomeu Vizoso 
; Steven Price 
Subject: RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

[AMD Public Use]

Ping

-Original Message-
From: Zhang, Jack (Jian) 
Sent: Monday, March 15, 2021 1:24 PM
To: Jack Zhang ; dri-de...@lists.freedesktop.org; 
amd-gfx@lists.freedesktop.org; Koenig, Christian ; 
Grodzovsky, Andrey ; Liu, Monk ; 
Deng, Emily ; Rob Herring ; Tomeu Vizoso 
; Steven Price 
Subject: RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

[AMD Public Use]

Hi, Rob/Tomeu/Steven,

Would you please help to review this patch for panfrost driver?

Thanks,
Jack Zhang

-Original Message-
From: Jack Zhang 
Sent: Monday, March 15, 2021 1:21 PM
To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, 
Christian ; Grodzovsky, Andrey 
; Liu, Monk ; Deng, Emily 

Cc: Zhang, Jack (Jian) 
Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

re-insert Bailing jobs to avoid memory leak.

V2: move re-insert step to drm/scheduler logic
V3: add panfrost's return value for bailing jobs in case it hits the memleak 
issue.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++--
 drivers/gpu/drm/panfrost/panfrost_job.c| 4 ++--
 drivers/gpu/drm/scheduler/sched_main.c | 8 +++-
 include/drm/gpu_scheduler.h| 1 +
 5 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 79b9cc73763f..86463b0f936e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4815,8 +4815,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
job ? job->base.id : -1);
 
/* even we skipped this reset, still need to set the job to 
guilty */
-   if (job)
+   if (job) {
drm_sched_increase_karma(>base);
+   r = DRM_GPU_SCHED_STAT_BAILING;
+   }
goto skip_recovery;
}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 759b34799221..41390bdacd9e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -34,6 +34,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
struct amdgpu_job *job = to_amdgpu_job(s_job);
struct amdgpu_task_info ti;
struct amdgpu_device *adev = ring->adev;
+   int ret;
 
memset(, 0, sizeof(struct amdgpu_task_info));
 
@@ -52,8 +53,11 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
  ti.process_name, ti.tgid, ti.task_name, ti.pid);
 
if (amdgpu_device_should_recover_gpu(ring->adev)) {
-   amdgpu_device_gpu_recover(ring->adev, job);
-   return DRM_GPU_SCHED_STAT_NOMINAL;
+   ret = amdgpu_device_gpu_recover(ring->adev, job);
+   if (ret == DRM_GPU_SCHED_STAT_BAILING)
+   return DRM_GPU_SCHED_STAT_BAILING;
+   else
+   return DRM_GPU_SCHED_STAT_NOMINAL;
} else {
drm_sched_suspend_timeout(>sched);
if (amdgpu_sriov_vf(adev))
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 6003cfeb1322..e2cb4f32dae1 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -444,7 +444,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct 
drm_sched_job
 * spurious. Bail out.
 */
if (dma_fence_is_signaled(job->done_fence))
-   return DRM_GPU_SCHED_STAT_NOMINAL;
+   return DRM_GPU_SCHED_STAT_BAILING;
 
dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, 
status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
js,
@@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct 
drm_sched_job
 
/* Scheduler is already stopped, nothing to do. */
if (!panfrost_scheduler_stop(>js->queue[js], sched_job))
-   return DRM_GPU_SCHED_STAT_NOMINAL;
+   return DRM_GPU_SCHED_STAT_BAILING;
 
/* Sched

RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

2021-03-16 Thread Zhang, Jack (Jian)
[AMD Public Use]

Ping

-Original Message-
From: Zhang, Jack (Jian) 
Sent: Monday, March 15, 2021 1:24 PM
To: Jack Zhang ; dri-de...@lists.freedesktop.org; 
amd-gfx@lists.freedesktop.org; Koenig, Christian ; 
Grodzovsky, Andrey ; Liu, Monk ; 
Deng, Emily ; Rob Herring ; Tomeu Vizoso 
; Steven Price 
Subject: RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

[AMD Public Use]

Hi, Rob/Tomeu/Steven,

Would you please help to review this patch for panfrost driver?

Thanks,
Jack Zhang

-Original Message-
From: Jack Zhang 
Sent: Monday, March 15, 2021 1:21 PM
To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, 
Christian ; Grodzovsky, Andrey 
; Liu, Monk ; Deng, Emily 

Cc: Zhang, Jack (Jian) 
Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

re-insert Bailing jobs to avoid memory leak.

V2: move re-insert step to drm/scheduler logic
V3: add panfrost's return value for bailing jobs in case it hits the memleak 
issue.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++--
 drivers/gpu/drm/panfrost/panfrost_job.c| 4 ++--
 drivers/gpu/drm/scheduler/sched_main.c | 8 +++-
 include/drm/gpu_scheduler.h| 1 +
 5 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 79b9cc73763f..86463b0f936e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4815,8 +4815,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
job ? job->base.id : -1);
 
/* even we skipped this reset, still need to set the job to 
guilty */
-   if (job)
+   if (job) {
drm_sched_increase_karma(>base);
+   r = DRM_GPU_SCHED_STAT_BAILING;
+   }
goto skip_recovery;
}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 759b34799221..41390bdacd9e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -34,6 +34,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
struct amdgpu_job *job = to_amdgpu_job(s_job);
struct amdgpu_task_info ti;
struct amdgpu_device *adev = ring->adev;
+   int ret;
 
memset(, 0, sizeof(struct amdgpu_task_info));
 
@@ -52,8 +53,11 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
  ti.process_name, ti.tgid, ti.task_name, ti.pid);
 
if (amdgpu_device_should_recover_gpu(ring->adev)) {
-   amdgpu_device_gpu_recover(ring->adev, job);
-   return DRM_GPU_SCHED_STAT_NOMINAL;
+   ret = amdgpu_device_gpu_recover(ring->adev, job);
+   if (ret == DRM_GPU_SCHED_STAT_BAILING)
+   return DRM_GPU_SCHED_STAT_BAILING;
+   else
+   return DRM_GPU_SCHED_STAT_NOMINAL;
} else {
drm_sched_suspend_timeout(>sched);
if (amdgpu_sriov_vf(adev))
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 6003cfeb1322..e2cb4f32dae1 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -444,7 +444,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct 
drm_sched_job
 * spurious. Bail out.
 */
if (dma_fence_is_signaled(job->done_fence))
-   return DRM_GPU_SCHED_STAT_NOMINAL;
+   return DRM_GPU_SCHED_STAT_BAILING;
 
dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, 
status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
js,
@@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct 
drm_sched_job
 
/* Scheduler is already stopped, nothing to do. */
if (!panfrost_scheduler_stop(>js->queue[js], sched_job))
-   return DRM_GPU_SCHED_STAT_NOMINAL;
+   return DRM_GPU_SCHED_STAT_BAILING;
 
/* Schedule a reset if there's no reset in progress. */
if (!atomic_xchg(>reset.pending, 1)) diff --git 
a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 92d8de24d0a1..a44f621fb5c4 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct 
*work)  {
struct drm_gpu_scheduler *sched;
struct drm_sched_job *job;
+   int ret;
 
sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
 
@@ -331,8 +332,13 @@ static void dr

RE: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

2021-03-14 Thread Zhang, Jack (Jian)
[AMD Public Use]

Hi, Rob/Tomeu/Steven,

Would you please help to review this patch for panfrost driver?

Thanks,
Jack Zhang

-Original Message-
From: Jack Zhang  
Sent: Monday, March 15, 2021 1:21 PM
To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, 
Christian ; Grodzovsky, Andrey 
; Liu, Monk ; Deng, Emily 

Cc: Zhang, Jack (Jian) 
Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak

re-insert Bailing jobs to avoid memory leak.

V2: move re-insert step to drm/scheduler logic
V3: add panfrost's return value for bailing jobs
in case it hits the memleak issue.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++--
 drivers/gpu/drm/panfrost/panfrost_job.c| 4 ++--
 drivers/gpu/drm/scheduler/sched_main.c | 8 +++-
 include/drm/gpu_scheduler.h| 1 +
 5 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 79b9cc73763f..86463b0f936e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4815,8 +4815,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
job ? job->base.id : -1);
 
/* even we skipped this reset, still need to set the job to 
guilty */
-   if (job)
+   if (job) {
drm_sched_increase_karma(>base);
+   r = DRM_GPU_SCHED_STAT_BAILING;
+   }
goto skip_recovery;
}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 759b34799221..41390bdacd9e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -34,6 +34,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
struct amdgpu_job *job = to_amdgpu_job(s_job);
struct amdgpu_task_info ti;
struct amdgpu_device *adev = ring->adev;
+   int ret;
 
memset(, 0, sizeof(struct amdgpu_task_info));
 
@@ -52,8 +53,11 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
  ti.process_name, ti.tgid, ti.task_name, ti.pid);
 
if (amdgpu_device_should_recover_gpu(ring->adev)) {
-   amdgpu_device_gpu_recover(ring->adev, job);
-   return DRM_GPU_SCHED_STAT_NOMINAL;
+   ret = amdgpu_device_gpu_recover(ring->adev, job);
+   if (ret == DRM_GPU_SCHED_STAT_BAILING)
+   return DRM_GPU_SCHED_STAT_BAILING;
+   else
+   return DRM_GPU_SCHED_STAT_NOMINAL;
} else {
drm_sched_suspend_timeout(>sched);
if (amdgpu_sriov_vf(adev))
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 6003cfeb1322..e2cb4f32dae1 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -444,7 +444,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct 
drm_sched_job
 * spurious. Bail out.
 */
if (dma_fence_is_signaled(job->done_fence))
-   return DRM_GPU_SCHED_STAT_NOMINAL;
+   return DRM_GPU_SCHED_STAT_BAILING;
 
dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, 
status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
js,
@@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct 
drm_sched_job
 
/* Scheduler is already stopped, nothing to do. */
if (!panfrost_scheduler_stop(>js->queue[js], sched_job))
-   return DRM_GPU_SCHED_STAT_NOMINAL;
+   return DRM_GPU_SCHED_STAT_BAILING;
 
/* Schedule a reset if there's no reset in progress. */
if (!atomic_xchg(>reset.pending, 1))
diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 92d8de24d0a1..a44f621fb5c4 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
 {
struct drm_gpu_scheduler *sched;
struct drm_sched_job *job;
+   int ret;
 
sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
 
@@ -331,8 +332,13 @@ static void drm_sched_job_timedout(struct work_struct 
*work)
list_del_init(>list);
spin_unlock(>job_list_lock);
 
-   job->sched->ops->timedout_job(job);
+   ret = job->sched->ops->timedout_job(job);
 
+   if (ret == DRM_GPU_SCHED_STAT_BAILING) {
+   spin_lock(>job_list_lock);
+ 

RE: [PATCH] drm/scheduler re-insert Bailing job to avoid memleak

2021-03-11 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,

ok, I have changed it and uploaded V2 patch.

Thanks,
Jack
-Original Message-
From: Grodzovsky, Andrey 
Sent: Friday, March 12, 2021 1:04 PM
To: Alex Deucher ; Zhang, Jack (Jian) 
; Maling list - DRI developers 

Cc: amd-gfx list ; Koenig, Christian 
; Liu, Monk ; Deng, Emily 

Subject: Re: [PATCH] drm/scheduler re-insert Bailing job to avoid memleak

Check panfrost driver at panfrost_scheduler_stop, and panfrost_job_timedout - 
they also terminate prematurely in both places so probably worth adding this 
there too.

Andrey

On 2021-03-11 11:13 p.m., Alex Deucher wrote:
> +dri-devel
>
> Please be sure to cc dri-devel when you send out gpu scheduler patches.
>
> On Thu, Mar 11, 2021 at 10:57 PM Jack Zhang  wrote:
>>
>> re-insert Bailing jobs to avoid memory leak.
>>
>> Signed-off-by: Jack Zhang 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++--
>>   drivers/gpu/drm/scheduler/sched_main.c | 8 +++-
>>   include/drm/gpu_scheduler.h| 1 +
>>   4 files changed, 17 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 79b9cc73763f..86463b0f936e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4815,8 +4815,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
>> *adev,
>>  job ? job->base.id : -1);
>>
>>  /* even we skipped this reset, still need to set the job to 
>> guilty */
>> -   if (job)
>> +   if (job) {
>>  drm_sched_increase_karma(>base);
>> +   r = DRM_GPU_SCHED_STAT_BAILING;
>> +   }
>>  goto skip_recovery;
>>  }
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 759b34799221..41390bdacd9e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -34,6 +34,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
>> drm_sched_job *s_job)
>>  struct amdgpu_job *job = to_amdgpu_job(s_job);
>>  struct amdgpu_task_info ti;
>>  struct amdgpu_device *adev = ring->adev;
>> +   int ret;
>>
>>  memset(, 0, sizeof(struct amdgpu_task_info));
>>
>> @@ -52,8 +53,11 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
>> drm_sched_job *s_job)
>>ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>
>>  if (amdgpu_device_should_recover_gpu(ring->adev)) {
>> -   amdgpu_device_gpu_recover(ring->adev, job);
>> -   return DRM_GPU_SCHED_STAT_NOMINAL;
>> +   ret = amdgpu_device_gpu_recover(ring->adev, job);
>> +   if (ret == DRM_GPU_SCHED_STAT_BAILING)
>> +   return DRM_GPU_SCHED_STAT_BAILING;
>> +   else
>> +   return DRM_GPU_SCHED_STAT_NOMINAL;
>>  } else {
>>  drm_sched_suspend_timeout(>sched);
>>  if (amdgpu_sriov_vf(adev)) diff --git
>> a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 92d8de24d0a1..a44f621fb5c4 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct 
>> *work)
>>   {
>>  struct drm_gpu_scheduler *sched;
>>  struct drm_sched_job *job;
>> +   int ret;
>>
>>  sched = container_of(work, struct drm_gpu_scheduler,
>> work_tdr.work);
>>
>> @@ -331,8 +332,13 @@ static void drm_sched_job_timedout(struct work_struct 
>> *work)
>>  list_del_init(>list);
>>  spin_unlock(>job_list_lock);
>>
>> -   job->sched->ops->timedout_job(job);
>> +   ret = job->sched->ops->timedout_job(job);
>>
>> +   if (ret == DRM_GPU_SCHED_STAT_BAILING) {
>> +   spin_lock(>job_list_lock);
>> +   list_add(>node, >ring_mirror_list);
>> +   spin_unlock(>job_list_lock);
>> +   }

Problem here that since you alr

Re: [PATCH v7] drm/amd/amdgpu implement tdr advanced mode

2021-03-11 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

hi,Andrey and Christian,

V8 patch is uploaded.


Thanks,
Jack


发件人: amd-gfx  代表 Zhang, Jack (Jian) 

发送时间: 2021年3月11日星期四 下午8:20
收件人: amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey; Liu, Monk; Deng, Emily; 
Koenig, Christian
主题: Re: [PATCH v7] drm/amd/amdgpu implement tdr advanced mode


[AMD Official Use Only - Internal Distribution Only]


[AMD Official Use Only - Internal Distribution Only]

hi,Christian,

Good idea,thank you for these efforts.
I will update in next version.

Jack


From: Koenig, Christian 
Sent: Thursday, March 11, 2021 6:41:05 PM
To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org 
; Grodzovsky, Andrey 
; Liu, Monk ; Deng, Emily 

Subject: Re: [PATCH v7] drm/amd/amdgpu implement tdr advanced mode



Am 11.03.21 um 06:58 schrieb Jack Zhang:
> [Why]
> Previous tdr design treats the first job in job_timeout as the bad job.
> But sometimes a later bad compute job can block a good gfx job and
> cause an unexpected gfx job timeout because gfx and compute ring share
> internal GC HW mutually.
>
> [How]
> This patch implements an advanced tdr mode.It involves an additinal
> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
> step in order to find the real bad job.
>
> 1. At Step0 Resubmit stage, it synchronously submits and pends for the
> first job being signaled. If it gets timeout, we identify it as guilty
> and do hw reset. After that, we would do the normal resubmit step to
> resubmit left jobs.
>
> 2. For whole gpu reset(vram lost), do resubmit as the old way.
>
> Signed-off-by: Jack Zhang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  63 
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|   2 +-
>   drivers/gpu/drm/scheduler/sched_main.c | 107 +++--
>   include/drm/gpu_scheduler.h|   3 +
>   4 files changed, 142 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e247c3a2ec08..5b182ade26ad 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>int i, r = 0;
>bool need_emergency_restart = false;
>bool audio_suspended = false;
> + int tmp_vram_lost_counter;
>
>/*
> * Special case: RAS triggered and full reset isn't supported
> @@ -4788,6 +4789,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>}
>}
>
> + tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
>/* Actual ASIC resets if needed.*/
>/* TODO Implement XGMI hive reset logic for SRIOV */
>if (amdgpu_sriov_vf(adev)) {
> @@ -4805,6 +4807,67 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>/* Post ASIC reset for all devs .*/
>list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
>
> + if (amdgpu_gpu_recovery == 2 &&
> + !(tmp_vram_lost_counter < 
> atomic_read(>vram_lost_counter))) {
> +
> + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {

Starting from here I would put this into a separate helper function in
amdgpu_ring.c.

You should also probably use checkpatch.pl once more since a couple of
lines below might result in warnings.

Christian.

> + struct amdgpu_ring *ring = tmp_adev->rings[i];
> + int ret = 0;
> + struct drm_sched_job *s_job;
> +
> + if (!ring || !ring->sched.thread)
> + continue;
> +
> + /* No point to resubmit jobs if we didn't HW 
> reset*/
> + if (!tmp_adev->asic_reset_res && !job_signaled) 
> {
> +
> + s_job = 
> list_first_entry_or_null(>sched.pending_list, struct drm_sched_job, 
> list);
> + if (s_job == NULL)
> + continue;
> +
> + /* clear job's guilty and depend the 
> folowing step to decide the real one */
> + drm_sched_reset_karma(s_job);
> + 
> drm_sched_resubmit_jobs_ext(>sched, 1);
> + ret = 
> dma_fence_wait_timeout(s_job->s_fence->parent, 

Re: [PATCH v7] drm/amd/amdgpu implement tdr advanced mode

2021-03-11 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

hi?Christian?

Good idea?thank you for these efforts.
I will update in next version.

Jack


From: Koenig, Christian 
Sent: Thursday, March 11, 2021 6:41:05 PM
To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org 
; Grodzovsky, Andrey 
; Liu, Monk ; Deng, Emily 

Subject: Re: [PATCH v7] drm/amd/amdgpu implement tdr advanced mode



Am 11.03.21 um 06:58 schrieb Jack Zhang:
> [Why]
> Previous tdr design treats the first job in job_timeout as the bad job.
> But sometimes a later bad compute job can block a good gfx job and
> cause an unexpected gfx job timeout because gfx and compute ring share
> internal GC HW mutually.
>
> [How]
> This patch implements an advanced tdr mode.It involves an additinal
> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
> step in order to find the real bad job.
>
> 1. At Step0 Resubmit stage, it synchronously submits and pends for the
> first job being signaled. If it gets timeout, we identify it as guilty
> and do hw reset. After that, we would do the normal resubmit step to
> resubmit left jobs.
>
> 2. For whole gpu reset(vram lost), do resubmit as the old way.
>
> Signed-off-by: Jack Zhang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  63 
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|   2 +-
>   drivers/gpu/drm/scheduler/sched_main.c | 107 +++--
>   include/drm/gpu_scheduler.h|   3 +
>   4 files changed, 142 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e247c3a2ec08..5b182ade26ad 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>int i, r = 0;
>bool need_emergency_restart = false;
>bool audio_suspended = false;
> + int tmp_vram_lost_counter;
>
>/*
> * Special case: RAS triggered and full reset isn't supported
> @@ -4788,6 +4789,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>}
>}
>
> + tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
>/* Actual ASIC resets if needed.*/
>/* TODO Implement XGMI hive reset logic for SRIOV */
>if (amdgpu_sriov_vf(adev)) {
> @@ -4805,6 +4807,67 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>/* Post ASIC reset for all devs .*/
>list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
>
> + if (amdgpu_gpu_recovery == 2 &&
> + !(tmp_vram_lost_counter < 
> atomic_read(>vram_lost_counter))) {
> +
> + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {

Starting from here I would put this into a separate helper function in
amdgpu_ring.c.

You should also probably use checkpatch.pl once more since a couple of
lines below might result in warnings.

Christian.

> + struct amdgpu_ring *ring = tmp_adev->rings[i];
> + int ret = 0;
> + struct drm_sched_job *s_job;
> +
> + if (!ring || !ring->sched.thread)
> + continue;
> +
> + /* No point to resubmit jobs if we didn't HW 
> reset*/
> + if (!tmp_adev->asic_reset_res && !job_signaled) 
> {
> +
> + s_job = 
> list_first_entry_or_null(>sched.pending_list, struct drm_sched_job, 
> list);
> + if (s_job == NULL)
> + continue;
> +
> + /* clear job's guilty and depend the 
> folowing step to decide the real one */
> + drm_sched_reset_karma(s_job);
> + 
> drm_sched_resubmit_jobs_ext(>sched, 1);
> + ret = 
> dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
> +
> + if (ret == 0) { /* timeout */
> + DRM_ERROR("Found the real bad 
> job! ring:%s, job_id:%llx\n", ring->sched.name, s_job->id);
> + /* set guilty */
> + drm_sched_increase_karma(s_job);
> +
> + /* do hw

RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

2021-03-10 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,

I just submitted the V7 patch in which I wrapped the increase_karma func and 
resubmit_job func.

Thanks,
Jack
From: amd-gfx  On Behalf Of Zhang, Jack 
(Jian)
Sent: Thursday, March 11, 2021 11:28 AM
To: Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org; Koenig, Christian ; 
Liu, Monk ; Deng, Emily 
Subject: RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode


[AMD Official Use Only - Internal Distribution Only]


[AMD Official Use Only - Internal Distribution Only]

Ok, that will do the trick. Thank you!

Jack
From: Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>>
Sent: Thursday, March 11, 2021 11:24 AM
To: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Koenig, 
Christian mailto:christian.koe...@amd.com>>; Liu, 
Monk mailto:monk@amd.com>>; Deng, Emily 
mailto:emily.d...@amd.com>>
Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

You can just rename drm_sched_resubmit_jobs to drm_sched_resubmit_jobs_imp and 
create a wrapper function drm_sched_resubmit_jobs which will call that function 
with MAX_INT and so no need for code change in other driver will be needed. For 
amdgpu you call directly drm_sched_resubmit_jobs_imp with 1 and MAX_INT.

Andrey

________
From: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>
Sent: 10 March 2021 22:05
To: Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>; Koenig, 
Christian mailto:christian.koe...@amd.com>>; Liu, 
Monk mailto:monk@amd.com>>; Deng, Emily 
mailto:emily.d...@amd.com>>
Subject: RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,

Thank you for your review.

>> This is identical to drm_sched_resubmit_jobs in all but the 'int max' 
>> handling, can't you reuse drm_sched_resubmit_jobs by passing max argument==1 
>> for the find guilty run and then, for the later normal run passing 
>> max==MAX_INT to avoid break the iteration to early ?

Due to C language doesn't support function overloading, we couldn’t define 
function like the following(compiler error):
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max = 
INT_MAX);

We have to rewrite every vender's  code where call the drm_sched_resubmit_jobs 
if we defined the function like this.
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max);

Meanwhile, I also explored the Variable Arguments in C, But it still cannot 
meet the needs very well.

Do you know some other methods that could help us to make this happen?
Otherwise, I'm afraid we have to define a separate ways as patch V6 did.

Thanks,
Jack

-Original Message-
From: Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>>
Sent: Thursday, March 11, 2021 12:19 AM
To: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Koenig, 
Christian mailto:christian.koe...@amd.com>>; Liu, 
Monk mailto:monk@amd.com>>; Deng, Emily 
mailto:emily.d...@amd.com>>
Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode



On 2021-03-10 8:33 a.m., Jack Zhang wrote:
> [Why]
> Previous tdr design treats the first job in job_timeout as the bad job.
> But sometimes a later bad compute job can block a good gfx job and
> cause an unexpected gfx job timeout because gfx and compute ring share
> internal GC HW mutually.
>
> [How]
> This patch implements an advanced tdr mode.It involves an additinal
> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
> step in order to find the real bad job.
>
> 1. At Step0 Resubmit stage, it synchronously submits and pends for the
> first job being signaled. If it gets timeout, we identify it as guilty
> and do hw reset. After that, we would do the normal resubmit step to
> resubmit left jobs.
>
> 2. Re-insert Bailing job to mirror_list, and leave it to be handled by
> the main reset thread.
>
> 3. For whole gpu reset(vram lost), do resubmit as the old way.
>
> Signed-off-by: Jack Zhang mailto:jack.zha...@amd.com>>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 72 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
>   drivers/gpu/drm/scheduler/sched_main.c | 75 ++
>   include/drm/gpu_scheduler.h|  2 +
>   4 files changed, 148 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e247c3a2ec08..dac0a242e5f5 100644
> --- a/dri

RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

2021-03-10 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Ok, that will do the trick. Thank you!

Jack
From: Grodzovsky, Andrey 
Sent: Thursday, March 11, 2021 11:24 AM
To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; 
Koenig, Christian ; Liu, Monk ; 
Deng, Emily 
Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

You can just rename drm_sched_resubmit_jobs to drm_sched_resubmit_jobs_imp and 
create a wrapper function drm_sched_resubmit_jobs which will call that function 
with MAX_INT and so no need for code change in other driver will be needed. For 
amdgpu you call directly drm_sched_resubmit_jobs_imp with 1 and MAX_INT.

Andrey


From: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>
Sent: 10 March 2021 22:05
To: Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>; Koenig, 
Christian mailto:christian.koe...@amd.com>>; Liu, 
Monk mailto:monk@amd.com>>; Deng, Emily 
mailto:emily.d...@amd.com>>
Subject: RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,

Thank you for your review.

>> This is identical to drm_sched_resubmit_jobs in all but the 'int max' 
>> handling, can't you reuse drm_sched_resubmit_jobs by passing max argument==1 
>> for the find guilty run and then, for the later normal run passing 
>> max==MAX_INT to avoid break the iteration to early ?

Due to C language doesn't support function overloading, we couldn’t define 
function like the following(compiler error):
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max = 
INT_MAX);

We have to rewrite every vender's  code where call the drm_sched_resubmit_jobs 
if we defined the function like this.
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max);

Meanwhile, I also explored the Variable Arguments in C, But it still cannot 
meet the needs very well.

Do you know some other methods that could help us to make this happen?
Otherwise, I'm afraid we have to define a separate ways as patch V6 did.

Thanks,
Jack

-Original Message-
From: Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>>
Sent: Thursday, March 11, 2021 12:19 AM
To: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Koenig, 
Christian mailto:christian.koe...@amd.com>>; Liu, 
Monk mailto:monk@amd.com>>; Deng, Emily 
mailto:emily.d...@amd.com>>
Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode



On 2021-03-10 8:33 a.m., Jack Zhang wrote:
> [Why]
> Previous tdr design treats the first job in job_timeout as the bad job.
> But sometimes a later bad compute job can block a good gfx job and
> cause an unexpected gfx job timeout because gfx and compute ring share
> internal GC HW mutually.
>
> [How]
> This patch implements an advanced tdr mode.It involves an additinal
> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
> step in order to find the real bad job.
>
> 1. At Step0 Resubmit stage, it synchronously submits and pends for the
> first job being signaled. If it gets timeout, we identify it as guilty
> and do hw reset. After that, we would do the normal resubmit step to
> resubmit left jobs.
>
> 2. Re-insert Bailing job to mirror_list, and leave it to be handled by
> the main reset thread.
>
> 3. For whole gpu reset(vram lost), do resubmit as the old way.
>
> Signed-off-by: Jack Zhang mailto:jack.zha...@amd.com>>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 72 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
>   drivers/gpu/drm/scheduler/sched_main.c | 75 ++
>   include/drm/gpu_scheduler.h|  2 +
>   4 files changed, 148 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e247c3a2ec08..dac0a242e5f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   int i, r = 0;
>   bool need_emergency_restart = false;
>   bool audio_suspended = false;
> +int tmp_vram_lost_counter;
>
>   /*
>* Special case: RAS triggered and full reset isn't supported @@
> -4689,9 +4690,14 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in 
> progress",
>   job ? job->base.id : -1);
>
> -/* even we skipped this reset, still need to set the job to guilty 

RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

2021-03-10 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,

Thank you for your review.

>> This is identical to drm_sched_resubmit_jobs in all but the 'int max' 
>> handling, can't you reuse drm_sched_resubmit_jobs by passing max argument==1 
>> for the find guilty run and then, for the later normal run passing 
>> max==MAX_INT to avoid break the iteration to early ?

Due to C language doesn't support function overloading, we couldn’t define 
function like the following(compiler error):
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max = 
INT_MAX);

We have to rewrite every vender's  code where call the drm_sched_resubmit_jobs 
if we defined the function like this.
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max);

Meanwhile, I also explored the Variable Arguments in C, But it still cannot 
meet the needs very well.

Do you know some other methods that could help us to make this happen?
Otherwise, I'm afraid we have to define a separate ways as patch V6 did.

Thanks,
Jack

-Original Message-
From: Grodzovsky, Andrey 
Sent: Thursday, March 11, 2021 12:19 AM
To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; 
Koenig, Christian ; Liu, Monk ; 
Deng, Emily 
Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode



On 2021-03-10 8:33 a.m., Jack Zhang wrote:
> [Why]
> Previous tdr design treats the first job in job_timeout as the bad job.
> But sometimes a later bad compute job can block a good gfx job and
> cause an unexpected gfx job timeout because gfx and compute ring share
> internal GC HW mutually.
>
> [How]
> This patch implements an advanced tdr mode.It involves an additinal
> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
> step in order to find the real bad job.
>
> 1. At Step0 Resubmit stage, it synchronously submits and pends for the
> first job being signaled. If it gets timeout, we identify it as guilty
> and do hw reset. After that, we would do the normal resubmit step to
> resubmit left jobs.
>
> 2. Re-insert Bailing job to mirror_list, and leave it to be handled by
> the main reset thread.
>
> 3. For whole gpu reset(vram lost), do resubmit as the old way.
>
> Signed-off-by: Jack Zhang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 72 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
>   drivers/gpu/drm/scheduler/sched_main.c | 75 ++
>   include/drm/gpu_scheduler.h|  2 +
>   4 files changed, 148 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e247c3a2ec08..dac0a242e5f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   int i, r = 0;
>   bool need_emergency_restart = false;
>   bool audio_suspended = false;
> +int tmp_vram_lost_counter;
>
>   /*
>* Special case: RAS triggered and full reset isn't supported @@
> -4689,9 +4690,14 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in 
> progress",
>   job ? job->base.id : -1);
>
> -/* even we skipped this reset, still need to set the job to guilty */
> -if (job)
> +if (job) {
> +/* re-insert node to avoid memory leak */
> +spin_lock(>base.sched->job_list_lock);
> +list_add(>base.node, >base.sched->pending_list);
> +spin_unlock(>base.sched->job_list_lock);
> +/* even we skipped this reset, still need to set the job to guilty
> +*/
>   drm_sched_increase_karma(>base);
> +}

I think this piece should be in a standalone patch as it comes to fix a bug of 
leaking jobs and not directly related to find actual guilty job. Also, this is 
not the only place where the problem arises - it also arises in other drivers 
where they check that guilty job's fence already signaled and bail out before 
reinsertion of bad job to mirror list. Seems to me better fix would be to 
handle this within scheduler code by adding specific return value to 
drm_sched_backend_ops.timedout_job marking that the code terminated early - 
before calling drm_sched_stop and so drm_sched_job_timedout would know this and 
then reinsert the job back to mirror_list itself.

>   goto skip_recovery;
>   }
>
> @@ -4788,6 +4794,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   }
>   }
>
> +tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
>   /* Actual ASIC resets if needed.*/
>   /* TODO Implement XGMI hive reset logic for SRIOV */
>   if (amdgpu_sriov_vf(adev)) {
> @@ -

Re: [PATCH v4] drm/amd/amdgpu implement tdr advanced mode

2021-03-10 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]


As Monk previously explained in another mail session. Because the 
drm_sched_resubmit_jobs submit all jobs which does not meet our needs.

We need a new APi  drm_sched_resubmit_jobs2()
to submit the first job of a ring. (also we can leverage the error handling for 
Null or Error fence inside this function ).

It seems clean to use this jobs2 function than writing the run_job + some error 
handling in the main logic.


Thanks,
/Jack






发件人: Koenig, Christian 
发送时间: 2021年3月10日星期三 下午9:49
收件人: Zhang, Jack (Jian); amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey; 
Liu, Monk; Deng, Emily
主题: Re: [PATCH v4] drm/amd/amdgpu implement tdr advanced mode

Andrey needs to review the reste, but essentially I don't see the reason why 
you need this drm_sched_resubmit_jobs2().

Christian.

Am 10.03.21 um 14:36 schrieb Zhang, Jack (Jian):

[AMD Official Use Only - Internal Distribution Only]

Thanks Christian, just did the checkpatch scan.  Please see the V6 patch

/Jack



From: Koenig, Christian 
<mailto:christian.koe...@amd.com>
Sent: Wednesday, March 10, 2021 8:54:53 PM
To: Zhang, Jack (Jian) <mailto:jack.zha...@amd.com>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
<mailto:amd-gfx@lists.freedesktop.org>; 
Grodzovsky, Andrey 
<mailto:andrey.grodzov...@amd.com>; Liu, Monk 
<mailto:monk@amd.com>; Deng, Emily 
<mailto:emily.d...@amd.com>
Subject: Re: [PATCH v4] drm/amd/amdgpu implement tdr advanced mode

Am 10.03.21 um 12:19 schrieb Jack Zhang:
> [Why]
> Previous tdr design treats the first job in job_timeout as the bad job.
> But sometimes a later bad compute job can block a good gfx job and
> cause an unexpected gfx job timeout because gfx and compute ring share
> internal GC HW mutually.
>
> [How]
> This patch implements an advanced tdr mode.It involves an additinal
> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
> step in order to find the real bad job.
>
> 1. At Step0 Resubmit stage, it synchronously submits and pends for the
> first job being signaled. If it gets timeout, we identify it as guilty
> and do hw reset. After that, we would do the normal resubmit step to
> resubmit left jobs.
>
> 2. Re-insert Bailing job to mirror_list, and leave it to be handled by
> the main reset thread.
>
> 3. For whole gpu reset(vram lost), do resubmit as the old way.
>
> Signed-off-by: Jack Zhang <mailto:jack.zha...@amd.com>
> Change-Id: I408357f10b9034caaa1b83610e19e514c5fbaaf2
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 73 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
>   drivers/gpu/drm/scheduler/sched_main.c | 75 ++
>   include/drm/gpu_scheduler.h|  2 +
>   4 files changed, 148 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e247c3a2ec08..02056355a055 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4639,7 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>int i, r = 0;
>bool need_emergency_restart = false;
>bool audio_suspended = false;
> -
> + int tmp_vram_lost_counter;

Please keep the empoty line between declaration and code.

In general give that patch a pass with checkpath.pl since there are a
couple of other place which needs fixing at first glance.

Christian.


>/*
> * Special case: RAS triggered and full reset isn't supported
> */
> @@ -4689,9 +4689,14 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another 
> already in progress",
>job ? job->base.id : -1);
>
> - /* even we skipped this reset, still need to set the job to 
> guilty */
> - if (job)
> + if (job) {
> + /* re-insert node to avoid memory leak */
> + spin_lock(>base.sched->job_list_lock);
> + list_add(>base.node, 
> >base.sched->pending_list);
> + spin_unlock(>base.sched->job_list_lock);
> + /* even we skipped this reset, still need to set the 
> job to guilty */
>drm_sched_increase_karma(>base);
> + }
>goto skip_recovery;
>}
>
> @@ -4788,6 +4793,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>}
>}
>

Re: [PATCH v4] drm/amd/amdgpu implement tdr advanced mode

2021-03-10 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Thanks Christian? just did the checkpatch scan.  Please see the V6 patch

/Jack



From: Koenig, Christian 
Sent: Wednesday, March 10, 2021 8:54:53 PM
To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org 
; Grodzovsky, Andrey 
; Liu, Monk ; Deng, Emily 

Subject: Re: [PATCH v4] drm/amd/amdgpu implement tdr advanced mode

Am 10.03.21 um 12:19 schrieb Jack Zhang:
> [Why]
> Previous tdr design treats the first job in job_timeout as the bad job.
> But sometimes a later bad compute job can block a good gfx job and
> cause an unexpected gfx job timeout because gfx and compute ring share
> internal GC HW mutually.
>
> [How]
> This patch implements an advanced tdr mode.It involves an additinal
> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
> step in order to find the real bad job.
>
> 1. At Step0 Resubmit stage, it synchronously submits and pends for the
> first job being signaled. If it gets timeout, we identify it as guilty
> and do hw reset. After that, we would do the normal resubmit step to
> resubmit left jobs.
>
> 2. Re-insert Bailing job to mirror_list, and leave it to be handled by
> the main reset thread.
>
> 3. For whole gpu reset(vram lost), do resubmit as the old way.
>
> Signed-off-by: Jack Zhang 
> Change-Id: I408357f10b9034caaa1b83610e19e514c5fbaaf2
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 73 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
>   drivers/gpu/drm/scheduler/sched_main.c | 75 ++
>   include/drm/gpu_scheduler.h|  2 +
>   4 files changed, 148 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e247c3a2ec08..02056355a055 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4639,7 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>int i, r = 0;
>bool need_emergency_restart = false;
>bool audio_suspended = false;
> -
> + int tmp_vram_lost_counter;

Please keep the empoty line between declaration and code.

In general give that patch a pass with checkpath.pl since there are a
couple of other place which needs fixing at first glance.

Christian.


>/*
> * Special case: RAS triggered and full reset isn't supported
> */
> @@ -4689,9 +4689,14 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another 
> already in progress",
>job ? job->base.id : -1);
>
> - /* even we skipped this reset, still need to set the job to 
> guilty */
> - if (job)
> + if (job) {
> + /* re-insert node to avoid memory leak */
> + spin_lock(>base.sched->job_list_lock);
> + list_add(>base.node, 
> >base.sched->pending_list);
> + spin_unlock(>base.sched->job_list_lock);
> + /* even we skipped this reset, still need to set the 
> job to guilty */
>drm_sched_increase_karma(>base);
> + }
>goto skip_recovery;
>}
>
> @@ -4788,6 +4793,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>}
>}
>
> + tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
>/* Actual ASIC resets if needed.*/
>/* TODO Implement XGMI hive reset logic for SRIOV */
>if (amdgpu_sriov_vf(adev)) {
> @@ -4805,6 +4811,67 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>/* Post ASIC reset for all devs .*/
>list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
>
> + if (amdgpu_gpu_recovery == 2 &&
> + !(tmp_vram_lost_counter < 
> atomic_read(>vram_lost_counter))) {
> +
> + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> + struct amdgpu_ring *ring = tmp_adev->rings[i];
> + int ret = 0;
> + struct drm_sched_job *s_job;
> +
> + if (!ring || !ring->sched.thread)
> + continue;
> +
> + /* No point to resubmit jobs if we didn't HW 
> reset*/
> + if (!tmp_adev->asic_reset_res && !job_signaled) 
> 

RE: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

2021-03-10 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey and Christian,

I have uploaded the V4 patch. And involved the previous review opinions.
Would you please help to take a review of it?

The basic idea is to add an additional resubmit step before the normal resubmit.

In this resubmit step,

We submit the first job of each ring. And wait for hw fence.
If fence is signaled, signal the finished fence job, delete the node and free 
the job.
If fence is timeout, do hw reset, and set it to guilty.

After we verified all the first job in each ring.
We can make sure we have cleared all the finished jobs and set guilty to every 
hanged jobs.

Then we could safely do the normal resubmit jobs.

Thanks,
Jack

-Original Message-
From: Koenig, Christian 
Sent: Wednesday, March 10, 2021 1:42 AM
To: Grodzovsky, Andrey ; Liu, Monk 
; Zhang, Jack (Jian) ; 
amd-gfx@lists.freedesktop.org; Deng, Emily 
Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

Yeah, sorry. I've seen your reply only after I wrote this mail.

I think we can go ahead if you don't have much other concern on this.

Christian.

Am 09.03.21 um 17:48 schrieb Andrey Grodzovsky:
> I we are talking about 'PATCH v3] drm/amd/amdgpu implement tdr
> advanced mode'  which was sent yesterday then I already went over it
> and only had 2 cosmetical comments.
>
> Andrey
>
> On 2021-03-09 6:16 a.m., Christian König wrote:
>> Yeah, that are some really good points. I completely agree that we
>> shouldn't do any larger cleanup right now.
>>
>> But I think we still need some more review on this. I most likely
>> won't have enough time to look into this before the weekend.
>>
>> Andrey can you take a look as well?
>>
>> Thanks,
>> Christian.
>>
>> Am 09.03.21 um 08:29 schrieb Liu, Monk:
>>> [AMD Official Use Only - Internal Distribution Only]
>>>
>>> Christian
>>>
>>> what feasible and practice now is:
>>> 1) we implement the advanced TDR mode in upstream first (so we can
>>> copy the same scheme in our LTS kernel) -- if you want we can avoid
>>> change drm/scheduler part code, but that one is already rejected by
>>> you due to complicated
>>> 2) then we retire the mirror list concept and rework the
>>> drm/scheduler with KFIFO
>>> 3) remove the guilty/karma handling from scheduler
>>>
>>> So I basically agree with you on the spirit of above changes: hide
>>> those AMD internal concept or tricks in vendor's driver part and
>>> make scheduler simple and scalable But that definitely need  a
>>> longer design and discussion, so why don't we focus on our current
>>> problems now, as long as the new change doesn't regress it is still
>>> a good change based on current TDR implements
>>>
>>> I would proposal we only change AMD part code in this time, Jack's
>>> first version patch didn't touch scheduler part, but you stated it
>>> was too complicated and rejected it
>>>
>>> So the allowable revise options is what Jack did in ver2, which need
>>> to introduce a new API in scheduler drm_sched_resubmit_jobs2().
>>>
>>> Hah: --( what do you think ?
>>>
>>> Thanks
>>>
>>> --
>>> Monk Liu | Cloud-GPU Core team
>>> --
>>>
>>> -Original Message-
>>> From: Koenig, Christian 
>>> Sent: Monday, March 8, 2021 3:53 PM
>>> To: Liu, Monk ; Zhang, Jack (Jian)
>>> ; amd-gfx@lists.freedesktop.org; Grodzovsky,
>>> Andrey ; Deng, Emily 
>>> Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
>>>
>>>
>>>
>>> Am 08.03.21 um 05:06 schrieb Liu, Monk:
>>>> [AMD Official Use Only - Internal Distribution Only]
>>>>
>>>>>> well first of all please completely drop the affinity group stuff
>>>>>> from this patch. We should concentrate on one feature at at time.
>>>> We need it to expedite the process, we can introduce this change in
>>>> another patch
>>>>
>>>>
>>>>>> Then the implementation is way to complicate. All you need to do
>>>>>> is insert a dma_fence_wait after re-scheduling each job after a
>>>>>> reset.
>>>> No that's not true, during the " drm_sched_resubmit_jobs" it will
>>>> put all jobs in mirror list into the hw ring, but we can only allow
>>>> the first job to the ring To catch the real guilty one (otherwise
>>>> it is possible that 

RE: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

2021-03-08 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,

Since this change is a bit critical to our project, we would be grateful that 
could get your review.

Are there anything that's is not clear enough I could help to explain?

Again,
Thanks for your huge help to our problem.

Thanks,
Jack

-Original Message-
From: Koenig, Christian 
Sent: Monday, March 8, 2021 8:45 PM
To: Zhang, Jack (Jian) ; Liu, Monk ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey ; 
Deng, Emily 
Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

Hi Jack,

yes that comes pretty close. I'm going over the patch right now.

Some things still look a bit complicated to me, but I need to wrap my head 
around how and why we are doing it this way once more.

Christian.

Am 08.03.21 um 13:43 schrieb Zhang, Jack (Jian):
> [AMD Public Use]
>
> Hi, Christian,
>
> I made some change on V3 patch that insert a dma_fence_wait for the first 
> jobs after resubmit jobs.
> It seems simpler than the V2 patch. Is this what you first thinks of in your 
> mind?
>
> Thanks,
> Jack
>
> -Original Message-
> From: Koenig, Christian 
> Sent: Monday, March 8, 2021 3:53 PM
> To: Liu, Monk ; Zhang, Jack (Jian)
> ; amd-gfx@lists.freedesktop.org; Grodzovsky,
> Andrey ; Deng, Emily 
> Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
>
>
>
> Am 08.03.21 um 05:06 schrieb Liu, Monk:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>>>> well first of all please completely drop the affinity group stuff from 
>>>> this patch. We should concentrate on one feature at at time.
>> We need it to expedite the process, we can introduce this change in
>> another patch
>>
>>
>>>> Then the implementation is way to complicate. All you need to do is insert 
>>>> a dma_fence_wait after re-scheduling each job after a reset.
>> No that's not true, during the " drm_sched_resubmit_jobs" it will put
>> all jobs in mirror list into the hw ring, but we can only allow the
>> first job to the ring To catch the real guilty one (otherwise it is possible 
>> that the later job in the ring also has bug and affect our judgement) So we 
>> need to implement a new " drm_sched_resubmit_jobs2()", like this way:
> Something like this. But since waiting for the guilty job is AMD specific we 
> should rather rework the stuff from the beginning.
>
> What I have in mind is the following:
> 1. Add a reference from the scheduler fence back to the job which is cleared 
> only when the scheduler fence finishes.
> 2. Completely drop the ring_mirror_list and replace it with a kfifo of 
> pointers to the active scheduler fences.
> 3. Replace drm_sched_resubmit_jobs with a drm_sched_for_each_active() macro 
> which allows drivers to iterate over all the active jobs and 
> resubmit/wait/mark them as guilty etc etc..
> 4. Remove the guilty/karma handling from the scheduler. This is something AMD 
> specific and shouldn't leak into common code.
>
> Regards,
> Christian.
>
>> drm_sched_resubmit_jobs2()
>> ~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler *sched, int max)
>> 500 {
>> 501 struct drm_sched_job *s_job, *tmp;
>> 502 uint64_t guilty_context;
>> 503 bool found_guilty = false;
>> 504 struct dma_fence *fence;
>> + 505 int i = 0;
>> 506
>> 507 list_for_each_entry_safe(s_job, tmp, >ring_mirror_list, 
>> node) {
>> 508 struct drm_sched_fence *s_fence = s_job->s_fence;
>> 509
>> + 510 if (i >= max)
>> + 511 break;
>> + 512
>> 513 if (!found_guilty && atomic_read(_job->karma) > 
>> sched->hang_limit) {
>> 514 found_guilty = true;
>> 515 guilty_context = s_job->s_fence->scheduled.context;
>> 516 }
>> 517
>> 518 if (found_guilty && s_job->s_fence->scheduled.context == 
>> guilty_context)
>> 519 dma_fence_set_error(_fence->finished, -ECANCELED);
>> 520
>> 521 dma_fence_put(s_job->s_fence->parent);
>> 522 fence = sched->ops->run_job(s_job);
>> + 523 i++;
>> 524
>> 525 if (IS_ERR_OR_NULL(fence)) {
>> 526 if (IS_ERR(fence))
>> 527 dma_fence_set_error(_fence->finished, 
>> PTR_ERR(fence));
>> 528
>>     529 s_job->s_fence->parent = NULL;
>> 530 } else {
>> 531 s_job->s_fence->parent

RE: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

2021-03-08 Thread Zhang, Jack (Jian)
[AMD Public Use]

Hi, Christian,

I made some change on V3 patch that insert a dma_fence_wait for the first jobs 
after resubmit jobs.
It seems simpler than the V2 patch. Is this what you first thinks of in your 
mind?

Thanks,
Jack

-Original Message-
From: Koenig, Christian  
Sent: Monday, March 8, 2021 3:53 PM
To: Liu, Monk ; Zhang, Jack (Jian) ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey ; 
Deng, Emily 
Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode



Am 08.03.21 um 05:06 schrieb Liu, Monk:
> [AMD Official Use Only - Internal Distribution Only]
>
>>> well first of all please completely drop the affinity group stuff from this 
>>> patch. We should concentrate on one feature at at time.
> We need it to expedite the process, we can introduce this change in 
> another patch
>
>
>>> Then the implementation is way to complicate. All you need to do is insert 
>>> a dma_fence_wait after re-scheduling each job after a reset.
> No that's not true, during the " drm_sched_resubmit_jobs" it will put 
> all jobs in mirror list into the hw ring, but we can only allow the 
> first job to the ring To catch the real guilty one (otherwise it is possible 
> that the later job in the ring also has bug and affect our judgement) So we 
> need to implement a new " drm_sched_resubmit_jobs2()", like this way:

Something like this. But since waiting for the guilty job is AMD specific we 
should rather rework the stuff from the beginning.

What I have in mind is the following:
1. Add a reference from the scheduler fence back to the job which is cleared 
only when the scheduler fence finishes.
2. Completely drop the ring_mirror_list and replace it with a kfifo of pointers 
to the active scheduler fences.
3. Replace drm_sched_resubmit_jobs with a drm_sched_for_each_active() macro 
which allows drivers to iterate over all the active jobs and resubmit/wait/mark 
them as guilty etc etc..
4. Remove the guilty/karma handling from the scheduler. This is something AMD 
specific and shouldn't leak into common code.

Regards,
Christian.

>
> drm_sched_resubmit_jobs2()
> ~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler *sched, int max)
>500 {
>501 struct drm_sched_job *s_job, *tmp;
>502 uint64_t guilty_context;
>503 bool found_guilty = false;
>504 struct dma_fence *fence;
> + 505 int i = 0;
>506
>507 list_for_each_entry_safe(s_job, tmp, >ring_mirror_list, 
> node) {
>508 struct drm_sched_fence *s_fence = s_job->s_fence;
>509
> + 510 if (i >= max)
> + 511 break;
> + 512
>513 if (!found_guilty && atomic_read(_job->karma) > 
> sched->hang_limit) {
>514 found_guilty = true;
>515 guilty_context = s_job->s_fence->scheduled.context;
>516 }
>517
>518 if (found_guilty && s_job->s_fence->scheduled.context == 
> guilty_context)
>519 dma_fence_set_error(_fence->finished, -ECANCELED);
>520
>521 dma_fence_put(s_job->s_fence->parent);
>522 fence = sched->ops->run_job(s_job);
> + 523 i++;
>524
>525 if (IS_ERR_OR_NULL(fence)) {
>526 if (IS_ERR(fence))
>527 dma_fence_set_error(_fence->finished, 
> PTR_ERR(fence));
>528
>529 s_job->s_fence->parent = NULL;
>530 } else {
>531 s_job->s_fence->parent = fence;
>532 }
>533
>534
>535 }
>536 }
>537 EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>538
>
>
>
> Thanks
>
> --
> Monk Liu | Cloud-GPU Core team
> --
>
> -Original Message-
> From: Koenig, Christian 
> Sent: Sunday, March 7, 2021 3:03 AM
> To: Zhang, Jack (Jian) ; 
> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
> ; Liu, Monk ; Deng, Emily 
> 
> Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode
>
> Hi Jack,
>
> well first of all please completely drop the affinity group stuff from this 
> patch. We should concentrate on one feature at at time.
>
> Then the implementation is way to complicate. All you need to do is insert a 
> dma_fence_wait after re-scheduling each job after a reset.
>
> Additional to that this feature is completely AMD specific and shouldn't 
> affect the common scheduler in any way.
>
> Regards,
> Christian.
>
> Am 06.03.21 um 18:25 schrieb Jack Zhang:
>> [Why]
>> Previous tdr design treats the first job in job_timeout as the bad job.

RE: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

2021-03-07 Thread Zhang, Jack (Jian)
Hi, Christian and Monk,

Thanks so much for these valuable ideas. I will follow the advices.

>> Then the implementation is way to complicate. All you need to do is insert a 
>> dma_fence_wait after re-scheduling each job after a reset.
Sure, if I re-implement amd_sched_resubmit_jobs2(), and add a dma_fence_wait 
inside the job2 func, the change will be simpler.
I will do the change and submit for review later.

Thanks a lot for your help.
/Jack

-Original Message-
From: Liu, Monk  
Sent: Monday, March 8, 2021 12:06 PM
To: Koenig, Christian ; Zhang, Jack (Jian) 
; amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
; Deng, Emily 
Subject: RE: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

[AMD Official Use Only - Internal Distribution Only]

>> well first of all please completely drop the affinity group stuff from this 
>> patch. We should concentrate on one feature at at time.
We need it to expedite the process, we can introduce this change in another 
patch


>> Then the implementation is way to complicate. All you need to do is insert a 
>> dma_fence_wait after re-scheduling each job after a reset.
No that's not true, during the " drm_sched_resubmit_jobs" it will put all jobs 
in mirror list into the hw ring, but we can only allow the first job to the 
ring To catch the real guilty one (otherwise it is possible that the later job 
in the ring also has bug and affect our judgement) So we need to implement a 
new " drm_sched_resubmit_jobs2()", like this way:

drm_sched_resubmit_jobs2()
~ 499 void drm_sched_resubmit_jobs2(struct drm_gpu_scheduler *sched, int max)
  500 {
  501 struct drm_sched_job *s_job, *tmp;
  502 uint64_t guilty_context;
  503 bool found_guilty = false;
  504 struct dma_fence *fence;
+ 505 int i = 0;
  506
  507 list_for_each_entry_safe(s_job, tmp, >ring_mirror_list, node) {
  508 struct drm_sched_fence *s_fence = s_job->s_fence;
  509
+ 510 if (i >= max)
+ 511 break;
+ 512
  513 if (!found_guilty && atomic_read(_job->karma) > 
sched->hang_limit) {
  514 found_guilty = true;
  515 guilty_context = s_job->s_fence->scheduled.context;
  516 }
  517
  518 if (found_guilty && s_job->s_fence->scheduled.context == 
guilty_context)
  519 dma_fence_set_error(_fence->finished, -ECANCELED);
  520
  521 dma_fence_put(s_job->s_fence->parent);
  522 fence = sched->ops->run_job(s_job);
+ 523 i++;
  524
  525 if (IS_ERR_OR_NULL(fence)) {
  526 if (IS_ERR(fence))
  527 dma_fence_set_error(_fence->finished, PTR_ERR(fence));
  528
  529 s_job->s_fence->parent = NULL;
  530 } else {
  531 s_job->s_fence->parent = fence;
  532 }
  533
  534
  535 }
  536 }
  537 EXPORT_SYMBOL(drm_sched_resubmit_jobs);
  538



Thanks 

--
Monk Liu | Cloud-GPU Core team
------

-Original Message-
From: Koenig, Christian 
Sent: Sunday, March 7, 2021 3:03 AM
To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; 
Grodzovsky, Andrey ; Liu, Monk ; 
Deng, Emily 
Subject: Re: [PATCH v2] drm/amd/amdgpu implement tdr advanced mode

Hi Jack,

well first of all please completely drop the affinity group stuff from this 
patch. We should concentrate on one feature at at time.

Then the implementation is way to complicate. All you need to do is insert a 
dma_fence_wait after re-scheduling each job after a reset.

Additional to that this feature is completely AMD specific and shouldn't affect 
the common scheduler in any way.

Regards,
Christian.

Am 06.03.21 um 18:25 schrieb Jack Zhang:
> [Why]
> Previous tdr design treats the first job in job_timeout as the bad job.
> But sometimes a later bad compute job can block a good gfx job and 
> cause an unexpected gfx job timeout because gfx and compute ring share 
> internal GC HW mutually.
>
> [How]
> This patch implements an advanced tdr mode.It involves an additinal 
> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit 
> step in order to find the real bad job.
>
> 1. For Bailing TDR job, re-insert it to mirror_list, don't set it to 
> guilty and leave it to be handled by the main reset thread.
>
> 2. Don't set the job to guilty in pre_asic_reset, and leave it to be 
> handled by Step0 Resubmit Stage.
>
> 3. At Step0 Resubmit stage, it first resubmit jobs asynchronously, 
> then it iterate each ring mirror_list, synchronously pend for each hw 
> fence being signaled. If the a job's hw fence get timeout, we identify 
> it as guilty and do hw reset to recover hw. After that, we would do 
> the normal resubmit step to resubmit left jobs.
>
> 4. For whole gpu re

RE: [PATCH] drm/amd/amdgpu: add fini virt data exchange to ip_suspend

2021-03-04 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Jack Zhang 


-Original Message-
From: Jingwen Chen 
Sent: Friday, March 5, 2021 2:12 PM
To: amd-gfx@lists.freedesktop.org
Cc: Chen, Horace ; Zhang, Jack (Jian) 
; Chen, JingWen 
Subject: [PATCH] drm/amd/amdgpu: add fini virt data exchange to ip_suspend

[Why]
when try to shutdown guest vm in sriov mode, virt data exchange is not fini. 
After vram lost, trying to write vram could hang cpu.

[How]
add fini virt data exchange in ip_suspend

Signed-off-by: Jingwen Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a11760ec3924..bec725b50c1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2774,8 +2774,10 @@ int amdgpu_device_ip_suspend(struct amdgpu_device *adev) 
 {
 int r;

-if (amdgpu_sriov_vf(adev))
+if (amdgpu_sriov_vf(adev)) {
+amdgpu_virt_fini_data_exchange(adev);
 amdgpu_virt_request_full_gpu(adev, false);
+}

 r = amdgpu_device_ip_suspend_phase1(adev);
 if (r)
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amd/amdgpu: fini data exchange when req_gpu_fini in SRIOV

2021-03-04 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Please comments what the issue is without this change.

/Jack
-Original Message-
From: Jingwen Chen 
Sent: Friday, March 5, 2021 12:44 PM
To: amd-gfx@lists.freedesktop.org
Cc: Chen, Horace ; Zhang, Jack (Jian) 
; Chen, JingWen 
Subject: [PATCH] drm/amd/amdgpu: fini data exchange when req_gpu_fini in SRIOV

Do fini data exchange everytime req_gpu_fini in SRIOV

Signed-off-by: Jingwen Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 3 +++
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a11760ec3924..e3ed52f66414 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3601,10 +3601,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 /* make sure IB test finished before entering exclusive mode
  * to avoid preemption on IB test
  * */
-if (amdgpu_sriov_vf(adev)) {
+if (amdgpu_sriov_vf(adev))
 amdgpu_virt_request_full_gpu(adev, false);
-amdgpu_virt_fini_data_exchange(adev);
-}

 /* disable all interrupts */
 amdgpu_irq_disable_all(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 3dd7eec52344..af1e5d8fc2b8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -114,6 +114,9 @@ int amdgpu_virt_request_full_gpu(struct amdgpu_device 
*adev, bool init)
 struct amdgpu_virt *virt = >virt;
 int r;

+if (!init)
+amdgpu_virt_fini_data_exchange(adev);
+
 if (virt->ops && virt->ops->req_full_gpu) {
 r = virt->ops->req_full_gpu(adev, init);
 if (r)
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] amdgpu/sriov Stop data exchange for wholegpu reset

2021-01-11 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Ping...

-Original Message-
From: Zhang, Jack (Jian)
Sent: Friday, January 8, 2021 11:07 AM
To: amd-gfx@lists.freedesktop.org; Liu, Monk ; Chen, JingWen 
; Deucher, Alexander ; Deng, 
Emily 
Subject: RE: [PATCH] amdgpu/sriov Stop data exchange for wholegpu reset

Ping

-Original Message-
From: Jack Zhang 
Sent: Thursday, January 7, 2021 6:47 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) ; Zhang, Jack (Jian) 
; Chen, JingWen 
Subject: [PATCH] amdgpu/sriov Stop data exchange for wholegpu reset

[Why]
When host trigger a whole gpu reset, guest will keep waiting till host finish 
reset. But there's a work queue in guest exchanging data between vf which 
need to access frame buffer. During whole gpu reset, frame buffer is not 
accessable, and this causes the call trace.

[How]
After vf get reset notification from pf, stop data exchange.

Signed-off-by: Jingwen Chen 
Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c| 1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c| 1 +
 3 files changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 83ca5cbffe2c..3e212862cf5d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -571,6 +571,7 @@ void amdgpu_virt_fini_data_exchange(struct amdgpu_device 
*adev)
 DRM_INFO("clean up the vf2pf work item\n");
 flush_delayed_work(>virt.vf2pf_work);
 cancel_delayed_work_sync(>virt.vf2pf_work);
+adev->virt.vf2pf_update_interval_ms = 0;
 }
 }

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 7767ccca526b..3ee481557fc9 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -255,6 +255,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
 if (!down_read_trylock(>reset_sem))
 return;

+amdgpu_virt_fini_data_exchange(adev);
 atomic_set(>in_gpu_reset, 1);

 do {
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index dd5c1e6ce009..48e588d3c409 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -276,6 +276,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
 if (!down_read_trylock(>reset_sem))
 return;

+amdgpu_virt_fini_data_exchange(adev);
 atomic_set(>in_gpu_reset, 1);

 do {
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] amdgpu/sriov Stop data exchange for wholegpu reset

2021-01-07 Thread Zhang, Jack (Jian)
Ping

-Original Message-
From: Jack Zhang  
Sent: Thursday, January 7, 2021 6:47 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) ; Zhang, Jack (Jian) 
; Chen, JingWen 
Subject: [PATCH] amdgpu/sriov Stop data exchange for wholegpu reset

[Why]
When host trigger a whole gpu reset, guest will keep waiting till host finish 
reset. But there's a work queue in guest exchanging data between vf which 
need to access frame buffer. During whole gpu reset, frame buffer is not 
accessable, and this causes the call trace.

[How]
After vf get reset notification from pf, stop data exchange.

Signed-off-by: Jingwen Chen 
Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c| 1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c| 1 +
 3 files changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 83ca5cbffe2c..3e212862cf5d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -571,6 +571,7 @@ void amdgpu_virt_fini_data_exchange(struct amdgpu_device 
*adev)
DRM_INFO("clean up the vf2pf work item\n");
flush_delayed_work(>virt.vf2pf_work);
cancel_delayed_work_sync(>virt.vf2pf_work);
+   adev->virt.vf2pf_update_interval_ms = 0;
}
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 7767ccca526b..3ee481557fc9 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -255,6 +255,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
if (!down_read_trylock(>reset_sem))
return;
 
+   amdgpu_virt_fini_data_exchange(adev);
atomic_set(>in_gpu_reset, 1);
 
do {
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index dd5c1e6ce009..48e588d3c409 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -276,6 +276,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
if (!down_read_trylock(>reset_sem))
return;
 
+   amdgpu_virt_fini_data_exchange(adev);
atomic_set(>in_gpu_reset, 1);
 
do {
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 3/5] drm/amd/sriov add mmsch_v3 interface

2020-07-14 Thread Zhang, Jack (Jian)
Hi, Dennis,

I gave some feedback in the comments.
Thank you for your review.

Best Regards,
Jack Zhang

-Original Message-
From: Li, Dennis 
Sent: Tuesday, July 14, 2020 12:35 PM
To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) ; Liu, Leo ; 
Zhang, Hawking 
Subject: RE: [PATCH 3/5] drm/amd/sriov add mmsch_v3 interface

[AMD Official Use Only - Internal Distribution Only]

Hi, Jack,
  Please see the following comments. 

Best Regards
Dennis Li
-Original Message-
From: amd-gfx  On Behalf Of Jack Zhang
Sent: Tuesday, July 14, 2020 10:47 AM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) ; Liu, Leo ; 
Zhang, Hawking 
Subject: [PATCH 3/5] drm/amd/sriov add mmsch_v3 interface

For VCN3.0 SRIOV, Guest driver needs to communicate with mmsch to set the World 
Switch for MM appropriately. This patch add the interface for mmsch_v3.0.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h | 130 
 1 file changed, 130 insertions(+)
 create mode 100644 drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h

diff --git a/drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h 
b/drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h
new file mode 100644
index ..3e4e858a6965
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h
@@ -0,0 +1,130 @@
+/*
+ * Copyright 2020 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person 
+obtaining a
+ * copy of this software and associated documentation files (the 
+"Software"),
+ * to deal in the Software without restriction, including without 
+limitation
+ * the rights to use, copy, modify, merge, publish, distribute, 
+sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom 
+the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be 
+included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
+EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
+MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT 
+SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, 
+DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 
+OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE 
+OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __MMSCH_V3_0_H__
+#define __MMSCH_V3_0_H__
+
+#include "amdgpu_vcn.h"
+
+#define MMSCH_VERSION_MAJOR3
+#define MMSCH_VERSION_MINOR0
+#define MMSCH_VERSION  (MMSCH_VERSION_MAJOR << 16 | MMSCH_VERSION_MINOR)
+
+enum mmsch_v3_0_command_type {
+   MMSCH_COMMAND__DIRECT_REG_WRITE = 0,
+   MMSCH_COMMAND__DIRECT_REG_POLLING = 2,
+   MMSCH_COMMAND__DIRECT_REG_READ_MODIFY_WRITE = 3,
+   MMSCH_COMMAND__INDIRECT_REG_WRITE = 8,
+   MMSCH_COMMAND__END = 0xf
+};
+
+struct mmsch_v3_0_table_info {
+   uint32_t init_status;
+   uint32_t table_offset;
+   uint32_t table_size;
+};
+
+struct mmsch_v3_0_init_header {
+   uint32_t version;
+   uint32_t total_size;
+   struct mmsch_v3_0_table_info inst[AMDGPU_MAX_VCN_INSTANCES]; };

[Dennis]  You have defined total_size, why inst size is 
AMDGPU_MAX_VCN_INSTANCES, which will cause memory waste.
[Jack] In our case, AMDGPU_MAX_VCN_INSTANCES is a fixed number, which equals 2. 
 And struct mmsch_v3_0_table_info occupy 3 dws.  Thus there's not too much 
memory waste.

+struct mmsch_v3_0_cmd_direct_reg_header {
+   uint32_t reg_offset   : 28;
+   uint32_t command_type : 4;
+};
+
+struct mmsch_v3_0_cmd_indirect_reg_header {
+   uint32_t reg_offset: 20;
+   uint32_t reg_idx_space : 8;
+   uint32_t command_type  : 4;
+};
+
+struct mmsch_v3_0_cmd_direct_write {
+   struct mmsch_v3_0_cmd_direct_reg_header cmd_header;
+   uint32_t reg_value;
+};
+
+struct mmsch_v3_0_cmd_direct_read_modify_write {
+   struct mmsch_v3_0_cmd_direct_reg_header cmd_header;
+   uint32_t write_data;
+   uint32_t mask_value;
+};
+
+struct mmsch_v3_0_cmd_direct_polling {
+   struct mmsch_v3_0_cmd_direct_reg_header cmd_header;
+   uint32_t mask_value;
+   uint32_t wait_value;
+};
+
+struct mmsch_v3_0_cmd_end {
+   struct mmsch_v3_0_cmd_direct_reg_header cmd_header; };
+
+struct mmsch_v3_0_cmd_indirect_write {
+   struct mmsch_v3_0_cmd_indirect_reg_header cmd_header;
+   uint32_t reg_value;
+};
+
+#define MMSCH_V3_0_INSERT_DIRECT_RD_MOD_WT(reg, mask, data) { \
+   size = sizeof(struct mmsch_v3_0_cmd_direct_read_modify_write); \
+   size_dw = size / 4; \
+   direct_rd_mod_wt.cmd_header.reg_offset = reg; \
+   direct_rd_mod_wt.mask_value = mask; \
+   direct_rd_mod_wt.write_data = data; \
+   memcpy((void *)table_loc, _rd_mod_wt, size); \
+   table_loc += size

Re: [PATCH] drm/amdgpu fix incorrect sysfs remove behavior for xgmi

2020-05-19 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

ping

 Outlook for Android<https://aka.ms/ghei36>

From: Jack Zhang 
Sent: Monday, May 18, 2020 5:00:53 PM
To: amd-gfx@lists.freedesktop.org 
Cc: Zhang, Jack (Jian) 
Subject: [PATCH] drm/amdgpu fix incorrect sysfs remove behavior for xgmi

Under xgmi setup,some sysfs fail to create for the second time of kmd
driver loading. It's due to sysfs nodes are not removed appropriately
in the last unlod time.

Changes of this patch:
1. remove sysfs for dev_attr_xgmi_error
2. remove sysfs_link adev->dev->kobj with target name.
   And it only needs to be removed once for a xgmi setup
3. remove sysfs_link hive->kobj with target name

In amdgpu_xgmi_remove_device:
1. amdgpu_xgmi_sysfs_rem_dev_info needs to be run per device
2. amdgpu_xgmi_sysfs_destroy needs to be run on the last node of
device.

v2: initialize array with memset

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index e9e59bc..3b46ea8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -325,9 +325,19 @@ static int amdgpu_xgmi_sysfs_add_dev_info(struct 
amdgpu_device *adev,
 static void amdgpu_xgmi_sysfs_rem_dev_info(struct amdgpu_device *adev,
   struct amdgpu_hive_info *hive)
 {
+   char node[10];
+   memset(node, 0, sizeof(node));
+
 device_remove_file(adev->dev, _attr_xgmi_device_id);
-   sysfs_remove_link(>dev->kobj, adev->ddev->unique);
-   sysfs_remove_link(hive->kobj, adev->ddev->unique);
+   device_remove_file(adev->dev, _attr_xgmi_error);
+
+   if (adev != hive->adev) {
+   sysfs_remove_link(>dev->kobj,"xgmi_hive_info");
+   }
+
+   sprintf(node, "node%d", hive->number_devices);
+   sysfs_remove_link(hive->kobj, node);
+
 }


@@ -583,14 +593,14 @@ int amdgpu_xgmi_remove_device(struct amdgpu_device *adev)
 if (!hive)
 return -EINVAL;

-   if (!(hive->number_devices--)) {
+   task_barrier_rem_task(>tb);
+   amdgpu_xgmi_sysfs_rem_dev_info(adev, hive);
+   mutex_unlock(>hive_lock);
+
+   if(!(--hive->number_devices)){
 amdgpu_xgmi_sysfs_destroy(adev, hive);
 mutex_destroy(>hive_lock);
 mutex_destroy(>reset_lock);
-   } else {
-   task_barrier_rem_task(>tb);
-   amdgpu_xgmi_sysfs_rem_dev_info(adev, hive);
-   mutex_unlock(>hive_lock);
 }

 return psp_xgmi_terminate(>psp);
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu fix incorrect sysfs remove behavior for xgmi

2020-05-17 Thread Zhang, Jack (Jian)



-Original Message-
From: Jack Zhang  
Sent: Monday, May 18, 2020 12:45 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH] drm/amdgpu fix incorrect sysfs remove behavior for xgmi

Under xgmi setup,some sysfs fail to create for the second time of kmd driver 
loading. It's due to sysfs nodes are not removed appropriately in the last 
unlod time.

Changes of this patch:
1. remove sysfs for dev_attr_xgmi_error
2. remove sysfs_link adev->dev->kobj with target name.
   And it only needs to be removed once for a xgmi setup 3. remove sysfs_link 
hive->kobj with target name

In amdgpu_xgmi_remove_device:
1. amdgpu_xgmi_sysfs_rem_dev_info needs to be run per device 2. 
amdgpu_xgmi_sysfs_destroy needs to be run on the last node of device.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index e9e59bc..bfe2468 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -325,9 +325,17 @@ static int amdgpu_xgmi_sysfs_add_dev_info(struct 
amdgpu_device *adev,  static void amdgpu_xgmi_sysfs_rem_dev_info(struct 
amdgpu_device *adev,
  struct amdgpu_hive_info *hive)
 {
+   char node[10] = { 0 };
device_remove_file(adev->dev, _attr_xgmi_device_id);
-   sysfs_remove_link(>dev->kobj, adev->ddev->unique);
-   sysfs_remove_link(hive->kobj, adev->ddev->unique);
+   device_remove_file(adev->dev, _attr_xgmi_error);
+
+   if (adev != hive->adev) {
+   sysfs_remove_link(>dev->kobj,"xgmi_hive_info");
+   }
+
+   sprintf(node, "node%d", hive->number_devices);
+   sysfs_remove_link(hive->kobj, node);
+
 }
 
 
@@ -583,14 +591,14 @@ int amdgpu_xgmi_remove_device(struct amdgpu_device *adev)
if (!hive)
return -EINVAL;
 
-   if (!(hive->number_devices--)) {
+   task_barrier_rem_task(>tb);
+   amdgpu_xgmi_sysfs_rem_dev_info(adev, hive);
+   mutex_unlock(>hive_lock);
+
+   if(!(--hive->number_devices)){
amdgpu_xgmi_sysfs_destroy(adev, hive);
mutex_destroy(>hive_lock);
mutex_destroy(>reset_lock);
-   } else {
-   task_barrier_rem_task(>tb);
-   amdgpu_xgmi_sysfs_rem_dev_info(adev, hive);
-   mutex_unlock(>hive_lock);
}
 
return psp_xgmi_terminate(>psp);
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset

2020-04-02 Thread Zhang, Jack (Jian)
Thanks Monk,

I just updated the patch and it could passed 1000 rounds TDR test.

Sent out an review email.

Regards,
Jack
-Original Message-
From: Liu, Monk  
Sent: Friday, April 3, 2020 11:38 AM
To: Kuehling, Felix ; Zhang, Jack (Jian) 
; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset

Thanks Felix

Hi Jack

I think below changes can resolve your problem , we had this on our customer 
branch already, it fix the memory leak, and also fix my previous bug .
Can you make this change applied to gfx_v10/v9 ? thanks !

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
index 29749502..532258445 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
@@ -543,6 +543,8 @@ static int kgd_hqd_destroy(struct kgd_dev *kgd, void *mqd,
uint32_t temp;
struct v10_compute_mqd *m = get_mqd(mqd);

+   if (amdgpu_sriov_vf(adev) && adev->in_gpu_reset)
+   return 0;
 #if 0
unsigned long flags;
int retry;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 35b32ad..f6479e1 100755
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3653,6 +3653,8 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device 
*adev,
if (r)
return r;

+   amdgpu_amdkfd_pre_reset(adev);
+
/* Resume IP prior to SMC */
r = amdgpu_device_ip_reinit_early_sriov(adev);
if (r)

_
Monk Liu|GPU Virtualization Team |AMD


-Original Message-
From: Kuehling, Felix 
Sent: Friday, April 3, 2020 1:26 AM
To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; 
Liu, Monk 
Subject: Re: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset

[+Monk]

This looks reasonable to me. However, you're effectively reverting this commit 
by Monk:

a03eb637d2a5 drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

In hind-sight, Monk's commit was broken. Removing the call to pre_reset has 
other consequences, such as breaking notifications about reset to user mode, 
and probably invalidating some assumptions in kfd_post_reset.
Can you coordinate with Monk to work out why his change was needed, and whether 
you'll need a different solution for the problem he was trying to address?

In the meanwhile, this patch is

Acked-by: Felix Kuehling 


Am 2020-04-02 um 3:20 a.m. schrieb Jack Zhang:

> kfd_pre_reset will free mem_objs allocated by kfd_gtt_sa_allocate
>
> Without this change, sriov tdr code path will never free those 
> allocated memories and get memory leak.
>
> Signed-off-by: Jack Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 8faaa17..832daf7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3847,6 +3847,8 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,  {
>   int r;
>  
> + amdgpu_amdkfd_pre_reset(adev);
> +
>   if (from_hypervisor)
>   r = amdgpu_virt_request_full_gpu(adev, true);
>   else
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset

2020-04-02 Thread Zhang, Jack (Jian)


-Original Message-
From: Jack Zhang  
Sent: Thursday, April 2, 2020 3:20 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH] drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset

kfd_pre_reset will free mem_objs allocated by kfd_gtt_sa_allocate

Without this change, sriov tdr code path will never free those allocated 
memories and get memory leak.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8faaa17..832daf7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3847,6 +3847,8 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device 
*adev,  {
int r;
 
+   amdgpu_amdkfd_pre_reset(adev);
+
if (from_hypervisor)
r = amdgpu_virt_request_full_gpu(adev, true);
else
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu/sriov refine vcn_v2_5_early_init func

2020-03-10 Thread Zhang, Jack (Jian)
Ping...

-Original Message-
From: amd-gfx  On Behalf Of Jack Zhang
Sent: Tuesday, March 10, 2020 8:49 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) ; Zhang, Jack (Jian) 

Subject: [PATCH] drm/amdgpu/sriov refine vcn_v2_5_early_init func

refine the assignment for vcn.num_vcn_inst, vcn.harvest_config, 
vcn.num_enc_rings in VF

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 35 ++-
 1 file changed, 18 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
index 2d64ba1..9b22e2b 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
@@ -74,29 +74,30 @@ static int amdgpu_ih_clientid_vcns[] = {  static int 
vcn_v2_5_early_init(void *handle)  {
struct amdgpu_device *adev = (struct amdgpu_device *)handle;
-   if (adev->asic_type == CHIP_ARCTURUS) {
-   u32 harvest;
-   int i;
-
-   adev->vcn.num_vcn_inst = VCN25_MAX_HW_INSTANCES_ARCTURUS;
-   for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
-   harvest = RREG32_SOC15(UVD, i, mmCC_UVD_HARVESTING);
-   if (harvest & CC_UVD_HARVESTING__UVD_DISABLE_MASK)
-   adev->vcn.harvest_config |= 1 << i;
-   }
-
-   if (adev->vcn.harvest_config == (AMDGPU_VCN_HARVEST_VCN0 |
-AMDGPU_VCN_HARVEST_VCN1))
-   /* both instances are harvested, disable the block */
-   return -ENOENT;
-   } else
-   adev->vcn.num_vcn_inst = 1;
 
if (amdgpu_sriov_vf(adev)) {
adev->vcn.num_vcn_inst = 2;
adev->vcn.harvest_config = 0;
adev->vcn.num_enc_rings = 1;
} else {
+   if (adev->asic_type == CHIP_ARCTURUS) {
+   u32 harvest;
+   int i;
+
+   adev->vcn.num_vcn_inst = 
VCN25_MAX_HW_INSTANCES_ARCTURUS;
+   for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
+   harvest = RREG32_SOC15(UVD, i, 
mmCC_UVD_HARVESTING);
+   if (harvest & 
CC_UVD_HARVESTING__UVD_DISABLE_MASK)
+   adev->vcn.harvest_config |= 1 << i;
+   }
+
+   if (adev->vcn.harvest_config == 
(AMDGPU_VCN_HARVEST_VCN0 |
+   AMDGPU_VCN_HARVEST_VCN1))
+   /* both instances are harvested, disable the 
block */
+   return -ENOENT;
+   } else
+   adev->vcn.num_vcn_inst = 1;
+
adev->vcn.num_enc_rings = 2;
}
 
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CJack.Zhang1%40amd.com%7Cb499c8855c4e497bbeee08d7c4f15e1a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637194413257832189sdata=Tw6BCqhv%2BteHBneDLesEYilaiu6%2F8oKQX4KKRAlYdtQ%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu/sriov Don't send msg when smu suspend

2020-02-05 Thread Zhang, Jack (Jian)
Hi, Team,

Would you please help to take a look at this patch?

Regards,
Jack

-Original Message-
From: amd-gfx  On Behalf Of Jack Zhang
Sent: Wednesday, February 5, 2020 5:18 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH] drm/amdgpu/sriov Don't send msg when smu suspend

For sriov and pp_onevf_mode, do not send message to set smu status, becasue smu 
doesn't support these messages under VF.

Besides, it should skip smu_suspend when pp_onevf_mode is disabled.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ---  
drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 21 +
 2 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 4ff7ce3..2d1f8d4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2353,15 +2353,16 @@ static int amdgpu_device_ip_suspend_phase2(struct 
amdgpu_device *adev)
}
adev->ip_blocks[i].status.hw = false;
/* handle putting the SMC in the appropriate state */
-   if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_SMC) {
-   r = amdgpu_dpm_set_mp1_state(adev, adev->mp1_state);
-   if (r) {
-   DRM_ERROR("SMC failed to set mp1 state %d, 
%d\n",
- adev->mp1_state, r);
-   return r;
+   if(!amdgpu_sriov_vf(adev)){
+   if (adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_SMC) {
+   r = amdgpu_dpm_set_mp1_state(adev, 
adev->mp1_state);
+   if (r) {
+   DRM_ERROR("SMC failed to set mp1 state 
%d, %d\n",
+   adev->mp1_state, r);
+   return r;
+   }
}
}
-
adev->ip_blocks[i].status.hw = false;
}
 
diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
index 99ad4dd..a6d7b5f 100644
--- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
@@ -1461,21 +1461,26 @@ static int smu_suspend(void *handle)
struct smu_context *smu = >smu;
bool baco_feature_is_enabled = false;
 
+   if (amdgpu_sriov_vf(adev)&& !amdgpu_sriov_is_pp_one_vf(adev))
+   return 0;
+
if (!smu->pm_enabled)
return 0;
 
if(!smu->is_apu)
baco_feature_is_enabled = smu_feature_is_enabled(smu, 
SMU_FEATURE_BACO_BIT);
 
-   ret = smu_system_features_control(smu, false);
-   if (ret)
-   return ret;
-
-   if (baco_feature_is_enabled) {
-   ret = smu_feature_set_enabled(smu, SMU_FEATURE_BACO_BIT, true);
-   if (ret) {
-   pr_warn("set BACO feature enabled failed, return %d\n", 
ret);
+   if(!amdgpu_sriov_vf(adev)) {
+   ret = smu_system_features_control(smu, false);
+   if (ret)
return ret;
+
+   if (baco_feature_is_enabled) {
+   ret = smu_feature_set_enabled(smu, 
SMU_FEATURE_BACO_BIT, true);
+   if (ret) {
+   pr_warn("set BACO feature enabled failed, 
return %d\n", ret);
+   return ret;
+   }
}
}
 
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CJack.Zhang1%40amd.com%7Ceb00cb04455340909ef908d7aa1c5ab5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164911088955698sdata=AtqrYF%2Br53lO9oQLu6Q%2BPeco5KKDGKODjCvOWQmO9hw%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF

2019-12-27 Thread Zhang, Jack (Jian)
hi, Emily,

Thank you for your review.
Comment inline, updated patch attached

-Original Message-
From: Deng, Emily  
Sent: Friday, December 27, 2019 3:27 PM
To: Zhang, Jack (Jian) ; Feng, Kenneth 
; Deucher, Alexander ; Quan, 
Evan ; Wang, Kevin(Yang) ; Tao, Yintian 
; Min, Frank ; Liu, Monk 
; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF

[AMD Official Use Only - Internal Distribution Only]



>-Original Message-
>From: Zhang, Jack (Jian) 
>Sent: Friday, December 27, 2019 3:00 PM
>To: Feng, Kenneth ; Deucher, Alexander 
>; Quan, Evan ; Wang,
>Kevin(Yang) ; Tao, Yintian ; 
>Deng, Emily ; Min, Frank ; Liu, 
>Monk ; amd-gfx@lists.freedesktop.org; Zhang, Jack 
>(Jian) 
>Subject: RE: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for 
>ARCTURUS VF
>
>
>
>-Original Message-
>From: Jack Zhang 
>Sent: Friday, December 27, 2019 2:57 PM
>To: amd-gfx@lists.freedesktop.org
>Cc: Zhang, Jack (Jian) 
>Subject: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF
>
>Before, initialization of smu ip block would be skipped for sriov 
>ASICs. But if there's only one VF being used, guest driver should be 
>able to dump some HW info such as clks, temperature,etc.
>
>To solve this, now after onevf mode is enabled, host driver will notify 
>guest. If it's onevf mode, guest will do smu hw_init and skip some 
>steps in normal smu hw_init flow because host driver has already done it for 
>smu.
>
>With this fix, guest app can talk with smu and dump hw information from smu.
>
>v2: refine the logic for pm_enabled.Skip hw_init by not changing pm_enabled.
>
>Signed-off-by: Jack Zhang 
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  3 +-
> drivers/gpu/drm/amd/amdgpu/soc15.c |  3 +-
> drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 45 
>+--
>---
> 3 files changed, 29 insertions(+), 22 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>index 8469834..08130a6 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>@@ -1448,7 +1448,8 @@ static int psp_np_fw_load(struct psp_context *psp)
> || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G
>   || ucode->ucode_id ==
>AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL
>   || ucode->ucode_id ==
>AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM
>-  || ucode->ucode_id ==
>AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM))
>+  || ucode->ucode_id ==
>AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM
>+  || ucode->ucode_id == AMDGPU_UCODE_ID_SMC))
>   /*skip ucode loading in SRIOV VF */
>   continue;
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c
>b/drivers/gpu/drm/amd/amdgpu/soc15.c
>index b53d401..a271496 100644
>--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
>+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
>@@ -827,8 +827,7 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
>   amdgpu_device_ip_block_add(adev,
>_virtual_ip_block);
>   amdgpu_device_ip_block_add(adev, _v9_0_ip_block);
>   amdgpu_device_ip_block_add(adev, _v4_0_ip_block);
>-  if (!amdgpu_sriov_vf(adev))
>-  amdgpu_device_ip_block_add(adev,
>_v11_0_ip_block);
>+  amdgpu_device_ip_block_add(adev, _v11_0_ip_block);
>
>   if (amdgpu_sriov_vf(adev)) {
>   if (likely(adev->firmware.load_type ==
>AMDGPU_FW_LOAD_PSP)) diff --git
>a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
>b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
>index 936c682..42c0a6d 100644
>--- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
>+++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
>@@ -531,10 +531,14 @@ bool is_support_sw_smu(struct amdgpu_device
>*adev)
>   if (adev->asic_type == CHIP_VEGA20)
>   return (amdgpu_dpm == 2) ? true : false;
>   else if (adev->asic_type >= CHIP_ARCTURUS) {
>-  if (amdgpu_sriov_vf(adev))
>-  return false;
>-  else
>+  if (amdgpu_sriov_vf(adev)) {
>+  if(amdgpu_sriov_is_pp_one_vf(adev))
>+  return true;
>+  else
>+  return false;
>+  } else {
>   return true;
>+  }
>   } else
>   return false;
> }
>@@ -1062,20 +1066,19 @@ static int smu_smc_table_hw_init(struct 
>smu_context *smu,
>   }
>
>   /* smu_dump_pptable(smu); *

RE: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF

2019-12-26 Thread Zhang, Jack (Jian)
Hi, all,

Here is some description about sriov and pp_one_vf mode:
1)  amdgpu_sriov_vf  marks if it is sriov or bare-metal. While 
amdgpu_sriov_is_pp_one_vf is a mode of sriov- It means there is only one VF  
generated by host driver.
2)  Host administrator can determine “vf number” when host driver is 
loaded. If vf_num =1, host driver will notify guest driver  it is under one vf 
mode --- pp_one_vf return true. 
Otherwise, pp_one_vf return false. Without unloading guest driver and 
host driver, vf_num cannot be changed. So it is a static process.
3)  Under  pp_one_vf mode in sriov platform, guest driver will do smu 
hw_init, the goal of it is to let guest driver “talk” with smu by sending 
authorized smu messages. This will help user mode app to dump info like clks, 
temperature, GPU usage…. 
Currently we don’t support guest driver to write value to smu. We can  
only read  smu information.
Besides, as host driver has already initialized smu hw such as loading 
smu firmware and and set some feature control modes, some hw init steps need to 
skip in guest driver, such as write pptable, load smc firmware.
4)  At the same time, pp_one_vf mode need smu some firmware changes to open 
permission for certain messages in VF.


B.R.
Jack
-Original Message-
From: Zhang, Jack (Jian) 
Sent: Friday, December 27, 2019 3:00 PM
To: Feng, Kenneth ; Deucher, Alexander 
; Quan, Evan ; Wang, Kevin(Yang) 
; Tao, Yintian ; Deng, Emily 
; Min, Frank ; Liu, Monk 
; amd-gfx@lists.freedesktop.org; Zhang, Jack (Jian) 

Subject: RE: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF



-Original Message-
From: Jack Zhang  
Sent: Friday, December 27, 2019 2:57 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF

Before, initialization of smu ip block would be skipped for sriov ASICs. But if 
there's only one VF being used, guest driver should be able to dump some HW 
info such as clks, temperature,etc.

To solve this, now after onevf mode is enabled, host driver will notify guest. 
If it's onevf mode, guest will do smu hw_init and skip some steps in normal smu 
hw_init flow because host driver has already done it for smu.

With this fix, guest app can talk with smu and dump hw information from smu.

v2: refine the logic for pm_enabled.Skip hw_init by not changing pm_enabled.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  3 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c |  3 +-
 drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 45 +-
 3 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 8469834..08130a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -1448,7 +1448,8 @@ static int psp_np_fw_load(struct psp_context *psp)
 || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G
|| ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL
|| ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM
-   || ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM))
+   || ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM
+   || ucode->ucode_id == AMDGPU_UCODE_ID_SMC))
/*skip ucode loading in SRIOV VF */
continue;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
b/drivers/gpu/drm/amd/amdgpu/soc15.c
index b53d401..a271496 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -827,8 +827,7 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
amdgpu_device_ip_block_add(adev, _virtual_ip_block);
amdgpu_device_ip_block_add(adev, _v9_0_ip_block);
amdgpu_device_ip_block_add(adev, _v4_0_ip_block);
-   if (!amdgpu_sriov_vf(adev))
-   amdgpu_device_ip_block_add(adev, _v11_0_ip_block);
+   amdgpu_device_ip_block_add(adev, _v11_0_ip_block);
 
if (amdgpu_sriov_vf(adev)) {
if (likely(adev->firmware.load_type == 
AMDGPU_FW_LOAD_PSP)) diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
index 936c682..42c0a6d 100644
--- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
@@ -531,10 +531,14 @@ bool is_support_sw_smu(struct amdgpu_device *adev)
if (adev->asic_type == CHIP_VEGA20)
return (amdgpu_dpm == 2) ? true : false;
else if (adev->asic_type >= CHIP_ARCTURUS) {
-   if (amdgpu_sriov_vf(adev))
-   return false;
-   else
+ 

RE: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF

2019-12-26 Thread Zhang, Jack (Jian)



-Original Message-
From: Jack Zhang  
Sent: Friday, December 27, 2019 2:57 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH 1/2] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF

Before, initialization of smu ip block would be skipped for sriov ASICs. But if 
there's only one VF being used, guest driver should be able to dump some HW 
info such as clks, temperature,etc.

To solve this, now after onevf mode is enabled, host driver will notify guest. 
If it's onevf mode, guest will do smu hw_init and skip some steps in normal smu 
hw_init flow because host driver has already done it for smu.

With this fix, guest app can talk with smu and dump hw information from smu.

v2: refine the logic for pm_enabled.Skip hw_init by not changing pm_enabled.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  3 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c |  3 +-
 drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 45 +-
 3 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 8469834..08130a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -1448,7 +1448,8 @@ static int psp_np_fw_load(struct psp_context *psp)
 || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G
|| ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL
|| ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM
-   || ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM))
+   || ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM
+   || ucode->ucode_id == AMDGPU_UCODE_ID_SMC))
/*skip ucode loading in SRIOV VF */
continue;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
b/drivers/gpu/drm/amd/amdgpu/soc15.c
index b53d401..a271496 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -827,8 +827,7 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
amdgpu_device_ip_block_add(adev, _virtual_ip_block);
amdgpu_device_ip_block_add(adev, _v9_0_ip_block);
amdgpu_device_ip_block_add(adev, _v4_0_ip_block);
-   if (!amdgpu_sriov_vf(adev))
-   amdgpu_device_ip_block_add(adev, _v11_0_ip_block);
+   amdgpu_device_ip_block_add(adev, _v11_0_ip_block);
 
if (amdgpu_sriov_vf(adev)) {
if (likely(adev->firmware.load_type == 
AMDGPU_FW_LOAD_PSP)) diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
index 936c682..42c0a6d 100644
--- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
@@ -531,10 +531,14 @@ bool is_support_sw_smu(struct amdgpu_device *adev)
if (adev->asic_type == CHIP_VEGA20)
return (amdgpu_dpm == 2) ? true : false;
else if (adev->asic_type >= CHIP_ARCTURUS) {
-   if (amdgpu_sriov_vf(adev))
-   return false;
-   else
+   if (amdgpu_sriov_vf(adev)) {
+   if(amdgpu_sriov_is_pp_one_vf(adev))
+   return true;
+   else
+   return false;
+   } else {
return true;
+   }
} else
return false;
 }
@@ -1062,20 +1066,19 @@ static int smu_smc_table_hw_init(struct smu_context 
*smu,
}
 
/* smu_dump_pptable(smu); */
+   if(amdgpu_sriov_vf(adev) && !amdgpu_sriov_is_pp_one_vf(adev)){
+   /*
+* Copy pptable bo in the vram to smc with SMU MSGs such as
+* SetDriverDramAddr and TransferTableDram2Smu.
+*/
+   ret = smu_write_pptable(smu);
+   if (ret)
+   return ret;
 
-   /*
-* Copy pptable bo in the vram to smc with SMU MSGs such as
-* SetDriverDramAddr and TransferTableDram2Smu.
-*/
-   ret = smu_write_pptable(smu);
-   if (ret)
-   return ret;
-
-   /* issue Run*Btc msg */
-   ret = smu_run_btc(smu);
-   if (ret)
-   return ret;
-
+   /* issue Run*Btc msg */
+   ret = smu_run_btc(smu);
+   if (ret)
+   return ret;
ret = smu_feature_set_allowed_mask(smu);
if (ret)
return ret;
@@ -1083,7 +1086,7 @@ static int smu_smc_table_hw_init(struct smu_context *smu,
ret = smu_system_features_control(smu, true);
if (ret)
return ret;
-
+   }
if (adev->asic_type != CHIP_ARCTURUS) {
  

RE: [PATCH 2/2] amd/amdgpu/sriov use resume for smc recover under sriov

2019-12-26 Thread Zhang, Jack (Jian)



-Original Message-
From: Jack Zhang  
Sent: Friday, December 27, 2019 2:57 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH 2/2] amd/amdgpu/sriov use resume for smc recover under sriov

Under sriov and pp_onevf mode,
1.take resume function for smc recover to avoid potential memory leak.

2.add return condition inside smc resume function for sriov_pp_onevf_mode and 
pm_enabled param.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +-  
drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 6 ++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 6a4b142..547fa3e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2458,7 +2458,11 @@ static int amdgpu_device_ip_reinit_late_sriov(struct 
amdgpu_device *adev)
block->status.hw)
continue;
 
-   r = block->version->funcs->hw_init(adev);
+   if (block->version->type == AMD_IP_BLOCK_TYPE_SMC)
+   r = block->version->funcs->resume(adev);
+   else
+   r = block->version->funcs->hw_init(adev);
+
DRM_INFO("RE-INIT-late: %s %s\n", 
block->version->funcs->name, r?"failed":"succeeded");
if (r)
return r;
diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
index 42c0a6d..7cad358 100644
--- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
@@ -1416,6 +1416,12 @@ static int smu_resume(void *handle)
struct amdgpu_device *adev = (struct amdgpu_device *)handle;
struct smu_context *smu = >smu;
 
+   if (amdgpu_sriov_vf(adev)&& !amdgpu_sriov_is_pp_one_vf(adev))
+   return 0;
+
+   if (!smu->pm_enabled)
+   return 0;
+
pr_info("SMU is resuming...\n");
 
ret = smu_start_smc_engine(smu);
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amd/powerplay free driver_pptable for smu reinit

2019-12-25 Thread Zhang, Jack (Jian)
Ping...

-Original Message-
From: Zhang, Jack (Jian) 
Sent: Tuesday, December 24, 2019 1:05 PM
To: Wang, Kevin(Yang) ; Feng, Kenneth 
; Quan, Evan ; Jack Zhang 
; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH] drm/amd/powerplay free driver_pptable for smu reinit



-Original Message-
From: Jack Zhang  
Sent: Tuesday, December 24, 2019 1:05 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH] drm/amd/powerplay free driver_pptable for smu reinit

During gpu recover, smu hw reinit will fail becasue 
table_context->driver_pptable is not freed and set to NULL.

Free the driver_pptable pointer if it's not NULL.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/powerplay/smu_v11_0.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c 
b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c
index 7781d24..ca877bd 100644
--- a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c
+++ b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c
@@ -713,8 +713,10 @@ int smu_v11_0_parse_pptable(struct smu_context *smu)
struct smu_table_context *table_context = >smu_table;
struct smu_table *table = _context->tables[SMU_TABLE_PPTABLE];
 
-   if (table_context->driver_pptable)
-   return -EINVAL;
+   if (table_context->driver_pptable) {
+   kfree(table_context->driver_pptable);
+   table_context->driver_pptable = NULL;
+   }
 
table_context->driver_pptable = kzalloc(table->size, GFP_KERNEL);
 
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF

2019-12-25 Thread Zhang, Jack (Jian)
Ping…

From: Zhang, Jack (Jian)
Sent: Tuesday, December 24, 2019 11:51 AM
To: Wang, Kevin(Yang) ; Feng, Kenneth 
; Tao, Yintian ; 
amd-gfx@lists.freedesktop.org; Deng, Emily ; Quan, Evan 

Subject: RE: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF

Hi, Kevin,

Thank you for your code review.

  1.  I updated the patch to refine pm_enbable logic (attached).


  1.  For the following comments. “* Copy pptable bo in the vram…”, it is the 
original bare-metal comments. I didn’t change it.


  1.  “could you describe the function of pp_one_vf and sriov_vf ?”

  1.  amdgpu_sriov_vf  marks if it is sriov or bare-metal. While 
amdgpu_sriov_is_pp_one_vf is a mode of sriov- It means there is only one VF  
generated by host driver.
  2.  When host driver is loaded, host administrator can determine “vf number”. 
If vf_num =1, host driver will let guest driver know it is under one vf 
mode-pp_one_vf return true. Otherwise, pp_one_vf return false. Without 
unloading guest driver and host driver, vf_num cannot be changed. So it is a 
static process.
  3.  Under  pp_one_vf mode, guest driver will do hw_init for smu, the purpose 
of it is to enable guest driver to “talk” with smu by sending authorized smu 
messages. This will help user mode app to dump info like clks, temperature, GPU 
usage…. Currently we don’t support guest driver to write value to smu. Only 
support read permission to dump smu infos.

Besides, as host driver has already initialized smu hw, some hw init steps need 
to skip in guest hw_init function of smu block,such as write pptable, load smc 
firmware.

  1.  pp_one_vf mode need smu some firmware changes to open permission for 
certain messages in VF.

B.R.
Jack

From: Wang, Kevin(Yang) mailto:kevin1.w...@amd.com>>
Sent: Monday, December 23, 2019 6:11 PM
To: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; Feng, 
Kenneth mailto:kenneth.f...@amd.com>>; Tao, Yintian 
mailto:yintian@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Deng, 
Emily mailto:emily.d...@amd.com>>
Cc: Quan, Evan mailto:evan.q...@amd.com>>
Subject: Re: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF


add @Quan, Evan<mailto:evan.q...@amd.com> to support arcturus asic.
comment inline.
____
From: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>
Sent: Monday, December 23, 2019 4:42 PM
To: Feng, Kenneth mailto:kenneth.f...@amd.com>>; Wang, 
Kevin(Yang) mailto:kevin1.w...@amd.com>>; Tao, Yintian 
mailto:yintian@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>; Deng, 
Emily mailto:emily.d...@amd.com>>
Cc: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>
Subject: RE: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF



-Original Message-
From: Jack Zhang mailto:jack.zha...@amd.com>>
Sent: Monday, December 23, 2019 4:40 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>
Subject: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF

Before, initialization of smu ip block would be skipped for sriov ASICs. But if 
there's only one VF being used, guest driver should be able to dump some HW 
info such as clks, temperature,etc.

To solve this, now after onevf mode is enabled, host driver will notify guest. 
If it's onevf mode, guest will do smu hw_init and skip some steps in normal smu 
hw_init flow because host driver has already done it for smu.

With this fix, guest app can talk with smu and dump hw information from smu.

Signed-off-by: Jack Zhang mailto:jack.zha...@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  3 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c |  3 +-
 drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 49 ++
 3 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 8469834..08130a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -1448,7 +1448,8 @@ static int psp_np_fw_load(struct psp_context *psp)
 || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G
 || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL
 || ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM
-   || ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM))
+   || ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM
+   || ucode->ucode_id == AMDGPU_UCODE_ID_SMC))
 /*skip ucode loading in SRIOV VF */
 continue;

diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
b/drivers/gpu/drm/amd/amdgpu/soc15.c
index b53d401..a2

RE: [PATCH] drm/amd/powerplay free driver_pptable for smu reinit

2019-12-23 Thread Zhang, Jack (Jian)



-Original Message-
From: Jack Zhang  
Sent: Tuesday, December 24, 2019 1:05 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH] drm/amd/powerplay free driver_pptable for smu reinit

During gpu recover, smu hw reinit will fail becasue 
table_context->driver_pptable is not freed and set to NULL.

Free the driver_pptable pointer if it's not NULL.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/powerplay/smu_v11_0.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c 
b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c
index 7781d24..ca877bd 100644
--- a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c
+++ b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c
@@ -713,8 +713,10 @@ int smu_v11_0_parse_pptable(struct smu_context *smu)
struct smu_table_context *table_context = >smu_table;
struct smu_table *table = _context->tables[SMU_TABLE_PPTABLE];
 
-   if (table_context->driver_pptable)
-   return -EINVAL;
+   if (table_context->driver_pptable) {
+   kfree(table_context->driver_pptable);
+   table_context->driver_pptable = NULL;
+   }
 
table_context->driver_pptable = kzalloc(table->size, GFP_KERNEL);
 
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF

2019-12-23 Thread Zhang, Jack (Jian)
Hi, Kevin,

Thank you for your code review.

  1.  I updated the patch to refine pm_enbable logic (attached).


  1.  For the following comments. “* Copy pptable bo in the vram…”, it is the 
original bare-metal comments. I didn’t change it.


  1.  “could you describe the function of pp_one_vf and sriov_vf ?”

  1.  amdgpu_sriov_vf  marks if it is sriov or bare-metal. While 
amdgpu_sriov_is_pp_one_vf is a mode of sriov- It means there is only one VF  
generated by host driver.
  2.  When host driver is loaded, host administrator can determine “vf number”. 
If vf_num =1, host driver will let guest driver know it is under one vf 
mode-pp_one_vf return true. Otherwise, pp_one_vf return false. Without 
unloading guest driver and host driver, vf_num cannot be changed. So it is a 
static process.
  3.  Under  pp_one_vf mode, guest driver will do hw_init for smu, the purpose 
of it is to enable guest driver to “talk” with smu by sending authorized smu 
messages. This will help user mode app to dump info like clks, temperature, GPU 
usage…. Currently we don’t support guest driver to write value to smu. Only 
support read permission to dump smu infos.

Besides, as host driver has already initialized smu hw, some hw init steps need 
to skip in guest hw_init function of smu block,such as write pptable, load smc 
firmware.

  1.  pp_one_vf mode need smu some firmware changes to open permission for 
certain messages in VF.

B.R.
Jack

From: Wang, Kevin(Yang) 
Sent: Monday, December 23, 2019 6:11 PM
To: Zhang, Jack (Jian) ; Feng, Kenneth 
; Tao, Yintian ; 
amd-gfx@lists.freedesktop.org; Deng, Emily 
Cc: Quan, Evan 
Subject: Re: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF


add @Quan, Evan<mailto:evan.q...@amd.com> to support arcturus asic.
comment inline.

From: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>
Sent: Monday, December 23, 2019 4:42 PM
To: Feng, Kenneth mailto:kenneth.f...@amd.com>>; Wang, 
Kevin(Yang) mailto:kevin1.w...@amd.com>>; Tao, Yintian 
mailto:yintian@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>; Deng, 
Emily mailto:emily.d...@amd.com>>
Cc: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>
Subject: RE: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF



-Original Message-
From: Jack Zhang mailto:jack.zha...@amd.com>>
Sent: Monday, December 23, 2019 4:40 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>
Subject: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF

Before, initialization of smu ip block would be skipped for sriov ASICs. But if 
there's only one VF being used, guest driver should be able to dump some HW 
info such as clks, temperature,etc.

To solve this, now after onevf mode is enabled, host driver will notify guest. 
If it's onevf mode, guest will do smu hw_init and skip some steps in normal smu 
hw_init flow because host driver has already done it for smu.

With this fix, guest app can talk with smu and dump hw information from smu.

Signed-off-by: Jack Zhang mailto:jack.zha...@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  3 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c |  3 +-
 drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 49 ++
 3 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 8469834..08130a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -1448,7 +1448,8 @@ static int psp_np_fw_load(struct psp_context *psp)
 || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G
 || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL
 || ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM
-   || ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM))
+   || ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM
+   || ucode->ucode_id == AMDGPU_UCODE_ID_SMC))
 /*skip ucode loading in SRIOV VF */
 continue;

diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
b/drivers/gpu/drm/amd/amdgpu/soc15.c
index b53d401..a271496 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -827,8 +827,7 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
 amdgpu_device_ip_block_add(adev, 
_virtual_ip_block);
 amdgpu_device_ip_block_add(adev, _v9_0_ip_block);
 amdgpu_device_ip_block_add(adev, _v4_0_ip_block);
-   if (!amdgpu_sriov_vf(adev))
-   amdgpu_device_ip_block_add(adev, _v11_0_ip_block);
+

RE: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF

2019-12-23 Thread Zhang, Jack (Jian)



-Original Message-
From: Jack Zhang  
Sent: Monday, December 23, 2019 4:40 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH] amd/amdgpu/sriov enable onevf mode for ARCTURUS VF

Before, initialization of smu ip block would be skipped for sriov ASICs. But if 
there's only one VF being used, guest driver should be able to dump some HW 
info such as clks, temperature,etc.

To solve this, now after onevf mode is enabled, host driver will notify guest. 
If it's onevf mode, guest will do smu hw_init and skip some steps in normal smu 
hw_init flow because host driver has already done it for smu.

With this fix, guest app can talk with smu and dump hw information from smu.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  3 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c |  3 +-
 drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 49 ++
 3 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 8469834..08130a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -1448,7 +1448,8 @@ static int psp_np_fw_load(struct psp_context *psp)
 || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G
|| ucode->ucode_id == AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL
|| ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM
-   || ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM))
+   || ucode->ucode_id == 
AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM
+   || ucode->ucode_id == AMDGPU_UCODE_ID_SMC))
/*skip ucode loading in SRIOV VF */
continue;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
b/drivers/gpu/drm/amd/amdgpu/soc15.c
index b53d401..a271496 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -827,8 +827,7 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
amdgpu_device_ip_block_add(adev, _virtual_ip_block);
amdgpu_device_ip_block_add(adev, _v9_0_ip_block);
amdgpu_device_ip_block_add(adev, _v4_0_ip_block);
-   if (!amdgpu_sriov_vf(adev))
-   amdgpu_device_ip_block_add(adev, _v11_0_ip_block);
+   amdgpu_device_ip_block_add(adev, _v11_0_ip_block);
 
if (amdgpu_sriov_vf(adev)) {
if (likely(adev->firmware.load_type == 
AMDGPU_FW_LOAD_PSP)) diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
index 936c682..c07fb26 100644
--- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
@@ -531,10 +531,14 @@ bool is_support_sw_smu(struct amdgpu_device *adev)
if (adev->asic_type == CHIP_VEGA20)
return (amdgpu_dpm == 2) ? true : false;
else if (adev->asic_type >= CHIP_ARCTURUS) {
-   if (amdgpu_sriov_vf(adev))
-   return false;
-   else
+   if (amdgpu_sriov_vf(adev)) {
+   if(amdgpu_sriov_is_pp_one_vf(adev))
+   return true;
+   else
+   return false;
+   } else {
return true;
+   }
} else
return false;
 }
@@ -1062,20 +1066,19 @@ static int smu_smc_table_hw_init(struct smu_context 
*smu,
}
 
/* smu_dump_pptable(smu); */
+   if(amdgpu_sriov_vf(adev) && !amdgpu_sriov_is_pp_one_vf(adev)){
+   /*
+* Copy pptable bo in the vram to smc with SMU MSGs such as
+* SetDriverDramAddr and TransferTableDram2Smu.
+*/
+   ret = smu_write_pptable(smu);
+   if (ret)
+   return ret;
 
-   /*
-* Copy pptable bo in the vram to smc with SMU MSGs such as
-* SetDriverDramAddr and TransferTableDram2Smu.
-*/
-   ret = smu_write_pptable(smu);
-   if (ret)
-   return ret;
-
-   /* issue Run*Btc msg */
-   ret = smu_run_btc(smu);
-   if (ret)
-   return ret;
-
+   /* issue Run*Btc msg */
+   ret = smu_run_btc(smu);
+   if (ret)
+   return ret;
ret = smu_feature_set_allowed_mask(smu);
if (ret)
return ret;
@@ -1083,7 +1086,7 @@ static int smu_smc_table_hw_init(struct smu_context *smu,
ret = smu_system_features_control(smu, true);
if (ret)
return ret;
-
+   }
if (adev->asic_type != CHIP_ARCTURUS) {
ret = smu_notify_display_change(smu);
if (ret)
@@ -1136,8 +1139,9 @

RE: [PATCH] amd/amdgpu/sriov swSMU disable for sriov

2019-12-02 Thread Zhang, Jack (Jian)
Thanks Evan, I will add "{ }" before I check-in the code.

Best,
Jack

-Original Message-
From: Quan, Evan  
Sent: Tuesday, December 3, 2019 10:45 AM
To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: RE: [PATCH] amd/amdgpu/sriov swSMU disable for sriov



> -Original Message-
> From: amd-gfx  On Behalf Of 
> Jack Zhang
> Sent: Monday, December 2, 2019 7:05 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Jack (Jian) 
> Subject: [PATCH] amd/amdgpu/sriov swSMU disable for sriov
> 
> For boards greater than ARCTURUS, and under sriov platform, swSMU is 
> not supported because smu ip block is commented at guest driver.
> 
> Generally for sriov, initialization of smu is moved to host driver.
> Thus, smu sw_init and hw_init will not be executed at guest driver.
> 
> Without sw structure being initialized in guest driver, swSMU cannot 
> declare to be supported.
> 
> Signed-off-by: Jack Zhang 
> ---
>  drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> index 36001a4..0b8a53b 100644
> --- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> @@ -531,7 +531,10 @@ bool is_support_sw_smu(struct amdgpu_device *adev)
>   if (adev->asic_type == CHIP_VEGA20)
>   return (amdgpu_dpm == 2) ? true : false;
>   else if (adev->asic_type >= CHIP_ARCTURUS)
> - return true;
> + if (amdgpu_sriov_vf(adev))
> + return false;
> + else
> + return true;
[Quan, Evan] Are "{" and "}" missing around this code block? This seems a 
little weird.
>   else
>   return false;
>  }
> --
> 2.7.4
> 
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.free
> desktop.org%2Fmailman%2Flistinfo%2Famd-
> gfxdata=02%7C01%7Cevan.quan%40amd.com%7Ca0119099a3db450554
> f208d777178b5f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6371
> 08815384772040sdata=8RJ1QyDzHEcnOnk0EBGkhfVljeiPWaZSNlO6OyAa
> enc%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] amd/amdgpu/sriov swSMU disable for sriov

2019-12-02 Thread Zhang, Jack (Jian)
Hi, Team,

Would you please help to review this patch?

Best,
Jack

-Original Message-
From: Jack Zhang  
Sent: Monday, December 2, 2019 7:05 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH] amd/amdgpu/sriov swSMU disable for sriov

For boards greater than ARCTURUS, and under sriov platform, swSMU is not 
supported because smu ip block is commented at guest driver.

Generally for sriov, initialization of smu is moved to host driver.
Thus, smu sw_init and hw_init will not be executed at guest driver.

Without sw structure being initialized in guest driver, swSMU cannot declare to 
be supported.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
index 36001a4..0b8a53b 100644
--- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
@@ -531,7 +531,10 @@ bool is_support_sw_smu(struct amdgpu_device *adev)
if (adev->asic_type == CHIP_VEGA20)
return (amdgpu_dpm == 2) ? true : false;
else if (adev->asic_type >= CHIP_ARCTURUS)
-   return true;
+   if (amdgpu_sriov_vf(adev))
+   return false;
+   else
+   return true;
else
return false;
 }
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 2/2] drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus VF.

2019-11-21 Thread Zhang, Jack (Jian)
Sure,will do that!

Thanks,
Jack

 Outlook for Android<https://aka.ms/ghei36>

From: Deucher, Alexander 
Sent: Friday, November 22, 2019 12:00:04 AM
To: Zhang, Jack (Jian) ; Alex Deucher 

Cc: amd-gfx list 
Subject: Re: [PATCH 2/2] drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus 
VF.

You can have my RB on the first patch if you fix the compiler warnings.

Alex

From: amd-gfx  on behalf of Zhang, Jack 
(Jian) 
Sent: Thursday, November 21, 2019 10:58 AM
To: Alex Deucher 
Cc: amd-gfx list 
Subject: Re: [PATCH 2/2] drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus 
VF.

Thanks Alex for this review.

Both of the two patches will not influence bare-metal code path.

B. R.
Jack

 Outlook for 
Android<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Faka.ms%2Fghei36=02%7C01%7Calexander.deucher%40amd.com%7Cf350a586b8b840526a2c08d76e9bb538%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637099487450051533=ghsxwJqO5AO%2B2pBhPoQz70SkyYyeNCnA5FQMAHQVVpI%3D=0>

From: Alex Deucher 
Sent: Thursday, November 21, 2019 11:26:37 PM
To: Zhang, Jack (Jian) 
Cc: amd-gfx list 
Subject: Re: [PATCH 2/2] drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus 
VF.

Nevermind.  I missed the sr-iov check. Patch is:
Acked-by: Alex Deucher 

On Thu, Nov 21, 2019 at 10:25 AM Alex Deucher  wrote:
>
> Won't this impact bare metal hw that needs this?
>
> Alex
>
> On Thu, Nov 21, 2019 at 1:17 AM Jack Zhang  wrote:
> >
> > After rlcg fw 2.1, kmd driver starts to load extra fw for
> > LIST_CNTL,GPM_MEM,SRM_MEM. We needs to skip the three fw
> > because all rlcg related fw have been loaded by host driver.
> > Guest driver would load the three fw fail without this change.
> >
> > Signed-off-by: Jack Zhang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > index c3a42d3..eecde80 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > @@ -1470,7 +1470,10 @@ static int psp_np_fw_load(struct psp_context *psp)
> > || ucode->ucode_id == AMDGPU_UCODE_ID_SDMA5
> > || ucode->ucode_id == AMDGPU_UCODE_ID_SDMA6
> > || ucode->ucode_id == AMDGPU_UCODE_ID_SDMA7
> > -   || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G))
> > +|| ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G
> > +   || ucode->ucode_id == 
> > AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL
> > +   || ucode->ucode_id == 
> > AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM
> > +   || ucode->ucode_id == 
> > AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM))
> > /*skip ucode loading in SRIOV VF */
> > continue;
> >
> > --
> > 2.7.4
> >
> > ___
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CJack.Zhang1%40amd.com%7C62c30b71b3b8490862d808d76e973ae5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637099468112639673sdata=iC932sWNGrLeexp7wIDdrZ7cjVBcFv5giOTC3uHdE%2Fk%3Dreserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx=02%7C01%7Calexander.deucher%40amd.com%7Cf350a586b8b840526a2c08d76e9bb538%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637099487450061527=FofOoED2ixVkMukh%2FOXH8mdtu%2B%2BwilXYGxbQVNRflUY%3D=0>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 2/2] drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus VF.

2019-11-21 Thread Zhang, Jack (Jian)
Thanks Alex for this review.

Both of the two patches will not influence bare-metal code path.

B. R.
Jack

 Outlook for Android<https://aka.ms/ghei36>

From: Alex Deucher 
Sent: Thursday, November 21, 2019 11:26:37 PM
To: Zhang, Jack (Jian) 
Cc: amd-gfx list 
Subject: Re: [PATCH 2/2] drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus 
VF.

Nevermind.  I missed the sr-iov check. Patch is:
Acked-by: Alex Deucher 

On Thu, Nov 21, 2019 at 10:25 AM Alex Deucher  wrote:
>
> Won't this impact bare metal hw that needs this?
>
> Alex
>
> On Thu, Nov 21, 2019 at 1:17 AM Jack Zhang  wrote:
> >
> > After rlcg fw 2.1, kmd driver starts to load extra fw for
> > LIST_CNTL,GPM_MEM,SRM_MEM. We needs to skip the three fw
> > because all rlcg related fw have been loaded by host driver.
> > Guest driver would load the three fw fail without this change.
> >
> > Signed-off-by: Jack Zhang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > index c3a42d3..eecde80 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > @@ -1470,7 +1470,10 @@ static int psp_np_fw_load(struct psp_context *psp)
> > || ucode->ucode_id == AMDGPU_UCODE_ID_SDMA5
> > || ucode->ucode_id == AMDGPU_UCODE_ID_SDMA6
> > || ucode->ucode_id == AMDGPU_UCODE_ID_SDMA7
> > -   || ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G))
> > +|| ucode->ucode_id == AMDGPU_UCODE_ID_RLC_G
> > +   || ucode->ucode_id == 
> > AMDGPU_UCODE_ID_RLC_RESTORE_LIST_CNTL
> > +   || ucode->ucode_id == 
> > AMDGPU_UCODE_ID_RLC_RESTORE_LIST_GPM_MEM
> > +   || ucode->ucode_id == 
> > AMDGPU_UCODE_ID_RLC_RESTORE_LIST_SRM_MEM))
> > /*skip ucode loading in SRIOV VF */
> > continue;
> >
> > --
> > 2.7.4
> >
> > ___
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CJack.Zhang1%40amd.com%7C62c30b71b3b8490862d808d76e973ae5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637099468112639673sdata=iC932sWNGrLeexp7wIDdrZ7cjVBcFv5giOTC3uHdE%2Fk%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 1/2] drm/amd/amdgpu/sriov temporarily skip ras, dtm, hdcp for arcturus VF

2019-11-20 Thread Zhang, Jack (Jian)
Hi, Team,

Would you please help to take a look this patch?

BR
Jack

-Original Message-
From: amd-gfx  On Behalf Of Jack Zhang
Sent: Thursday, November 21, 2019 2:17 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH 1/2] drm/amd/amdgpu/sriov temporarily skip ras, dtm, hdcp for 
arcturus VF

Temporarily skip ras,dtm,hdcp initialize and terminate for arcturus VF 
Currently the three features haven't been enabled at SRIOV, it would trigger 
guest driver load fail with the bare-metal path of the three features.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 36 +
 1 file changed, 36 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 2a8a08a..c3a42d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -756,6 +756,12 @@ int psp_ras_enable_features(struct psp_context *psp,
 
 static int psp_ras_terminate(struct psp_context *psp)  {
+   /*
+* TODO: bypass the terminate in sriov for now
+*/
+   if (amdgpu_sriov_vf(psp->adev))
+   return 0;
+
int ret;
 
if (!psp->ras.ras_initialized)
@@ -777,6 +783,12 @@ static int psp_ras_terminate(struct psp_context *psp)
 
 static int psp_ras_initialize(struct psp_context *psp)  {
+   /*
+* TODO: bypass the initialize in sriov for now
+*/
+   if (amdgpu_sriov_vf(psp->adev))
+   return 0;
+
int ret;
 
if (!psp->adev->psp.ta_ras_ucode_size || @@ -872,6 +884,12 @@ static 
int psp_hdcp_load(struct psp_context *psp)  }  static int 
psp_hdcp_initialize(struct psp_context *psp)  {
+   /*
+* TODO: bypass the initialize in sriov for now
+*/
+   if (amdgpu_sriov_vf(psp->adev))
+   return 0;
+
int ret;
 
if (!psp->adev->psp.ta_hdcp_ucode_size || @@ -960,6 +978,12 @@ int 
psp_hdcp_invoke(struct psp_context *psp, uint32_t ta_cmd_id)
 
 static int psp_hdcp_terminate(struct psp_context *psp)  {
+   /*
+* TODO: bypass the terminate in sriov for now
+*/
+   if (amdgpu_sriov_vf(psp->adev))
+   return 0;
+
int ret;
 
if (!psp->hdcp_context.hdcp_initialized)
@@ -1051,6 +1075,12 @@ static int psp_dtm_load(struct psp_context *psp)
 
 static int psp_dtm_initialize(struct psp_context *psp)  {
+   /*
+* TODO: bypass the initialize in sriov for now
+*/
+   if (amdgpu_sriov_vf(psp->adev))
+   return 0;
+
int ret;
 
if (!psp->adev->psp.ta_dtm_ucode_size || @@ -1109,6 +1139,12 @@ int 
psp_dtm_invoke(struct psp_context *psp, uint32_t ta_cmd_id)
 
 static int psp_dtm_terminate(struct psp_context *psp)  {
+   /*
+* TODO: bypass the terminate in sriov for now
+*/
+   if (amdgpu_sriov_vf(psp->adev))
+   return 0;
+
int ret;
 
if (!psp->dtm_context.dtm_initialized)
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amd/amdgpu/sriov ip block setting of Arcturus

2019-09-29 Thread Zhang, Jack (Jian)
Hi, folks,

Would you please help to review the patch?

Thanks,
Jack

-Original Message-
From: amd-gfx  On Behalf Of Jack Zhang
Sent: Monday, September 30, 2019 1:00 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH] drm/amd/amdgpu/sriov ip block setting of Arcturus

Add ip block setting for Arcturus SRIOV

1.PSP need to be initialized before IH.
2.SMU doesn't need to be initialized at kmd driver.
3.Arcturus doesn't support DCE hardware,it needs to skip
  register access to DCE.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 10 ++
 drivers/gpu/drm/amd/amdgpu/soc15.c| 19 +++
 2 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 95a9a5f5..44023bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1330,11 +1330,13 @@ static int gmc_v9_0_hw_init(void *handle)
gmc_v9_0_init_golden_registers(adev);
 
if (adev->mode_info.num_crtc) {
-   /* Lockout access through VGA aperture*/
-   WREG32_FIELD15(DCE, 0, VGA_HDP_CONTROL, VGA_MEMORY_DISABLE, 1);
+   if (adev->asic_type != CHIP_ARCTURUS) {
+   /* Lockout access through VGA aperture*/
+   WREG32_FIELD15(DCE, 0, VGA_HDP_CONTROL, 
VGA_MEMORY_DISABLE, 1);
 
-   /* disable VGA render */
-   WREG32_FIELD15(DCE, 0, VGA_RENDER_CONTROL, VGA_VSTATUS_CNTL, 0);
+   /* disable VGA render */
+   WREG32_FIELD15(DCE, 0, VGA_RENDER_CONTROL, 
VGA_VSTATUS_CNTL, 0);
+   }
}
 
r = gmc_v9_0_gart_enable(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
b/drivers/gpu/drm/amd/amdgpu/soc15.c
index dbd790e..ac181e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -754,14 +754,25 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
case CHIP_ARCTURUS:
amdgpu_device_ip_block_add(adev, _common_ip_block);
amdgpu_device_ip_block_add(adev, _v9_0_ip_block);
-   amdgpu_device_ip_block_add(adev, _ih_ip_block);
-   if (likely(adev->firmware.load_type == AMDGPU_FW_LOAD_PSP))
-   amdgpu_device_ip_block_add(adev, _v11_0_ip_block);
+
+   /* For MI100 SR-IOV, PSP need to be initialized before IH */
+   if (amdgpu_sriov_vf(adev)) {
+   if (likely(adev->firmware.load_type == 
AMDGPU_FW_LOAD_PSP))
+   amdgpu_device_ip_block_add(adev, 
_v11_0_ip_block);
+   amdgpu_device_ip_block_add(adev, _ih_ip_block);
+   } else {
+   amdgpu_device_ip_block_add(adev, _ih_ip_block);
+   if (likely(adev->firmware.load_type == 
AMDGPU_FW_LOAD_PSP))
+   amdgpu_device_ip_block_add(adev, 
_v11_0_ip_block);
+   }
+
if (adev->enable_virtual_display || amdgpu_sriov_vf(adev))
amdgpu_device_ip_block_add(adev, _virtual_ip_block);
amdgpu_device_ip_block_add(adev, _v9_0_ip_block);
amdgpu_device_ip_block_add(adev, _v4_0_ip_block);
-   amdgpu_device_ip_block_add(adev, _v11_0_ip_block);
+   if (!amdgpu_sriov_vf(adev))
+   amdgpu_device_ip_block_add(adev, _v11_0_ip_block);
+
if (unlikely(adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT))
amdgpu_device_ip_block_add(adev, _v2_5_ip_block);
break;
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or passthrough

2019-09-18 Thread Zhang, Jack (Jian)



Reviewed & Tested-by: Jack Zhang 
mailto:jack.zha...@amd.com>>

BR,
Jack
From: Deng, Emily 
Sent: Thursday, September 19, 2019 10:58 AM
To: Koenig, Christian 
Cc: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; 
Teng, Rui ; Cui, Flora 
Subject: RE: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or 
passthrough

Hi Jack,
Could you please give a try about this? Both for passthrough and sriov.

Best wishes
Emily Deng
From: Koenig, Christian 
mailto:christian.koe...@amd.com>>
Sent: Wednesday, September 18, 2019 6:47 PM
To: Deng, Emily mailto:emily.d...@amd.com>>
Cc: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Teng, Rui 
mailto:rui.t...@amd.com>>; Cui, Flora 
mailto:flora@amd.com>>
Subject: Re: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or 
passthrough

Hi Jack & Emily,

asking around a bit helped, we should be able to take a look at the module 
state to distinct the two use cases like this:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6b96a5738e57..0af134eb03e2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1074,7 +1074,10 @@ amdgpu_pci_remove(struct pci_dev *pdev)
 {
struct drm_device *dev = pci_get_drvdata(pdev);

-   DRM_ERROR("Device removal is currently not supported outside of 
fbcon\n");
+#ifdef MODULE
+   if (THIS_MODULE->state != MODULE_STATE_GOING)
+#endif
+   DRM_ERROR("Device removal is currently not supported outside of 
fbcon\n");
drm_dev_unplug(dev);
drm_dev_put(dev);
pci_disable_device(pdev);

It's a bit of a hack, but I think that this should work.

Regards,
Christian.

Am 18.09.19 um 12:29 schrieb Christian König:
Hi Emily,

Do you think this is because the wrong use case?
Well Jack's use case is correct, but the PCIe hot plug removal use case is 
incorrect.

Changing it to a warning is most likely not a good idea either because it is 
indeed an error to try to use PCIe hot plug removal.

What we need to distinct is why the function is called, if it's because of 
pci_unregister_driver(_kms_pci_driver) in amdgpu_exit() then the use 
case is valid and we should not print the error.

But if it's because somebody does something like "echo 1 > 
/sys/bus/pci/devices/\:0b\:00.1/remove" then that is invalid and we should 
print it.

We could do some hack and look at the stack trace, but that is probably not 
reliable either.

Maybe we can look at the module reference count or something like that.

Regards,
Christian.

Am 18.09.19 um 12:04 schrieb Deng, Emily:
Hi Christian,
Do you think this is because the wrong use case? Even we add the error log, the 
issue still happens. Could we change this error to warning? As for the right 
method to remove the driver, it shouldn’t occur issues.

Best wishes
Emily Deng
From: Koenig, Christian 
<mailto:christian.koe...@amd.com>
Sent: Wednesday, September 18, 2019 5:45 PM
To: Deng, Emily <mailto:emily.d...@amd.com>
Cc: Zhang, Jack (Jian) <mailto:jack.zha...@amd.com>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Teng, Rui 
<mailto:rui.t...@amd.com>; Cui, Flora 
<mailto:flora@amd.com>
Subject: RE: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or 
passthrough

Yes, exactly.

The problem is that even when output is qxl or the Intel driver our driver is 
still loaded and forcefully removing it renders the desktop unusable.

Christian.


Am 18.09.2019 11:24 schrieb "Deng, Emily" 
mailto:emily.d...@amd.com>>:

Hi Christian,

Do you mean, for passthrough mode, the desktop is rendered by our asic, but 
enduser is trying to remove the driver by hot plug?



Best wishes

Emily Deng

From: Koenig, Christian 
mailto:christian.koe...@amd.com>>
Sent: Wednesday, September 18, 2019 4:44 PM
To: Deng, Emily mailto:emily.d...@amd.com>>
Cc: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Teng, Rui 
mailto:rui.t...@amd.com>>; Cui, Flora 
mailto:flora@amd.com>>
Subject: RE: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or 
passthrough



Hi Emily,



Yeah, exactly that's the problem: In some cases the driver can be unloaded 
while it is still in use!



See we added this error message because endusers tried to use PCIe hot plug to 
unload the driver to use the hardware for paththrough.



But this will completely nuke your desktop, even when amdgpu is only a 
secondary device like in the qxl case.



Jack is using the correct way of doing it, e.g. using "modprobe -r" or rmmod. 
Both commands check the use count before unloading the driver instances.



I don't see a way to distingt the two cases and what Jack

RE: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or passthrough

2019-09-18 Thread Zhang, Jack (Jian)
Hi, Christian and folks,

In virtual machines(such virt-manager), there's always a virtual graphics 
device existed like "qxl" as the default gfx device.
So under such condition, we believe it's reasonable to unload amdgpu driver as 
it is not treated as the default fbcon device.

Would you please help to review this patch?

Best wishes,
Jack

-Original Message-
From: Jack Zhang  
Sent: Wednesday, September 18, 2019 3:25 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH] drm/amdgpu/sriov: omit fbcon error under sriov or passthrough

In virtual machine, there would be a qxl or cirrus graphics device as the 
default master fbcon device.

So for PF(passthrough mode) or SRIOV VF, it is reasonable to unload amdgpu 
driver. Amdgpu doesn't have to be the only fbcon device under this condition.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 420888e..ada2b25 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1103,8 +1103,9 @@ static void
 amdgpu_pci_remove(struct pci_dev *pdev)  {
struct drm_device *dev = pci_get_drvdata(pdev);
-
-   DRM_ERROR("Device removal is currently not supported outside of 
fbcon\n");
+   struct amdgpu_device *adev = dev->dev_private;
+   if (!amdgpu_sriov_vf(adev) && !amdgpu_passthrough(adev))
+   DRM_ERROR("Device removal is currently not supported outside of 
+fbcon\n");
drm_dev_unplug(dev);
drm_dev_put(dev);
pci_disable_device(pdev);
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu/sriov: add ring_stop before ring_create in psp v11 code

2019-09-10 Thread Zhang, Jack (Jian)
ping

-Original Message-
From: amd-gfx  On Behalf Of Jack Zhang
Sent: Tuesday, September 10, 2019 4:09 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Jack (Jian) 
Subject: [PATCH] drm/amdgpu/sriov: add ring_stop before ring_create in psp v11 
code

psp  v11 code missed ring stop in ring create function(VMR) while psp v3.1 code 
had the code. This will cause VM destroy1 fail and psp ring create fail.

For SIOV-VF, ring_stop should not be deleted in ring_create function.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 61 +++---
 1 file changed, 34 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
index cddfa3c..8e9b34a 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
@@ -397,6 +397,34 @@ static bool psp_v11_0_support_vmr_ring(struct psp_context 
*psp)
return false;
 }
 
+static int psp_v11_0_ring_stop(struct psp_context *psp,
+ enum psp_ring_type ring_type)
+{
+   int ret = 0;
+   struct amdgpu_device *adev = psp->adev;
+
+   /* Write the ring destroy command*/
+   if (psp_v11_0_support_vmr_ring(psp))
+   WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_101,
+GFX_CTRL_CMD_ID_DESTROY_GPCOM_RING);
+   else
+   WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_64,
+GFX_CTRL_CMD_ID_DESTROY_RINGS);
+
+   /* there might be handshake issue with hardware which needs delay */
+   mdelay(20);
+
+   /* Wait for response flag (bit 31) */
+   if (psp_v11_0_support_vmr_ring(psp))
+   ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, 
mmMP0_SMN_C2PMSG_101),
+  0x8000, 0x8000, false);
+   else
+   ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, 
mmMP0_SMN_C2PMSG_64),
+  0x8000, 0x8000, false);
+
+   return ret;
+}
+
 static int psp_v11_0_ring_create(struct psp_context *psp,
enum psp_ring_type ring_type)
 {
@@ -406,6 +434,12 @@ static int psp_v11_0_ring_create(struct psp_context *psp,
struct amdgpu_device *adev = psp->adev;
 
if (psp_v11_0_support_vmr_ring(psp)) {
+   ret = psp_v11_0_ring_stop(psp, ring_type);
+   if (ret) {
+   DRM_ERROR("psp_v11_0_ring_stop_sriov failed!\n");
+   return ret;
+   }
+
/* Write low address of the ring to C2PMSG_102 */
psp_ring_reg = lower_32_bits(ring->ring_mem_mc_addr);
WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_102, psp_ring_reg); @@ 
-450,33 +484,6 @@ static int psp_v11_0_ring_create(struct psp_context *psp,
return ret;
 }
 
-static int psp_v11_0_ring_stop(struct psp_context *psp,
- enum psp_ring_type ring_type)
-{
-   int ret = 0;
-   struct amdgpu_device *adev = psp->adev;
-
-   /* Write the ring destroy command*/
-   if (psp_v11_0_support_vmr_ring(psp))
-   WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_101,
-GFX_CTRL_CMD_ID_DESTROY_GPCOM_RING);
-   else
-   WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_64,
-GFX_CTRL_CMD_ID_DESTROY_RINGS);
-
-   /* there might be handshake issue with hardware which needs delay */
-   mdelay(20);
-
-   /* Wait for response flag (bit 31) */
-   if (psp_v11_0_support_vmr_ring(psp))
-   ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, 
mmMP0_SMN_C2PMSG_101),
-  0x8000, 0x8000, false);
-   else
-   ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, 
mmMP0_SMN_C2PMSG_64),
-  0x8000, 0x8000, false);
-
-   return ret;
-}
 
 static int psp_v11_0_ring_destroy(struct psp_context *psp,
 enum psp_ring_type ring_type)
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx