date:20211201

Re: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Lazar, Lijo





On 12/2/2021 8:42 AM, Yu, Lang wrote:

[AMD Official Use Only]




-Original Message-
From: Quan, Evan 
Sent: Thursday, December 2, 2021 10:48 AM
To: Yu, Lang ; Koenig, Christian
; Christian König
; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Lazar, Lijo
; Huang, Ray 
Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option

[AMD Official Use Only]




-Original Message-
From: amd-gfx  On Behalf Of Yu,
Lang
Sent: Wednesday, December 1, 2021 7:37 PM
To: Koenig, Christian ; Christian König
; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Lazar, Lijo
; Huang, Ray 
Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option

[AMD Official Use Only]




-Original Message-
From: Koenig, Christian 
Sent: Wednesday, December 1, 2021 7:29 PM
To: Yu, Lang ; Christian König
; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Lazar, Lijo
; Huang, Ray 
Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option

Am 01.12.21 um 12:20 schrieb Yu, Lang:

[AMD Official Use Only]


-Original Message-
From: Christian König 
Sent: Wednesday, December 1, 2021 6:49 PM
To: Yu, Lang ; Koenig, Christian
; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Lazar, Lijo
; Huang, Ray 
Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option

Am 01.12.21 um 11:44 schrieb Yu, Lang:

[AMD Official Use Only]




-Original Message-
From: Koenig, Christian 
Sent: Wednesday, December 1, 2021 5:30 PM
To: Yu, Lang ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Lazar, Lijo
; Huang, Ray 
Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option

Am 01.12.21 um 10:24 schrieb Lang Yu:

To maintain system error state when SMU errors occurred, which
will aid in debugging SMU firmware issues, add SMU debug option

support.


It can be enabled or disabled via amdgpu_smu_debug debugfs file.
When enabled, it makes SMU errors fatal.
It is disabled by default.

== Command Guide ==

1, enable SMU debug option

 # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

2, disable SMU debug option

 # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

v3:
 - Use debugfs_create_bool().(Christian)
 - Put variable into smu_context struct.
 - Don't resend command when timeout.

v2:
 - Resend command when timeout.(Lijo)
 - Use debugfs file instead of module parameter.

Signed-off-by: Lang Yu 

Well the debugfs part looks really nice and clean now, but one
more comment below.


---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
 drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
 4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..86cd888c7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
amdgpu_device

*adev)

if (!debugfs_initialized())
return 0;

+   debugfs_create_bool("amdgpu_smu_debug", 0600, root,
+ >smu.smu_debug_mode);
+
ent = debugfs_create_file("amdgpu_preempt_ib", 0600,

root,

adev,

  _ib_preempt);
if (IS_ERR(ent)) {
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index f738f7dc20c9..50dbf5594a9d 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -569,6 +569,11 @@ struct smu_context
struct smu_user_dpm_profile user_dpm_profile;

struct stb_context stb_context;
+   /*
+* When enabled, it makes SMU errors fatal.
+* (0 = disabled (default), 1 = enabled)
+*/
+   bool smu_debug_mode;
 };

 struct i2c_adapter;
diff --git

a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c

b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 6e781cee8bb6..d3797a2d6451 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct

smu_context *smu)

 out:
mutex_unlock(>message_lock);

+   BUG_ON(unlikely(smu->smu_debug_mode) && ret);
+
return ret;
 }

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index 048ca1673863..9be005eb4241 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -349,15 +349,21 @@ int

smu_cmn_send_smc_msg_with_param(struct

smu_context *smu,

__smu_cmn_reg_print_error(smu, reg, index, param,

msg);

goto Out;
}
+
__smu_cmn_send_msg(smu, (uint16_t) index, param);
reg

Re: [PATCH V3 01/17] drm/amd/pm: do not expose implementation details to other blocks out of power

2021-12-01 Thread Lazar, Lijo





On 12/2/2021 11:48 AM, Quan, Evan wrote:

[AMD Official Use Only]




-Original Message-
From: Lazar, Lijo 
Sent: Thursday, December 2, 2021 1:12 PM
To: Quan, Evan ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian
; Feng, Kenneth 
Subject: Re: [PATCH V3 01/17] drm/amd/pm: do not expose implementation
details to other blocks out of power



On 12/2/2021 10:22 AM, Quan, Evan wrote:

[AMD Official Use Only]




-Original Message-
From: Lazar, Lijo 
Sent: Thursday, December 2, 2021 12:13 PM
To: Quan, Evan ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig,

Christian

; Feng, Kenneth 
Subject: Re: [PATCH V3 01/17] drm/amd/pm: do not expose
implementation details to other blocks out of power



On 12/2/2021 8:39 AM, Evan Quan wrote:

Those implementation details(whether swsmu supported, some

ppt_funcs

supported, accessing internal statistics ...)should be kept
internally. It's not a good practice and even error prone to expose

implementation details.


Signed-off-by: Evan Quan 
Change-Id: Ibca3462ceaa26a27a9145282b60c6ce5deca7752
---
drivers/gpu/drm/amd/amdgpu/aldebaran.c|  2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   | 25 +++---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  6 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c   | 18 +---
drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h   |  7 --
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |  5 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   |  5 +-
drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c   |  2 +-
.../gpu/drm/amd/include/kgd_pp_interface.h|  4 +
drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 90

+++

drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   | 25 +-
drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   | 11 +--
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 26 +++---
13 files changed, 161 insertions(+), 65 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
index bcfdb63b1d42..a545df4efce1 100644
--- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
+++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
@@ -260,7 +260,7 @@ static int aldebaran_mode2_restore_ip(struct

amdgpu_device *adev)

adev->gfx.rlc.funcs->resume(adev);

/* Wait for FW reset event complete */
-   r = smu_wait_for_event(adev, SMU_EVENT_RESET_COMPLETE, 0);
+   r = amdgpu_dpm_wait_for_event(adev,

SMU_EVENT_RESET_COMPLETE, 0);

Hi Evan,

As mentioned in the earlier comments, I suggest you to leave these
newer APIs and take care of the rest of the APIs. These may be
covered as
amdgpu_smu* in another patch set. Till that time, it's not needed to
move them to amdgpu_dpm (as mentioned before, some of them are

are

not even remotely related to power management).

[Quan, Evan] This patch series highly relies on such change. That is swsmu is

another framework as powerplay and all access should come through
amdgpu_dpm.c.

More specifically, patch 13 and 17 directly relies on this.
Further more, without the unified lock protection from patch 17, the

changes for dropping unneeded locks(which had been in my local branch) will
be impossible.



Patch 13 is directly related to smu context. I don't see many smu context
related APIs added in amdgpu_dpm. I guess you could convert those APIs
directly to pass amdgpu_device instead of smu_context.

Ex: smu_get_ecc_info(struct amdgpu_device *adev,

As for the mutex change, we could still use pm.mutex in place of smu mutex,
right?

[Quan, Evan] I'm afraid such partial change(some swsmu APIs get called though 
amdgpu_dpm while others via smu_* directly) will cause some chaos.
That is some will have their lock protection(pm.mutex) in amdgpu_dpm.c while 
others in amdgpu_smu.c.
That also means some swsmu APIs in amdgpu_smu.c  need pm.mutex while others do 
not.

I would prefer current way which converts all of them to be called through 
amdgpu_dpm.
If needed, we can convert them all back to smu_* directly later(with new patch 
set).
That will be simpler.



I'm fine with the idea that naming those APIs supported by swsmu only

with prefix amdgpu_smu*. But that has to be done after this patch series.

And I would expect those APIs are located in amdgpu_dpm.c(instead of

amdgpu_smu.c) also.

I don't think so. amdgpu_dpm and amdgpu_smu should be separate. I guess
we shouldn't plan to have additional APIs in amdgpu_dpm anymore and
move to component based APIs.

[Quan, Evan] Well, you could argue that. But as I said, image user wants to 
call some swsmu api from gfx_v9_0.c(some asics(ALDEBARAN/ARCTURUS) support 
swsmu while others not).
What will be used then?  Maybe checking for the asic type(knows which ASIC 
support swsmu) or swsmu support before calling. Then still we are leaking power 
implementation details.



Currently, I don't see a case like this. But as per the component 
version arch we have, it is managed with a component version. Like, we

RE: [PATCH V3 01/17] drm/amd/pm: do not expose implementation details to other blocks out of power

2021-12-01 Thread Quan, Evan

[AMD Official Use Only]



> -Original Message-
> From: Lazar, Lijo 
> Sent: Thursday, December 2, 2021 1:12 PM
> To: Quan, Evan ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Feng, Kenneth 
> Subject: Re: [PATCH V3 01/17] drm/amd/pm: do not expose implementation
> details to other blocks out of power
> 
> 
> 
> On 12/2/2021 10:22 AM, Quan, Evan wrote:
> > [AMD Official Use Only]
> >
> >
> >
> >> -Original Message-
> >> From: Lazar, Lijo 
> >> Sent: Thursday, December 2, 2021 12:13 PM
> >> To: Quan, Evan ; amd-gfx@lists.freedesktop.org
> >> Cc: Deucher, Alexander ; Koenig,
> Christian
> >> ; Feng, Kenneth 
> >> Subject: Re: [PATCH V3 01/17] drm/amd/pm: do not expose
> >> implementation details to other blocks out of power
> >>
> >>
> >>
> >> On 12/2/2021 8:39 AM, Evan Quan wrote:
> >>> Those implementation details(whether swsmu supported, some
> ppt_funcs
> >>> supported, accessing internal statistics ...)should be kept
> >>> internally. It's not a good practice and even error prone to expose
> >> implementation details.
> >>>
> >>> Signed-off-by: Evan Quan 
> >>> Change-Id: Ibca3462ceaa26a27a9145282b60c6ce5deca7752
> >>> ---
> >>>drivers/gpu/drm/amd/amdgpu/aldebaran.c|  2 +-
> >>>drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   | 25 +++---
> >>>drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  6 +-
> >>>drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c   | 18 +---
> >>>drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h   |  7 --
> >>>drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |  5 +-
> >>>drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   |  5 +-
> >>>drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c   |  2 +-
> >>>.../gpu/drm/amd/include/kgd_pp_interface.h|  4 +
> >>>drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 90
> >> +++
> >>>drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   | 25 +-
> >>>drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   | 11 +--
> >>>drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 26 +++---
> >>>13 files changed, 161 insertions(+), 65 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> >>> b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> >>> index bcfdb63b1d42..a545df4efce1 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> >>> @@ -260,7 +260,7 @@ static int aldebaran_mode2_restore_ip(struct
> >> amdgpu_device *adev)
> >>>   adev->gfx.rlc.funcs->resume(adev);
> >>>
> >>>   /* Wait for FW reset event complete */
> >>> - r = smu_wait_for_event(adev, SMU_EVENT_RESET_COMPLETE, 0);
> >>> + r = amdgpu_dpm_wait_for_event(adev,
> >> SMU_EVENT_RESET_COMPLETE, 0);
> >>
> >> Hi Evan,
> >>
> >> As mentioned in the earlier comments, I suggest you to leave these
> >> newer APIs and take care of the rest of the APIs. These may be
> >> covered as
> >> amdgpu_smu* in another patch set. Till that time, it's not needed to
> >> move them to amdgpu_dpm (as mentioned before, some of them are
> are
> >> not even remotely related to power management).
> > [Quan, Evan] This patch series highly relies on such change. That is swsmu 
> > is
> another framework as powerplay and all access should come through
> amdgpu_dpm.c.
> > More specifically, patch 13 and 17 directly relies on this.
> > Further more, without the unified lock protection from patch 17, the
> changes for dropping unneeded locks(which had been in my local branch) will
> be impossible.
> >
> Patch 13 is directly related to smu context. I don't see many smu context
> related APIs added in amdgpu_dpm. I guess you could convert those APIs
> directly to pass amdgpu_device instead of smu_context.
> 
> Ex: smu_get_ecc_info(struct amdgpu_device *adev,
> 
> As for the mutex change, we could still use pm.mutex in place of smu mutex,
> right?
[Quan, Evan] I'm afraid such partial change(some swsmu APIs get called though 
amdgpu_dpm while others via smu_* directly) will cause some chaos.
That is some will have their lock protection(pm.mutex) in amdgpu_dpm.c while 
others in amdgpu_smu.c.
That also means some swsmu APIs in amdgpu_smu.c  need pm.mutex while others do 
not.

I would prefer current way which converts all of them to be called through 
amdgpu_dpm.
If needed, we can convert them all back to smu_* directly later(with new patch 
set).
That will be simpler. 
> 
> > I'm fine with the idea that naming those APIs supported by swsmu only
> with prefix amdgpu_smu*. But that has to be done after this patch series.
> > And I would expect those APIs are located in amdgpu_dpm.c(instead of
> amdgpu_smu.c) also.
> 
> I don't think so. amdgpu_dpm and amdgpu_smu should be separate. I guess
> we shouldn't plan to have additional APIs in amdgpu_dpm anymore and
> move to component based APIs.
[Quan, Evan] Well, you could argue that. But as I said, image user wants to 
call some swsmu api from gfx_v9_0.c(some asics(ALDEBARAN/ARCTURUS) support 
swsmu while others not).

Re: [PATCH V3 01/17] drm/amd/pm: do not expose implementation details to other blocks out of power

2021-12-01 Thread Lazar, Lijo





On 12/2/2021 10:22 AM, Quan, Evan wrote:

[AMD Official Use Only]




-Original Message-
From: Lazar, Lijo 
Sent: Thursday, December 2, 2021 12:13 PM
To: Quan, Evan ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian
; Feng, Kenneth 
Subject: Re: [PATCH V3 01/17] drm/amd/pm: do not expose implementation
details to other blocks out of power



On 12/2/2021 8:39 AM, Evan Quan wrote:

Those implementation details(whether swsmu supported, some ppt_funcs
supported, accessing internal statistics ...)should be kept
internally. It's not a good practice and even error prone to expose

implementation details.


Signed-off-by: Evan Quan 
Change-Id: Ibca3462ceaa26a27a9145282b60c6ce5deca7752
---
   drivers/gpu/drm/amd/amdgpu/aldebaran.c|  2 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   | 25 +++---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  6 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c   | 18 +---
   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h   |  7 --
   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |  5 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   |  5 +-
   drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c   |  2 +-
   .../gpu/drm/amd/include/kgd_pp_interface.h|  4 +
   drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 90

+++

   drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   | 25 +-
   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   | 11 +--
   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 26 +++---
   13 files changed, 161 insertions(+), 65 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
index bcfdb63b1d42..a545df4efce1 100644
--- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
+++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
@@ -260,7 +260,7 @@ static int aldebaran_mode2_restore_ip(struct

amdgpu_device *adev)

adev->gfx.rlc.funcs->resume(adev);

/* Wait for FW reset event complete */
-   r = smu_wait_for_event(adev, SMU_EVENT_RESET_COMPLETE, 0);
+   r = amdgpu_dpm_wait_for_event(adev,

SMU_EVENT_RESET_COMPLETE, 0);

Hi Evan,

As mentioned in the earlier comments, I suggest you to leave these newer
APIs and take care of the rest of the APIs. These may be covered as
amdgpu_smu* in another patch set. Till that time, it's not needed to move
them to amdgpu_dpm (as mentioned before, some of them are are not
even remotely related to power management).

[Quan, Evan] This patch series highly relies on such change. That is swsmu is 
another framework as powerplay and all access should come through amdgpu_dpm.c.
More specifically, patch 13 and 17 directly relies on this.
Further more, without the unified lock protection from patch 17, the changes 
for dropping unneeded locks(which had been in my local branch) will be 
impossible.

Patch 13 is directly related to smu context. I don't see many smu 
context related APIs added in amdgpu_dpm. I guess you could convert 
those APIs directly to pass amdgpu_device instead of smu_context.


Ex: smu_get_ecc_info(struct amdgpu_device *adev,

As for the mutex change, we could still use pm.mutex in place of smu 
mutex, right?



I'm fine with the idea that naming those APIs supported by swsmu only with 
prefix amdgpu_smu*. But that has to be done after this patch series.
And I would expect those APIs are located in amdgpu_dpm.c(instead of 
amdgpu_smu.c) also.


I don't think so. amdgpu_dpm and amdgpu_smu should be separate. I guess 
we shouldn't plan to have additional APIs in amdgpu_dpm anymore and move 
to component based APIs.


Thanks,
Lijo



BR
Evan


Thanks,
Lijo


if (r) {
dev_err(adev->dev,
"Failed to get response from firmware after reset\n");

diff --git

a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..0d1f00b24aae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1585,22 +1585,25 @@ static int amdgpu_debugfs_sclk_set(void *data,

u64 val)

return ret;
}

-   if (is_support_sw_smu(adev)) {
-   ret = smu_get_dpm_freq_range(>smu, SMU_SCLK,

_freq, _freq);

-   if (ret || val > max_freq || val < min_freq)
-   return -EINVAL;
-   ret = smu_set_soft_freq_range(>smu, SMU_SCLK,

(uint32_t)val, (uint32_t)val);

-   } else {
-   return 0;
+   ret = amdgpu_dpm_get_dpm_freq_range(adev, PP_SCLK,

_freq, _freq);

+   if (ret == -EOPNOTSUPP) {
+   ret = 0;
+   goto out;
}
+   if (ret || val > max_freq || val < min_freq) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   ret = amdgpu_dpm_set_soft_freq_range(adev, PP_SCLK,

(uint32_t)val, (uint32_t)val);

+   if (ret)
+   ret = -EINVAL;

+out:
pm_runtime_mark_last_busy(adev_to_drm(adev)->dev);

[PATCH] drm/amdgpu: Fix null pointer access of BO

2021-12-01 Thread xinhui pan

TTM want bo->resource to be valid during BO's life.
But ttm_bo_mem_space might fail and bo->resource point to NULL. Many code
touch bo->resource and hit panic then.

As old and new mem might overlap, move ttm_resource_free after
ttm_bo_mem_space is not an option.
We could assign one sysmem node to BO to make bo->resource valid.

Signed-off-by: xinhui pan 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index c4317343967f..697fac0b82a3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -359,6 +359,7 @@ int amdgpu_bo_create_kernel_at(struct amdgpu_device *adev,
   struct amdgpu_bo **bo_ptr, void **cpu_addr)
 {
struct ttm_operation_ctx ctx = { false, false };
+   struct ttm_resource *tmp_res;
unsigned int i;
int r;
 
@@ -380,17 +381,26 @@ int amdgpu_bo_create_kernel_at(struct amdgpu_device *adev,
if (cpu_addr)
amdgpu_bo_kunmap(*bo_ptr);
 
-   ttm_resource_free(&(*bo_ptr)->tbo, &(*bo_ptr)->tbo.resource);
+   /* Assign one sysmem node to BO as we want bo->resource to be valid. */
+   amdgpu_bo_placement_from_domain(*bo_ptr, AMDGPU_GEM_DOMAIN_CPU);
+   r = ttm_bo_mem_space(&(*bo_ptr)->tbo, &(*bo_ptr)->placement,
+_res, );
+   if (r)
+   goto error;
+
+   ttm_bo_move_null(&(*bo_ptr)->tbo, tmp_res);
 
for (i = 0; i < (*bo_ptr)->placement.num_placement; ++i) {
(*bo_ptr)->placements[i].fpfn = offset >> PAGE_SHIFT;
(*bo_ptr)->placements[i].lpfn = (offset + size) >> PAGE_SHIFT;
}
r = ttm_bo_mem_space(&(*bo_ptr)->tbo, &(*bo_ptr)->placement,
-&(*bo_ptr)->tbo.resource, );
+_res, );
if (r)
goto error;
 
+   ttm_bo_move_null(&(*bo_ptr)->tbo, tmp_res);
+
if (cpu_addr) {
r = amdgpu_bo_kmap(*bo_ptr, cpu_addr);
if (r)
-- 
2.25.1

RE: [PATCH V3 01/17] drm/amd/pm: do not expose implementation details to other blocks out of power

2021-12-01 Thread Quan, Evan

[AMD Official Use Only]



> -Original Message-
> From: Lazar, Lijo 
> Sent: Thursday, December 2, 2021 12:13 PM
> To: Quan, Evan ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Feng, Kenneth 
> Subject: Re: [PATCH V3 01/17] drm/amd/pm: do not expose implementation
> details to other blocks out of power
> 
> 
> 
> On 12/2/2021 8:39 AM, Evan Quan wrote:
> > Those implementation details(whether swsmu supported, some ppt_funcs
> > supported, accessing internal statistics ...)should be kept
> > internally. It's not a good practice and even error prone to expose
> implementation details.
> >
> > Signed-off-by: Evan Quan 
> > Change-Id: Ibca3462ceaa26a27a9145282b60c6ce5deca7752
> > ---
> >   drivers/gpu/drm/amd/amdgpu/aldebaran.c|  2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   | 25 +++---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  6 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c   | 18 +---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h   |  7 --
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |  5 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   |  5 +-
> >   drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c   |  2 +-
> >   .../gpu/drm/amd/include/kgd_pp_interface.h|  4 +
> >   drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 90
> +++
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   | 25 +-
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   | 11 +--
> >   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 26 +++---
> >   13 files changed, 161 insertions(+), 65 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> > b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> > index bcfdb63b1d42..a545df4efce1 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> > @@ -260,7 +260,7 @@ static int aldebaran_mode2_restore_ip(struct
> amdgpu_device *adev)
> > adev->gfx.rlc.funcs->resume(adev);
> >
> > /* Wait for FW reset event complete */
> > -   r = smu_wait_for_event(adev, SMU_EVENT_RESET_COMPLETE, 0);
> > +   r = amdgpu_dpm_wait_for_event(adev,
> SMU_EVENT_RESET_COMPLETE, 0);
> 
> Hi Evan,
> 
> As mentioned in the earlier comments, I suggest you to leave these newer
> APIs and take care of the rest of the APIs. These may be covered as
> amdgpu_smu* in another patch set. Till that time, it's not needed to move
> them to amdgpu_dpm (as mentioned before, some of them are are not
> even remotely related to power management).
[Quan, Evan] This patch series highly relies on such change. That is swsmu is 
another framework as powerplay and all access should come through amdgpu_dpm.c.
More specifically, patch 13 and 17 directly relies on this.
Further more, without the unified lock protection from patch 17, the changes 
for dropping unneeded locks(which had been in my local branch) will be 
impossible.

I'm fine with the idea that naming those APIs supported by swsmu only with 
prefix amdgpu_smu*. But that has to be done after this patch series.
And I would expect those APIs are located in amdgpu_dpm.c(instead of 
amdgpu_smu.c) also.

BR
Evan
> 
> Thanks,
> Lijo
> 
> > if (r) {
> > dev_err(adev->dev,
> > "Failed to get response from firmware after reset\n");
> diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > index 164d6a9e9fbb..0d1f00b24aae 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > @@ -1585,22 +1585,25 @@ static int amdgpu_debugfs_sclk_set(void *data,
> u64 val)
> > return ret;
> > }
> >
> > -   if (is_support_sw_smu(adev)) {
> > -   ret = smu_get_dpm_freq_range(>smu, SMU_SCLK,
> _freq, _freq);
> > -   if (ret || val > max_freq || val < min_freq)
> > -   return -EINVAL;
> > -   ret = smu_set_soft_freq_range(>smu, SMU_SCLK,
> (uint32_t)val, (uint32_t)val);
> > -   } else {
> > -   return 0;
> > +   ret = amdgpu_dpm_get_dpm_freq_range(adev, PP_SCLK,
> _freq, _freq);
> > +   if (ret == -EOPNOTSUPP) {
> > +   ret = 0;
> > +   goto out;
> > }
> > +   if (ret || val > max_freq || val < min_freq) {
> > +   ret = -EINVAL;
> > +   goto out;
> > +   }
> > +
> > +   ret = amdgpu_dpm_set_soft_freq_range(adev, PP_SCLK,
> (uint32_t)val, (uint32_t)val);
> > +   if (ret)
> > +   ret = -EINVAL;
> >
> > +out:
> > pm_runtime_mark_last_busy(adev_to_drm(adev)->dev);
> > pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);
> >
> > -   if (ret)
> > -   return -EINVAL;
> > -
> > -   return 0;
> > +   return ret;
> >   }
> >
> >   DEFINE_DEBUGFS_ATTRIBUTE(fops_ib_preempt, NULL, diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 1989f9e9379e..41cc1ffb5809 100644
> > ---

Re: [PATCH v4 1/6] drm: move the buddy allocator from i915 into common drm

2021-12-01 Thread kernel test robot

Hi Arunpravin,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on drm-intel/for-linux-next]
[also build test ERROR on v5.16-rc3]
[cannot apply to drm/drm-next drm-tip/drm-tip next-20211201]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Arunpravin/drm-move-the-buddy-allocator-from-i915-into-common-drm/20211202-004327
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: i386-allyesconfig 
(https://download.01.org/0day-ci/archive/20211202/202112021239.jptbrhi2-...@intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
# 
https://github.com/0day-ci/linux/commit/afbc900c0399e8c6220abd729932e877e81f37c8
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Arunpravin/drm-move-the-buddy-allocator-from-i915-into-common-drm/20211202-004327
git checkout afbc900c0399e8c6220abd729932e877e81f37c8
# save the config file to linux build tree
mkdir build_dir
make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

   In file included from drivers/gpu/drm/i915/intel_memory_region.c:242:
>> drivers/gpu/drm/i915/selftests/intel_memory_region.c:23:10: fatal error: 
>> i915_buddy.h: No such file or directory
  23 | #include "i915_buddy.h"
 |  ^~
   compilation terminated.


vim +23 drivers/gpu/drm/i915/selftests/intel_memory_region.c

232a6ebae41919 Matthew Auld 2019-10-08  14  
340be48f2c5a3c Matthew Auld 2019-10-25  15  #include 
"gem/i915_gem_context.h"
b908be543e4441 Matthew Auld 2019-10-25  16  #include "gem/i915_gem_lmem.h"
232a6ebae41919 Matthew Auld 2019-10-08  17  #include "gem/i915_gem_region.h"
340be48f2c5a3c Matthew Auld 2019-10-25  18  #include 
"gem/selftests/igt_gem_utils.h"
232a6ebae41919 Matthew Auld 2019-10-08  19  #include 
"gem/selftests/mock_context.h"
99919be74aa375 Thomas Hellström 2021-06-17  20  #include "gt/intel_engine_pm.h"
6804da20bb549e Chris Wilson 2019-10-27  21  #include 
"gt/intel_engine_user.h"
b908be543e4441 Matthew Auld 2019-10-25  22  #include "gt/intel_gt.h"
d53ec322dc7de3 Matthew Auld 2021-06-16 @23  #include "i915_buddy.h"
99919be74aa375 Thomas Hellström 2021-06-17  24  #include "gt/intel_migrate.h"
ba12993c522801 Matthew Auld 2020-01-29  25  #include "i915_memcpy.h"
d53ec322dc7de3 Matthew Auld 2021-06-16  26  #include 
"i915_ttm_buddy_manager.h"
01377a0d7e6648 Abdiel Janulgue  2019-10-25  27  #include 
"selftests/igt_flush_test.h"
2f0b97ca021186 Matthew Auld 2019-10-08  28  #include 
"selftests/i915_random.h"
232a6ebae41919 Matthew Auld 2019-10-08  29  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

Re: [PATCH V3 01/17] drm/amd/pm: do not expose implementation details to other blocks out of power

2021-12-01 Thread Lazar, Lijo





On 12/2/2021 8:39 AM, Evan Quan wrote:

Those implementation details(whether swsmu supported, some ppt_funcs supported,
accessing internal statistics ...)should be kept internally. It's not a good
practice and even error prone to expose implementation details.

Signed-off-by: Evan Quan 
Change-Id: Ibca3462ceaa26a27a9145282b60c6ce5deca7752
---
  drivers/gpu/drm/amd/amdgpu/aldebaran.c|  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   | 25 +++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  6 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c   | 18 +---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h   |  7 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |  5 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   |  5 +-
  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c   |  2 +-
  .../gpu/drm/amd/include/kgd_pp_interface.h|  4 +
  drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 90 +++
  drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   | 25 +-
  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   | 11 +--
  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 26 +++---
  13 files changed, 161 insertions(+), 65 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c 
b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
index bcfdb63b1d42..a545df4efce1 100644
--- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
+++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
@@ -260,7 +260,7 @@ static int aldebaran_mode2_restore_ip(struct amdgpu_device 
*adev)
adev->gfx.rlc.funcs->resume(adev);
  
  	/* Wait for FW reset event complete */

-   r = smu_wait_for_event(adev, SMU_EVENT_RESET_COMPLETE, 0);
+   r = amdgpu_dpm_wait_for_event(adev, SMU_EVENT_RESET_COMPLETE, 0);


Hi Evan,

As mentioned in the earlier comments, I suggest you to leave these newer 
APIs and take care of the rest of the APIs. These may be covered as 
amdgpu_smu* in another patch set. Till that time, it's not needed to 
move them to amdgpu_dpm (as mentioned before, some of them are are not 
even remotely related to power management).


Thanks,
Lijo


if (r) {
dev_err(adev->dev,
"Failed to get response from firmware after reset\n");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..0d1f00b24aae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1585,22 +1585,25 @@ static int amdgpu_debugfs_sclk_set(void *data, u64 val)
return ret;
}
  
-	if (is_support_sw_smu(adev)) {

-   ret = smu_get_dpm_freq_range(>smu, SMU_SCLK, _freq, 
_freq);
-   if (ret || val > max_freq || val < min_freq)
-   return -EINVAL;
-   ret = smu_set_soft_freq_range(>smu, SMU_SCLK, 
(uint32_t)val, (uint32_t)val);
-   } else {
-   return 0;
+   ret = amdgpu_dpm_get_dpm_freq_range(adev, PP_SCLK, _freq, 
_freq);
+   if (ret == -EOPNOTSUPP) {
+   ret = 0;
+   goto out;
}
+   if (ret || val > max_freq || val < min_freq) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   ret = amdgpu_dpm_set_soft_freq_range(adev, PP_SCLK, (uint32_t)val, 
(uint32_t)val);
+   if (ret)
+   ret = -EINVAL;
  
+out:

pm_runtime_mark_last_busy(adev_to_drm(adev)->dev);
pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);
  
-	if (ret)

-   return -EINVAL;
-
-   return 0;
+   return ret;
  }
  
  DEFINE_DEBUGFS_ATTRIBUTE(fops_ib_preempt, NULL,

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1989f9e9379e..41cc1ffb5809 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2617,7 +2617,7 @@ static int amdgpu_device_ip_late_init(struct 
amdgpu_device *adev)
if (adev->asic_type == CHIP_ARCTURUS &&
amdgpu_passthrough(adev) &&
adev->gmc.xgmi.num_physical_nodes > 1)
-   smu_set_light_sbr(>smu, true);
+   amdgpu_dpm_set_light_sbr(adev, true);
  
  	if (adev->gmc.xgmi.num_physical_nodes > 1) {

mutex_lock(_info.mutex);
@@ -2857,7 +2857,7 @@ static int amdgpu_device_ip_suspend_phase2(struct 
amdgpu_device *adev)
int i, r;
  
  	if (adev->in_s0ix)

-   amdgpu_gfx_state_change_set(adev, sGpuChangeState_D3Entry);
+   amdgpu_dpm_gfx_state_change(adev, sGpuChangeState_D3Entry);
  
  	for (i = adev->num_ip_blocks - 1; i >= 0; i--) {

if (!adev->ip_blocks[i].status.valid)
@@ -3982,7 +3982,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
fbcon)
return 0;
  
  	if (adev->in_s0ix)

-   amdgpu_gfx_state_change_set(adev, sGpuChangeState_D0Entry);
+   amdgpu_dpm_gfx_state_change(adev, sGpuChangeState_D0Entry);
  
  	/* post card */

[PATCH V3 13/17] drm/amd/pm: do not expose the smu_context structure used internally in power

2021-12-01 Thread Evan Quan

This can cover the power implementation details. And as what did for
powerplay framework, we hook the smu_context to adev->powerplay.pp_handle.

Signed-off-by: Evan Quan 
Change-Id: I3969c9f62a8b63dc6e4321a488d8f15022ffeb3d
--
v1->v2:
  - drop smu_ppt_limit_type used internally from kgd_pp_interface.h(Lijo)
  - drop the smu_send_hbm_bad_pages_num() change which can be combined into
the patch ahead(Lijo)
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  6 ---
 .../gpu/drm/amd/include/kgd_pp_interface.h|  3 ++
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 50 --
 drivers/gpu/drm/amd/pm/amdgpu_pm.c|  2 +-
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  4 --
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 52 ---
 .../gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c |  9 ++--
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   |  9 ++--
 .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   |  9 ++--
 .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c|  4 +-
 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c|  9 ++--
 .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  8 +--
 12 files changed, 96 insertions(+), 69 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c987813a4996..fefabd568483 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -99,7 +99,6 @@
 #include "amdgpu_gem.h"
 #include "amdgpu_doorbell.h"
 #include "amdgpu_amdkfd.h"
-#include "amdgpu_smu.h"
 #include "amdgpu_discovery.h"
 #include "amdgpu_mes.h"
 #include "amdgpu_umc.h"
@@ -950,11 +949,6 @@ struct amdgpu_device {
 
/* powerplay */
struct amd_powerplaypowerplay;
-
-   /* smu */
-   struct smu_context  smu;
-
-   /* dpm */
struct amdgpu_pmpm;
u32 cg_flags;
u32 pg_flags;
diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
index 7919e96e772b..a8eec91c0995 100644
--- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
+++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
@@ -25,6 +25,9 @@
 #define __KGD_PP_INTERFACE_H__
 
 extern const struct amdgpu_ip_block_version pp_smu_ip_block;
+extern const struct amdgpu_ip_block_version smu_v11_0_ip_block;
+extern const struct amdgpu_ip_block_version smu_v12_0_ip_block;
+extern const struct amdgpu_ip_block_version smu_v13_0_ip_block;
 
 enum smu_event_type {
SMU_EVENT_RESET_COMPLETE = 0,
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index 45bc2486b1b4..cda7d21c1b3e 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -31,6 +31,7 @@
 #include "amdgpu_display.h"
 #include "hwmgr.h"
 #include 
+#include "amdgpu_smu.h"
 
 #define amdgpu_dpm_enable_bapm(adev, e) \

((adev)->powerplay.pp_funcs->enable_bapm((adev)->powerplay.pp_handle, (e)))
@@ -213,7 +214,7 @@ int amdgpu_dpm_baco_reset(struct amdgpu_device *adev)
 
 bool amdgpu_dpm_is_mode1_reset_supported(struct amdgpu_device *adev)
 {
-   struct smu_context *smu = >smu;
+   struct smu_context *smu = adev->powerplay.pp_handle;
 
if (is_support_sw_smu(adev))
return smu_mode1_reset_is_support(smu);
@@ -223,7 +224,7 @@ bool amdgpu_dpm_is_mode1_reset_supported(struct 
amdgpu_device *adev)
 
 int amdgpu_dpm_mode1_reset(struct amdgpu_device *adev)
 {
-   struct smu_context *smu = >smu;
+   struct smu_context *smu = adev->powerplay.pp_handle;
 
if (is_support_sw_smu(adev))
return smu_mode1_reset(smu);
@@ -276,7 +277,7 @@ int amdgpu_dpm_set_df_cstate(struct amdgpu_device *adev,
 
 int amdgpu_dpm_allow_xgmi_power_down(struct amdgpu_device *adev, bool en)
 {
-   struct smu_context *smu = >smu;
+   struct smu_context *smu = adev->powerplay.pp_handle;
 
if (is_support_sw_smu(adev))
return smu_allow_xgmi_power_down(smu, en);
@@ -341,7 +342,7 @@ void amdgpu_pm_acpi_event_handler(struct amdgpu_device 
*adev)
mutex_unlock(>pm.mutex);
 
if (is_support_sw_smu(adev))
-   smu_set_ac_dc(>smu);
+   smu_set_ac_dc(adev->powerplay.pp_handle);
}
 }
 
@@ -423,12 +424,14 @@ int amdgpu_pm_load_smu_firmware(struct amdgpu_device 
*adev, uint32_t *smu_versio
 
 int amdgpu_dpm_set_light_sbr(struct amdgpu_device *adev, bool enable)
 {
-   return smu_set_light_sbr(>smu, enable);
+   return smu_set_light_sbr(adev->powerplay.pp_handle, enable);
 }
 
 int amdgpu_dpm_send_hbm_bad_pages_num(struct amdgpu_device *adev, uint32_t 
size)
 {
-   return smu_send_hbm_bad_pages_num(>smu, size);
+   struct smu_context *smu = adev->powerplay.pp_handle;
+
+   return smu_send_hbm_bad_pages_num(smu, size);
 }
 
 int amdgpu_dpm_get_dpm_freq_range(struct amdgpu_device *adev,
@@ -441,7

RE: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Yu, Lang

[AMD Official Use Only]



>-Original Message-
>From: Quan, Evan 
>Sent: Thursday, December 2, 2021 10:48 AM
>To: Yu, Lang ; Koenig, Christian
>; Christian König
>; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander ; Lazar, Lijo
>; Huang, Ray 
>Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
>
>[AMD Official Use Only]
>
>
>
>> -Original Message-
>> From: amd-gfx  On Behalf Of Yu,
>> Lang
>> Sent: Wednesday, December 1, 2021 7:37 PM
>> To: Koenig, Christian ; Christian König
>> ; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander ; Lazar, Lijo
>> ; Huang, Ray 
>> Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
>>
>> [AMD Official Use Only]
>>
>>
>>
>> >-Original Message-
>> >From: Koenig, Christian 
>> >Sent: Wednesday, December 1, 2021 7:29 PM
>> >To: Yu, Lang ; Christian König
>> >; amd-gfx@lists.freedesktop.org
>> >Cc: Deucher, Alexander ; Lazar, Lijo
>> >; Huang, Ray 
>> >Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>> >
>> >Am 01.12.21 um 12:20 schrieb Yu, Lang:
>> >> [AMD Official Use Only]
>> >>
>> >>> -Original Message-
>> >>> From: Christian König 
>> >>> Sent: Wednesday, December 1, 2021 6:49 PM
>> >>> To: Yu, Lang ; Koenig, Christian
>> >>> ; amd-gfx@lists.freedesktop.org
>> >>> Cc: Deucher, Alexander ; Lazar, Lijo
>> >>> ; Huang, Ray 
>> >>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>> >>>
>> >>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
>>  [AMD Official Use Only]
>> 
>> 
>> 
>> > -Original Message-
>> > From: Koenig, Christian 
>> > Sent: Wednesday, December 1, 2021 5:30 PM
>> > To: Yu, Lang ; amd-gfx@lists.freedesktop.org
>> > Cc: Deucher, Alexander ; Lazar, Lijo
>> > ; Huang, Ray 
>> > Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>> >
>> > Am 01.12.21 um 10:24 schrieb Lang Yu:
>> >> To maintain system error state when SMU errors occurred, which
>> >> will aid in debugging SMU firmware issues, add SMU debug option
>> support.
>> >>
>> >> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>> >> When enabled, it makes SMU errors fatal.
>> >> It is disabled by default.
>> >>
>> >> == Command Guide ==
>> >>
>> >> 1, enable SMU debug option
>> >>
>> >> # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>> >>
>> >> 2, disable SMU debug option
>> >>
>> >> # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>> >>
>> >> v3:
>> >> - Use debugfs_create_bool().(Christian)
>> >> - Put variable into smu_context struct.
>> >> - Don't resend command when timeout.
>> >>
>> >> v2:
>> >> - Resend command when timeout.(Lijo)
>> >> - Use debugfs file instead of module parameter.
>> >>
>> >> Signed-off-by: Lang Yu 
>> > Well the debugfs part looks really nice and clean now, but one
>> > more comment below.
>> >
>> >> ---
>> >> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
>> >> drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
>> >> drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>> >> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
>> >> 4 files changed, 17 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> >> index 164d6a9e9fbb..86cd888c7822 100644
>> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> >> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
>> >> amdgpu_device
>> > *adev)
>> >>   if (!debugfs_initialized())
>> >>   return 0;
>> >>
>> >> + debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>> >> +   >smu.smu_debug_mode);
>> >> +
>> >>   ent = debugfs_create_file("amdgpu_preempt_ib", 0600,
>> root,
>> >adev,
>> >> _ib_preempt);
>> >>   if (IS_ERR(ent)) {
>> >> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> >> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> >> index f738f7dc20c9..50dbf5594a9d 100644
>> >> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> >> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> >> @@ -569,6 +569,11 @@ struct smu_context
>> >>   struct smu_user_dpm_profile user_dpm_profile;
>> >>
>> >>   struct stb_context stb_context;
>> >> + /*
>> >> +  * When enabled, it makes SMU errors fatal.
>> >> +  * (0 = disabled (default), 1 = enabled)
>> >> +  */
>> >> + bool smu_debug_mode;
>> >> };
>> >>
>> >> struct i2c_adapter;
>> >> diff --git
>> a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> >>

[PATCH V3 14/17] drm/amd/pm: relocate the power related headers

2021-12-01 Thread Evan Quan

Instead of centralizing all headers in the same folder. Separate them into
different folders and place them among those source files those who really
need them.

Signed-off-by: Evan Quan 
Change-Id: Id74cb4c7006327ca7ecd22daf17321e417c4aa71
---
 drivers/gpu/drm/amd/pm/Makefile   | 10 +++---
 drivers/gpu/drm/amd/pm/legacy-dpm/Makefile| 32 +++
 .../pm/{powerplay => legacy-dpm}/cik_dpm.h|  0
 .../amd/pm/{powerplay => legacy-dpm}/kv_dpm.c |  0
 .../amd/pm/{powerplay => legacy-dpm}/kv_dpm.h |  0
 .../amd/pm/{powerplay => legacy-dpm}/kv_smc.c |  0
 .../pm/{powerplay => legacy-dpm}/legacy_dpm.c |  0
 .../pm/{powerplay => legacy-dpm}/legacy_dpm.h |  0
 .../amd/pm/{powerplay => legacy-dpm}/ppsmc.h  |  0
 .../pm/{powerplay => legacy-dpm}/r600_dpm.h   |  0
 .../amd/pm/{powerplay => legacy-dpm}/si_dpm.c |  0
 .../amd/pm/{powerplay => legacy-dpm}/si_dpm.h |  0
 .../amd/pm/{powerplay => legacy-dpm}/si_smc.c |  0
 .../{powerplay => legacy-dpm}/sislands_smc.h  |  0
 drivers/gpu/drm/amd/pm/powerplay/Makefile |  6 +---
 .../pm/{ => powerplay}/inc/amd_powerplay.h|  0
 .../drm/amd/pm/{ => powerplay}/inc/cz_ppsmc.h |  0
 .../amd/pm/{ => powerplay}/inc/fiji_ppsmc.h   |  0
 .../pm/{ => powerplay}/inc/hardwaremanager.h  |  0
 .../drm/amd/pm/{ => powerplay}/inc/hwmgr.h|  0
 .../{ => powerplay}/inc/polaris10_pwrvirus.h  |  0
 .../amd/pm/{ => powerplay}/inc/power_state.h  |  0
 .../drm/amd/pm/{ => powerplay}/inc/pp_debug.h |  0
 .../amd/pm/{ => powerplay}/inc/pp_endian.h|  0
 .../amd/pm/{ => powerplay}/inc/pp_thermal.h   |  0
 .../amd/pm/{ => powerplay}/inc/ppinterrupt.h  |  0
 .../drm/amd/pm/{ => powerplay}/inc/rv_ppsmc.h |  0
 .../drm/amd/pm/{ => powerplay}/inc/smu10.h|  0
 .../pm/{ => powerplay}/inc/smu10_driver_if.h  |  0
 .../pm/{ => powerplay}/inc/smu11_driver_if.h  |  0
 .../gpu/drm/amd/pm/{ => powerplay}/inc/smu7.h |  0
 .../drm/amd/pm/{ => powerplay}/inc/smu71.h|  0
 .../pm/{ => powerplay}/inc/smu71_discrete.h   |  0
 .../drm/amd/pm/{ => powerplay}/inc/smu72.h|  0
 .../pm/{ => powerplay}/inc/smu72_discrete.h   |  0
 .../drm/amd/pm/{ => powerplay}/inc/smu73.h|  0
 .../pm/{ => powerplay}/inc/smu73_discrete.h   |  0
 .../drm/amd/pm/{ => powerplay}/inc/smu74.h|  0
 .../pm/{ => powerplay}/inc/smu74_discrete.h   |  0
 .../drm/amd/pm/{ => powerplay}/inc/smu75.h|  0
 .../pm/{ => powerplay}/inc/smu75_discrete.h   |  0
 .../amd/pm/{ => powerplay}/inc/smu7_common.h  |  0
 .../pm/{ => powerplay}/inc/smu7_discrete.h|  0
 .../amd/pm/{ => powerplay}/inc/smu7_fusion.h  |  0
 .../amd/pm/{ => powerplay}/inc/smu7_ppsmc.h   |  0
 .../gpu/drm/amd/pm/{ => powerplay}/inc/smu8.h |  0
 .../amd/pm/{ => powerplay}/inc/smu8_fusion.h  |  0
 .../gpu/drm/amd/pm/{ => powerplay}/inc/smu9.h |  0
 .../pm/{ => powerplay}/inc/smu9_driver_if.h   |  0
 .../{ => powerplay}/inc/smu_ucode_xfer_cz.h   |  0
 .../{ => powerplay}/inc/smu_ucode_xfer_vi.h   |  0
 .../drm/amd/pm/{ => powerplay}/inc/smumgr.h   |  0
 .../amd/pm/{ => powerplay}/inc/tonga_ppsmc.h  |  0
 .../amd/pm/{ => powerplay}/inc/vega10_ppsmc.h |  0
 .../inc/vega12/smu9_driver_if.h   |  0
 .../amd/pm/{ => powerplay}/inc/vega12_ppsmc.h |  0
 .../amd/pm/{ => powerplay}/inc/vega20_ppsmc.h |  0
 .../amd/pm/{ => swsmu}/inc/aldebaran_ppsmc.h  |  0
 .../drm/amd/pm/{ => swsmu}/inc/amdgpu_smu.h   |  0
 .../amd/pm/{ => swsmu}/inc/arcturus_ppsmc.h   |  0
 .../inc/smu11_driver_if_arcturus.h|  0
 .../inc/smu11_driver_if_cyan_skillfish.h  |  0
 .../{ => swsmu}/inc/smu11_driver_if_navi10.h  |  0
 .../inc/smu11_driver_if_sienna_cichlid.h  |  0
 .../{ => swsmu}/inc/smu11_driver_if_vangogh.h |  0
 .../amd/pm/{ => swsmu}/inc/smu12_driver_if.h  |  0
 .../inc/smu13_driver_if_aldebaran.h   |  0
 .../inc/smu13_driver_if_yellow_carp.h |  0
 .../pm/{ => swsmu}/inc/smu_11_0_cdr_table.h   |  0
 .../drm/amd/pm/{ => swsmu}/inc/smu_types.h|  0
 .../drm/amd/pm/{ => swsmu}/inc/smu_v11_0.h|  0
 .../pm/{ => swsmu}/inc/smu_v11_0_7_ppsmc.h|  0
 .../pm/{ => swsmu}/inc/smu_v11_0_7_pptable.h  |  0
 .../amd/pm/{ => swsmu}/inc/smu_v11_0_ppsmc.h  |  0
 .../pm/{ => swsmu}/inc/smu_v11_0_pptable.h|  0
 .../amd/pm/{ => swsmu}/inc/smu_v11_5_pmfw.h   |  0
 .../amd/pm/{ => swsmu}/inc/smu_v11_5_ppsmc.h  |  0
 .../amd/pm/{ => swsmu}/inc/smu_v11_8_pmfw.h   |  0
 .../amd/pm/{ => swsmu}/inc/smu_v11_8_ppsmc.h  |  0
 .../drm/amd/pm/{ => swsmu}/inc/smu_v12_0.h|  0
 .../amd/pm/{ => swsmu}/inc/smu_v12_0_ppsmc.h  |  0
 .../drm/amd/pm/{ => swsmu}/inc/smu_v13_0.h|  0
 .../amd/pm/{ => swsmu}/inc/smu_v13_0_1_pmfw.h |  0
 .../pm/{ => swsmu}/inc/smu_v13_0_1_ppsmc.h|  0
 .../pm/{ => swsmu}/inc/smu_v13_0_pptable.h|  0
 .../gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c |  1 -
 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c|  1 -
 87 files changed, 39 insertions(+), 11 deletions(-)
 create mode 100644 drivers/gpu/drm/amd/pm/legacy-dpm/Makefile
 rename drivers/gpu/drm/amd/pm/{powerplay => legacy-dpm}/cik_dpm.h (100%)

[PATCH V3 16/17] drm/amd/pm: revise the performance level setting APIs

2021-12-01 Thread Evan Quan

Avoid cross callings which make lock protection enforcement
on amdgpu_dpm_force_performance_level() impossible.

Signed-off-by: Evan Quan 
Change-Id: Ie658140f40ab906ce2ec47576a086062b61076a6
---
 drivers/gpu/drm/amd/pm/amdgpu_pm.c| 29 ---
 .../gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c| 17 ++-
 .../gpu/drm/amd/pm/powerplay/amd_powerplay.c  | 12 
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 12 
 4 files changed, 34 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index f5c0ae032954..5e5006af6b75 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -305,6 +305,10 @@ static ssize_t 
amdgpu_set_power_dpm_force_performance_level(struct device *dev,
enum amd_dpm_forced_level level;
enum amd_dpm_forced_level current_level;
int ret = 0;
+   uint32_t profile_mode_mask = AMD_DPM_FORCED_LEVEL_PROFILE_STANDARD |
+   AMD_DPM_FORCED_LEVEL_PROFILE_MIN_SCLK |
+   AMD_DPM_FORCED_LEVEL_PROFILE_MIN_MCLK |
+   AMD_DPM_FORCED_LEVEL_PROFILE_PEAK;
 
if (amdgpu_in_reset(adev))
return -EPERM;
@@ -358,10 +362,7 @@ static ssize_t 
amdgpu_set_power_dpm_force_performance_level(struct device *dev,
}
 
/* profile_exit setting is valid only when current mode is in profile 
mode */
-   if (!(current_level & (AMD_DPM_FORCED_LEVEL_PROFILE_STANDARD |
-   AMD_DPM_FORCED_LEVEL_PROFILE_MIN_SCLK |
-   AMD_DPM_FORCED_LEVEL_PROFILE_MIN_MCLK |
-   AMD_DPM_FORCED_LEVEL_PROFILE_PEAK)) &&
+   if (!(current_level & profile_mode_mask) &&
(level == AMD_DPM_FORCED_LEVEL_PROFILE_EXIT)) {
pr_err("Currently not in any profile mode!\n");
pm_runtime_mark_last_busy(ddev->dev);
@@ -369,6 +370,26 @@ static ssize_t 
amdgpu_set_power_dpm_force_performance_level(struct device *dev,
return -EINVAL;
}
 
+   if (!(current_level & profile_mode_mask) &&
+ (level & profile_mode_mask)) {
+   /* enter UMD Pstate */
+   amdgpu_device_ip_set_powergating_state(adev,
+  AMD_IP_BLOCK_TYPE_GFX,
+  AMD_PG_STATE_UNGATE);
+   amdgpu_device_ip_set_clockgating_state(adev,
+  AMD_IP_BLOCK_TYPE_GFX,
+  AMD_CG_STATE_UNGATE);
+   } else if ((current_level & profile_mode_mask) &&
+   !(level & profile_mode_mask)) {
+   /* exit UMD Pstate */
+   amdgpu_device_ip_set_clockgating_state(adev,
+  AMD_IP_BLOCK_TYPE_GFX,
+  AMD_CG_STATE_GATE);
+   amdgpu_device_ip_set_powergating_state(adev,
+  AMD_IP_BLOCK_TYPE_GFX,
+  AMD_PG_STATE_GATE);
+   }
+
if (amdgpu_dpm_force_performance_level(adev, level)) {
pm_runtime_mark_last_busy(ddev->dev);
pm_runtime_put_autosuspend(ddev->dev);
diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c 
b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c
index 3c6ee493e410..9613c6181c17 100644
--- a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c
+++ b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c
@@ -953,6 +953,7 @@ static struct amdgpu_ps *amdgpu_dpm_pick_power_state(struct 
amdgpu_device *adev,
 
 static int amdgpu_dpm_change_power_state_locked(struct amdgpu_device *adev)
 {
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
struct amdgpu_ps *ps;
enum amd_pm_state_type dpm_state;
int ret;
@@ -976,7 +977,7 @@ static int amdgpu_dpm_change_power_state_locked(struct 
amdgpu_device *adev)
else
return -EINVAL;
 
-   if (amdgpu_dpm == 1 && adev->powerplay.pp_funcs->print_power_state) {
+   if (amdgpu_dpm == 1 && pp_funcs->print_power_state) {
printk("switching from power state:\n");
amdgpu_dpm_print_power_state(adev, adev->pm.dpm.current_ps);
printk("switching to power state:\n");
@@ -985,14 +986,14 @@ static int amdgpu_dpm_change_power_state_locked(struct 
amdgpu_device *adev)
 
/* update whether vce is active */
ps->vce_active = adev->pm.dpm.vce_active;
-   if (adev->powerplay.pp_funcs->display_configuration_changed)
+   if (pp_funcs->display_configuration_changed)
amdgpu_dpm_display_configuration_changed(adev);
 
ret = amdgpu_dpm_pre_set_power_state(adev);
if (ret)
return ret;
 
-

RE: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Quan, Evan

[AMD Official Use Only]



> -Original Message-
> From: amd-gfx  On Behalf Of Yu,
> Lang
> Sent: Wednesday, December 1, 2021 7:37 PM
> To: Koenig, Christian ; Christian König
> ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Lazar, Lijo
> ; Huang, Ray 
> Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
> 
> [AMD Official Use Only]
> 
> 
> 
> >-Original Message-
> >From: Koenig, Christian 
> >Sent: Wednesday, December 1, 2021 7:29 PM
> >To: Yu, Lang ; Christian König
> >; amd-gfx@lists.freedesktop.org
> >Cc: Deucher, Alexander ; Lazar, Lijo
> >; Huang, Ray 
> >Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
> >
> >Am 01.12.21 um 12:20 schrieb Yu, Lang:
> >> [AMD Official Use Only]
> >>
> >>> -Original Message-
> >>> From: Christian König 
> >>> Sent: Wednesday, December 1, 2021 6:49 PM
> >>> To: Yu, Lang ; Koenig, Christian
> >>> ; amd-gfx@lists.freedesktop.org
> >>> Cc: Deucher, Alexander ; Lazar, Lijo
> >>> ; Huang, Ray 
> >>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
> >>>
> >>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
>  [AMD Official Use Only]
> 
> 
> 
> > -Original Message-
> > From: Koenig, Christian 
> > Sent: Wednesday, December 1, 2021 5:30 PM
> > To: Yu, Lang ; amd-gfx@lists.freedesktop.org
> > Cc: Deucher, Alexander ; Lazar, Lijo
> > ; Huang, Ray 
> > Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
> >
> > Am 01.12.21 um 10:24 schrieb Lang Yu:
> >> To maintain system error state when SMU errors occurred, which
> >> will aid in debugging SMU firmware issues, add SMU debug option
> support.
> >>
> >> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
> >> When enabled, it makes SMU errors fatal.
> >> It is disabled by default.
> >>
> >> == Command Guide ==
> >>
> >> 1, enable SMU debug option
> >>
> >> # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> >>
> >> 2, disable SMU debug option
> >>
> >> # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> >>
> >> v3:
> >> - Use debugfs_create_bool().(Christian)
> >> - Put variable into smu_context struct.
> >> - Don't resend command when timeout.
> >>
> >> v2:
> >> - Resend command when timeout.(Lijo)
> >> - Use debugfs file instead of module parameter.
> >>
> >> Signed-off-by: Lang Yu 
> > Well the debugfs part looks really nice and clean now, but one
> > more comment below.
> >
> >> ---
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
> >> drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
> >> drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
> >> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
> >> 4 files changed, 17 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >> index 164d6a9e9fbb..86cd888c7822 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
> >> amdgpu_device
> > *adev)
> >>if (!debugfs_initialized())
> >>return 0;
> >>
> >> +  debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> >> +>smu.smu_debug_mode);
> >> +
> >>ent = debugfs_create_file("amdgpu_preempt_ib", 0600,
> root,
> >adev,
> >>  _ib_preempt);
> >>if (IS_ERR(ent)) {
> >> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >> index f738f7dc20c9..50dbf5594a9d 100644
> >> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >> @@ -569,6 +569,11 @@ struct smu_context
> >>struct smu_user_dpm_profile user_dpm_profile;
> >>
> >>struct stb_context stb_context;
> >> +  /*
> >> +   * When enabled, it makes SMU errors fatal.
> >> +   * (0 = disabled (default), 1 = enabled)
> >> +   */
> >> +  bool smu_debug_mode;
> >> };
> >>
> >> struct i2c_adapter;
> >> diff --git
> a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >> index 6e781cee8bb6..d3797a2d6451 100644
> >> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
> > smu_context *smu)
> >> out:
> >>mutex_unlock(>message_lock);
> >>
> >> +

[PATCH V3 17/17] drm/amd/pm: unified lock protections in amdgpu_dpm.c

2021-12-01 Thread Evan Quan

As the only entry point, it's now safe and reasonable to
enforce the lock protections in amdgpu_dpm.c. And with
this, we can drop other internal used power locks.

Signed-off-by: Evan Quan 
Change-Id: Iad228cad0b3d8c41927def08965a52525f3f51d3
---
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c| 719 +++--
 drivers/gpu/drm/amd/pm/legacy-dpm/kv_dpm.c |  16 +-
 drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c |  16 +-
 3 files changed, 536 insertions(+), 215 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index cda7d21c1b3e..73a419366355 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -39,15 +39,33 @@
 int amdgpu_dpm_get_sclk(struct amdgpu_device *adev, bool low)
 {
const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+   int ret = 0;
+
+   if (!pp_funcs->get_sclk)
+   return 0;
 
-   return pp_funcs->get_sclk((adev)->powerplay.pp_handle, (low));
+   mutex_lock(>pm.mutex);
+   ret = pp_funcs->get_sclk((adev)->powerplay.pp_handle,
+low);
+   mutex_unlock(>pm.mutex);
+
+   return ret;
 }
 
 int amdgpu_dpm_get_mclk(struct amdgpu_device *adev, bool low)
 {
const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+   int ret = 0;
+
+   if (!pp_funcs->get_mclk)
+   return 0;
+
+   mutex_lock(>pm.mutex);
+   ret = pp_funcs->get_mclk((adev)->powerplay.pp_handle,
+low);
+   mutex_unlock(>pm.mutex);
 
-   return pp_funcs->get_mclk((adev)->powerplay.pp_handle, (low));
+   return ret;
 }
 
 int amdgpu_dpm_set_powergating_by_smu(struct amdgpu_device *adev, uint32_t 
block_type, bool gate)
@@ -62,52 +80,20 @@ int amdgpu_dpm_set_powergating_by_smu(struct amdgpu_device 
*adev, uint32_t block
return 0;
}
 
+   mutex_lock(>pm.mutex);
+
switch (block_type) {
case AMD_IP_BLOCK_TYPE_UVD:
case AMD_IP_BLOCK_TYPE_VCE:
-   if (pp_funcs && pp_funcs->set_powergating_by_smu) {
-   /*
-* TODO: need a better lock mechanism
-*
-* Here adev->pm.mutex lock protection is enforced on
-* UVD and VCE cases only. Since for other cases, there
-* may be already lock protection in amdgpu_pm.c.
-* This is a quick fix for the deadlock issue below.
-* NFO: task ocltst:2028 blocked for more than 120 
seconds.
-* Tainted: G   OE 5.0.0-37-generic 
#40~18.04.1-Ubuntu
-* echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
-* cltst  D0  2028   2026 0x
-* all Trace:
-* __schedule+0x2c0/0x870
-* schedule+0x2c/0x70
-* schedule_preempt_disabled+0xe/0x10
-* __mutex_lock.isra.9+0x26d/0x4e0
-* __mutex_lock_slowpath+0x13/0x20
-* ? __mutex_lock_slowpath+0x13/0x20
-* mutex_lock+0x2f/0x40
-* amdgpu_dpm_set_powergating_by_smu+0x64/0xe0 
[amdgpu]
-* 
gfx_v8_0_enable_gfx_static_mg_power_gating+0x3c/0x70 [amdgpu]
-* gfx_v8_0_set_powergating_state+0x66/0x260 
[amdgpu]
-* amdgpu_device_ip_set_powergating_state+0x62/0xb0 
[amdgpu]
-* pp_dpm_force_performance_level+0xe7/0x100 
[amdgpu]
-* 
amdgpu_set_dpm_forced_performance_level+0x129/0x330 [amdgpu]
-*/
-   mutex_lock(>pm.mutex);
-   ret = (pp_funcs->set_powergating_by_smu(
-   (adev)->powerplay.pp_handle, block_type, gate));
-   mutex_unlock(>pm.mutex);
-   }
-   break;
case AMD_IP_BLOCK_TYPE_GFX:
case AMD_IP_BLOCK_TYPE_VCN:
case AMD_IP_BLOCK_TYPE_SDMA:
case AMD_IP_BLOCK_TYPE_JPEG:
case AMD_IP_BLOCK_TYPE_GMC:
case AMD_IP_BLOCK_TYPE_ACP:
-   if (pp_funcs && pp_funcs->set_powergating_by_smu) {
+   if (pp_funcs && pp_funcs->set_powergating_by_smu)
ret = (pp_funcs->set_powergating_by_smu(
(adev)->powerplay.pp_handle, block_type, gate));
-   }
break;
default:
break;
@@ -116,6 +102,8 @@ int amdgpu_dpm_set_powergating_by_smu(struct amdgpu_device 
*adev, uint32_t block
if (!ret)
atomic_set(>pm.pwr_state[block_type], pwr_state);
 
+

[PATCH V3 15/17] drm/amd/pm: drop unnecessary gfxoff controls

2021-12-01 Thread Evan Quan

Those gfxoff controls added for some specific ASICs are unnecessary.
The functionalities are not affected without them. Also to align with
other ASICs, they should also be dropped.

Signed-off-by: Evan Quan 
Change-Id: Ia8475ef9e97635441aca5e0a7693e2a515498523
---
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |  4 ---
 .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 25 +--
 .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c|  7 --
 .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  7 --
 4 files changed, 1 insertion(+), 42 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index 70fc6bb00d1f..1edc71dde3e4 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -1541,8 +1541,6 @@ static int smu_reset(struct smu_context *smu)
struct amdgpu_device *adev = smu->adev;
int ret;
 
-   amdgpu_gfx_off_ctrl(smu->adev, false);
-
ret = smu_hw_fini(adev);
if (ret)
return ret;
@@ -1555,8 +1553,6 @@ static int smu_reset(struct smu_context *smu)
if (ret)
return ret;
 
-   amdgpu_gfx_off_ctrl(smu->adev, true);
-
return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index 6a5064f4ea86..9766870987db 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -1036,10 +1036,6 @@ static int sienna_cichlid_print_clk_levels(struct 
smu_context *smu,
if (ret)
goto print_clk_out;
 
-   /* no need to disable gfxoff when retrieving the current gfxclk 
*/
-   if ((clk_type == SMU_GFXCLK) || (clk_type == SMU_SCLK))
-   amdgpu_gfx_off_ctrl(adev, false);
-
ret = smu_v11_0_get_dpm_level_count(smu, clk_type, );
if (ret)
goto print_clk_out;
@@ -1168,25 +1164,18 @@ static int sienna_cichlid_print_clk_levels(struct 
smu_context *smu,
}
 
 print_clk_out:
-   if ((clk_type == SMU_GFXCLK) || (clk_type == SMU_SCLK))
-   amdgpu_gfx_off_ctrl(adev, true);
-
return size;
 }
 
 static int sienna_cichlid_force_clk_levels(struct smu_context *smu,
   enum smu_clk_type clk_type, uint32_t mask)
 {
-   struct amdgpu_device *adev = smu->adev;
int ret = 0;
uint32_t soft_min_level = 0, soft_max_level = 0, min_freq = 0, max_freq 
= 0;
 
soft_min_level = mask ? (ffs(mask) - 1) : 0;
soft_max_level = mask ? (fls(mask) - 1) : 0;
 
-   if ((clk_type == SMU_GFXCLK) || (clk_type == SMU_SCLK))
-   amdgpu_gfx_off_ctrl(adev, false);
-
switch (clk_type) {
case SMU_GFXCLK:
case SMU_SCLK:
@@ -1220,9 +1209,6 @@ static int sienna_cichlid_force_clk_levels(struct 
smu_context *smu,
}
 
 forec_level_out:
-   if ((clk_type == SMU_GFXCLK) || (clk_type == SMU_SCLK))
-   amdgpu_gfx_off_ctrl(adev, true);
-
return 0;
 }
 
@@ -1865,16 +1851,7 @@ static int sienna_cichlid_get_dpm_ultimate_freq(struct 
smu_context *smu,
enum smu_clk_type clk_type,
uint32_t *min, uint32_t *max)
 {
-   struct amdgpu_device *adev = smu->adev;
-   int ret;
-
-   if (clk_type == SMU_GFXCLK)
-   amdgpu_gfx_off_ctrl(adev, false);
-   ret = smu_v11_0_get_dpm_ultimate_freq(smu, clk_type, min, max);
-   if (clk_type == SMU_GFXCLK)
-   amdgpu_gfx_off_ctrl(adev, true);
-
-   return ret;
+   return smu_v11_0_get_dpm_ultimate_freq(smu, clk_type, min, max);
 }
 
 static void sienna_cichlid_dump_od_table(struct smu_context *smu,
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
index 2a53b5b1d261..fd188ee3ab54 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
@@ -1798,7 +1798,6 @@ int smu_v11_0_set_soft_freq_limited_range(struct 
smu_context *smu,
  uint32_t min,
  uint32_t max)
 {
-   struct amdgpu_device *adev = smu->adev;
int ret = 0, clk_id = 0;
uint32_t param;
 
@@ -1811,9 +1810,6 @@ int smu_v11_0_set_soft_freq_limited_range(struct 
smu_context *smu,
if (clk_id < 0)
return clk_id;
 
-   if (clk_type == SMU_GFXCLK)
-   amdgpu_gfx_off_ctrl(adev, false);
-
if (max > 0) {
param = (uint32_t)((clk_id << 16) | (max & 0x));
ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_SetSoftMaxByFreq,
@@ -1831,9 +1827,6 @@ int smu_v11_0_set_soft_freq_limited_range(struct 
smu_context *smu,
}
 
 out:
-   if (clk_type ==

[PATCH V3 11/17] drm/amd/pm: correct the usage for amdgpu_dpm_dispatch_task()

2021-12-01 Thread Evan Quan

We should avoid having multi-function APIs. It should be up to the caller
to determine when or whether to call amdgpu_dpm_dispatch_task().

Signed-off-by: Evan Quan 
Change-Id: I78ec4eb8ceb6e526a4734113d213d15a5fbaa8a4
---
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 18 ++
 drivers/gpu/drm/amd/pm/amdgpu_pm.c  | 26 --
 2 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index 16371879cfc1..45bc2486b1b4 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -551,8 +551,6 @@ void amdgpu_dpm_set_power_state(struct amdgpu_device *adev,
enum amd_pm_state_type state)
 {
adev->pm.dpm.user_state = state;
-
-   amdgpu_dpm_dispatch_task(adev, AMD_PP_TASK_ENABLE_USER_STATE, );
 }
 
 enum amd_dpm_forced_level amdgpu_dpm_get_performance_level(struct 
amdgpu_device *adev)
@@ -720,13 +718,7 @@ int amdgpu_dpm_set_sclk_od(struct amdgpu_device *adev, 
uint32_t value)
if (!pp_funcs->set_sclk_od)
return -EOPNOTSUPP;
 
-   pp_funcs->set_sclk_od(adev->powerplay.pp_handle, value);
-
-   amdgpu_dpm_dispatch_task(adev,
-AMD_PP_TASK_READJUST_POWER_STATE,
-NULL);
-
-   return 0;
+   return pp_funcs->set_sclk_od(adev->powerplay.pp_handle, value);
 }
 
 int amdgpu_dpm_get_mclk_od(struct amdgpu_device *adev)
@@ -746,13 +738,7 @@ int amdgpu_dpm_set_mclk_od(struct amdgpu_device *adev, 
uint32_t value)
if (!pp_funcs->set_mclk_od)
return -EOPNOTSUPP;
 
-   pp_funcs->set_mclk_od(adev->powerplay.pp_handle, value);
-
-   amdgpu_dpm_dispatch_task(adev,
-AMD_PP_TASK_READJUST_POWER_STATE,
-NULL);
-
-   return 0;
+   return pp_funcs->set_mclk_od(adev->powerplay.pp_handle, value);
 }
 
 int amdgpu_dpm_get_power_profile_mode(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index fa2f4e11e94e..89e1134d660f 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -187,6 +187,10 @@ static ssize_t amdgpu_set_power_dpm_state(struct device 
*dev,
 
amdgpu_dpm_set_power_state(adev, state);
 
+   amdgpu_dpm_dispatch_task(adev,
+AMD_PP_TASK_ENABLE_USER_STATE,
+);
+
pm_runtime_mark_last_busy(ddev->dev);
pm_runtime_put_autosuspend(ddev->dev);
 
@@ -1278,7 +1282,16 @@ static ssize_t amdgpu_set_pp_sclk_od(struct device *dev,
return ret;
}
 
-   amdgpu_dpm_set_sclk_od(adev, (uint32_t)value);
+   ret = amdgpu_dpm_set_sclk_od(adev, (uint32_t)value);
+   if (ret) {
+   pm_runtime_mark_last_busy(ddev->dev);
+   pm_runtime_put_autosuspend(ddev->dev);
+   return ret;
+   }
+
+   amdgpu_dpm_dispatch_task(adev,
+AMD_PP_TASK_READJUST_POWER_STATE,
+NULL);
 
pm_runtime_mark_last_busy(ddev->dev);
pm_runtime_put_autosuspend(ddev->dev);
@@ -1340,7 +1353,16 @@ static ssize_t amdgpu_set_pp_mclk_od(struct device *dev,
return ret;
}
 
-   amdgpu_dpm_set_mclk_od(adev, (uint32_t)value);
+   ret = amdgpu_dpm_set_mclk_od(adev, (uint32_t)value);
+   if (ret) {
+   pm_runtime_mark_last_busy(ddev->dev);
+   pm_runtime_put_autosuspend(ddev->dev);
+   return ret;
+   }
+
+   amdgpu_dpm_dispatch_task(adev,
+AMD_PP_TASK_READJUST_POWER_STATE,
+NULL);
 
pm_runtime_mark_last_busy(ddev->dev);
pm_runtime_put_autosuspend(ddev->dev);
-- 
2.29.0

[PATCH V3 12/17] drm/amd/pm: drop redundant or unused APIs and data structures

2021-12-01 Thread Evan Quan

Drop those unused APIs and data structures.

Signed-off-by: Evan Quan 
Change-Id: I57d2a03dcda02d0b5d9c5ffbdd37bffe49945407
---
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 49 -
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h  |  4 ++
 2 files changed, 4 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
index b0791e855ad3..de76636052e6 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
@@ -88,19 +88,6 @@ struct amdgpu_dpm_thermal {
struct amdgpu_irq_src   irq;
 };
 
-enum amdgpu_clk_action
-{
-   AMDGPU_SCLK_UP = 1,
-   AMDGPU_SCLK_DOWN
-};
-
-struct amdgpu_blacklist_clocks
-{
-   u32 sclk;
-   u32 mclk;
-   enum amdgpu_clk_action action;
-};
-
 struct amdgpu_clock_and_voltage_limits {
u32 sclk;
u32 mclk;
@@ -239,10 +226,6 @@ struct amdgpu_dpm_fan {
bool ucode_fan_control;
 };
 
-#define amdgpu_dpm_reset_power_profile_state(adev, request) \
-   ((adev)->powerplay.pp_funcs->reset_power_profile_state(\
-   (adev)->powerplay.pp_handle, request))
-
 struct amdgpu_dpm {
struct amdgpu_ps*ps;
/* number of valid power states */
@@ -339,35 +322,6 @@ struct amdgpu_pm {
boolpp_force_state_enabled;
 };
 
-#define R600_SSTU_DFLT   0
-#define R600_SST_DFLT0x00C8
-
-/* XXX are these ok? */
-#define R600_TEMP_RANGE_MIN (90 * 1000)
-#define R600_TEMP_RANGE_MAX (120 * 1000)
-
-#define FDO_PWM_MODE_STATIC  1
-#define FDO_PWM_MODE_STATIC_RPM 5
-
-enum amdgpu_td {
-   AMDGPU_TD_AUTO,
-   AMDGPU_TD_UP,
-   AMDGPU_TD_DOWN,
-};
-
-enum amdgpu_display_watermark {
-   AMDGPU_DISPLAY_WATERMARK_LOW = 0,
-   AMDGPU_DISPLAY_WATERMARK_HIGH = 1,
-};
-
-enum amdgpu_display_gap
-{
-AMDGPU_PM_DISPLAY_GAP_VBLANK_OR_WM = 0,
-AMDGPU_PM_DISPLAY_GAP_VBLANK   = 1,
-AMDGPU_PM_DISPLAY_GAP_WATERMARK= 2,
-AMDGPU_PM_DISPLAY_GAP_IGNORE   = 3,
-};
-
 u32 amdgpu_dpm_get_vblank_time(struct amdgpu_device *adev);
 int amdgpu_dpm_read_sensor(struct amdgpu_device *adev, enum amd_pp_sensors 
sensor,
   void *data, uint32_t *size);
@@ -417,9 +371,6 @@ int amdgpu_dpm_smu_i2c_bus_access(struct amdgpu_device 
*adev,
 
 void amdgpu_pm_acpi_event_handler(struct amdgpu_device *adev);
 
-int amdgpu_dpm_read_sensor(struct amdgpu_device *adev, enum amd_pp_sensors 
sensor,
-  void *data, uint32_t *size);
-
 void amdgpu_dpm_compute_clocks(struct amdgpu_device *adev);
 void amdgpu_dpm_enable_uvd(struct amdgpu_device *adev, bool enable);
 void amdgpu_dpm_enable_vce(struct amdgpu_device *adev, bool enable);
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h 
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
index beea03810bca..67a25da79256 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
@@ -26,6 +26,10 @@
 #include "amdgpu_smu.h"
 
 #if defined(SWSMU_CODE_LAYER_L2) || defined(SWSMU_CODE_LAYER_L3) || 
defined(SWSMU_CODE_LAYER_L4)
+
+#define FDO_PWM_MODE_STATIC  1
+#define FDO_PWM_MODE_STATIC_RPM 5
+
 int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
 uint16_t msg_index,
 uint32_t param);
-- 
2.29.0

[PATCH V3 10/17] drm/amd/pm: move those code piece used by Stoney only to smu8_hwmgr.c

2021-12-01 Thread Evan Quan

Instead of putting them in amdgpu_dpm.c.

Signed-off-by: Evan Quan 
Change-Id: Ieb7ed5fb6140401a7692b401c5a42dc53da92af8
---
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c| 14 --
 drivers/gpu/drm/amd/pm/inc/hwmgr.h |  3 ---
 .../gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c| 10 +-
 3 files changed, 9 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index e0ea92155627..16371879cfc1 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -32,8 +32,6 @@
 #include "hwmgr.h"
 #include 
 
-#define WIDTH_4K 3840
-
 #define amdgpu_dpm_enable_bapm(adev, e) \

((adev)->powerplay.pp_funcs->enable_bapm((adev)->powerplay.pp_handle, (e)))
 
@@ -383,18 +381,6 @@ void amdgpu_dpm_enable_uvd(struct amdgpu_device *adev, 
bool enable)
if (ret)
DRM_ERROR("Dpm %s uvd failed, ret = %d. \n",
  enable ? "enable" : "disable", ret);
-
-   /* enable/disable Low Memory PState for UVD (4k videos) */
-   if (adev->asic_type == CHIP_STONEY &&
-   adev->uvd.decode_image_width >= WIDTH_4K) {
-   struct pp_hwmgr *hwmgr = adev->powerplay.pp_handle;
-
-   if (hwmgr && hwmgr->hwmgr_func &&
-   hwmgr->hwmgr_func->update_nbdpm_pstate)
-   hwmgr->hwmgr_func->update_nbdpm_pstate(hwmgr,
-  !enable,
-  true);
-   }
 }
 
 void amdgpu_dpm_enable_vce(struct amdgpu_device *adev, bool enable)
diff --git a/drivers/gpu/drm/amd/pm/inc/hwmgr.h 
b/drivers/gpu/drm/amd/pm/inc/hwmgr.h
index 8ed01071fe5a..03226baea65e 100644
--- a/drivers/gpu/drm/amd/pm/inc/hwmgr.h
+++ b/drivers/gpu/drm/amd/pm/inc/hwmgr.h
@@ -331,9 +331,6 @@ struct pp_hwmgr_func {
uint32_t mc_addr_low,
uint32_t mc_addr_hi,
uint32_t size);
-   int (*update_nbdpm_pstate)(struct pp_hwmgr *hwmgr,
-   bool enable,
-   bool lock);
int (*get_thermal_temperature_range)(struct pp_hwmgr *hwmgr,
struct PP_TemperatureRange *range);
int (*get_power_profile_mode)(struct pp_hwmgr *hwmgr, char *buf);
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c
index 03bf8f069222..b50fd4a4a3d1 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c
@@ -1950,9 +1950,12 @@ static void smu8_dpm_powergate_acp(struct pp_hwmgr 
*hwmgr, bool bgate)
smum_send_msg_to_smc(hwmgr, PPSMC_MSG_ACPPowerON, NULL);
 }
 
+#define WIDTH_4K   3840
+
 static void smu8_dpm_powergate_uvd(struct pp_hwmgr *hwmgr, bool bgate)
 {
struct smu8_hwmgr *data = hwmgr->backend;
+   struct amdgpu_device *adev = hwmgr->adev;
 
data->uvd_power_gated = bgate;
 
@@ -1976,6 +1979,12 @@ static void smu8_dpm_powergate_uvd(struct pp_hwmgr 
*hwmgr, bool bgate)
smu8_dpm_update_uvd_dpm(hwmgr, false);
}
 
+   /* enable/disable Low Memory PState for UVD (4k videos) */
+   if (adev->asic_type == CHIP_STONEY &&
+   adev->uvd.decode_image_width >= WIDTH_4K)
+   smu8_nbdpm_pstate_enable_disable(hwmgr,
+bgate,
+true);
 }
 
 static void smu8_dpm_powergate_vce(struct pp_hwmgr *hwmgr, bool bgate)
@@ -2037,7 +2046,6 @@ static const struct pp_hwmgr_func smu8_hwmgr_funcs = {
.power_state_set = smu8_set_power_state_tasks,
.dynamic_state_management_disable = smu8_disable_dpm_tasks,
.notify_cac_buffer_info = smu8_notify_cac_buffer_info,
-   .update_nbdpm_pstate = smu8_nbdpm_pstate_enable_disable,
.get_thermal_temperature_range = smu8_get_thermal_temperature_range,
 };
 
-- 
2.29.0

[PATCH V3 09/17] drm/amd/pm: optimize the amdgpu_pm_compute_clocks() implementations

2021-12-01 Thread Evan Quan

Drop cross callings and multi-function APIs. Also avoid exposing
internal implementations details.

Signed-off-by: Evan Quan 
Change-Id: I55e5ab3da6a70482f5f5d8c256eed2f754feae20
---
 .../gpu/drm/amd/include/kgd_pp_interface.h|   2 +-
 drivers/gpu/drm/amd/pm/Makefile   |   2 +-
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 222 +++---
 drivers/gpu/drm/amd/pm/amdgpu_dpm_internal.c  |  94 
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |   2 -
 .../gpu/drm/amd/pm/inc/amdgpu_dpm_internal.h  |  32 +++
 .../gpu/drm/amd/pm/powerplay/amd_powerplay.c  |  39 ++-
 drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c |   6 +-
 drivers/gpu/drm/amd/pm/powerplay/legacy_dpm.c |  60 -
 drivers/gpu/drm/amd/pm/powerplay/legacy_dpm.h |   3 +-
 drivers/gpu/drm/amd/pm/powerplay/si_dpm.c |  41 +++-
 11 files changed, 295 insertions(+), 208 deletions(-)
 create mode 100644 drivers/gpu/drm/amd/pm/amdgpu_dpm_internal.c
 create mode 100644 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm_internal.h

diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
index cdf724dcf832..7919e96e772b 100644
--- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
+++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
@@ -404,7 +404,7 @@ struct amd_pm_funcs {
int (*get_dpm_clock_table)(void *handle,
   struct dpm_clocks *clock_table);
int (*get_smu_prv_buf_details)(void *handle, void **addr, size_t *size);
-   int (*change_power_state)(void *handle);
+   void (*pm_compute_clocks)(void *handle);
 };
 
 struct metrics_table_header {
diff --git a/drivers/gpu/drm/amd/pm/Makefile b/drivers/gpu/drm/amd/pm/Makefile
index 8cf6eff1ea93..d35ffde387f1 100644
--- a/drivers/gpu/drm/amd/pm/Makefile
+++ b/drivers/gpu/drm/amd/pm/Makefile
@@ -40,7 +40,7 @@ AMD_PM = $(addsuffix /Makefile,$(addprefix 
$(FULL_AMD_PATH)/pm/,$(PM_LIBS)))
 
 include $(AMD_PM)
 
-PM_MGR = amdgpu_dpm.o amdgpu_pm.o
+PM_MGR = amdgpu_dpm.o amdgpu_pm.o amdgpu_dpm_internal.o
 
 AMD_PM_POWER = $(addprefix $(AMD_PM_PATH)/,$(PM_MGR))
 
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index 9b268f300815..e0ea92155627 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -37,73 +37,6 @@
 #define amdgpu_dpm_enable_bapm(adev, e) \

((adev)->powerplay.pp_funcs->enable_bapm((adev)->powerplay.pp_handle, (e)))
 
-static void amdgpu_dpm_get_active_displays(struct amdgpu_device *adev)
-{
-   struct drm_device *ddev = adev_to_drm(adev);
-   struct drm_crtc *crtc;
-   struct amdgpu_crtc *amdgpu_crtc;
-
-   adev->pm.dpm.new_active_crtcs = 0;
-   adev->pm.dpm.new_active_crtc_count = 0;
-   if (adev->mode_info.num_crtc && 
adev->mode_info.mode_config_initialized) {
-   list_for_each_entry(crtc,
-   >mode_config.crtc_list, head) {
-   amdgpu_crtc = to_amdgpu_crtc(crtc);
-   if (amdgpu_crtc->enabled) {
-   adev->pm.dpm.new_active_crtcs |= (1 << 
amdgpu_crtc->crtc_id);
-   adev->pm.dpm.new_active_crtc_count++;
-   }
-   }
-   }
-}
-
-u32 amdgpu_dpm_get_vblank_time(struct amdgpu_device *adev)
-{
-   struct drm_device *dev = adev_to_drm(adev);
-   struct drm_crtc *crtc;
-   struct amdgpu_crtc *amdgpu_crtc;
-   u32 vblank_in_pixels;
-   u32 vblank_time_us = 0x; /* if the displays are off, vblank 
time is max */
-
-   if (adev->mode_info.num_crtc && 
adev->mode_info.mode_config_initialized) {
-   list_for_each_entry(crtc, >mode_config.crtc_list, head) {
-   amdgpu_crtc = to_amdgpu_crtc(crtc);
-   if (crtc->enabled && amdgpu_crtc->enabled && 
amdgpu_crtc->hw_mode.clock) {
-   vblank_in_pixels =
-   amdgpu_crtc->hw_mode.crtc_htotal *
-   (amdgpu_crtc->hw_mode.crtc_vblank_end -
-   amdgpu_crtc->hw_mode.crtc_vdisplay +
-   (amdgpu_crtc->v_border * 2));
-
-   vblank_time_us = vblank_in_pixels * 1000 / 
amdgpu_crtc->hw_mode.clock;
-   break;
-   }
-   }
-   }
-
-   return vblank_time_us;
-}
-
-static u32 amdgpu_dpm_get_vrefresh(struct amdgpu_device *adev)
-{
-   struct drm_device *dev = adev_to_drm(adev);
-   struct drm_crtc *crtc;
-   struct amdgpu_crtc *amdgpu_crtc;
-   u32 vrefresh = 0;
-
-   if (adev->mode_info.num_crtc && 
adev->mode_info.mode_config_initialized) {
-   list_for_each_entry(crtc, >mode_config.crtc_list, head) {
-   amdgpu_crtc = to_amdgpu_crtc(crtc);
-   if

[PATCH V3 07/17] drm/amd/pm: create a new holder for those APIs used only by legacy ASICs(si/kv)

2021-12-01 Thread Evan Quan

Those APIs are used only by legacy ASICs(si/kv). They cannot be
shared by other ASICs. So, we create a new holder for them.

Signed-off-by: Evan Quan 
Change-Id: I555dfa37e783a267b1d3b3a7db5c87fcc3f1556f
--
v1->v2:
  - rename amdgpu_pm_compute_clocks as amdgpu_dpm_compute_clocks(Lijo)
---
 drivers/gpu/drm/amd/amdgpu/dce_v10_0.c|2 +-
 drivers/gpu/drm/amd/amdgpu/dce_v11_0.c|2 +-
 drivers/gpu/drm/amd/amdgpu/dce_v6_0.c |2 +-
 drivers/gpu/drm/amd/amdgpu/dce_v8_0.c |2 +-
 .../amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c  |2 +-
 .../gpu/drm/amd/include/kgd_pp_interface.h|1 +
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 1022 +---
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |   17 +-
 drivers/gpu/drm/amd/pm/powerplay/Makefile |2 +-
 drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c |6 +-
 drivers/gpu/drm/amd/pm/powerplay/legacy_dpm.c | 1024 +
 drivers/gpu/drm/amd/pm/powerplay/legacy_dpm.h |   37 +
 drivers/gpu/drm/amd/pm/powerplay/si_dpm.c |6 +-
 13 files changed, 1089 insertions(+), 1036 deletions(-)
 create mode 100644 drivers/gpu/drm/amd/pm/powerplay/legacy_dpm.c
 create mode 100644 drivers/gpu/drm/amd/pm/powerplay/legacy_dpm.h

diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c
index d1570a462a51..5d5205870861 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c
@@ -2532,7 +2532,7 @@ static void dce_v10_0_crtc_dpms(struct drm_crtc *crtc, 
int mode)
break;
}
/* adjust pm to dpms */
-   amdgpu_pm_compute_clocks(adev);
+   amdgpu_dpm_compute_clocks(adev);
 }
 
 static void dce_v10_0_crtc_prepare(struct drm_crtc *crtc)
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
index 18a7b3bd633b..4d812b22c54f 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
@@ -2608,7 +2608,7 @@ static void dce_v11_0_crtc_dpms(struct drm_crtc *crtc, 
int mode)
break;
}
/* adjust pm to dpms */
-   amdgpu_pm_compute_clocks(adev);
+   amdgpu_dpm_compute_clocks(adev);
 }
 
 static void dce_v11_0_crtc_prepare(struct drm_crtc *crtc)
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c
index c7803dc2b2d5..b90bc2adf778 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c
@@ -2424,7 +2424,7 @@ static void dce_v6_0_crtc_dpms(struct drm_crtc *crtc, int 
mode)
break;
}
/* adjust pm to dpms */
-   amdgpu_pm_compute_clocks(adev);
+   amdgpu_dpm_compute_clocks(adev);
 }
 
 static void dce_v6_0_crtc_prepare(struct drm_crtc *crtc)
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
index 8318ee8339f1..7c1379b02f94 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
@@ -2433,7 +2433,7 @@ static void dce_v8_0_crtc_dpms(struct drm_crtc *crtc, int 
mode)
break;
}
/* adjust pm to dpms */
-   amdgpu_pm_compute_clocks(adev);
+   amdgpu_dpm_compute_clocks(adev);
 }
 
 static void dce_v8_0_crtc_prepare(struct drm_crtc *crtc)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c
index 46550811da00..75284e2cec74 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c
@@ -101,7 +101,7 @@ bool dm_pp_apply_display_requirements(
 
amdgpu_dpm_display_configuration_change(adev, 
>pm.pm_display_cfg);
 
-   amdgpu_pm_compute_clocks(adev);
+   amdgpu_dpm_compute_clocks(adev);
}
 
return true;
diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
index 2e295facd086..cdf724dcf832 100644
--- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
+++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
@@ -404,6 +404,7 @@ struct amd_pm_funcs {
int (*get_dpm_clock_table)(void *handle,
   struct dpm_clocks *clock_table);
int (*get_smu_prv_buf_details)(void *handle, void **addr, size_t *size);
+   int (*change_power_state)(void *handle);
 };
 
 struct metrics_table_header {
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index efe6f8129c06..9b268f300815 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -34,113 +34,9 @@
 
 #define WIDTH_4K 3840
 
-#define amdgpu_dpm_pre_set_power_state(adev) \
-   
((adev)->powerplay.pp_funcs->pre_set_power_state((adev)->powerplay.pp_handle))
-
-#define amdgpu_dpm_post_set_power_state(adev) \
-

[PATCH V3 08/17] drm/amd/pm: move pp_force_state_enabled member to amdgpu_pm structure

2021-12-01 Thread Evan Quan

As it lables an internal pm state and amdgpu_pm structure is the more
proper place than amdgpu_device structure for it.

Signed-off-by: Evan Quan 
Change-Id: I7890e8fe7af2ecd8591d30442340deb8773bacc3
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 -
 drivers/gpu/drm/amd/pm/amdgpu_pm.c  | 6 +++---
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 2 ++
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c5cfe2926ca1..c987813a4996 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -950,7 +950,6 @@ struct amdgpu_device {
 
/* powerplay */
struct amd_powerplaypowerplay;
-   boolpp_force_state_enabled;
 
/* smu */
struct smu_context  smu;
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 3382d30b5d90..fa2f4e11e94e 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -469,7 +469,7 @@ static ssize_t amdgpu_get_pp_force_state(struct device *dev,
if (adev->in_suspend && !adev->in_runpm)
return -EPERM;
 
-   if (adev->pp_force_state_enabled)
+   if (adev->pm.pp_force_state_enabled)
return amdgpu_get_pp_cur_state(dev, attr, buf);
else
return sysfs_emit(buf, "\n");
@@ -492,7 +492,7 @@ static ssize_t amdgpu_set_pp_force_state(struct device *dev,
if (adev->in_suspend && !adev->in_runpm)
return -EPERM;
 
-   adev->pp_force_state_enabled = false;
+   adev->pm.pp_force_state_enabled = false;
 
if (strlen(buf) == 1)
return count;
@@ -523,7 +523,7 @@ static ssize_t amdgpu_set_pp_force_state(struct device *dev,
if (ret)
goto err_out;
 
-   adev->pp_force_state_enabled = true;
+   adev->pm.pp_force_state_enabled = true;
}
 
pm_runtime_mark_last_busy(ddev->dev);
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
index 89caece4ab3e..b7841a989d59 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
@@ -335,6 +335,8 @@ struct amdgpu_pm {
struct list_headpm_attr_list;
 
atomic_tpwr_state[AMD_IP_BLOCK_TYPE_NUM];
+
+   boolpp_force_state_enabled;
 };
 
 #define R600_SSTU_DFLT   0
-- 
2.29.0

[PATCH V3 06/17] drm/amd/pm: do not expose the API used internally only in kv_dpm.c

2021-12-01 Thread Evan Quan

Move it to kv_dpm.c instead.

Signed-off-by: Evan Quan 
Change-Id: I554332b386491a79b7913f72786f1e2cb1f8165b
--
v1->v2:
  - rename the API with "kv_" prefix(Alex)
---
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 23 -
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  2 --
 drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c | 25 ++-
 3 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index b31858ad9b83..efe6f8129c06 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -209,29 +209,6 @@ static u32 amdgpu_dpm_get_vrefresh(struct amdgpu_device 
*adev)
return vrefresh;
 }
 
-bool amdgpu_is_internal_thermal_sensor(enum amdgpu_int_thermal_type sensor)
-{
-   switch (sensor) {
-   case THERMAL_TYPE_RV6XX:
-   case THERMAL_TYPE_RV770:
-   case THERMAL_TYPE_EVERGREEN:
-   case THERMAL_TYPE_SUMO:
-   case THERMAL_TYPE_NI:
-   case THERMAL_TYPE_SI:
-   case THERMAL_TYPE_CI:
-   case THERMAL_TYPE_KV:
-   return true;
-   case THERMAL_TYPE_ADT7473_WITH_INTERNAL:
-   case THERMAL_TYPE_EMC2103_WITH_INTERNAL:
-   return false; /* need special handling */
-   case THERMAL_TYPE_NONE:
-   case THERMAL_TYPE_EXTERNAL:
-   case THERMAL_TYPE_EXTERNAL_GPIO:
-   default:
-   return false;
-   }
-}
-
 union power_info {
struct _ATOM_POWERPLAY_INFO info;
struct _ATOM_POWERPLAY_INFO_V2 info_2;
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
index f43b96dfe9d8..01120b302590 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
@@ -374,8 +374,6 @@ u32 amdgpu_dpm_get_vblank_time(struct amdgpu_device *adev);
 int amdgpu_dpm_read_sensor(struct amdgpu_device *adev, enum amd_pp_sensors 
sensor,
   void *data, uint32_t *size);
 
-bool amdgpu_is_internal_thermal_sensor(enum amdgpu_int_thermal_type sensor);
-
 int amdgpu_get_platform_caps(struct amdgpu_device *adev);
 
 int amdgpu_parse_extended_power_table(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c 
b/drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c
index bcae42cef374..380a5336c74f 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c
@@ -1256,6 +1256,29 @@ static void kv_dpm_enable_bapm(void *handle, bool enable)
}
 }
 
+static bool kv_is_internal_thermal_sensor(enum amdgpu_int_thermal_type sensor)
+{
+   switch (sensor) {
+   case THERMAL_TYPE_RV6XX:
+   case THERMAL_TYPE_RV770:
+   case THERMAL_TYPE_EVERGREEN:
+   case THERMAL_TYPE_SUMO:
+   case THERMAL_TYPE_NI:
+   case THERMAL_TYPE_SI:
+   case THERMAL_TYPE_CI:
+   case THERMAL_TYPE_KV:
+   return true;
+   case THERMAL_TYPE_ADT7473_WITH_INTERNAL:
+   case THERMAL_TYPE_EMC2103_WITH_INTERNAL:
+   return false; /* need special handling */
+   case THERMAL_TYPE_NONE:
+   case THERMAL_TYPE_EXTERNAL:
+   case THERMAL_TYPE_EXTERNAL_GPIO:
+   default:
+   return false;
+   }
+}
+
 static int kv_dpm_enable(struct amdgpu_device *adev)
 {
struct kv_power_info *pi = kv_get_pi(adev);
@@ -1352,7 +1375,7 @@ static int kv_dpm_enable(struct amdgpu_device *adev)
}
 
if (adev->irq.installed &&
-   amdgpu_is_internal_thermal_sensor(adev->pm.int_thermal_type)) {
+   kv_is_internal_thermal_sensor(adev->pm.int_thermal_type)) {
ret = kv_set_thermal_temperature_range(adev, KV_TEMP_RANGE_MIN, 
KV_TEMP_RANGE_MAX);
if (ret) {
DRM_ERROR("kv_set_thermal_temperature_range failed\n");
-- 
2.29.0

[PATCH V3 05/17] drm/amd/pm: do not expose those APIs used internally only in si_dpm.c

2021-12-01 Thread Evan Quan

Move them to si_dpm.c instead.

Signed-off-by: Evan Quan 
Change-Id: I288205cfd7c6ba09cfb22626ff70360d61ff0c67
--
v1->v2:
  - rename the API with "si_" prefix(Alex)
v2->v3:
  - rename other data structures used only in si_dpm.c(Lijo)
---
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c   |  25 -
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  25 -
 drivers/gpu/drm/amd/pm/powerplay/si_dpm.c | 106 +++---
 drivers/gpu/drm/amd/pm/powerplay/si_dpm.h |  15 ++-
 4 files changed, 83 insertions(+), 88 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index 72a8cb70d36b..b31858ad9b83 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -894,31 +894,6 @@ void amdgpu_add_thermal_controller(struct amdgpu_device 
*adev)
}
 }
 
-enum amdgpu_pcie_gen amdgpu_get_pcie_gen_support(struct amdgpu_device *adev,
-u32 sys_mask,
-enum amdgpu_pcie_gen asic_gen,
-enum amdgpu_pcie_gen 
default_gen)
-{
-   switch (asic_gen) {
-   case AMDGPU_PCIE_GEN1:
-   return AMDGPU_PCIE_GEN1;
-   case AMDGPU_PCIE_GEN2:
-   return AMDGPU_PCIE_GEN2;
-   case AMDGPU_PCIE_GEN3:
-   return AMDGPU_PCIE_GEN3;
-   default:
-   if ((sys_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3) &&
-   (default_gen == AMDGPU_PCIE_GEN3))
-   return AMDGPU_PCIE_GEN3;
-   else if ((sys_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN2) &&
-(default_gen == AMDGPU_PCIE_GEN2))
-   return AMDGPU_PCIE_GEN2;
-   else
-   return AMDGPU_PCIE_GEN1;
-   }
-   return AMDGPU_PCIE_GEN1;
-}
-
 struct amd_vce_state*
 amdgpu_get_vce_clock_state(void *handle, u32 idx)
 {
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
index 6681b878e75f..f43b96dfe9d8 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
@@ -45,19 +45,6 @@ enum amdgpu_int_thermal_type {
THERMAL_TYPE_KV,
 };
 
-enum amdgpu_dpm_auto_throttle_src {
-   AMDGPU_DPM_AUTO_THROTTLE_SRC_THERMAL,
-   AMDGPU_DPM_AUTO_THROTTLE_SRC_EXTERNAL
-};
-
-enum amdgpu_dpm_event_src {
-   AMDGPU_DPM_EVENT_SRC_ANALOG = 0,
-   AMDGPU_DPM_EVENT_SRC_EXTERNAL = 1,
-   AMDGPU_DPM_EVENT_SRC_DIGITAL = 2,
-   AMDGPU_DPM_EVENT_SRC_ANALOG_OR_EXTERNAL = 3,
-   AMDGPU_DPM_EVENT_SRC_DIGIAL_OR_EXTERNAL = 4
-};
-
 struct amdgpu_ps {
u32 caps; /* vbios flags */
u32 class; /* vbios flags */
@@ -252,13 +239,6 @@ struct amdgpu_dpm_fan {
bool ucode_fan_control;
 };
 
-enum amdgpu_pcie_gen {
-   AMDGPU_PCIE_GEN1 = 0,
-   AMDGPU_PCIE_GEN2 = 1,
-   AMDGPU_PCIE_GEN3 = 2,
-   AMDGPU_PCIE_GEN_INVALID = 0x
-};
-
 #define amdgpu_dpm_reset_power_profile_state(adev, request) \
((adev)->powerplay.pp_funcs->reset_power_profile_state(\
(adev)->powerplay.pp_handle, request))
@@ -403,11 +383,6 @@ void amdgpu_free_extended_power_table(struct amdgpu_device 
*adev);
 
 void amdgpu_add_thermal_controller(struct amdgpu_device *adev);
 
-enum amdgpu_pcie_gen amdgpu_get_pcie_gen_support(struct amdgpu_device *adev,
-u32 sys_mask,
-enum amdgpu_pcie_gen asic_gen,
-enum amdgpu_pcie_gen 
default_gen);
-
 struct amd_vce_state*
 amdgpu_get_vce_clock_state(void *handle, u32 idx);
 
diff --git a/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c 
b/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
index 81f82aa05ec2..ab0fa6c79255 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
@@ -96,6 +96,19 @@ union pplib_clock_info {
struct _ATOM_PPLIB_SI_CLOCK_INFO si;
 };
 
+enum si_dpm_auto_throttle_src {
+   DPM_AUTO_THROTTLE_SRC_THERMAL,
+   DPM_AUTO_THROTTLE_SRC_EXTERNAL
+};
+
+enum si_dpm_event_src {
+   DPM_EVENT_SRC_ANALOG = 0,
+   DPM_EVENT_SRC_EXTERNAL = 1,
+   DPM_EVENT_SRC_DIGITAL = 2,
+   DPM_EVENT_SRC_ANALOG_OR_EXTERNAL = 3,
+   DPM_EVENT_SRC_DIGIAL_OR_EXTERNAL = 4
+};
+
 static const u32 r600_utc[R600_PM_NUMBER_OF_TC] =
 {
R600_UTC_DFLT_00,
@@ -3718,25 +3731,25 @@ static void si_set_dpm_event_sources(struct 
amdgpu_device *adev, u32 sources)
 {
struct rv7xx_power_info *pi = rv770_get_pi(adev);
bool want_thermal_protection;
-   enum amdgpu_dpm_event_src dpm_event_src;
+   enum si_dpm_event_src dpm_event_src;
 
switch (sources) {
case 0:
default:
want_thermal_protection = false;
break;
-   case (1 << AMDGPU_DPM_AUTO_THROTTLE_SRC_THERMAL):
+

[PATCH V3 04/17] drm/amd/pm: do not expose those APIs used internally only in amdgpu_dpm.c

2021-12-01 Thread Evan Quan

Move them to amdgpu_dpm.c instead.

Signed-off-by: Evan Quan 
Change-Id: I59fe0efcb47c18ec7254f3624db7a2eb78d91b8c
---
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 25 +++--
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 23 ---
 2 files changed, 23 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index 3a84c3995f2d..72a8cb70d36b 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -34,6 +34,27 @@
 
 #define WIDTH_4K 3840
 
+#define amdgpu_dpm_pre_set_power_state(adev) \
+   
((adev)->powerplay.pp_funcs->pre_set_power_state((adev)->powerplay.pp_handle))
+
+#define amdgpu_dpm_post_set_power_state(adev) \
+   
((adev)->powerplay.pp_funcs->post_set_power_state((adev)->powerplay.pp_handle))
+
+#define amdgpu_dpm_display_configuration_changed(adev) \
+   
((adev)->powerplay.pp_funcs->display_configuration_changed((adev)->powerplay.pp_handle))
+
+#define amdgpu_dpm_print_power_state(adev, ps) \
+   
((adev)->powerplay.pp_funcs->print_power_state((adev)->powerplay.pp_handle, 
(ps)))
+
+#define amdgpu_dpm_vblank_too_short(adev) \
+   
((adev)->powerplay.pp_funcs->vblank_too_short((adev)->powerplay.pp_handle))
+
+#define amdgpu_dpm_enable_bapm(adev, e) \
+   
((adev)->powerplay.pp_funcs->enable_bapm((adev)->powerplay.pp_handle, (e)))
+
+#define amdgpu_dpm_check_state_equal(adev, cps, rps, equal) \
+   
((adev)->powerplay.pp_funcs->check_state_equal((adev)->powerplay.pp_handle, 
(cps), (rps), (equal)))
+
 void amdgpu_dpm_print_class_info(u32 class, u32 class2)
 {
const char *s;
@@ -120,7 +141,7 @@ void amdgpu_dpm_print_ps_status(struct amdgpu_device *adev,
pr_cont("\n");
 }
 
-void amdgpu_dpm_get_active_displays(struct amdgpu_device *adev)
+static void amdgpu_dpm_get_active_displays(struct amdgpu_device *adev)
 {
struct drm_device *ddev = adev_to_drm(adev);
struct drm_crtc *crtc;
@@ -168,7 +189,7 @@ u32 amdgpu_dpm_get_vblank_time(struct amdgpu_device *adev)
return vblank_time_us;
 }
 
-u32 amdgpu_dpm_get_vrefresh(struct amdgpu_device *adev)
+static u32 amdgpu_dpm_get_vrefresh(struct amdgpu_device *adev)
 {
struct drm_device *dev = adev_to_drm(adev);
struct drm_crtc *crtc;
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
index fea203a79cb4..6681b878e75f 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
@@ -259,27 +259,6 @@ enum amdgpu_pcie_gen {
AMDGPU_PCIE_GEN_INVALID = 0x
 };
 
-#define amdgpu_dpm_pre_set_power_state(adev) \
-   
((adev)->powerplay.pp_funcs->pre_set_power_state((adev)->powerplay.pp_handle))
-
-#define amdgpu_dpm_post_set_power_state(adev) \
-   
((adev)->powerplay.pp_funcs->post_set_power_state((adev)->powerplay.pp_handle))
-
-#define amdgpu_dpm_display_configuration_changed(adev) \
-   
((adev)->powerplay.pp_funcs->display_configuration_changed((adev)->powerplay.pp_handle))
-
-#define amdgpu_dpm_print_power_state(adev, ps) \
-   
((adev)->powerplay.pp_funcs->print_power_state((adev)->powerplay.pp_handle, 
(ps)))
-
-#define amdgpu_dpm_vblank_too_short(adev) \
-   
((adev)->powerplay.pp_funcs->vblank_too_short((adev)->powerplay.pp_handle))
-
-#define amdgpu_dpm_enable_bapm(adev, e) \
-   
((adev)->powerplay.pp_funcs->enable_bapm((adev)->powerplay.pp_handle, (e)))
-
-#define amdgpu_dpm_check_state_equal(adev, cps, rps, equal) \
-   
((adev)->powerplay.pp_funcs->check_state_equal((adev)->powerplay.pp_handle, 
(cps), (rps), (equal)))
-
 #define amdgpu_dpm_reset_power_profile_state(adev, request) \
((adev)->powerplay.pp_funcs->reset_power_profile_state(\
(adev)->powerplay.pp_handle, request))
@@ -412,8 +391,6 @@ void amdgpu_dpm_print_cap_info(u32 caps);
 void amdgpu_dpm_print_ps_status(struct amdgpu_device *adev,
struct amdgpu_ps *rps);
 u32 amdgpu_dpm_get_vblank_time(struct amdgpu_device *adev);
-u32 amdgpu_dpm_get_vrefresh(struct amdgpu_device *adev);
-void amdgpu_dpm_get_active_displays(struct amdgpu_device *adev);
 int amdgpu_dpm_read_sensor(struct amdgpu_device *adev, enum amd_pp_sensors 
sensor,
   void *data, uint32_t *size);
 
-- 
2.29.0

[PATCH V3 03/17] drm/amd/pm: do not expose power implementation details to display

2021-12-01 Thread Evan Quan

Display is another client of our power APIs. It's not proper to spike
into power implementation details there.

Signed-off-by: Evan Quan 
Change-Id: Ic897131e16473ed29d3d7586d822a55c64e6574a
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |   6 +-
 .../amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c  | 246 +++---
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 218 
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  38 +++
 4 files changed, 344 insertions(+), 164 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 53f7fdf956eb..92480cc57623 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2139,12 +2139,8 @@ static void s3_handle_mst(struct drm_device *dev, bool 
suspend)
 
 static int amdgpu_dm_smu_write_watermarks_table(struct amdgpu_device *adev)
 {
-   struct smu_context *smu = >smu;
int ret = 0;
 
-   if (!is_support_sw_smu(adev))
-   return 0;
-
/* This interface is for dGPU Navi1x.Linux dc-pplib interface depends
 * on window driver dc implementation.
 * For Navi1x, clock settings of dcn watermarks are fixed. the settings
@@ -2183,7 +2179,7 @@ static int amdgpu_dm_smu_write_watermarks_table(struct 
amdgpu_device *adev)
return 0;
}
 
-   ret = smu_write_watermarks_table(smu);
+   ret = amdgpu_dpm_write_watermarks_table(adev);
if (ret) {
DRM_ERROR("Failed to update WMTABLE!\n");
return ret;
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c
index eba270121698..46550811da00 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c
@@ -99,10 +99,7 @@ bool dm_pp_apply_display_requirements(
adev->pm.pm_display_cfg.displays[i].controller_id = 
dc_cfg->pipe_idx + 1;
}
 
-   if (adev->powerplay.pp_funcs && 
adev->powerplay.pp_funcs->display_configuration_change)
-   adev->powerplay.pp_funcs->display_configuration_change(
-   adev->powerplay.pp_handle,
-   >pm.pm_display_cfg);
+   amdgpu_dpm_display_configuration_change(adev, 
>pm.pm_display_cfg);
 
amdgpu_pm_compute_clocks(adev);
}
@@ -298,31 +295,25 @@ bool dm_pp_get_clock_levels_by_type(
struct dm_pp_clock_levels *dc_clks)
 {
struct amdgpu_device *adev = ctx->driver_context;
-   void *pp_handle = adev->powerplay.pp_handle;
struct amd_pp_clocks pp_clks = { 0 };
struct amd_pp_simple_clock_info validation_clks = { 0 };
uint32_t i;
 
-   if (adev->powerplay.pp_funcs && 
adev->powerplay.pp_funcs->get_clock_by_type) {
-   if (adev->powerplay.pp_funcs->get_clock_by_type(pp_handle,
-   dc_to_pp_clock_type(clk_type), _clks)) {
-   /* Error in pplib. Provide default values. */
-   get_default_clock_levels(clk_type, dc_clks);
-   return true;
-   }
+   if (amdgpu_dpm_get_clock_by_type(adev,
+   dc_to_pp_clock_type(clk_type), _clks)) {
+   /* Error in pplib. Provide default values. */
+   get_default_clock_levels(clk_type, dc_clks);
+   return true;
}
 
pp_to_dc_clock_levels(_clks, dc_clks, clk_type);
 
-   if (adev->powerplay.pp_funcs && 
adev->powerplay.pp_funcs->get_display_mode_validation_clocks) {
-   if 
(adev->powerplay.pp_funcs->get_display_mode_validation_clocks(
-   pp_handle, _clks)) {
-   /* Error in pplib. Provide default values. */
-   DRM_INFO("DM_PPLIB: Warning: using default validation 
clocks!\n");
-   validation_clks.engine_max_clock = 72000;
-   validation_clks.memory_max_clock = 8;
-   validation_clks.level = 0;
-   }
+   if (amdgpu_dpm_get_display_mode_validation_clks(adev, 
_clks)) {
+   /* Error in pplib. Provide default values. */
+   DRM_INFO("DM_PPLIB: Warning: using default validation 
clocks!\n");
+   validation_clks.engine_max_clock = 72000;
+   validation_clks.memory_max_clock = 8;
+   validation_clks.level = 0;
}
 
DRM_INFO("DM_PPLIB: Validation clocks:\n");
@@ -370,18 +361,14 @@ bool dm_pp_get_clock_levels_by_type_with_latency(
struct dm_pp_clock_levels_with_latency *clk_level_info)
 {
struct amdgpu_device *adev = ctx->driver_context;
-   void *pp_handle = adev->powerplay.pp_handle;
struct pp_clock_levels_with_latency

[PATCH V3 02/17] drm/amd/pm: do not expose power implementation details to amdgpu_pm.c

2021-12-01 Thread Evan Quan

amdgpu_pm.c holds all the user sysfs/hwmon interfaces. It's another
client of our power APIs. It's not proper to spike into power
implementation details there.

Signed-off-by: Evan Quan 
Change-Id: I397853ddb13eacfce841366de2a623535422df9a
--
v1->v2:
  - drop unneeded "return;" in amdgpu_dpm_get_current_power_state(Guchun)
---
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 456 ++-
 drivers/gpu/drm/amd/pm/amdgpu_pm.c| 519 --
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   | 160 +++
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |   3 -
 4 files changed, 707 insertions(+), 431 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index 54abdf7080de..2c789eb5d066 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -1453,7 +1453,9 @@ static void amdgpu_dpm_change_power_state_locked(struct 
amdgpu_device *adev)
if (equal)
return;
 
-   amdgpu_dpm_set_power_state(adev);
+   if (adev->powerplay.pp_funcs->set_power_state)
+   
adev->powerplay.pp_funcs->set_power_state(adev->powerplay.pp_handle);
+
amdgpu_dpm_post_set_power_state(adev);
 
adev->pm.dpm.current_active_crtcs = adev->pm.dpm.new_active_crtcs;
@@ -1704,3 +1706,455 @@ int amdgpu_dpm_get_ecc_info(struct amdgpu_device *adev,
 
return smu_get_ecc_info(>smu, umc_ecc);
 }
+
+struct amd_vce_state *amdgpu_dpm_get_vce_clock_state(struct amdgpu_device 
*adev,
+uint32_t idx)
+{
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+
+   if (!pp_funcs->get_vce_clock_state)
+   return NULL;
+
+   return pp_funcs->get_vce_clock_state(adev->powerplay.pp_handle,
+idx);
+}
+
+void amdgpu_dpm_get_current_power_state(struct amdgpu_device *adev,
+   enum amd_pm_state_type *state)
+{
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+
+   if (!pp_funcs->get_current_power_state) {
+   *state = adev->pm.dpm.user_state;
+   return;
+   }
+
+   *state = pp_funcs->get_current_power_state(adev->powerplay.pp_handle);
+   if (*state < POWER_STATE_TYPE_DEFAULT ||
+   *state > POWER_STATE_TYPE_INTERNAL_3DPERF)
+   *state = adev->pm.dpm.user_state;
+}
+
+void amdgpu_dpm_set_power_state(struct amdgpu_device *adev,
+   enum amd_pm_state_type state)
+{
+   adev->pm.dpm.user_state = state;
+
+   if (adev->powerplay.pp_funcs->dispatch_tasks)
+   amdgpu_dpm_dispatch_task(adev, AMD_PP_TASK_ENABLE_USER_STATE, 
);
+   else
+   amdgpu_pm_compute_clocks(adev);
+}
+
+enum amd_dpm_forced_level amdgpu_dpm_get_performance_level(struct 
amdgpu_device *adev)
+{
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+   enum amd_dpm_forced_level level;
+
+   if (pp_funcs->get_performance_level)
+   level = 
pp_funcs->get_performance_level(adev->powerplay.pp_handle);
+   else
+   level = adev->pm.dpm.forced_level;
+
+   return level;
+}
+
+int amdgpu_dpm_force_performance_level(struct amdgpu_device *adev,
+  enum amd_dpm_forced_level level)
+{
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+
+   if (pp_funcs->force_performance_level) {
+   if (adev->pm.dpm.thermal_active)
+   return -EINVAL;
+
+   if (pp_funcs->force_performance_level(adev->powerplay.pp_handle,
+ level))
+   return -EINVAL;
+   }
+
+   adev->pm.dpm.forced_level = level;
+
+   return 0;
+}
+
+int amdgpu_dpm_get_pp_num_states(struct amdgpu_device *adev,
+struct pp_states_info *states)
+{
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+
+   if (!pp_funcs->get_pp_num_states)
+   return -EOPNOTSUPP;
+
+   return pp_funcs->get_pp_num_states(adev->powerplay.pp_handle, states);
+}
+
+int amdgpu_dpm_dispatch_task(struct amdgpu_device *adev,
+ enum amd_pp_task task_id,
+ enum amd_pm_state_type *user_state)
+{
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+
+   if (!pp_funcs->dispatch_tasks)
+   return -EOPNOTSUPP;
+
+   return pp_funcs->dispatch_tasks(adev->powerplay.pp_handle, task_id, 
user_state);
+}
+
+int amdgpu_dpm_get_pp_table(struct amdgpu_device *adev, char **table)
+{
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+
+   if (!pp_funcs->get_pp_table)
+   return 0;
+
+   return pp_funcs->get_pp_table(adev->powerplay.pp_handle, table);
+}
+
+int

[PATCH V3 01/17] drm/amd/pm: do not expose implementation details to other blocks out of power

2021-12-01 Thread Evan Quan

Those implementation details(whether swsmu supported, some ppt_funcs supported,
accessing internal statistics ...)should be kept internally. It's not a good
practice and even error prone to expose implementation details.

Signed-off-by: Evan Quan 
Change-Id: Ibca3462ceaa26a27a9145282b60c6ce5deca7752
---
 drivers/gpu/drm/amd/amdgpu/aldebaran.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   | 25 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c   | 18 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h   |  7 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |  5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   |  5 +-
 drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c   |  2 +-
 .../gpu/drm/amd/include/kgd_pp_interface.h|  4 +
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 90 +++
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   | 25 +-
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   | 11 +--
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 26 +++---
 13 files changed, 161 insertions(+), 65 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c 
b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
index bcfdb63b1d42..a545df4efce1 100644
--- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
+++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
@@ -260,7 +260,7 @@ static int aldebaran_mode2_restore_ip(struct amdgpu_device 
*adev)
adev->gfx.rlc.funcs->resume(adev);
 
/* Wait for FW reset event complete */
-   r = smu_wait_for_event(adev, SMU_EVENT_RESET_COMPLETE, 0);
+   r = amdgpu_dpm_wait_for_event(adev, SMU_EVENT_RESET_COMPLETE, 0);
if (r) {
dev_err(adev->dev,
"Failed to get response from firmware after reset\n");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..0d1f00b24aae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1585,22 +1585,25 @@ static int amdgpu_debugfs_sclk_set(void *data, u64 val)
return ret;
}
 
-   if (is_support_sw_smu(adev)) {
-   ret = smu_get_dpm_freq_range(>smu, SMU_SCLK, _freq, 
_freq);
-   if (ret || val > max_freq || val < min_freq)
-   return -EINVAL;
-   ret = smu_set_soft_freq_range(>smu, SMU_SCLK, 
(uint32_t)val, (uint32_t)val);
-   } else {
-   return 0;
+   ret = amdgpu_dpm_get_dpm_freq_range(adev, PP_SCLK, _freq, 
_freq);
+   if (ret == -EOPNOTSUPP) {
+   ret = 0;
+   goto out;
}
+   if (ret || val > max_freq || val < min_freq) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   ret = amdgpu_dpm_set_soft_freq_range(adev, PP_SCLK, (uint32_t)val, 
(uint32_t)val);
+   if (ret)
+   ret = -EINVAL;
 
+out:
pm_runtime_mark_last_busy(adev_to_drm(adev)->dev);
pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);
 
-   if (ret)
-   return -EINVAL;
-
-   return 0;
+   return ret;
 }
 
 DEFINE_DEBUGFS_ATTRIBUTE(fops_ib_preempt, NULL,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1989f9e9379e..41cc1ffb5809 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2617,7 +2617,7 @@ static int amdgpu_device_ip_late_init(struct 
amdgpu_device *adev)
if (adev->asic_type == CHIP_ARCTURUS &&
amdgpu_passthrough(adev) &&
adev->gmc.xgmi.num_physical_nodes > 1)
-   smu_set_light_sbr(>smu, true);
+   amdgpu_dpm_set_light_sbr(adev, true);
 
if (adev->gmc.xgmi.num_physical_nodes > 1) {
mutex_lock(_info.mutex);
@@ -2857,7 +2857,7 @@ static int amdgpu_device_ip_suspend_phase2(struct 
amdgpu_device *adev)
int i, r;
 
if (adev->in_s0ix)
-   amdgpu_gfx_state_change_set(adev, sGpuChangeState_D3Entry);
+   amdgpu_dpm_gfx_state_change(adev, sGpuChangeState_D3Entry);
 
for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
if (!adev->ip_blocks[i].status.valid)
@@ -3982,7 +3982,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
fbcon)
return 0;
 
if (adev->in_s0ix)
-   amdgpu_gfx_state_change_set(adev, sGpuChangeState_D0Entry);
+   amdgpu_dpm_gfx_state_change(adev, sGpuChangeState_D0Entry);
 
/* post card */
if (amdgpu_device_need_post(adev)) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 1916ec84dd71..3d8f82dc8c97 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -615,7 +615,7 @@ int amdgpu_get_gfx_off_status(struct amdgpu_device *adev, 
uint32_t *value)

[PATCH V3 00/17] Unified entry point for other blocks to interact with power

2021-12-01 Thread Evan Quan

There are several problems with current power implementations:
1. Too many internal details are exposed to other blocks. Thus to interact with
   power, they need to know which power framework is used(powerplay vs swsmu)
   or even whether some API is implemented.
2. A lot of cross callings exist which make it hard to get a whole picture of
   the code hierarchy. And that makes any code change/increment error-prone.
3. Many different types of lock are used. It is calculated there is totally
   13 different locks are used within power. Some of them are even designed for
   the same purpose.

To ease the problems above, this patch series try to
1. provide unified entry point for other blocks to interact with power.
2. relocate some source code piece/headers to avoid cross callings.
3. enforce a unified lock protection on those entry point APIs above.
   That makes the future optimization for unnecessary power locks possible.

Evan Quan (17):
  drm/amd/pm: do not expose implementation details to other blocks out
of power
  drm/amd/pm: do not expose power implementation details to amdgpu_pm.c
  drm/amd/pm: do not expose power implementation details to display
  drm/amd/pm: do not expose those APIs used internally only in
amdgpu_dpm.c
  drm/amd/pm: do not expose those APIs used internally only in si_dpm.c
  drm/amd/pm: do not expose the API used internally only in kv_dpm.c
  drm/amd/pm: create a new holder for those APIs used only by legacy
ASICs(si/kv)
  drm/amd/pm: move pp_force_state_enabled member to amdgpu_pm structure
  drm/amd/pm: optimize the amdgpu_pm_compute_clocks() implementations
  drm/amd/pm: move those code piece used by Stoney only to smu8_hwmgr.c
  drm/amd/pm: correct the usage for amdgpu_dpm_dispatch_task()
  drm/amd/pm: drop redundant or unused APIs and data structures
  drm/amd/pm: do not expose the smu_context structure used internally in
power
  drm/amd/pm: relocate the power related headers
  drm/amd/pm: drop unnecessary gfxoff controls
  drm/amd/pm: revise the performance level setting APIs
  drm/amd/pm: unified lock protections in amdgpu_dpm.c

 drivers/gpu/drm/amd/amdgpu/aldebaran.c|2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |7 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |   25 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c   |   18 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h   |7 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   |5 +-
 drivers/gpu/drm/amd/amdgpu/dce_v10_0.c|2 +-
 drivers/gpu/drm/amd/amdgpu/dce_v11_0.c|2 +-
 drivers/gpu/drm/amd/amdgpu/dce_v6_0.c |2 +-
 drivers/gpu/drm/amd/amdgpu/dce_v8_0.c |2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c   |2 +-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |6 +-
 .../amd/display/amdgpu_dm/amdgpu_dm_pp_smu.c  |  248 +-
 .../gpu/drm/amd/include/kgd_pp_interface.h|8 +
 drivers/gpu/drm/amd/pm/Makefile   |   12 +-
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 2434 -
 drivers/gpu/drm/amd/pm/amdgpu_dpm_internal.c  |   94 +
 drivers/gpu/drm/amd/pm/amdgpu_pm.c|  570 ++--
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  341 +--
 .../gpu/drm/amd/pm/inc/amdgpu_dpm_internal.h  |   32 +
 drivers/gpu/drm/amd/pm/legacy-dpm/Makefile|   32 +
 .../pm/{powerplay => legacy-dpm}/cik_dpm.h|0
 .../amd/pm/{powerplay => legacy-dpm}/kv_dpm.c |   47 +-
 .../amd/pm/{powerplay => legacy-dpm}/kv_dpm.h |0
 .../amd/pm/{powerplay => legacy-dpm}/kv_smc.c |0
 .../gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c| 1081 
 .../gpu/drm/amd/pm/legacy-dpm/legacy_dpm.h|   38 +
 .../amd/pm/{powerplay => legacy-dpm}/ppsmc.h  |0
 .../pm/{powerplay => legacy-dpm}/r600_dpm.h   |0
 .../amd/pm/{powerplay => legacy-dpm}/si_dpm.c |  163 +-
 .../amd/pm/{powerplay => legacy-dpm}/si_dpm.h |   15 +-
 .../amd/pm/{powerplay => legacy-dpm}/si_smc.c |0
 .../{powerplay => legacy-dpm}/sislands_smc.h  |0
 drivers/gpu/drm/amd/pm/powerplay/Makefile |4 -
 .../gpu/drm/amd/pm/powerplay/amd_powerplay.c  |   51 +-
 .../drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c   |   10 +-
 .../pm/{ => powerplay}/inc/amd_powerplay.h|0
 .../drm/amd/pm/{ => powerplay}/inc/cz_ppsmc.h |0
 .../amd/pm/{ => powerplay}/inc/fiji_ppsmc.h   |0
 .../pm/{ => powerplay}/inc/hardwaremanager.h  |0
 .../drm/amd/pm/{ => powerplay}/inc/hwmgr.h|3 -
 .../{ => powerplay}/inc/polaris10_pwrvirus.h  |0
 .../amd/pm/{ => powerplay}/inc/power_state.h  |0
 .../drm/amd/pm/{ => powerplay}/inc/pp_debug.h |0
 .../amd/pm/{ => powerplay}/inc/pp_endian.h|0
 .../amd/pm/{ => powerplay}/inc/pp_thermal.h   |0
 .../amd/pm/{ => powerplay}/inc/ppinterrupt.h  |0
 .../drm/amd/pm/{ => powerplay}/inc/rv_ppsmc.h |0
 .../drm/amd/pm/{ => powerplay}/inc/smu10.h|0
 .../pm/{ =>

RE: [PATCH] drm/amdgpu: handle SRIOV VCN revision parsing

2021-12-01 Thread Chen, Guchun

[Public]

Reviewed-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Thursday, December 2, 2021 5:36 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH] drm/amdgpu: handle SRIOV VCN revision parsing

For SR-IOV, the IP discovery revision number encodes additional information.  
Handle that case here.

v2: drop additional IP versions

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 17 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c   |  2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h   |  1 +
 drivers/gpu/drm/amd/amdgpu/nv.c   |  2 --
 4 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index ea00090b3fb3..552031950518 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -379,8 +379,21 @@ int amdgpu_discovery_reg_base_init(struct amdgpu_device 
*adev)
  ip->major, ip->minor,
  ip->revision);
 
-   if (le16_to_cpu(ip->hw_id) == VCN_HWID)
+   if (le16_to_cpu(ip->hw_id) == VCN_HWID) {
+   if (amdgpu_sriov_vf(adev)) {
+   /* SR-IOV modifies each VCN’s revision 
(uint8)
+* Bit [5:0]: original revision value
+* Bit [7:6]: en/decode capability:
+* 0b00 : VCN function normally
+* 0b10 : encode is disabled
+* 0b01 : decode is disabled
+*/
+   
adev->vcn.sriov_config[adev->vcn.num_vcn_inst] =
+   (ip->revision & 0xc0) >> 6;
+   ip->revision &= ~0xc0;
+   }
adev->vcn.num_vcn_inst++;
+   }
if (le16_to_cpu(ip->hw_id) == SDMA0_HWID ||
le16_to_cpu(ip->hw_id) == SDMA1_HWID ||
le16_to_cpu(ip->hw_id) == SDMA2_HWID || @@ -917,10 
+930,8 @@ static int amdgpu_discovery_set_mm_ip_blocks(struct amdgpu_device 
*adev)
break;
case IP_VERSION(3, 0, 0):
case IP_VERSION(3, 0, 16):
-   case IP_VERSION(3, 0, 64):
case IP_VERSION(3, 1, 1):
case IP_VERSION(3, 0, 2):
-   case IP_VERSION(3, 0, 192):
amdgpu_device_ip_block_add(adev, _v3_0_ip_block);
if (!amdgpu_sriov_vf(adev))
amdgpu_device_ip_block_add(adev, 
_v3_0_ip_block); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index 585961c2f5f2..2658414c503d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -134,8 +134,6 @@ int amdgpu_vcn_sw_init(struct amdgpu_device *adev)
adev->vcn.indirect_sram = true;
break;
case IP_VERSION(3, 0, 0):
-   case IP_VERSION(3, 0, 64):
-   case IP_VERSION(3, 0, 192):
if (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 0))
fw_name = FIRMWARE_SIENNA_CICHLID;
else
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
index bfa27ea94804..938a5ead3f20 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
@@ -235,6 +235,7 @@ struct amdgpu_vcn {
 
uint8_t num_vcn_inst;
struct amdgpu_vcn_inst   inst[AMDGPU_MAX_VCN_INSTANCES];
+   uint8_t  sriov_config[AMDGPU_MAX_VCN_INSTANCES];
struct amdgpu_vcn_reginternal;
struct mutex vcn_pg_lock;
struct mutexvcn1_jpeg1_workaround;
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c 
index 2ec1ffb36b1f..7088528079c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -182,8 +182,6 @@ static int nv_query_video_codecs(struct amdgpu_device 
*adev, bool encode,  {
switch (adev->ip_versions[UVD_HWIP][0]) {
case IP_VERSION(3, 0, 0):
-   case IP_VERSION(3, 0, 64):
-   case IP_VERSION(3, 0, 192):
if (amdgpu_sriov_vf(adev)) {
if (encode)
*codecs = _sc_video_codecs_encode;
--
2.31.1

Re: [PATCH v4 1/6] drm: move the buddy allocator from i915 into common drm

2021-12-01 Thread kernel test robot

Hi Arunpravin,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on drm-intel/for-linux-next]
[also build test ERROR on v5.16-rc3]
[cannot apply to drm/drm-next drm-tip/drm-tip next-20211201]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Arunpravin/drm-move-the-buddy-allocator-from-i915-into-common-drm/20211202-004327
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-randconfig-a012-20211130 
(https://download.01.org/0day-ci/archive/20211202/202112020812.si0y9psy-...@intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 
4b553297ef3ee4dc2119d5429adf3072e90fac38)
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://github.com/0day-ci/linux/commit/afbc900c0399e8c6220abd729932e877e81f37c8
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Arunpravin/drm-move-the-buddy-allocator-from-i915-into-common-drm/20211202-004327
git checkout afbc900c0399e8c6220abd729932e877e81f37c8
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 
O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/gpu/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

   In file included from drivers/gpu/drm/i915/intel_memory_region.c:242:
>> drivers/gpu/drm/i915/selftests/intel_memory_region.c:23:10: fatal error: 
>> 'i915_buddy.h' file not found
   #include "i915_buddy.h"
^~
   1 error generated.


vim +23 drivers/gpu/drm/i915/selftests/intel_memory_region.c

232a6ebae419193 Matthew Auld 2019-10-08  14  
340be48f2c5a3c0 Matthew Auld 2019-10-25  15  #include 
"gem/i915_gem_context.h"
b908be543e44414 Matthew Auld 2019-10-25  16  #include "gem/i915_gem_lmem.h"
232a6ebae419193 Matthew Auld 2019-10-08  17  #include 
"gem/i915_gem_region.h"
340be48f2c5a3c0 Matthew Auld 2019-10-25  18  #include 
"gem/selftests/igt_gem_utils.h"
232a6ebae419193 Matthew Auld 2019-10-08  19  #include 
"gem/selftests/mock_context.h"
99919be74aa3753 Thomas Hellström 2021-06-17  20  #include "gt/intel_engine_pm.h"
6804da20bb549e3 Chris Wilson 2019-10-27  21  #include 
"gt/intel_engine_user.h"
b908be543e44414 Matthew Auld 2019-10-25  22  #include "gt/intel_gt.h"
d53ec322dc7de32 Matthew Auld 2021-06-16 @23  #include "i915_buddy.h"
99919be74aa3753 Thomas Hellström 2021-06-17  24  #include "gt/intel_migrate.h"
ba12993c5228015 Matthew Auld 2020-01-29  25  #include "i915_memcpy.h"
d53ec322dc7de32 Matthew Auld 2021-06-16  26  #include 
"i915_ttm_buddy_manager.h"
01377a0d7e6648b Abdiel Janulgue  2019-10-25  27  #include 
"selftests/igt_flush_test.h"
2f0b97ca0211863 Matthew Auld 2019-10-08  28  #include 
"selftests/i915_random.h"
232a6ebae419193 Matthew Auld 2019-10-08  29  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

[pull] amdgpu, amdkfd drm-fixes-5.16

2021-12-01 Thread Alex Deucher

Hi Dave, Daniel,

Fixes for 5.16.

The following changes since commit d58071a8a76d779eedab38033ae4c821c30295a5:

  Linux 5.16-rc3 (2021-11-28 14:09:19 -0800)

are available in the Git repository at:

  https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-fixes-5.16-2021-12-01

for you to fetch changes up to 3abfe30d803e62cc75dec254eefab3b04d69219b:

  drm/amdkfd: process_info lock not needed for svm (2021-12-01 17:09:58 -0500)


amd-drm-fixes-5.16-2021-12-01:

amdgpu:
- IP discovery based enumeration fixes
- vkms fixes
- DSC fixes for DP MST
- Audio fix for hotplug with tiled displays
- Misc display fixes
- DP tunneling fix
- DP fix
- Aldebaran fix

amdkfd:
- Locking fix
- Static checker fix
- Fix double free


Flora Cui (2):
  drm/amdgpu: cancel the correct hrtimer on exit
  drm/amdgpu: check atomic flag to differeniate with legacy path

Guchun Chen (1):
  drm/amdgpu: fix the missed handling for SDMA2 and SDMA3

Jane Jian (1):
  drm/amdgpu/sriov/vcn: add new vcn ip revision check case for 
SIENNA_CICHLID

Jimmy Kizito (1):
  drm/amd/display: Add work around for tunneled MST.

Lijo Lazar (1):
  drm/amdgpu: Don't halt RLC on GFX suspend

Mustapha Ghaddar (1):
  drm/amd/display: Fix for the no Audio bug with Tiled Displays

Nicholas Kazlauskas (1):
  drm/amd/display: Allow DSC on supported MST branch devices

Perry Yuan (1):
  drm/amd/display: add connector type check for CRC source set

Philip Yang (3):
  drm/amdkfd: set "r = 0" explicitly before goto
  drm/amdkfd: fix double free mem structure
  drm/amdkfd: process_info lock not needed for svm

Shen, George (1):
  drm/amd/display: Clear DPCD lane settings after repeater training

shaoyunl (1):
  drm/amdgpu: adjust the kfd reset sequence in reset sriov function

 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c   |  8 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c  |  3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c   |  4 ++--
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c  |  7 ---
 drivers/gpu/drm/amd/amdgpu/nv.c|  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c   | 13 
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c  |  8 
 .../amd/display/amdgpu_dm/amdgpu_dm_mst_types.c| 20 ++
 drivers/gpu/drm/amd/display/dc/core/dc_link.c  | 16 +++
 drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c   |  2 +-
 drivers/gpu/drm/amd/display/dc/core/dc_resource.c  | 24 +-
 drivers/gpu/drm/amd/display/dc/dc.h|  3 ++-
 drivers/gpu/drm/amd/display/dc/dc_link.h   |  2 ++
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  2 +-
 16 files changed, 90 insertions(+), 40 deletions(-)

[PATCH] drm/amdgpu: handle SRIOV VCN revision parsing

2021-12-01 Thread Alex Deucher

For SR-IOV, the IP discovery revision number encodes
additional information.  Handle that case here.

v2: drop additional IP versions

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 17 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c   |  2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h   |  1 +
 drivers/gpu/drm/amd/amdgpu/nv.c   |  2 --
 4 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index ea00090b3fb3..552031950518 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -379,8 +379,21 @@ int amdgpu_discovery_reg_base_init(struct amdgpu_device 
*adev)
  ip->major, ip->minor,
  ip->revision);
 
-   if (le16_to_cpu(ip->hw_id) == VCN_HWID)
+   if (le16_to_cpu(ip->hw_id) == VCN_HWID) {
+   if (amdgpu_sriov_vf(adev)) {
+   /* SR-IOV modifies each VCN’s revision 
(uint8)
+* Bit [5:0]: original revision value
+* Bit [7:6]: en/decode capability:
+* 0b00 : VCN function normally
+* 0b10 : encode is disabled
+* 0b01 : decode is disabled
+*/
+   
adev->vcn.sriov_config[adev->vcn.num_vcn_inst] =
+   (ip->revision & 0xc0) >> 6;
+   ip->revision &= ~0xc0;
+   }
adev->vcn.num_vcn_inst++;
+   }
if (le16_to_cpu(ip->hw_id) == SDMA0_HWID ||
le16_to_cpu(ip->hw_id) == SDMA1_HWID ||
le16_to_cpu(ip->hw_id) == SDMA2_HWID ||
@@ -917,10 +930,8 @@ static int amdgpu_discovery_set_mm_ip_blocks(struct 
amdgpu_device *adev)
break;
case IP_VERSION(3, 0, 0):
case IP_VERSION(3, 0, 16):
-   case IP_VERSION(3, 0, 64):
case IP_VERSION(3, 1, 1):
case IP_VERSION(3, 0, 2):
-   case IP_VERSION(3, 0, 192):
amdgpu_device_ip_block_add(adev, _v3_0_ip_block);
if (!amdgpu_sriov_vf(adev))
amdgpu_device_ip_block_add(adev, 
_v3_0_ip_block);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index 585961c2f5f2..2658414c503d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -134,8 +134,6 @@ int amdgpu_vcn_sw_init(struct amdgpu_device *adev)
adev->vcn.indirect_sram = true;
break;
case IP_VERSION(3, 0, 0):
-   case IP_VERSION(3, 0, 64):
-   case IP_VERSION(3, 0, 192):
if (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 0))
fw_name = FIRMWARE_SIENNA_CICHLID;
else
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
index bfa27ea94804..938a5ead3f20 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
@@ -235,6 +235,7 @@ struct amdgpu_vcn {
 
uint8_t num_vcn_inst;
struct amdgpu_vcn_inst   inst[AMDGPU_MAX_VCN_INSTANCES];
+   uint8_t  sriov_config[AMDGPU_MAX_VCN_INSTANCES];
struct amdgpu_vcn_reginternal;
struct mutex vcn_pg_lock;
struct mutexvcn1_jpeg1_workaround;
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 2ec1ffb36b1f..7088528079c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -182,8 +182,6 @@ static int nv_query_video_codecs(struct amdgpu_device 
*adev, bool encode,
 {
switch (adev->ip_versions[UVD_HWIP][0]) {
case IP_VERSION(3, 0, 0):
-   case IP_VERSION(3, 0, 64):
-   case IP_VERSION(3, 0, 192):
if (amdgpu_sriov_vf(adev)) {
if (encode)
*codecs = _sc_video_codecs_encode;
-- 
2.31.1

Re: [PATCH] drm/amd/display: fix mixed declaration and code

2021-12-01 Thread Harry Wentland




On 2021-12-01 15:33, Alex Deucher wrote:
> Reorder the code to fix the warning.
> 
> Fixes: 8808f3ffb14d79 ("drm/amd/display: Add DP-HDMI FRL PCON Support in DC")
> Cc: Fangzhi Zuo 
> Signed-off-by: Alex Deucher 

Reviewed-by: Harry Wentland 

Harry

> ---
>  drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c 
> b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
> index 66dacde7a7cc..62510b643882 100644
> --- a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
> +++ b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
> @@ -4684,12 +4684,12 @@ static void get_active_converter_info(
>  
>  #if defined(CONFIG_DRM_AMD_DC_DCN)
>   if 
> (link->dc->caps.hdmi_frl_pcon_support) {
> + union hdmi_encoded_link_bw 
> hdmi_encoded_link_bw;
> +
>   
> link->dpcd_caps.dongle_caps.dp_hdmi_frl_max_link_bw_in_kbps =
>   
> dc_link_bw_kbps_from_raw_frl_link_rate_data(
>   
> hdmi_color_caps.bits.MAX_ENCODED_LINK_BW_SUPPORT);
>  
> - union hdmi_encoded_link_bw 
> hdmi_encoded_link_bw;
> -
>   // Intersect reported max link 
> bw support with the supported link rate post FRL link training
>   if (core_link_read_dpcd(link, 
> DP_PCON_HDMI_POST_FRL_STATUS,
>   
> _encoded_link_bw.raw, sizeof(hdmi_encoded_link_bw)) == DC_OK) {
>

[PATCH] drm/amd/display: fix mixed declaration and code

2021-12-01 Thread Alex Deucher

Reorder the code to fix the warning.

Fixes: 8808f3ffb14d79 ("drm/amd/display: Add DP-HDMI FRL PCON Support in DC")
Cc: Fangzhi Zuo 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c 
b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
index 66dacde7a7cc..62510b643882 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
@@ -4684,12 +4684,12 @@ static void get_active_converter_info(
 
 #if defined(CONFIG_DRM_AMD_DC_DCN)
if 
(link->dc->caps.hdmi_frl_pcon_support) {
+   union hdmi_encoded_link_bw 
hdmi_encoded_link_bw;
+

link->dpcd_caps.dongle_caps.dp_hdmi_frl_max_link_bw_in_kbps =

dc_link_bw_kbps_from_raw_frl_link_rate_data(

hdmi_color_caps.bits.MAX_ENCODED_LINK_BW_SUPPORT);
 
-   union hdmi_encoded_link_bw 
hdmi_encoded_link_bw;
-
// Intersect reported max link 
bw support with the supported link rate post FRL link training
if (core_link_read_dpcd(link, 
DP_PCON_HDMI_POST_FRL_STATUS,

_encoded_link_bw.raw, sizeof(hdmi_encoded_link_bw)) == DC_OK) {
-- 
2.31.1

Re: [PATCH] drm/radeon/radeon_connectors: Fix a NULL pointer dereference in radeon_fp_native_mode()

2021-12-01 Thread Alex Deucher

On Tue, Nov 30, 2021 at 9:49 AM Zhou Qingyang  wrote:
>
> In radeon_fp_native_mode(), the return value of drm_mode_duplicate() is
> assigned to mode and there is a dereference of it in
> radeon_fp_native_mode(), which could lead to a NULL pointer
> dereference on failure of drm_mode_duplicate().
>
> Fix this bug by adding a check of mode.
>
> This bug was found by a static analyzer. The analysis employs
> differential checking to identify inconsistent security operations
> (e.g., checks or kfrees) between two code paths and confirms that the
> inconsistent operations are not recovered in the current function or
> the callers, so they constitute bugs.
>
> Note that, as a bug found by static analysis, it can be a false
> positive or hard to trigger. Multiple researchers have cross-reviewed
> the bug.
>
> Builds with CONFIG_DRM_RADEON=m show no new warnings,
> and our static analyzer no longer warns about this code.
>
> Fixes: d2efdf6d6f42 ("drm/radeon/kms: add cvt mode if we only have lvds w/h 
> and no edid (v4)")
> Signed-off-by: Zhou Qingyang 
> ---
>  drivers/gpu/drm/radeon/radeon_connectors.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_connectors.c 
> b/drivers/gpu/drm/radeon/radeon_connectors.c
> index 607ad5620bd9..49f187614f96 100644
> --- a/drivers/gpu/drm/radeon/radeon_connectors.c
> +++ b/drivers/gpu/drm/radeon/radeon_connectors.c
> @@ -473,6 +473,9 @@ static struct drm_display_mode 
> *radeon_fp_native_mode(struct drm_encoder *encode
> native_mode->vdisplay != 0 &&
> native_mode->clock != 0) {
> mode = drm_mode_duplicate(dev, native_mode);
> +   if (!mode)
> +   return NULL;
> +

The else if clause needs a similar check.  Care to fix that up as well?

Alex

> mode->type = DRM_MODE_TYPE_PREFERRED | DRM_MODE_TYPE_DRIVER;
> drm_mode_set_name(mode);
>
> --
> 2.25.1
>

Re: [PATCH] fix a NULL pointer dereference in amdgpu_connector_lcd_native_mode()

2021-12-01 Thread Alex Deucher

On Tue, Nov 30, 2021 at 6:24 AM Zhou Qingyang  wrote:
>
> In amdgpu_connector_lcd_native_mode(), the return value of
> drm_mode_duplicate() is assigned to mode, and there is a dereference
> of it in amdgpu_connector_lcd_native_mode(), which will lead to a NULL
> pointer dereference on failure of drm_mode_duplicate().
>
> Fix this bug add a check of mode.
>
> This bug was found by a static analyzer. The analysis employs
> differential checking to identify inconsistent security operations
> (e.g., checks or kfrees) between two code paths and confirms that the
> inconsistent operations are not recovered in the current function or
> the callers, so they constitute bugs.
>
> Note that, as a bug found by static analysis, it can be a false
> positive or hard to trigger. Multiple researchers have cross-reviewed
> the bug.
>
> Builds with CONFIG_DRM_AMDGPU=m show no new warnings, and
> our static analyzer no longer warns about this code.
>
> Fixes: d38ceaf99ed0 ("drm/amdgpu: add core driver (v4)")
> Signed-off-by: Zhou Qingyang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c
> index 0de66f59adb8..0170aa84c5e6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c
> @@ -387,6 +387,9 @@ amdgpu_connector_lcd_native_mode(struct drm_encoder 
> *encoder)
> native_mode->vdisplay != 0 &&
> native_mode->clock != 0) {
> mode = drm_mode_duplicate(dev, native_mode);
> +   if (!mode)
> +   return NULL;
> +

The else if clause needs a similar check.  Care to fix that up as well?

Alex

> mode->type = DRM_MODE_TYPE_PREFERRED | DRM_MODE_TYPE_DRIVER;
> drm_mode_set_name(mode);
>
> --
> 2.25.1
>

Re: [PATCH v2] drm/amdgpu: update fw_load_type module parameter doc to match code

2021-12-01 Thread Alex Deucher

Applied.  Thanks!

On Mon, Nov 29, 2021 at 3:09 PM Yann Dirson  wrote:
>
> amdgpu_ucode_get_load_type() does not interpret this parameter as
> documented.  It is ignored for many ASIC types (which presumably
> only support one load_type), and when not ignored it is only used
> to force direct loading instead of PSP loading.  SMU loading is
> only available for ASICs for which the parameter is ignored.
>
> Signed-off-by: Yann Dirson 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index ecdec75fdf69..64881068b115 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -315,9 +315,12 @@ module_param_named(dpm, amdgpu_dpm, int, 0444);
>
>  /**
>   * DOC: fw_load_type (int)
> - * Set different firmware loading type for debugging (0 = direct, 1 = SMU, 2 
> = PSP). The default is -1 (auto).
> + * Set different firmware loading type for debugging, if supported.
> + * Set to 0 to force direct loading if supported by the ASIC.  Set
> + * to -1 to select the default loading mode for the ASIC, as defined
> + * by the driver.  The default is -1 (auto).
>   */
> -MODULE_PARM_DESC(fw_load_type, "firmware loading type (0 = direct, 1 = SMU, 
> 2 = PSP, -1 = auto)");
> +MODULE_PARM_DESC(fw_load_type, "firmware loading type (0 = force direct if 
> supported, -1 = auto)");
>  module_param_named(fw_load_type, amdgpu_fw_load_type, int, 0444);
>
>  /**
> --
> 2.31.1
>

Re: [v3] drm/amdgpu: reset asic after system-wide suspend aborted (v3)

2021-12-01 Thread Alex Deucher

On Wed, Dec 1, 2021 at 1:46 PM Limonciello, Mario
 wrote:
>
> On 11/24/2021 23:48, Prike Liang wrote:
> > Do ASIC reset at the moment Sx suspend aborted behind of amdgpu suspend
> > to keep AMDGPU in a clean reset state and that can avoid re-initialize
> > device improperly error. Currently,we just always do asic reset in the
> > amdgpu resume until sort out the PM abort case.
> >
> > v2: Remove incomplete PM abort flag and add GPU hive case check for
> > GPU reset.
> >
> > v3: Some dGPU reset method not support at the early resume time and
> > temprorary skip the dGPU case.
> >
> > Signed-off-by: Prike Liang 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 
> >   1 file changed, 8 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 7d4115d..f6e1a6a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3983,6 +3983,14 @@ int amdgpu_device_resume(struct drm_device *dev, 
> > bool fbcon)
> >   if (adev->in_s0ix)
> >   amdgpu_gfx_state_change_set(adev, sGpuChangeState_D0Entry);
> >
> > + /*TODO: In order to not let all-always asic reset affect resume 
> > latency
> > +  * need sort out the case which really need asic reset in the resume 
> > process.
> > +  * As to the known issue on the system suspend abort behind the 
> > AMDGPU suspend,
> > +  * may can sort this case by checking struct suspend_stats which need 
> > exported
> > +  * firstly.
> > +  */
> > + if (adev->flags & AMD_IS_APU)
> > + amdgpu_asic_reset(adev);
>
> Ideally you only want this to happen on S3 right?  So shouldn't there be
> an extra check for `amdgpu_acpi_is_s0ix_active`?

Shouldn't matter on the resume side.  Only the suspend side.  If we
reset in suspend, we'd end up disabling gfxoff.  On the resume side,
it should safe, but the resume paths for various IPs probably are not
adequate to deal with a reset for S0i3 since they don't re-init as
much hardware.  So it's probably better to skip this for S0i3.

Alex


>
> >   /* post card */
> >   if (amdgpu_device_need_post(adev)) {
> >   r = amdgpu_device_asic_init(adev);
> >
>

Re: [v3] drm/amdgpu: reset asic after system-wide suspend aborted (v3)

2021-12-01 Thread Limonciello, Mario


On 11/24/2021 23:48, Prike Liang wrote:

Do ASIC reset at the moment Sx suspend aborted behind of amdgpu suspend
to keep AMDGPU in a clean reset state and that can avoid re-initialize
device improperly error. Currently,we just always do asic reset in the
amdgpu resume until sort out the PM abort case.

v2: Remove incomplete PM abort flag and add GPU hive case check for
GPU reset.

v3: Some dGPU reset method not support at the early resume time and
temprorary skip the dGPU case.

Signed-off-by: Prike Liang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7d4115d..f6e1a6a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3983,6 +3983,14 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
fbcon)
if (adev->in_s0ix)
amdgpu_gfx_state_change_set(adev, sGpuChangeState_D0Entry);
  
+	/*TODO: In order to not let all-always asic reset affect resume latency

+* need sort out the case which really need asic reset in the resume 
process.
+* As to the known issue on the system suspend abort behind the AMDGPU 
suspend,
+* may can sort this case by checking struct suspend_stats which need 
exported
+* firstly.
+*/
+   if (adev->flags & AMD_IS_APU)
+   amdgpu_asic_reset(adev);


Ideally you only want this to happen on S3 right?  So shouldn't there be 
an extra check for `amdgpu_acpi_is_s0ix_active`?



/* post card */
if (amdgpu_device_need_post(adev)) {
r = amdgpu_device_asic_init(adev);

[PATCH v4 5/6] drm/amdgpu: move vram inline functions into a header

2021-12-01 Thread Arunpravin

Move shared vram inline functions and structs
into a header file

Signed-off-by: Arunpravin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h | 51 
 1 file changed, 51 insertions(+)
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
new file mode 100644
index ..59983464cce5
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: MIT
+ * Copyright 2021 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __AMDGPU_VRAM_MGR_H__
+#define __AMDGPU_VRAM_MGR_H__
+
+#include 
+
+struct amdgpu_vram_mgr_node {
+   struct ttm_resource base;
+   struct list_head blocks;
+   unsigned long flags;
+};
+
+static inline u64 amdgpu_node_start(struct drm_buddy_block *block)
+{
+   return drm_buddy_block_offset(block);
+}
+
+static inline u64 amdgpu_node_size(struct drm_buddy_block *block)
+{
+   return PAGE_SIZE << drm_buddy_block_order(block);
+}
+
+static inline struct amdgpu_vram_mgr_node *
+to_amdgpu_vram_mgr_node(struct ttm_resource *res)
+{
+   return container_of(res, struct amdgpu_vram_mgr_node, base);
+}
+
+#endif
-- 
2.25.1

[PATCH v4 6/6] drm/amdgpu: add drm buddy support to amdgpu

2021-12-01 Thread Arunpravin

- Remove drm_mm references and replace with drm buddy functionalities
- Add res cursor support for drm buddy

v2(Matthew Auld):
  - replace spinlock with mutex as we call kmem_cache_zalloc
(..., GFP_KERNEL) in drm_buddy_alloc() function

  - lock drm_buddy_block_trim() function as it calls
mark_free/mark_split are all globally visible

v3:
  - remove drm_buddy_block_trim() error handling and
print a warn message if it fails

Signed-off-by: Arunpravin 
---
 drivers/gpu/drm/Kconfig   |   1 +
 .../gpu/drm/amd/amdgpu/amdgpu_res_cursor.h|  97 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h   |   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c  | 261 ++
 4 files changed, 232 insertions(+), 133 deletions(-)

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 7a4a66d54782..dd880910282b 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -276,6 +276,7 @@ config DRM_AMDGPU
select HWMON
select BACKLIGHT_CLASS_DEVICE
select INTERVAL_TREE
+   select DRM_BUDDY
help
  Choose this option if you have a recent AMD Radeon graphics card.
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
index acfa207cf970..da12b4ff2e45 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
@@ -30,12 +30,15 @@
 #include 
 #include 
 
+#include "amdgpu_vram_mgr.h"
+
 /* state back for walking over vram_mgr and gtt_mgr allocations */
 struct amdgpu_res_cursor {
uint64_tstart;
uint64_tsize;
uint64_tremaining;
-   struct drm_mm_node  *node;
+   void*node;
+   uint32_tmem_type;
 };
 
 /**
@@ -52,27 +55,63 @@ static inline void amdgpu_res_first(struct ttm_resource 
*res,
uint64_t start, uint64_t size,
struct amdgpu_res_cursor *cur)
 {
+   struct drm_buddy_block *block;
+   struct list_head *head, *next;
struct drm_mm_node *node;
 
-   if (!res || res->mem_type == TTM_PL_SYSTEM) {
-   cur->start = start;
-   cur->size = size;
-   cur->remaining = size;
-   cur->node = NULL;
-   WARN_ON(res && start + size > res->num_pages << PAGE_SHIFT);
-   return;
-   }
+   if (!res)
+   goto err_out;
 
BUG_ON(start + size > res->num_pages << PAGE_SHIFT);
 
-   node = to_ttm_range_mgr_node(res)->mm_nodes;
-   while (start >= node->size << PAGE_SHIFT)
-   start -= node++->size << PAGE_SHIFT;
+   cur->mem_type = res->mem_type;
+
+   switch (cur->mem_type) {
+   case TTM_PL_VRAM:
+   head = _amdgpu_vram_mgr_node(res)->blocks;
+
+   block = list_first_entry_or_null(head,
+struct drm_buddy_block,
+link);
+   if (!block)
+   goto err_out;
+
+   while (start >= amdgpu_node_size(block)) {
+   start -= amdgpu_node_size(block);
+
+   next = block->link.next;
+   if (next != head)
+   block = list_entry(next, struct 
drm_buddy_block, link);
+   }
+
+   cur->start = amdgpu_node_start(block) + start;
+   cur->size = min(amdgpu_node_size(block) - start, size);
+   cur->remaining = size;
+   cur->node = block;
+   break;
+   case TTM_PL_TT:
+   node = to_ttm_range_mgr_node(res)->mm_nodes;
+   while (start >= node->size << PAGE_SHIFT)
+   start -= node++->size << PAGE_SHIFT;
+
+   cur->start = (node->start << PAGE_SHIFT) + start;
+   cur->size = min((node->size << PAGE_SHIFT) - start, size);
+   cur->remaining = size;
+   cur->node = node;
+   break;
+   default:
+   goto err_out;
+   }
 
-   cur->start = (node->start << PAGE_SHIFT) + start;
-   cur->size = min((node->size << PAGE_SHIFT) - start, size);
+   return;
+
+err_out:
+   cur->start = start;
+   cur->size = size;
cur->remaining = size;
-   cur->node = node;
+   cur->node = NULL;
+   WARN_ON(res && start + size > res->num_pages << PAGE_SHIFT);
+   return;
 }
 
 /**
@@ -85,7 +124,9 @@ static inline void amdgpu_res_first(struct ttm_resource *res,
  */
 static inline void amdgpu_res_next(struct amdgpu_res_cursor *cur, uint64_t 
size)
 {
-   struct drm_mm_node *node = cur->node;
+   struct drm_buddy_block *block;
+   struct drm_mm_node *node;
+   struct list_head *next;
 
BUG_ON(size > cur->remaining);
 
@@ -99,9

[PATCH v4 4/6] drm: implement a method to free unused pages

2021-12-01 Thread Arunpravin

On contiguous allocation, we round up the size
to the *next* power of 2, implement a function
to free the unused pages after the newly allocate block.

v2(Matthew Auld):
  - replace function name 'drm_buddy_free_unused_pages' with
drm_buddy_block_trim
  - replace input argument name 'actual_size' with 'new_size'
  - add more validation checks for input arguments
  - add overlaps check to avoid needless searching and splitting
  - merged the below patch to see the feature in action
- add free unused pages support to i915 driver
  - lock drm_buddy_block_trim() function as it calls mark_free/mark_split
are all globally visible

v3:
  - remove drm_buddy_block_trim() error handling and
print a warn message if it fails

Signed-off-by: Arunpravin 
---
 drivers/gpu/drm/drm_buddy.c   | 72 ++-
 drivers/gpu/drm/i915/i915_ttm_buddy_manager.c | 10 +++
 include/drm/drm_buddy.h   |  4 ++
 3 files changed, 83 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index eddc1eeda02e..707efc82216d 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -434,7 +434,8 @@ alloc_from_freelist(struct drm_buddy_mm *mm,
 static int __alloc_range(struct drm_buddy_mm *mm,
 struct list_head *dfs,
 u64 start, u64 size,
-struct list_head *blocks)
+struct list_head *blocks,
+bool trim_path)
 {
struct drm_buddy_block *block;
struct drm_buddy_block *buddy;
@@ -480,8 +481,20 @@ static int __alloc_range(struct drm_buddy_mm *mm,
 
if (!drm_buddy_block_is_split(block)) {
err = split_block(mm, block);
-   if (unlikely(err))
+   if (unlikely(err)) {
+   if (trim_path)
+   /*
+* Here in case of trim, we return and 
dont goto
+* split failure path as it removes 
from the
+* original list and potentially also 
freeing
+* the block. so we could leave as it 
is,
+* worse case we get some internal 
fragmentation
+* and leave the decision to the user
+*/
+   return err;
+
goto err_undo;
+   }
}
 
list_add(>right->tmp_link, dfs);
@@ -535,8 +548,61 @@ static int __drm_buddy_alloc_range(struct drm_buddy_mm *mm,
for (i = 0; i < mm->n_roots; ++i)
list_add_tail(>roots[i]->tmp_link, );
 
-   return __alloc_range(mm, , start, size, blocks);
+   return __alloc_range(mm, , start, size, blocks, 0);
+}
+
+/**
+ * drm_buddy_block_trim - free unused pages
+ *
+ * @mm: DRM buddy manager
+ * @new_size: original size requested
+ * @blocks: output list head to add allocated blocks
+ *
+ * For contiguous allocation, we round up the size to the nearest
+ * power of two value, drivers consume *actual* size, so remaining
+ * portions are unused and it can be freed.
+ *
+ * Returns:
+ * 0 on success, error code on failure.
+ */
+int drm_buddy_block_trim(struct drm_buddy_mm *mm,
+u64 new_size,
+struct list_head *blocks)
+{
+   struct drm_buddy_block *block;
+   u64 new_start;
+   LIST_HEAD(dfs);
+
+   if (!list_is_singular(blocks))
+   return -EINVAL;
+
+   block = list_first_entry(blocks,
+struct drm_buddy_block,
+link);
+
+   if (!drm_buddy_block_is_allocated(block))
+   return -EINVAL;
+
+   if (new_size > drm_buddy_block_size(mm, block))
+   return -EINVAL;
+
+   if (!new_size && !IS_ALIGNED(new_size, mm->chunk_size))
+   return -EINVAL;
+
+   if (new_size == drm_buddy_block_size(mm, block))
+   return 0;
+
+   list_del(>link);
+
+   new_start = drm_buddy_block_offset(block);
+
+   mark_free(mm, block);
+
+   list_add(>tmp_link, );
+
+   return __alloc_range(mm, , new_start, new_size, blocks, 1);
 }
+EXPORT_SYMBOL(drm_buddy_block_trim);
 
 /**
  * drm_buddy_alloc - allocate power-of-two blocks
diff --git a/drivers/gpu/drm/i915/i915_ttm_buddy_manager.c 
b/drivers/gpu/drm/i915/i915_ttm_buddy_manager.c
index 7c58efb60dba..c5831c27fe82 100644
--- a/drivers/gpu/drm/i915/i915_ttm_buddy_manager.c
+++ b/drivers/gpu/drm/i915/i915_ttm_buddy_manager.c
@@ -97,6 +97,16 @@ static int i915_ttm_buddy_man_alloc(struct 
ttm_resource_manager *man,
if (unlikely(err))
goto err_free_blocks;
 
+   if

[PATCH v4 3/6] drm: implement top-down allocation method

2021-12-01 Thread Arunpravin

Implemented a function which walk through the order list,
compares the offset and returns the maximum offset block,
this method is unpredictable in obtaining the high range
address blocks which depends on allocation and deallocation.
for instance, if driver requests address at a low specific
range, allocator traverses from the root block and splits
the larger blocks until it reaches the specific block and
in the process of splitting, lower orders in the freelist
are occupied with low range address blocks and for the
subsequent TOPDOWN memory request we may return the low
range blocks.To overcome this issue, we may go with the
below approach.

The other approach, sorting each order list entries in
ascending order and compares the last entry of each
order list in the freelist and return the max block.
This creates sorting overhead on every drm_buddy_free()
request and split up of larger blocks for a single page
request.

v2:
  - Fix alignment issues(Matthew Auld)
  - Remove unnecessary list_empty check(Matthew Auld)
  - merged the below patch to see the feature in action
- add top-down alloc support to i915 driver

Signed-off-by: Arunpravin 
---
 drivers/gpu/drm/drm_buddy.c   | 36 ---
 drivers/gpu/drm/i915/i915_ttm_buddy_manager.c |  3 ++
 include/drm/drm_buddy.h   |  1 +
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index 7f47632821f4..eddc1eeda02e 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -367,6 +367,26 @@ alloc_range_bias(struct drm_buddy_mm *mm,
return ERR_PTR(err);
 }
 
+static struct drm_buddy_block *
+get_maxblock(struct list_head *head)
+{
+   struct drm_buddy_block *max_block = NULL, *node;
+
+   max_block = list_first_entry_or_null(head,
+struct drm_buddy_block,
+link);
+   if (!max_block)
+   return NULL;
+
+   list_for_each_entry(node, head, link) {
+   if (drm_buddy_block_offset(node) >
+   drm_buddy_block_offset(max_block))
+   max_block = node;
+   }
+
+   return max_block;
+}
+
 static struct drm_buddy_block *
 alloc_from_freelist(struct drm_buddy_mm *mm,
unsigned int order,
@@ -377,11 +397,17 @@ alloc_from_freelist(struct drm_buddy_mm *mm,
int err;
 
for (i = order; i <= mm->max_order; ++i) {
-   block = list_first_entry_or_null(>free_list[i],
-struct drm_buddy_block,
-link);
-   if (block)
-   break;
+   if (flags & DRM_BUDDY_TOPDOWN_ALLOCATION) {
+   block = get_maxblock(>free_list[i]);
+   if (block)
+   break;
+   } else {
+   block = list_first_entry_or_null(>free_list[i],
+struct drm_buddy_block,
+link);
+   if (block)
+   break;
+   }
}
 
if (!block)
diff --git a/drivers/gpu/drm/i915/i915_ttm_buddy_manager.c 
b/drivers/gpu/drm/i915/i915_ttm_buddy_manager.c
index 7621d42155e6..7c58efb60dba 100644
--- a/drivers/gpu/drm/i915/i915_ttm_buddy_manager.c
+++ b/drivers/gpu/drm/i915/i915_ttm_buddy_manager.c
@@ -53,6 +53,9 @@ static int i915_ttm_buddy_man_alloc(struct 
ttm_resource_manager *man,
INIT_LIST_HEAD(_res->blocks);
bman_res->mm = mm;
 
+   if (place->flags & TTM_PL_FLAG_TOPDOWN)
+   bman_res->flags |= DRM_BUDDY_TOPDOWN_ALLOCATION;
+
if (place->fpfn || lpfn != man->size)
bman_res->flags |= DRM_BUDDY_RANGE_ALLOCATION;
 
diff --git a/include/drm/drm_buddy.h b/include/drm/drm_buddy.h
index 221de702e909..316ac0d25f08 100644
--- a/include/drm/drm_buddy.h
+++ b/include/drm/drm_buddy.h
@@ -28,6 +28,7 @@
 })
 
 #define DRM_BUDDY_RANGE_ALLOCATION (1 << 0)
+#define DRM_BUDDY_TOPDOWN_ALLOCATION (1 << 1)
 
 struct drm_buddy_block {
 #define DRM_BUDDY_HEADER_OFFSET GENMASK_ULL(63, 12)
-- 
2.25.1

[PATCH v4 2/6] drm: improve drm_buddy_alloc function

2021-12-01 Thread Arunpravin

- Make drm_buddy_alloc a single function to handle
  range allocation and non-range allocation demands

- Implemented a new function alloc_range() which allocates
  the requested power-of-two block comply with range limitations

- Moved order computation and memory alignment logic from
  i915 driver to drm buddy

v2:
  merged below changes to keep the build unbroken
   - drm_buddy_alloc_range() becomes obsolete and may be removed
   - enable ttm range allocation (fpfn / lpfn) support in i915 driver
   - apply enhanced drm_buddy_alloc() function to i915 driver

v3(Matthew Auld):
  - Fix alignment issues and remove unnecessary list_empty check
  - add more validation checks for input arguments
  - make alloc_range() block allocations as bottom-up
  - optimize order computation logic
  - replace uint64_t with u64, which is preferred in the kernel

v4(Matthew Auld):
  - keep drm_buddy_alloc_range() function implementation for generic
actual range allocations
  - keep alloc_range() implementation for end bias allocations

Signed-off-by: Arunpravin 
---
 drivers/gpu/drm/drm_buddy.c   | 316 +-
 drivers/gpu/drm/i915/i915_ttm_buddy_manager.c |  67 ++--
 drivers/gpu/drm/i915/i915_ttm_buddy_manager.h |   2 +
 include/drm/drm_buddy.h   |  22 +-
 4 files changed, 285 insertions(+), 122 deletions(-)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index 9340a4b61c5a..7f47632821f4 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -280,23 +280,97 @@ void drm_buddy_free_list(struct drm_buddy_mm *mm, struct 
list_head *objects)
 }
 EXPORT_SYMBOL(drm_buddy_free_list);
 
-/**
- * drm_buddy_alloc - allocate power-of-two blocks
- *
- * @mm: DRM buddy manager to allocate from
- * @order: size of the allocation
- *
- * The order value here translates to:
- *
- * 0 = 2^0 * mm->chunk_size
- * 1 = 2^1 * mm->chunk_size
- * 2 = 2^2 * mm->chunk_size
- *
- * Returns:
- * allocated ptr to the _buddy_block on success
- */
-struct drm_buddy_block *
-drm_buddy_alloc(struct drm_buddy_mm *mm, unsigned int order)
+static inline bool overlaps(u64 s1, u64 e1, u64 s2, u64 e2)
+{
+   return s1 <= e2 && e1 >= s2;
+}
+
+static inline bool contains(u64 s1, u64 e1, u64 s2, u64 e2)
+{
+   return s1 <= s2 && e1 >= e2;
+}
+
+static struct drm_buddy_block *
+alloc_range_bias(struct drm_buddy_mm *mm,
+u64 start, u64 end,
+unsigned int order)
+{
+   struct drm_buddy_block *block;
+   struct drm_buddy_block *buddy;
+   LIST_HEAD(dfs);
+   int err;
+   int i;
+
+   end = end - 1;
+
+   for (i = 0; i < mm->n_roots; ++i)
+   list_add_tail(>roots[i]->tmp_link, );
+
+   do {
+   u64 block_start;
+   u64 block_end;
+
+   block = list_first_entry_or_null(,
+struct drm_buddy_block,
+tmp_link);
+   if (!block)
+   break;
+
+   list_del(>tmp_link);
+
+   if (drm_buddy_block_order(block) < order)
+   continue;
+
+   block_start = drm_buddy_block_offset(block);
+   block_end = block_start + drm_buddy_block_size(mm, block) - 1;
+
+   if (!overlaps(start, end, block_start, block_end))
+   continue;
+
+   if (drm_buddy_block_is_allocated(block))
+   continue;
+
+   if (contains(start, end, block_start, block_end) &&
+   order == drm_buddy_block_order(block)) {
+   /*
+* Find the free block within the range.
+*/
+   if (drm_buddy_block_is_free(block))
+   return block;
+
+   continue;
+   }
+
+   if (!drm_buddy_block_is_split(block)) {
+   err = split_block(mm, block);
+   if (unlikely(err))
+   goto err_undo;
+   }
+
+   list_add(>right->tmp_link, );
+   list_add(>left->tmp_link, );
+   } while (1);
+
+   return ERR_PTR(-ENOSPC);
+
+err_undo:
+   /*
+* We really don't want to leave around a bunch of split blocks, since
+* bigger is better, so make sure we merge everything back before we
+* free the allocated blocks.
+*/
+   buddy = get_buddy(block);
+   if (buddy &&
+   (drm_buddy_block_is_free(block) &&
+drm_buddy_block_is_free(buddy)))
+   __drm_buddy_free(mm, block);
+   return ERR_PTR(err);
+}
+
+static struct drm_buddy_block *
+alloc_from_freelist(struct drm_buddy_mm *mm,
+   unsigned int order,
+   unsigned long flags)
 {
struct drm_buddy_block *block = NULL;
unsigned

[PATCH v4 1/6] drm: move the buddy allocator from i915 into common drm

2021-12-01 Thread Arunpravin

Move the base i915 buddy allocator code into drm
- Move i915_buddy.h to include/drm
- Move i915_buddy.c to drm root folder
- Rename "i915" string with "drm" string wherever applicable
- Rename "I915" string with "DRM" string wherever applicable
- Fix header file dependencies
- Fix alignment issues
- add Makefile support for drm buddy
- export functions and write kerneldoc description
- Remove i915 selftest config check condition as buddy selftest
  will be moved to drm selftest folder

cleanup i915 buddy references in i915 driver module
and replace with drm buddy

v2:
  - include header file in alphabetical order(Thomas)
  - merged changes listed in the body section into a single patch
to keep the build intact(Christian, Jani)

v3:
  - make drm buddy a separate module(Thomas, Christian)

Signed-off-by: Arunpravin 
---
 drivers/gpu/drm/Kconfig   |   6 +
 drivers/gpu/drm/Makefile  |   2 +
 drivers/gpu/drm/drm_buddy.c   | 516 ++
 drivers/gpu/drm/i915/Kconfig  |   1 +
 drivers/gpu/drm/i915/Makefile |   1 -
 drivers/gpu/drm/i915/i915_buddy.c | 466 
 drivers/gpu/drm/i915/i915_buddy.h | 143 -
 drivers/gpu/drm/i915/i915_module.c|   3 -
 drivers/gpu/drm/i915/i915_scatterlist.c   |  11 +-
 drivers/gpu/drm/i915/i915_ttm_buddy_manager.c |  33 +-
 drivers/gpu/drm/i915/i915_ttm_buddy_manager.h |   4 +-
 include/drm/drm_buddy.h   | 154 ++
 12 files changed, 703 insertions(+), 637 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_buddy.c
 delete mode 100644 drivers/gpu/drm/i915/i915_buddy.c
 delete mode 100644 drivers/gpu/drm/i915/i915_buddy.h
 create mode 100644 include/drm/drm_buddy.h

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 0039df26854b..7a4a66d54782 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -197,6 +197,12 @@ config DRM_TTM
  GPU memory types. Will be enabled automatically if a device driver
  uses it.
 
+config DRM_BUDDY
+   tristate
+   depends on DRM
+   help
+ A page based buddy allocator
+
 config DRM_VRAM_HELPER
tristate
depends on DRM
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 0dff40bb863c..e62e432bf1e5 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -35,6 +35,8 @@ drm-$(CONFIG_DRM_LOAD_EDID_FIRMWARE) += drm_edid_load.o
 
 obj-$(CONFIG_DRM_DP_AUX_BUS) += drm_dp_aux_bus.o
 
+obj-$(CONFIG_DRM_BUDDY) += drm_buddy.o
+
 drm_vram_helper-y := drm_gem_vram_helper.o
 obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
 
diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
new file mode 100644
index ..9340a4b61c5a
--- /dev/null
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -0,0 +1,516 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2021 Intel Corporation
+ */
+
+#include 
+#include 
+#include 
+
+#include 
+
+static struct drm_buddy_block *drm_block_alloc(struct drm_buddy_mm *mm,
+  struct drm_buddy_block *parent,
+  unsigned int order,
+  u64 offset)
+{
+   struct drm_buddy_block *block;
+
+   BUG_ON(order > DRM_BUDDY_MAX_ORDER);
+
+   block = kmem_cache_zalloc(mm->slab_blocks, GFP_KERNEL);
+   if (!block)
+   return NULL;
+
+   block->header = offset;
+   block->header |= order;
+   block->parent = parent;
+
+   BUG_ON(block->header & DRM_BUDDY_HEADER_UNUSED);
+   return block;
+}
+
+static void drm_block_free(struct drm_buddy_mm *mm,
+  struct drm_buddy_block *block)
+{
+   kmem_cache_free(mm->slab_blocks, block);
+}
+
+static void mark_allocated(struct drm_buddy_block *block)
+{
+   block->header &= ~DRM_BUDDY_HEADER_STATE;
+   block->header |= DRM_BUDDY_ALLOCATED;
+
+   list_del(>link);
+}
+
+static void mark_free(struct drm_buddy_mm *mm,
+ struct drm_buddy_block *block)
+{
+   block->header &= ~DRM_BUDDY_HEADER_STATE;
+   block->header |= DRM_BUDDY_FREE;
+
+   list_add(>link,
+>free_list[drm_buddy_block_order(block)]);
+}
+
+static void mark_split(struct drm_buddy_block *block)
+{
+   block->header &= ~DRM_BUDDY_HEADER_STATE;
+   block->header |= DRM_BUDDY_SPLIT;
+
+   list_del(>link);
+}
+
+/**
+ * drm_buddy_init - init memory manager
+ *
+ * @mm: DRM buddy manager to initialize
+ * @size: size in bytes to manage
+ * @chunk_size: minimum page size in bytes for our allocations
+ *
+ * Initializes the memory manager and its resources.
+ *
+ * Returns:
+ * 0 on success, error code on failure.
+ */
+int drm_buddy_init(struct drm_buddy_mm *mm, u64 size, u64 chunk_size)
+{
+   unsigned int i;
+   u64 offset;
+
+   if (size < chunk_size)
+

Re: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Andrey Grodzovsky




On 2021-12-01 8:11 a.m., Christian König wrote:

Adding Andrey as well.

Am 01.12.21 um 12:37 schrieb Yu, Lang:

[SNIP]

+ BUG_ON(unlikely(smu->smu_debug_mode) && res);

BUG_ON() really crashes the kernel and is only allowed if we
prevent further data corruption with that.

Most of the time WARN_ON() is more appropriate, but I can't fully
judge here since I don't know the SMU code well enough.

This is what SMU FW guys want. They want "user-visible (potentially
fatal)

errors", then a hang.

They want to keep system state since the error occurred.

Well that is rather problematic.

First of all we need to really justify that, crashing the kernel is
not something easily done.

Then this isn't really effective here. What happens is that you crash
the kernel thread of the currently executing process, but it is
perfectly possible that another thread still tries to send messages
to the SMU. You need to have the BUG_ON() before dropping the lock to
make sure that this really gets the driver stuck in the current 
state.

Thanks. I got it. I just thought it is a kenel panic.
Could we use a panic() here?
Potentially, but that might reboot the system automatically which is 
probably not

what you want either.

How does the SMU firmware team gather the necessary information when a
problem occurs?

As far as I know, they usually use a HDT to collect information.
And they request a hang when error occurred in ticket.
"Suggested error responses include pop-up windows (by x86 driver, if 
this is possible) or simply hanging after logging the error."


In that case I suggest to set the "don't_touch_the_hardware_any_more" 
procedure we also use in case of PCIe hotplug.


Andrey has the details but essentially it stops the driver from 
touching the hardware any more, signals all fences and unblocks 
everything.


It should then be trivial to inspect the hardware state and see what's 
going on, but the system will keep stable at least for SSH access.


Might be a good idea to have that mode for other fault cases like page 
faults and hardware crashes.


Regards,
Christian.



There is no one specific function that does all of that, what I think 
can be done is to bring the device to kind of halt state where no one 
touches it - as following -


1) Follow amdpgu_pci_remove -

    drm_dev_unplug to make device inaccessible to user space (IOCTLs 
e.t.c.) and clears MMIO mappings to device and disallows remappings 
through page faults


    No need to call all of amdgpu_driver_unload_kms but, within it call 
amdgpu_irq_disable_all and amdgpu_fence_driver_hw_fini toi disable 
interrupts and force signall all HW fences.


    pci_disable_device and pci_wait_for_pending_transaction to flush 
any in flight DMA operations from device


2) set adev->no_hw_access so that most of places we access HW (all 
subsequent registers reads/writes and SMU/PSP message sending is 
skipped, but some race will be with those already in progress so maybe 
adding some wait)


Andrey






Regards,
Lang

Re: [PATCH v5] drm/radeon/radeon_kms: Fix a NULL pointer dereference in radeon_driver_open_kms()

2021-12-01 Thread Christian König


Am 01.12.21 um 16:13 schrieb Zhou Qingyang:

In radeon_driver_open_kms(), radeon_vm_bo_add() is assigned to
vm->ib_bo_va and passes and used in radeon_vm_bo_set_addr(). In
radeon_vm_bo_set_addr(), there is a dereference of vm->ib_bo_va,
which could lead to a NULL pointer dereference on failure of
radeon_vm_bo_add().

Fix this bug by adding a check of vm->ib_bo_va.

This bug was found by a static analyzer. The analysis employs
differential checking to identify inconsistent security operations
(e.g., checks or kfrees) between two code paths and confirms that the
inconsistent operations are not recovered in the current function or
the callers, so they constitute bugs.

Note that, as a bug found by static analysis, it can be a false
positive or hard to trigger. Multiple researchers have cross-reviewed
the bug.

Builds with CONFIG_DRM_RADEON=m show no new warnings,
and our static analyzer no longer warns about this code.

Fixes: cc9e67e3d700 ("drm/radeon: fix VM IB handling")
Signed-off-by: Zhou Qingyang 
---
Changes in v5:
   -  Use conditions to avoid unnecessary initialization

Changes in v4:
   -  Initialize the variables to silence warning

Changes in v3:
   -  Fix the bug that good case will also be freed
   -  Improve code style

Changes in v2:
   -  Improve the error handling into goto style

  drivers/gpu/drm/radeon/radeon_kms.c | 36 -
  1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_kms.c 
b/drivers/gpu/drm/radeon/radeon_kms.c
index 482fb0ae6cb5..66aee48fd09d 100644
--- a/drivers/gpu/drm/radeon/radeon_kms.c
+++ b/drivers/gpu/drm/radeon/radeon_kms.c
@@ -648,6 +648,8 @@ void radeon_driver_lastclose_kms(struct drm_device *dev)
  int radeon_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
  {
struct radeon_device *rdev = dev->dev_private;
+   struct radeon_fpriv *fpriv;
+   struct radeon_vm *vm;
int r;
  
  	file_priv->driver_priv = NULL;

@@ -660,8 +662,6 @@ int radeon_driver_open_kms(struct drm_device *dev, struct 
drm_file *file_priv)
  
  	/* new gpu have virtual address space support */

if (rdev->family >= CHIP_CAYMAN) {
-   struct radeon_fpriv *fpriv;
-   struct radeon_vm *vm;
  
  		fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);

if (unlikely(!fpriv)) {
@@ -672,35 +672,39 @@ int radeon_driver_open_kms(struct drm_device *dev, struct 
drm_file *file_priv)
if (rdev->accel_working) {
vm = >vm;
r = radeon_vm_init(rdev, vm);
-   if (r) {
-   kfree(fpriv);
-   goto out_suspend;
-   }
+   if (r)
+   goto out_fpriv;
  
  			r = radeon_bo_reserve(rdev->ring_tmp_bo.bo, false);

-   if (r) {
-   radeon_vm_fini(rdev, vm);
-   kfree(fpriv);
-   goto out_suspend;
-   }
+   if (r)
+   goto out_vm_fini;
  
  			/* map the ib pool buffer read only into

 * virtual address space */
vm->ib_bo_va = radeon_vm_bo_add(rdev, vm,
rdev->ring_tmp_bo.bo);
+   if (!vm->ib_bo_va) {
+   r = -ENOMEM;
+   goto out_vm_fini;
+   }
+
r = radeon_vm_bo_set_addr(rdev, vm->ib_bo_va,
  RADEON_VA_IB_OFFSET,
  RADEON_VM_PAGE_READABLE |
  RADEON_VM_PAGE_SNOOPED);
-   if (r) {
-   radeon_vm_fini(rdev, vm);
-   kfree(fpriv);
-   goto out_suspend;
-   }
+   if (r)
+   goto out_vm_fini;
}
file_priv->driver_priv = fpriv;
}
  
+	if (!r)


I think that test is unecessary now, maybe double check.

Either way patch Reviewed-by: Christian König 
. Alex will probably pick it up now.


Thanks for the help,
Christian.


+   goto out_suspend;
+
+out_vm_fini:
+   radeon_vm_fini(rdev, vm);
+out_fpriv:
+   kfree(fpriv);
  out_suspend:
pm_runtime_mark_last_busy(dev->dev);
pm_runtime_put_autosuspend(dev->dev);

Re: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Christian König


Adding Andrey as well.

Am 01.12.21 um 12:37 schrieb Yu, Lang:

[SNIP]

+   BUG_ON(unlikely(smu->smu_debug_mode) && res);

BUG_ON() really crashes the kernel and is only allowed if we
prevent further data corruption with that.

Most of the time WARN_ON() is more appropriate, but I can't fully
judge here since I don't know the SMU code well enough.

This is what SMU FW guys want. They want "user-visible (potentially
fatal)

errors", then a hang.

They want to keep system state since the error occurred.

Well that is rather problematic.

First of all we need to really justify that, crashing the kernel is
not something easily done.

Then this isn't really effective here. What happens is that you crash
the kernel thread of the currently executing process, but it is
perfectly possible that another thread still tries to send messages
to the SMU. You need to have the BUG_ON() before dropping the lock to
make sure that this really gets the driver stuck in the current state.

Thanks. I got it. I just thought it is a kenel panic.
Could we use a panic() here?

Potentially, but that might reboot the system automatically which is probably 
not
what you want either.

How does the SMU firmware team gather the necessary information when a
problem occurs?

As far as I know, they usually use a HDT to collect information.
And they request a hang when error occurred in ticket.
"Suggested error responses include pop-up windows (by x86 driver, if this is 
possible) or simply hanging after logging the error."


In that case I suggest to set the "don't_touch_the_hardware_any_more" 
procedure we also use in case of PCIe hotplug.


Andrey has the details but essentially it stops the driver from touching 
the hardware any more, signals all fences and unblocks everything.


It should then be trivial to inspect the hardware state and see what's 
going on, but the system will keep stable at least for SSH access.


Might be a good idea to have that mode for other fault cases like page 
faults and hardware crashes.


Regards,
Christian.



Regards,
Lang

RE: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Yu, Lang

[AMD Official Use Only]



>-Original Message-
>From: Koenig, Christian 
>Sent: Wednesday, December 1, 2021 7:29 PM
>To: Yu, Lang ; Christian König
>; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander ; Lazar, Lijo
>; Huang, Ray 
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>Am 01.12.21 um 12:20 schrieb Yu, Lang:
>> [AMD Official Use Only]
>>
>>> -Original Message-
>>> From: Christian König 
>>> Sent: Wednesday, December 1, 2021 6:49 PM
>>> To: Yu, Lang ; Koenig, Christian
>>> ; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander ; Lazar, Lijo
>>> ; Huang, Ray 
>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>
>>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
 [AMD Official Use Only]



> -Original Message-
> From: Koenig, Christian 
> Sent: Wednesday, December 1, 2021 5:30 PM
> To: Yu, Lang ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Lazar, Lijo
> ; Huang, Ray 
> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
> Am 01.12.21 um 10:24 schrieb Lang Yu:
>> To maintain system error state when SMU errors occurred, which
>> will aid in debugging SMU firmware issues, add SMU debug option support.
>>
>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>> When enabled, it makes SMU errors fatal.
>> It is disabled by default.
>>
>> == Command Guide ==
>>
>> 1, enable SMU debug option
>>
>> # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>
>> 2, disable SMU debug option
>>
>> # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>
>> v3:
>> - Use debugfs_create_bool().(Christian)
>> - Put variable into smu_context struct.
>> - Don't resend command when timeout.
>>
>> v2:
>> - Resend command when timeout.(Lijo)
>> - Use debugfs file instead of module parameter.
>>
>> Signed-off-by: Lang Yu 
> Well the debugfs part looks really nice and clean now, but one more
> comment below.
>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
>> drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
>> drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
>> 4 files changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> index 164d6a9e9fbb..86cd888c7822 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
> *adev)
>>  if (!debugfs_initialized())
>>  return 0;
>>
>> +debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>> +  >smu.smu_debug_mode);
>> +
>>  ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root,
>adev,
>>_ib_preempt);
>>  if (IS_ERR(ent)) {
>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> index f738f7dc20c9..50dbf5594a9d 100644
>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> @@ -569,6 +569,11 @@ struct smu_context
>>  struct smu_user_dpm_profile user_dpm_profile;
>>
>>  struct stb_context stb_context;
>> +/*
>> + * When enabled, it makes SMU errors fatal.
>> + * (0 = disabled (default), 1 = enabled)
>> + */
>> +bool smu_debug_mode;
>> };
>>
>> struct i2c_adapter;
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> index 6e781cee8bb6..d3797a2d6451 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
> smu_context *smu)
>> out:
>>  mutex_unlock(>message_lock);
>>
>> +BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>> +
>>  return ret;
>> }
>>
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> index 048ca1673863..9be005eb4241 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> @@ -349,15 +349,21 @@ int
>smu_cmn_send_smc_msg_with_param(struct
> smu_context *smu,
>>  __smu_cmn_reg_print_error(smu, reg, index, param,
>msg);
>>  goto Out;
>>  }

Re: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Christian König


Am 01.12.21 um 12:20 schrieb Yu, Lang:

[AMD Official Use Only]


-Original Message-
From: Christian König 
Sent: Wednesday, December 1, 2021 6:49 PM
To: Yu, Lang ; Koenig, Christian
; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Lazar, Lijo
; Huang, Ray 
Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option

Am 01.12.21 um 11:44 schrieb Yu, Lang:

[AMD Official Use Only]




-Original Message-
From: Koenig, Christian 
Sent: Wednesday, December 1, 2021 5:30 PM
To: Yu, Lang ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Lazar, Lijo
; Huang, Ray 
Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option

Am 01.12.21 um 10:24 schrieb Lang Yu:

To maintain system error state when SMU errors occurred, which will
aid in debugging SMU firmware issues, add SMU debug option support.

It can be enabled or disabled via amdgpu_smu_debug debugfs file.
When enabled, it makes SMU errors fatal.
It is disabled by default.

== Command Guide ==

1, enable SMU debug option

# echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

2, disable SMU debug option

# echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

v3:
- Use debugfs_create_bool().(Christian)
- Put variable into smu_context struct.
- Don't resend command when timeout.

v2:
- Resend command when timeout.(Lijo)
- Use debugfs file instead of module parameter.

Signed-off-by: Lang Yu 

Well the debugfs part looks really nice and clean now, but one more
comment below.


---
drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..86cd888c7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device

*adev)

if (!debugfs_initialized())
return 0;

+   debugfs_create_bool("amdgpu_smu_debug", 0600, root,
+ >smu.smu_debug_mode);
+
ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
  _ib_preempt);
if (IS_ERR(ent)) {
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index f738f7dc20c9..50dbf5594a9d 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -569,6 +569,11 @@ struct smu_context
struct smu_user_dpm_profile user_dpm_profile;

struct stb_context stb_context;
+   /*
+* When enabled, it makes SMU errors fatal.
+* (0 = disabled (default), 1 = enabled)
+*/
+   bool smu_debug_mode;
};

struct i2c_adapter;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 6e781cee8bb6..d3797a2d6451 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct

smu_context *smu)

out:
mutex_unlock(>message_lock);

+   BUG_ON(unlikely(smu->smu_debug_mode) && ret);
+
return ret;
}

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index 048ca1673863..9be005eb4241 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct

smu_context *smu,

__smu_cmn_reg_print_error(smu, reg, index, param, msg);
goto Out;
}
+
__smu_cmn_send_msg(smu, (uint16_t) index, param);
reg = __smu_cmn_poll_stat(smu);
res = __smu_cmn_reg2errno(smu, reg);
-   if (res != 0)
+   if (res != 0) {
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
+   goto Out;
+   }
if (read_arg)
smu_cmn_read_arg(smu, read_arg);
Out:
mutex_unlock(>message_lock);
+
+   BUG_ON(unlikely(smu->smu_debug_mode) && res);

BUG_ON() really crashes the kernel and is only allowed if we prevent
further data corruption with that.

Most of the time WARN_ON() is more appropriate, but I can't fully
judge here since I don't know the SMU code well enough.

This is what SMU FW guys want. They want "user-visible (potentially fatal)

errors", then a hang.

They want to keep system state since the error occurred.

Well that is rather problematic.

First of all we need to really justify that, crashing the kernel is not 
something
easily done.

Then this isn't really effective here. What happens is that you crash the kernel
thread of

RE: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Yu, Lang

[AMD Official Use Only]



>-Original Message-
>From: Christian König 
>Sent: Wednesday, December 1, 2021 6:49 PM
>To: Yu, Lang ; Koenig, Christian
>; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander ; Lazar, Lijo
>; Huang, Ray 
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>Am 01.12.21 um 11:44 schrieb Yu, Lang:
>> [AMD Official Use Only]
>>
>>
>>
>>> -Original Message-
>>> From: Koenig, Christian 
>>> Sent: Wednesday, December 1, 2021 5:30 PM
>>> To: Yu, Lang ; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander ; Lazar, Lijo
>>> ; Huang, Ray 
>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>
>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
 To maintain system error state when SMU errors occurred, which will
 aid in debugging SMU firmware issues, add SMU debug option support.

 It can be enabled or disabled via amdgpu_smu_debug debugfs file.
 When enabled, it makes SMU errors fatal.
 It is disabled by default.

 == Command Guide ==

 1, enable SMU debug option

# echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

 2, disable SMU debug option

# echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

 v3:
- Use debugfs_create_bool().(Christian)
- Put variable into smu_context struct.
- Don't resend command when timeout.

 v2:
- Resend command when timeout.(Lijo)
- Use debugfs file instead of module parameter.

 Signed-off-by: Lang Yu 
>>> Well the debugfs part looks really nice and clean now, but one more
>>> comment below.
>>>
 ---
drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
4 files changed, 17 insertions(+), 1 deletion(-)

 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
 b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
 index 164d6a9e9fbb..86cd888c7822 100644
 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
 +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
 @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>>> *adev)
if (!debugfs_initialized())
return 0;

 +  debugfs_create_bool("amdgpu_smu_debug", 0600, root,
 +>smu.smu_debug_mode);
 +
ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
  _ib_preempt);
if (IS_ERR(ent)) {
 diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
 b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
 index f738f7dc20c9..50dbf5594a9d 100644
 --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
 +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
 @@ -569,6 +569,11 @@ struct smu_context
struct smu_user_dpm_profile user_dpm_profile;

struct stb_context stb_context;
 +  /*
 +   * When enabled, it makes SMU errors fatal.
 +   * (0 = disabled (default), 1 = enabled)
 +   */
 +  bool smu_debug_mode;
};

struct i2c_adapter;
 diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
 b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
 index 6e781cee8bb6..d3797a2d6451 100644
 --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
 +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
 @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>>> smu_context *smu)
out:
mutex_unlock(>message_lock);

 +  BUG_ON(unlikely(smu->smu_debug_mode) && ret);
 +
return ret;
}

 diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
 b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
 index 048ca1673863..9be005eb4241 100644
 --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
 +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
 @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct
>>> smu_context *smu,
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
goto Out;
}
 +
__smu_cmn_send_msg(smu, (uint16_t) index, param);
reg = __smu_cmn_poll_stat(smu);
res = __smu_cmn_reg2errno(smu, reg);
 -  if (res != 0)
 +  if (res != 0) {
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
 +  goto Out;
 +  }
if (read_arg)
smu_cmn_read_arg(smu, read_arg);
Out:
mutex_unlock(>message_lock);
 +
 +  BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>> BUG_ON() really crashes the kernel and is only allowed if we prevent
>>> further data corruption

RE: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Yu, Lang

[AMD Official Use Only]



>-Original Message-
>From: Lazar, Lijo 
>Sent: Wednesday, December 1, 2021 6:46 PM
>To: Yu, Lang ; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander ; Huang, Ray
>; Koenig, Christian 
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>
>
>On 12/1/2021 4:08 PM, Yu, Lang wrote:
>> [AMD Official Use Only]
>>
>>
>>
>>> -Original Message-
>>> From: Lazar, Lijo 
>>> Sent: Wednesday, December 1, 2021 5:47 PM
>>> To: Yu, Lang ; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander ; Huang, Ray
>>> ; Koenig, Christian 
>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>
>>>
>>>
>>> On 12/1/2021 2:54 PM, Lang Yu wrote:
 To maintain system error state when SMU errors occurred, which will
 aid in debugging SMU firmware issues, add SMU debug option support.

 It can be enabled or disabled via amdgpu_smu_debug debugfs file.
 When enabled, it makes SMU errors fatal.
 It is disabled by default.

 == Command Guide ==

 1, enable SMU debug option

# echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

 2, disable SMU debug option

# echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

 v3:
- Use debugfs_create_bool().(Christian)
- Put variable into smu_context struct.
- Don't resend command when timeout.

 v2:
- Resend command when timeout.(Lijo)
- Use debugfs file instead of module parameter.

 Signed-off-by: Lang Yu 
 ---
drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
4 files changed, 17 insertions(+), 1 deletion(-)

 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
 b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
 index 164d6a9e9fbb..86cd888c7822 100644
 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
 +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
 @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>>> *adev)
if (!debugfs_initialized())
return 0;

 +  debugfs_create_bool("amdgpu_smu_debug", 0600, root,
 +>smu.smu_debug_mode);
 +
ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
  _ib_preempt);
if (IS_ERR(ent)) {
 diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
 b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
 index f738f7dc20c9..50dbf5594a9d 100644
 --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
 +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
 @@ -569,6 +569,11 @@ struct smu_context
struct smu_user_dpm_profile user_dpm_profile;

struct stb_context stb_context;
 +  /*
 +   * When enabled, it makes SMU errors fatal.
 +   * (0 = disabled (default), 1 = enabled)
 +   */
 +  bool smu_debug_mode;
};

struct i2c_adapter;
 diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
 b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
 index 6e781cee8bb6..d3797a2d6451 100644
 --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
 +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
 @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>>> smu_context *smu)
out:
mutex_unlock(>message_lock);

 +  BUG_ON(unlikely(smu->smu_debug_mode) && ret);
 +
>>> This hunk can be skipped while submitting. If this fails, GPU reset
>>> will fail and amdgpu won't continue.
>>
>> Ok, we don't handle such cases.
>>
>>>
return ret;
}

 diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
 b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
 index 048ca1673863..9be005eb4241 100644
 --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
 +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
 @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct
>>> smu_context *smu,
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
goto Out;
}
 +
__smu_cmn_send_msg(smu, (uint16_t) index, param);
reg = __smu_cmn_poll_stat(smu);
res = __smu_cmn_reg2errno(smu, reg);
 -  if (res != 0)
 +  if (res != 0) {
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
 +  goto Out;
>>>
>>> Next step is reading smu parameter register which is harmless as
>>> reading response register and it's not clear on read. This goto also may be
>skipped.
>>
>> I just think that does some extra work. We don’t want to read response 
>> register.
>> This

[PATCH V2 11/11] drm/amdgpu: Move error inject function from amdgpu_ras.c to each block

2021-12-01 Thread yipechai

Move each block error inject function from amdgpu_ras.c to each block.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  | 62 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 28 +++
 drivers/gpu/drm/amd/amdgpu/mca_v3_0.c| 18 +++
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c  | 16 ++
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c  | 16 ++
 drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.c  | 16 ++
 drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c   | 16 ++
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c   | 16 ++
 drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c   | 16 ++
 drivers/gpu/drm/amd/amdgpu/umc_v6_1.c| 16 ++
 drivers/gpu/drm/amd/amdgpu/umc_v6_7.c| 16 ++
 drivers/gpu/drm/amd/amdgpu/umc_v8_7.c| 16 ++
 12 files changed, 201 insertions(+), 51 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 2e38bd3d3d45..87b625d305c9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1032,31 +1032,7 @@ int amdgpu_ras_reset_error_status(struct amdgpu_device 
*adev,
return 0;
 }
 
-/* Trigger XGMI/WAFL error */
-static int amdgpu_ras_error_inject_xgmi(struct amdgpu_device *adev,
-struct ta_ras_trigger_error_input *block_info)
-{
-   int ret;
-
-   if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
-   dev_warn(adev->dev, "Failed to disallow df cstate");
 
-   if (amdgpu_dpm_allow_xgmi_power_down(adev, false))
-   dev_warn(adev->dev, "Failed to disallow XGMI power down");
-
-   ret = psp_ras_trigger_error(>psp, block_info);
-
-   if (amdgpu_ras_intr_triggered())
-   return ret;
-
-   if (amdgpu_dpm_allow_xgmi_power_down(adev, true))
-   dev_warn(adev->dev, "Failed to allow XGMI power down");
-
-   if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_ALLOW))
-   dev_warn(adev->dev, "Failed to allow df cstate");
-
-   return ret;
-}
 
 /* wrapper of psp_ras_trigger_error */
 int amdgpu_ras_error_inject(struct amdgpu_device *adev,
@@ -1076,41 +1052,25 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
if (!obj)
return -EINVAL;
 
+   if (!block_obj || !block_obj->ops)  {
+   dev_info(adev->dev, "%s don't config ras function \n", 
get_ras_block_str(>head));
+   return -EINVAL;
+   }
+
/* Calculate XGMI relative offset */
if (adev->gmc.xgmi.num_physical_nodes > 1) {
-   block_info.address =
-   amdgpu_xgmi_get_relative_phy_addr(adev,
- block_info.address);
+   block_info.address =  amdgpu_xgmi_get_relative_phy_addr(adev, 
block_info.address);
}
 
-   switch (info->head.block) {
-   case AMDGPU_RAS_BLOCK__GFX:
-   if (!block_obj || !block_obj->ops)  {
-   dev_info(adev->dev, "%s don't config ras function \n", 
get_ras_block_str(>head));
-   return -EINVAL;
-   }
-   if (block_obj->ops->ras_error_inject)
+   if (block_obj->ops->ras_error_inject) {
+   if(info->head.block == AMDGPU_RAS_BLOCK__GFX)
ret = block_obj->ops->ras_error_inject(adev, info);
-   break;
-   case AMDGPU_RAS_BLOCK__UMC:
-   case AMDGPU_RAS_BLOCK__SDMA:
-   case AMDGPU_RAS_BLOCK__MMHUB:
-   case AMDGPU_RAS_BLOCK__PCIE_BIF:
-   case AMDGPU_RAS_BLOCK__MCA:
-   ret = psp_ras_trigger_error(>psp, _info);
-   break;
-   case AMDGPU_RAS_BLOCK__XGMI_WAFL:
-   ret = amdgpu_ras_error_inject_xgmi(adev, _info);
-   break;
-   default:
-   dev_info(adev->dev, "%s error injection is not supported yet\n",
-get_ras_block_str(>head));
-   ret = -EINVAL;
+   else
+   ret = block_obj->ops->ras_error_inject(adev, 
_info);
}
 
if (ret)
-   dev_err(adev->dev, "ras inject %s failed %d\n",
-   get_ras_block_str(>head), ret);
+   dev_err(adev->dev, "ras inject %s failed %d\n", 
get_ras_block_str(>head), ret);
 
return ret;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index da541c7b1ec2..298742afba99 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -940,6 +940,33 @@ static void amdgpu_xgmi_query_ras_error_count(struct 
amdgpu_device *adev,
err_data->ce_count += ce_cnt;
 }
 
+/* Trigger XGMI/WAFL error */
+static int amdgpu_ras_error_inject_xgmi(struct amdgpu_device *adev,
+void *inject_if)
+{
+   int ret = 0;;
+   struct ta_ras_trigger_error_input *block_info =  (struct

[PATCH V2 10/11] drm/amdgpu: Modify mca block to fit for the unified ras block data and ops

2021-12-01 Thread yipechai

1.Modify mca block to fit for the unified ras block data and ops.
2.Implement .ras_block_match function pointer for mca block to identify itself.
3.Change amdgpu_mca_ras_funcs to amdgpu_mca_ras_block(amdgpu_mca_ras had been 
used), and the corresponding variable name remove _funcs suffix.
4.Remove the const flag of cma ras variable so that cma ras block can be able 
to be insertted into amdgpu device ras block link list.
5.Invoke amdgpu_ras_register_ras_block function to register cma ras block into 
amdgpu device ras block link list.
6.Remove the redundant code about cma in amdgpu_ras.c after using the unified 
ras block.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 18 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c |  6 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h | 14 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 29 +--
 drivers/gpu/drm/amd/amdgpu/mca_v3_0.c   | 67 +++--
 5 files changed, 68 insertions(+), 66 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index ead143214448..065d98cc028f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -467,23 +467,23 @@ int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
return r;
}
 
-   if (adev->mca.mp0.ras_funcs &&
-   adev->mca.mp0.ras_funcs->ras_late_init) {
-   r = adev->mca.mp0.ras_funcs->ras_late_init(adev);
+   if (adev->mca.mp0.ras && adev->mca.mp0.ras->ras_block.ops &&
+   adev->mca.mp0.ras->ras_block.ops->ras_late_init) {
+   r = adev->mca.mp0.ras->ras_block.ops->ras_late_init(adev);
if (r)
return r;
}
 
-   if (adev->mca.mp1.ras_funcs &&
-   adev->mca.mp1.ras_funcs->ras_late_init) {
-   r = adev->mca.mp1.ras_funcs->ras_late_init(adev);
+   if (adev->mca.mp1.ras && adev->mca.mp1.ras->ras_block.ops &&
+   adev->mca.mp1.ras->ras_block.ops->ras_late_init) {
+   r = adev->mca.mp1.ras->ras_block.ops->ras_late_init(adev);
if (r)
return r;
}
 
-   if (adev->mca.mpio.ras_funcs &&
-   adev->mca.mpio.ras_funcs->ras_late_init) {
-   r = adev->mca.mpio.ras_funcs->ras_late_init(adev);
+   if (adev->mca.mpio.ras && adev->mca.mpio.ras->ras_block.ops &&
+   adev->mca.mpio.ras->ras_block.ops->ras_late_init) {
+   r = adev->mca.mpio.ras->ras_block.ops->ras_late_init(adev);
if (r)
return r;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
index ce538f4819f9..86dbe485a644 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
@@ -79,15 +79,15 @@ int amdgpu_mca_ras_late_init(struct amdgpu_device *adev,
.cb = NULL,
};
struct ras_fs_if fs_info = {
-   .sysfs_name = mca_dev->ras_funcs->sysfs_name,
+   .sysfs_name = mca_dev->ras->ras_block.name,
};
 
if (!mca_dev->ras_if) {
mca_dev->ras_if = kmalloc(sizeof(struct ras_common_if), 
GFP_KERNEL);
if (!mca_dev->ras_if)
return -ENOMEM;
-   mca_dev->ras_if->block = mca_dev->ras_funcs->ras_block;
-   mca_dev->ras_if->sub_block_index = 
mca_dev->ras_funcs->ras_sub_block;
+   mca_dev->ras_if->block = mca_dev->ras->ras_block.block;
+   mca_dev->ras_if->sub_block_index = 
mca_dev->ras->ras_block.sub_block_index;
mca_dev->ras_if->type = AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE;
}
ih_info.head = fs_info.head = *mca_dev->ras_if;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
index c74bc7177066..be030c4031d2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
@@ -21,21 +21,13 @@
 #ifndef __AMDGPU_MCA_H__
 #define __AMDGPU_MCA_H__
 
-struct amdgpu_mca_ras_funcs {
-   int (*ras_late_init)(struct amdgpu_device *adev);
-   void (*ras_fini)(struct amdgpu_device *adev);
-   void (*query_ras_error_count)(struct amdgpu_device *adev,
- void *ras_error_status);
-   void (*query_ras_error_address)(struct amdgpu_device *adev,
-   void *ras_error_status);
-   uint32_t ras_block;
-   uint32_t ras_sub_block;
-   const char* sysfs_name;
+struct amdgpu_mca_ras_block {
+   struct amdgpu_ras_block_object ras_block;
 };
 
 struct amdgpu_mca_ras {
struct ras_common_if *ras_if;
-   const struct amdgpu_mca_ras_funcs *ras_funcs;
+   struct amdgpu_mca_ras_block *ras;
 };
 
 struct amdgpu_mca_funcs {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c

[PATCH V2 09/11] drm/amdgpu: Modify sdma block to fit for the unified ras block data and ops

2021-12-01 Thread yipechai

1.Modify sdma block to fit for the unified ras block data and ops.
2.Implement .ras_block_match function pointer for sdma block to identify itself.
3.Change amdgpu_sdma_ras_funcs to amdgpu_sdma_ras, and the corresponding 
variable name remove _funcs suffix.
4.Remove the const flag of sdma ras variable so that sdma ras block can be able 
to be insertted into amdgpu device ras block link list.
5.Invoke amdgpu_ras_register_ras_block function to register sdma ras block into 
amdgpu device ras block link list.
6.Remove the redundant code about sdma in amdgpu_ras.c after using the unified 
ras block.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  |  9 
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 13 ++---
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c   | 61 +++-
 drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c   | 40 ++--
 drivers/gpu/drm/amd/amdgpu/sdma_v4_4.h   |  2 +-
 5 files changed, 92 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 7d050afd7e2e..6a145d0e0032 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -939,12 +939,6 @@ int amdgpu_ras_query_error_status(struct amdgpu_device 
*adev,
block_obj->ops->query_ras_error_address(adev, 
_data);
break;
case AMDGPU_RAS_BLOCK__SDMA:
-   if (adev->sdma.funcs->query_ras_error_count) {
-   for (i = 0; i < adev->sdma.num_instances; i++)
-   adev->sdma.funcs->query_ras_error_count(adev, i,
-   
_data);
-   }
-   break;
case AMDGPU_RAS_BLOCK__GFX:
case AMDGPU_RAS_BLOCK__MMHUB:
if (!block_obj || !block_obj->ops)  {
@@ -1049,9 +1043,6 @@ int amdgpu_ras_reset_error_status(struct amdgpu_device 
*adev,
block_obj->ops->reset_ras_error_status(adev);
break;
case AMDGPU_RAS_BLOCK__SDMA:
-   if (adev->sdma.funcs->reset_ras_error_count)
-   adev->sdma.funcs->reset_ras_error_count(adev);
-   break;
case AMDGPU_RAS_BLOCK__HDP:
if (!block_obj || !block_obj->ops)  {
dev_info(adev->dev, "%s don't config ras function \n", 
ras_block_str(block));
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
index f8fb755e3aa6..a0761cf50ae0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
@@ -23,6 +23,7 @@
 
 #ifndef __AMDGPU_SDMA_H__
 #define __AMDGPU_SDMA_H__
+#include "amdgpu_ras.h"
 
 /* max number of IP instances */
 #define AMDGPU_MAX_SDMA_INSTANCES  8
@@ -50,13 +51,9 @@ struct amdgpu_sdma_instance {
boolburst_nop;
 };
 
-struct amdgpu_sdma_ras_funcs {
-   int (*ras_late_init)(struct amdgpu_device *adev,
-   void *ras_ih_info);
-   void (*ras_fini)(struct amdgpu_device *adev);
-   int (*query_ras_error_count)(struct amdgpu_device *adev,
-   uint32_t instance, void *ras_error_status);
-   void (*reset_ras_error_count)(struct amdgpu_device *adev);
+struct amdgpu_sdma_ras {
+   struct amdgpu_ras_block_object ras_block;
+   int (*sdma_ras_late_init)(struct amdgpu_device *adev, void 
*ras_ih_info);
 };
 
 struct amdgpu_sdma {
@@ -73,7 +70,7 @@ struct amdgpu_sdma {
uint32_tsrbm_soft_reset;
boolhas_page_queue;
struct ras_common_if*ras_if;
-   const struct amdgpu_sdma_ras_funcs  *funcs;
+   struct amdgpu_sdma_ras  *ras;
 };
 
 /*
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c 
b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index 69c9e460c1eb..30a651613776 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -1898,13 +1898,13 @@ static int sdma_v4_0_late_init(void *handle)
sdma_v4_0_setup_ulv(adev);
 
if (!amdgpu_persistent_edc_harvesting_supported(adev)) {
-   if (adev->sdma.funcs &&
-   adev->sdma.funcs->reset_ras_error_count)
-   adev->sdma.funcs->reset_ras_error_count(adev);
+   if (adev->sdma.ras && adev->sdma.ras->ras_block.ops &&
+   adev->sdma.ras->ras_block.ops->reset_ras_error_count)
+   
adev->sdma.ras->ras_block.ops->reset_ras_error_count(adev);
}
 
-   if (adev->sdma.funcs && adev->sdma.funcs->ras_late_init)
-   return adev->sdma.funcs->ras_late_init(adev, _info);
+   if (adev->sdma.ras && adev->sdma.ras->sdma_ras_late_init)
+   return adev->sdma.ras->sdma_ras_late_init(adev, _info);
else
return 0;
 }
@@ -2007,8 +2007,9 @@ static int

[PATCH V2 08/11] drm/amdgpu: Modify umc block to fit for the unified ras block data and ops

2021-12-01 Thread yipechai

1.Modify umc block to fit for the unified ras block data and ops.
2.Implement .ras_block_match function pointer for umc block to identify itself.
3.Change amdgpu_umc_ras_funcs to amdgpu_umc_ras, and the corresponding variable 
name remove _funcs suffix.
4.Remove the const flag of umc ras variable so that umc ras block can be able 
to be insertted into amdgpu device ras block link list.
5.Invoke amdgpu_ras_register_ras_block function to register umc ras block into 
amdgpu device ras block link list.
6.Remove the redundant code about umc in amdgpu_ras.c after using the unified 
ras block.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 12 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 21 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 18 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 13 -
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c  |  4 +++-
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   |  9 ++---
 drivers/gpu/drm/amd/amdgpu/umc_v6_1.c   | 25 +++--
 drivers/gpu/drm/amd/amdgpu/umc_v6_1.h   |  2 +-
 drivers/gpu/drm/amd/amdgpu/umc_v6_7.c   | 23 ++-
 drivers/gpu/drm/amd/amdgpu/umc_v6_7.h   |  2 +-
 drivers/gpu/drm/amd/amdgpu/umc_v8_7.c   | 25 +++--
 drivers/gpu/drm/amd/amdgpu/umc_v8_7.h   |  2 +-
 12 files changed, 111 insertions(+), 45 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 317b5e93a1f0..ead143214448 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -434,9 +434,9 @@ int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
 {
int r;
 
-   if (adev->umc.ras_funcs &&
-   adev->umc.ras_funcs->ras_late_init) {
-   r = adev->umc.ras_funcs->ras_late_init(adev);
+   if (adev->umc.ras && adev->umc.ras->ras_block.ops &&
+   adev->umc.ras->ras_block.ops->ras_late_init) {
+   r = adev->umc.ras->ras_block.ops->ras_late_init(adev);
if (r)
return r;
}
@@ -493,9 +493,9 @@ int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
 
 void amdgpu_gmc_ras_fini(struct amdgpu_device *adev)
 {
-   if (adev->umc.ras_funcs &&
-   adev->umc.ras_funcs->ras_fini)
-   adev->umc.ras_funcs->ras_fini(adev);
+   if (adev->umc.ras && adev->umc.ras->ras_block.ops &&
+   adev->umc.ras->ras_block.ops->ras_fini)
+   adev->umc.ras->ras_block.ops->ras_fini(adev);
 
if (adev->mmhub.ras && adev->mmhub.ras->ras_block.ops &&
adev->mmhub.ras->ras_block.ops->ras_fini)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 273a550741e4..7d050afd7e2e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -925,15 +925,18 @@ int amdgpu_ras_query_error_status(struct amdgpu_device 
*adev,
 
switch (info->head.block) {
case AMDGPU_RAS_BLOCK__UMC:
-   if (adev->umc.ras_funcs &&
-   adev->umc.ras_funcs->query_ras_error_count)
-   adev->umc.ras_funcs->query_ras_error_count(adev, 
_data);
+   if (!block_obj || !block_obj->ops)  {
+   dev_info(adev->dev, "%s don't config ras function \n",
+   get_ras_block_str(>head));
+   return -EINVAL;
+   }
+   if (block_obj->ops->query_ras_error_count)
+   block_obj->ops->query_ras_error_count(adev, _data);
/* umc query_ras_error_address is also responsible for clearing
 * error status
 */
-   if (adev->umc.ras_funcs &&
-   adev->umc.ras_funcs->query_ras_error_address)
-   adev->umc.ras_funcs->query_ras_error_address(adev, 
_data);
+   if (block_obj->ops->query_ras_error_address)
+   block_obj->ops->query_ras_error_address(adev, 
_data);
break;
case AMDGPU_RAS_BLOCK__SDMA:
if (adev->sdma.funcs->query_ras_error_count) {
@@ -2359,12 +2362,12 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
/* Init poison supported flag, the default value is false */
if (adev->df.funcs &&
adev->df.funcs->query_ras_poison_mode &&
-   adev->umc.ras_funcs &&
-   adev->umc.ras_funcs->query_ras_poison_mode) {
+   adev->umc.ras && adev->umc.ras->ras_block.ops &&
+   adev->umc.ras->ras_block.ops->query_ras_poison_mode) {
df_poison =
adev->df.funcs->query_ras_poison_mode(adev);
umc_poison =
-   adev->umc.ras_funcs->query_ras_poison_mode(adev);
+   
adev->umc.ras->ras_block.ops->query_ras_poison_mode(adev);
/* Only

[PATCH V2 07/11] drm/amdgpu: Modify nbio block to fit for the unified ras block data and ops

2021-12-01 Thread yipechai

1.Modify nbio block to fit for the unified ras block data and ops.
2.Implement .ras_block_match function pointer for nbio block to identify itself.
3.Change amdgpu_nbio_ras_funcs to amdgpu_nbio_ras, and the corresponding 
variable name remove _funcs suffix.
4.Remove the const flag of mmhub ras variable so that nbio ras block can be 
able to be insertted into amdgpu device ras block link list.
5.Invoke amdgpu_ras_register_ras_block function to register nbio ras block into 
amdgpu device ras block link list.
6.Remove the redundant code about nbio in amdgpu_ras.c after using the unified 
ras block.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c  | 12 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h |  9 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  | 22 -
 drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c   | 30 
 drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h   |  2 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c   | 20 
 6 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index 5208b2dd176a..24feceb51289 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -208,13 +208,13 @@ irqreturn_t amdgpu_irq_handler(int irq, void *arg)
 * ack the interrupt if it is there
 */
if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__PCIE_BIF)) {
-   if (adev->nbio.ras_funcs &&
-   adev->nbio.ras_funcs->handle_ras_controller_intr_no_bifring)
-   
adev->nbio.ras_funcs->handle_ras_controller_intr_no_bifring(adev);
+   if (adev->nbio.ras &&
+   adev->nbio.ras->handle_ras_controller_intr_no_bifring)
+   
adev->nbio.ras->handle_ras_controller_intr_no_bifring(adev);
 
-   if (adev->nbio.ras_funcs &&
-   
adev->nbio.ras_funcs->handle_ras_err_event_athub_intr_no_bifring)
-   
adev->nbio.ras_funcs->handle_ras_err_event_athub_intr_no_bifring(adev);
+   if (adev->nbio.ras &&
+   adev->nbio.ras->handle_ras_err_event_athub_intr_no_bifring)
+   
adev->nbio.ras->handle_ras_err_event_athub_intr_no_bifring(adev);
}
 
return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h
index 843052205bd5..4a1fb85939d6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h
@@ -47,15 +47,12 @@ struct nbio_hdp_flush_reg {
u32 ref_and_mask_sdma7;
 };
 
-struct amdgpu_nbio_ras_funcs {
+struct amdgpu_nbio_ras {
+   struct amdgpu_ras_block_object ras_block;
void (*handle_ras_controller_intr_no_bifring)(struct amdgpu_device 
*adev);
void (*handle_ras_err_event_athub_intr_no_bifring)(struct amdgpu_device 
*adev);
int (*init_ras_controller_interrupt)(struct amdgpu_device *adev);
int (*init_ras_err_event_athub_interrupt)(struct amdgpu_device *adev);
-   void (*query_ras_error_count)(struct amdgpu_device *adev,
- void *ras_error_status);
-   int (*ras_late_init)(struct amdgpu_device *adev);
-   void (*ras_fini)(struct amdgpu_device *adev);
 };
 
 struct amdgpu_nbio_funcs {
@@ -104,7 +101,7 @@ struct amdgpu_nbio {
struct amdgpu_irq_src ras_err_event_athub_irq;
struct ras_common_if *ras_if;
const struct amdgpu_nbio_funcs *funcs;
-   const struct amdgpu_nbio_ras_funcs *ras_funcs;
+   struct amdgpu_nbio_ras  *ras;
 };
 
 int amdgpu_nbio_ras_late_init(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index d705d8b1daf6..273a550741e4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -957,10 +957,6 @@ int amdgpu_ras_query_error_status(struct amdgpu_device 
*adev,
block_obj->ops->query_ras_error_status(adev);
break;
case AMDGPU_RAS_BLOCK__PCIE_BIF:
-   if (adev->nbio.ras_funcs &&
-   adev->nbio.ras_funcs->query_ras_error_count)
-   adev->nbio.ras_funcs->query_ras_error_count(adev, 
_data);
-   break;
case AMDGPU_RAS_BLOCK__XGMI_WAFL:
case AMDGPU_RAS_BLOCK__HDP:
if (!block_obj || !block_obj->ops)  {
@@ -2336,24 +2332,26 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
case CHIP_VEGA20:
case CHIP_ARCTURUS:
case CHIP_ALDEBARAN:
-   if (!adev->gmc.xgmi.connected_to_cpu)
-   adev->nbio.ras_funcs = _v7_4_ras_funcs;
+   if (!adev->gmc.xgmi.connected_to_cpu) {
+   adev->nbio.ras = _v7_4_ras;
+   amdgpu_ras_register_ras_block(adev, 
>nbio.ras->ras_block);
+

[PATCH V2 06/11] drm/amdgpu: Modify mmhub block to fit for the unified ras block data and ops

2021-12-01 Thread yipechai

1.Modify mmhub block to fit for the unified ras block data and ops.
2.Implement .ras_block_match function pointer for mmhub block to identify 
itself.
3.Change amdgpu_mmhub_ras_funcs to amdgpu_mmhub_ras, and the corresponding 
variable name remove _funcs suffix.
4.Remove the const flag of mmhub ras variable so that mmhub ras block can be 
able to be insertted into amdgpu device ras block link list.
5.Invoke amdgpu_ras_register_ras_block function to register mmhub ras block 
into amdgpu device ras block link list. 5.Remove the redundant code about mmhub 
in amdgpu_ras.c after using the unified ras block.
6.Remove the redundant code about mmhub in amdgpu_ras.c after using the unified 
ras block.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c| 12 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mmhub.h  | 12 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 49 +++---
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 16 ---
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c| 23 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.h|  2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c| 23 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.h|  2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.c| 23 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.h|  2 +-
 11 files changed, 108 insertions(+), 68 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0980396ee709..c7d5592f0cf6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3377,9 +3377,9 @@ static void amdgpu_device_xgmi_reset_func(struct 
work_struct *__work)
if (adev->asic_reset_res)
goto fail;
 
-   if (adev->mmhub.ras_funcs &&
-   adev->mmhub.ras_funcs->reset_ras_error_count)
-   adev->mmhub.ras_funcs->reset_ras_error_count(adev);
+   if (adev->mmhub.ras && adev->mmhub.ras->ras_block.ops &&
+   adev->mmhub.ras->ras_block.ops->reset_ras_error_count)
+   
adev->mmhub.ras->ras_block.ops->reset_ras_error_count(adev);
} else {
 
task_barrier_full(>tb);
@@ -4705,9 +4705,9 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
 
if (!r && amdgpu_ras_intr_triggered()) {
list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
-   if (tmp_adev->mmhub.ras_funcs &&
-   tmp_adev->mmhub.ras_funcs->reset_ras_error_count)
-   
tmp_adev->mmhub.ras_funcs->reset_ras_error_count(tmp_adev);
+   if (tmp_adev->mmhub.ras && 
tmp_adev->mmhub.ras->ras_block.ops &&
+   
tmp_adev->mmhub.ras->ras_block.ops->reset_ras_error_count)
+   
tmp_adev->mmhub.ras->ras_block.ops->reset_ras_error_count(tmp_adev);
}
 
amdgpu_ras_intr_cleared();
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 0d06e7a2b951..317b5e93a1f0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -441,9 +441,9 @@ int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
return r;
}
 
-   if (adev->mmhub.ras_funcs &&
-   adev->mmhub.ras_funcs->ras_late_init) {
-   r = adev->mmhub.ras_funcs->ras_late_init(adev);
+   if (adev->mmhub.ras && adev->mmhub.ras->ras_block.ops &&
+   adev->mmhub.ras->ras_block.ops->ras_late_init) {
+   r = adev->mmhub.ras->ras_block.ops->ras_late_init(adev);
if (r)
return r;
}
@@ -497,9 +497,9 @@ void amdgpu_gmc_ras_fini(struct amdgpu_device *adev)
adev->umc.ras_funcs->ras_fini)
adev->umc.ras_funcs->ras_fini(adev);
 
-   if (adev->mmhub.ras_funcs &&
-   adev->mmhub.ras_funcs->ras_fini)
-   adev->mmhub.ras_funcs->ras_fini(adev);
+   if (adev->mmhub.ras && adev->mmhub.ras->ras_block.ops &&
+   adev->mmhub.ras->ras_block.ops->ras_fini)
+   adev->mmhub.ras->ras_block.ops->ras_fini(adev);
 
if (adev->gmc.xgmi.ras && adev->gmc.xgmi.ras->ras_block.ops &&
adev->gmc.xgmi.ras->ras_block.ops->ras_fini)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mmhub.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mmhub.h
index b27fcbccce2b..6d10b3f248db 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mmhub.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mmhub.h
@@ -21,14 +21,8 @@
 #ifndef __AMDGPU_MMHUB_H__
 #define __AMDGPU_MMHUB_H__
 
-struct amdgpu_mmhub_ras_funcs {
-   int (*ras_late_init)(struct amdgpu_device *adev);
-   void (*ras_fini)(struct amdgpu_device *adev);
-   void (*query_ras_error_count)(struct

[PATCH V2 04/11] drm/amdgpu: Modify gmc block to fit for the unified ras block data and ops

2021-12-01 Thread yipechai

1.Modify gmc block to fit for the unified ras block data and ops
2.Implement .ras_block_match function pointer for gmc block to identify itself.
3.Change amdgpu_xgmi_ras_funcs to amdgpu_xgmi_ras, and the corresponding 
variable name remove _funcs suffix.
4.Remove the const flag of gmc ras variable so that gmc ras block can be able 
to be insertted into amdgpu device ras block link list.
5.Invoke amdgpu_ras_register_ras_block function to register gmc ras block into 
amdgpu device ras block link list.
6.Remove the redundant code about gmc in amdgpu_ras.c after using the unified 
ras block.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c  | 18 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h  | 11 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  | 10 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 31 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h |  4 +--
 5 files changed, 48 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 83f26bca7dac..3ba2f0f1f1b4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -448,12 +448,14 @@ int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
return r;
}
 
-   if (!adev->gmc.xgmi.connected_to_cpu)
-   adev->gmc.xgmi.ras_funcs = _ras_funcs;
+   if (!adev->gmc.xgmi.connected_to_cpu) {
+   adev->gmc.xgmi.ras = _ras;
+   amdgpu_ras_register_ras_block(adev, 
>gmc.xgmi.ras->ras_block);
+   }
 
-   if (adev->gmc.xgmi.ras_funcs &&
-   adev->gmc.xgmi.ras_funcs->ras_late_init) {
-   r = adev->gmc.xgmi.ras_funcs->ras_late_init(adev);
+   if (adev->gmc.xgmi.ras && adev->gmc.xgmi.ras->ras_block.ops &&
+   adev->gmc.xgmi.ras->ras_block.ops->ras_late_init) {
+   r = adev->gmc.xgmi.ras->ras_block.ops->ras_late_init(adev);
if (r)
return r;
}
@@ -499,9 +501,9 @@ void amdgpu_gmc_ras_fini(struct amdgpu_device *adev)
adev->mmhub.ras_funcs->ras_fini)
adev->mmhub.ras_funcs->ras_fini(adev);
 
-   if (adev->gmc.xgmi.ras_funcs &&
-   adev->gmc.xgmi.ras_funcs->ras_fini)
-   adev->gmc.xgmi.ras_funcs->ras_fini(adev);
+   if (adev->gmc.xgmi.ras && adev->gmc.xgmi.ras->ras_block.ops &&
+   adev->gmc.xgmi.ras->ras_block.ops->ras_fini)
+   adev->gmc.xgmi.ras->ras_block.ops->ras_fini(adev);
 
if (adev->hdp.ras_funcs &&
adev->hdp.ras_funcs->ras_fini)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index e55201134a01..923db5ff5859 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -29,6 +29,7 @@
 #include 
 
 #include "amdgpu_irq.h"
+#include "amdgpu_ras.h"
 
 /* VA hole for 48bit addresses on Vega10 */
 #define AMDGPU_GMC_HOLE_START  0x8000ULL
@@ -135,12 +136,8 @@ struct amdgpu_gmc_funcs {
unsigned int (*get_vbios_fb_size)(struct amdgpu_device *adev);
 };
 
-struct amdgpu_xgmi_ras_funcs {
-   int (*ras_late_init)(struct amdgpu_device *adev);
-   void (*ras_fini)(struct amdgpu_device *adev);
-   int (*query_ras_error_count)(struct amdgpu_device *adev,
-void *ras_error_status);
-   void (*reset_ras_error_count)(struct amdgpu_device *adev);
+struct amdgpu_xgmi_ras {
+   struct amdgpu_ras_block_object ras_block;
 };
 
 struct amdgpu_xgmi {
@@ -159,7 +156,7 @@ struct amdgpu_xgmi {
struct ras_common_if *ras_if;
bool connected_to_cpu;
bool pending_reset;
-   const struct amdgpu_xgmi_ras_funcs *ras_funcs;
+   struct amdgpu_xgmi_ras *ras;
 };
 
 struct amdgpu_gmc {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 190a4a4e9d7a..a6a2f928c6ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -970,9 +970,13 @@ int amdgpu_ras_query_error_status(struct amdgpu_device 
*adev,
adev->nbio.ras_funcs->query_ras_error_count(adev, 
_data);
break;
case AMDGPU_RAS_BLOCK__XGMI_WAFL:
-   if (adev->gmc.xgmi.ras_funcs &&
-   adev->gmc.xgmi.ras_funcs->query_ras_error_count)
-   adev->gmc.xgmi.ras_funcs->query_ras_error_count(adev, 
_data);
+   if (!block_obj || !block_obj->ops)  {
+   dev_info(adev->dev, "%s don't config ras function \n",
+   get_ras_block_str(>head));
+   return -EINVAL;
+   }
+   if (block_obj->ops->query_ras_error_count)
+   block_obj->ops->query_ras_error_count(adev, _data);
break;
case AMDGPU_RAS_BLOCK__HDP:

[PATCH V2 05/11] drm/amdgpu: Modify hdp block to fit for the unified ras block data and ops

2021-12-01 Thread yipechai

1.Modify hdp block to fit for the unified ras block data and ops.
2.Implement .ras_block_match function pointer for hdp block to identify itself.
3.Change amdgpu_hdp_ras_funcs to amdgpu_hdp_ras, and the corresponding variable 
name remove _funcs suffix.
4.Remove the const flag of hdp ras variable so that hdp ras block can be able 
to be insertted into amdgpu device ras block link list.
5.Invoke amdgpu_ras_register_ras_block function to register hdp ras block into 
amdgpu device ras block link list.
6.Remove the redundant code about hdp in amdgpu_ras.c after using the unified 
ras block.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 12 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_hdp.h | 11 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 16 
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   |  9 +
 drivers/gpu/drm/amd/amdgpu/hdp_v4_0.c   | 22 +-
 drivers/gpu/drm/amd/amdgpu/hdp_v4_0.h   |  2 +-
 6 files changed, 45 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 3ba2f0f1f1b4..0d06e7a2b951 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -460,9 +460,9 @@ int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
return r;
}
 
-   if (adev->hdp.ras_funcs &&
-   adev->hdp.ras_funcs->ras_late_init) {
-   r = adev->hdp.ras_funcs->ras_late_init(adev);
+   if (adev->hdp.ras && adev->hdp.ras->ras_block.ops &&
+   adev->hdp.ras->ras_block.ops->ras_late_init) {
+   r = adev->hdp.ras->ras_block.ops->ras_late_init(adev);
if (r)
return r;
}
@@ -505,9 +505,9 @@ void amdgpu_gmc_ras_fini(struct amdgpu_device *adev)
adev->gmc.xgmi.ras->ras_block.ops->ras_fini)
adev->gmc.xgmi.ras->ras_block.ops->ras_fini(adev);
 
-   if (adev->hdp.ras_funcs &&
-   adev->hdp.ras_funcs->ras_fini)
-   adev->hdp.ras_funcs->ras_fini(adev);
+   if (adev->hdp.ras && adev->hdp.ras->ras_block.ops &&
+   adev->hdp.ras->ras_block.ops->ras_fini)
+   adev->hdp.ras->ras_block.ops->ras_fini(adev);
 }
 
/*
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hdp.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_hdp.h
index 7ec99d591584..6e53898fb283 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hdp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hdp.h
@@ -22,13 +22,10 @@
  */
 #ifndef __AMDGPU_HDP_H__
 #define __AMDGPU_HDP_H__
+#include "amdgpu_ras.h"
 
-struct amdgpu_hdp_ras_funcs {
-   int (*ras_late_init)(struct amdgpu_device *adev);
-   void (*ras_fini)(struct amdgpu_device *adev);
-   void (*query_ras_error_count)(struct amdgpu_device *adev,
- void *ras_error_status);
-   void (*reset_ras_error_count)(struct amdgpu_device *adev);
+struct amdgpu_hdp_ras{
+   struct amdgpu_ras_block_object ras_block;
 };
 
 struct amdgpu_hdp_funcs {
@@ -43,7 +40,7 @@ struct amdgpu_hdp_funcs {
 struct amdgpu_hdp {
struct ras_common_if*ras_if;
const struct amdgpu_hdp_funcs   *funcs;
-   const struct amdgpu_hdp_ras_funcs   *ras_funcs;
+   struct amdgpu_hdp_ras   *ras;
 };
 
 int amdgpu_hdp_ras_late_init(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index a6a2f928c6ca..bed414404c6f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -970,6 +970,7 @@ int amdgpu_ras_query_error_status(struct amdgpu_device 
*adev,
adev->nbio.ras_funcs->query_ras_error_count(adev, 
_data);
break;
case AMDGPU_RAS_BLOCK__XGMI_WAFL:
+   case AMDGPU_RAS_BLOCK__HDP:
if (!block_obj || !block_obj->ops)  {
dev_info(adev->dev, "%s don't config ras function \n",
get_ras_block_str(>head));
@@ -978,11 +979,6 @@ int amdgpu_ras_query_error_status(struct amdgpu_device 
*adev,
if (block_obj->ops->query_ras_error_count)
block_obj->ops->query_ras_error_count(adev, _data);
break;
-   case AMDGPU_RAS_BLOCK__HDP:
-   if (adev->hdp.ras_funcs &&
-   adev->hdp.ras_funcs->query_ras_error_count)
-   adev->hdp.ras_funcs->query_ras_error_count(adev, 
_data);
-   break;
case AMDGPU_RAS_BLOCK__MCA:
amdgpu_ras_mca_query_error_status(adev, >head, _data);
break;
@@ -1074,9 +1070,13 @@ int amdgpu_ras_reset_error_status(struct amdgpu_device 
*adev,
adev->sdma.funcs->reset_ras_error_count(adev);
break;
case AMDGPU_RAS_BLOCK__HDP:
-   if (adev->hdp.ras_funcs &&
-

[PATCH V2 03/11] drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops

2021-12-01 Thread yipechai

1.Modify gfx block to fit for the unified ras block data and ops
2.Implement .ras_block_match function pointer for gfx block to identify itself.
3.Change amdgpu_gfx_ras_funcs to amdgpu_gfx_ras, and the corresponding variable 
name remove _funcs suffix.
4.Remove the const flag of gfx ras variable so that gfx ras block can be able 
to be insertted into amdgpu device ras block link list.
5.Invoke amdgpu_ras_register_ras_block function to register gfx ras block into 
amdgpu device ras block link list.
6.Remove the redundant code about gfx in amdgpu_ras.c after using the unified 
ras block.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c |  6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 15 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 80 ++---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   | 73 +++---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c   | 39 
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4.h   |  2 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c | 42 +
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.h |  2 +-
 8 files changed, 178 insertions(+), 81 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 1795d448c700..da8691259ac1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -696,9 +696,9 @@ int amdgpu_gfx_process_ras_data_cb(struct amdgpu_device 
*adev,
 */
if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) {
kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
-   if (adev->gfx.ras_funcs &&
-   adev->gfx.ras_funcs->query_ras_error_count)
-   adev->gfx.ras_funcs->query_ras_error_count(adev, 
err_data);
+   if (adev->gfx.ras && adev->gfx.ras->ras_block.ops &&
+   adev->gfx.ras->ras_block.ops->query_ras_error_count)
+   
adev->gfx.ras->ras_block.ops->query_ras_error_count(adev, err_data);
amdgpu_ras_reset_gpu(adev);
}
return AMDGPU_RAS_SUCCESS;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index 6b78b4a0e182..ff4a8428a84b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -31,6 +31,7 @@
 #include "amdgpu_ring.h"
 #include "amdgpu_rlc.h"
 #include "soc15.h"
+#include "amdgpu_ras.h"
 
 /* GFX current status */
 #define AMDGPU_GFX_NORMAL_MODE 0xL
@@ -213,16 +214,8 @@ struct amdgpu_cu_info {
uint32_t bitmap[4][4];
 };
 
-struct amdgpu_gfx_ras_funcs {
-   int (*ras_late_init)(struct amdgpu_device *adev);
-   void (*ras_fini)(struct amdgpu_device *adev);
-   int (*ras_error_inject)(struct amdgpu_device *adev,
-   void *inject_if);
-   int (*query_ras_error_count)(struct amdgpu_device *adev,
-void *ras_error_status);
-   void (*reset_ras_error_count)(struct amdgpu_device *adev);
-   void (*query_ras_error_status)(struct amdgpu_device *adev);
-   void (*reset_ras_error_status)(struct amdgpu_device *adev);
+struct amdgpu_gfx_ras {
+   struct amdgpu_ras_block_object  ras_block;
void (*enable_watchdog_timer)(struct amdgpu_device *adev);
 };
 
@@ -348,7 +341,7 @@ struct amdgpu_gfx {
 
/*ras */
struct ras_common_if*ras_if;
-   const struct amdgpu_gfx_ras_funcs   *ras_funcs;
+   struct amdgpu_gfx_ras   *ras;
 };
 
 #define amdgpu_gfx_get_gpu_clock_counter(adev) 
(adev)->gfx.funcs->get_gpu_clock_counter((adev))
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 1cf1f6331db1..190a4a4e9d7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -862,6 +862,27 @@ static int amdgpu_ras_enable_all_features(struct 
amdgpu_device *adev,
 }
 /* feature ctl end */
 
+static struct amdgpu_ras_block_object* amdgpu_ras_get_ras_block(struct 
amdgpu_device *adev,
+   enum amdgpu_ras_block block, uint32_t 
sub_block_index)
+{
+   struct amdgpu_ras_block_object *obj, *tmp;
+
+   if (block >= AMDGPU_RAS_BLOCK__LAST) {
+   return NULL;
+   }
+
+   list_for_each_entry_safe(obj, tmp, >ras_list, node) {
+   if( !obj->ops || !obj->ops->ras_block_match) {
+   dev_info(adev->dev, "%s don't config ops or  
ras_block_match\n", obj->name);
+   continue;
+   }
+   if (!obj->ops->ras_block_match(obj, block, sub_block_index)) {
+   return obj;
+   }
+   }
+
+   return NULL;
+}
 
 void amdgpu_ras_mca_query_error_status(struct amdgpu_device *adev,
   struct ras_common_if *ras_block,
@@ -892,6 +913,7 @@ void

[PATCH V2 02/11] drm/amdgpu: Modify the compilation failed problem when other ras blocks' .h include amdgpu_ras.h

2021-12-01 Thread yipechai

Modify the compilation failed problem when other ras blocks' .h include 
amdgpu_ras.h.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 22 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 23 ---
 2 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 8713575c7cf1..1cf1f6331db1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2739,6 +2739,28 @@ static void amdgpu_register_bad_pages_mca_notifier(void)
 }
 }
 #endif
+
+/* check if ras is supported on block, say, sdma, gfx */
+int amdgpu_ras_is_supported(struct amdgpu_device *adev,
+   unsigned int block)
+{
+   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+   if (block >= AMDGPU_RAS_BLOCK_COUNT)
+   return 0;
+   return ras && (adev->ras_enabled & (1 << block));
+}
+
+int amdgpu_ras_reset_gpu(struct amdgpu_device *adev)
+{
+   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+   if (atomic_cmpxchg(>in_recovery, 0, 1) == 0)
+   schedule_work(>recovery_work);
+   return 0;
+}
+
+
 /* Rigister each ip ras block into amdgpu ras */
 int amdgpu_ras_register_ras_block(struct amdgpu_device *adev,
struct amdgpu_ras_block_object* ras_block_obj)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index d6e5e3c862bd..41623a649fa1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -514,16 +514,6 @@ struct amdgpu_ras_block_ops {
 #define amdgpu_ras_get_context(adev)   ((adev)->psp.ras_context.ras)
 #define amdgpu_ras_set_context(adev, ras_con)  ((adev)->psp.ras_context.ras = 
(ras_con))
 
-/* check if ras is supported on block, say, sdma, gfx */
-static inline int amdgpu_ras_is_supported(struct amdgpu_device *adev,
-   unsigned int block)
-{
-   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
-
-   if (block >= AMDGPU_RAS_BLOCK_COUNT)
-   return 0;
-   return ras && (adev->ras_enabled & (1 << block));
-}
 
 int amdgpu_ras_recovery_init(struct amdgpu_device *adev);
 
@@ -540,15 +530,6 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
 
 int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
 
-static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev)
-{
-   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
-
-   if (atomic_cmpxchg(>in_recovery, 0, 1) == 0)
-   schedule_work(>recovery_work);
-   return 0;
-}
-
 static inline enum ta_ras_block
 amdgpu_ras_block_to_ta(enum amdgpu_ras_block block) {
switch (block) {
@@ -680,5 +661,9 @@ const char *get_ras_block_str(struct ras_common_if 
*ras_block);
 
 bool amdgpu_ras_is_poison_mode_supported(struct amdgpu_device *adev);
 
+int amdgpu_ras_is_supported(struct amdgpu_device *adev,unsigned int 
block);
+
+int amdgpu_ras_reset_gpu(struct amdgpu_device *adev);
+
 int amdgpu_ras_register_ras_block(struct amdgpu_device *adev, struct 
amdgpu_ras_block_object* ras_block_obj);
 #endif
-- 
2.25.1

[PATCH V2 01/11] drm/amdgpu: Unify ras block interface for each ras block

2021-12-01 Thread yipechai

1. Define unified ops interface for each block.
2. Add ras_block_match function pointer in ops interface for each ras block to 
identify itself.
3. Define unified basic ras block data for each ras block.
4. Create dedicated amdgpu device ras block link list to manage all of the ras 
blocks.
5. Add amdgpu_ras_register_ras_block new function interface for each ras block 
to register itself to ras controlling block.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 12 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h| 29 ++
 4 files changed, 45 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index db1505455761..eddf230856e2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1151,6 +1151,8 @@ struct amdgpu_device {
boolbarrier_has_auto_waitcnt;
 
struct amdgpu_reset_control *reset_cntl;
+
+   struct list_headras_list;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 73ec46140d68..0980396ee709 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3578,6 +3578,8 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 
INIT_LIST_HEAD(>reset_list);
 
+   INIT_LIST_HEAD(>ras_list);
+
INIT_DELAYED_WORK(>delayed_init_work,
  amdgpu_device_delayed_init_work_handler);
INIT_DELAYED_WORK(>gfx.gfx_off_delay_work,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 90f0db3b4f65..8713575c7cf1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2739,3 +2739,15 @@ static void amdgpu_register_bad_pages_mca_notifier(void)
 }
 }
 #endif
+/* Rigister each ip ras block into amdgpu ras */
+int amdgpu_ras_register_ras_block(struct amdgpu_device *adev,
+   struct amdgpu_ras_block_object* ras_block_obj)
+{
+   if (!adev || !ras_block_obj)
+   return -EINVAL;
+
+   INIT_LIST_HEAD(_block_obj->node);
+   list_add_tail(_block_obj->node, >ras_list);
+
+   return 0;
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index cdd0010a5389..d6e5e3c862bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -469,6 +469,34 @@ struct ras_debug_if {
};
int op;
 };
+
+struct amdgpu_ras_block_object {
+   /* block name */
+   char name[32];
+
+   enum amdgpu_ras_block block;
+
+   uint32_t sub_block_index;
+
+   /* ras block link */
+   struct list_head node;
+
+   const struct amdgpu_ras_block_ops *ops;
+};
+
+struct amdgpu_ras_block_ops {
+   int (*ras_block_match)(struct amdgpu_ras_block_object* block_obj, enum 
amdgpu_ras_block block, uint32_t sub_block_index);
+   int (*ras_late_init)(struct amdgpu_device *adev);
+   void (*ras_fini)(struct amdgpu_device *adev);
+   int (*ras_error_inject)(struct amdgpu_device *adev, void *inject_if);
+   void  (*query_ras_error_count)(struct amdgpu_device *adev,void 
*ras_error_status);
+   void (*query_ras_error_status)(struct amdgpu_device *adev);
+   bool  (*query_ras_poison_mode)(struct amdgpu_device *adev);
+   void (*query_ras_error_address)(struct amdgpu_device *adev, void 
*ras_error_status);
+   void (*reset_ras_error_count)(struct amdgpu_device *adev);
+   void (*reset_ras_error_status)(struct amdgpu_device *adev);
+};
+
 /* work flow
  * vbios
  * 1: ras feature enable (enabled by default)
@@ -652,4 +680,5 @@ const char *get_ras_block_str(struct ras_common_if 
*ras_block);
 
 bool amdgpu_ras_is_poison_mode_supported(struct amdgpu_device *adev);
 
+int amdgpu_ras_register_ras_block(struct amdgpu_device *adev, struct 
amdgpu_ras_block_object* ras_block_obj);
 #endif
-- 
2.25.1

Re: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Christian König


Am 01.12.21 um 11:44 schrieb Yu, Lang:

[AMD Official Use Only]




-Original Message-
From: Koenig, Christian 
Sent: Wednesday, December 1, 2021 5:30 PM
To: Yu, Lang ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Lazar, Lijo
; Huang, Ray 
Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option

Am 01.12.21 um 10:24 schrieb Lang Yu:

To maintain system error state when SMU errors occurred, which will
aid in debugging SMU firmware issues, add SMU debug option support.

It can be enabled or disabled via amdgpu_smu_debug debugfs file. When
enabled, it makes SMU errors fatal.
It is disabled by default.

== Command Guide ==

1, enable SMU debug option

   # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

2, disable SMU debug option

   # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

v3:
   - Use debugfs_create_bool().(Christian)
   - Put variable into smu_context struct.
   - Don't resend command when timeout.

v2:
   - Resend command when timeout.(Lijo)
   - Use debugfs file instead of module parameter.

Signed-off-by: Lang Yu 

Well the debugfs part looks really nice and clean now, but one more comment
below.


---
   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
   drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
   4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..86cd888c7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device

*adev)

if (!debugfs_initialized())
return 0;

+   debugfs_create_bool("amdgpu_smu_debug", 0600, root,
+ >smu.smu_debug_mode);
+
ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
  _ib_preempt);
if (IS_ERR(ent)) {
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index f738f7dc20c9..50dbf5594a9d 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -569,6 +569,11 @@ struct smu_context
struct smu_user_dpm_profile user_dpm_profile;

struct stb_context stb_context;
+   /*
+* When enabled, it makes SMU errors fatal.
+* (0 = disabled (default), 1 = enabled)
+*/
+   bool smu_debug_mode;
   };

   struct i2c_adapter;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 6e781cee8bb6..d3797a2d6451 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct

smu_context *smu)

   out:
mutex_unlock(>message_lock);

+   BUG_ON(unlikely(smu->smu_debug_mode) && ret);
+
return ret;
   }

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index 048ca1673863..9be005eb4241 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct

smu_context *smu,

__smu_cmn_reg_print_error(smu, reg, index, param, msg);
goto Out;
}
+
__smu_cmn_send_msg(smu, (uint16_t) index, param);
reg = __smu_cmn_poll_stat(smu);
res = __smu_cmn_reg2errno(smu, reg);
-   if (res != 0)
+   if (res != 0) {
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
+   goto Out;
+   }
if (read_arg)
smu_cmn_read_arg(smu, read_arg);
   Out:
mutex_unlock(>message_lock);
+
+   BUG_ON(unlikely(smu->smu_debug_mode) && res);

BUG_ON() really crashes the kernel and is only allowed if we prevent further 
data
corruption with that.

Most of the time WARN_ON() is more appropriate, but I can't fully judge here
since I don't know the SMU code well enough.

This is what SMU FW guys want. They want "user-visible (potentially fatal) 
errors", then a hang.
They want to keep system state since the error occurred.


Well that is rather problematic.

First of all we need to really justify that, crashing the kernel is not 
something easily done.


Then this isn't really effective here. What happens is that you crash 
the kernel thread of the currently executing process, but it is 
perfectly possible that another thread still tries to send messages to 
the SMU. You need to have the BUG_ON() before dropping the lock to make 
sure that this really gets the driver stuck in the current state.


Regards,
Christian.



Regards,
Lang


Christian.


+
return res;
   }

Re: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Lazar, Lijo





On 12/1/2021 4:08 PM, Yu, Lang wrote:

[AMD Official Use Only]




-Original Message-
From: Lazar, Lijo 
Sent: Wednesday, December 1, 2021 5:47 PM
To: Yu, Lang ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Huang, Ray
; Koenig, Christian 
Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option



On 12/1/2021 2:54 PM, Lang Yu wrote:

To maintain system error state when SMU errors occurred, which will
aid in debugging SMU firmware issues, add SMU debug option support.

It can be enabled or disabled via amdgpu_smu_debug debugfs file. When
enabled, it makes SMU errors fatal.
It is disabled by default.

== Command Guide ==

1, enable SMU debug option

   # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

2, disable SMU debug option

   # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

v3:
   - Use debugfs_create_bool().(Christian)
   - Put variable into smu_context struct.
   - Don't resend command when timeout.

v2:
   - Resend command when timeout.(Lijo)
   - Use debugfs file instead of module parameter.

Signed-off-by: Lang Yu 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
   drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
   4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..86cd888c7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device

*adev)

if (!debugfs_initialized())
return 0;

+   debugfs_create_bool("amdgpu_smu_debug", 0600, root,
+ >smu.smu_debug_mode);
+
ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
  _ib_preempt);
if (IS_ERR(ent)) {
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index f738f7dc20c9..50dbf5594a9d 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -569,6 +569,11 @@ struct smu_context
struct smu_user_dpm_profile user_dpm_profile;

struct stb_context stb_context;
+   /*
+* When enabled, it makes SMU errors fatal.
+* (0 = disabled (default), 1 = enabled)
+*/
+   bool smu_debug_mode;
   };

   struct i2c_adapter;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 6e781cee8bb6..d3797a2d6451 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct

smu_context *smu)

   out:
mutex_unlock(>message_lock);

+   BUG_ON(unlikely(smu->smu_debug_mode) && ret);
+

This hunk can be skipped while submitting. If this fails, GPU reset will fail 
and
amdgpu won't continue.


Ok, we don't handle such cases.




return ret;
   }

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index 048ca1673863..9be005eb4241 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct

smu_context *smu,

__smu_cmn_reg_print_error(smu, reg, index, param, msg);
goto Out;
}
+
__smu_cmn_send_msg(smu, (uint16_t) index, param);
reg = __smu_cmn_poll_stat(smu);
res = __smu_cmn_reg2errno(smu, reg);
-   if (res != 0)
+   if (res != 0) {
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
+   goto Out;


Next step is reading smu parameter register which is harmless as reading
response register and it's not clear on read. This goto also may be skipped.


I just think that does some extra work. We don’t want to read response register.
This goto makes error handling more clear.



This change affects non-debug mode also. If things are normal, error 
handling is supposed to be done by the caller based on the FW response 
and/or return parameter value, if there is any. smu_debug_mode shouldn't 
change that.


Thanks,
Lijo


Regards,
Lang


Thanks,
Lijo


+   }
if (read_arg)
smu_cmn_read_arg(smu, read_arg);
   Out:
mutex_unlock(>message_lock);
+
+   BUG_ON(unlikely(smu->smu_debug_mode) && res);
+
return res;
   }

RE: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Yu, Lang

[AMD Official Use Only]



>-Original Message-
>From: Koenig, Christian 
>Sent: Wednesday, December 1, 2021 5:30 PM
>To: Yu, Lang ; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander ; Lazar, Lijo
>; Huang, Ray 
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>Am 01.12.21 um 10:24 schrieb Lang Yu:
>> To maintain system error state when SMU errors occurred, which will
>> aid in debugging SMU firmware issues, add SMU debug option support.
>>
>> It can be enabled or disabled via amdgpu_smu_debug debugfs file. When
>> enabled, it makes SMU errors fatal.
>> It is disabled by default.
>>
>> == Command Guide ==
>>
>> 1, enable SMU debug option
>>
>>   # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>
>> 2, disable SMU debug option
>>
>>   # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>
>> v3:
>>   - Use debugfs_create_bool().(Christian)
>>   - Put variable into smu_context struct.
>>   - Don't resend command when timeout.
>>
>> v2:
>>   - Resend command when timeout.(Lijo)
>>   - Use debugfs file instead of module parameter.
>>
>> Signed-off-by: Lang Yu 
>
>Well the debugfs part looks really nice and clean now, but one more comment
>below.
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
>>   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
>>   drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
>>   4 files changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> index 164d6a9e9fbb..86cd888c7822 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>*adev)
>>  if (!debugfs_initialized())
>>  return 0;
>>
>> +debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>> +  >smu.smu_debug_mode);
>> +
>>  ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>>_ib_preempt);
>>  if (IS_ERR(ent)) {
>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> index f738f7dc20c9..50dbf5594a9d 100644
>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> @@ -569,6 +569,11 @@ struct smu_context
>>  struct smu_user_dpm_profile user_dpm_profile;
>>
>>  struct stb_context stb_context;
>> +/*
>> + * When enabled, it makes SMU errors fatal.
>> + * (0 = disabled (default), 1 = enabled)
>> + */
>> +bool smu_debug_mode;
>>   };
>>
>>   struct i2c_adapter;
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> index 6e781cee8bb6..d3797a2d6451 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>smu_context *smu)
>>   out:
>>  mutex_unlock(>message_lock);
>>
>> +BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>> +
>>  return ret;
>>   }
>>
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> index 048ca1673863..9be005eb4241 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct
>smu_context *smu,
>>  __smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>  goto Out;
>>  }
>> +
>>  __smu_cmn_send_msg(smu, (uint16_t) index, param);
>>  reg = __smu_cmn_poll_stat(smu);
>>  res = __smu_cmn_reg2errno(smu, reg);
>> -if (res != 0)
>> +if (res != 0) {
>>  __smu_cmn_reg_print_error(smu, reg, index, param, msg);
>> +goto Out;
>> +}
>>  if (read_arg)
>>  smu_cmn_read_arg(smu, read_arg);
>>   Out:
>>  mutex_unlock(>message_lock);
>> +
>> +BUG_ON(unlikely(smu->smu_debug_mode) && res);
>
>BUG_ON() really crashes the kernel and is only allowed if we prevent further 
>data
>corruption with that.
>
>Most of the time WARN_ON() is more appropriate, but I can't fully judge here
>since I don't know the SMU code well enough.

This is what SMU FW guys want. They want "user-visible (potentially fatal) 
errors", then a hang.
They want to keep system state since the error occurred.

Regards,
Lang

>Christian.
>
>> +
>>  return res;
>>   }
>>

RE: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Yu, Lang

[AMD Official Use Only]



>-Original Message-
>From: Lazar, Lijo 
>Sent: Wednesday, December 1, 2021 5:47 PM
>To: Yu, Lang ; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander ; Huang, Ray
>; Koenig, Christian 
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>
>
>On 12/1/2021 2:54 PM, Lang Yu wrote:
>> To maintain system error state when SMU errors occurred, which will
>> aid in debugging SMU firmware issues, add SMU debug option support.
>>
>> It can be enabled or disabled via amdgpu_smu_debug debugfs file. When
>> enabled, it makes SMU errors fatal.
>> It is disabled by default.
>>
>> == Command Guide ==
>>
>> 1, enable SMU debug option
>>
>>   # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>
>> 2, disable SMU debug option
>>
>>   # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>
>> v3:
>>   - Use debugfs_create_bool().(Christian)
>>   - Put variable into smu_context struct.
>>   - Don't resend command when timeout.
>>
>> v2:
>>   - Resend command when timeout.(Lijo)
>>   - Use debugfs file instead of module parameter.
>>
>> Signed-off-by: Lang Yu 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
>>   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
>>   drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
>>   4 files changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> index 164d6a9e9fbb..86cd888c7822 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>*adev)
>>  if (!debugfs_initialized())
>>  return 0;
>>
>> +debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>> +  >smu.smu_debug_mode);
>> +
>>  ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>>_ib_preempt);
>>  if (IS_ERR(ent)) {
>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> index f738f7dc20c9..50dbf5594a9d 100644
>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> @@ -569,6 +569,11 @@ struct smu_context
>>  struct smu_user_dpm_profile user_dpm_profile;
>>
>>  struct stb_context stb_context;
>> +/*
>> + * When enabled, it makes SMU errors fatal.
>> + * (0 = disabled (default), 1 = enabled)
>> + */
>> +bool smu_debug_mode;
>>   };
>>
>>   struct i2c_adapter;
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> index 6e781cee8bb6..d3797a2d6451 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>smu_context *smu)
>>   out:
>>  mutex_unlock(>message_lock);
>>
>> +BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>> +
>This hunk can be skipped while submitting. If this fails, GPU reset will fail 
>and
>amdgpu won't continue.

Ok, we don't handle such cases.

>
>>  return ret;
>>   }
>>
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> index 048ca1673863..9be005eb4241 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct
>smu_context *smu,
>>  __smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>  goto Out;
>>  }
>> +
>>  __smu_cmn_send_msg(smu, (uint16_t) index, param);
>>  reg = __smu_cmn_poll_stat(smu);
>>  res = __smu_cmn_reg2errno(smu, reg);
>> -if (res != 0)
>> +if (res != 0) {
>>  __smu_cmn_reg_print_error(smu, reg, index, param, msg);
>> +goto Out;
>
>Next step is reading smu parameter register which is harmless as reading
>response register and it's not clear on read. This goto also may be skipped.

I just think that does some extra work. We don’t want to read response register.
This goto makes error handling more clear.

Regards,
Lang

>Thanks,
>Lijo
>
>> +}
>>  if (read_arg)
>>  smu_cmn_read_arg(smu, read_arg);
>>   Out:
>>  mutex_unlock(>message_lock);
>> +
>> +BUG_ON(unlikely(smu->smu_debug_mode) && res);
>> +
>>  return res;
>>   }
>>
>>

Re: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Lazar, Lijo





On 12/1/2021 2:54 PM, Lang Yu wrote:

To maintain system error state when SMU errors occurred,
which will aid in debugging SMU firmware issues, add SMU
debug option support.

It can be enabled or disabled via amdgpu_smu_debug
debugfs file. When enabled, it makes SMU errors fatal.
It is disabled by default.

== Command Guide ==

1, enable SMU debug option

  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

2, disable SMU debug option

  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

v3:
  - Use debugfs_create_bool().(Christian)
  - Put variable into smu_context struct.
  - Don't resend command when timeout.

v2:
  - Resend command when timeout.(Lijo)
  - Use debugfs file instead of module parameter.

Signed-off-by: Lang Yu 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
  drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
  4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..86cd888c7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev)
if (!debugfs_initialized())
return 0;
  
+	debugfs_create_bool("amdgpu_smu_debug", 0600, root,

+ >smu.smu_debug_mode);
+
ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
  _ib_preempt);
if (IS_ERR(ent)) {
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index f738f7dc20c9..50dbf5594a9d 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -569,6 +569,11 @@ struct smu_context
struct smu_user_dpm_profile user_dpm_profile;
  
  	struct stb_context stb_context;

+   /*
+* When enabled, it makes SMU errors fatal.
+* (0 = disabled (default), 1 = enabled)
+*/
+   bool smu_debug_mode;
  };
  
  struct i2c_adapter;

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 6e781cee8bb6..d3797a2d6451 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct smu_context *smu)
  out:
mutex_unlock(>message_lock);
  
+	BUG_ON(unlikely(smu->smu_debug_mode) && ret);

+
This hunk can be skipped while submitting. If this fails, GPU reset will 
fail and amdgpu won't continue.



return ret;
  }
  
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c

index 048ca1673863..9be005eb4241 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context 
*smu,
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
goto Out;
}
+
__smu_cmn_send_msg(smu, (uint16_t) index, param);
reg = __smu_cmn_poll_stat(smu);
res = __smu_cmn_reg2errno(smu, reg);
-   if (res != 0)
+   if (res != 0) {
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
+   goto Out;


Next step is reading smu parameter register which is harmless as reading 
response register and it's not clear on read. This goto also may be skipped.


Thanks,
Lijo


+   }
if (read_arg)
smu_cmn_read_arg(smu, read_arg);
  Out:
mutex_unlock(>message_lock);
+
+   BUG_ON(unlikely(smu->smu_debug_mode) && res);
+
return res;
  }

Re: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Huang Rui

On Wed, Dec 01, 2021 at 05:24:58PM +0800, Yu, Lang wrote:
> To maintain system error state when SMU errors occurred,
> which will aid in debugging SMU firmware issues, add SMU
> debug option support.
> 
> It can be enabled or disabled via amdgpu_smu_debug
> debugfs file. When enabled, it makes SMU errors fatal.
> It is disabled by default.
> 
> == Command Guide ==
> 
> 1, enable SMU debug option
> 
>  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> 
> 2, disable SMU debug option
> 
>  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> 
> v3:
>  - Use debugfs_create_bool().(Christian)
>  - Put variable into smu_context struct.
>  - Don't resend command when timeout.
> 
> v2:
>  - Resend command when timeout.(Lijo)
>  - Use debugfs file instead of module parameter.
> 
> Signed-off-by: Lang Yu 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
>  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
>  drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
>  4 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 164d6a9e9fbb..86cd888c7822 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev)
>   if (!debugfs_initialized())
>   return 0;
>  
> + debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> +   >smu.smu_debug_mode);
> +
>   ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
> _ib_preempt);
>   if (IS_ERR(ent)) {
> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> index f738f7dc20c9..50dbf5594a9d 100644
> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> @@ -569,6 +569,11 @@ struct smu_context
>   struct smu_user_dpm_profile user_dpm_profile;
>  
>   struct stb_context stb_context;
> + /*
> +  * When enabled, it makes SMU errors fatal.
> +  * (0 = disabled (default), 1 = enabled)
> +  */
> + bool smu_debug_mode;
>  };
>  
>  struct i2c_adapter;
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> index 6e781cee8bb6..d3797a2d6451 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct smu_context 
> *smu)
>  out:
>   mutex_unlock(>message_lock);
>  
> + BUG_ON(unlikely(smu->smu_debug_mode) && ret);
> +
>   return ret;
>  }
>  
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index 048ca1673863..9be005eb4241 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context 
> *smu,
>   __smu_cmn_reg_print_error(smu, reg, index, param, msg);
>   goto Out;
>   }
> +
>   __smu_cmn_send_msg(smu, (uint16_t) index, param);
>   reg = __smu_cmn_poll_stat(smu);
>   res = __smu_cmn_reg2errno(smu, reg);
> - if (res != 0)
> + if (res != 0) {
>   __smu_cmn_reg_print_error(smu, reg, index, param, msg);
> + goto Out;
> + }
>   if (read_arg)
>   smu_cmn_read_arg(smu, read_arg);
>  Out:
>   mutex_unlock(>message_lock);
> +
> + BUG_ON(unlikely(smu->smu_debug_mode) && res);
> +

Do we need to add BUG_ON on smu_cmn_send_msg_without_waiting() as well?

Thanks,
Ray

>   return res;
>  }
>  
> -- 
> 2.25.1
>

Re: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Christian König


Am 01.12.21 um 10:24 schrieb Lang Yu:

To maintain system error state when SMU errors occurred,
which will aid in debugging SMU firmware issues, add SMU
debug option support.

It can be enabled or disabled via amdgpu_smu_debug
debugfs file. When enabled, it makes SMU errors fatal.
It is disabled by default.

== Command Guide ==

1, enable SMU debug option

  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

2, disable SMU debug option

  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

v3:
  - Use debugfs_create_bool().(Christian)
  - Put variable into smu_context struct.
  - Don't resend command when timeout.

v2:
  - Resend command when timeout.(Lijo)
  - Use debugfs file instead of module parameter.

Signed-off-by: Lang Yu 


Well the debugfs part looks really nice and clean now, but one more 
comment below.



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
  drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
  4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..86cd888c7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev)
if (!debugfs_initialized())
return 0;
  
+	debugfs_create_bool("amdgpu_smu_debug", 0600, root,

+ >smu.smu_debug_mode);
+
ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
  _ib_preempt);
if (IS_ERR(ent)) {
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index f738f7dc20c9..50dbf5594a9d 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -569,6 +569,11 @@ struct smu_context
struct smu_user_dpm_profile user_dpm_profile;
  
  	struct stb_context stb_context;

+   /*
+* When enabled, it makes SMU errors fatal.
+* (0 = disabled (default), 1 = enabled)
+*/
+   bool smu_debug_mode;
  };
  
  struct i2c_adapter;

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 6e781cee8bb6..d3797a2d6451 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct smu_context *smu)
  out:
mutex_unlock(>message_lock);
  
+	BUG_ON(unlikely(smu->smu_debug_mode) && ret);

+
return ret;
  }
  
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c

index 048ca1673863..9be005eb4241 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context 
*smu,
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
goto Out;
}
+
__smu_cmn_send_msg(smu, (uint16_t) index, param);
reg = __smu_cmn_poll_stat(smu);
res = __smu_cmn_reg2errno(smu, reg);
-   if (res != 0)
+   if (res != 0) {
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
+   goto Out;
+   }
if (read_arg)
smu_cmn_read_arg(smu, read_arg);
  Out:
mutex_unlock(>message_lock);
+
+   BUG_ON(unlikely(smu->smu_debug_mode) && res);


BUG_ON() really crashes the kernel and is only allowed if we prevent 
further data corruption with that.


Most of the time WARN_ON() is more appropriate, but I can't fully judge 
here since I don't know the SMU code well enough.


Christian.


+
return res;
  }

[PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Lang Yu

To maintain system error state when SMU errors occurred,
which will aid in debugging SMU firmware issues, add SMU
debug option support.

It can be enabled or disabled via amdgpu_smu_debug
debugfs file. When enabled, it makes SMU errors fatal.
It is disabled by default.

== Command Guide ==

1, enable SMU debug option

 # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

2, disable SMU debug option

 # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

v3:
 - Use debugfs_create_bool().(Christian)
 - Put variable into smu_context struct.
 - Don't resend command when timeout.

v2:
 - Resend command when timeout.(Lijo)
 - Use debugfs file instead of module parameter.

Signed-off-by: Lang Yu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c| 3 +++
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h| 5 +
 drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++-
 4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..86cd888c7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev)
if (!debugfs_initialized())
return 0;
 
+   debugfs_create_bool("amdgpu_smu_debug", 0600, root,
+ >smu.smu_debug_mode);
+
ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
  _ib_preempt);
if (IS_ERR(ent)) {
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index f738f7dc20c9..50dbf5594a9d 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -569,6 +569,11 @@ struct smu_context
struct smu_user_dpm_profile user_dpm_profile;
 
struct stb_context stb_context;
+   /*
+* When enabled, it makes SMU errors fatal.
+* (0 = disabled (default), 1 = enabled)
+*/
+   bool smu_debug_mode;
 };
 
 struct i2c_adapter;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 6e781cee8bb6..d3797a2d6451 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct smu_context *smu)
 out:
mutex_unlock(>message_lock);
 
+   BUG_ON(unlikely(smu->smu_debug_mode) && ret);
+
return ret;
 }
 
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index 048ca1673863..9be005eb4241 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context 
*smu,
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
goto Out;
}
+
__smu_cmn_send_msg(smu, (uint16_t) index, param);
reg = __smu_cmn_poll_stat(smu);
res = __smu_cmn_reg2errno(smu, reg);
-   if (res != 0)
+   if (res != 0) {
__smu_cmn_reg_print_error(smu, reg, index, param, msg);
+   goto Out;
+   }
if (read_arg)
smu_cmn_read_arg(smu, read_arg);
 Out:
mutex_unlock(>message_lock);
+
+   BUG_ON(unlikely(smu->smu_debug_mode) && res);
+
return res;
 }
 
-- 
2.25.1

Re: [PATCH] drm/amdgpu: add SMU debug option support

2021-12-01 Thread Lazar, Lijo





On 12/1/2021 1:48 PM, Yu, Lang wrote:

[AMD Official Use Only]




-Original Message-
From: amd-gfx  On Behalf Of Yu, Lang
Sent: Wednesday, December 1, 2021 3:58 PM
To: Lazar, Lijo ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Huang, Ray
; Koenig, Christian 
Subject: RE: [PATCH] drm/amdgpu: add SMU debug option support

[AMD Official Use Only]




-Original Message-
From: Lazar, Lijo 
Sent: Wednesday, December 1, 2021 3:28 PM
To: Yu, Lang ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Huang, Ray
; Koenig, Christian 
Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support



On 12/1/2021 12:37 PM, Yu, Lang wrote:

[AMD Official Use Only]




-Original Message-
From: Lazar, Lijo 
Sent: Wednesday, December 1, 2021 2:56 PM
To: Yu, Lang ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Huang, Ray
; Koenig, Christian 
Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support



On 12/1/2021 11:57 AM, Yu, Lang wrote:

[AMD Official Use Only]

Hi Lijo,

Thanks for your comments.

   From my understanding, that just increases the timeout threshold
and could hide some potential issues which should be exposed and solved.

If current timeout threshold is not enough for some corner cases,
(1) Do we consider to increase the threshold to cover these cases?
(2) Or do we just expose them and request SMU FW to optimize them?

I think it doesn't make much sense to increase the threshold in debug mode.
How do you think? Thanks!


In normal cases, 2secs would be more than enough. If we hang
immediately, then check the FW registers later, the response would
have come. I thought we just need to note those cases and not to
fail everytime. Just to mark as a red flag in the log to tell us
that FW is unexpectedly busy processing something else when the message is

sent.


There are some issues related to S0ix where we see the FW comes back
with a response with an increased timeout under certain conditions.


If these issues still exists, could we just blacklist the tests that
triggered them before solve them? Or we just increase the threshold
to cover

all the cases?




Actually, the timeout is message specific - like i2c transfer from
EEPROM could take longer time.

I am not sure if we should have more than 2s as timeout. Whenever this
kind of issue happens, FW team check registers (then it will have a
proper value) and say they don't see anything abnormal :) Usually,
those are just signs of crack and it eventually breaks.

Option is just fail immediately (then again not sure useful it will be
if the issue is this sort of thing) or wait to see how far it goes with
an added timeout before it fails eventually.


Are smu_cmn_wait_for_response()/smu_cmn_send_msg_without_waiting()
designed for long timeout cases? Is it fine that we don't fail here in the 
event of
timeout?


Or we can add a timeout parameter into smu_cmn_send_smc_msg_with_param()
to specify the timeout you want for specific message.
I think this may be another story. Thanks!



Yes, that will be a different patch. For now, skip the extended timeout. 
Every timeout will trigger a debug alarm and let it be that way for 
debug mode. I think you can skip the retry also (originally I meant this 
by that comment - retry again for response reg check).


Thanks,
Lijo


Thanks,
Lang




Thanks,
Lijo


Regards,
Lang



Thanks,
Lijo



Regards,
Lang


-Original Message-
From: Lazar, Lijo 
Sent: Wednesday, December 1, 2021 1:44 PM
To: Lazar, Lijo ; Yu, Lang ;
amd- g...@lists.freedesktop.org
Cc: Deucher, Alexander ; Huang, Ray
; Koenig, Christian 
Subject: RE: [PATCH] drm/amdgpu: add SMU debug option support

Just realized that the patch I pasted won't work. Outer loop exit
needs to be like this.
(reg & MP1_C2PMSG_90__CONTENT_MASK) != 0 && extended_wait-- >=
0

Anyway, that patch is only there to communicate what I really
meant in the earlier comment.

Thanks,
Lijo

-Original Message-
From: amd-gfx  On Behalf Of
Lazar, Lijo
Sent: Wednesday, December 1, 2021 10:44 AM
To: Yu, Lang ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Huang, Ray
; Koenig, Christian 
Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support



On 11/30/2021 10:47 AM, Lang Yu wrote:

To maintain system error state when SMU errors occurred, which
will aid in debugging SMU firmware issues, add SMU debug option

support.


It can be enabled or disabled via amdgpu_smu_debug debugfs file.
When enabled, it makes SMU errors fatal.
It is disabled by default.

== Command Guide ==

1, enable SMU debug option

 # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

2, disable SMU debug option

 # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

v2:
 - Resend command when timeout.(Lijo)
 - Use debugfs file instead of module parameter.

Signed-off-by: Lang Yu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 32

+

 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c  | 39

RE: [PATCH] drm/amdgpu: add SMU debug option support

2021-12-01 Thread Yu, Lang

[AMD Official Use Only]



>-Original Message-
>From: amd-gfx  On Behalf Of Yu, Lang
>Sent: Wednesday, December 1, 2021 3:58 PM
>To: Lazar, Lijo ; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander ; Huang, Ray
>; Koenig, Christian 
>Subject: RE: [PATCH] drm/amdgpu: add SMU debug option support
>
>[AMD Official Use Only]
>
>
>
>>-Original Message-
>>From: Lazar, Lijo 
>>Sent: Wednesday, December 1, 2021 3:28 PM
>>To: Yu, Lang ; amd-gfx@lists.freedesktop.org
>>Cc: Deucher, Alexander ; Huang, Ray
>>; Koenig, Christian 
>>Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support
>>
>>
>>
>>On 12/1/2021 12:37 PM, Yu, Lang wrote:
>>> [AMD Official Use Only]
>>>
>>>
>>>
 -Original Message-
 From: Lazar, Lijo 
 Sent: Wednesday, December 1, 2021 2:56 PM
 To: Yu, Lang ; amd-gfx@lists.freedesktop.org
 Cc: Deucher, Alexander ; Huang, Ray
 ; Koenig, Christian 
 Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support



 On 12/1/2021 11:57 AM, Yu, Lang wrote:
> [AMD Official Use Only]
>
> Hi Lijo,
>
> Thanks for your comments.
>
>   From my understanding, that just increases the timeout threshold
> and could hide some potential issues which should be exposed and solved.
>
> If current timeout threshold is not enough for some corner cases,
> (1) Do we consider to increase the threshold to cover these cases?
> (2) Or do we just expose them and request SMU FW to optimize them?
>
> I think it doesn't make much sense to increase the threshold in debug 
> mode.
> How do you think? Thanks!

 In normal cases, 2secs would be more than enough. If we hang
 immediately, then check the FW registers later, the response would
 have come. I thought we just need to note those cases and not to
 fail everytime. Just to mark as a red flag in the log to tell us
 that FW is unexpectedly busy processing something else when the message is
>sent.

 There are some issues related to S0ix where we see the FW comes back
 with a response with an increased timeout under certain conditions.
>>>
>>> If these issues still exists, could we just blacklist the tests that
>>> triggered them before solve them? Or we just increase the threshold
>>> to cover
>>all the cases?
>>>
>>
>>Actually, the timeout is message specific - like i2c transfer from
>>EEPROM could take longer time.
>>
>>I am not sure if we should have more than 2s as timeout. Whenever this
>>kind of issue happens, FW team check registers (then it will have a
>>proper value) and say they don't see anything abnormal :) Usually,
>>those are just signs of crack and it eventually breaks.
>>
>>Option is just fail immediately (then again not sure useful it will be
>>if the issue is this sort of thing) or wait to see how far it goes with
>>an added timeout before it fails eventually.
>
>Are smu_cmn_wait_for_response()/smu_cmn_send_msg_without_waiting()
>designed for long timeout cases? Is it fine that we don't fail here in the 
>event of
>timeout?

Or we can add a timeout parameter into smu_cmn_send_smc_msg_with_param() 
to specify the timeout you want for specific message.
I think this may be another story. Thanks!
 
Thanks,
Lang
>
>>
>>Thanks,
>>Lijo
>>
>>> Regards,
>>> Lang
>>>

 Thanks,
 Lijo

>
> Regards,
> Lang
>
>> -Original Message-
>> From: Lazar, Lijo 
>> Sent: Wednesday, December 1, 2021 1:44 PM
>> To: Lazar, Lijo ; Yu, Lang ;
>> amd- g...@lists.freedesktop.org
>> Cc: Deucher, Alexander ; Huang, Ray
>> ; Koenig, Christian 
>> Subject: RE: [PATCH] drm/amdgpu: add SMU debug option support
>>
>> Just realized that the patch I pasted won't work. Outer loop exit
>> needs to be like this.
>>  (reg & MP1_C2PMSG_90__CONTENT_MASK) != 0 && extended_wait-- >=
>> 0
>>
>> Anyway, that patch is only there to communicate what I really
>> meant in the earlier comment.
>>
>> Thanks,
>> Lijo
>>
>> -Original Message-
>> From: amd-gfx  On Behalf Of
>> Lazar, Lijo
>> Sent: Wednesday, December 1, 2021 10:44 AM
>> To: Yu, Lang ; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander ; Huang, Ray
>> ; Koenig, Christian 
>> Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support
>>
>>
>>
>> On 11/30/2021 10:47 AM, Lang Yu wrote:
>>> To maintain system error state when SMU errors occurred, which
>>> will aid in debugging SMU firmware issues, add SMU debug option
>support.
>>>
>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>>> When enabled, it makes SMU errors fatal.
>>> It is disabled by default.
>>>
>>> == Command Guide ==
>>>
>>> 1, enable SMU debug option
>>>
>>> # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>
>>> 2, disable SMU debug option
>>>
>>> #

73 matches

Mail list logo