RE: [PATCH] drm/amd/pm: Vangogh: Add new gpu_metrics_v2_4 to acquire gpu_metrics

2023-06-10 Thread Yang, WenYou
[AMD Official Use Only - General]

Any comments?

> -Original Message-
> From: Wenyou Yang 
> Sent: Thursday, June 1, 2023 9:38 AM
> To: Deucher, Alexander ; Limonciello, Mario
> ; Koenig, Christian ;
> Pan, Xinhui ; Quan, Evan 
> Cc: Yuan, Perry ; Liang, Richard qi
> ; amd-gfx@lists.freedesktop.org; dri-
> de...@lists.freedesktop.org; linux-ker...@vger.kernel.org; Yang, WenYou
> 
> Subject: [PATCH] drm/amd/pm: Vangogh: Add new gpu_metrics_v2_4 to acquire
> gpu_metrics
>
> To acquire the voltage and current info from gpu_metrics interface, but
> gpu_metrics_v2_3 doesn't contain them, and to be backward compatible, add
> new gpu_metrics_v2_4 structure.
>
> Acked-by: Evan Quan 
> Signed-off-by: Wenyou Yang 
> ---
>  .../gpu/drm/amd/include/kgd_pp_interface.h|  69 +++
>  .../gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c  | 109 -
> -
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c|   3 +
>  3 files changed, 172 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> index 9f542f6e19ed..0f37dafafcf9 100644
> --- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> +++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> @@ -892,4 +892,73 @@ struct gpu_metrics_v2_3 {
>   uint16_taverage_temperature_core[8]; //
> average CPU core temperature on APUs
>   uint16_taverage_temperature_l3[2];
>  };
> +
> +struct gpu_metrics_v2_4 {
> + struct metrics_table_header common_header;
> +
> + /* Temperature */
> + uint16_ttemperature_gfx;
> + uint16_ttemperature_soc;
> + uint16_ttemperature_core[8];
> + uint16_ttemperature_l3[2];
> +
> + /* Utilization */
> + uint16_taverage_gfx_activity;
> + uint16_taverage_mm_activity;
> +
> + /* Driver attached timestamp (in ns) */
> + uint64_tsystem_clock_counter;
> +
> + /* Power/Energy */
> + uint16_taverage_socket_power;
> + uint16_taverage_cpu_power;
> + uint16_taverage_soc_power;
> + uint16_taverage_gfx_power;
> + uint16_taverage_core_power[8];
> +
> + /* Average clocks */
> + uint16_taverage_gfxclk_frequency;
> + uint16_taverage_socclk_frequency;
> + uint16_taverage_uclk_frequency;
> + uint16_taverage_fclk_frequency;
> + uint16_taverage_vclk_frequency;
> + uint16_taverage_dclk_frequency;
> +
> + /* Current clocks */
> + uint16_tcurrent_gfxclk;
> + uint16_tcurrent_socclk;
> + uint16_tcurrent_uclk;
> + uint16_tcurrent_fclk;
> + uint16_tcurrent_vclk;
> + uint16_tcurrent_dclk;
> + uint16_tcurrent_coreclk[8];
> + uint16_tcurrent_l3clk[2];
> +
> + /* Throttle status (ASIC dependent) */
> + uint32_tthrottle_status;
> +
> + /* Fans */
> + uint16_tfan_pwm;
> +
> + uint16_tpadding[3];
> +
> + /* Throttle status (ASIC independent) */
> + uint64_tindep_throttle_status;
> +
> + /* Average Temperature */
> + uint16_taverage_temperature_gfx;
> + uint16_taverage_temperature_soc;
> + uint16_taverage_temperature_core[8];
> + uint16_taverage_temperature_l3[2];
> +
> + /* Power/Voltage */
> + uint16_taverage_cpu_voltage;
> + uint16_taverage_soc_voltage;
> + uint16_taverage_gfx_voltage;
> +
> + /* Power/Current */
> + uint16_taverage_cpu_current;
> + uint16_taverage_soc_current;
> + uint16_taverage_gfx_current;
> +};
>  #endif
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
> index 067b4e0b026c..185d0b50ee8e 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
> @@ -1854,6 +1854,86 @@ static ssize_t vangogh_get_gpu_metrics_v2_3(struct
> smu_context *smu,
>   return sizeof(struct gpu_metrics_v2_3);  }
>
> +static ssize_t vangogh_get_gpu_metrics_v2_4(struct smu_context *smu,
> + void **table)
> +{
> + 

RE: [PATCH] drm/amdgpu: Fix usage of UMC fill record in RAS

2023-06-10 Thread Zhang, Hawking
[AMD Official Use Only - General]

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: Tuikov, Luben 
Sent: Saturday, June 10, 2023 19:36
To: AMD Graphics 
Cc: Tuikov, Luben ; Zhou1, Tao ; 
Zhang, Hawking ; Deucher, Alexander 

Subject: [PATCH] drm/amdgpu: Fix usage of UMC fill record in RAS

The fixed commit listed in the Fixes tag below, introduced a bug in 
amdgpu_ras.c::amdgpu_reserve_page_direct(), in that when introducing the new
amdgpu_umc_fill_error_record() and internally in that new function the physical 
address (argument "uint64_t retired_page"--wrong name) is right-shifted by 
AMDGPU_GPU_PAGE_SHIFT. Thus, in amdgpu_reserve_page_direct() when we pass 
"address" to that new function, we should NOT right-shift it, since this 
results, erroneously, in the page address to be 0 for first
2^(2*AMDGPU_GPU_PAGE_SHIFT) memory addresses.

This commit fixes this bug.

Cc: Tao Zhou 
Cc: Hawking Zhang 
Cc: Alex Deucher 
Fixes: 400013b268cb66 ("drm/amdgpu: add umc_fill_error_record to make code more 
simple")
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 27a32933cbee3b..700eb180ea60fa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -171,8 +171,7 @@ static int amdgpu_reserve_page_direct(struct amdgpu_device 
*adev, uint64_t addre

memset(_rec, 0x0, sizeof(struct eeprom_table_record));
err_data.err_addr = _rec;
-   amdgpu_umc_fill_error_record(_data, address,
-   (address >> AMDGPU_GPU_PAGE_SHIFT), 0, 0);
+   amdgpu_umc_fill_error_record(_data, address, address, 0, 0);

if (amdgpu_bad_page_threshold != 0) {
amdgpu_ras_add_bad_pages(adev, err_data.err_addr,

base-commit: 7eda4451177abbc8d2fab24f816a3243dd1808d8
prerequisite-patch-id: f2a3eadc5d7074225109701f1bb43b38bd6024fd
--
2.41.0



[PATCH] drm/amdgpu: Fix usage of UMC fill record in RAS

2023-06-10 Thread Luben Tuikov
The fixed commit listed in the Fixes tag below, introduced a bug in
amdgpu_ras.c::amdgpu_reserve_page_direct(), in that when introducing the new
amdgpu_umc_fill_error_record() and internally in that new function the physical
address (argument "uint64_t retired_page"--wrong name) is right-shifted by
AMDGPU_GPU_PAGE_SHIFT. Thus, in amdgpu_reserve_page_direct() when we pass
"address" to that new function, we should NOT right-shift it, since this
results, erroneously, in the page address to be 0 for first
2^(2*AMDGPU_GPU_PAGE_SHIFT) memory addresses.

This commit fixes this bug.

Cc: Tao Zhou 
Cc: Hawking Zhang 
Cc: Alex Deucher 
Fixes: 400013b268cb66 ("drm/amdgpu: add umc_fill_error_record to make code more 
simple")
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 27a32933cbee3b..700eb180ea60fa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -171,8 +171,7 @@ static int amdgpu_reserve_page_direct(struct amdgpu_device 
*adev, uint64_t addre
 
memset(_rec, 0x0, sizeof(struct eeprom_table_record));
err_data.err_addr = _rec;
-   amdgpu_umc_fill_error_record(_data, address,
-   (address >> AMDGPU_GPU_PAGE_SHIFT), 0, 0);
+   amdgpu_umc_fill_error_record(_data, address, address, 0, 0);
 
if (amdgpu_bad_page_threshold != 0) {
amdgpu_ras_add_bad_pages(adev, err_data.err_addr,

base-commit: 7eda4451177abbc8d2fab24f816a3243dd1808d8
prerequisite-patch-id: f2a3eadc5d7074225109701f1bb43b38bd6024fd
-- 
2.41.0



Re: [PATCH] drm/amdgpu: Wrap -Wunused-but-set-variable in cc-option

2023-06-10 Thread Jani Nikula
On Thu, 08 Jun 2023, Nathan Chancellor  wrote:
> -Wunused-but-set-variable was only supported in clang starting with
> 13.0.0, so earlier versions will emit a warning, which is turned into a
> hard error for the kernel to mirror GCC:
>
>   error: unknown warning option '-Wunused-but-set-variable'; did you mean 
> '-Wunused-const-variable'? [-Werror,-Wunknown-warning-option]
>
> The minimum supported version of clang for building the kernel is
> 11.0.0, so match the rest of the kernel and wrap
> -Wunused-but-set-variable in a cc-option call, so that it is only used
> when supported by the compiler.

I wonder if there's a table somewhere listing all the warning options,
which GCC and Clang versions support them, and which versions have them
in -Wall and -Wextra. Would be really useful.

If there isn't one, it would be really helpful. *wink*.

BR,
Jani.


>
> Closes: https://github.com/ClangBuiltLinux/linux/issues/1869
> Fixes: a0fd5a5f676c ("drm/amd/amdgpu: introduce DRM_AMDGPU_WERROR")
> Signed-off-by: Nathan Chancellor 
> ---
>  drivers/gpu/drm/amd/amdgpu/Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
> b/drivers/gpu/drm/amd/amdgpu/Makefile
> index 7ee68b1bbfed..86b833085f19 100644
> --- a/drivers/gpu/drm/amd/amdgpu/Makefile
> +++ b/drivers/gpu/drm/amd/amdgpu/Makefile
> @@ -40,7 +40,7 @@ ccflags-y := -I$(FULL_AMD_PATH)/include/asic_reg \
>   -I$(FULL_AMD_PATH)/amdkfd
>  
>  subdir-ccflags-y := -Wextra
> -subdir-ccflags-y += -Wunused-but-set-variable
> +subdir-ccflags-y += $(call cc-option, -Wunused-but-set-variable)
>  subdir-ccflags-y += -Wno-unused-parameter
>  subdir-ccflags-y += -Wno-type-limits
>  subdir-ccflags-y += -Wno-sign-compare
>
> ---
> base-commit: 6bd4b01e8938779b0d959bdf33949a9aa258a363
> change-id: 
> 20230608-amdgpu-wrap-wunused-but-set-variable-in-cc-option-0be9528ac5c8
>
> Best regards,

-- 
Jani Nikula, Intel Open Source Graphics Center


Re: [PATCH] drm/amd/amdgpu: enable W=1 for amdgpu

2023-06-10 Thread Jani Nikula
On Sat, 10 Jun 2023, Masahiro Yamada  wrote:
> On Sat, Jun 10, 2023 at 5:17 AM Nathan Chancellor  wrote:
>>
>> + Masahiro and linux-kbuild
>>
>> On Fri, Jun 09, 2023 at 12:42:06PM -0400, Hamza Mahfooz wrote:
>> > We have a clean build with W=1 as of
>> > commit 12a15dd589ac ("drm/amd/display/amdgpu_dm/amdgpu_dm_helpers: Move
>> > SYNAPTICS_DEVICE_ID into CONFIG_DRM_AMD_DC_DCN ifdef"). So, let's enable
>> > these checks unconditionally for the entire module to catch these errors
>> > during development.
>> >
>> > Cc: Alex Deucher 
>> > Cc: Nathan Chancellor 
>> > Signed-off-by: Hamza Mahfooz 
>>
>> I think this is fine, especially since it will help catch issues in
>> amdgpu quickly and hopefully encourage developers to fix their problems
>> before they make it to a tree with wider impact lika -next.
>>
>> However, this is now the third place that W=1 has been effectively
>> enabled (i915 and btrfs are the other two I know of) and it would be
>> nice if this was a little more unified, especially since it is not
>> uncommon for the warnings under W=1 to shift around and keeping them
>> unified will make maintainence over the longer term a little easier. I
>> am not sure if this has been brought up in the past and I don't want to
>> hold up this change but I suspect this sentiment of wanting to enable
>> W=1 on a per-subsystem basis is going to continue to grow.
>
>
>
> I believe this patch is the right way because
> we will be able to add a new warning option to
> scripts/Makefile.extrawarn without fixing any code.
>
> I remember somebody argued that drivers should be
> able to do
>   subdir-ccflags-y += $(W1_FLAGS)

Personally, I think this would be the viable way to make the kernel free
of W=1 warnings. Make it clean driver and subsystem at a time, with
constant progress. Currently, there's haphazard fixing of issues, with
new ones creeping back in, because kernel-wide W=1 is too verbose for
most developers. It's whac-a-mole.

> However, if a new flag, -Wfoo, emits warnings
> for drivers/gpu/drm/{i915,amd},
> you cannot add it to W=1 until fixing the code.

Or adding -Wno-foo where it breaks, until fixed.

> If many drivers start to do likewise,
> W=1 warning will not be W=1 any more.

I don't know, is the goal to fix the warnings, or keep adding stuff to
W=1 so that it'll always emit warnings? :p


BR,
Jani.

> Another good thing for hard-coding warning options
> is you can lift up a warning flag one by one.
>
>
> Let's say you fixed the entire DRM subsystem so
> it is -Wunused free now.
>
> Then, you can move -Wunused to drivers/gpu/drm/Makefile,
> while other warning options stay in drivers Makefiles.
>
>
>
>
>
>
>
>>
>> Regardless, for clang 11.1.0 to 16.0.5, I see no warnings when building
>> drivers/gpu/drm/amd/amdgpu/ with Arch Linux's configuration or
>> allmodconfig.
>>
>> Reviewed-by: Nathan Chancellor 
>> Tested-by: Nathan Chancellor 
>>
>> > ---
>> >  drivers/gpu/drm/amd/amdgpu/Makefile | 13 -
>> >  1 file changed, 12 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
>> > b/drivers/gpu/drm/amd/amdgpu/Makefile
>> > index 86b833085f19..8d16f280b695 100644
>> > --- a/drivers/gpu/drm/amd/amdgpu/Makefile
>> > +++ b/drivers/gpu/drm/amd/amdgpu/Makefile
>> > @@ -40,7 +40,18 @@ ccflags-y := -I$(FULL_AMD_PATH)/include/asic_reg \
>> >   -I$(FULL_AMD_PATH)/amdkfd
>> >
>> >  subdir-ccflags-y := -Wextra
>> > -subdir-ccflags-y += $(call cc-option, -Wunused-but-set-variable)
>> > +subdir-ccflags-y += -Wunused
>> > +subdir-ccflags-y += -Wmissing-prototypes
>> > +subdir-ccflags-y += -Wmissing-declarations
>> > +subdir-ccflags-y += -Wmissing-include-dirs
>> > +subdir-ccflags-y += -Wold-style-definition
>> > +subdir-ccflags-y += -Wmissing-format-attribute
>> > +# Need this to avoid recursive variable evaluation issues
>> > +cond-flags := $(call cc-option, -Wunused-but-set-variable) \
>> > + $(call cc-option, -Wunused-const-variable) \
>> > + $(call cc-option, -Wstringop-truncation) \
>> > + $(call cc-option, -Wpacked-not-aligned)
>> > +subdir-ccflags-y += $(cond-flags)
>> >  subdir-ccflags-y += -Wno-unused-parameter
>> >  subdir-ccflags-y += -Wno-type-limits
>> >  subdir-ccflags-y += -Wno-sign-compare
>> > --
>> > 2.40.1
>> >

-- 
Jani Nikula, Intel Open Source Graphics Center