date:20240513

Re: [PATCH 3/5] drm/amdgpu: Fix null pointer dereference to aca_handle

2024-05-13 Thread Lazar, Lijo




On 5/14/2024 6:30 AM, Ma, Jun wrote:
> Hi Lijo & Kevin, thanks for review, will drop this patch
> 

In the original function below check is there.

if (!handle || !info || type >= ACA_ERROR_TYPE_COUNT)
return -EINVAL;

So moving this to a later stage is still valid.
struct aca_error_cache *error_cache = >error_cache;

Further NULL check of error_cache is not required.

Thanks,
Lijo

> Regards,
> Ma Jun
> 
> On 5/14/2024 7:13 AM, Wang, Yang(Kevin) wrote:
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> -Original Message-
>> From: Ma, Jun 
>> Sent: Monday, May 13, 2024 4:56 PM
>> To: amd-gfx@lists.freedesktop.org
>> Cc: Feng, Kenneth ; Deucher, Alexander 
>> ; Wang, Yang(Kevin) ; 
>> Koenig, Christian ; Ma, Jun 
>> Subject: [PATCH 3/5] drm/amdgpu: Fix null pointer dereference to aca_handle
>>
>> Check handle pointer before using it
>>
>> Signed-off-by: Ma Jun 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 6 +-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
>> index 28febf33fb1b..e969a7d77b4d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
>> @@ -279,7 +279,7 @@ static struct aca_bank_error *get_bank_error(struct 
>> aca_error *aerr, struct aca_  int aca_error_cache_log_bank_error(struct 
>> aca_handle *handle, struct aca_bank_info *info,
>>enum aca_error_type type, u64 count)  {
>> -   struct aca_error_cache *error_cache = >error_cache;
>> +   struct aca_error_cache *error_cache;
>> struct aca_bank_error *bank_error;
>> struct aca_error *aerr;
>>
>> @@ -289,6 +289,10 @@ int aca_error_cache_log_bank_error(struct aca_handle 
>> *handle, struct aca_bank_in
>> if (!count)
>> return 0;
>>
>> +   error_cache = >error_cache;
>> [Kevin]:
>> The above code is always return non-0 value, right?
>>
>> Best Regards,
>> Kevin
>> +   if (!error_cache)
>> +   return -EINVAL;
>> +
>> aerr = _cache->errors[type];
>> bank_error = get_bank_error(aerr, info);
>> if (!bank_error)
>> --
>> 2.34.1
>>

[PATCH] drm/amdgpu/pm: Drop hard-code value of usTMax

2024-05-13 Thread Ma Jun

Drop hard-code value of nsTmax because we read this
value from fantable below.

Signed-off-by: Ma Jun 
---
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/process_pptables_v1_0.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/process_pptables_v1_0.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/process_pptables_v1_0.c
index 17882f8dfdd3..6cfef1b295ab 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/process_pptables_v1_0.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/process_pptables_v1_0.c
@@ -977,8 +977,6 @@ static int init_thermal_controller(
= le16_to_cpu(tonga_fan_table->usPWMMed);
hwmgr->thermal_controller.advanceFanControlParameters.usPWMHigh
= le16_to_cpu(tonga_fan_table->usPWMHigh);
-   hwmgr->thermal_controller.advanceFanControlParameters.usTMax
-   = 10900;  /* hard coded */
hwmgr->thermal_controller.advanceFanControlParameters.usTMax
= le16_to_cpu(tonga_fan_table->usTMax);

hwmgr->thermal_controller.advanceFanControlParameters.ucFanControlMode
-- 
2.34.1

[PATCH v2] drm/amdgpu: Fix the null pointer dereference to ras_manager

2024-05-13 Thread Ma Jun

Check ras_manager before using it

Signed-off-by: Ma Jun 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 925ec65ac5ed..2bcf5c3b5d70 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2172,12 +2172,15 @@ static void amdgpu_ras_interrupt_process_handler(struct 
work_struct *work)
 int amdgpu_ras_interrupt_dispatch(struct amdgpu_device *adev,
struct ras_dispatch_if *info)
 {
-   struct ras_manager *obj = amdgpu_ras_find_obj(adev, >head);
-   struct ras_ih_data *data = >ih_data;
+   struct ras_manager *obj;
+   struct ras_ih_data *data;
 
+   obj = amdgpu_ras_find_obj(adev, >head);
if (!obj)
return -EINVAL;
 
+   data = >ih_data;
+
if (data->inuse == 0)
return 0;
 
-- 
2.34.1

RE: [PATCH 01/22] drm/amdgpu: fix dereference after null check

2024-05-13 Thread Zhang, Jesse(Jie)

[AMD Official Use Only - AMD Internal Distribution Only]

Hi Christian,

-Original Message-
From: Christian König 
Sent: Monday, May 13, 2024 7:41 PM
To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Huang, Tim 
Subject: Re: [PATCH 01/22] drm/amdgpu: fix dereference after null check

Am 10.05.24 um 04:50 schrieb Jesse Zhang:
> check the pointer hive before use.
>
> Signed-off-by: Jesse Zhang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 77f6fd50002a..00fe3c2d5431 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5725,7 +5725,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>* to put adev in the 1st position.
>*/
>   INIT_LIST_HEAD(_list);
> - if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) {
> + if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes >
> +1) && hive) {

That solution looks not optimal to me.

Checking adev->gmc.xgmi.num_physical_nodes > 1 already makes sure that hive 
shouldn't be NULL.

If automated checkers complain about that we should probably drop the
adev->gmc.xgmi.num_physical_nodes > 1 check and check for hive instead.

[Zhang, Jesse(Jie)]
gmc.xgmi.num_physical_nodes is obtained by reading register 
GCMC_VM_XGMI_LFB_CNTL.
But getting hive may fail because of  no memory (NOMEM),
or the kobject of xgmi hive cannot be initialized in the function 
amdgpu_get_xgmi_hive.
Is (adev->gmc.xgmi.num_physical_nodes > 1) equivalent to (!hive) here?

Regards
Jesse

Regards,
Christian.

>   list_for_each_entry(tmp_adev, >device_list, 
> gmc.xgmi.head) {
>   list_add_tail(_adev->reset_list, _list);
>   if (adev->shutdown)

RE: [PATCH 1/5] drm/amdgpu/pm: Fix the null pointer dereference for smu7

2024-05-13 Thread Wang, Yang(Kevin)

[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Yang Wang 

Best Regards,
Kevin

-Original Message-
From: Ma, Jun 
Sent: Monday, May 13, 2024 4:56 PM
To: amd-gfx@lists.freedesktop.org
Cc: Feng, Kenneth ; Deucher, Alexander 
; Wang, Yang(Kevin) ; 
Koenig, Christian ; Ma, Jun 
Subject: [PATCH 1/5] drm/amdgpu/pm: Fix the null pointer dereference for smu7

optimize the code to avoid pass a null pointer (hwmgr->backend) to function 
smu7_update_edc_leakage_table.

Signed-off-by: Ma Jun 
---
 .../drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c   | 50 +--
 1 file changed, 24 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
index 123af237878f..632a25957477 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
@@ -2957,6 +2957,7 @@ static int smu7_update_edc_leakage_table(struct pp_hwmgr 
*hwmgr)

 static int smu7_hwmgr_backend_init(struct pp_hwmgr *hwmgr)  {
+   struct amdgpu_device *adev = hwmgr->adev;
struct smu7_hwmgr *data;
int result = 0;

@@ -2993,40 +2994,37 @@ static int smu7_hwmgr_backend_init(struct pp_hwmgr 
*hwmgr)
/* Initalize Dynamic State Adjustment Rule Settings */
result = phm_initializa_dynamic_state_adjustment_rule_settings(hwmgr);

-   if (0 == result) {
-   struct amdgpu_device *adev = hwmgr->adev;
+   if (result)
+   goto fail;

-   data->is_tlu_enabled = false;
+   data->is_tlu_enabled = false;

-   hwmgr->platform_descriptor.hardwareActivityPerformanceLevels =
+   hwmgr->platform_descriptor.hardwareActivityPerformanceLevels =

SMU7_MAX_HARDWARE_POWERLEVELS;
-   hwmgr->platform_descriptor.hardwarePerformanceLevels = 2;
-   hwmgr->platform_descriptor.minimumClocksReductionPercentage = 
50;
+   hwmgr->platform_descriptor.hardwarePerformanceLevels = 2;
+   hwmgr->platform_descriptor.minimumClocksReductionPercentage = 50;

-   data->pcie_gen_cap = adev->pm.pcie_gen_mask;
-   if (data->pcie_gen_cap & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3)
-   data->pcie_spc_cap = 20;
-   else
-   data->pcie_spc_cap = 16;
-   data->pcie_lane_cap = adev->pm.pcie_mlw_mask;
-
-   hwmgr->platform_descriptor.vbiosInterruptId = 0x2400; /* 
IRQ_SOURCE1_SW_INT */
-/* The true clock step depends on the frequency, typically 4.5 or 9 MHz. Here 
we use 5. */
-   hwmgr->platform_descriptor.clockStep.engineClock = 500;
-   hwmgr->platform_descriptor.clockStep.memoryClock = 500;
-   smu7_thermal_parameter_init(hwmgr);
-   } else {
-   /* Ignore return value in here, we are cleaning up a mess. */
-   smu7_hwmgr_backend_fini(hwmgr);
-   }
+   data->pcie_gen_cap = adev->pm.pcie_gen_mask;
+   if (data->pcie_gen_cap & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3)
+   data->pcie_spc_cap = 20;
+   else
+   data->pcie_spc_cap = 16;
+   data->pcie_lane_cap = adev->pm.pcie_mlw_mask;
+
+   hwmgr->platform_descriptor.vbiosInterruptId = 0x2400; /* 
IRQ_SOURCE1_SW_INT */
+   /* The true clock step depends on the frequency, typically 4.5 or 9 
MHz. Here we use 5. */
+   hwmgr->platform_descriptor.clockStep.engineClock = 500;
+   hwmgr->platform_descriptor.clockStep.memoryClock = 500;
+   smu7_thermal_parameter_init(hwmgr);

result = smu7_update_edc_leakage_table(hwmgr);
-   if (result) {
-   smu7_hwmgr_backend_fini(hwmgr);
-   return result;
-   }
+   if (result)
+   goto fail;

return 0;
+fail:
+   smu7_hwmgr_backend_fini(hwmgr);
+   return result;
 }

 static int smu7_force_dpm_highest(struct pp_hwmgr *hwmgr)
--
2.34.1

RE: [PATCH 5/5] drm/amdgpu: Remove dead code in amdgpu_ras_add_mca_err_addr

2024-05-13 Thread Chai, Thomas

[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: YiPeng Chai 

-
Best Regards,
Thomas

-Original Message-
From: amd-gfx  On Behalf Of Ma Jun
Sent: Monday, May 13, 2024 4:56 PM
To: amd-gfx@lists.freedesktop.org
Cc: Feng, Kenneth ; Deucher, Alexander 
; Wang, Yang(Kevin) ; 
Koenig, Christian ; Ma, Jun 
Subject: [PATCH 5/5] drm/amdgpu: Remove dead code in amdgpu_ras_add_mca_err_addr

Remove dead code in amdgpu_ras_add_mca_err_addr

Signed-off-by: Ma Jun 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 13 -
 1 file changed, 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 6da02a209890..0cf67923c0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -4292,21 +4292,8 @@ static struct ras_err_info 
*amdgpu_ras_error_get_info(struct ras_err_data *err_d

 void amdgpu_ras_add_mca_err_addr(struct ras_err_info *err_info, struct 
ras_err_addr *err_addr)  {
-   struct ras_err_addr *mca_err_addr;
-
/* This function will be retired. */
return;
-   mca_err_addr = kzalloc(sizeof(*mca_err_addr), GFP_KERNEL);
-   if (!mca_err_addr)
-   return;
-
-   INIT_LIST_HEAD(_err_addr->node);
-
-   mca_err_addr->err_status = err_addr->err_status;
-   mca_err_addr->err_ipid = err_addr->err_ipid;
-   mca_err_addr->err_addr = err_addr->err_addr;
-
-   list_add_tail(_err_addr->node, _info->err_addr_list);
 }

 void amdgpu_ras_del_mca_err_addr(struct ras_err_info *err_info, struct 
ras_err_addr *mca_err_addr)
--
2.34.1

Re: [PATCH 3/5] drm/amdgpu: Fix null pointer dereference to aca_handle

2024-05-13 Thread Ma, Jun

Hi Lijo & Kevin, thanks for review, will drop this patch

Regards,
Ma Jun

On 5/14/2024 7:13 AM, Wang, Yang(Kevin) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> -Original Message-
> From: Ma, Jun 
> Sent: Monday, May 13, 2024 4:56 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Feng, Kenneth ; Deucher, Alexander 
> ; Wang, Yang(Kevin) ; 
> Koenig, Christian ; Ma, Jun 
> Subject: [PATCH 3/5] drm/amdgpu: Fix null pointer dereference to aca_handle
> 
> Check handle pointer before using it
> 
> Signed-off-by: Ma Jun 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> index 28febf33fb1b..e969a7d77b4d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> @@ -279,7 +279,7 @@ static struct aca_bank_error *get_bank_error(struct 
> aca_error *aerr, struct aca_  int aca_error_cache_log_bank_error(struct 
> aca_handle *handle, struct aca_bank_info *info,
>enum aca_error_type type, u64 count)  {
> -   struct aca_error_cache *error_cache = >error_cache;
> +   struct aca_error_cache *error_cache;
> struct aca_bank_error *bank_error;
> struct aca_error *aerr;
> 
> @@ -289,6 +289,10 @@ int aca_error_cache_log_bank_error(struct aca_handle 
> *handle, struct aca_bank_in
> if (!count)
> return 0;
> 
> +   error_cache = >error_cache;
> [Kevin]:
> The above code is always return non-0 value, right?
> 
> Best Regards,
> Kevin
> +   if (!error_cache)
> +   return -EINVAL;
> +
> aerr = _cache->errors[type];
> bank_error = get_bank_error(aerr, info);
> if (!bank_error)
> --
> 2.34.1
>

[PATCH] drm/amdgpu: fix compiler 'side-effect' check issue for RAS_EVENT_LOG()

2024-05-13 Thread Yang Wang

create a new helper function to avoid compiler 'side-effect'
check about RAS_EVENT_LOG() macro.

Signed-off-by: Yang Wang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 18 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 13 ++---
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 1dd13ed3b7b5..c04e6ced1af3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -4504,3 +4504,21 @@ int amdgpu_ras_reserve_page(struct amdgpu_device *adev, 
uint64_t pfn)
 
return ret;
 }
+
+void amdgpu_ras_event_log_print(struct amdgpu_device *adev, u64 event_id,
+   const char *fmt, ...)
+{
+   struct va_format vaf;
+   va_list args;
+
+   va_start(args, fmt);
+   vaf.fmt = fmt;
+   vaf.va = 
+
+   if (amdgpu_ras_event_id_is_valid(adev, event_id))
+   dev_printk(KERN_INFO, adev->dev, "{%llu}%pV", event_id, );
+   else
+   dev_printk(KERN_INFO, adev->dev, "%pV", );
+
+   va_end(args);
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index c8980d5f6540..6a8c7b1609df 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -67,13 +67,8 @@ struct amdgpu_iv_entry;
 /* The high three bits indicates socketid */
 #define AMDGPU_RAS_GET_FEATURES(val)  ((val) & 
~AMDGPU_RAS_FEATURES_SOCKETID_MASK)
 
-#define RAS_EVENT_LOG(_adev, _id, _fmt, ...)   \
-do {   \
-   if (amdgpu_ras_event_id_is_valid((_adev), (_id)))   
\
-   dev_info((_adev)->dev, "{%llu}" _fmt, (_id), ##__VA_ARGS__);
\
-   else\
-   dev_info((_adev)->dev, _fmt, ##__VA_ARGS__);
\
-} while (0)
+#define RAS_EVENT_LOG(adev, id, fmt, ...)  \
+   amdgpu_ras_event_log_print((adev), (id), (fmt), ##__VA_ARGS__);
 
 enum amdgpu_ras_block {
AMDGPU_RAS_BLOCK__UMC = 0,
@@ -956,4 +951,8 @@ int amdgpu_ras_put_poison_req(struct amdgpu_device *adev,
enum amdgpu_ras_block block, uint16_t pasid,
pasid_notify pasid_fn, void *data, uint32_t reset);
 
+__printf(3, 4)
+void amdgpu_ras_event_log_print(struct amdgpu_device *adev, u64 event_id,
+   const char *fmt, ...);
+
 #endif
-- 
2.34.1

Re: [PATCH] drm/kfd: Correct pined buffer handling at kfd restore and validate process

2024-05-13 Thread Felix Kuehling




On 2024-05-13 11:18, Xiaogang.Chen wrote:
> From: Xiaogang Chen 
> 
> This reverts 8a774fe912ff09e39c2d3a3589c729330113f388 "drm/amdgpu: avoid 
> restore
> process run into dead loop" since buffer got pined is not related whether it

Spelling: pined -> pinned

Same in the commit headline.


> needs mapping. And skip buffer validation at kfd driver if the buffer has been
> pinned.
> 
> Signed-off-by: Xiaogang Chen 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 3314821e4cf3..80018738bd1c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -415,6 +415,10 @@ static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo 
> *bo, uint32_t domain,
>"Called with userptr BO"))
>   return -EINVAL;
>  
> + /* bo has been pined, not need validate it */

pined -> pinned

With those typos fixed, the patch is

Reviewed-by: Felix Kuehling 


> + if (bo->tbo.pin_count)
> + return 0;
> +
>   amdgpu_bo_placement_from_domain(bo, domain);
>  
>   ret = ttm_bo_validate(>tbo, >placement, );
> @@ -2736,7 +2740,7 @@ static int confirm_valid_user_pages_locked(struct 
> amdkfd_process_info *process_i
>  
>   /* keep mem without hmm range at userptr_inval_list */
>   if (!mem->range)
> -  continue;
> + continue;
>  
>   /* Only check mem with hmm range associated */
>   valid = amdgpu_ttm_tt_get_user_pages_done(
> @@ -2981,9 +2985,6 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, 
> struct dma_fence __rcu *
>   if (!attachment->is_mapped)
>   continue;
>  
> - if (attachment->bo_va->base.bo->tbo.pin_count)
> - continue;
> -
>   kfd_mem_dmaunmap_attachment(mem, attachment);
>   ret = update_gpuvm_pte(mem, attachment, _obj);
>   if (ret) {

RE: [PATCH 3/5] drm/amdgpu: Fix null pointer dereference to aca_handle

2024-05-13 Thread Wang, Yang(Kevin)

[AMD Official Use Only - AMD Internal Distribution Only]

-Original Message-
From: Ma, Jun 
Sent: Monday, May 13, 2024 4:56 PM
To: amd-gfx@lists.freedesktop.org
Cc: Feng, Kenneth ; Deucher, Alexander 
; Wang, Yang(Kevin) ; 
Koenig, Christian ; Ma, Jun 
Subject: [PATCH 3/5] drm/amdgpu: Fix null pointer dereference to aca_handle

Check handle pointer before using it

Signed-off-by: Ma Jun 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
index 28febf33fb1b..e969a7d77b4d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
@@ -279,7 +279,7 @@ static struct aca_bank_error *get_bank_error(struct 
aca_error *aerr, struct aca_  int aca_error_cache_log_bank_error(struct 
aca_handle *handle, struct aca_bank_info *info,
   enum aca_error_type type, u64 count)  {
-   struct aca_error_cache *error_cache = >error_cache;
+   struct aca_error_cache *error_cache;
struct aca_bank_error *bank_error;
struct aca_error *aerr;

@@ -289,6 +289,10 @@ int aca_error_cache_log_bank_error(struct aca_handle 
*handle, struct aca_bank_in
if (!count)
return 0;

+   error_cache = >error_cache;
[Kevin]:
The above code is always return non-0 value, right?

Best Regards,
Kevin
+   if (!error_cache)
+   return -EINVAL;
+
aerr = _cache->errors[type];
bank_error = get_bank_error(aerr, info);
if (!bank_error)
--
2.34.1

RE: [PATCH v3 10/10] Documentation/amdgpu: Add PM policy documentation

2024-05-13 Thread Deucher, Alexander

[Public]

> -Original Message-
> From: Lazar, Lijo 
> Sent: Monday, May 13, 2024 5:21 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Deucher, Alexander
> ; Kamal, Asad ; Ma,
> Le 
> Subject: [PATCH v3 10/10] Documentation/amdgpu: Add PM policy
> documentation
>
> Add documentation about the newly added pm_policy node in sysfs.
>
> Signed-off-by: Lijo Lazar 
> ---
>  Documentation/gpu/amdgpu/thermal.rst |  6 
>  drivers/gpu/drm/amd/pm/amdgpu_pm.c   | 48
> 
>  2 files changed, 54 insertions(+)
>
> diff --git a/Documentation/gpu/amdgpu/thermal.rst
> b/Documentation/gpu/amdgpu/thermal.rst
> index 2f6166f81e6a..6d942b5c58f0 100644
> --- a/Documentation/gpu/amdgpu/thermal.rst
> +++ b/Documentation/gpu/amdgpu/thermal.rst
> @@ -49,6 +49,12 @@ pp_power_profile_mode  .. kernel-doc::
> drivers/gpu/drm/amd/pm/amdgpu_pm.c
> :doc: pp_power_profile_mode
>
> +pm_policy
> +-
> +
> +.. kernel-doc:: drivers/gpu/drm/amd/pm/amdgpu_pm.c
> +   :doc: pm_policy
> +
>  \*_busy_percent
>  ---
>
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> index 9cca4716ec43..45766d49f1f2 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> @@ -2214,6 +2214,54 @@ static int pp_dpm_clk_default_attr_update(struct
> amdgpu_device *adev, struct amd
>   return 0;
>  }
>
> +/**
> + * DOC: pm_policy
> + *
> + * Certain SOCs can support different power policies to optimize
> +application
> + * performance. However, this policy is provided only at SOC level and
> +not at a
> + * per-process level. This is useful especially when entire SOC is
> +utilized for
> + * dedicated workload.
> + *
> + * The amdgpu driver provides a sysfs API for selecting the policy.
> +Presently,
> + * only two types of policies are supported through this interface.
> + *
> + *  Pstate Policy Selection - This is to select different Pstate
> +profiles which
> + *  decides clock/throttling preferences.
> + *
> + *  XGMI PLPD Policy Selection - When multiple devices are connected
> +over XGMI,
> + *  this helps to select policy to be applied for per link power down.
> + *
> + * The list of available policies and policy levels vary between SOCs.
> +They can
> + * be viewed by reading the file. The policy level which is applied
> +presently is
> + * denoted by * (asterisk). E.g.,
> + *
> + * .. code-block:: console
> + *
> + *   cat /sys/bus/pci/devices/.../pm_policy
> + *   soc pstate
> + *   0 : soc_pstate_default
> + *   1 : soc_pstate_0
> + *   2 : soc_pstate_1*
> + *   3 : soc_pstate_2
> + *   xgmi plpd
> + *   0 : plpd_disallow
> + *   1 : plpd_default
> + *   2 : plpd_optimized*
> + *
> + * To apply a specific policy
> + *
> + * "echo   > /sys/bus/pci/devices/.../pm_policy"
> + *
> + * For the levels listed in the example above, to select
> +"plpd_optimized" for
> + * XGMI and "soc_pstate_2" for soc pstate policy -
> + *
> + * .. code-block:: console
> + *
> + *   echo "xgmi 2" > /sys/bus/pci/devices/.../pm_policy
> + *   echo "soc_pstate 3" > /sys/bus/pci/devices/.../pm_policy

The naming should be consistent between what is printed when you read the file 
and what you write to it.  E.g., policy_type should be soc_pstate and xgmi_plpd 
in both cases.

Alex

> + *
> + */
> +
>  static ssize_t amdgpu_get_pm_policy(struct device *dev,
>   struct device_attribute *attr, char *buf)  {
> --
> 2.25.1

[PATCH 2/2] drm/amdgpu: apply state adjust rules to some additional HAINAN vairants

2024-05-13 Thread Alex Deucher

They need a similar workaround.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1839
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c 
b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
index f245fc0bc6d36..c836bb03547f6 100644
--- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
+++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
@@ -3435,9 +3435,11 @@ static void si_apply_state_adjust_rules(struct 
amdgpu_device *adev,
if (adev->asic_type == CHIP_HAINAN) {
if ((adev->pdev->revision == 0x81) ||
(adev->pdev->revision == 0xC3) ||
+   (adev->pdev->device == 0x6660) ||
(adev->pdev->device == 0x6664) ||
(adev->pdev->device == 0x6665) ||
-   (adev->pdev->device == 0x6667)) {
+   (adev->pdev->device == 0x6667) ||
+   (adev->pdev->device == 0x666F)) {
max_sclk = 75000;
}
if ((adev->pdev->revision == 0xC3) ||
-- 
2.45.0

[PATCH 1/2] drm/radeon: apply state adjust rules to some additional HAINAN vairants

2024-05-13 Thread Alex Deucher

They need a similar workaround.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1839
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/radeon/si_dpm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/radeon/si_dpm.c b/drivers/gpu/drm/radeon/si_dpm.c
index 9deb91970d4df..5db16c20bd173 100644
--- a/drivers/gpu/drm/radeon/si_dpm.c
+++ b/drivers/gpu/drm/radeon/si_dpm.c
@@ -2915,9 +2915,11 @@ static void si_apply_state_adjust_rules(struct 
radeon_device *rdev,
if (rdev->family == CHIP_HAINAN) {
if ((rdev->pdev->revision == 0x81) ||
(rdev->pdev->revision == 0xC3) ||
+   (rdev->pdev->device == 0x6660) ||
(rdev->pdev->device == 0x6664) ||
(rdev->pdev->device == 0x6665) ||
-   (rdev->pdev->device == 0x6667)) {
+   (rdev->pdev->device == 0x6667) ||
+   (rdev->pdev->device == 0x666F)) {
max_sclk = 75000;
}
if ((rdev->pdev->revision == 0xC3) ||
-- 
2.45.0

Re: [PATCH] drm/amdgpu/pm: update documentation on memory clock

2024-05-13 Thread Alex Deucher

Ping?

On Thu, May 2, 2024 at 5:07 PM Alex Deucher  wrote:
>
> Update documentation for RDNA3 dGPUs.
>
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> index ec9058c80647a..9ad114e695e5d 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> @@ -610,12 +610,18 @@ static ssize_t amdgpu_set_pp_table(struct device *dev,
>   *
>   * Clock conversion (Mhz):
>   *
> + * Pre-RDNA3 GPUs:
> + *
>   * HBM: effective_memory_clock = memory_controller_clock * 1
>   *
>   * G5: effective_memory_clock = memory_controller_clock * 1
>   *
>   * G6: effective_memory_clock = memory_controller_clock * 2
>   *
> + * RDNA3 GPUs:
> + *
> + * G6: effective_memory_clock = memory_controller_clock * 1
> + *
>   * DRAM data rate (MT/s):
>   *
>   * HBM: effective_memory_clock * 2 = data_rate
> --
> 2.44.0
>

[PATCH 3/3] drm/amdgpu/gfx11: enable gfx pipe1 hardware support

2024-05-13 Thread Alex Deucher

Enable gfx pipe1 hardware support.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index d750ab86e4b27..96f441917b719 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -51,7 +51,7 @@
 #include "nbio_v4_3.h"
 #include "mes_v11_0.h"
 
-#define GFX11_NUM_GFX_RINGS1
+#define GFX11_NUM_GFX_RINGS2
 #define GFX11_MEC_HPD_SIZE 2048
 
 #define RLCG_UCODE_LOADING_START_ADDRESS   0x2000L
@@ -1342,7 +1342,7 @@ static int gfx_v11_0_sw_init(void *handle)
case IP_VERSION(11, 0, 2):
case IP_VERSION(11, 0, 3):
adev->gfx.me.num_me = 1;
-   adev->gfx.me.num_pipe_per_me = 1;
+   adev->gfx.me.num_pipe_per_me = 2;
adev->gfx.me.num_queue_per_pipe = 1;
adev->gfx.mec.num_mec = 2;
adev->gfx.mec.num_pipe_per_mec = 4;
@@ -1353,7 +1353,7 @@ static int gfx_v11_0_sw_init(void *handle)
case IP_VERSION(11, 5, 0):
case IP_VERSION(11, 5, 1):
adev->gfx.me.num_me = 1;
-   adev->gfx.me.num_pipe_per_me = 1;
+   adev->gfx.me.num_pipe_per_me = 2;
adev->gfx.me.num_queue_per_pipe = 1;
adev->gfx.mec.num_mec = 1;
adev->gfx.mec.num_pipe_per_mec = 4;
-- 
2.45.0

[PATCH 1/3] drm/amdgpu/gfx11: select HDP ref/mask according to gfx ring pipe

2024-05-13 Thread Alex Deucher

Use correct ref/mask for differnent gfx ring pipe. Ported from
ZhenGuo's patch for gfx10.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index c87943f6c4436..c8c055ef2f3c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -5294,7 +5294,7 @@ static void gfx_v11_0_ring_emit_hdp_flush(struct 
amdgpu_ring *ring)
}
reg_mem_engine = 0;
} else {
-   ref_and_mask = nbio_hf_reg->ref_and_mask_cp0;
+   ref_and_mask = nbio_hf_reg->ref_and_mask_cp0 << ring->pipe;
reg_mem_engine = 1; /* pfp */
}
 
-- 
2.45.0

[PATCH 2/3] drm/amdgpu/gfx11: handle priority setup for gfx pipe1

2024-05-13 Thread Alex Deucher

Set up pipe1 as a high priority queue.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 36 ++
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index c8c055ef2f3c2..d750ab86e4b27 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -930,9 +930,9 @@ static int gfx_v11_0_gpu_early_init(struct amdgpu_device 
*adev)
 static int gfx_v11_0_gfx_ring_init(struct amdgpu_device *adev, int ring_id,
   int me, int pipe, int queue)
 {
-   int r;
struct amdgpu_ring *ring;
unsigned int irq_type;
+   unsigned int hw_prio;
 
ring = >gfx.gfx_ring[ring_id];
 
@@ -951,11 +951,10 @@ static int gfx_v11_0_gfx_ring_init(struct amdgpu_device 
*adev, int ring_id,
sprintf(ring->name, "gfx_%d.%d.%d", ring->me, ring->pipe, ring->queue);
 
irq_type = AMDGPU_CP_IRQ_GFX_ME0_PIPE0_EOP + ring->pipe;
-   r = amdgpu_ring_init(adev, ring, 1024, >gfx.eop_irq, irq_type,
-AMDGPU_RING_PRIO_DEFAULT, NULL);
-   if (r)
-   return r;
-   return 0;
+   hw_prio = amdgpu_gfx_is_high_priority_graphics_queue(adev, ring) ?
+   AMDGPU_GFX_PIPE_PRIO_HIGH : AMDGPU_GFX_PIPE_PRIO_NORMAL;
+   return amdgpu_ring_init(adev, ring, 1024, >gfx.eop_irq, irq_type,
+   hw_prio, NULL);
 }
 
 static int gfx_v11_0_compute_ring_init(struct amdgpu_device *adev, int ring_id,
@@ -3616,6 +3615,24 @@ static void gfx_v11_0_cp_set_doorbell_range(struct 
amdgpu_device *adev)
 (adev->doorbell_index.userqueue_end * 2) << 2);
 }
 
+static void gfx_v11_0_gfx_mqd_set_priority(struct amdgpu_device *adev,
+  struct v11_gfx_mqd *mqd,
+  struct amdgpu_mqd_prop *prop)
+{
+   bool priority = 0;
+   u32 tmp;
+
+   /* set up default queue priority level
+* 0x0 = low priority, 0x1 = high priority
+*/
+   if (prop->hqd_pipe_priority == AMDGPU_GFX_PIPE_PRIO_HIGH)
+   priority = 1;
+
+   tmp = RREG32_SOC15(GC, 0, regCP_GFX_HQD_QUEUE_PRIORITY);
+   tmp = REG_SET_FIELD(tmp, CP_GFX_HQD_QUEUE_PRIORITY, PRIORITY_LEVEL, 
priority);
+   mqd->cp_gfx_hqd_queue_priority = tmp;
+}
+
 static int gfx_v11_0_gfx_mqd_init(struct amdgpu_device *adev, void *m,
  struct amdgpu_mqd_prop *prop)
 {
@@ -3644,11 +3661,8 @@ static int gfx_v11_0_gfx_mqd_init(struct amdgpu_device 
*adev, void *m,
tmp = REG_SET_FIELD(tmp, CP_GFX_HQD_VMID, VMID, 0);
mqd->cp_gfx_hqd_vmid = 0;
 
-   /* set up default queue priority level
-* 0x0 = low priority, 0x1 = high priority */
-   tmp = RREG32_SOC15(GC, 0, regCP_GFX_HQD_QUEUE_PRIORITY);
-   tmp = REG_SET_FIELD(tmp, CP_GFX_HQD_QUEUE_PRIORITY, PRIORITY_LEVEL, 0);
-   mqd->cp_gfx_hqd_queue_priority = tmp;
+   /* set up gfx queue priority */
+   gfx_v11_0_gfx_mqd_set_priority(adev, mqd, prop);
 
/* set up time quantum */
tmp = RREG32_SOC15(GC, 0, regCP_GFX_HQD_QUANTUM);
-- 
2.45.0

Re: [PATCH] drm/amdgpu/soc24: use common nbio callback to set remap offset

2024-05-13 Thread Alex Deucher

Ping.

On Wed, May 8, 2024 at 3:42 PM Alex Deucher  wrote:
>
> This fixes HDP flushes on systems with non-4K pages.
>
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/soc24.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc24.c 
> b/drivers/gpu/drm/amd/amdgpu/soc24.c
> index 12900488dd618..66c7138fc6aa4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc24.c
> +++ b/drivers/gpu/drm/amd/amdgpu/soc24.c
> @@ -372,11 +372,9 @@ static const struct amdgpu_asic_funcs soc24_asic_funcs = 
> {
>
>  static int soc24_common_early_init(void *handle)
>  {
> -#define MMIO_REG_HOLE_OFFSET (0x8 - PAGE_SIZE)
> struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>
> -   adev->rmmio_remap.reg_offset = MMIO_REG_HOLE_OFFSET;
> -   adev->rmmio_remap.bus_addr = adev->rmmio_base + MMIO_REG_HOLE_OFFSET;
> +   adev->nbio.funcs->set_reg_remap(adev);
> adev->smc_rreg = NULL;
> adev->smc_wreg = NULL;
> adev->pcie_rreg = _device_indirect_rreg;
> --
> 2.45.0
>

Re: [PATCH v2] drm/amd/display: fix documentation warnings for mpc.h

2024-05-13 Thread Alex Deucher

On Sat, May 11, 2024 at 12:22 AM Stephen Rothwell  wrote:
>
> Hi Marcelo,
>
> On Sat, 11 May 2024 13:37:17 +1000 Stephen Rothwell  
> wrote:
> >
> > Thanks for doing this.
> >
> > I haven't tested it, but just a couple of little things:
> >
> > On Fri, 10 May 2024 21:02:02 -0300 Marcelo Mendes Spessoto Junior 
> >  wrote:
> > >
> > > Fix most of the display documentation compile warnings by
> > > documenting struct mpc_funcs functions in dc/inc/hw/mpc.h file.
> > >
> >   .
> >   .
> >   .
> > >
> > > Fixes:
> > > b8c1c3a82e75 ("Documentation/gpu: Add kernel doc entry for MPC")
> >
> > Please don't split the Fixes tag line and also don't add blank lines
> > between tags.
> >
> > I also appreciate being credited for reporting (unless you got a report
> > from somewhere else as well, then maybe credit both).
> >
> > Reported-by: Stephen Rothwell 
>
> Now tested.
>
> Tested-by: Stephen Rothwell 
> Closes: 
> https://lore.kernel.org/linux-next/20240130134954.04fcf...@canb.auug.org.au/

Applied and added/fixed up tags.

Thanks!

Alex

>
> --
> Cheers,
> Stephen Rothwell

Re: [PATCH v2] drm/amdgpu: Add Ring Hang Events

2024-05-13 Thread Christian König


Am 09.05.24 um 22:41 schrieb Ori Messinger:

This patch adds 'ring hang' events to the driver.
This is done by adding a 'reset_ring_hang' bool variable to the
struct 'amdgpu_reset_context' in the amdgpu_reset.h file.
The purpose for this 'reset_ring_hang' variable is whenever a GPU
reset is initiated due to a ring hang, the reset_ring_hang should
be set to 'true'.

This 'amdgpu_reset_context' struct is now also passed
through across all relevant functions, and another event type
"KFD_SMI_EVENT_RING_HANG" is added to the kfd_smi_event enum.


To repeat myself on the newest version on the patch, this whole approach 
is a clear NAK.


Driver hangs including the cause of it are exposed through udev.

And in general exposing driver telemetry through the SMI interface in 
KFD is a pretty clear no-go.


Regards,
Christian.



Signed-off-by: Ori Messinger 
Change-Id: I6af3022eb1b4514201c9430d635ff87f167ad6f7
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c  |  6 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h  |  9 ++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  3 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h   |  2 ++
  drivers/gpu/drm/amd/amdkfd/kfd_device.c |  7 ---
  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c |  7 ++-
  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h |  5 -
  include/uapi/linux/kfd_ioctl.h  | 15 ---
  9 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index e3738d417245..ac0ee4322555 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -133,6 +133,8 @@ static void amdgpu_amdkfd_reset_work(struct work_struct 
*work)
  
  	reset_context.method = AMD_RESET_METHOD_NONE;

reset_context.reset_req_dev = adev;
+   reset_context.reset_ring_hang = true;
+   strscpy(reset_context.reset_cause, "scheduler_hang", 
sizeof(reset_context.reset_cause));
clear_bit(AMDGPU_NEED_FULL_RESET, _context.flags);
  
  	amdgpu_device_gpu_recover(adev, NULL, _context);

@@ -261,12 +263,12 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool 
run_pm)
return r;
  }
  
-int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)

+int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev, struct 
amdgpu_reset_context *reset_context)
  {
int r = 0;
  
  	if (adev->kfd.dev)

-   r = kgd2kfd_pre_reset(adev->kfd.dev);
+   r = kgd2kfd_pre_reset(adev->kfd.dev, reset_context);
  
  	return r;

  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 1de021ebdd46..c9030d8b8308 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -47,6 +47,7 @@ enum TLB_FLUSH_TYPE {
  };
  
  struct amdgpu_device;

+struct amdgpu_reset_context;
  
  enum kfd_mem_attachment_type {

KFD_MEM_ATT_SHARED, /* Share kgd_mem->bo or another attachment's */
@@ -170,7 +171,8 @@ bool amdgpu_amdkfd_have_atomics_support(struct 
amdgpu_device *adev);
  
  bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid);
  
-int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev);

+int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev,
+   struct amdgpu_reset_context *reset_context);
  
  int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev);
  
@@ -416,7 +418,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,

  void kgd2kfd_device_exit(struct kfd_dev *kfd);
  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
  int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
-int kgd2kfd_pre_reset(struct kfd_dev *kfd);
+int kgd2kfd_pre_reset(struct kfd_dev *kfd,
+ struct amdgpu_reset_context *reset_context);
  int kgd2kfd_post_reset(struct kfd_dev *kfd);
  void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
  void kgd2kfd_set_sram_ecc_flag(struct kfd_dev *kfd);
@@ -459,7 +462,7 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool 
run_pm)
return 0;
  }
  
-static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)

+static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd, struct 
amdgpu_reset_context *reset_context)
  {
return 0;
  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 77f6fd50002a..f9fa784f36f4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5772,7 +5772,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  
  		cancel_delayed_work_sync(_adev->delayed_init_work);
  
-		amdgpu_amdkfd_pre_reset(tmp_adev);

+   amdgpu_amdkfd_pre_reset(tmp_adev, reset_context);
  
  		/*

 * Mark these ASICs to be reseted as untracked first
diff --git

Re: [PATCH] drm/amdgpu/vcn: remove irq disabling in vcn 5 suspend

2024-05-13 Thread David Wu




On 2024-05-13 13:43, Christian König wrote:

Am 13.05.24 um 19:41 schrieb David Wu:


On 2024-05-13 13:11, Christian König wrote:



Am 09.05.24 um 20:40 schrieb David (Ming Qiang) Wu:

We do not directly enable/disable VCN IRQ in vcn 5.0.0.
And we do not handle the IRQ state as well. So the calls to
disable IRQ and set state are removed. This effectively gets
rid of the warining of
   "WARN_ON(!amdgpu_irq_enabled(adev, src, type))"
in amdgpu_irq_put().

Signed-off-by: David (Ming Qiang) Wu 
---
  drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 19 ---
  1 file changed, 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c

index 851975b5ce29..9b87d6a49b39 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
@@ -229,8 +229,6 @@ static int vcn_v5_0_0_hw_fini(void *handle)
  for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
  if (adev->vcn.harvest_config & (1 << i))
  continue;
-
-    amdgpu_irq_put(adev, >vcn.inst[i].irq, 0);
  }


Looks like you can now remove the whole for loop.
I realized that but there is a new patch added inside this loop to 
cover the suspend/resume issue.


Is that added in a later patch or did you rebased your patch ontop of it?

If it's added in a later patch then it's better to remove and re-add 
the lines. Otherwise you can get a mail from automated scripts that 
you have dead code.


Bit annoying but the documented way of doing things in the kernel.
understood - it is added in the next patch right after this one - so far 
I have not got an email for the dead code yet. If it raises an issue I 
will  work on it.


Regards,
Christian.



Apart from that looks good to me,
Christian.


    return 0;
@@ -1226,22 +1224,6 @@ static int 
vcn_v5_0_0_set_powergating_state(void *handle, enum amd_powergating_s

  return ret;
  }
  -/**
- * vcn_v5_0_0_set_interrupt_state - set VCN block interrupt state
- *
- * @adev: amdgpu_device pointer
- * @source: interrupt sources
- * @type: interrupt types
- * @state: interrupt states
- *
- * Set VCN block interrupt state
- */
-static int vcn_v5_0_0_set_interrupt_state(struct amdgpu_device 
*adev, struct amdgpu_irq_src *source,

-    unsigned type, enum amdgpu_interrupt_state state)
-{
-    return 0;
-}
-
  /**
   * vcn_v5_0_0_process_interrupt - process VCN block interrupt
   *
@@ -1287,7 +1269,6 @@ static int 
vcn_v5_0_0_process_interrupt(struct amdgpu_device *adev, struct amdgp

  }
    static const struct amdgpu_irq_src_funcs vcn_v5_0_0_irq_funcs = {
-    .set = vcn_v5_0_0_set_interrupt_state,
  .process = vcn_v5_0_0_process_interrupt,
  };

Re: [PATCH] drm/amdgpu/vcn: remove irq disabling in vcn 5 suspend

2024-05-13 Thread Christian König


Am 13.05.24 um 19:41 schrieb David Wu:


On 2024-05-13 13:11, Christian König wrote:



Am 09.05.24 um 20:40 schrieb David (Ming Qiang) Wu:

We do not directly enable/disable VCN IRQ in vcn 5.0.0.
And we do not handle the IRQ state as well. So the calls to
disable IRQ and set state are removed. This effectively gets
rid of the warining of
   "WARN_ON(!amdgpu_irq_enabled(adev, src, type))"
in amdgpu_irq_put().

Signed-off-by: David (Ming Qiang) Wu 
---
  drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 19 ---
  1 file changed, 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c

index 851975b5ce29..9b87d6a49b39 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
@@ -229,8 +229,6 @@ static int vcn_v5_0_0_hw_fini(void *handle)
  for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
  if (adev->vcn.harvest_config & (1 << i))
  continue;
-
-    amdgpu_irq_put(adev, >vcn.inst[i].irq, 0);
  }


Looks like you can now remove the whole for loop.
I realized that but there is a new patch added inside this loop to 
cover the suspend/resume issue.


Is that added in a later patch or did you rebased your patch ontop of it?

If it's added in a later patch then it's better to remove and re-add the 
lines. Otherwise you can get a mail from automated scripts that you have 
dead code.


Bit annoying but the documented way of doing things in the kernel.

Regards,
Christian.



Apart from that looks good to me,
Christian.


    return 0;
@@ -1226,22 +1224,6 @@ static int 
vcn_v5_0_0_set_powergating_state(void *handle, enum amd_powergating_s

  return ret;
  }
  -/**
- * vcn_v5_0_0_set_interrupt_state - set VCN block interrupt state
- *
- * @adev: amdgpu_device pointer
- * @source: interrupt sources
- * @type: interrupt types
- * @state: interrupt states
- *
- * Set VCN block interrupt state
- */
-static int vcn_v5_0_0_set_interrupt_state(struct amdgpu_device 
*adev, struct amdgpu_irq_src *source,

-    unsigned type, enum amdgpu_interrupt_state state)
-{
-    return 0;
-}
-
  /**
   * vcn_v5_0_0_process_interrupt - process VCN block interrupt
   *
@@ -1287,7 +1269,6 @@ static int vcn_v5_0_0_process_interrupt(struct 
amdgpu_device *adev, struct amdgp

  }
    static const struct amdgpu_irq_src_funcs vcn_v5_0_0_irq_funcs = {
-    .set = vcn_v5_0_0_set_interrupt_state,
  .process = vcn_v5_0_0_process_interrupt,
  };

Re: [PATCH] drm/amdgpu/vcn: remove irq disabling in vcn 5 suspend

2024-05-13 Thread David Wu




On 2024-05-13 13:11, Christian König wrote:



Am 09.05.24 um 20:40 schrieb David (Ming Qiang) Wu:

We do not directly enable/disable VCN IRQ in vcn 5.0.0.
And we do not handle the IRQ state as well. So the calls to
disable IRQ and set state are removed. This effectively gets
rid of the warining of
   "WARN_ON(!amdgpu_irq_enabled(adev, src, type))"
in amdgpu_irq_put().

Signed-off-by: David (Ming Qiang) Wu 
---
  drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 19 ---
  1 file changed, 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c

index 851975b5ce29..9b87d6a49b39 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
@@ -229,8 +229,6 @@ static int vcn_v5_0_0_hw_fini(void *handle)
  for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
  if (adev->vcn.harvest_config & (1 << i))
  continue;
-
-    amdgpu_irq_put(adev, >vcn.inst[i].irq, 0);
  }


Looks like you can now remove the whole for loop.
I realized that but there is a new patch added inside this loop to cover 
the suspend/resume issue.


Apart from that looks good to me,
Christian.


    return 0;
@@ -1226,22 +1224,6 @@ static int 
vcn_v5_0_0_set_powergating_state(void *handle, enum amd_powergating_s

  return ret;
  }
  -/**
- * vcn_v5_0_0_set_interrupt_state - set VCN block interrupt state
- *
- * @adev: amdgpu_device pointer
- * @source: interrupt sources
- * @type: interrupt types
- * @state: interrupt states
- *
- * Set VCN block interrupt state
- */
-static int vcn_v5_0_0_set_interrupt_state(struct amdgpu_device 
*adev, struct amdgpu_irq_src *source,

-    unsigned type, enum amdgpu_interrupt_state state)
-{
-    return 0;
-}
-
  /**
   * vcn_v5_0_0_process_interrupt - process VCN block interrupt
   *
@@ -1287,7 +1269,6 @@ static int vcn_v5_0_0_process_interrupt(struct 
amdgpu_device *adev, struct amdgp

  }
    static const struct amdgpu_irq_src_funcs vcn_v5_0_0_irq_funcs = {
-    .set = vcn_v5_0_0_set_interrupt_state,
  .process = vcn_v5_0_0_process_interrupt,
  };

Re: [PATCH] drm/amdgpu/vcn: remove irq disabling in vcn 5 suspend

2024-05-13 Thread Christian König





Am 09.05.24 um 20:40 schrieb David (Ming Qiang) Wu:

We do not directly enable/disable VCN IRQ in vcn 5.0.0.
And we do not handle the IRQ state as well. So the calls to
disable IRQ and set state are removed. This effectively gets
rid of the warining of
   "WARN_ON(!amdgpu_irq_enabled(adev, src, type))"
in amdgpu_irq_put().

Signed-off-by: David (Ming Qiang) Wu 
---
  drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 19 ---
  1 file changed, 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
index 851975b5ce29..9b87d6a49b39 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
@@ -229,8 +229,6 @@ static int vcn_v5_0_0_hw_fini(void *handle)
for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
if (adev->vcn.harvest_config & (1 << i))
continue;
-
-   amdgpu_irq_put(adev, >vcn.inst[i].irq, 0);
}


Looks like you can now remove the whole for loop.

Apart from that looks good to me,
Christian.

  
  	return 0;

@@ -1226,22 +1224,6 @@ static int vcn_v5_0_0_set_powergating_state(void 
*handle, enum amd_powergating_s
return ret;
  }
  
-/**

- * vcn_v5_0_0_set_interrupt_state - set VCN block interrupt state
- *
- * @adev: amdgpu_device pointer
- * @source: interrupt sources
- * @type: interrupt types
- * @state: interrupt states
- *
- * Set VCN block interrupt state
- */
-static int vcn_v5_0_0_set_interrupt_state(struct amdgpu_device *adev, struct 
amdgpu_irq_src *source,
-   unsigned type, enum amdgpu_interrupt_state state)
-{
-   return 0;
-}
-
  /**
   * vcn_v5_0_0_process_interrupt - process VCN block interrupt
   *
@@ -1287,7 +1269,6 @@ static int vcn_v5_0_0_process_interrupt(struct 
amdgpu_device *adev, struct amdgp
  }
  
  static const struct amdgpu_irq_src_funcs vcn_v5_0_0_irq_funcs = {

-   .set = vcn_v5_0_0_set_interrupt_state,
.process = vcn_v5_0_0_process_interrupt,
  };

Re: [RESEND 4/6] drm/amdgpu: remove amdgpu_connector_edid() and stop using edid_blob_ptr

2024-05-13 Thread Robert Foss

On Fri, May 10, 2024 at 5:09 PM Jani Nikula  wrote:
>
> amdgpu_connector_edid() copies the EDID from edid_blob_ptr as a side
> effect if amdgpu_connector->edid isn't initialized. However, everywhere
> that the returned EDID is used, the EDID should have been set
> beforehands.
>
> Only the drm EDID code should look at the EDID property, anyway, so stop
> using it.
>
> Cc: Alex Deucher 
> Cc: Christian König 
> Cc: Pan, Xinhui 
> Cc: amd-gfx@lists.freedesktop.org
> Signed-off-by: Jani Nikula 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 16 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.h |  1 -
>  drivers/gpu/drm/amd/amdgpu/dce_v10_0.c |  4 ++--
>  drivers/gpu/drm/amd/amdgpu/dce_v11_0.c |  4 ++--
>  drivers/gpu/drm/amd/amdgpu/dce_v6_0.c  |  4 ++--
>  drivers/gpu/drm/amd/amdgpu/dce_v8_0.c  |  4 ++--
>  6 files changed, 8 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c
> index 9caba10315a8..cae7479c3ecf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c
> @@ -246,22 +246,6 @@ amdgpu_connector_find_encoder(struct drm_connector 
> *connector,
> return NULL;
>  }
>
> -struct edid *amdgpu_connector_edid(struct drm_connector *connector)
> -{
> -   struct amdgpu_connector *amdgpu_connector = 
> to_amdgpu_connector(connector);
> -   struct drm_property_blob *edid_blob = connector->edid_blob_ptr;
> -
> -   if (amdgpu_connector->edid) {
> -   return amdgpu_connector->edid;
> -   } else if (edid_blob) {
> -   struct edid *edid = kmemdup(edid_blob->data, 
> edid_blob->length, GFP_KERNEL);
> -
> -   if (edid)
> -   amdgpu_connector->edid = edid;
> -   }
> -   return amdgpu_connector->edid;
> -}
> -
>  static struct edid *
>  amdgpu_connector_get_hardcoded_edid(struct amdgpu_device *adev)
>  {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.h
> index 61fcef15ad72..eff833b6ed31 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.h
> @@ -24,7 +24,6 @@
>  #ifndef __AMDGPU_CONNECTORS_H__
>  #define __AMDGPU_CONNECTORS_H__
>
> -struct edid *amdgpu_connector_edid(struct drm_connector *connector);
>  void amdgpu_connector_hotplug(struct drm_connector *connector);
>  int amdgpu_connector_get_monitor_bpc(struct drm_connector *connector);
>  u16 amdgpu_connector_encoder_get_dp_bridge_encoder_id(struct drm_connector 
> *connector);
> diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c 
> b/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c
> index b44fce44c066..dddb5fe16f2c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c
> @@ -1299,7 +1299,7 @@ static void 
> dce_v10_0_audio_write_speaker_allocation(struct drm_encoder *encoder
> return;
> }
>
> -   sad_count = 
> drm_edid_to_speaker_allocation(amdgpu_connector_edid(connector), );
> +   sad_count = drm_edid_to_speaker_allocation(amdgpu_connector->edid, 
> );
> if (sad_count < 0) {
> DRM_ERROR("Couldn't read Speaker Allocation Data Block: 
> %d\n", sad_count);
> sad_count = 0;
> @@ -1369,7 +1369,7 @@ static void dce_v10_0_audio_write_sad_regs(struct 
> drm_encoder *encoder)
> return;
> }
>
> -   sad_count = drm_edid_to_sad(amdgpu_connector_edid(connector), );
> +   sad_count = drm_edid_to_sad(amdgpu_connector->edid, );
> if (sad_count < 0)
> DRM_ERROR("Couldn't read SADs: %d\n", sad_count);
> if (sad_count <= 0)
> diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c 
> b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
> index 80b2e7f79acf..11780e4d7e9f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
> @@ -1331,7 +1331,7 @@ static void 
> dce_v11_0_audio_write_speaker_allocation(struct drm_encoder *encoder
> return;
> }
>
> -   sad_count = 
> drm_edid_to_speaker_allocation(amdgpu_connector_edid(connector), );
> +   sad_count = drm_edid_to_speaker_allocation(amdgpu_connector->edid, 
> );
> if (sad_count < 0) {
> DRM_ERROR("Couldn't read Speaker Allocation Data Block: 
> %d\n", sad_count);
> sad_count = 0;
> @@ -1401,7 +1401,7 @@ static void dce_v11_0_audio_write_sad_regs(struct 
> drm_encoder *encoder)
> return;
> }
>
> -   sad_count = drm_edid_to_sad(amdgpu_connector_edid(connector), );
> +   sad_count = drm_edid_to_sad(amdgpu_connector->edid, );
> if (sad_count < 0)
> DRM_ERROR("Couldn't read SADs: %d\n", sad_count);
> if (sad_count <= 0)
> diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c 
>

Re: [RESEND 3/6] drm/radeon: remove radeon_connector_edid() and stop using edid_blob_ptr

2024-05-13 Thread Robert Foss

On Fri, May 10, 2024 at 5:08 PM Jani Nikula  wrote:
>
> radeon_connector_edid() copies the EDID from edid_blob_ptr as a side
> effect if radeon_connector->edid isn't initialized. However, everywhere
> that the returned EDID is used, the EDID should have been set
> beforehands.
>
> Only the drm EDID code should look at the EDID property, anyway, so stop
> using it.
>
> Cc: Alex Deucher 
> Cc: Christian König 
> Cc: Pan, Xinhui 
> Cc: amd-gfx@lists.freedesktop.org
> Signed-off-by: Jani Nikula 
> ---
>  drivers/gpu/drm/radeon/radeon_audio.c  |  7 ---
>  drivers/gpu/drm/radeon/radeon_connectors.c | 15 ---
>  drivers/gpu/drm/radeon/radeon_mode.h   |  2 --
>  3 files changed, 4 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_audio.c 
> b/drivers/gpu/drm/radeon/radeon_audio.c
> index 16c10db3ce6f..0bcd767b9f47 100644
> --- a/drivers/gpu/drm/radeon/radeon_audio.c
> +++ b/drivers/gpu/drm/radeon/radeon_audio.c
> @@ -303,6 +303,7 @@ void radeon_audio_endpoint_wreg(struct radeon_device 
> *rdev, u32 offset,
>  static void radeon_audio_write_sad_regs(struct drm_encoder *encoder)
>  {
> struct drm_connector *connector = 
> radeon_get_connector_for_encoder(encoder);
> +   struct radeon_connector *radeon_connector = 
> to_radeon_connector(connector);
> struct radeon_encoder *radeon_encoder = to_radeon_encoder(encoder);
> struct cea_sad *sads;
> int sad_count;
> @@ -310,7 +311,7 @@ static void radeon_audio_write_sad_regs(struct 
> drm_encoder *encoder)
> if (!connector)
> return;
>
> -   sad_count = drm_edid_to_sad(radeon_connector_edid(connector), );
> +   sad_count = drm_edid_to_sad(radeon_connector->edid, );
> if (sad_count < 0)
> DRM_ERROR("Couldn't read SADs: %d\n", sad_count);
> if (sad_count <= 0)
> @@ -326,6 +327,7 @@ static void radeon_audio_write_sad_regs(struct 
> drm_encoder *encoder)
>  static void radeon_audio_write_speaker_allocation(struct drm_encoder 
> *encoder)
>  {
> struct drm_connector *connector = 
> radeon_get_connector_for_encoder(encoder);
> +   struct radeon_connector *radeon_connector = 
> to_radeon_connector(connector);
> struct radeon_encoder *radeon_encoder = to_radeon_encoder(encoder);
> u8 *sadb = NULL;
> int sad_count;
> @@ -333,8 +335,7 @@ static void radeon_audio_write_speaker_allocation(struct 
> drm_encoder *encoder)
> if (!connector)
> return;
>
> -   sad_count = 
> drm_edid_to_speaker_allocation(radeon_connector_edid(connector),
> -  );
> +   sad_count = drm_edid_to_speaker_allocation(radeon_connector->edid, 
> );
> if (sad_count < 0) {
> DRM_DEBUG("Couldn't read Speaker Allocation Data Block: %d\n",
>   sad_count);
> diff --git a/drivers/gpu/drm/radeon/radeon_connectors.c 
> b/drivers/gpu/drm/radeon/radeon_connectors.c
> index 81b5c3c8f658..80879e946342 100644
> --- a/drivers/gpu/drm/radeon/radeon_connectors.c
> +++ b/drivers/gpu/drm/radeon/radeon_connectors.c
> @@ -255,21 +255,6 @@ static struct drm_encoder *radeon_find_encoder(struct 
> drm_connector *connector,
> return NULL;
>  }
>
> -struct edid *radeon_connector_edid(struct drm_connector *connector)
> -{
> -   struct radeon_connector *radeon_connector = 
> to_radeon_connector(connector);
> -   struct drm_property_blob *edid_blob = connector->edid_blob_ptr;
> -
> -   if (radeon_connector->edid) {
> -   return radeon_connector->edid;
> -   } else if (edid_blob) {
> -   struct edid *edid = kmemdup(edid_blob->data, 
> edid_blob->length, GFP_KERNEL);
> -   if (edid)
> -   radeon_connector->edid = edid;
> -   }
> -   return radeon_connector->edid;
> -}
> -
>  static void radeon_connector_get_edid(struct drm_connector *connector)
>  {
> struct drm_device *dev = connector->dev;
> diff --git a/drivers/gpu/drm/radeon/radeon_mode.h 
> b/drivers/gpu/drm/radeon/radeon_mode.h
> index 546381a5c918..e0a5af180801 100644
> --- a/drivers/gpu/drm/radeon/radeon_mode.h
> +++ b/drivers/gpu/drm/radeon/radeon_mode.h
> @@ -701,8 +701,6 @@ extern u16 
> radeon_connector_encoder_get_dp_bridge_encoder_id(struct drm_connecto
>  extern bool radeon_connector_is_dp12_capable(struct drm_connector 
> *connector);
>  extern int radeon_get_monitor_bpc(struct drm_connector *connector);
>
> -extern struct edid *radeon_connector_edid(struct drm_connector *connector);
> -
>  extern void radeon_connector_hotplug(struct drm_connector *connector);
>  extern int radeon_dp_mode_valid_helper(struct drm_connector *connector,
>struct drm_display_mode *mode);
> --
> 2.39.2
>

Reviewed-by: Robert Foss

Re: [PATCH] drm/amdgpu: fix getting vram info for gfx12

2024-05-13 Thread Alex Deucher

On Mon, May 13, 2024 at 11:32 AM Min, Frank  wrote:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> From: Frank Min 
>
> gfx12 query video mem channel/type/width from umc_info of atom list, so fix 
> it accordingly.
>
> Signed-off-by: Frank Min 

Acked-by: Alex Deucher 

> ---
>  .../gpu/drm/amd/amdgpu/amdgpu_atomfirmware.c  | 263 ++
>  1 file changed, 148 insertions(+), 115 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atomfirmware.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_atomfirmware.c
> index a6d64bdbbb14..6fe84151332e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atomfirmware.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atomfirmware.c
> @@ -289,7 +289,6 @@ static int convert_atom_mem_type_to_vram_type(struct 
> amdgpu_device *adev,
> return vram_type;
>  }
>
> -
>  int
>  amdgpu_atomfirmware_get_vram_info(struct amdgpu_device *adev,
>   int *vram_width, int *vram_type,
> @@ -300,6 +299,7 @@ amdgpu_atomfirmware_get_vram_info(struct amdgpu_device 
> *adev,
> u16 data_offset, size;
> union igp_info *igp_info;
> union vram_info *vram_info;
> +   union umc_info *umc_info;
> union vram_module *vram_module;
> u8 frev, crev;
> u8 mem_type;
> @@ -311,10 +311,16 @@ amdgpu_atomfirmware_get_vram_info(struct amdgpu_device 
> *adev,
> if (adev->flags & AMD_IS_APU)
> index = 
> get_index_into_master_table(atom_master_list_of_data_tables_v2_1,
> integratedsysteminfo);
> -   else
> -   index = 
> get_index_into_master_table(atom_master_list_of_data_tables_v2_1,
> -   vram_info);
> -
> +   else {
> +   switch (amdgpu_ip_version(adev, GC_HWIP, 0)) {
> +   case IP_VERSION(12, 0, 0):
> +   case IP_VERSION(12, 0, 1):
> +   index = 
> get_index_into_master_table(atom_master_list_of_data_tables_v2_1, umc_info);
> +   break;
> +   default:
> +   index = 
> get_index_into_master_table(atom_master_list_of_data_tables_v2_1, vram_info);
> +   }
> +   }
> if (amdgpu_atom_parse_data_header(mode_info->atom_context,
>   index, ,
>   , , _offset)) {
> @@ -368,123 +374,150 @@ amdgpu_atomfirmware_get_vram_info(struct 
> amdgpu_device *adev,
> return -EINVAL;
> }
> } else {
> -   vram_info = (union vram_info *)
> -   (mode_info->atom_context->bios + data_offset);
> -   module_id = (RREG32(adev->bios_scratch_reg_offset + 
> 4) & 0x00ff) >> 16;
> -   if (frev == 3) {
> -   switch (crev) {
> -   /* v30 */
> -   case 0:
> -   vram_module = (union vram_module 
> *)vram_info->v30.vram_module;
> -   mem_vendor = 
> (vram_module->v30.dram_vendor_id) & 0xF;
> -   if (vram_vendor)
> -   *vram_vendor = mem_vendor;
> -   mem_type = vram_info->v30.memory_type;
> -   if (vram_type)
> -   *vram_type = 
> convert_atom_mem_type_to_vram_type(adev, mem_type);
> -   mem_channel_number = 
> vram_info->v30.channel_num;
> -   mem_channel_width = 
> vram_info->v30.channel_width;
> -   if (vram_width)
> -   *vram_width = 
> mem_channel_number * (1 << mem_channel_width);
> -   break;
> -   default:
> -   return -EINVAL;
> -   }
> -   } else if (frev == 2) {
> -   switch (crev) {
> -   /* v23 */
> -   case 3:
> -   if (module_id > 
> vram_info->v23.vram_module_num)
> -   module_id = 0;
> -   vram_module = (union vram_module 
> *)vram_info->v23.vram_module;
> -   while (i < module_id) {
> -   vram_module = (union 
> vram_module *)
> -   ((u8 *)vram_module + 
> vram_module->v9.vram_module_size);
> -   i++;

Re: [RESEND 2/6] drm/radeon: convert to using is_hdmi and has_audio from display info

2024-05-13 Thread Robert Foss

On Fri, May 10, 2024 at 5:08 PM Jani Nikula  wrote:
>
> Prefer the parsed results for is_hdmi and has_audio in display info over
> calling drm_detect_hdmi_monitor() and drm_detect_monitor_audio(),
> respectively.
>
> Cc: Alex Deucher 
> Cc: Christian König 
> Cc: Pan, Xinhui 
> Cc: amd-gfx@lists.freedesktop.org
> Signed-off-by: Jani Nikula 
> ---
>  drivers/gpu/drm/radeon/atombios_encoders.c | 10 +-
>  drivers/gpu/drm/radeon/evergreen_hdmi.c|  5 ++---
>  drivers/gpu/drm/radeon/radeon_audio.c  |  6 +++---
>  drivers/gpu/drm/radeon/radeon_connectors.c | 12 ++--
>  drivers/gpu/drm/radeon/radeon_display.c|  2 +-
>  drivers/gpu/drm/radeon/radeon_encoders.c   |  4 ++--
>  6 files changed, 19 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/gpu/drm/radeon/atombios_encoders.c 
> b/drivers/gpu/drm/radeon/atombios_encoders.c
> index 2bff0d9e20f5..0aa395fac36f 100644
> --- a/drivers/gpu/drm/radeon/atombios_encoders.c
> +++ b/drivers/gpu/drm/radeon/atombios_encoders.c
> @@ -701,7 +701,7 @@ atombios_get_encoder_mode(struct drm_encoder *encoder)
> if (radeon_connector->use_digital &&
> (radeon_connector->audio == RADEON_AUDIO_ENABLE))
> return ATOM_ENCODER_MODE_HDMI;
> -   else if 
> (drm_detect_hdmi_monitor(radeon_connector_edid(connector)) &&
> +   else if (connector->display_info.is_hdmi &&
>  (radeon_connector->audio == 
> RADEON_AUDIO_AUTO))
> return ATOM_ENCODER_MODE_HDMI;
> else if (radeon_connector->use_digital)
> @@ -720,7 +720,7 @@ atombios_get_encoder_mode(struct drm_encoder *encoder)
> if (radeon_audio != 0) {
> if (radeon_connector->audio == RADEON_AUDIO_ENABLE)
> return ATOM_ENCODER_MODE_HDMI;
> -   else if 
> (drm_detect_hdmi_monitor(radeon_connector_edid(connector)) &&
> +   else if (connector->display_info.is_hdmi &&
>  (radeon_connector->audio == 
> RADEON_AUDIO_AUTO))
> return ATOM_ENCODER_MODE_HDMI;
> else
> @@ -737,14 +737,14 @@ atombios_get_encoder_mode(struct drm_encoder *encoder)
> if ((dig_connector->dp_sink_type == 
> CONNECTOR_OBJECT_ID_DISPLAYPORT) ||
> (dig_connector->dp_sink_type == CONNECTOR_OBJECT_ID_eDP)) 
> {
> if (radeon_audio != 0 &&
> -   
> drm_detect_monitor_audio(radeon_connector_edid(connector)) &&
> +   connector->display_info.has_audio &&
> ASIC_IS_DCE4(rdev) && !ASIC_IS_DCE5(rdev))
> return ATOM_ENCODER_MODE_DP_AUDIO;
> return ATOM_ENCODER_MODE_DP;
> } else if (radeon_audio != 0) {
> if (radeon_connector->audio == RADEON_AUDIO_ENABLE)
> return ATOM_ENCODER_MODE_HDMI;
> -   else if 
> (drm_detect_hdmi_monitor(radeon_connector_edid(connector)) &&
> +   else if (connector->display_info.is_hdmi &&
>  (radeon_connector->audio == 
> RADEON_AUDIO_AUTO))
> return ATOM_ENCODER_MODE_HDMI;
> else
> @@ -755,7 +755,7 @@ atombios_get_encoder_mode(struct drm_encoder *encoder)
> break;
> case DRM_MODE_CONNECTOR_eDP:
> if (radeon_audio != 0 &&
> -   
> drm_detect_monitor_audio(radeon_connector_edid(connector)) &&
> +   connector->display_info.has_audio &&
> ASIC_IS_DCE4(rdev) && !ASIC_IS_DCE5(rdev))
> return ATOM_ENCODER_MODE_DP_AUDIO;
> return ATOM_ENCODER_MODE_DP;
> diff --git a/drivers/gpu/drm/radeon/evergreen_hdmi.c 
> b/drivers/gpu/drm/radeon/evergreen_hdmi.c
> index 681119c91d94..09dda114e218 100644
> --- a/drivers/gpu/drm/radeon/evergreen_hdmi.c
> +++ b/drivers/gpu/drm/radeon/evergreen_hdmi.c
> @@ -412,7 +412,7 @@ void evergreen_hdmi_enable(struct drm_encoder *encoder, 
> bool enable)
> if (enable) {
> struct drm_connector *connector = 
> radeon_get_connector_for_encoder(encoder);
>
> -   if (connector && 
> drm_detect_monitor_audio(radeon_connector_edid(connector))) {
> +   if (connector && connector->display_info.has_audio) {
> WREG32(HDMI_INFOFRAME_CONTROL0 + dig->afmt->offset,
>HDMI_AVI_INFO_SEND | /* enable AVI info frames 
> */
>HDMI_AVI_INFO_CONT | /* required for audio 
> info values to be updated */
> @@ -450,8 +450,7 @@ void evergreen_dp_enable(struct drm_encoder *encoder, 
> bool enable)
>

Re: [PATCH v10 6/6] drm/amdgpu: Enable userq fence interrupt support

2024-05-13 Thread Christian König


Am 10.05.24 um 10:50 schrieb Arunpravin Paneer Selvam:

Add support to handle the userqueue protected fence signal hardware
interrupt.

Create a xarray which maps the doorbell index to the fence driver address.
This would help to retrieve the fence driver information when an userq fence
interrupt is triggered. Firmware sends the doorbell offset value and
this info is compared with the queue's mqd doorbell offset value.
If they are same, we process the userq fence interrupt.

v1:(Christian)
   - use xa_load() instead of going over all entries
   - keep the xa_lock until the fence driver process completes
   - create a separate patch to remove the MES self test function
 call.

Signed-off-by: Arunpravin Paneer Selvam 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  2 ++
  .../gpu/drm/amd/amdgpu/amdgpu_userq_fence.c   | 15 
  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c| 23 +--
  4 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 4ca14b02668b..2d5ef2e74c71 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1043,6 +1043,8 @@ struct amdgpu_device {
struct amdgpu_mqd   mqds[AMDGPU_HW_IP_NUM];
const struct amdgpu_userq_funcs *userq_funcs[AMDGPU_HW_IP_NUM];
  
+	struct xarray			userq_xa;

+
/* df */
struct amdgpu_dfdf;
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 2d9fa3d0d4a4..fd919105a181 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3982,6 +3982,8 @@ int amdgpu_device_init(struct amdgpu_device *adev,
spin_lock_init(>audio_endpt_idx_lock);
spin_lock_init(>mm_stats.lock);
  
+	xa_init_flags(>userq_xa, XA_FLAGS_LOCK_IRQ);

+
INIT_LIST_HEAD(>shadow_list);
mutex_init(>shadow_list_lock);
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c

index 339d82d5808f..4cbc25595226 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
@@ -70,6 +70,7 @@ int amdgpu_userq_fence_driver_alloc(struct amdgpu_device 
*adev,
struct amdgpu_usermode_queue *userq)
  {
struct amdgpu_userq_fence_driver *fence_drv;
+   unsigned long flags;
int r;
  
  	fence_drv = kzalloc(sizeof(*fence_drv), GFP_KERNEL);

@@ -97,6 +98,11 @@ int amdgpu_userq_fence_driver_alloc(struct amdgpu_device 
*adev,
fence_drv->context = dma_fence_context_alloc(1);
get_task_comm(fence_drv->timeline_name, current);
  
+	xa_lock_irqsave(>userq_xa, flags);

+   __xa_store(>userq_xa, userq->doorbell_index,
+  fence_drv, GFP_KERNEL);
+   xa_unlock_irqrestore(>userq_xa, flags);
+
userq->fence_drv = fence_drv;
  
  	return 0;

@@ -147,8 +153,11 @@ void amdgpu_userq_fence_driver_destroy(struct kref *ref)
struct amdgpu_userq_fence_driver *fence_drv = container_of(ref,
 struct amdgpu_userq_fence_driver,
 refcount);
+   struct amdgpu_userq_fence_driver *xa_fence_drv;
struct amdgpu_device *adev = fence_drv->adev;
struct amdgpu_userq_fence *fence, *tmp;
+   struct xarray *xa = >userq_xa;
+   unsigned long index;
struct dma_fence *f;
  
  	spin_lock(_drv->fence_list_lock);

@@ -165,6 +174,12 @@ void amdgpu_userq_fence_driver_destroy(struct kref *ref)
}
spin_unlock(_drv->fence_list_lock);
  
+	xa_lock(xa);

+   xa_for_each(xa, index, xa_fence_drv)
+   if (xa_fence_drv == fence_drv)
+   __xa_erase(xa, index);
+   xa_unlock(xa);


That is rather inefficient. We should probably move registering a 
fence_drv for a certain doorbell into the userq code instead.



+
/* Free seq64 memory */
amdgpu_seq64_free(adev, fence_drv->gpu_addr);
kfree(fence_drv);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index a786e25432ae..0a206f484240 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -49,6 +49,7 @@
  #include "gfx_v11_0_3.h"
  #include "nbio_v4_3.h"
  #include "mes_v11_0.h"
+#include "amdgpu_userq_fence.h"
  
  #define GFX11_NUM_GFX_RINGS		1

  #define GFX11_MEC_HPD_SIZE2048
@@ -5939,25 +5940,23 @@ static int gfx_v11_0_eop_irq(struct amdgpu_device *adev,
 struct amdgpu_irq_src *source,
 struct amdgpu_iv_entry *entry)
  {
-   int i;
+   u32 doorbell_offset = entry->src_data[0];
u8 me_id, pipe_id, queue_id;
struct amdgpu_ring *ring;
-

Re: [PATCH v10 5/6] drm/amdgpu: Remove the MES self test

2024-05-13 Thread Christian König


Am 10.05.24 um 10:50 schrieb Arunpravin Paneer Selvam:

Remove MES self test as this conflicts the userqueue fence
interrupts.


Please also completely remove the amdgpu_mes_self_test() function and 
any now unused code.


Regards,
Christian.



Signed-off-by: Arunpravin Paneer Selvam 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  3 ---
  drivers/gpu/drm/amd/amdgpu/mes_v10_1.c | 12 +---
  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 14 +-
  3 files changed, 2 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7753a2e64d41..2d9fa3d0d4a4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4719,9 +4719,6 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
fbcon)
}
adev->in_suspend = false;
  
-	if (adev->enable_mes)

-   amdgpu_mes_self_test(adev);
-
if (amdgpu_acpi_smart_shift_update(dev, AMDGPU_SS_DEV_D0))
DRM_WARN("smart shift update failed\n");
  
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c b/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c

index 4d1121d1a1e7..b7bfb3185a30 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
@@ -1161,20 +1161,10 @@ static int mes_v10_0_early_init(void *handle)
return 0;
  }
  
-static int mes_v10_0_late_init(void *handle)

-{
-   struct amdgpu_device *adev = (struct amdgpu_device *)handle;
-
-   if (!amdgpu_in_reset(adev))
-   amdgpu_mes_self_test(adev);
-
-   return 0;
-}
-
  static const struct amd_ip_funcs mes_v10_1_ip_funcs = {
.name = "mes_v10_1",
.early_init = mes_v10_0_early_init,
-   .late_init = mes_v10_0_late_init,
+   .late_init = NULL,
.sw_init = mes_v10_1_sw_init,
.sw_fini = mes_v10_1_sw_fini,
.hw_init = mes_v10_1_hw_init,
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index feb7fa2c304c..5923b7b0bd95 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -1274,22 +1274,10 @@ static int mes_v11_0_early_init(void *handle)
return 0;
  }
  
-static int mes_v11_0_late_init(void *handle)

-{
-   struct amdgpu_device *adev = (struct amdgpu_device *)handle;
-
-   /* it's only intended for use in mes_self_test case, not for s0ix and 
reset */
-   if (!amdgpu_in_reset(adev) && !adev->in_s0ix && !adev->in_suspend &&
-   (amdgpu_ip_version(adev, GC_HWIP, 0) != IP_VERSION(11, 0, 3)))
-   amdgpu_mes_self_test(adev);
-
-   return 0;
-}
-
  static const struct amd_ip_funcs mes_v11_0_ip_funcs = {
.name = "mes_v11_0",
.early_init = mes_v11_0_early_init,
-   .late_init = mes_v11_0_late_init,
+   .late_init = NULL,
.sw_init = mes_v11_0_sw_init,
.sw_fini = mes_v11_0_sw_fini,
.hw_init = mes_v11_0_hw_init,

Re: [PATCH v3] drm/amdgpu: Add Ring Hang Events

2024-05-13 Thread Christian König


Am 13.05.24 um 06:14 schrieb Ori Messinger:

This patch adds 'ring hang' events to the driver.
This is done by adding a 'reset_ring_hang' bool variable to the
struct 'amdgpu_reset_context' in the amdgpu_reset.h file.
The purpose for this 'reset_ring_hang' variable is whenever a GPU
reset is initiated due to a ring hang, the reset_ring_hang should
be set to 'true'.

Additionally, the reset cause is passed into the kfd smi event as
a string, and is displayed in dmesg as an error.

This 'amdgpu_reset_context' struct is now also passed
through across all relevant functions, and another event type
"KFD_SMI_EVENT_RING_HANG" is added to the kfd_smi_event enum.


Well general NAK to that design.

Why in the world would we want to expose the ring hang through the KFD 
SMI interface?


Regards,
Christian.



Signed-off-by: Ori Messinger 
Change-Id: I6af3022eb1b4514201c9430d635ff87f167ad6f7
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c  |  7 +--
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h  |  9 ++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 16 
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |  4 
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h   |  2 ++
  drivers/gpu/drm/amd/amdkfd/kfd_device.c |  7 ---
  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c |  7 ++-
  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h |  5 -
  include/uapi/linux/kfd_ioctl.h  | 15 ---
  10 files changed, 56 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index e3738d417245..f1c6dc939cc3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -133,6 +133,9 @@ static void amdgpu_amdkfd_reset_work(struct work_struct 
*work)
  
  	reset_context.method = AMD_RESET_METHOD_NONE;

reset_context.reset_req_dev = adev;
+   reset_context.reset_ring_hang = true;
+   strscpy(reset_context.reset_cause, "hws_hang", 
sizeof(reset_context.reset_cause));
+   DRM_ERROR("Reset cause: %s\n", reset_context.reset_cause);
clear_bit(AMDGPU_NEED_FULL_RESET, _context.flags);
  
  	amdgpu_device_gpu_recover(adev, NULL, _context);

@@ -261,12 +264,12 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool 
run_pm)
return r;
  }
  
-int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)

+int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev, struct 
amdgpu_reset_context *reset_context)
  {
int r = 0;
  
  	if (adev->kfd.dev)

-   r = kgd2kfd_pre_reset(adev->kfd.dev);
+   r = kgd2kfd_pre_reset(adev->kfd.dev, reset_context);
  
  	return r;

  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 1de021ebdd46..c9030d8b8308 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -47,6 +47,7 @@ enum TLB_FLUSH_TYPE {
  };
  
  struct amdgpu_device;

+struct amdgpu_reset_context;
  
  enum kfd_mem_attachment_type {

KFD_MEM_ATT_SHARED, /* Share kgd_mem->bo or another attachment's */
@@ -170,7 +171,8 @@ bool amdgpu_amdkfd_have_atomics_support(struct 
amdgpu_device *adev);
  
  bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid);
  
-int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev);

+int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev,
+   struct amdgpu_reset_context *reset_context);
  
  int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev);
  
@@ -416,7 +418,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,

  void kgd2kfd_device_exit(struct kfd_dev *kfd);
  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
  int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
-int kgd2kfd_pre_reset(struct kfd_dev *kfd);
+int kgd2kfd_pre_reset(struct kfd_dev *kfd,
+ struct amdgpu_reset_context *reset_context);
  int kgd2kfd_post_reset(struct kfd_dev *kfd);
  void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
  void kgd2kfd_set_sram_ecc_flag(struct kfd_dev *kfd);
@@ -459,7 +462,7 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool 
run_pm)
return 0;
  }
  
-static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)

+static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd, struct 
amdgpu_reset_context *reset_context)
  {
return 0;
  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 00fe3c2d5431..b18f37426b5e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5772,7 +5772,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  
  		cancel_delayed_work_sync(_adev->delayed_init_work);
  
-		amdgpu_amdkfd_pre_reset(tmp_adev);

+   amdgpu_amdkfd_pre_reset(tmp_adev, reset_context);
  
  		/*

Re: [PATCH v10 4/6] drm/amdgpu: Implement userqueue signal/wait IOCTL

2024-05-13 Thread Christian König


Am 10.05.24 um 10:50 schrieb Arunpravin Paneer Selvam:

This patch introduces new IOCTL for userqueue secure semaphore.

The signal IOCTL called from userspace application creates a drm
syncobj and array of bo GEM handles and passed in as parameter to
the driver to install the fence into it.

The wait IOCTL gets an array of drm syncobjs, finds the fences
attached to the drm syncobjs and obtain the array of
memory_address/fence_value combintion which are returned to
userspace.

v2: (Christian)
 - Install fence into GEM BO object.
 - Lock all BO's using the dma resv subsystem
 - Reorder the sequence in signal IOCTL function.
 - Get write pointer from the shadow wptr
 - use userq_fence to fetch the va/value in wait IOCTL.

v3: (Christian)
 - Use drm_exec helper for the proper BO drm reserve and avoid BO
   lock/unlock issues.
 - fence/fence driver reference count logic for signal/wait IOCTLs.

v4: (Christian)
 - Fixed the drm_exec calling sequence
 - use dma_resv_for_each_fence_unlock if BO's are not locked
 - Modified the fence_info array storing logic.

v5: (Christian)
 - Keep fence_drv until wait queue execution.
 - Add dma_fence_wait for other fences.
 - Lock BO's using drm_exec as the number of fences in them could
   change.
 - Install signaled fences as well into BO/Syncobj.
 - Move Syncobj fence installation code after the drm_exec_prepare_array.
 - Directly add dma_resv_usage_rw(args->bo_flags
 - remove unnecessary dma_fence_put.

v6: (Christian)
 - Add xarray stuff to store the fence_drv
 - Implement a function to iterate over the xarray and drop
   the fence_drv references.
 - Add drm_exec_until_all_locked() wrapper
 - Add a check that if we haven't exceeded the user allocated num_fences
   before adding dma_fence to the fences array.

v7: (Christian)
 - Use memdup_user() for kmalloc_array + copy_from_user
 - Move the fence_drv references from the xarray into the newly created 
fence
   and drop the fence_drv references when we signal this fence.
 - Move this locking of BOs before the "if (!wait_info->num_fences)",
   this way you need this code block only once.
 - Merge the error handling code and the cleanup + return 0 code.
 - Initializing the xa should probably be done in the userq code.
 - Remove the userq back pointer stored in fence_drv.
 - Pass xarray as parameter in amdgpu_userq_walk_and_drop_fence_drv()

v8: (Christian)
 - Move fence_drv references must come before adding the fence to the list.
 - Use xa_lock_irqsave_nested for nested spinlock operations.
 - userq_mgr should be per fpriv and not one per device.
 - Restructure the interrupt process code for the early exit of the loop.
 - The reference acquired in the syncobj fence replace code needs to be
   kept around.
 - Modify the dma_fence acquire placement in wait IOCTL.
 - Move USERQ_BO_WRITE flag to UAPI header file.
 - drop the fence drv reference after telling the hw to stop accessing it.
 - Add multi sync object support to userq signal IOCTL.

v9: (Christian)
 - Store all the fence_drv ref to other drivers and not ourself.
 - Iterate over uq_fence_drv_xa without holding a lock.

Signed-off-by: Arunpravin Paneer Selvam 
Suggested-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   |   2 +
  .../gpu/drm/amd/amdgpu/amdgpu_userq_fence.c   | 431 +-
  .../gpu/drm/amd/amdgpu/amdgpu_userq_fence.h   |   6 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c |  29 +-
  .../gpu/drm/amd/include/amdgpu_userqueue.h|   1 +
  5 files changed, 462 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 844f7b5f90db..5892a4c1a92e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2918,6 +2918,8 @@ const struct drm_ioctl_desc amdgpu_ioctls_kms[] = {
DRM_IOCTL_DEF_DRV(AMDGPU_GEM_OP, amdgpu_gem_op_ioctl, 
DRM_AUTH|DRM_RENDER_ALLOW),
DRM_IOCTL_DEF_DRV(AMDGPU_GEM_USERPTR, amdgpu_gem_userptr_ioctl, 
DRM_AUTH|DRM_RENDER_ALLOW),
DRM_IOCTL_DEF_DRV(AMDGPU_USERQ, amdgpu_userq_ioctl, 
DRM_AUTH|DRM_RENDER_ALLOW),
+   DRM_IOCTL_DEF_DRV(AMDGPU_USERQ_SIGNAL, amdgpu_userq_signal_ioctl, 
DRM_AUTH|DRM_RENDER_ALLOW),
+   DRM_IOCTL_DEF_DRV(AMDGPU_USERQ_WAIT, amdgpu_userq_wait_ioctl, 
DRM_AUTH|DRM_RENDER_ALLOW),
  };
  
  static const struct drm_driver amdgpu_kms_driver = {

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
index f7baea2c67ab..339d82d5808f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
@@ -25,6 +25,7 @@
  #include 
  #include 
  
+#include 

  #include 
  
  #include "amdgpu.h"

@@ -92,6 +93,7 @@ int amdgpu_userq_fence_driver_alloc(struct

Re: [PATCH v2 2/2] drm/tests: Add a unit test for range bias allocation

2024-05-13 Thread Matthew Auld


On 13/05/2024 16:11, Paneer Selvam, Arunpravin wrote:

Hi Matthew,

On 5/13/2024 1:49 PM, Matthew Auld wrote:

On 12/05/2024 08:59, Arunpravin Paneer Selvam wrote:

Allocate cleared blocks in the bias range when the DRM
buddy's clear avail is zero. This will validate the bias
range allocation in scenarios like system boot when no
cleared blocks are available and exercise the fallback
path too. The resulting blocks should always be dirty.

Signed-off-by: Arunpravin Paneer Selvam 


---
  drivers/gpu/drm/tests/drm_buddy_test.c | 35 ++
  1 file changed, 35 insertions(+)

diff --git a/drivers/gpu/drm/tests/drm_buddy_test.c 
b/drivers/gpu/drm/tests/drm_buddy_test.c

index e3b50e240d36..a194f271bc55 100644
--- a/drivers/gpu/drm/tests/drm_buddy_test.c
+++ b/drivers/gpu/drm/tests/drm_buddy_test.c
@@ -26,6 +26,8 @@ static void drm_test_buddy_alloc_range_bias(struct 
kunit *test)

  u32 mm_size, ps, bias_size, bias_start, bias_end, bias_rem;
  DRM_RND_STATE(prng, random_seed);
  unsigned int i, count, *order;
+    struct drm_buddy_block *block;
+    unsigned long flags;
  struct drm_buddy mm;
  LIST_HEAD(allocated);
  @@ -222,6 +224,39 @@ static void 
drm_test_buddy_alloc_range_bias(struct kunit *test)

    drm_buddy_free_list(, , 0);
  drm_buddy_fini();
+
+    /*
+ * Allocate cleared blocks in the bias range when the DRM 
buddy's clear avail is
+ * zero. This will validate the bias range allocation in 
scenarios like system boot
+ * when no cleared blocks are available and exercise the 
fallback path too. The resulting

+ * blocks should always be dirty.
+ */
+
+    KUNIT_ASSERT_FALSE_MSG(test, drm_buddy_init(, mm_size, ps),
+   "buddy_init failed\n");
+    mm.clear_avail = 0;


Should already be zero, right? Maybe make this an assert instead?
No, since the mm declared as a local variable in the test case, 
mm.clear_avail is not zero.


That sounds like a bug IMO. The init() should initialize it, like it 
does for mm.avail and everything else.





+
+    bias_start = round_up(prandom_u32_state() % (mm_size - ps), 
ps);
+    bias_end = round_up(bias_start + prandom_u32_state() % 
(mm_size - bias_start), ps);

+    bias_end = max(bias_end, bias_start + ps);
+    bias_rem = bias_end - bias_start;
+
+    flags = DRM_BUDDY_CLEAR_ALLOCATION | DRM_BUDDY_RANGE_ALLOCATION;
+    u32 size = max(round_up(prandom_u32_state() % bias_rem, 
ps), ps);


u32 declaration should be moved to above?

Sure.

Thanks,
Arun.


Otherwise,
Reviewed-by: Matthew Auld 


+
+    KUNIT_ASSERT_FALSE_MSG(test,
+   drm_buddy_alloc_blocks(, bias_start,
+  bias_end, size, ps,
+  ,
+  flags),
+   "buddy_alloc failed with bias(%x-%x), size=%u, 
ps=%u\n",

+   bias_start, bias_end, size, ps);
+
+    list_for_each_entry(block, , link)
+    KUNIT_EXPECT_EQ(test, drm_buddy_block_is_clear(block), false);
+
+    drm_buddy_free_list(, , 0);
+    drm_buddy_fini();
  }
    static void drm_test_buddy_alloc_clear(struct kunit *test)

[PATCH] drm/kfd: Correct pined buffer handling at kfd restore and validate process

2024-05-13 Thread Xiaogang . Chen

From: Xiaogang Chen 

This reverts 8a774fe912ff09e39c2d3a3589c729330113f388 "drm/amdgpu: avoid restore
process run into dead loop" since buffer got pined is not related whether it
needs mapping. And skip buffer validation at kfd driver if the buffer has been
pinned.

Signed-off-by: Xiaogang Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 3314821e4cf3..80018738bd1c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -415,6 +415,10 @@ static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo *bo, 
uint32_t domain,
 "Called with userptr BO"))
return -EINVAL;
 
+   /* bo has been pined, not need validate it */
+   if (bo->tbo.pin_count)
+   return 0;
+
amdgpu_bo_placement_from_domain(bo, domain);
 
ret = ttm_bo_validate(>tbo, >placement, );
@@ -2736,7 +2740,7 @@ static int confirm_valid_user_pages_locked(struct 
amdkfd_process_info *process_i
 
/* keep mem without hmm range at userptr_inval_list */
if (!mem->range)
-continue;
+   continue;
 
/* Only check mem with hmm range associated */
valid = amdgpu_ttm_tt_get_user_pages_done(
@@ -2981,9 +2985,6 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, 
struct dma_fence __rcu *
if (!attachment->is_mapped)
continue;
 
-   if (attachment->bo_va->base.bo->tbo.pin_count)
-   continue;
-
kfd_mem_dmaunmap_attachment(mem, attachment);
ret = update_gpuvm_pte(mem, attachment, _obj);
if (ret) {
-- 
2.25.1

Re: [PATCH v2 2/2] drm/tests: Add a unit test for range bias allocation

2024-05-13 Thread Paneer Selvam, Arunpravin


Hi Matthew,

On 5/13/2024 1:49 PM, Matthew Auld wrote:

On 12/05/2024 08:59, Arunpravin Paneer Selvam wrote:

Allocate cleared blocks in the bias range when the DRM
buddy's clear avail is zero. This will validate the bias
range allocation in scenarios like system boot when no
cleared blocks are available and exercise the fallback
path too. The resulting blocks should always be dirty.

Signed-off-by: Arunpravin Paneer Selvam 


---
  drivers/gpu/drm/tests/drm_buddy_test.c | 35 ++
  1 file changed, 35 insertions(+)

diff --git a/drivers/gpu/drm/tests/drm_buddy_test.c 
b/drivers/gpu/drm/tests/drm_buddy_test.c

index e3b50e240d36..a194f271bc55 100644
--- a/drivers/gpu/drm/tests/drm_buddy_test.c
+++ b/drivers/gpu/drm/tests/drm_buddy_test.c
@@ -26,6 +26,8 @@ static void drm_test_buddy_alloc_range_bias(struct 
kunit *test)

  u32 mm_size, ps, bias_size, bias_start, bias_end, bias_rem;
  DRM_RND_STATE(prng, random_seed);
  unsigned int i, count, *order;
+    struct drm_buddy_block *block;
+    unsigned long flags;
  struct drm_buddy mm;
  LIST_HEAD(allocated);
  @@ -222,6 +224,39 @@ static void 
drm_test_buddy_alloc_range_bias(struct kunit *test)

    drm_buddy_free_list(, , 0);
  drm_buddy_fini();
+
+    /*
+ * Allocate cleared blocks in the bias range when the DRM 
buddy's clear avail is
+ * zero. This will validate the bias range allocation in 
scenarios like system boot
+ * when no cleared blocks are available and exercise the 
fallback path too. The resulting

+ * blocks should always be dirty.
+ */
+
+    KUNIT_ASSERT_FALSE_MSG(test, drm_buddy_init(, mm_size, ps),
+   "buddy_init failed\n");
+    mm.clear_avail = 0;


Should already be zero, right? Maybe make this an assert instead?
No, since the mm declared as a local variable in the test case, 
mm.clear_avail is not zero.



+
+    bias_start = round_up(prandom_u32_state() % (mm_size - ps), 
ps);
+    bias_end = round_up(bias_start + prandom_u32_state() % 
(mm_size - bias_start), ps);

+    bias_end = max(bias_end, bias_start + ps);
+    bias_rem = bias_end - bias_start;
+
+    flags = DRM_BUDDY_CLEAR_ALLOCATION | DRM_BUDDY_RANGE_ALLOCATION;
+    u32 size = max(round_up(prandom_u32_state() % bias_rem, 
ps), ps);


u32 declaration should be moved to above?

Sure.

Thanks,
Arun.


Otherwise,
Reviewed-by: Matthew Auld 


+
+    KUNIT_ASSERT_FALSE_MSG(test,
+   drm_buddy_alloc_blocks(, bias_start,
+  bias_end, size, ps,
+  ,
+  flags),
+   "buddy_alloc failed with bias(%x-%x), size=%u, 
ps=%u\n",

+   bias_start, bias_end, size, ps);
+
+    list_for_each_entry(block, , link)
+    KUNIT_EXPECT_EQ(test, drm_buddy_block_is_clear(block), false);
+
+    drm_buddy_free_list(, , 0);
+    drm_buddy_fini();
  }
    static void drm_test_buddy_alloc_clear(struct kunit *test)

[PATCH] drm/amdgpu: fix getting vram info for gfx12

2024-05-13 Thread Min, Frank

[AMD Official Use Only - AMD Internal Distribution Only]

From: Frank Min 

gfx12 query video mem channel/type/width from umc_info of atom list, so fix it 
accordingly.

Signed-off-by: Frank Min 
---
 .../gpu/drm/amd/amdgpu/amdgpu_atomfirmware.c  | 263 ++
 1 file changed, 148 insertions(+), 115 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atomfirmware.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_atomfirmware.c
index a6d64bdbbb14..6fe84151332e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atomfirmware.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atomfirmware.c
@@ -289,7 +289,6 @@ static int convert_atom_mem_type_to_vram_type(struct 
amdgpu_device *adev,
return vram_type;
 }

-
 int
 amdgpu_atomfirmware_get_vram_info(struct amdgpu_device *adev,
  int *vram_width, int *vram_type,
@@ -300,6 +299,7 @@ amdgpu_atomfirmware_get_vram_info(struct amdgpu_device 
*adev,
u16 data_offset, size;
union igp_info *igp_info;
union vram_info *vram_info;
+   union umc_info *umc_info;
union vram_module *vram_module;
u8 frev, crev;
u8 mem_type;
@@ -311,10 +311,16 @@ amdgpu_atomfirmware_get_vram_info(struct amdgpu_device 
*adev,
if (adev->flags & AMD_IS_APU)
index = 
get_index_into_master_table(atom_master_list_of_data_tables_v2_1,
integratedsysteminfo);
-   else
-   index = 
get_index_into_master_table(atom_master_list_of_data_tables_v2_1,
-   vram_info);
-
+   else {
+   switch (amdgpu_ip_version(adev, GC_HWIP, 0)) {
+   case IP_VERSION(12, 0, 0):
+   case IP_VERSION(12, 0, 1):
+   index = 
get_index_into_master_table(atom_master_list_of_data_tables_v2_1, umc_info);
+   break;
+   default:
+   index = 
get_index_into_master_table(atom_master_list_of_data_tables_v2_1, vram_info);
+   }
+   }
if (amdgpu_atom_parse_data_header(mode_info->atom_context,
  index, ,
  , , _offset)) {
@@ -368,123 +374,150 @@ amdgpu_atomfirmware_get_vram_info(struct amdgpu_device 
*adev,
return -EINVAL;
}
} else {
-   vram_info = (union vram_info *)
-   (mode_info->atom_context->bios + data_offset);
-   module_id = (RREG32(adev->bios_scratch_reg_offset + 4) 
& 0x00ff) >> 16;
-   if (frev == 3) {
-   switch (crev) {
-   /* v30 */
-   case 0:
-   vram_module = (union vram_module 
*)vram_info->v30.vram_module;
-   mem_vendor = 
(vram_module->v30.dram_vendor_id) & 0xF;
-   if (vram_vendor)
-   *vram_vendor = mem_vendor;
-   mem_type = vram_info->v30.memory_type;
-   if (vram_type)
-   *vram_type = 
convert_atom_mem_type_to_vram_type(adev, mem_type);
-   mem_channel_number = 
vram_info->v30.channel_num;
-   mem_channel_width = 
vram_info->v30.channel_width;
-   if (vram_width)
-   *vram_width = 
mem_channel_number * (1 << mem_channel_width);
-   break;
-   default:
-   return -EINVAL;
-   }
-   } else if (frev == 2) {
-   switch (crev) {
-   /* v23 */
-   case 3:
-   if (module_id > 
vram_info->v23.vram_module_num)
-   module_id = 0;
-   vram_module = (union vram_module 
*)vram_info->v23.vram_module;
-   while (i < module_id) {
-   vram_module = (union 
vram_module *)
-   ((u8 *)vram_module + 
vram_module->v9.vram_module_size);
-   i++;
-   }
-   mem_type = vram_module->v9.memory_type;
-   if (vram_type)
-   *vram_type = 
convert_atom_mem_type_to_vram_type(adev, mem_type);
-

Re: [RFC 1/5] drm/amdgpu: Fix migration rate limiting accounting

2024-05-13 Thread Friedrich Vock


On 09.05.24 11:19, Tvrtko Ursulin wrote:


On 08/05/2024 20:08, Friedrich Vock wrote:

On 08.05.24 20:09, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

The logic assumed any migration attempt worked and therefore would
over-
account the amount of data migrated during buffer re-validation. As a
consequence client can be unfairly penalised by incorrectly considering
its migration budget spent.


If the migration failed but data was still moved (which I think could be
the case when we try evicting everything but it still doesn't work?),
shouldn't the eviction movements count towards the ratelimit too?


Possibly, which path would that be?


Thinking about it more, the only case where allocation still won't
succeed after evicting everything from a place is the edge case when the
buffer is larger than the place's size.

The most likely condition for this to happen (without the submission
failing entirely because the buffer just doesn't fit anywhere) would be
if the app tries creating a 256MB+ visible-VRAM buffer if resizeable BAR
is disabled.
This case could potentially trigger an allocation failure when trying
with preferred_domains, but retrying with allowed_domains, which
includes GTT, could subsequently work.


I mean there are definitely more migration which *should not* be
counted which I think your mini-series approaches more accurately.
What this patch achieves, in its current RFC form, is reduces the
"false-positive" migration budget depletions.

So larger improvements aside, point of the series was to illustrate
that even the things which were said to be working do not seem to. See
cover letter to see what I thought does not work either well or at all.

Fair point. If this patchset does "wrong"/inaccurate accounting in a
different way that improves the experience, then it's still an improvement.

Fix it by looking at the before and after buffer object backing
store and
only account if there was a change.

FIXME:
I think this needs a better solution to account for migrations between
VRAM visible and non-visible portions.


FWIW, I have some WIP patches (not posted on any MLs yet though) that
attempt to solve this issue (+actually enforcing ratelimits) by moving
the ratelimit accounting/enforcement to TTM entirely.

By moving the accounting to TTM we can count moved bytes when we move
them, and don't have to rely on comparing resources to determine whether
moving actually happened. This should address your FIXME as well.


Yep, I've seen them. They are not necessarily conflicting with this
series, potentialy TTM placement flag aside. *If* something like this
can be kept small and still manage to fix up a few simple things which
do not appear to work at all at the moment.

For the larger re-work it is quite, well, large and it is not easy to
be certain the end result would work as expected. IMO it would be best
to sketch out a larger series which brings some practical and
masurable change in behaviour before commiting to merge things piecemeal.


Yeah, fully agree. Getting something working and iterating on that based
on the results you get seems like the best way forward, that's what I'll
be focusing on for now.

Thanks,
Friedrich


For instance I have a niggling feeling the runtime games driver plays
with placements and domains are not great and wonder if things could
be cleaner if simplified by letting TTM manage things more, more
explicitly, and having the list of placements more static. Thinking
about it seems a step too far for now though.

Regards,

Tvrtko



Regards,
Friedrich


Signed-off-by: Tvrtko Ursulin 
Cc: Christian König 
Cc: Friedrich Vock 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 26
+-
  1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index ec888fc6ead8..22708954ae68 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -784,12 +784,15 @@ static int amdgpu_cs_bo_validate(void *param,
struct amdgpu_bo *bo)
  .no_wait_gpu = false,
  .resv = bo->tbo.base.resv
  };
+    struct ttm_resource *old_res;
  uint32_t domain;
  int r;

  if (bo->tbo.pin_count)
  return 0;

+    old_res = bo->tbo.resource;
+
  /* Don't move this buffer if we have depleted our allowance
   * to move it. Don't move anything if the threshold is zero.
   */
@@ -817,16 +820,29 @@ static int amdgpu_cs_bo_validate(void *param,
struct amdgpu_bo *bo)
  amdgpu_bo_placement_from_domain(bo, domain);
  r = ttm_bo_validate(>tbo, >placement, );

-    p->bytes_moved += ctx.bytes_moved;
-    if (!amdgpu_gmc_vram_full_visible(>gmc) &&
-    amdgpu_res_cpu_visible(adev, bo->tbo.resource))
-    p->bytes_moved_vis += ctx.bytes_moved;
-
  if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) {
  domain = bo->allowed_domains;
  goto retry;
  }

+    if (!r) {
+

Re: [PATCH] drm/amdgpu: add initial value for gfx12 AGP aperture

2024-05-13 Thread Alex Deucher

Reviewed-by: Alex Deucher 

On Mon, May 13, 2024 at 5:37 AM Gao, Likun  wrote:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> This patch was
> Reviewed-by: Likun Gao .
>
> Regards,
> Likun
>
> -Original Message-
> From: Min, Frank 
> Sent: Monday, May 13, 2024 4:56 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian 
> ; Zhang, Hawking ; Gao, 
> Likun 
> Subject: [PATCH] drm/amdgpu: add initial value for gfx12 AGP aperture
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> From: Frank Min 
>
> add initial value for gfx12 AGP aperture
>
> Signed-off-by: Frank Min 
> ---
>  drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> index 34264a33dcdf..b876300bb9f6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> @@ -622,6 +622,7 @@ static void gmc_v12_0_vram_gtt_location(struct 
> amdgpu_device *adev,
>
> base = adev->mmhub.funcs->get_fb_location(adev);
>
> +   amdgpu_gmc_set_agp_default(adev, mc);
> amdgpu_gmc_vram_location(adev, >gmc, base);
> amdgpu_gmc_gart_location(adev, mc, AMDGPU_GART_PLACEMENT_LOW);
> if (!amdgpu_sriov_vf(adev) && (amdgpu_agp == 1))
> --
> 2.34.1
>
>

[PATCH v1 0/4] Add CP and GFX Queue register in ip Dump for

2024-05-13 Thread Sunil Khatri

*** BLURB HERE ***

Sunil Khatri (4):
  drm/amdgpu: update the ip_dump to ipdump_core
  drm/amdgpu: Add support to dump gfx10 cp registers
  drm/amdgpu: add support to dump gfx10 queue registers
  drm/amdgpu: add prints while ip registr dump

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h|   4 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 214 +++--
 3 files changed, 208 insertions(+), 12 deletions(-)

-- 
2.34.1

[PATCH v2 3/4] drm/amdgpu: add support to dump gfx10 queue registers

2024-05-13 Thread Sunil Khatri

Add gfx queue register for all instances in ip dump
for gfx10.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |  1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 86 +
 2 files changed, 87 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index d96873c154ed..54232066cd3b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -437,6 +437,7 @@ struct amdgpu_gfx {
/* IP reg dump */
uint32_t*ipdump_core;
uint32_t*ipdump_cp;
+   uint32_t*ipdump_gfx_queue;
 };
 
 struct amdgpu_gfx_ras_reg_entry {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index daf9a3571183..221fbde297e9 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -424,6 +424,33 @@ static const struct amdgpu_hwip_reg_entry 
gc_cp_reg_list_10[] = {
SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_STATUS)
 };
 
+static const struct amdgpu_hwip_reg_entry gc_gfx_queue_reg_list_10[] = {
+   /* gfx queue registers */
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_ACTIVE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_QUEUE_PRIORITY),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_BASE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_BASE_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CSMD_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_WPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_WPTR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_DEQUEUE_REQUEST),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_MAPPED),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_QUE_MGR_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_HQ_CONTROL0),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_HQ_STATUS0),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_WPTR_POLL_ADDR_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_WPTR_POLL_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_CSMD_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_WPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_WPTR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_MQD_BASE_ADDR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_MQD_BASE_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR_POLL_ADDR_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR_POLL_ADDR_HI)
+};
+
 static const struct soc15_reg_golden golden_settings_gc_10_1[] = {
SOC15_REG_GOLDEN_VALUE(GC, 0, mmCB_HW_CONTROL_4, 0x, 
0x00400014),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmCGTT_CPF_CLK_CTRL, 0xfcff8fff, 
0xf8000100),
@@ -4664,6 +4691,19 @@ static void gfx_v10_0_alloc_ip_dump(struct amdgpu_device 
*adev)
} else {
adev->gfx.ipdump_cp = ptr;
}
+
+   /* Allocate memory for gfx cp queue registers for all the instances */
+   reg_count = ARRAY_SIZE(gc_cp_reg_list_10);
+   inst = adev->gfx.me.num_me * adev->gfx.me.num_pipe_per_me *
+   adev->gfx.me.num_queue_per_pipe;
+
+   ptr = kcalloc(reg_count * inst, sizeof(uint32_t), GFP_KERNEL);
+   if (ptr == NULL) {
+   DRM_ERROR("Failed to allocate memory for GFX CP IP Dump\n");
+   adev->gfx.ipdump_gfx_queue = NULL;
+   } else {
+   adev->gfx.ipdump_gfx_queue = ptr;
+   }
 }
 
 static int gfx_v10_0_sw_init(void *handle)
@@ -4874,6 +4914,7 @@ static int gfx_v10_0_sw_fini(void *handle)
 
kfree(adev->gfx.ipdump_core);
kfree(adev->gfx.ipdump_cp);
+   kfree(adev->gfx.ipdump_gfx_queue);
 
return 0;
 }
@@ -9368,6 +9409,26 @@ static void gfx_v10_ip_print(void *handle, struct 
drm_printer *p)
}
}
}
+
+   /* print gfx queue registers for all instances */
+   if (!adev->gfx.ipdump_gfx_queue)
+   return;
+
+   reg_count = ARRAY_SIZE(gc_gfx_queue_reg_list_10);
+
+   for (i = 0; i < adev->gfx.me.num_me; i++) {
+   for (j = 0; j < adev->gfx.me.num_pipe_per_me; j++) {
+   for (k = 0; k < adev->gfx.me.num_queue_per_pipe; k++) {
+   drm_printf(p, "me %d, pipe %d, queue %d\n", i, 
j, k);
+   for (reg = 0; reg < reg_count; reg++) {
+   drm_printf(p, "%-50s \t 0x%08x\n",
+  
gc_gfx_queue_reg_list_10[reg].reg_name,
+  
adev->gfx.ipdump_gfx_queue[index + reg]);
+   }
+   index += reg_count;
+   }
+   }
+   }
 }
 
 static void gfx_v10_ip_dump(void *handle)

[PATCH v2 4/4] drm/amdgpu: add prints while ip registr dump

2024-05-13 Thread Sunil Khatri

add prints before and after during ip registers
dump. It avoids user to think of system being
stuck/hung as register dump takes time after a
gpu hang.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 00fe3c2d5431..b0186c61d24a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5373,11 +5373,13 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
if (!test_bit(AMDGPU_SKIP_COREDUMP, _context->flags)) {
amdgpu_reset_reg_dumps(tmp_adev);
 
+   dev_info(tmp_adev->dev, "Dumping IP Registers\n");
/* Trigger ip dump before we reset the asic */
for (i = 0; i < tmp_adev->num_ip_blocks; i++)
if 
(tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
tmp_adev->ip_blocks[i].version->funcs
->dump_ip_state((void *)tmp_adev);
+   dev_info(tmp_adev->dev, "Dumping IP Registers Completed\n");
}
 
reset_context->reset_device_list = device_list_handle;
-- 
2.34.1

[PATCH v2 2/4] drm/amdgpu: Add support to dump gfx10 cp registers

2024-05-13 Thread Sunil Khatri

add support to dump registers of all instances of
cp registers in gfx10

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |   1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 117 +++-
 2 files changed, 114 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index 30d7f9c29478..d96873c154ed 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -436,6 +436,7 @@ struct amdgpu_gfx {
 
/* IP reg dump */
uint32_t*ipdump_core;
+   uint32_t*ipdump_cp;
 };
 
 struct amdgpu_gfx_ras_reg_entry {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index f6d6a4b9802d..daf9a3571183 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -381,6 +381,49 @@ static const struct amdgpu_hwip_reg_entry 
gc_reg_list_10_1[] = {
SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3)
 };
 
+static const struct amdgpu_hwip_reg_entry gc_cp_reg_list_10[] = {
+   /* compute registers */
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_VMID),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PERSISTENT_STATE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PIPE_PRIORITY),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUEUE_PRIORITY),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUANTUM),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_REQUEST),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_EVENTS),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_SIZE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_WG_STATE_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_SIZE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_GDS_RESOURCE_STATE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_ERROR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR_MEM),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_DW_CNT),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_WG_STATE_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_STATUS)
+};
+
 static const struct soc15_reg_golden golden_settings_gc_10_1[] = {
SOC15_REG_GOLDEN_VALUE(GC, 0, mmCB_HW_CONTROL_4, 0x, 
0x00400014),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmCGTT_CPF_CLK_CTRL, 0xfcff8fff, 
0xf8000100),
@@ -4595,10 +4638,11 @@ static int gfx_v10_0_compute_ring_init(struct 
amdgpu_device *adev, int ring_id,
 hw_prio, NULL);
 }
 
-static void gfx_v10_0_alloc_dump_mem(struct amdgpu_device *adev)
+static void gfx_v10_0_alloc_ip_dump(struct amdgpu_device *adev)
 {
uint32_t reg_count = ARRAY_SIZE(gc_reg_list_10_1);
uint32_t *ptr;
+   uint32_t inst;
 
ptr = kcalloc(reg_count, sizeof(uint32_t), GFP_KERNEL);
if (ptr == NULL) {
@@ -4607,6 +4651,19 @@ static void gfx_v10_0_alloc_dump_mem(struct 
amdgpu_device *adev)
} else {
adev->gfx.ipdump_core = ptr;
}
+
+   /* Allocate memory for gfx cp registers for all the instances */
+   reg_count = ARRAY_SIZE(gc_cp_reg_list_10);
+   inst = adev->gfx.mec.num_mec * adev->gfx.mec.num_pipe_per_mec *
+   adev->gfx.mec.num_queue_per_pipe;
+
+   ptr = kcalloc(reg_count * inst, sizeof(uint32_t), GFP_KERNEL);
+   if (ptr == NULL) {
+   DRM_ERROR("Failed to allocate memory for GFX CP IP Dump\n");
+   adev->gfx.ipdump_cp = NULL;
+   } else {
+   adev->gfx.ipdump_cp = ptr;
+   }
 }
 
 static

[PATCH v2 1/4] drm/amdgpu: update the ip_dump to ipdump_core

2024-05-13 Thread Sunil Khatri

Update the memory pointer from ip_dump to ipdump_core
to make it specific to core registers and rest other
registers to be dumped in their respective memories.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |  2 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 14 +++---
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index 109f471ff315..30d7f9c29478 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -435,7 +435,7 @@ struct amdgpu_gfx {
boolmcbp; /* mid command buffer preemption 
*/
 
/* IP reg dump */
-   uint32_t*ip_dump;
+   uint32_t*ipdump_core;
 };
 
 struct amdgpu_gfx_ras_reg_entry {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 953df202953a..f6d6a4b9802d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -4603,9 +4603,9 @@ static void gfx_v10_0_alloc_dump_mem(struct amdgpu_device 
*adev)
ptr = kcalloc(reg_count, sizeof(uint32_t), GFP_KERNEL);
if (ptr == NULL) {
DRM_ERROR("Failed to allocate memory for IP Dump\n");
-   adev->gfx.ip_dump = NULL;
+   adev->gfx.ipdump_core = NULL;
} else {
-   adev->gfx.ip_dump = ptr;
+   adev->gfx.ipdump_core = ptr;
}
 }
 
@@ -4815,7 +4815,7 @@ static int gfx_v10_0_sw_fini(void *handle)
 
gfx_v10_0_free_microcode(adev);
 
-   kfree(adev->gfx.ip_dump);
+   kfree(adev->gfx.ipdump_core);
 
return 0;
 }
@@ -9283,13 +9283,13 @@ static void gfx_v10_ip_print(void *handle, struct 
drm_printer *p)
uint32_t i;
uint32_t reg_count = ARRAY_SIZE(gc_reg_list_10_1);
 
-   if (!adev->gfx.ip_dump)
+   if (!adev->gfx.ipdump_core)
return;
 
for (i = 0; i < reg_count; i++)
drm_printf(p, "%-50s \t 0x%08x\n",
   gc_reg_list_10_1[i].reg_name,
-  adev->gfx.ip_dump[i]);
+  adev->gfx.ipdump_core[i]);
 }
 
 static void gfx_v10_ip_dump(void *handle)
@@ -9298,12 +9298,12 @@ static void gfx_v10_ip_dump(void *handle)
uint32_t i;
uint32_t reg_count = ARRAY_SIZE(gc_reg_list_10_1);
 
-   if (!adev->gfx.ip_dump)
+   if (!adev->gfx.ipdump_core)
return;
 
amdgpu_gfx_off_ctrl(adev, false);
for (i = 0; i < reg_count; i++)
-   adev->gfx.ip_dump[i] = 
RREG32(SOC15_REG_ENTRY_OFFSET(gc_reg_list_10_1[i]));
+   adev->gfx.ipdump_core[i] = 
RREG32(SOC15_REG_ENTRY_OFFSET(gc_reg_list_10_1[i]));
amdgpu_gfx_off_ctrl(adev, true);
 }
 
-- 
2.34.1

Re: [RESEND 0/6] drm, nouveau/radeon/amdpgu: edid_blob_ptr cleanups

2024-05-13 Thread Alex Deucher

On Mon, May 13, 2024 at 8:20 AM Jani Nikula  wrote:
>
> On Fri, 10 May 2024, Alex Deucher  wrote:
> > On Fri, May 10, 2024 at 11:17 AM Jani Nikula  wrote:
> >>
> >> I've sent this some moths ago, let's try again...
> >>
> >> BR,
> >> Jani.
> >>
> >> Jani Nikula (6):
> >>   drm/nouveau: convert to using is_hdmi and has_audio from display info
> >>   drm/radeon: convert to using is_hdmi and has_audio from display info
> >>   drm/radeon: remove radeon_connector_edid() and stop using
> >> edid_blob_ptr
> >>   drm/amdgpu: remove amdgpu_connector_edid() and stop using
> >> edid_blob_ptr
> >>   drm/edid: add a helper for EDID sysfs property show
> >>   drm/connector: update edid_blob_ptr documentation
> >
> > Series is:
> > Acked-by: Alex Deucher 
>
> Thanks, do you want to pick these up via your tree? And do you expect a
> proper R-b before merging?

Feel free to take them via drm-misc if you'd prefer to land the whole
set together, otherwise, I can pick up the radeon/amdgpu patches.

Alex


>
> BR,
> Jani.
>
>
> >
> >>
> >>  .../gpu/drm/amd/amdgpu/amdgpu_connectors.c| 16 -
> >>  .../gpu/drm/amd/amdgpu/amdgpu_connectors.h|  1 -
> >>  drivers/gpu/drm/amd/amdgpu/dce_v10_0.c|  4 +--
> >>  drivers/gpu/drm/amd/amdgpu/dce_v11_0.c|  4 +--
> >>  drivers/gpu/drm/amd/amdgpu/dce_v6_0.c |  4 +--
> >>  drivers/gpu/drm/amd/amdgpu/dce_v8_0.c |  4 +--
> >>  drivers/gpu/drm/drm_crtc_internal.h   |  2 ++
> >>  drivers/gpu/drm/drm_edid.c| 33 +++
> >>  drivers/gpu/drm/drm_sysfs.c   | 24 ++
> >>  drivers/gpu/drm/nouveau/dispnv50/disp.c   |  8 ++---
> >>  drivers/gpu/drm/nouveau/dispnv50/head.c   |  8 +
> >>  drivers/gpu/drm/nouveau/nouveau_connector.c   |  2 +-
> >>  drivers/gpu/drm/radeon/atombios_encoders.c| 10 +++---
> >>  drivers/gpu/drm/radeon/evergreen_hdmi.c   |  5 ++-
> >>  drivers/gpu/drm/radeon/radeon_audio.c | 13 
> >>  drivers/gpu/drm/radeon/radeon_connectors.c| 27 ---
> >>  drivers/gpu/drm/radeon/radeon_display.c   |  2 +-
> >>  drivers/gpu/drm/radeon/radeon_encoders.c  |  4 +--
> >>  drivers/gpu/drm/radeon/radeon_mode.h  |  2 --
> >>  include/drm/drm_connector.h   |  6 +++-
> >>  20 files changed, 79 insertions(+), 100 deletions(-)
> >>
> >> --
> >> 2.39.2
> >>
>
> --
> Jani Nikula, Intel

Re: [RFC 0/5] Discussion around eviction improvements

2024-05-13 Thread Tvrtko Ursulin




On 09/05/2024 13:40, Tvrtko Ursulin wrote:


On 08/05/2024 19:09, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Last few days I was looking at the situation with VRAM over 
subscription, what
happens versus what perhaps should happen. Browsing through the driver 
and

running some simple experiments.

I ended up with this patch series which, as a disclaimer, may be 
completely
wrong but as I found some suspicious things, to me at least, I thought 
it was a

good point to stop and request some comments.

To perhaps summarise what are the main issues I think I found:

  * Migration rate limiting does not bother knowing if actual 
migration happened

    and so can over-account and unfairly penalise.

  * Migration rate limiting does not even work, at least not for the 
common case
    where userspace configures VRAM+GTT. It thinks it can stop 
migration attempts
    by playing with bo->allowed_domains vs bo->preferred domains but, 
both from
    the code, and from empirical experiments, I see that not working 
at all. Both

    masks are identical so fiddling with them achieves nothing.

  * Idea of the fallback placement only works when VRAM has free 
space. As soon
    as it does not, ttm_resource_compatible is happy to leave the 
buffers in the

    secondary placement forever.

  * Driver thinks it will be re-validating evicted buffers on the next 
submission
    but it does not for the very common case of VRAM+GTT because it 
only checks

    if current placement is *none* of the preferred placements.

All those problems are addressed in individual patches.

End result of this series appears to be driver which will try harder 
to move
buffers back into VRAM, but will be (more) correctly throttled in 
doing so by

the existing rate limiting logic.

I have run a quick benchmark of Cyberpunk 2077 and cannot say that I 
saw a
change but that could be a good thing too. At least I did not break 
anything,
perhaps.. On one occassion I did see the rate limiting logic get 
confused while
for a period of few minutes it went to a mode where it was constantly 
giving a
high migration budget. But that recovered itself when I switched 
clients and did
not come back so I don't know. If there is something wrong there I 
don't think

it would be caused by any patches in this series.


Since yesterday I also briefly tested with Far Cry New Dawn. One run 
each so possibly doesn't mean anything apart that there isn't a 
regression aka migration throttling is keeping things at bay even with 
increased requests to migrate things back to VRAM:


  before after
min/avg/max fps    36/44/54    37/45/55

Cyberpunk 2077 from yesterday was similarly close:

     26.96/29.59/30.40    29.70/30.00/30.32

I guess the real story is proper DGPU where misplaced buffers have a 
real cost.


I found one game which regresses spectacularly badly with this series - 
Assasin's Creed Valhalla. The built-in benchmark at least. The game 
appears to have a working set much larger than the other games I tested, 
around 5GiB total during the benchmark. And for some reason migration 
throttling totally fails to put it in check. I will be investigating 
this shortly.


Regards,

Tvrtko

Series is probably rough but should be good enough for dicsussion. I 
am curious

to hear if I identified at least something correctly as a real problem.

It would also be good to hear what are the suggested games to check 
and see

whether there is any improvement.

Cc: Christian König 
Cc: Friedrich Vock 

Tvrtko Ursulin (5):
   drm/amdgpu: Fix migration rate limiting accounting
   drm/amdgpu: Actually respect buffer migration budget
   drm/ttm: Add preferred placement flag
   drm/amdgpu: Use preferred placement for VRAM+GTT
   drm/amdgpu: Re-validate evicted buffers

  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++--
  drivers/gpu/drm/ttm/ttm_resource.c | 13 +---
  include/drm/ttm/ttm_placement.h    |  3 ++
  5 files changed, 65 insertions(+), 18 deletions(-)

Re: [RFC PATCH 00/18] TTM interface for managing VRAM oversubscription

2024-05-13 Thread Friedrich Vock


Hi,

On 02.05.24 16:23, Maarten Lankhorst wrote:

Hey,

[snip]

For Xe, I've been loking at using cgroups. A small prototype is
available at

https://cgit.freedesktop.org/~mlankhorst/linux/log/?h=dumpcg

To stimulate discussion, I've added amdgpu support as well.
This should make it possible to isolate the compositor allocations
from the target program.

This support is still incomplete and covers vram only, but I need help
from userspace and consensus from other drivers on how to move forward.

I'm thinking of making 3 cgroup limits:
1. Physical memory, each time a buffer is allocated, it counts towards
it, regardless where it resides.
2. Mappable memory, all buffers allocated in sysmem or vram count
towards this limit.
3. VRAM, only buffers residing in VRAM count here.

This ensures that VRAM can always be evicted to sysmem, by having a
mappable memory quota, and having a sysmem reservation.

The main trouble is that when evicting, you want to charge the
original process the changes in allocation limits, but it should be
solvable.

I've been looking for someone else needing the usecase in a different
context, so let me know what you think of the idea.


Sorry for the late reply. The idea sounds really good! I think cgroups
are great fit for what we'd need to prioritize game+compositor over
other potential non-foreground apps.

From what I can tell looking through the code, the current cgroup
properties are absolute memory sizes that userspace asks the kernel to
restrict the cgroup usage to?
While that sounds useful for some usecases too, I'm not sure just these
limits are a good solution for making sure that your compositor's and
foreground app's resources stay in memory (in favor of background apps)
when there is pressure.


This can be generalized towards all uses of the GPU, but the
compositor vs game thrashing is a good example of why it is useful to
have.


IIRC Tvrtko's original proposal was about per-cgroup DRM scheduling
priorities providing lower submission latency for prioritized cgroups,
right?

I think what we need here would pretty much exactly such a priority
system, but for memory: The cgroup containing the foreground app/game
and the compositor should have some hint telling TTM to try its hardest
to avoid evicting its buffers (i.e. a high memory priority).
Your existing drm_cgroup work looks like a great base for this, and I'd
be happy to help/participate with the implementation for amdgpu.

Thanks,
Friedrich


I should still have my cgroup testcase somewhere, this is only a
rebase of my previous proposal, but I think it fits the usecase.

Cheers,
Maarten

Re: [RESEND 0/6] drm, nouveau/radeon/amdpgu: edid_blob_ptr cleanups

2024-05-13 Thread Jani Nikula

On Fri, 10 May 2024, Alex Deucher  wrote:
> On Fri, May 10, 2024 at 11:17 AM Jani Nikula  wrote:
>>
>> I've sent this some moths ago, let's try again...
>>
>> BR,
>> Jani.
>>
>> Jani Nikula (6):
>>   drm/nouveau: convert to using is_hdmi and has_audio from display info
>>   drm/radeon: convert to using is_hdmi and has_audio from display info
>>   drm/radeon: remove radeon_connector_edid() and stop using
>> edid_blob_ptr
>>   drm/amdgpu: remove amdgpu_connector_edid() and stop using
>> edid_blob_ptr
>>   drm/edid: add a helper for EDID sysfs property show
>>   drm/connector: update edid_blob_ptr documentation
>
> Series is:
> Acked-by: Alex Deucher 

Thanks, do you want to pick these up via your tree? And do you expect a
proper R-b before merging?

BR,
Jani.


>
>>
>>  .../gpu/drm/amd/amdgpu/amdgpu_connectors.c| 16 -
>>  .../gpu/drm/amd/amdgpu/amdgpu_connectors.h|  1 -
>>  drivers/gpu/drm/amd/amdgpu/dce_v10_0.c|  4 +--
>>  drivers/gpu/drm/amd/amdgpu/dce_v11_0.c|  4 +--
>>  drivers/gpu/drm/amd/amdgpu/dce_v6_0.c |  4 +--
>>  drivers/gpu/drm/amd/amdgpu/dce_v8_0.c |  4 +--
>>  drivers/gpu/drm/drm_crtc_internal.h   |  2 ++
>>  drivers/gpu/drm/drm_edid.c| 33 +++
>>  drivers/gpu/drm/drm_sysfs.c   | 24 ++
>>  drivers/gpu/drm/nouveau/dispnv50/disp.c   |  8 ++---
>>  drivers/gpu/drm/nouveau/dispnv50/head.c   |  8 +
>>  drivers/gpu/drm/nouveau/nouveau_connector.c   |  2 +-
>>  drivers/gpu/drm/radeon/atombios_encoders.c| 10 +++---
>>  drivers/gpu/drm/radeon/evergreen_hdmi.c   |  5 ++-
>>  drivers/gpu/drm/radeon/radeon_audio.c | 13 
>>  drivers/gpu/drm/radeon/radeon_connectors.c| 27 ---
>>  drivers/gpu/drm/radeon/radeon_display.c   |  2 +-
>>  drivers/gpu/drm/radeon/radeon_encoders.c  |  4 +--
>>  drivers/gpu/drm/radeon/radeon_mode.h  |  2 --
>>  include/drm/drm_connector.h   |  6 +++-
>>  20 files changed, 79 insertions(+), 100 deletions(-)
>>
>> --
>> 2.39.2
>>

-- 
Jani Nikula, Intel

Re: [RESEND 1/6] drm/nouveau: convert to using is_hdmi and has_audio from display info

2024-05-13 Thread Jani Nikula

On Fri, 10 May 2024, Lyude Paul  wrote:
> Reviewed-by: Lyude Paul 

Thanks, how do you want to handle merging this?

BR,
Jani.


>
> On Fri, 2024-05-10 at 18:08 +0300, Jani Nikula wrote:
>> Prefer the parsed results for is_hdmi and has_audio in display info
>> over
>> calling drm_detect_hdmi_monitor() and drm_detect_monitor_audio(),
>> respectively.
>> 
>> Conveniently, this also removes the need to use edid_blob_ptr.
>> 
>> v2: Reverse a backwards if condition (Ilia)
>> 
>> Cc: Karol Herbst 
>> Cc: Lyude Paul 
>> Cc: Danilo Krummrich 
>> Cc: nouv...@lists.freedesktop.org
>> Signed-off-by: Jani Nikula 
>> ---
>>  drivers/gpu/drm/nouveau/dispnv50/disp.c | 8 
>>  drivers/gpu/drm/nouveau/dispnv50/head.c | 8 +---
>>  drivers/gpu/drm/nouveau/nouveau_connector.c | 2 +-
>>  3 files changed, 6 insertions(+), 12 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/nouveau/dispnv50/disp.c
>> b/drivers/gpu/drm/nouveau/dispnv50/disp.c
>> index 0c3d88ad0b0e..168c27213287 100644
>> --- a/drivers/gpu/drm/nouveau/dispnv50/disp.c
>> +++ b/drivers/gpu/drm/nouveau/dispnv50/disp.c
>> @@ -751,7 +751,7 @@ nv50_audio_enable(struct drm_encoder *encoder,
>> struct nouveau_crtc *nv_crtc,
>>  struct nouveau_encoder *nv_encoder =
>> nouveau_encoder(encoder);
>>  struct nvif_outp *outp = _encoder->outp;
>>  
>> -if (!nv50_audio_supported(encoder) ||
>> !drm_detect_monitor_audio(nv_connector->edid))
>> +if (!nv50_audio_supported(encoder) || !nv_connector-
>> >base.display_info.has_audio)
>>  return;
>>  
>>  mutex_lock(>audio.lock);
>> @@ -1765,7 +1765,7 @@ nv50_sor_atomic_enable(struct drm_encoder
>> *encoder, struct drm_atomic_state *sta
>>  if ((disp->disp->object.oclass == GT214_DISP ||
>>   disp->disp->object.oclass >= GF110_DISP) &&
>>      nv_encoder->dcb->type != DCB_OUTPUT_LVDS &&
>> -    drm_detect_monitor_audio(nv_connector->edid))
>> +    nv_connector->base.display_info.has_audio)
>>  hda = true;
>>  
>>  if (!nvif_outp_acquired(outp))
>> @@ -1774,7 +1774,7 @@ nv50_sor_atomic_enable(struct drm_encoder
>> *encoder, struct drm_atomic_state *sta
>>  switch (nv_encoder->dcb->type) {
>>  case DCB_OUTPUT_TMDS:
>>  if (disp->disp->object.oclass != NV50_DISP &&
>> -    drm_detect_hdmi_monitor(nv_connector->edid))
>> +    nv_connector->base.display_info.is_hdmi)
>>  nv50_hdmi_enable(encoder, nv_crtc,
>> nv_connector, state, mode, hda);
>>  
>>  if (nv_encoder->outp.or.link & 1) {
>> @@ -1787,7 +1787,7 @@ nv50_sor_atomic_enable(struct drm_encoder
>> *encoder, struct drm_atomic_state *sta
>>   */
>>  if (mode->clock >= 165000 &&
>>      nv_encoder->dcb->duallink_possible &&
>> -    !drm_detect_hdmi_monitor(nv_connector-
>> >edid))
>> +    !nv_connector-
>> >base.display_info.is_hdmi)
>>  proto =
>> NV507D_SOR_SET_CONTROL_PROTOCOL_DUAL_TMDS;
>>  } else {
>>  proto =
>> NV507D_SOR_SET_CONTROL_PROTOCOL_SINGLE_TMDS_B;
>> diff --git a/drivers/gpu/drm/nouveau/dispnv50/head.c
>> b/drivers/gpu/drm/nouveau/dispnv50/head.c
>> index 83355dbc15ee..d7c74cc43ba5 100644
>> --- a/drivers/gpu/drm/nouveau/dispnv50/head.c
>> +++ b/drivers/gpu/drm/nouveau/dispnv50/head.c
>> @@ -127,14 +127,8 @@ nv50_head_atomic_check_view(struct
>> nv50_head_atom *armh,
>>  struct drm_display_mode *omode = >state.adjusted_mode;
>>  struct drm_display_mode *umode = >state.mode;
>>  int mode = asyc->scaler.mode;
>> -struct edid *edid;
>>  int umode_vdisplay, omode_hdisplay, omode_vdisplay;
>>  
>> -if (connector->edid_blob_ptr)
>> -edid = (struct edid *)connector->edid_blob_ptr-
>> >data;
>> -else
>> -edid = NULL;
>> -
>>  if (!asyc->scaler.full) {
>>  if (mode == DRM_MODE_SCALE_NONE)
>>  omode = umode;
>> @@ -162,7 +156,7 @@ nv50_head_atomic_check_view(struct nv50_head_atom
>> *armh,
>>   */
>>  if ((asyc->scaler.underscan.mode == UNDERSCAN_ON ||
>>      (asyc->scaler.underscan.mode == UNDERSCAN_AUTO &&
>> - drm_detect_hdmi_monitor(edid {
>> + connector->display_info.is_hdmi))) {
>>  u32 bX = asyc->scaler.underscan.hborder;
>>  u32 bY = asyc->scaler.underscan.vborder;
>>  u32 r = (asyh->view.oH << 19) / asyh->view.oW;
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c
>> b/drivers/gpu/drm/nouveau/nouveau_connector.c
>> index 856b3ef5edb8..938832a6af15 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_connector.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
>> @@ -1034,7 +1034,7 @@ get_tmds_link_bandwidth(struct drm_connector
>> *connector)
>>  unsigned duallink_scale =
>>  nouveau_duallink && nv_encoder->dcb-
>> >duallink_possible ? 2 : 1;
>>  
>> -if

Re: [PATCH 4/5] drm/amdgpu: Fix null pointer dereference to bo

2024-05-13 Thread Christian König


Am 13.05.24 um 10:56 schrieb Ma Jun:

Check bo before using it

Signed-off-by: Ma Jun 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 16 +++-
  1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
index 2b7b67916c1d..8269fd6828bf 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
@@ -497,9 +497,8 @@ static void gmc_v12_0_get_vm_pte(struct amdgpu_device *adev,
 uint64_t *flags)
  {
struct amdgpu_bo *bo = mapping->bo_va->base.bo;
-   struct amdgpu_device *bo_adev = amdgpu_ttm_adev(bo->tbo.bdev);
-   bool coherent = bo->flags & AMDGPU_GEM_CREATE_COHERENT;
-   bool is_system = bo->tbo.resource->mem_type == TTM_PL_SYSTEM;
+   struct amdgpu_device *bo_adev;
+   bool coherent, is_system;
  
  
  	*flags &= ~AMDGPU_PTE_EXECUTABLE;

@@ -515,13 +514,20 @@ static void gmc_v12_0_get_vm_pte(struct amdgpu_device 
*adev,
*flags &= ~AMDGPU_PTE_VALID;
}
  
-	if (bo && bo->flags & (AMDGPU_GEM_CREATE_COHERENT |

+   if (!bo)
+   return;
+
+   if (bo->flags & (AMDGPU_GEM_CREATE_COHERENT |
   AMDGPU_GEM_CREATE_UNCACHED))
*flags = (*flags & ~AMDGPU_PTE_MTYPE_GFX12_MASK) |
 AMDGPU_PTE_MTYPE_GFX12(MTYPE_UC);
  
+	bo_adev = amdgpu_ttm_adev(bo->tbo.bdev);

+   coherent = bo->flags & AMDGPU_GEM_CREATE_COHERENT;
+   is_system = bo->tbo.resource->mem_type == TTM_PL_SYSTEM;
+
/* WA for HW bug */
-   if ((bo && is_system) || ((bo_adev != adev) && coherent))
+   if (is_system || ((bo_adev != adev) && coherent))
*flags |= AMDGPU_PTE_MTYPE_GFX12(MTYPE_NC);
  
  }

RE: [PATCH] drm/amd/amdgpu: update jpeg 5 capability

2024-05-13 Thread Liu, Leo

[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Leo Liu 

> -Original Message-
> From: Wu, David 
> Sent: Thursday, May 9, 2024 5:37 PM
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> 
> Cc: Koenig, Christian ; Liu, Leo
> ; Jiang, Sonny 
> Subject: [PATCH] drm/amd/amdgpu: update jpeg 5 capability
>
> Based on the documentation the maximum resolustion should be 16384x16384.
>
> Signed-off-by: David (Ming Qiang) Wu 
> ---
>  drivers/gpu/drm/amd/amdgpu/soc24.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc24.c
> b/drivers/gpu/drm/amd/amdgpu/soc24.c
> index 12900488dd61..285d6af10f62 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc24.c
> +++ b/drivers/gpu/drm/amd/amdgpu/soc24.c
> @@ -61,7 +61,7 @@ static const struct amdgpu_video_codecs
> vcn_5_0_0_video_codecs_encode_vcn0 = {  static const struct
> amdgpu_video_codec_info vcn_5_0_0_video_codecs_decode_array_vcn0[] = {
>
>   {codec_info_build(AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_MPEG4_
> AVC, 4096, 2160, 52)},
>   {codec_info_build(AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_HEVC,
> 8192, 4320, 183)},
> - {codec_info_build(AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_JPEG,
> 4096, 4096, 0)},
> + {codec_info_build(AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_JPEG,
> 16384, 16384,
> +0)},
>   {codec_info_build(AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_VP9,
> 8192, 4352, 0)},
>   {codec_info_build(AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_AV1,
> 8192, 4352, 0)},  };
> --
> 2.34.1

Re: [PATCH 02/20] drm/amdgpu: don't trample pdev drvdata

2024-05-13 Thread Christian König


Am 10.05.24 um 20:12 schrieb Matthew Auld:

The driver release callback is called when a particular drm_device goes
away, just like with drmm, so here we should never nuke the pdev drvdata
pointer, since that could already be pointing to a new drvdata.
For example something hotunplugs the device, for which we have an open
driver fd, keeping the drm_device alive, and in the meantime the same
physical device is re-attached to a new drm_device therefore setting
drvdata again. Once the original drm_device goes away, we might then
call the release which then incorrectly tramples the drvdata.

The driver core will already nuke the pointer for us when the pci device
is removed, so should be safe to simply drop. Alternative would be to
move to the driver pci remove callback.

Signed-off-by: Matthew Auld 
Cc: Christian König 
Cc: Daniel Vetter 
Cc: amd-gfx@lists.freedesktop.org


Oh! Very good catch! That might become important for a feature we 
current discuss internally.


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index a0ea6fe8d060..d5fed007c698 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -1450,7 +1450,6 @@ void amdgpu_driver_release_kms(struct drm_device *dev)
struct amdgpu_device *adev = drm_to_adev(dev);
  
  	amdgpu_device_fini_sw(adev);

-   pci_set_drvdata(adev->pdev, NULL);
  }
  
  /*

Re: [PATCH 01/22] drm/amdgpu: fix dereference after null check

2024-05-13 Thread Christian König


Am 10.05.24 um 04:50 schrieb Jesse Zhang:

check the pointer hive before use.

Signed-off-by: Jesse Zhang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 77f6fd50002a..00fe3c2d5431 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5725,7 +5725,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 * to put adev in the 1st position.
 */
INIT_LIST_HEAD(_list);
-   if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) {
+   if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1) 
&& hive) {


That solution looks not optimal to me.

Checking adev->gmc.xgmi.num_physical_nodes > 1 already makes sure that 
hive shouldn't be NULL.


If automated checkers complain about that we should probably drop the 
adev->gmc.xgmi.num_physical_nodes > 1 check and check for hive instead.


Regards,
Christian.


list_for_each_entry(tmp_adev, >device_list, 
gmc.xgmi.head) {
list_add_tail(_adev->reset_list, _list);
if (adev->shutdown)

Re: [PATCH 05/22] drm/amd/pm: check specific index for aldebaran

2024-05-13 Thread Lazar, Lijo




On 5/10/2024 8:20 AM, Jesse Zhang wrote:
> Check for specific indexes that may be invalid values.
> 
> Signed-off-by: Jesse Zhang 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> index ce941fbb9cfb..a22eb6bbb05e 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> @@ -1886,7 +1886,8 @@ static int aldebaran_mode2_reset(struct smu_context 
> *smu)
>  
>   index = smu_cmn_to_asic_specific_index(smu, CMN2ASIC_MAPPING_MSG,
>   SMU_MSG_GfxDeviceDriverReset);
> -
> + if (index < 0 )
> + return -EINVAL;

To avoid warning problems, drop index and use PPSMC_MSG_GfxDriverReset
instead of index.

Thanks,
Lijo

>   mutex_lock(>message_lock);
>   if (smu->smc_fw_version >= 0x00441400) {
>   ret = smu_cmn_send_msg_without_waiting(smu, (uint16_t)index, 
> SMU_RESET_MODE_2);

Re: [PATCH] drm/amd/display: Don't register panel_power_savings on OLED panels

2024-05-13 Thread Harry Wentland


On 2024-05-09 13:05, Mario Limonciello wrote:

OLED panels don't support the ABM, they shouldn't offer the
panel_power_savings attribute to the user. Check whether aux BL
control support was enabled to decide whether to offer it.

Reported-by: Gergo Koteles 
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3359
Signed-off-by: Mario Limonciello 


Reviewed-by: Harry Wentland 

Harry


---
  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 29 ---
  1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 3054bf79fc99..ce2ec857b8a6 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6532,12 +6532,34 @@ static const struct attribute_group amdgpu_group = {
.attrs = amdgpu_attrs
  };
  
+static bool

+amdgpu_dm_should_create_sysfs(struct amdgpu_dm_connector *amdgpu_dm_connector)
+{
+   if (amdgpu_dm_abm_level >= 0)
+   return false;
+
+   if (amdgpu_dm_connector->base.connector_type != DRM_MODE_CONNECTOR_eDP)
+   return false;
+
+   /* check for OLED panels */
+   if (amdgpu_dm_connector->bl_idx >= 0) {
+   struct drm_device *drm = amdgpu_dm_connector->base.dev;
+   struct amdgpu_display_manager *dm = _to_adev(drm)->dm;
+   struct amdgpu_dm_backlight_caps *caps;
+
+   caps = >backlight_caps[amdgpu_dm_connector->bl_idx];
+   if (caps->aux_support)
+   return false;
+   }
+
+   return true;
+}
+
  static void amdgpu_dm_connector_unregister(struct drm_connector *connector)
  {
struct amdgpu_dm_connector *amdgpu_dm_connector = 
to_amdgpu_dm_connector(connector);
  
-	if (connector->connector_type == DRM_MODE_CONNECTOR_eDP &&

-   amdgpu_dm_abm_level < 0)
+   if (amdgpu_dm_should_create_sysfs(amdgpu_dm_connector))
sysfs_remove_group(>kdev->kobj, _group);
  
  	drm_dp_aux_unregister(_dm_connector->dm_dp_aux.aux);

@@ -6644,8 +,7 @@ amdgpu_dm_connector_late_register(struct drm_connector 
*connector)
to_amdgpu_dm_connector(connector);
int r;
  
-	if (connector->connector_type == DRM_MODE_CONNECTOR_eDP &&

-   amdgpu_dm_abm_level < 0) {
+   if (amdgpu_dm_should_create_sysfs(amdgpu_dm_connector)) {
r = sysfs_create_group(>kdev->kobj,
   _group);
if (r)

Re: [PATCH 09/22] drm/amd/pm: check specific index for smu13

2024-05-13 Thread Lazar, Lijo




On 5/13/2024 4:27 PM, Lazar, Lijo wrote:
> 
> 
> On 5/10/2024 8:20 AM, Jesse Zhang wrote:
>> Check for specific indexes that may be invalid values.
>>
>> Signed-off-by: Jesse Zhang 
>> ---
>>  drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c 
>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
>> index 051092f1b1b4..7c343dd12a7f 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
>> @@ -2336,6 +2336,8 @@ static int smu_v13_0_6_mode2_reset(struct smu_context 
>> *smu)
>>  
>>  index = smu_cmn_to_asic_specific_index(smu, CMN2ASIC_MAPPING_MSG,
>> SMU_MSG_GfxDeviceDriverReset);
>> +if (index < 0)
>> +ret = -EINVAL;
>>  
> 
> This check is unnecessary. This is IP specific logic and the index is
> expected to exist.
> 

If you are seeing a warning problem because of this, drop index and use
PPSMC_MSG_GfxDriverReset directly.

Thanks,
Lijo

> See this entry in smu_v13_0_6_message_map
> 
> MSG_MAP(GfxDeviceDriverReset,PPSMC_MSG_GfxDriverReset,
>SMU_MSG_RAS_PRI),
> >
> Thanks,
> Lijo
> 
>>  mutex_lock(>message_lock);
>>

Re: [PATCH 09/22] drm/amd/pm: check specific index for smu13

2024-05-13 Thread Lazar, Lijo




On 5/10/2024 8:20 AM, Jesse Zhang wrote:
> Check for specific indexes that may be invalid values.
> 
> Signed-off-by: Jesse Zhang 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> index 051092f1b1b4..7c343dd12a7f 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> @@ -2336,6 +2336,8 @@ static int smu_v13_0_6_mode2_reset(struct smu_context 
> *smu)
>  
>   index = smu_cmn_to_asic_specific_index(smu, CMN2ASIC_MAPPING_MSG,
>  SMU_MSG_GfxDeviceDriverReset);
> + if (index < 0)
> + ret = -EINVAL;
>  

This check is unnecessary. This is IP specific logic and the index is
expected to exist.

See this entry in smu_v13_0_6_message_map

MSG_MAP(GfxDeviceDriverReset,PPSMC_MSG_GfxDriverReset,
   SMU_MSG_RAS_PRI),


Thanks,
Lijo

>   mutex_lock(>message_lock);
>

Re: [PATCH 3/5] drm/amdgpu: Fix null pointer dereference to aca_handle

2024-05-13 Thread Lazar, Lijo




On 5/13/2024 2:26 PM, Ma Jun wrote:
> Check handle pointer before using it
> 
> Signed-off-by: Ma Jun 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> index 28febf33fb1b..e969a7d77b4d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> @@ -279,7 +279,7 @@ static struct aca_bank_error *get_bank_error(struct 
> aca_error *aerr, struct aca_
>  int aca_error_cache_log_bank_error(struct aca_handle *handle, struct 
> aca_bank_info *info,
>  enum aca_error_type type, u64 count)
>  {
> - struct aca_error_cache *error_cache = >error_cache;
> + struct aca_error_cache *error_cache;
>   struct aca_bank_error *bank_error;
>   struct aca_error *aerr;
>  
> @@ -289,6 +289,10 @@ int aca_error_cache_log_bank_error(struct aca_handle 
> *handle, struct aca_bank_in
>   if (!count)
>   return 0;
>  
> + error_cache = >error_cache;
> + if (!error_cache)
> + return -EINVAL;

Similar as in patch 2. error_cache is not a pointer variable.

struct aca_error_cache error_cache;


Thanks,
Lijo

> +
>   aerr = _cache->errors[type];
>   bank_error = get_bank_error(aerr, info);
>   if (!bank_error)

Re: [PATCH 2/5] drm/amdgpu: Fix the null pointer dereference to ras_manager

2024-05-13 Thread Lazar, Lijo




On 5/13/2024 2:26 PM, Ma Jun wrote:
> Check ras_manager before using it
> 
> Signed-off-by: Ma Jun 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 1dd13ed3b7b5..6da02a209890 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2172,12 +2172,17 @@ static void 
> amdgpu_ras_interrupt_process_handler(struct work_struct *work)
>  int amdgpu_ras_interrupt_dispatch(struct amdgpu_device *adev,
>   struct ras_dispatch_if *info)
>  {
> - struct ras_manager *obj = amdgpu_ras_find_obj(adev, >head);
> - struct ras_ih_data *data = >ih_data;
> + struct ras_manager *obj;
> + struct ras_ih_data *data;
>  
> + obj = amdgpu_ras_find_obj(adev, >head);
>   if (!obj)
>   return -EINVAL;
>  
> + data = >ih_data;
> + if (!data)
> + return -EINVAL;

This check is not needed. ih_data is declared as below in ras_manager.

struct ras_ih_data ih_data;

Thanks,
Lijo

> +
>   if (data->inuse == 0)
>   return 0;
>

[PATCH v3 10/10] Documentation/amdgpu: Add PM policy documentation

2024-05-13 Thread Lijo Lazar

Add documentation about the newly added pm_policy node in sysfs.

Signed-off-by: Lijo Lazar 
---
 Documentation/gpu/amdgpu/thermal.rst |  6 
 drivers/gpu/drm/amd/pm/amdgpu_pm.c   | 48 
 2 files changed, 54 insertions(+)

diff --git a/Documentation/gpu/amdgpu/thermal.rst 
b/Documentation/gpu/amdgpu/thermal.rst
index 2f6166f81e6a..6d942b5c58f0 100644
--- a/Documentation/gpu/amdgpu/thermal.rst
+++ b/Documentation/gpu/amdgpu/thermal.rst
@@ -49,6 +49,12 @@ pp_power_profile_mode
 .. kernel-doc:: drivers/gpu/drm/amd/pm/amdgpu_pm.c
:doc: pp_power_profile_mode
 
+pm_policy
+-
+
+.. kernel-doc:: drivers/gpu/drm/amd/pm/amdgpu_pm.c
+   :doc: pm_policy
+
 \*_busy_percent
 ---
 
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 9cca4716ec43..45766d49f1f2 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -2214,6 +2214,54 @@ static int pp_dpm_clk_default_attr_update(struct 
amdgpu_device *adev, struct amd
return 0;
 }
 
+/**
+ * DOC: pm_policy
+ *
+ * Certain SOCs can support different power policies to optimize application
+ * performance. However, this policy is provided only at SOC level and not at a
+ * per-process level. This is useful especially when entire SOC is utilized for
+ * dedicated workload.
+ *
+ * The amdgpu driver provides a sysfs API for selecting the policy. Presently,
+ * only two types of policies are supported through this interface.
+ *
+ *  Pstate Policy Selection - This is to select different Pstate profiles which
+ *  decides clock/throttling preferences.
+ *
+ *  XGMI PLPD Policy Selection - When multiple devices are connected over XGMI,
+ *  this helps to select policy to be applied for per link power down.
+ *
+ * The list of available policies and policy levels vary between SOCs. They can
+ * be viewed by reading the file. The policy level which is applied presently 
is
+ * denoted by * (asterisk). E.g.,
+ *
+ * .. code-block:: console
+ *
+ * cat /sys/bus/pci/devices/.../pm_policy
+ * soc pstate
+ * 0 : soc_pstate_default
+ * 1 : soc_pstate_0
+ * 2 : soc_pstate_1*
+ * 3 : soc_pstate_2
+ * xgmi plpd
+ * 0 : plpd_disallow
+ * 1 : plpd_default
+ * 2 : plpd_optimized*
+ *
+ * To apply a specific policy
+ *
+ * "echo   > /sys/bus/pci/devices/.../pm_policy"
+ *
+ * For the levels listed in the example above, to select "plpd_optimized" for
+ * XGMI and "soc_pstate_2" for soc pstate policy -
+ *
+ * .. code-block:: console
+ *
+ * echo "xgmi 2" > /sys/bus/pci/devices/.../pm_policy
+ * echo "soc_pstate 3" > /sys/bus/pci/devices/.../pm_policy
+ *
+ */
+
 static ssize_t amdgpu_get_pm_policy(struct device *dev,
struct device_attribute *attr, char *buf)
 {
-- 
2.25.1

[PATCH v3 06/10] drm/amd/pm: Add xgmi plpd to aldebaran pm_policy

2024-05-13 Thread Lijo Lazar

On aldebaran, allow changing xgmi plpd policy through pm_policy sysfs
interface.

Signed-off-by: Lijo Lazar 
Reviewed-by: Hawking Zhang 
---
 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 35 +++
 1 file changed, 35 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index a22eb6bbb05e..66d386ef1da9 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -266,9 +266,30 @@ static int aldebaran_tables_init(struct smu_context *smu)
return 0;
 }
 
+static int aldebaran_select_plpd_policy(struct smu_context *smu, int level)
+{
+   struct amdgpu_device *adev = smu->adev;
+
+   /* The message only works on master die and NACK will be sent
+  back for other dies, only send it on master die */
+   if (adev->smuio.funcs->get_socket_id(adev) ||
+   adev->smuio.funcs->get_die_id(adev))
+   return 0;
+
+   if (level == XGMI_PLPD_DEFAULT)
+   return smu_cmn_send_smc_msg_with_param(
+   smu, SMU_MSG_GmiPwrDnControl, 0, NULL);
+   else if (level == XGMI_PLPD_DISALLOW)
+   return smu_cmn_send_smc_msg_with_param(
+   smu, SMU_MSG_GmiPwrDnControl, 1, NULL);
+   else
+   return -EINVAL;
+}
+
 static int aldebaran_allocate_dpm_context(struct smu_context *smu)
 {
struct smu_dpm_context *smu_dpm = >smu_dpm;
+   struct smu_dpm_policy *policy;
 
smu_dpm->dpm_context = kzalloc(sizeof(struct smu_13_0_dpm_context),
   GFP_KERNEL);
@@ -276,6 +297,20 @@ static int aldebaran_allocate_dpm_context(struct 
smu_context *smu)
return -ENOMEM;
smu_dpm->dpm_context_size = sizeof(struct smu_13_0_dpm_context);
 
+   smu_dpm->dpm_policies =
+   kzalloc(sizeof(struct smu_dpm_policy_ctxt), GFP_KERNEL);
+
+   if (!smu_dpm->dpm_policies)
+   return -ENOMEM;
+
+   policy = &(smu_dpm->dpm_policies->policies[0]);
+   policy->policy_type = PP_PM_POLICY_XGMI_PLPD;
+   policy->level_mask = BIT(XGMI_PLPD_DISALLOW) | BIT(XGMI_PLPD_DEFAULT);
+   policy->current_level = XGMI_PLPD_DEFAULT;
+   policy->set_policy = aldebaran_select_plpd_policy;
+   smu_cmn_generic_plpd_policy_desc(policy);
+   smu_dpm->dpm_policies->policy_mask |= BIT(PP_PM_POLICY_XGMI_PLPD);
+
return 0;
 }
 
-- 
2.25.1

[PATCH v3 09/10] drm/amd/pm: Remove unused interface to set plpd

2024-05-13 Thread Lijo Lazar

Remove unused callback to set PLPD policy and its implementation from
arcturus, aldebaran and SMUv13.0.6 SOCs.

Signed-off-by: Lijo Lazar 
Reviewed-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h |  6 ---
 .../gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c | 22 ---
 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 24 
 .../drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c  | 39 ---
 4 files changed, 91 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
index e460fa9aaf23..dd9356c7428e 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
@@ -876,12 +876,6 @@ struct pptable_funcs {
 */
int (*set_df_cstate)(struct smu_context *smu, enum pp_df_cstate state);
 
-   /**
-* @select_xgmi_plpd_policy: Select xgmi per-link power down policy.
-*/
-   int (*select_xgmi_plpd_policy)(struct smu_context *smu,
-  enum pp_xgmi_plpd_mode mode);
-
/**
 * @update_pcie_parameters: Update and upload the system's PCIe
 *  capabilites to the SMU.
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c
index 84f7d4139bda..c0f6b59369b7 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c
@@ -,27 +,6 @@ static int arcturus_set_df_cstate(struct smu_context 
*smu,
return smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_DFCstateControl, 
state, NULL);
 }
 
-static int arcturus_select_xgmi_plpd_policy(struct smu_context *smu,
-   enum pp_xgmi_plpd_mode mode)
-{
-   /* PPSMC_MSG_GmiPwrDnControl is supported by 54.23.0 and onwards */
-   if (smu->smc_fw_version < 0x00361700) {
-   dev_err(smu->adev->dev, "XGMI power down control is only 
supported by PMFW 54.23.0 and onwards\n");
-   return -EINVAL;
-   }
-
-   if (mode == XGMI_PLPD_DEFAULT)
-   return smu_cmn_send_smc_msg_with_param(smu,
-  SMU_MSG_GmiPwrDnControl,
-  1, NULL);
-   else if (mode == XGMI_PLPD_DISALLOW)
-   return smu_cmn_send_smc_msg_with_param(smu,
-  SMU_MSG_GmiPwrDnControl,
-  0, NULL);
-   else
-   return -EINVAL;
-}
-
 static const struct throttling_logging_label {
uint32_t feature_mask;
const char *label;
@@ -2440,7 +2419,6 @@ static const struct pptable_funcs arcturus_ppt_funcs = {
.get_dpm_ultimate_freq = smu_v11_0_get_dpm_ultimate_freq,
.set_soft_freq_limited_range = smu_v11_0_set_soft_freq_limited_range,
.set_df_cstate = arcturus_set_df_cstate,
-   .select_xgmi_plpd_policy = arcturus_select_xgmi_plpd_policy,
.log_thermal_throttling_event = arcturus_log_thermal_throttling_event,
.get_pp_feature_mask = smu_cmn_get_pp_feature_mask,
.set_pp_feature_mask = smu_cmn_set_pp_feature_mask,
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 66d386ef1da9..e584e53e3760 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -1642,29 +1642,6 @@ static int aldebaran_set_df_cstate(struct smu_context 
*smu,
return smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_DFCstateControl, 
state, NULL);
 }
 
-static int aldebaran_select_xgmi_plpd_policy(struct smu_context *smu,
-enum pp_xgmi_plpd_mode mode)
-{
-   struct amdgpu_device *adev = smu->adev;
-
-   /* The message only works on master die and NACK will be sent
-  back for other dies, only send it on master die */
-   if (adev->smuio.funcs->get_socket_id(adev) ||
-   adev->smuio.funcs->get_die_id(adev))
-   return 0;
-
-   if (mode == XGMI_PLPD_DEFAULT)
-   return smu_cmn_send_smc_msg_with_param(smu,
-  SMU_MSG_GmiPwrDnControl,
-  0, NULL);
-   else if (mode == XGMI_PLPD_DISALLOW)
-   return smu_cmn_send_smc_msg_with_param(smu,
-  SMU_MSG_GmiPwrDnControl,
-  1, NULL);
-   else
-   return -EINVAL;
-}
-
 static const struct throttling_logging_label {
uint32_t feature_mask;
const char *label;
@@ -2104,7 +2081,6 @@ static const struct pptable_funcs aldebaran_ppt_funcs = {
.set_soft_freq_limited_range =

[PATCH v3 08/10] drm/amd/pm: Remove legacy interface for xgmi plpd

2024-05-13 Thread Lijo Lazar

Replace the legacy interface with amdgpu_dpm_set_pm_policy to set XGMI
PLPD mode. Also, xgmi_plpd sysfs node is not used by any client. Remove
that as well.

Signed-off-by: Lijo Lazar 
Reviewed-by: Hawking Zhang 
---
v2: No change
v3: Rebase to remove device_attr_id__xgmi_plpd_policy

 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c  |  4 +-
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 43 
 drivers/gpu/drm/amd/pm/amdgpu_pm.c| 68 ---
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  5 --
 drivers/gpu/drm/amd/pm/inc/amdgpu_pm.h|  1 -
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 27 
 drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h |  2 -
 7 files changed, 2 insertions(+), 148 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 4a14f9c1bfe8..821ba2309dec 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -1446,7 +1446,7 @@ static int amdgpu_ras_error_inject_xgmi(struct 
amdgpu_device *adev,
if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
dev_warn(adev->dev, "Failed to disallow df cstate");
 
-   ret1 = amdgpu_dpm_set_xgmi_plpd_mode(adev, XGMI_PLPD_DISALLOW);
+   ret1 = amdgpu_dpm_set_pm_policy(adev, PP_PM_POLICY_XGMI_PLPD, 
XGMI_PLPD_DISALLOW);
if (ret1 && ret1 != -EOPNOTSUPP)
dev_warn(adev->dev, "Failed to disallow XGMI power down");
 
@@ -1455,7 +1455,7 @@ static int amdgpu_ras_error_inject_xgmi(struct 
amdgpu_device *adev,
if (amdgpu_ras_intr_triggered())
return ret2;
 
-   ret1 = amdgpu_dpm_set_xgmi_plpd_mode(adev, XGMI_PLPD_DEFAULT);
+   ret1 = amdgpu_dpm_set_pm_policy(adev, PP_PM_POLICY_XGMI_PLPD, 
XGMI_PLPD_DEFAULT);
if (ret1 && ret1 != -EOPNOTSUPP)
dev_warn(adev->dev, "Failed to allow XGMI power down");
 
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index b443906484e7..cd169af35399 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -368,49 +368,6 @@ int amdgpu_dpm_set_df_cstate(struct amdgpu_device *adev,
return ret;
 }
 
-int amdgpu_dpm_get_xgmi_plpd_mode(struct amdgpu_device *adev, char **mode_desc)
-{
-   struct smu_context *smu = adev->powerplay.pp_handle;
-   int mode = XGMI_PLPD_NONE;
-
-   if (is_support_sw_smu(adev)) {
-   mode = smu->plpd_mode;
-   if (mode_desc == NULL)
-   return mode;
-   switch (smu->plpd_mode) {
-   case XGMI_PLPD_DISALLOW:
-   *mode_desc = "disallow";
-   break;
-   case XGMI_PLPD_DEFAULT:
-   *mode_desc = "default";
-   break;
-   case XGMI_PLPD_OPTIMIZED:
-   *mode_desc = "optimized";
-   break;
-   case XGMI_PLPD_NONE:
-   default:
-   *mode_desc = "none";
-   break;
-   }
-   }
-
-   return mode;
-}
-
-int amdgpu_dpm_set_xgmi_plpd_mode(struct amdgpu_device *adev, int mode)
-{
-   struct smu_context *smu = adev->powerplay.pp_handle;
-   int ret = -EOPNOTSUPP;
-
-   if (is_support_sw_smu(adev)) {
-   mutex_lock(>pm.mutex);
-   ret = smu_set_xgmi_plpd_mode(smu, mode);
-   mutex_unlock(>pm.mutex);
-   }
-
-   return ret;
-}
-
 ssize_t amdgpu_dpm_get_pm_policy_info(struct amdgpu_device *adev, char *buf)
 {
struct smu_context *smu = adev->powerplay.pp_handle;
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 48da662da005..9cca4716ec43 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -2214,70 +2214,6 @@ static int pp_dpm_clk_default_attr_update(struct 
amdgpu_device *adev, struct amd
return 0;
 }
 
-/* Following items will be read out to indicate current plpd policy:
- *  - -1: none
- *  - 0: disallow
- *  - 1: default
- *  - 2: optimized
- */
-static ssize_t amdgpu_get_xgmi_plpd_policy(struct device *dev,
-  struct device_attribute *attr,
-  char *buf)
-{
-   struct drm_device *ddev = dev_get_drvdata(dev);
-   struct amdgpu_device *adev = drm_to_adev(ddev);
-   char *mode_desc = "none";
-   int mode;
-
-   if (amdgpu_in_reset(adev))
-   return -EPERM;
-   if (adev->in_suspend && !adev->in_runpm)
-   return -EPERM;
-
-   mode = amdgpu_dpm_get_xgmi_plpd_mode(adev, _desc);
-
-   return sysfs_emit(buf, "%d: %s\n", mode, mode_desc);
-}
-
-/* Following argument value is expected from user to change plpd policy
- *  - arg 0: disallow plpd
- *  - arg 1: default policy
- *  - arg 2: optimized policy
- */

[PATCH v3 05/10] drm/amd/pm: Add xgmi plpd to SMU v13.0.6 pm_policy

2024-05-13 Thread Lijo Lazar

On SOCs with SMU v13.0.6, allow changing xgmi plpd policy through
pm_policy sysfs interface.

Signed-off-by: Lijo Lazar 
Reviewed-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 19 +--
 .../drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c  | 51 +--
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 27 ++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  1 +
 4 files changed, 91 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index df9ff377ebfd..a8116ae4472a 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -1197,6 +1197,9 @@ static void smu_swctf_delayed_work_handler(struct 
work_struct *work)
 
 static void smu_init_xgmi_plpd_mode(struct smu_context *smu)
 {
+   struct smu_dpm_context *dpm_ctxt = &(smu->smu_dpm);
+   struct smu_dpm_policy_ctxt *policy_ctxt;
+
if (amdgpu_ip_version(smu->adev, MP1_HWIP, 0) == IP_VERSION(11, 0, 2)) {
smu->plpd_mode = XGMI_PLPD_DEFAULT;
return;
@@ -1204,10 +1207,20 @@ static void smu_init_xgmi_plpd_mode(struct smu_context 
*smu)
 
/* PMFW put PLPD into default policy after enabling the feature */
if (smu_feature_is_enabled(smu,
-  SMU_FEATURE_XGMI_PER_LINK_PWR_DWN_BIT))
+  SMU_FEATURE_XGMI_PER_LINK_PWR_DWN_BIT)) {
+   struct smu_dpm_policy *policy;
+
smu->plpd_mode = XGMI_PLPD_DEFAULT;
-   else
+   policy = smu_get_pm_policy(smu, PP_PM_POLICY_XGMI_PLPD);
+   if (policy)
+   policy->current_level = XGMI_PLPD_DEFAULT;
+   } else {
smu->plpd_mode = XGMI_PLPD_NONE;
+   policy_ctxt = dpm_ctxt->dpm_policies;
+   if (policy_ctxt)
+   policy_ctxt->policy_mask &=
+   ~BIT(PP_PM_POLICY_XGMI_PLPD);
+   }
 }
 
 static int smu_sw_init(void *handle)
@@ -3555,7 +3568,7 @@ struct smu_dpm_policy *smu_get_pm_policy(struct 
smu_context *smu,
policy_ctxt = dpm_ctxt->dpm_policies;
if (!policy_ctxt)
return NULL;
-
+   
for_each_set_bit(i, _ctxt->policy_mask, PP_PM_POLICY_NUM) {
if (policy_ctxt->policies[i].policy_type == p_type)
return _ctxt->policies[i];
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
index 0ed0b5326d35..173c5599279b 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
@@ -403,9 +403,45 @@ static int smu_v13_0_6_select_policy_soc_pstate(struct 
smu_context *smu,
return ret;
 }
 
+static int smu_v13_0_6_select_plpd_policy(struct smu_context *smu, int level)
+{
+   struct amdgpu_device *adev = smu->adev;
+   int ret, param;
+
+   switch (level) {
+   case XGMI_PLPD_DEFAULT:
+   param = PPSMC_PLPD_MODE_DEFAULT;
+   break;
+   case XGMI_PLPD_OPTIMIZED:
+   param = PPSMC_PLPD_MODE_OPTIMIZED;
+   break;
+   case XGMI_PLPD_DISALLOW:
+   param = 0;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   if (level == XGMI_PLPD_DISALLOW)
+   ret = smu_cmn_send_smc_msg_with_param(
+   smu, SMU_MSG_GmiPwrDnControl, param, NULL);
+   else
+   /* change xgmi per-link power down policy */
+   ret = smu_cmn_send_smc_msg_with_param(
+   smu, SMU_MSG_SelectPLPDMode, param, NULL);
+
+   if (ret)
+   dev_err(adev->dev,
+   "select xgmi per-link power down policy %d failed\n",
+   level);
+
+   return ret;
+}
+
 static int smu_v13_0_6_allocate_dpm_context(struct smu_context *smu)
 {
struct smu_dpm_context *smu_dpm = >smu_dpm;
+   struct smu_dpm_policy *policy;
 
smu_dpm->dpm_context =
kzalloc(sizeof(struct smu_13_0_dpm_context), GFP_KERNEL);
@@ -413,11 +449,9 @@ static int smu_v13_0_6_allocate_dpm_context(struct 
smu_context *smu)
return -ENOMEM;
smu_dpm->dpm_context_size = sizeof(struct smu_13_0_dpm_context);
 
+   smu_dpm->dpm_policies =
+   kzalloc(sizeof(struct smu_dpm_policy_ctxt), GFP_KERNEL);
if (!(smu->adev->flags & AMD_IS_APU)) {
-   struct smu_dpm_policy *policy;
-
-   smu_dpm->dpm_policies =
-   kzalloc(sizeof(struct smu_dpm_policy_ctxt), GFP_KERNEL);
policy = &(smu_dpm->dpm_policies->policies[0]);
 
policy->policy_type = PP_PM_POLICY_SOC_PSTATE;
@@ -430,6 +464,15 @@ static int smu_v13_0_6_allocate_dpm_context(struct 
smu_context *smu)

[PATCH v3 04/10] drm/amd/pm: Add xgmi plpd policy to pm_policy

2024-05-13 Thread Lijo Lazar

Add support to set XGMI PLPD policy levels through pm_policy sysfs node.

Signed-off-by: Lijo Lazar 
Reviewed-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/include/kgd_pp_interface.h | 1 +
 drivers/gpu/drm/amd/pm/amdgpu_pm.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
index 8ed9aa9a990d..4b20e2274313 100644
--- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
+++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
@@ -276,6 +276,7 @@ enum pp_xgmi_plpd_mode {
 enum pp_pm_policy {
PP_PM_POLICY_NONE = -1,
PP_PM_POLICY_SOC_PSTATE = 0,
+   PP_PM_POLICY_XGMI_PLPD,
PP_PM_POLICY_NUM,
 };
 
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index d62f1d1d6c84..48da662da005 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -2318,6 +2318,9 @@ static ssize_t amdgpu_set_pm_policy(struct device *dev,
if (strncmp(tmp, "soc_pstate", strlen("soc_pstate")) == 0) {
policy_type = PP_PM_POLICY_SOC_PSTATE;
tmp += strlen("soc_pstate");
+   } else if (strncmp(tmp, "xgmi", strlen("xgmi")) == 0) {
+   policy_type = PP_PM_POLICY_XGMI_PLPD;
+   tmp += strlen("xgmi");
} else {
return -EINVAL;
}
-- 
2.25.1

[PATCH v3 07/10] drm/amd/pm: Add xgmi plpd to arcturus pm_policy

2024-05-13 Thread Lijo Lazar

On arcturus, allow changing xgmi plpd policy through pm_policy sysfs
interface.

Signed-off-by: Lijo Lazar 
Reviewed-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |  7 ++--
 .../gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c | 42 +++
 2 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index a8116ae4472a..a395967905d2 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -1199,19 +1199,20 @@ static void smu_init_xgmi_plpd_mode(struct smu_context 
*smu)
 {
struct smu_dpm_context *dpm_ctxt = &(smu->smu_dpm);
struct smu_dpm_policy_ctxt *policy_ctxt;
+   struct smu_dpm_policy *policy;
 
+   policy = smu_get_pm_policy(smu, PP_PM_POLICY_XGMI_PLPD);
if (amdgpu_ip_version(smu->adev, MP1_HWIP, 0) == IP_VERSION(11, 0, 2)) {
smu->plpd_mode = XGMI_PLPD_DEFAULT;
+   if (policy)
+   policy->current_level = XGMI_PLPD_DEFAULT;
return;
}
 
/* PMFW put PLPD into default policy after enabling the feature */
if (smu_feature_is_enabled(smu,
   SMU_FEATURE_XGMI_PER_LINK_PWR_DWN_BIT)) {
-   struct smu_dpm_policy *policy;
-
smu->plpd_mode = XGMI_PLPD_DEFAULT;
-   policy = smu_get_pm_policy(smu, PP_PM_POLICY_XGMI_PLPD);
if (policy)
policy->current_level = XGMI_PLPD_DEFAULT;
} else {
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c
index 623f6052f97e..84f7d4139bda 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c
@@ -283,9 +283,29 @@ static int arcturus_tables_init(struct smu_context *smu)
return 0;
 }
 
+static int arcturus_select_plpd_policy(struct smu_context *smu, int level)
+{
+   /* PPSMC_MSG_GmiPwrDnControl is supported by 54.23.0 and onwards */
+   if (smu->smc_fw_version < 0x00361700) {
+   dev_err(smu->adev->dev,
+   "XGMI power down control is only supported by PMFW 
54.23.0 and onwards\n");
+   return -EINVAL;
+   }
+
+   if (level == XGMI_PLPD_DEFAULT)
+   return smu_cmn_send_smc_msg_with_param(
+   smu, SMU_MSG_GmiPwrDnControl, 1, NULL);
+   else if (level == XGMI_PLPD_DISALLOW)
+   return smu_cmn_send_smc_msg_with_param(
+   smu, SMU_MSG_GmiPwrDnControl, 0, NULL);
+   else
+   return -EINVAL;
+}
+
 static int arcturus_allocate_dpm_context(struct smu_context *smu)
 {
struct smu_dpm_context *smu_dpm = >smu_dpm;
+   struct smu_dpm_policy *policy;
 
smu_dpm->dpm_context = kzalloc(sizeof(struct smu_11_0_dpm_context),
   GFP_KERNEL);
@@ -293,6 +313,20 @@ static int arcturus_allocate_dpm_context(struct 
smu_context *smu)
return -ENOMEM;
smu_dpm->dpm_context_size = sizeof(struct smu_11_0_dpm_context);
 
+   smu_dpm->dpm_policies =
+   kzalloc(sizeof(struct smu_dpm_policy_ctxt), GFP_KERNEL);
+
+   if (!smu_dpm->dpm_policies)
+   return -ENOMEM;
+
+   policy = &(smu_dpm->dpm_policies->policies[0]);
+   policy->policy_type = PP_PM_POLICY_XGMI_PLPD;
+   policy->level_mask = BIT(XGMI_PLPD_DISALLOW) | BIT(XGMI_PLPD_DEFAULT);
+   policy->current_level = XGMI_PLPD_DEFAULT;
+   policy->set_policy = arcturus_select_plpd_policy;
+   smu_cmn_generic_plpd_policy_desc(policy);
+   smu_dpm->dpm_policies->policy_mask |= BIT(PP_PM_POLICY_XGMI_PLPD);
+
return 0;
 }
 
@@ -403,6 +437,14 @@ static int arcturus_set_default_dpm_table(struct 
smu_context *smu)
dpm_table->max = dpm_table->dpm_levels[0].value;
}
 
+   /* XGMI PLPD is supported by 54.23.0 and onwards */
+   if (smu->smc_fw_version < 0x00361700) {
+   struct smu_dpm_context *smu_dpm = >smu_dpm;
+
+   smu_dpm->dpm_policies->policy_mask &=
+   ~BIT(PP_PM_POLICY_XGMI_PLPD);
+   }
+
return 0;
 }
 
-- 
2.25.1

[PATCH v3 03/10] drm/amd/pm: Add support to select pstate policy

2024-05-13 Thread Lijo Lazar

Add support to select pstate policy in SOCs with SMUv13.0.6

Signed-off-by: Lijo Lazar 
Reviewed-by: Hawking Zhang 
---
 .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  2 +
 .../drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c  | 71 +++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 30 
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  1 +
 4 files changed, 104 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
index 9c0445fa9f9b..3a50076e44f0 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
@@ -531,10 +531,12 @@ int smu_v13_0_fini_smc_tables(struct smu_context *smu)
smu_table->watermarks_table = NULL;
smu_table->metrics_time = 0;
 
+   kfree(smu_dpm->dpm_policies);
kfree(smu_dpm->dpm_context);
kfree(smu_dpm->golden_dpm_context);
kfree(smu_dpm->dpm_current_power_state);
kfree(smu_dpm->dpm_request_power_state);
+   smu_dpm->dpm_policies = NULL;
smu_dpm->dpm_context = NULL;
smu_dpm->golden_dpm_context = NULL;
smu_dpm->dpm_context_size = 0;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
index ed9c4866b6e4..0ed0b5326d35 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
@@ -174,6 +174,7 @@ static const struct cmn2asic_msg_mapping 
smu_v13_0_6_message_map[SMU_MSG_MAX_COU
MSG_MAP(McaBankCeDumpDW, PPSMC_MSG_McaBankCeDumpDW, 
SMU_MSG_RAS_PRI),
MSG_MAP(SelectPLPDMode,  PPSMC_MSG_SelectPLPDMode,  
0),
MSG_MAP(RmaDueToBadPageThreshold,
PPSMC_MSG_RmaDueToBadPageThreshold,0),
+   MSG_MAP(SelectPstatePolicy,  
PPSMC_MSG_SelectPstatePolicy,  0),
 };
 
 // clang-format on
@@ -369,6 +370,39 @@ static int smu_v13_0_6_tables_init(struct smu_context *smu)
return 0;
 }
 
+static int smu_v13_0_6_select_policy_soc_pstate(struct smu_context *smu,
+   int policy)
+{
+   struct amdgpu_device *adev = smu->adev;
+   int ret, param;
+
+   switch (policy) {
+   case SOC_PSTATE_DEFAULT:
+   param = 0;
+   break;
+   case SOC_PSTATE_0:
+   param = 1;
+   break;
+   case SOC_PSTATE_1:
+   param = 2;
+   break;
+   case SOC_PSTATE_2:
+   param = 3;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   ret = smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_SelectPstatePolicy,
+ param, NULL);
+
+   if (ret)
+   dev_err(adev->dev, "select soc pstate policy %d failed",
+   policy);
+
+   return ret;
+}
+
 static int smu_v13_0_6_allocate_dpm_context(struct smu_context *smu)
 {
struct smu_dpm_context *smu_dpm = >smu_dpm;
@@ -379,6 +413,24 @@ static int smu_v13_0_6_allocate_dpm_context(struct 
smu_context *smu)
return -ENOMEM;
smu_dpm->dpm_context_size = sizeof(struct smu_13_0_dpm_context);
 
+   if (!(smu->adev->flags & AMD_IS_APU)) {
+   struct smu_dpm_policy *policy;
+
+   smu_dpm->dpm_policies =
+   kzalloc(sizeof(struct smu_dpm_policy_ctxt), GFP_KERNEL);
+   policy = &(smu_dpm->dpm_policies->policies[0]);
+
+   policy->policy_type = PP_PM_POLICY_SOC_PSTATE;
+   policy->level_mask = BIT(SOC_PSTATE_DEFAULT) |
+BIT(SOC_PSTATE_0) | BIT(SOC_PSTATE_1) |
+BIT(SOC_PSTATE_2);
+   policy->current_level = SOC_PSTATE_DEFAULT;
+   policy->set_policy = smu_v13_0_6_select_policy_soc_pstate;
+   smu_cmn_generic_soc_policy_desc(policy);
+   smu_dpm->dpm_policies->policy_mask |=
+   BIT(PP_PM_POLICY_SOC_PSTATE);
+   }
+
return 0;
 }
 
@@ -639,6 +691,15 @@ static int smu_v13_0_6_get_dpm_level_count(struct 
smu_context *smu,
return ret;
 }
 
+static void smu_v13_0_6_pm_policy_init(struct smu_context *smu)
+{
+   struct smu_dpm_policy *policy;
+
+   policy = smu_get_pm_policy(smu, PP_PM_POLICY_SOC_PSTATE);
+   if (policy)
+   policy->current_level = SOC_PSTATE_DEFAULT;
+}
+
 static int smu_v13_0_6_set_default_dpm_table(struct smu_context *smu)
 {
struct smu_13_0_dpm_context *dpm_context = smu->smu_dpm.dpm_context;
@@ -668,6 +729,16 @@ static int smu_v13_0_6_set_default_dpm_table(struct 
smu_context *smu)
 
smu_v13_0_6_setup_driver_pptable(smu);
 
+   /* DPM policy not supported in older firmwares */
+   if (!(smu->adev->flags & AMD_IS_APU) &&

[PATCH v3 02/10] drm/amd/pm: Update PMFW messages for SMUv13.0.6

2024-05-13 Thread Lijo Lazar

Add PMF message to select a Pstate policy in SOCs with SMU v13.0.6.

Signed-off-by: Lijo Lazar 
Reviewed-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu_v13_0_6_ppsmc.h | 3 ++-
 drivers/gpu/drm/amd/pm/swsmu/inc/smu_types.h | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu_v13_0_6_ppsmc.h 
b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu_v13_0_6_ppsmc.h
index 86758051cb93..41cb681927e2 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu_v13_0_6_ppsmc.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu_v13_0_6_ppsmc.h
@@ -92,7 +92,8 @@
 #define PPSMC_MSG_McaBankCeDumpDW   0x3B
 #define PPSMC_MSG_SelectPLPDMode0x40
 #define PPSMC_MSG_RmaDueToBadPageThreshold  0x43
-#define PPSMC_Message_Count 0x44
+#define PPSMC_MSG_SelectPstatePolicy0x44
+#define PPSMC_Message_Count 0x45
 
 //PPSMC Reset Types for driver msg argument
 #define PPSMC_RESET_TYPE_DRIVER_MODE_1_RESET0x1
diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/smu_types.h 
b/drivers/gpu/drm/amd/pm/swsmu/inc/smu_types.h
index c48214e3dc8e..dff36bd7a17c 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/inc/smu_types.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/inc/smu_types.h
@@ -272,7 +272,8 @@
__SMU_DUMMY_MAP(SetSoftMinVpe), \
__SMU_DUMMY_MAP(GetMetricsVersion), \
__SMU_DUMMY_MAP(EnableUCLKShadow), \
-   __SMU_DUMMY_MAP(RmaDueToBadPageThreshold),
+   __SMU_DUMMY_MAP(RmaDueToBadPageThreshold),\
+   __SMU_DUMMY_MAP(SelectPstatePolicy),
 
 #undef __SMU_DUMMY_MAP
 #define __SMU_DUMMY_MAP(type)  SMU_MSG_##type
-- 
2.25.1

[PATCH v3 01/10] drm/amd/pm: Add support for DPM policies

2024-05-13 Thread Lijo Lazar

Add support to set/get information about different DPM policies. The
support is only available on SOCs which use swsmu architecture.

A DPM policy type may be defined with different levels. For example, a
policy may be defined to select Pstate preference and then later a
pstate preference may be chosen.

Signed-off-by: Lijo Lazar 
Reviewed-by: Hawking Zhang 
---
v2: Add NULL checks before accessing smu_dpm_policy_ctxt
v3: Rebase to add device_attr_id__pm_policy
 
 .../gpu/drm/amd/include/kgd_pp_interface.h| 16 +++
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 29 ++
 drivers/gpu/drm/amd/pm/amdgpu_pm.c| 92 +
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  4 +
 drivers/gpu/drm/amd/pm/inc/amdgpu_pm.h|  1 +
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 98 +++
 drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h | 29 ++
 7 files changed, 269 insertions(+)

diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
index 805c9d37a2b4..8ed9aa9a990d 100644
--- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
+++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
@@ -273,6 +273,22 @@ enum pp_xgmi_plpd_mode {
XGMI_PLPD_COUNT,
 };
 
+enum pp_pm_policy {
+   PP_PM_POLICY_NONE = -1,
+   PP_PM_POLICY_SOC_PSTATE = 0,
+   PP_PM_POLICY_NUM,
+};
+
+enum pp_policy_soc_pstate {
+   SOC_PSTATE_DEFAULT = 0,
+   SOC_PSTATE_0,
+   SOC_PSTATE_1,
+   SOC_PSTATE_2,
+   SOC_PSTAT_COUNT,
+};
+
+#define PP_POLICY_MAX_LEVELS 5
+
 #define PP_GROUP_MASK0xF000
 #define PP_GROUP_SHIFT   28
 
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index eee919577b44..b443906484e7 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -411,6 +411,35 @@ int amdgpu_dpm_set_xgmi_plpd_mode(struct amdgpu_device 
*adev, int mode)
return ret;
 }
 
+ssize_t amdgpu_dpm_get_pm_policy_info(struct amdgpu_device *adev, char *buf)
+{
+   struct smu_context *smu = adev->powerplay.pp_handle;
+   int ret = -EOPNOTSUPP;
+
+   if (is_support_sw_smu(adev)) {
+   mutex_lock(>pm.mutex);
+   ret = smu_get_pm_policy_info(smu, buf);
+   mutex_unlock(>pm.mutex);
+   }
+
+   return ret;
+}
+
+int amdgpu_dpm_set_pm_policy(struct amdgpu_device *adev, int policy_type,
+int policy_level)
+{
+   struct smu_context *smu = adev->powerplay.pp_handle;
+   int ret = -EOPNOTSUPP;
+
+   if (is_support_sw_smu(adev)) {
+   mutex_lock(>pm.mutex);
+   ret = smu_set_pm_policy(smu, policy_type, policy_level);
+   mutex_unlock(>pm.mutex);
+   }
+
+   return ret;
+}
+
 int amdgpu_dpm_enable_mgpu_fan_boost(struct amdgpu_device *adev)
 {
void *pp_handle = adev->powerplay.pp_handle;
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 110f2fc31754..d62f1d1d6c84 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -2278,6 +2278,96 @@ static ssize_t amdgpu_set_xgmi_plpd_policy(struct device 
*dev,
return count;
 }
 
+static ssize_t amdgpu_get_pm_policy(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct drm_device *ddev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(ddev);
+
+   if (amdgpu_in_reset(adev))
+   return -EPERM;
+   if (adev->in_suspend && !adev->in_runpm)
+   return -EPERM;
+
+   return amdgpu_dpm_get_pm_policy_info(adev, buf);
+}
+
+static ssize_t amdgpu_set_pm_policy(struct device *dev,
+   struct device_attribute *attr,
+   const char *buf, size_t count)
+{
+   struct drm_device *ddev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(ddev);
+   int policy_type, ret, num_params = 0;
+   char delimiter[] = " \n\t";
+   char tmp_buf[128];
+   char *tmp, *param;
+   long val;
+
+   if (amdgpu_in_reset(adev))
+   return -EPERM;
+   if (adev->in_suspend && !adev->in_runpm)
+   return -EPERM;
+
+   count = min(count, sizeof(tmp_buf));
+   memcpy(tmp_buf, buf, count);
+   tmp_buf[count - 1] = '\0';
+   tmp = tmp_buf;
+
+   tmp = skip_spaces(tmp);
+   if (strncmp(tmp, "soc_pstate", strlen("soc_pstate")) == 0) {
+   policy_type = PP_PM_POLICY_SOC_PSTATE;
+   tmp += strlen("soc_pstate");
+   } else {
+   return -EINVAL;
+   }
+
+   tmp = skip_spaces(tmp);
+   while ((param = strsep(, delimiter))) {
+   if (!strlen(param)) {
+   tmp = skip_spaces(tmp);
+   continue;
+   }
+   ret = kstrtol(param, 0, );

[PATCH v3 00/10] Add PM policy interfaces

2024-05-13 Thread Lijo Lazar

This series adds APIs to get the supported PM policies and also set them. A PM
policy type is a predefined policy type supported by an SOC and each policy may
define two or more levels to choose from. A user can select the appropriate
level through amdgpu_dpm_set_pm_policy() or through sysfs node pm_policy. Based
on the specific PM functional area, multiple PM policies may be defined for an
SOC For ex: a policy may be defined to set the right setting for XGMI per link
power down feature and another may be defined to select the SOC Pstate
preferences.
 
Presently, XGMI PLPD and SOC Pstate policy types are supported. It also removes
the legacy sysfs interface to set XGMI PLPD as it is not used any client like
SMI tool.

v2:
 Add NULL checks to avoid access on SOCs which don't support any policy.

v3:
 Rebase and add documentation patch

Lijo Lazar (10):
  drm/amd/pm: Add support for DPM policies
  drm/amd/pm: Update PMFW messages for SMUv13.0.6
  drm/amd/pm: Add support to select pstate policy
  drm/amd/pm: Add xgmi plpd policy to pm_policy
  drm/amd/pm: Add xgmi plpd to SMU v13.0.6 pm_policy
  drm/amd/pm: Add xgmi plpd to aldebaran pm_policy
  drm/amd/pm: Add xgmi plpd to arcturus pm_policy
  drm/amd/pm: Remove legacy interface for xgmi plpd
  drm/amd/pm: Remove unused interface to set plpd
  Documentation/amdgpu: Add PM policy documentation

 Documentation/gpu/amdgpu/thermal.rst  |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c  |   4 +-
 .../gpu/drm/amd/include/kgd_pp_interface.h|  17 ++
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c   |  32 ++--
 drivers/gpu/drm/amd/pm/amdgpu_pm.c| 133 +++
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |   9 +-
 drivers/gpu/drm/amd/pm/inc/amdgpu_pm.h|   2 +-
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 113 +++--
 drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h |  37 -
 .../pm/swsmu/inc/pmfw_if/smu_v13_0_6_ppsmc.h  |   3 +-
 drivers/gpu/drm/amd/pm/swsmu/inc/smu_types.h  |   3 +-
 .../gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c |  64 +---
 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c|  59 ---
 .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|   2 +
 .../drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c  | 153 +-
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c|  57 +++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|   2 +
 17 files changed, 527 insertions(+), 169 deletions(-)

-- 
2.25.1

RE: [PATCH] drm/amdgpu: add initial value for gfx12 AGP aperture

2024-05-13 Thread Gao, Likun

[AMD Official Use Only - AMD Internal Distribution Only]

This patch was
Reviewed-by: Likun Gao .

Regards,
Likun

-Original Message-
From: Min, Frank 
Sent: Monday, May 13, 2024 4:56 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Zhang, Hawking ; Gao, Likun 

Subject: [PATCH] drm/amdgpu: add initial value for gfx12 AGP aperture

[AMD Official Use Only - AMD Internal Distribution Only]

From: Frank Min 

add initial value for gfx12 AGP aperture

Signed-off-by: Frank Min 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
index 34264a33dcdf..b876300bb9f6 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
@@ -622,6 +622,7 @@ static void gmc_v12_0_vram_gtt_location(struct 
amdgpu_device *adev,

base = adev->mmhub.funcs->get_fb_location(adev);

+   amdgpu_gmc_set_agp_default(adev, mc);
amdgpu_gmc_vram_location(adev, >gmc, base);
amdgpu_gmc_gart_location(adev, mc, AMDGPU_GART_PLACEMENT_LOW);
if (!amdgpu_sriov_vf(adev) && (amdgpu_agp == 1))
--
2.34.1

[PATCH 4/5] drm/amdgpu: Fix null pointer dereference to bo

2024-05-13 Thread Ma Jun

Check bo before using it

Signed-off-by: Ma Jun 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
index 2b7b67916c1d..8269fd6828bf 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
@@ -497,9 +497,8 @@ static void gmc_v12_0_get_vm_pte(struct amdgpu_device *adev,
 uint64_t *flags)
 {
struct amdgpu_bo *bo = mapping->bo_va->base.bo;
-   struct amdgpu_device *bo_adev = amdgpu_ttm_adev(bo->tbo.bdev);
-   bool coherent = bo->flags & AMDGPU_GEM_CREATE_COHERENT;
-   bool is_system = bo->tbo.resource->mem_type == TTM_PL_SYSTEM;
+   struct amdgpu_device *bo_adev;
+   bool coherent, is_system;
 
 
*flags &= ~AMDGPU_PTE_EXECUTABLE;
@@ -515,13 +514,20 @@ static void gmc_v12_0_get_vm_pte(struct amdgpu_device 
*adev,
*flags &= ~AMDGPU_PTE_VALID;
}
 
-   if (bo && bo->flags & (AMDGPU_GEM_CREATE_COHERENT |
+   if (!bo)
+   return;
+
+   if (bo->flags & (AMDGPU_GEM_CREATE_COHERENT |
   AMDGPU_GEM_CREATE_UNCACHED))
*flags = (*flags & ~AMDGPU_PTE_MTYPE_GFX12_MASK) |
 AMDGPU_PTE_MTYPE_GFX12(MTYPE_UC);
 
+   bo_adev = amdgpu_ttm_adev(bo->tbo.bdev);
+   coherent = bo->flags & AMDGPU_GEM_CREATE_COHERENT;
+   is_system = bo->tbo.resource->mem_type == TTM_PL_SYSTEM;
+
/* WA for HW bug */
-   if ((bo && is_system) || ((bo_adev != adev) && coherent))
+   if (is_system || ((bo_adev != adev) && coherent))
*flags |= AMDGPU_PTE_MTYPE_GFX12(MTYPE_NC);
 
 }
-- 
2.34.1

[PATCH 5/5] drm/amdgpu: Remove dead code in amdgpu_ras_add_mca_err_addr

2024-05-13 Thread Ma Jun

Remove dead code in amdgpu_ras_add_mca_err_addr

Signed-off-by: Ma Jun 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 13 -
 1 file changed, 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 6da02a209890..0cf67923c0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -4292,21 +4292,8 @@ static struct ras_err_info 
*amdgpu_ras_error_get_info(struct ras_err_data *err_d
 
 void amdgpu_ras_add_mca_err_addr(struct ras_err_info *err_info, struct 
ras_err_addr *err_addr)
 {
-   struct ras_err_addr *mca_err_addr;
-
/* This function will be retired. */
return;
-   mca_err_addr = kzalloc(sizeof(*mca_err_addr), GFP_KERNEL);
-   if (!mca_err_addr)
-   return;
-
-   INIT_LIST_HEAD(_err_addr->node);
-
-   mca_err_addr->err_status = err_addr->err_status;
-   mca_err_addr->err_ipid = err_addr->err_ipid;
-   mca_err_addr->err_addr = err_addr->err_addr;
-
-   list_add_tail(_err_addr->node, _info->err_addr_list);
 }
 
 void amdgpu_ras_del_mca_err_addr(struct ras_err_info *err_info, struct 
ras_err_addr *mca_err_addr)
-- 
2.34.1

[PATCH 3/5] drm/amdgpu: Fix null pointer dereference to aca_handle

2024-05-13 Thread Ma Jun

Check handle pointer before using it

Signed-off-by: Ma Jun 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
index 28febf33fb1b..e969a7d77b4d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
@@ -279,7 +279,7 @@ static struct aca_bank_error *get_bank_error(struct 
aca_error *aerr, struct aca_
 int aca_error_cache_log_bank_error(struct aca_handle *handle, struct 
aca_bank_info *info,
   enum aca_error_type type, u64 count)
 {
-   struct aca_error_cache *error_cache = >error_cache;
+   struct aca_error_cache *error_cache;
struct aca_bank_error *bank_error;
struct aca_error *aerr;
 
@@ -289,6 +289,10 @@ int aca_error_cache_log_bank_error(struct aca_handle 
*handle, struct aca_bank_in
if (!count)
return 0;
 
+   error_cache = >error_cache;
+   if (!error_cache)
+   return -EINVAL;
+
aerr = _cache->errors[type];
bank_error = get_bank_error(aerr, info);
if (!bank_error)
-- 
2.34.1

[PATCH 2/5] drm/amdgpu: Fix the null pointer dereference to ras_manager

2024-05-13 Thread Ma Jun

Check ras_manager before using it

Signed-off-by: Ma Jun 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 1dd13ed3b7b5..6da02a209890 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2172,12 +2172,17 @@ static void amdgpu_ras_interrupt_process_handler(struct 
work_struct *work)
 int amdgpu_ras_interrupt_dispatch(struct amdgpu_device *adev,
struct ras_dispatch_if *info)
 {
-   struct ras_manager *obj = amdgpu_ras_find_obj(adev, >head);
-   struct ras_ih_data *data = >ih_data;
+   struct ras_manager *obj;
+   struct ras_ih_data *data;
 
+   obj = amdgpu_ras_find_obj(adev, >head);
if (!obj)
return -EINVAL;
 
+   data = >ih_data;
+   if (!data)
+   return -EINVAL;
+
if (data->inuse == 0)
return 0;
 
-- 
2.34.1

[PATCH 1/5] drm/amdgpu/pm: Fix the null pointer dereference for smu7

2024-05-13 Thread Ma Jun

optimize the code to avoid pass a null pointer (hwmgr->backend)
to function smu7_update_edc_leakage_table.

Signed-off-by: Ma Jun 
---
 .../drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c   | 50 +--
 1 file changed, 24 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
index 123af237878f..632a25957477 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
@@ -2957,6 +2957,7 @@ static int smu7_update_edc_leakage_table(struct pp_hwmgr 
*hwmgr)
 
 static int smu7_hwmgr_backend_init(struct pp_hwmgr *hwmgr)
 {
+   struct amdgpu_device *adev = hwmgr->adev;
struct smu7_hwmgr *data;
int result = 0;
 
@@ -2993,40 +2994,37 @@ static int smu7_hwmgr_backend_init(struct pp_hwmgr 
*hwmgr)
/* Initalize Dynamic State Adjustment Rule Settings */
result = phm_initializa_dynamic_state_adjustment_rule_settings(hwmgr);
 
-   if (0 == result) {
-   struct amdgpu_device *adev = hwmgr->adev;
+   if (result)
+   goto fail;
 
-   data->is_tlu_enabled = false;
+   data->is_tlu_enabled = false;
 
-   hwmgr->platform_descriptor.hardwareActivityPerformanceLevels =
+   hwmgr->platform_descriptor.hardwareActivityPerformanceLevels =

SMU7_MAX_HARDWARE_POWERLEVELS;
-   hwmgr->platform_descriptor.hardwarePerformanceLevels = 2;
-   hwmgr->platform_descriptor.minimumClocksReductionPercentage = 
50;
+   hwmgr->platform_descriptor.hardwarePerformanceLevels = 2;
+   hwmgr->platform_descriptor.minimumClocksReductionPercentage = 50;
 
-   data->pcie_gen_cap = adev->pm.pcie_gen_mask;
-   if (data->pcie_gen_cap & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3)
-   data->pcie_spc_cap = 20;
-   else
-   data->pcie_spc_cap = 16;
-   data->pcie_lane_cap = adev->pm.pcie_mlw_mask;
-
-   hwmgr->platform_descriptor.vbiosInterruptId = 0x2400; /* 
IRQ_SOURCE1_SW_INT */
-/* The true clock step depends on the frequency, typically 4.5 or 9 MHz. Here 
we use 5. */
-   hwmgr->platform_descriptor.clockStep.engineClock = 500;
-   hwmgr->platform_descriptor.clockStep.memoryClock = 500;
-   smu7_thermal_parameter_init(hwmgr);
-   } else {
-   /* Ignore return value in here, we are cleaning up a mess. */
-   smu7_hwmgr_backend_fini(hwmgr);
-   }
+   data->pcie_gen_cap = adev->pm.pcie_gen_mask;
+   if (data->pcie_gen_cap & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3)
+   data->pcie_spc_cap = 20;
+   else
+   data->pcie_spc_cap = 16;
+   data->pcie_lane_cap = adev->pm.pcie_mlw_mask;
+
+   hwmgr->platform_descriptor.vbiosInterruptId = 0x2400; /* 
IRQ_SOURCE1_SW_INT */
+   /* The true clock step depends on the frequency, typically 4.5 or 9 
MHz. Here we use 5. */
+   hwmgr->platform_descriptor.clockStep.engineClock = 500;
+   hwmgr->platform_descriptor.clockStep.memoryClock = 500;
+   smu7_thermal_parameter_init(hwmgr);
 
result = smu7_update_edc_leakage_table(hwmgr);
-   if (result) {
-   smu7_hwmgr_backend_fini(hwmgr);
-   return result;
-   }
+   if (result)
+   goto fail;
 
return 0;
+fail:
+   smu7_hwmgr_backend_fini(hwmgr);
+   return result;
 }
 
 static int smu7_force_dpm_highest(struct pp_hwmgr *hwmgr)
-- 
2.34.1

[PATCH] drm/amdgpu: add initial value for gfx12 AGP aperture

2024-05-13 Thread Min, Frank

[AMD Official Use Only - AMD Internal Distribution Only]

From: Frank Min 

add initial value for gfx12 AGP aperture

Signed-off-by: Frank Min 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
index 34264a33dcdf..b876300bb9f6 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
@@ -622,6 +622,7 @@ static void gmc_v12_0_vram_gtt_location(struct 
amdgpu_device *adev,

base = adev->mmhub.funcs->get_fb_location(adev);

+   amdgpu_gmc_set_agp_default(adev, mc);
amdgpu_gmc_vram_location(adev, >gmc, base);
amdgpu_gmc_gart_location(adev, mc, AMDGPU_GART_PLACEMENT_LOW);
if (!amdgpu_sriov_vf(adev) && (amdgpu_agp == 1))
--
2.34.1

Re: [PATCH v2 2/2] drm/tests: Add a unit test for range bias allocation

2024-05-13 Thread Matthew Auld


On 12/05/2024 08:59, Arunpravin Paneer Selvam wrote:

Allocate cleared blocks in the bias range when the DRM
buddy's clear avail is zero. This will validate the bias
range allocation in scenarios like system boot when no
cleared blocks are available and exercise the fallback
path too. The resulting blocks should always be dirty.

Signed-off-by: Arunpravin Paneer Selvam 
---
  drivers/gpu/drm/tests/drm_buddy_test.c | 35 ++
  1 file changed, 35 insertions(+)

diff --git a/drivers/gpu/drm/tests/drm_buddy_test.c 
b/drivers/gpu/drm/tests/drm_buddy_test.c
index e3b50e240d36..a194f271bc55 100644
--- a/drivers/gpu/drm/tests/drm_buddy_test.c
+++ b/drivers/gpu/drm/tests/drm_buddy_test.c
@@ -26,6 +26,8 @@ static void drm_test_buddy_alloc_range_bias(struct kunit 
*test)
u32 mm_size, ps, bias_size, bias_start, bias_end, bias_rem;
DRM_RND_STATE(prng, random_seed);
unsigned int i, count, *order;
+   struct drm_buddy_block *block;
+   unsigned long flags;
struct drm_buddy mm;
LIST_HEAD(allocated);
  
@@ -222,6 +224,39 @@ static void drm_test_buddy_alloc_range_bias(struct kunit *test)
  
  	drm_buddy_free_list(, , 0);

drm_buddy_fini();
+
+   /*
+* Allocate cleared blocks in the bias range when the DRM buddy's clear 
avail is
+* zero. This will validate the bias range allocation in scenarios like 
system boot
+* when no cleared blocks are available and exercise the fallback path 
too. The resulting
+* blocks should always be dirty.
+*/
+
+   KUNIT_ASSERT_FALSE_MSG(test, drm_buddy_init(, mm_size, ps),
+  "buddy_init failed\n");
+   mm.clear_avail = 0;


Should already be zero, right? Maybe make this an assert instead?


+
+   bias_start = round_up(prandom_u32_state() % (mm_size - ps), ps);
+   bias_end = round_up(bias_start + prandom_u32_state() % (mm_size - 
bias_start), ps);
+   bias_end = max(bias_end, bias_start + ps);
+   bias_rem = bias_end - bias_start;
+
+   flags = DRM_BUDDY_CLEAR_ALLOCATION | DRM_BUDDY_RANGE_ALLOCATION;
+   u32 size = max(round_up(prandom_u32_state() % bias_rem, ps), ps);


u32 declaration should be moved to above?

Otherwise,
Reviewed-by: Matthew Auld 


+
+   KUNIT_ASSERT_FALSE_MSG(test,
+  drm_buddy_alloc_blocks(, bias_start,
+ bias_end, size, ps,
+ ,
+ flags),
+  "buddy_alloc failed with bias(%x-%x), size=%u, 
ps=%u\n",
+  bias_start, bias_end, size, ps);
+
+   list_for_each_entry(block, , link)
+   KUNIT_EXPECT_EQ(test, drm_buddy_block_is_clear(block), false);
+
+   drm_buddy_free_list(, , 0);
+   drm_buddy_fini();
  }
  
  static void drm_test_buddy_alloc_clear(struct kunit *test)

RE: [PATCH 18/22 V3] drm/amd/pm: check negtive return for table entries

2024-05-13 Thread Huang, Tim

[AMD Official Use Only - AMD Internal Distribution Only]

This patch is,

Reviewed-by: Tim Huang 



> -Original Message-
> From: Jesse Zhang 
> Sent: Monday, May 13, 2024 4:06 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Huang, Tim ; Zhang,
> Jesse(Jie) ; Zhang, Jesse(Jie)
> 
> Subject: [PATCH 18/22 V3] drm/amd/pm: check negtive return for table
> entries
>
> Function hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr)
> returns a negative number
>
> Signed-off-by: Jesse Zhang 
> Suggested-by: Tim Huang 
> ---
>  drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> index f4bd8e9357e2..18f00038d844 100644
> --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> @@ -30,9 +30,8 @@ int psm_init_power_state_table(struct pp_hwmgr
> *hwmgr)  {
>   int result;
>   unsigned int i;
> - unsigned int table_entries;
>   struct pp_power_state *state;
> - int size;
> + int size, table_entries;
>
>   if (hwmgr->hwmgr_func->get_num_of_pp_table_entries == NULL)
>   return 0;
> @@ -40,15 +39,19 @@ int psm_init_power_state_table(struct pp_hwmgr
> *hwmgr)
>   if (hwmgr->hwmgr_func->get_power_state_size == NULL)
>   return 0;
>
> - hwmgr->num_ps = table_entries =
> hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr);
> + table_entries =
> hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr);
>
> - hwmgr->ps_size = size =
> hwmgr->hwmgr_func->get_power_state_size(hwmgr) +
> + size = hwmgr->hwmgr_func->get_power_state_size(hwmgr) +
> sizeof(struct pp_power_state);
>
> - if (table_entries == 0 || size == 0) {
> + if (table_entries <= 0 || size == 0) {
>   pr_warn("Please check whether power state management is
> supported on this asic\n");
> + hwmgr->num_ps = 0;
> + hwmgr->ps_size = 0;
>   return 0;
>   }
> + hwmgr->num_ps = table_entries;
> + hwmgr->ps_size = size;
>
>   hwmgr->ps = kcalloc(table_entries, size, GFP_KERNEL);
>   if (hwmgr->ps == NULL)
> --
> 2.25.1

[PATCH 18/22 V3] drm/amd/pm: check negtive return for table entries

2024-05-13 Thread Jesse Zhang

Function hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr) returns a 
negative number

Signed-off-by: Jesse Zhang 
Suggested-by: Tim Huang 
---
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
index f4bd8e9357e2..18f00038d844 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
@@ -30,9 +30,8 @@ int psm_init_power_state_table(struct pp_hwmgr *hwmgr)
 {
int result;
unsigned int i;
-   unsigned int table_entries;
struct pp_power_state *state;
-   int size;
+   int size, table_entries;
 
if (hwmgr->hwmgr_func->get_num_of_pp_table_entries == NULL)
return 0;
@@ -40,15 +39,19 @@ int psm_init_power_state_table(struct pp_hwmgr *hwmgr)
if (hwmgr->hwmgr_func->get_power_state_size == NULL)
return 0;
 
-   hwmgr->num_ps = table_entries = 
hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr);
+   table_entries = hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr);
 
-   hwmgr->ps_size = size = hwmgr->hwmgr_func->get_power_state_size(hwmgr) +
+   size = hwmgr->hwmgr_func->get_power_state_size(hwmgr) +
  sizeof(struct pp_power_state);
 
-   if (table_entries == 0 || size == 0) {
+   if (table_entries <= 0 || size == 0) {
pr_warn("Please check whether power state management is 
supported on this asic\n");
+   hwmgr->num_ps = 0;
+   hwmgr->ps_size = 0;
return 0;
}
+   hwmgr->num_ps = table_entries;
+   hwmgr->ps_size = size;
 
hwmgr->ps = kcalloc(table_entries, size, GFP_KERNEL);
if (hwmgr->ps == NULL)
-- 
2.25.1

RE: [PATCH 18/22] drm/amd/pm: check negtive return for table entries

2024-05-13 Thread Zhang, Jesse(Jie)

[AMD Official Use Only - AMD Internal Distribution Only]

Hi Tim,

-Original Message-
From: Huang, Tim 
Sent: Monday, May 13, 2024 3:40 PM
To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Zhang, Jesse(Jie) ; Zhang, 
Jesse(Jie) 
Subject: RE: [PATCH 18/22] drm/amd/pm: check negtive return for table entries

[AMD Official Use Only - AMD Internal Distribution Only]

Hi Jesse,

> -Original Message-
> From: Jesse Zhang 
> Sent: Monday, May 13, 2024 3:34 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Huang, Tim ; Zhang,
> Jesse(Jie) ; Zhang, Jesse(Jie)
> 
> Subject: [PATCH 18/22] drm/amd/pm: check negtive return for table
> entries
>
> Function hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr)
> returns a negative number
>
> Signed-off-by: Jesse Zhang 
> Suggested-by: Tim Huang 
> ---
>  drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> index f4bd8e9357e2..1276a95acc90 100644
> --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> @@ -30,9 +30,8 @@ int psm_init_power_state_table(struct pp_hwmgr
> *hwmgr)  {
>   int result;
>   unsigned int i;
> - unsigned int table_entries;
>   struct pp_power_state *state;
> - int size;
> + int size, table_entries;
>
>   if (hwmgr->hwmgr_func->get_num_of_pp_table_entries == NULL)
>   return 0;
> @@ -40,15 +39,17 @@ int psm_init_power_state_table(struct pp_hwmgr
> *hwmgr)
>   if (hwmgr->hwmgr_func->get_power_state_size == NULL)
>   return 0;
>
> - hwmgr->num_ps = table_entries =
> hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr);
> + table_entries =
> hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr);
>
> - hwmgr->ps_size = size =
> hwmgr->hwmgr_func->get_power_state_size(hwmgr) +
> + size = hwmgr->hwmgr_func->get_power_state_size(hwmgr) +
> sizeof(struct pp_power_state);
>
> - if (table_entries == 0 || size == 0) {
> + if (table_entries <= 0 || size == 0) {
>   pr_warn("Please check whether power state management is
> supported on this asic\n");
As we return 0 here, we still need to set the hwmgr->num_ps and hwmgr->ps_size 
to 0 here.
[Zhang, Jesse(Jie)]  yes, right. Thanks Tim.
Tim Huang
>   return 0;

>   }
> + hwmgr->num_ps = table_entries;
> + hwmgr->ps_size = size;
>
>   hwmgr->ps = kcalloc(table_entries, size, GFP_KERNEL);
>   if (hwmgr->ps == NULL)
> --
> 2.25.1

[RESEND 2/6] drm/radeon: convert to using is_hdmi and has_audio from display info

2024-05-13 Thread Jani Nikula

Prefer the parsed results for is_hdmi and has_audio in display info over
calling drm_detect_hdmi_monitor() and drm_detect_monitor_audio(),
respectively.

Cc: Alex Deucher 
Cc: Christian König 
Cc: Pan, Xinhui 
Cc: amd-gfx@lists.freedesktop.org
Signed-off-by: Jani Nikula 
---
 drivers/gpu/drm/radeon/atombios_encoders.c | 10 +-
 drivers/gpu/drm/radeon/evergreen_hdmi.c|  5 ++---
 drivers/gpu/drm/radeon/radeon_audio.c  |  6 +++---
 drivers/gpu/drm/radeon/radeon_connectors.c | 12 ++--
 drivers/gpu/drm/radeon/radeon_display.c|  2 +-
 drivers/gpu/drm/radeon/radeon_encoders.c   |  4 ++--
 6 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/radeon/atombios_encoders.c 
b/drivers/gpu/drm/radeon/atombios_encoders.c
index 2bff0d9e20f5..0aa395fac36f 100644
--- a/drivers/gpu/drm/radeon/atombios_encoders.c
+++ b/drivers/gpu/drm/radeon/atombios_encoders.c
@@ -701,7 +701,7 @@ atombios_get_encoder_mode(struct drm_encoder *encoder)
if (radeon_connector->use_digital &&
(radeon_connector->audio == RADEON_AUDIO_ENABLE))
return ATOM_ENCODER_MODE_HDMI;
-   else if 
(drm_detect_hdmi_monitor(radeon_connector_edid(connector)) &&
+   else if (connector->display_info.is_hdmi &&
 (radeon_connector->audio == RADEON_AUDIO_AUTO))
return ATOM_ENCODER_MODE_HDMI;
else if (radeon_connector->use_digital)
@@ -720,7 +720,7 @@ atombios_get_encoder_mode(struct drm_encoder *encoder)
if (radeon_audio != 0) {
if (radeon_connector->audio == RADEON_AUDIO_ENABLE)
return ATOM_ENCODER_MODE_HDMI;
-   else if 
(drm_detect_hdmi_monitor(radeon_connector_edid(connector)) &&
+   else if (connector->display_info.is_hdmi &&
 (radeon_connector->audio == RADEON_AUDIO_AUTO))
return ATOM_ENCODER_MODE_HDMI;
else
@@ -737,14 +737,14 @@ atombios_get_encoder_mode(struct drm_encoder *encoder)
if ((dig_connector->dp_sink_type == 
CONNECTOR_OBJECT_ID_DISPLAYPORT) ||
(dig_connector->dp_sink_type == CONNECTOR_OBJECT_ID_eDP)) {
if (radeon_audio != 0 &&
-   
drm_detect_monitor_audio(radeon_connector_edid(connector)) &&
+   connector->display_info.has_audio &&
ASIC_IS_DCE4(rdev) && !ASIC_IS_DCE5(rdev))
return ATOM_ENCODER_MODE_DP_AUDIO;
return ATOM_ENCODER_MODE_DP;
} else if (radeon_audio != 0) {
if (radeon_connector->audio == RADEON_AUDIO_ENABLE)
return ATOM_ENCODER_MODE_HDMI;
-   else if 
(drm_detect_hdmi_monitor(radeon_connector_edid(connector)) &&
+   else if (connector->display_info.is_hdmi &&
 (radeon_connector->audio == RADEON_AUDIO_AUTO))
return ATOM_ENCODER_MODE_HDMI;
else
@@ -755,7 +755,7 @@ atombios_get_encoder_mode(struct drm_encoder *encoder)
break;
case DRM_MODE_CONNECTOR_eDP:
if (radeon_audio != 0 &&
-   drm_detect_monitor_audio(radeon_connector_edid(connector)) 
&&
+   connector->display_info.has_audio &&
ASIC_IS_DCE4(rdev) && !ASIC_IS_DCE5(rdev))
return ATOM_ENCODER_MODE_DP_AUDIO;
return ATOM_ENCODER_MODE_DP;
diff --git a/drivers/gpu/drm/radeon/evergreen_hdmi.c 
b/drivers/gpu/drm/radeon/evergreen_hdmi.c
index 681119c91d94..09dda114e218 100644
--- a/drivers/gpu/drm/radeon/evergreen_hdmi.c
+++ b/drivers/gpu/drm/radeon/evergreen_hdmi.c
@@ -412,7 +412,7 @@ void evergreen_hdmi_enable(struct drm_encoder *encoder, 
bool enable)
if (enable) {
struct drm_connector *connector = 
radeon_get_connector_for_encoder(encoder);
 
-   if (connector && 
drm_detect_monitor_audio(radeon_connector_edid(connector))) {
+   if (connector && connector->display_info.has_audio) {
WREG32(HDMI_INFOFRAME_CONTROL0 + dig->afmt->offset,
   HDMI_AVI_INFO_SEND | /* enable AVI info frames */
   HDMI_AVI_INFO_CONT | /* required for audio info 
values to be updated */
@@ -450,8 +450,7 @@ void evergreen_dp_enable(struct drm_encoder *encoder, bool 
enable)
if (!dig || !dig->afmt)
return;
 
-   if (enable && connector &&
-   drm_detect_monitor_audio(radeon_connector_edid(connector))) {
+   if (enable && connector && connector->display_info.has_audio) {

[RESEND 5/6] drm/edid: add a helper for EDID sysfs property show

2024-05-13 Thread Jani Nikula

Add a helper to get the EDID property for sysfs property show. This
hides all the edid_blob_ptr usage within drm_edid.c.

Signed-off-by: Jani Nikula 
---
 drivers/gpu/drm/drm_crtc_internal.h |  2 ++
 drivers/gpu/drm/drm_edid.c  | 33 +
 drivers/gpu/drm/drm_sysfs.c | 24 ++---
 3 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/drm_crtc_internal.h 
b/drivers/gpu/drm/drm_crtc_internal.h
index 25aaae937ceb..20e9d7b206a2 100644
--- a/drivers/gpu/drm/drm_crtc_internal.h
+++ b/drivers/gpu/drm/drm_crtc_internal.h
@@ -303,6 +303,8 @@ const u8 *drm_edid_find_extension(const struct drm_edid 
*drm_edid,
  int ext_id, int *ext_index);
 void drm_edid_cta_sad_get(const struct cea_sad *cta_sad, u8 *sad);
 void drm_edid_cta_sad_set(struct cea_sad *cta_sad, const u8 *sad);
+ssize_t drm_edid_connector_property_show(struct drm_connector *connector,
+char *buf, loff_t off, size_t count);
 
 /* drm_edid_load.c */
 #ifdef CONFIG_DRM_LOAD_EDID_FIRMWARE
diff --git a/drivers/gpu/drm/drm_edid.c b/drivers/gpu/drm/drm_edid.c
index 4f54c91b31b2..97362dd2330b 100644
--- a/drivers/gpu/drm/drm_edid.c
+++ b/drivers/gpu/drm/drm_edid.c
@@ -6969,6 +6969,39 @@ static int _drm_edid_connector_property_update(struct 
drm_connector *connector,
return ret;
 }
 
+/* For sysfs edid show implementation */
+ssize_t drm_edid_connector_property_show(struct drm_connector *connector,
+char *buf, loff_t off, size_t count)
+{
+   const void *edid;
+   size_t size;
+   ssize_t ret = 0;
+
+   mutex_lock(>dev->mode_config.mutex);
+
+   if (!connector->edid_blob_ptr)
+   goto unlock;
+
+   edid = connector->edid_blob_ptr->data;
+   size = connector->edid_blob_ptr->length;
+   if (!edid)
+   goto unlock;
+
+   if (off >= size)
+   goto unlock;
+
+   if (off + count > size)
+   count = size - off;
+
+   memcpy(buf, edid + off, count);
+
+   ret = count;
+unlock:
+   mutex_unlock(>dev->mode_config.mutex);
+
+   return ret;
+}
+
 /**
  * drm_edid_connector_update - Update connector information from EDID
  * @connector: Connector
diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
index bd9b8ab4f82b..fb3bbb6adcd1 100644
--- a/drivers/gpu/drm/drm_sysfs.c
+++ b/drivers/gpu/drm/drm_sysfs.c
@@ -266,29 +266,9 @@ static ssize_t edid_show(struct file *filp, struct kobject 
*kobj,
 {
struct device *connector_dev = kobj_to_dev(kobj);
struct drm_connector *connector = to_drm_connector(connector_dev);
-   unsigned char *edid;
-   size_t size;
-   ssize_t ret = 0;
+   ssize_t ret;
 
-   mutex_lock(>dev->mode_config.mutex);
-   if (!connector->edid_blob_ptr)
-   goto unlock;
-
-   edid = connector->edid_blob_ptr->data;
-   size = connector->edid_blob_ptr->length;
-   if (!edid)
-   goto unlock;
-
-   if (off >= size)
-   goto unlock;
-
-   if (off + count > size)
-   count = size - off;
-   memcpy(buf, edid + off, count);
-
-   ret = count;
-unlock:
-   mutex_unlock(>dev->mode_config.mutex);
+   ret = drm_edid_connector_property_show(connector, buf, off, count);
 
return ret;
 }
-- 
2.39.2

Re: [PATCH v2] drm/amd/display: fix documentation warnings for mpc.h

2024-05-13 Thread Stephen Rothwell

Hi Marcelo,

On Sat, 11 May 2024 13:37:17 +1000 Stephen Rothwell  
wrote:
> 
> Thanks for doing this.
> 
> I haven't tested it, but just a couple of little things:
> 
> On Fri, 10 May 2024 21:02:02 -0300 Marcelo Mendes Spessoto Junior 
>  wrote:
> >
> > Fix most of the display documentation compile warnings by
> > documenting struct mpc_funcs functions in dc/inc/hw/mpc.h file.
> >   
>   .
>   .
>   .
> > 
> > Fixes:
> > b8c1c3a82e75 ("Documentation/gpu: Add kernel doc entry for MPC")  
> 
> Please don't split the Fixes tag line and also don't add blank lines
> between tags.
> 
> I also appreciate being credited for reporting (unless you got a report
> from somewhere else as well, then maybe credit both).
> 
> Reported-by: Stephen Rothwell 

Now tested.

Tested-by: Stephen Rothwell 
Closes: 
https://lore.kernel.org/linux-next/20240130134954.04fcf...@canb.auug.org.au/

-- 
Cheers,
Stephen Rothwell


pgpn3TqFBg4mT.pgp
Description: OpenPGP digital signature

Re: [PATCH v2] drm/amd/display: fix documentation warnings for mpc.h

2024-05-13 Thread Stephen Rothwell

Hi Marcelo,

Thanks for doing this.

I haven't tested it, but just a couple of little things:

On Fri, 10 May 2024 21:02:02 -0300 Marcelo Mendes Spessoto Junior 
 wrote:
>
> Fix most of the display documentation compile warnings by
> documenting struct mpc_funcs functions in dc/inc/hw/mpc.h file.
> 
.
.
.
> 
> Fixes:
> b8c1c3a82e75 ("Documentation/gpu: Add kernel doc entry for MPC")

Please don't split the Fixes tag line and also don't add blank lines
between tags.

I also appreciate being credited for reporting (unless you got a report
from somewhere else as well, then maybe credit both).

Reported-by: Stephen Rothwell 

> 
> Signed-off-by: Marcelo Mendes Spessoto Junior 
> ---
-- 
Cheers,
Stephen Rothwell

pgpleiX0ogMI_.pgp
Description: OpenPGP digital signature

[PATCH] drm/amd/display: clean up some inconsistent indenting

2024-05-13 Thread Jiapeng Chong

No functional modification involved.

drivers/gpu/drm/amd/amdgpu/../display/dc/resource/dcn401/dcn401_resource.c:792 
dcn401_i2c_hw_create() warn: inconsistent indenting.
drivers/gpu/drm/amd/amdgpu/../display/dc/resource/dcn401/dcn401_resource.c:894 
dcn401_hubp_create() warn: inconsistent indenting.
drivers/gpu/drm/amd/amdgpu/../display/dc/resource/dcn401/dcn401_resource.c:1738 
dcn401_resource_construct() warn: inconsistent indenting.

Reported-by: Abaci Robot 
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9002
Signed-off-by: Jiapeng Chong 
---
 .../dc/resource/dcn401/dcn401_resource.c  | 29 +--
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/resource/dcn401/dcn401_resource.c 
b/drivers/gpu/drm/amd/display/dc/resource/dcn401/dcn401_resource.c
index 75e2c62ae792..3e1bfddc6e43 100644
--- a/drivers/gpu/drm/amd/display/dc/resource/dcn401/dcn401_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/resource/dcn401/dcn401_resource.c
@@ -784,14 +784,13 @@ static struct dce_i2c_hw *dcn401_i2c_hw_create(
 
 #undef REG_STRUCT
 #define REG_STRUCT i2c_hw_regs
-   i2c_inst_regs_init(1),
-   i2c_inst_regs_init(2),
-   i2c_inst_regs_init(3),
-   i2c_inst_regs_init(4);
+   i2c_inst_regs_init(1),
+   i2c_inst_regs_init(2),
+   i2c_inst_regs_init(3),
+   i2c_inst_regs_init(4);
 
dcn2_i2c_hw_construct(dce_i2c_hw, ctx, inst,
-   _hw_regs[inst], _shifts, 
_masks);
-
+ _hw_regs[inst], _shifts, _masks);
return dce_i2c_hw;
 }
 
@@ -886,13 +885,13 @@ static struct hubp *dcn401_hubp_create(
 
 #undef REG_STRUCT
 #define REG_STRUCT hubp_regs
-   hubp_regs_init(0),
-   hubp_regs_init(1),
-   hubp_regs_init(2),
-   hubp_regs_init(3);
+   hubp_regs_init(0),
+   hubp_regs_init(1),
+   hubp_regs_init(2),
+   hubp_regs_init(3);
 
if (hubp401_construct(hubp2, ctx, inst,
-   _regs[inst], _shift, _mask))
+ _regs[inst], _shift, _mask))
return >base;
 
BREAK_TO_DEBUGGER();
@@ -1735,10 +1734,10 @@ static bool dcn401_resource_construct(
 
 #undef REG_STRUCT
 #define REG_STRUCT abm_regs
-   abm_regs_init(0),
-   abm_regs_init(1),
-   abm_regs_init(2),
-   abm_regs_init(3);
+   abm_regs_init(0),
+   abm_regs_init(1),
+   abm_regs_init(2),
+   abm_regs_init(3);
 
 #undef REG_STRUCT
 #define REG_STRUCT dccg_regs
-- 
2.20.1.7.g153144c

[RESEND 0/6] drm, nouveau/radeon/amdpgu: edid_blob_ptr cleanups

2024-05-13 Thread Jani Nikula

I've sent this some moths ago, let's try again...

BR,
Jani.

Jani Nikula (6):
  drm/nouveau: convert to using is_hdmi and has_audio from display info
  drm/radeon: convert to using is_hdmi and has_audio from display info
  drm/radeon: remove radeon_connector_edid() and stop using
edid_blob_ptr
  drm/amdgpu: remove amdgpu_connector_edid() and stop using
edid_blob_ptr
  drm/edid: add a helper for EDID sysfs property show
  drm/connector: update edid_blob_ptr documentation

 .../gpu/drm/amd/amdgpu/amdgpu_connectors.c| 16 -
 .../gpu/drm/amd/amdgpu/amdgpu_connectors.h|  1 -
 drivers/gpu/drm/amd/amdgpu/dce_v10_0.c|  4 +--
 drivers/gpu/drm/amd/amdgpu/dce_v11_0.c|  4 +--
 drivers/gpu/drm/amd/amdgpu/dce_v6_0.c |  4 +--
 drivers/gpu/drm/amd/amdgpu/dce_v8_0.c |  4 +--
 drivers/gpu/drm/drm_crtc_internal.h   |  2 ++
 drivers/gpu/drm/drm_edid.c| 33 +++
 drivers/gpu/drm/drm_sysfs.c   | 24 ++
 drivers/gpu/drm/nouveau/dispnv50/disp.c   |  8 ++---
 drivers/gpu/drm/nouveau/dispnv50/head.c   |  8 +
 drivers/gpu/drm/nouveau/nouveau_connector.c   |  2 +-
 drivers/gpu/drm/radeon/atombios_encoders.c| 10 +++---
 drivers/gpu/drm/radeon/evergreen_hdmi.c   |  5 ++-
 drivers/gpu/drm/radeon/radeon_audio.c | 13 
 drivers/gpu/drm/radeon/radeon_connectors.c| 27 ---
 drivers/gpu/drm/radeon/radeon_display.c   |  2 +-
 drivers/gpu/drm/radeon/radeon_encoders.c  |  4 +--
 drivers/gpu/drm/radeon/radeon_mode.h  |  2 --
 include/drm/drm_connector.h   |  6 +++-
 20 files changed, 79 insertions(+), 100 deletions(-)

-- 
2.39.2

[RESEND 6/6] drm/connector: update edid_blob_ptr documentation

2024-05-13 Thread Jani Nikula

Accessing the EDID via edid_blob_ptr causes chicken-and-egg
problems. Keep edid_blob_ptr as the userspace interface that should be
accessed via dedicated functions.

Signed-off-by: Jani Nikula 
---
 include/drm/drm_connector.h | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h
index fe88d7fc6b8f..58ee9adf9091 100644
--- a/include/drm/drm_connector.h
+++ b/include/drm/drm_connector.h
@@ -1636,8 +1636,12 @@ struct drm_connector {
 
/**
 * @edid_blob_ptr: DRM property containing EDID if present. Protected by
-* _mode_config.mutex. This should be updated only by calling
+* _mode_config.mutex.
+*
+* This must be updated only by calling drm_edid_connector_update() or
 * drm_connector_update_edid_property().
+*
+* This must not be used by drivers directly.
 */
struct drm_property_blob *edid_blob_ptr;
 
-- 
2.39.2

[RESEND 1/6] drm/nouveau: convert to using is_hdmi and has_audio from display info

2024-05-13 Thread Jani Nikula

Prefer the parsed results for is_hdmi and has_audio in display info over
calling drm_detect_hdmi_monitor() and drm_detect_monitor_audio(),
respectively.

Conveniently, this also removes the need to use edid_blob_ptr.

v2: Reverse a backwards if condition (Ilia)

Cc: Karol Herbst 
Cc: Lyude Paul 
Cc: Danilo Krummrich 
Cc: nouv...@lists.freedesktop.org
Signed-off-by: Jani Nikula 
---
 drivers/gpu/drm/nouveau/dispnv50/disp.c | 8 
 drivers/gpu/drm/nouveau/dispnv50/head.c | 8 +---
 drivers/gpu/drm/nouveau/nouveau_connector.c | 2 +-
 3 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/dispnv50/disp.c 
b/drivers/gpu/drm/nouveau/dispnv50/disp.c
index 0c3d88ad0b0e..168c27213287 100644
--- a/drivers/gpu/drm/nouveau/dispnv50/disp.c
+++ b/drivers/gpu/drm/nouveau/dispnv50/disp.c
@@ -751,7 +751,7 @@ nv50_audio_enable(struct drm_encoder *encoder, struct 
nouveau_crtc *nv_crtc,
struct nouveau_encoder *nv_encoder = nouveau_encoder(encoder);
struct nvif_outp *outp = _encoder->outp;
 
-   if (!nv50_audio_supported(encoder) || 
!drm_detect_monitor_audio(nv_connector->edid))
+   if (!nv50_audio_supported(encoder) || 
!nv_connector->base.display_info.has_audio)
return;
 
mutex_lock(>audio.lock);
@@ -1765,7 +1765,7 @@ nv50_sor_atomic_enable(struct drm_encoder *encoder, 
struct drm_atomic_state *sta
if ((disp->disp->object.oclass == GT214_DISP ||
 disp->disp->object.oclass >= GF110_DISP) &&
nv_encoder->dcb->type != DCB_OUTPUT_LVDS &&
-   drm_detect_monitor_audio(nv_connector->edid))
+   nv_connector->base.display_info.has_audio)
hda = true;
 
if (!nvif_outp_acquired(outp))
@@ -1774,7 +1774,7 @@ nv50_sor_atomic_enable(struct drm_encoder *encoder, 
struct drm_atomic_state *sta
switch (nv_encoder->dcb->type) {
case DCB_OUTPUT_TMDS:
if (disp->disp->object.oclass != NV50_DISP &&
-   drm_detect_hdmi_monitor(nv_connector->edid))
+   nv_connector->base.display_info.is_hdmi)
nv50_hdmi_enable(encoder, nv_crtc, nv_connector, state, 
mode, hda);
 
if (nv_encoder->outp.or.link & 1) {
@@ -1787,7 +1787,7 @@ nv50_sor_atomic_enable(struct drm_encoder *encoder, 
struct drm_atomic_state *sta
 */
if (mode->clock >= 165000 &&
nv_encoder->dcb->duallink_possible &&
-   !drm_detect_hdmi_monitor(nv_connector->edid))
+   !nv_connector->base.display_info.is_hdmi)
proto = 
NV507D_SOR_SET_CONTROL_PROTOCOL_DUAL_TMDS;
} else {
proto = NV507D_SOR_SET_CONTROL_PROTOCOL_SINGLE_TMDS_B;
diff --git a/drivers/gpu/drm/nouveau/dispnv50/head.c 
b/drivers/gpu/drm/nouveau/dispnv50/head.c
index 83355dbc15ee..d7c74cc43ba5 100644
--- a/drivers/gpu/drm/nouveau/dispnv50/head.c
+++ b/drivers/gpu/drm/nouveau/dispnv50/head.c
@@ -127,14 +127,8 @@ nv50_head_atomic_check_view(struct nv50_head_atom *armh,
struct drm_display_mode *omode = >state.adjusted_mode;
struct drm_display_mode *umode = >state.mode;
int mode = asyc->scaler.mode;
-   struct edid *edid;
int umode_vdisplay, omode_hdisplay, omode_vdisplay;
 
-   if (connector->edid_blob_ptr)
-   edid = (struct edid *)connector->edid_blob_ptr->data;
-   else
-   edid = NULL;
-
if (!asyc->scaler.full) {
if (mode == DRM_MODE_SCALE_NONE)
omode = umode;
@@ -162,7 +156,7 @@ nv50_head_atomic_check_view(struct nv50_head_atom *armh,
 */
if ((asyc->scaler.underscan.mode == UNDERSCAN_ON ||
(asyc->scaler.underscan.mode == UNDERSCAN_AUTO &&
-drm_detect_hdmi_monitor(edid {
+connector->display_info.is_hdmi))) {
u32 bX = asyc->scaler.underscan.hborder;
u32 bY = asyc->scaler.underscan.vborder;
u32 r = (asyh->view.oH << 19) / asyh->view.oW;
diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c 
b/drivers/gpu/drm/nouveau/nouveau_connector.c
index 856b3ef5edb8..938832a6af15 100644
--- a/drivers/gpu/drm/nouveau/nouveau_connector.c
+++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
@@ -1034,7 +1034,7 @@ get_tmds_link_bandwidth(struct drm_connector *connector)
unsigned duallink_scale =
nouveau_duallink && nv_encoder->dcb->duallink_possible ? 2 : 1;
 
-   if (drm_detect_hdmi_monitor(nv_connector->edid)) {
+   if (nv_connector->base.display_info.is_hdmi) {
info = _connector->base.display_info;
duallink_scale = 1;
}
-- 
2.39.2

[RESEND 3/6] drm/radeon: remove radeon_connector_edid() and stop using edid_blob_ptr

2024-05-13 Thread Jani Nikula

radeon_connector_edid() copies the EDID from edid_blob_ptr as a side
effect if radeon_connector->edid isn't initialized. However, everywhere
that the returned EDID is used, the EDID should have been set
beforehands.

Only the drm EDID code should look at the EDID property, anyway, so stop
using it.

Cc: Alex Deucher 
Cc: Christian König 
Cc: Pan, Xinhui 
Cc: amd-gfx@lists.freedesktop.org
Signed-off-by: Jani Nikula 
---
 drivers/gpu/drm/radeon/radeon_audio.c  |  7 ---
 drivers/gpu/drm/radeon/radeon_connectors.c | 15 ---
 drivers/gpu/drm/radeon/radeon_mode.h   |  2 --
 3 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_audio.c 
b/drivers/gpu/drm/radeon/radeon_audio.c
index 16c10db3ce6f..0bcd767b9f47 100644
--- a/drivers/gpu/drm/radeon/radeon_audio.c
+++ b/drivers/gpu/drm/radeon/radeon_audio.c
@@ -303,6 +303,7 @@ void radeon_audio_endpoint_wreg(struct radeon_device *rdev, 
u32 offset,
 static void radeon_audio_write_sad_regs(struct drm_encoder *encoder)
 {
struct drm_connector *connector = 
radeon_get_connector_for_encoder(encoder);
+   struct radeon_connector *radeon_connector = 
to_radeon_connector(connector);
struct radeon_encoder *radeon_encoder = to_radeon_encoder(encoder);
struct cea_sad *sads;
int sad_count;
@@ -310,7 +311,7 @@ static void radeon_audio_write_sad_regs(struct drm_encoder 
*encoder)
if (!connector)
return;
 
-   sad_count = drm_edid_to_sad(radeon_connector_edid(connector), );
+   sad_count = drm_edid_to_sad(radeon_connector->edid, );
if (sad_count < 0)
DRM_ERROR("Couldn't read SADs: %d\n", sad_count);
if (sad_count <= 0)
@@ -326,6 +327,7 @@ static void radeon_audio_write_sad_regs(struct drm_encoder 
*encoder)
 static void radeon_audio_write_speaker_allocation(struct drm_encoder *encoder)
 {
struct drm_connector *connector = 
radeon_get_connector_for_encoder(encoder);
+   struct radeon_connector *radeon_connector = 
to_radeon_connector(connector);
struct radeon_encoder *radeon_encoder = to_radeon_encoder(encoder);
u8 *sadb = NULL;
int sad_count;
@@ -333,8 +335,7 @@ static void radeon_audio_write_speaker_allocation(struct 
drm_encoder *encoder)
if (!connector)
return;
 
-   sad_count = 
drm_edid_to_speaker_allocation(radeon_connector_edid(connector),
-  );
+   sad_count = drm_edid_to_speaker_allocation(radeon_connector->edid, 
);
if (sad_count < 0) {
DRM_DEBUG("Couldn't read Speaker Allocation Data Block: %d\n",
  sad_count);
diff --git a/drivers/gpu/drm/radeon/radeon_connectors.c 
b/drivers/gpu/drm/radeon/radeon_connectors.c
index 81b5c3c8f658..80879e946342 100644
--- a/drivers/gpu/drm/radeon/radeon_connectors.c
+++ b/drivers/gpu/drm/radeon/radeon_connectors.c
@@ -255,21 +255,6 @@ static struct drm_encoder *radeon_find_encoder(struct 
drm_connector *connector,
return NULL;
 }
 
-struct edid *radeon_connector_edid(struct drm_connector *connector)
-{
-   struct radeon_connector *radeon_connector = 
to_radeon_connector(connector);
-   struct drm_property_blob *edid_blob = connector->edid_blob_ptr;
-
-   if (radeon_connector->edid) {
-   return radeon_connector->edid;
-   } else if (edid_blob) {
-   struct edid *edid = kmemdup(edid_blob->data, edid_blob->length, 
GFP_KERNEL);
-   if (edid)
-   radeon_connector->edid = edid;
-   }
-   return radeon_connector->edid;
-}
-
 static void radeon_connector_get_edid(struct drm_connector *connector)
 {
struct drm_device *dev = connector->dev;
diff --git a/drivers/gpu/drm/radeon/radeon_mode.h 
b/drivers/gpu/drm/radeon/radeon_mode.h
index 546381a5c918..e0a5af180801 100644
--- a/drivers/gpu/drm/radeon/radeon_mode.h
+++ b/drivers/gpu/drm/radeon/radeon_mode.h
@@ -701,8 +701,6 @@ extern u16 
radeon_connector_encoder_get_dp_bridge_encoder_id(struct drm_connecto
 extern bool radeon_connector_is_dp12_capable(struct drm_connector *connector);
 extern int radeon_get_monitor_bpc(struct drm_connector *connector);
 
-extern struct edid *radeon_connector_edid(struct drm_connector *connector);
-
 extern void radeon_connector_hotplug(struct drm_connector *connector);
 extern int radeon_dp_mode_valid_helper(struct drm_connector *connector,
   struct drm_display_mode *mode);
-- 
2.39.2

[RESEND 4/6] drm/amdgpu: remove amdgpu_connector_edid() and stop using edid_blob_ptr

2024-05-13 Thread Jani Nikula

amdgpu_connector_edid() copies the EDID from edid_blob_ptr as a side
effect if amdgpu_connector->edid isn't initialized. However, everywhere
that the returned EDID is used, the EDID should have been set
beforehands.

Only the drm EDID code should look at the EDID property, anyway, so stop
using it.

Cc: Alex Deucher 
Cc: Christian König 
Cc: Pan, Xinhui 
Cc: amd-gfx@lists.freedesktop.org
Signed-off-by: Jani Nikula 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 16 
 drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.h |  1 -
 drivers/gpu/drm/amd/amdgpu/dce_v10_0.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/dce_v11_0.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/dce_v6_0.c  |  4 ++--
 drivers/gpu/drm/amd/amdgpu/dce_v8_0.c  |  4 ++--
 6 files changed, 8 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c
index 9caba10315a8..cae7479c3ecf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c
@@ -246,22 +246,6 @@ amdgpu_connector_find_encoder(struct drm_connector 
*connector,
return NULL;
 }
 
-struct edid *amdgpu_connector_edid(struct drm_connector *connector)
-{
-   struct amdgpu_connector *amdgpu_connector = 
to_amdgpu_connector(connector);
-   struct drm_property_blob *edid_blob = connector->edid_blob_ptr;
-
-   if (amdgpu_connector->edid) {
-   return amdgpu_connector->edid;
-   } else if (edid_blob) {
-   struct edid *edid = kmemdup(edid_blob->data, edid_blob->length, 
GFP_KERNEL);
-
-   if (edid)
-   amdgpu_connector->edid = edid;
-   }
-   return amdgpu_connector->edid;
-}
-
 static struct edid *
 amdgpu_connector_get_hardcoded_edid(struct amdgpu_device *adev)
 {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.h
index 61fcef15ad72..eff833b6ed31 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.h
@@ -24,7 +24,6 @@
 #ifndef __AMDGPU_CONNECTORS_H__
 #define __AMDGPU_CONNECTORS_H__
 
-struct edid *amdgpu_connector_edid(struct drm_connector *connector);
 void amdgpu_connector_hotplug(struct drm_connector *connector);
 int amdgpu_connector_get_monitor_bpc(struct drm_connector *connector);
 u16 amdgpu_connector_encoder_get_dp_bridge_encoder_id(struct drm_connector 
*connector);
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c
index b44fce44c066..dddb5fe16f2c 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c
@@ -1299,7 +1299,7 @@ static void 
dce_v10_0_audio_write_speaker_allocation(struct drm_encoder *encoder
return;
}
 
-   sad_count = 
drm_edid_to_speaker_allocation(amdgpu_connector_edid(connector), );
+   sad_count = drm_edid_to_speaker_allocation(amdgpu_connector->edid, 
);
if (sad_count < 0) {
DRM_ERROR("Couldn't read Speaker Allocation Data Block: %d\n", 
sad_count);
sad_count = 0;
@@ -1369,7 +1369,7 @@ static void dce_v10_0_audio_write_sad_regs(struct 
drm_encoder *encoder)
return;
}
 
-   sad_count = drm_edid_to_sad(amdgpu_connector_edid(connector), );
+   sad_count = drm_edid_to_sad(amdgpu_connector->edid, );
if (sad_count < 0)
DRM_ERROR("Couldn't read SADs: %d\n", sad_count);
if (sad_count <= 0)
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
index 80b2e7f79acf..11780e4d7e9f 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
@@ -1331,7 +1331,7 @@ static void 
dce_v11_0_audio_write_speaker_allocation(struct drm_encoder *encoder
return;
}
 
-   sad_count = 
drm_edid_to_speaker_allocation(amdgpu_connector_edid(connector), );
+   sad_count = drm_edid_to_speaker_allocation(amdgpu_connector->edid, 
);
if (sad_count < 0) {
DRM_ERROR("Couldn't read Speaker Allocation Data Block: %d\n", 
sad_count);
sad_count = 0;
@@ -1401,7 +1401,7 @@ static void dce_v11_0_audio_write_sad_regs(struct 
drm_encoder *encoder)
return;
}
 
-   sad_count = drm_edid_to_sad(amdgpu_connector_edid(connector), );
+   sad_count = drm_edid_to_sad(amdgpu_connector->edid, );
if (sad_count < 0)
DRM_ERROR("Couldn't read SADs: %d\n", sad_count);
if (sad_count <= 0)
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c
index db20012600f5..05c0df97f01d 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c
@@ -1217,7 +1217,7 @@ static void 
dce_v6_0_audio_write_speaker_allocation(struct drm_encoder *encoder)
return;

RE: [PATCH 18/22] drm/amd/pm: check negtive return for table entries

2024-05-13 Thread Huang, Tim

[AMD Official Use Only - AMD Internal Distribution Only]

Hi Jesse,

> -Original Message-
> From: Jesse Zhang 
> Sent: Monday, May 13, 2024 3:34 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Huang, Tim ; Zhang,
> Jesse(Jie) ; Zhang, Jesse(Jie)
> 
> Subject: [PATCH 18/22] drm/amd/pm: check negtive return for table entries
>
> Function hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr)
> returns a negative number
>
> Signed-off-by: Jesse Zhang 
> Suggested-by: Tim Huang 
> ---
>  drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> index f4bd8e9357e2..1276a95acc90 100644
> --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> @@ -30,9 +30,8 @@ int psm_init_power_state_table(struct pp_hwmgr
> *hwmgr)  {
>   int result;
>   unsigned int i;
> - unsigned int table_entries;
>   struct pp_power_state *state;
> - int size;
> + int size, table_entries;
>
>   if (hwmgr->hwmgr_func->get_num_of_pp_table_entries == NULL)
>   return 0;
> @@ -40,15 +39,17 @@ int psm_init_power_state_table(struct pp_hwmgr
> *hwmgr)
>   if (hwmgr->hwmgr_func->get_power_state_size == NULL)
>   return 0;
>
> - hwmgr->num_ps = table_entries =
> hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr);
> + table_entries =
> hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr);
>
> - hwmgr->ps_size = size =
> hwmgr->hwmgr_func->get_power_state_size(hwmgr) +
> + size = hwmgr->hwmgr_func->get_power_state_size(hwmgr) +
> sizeof(struct pp_power_state);
>
> - if (table_entries == 0 || size == 0) {
> + if (table_entries <= 0 || size == 0) {
>   pr_warn("Please check whether power state management is
> supported on this asic\n");
As we return 0 here, we still need to set the hwmgr->num_ps and hwmgr->ps_size 
to 0 here.

Tim Huang
>   return 0;

>   }
> + hwmgr->num_ps = table_entries;
> + hwmgr->ps_size = size;
>
>   hwmgr->ps = kcalloc(table_entries, size, GFP_KERNEL);
>   if (hwmgr->ps == NULL)
> --
> 2.25.1

RE: [PATCH 2/22 V2] drm/amdgpu: the warning dereferencing obj for nbio_v7_4

2024-05-13 Thread Huang, Tim

[AMD Official Use Only - AMD Internal Distribution Only]

This patch is,

Reviewed-by: Tim Huang 



> -Original Message-
> From: Jesse Zhang 
> Sent: Monday, May 13, 2024 3:33 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Huang, Tim ; Zhang,
> Jesse(Jie) ; Zhang, Jesse(Jie)
> 
> Subject: [PATCH 2/22 V2] drm/amdgpu: the warning dereferencing obj for
> nbio_v7_4
>
> if ras_manager obj null, don't print NBIO err data
>
> Signed-off-by: Jesse Zhang 
> Suggested-by: Tim Huang 
> ---
>  drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> index fe18df10daaa..32cc60ce5521 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> @@ -383,7 +383,7 @@ static void
> nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
>   else
>   WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL,
> bif_doorbell_intr_cntl);
>
> - if (!ras->disable_ras_err_cnt_harvest) {
> + if (ras && !ras->disable_ras_err_cnt_harvest && obj) {
>   /*
>* clear error status after ras_controller_intr
>* according to hw team and count ue number
> --
> 2.25.1

[PATCH 18/22] drm/amd/pm: check negtive return for table entries

2024-05-13 Thread Jesse Zhang

Function hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr) returns a 
negative number

Signed-off-by: Jesse Zhang 
Suggested-by: Tim Huang 
---
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
index f4bd8e9357e2..1276a95acc90 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
@@ -30,9 +30,8 @@ int psm_init_power_state_table(struct pp_hwmgr *hwmgr)
 {
int result;
unsigned int i;
-   unsigned int table_entries;
struct pp_power_state *state;
-   int size;
+   int size, table_entries;
 
if (hwmgr->hwmgr_func->get_num_of_pp_table_entries == NULL)
return 0;
@@ -40,15 +39,17 @@ int psm_init_power_state_table(struct pp_hwmgr *hwmgr)
if (hwmgr->hwmgr_func->get_power_state_size == NULL)
return 0;
 
-   hwmgr->num_ps = table_entries = 
hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr);
+   table_entries = hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr);
 
-   hwmgr->ps_size = size = hwmgr->hwmgr_func->get_power_state_size(hwmgr) +
+   size = hwmgr->hwmgr_func->get_power_state_size(hwmgr) +
  sizeof(struct pp_power_state);
 
-   if (table_entries == 0 || size == 0) {
+   if (table_entries <= 0 || size == 0) {
pr_warn("Please check whether power state management is 
supported on this asic\n");
return 0;
}
+   hwmgr->num_ps = table_entries;
+   hwmgr->ps_size = size;
 
hwmgr->ps = kcalloc(table_entries, size, GFP_KERNEL);
if (hwmgr->ps == NULL)
-- 
2.25.1

[PATCH 2/22 V2] drm/amdgpu: the warning dereferencing obj for nbio_v7_4

2024-05-13 Thread Jesse Zhang

if ras_manager obj null, don't print NBIO err data

Signed-off-by: Jesse Zhang 
Suggested-by: Tim Huang 
---
 drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c 
b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
index fe18df10daaa..32cc60ce5521 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
@@ -383,7 +383,7 @@ static void 
nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
else
WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, 
bif_doorbell_intr_cntl);
 
-   if (!ras->disable_ras_err_cnt_harvest) {
+   if (ras && !ras->disable_ras_err_cnt_harvest && obj) {
/*
 * clear error status after ras_controller_intr
 * according to hw team and count ue number
-- 
2.25.1

Re: [RFC 0/5] Discussion around eviction improvements

2024-05-13 Thread Christian König


Just FYI, I've been on sick leave for a while and now trying to catch up.

It will probably be at least week until I can look into this again.

Sorry,
Christian.

Am 08.05.24 um 20:09 schrieb Tvrtko Ursulin:

From: Tvrtko Ursulin 

Last few days I was looking at the situation with VRAM over subscription, what
happens versus what perhaps should happen. Browsing through the driver and
running some simple experiments.

I ended up with this patch series which, as a disclaimer, may be completely
wrong but as I found some suspicious things, to me at least, I thought it was a
good point to stop and request some comments.

To perhaps summarise what are the main issues I think I found:

  * Migration rate limiting does not bother knowing if actual migration happened
and so can over-account and unfairly penalise.

  * Migration rate limiting does not even work, at least not for the common case
where userspace configures VRAM+GTT. It thinks it can stop migration 
attempts
by playing with bo->allowed_domains vs bo->preferred domains but, both from
the code, and from empirical experiments, I see that not working at all. 
Both
masks are identical so fiddling with them achieves nothing.

  * Idea of the fallback placement only works when VRAM has free space. As soon
as it does not, ttm_resource_compatible is happy to leave the buffers in the
secondary placement forever.

  * Driver thinks it will be re-validating evicted buffers on the next 
submission
but it does not for the very common case of VRAM+GTT because it only checks
if current placement is *none* of the preferred placements.

All those problems are addressed in individual patches.

End result of this series appears to be driver which will try harder to move
buffers back into VRAM, but will be (more) correctly throttled in doing so by
the existing rate limiting logic.

I have run a quick benchmark of Cyberpunk 2077 and cannot say that I saw a
change but that could be a good thing too. At least I did not break anything,
perhaps.. On one occassion I did see the rate limiting logic get confused while
for a period of few minutes it went to a mode where it was constantly giving a
high migration budget. But that recovered itself when I switched clients and did
not come back so I don't know. If there is something wrong there I don't think
it would be caused by any patches in this series.

Series is probably rough but should be good enough for dicsussion. I am curious
to hear if I identified at least something correctly as a real problem.

It would also be good to hear what are the suggested games to check and see
whether there is any improvement.

Cc: Christian König 
Cc: Friedrich Vock 

Tvrtko Ursulin (5):
   drm/amdgpu: Fix migration rate limiting accounting
   drm/amdgpu: Actually respect buffer migration budget
   drm/ttm: Add preferred placement flag
   drm/amdgpu: Use preferred placement for VRAM+GTT
   drm/amdgpu: Re-validate evicted buffers

  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++--
  drivers/gpu/drm/ttm/ttm_resource.c | 13 +---
  include/drm/ttm/ttm_placement.h|  3 ++
  5 files changed, 65 insertions(+), 18 deletions(-)

RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for nbio_v7_4

2024-05-13 Thread Zhang, Jesse(Jie)

[AMD Official Use Only - AMD Internal Distribution Only]

Hi Tim

-Original Message-
From: Huang, Tim 
Sent: Monday, May 13, 2024 12:23 PM
To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 

Subject: RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for 
nbio_v7_4

[AMD Official Use Only - AMD Internal Distribution Only]

Hi Jesse,

> -Original Message-
> From: Zhang, Jesse(Jie) 
> Sent: Monday, May 13, 2024 10:18 AM
> To: Zhang, Jesse(Jie) ;
> amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Huang, Tim 
> Subject: RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj
> for
> nbio_v7_4
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Ping ...
>
> -Original Message-
> From: Jesse Zhang 
> Sent: Friday, May 10, 2024 10:50 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Huang, Tim ; Zhang,
> Jesse(Jie) ; Zhang, Jesse(Jie)
> 
> Subject: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for
> nbio_v7_4
>
> if ras_manager obj null, don't print NBIO err data
>
> Signed-off-by: Jesse Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> index fe18df10daaa..26e5885db9b7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> @@ -383,7 +383,7 @@ static void
> nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
> else
> WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL,
> bif_doorbell_intr_cntl);
>
> -   if (!ras->disable_ras_err_cnt_harvest) {
> +   if (!ras->disable_ras_err_cnt_harvest && obj) {
We may need to check the ras pointer as well?  Such as change to " if (ras && 
!ras->disable_ras_err_cnt_harvest && obj) {"

[Zhang, Jesse(Jie)]  Thanks, will update the patch .
Tim Huang

> /*
>  * clear error status after ras_controller_intr
>  * according to hw team and count ue number
> --
> 2.25.1
>

Re: [PATCH v3] drm/amdgpu: Add Ring Hang Events

2024-05-13 Thread Lazar, Lijo




On 5/13/2024 9:44 AM, Ori Messinger wrote:
> This patch adds 'ring hang' events to the driver.
> This is done by adding a 'reset_ring_hang' bool variable to the
> struct 'amdgpu_reset_context' in the amdgpu_reset.h file.
> The purpose for this 'reset_ring_hang' variable is whenever a GPU
> reset is initiated due to a ring hang, the reset_ring_hang should
> be set to 'true'.
> 
> Additionally, the reset cause is passed into the kfd smi event as
> a string, and is displayed in dmesg as an error.
> 
> This 'amdgpu_reset_context' struct is now also passed
> through across all relevant functions, and another event type
> "KFD_SMI_EVENT_RING_HANG" is added to the kfd_smi_event enum.
> 
> Signed-off-by: Ori Messinger 
> Change-Id: I6af3022eb1b4514201c9430d635ff87f167ad6f7
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c  |  7 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h  |  9 ++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 16 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |  4 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h   |  2 ++
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c |  7 ---
>  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c |  7 ++-
>  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h |  5 -
>  include/uapi/linux/kfd_ioctl.h  | 15 ---
>  10 files changed, 56 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index e3738d417245..f1c6dc939cc3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -133,6 +133,9 @@ static void amdgpu_amdkfd_reset_work(struct work_struct 
> *work)
>  
>   reset_context.method = AMD_RESET_METHOD_NONE;
>   reset_context.reset_req_dev = adev;
> + reset_context.reset_ring_hang = true;
> + strscpy(reset_context.reset_cause, "hws_hang", 
> sizeof(reset_context.reset_cause));

Please add only reset cause as an enum to reset context. There is no
need for a separate variable like ring hang.

For a user ring that induces a HWS hang, that may be identified
separately or identified generically with "HWS hang" as the reason. HWS
hang also could be caused by RAS error.

A possible list is -

DRIVER_TRIGGERED (suspend/reset on init etc)
JOB HANG,
HWS HANG,
USER TRIGGERED,
RAS ERROR,

The description string for reset cause may be obtained separately with
something like below which returns the details - no need to add them to
reset context and pass around.

amdgpu_reset_get_reset_desc(reset_context);

If reset is caused by a specific job, reset context already has a job
pointer. From there you may get more details like device/partition id,
job id, ring name etc. and provide the description. For RAS errors,
there is already detailed dmesg log of IP which caused issue. So only
device could be sufficient.

> + DRM_ERROR("Reset cause: %s\n", reset_context.reset_cause);

Please don't use DRM_*. They are deprecated. Use either drm_err() or
dev_err() - they help to identify the device also.

Thanks,
Lijo

>   clear_bit(AMDGPU_NEED_FULL_RESET, _context.flags);
>  
>   amdgpu_device_gpu_recover(adev, NULL, _context);
> @@ -261,12 +264,12 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, 
> bool run_pm)
>   return r;
>  }
>  
> -int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
> +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev, struct 
> amdgpu_reset_context *reset_context)
>  {
>   int r = 0;
>  
>   if (adev->kfd.dev)
> - r = kgd2kfd_pre_reset(adev->kfd.dev);
> + r = kgd2kfd_pre_reset(adev->kfd.dev, reset_context);
>  
>   return r;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 1de021ebdd46..c9030d8b8308 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -47,6 +47,7 @@ enum TLB_FLUSH_TYPE {
>  };
>  
>  struct amdgpu_device;
> +struct amdgpu_reset_context;
>  
>  enum kfd_mem_attachment_type {
>   KFD_MEM_ATT_SHARED, /* Share kgd_mem->bo or another attachment's */
> @@ -170,7 +171,8 @@ bool amdgpu_amdkfd_have_atomics_support(struct 
> amdgpu_device *adev);
>  
>  bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid);
>  
> -int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev);
> +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev,
> + struct amdgpu_reset_context *reset_context);
>  
>  int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev);
>  
> @@ -416,7 +418,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>  void kgd2kfd_device_exit(struct kfd_dev *kfd);
>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>  int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
> -int kgd2kfd_pre_reset(struct kfd_dev *kfd);
> +int

97 matches

Mail list logo