Re: [RFC 0/5] Discussion around eviction improvements

2024-05-12 Thread Christian König

Just FYI, I've been on sick leave for a while and now trying to catch up.

It will probably be at least week until I can look into this again.

Sorry,
Christian.

Am 08.05.24 um 20:09 schrieb Tvrtko Ursulin:

From: Tvrtko Ursulin 

Last few days I was looking at the situation with VRAM over subscription, what
happens versus what perhaps should happen. Browsing through the driver and
running some simple experiments.

I ended up with this patch series which, as a disclaimer, may be completely
wrong but as I found some suspicious things, to me at least, I thought it was a
good point to stop and request some comments.

To perhaps summarise what are the main issues I think I found:

  * Migration rate limiting does not bother knowing if actual migration happened
and so can over-account and unfairly penalise.

  * Migration rate limiting does not even work, at least not for the common case
where userspace configures VRAM+GTT. It thinks it can stop migration 
attempts
by playing with bo->allowed_domains vs bo->preferred domains but, both from
the code, and from empirical experiments, I see that not working at all. 
Both
masks are identical so fiddling with them achieves nothing.

  * Idea of the fallback placement only works when VRAM has free space. As soon
as it does not, ttm_resource_compatible is happy to leave the buffers in the
secondary placement forever.

  * Driver thinks it will be re-validating evicted buffers on the next 
submission
but it does not for the very common case of VRAM+GTT because it only checks
if current placement is *none* of the preferred placements.

All those problems are addressed in individual patches.

End result of this series appears to be driver which will try harder to move
buffers back into VRAM, but will be (more) correctly throttled in doing so by
the existing rate limiting logic.

I have run a quick benchmark of Cyberpunk 2077 and cannot say that I saw a
change but that could be a good thing too. At least I did not break anything,
perhaps.. On one occassion I did see the rate limiting logic get confused while
for a period of few minutes it went to a mode where it was constantly giving a
high migration budget. But that recovered itself when I switched clients and did
not come back so I don't know. If there is something wrong there I don't think
it would be caused by any patches in this series.

Series is probably rough but should be good enough for dicsussion. I am curious
to hear if I identified at least something correctly as a real problem.

It would also be good to hear what are the suggested games to check and see
whether there is any improvement.

Cc: Christian König 
Cc: Friedrich Vock 

Tvrtko Ursulin (5):
   drm/amdgpu: Fix migration rate limiting accounting
   drm/amdgpu: Actually respect buffer migration budget
   drm/ttm: Add preferred placement flag
   drm/amdgpu: Use preferred placement for VRAM+GTT
   drm/amdgpu: Re-validate evicted buffers

  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++--
  drivers/gpu/drm/ttm/ttm_resource.c | 13 +---
  include/drm/ttm/ttm_placement.h|  3 ++
  5 files changed, 65 insertions(+), 18 deletions(-)





RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for nbio_v7_4

2024-05-12 Thread Zhang, Jesse(Jie)
[AMD Official Use Only - AMD Internal Distribution Only]

Hi Tim

-Original Message-
From: Huang, Tim 
Sent: Monday, May 13, 2024 12:23 PM
To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 

Subject: RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for 
nbio_v7_4

[AMD Official Use Only - AMD Internal Distribution Only]

Hi Jesse,

> -Original Message-
> From: Zhang, Jesse(Jie) 
> Sent: Monday, May 13, 2024 10:18 AM
> To: Zhang, Jesse(Jie) ;
> amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Huang, Tim 
> Subject: RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj
> for
> nbio_v7_4
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Ping ...
>
> -Original Message-
> From: Jesse Zhang 
> Sent: Friday, May 10, 2024 10:50 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Huang, Tim ; Zhang,
> Jesse(Jie) ; Zhang, Jesse(Jie)
> 
> Subject: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for
> nbio_v7_4
>
> if ras_manager obj null, don't print NBIO err data
>
> Signed-off-by: Jesse Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> index fe18df10daaa..26e5885db9b7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> @@ -383,7 +383,7 @@ static void
> nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
> else
> WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL,
> bif_doorbell_intr_cntl);
>
> -   if (!ras->disable_ras_err_cnt_harvest) {
> +   if (!ras->disable_ras_err_cnt_harvest && obj) {
We may need to check the ras pointer as well?  Such as change to " if (ras && 
!ras->disable_ras_err_cnt_harvest && obj) {"

[Zhang, Jesse(Jie)]  Thanks, will update the patch .
Tim Huang

> /*
>  * clear error status after ras_controller_intr
>  * according to hw team and count ue number
> --
> 2.25.1
>




Re: [PATCH v3] drm/amdgpu: Add Ring Hang Events

2024-05-12 Thread Lazar, Lijo



On 5/13/2024 9:44 AM, Ori Messinger wrote:
> This patch adds 'ring hang' events to the driver.
> This is done by adding a 'reset_ring_hang' bool variable to the
> struct 'amdgpu_reset_context' in the amdgpu_reset.h file.
> The purpose for this 'reset_ring_hang' variable is whenever a GPU
> reset is initiated due to a ring hang, the reset_ring_hang should
> be set to 'true'.
> 
> Additionally, the reset cause is passed into the kfd smi event as
> a string, and is displayed in dmesg as an error.
> 
> This 'amdgpu_reset_context' struct is now also passed
> through across all relevant functions, and another event type
> "KFD_SMI_EVENT_RING_HANG" is added to the kfd_smi_event enum.
> 
> Signed-off-by: Ori Messinger 
> Change-Id: I6af3022eb1b4514201c9430d635ff87f167ad6f7
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c  |  7 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h  |  9 ++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 16 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |  4 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h   |  2 ++
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c |  7 ---
>  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c |  7 ++-
>  drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h |  5 -
>  include/uapi/linux/kfd_ioctl.h  | 15 ---
>  10 files changed, 56 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index e3738d417245..f1c6dc939cc3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -133,6 +133,9 @@ static void amdgpu_amdkfd_reset_work(struct work_struct 
> *work)
>  
>   reset_context.method = AMD_RESET_METHOD_NONE;
>   reset_context.reset_req_dev = adev;
> + reset_context.reset_ring_hang = true;
> + strscpy(reset_context.reset_cause, "hws_hang", 
> sizeof(reset_context.reset_cause));

Please add only reset cause as an enum to reset context. There is no
need for a separate variable like ring hang.

For a user ring that induces a HWS hang, that may be identified
separately or identified generically with "HWS hang" as the reason. HWS
hang also could be caused by RAS error.

A possible list is -

DRIVER_TRIGGERED (suspend/reset on init etc)
JOB HANG,
HWS HANG,
USER TRIGGERED,
RAS ERROR,

The description string for reset cause may be obtained separately with
something like below which returns the details - no need to add them to
reset context and pass around.

amdgpu_reset_get_reset_desc(reset_context);

If reset is caused by a specific job, reset context already has a job
pointer. From there you may get more details like device/partition id,
job id, ring name etc. and provide the description. For RAS errors,
there is already detailed dmesg log of IP which caused issue. So only
device could be sufficient.

> + DRM_ERROR("Reset cause: %s\n", reset_context.reset_cause);

Please don't use DRM_*. They are deprecated. Use either drm_err() or
dev_err() - they help to identify the device also.

Thanks,
Lijo

>   clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>  
>   amdgpu_device_gpu_recover(adev, NULL, &reset_context);
> @@ -261,12 +264,12 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, 
> bool run_pm)
>   return r;
>  }
>  
> -int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
> +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev, struct 
> amdgpu_reset_context *reset_context)
>  {
>   int r = 0;
>  
>   if (adev->kfd.dev)
> - r = kgd2kfd_pre_reset(adev->kfd.dev);
> + r = kgd2kfd_pre_reset(adev->kfd.dev, reset_context);
>  
>   return r;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 1de021ebdd46..c9030d8b8308 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -47,6 +47,7 @@ enum TLB_FLUSH_TYPE {
>  };
>  
>  struct amdgpu_device;
> +struct amdgpu_reset_context;
>  
>  enum kfd_mem_attachment_type {
>   KFD_MEM_ATT_SHARED, /* Share kgd_mem->bo or another attachment's */
> @@ -170,7 +171,8 @@ bool amdgpu_amdkfd_have_atomics_support(struct 
> amdgpu_device *adev);
>  
>  bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid);
>  
> -int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev);
> +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev,
> + struct amdgpu_reset_context *reset_context);
>  
>  int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev);
>  
> @@ -416,7 +418,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>  void kgd2kfd_device_exit(struct kfd_dev *kfd);
>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>  int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
> -int kgd2kfd_pre_reset(struct kfd_dev *kfd);
> +int kgd2kfd_pr

RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for nbio_v7_4

2024-05-12 Thread Huang, Tim
[AMD Official Use Only - AMD Internal Distribution Only]

Hi Jesse,

> -Original Message-
> From: Zhang, Jesse(Jie) 
> Sent: Monday, May 13, 2024 10:18 AM
> To: Zhang, Jesse(Jie) ;
> amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Huang, Tim 
> Subject: RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for
> nbio_v7_4
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Ping ...
>
> -Original Message-
> From: Jesse Zhang 
> Sent: Friday, May 10, 2024 10:50 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Huang, Tim ; Zhang,
> Jesse(Jie) ; Zhang, Jesse(Jie)
> 
> Subject: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for
> nbio_v7_4
>
> if ras_manager obj null, don't print NBIO err data
>
> Signed-off-by: Jesse Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> index fe18df10daaa..26e5885db9b7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> @@ -383,7 +383,7 @@ static void
> nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
> else
> WREG32_SOC15(NBIO, 0,
> mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl);
>
> -   if (!ras->disable_ras_err_cnt_harvest) {
> +   if (!ras->disable_ras_err_cnt_harvest && obj) {
We may need to check the ras pointer as well?  Such as change to " if (ras && 
!ras->disable_ras_err_cnt_harvest && obj) {"


Tim Huang

> /*
>  * clear error status after ras_controller_intr
>  * according to hw team and count ue number
> --
> 2.25.1
>



[PATCH v3] drm/amdgpu: Add Ring Hang Events

2024-05-12 Thread Ori Messinger
This patch adds 'ring hang' events to the driver.
This is done by adding a 'reset_ring_hang' bool variable to the
struct 'amdgpu_reset_context' in the amdgpu_reset.h file.
The purpose for this 'reset_ring_hang' variable is whenever a GPU
reset is initiated due to a ring hang, the reset_ring_hang should
be set to 'true'.

Additionally, the reset cause is passed into the kfd smi event as
a string, and is displayed in dmesg as an error.

This 'amdgpu_reset_context' struct is now also passed
through across all relevant functions, and another event type
"KFD_SMI_EVENT_RING_HANG" is added to the kfd_smi_event enum.

Signed-off-by: Ori Messinger 
Change-Id: I6af3022eb1b4514201c9430d635ff87f167ad6f7
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c  |  7 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h  |  9 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 16 
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |  4 
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h   |  2 ++
 drivers/gpu/drm/amd/amdkfd/kfd_device.c |  7 ---
 drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c |  7 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h |  5 -
 include/uapi/linux/kfd_ioctl.h  | 15 ---
 10 files changed, 56 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index e3738d417245..f1c6dc939cc3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -133,6 +133,9 @@ static void amdgpu_amdkfd_reset_work(struct work_struct 
*work)
 
reset_context.method = AMD_RESET_METHOD_NONE;
reset_context.reset_req_dev = adev;
+   reset_context.reset_ring_hang = true;
+   strscpy(reset_context.reset_cause, "hws_hang", 
sizeof(reset_context.reset_cause));
+   DRM_ERROR("Reset cause: %s\n", reset_context.reset_cause);
clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
 
amdgpu_device_gpu_recover(adev, NULL, &reset_context);
@@ -261,12 +264,12 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool 
run_pm)
return r;
 }
 
-int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
+int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev, struct 
amdgpu_reset_context *reset_context)
 {
int r = 0;
 
if (adev->kfd.dev)
-   r = kgd2kfd_pre_reset(adev->kfd.dev);
+   r = kgd2kfd_pre_reset(adev->kfd.dev, reset_context);
 
return r;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 1de021ebdd46..c9030d8b8308 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -47,6 +47,7 @@ enum TLB_FLUSH_TYPE {
 };
 
 struct amdgpu_device;
+struct amdgpu_reset_context;
 
 enum kfd_mem_attachment_type {
KFD_MEM_ATT_SHARED, /* Share kgd_mem->bo or another attachment's */
@@ -170,7 +171,8 @@ bool amdgpu_amdkfd_have_atomics_support(struct 
amdgpu_device *adev);
 
 bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid);
 
-int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev);
+int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev,
+   struct amdgpu_reset_context *reset_context);
 
 int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev);
 
@@ -416,7 +418,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 void kgd2kfd_device_exit(struct kfd_dev *kfd);
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
 int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
-int kgd2kfd_pre_reset(struct kfd_dev *kfd);
+int kgd2kfd_pre_reset(struct kfd_dev *kfd,
+ struct amdgpu_reset_context *reset_context);
 int kgd2kfd_post_reset(struct kfd_dev *kfd);
 void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
 void kgd2kfd_set_sram_ecc_flag(struct kfd_dev *kfd);
@@ -459,7 +462,7 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool 
run_pm)
return 0;
 }
 
-static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
+static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd, struct 
amdgpu_reset_context *reset_context)
 {
return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 00fe3c2d5431..b18f37426b5e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5772,7 +5772,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
 
-   amdgpu_amdkfd_pre_reset(tmp_adev);
+   amdgpu_amdkfd_pre_reset(tmp_adev, reset_context);
 
/*
 * Mark these ASICs to be reseted as untracked first
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index e474

RE: [PATCH 18/22] drm/amd/pm: check negtive return for table entries

2024-05-12 Thread Huang, Tim
[AMD Official Use Only - AMD Internal Distribution Only]

Hi Jesse,

> -Original Message-
> From: Zhang, Jesse(Jie) 
> Sent: Monday, May 13, 2024 10:19 AM
> To: Zhang, Jesse(Jie) ;
> amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Huang, Tim 
> Subject: RE: [PATCH 18/22] drm/amd/pm: check negtive return for table
> entries
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Ping ...
>
> -Original Message-
> From: Jesse Zhang 
> Sent: Friday, May 10, 2024 10:51 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian
> ; Huang, Tim ; Zhang,
> Jesse(Jie) ; Zhang, Jesse(Jie)
> 
> Subject: [PATCH 18/22] drm/amd/pm: check negtive return for table entries
>
> Function hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr)
> returns a negative number
>
> Signed-off-by: Jesse Zhang 
> ---
>  drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> index f4bd8e9357e2..4433ec4e9cf2 100644
> --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
> @@ -30,9 +30,8 @@ int psm_init_power_state_table(struct pp_hwmgr
> *hwmgr)  {
> int result;
> unsigned int i;
> -   unsigned int table_entries;
> struct pp_power_state *state;
> -   int size;
> +   int size, table_entries;
>
> if (hwmgr->hwmgr_func->get_num_of_pp_table_entries == NULL)
> return 0;
> @@ -45,7 +44,7 @@ int psm_init_power_state_table(struct pp_hwmgr
> *hwmgr)
> hwmgr->ps_size = size =
> hwmgr->hwmgr_func->get_power_state_size(hwmgr) +
>   sizeof(struct
> pp_power_state);
>
> -   if (table_entries == 0 || size == 0) {
> +   if (table_entries <= 0 || size == 0) {
> pr_warn("Please check whether power state
> management is supported on this asic\n");
> return 0;
> }

We should need to check this before set the hwmgr->num_ps and hwmgr->ps_size, 
otherwise we may have incorrect hwmgr->num_ps if meet the condition 
table_entries < 0.

Tim Huang

> --
> 2.25.1
>



RE: [PATCH 12/22] drm/amd/pm: remove logically dead code

2024-05-12 Thread Wang, Yang(Kevin)
[AMD Official Use Only - AMD Internal Distribution Only]

Acked-by: Yang Wang 

Best Regards,
Kevin

-Original Message-
From: amd-gfx  On Behalf Of Zhang, 
Jesse(Jie)
Sent: Monday, May 13, 2024 10:19 AM
To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Huang, Tim 
Subject: RE: [PATCH 12/22] drm/amd/pm: remove logically dead code

[AMD Official Use Only - AMD Internal Distribution Only]

[AMD Official Use Only - AMD Internal Distribution Only]

Ping ...

-Original Message-
From: amd-gfx  On Behalf Of Jesse Zhang
Sent: Friday, May 10, 2024 10:50 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Huang, Tim ; Zhang, Jesse(Jie) 
; Zhang, Jesse(Jie) 
Subject: [PATCH 12/22] drm/amd/pm: remove logically dead code

Execution cannot reach this statement: case POWER_STATE_TYPE_BALAN.

Signed-off-by: Jesse Zhang 
---
 drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c 
b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c
index 60377747bab4..e861355ebd75 100644
--- a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c
+++ b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c
@@ -831,15 +831,6 @@ static struct amdgpu_ps 
*amdgpu_dpm_pick_power_state(struct amdgpu_device *adev,
return ps;
}
break;
-   case POWER_STATE_TYPE_BALANCED:
-   if (ui_class == ATOM_PPLIB_CLASSIFICATION_UI_BALANCED) {
-   if (ps->caps & ATOM_PPLIB_SINGLE_DISPLAY_ONLY) {
-   if (single_display)
-   return ps;
-   } else
-   return ps;
-   }
-   break;
case POWER_STATE_TYPE_PERFORMANCE:
if (ui_class == 
ATOM_PPLIB_CLASSIFICATION_UI_PERFORMANCE) {
if (ps->caps & ATOM_PPLIB_SINGLE_DISPLAY_ONLY) {
--
2.25.1

<>

RE: [PATCH 17/22] drm/amdgpu: fix the warning bad bit shift operation for aca_error_type type

2024-05-12 Thread Wang, Yang(Kevin)
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Yang Wang 

Best Regards,
Kevin

-Original Message-
From: amd-gfx  On Behalf Of Zhang, 
Jesse(Jie)
Sent: Monday, May 13, 2024 10:19 AM
To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Huang, Tim 
Subject: RE: [PATCH 17/22] drm/amdgpu: fix the warning bad bit shift operation 
for aca_error_type type

[AMD Official Use Only - AMD Internal Distribution Only]

[AMD Official Use Only - AMD Internal Distribution Only]

Ping ...

-Original Message-
From: amd-gfx  On Behalf Of Jesse Zhang
Sent: Friday, May 10, 2024 10:51 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Huang, Tim ; Zhang, Jesse(Jie) 
; Zhang, Jesse(Jie) 
Subject: [PATCH 17/22] drm/amdgpu: fix the warning bad bit shift operation for 
aca_error_type type

Filter invalid aca error types before performing a shift operation.

Signed-off-by: Jesse Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
index 28febf33fb1b..9e3560c190e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
@@ -534,7 +534,7 @@ int amdgpu_aca_get_error_data(struct amdgpu_device *adev, 
struct aca_handle *han
if (aca_handle_is_valid(handle))
return -EOPNOTSUPP;

-   if (!(BIT(type) & handle->mask))
+   if ((type < 0) || (!(BIT(type) & handle->mask)))
return  0;

return __aca_get_error_data(adev, handle, type, err_data, qctx);
--
2.25.1

<>

RE: [PATCH 12/22] drm/amd/pm: remove logically dead code

2024-05-12 Thread Zhang, Jesse(Jie)
[AMD Official Use Only - AMD Internal Distribution Only]

Ping ...

-Original Message-
From: amd-gfx  On Behalf Of Jesse Zhang
Sent: Friday, May 10, 2024 10:50 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Huang, Tim ; Zhang, Jesse(Jie) 
; Zhang, Jesse(Jie) 
Subject: [PATCH 12/22] drm/amd/pm: remove logically dead code

Execution cannot reach this statement: case POWER_STATE_TYPE_BALAN.

Signed-off-by: Jesse Zhang 
---
 drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c 
b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c
index 60377747bab4..e861355ebd75 100644
--- a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c
+++ b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c
@@ -831,15 +831,6 @@ static struct amdgpu_ps 
*amdgpu_dpm_pick_power_state(struct amdgpu_device *adev,
return ps;
}
break;
-   case POWER_STATE_TYPE_BALANCED:
-   if (ui_class == ATOM_PPLIB_CLASSIFICATION_UI_BALANCED) {
-   if (ps->caps & ATOM_PPLIB_SINGLE_DISPLAY_ONLY) {
-   if (single_display)
-   return ps;
-   } else
-   return ps;
-   }
-   break;
case POWER_STATE_TYPE_PERFORMANCE:
if (ui_class == 
ATOM_PPLIB_CLASSIFICATION_UI_PERFORMANCE) {
if (ps->caps & ATOM_PPLIB_SINGLE_DISPLAY_ONLY) {
--
2.25.1



RE: [PATCH 18/22] drm/amd/pm: check negtive return for table entries

2024-05-12 Thread Zhang, Jesse(Jie)
[AMD Official Use Only - AMD Internal Distribution Only]

Ping ...

-Original Message-
From: Jesse Zhang 
Sent: Friday, May 10, 2024 10:51 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Huang, Tim ; Zhang, Jesse(Jie) 
; Zhang, Jesse(Jie) 
Subject: [PATCH 18/22] drm/amd/pm: check negtive return for table entries

Function hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr) returns a 
negative number

Signed-off-by: Jesse Zhang 
---
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
index f4bd8e9357e2..4433ec4e9cf2 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c
@@ -30,9 +30,8 @@ int psm_init_power_state_table(struct pp_hwmgr *hwmgr)  {
int result;
unsigned int i;
-   unsigned int table_entries;
struct pp_power_state *state;
-   int size;
+   int size, table_entries;

if (hwmgr->hwmgr_func->get_num_of_pp_table_entries == NULL)
return 0;
@@ -45,7 +44,7 @@ int psm_init_power_state_table(struct pp_hwmgr *hwmgr)
hwmgr->ps_size = size = hwmgr->hwmgr_func->get_power_state_size(hwmgr) +
  sizeof(struct pp_power_state);

-   if (table_entries == 0 || size == 0) {
+   if (table_entries <= 0 || size == 0) {
pr_warn("Please check whether power state management is 
supported on this asic\n");
return 0;
}
--
2.25.1



RE: [PATCH 17/22] drm/amdgpu: fix the warning bad bit shift operation for aca_error_type type

2024-05-12 Thread Zhang, Jesse(Jie)
[AMD Official Use Only - AMD Internal Distribution Only]

Ping ...

-Original Message-
From: amd-gfx  On Behalf Of Jesse Zhang
Sent: Friday, May 10, 2024 10:51 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Huang, Tim ; Zhang, Jesse(Jie) 
; Zhang, Jesse(Jie) 
Subject: [PATCH 17/22] drm/amdgpu: fix the warning bad bit shift operation for 
aca_error_type type

Filter invalid aca error types before performing a shift operation.

Signed-off-by: Jesse Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
index 28febf33fb1b..9e3560c190e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
@@ -534,7 +534,7 @@ int amdgpu_aca_get_error_data(struct amdgpu_device *adev, 
struct aca_handle *han
if (aca_handle_is_valid(handle))
return -EOPNOTSUPP;

-   if (!(BIT(type) & handle->mask))
+   if ((type < 0) || (!(BIT(type) & handle->mask)))
return  0;

return __aca_get_error_data(adev, handle, type, err_data, qctx);
--
2.25.1



RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for nbio_v7_4

2024-05-12 Thread Zhang, Jesse(Jie)
[AMD Official Use Only - AMD Internal Distribution Only]

Ping ...

-Original Message-
From: Jesse Zhang 
Sent: Friday, May 10, 2024 10:50 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Huang, Tim ; Zhang, Jesse(Jie) 
; Zhang, Jesse(Jie) 
Subject: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for nbio_v7_4

if ras_manager obj null, don't print NBIO err data

Signed-off-by: Jesse Zhang 
---
 drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c 
b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
index fe18df10daaa..26e5885db9b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
@@ -383,7 +383,7 @@ static void 
nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
else
WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, 
bif_doorbell_intr_cntl);

-   if (!ras->disable_ras_err_cnt_harvest) {
+   if (!ras->disable_ras_err_cnt_harvest && obj) {
/*
 * clear error status after ras_controller_intr
 * according to hw team and count ue number
--
2.25.1



Re: [PATCH] drm/mst: Fix NULL pointer dereference at drm_dp_add_payload_part2

2024-05-12 Thread Limonciello, Mario




On 5/10/2024 4:24 AM, Jani Nikula wrote:

On Fri, 10 May 2024, "Lin, Wayne"  wrote:

[Public]


-Original Message-
From: Limonciello, Mario 
Sent: Friday, May 10, 2024 3:18 AM
To: Linux regressions mailing list ; Wentland, 
Harry
; Lin, Wayne 
Cc: ly...@redhat.com; imre.d...@intel.com; Leon Weiß ; sta...@vger.kernel.org; dri-de...@lists.freedesktop.org; amd-
g...@lists.freedesktop.org; intel-...@lists.freedesktop.org
Subject: Re: [PATCH] drm/mst: Fix NULL pointer dereference at
drm_dp_add_payload_part2

On 5/9/2024 07:43, Linux regression tracking (Thorsten Leemhuis) wrote:

On 18.04.24 21:43, Harry Wentland wrote:

On 2024-03-07 01:29, Wayne Lin wrote:

[Why]
Commit:
- commit 5aa1dfcdf0a4 ("drm/mst: Refactor the flow for payload
allocation/removement") accidently overwrite the commit
- commit 54d217406afe ("drm: use mgr->dev in drm_dbg_kms in
drm_dp_add_payload_part2") which cause regression.

[How]
Recover the original NULL fix and remove the unnecessary input
parameter 'state' for drm_dp_add_payload_part2().

Fixes: 5aa1dfcdf0a4 ("drm/mst: Refactor the flow for payload
allocation/removement")
Reported-by: Leon Weiß 
Link:
https://lore.kernel.org/r/38c253ea42072cc825dc969ac4e6b9b600371cc8.c
a...@ruhr-uni-bochum.de/
Cc: ly...@redhat.com
Cc: imre.d...@intel.com
Cc: sta...@vger.kernel.org
Cc: regressi...@lists.linux.dev
Signed-off-by: Wayne Lin 


I haven't been deep in MST code in a while but this all looks pretty
straightforward and good.

Reviewed-by: Harry Wentland 


Hmmm, that was three weeks ago, but it seems since then nothing
happened to fix the linked regression through this or some other
patch. Is there a reason? The build failure report from the CI maybe?


It touches files outside of amd but only has an ack from AMD.  I think we
/probably/ want an ack from i915 and nouveau to take it through.


Thanks, Mario!

Hi Thorsten,
Yeah, like what Mario said. Would also like to have ack from i915 and nouveau.


It usually works better if you Cc the folks you want an ack from! ;)

Acked-by: Jani Nikula 



Thanks! Can someone with commit permissions take this to drm-misc?



Re: [PATCH] drm/buddy: Fix the range bias clear memory allocation issue

2024-05-12 Thread Paneer Selvam, Arunpravin

Hi Daniel,

On 5/8/2024 2:11 PM, Daniel Vetter wrote:

On Wed, May 08, 2024 at 12:27:20PM +0530, Arunpravin Paneer Selvam wrote:

Problem statement: During the system boot time, an application request
for the bulk volume of cleared range bias memory when the clear_avail
is zero, we dont fallback into normal allocation method as we had an
unnecessary clear_avail check which prevents the fallback method leads
to fb allocation failure following system goes into unresponsive state.

Solution: Remove the unnecessary clear_avail check in the range bias
allocation function.

Signed-off-by: Arunpravin Paneer Selvam 
Fixes: 96950929eb23 ("drm/buddy: Implement tracking clear page feature")
Reviewed-by: Matthew Auld 
---
  drivers/gpu/drm/drm_buddy.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

Can you please also add a kunit test case to exercise this corner case and
make sure it stays fixed?

I have sent the v2 along with a kunit test case for this corner case.

Thanks,
Arun.


Thanks, Sima

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index 284ebae71cc4..831929ac95eb 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -574,7 +574,7 @@ __drm_buddy_alloc_range_bias(struct drm_buddy *mm,
  
  	block = __alloc_range_bias(mm, start, end, order,

   flags, fallback);
-   if (IS_ERR(block) && mm->clear_avail)
+   if (IS_ERR(block))
return __alloc_range_bias(mm, start, end, order,
  flags, !fallback);
  
--

2.25.1





[PATCH v2 1/2] drm/buddy: Fix the range bias clear memory allocation issue

2024-05-12 Thread Arunpravin Paneer Selvam
Problem statement: During the system boot time, an application request
for the bulk volume of cleared range bias memory when the clear_avail
is zero, we dont fallback into normal allocation method as we had an
unnecessary clear_avail check which prevents the fallback method leads
to fb allocation failure following system goes into unresponsive state.

Solution: Remove the unnecessary clear_avail check in the range bias
allocation function.

v2: add a kunit for this corner case (Daniel Vetter)

Signed-off-by: Arunpravin Paneer Selvam 
Fixes: 96950929eb23 ("drm/buddy: Implement tracking clear page feature")
Reviewed-by: Matthew Auld 
---
 drivers/gpu/drm/drm_buddy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index 284ebae71cc4..831929ac95eb 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -574,7 +574,7 @@ __drm_buddy_alloc_range_bias(struct drm_buddy *mm,
 
block = __alloc_range_bias(mm, start, end, order,
   flags, fallback);
-   if (IS_ERR(block) && mm->clear_avail)
+   if (IS_ERR(block))
return __alloc_range_bias(mm, start, end, order,
  flags, !fallback);
 
-- 
2.25.1



[PATCH v2 2/2] drm/tests: Add a unit test for range bias allocation

2024-05-12 Thread Arunpravin Paneer Selvam
Allocate cleared blocks in the bias range when the DRM
buddy's clear avail is zero. This will validate the bias
range allocation in scenarios like system boot when no
cleared blocks are available and exercise the fallback
path too. The resulting blocks should always be dirty.

Signed-off-by: Arunpravin Paneer Selvam 
---
 drivers/gpu/drm/tests/drm_buddy_test.c | 35 ++
 1 file changed, 35 insertions(+)

diff --git a/drivers/gpu/drm/tests/drm_buddy_test.c 
b/drivers/gpu/drm/tests/drm_buddy_test.c
index e3b50e240d36..a194f271bc55 100644
--- a/drivers/gpu/drm/tests/drm_buddy_test.c
+++ b/drivers/gpu/drm/tests/drm_buddy_test.c
@@ -26,6 +26,8 @@ static void drm_test_buddy_alloc_range_bias(struct kunit 
*test)
u32 mm_size, ps, bias_size, bias_start, bias_end, bias_rem;
DRM_RND_STATE(prng, random_seed);
unsigned int i, count, *order;
+   struct drm_buddy_block *block;
+   unsigned long flags;
struct drm_buddy mm;
LIST_HEAD(allocated);
 
@@ -222,6 +224,39 @@ static void drm_test_buddy_alloc_range_bias(struct kunit 
*test)
 
drm_buddy_free_list(&mm, &allocated, 0);
drm_buddy_fini(&mm);
+
+   /*
+* Allocate cleared blocks in the bias range when the DRM buddy's clear 
avail is
+* zero. This will validate the bias range allocation in scenarios like 
system boot
+* when no cleared blocks are available and exercise the fallback path 
too. The resulting
+* blocks should always be dirty.
+*/
+
+   KUNIT_ASSERT_FALSE_MSG(test, drm_buddy_init(&mm, mm_size, ps),
+  "buddy_init failed\n");
+   mm.clear_avail = 0;
+
+   bias_start = round_up(prandom_u32_state(&prng) % (mm_size - ps), ps);
+   bias_end = round_up(bias_start + prandom_u32_state(&prng) % (mm_size - 
bias_start), ps);
+   bias_end = max(bias_end, bias_start + ps);
+   bias_rem = bias_end - bias_start;
+
+   flags = DRM_BUDDY_CLEAR_ALLOCATION | DRM_BUDDY_RANGE_ALLOCATION;
+   u32 size = max(round_up(prandom_u32_state(&prng) % bias_rem, ps), ps);
+
+   KUNIT_ASSERT_FALSE_MSG(test,
+  drm_buddy_alloc_blocks(&mm, bias_start,
+ bias_end, size, ps,
+ &allocated,
+ flags),
+  "buddy_alloc failed with bias(%x-%x), size=%u, 
ps=%u\n",
+  bias_start, bias_end, size, ps);
+
+   list_for_each_entry(block, &allocated, link)
+   KUNIT_EXPECT_EQ(test, drm_buddy_block_is_clear(block), false);
+
+   drm_buddy_free_list(&mm, &allocated, 0);
+   drm_buddy_fini(&mm);
 }
 
 static void drm_test_buddy_alloc_clear(struct kunit *test)
-- 
2.25.1