Re: [RFC 0/5] Discussion around eviction improvements
Just FYI, I've been on sick leave for a while and now trying to catch up. It will probably be at least week until I can look into this again. Sorry, Christian. Am 08.05.24 um 20:09 schrieb Tvrtko Ursulin: From: Tvrtko Ursulin Last few days I was looking at the situation with VRAM over subscription, what happens versus what perhaps should happen. Browsing through the driver and running some simple experiments. I ended up with this patch series which, as a disclaimer, may be completely wrong but as I found some suspicious things, to me at least, I thought it was a good point to stop and request some comments. To perhaps summarise what are the main issues I think I found: * Migration rate limiting does not bother knowing if actual migration happened and so can over-account and unfairly penalise. * Migration rate limiting does not even work, at least not for the common case where userspace configures VRAM+GTT. It thinks it can stop migration attempts by playing with bo->allowed_domains vs bo->preferred domains but, both from the code, and from empirical experiments, I see that not working at all. Both masks are identical so fiddling with them achieves nothing. * Idea of the fallback placement only works when VRAM has free space. As soon as it does not, ttm_resource_compatible is happy to leave the buffers in the secondary placement forever. * Driver thinks it will be re-validating evicted buffers on the next submission but it does not for the very common case of VRAM+GTT because it only checks if current placement is *none* of the preferred placements. All those problems are addressed in individual patches. End result of this series appears to be driver which will try harder to move buffers back into VRAM, but will be (more) correctly throttled in doing so by the existing rate limiting logic. I have run a quick benchmark of Cyberpunk 2077 and cannot say that I saw a change but that could be a good thing too. At least I did not break anything, perhaps.. On one occassion I did see the rate limiting logic get confused while for a period of few minutes it went to a mode where it was constantly giving a high migration budget. But that recovered itself when I switched clients and did not come back so I don't know. If there is something wrong there I don't think it would be caused by any patches in this series. Series is probably rough but should be good enough for dicsussion. I am curious to hear if I identified at least something correctly as a real problem. It would also be good to hear what are the suggested games to check and see whether there is any improvement. Cc: Christian König Cc: Friedrich Vock Tvrtko Ursulin (5): drm/amdgpu: Fix migration rate limiting accounting drm/amdgpu: Actually respect buffer migration budget drm/ttm: Add preferred placement flag drm/amdgpu: Use preferred placement for VRAM+GTT drm/amdgpu: Re-validate evicted buffers drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++-- drivers/gpu/drm/ttm/ttm_resource.c | 13 +--- include/drm/ttm/ttm_placement.h| 3 ++ 5 files changed, 65 insertions(+), 18 deletions(-)
RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for nbio_v7_4
[AMD Official Use Only - AMD Internal Distribution Only] Hi Tim -Original Message- From: Huang, Tim Sent: Monday, May 13, 2024 12:23 PM To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Koenig, Christian Subject: RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for nbio_v7_4 [AMD Official Use Only - AMD Internal Distribution Only] Hi Jesse, > -Original Message- > From: Zhang, Jesse(Jie) > Sent: Monday, May 13, 2024 10:18 AM > To: Zhang, Jesse(Jie) ; > amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Koenig, Christian > ; Huang, Tim > Subject: RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj > for > nbio_v7_4 > > [AMD Official Use Only - AMD Internal Distribution Only] > > Ping ... > > -Original Message- > From: Jesse Zhang > Sent: Friday, May 10, 2024 10:50 AM > To: amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Koenig, Christian > ; Huang, Tim ; Zhang, > Jesse(Jie) ; Zhang, Jesse(Jie) > > Subject: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for > nbio_v7_4 > > if ras_manager obj null, don't print NBIO err data > > Signed-off-by: Jesse Zhang > --- > drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c > b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c > index fe18df10daaa..26e5885db9b7 100644 > --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c > +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c > @@ -383,7 +383,7 @@ static void > nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device > else > WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, > bif_doorbell_intr_cntl); > > - if (!ras->disable_ras_err_cnt_harvest) { > + if (!ras->disable_ras_err_cnt_harvest && obj) { We may need to check the ras pointer as well? Such as change to " if (ras && !ras->disable_ras_err_cnt_harvest && obj) {" [Zhang, Jesse(Jie)] Thanks, will update the patch . Tim Huang > /* > * clear error status after ras_controller_intr > * according to hw team and count ue number > -- > 2.25.1 >
Re: [PATCH v3] drm/amdgpu: Add Ring Hang Events
On 5/13/2024 9:44 AM, Ori Messinger wrote: > This patch adds 'ring hang' events to the driver. > This is done by adding a 'reset_ring_hang' bool variable to the > struct 'amdgpu_reset_context' in the amdgpu_reset.h file. > The purpose for this 'reset_ring_hang' variable is whenever a GPU > reset is initiated due to a ring hang, the reset_ring_hang should > be set to 'true'. > > Additionally, the reset cause is passed into the kfd smi event as > a string, and is displayed in dmesg as an error. > > This 'amdgpu_reset_context' struct is now also passed > through across all relevant functions, and another event type > "KFD_SMI_EVENT_RING_HANG" is added to the kfd_smi_event enum. > > Signed-off-by: Ori Messinger > Change-Id: I6af3022eb1b4514201c9430d635ff87f167ad6f7 > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 7 +-- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 9 ++--- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 16 > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 2 ++ > drivers/gpu/drm/amd/amdkfd/kfd_device.c | 7 --- > drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 7 ++- > drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h | 5 - > include/uapi/linux/kfd_ioctl.h | 15 --- > 10 files changed, 56 insertions(+), 18 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > index e3738d417245..f1c6dc939cc3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > @@ -133,6 +133,9 @@ static void amdgpu_amdkfd_reset_work(struct work_struct > *work) > > reset_context.method = AMD_RESET_METHOD_NONE; > reset_context.reset_req_dev = adev; > + reset_context.reset_ring_hang = true; > + strscpy(reset_context.reset_cause, "hws_hang", > sizeof(reset_context.reset_cause)); Please add only reset cause as an enum to reset context. There is no need for a separate variable like ring hang. For a user ring that induces a HWS hang, that may be identified separately or identified generically with "HWS hang" as the reason. HWS hang also could be caused by RAS error. A possible list is - DRIVER_TRIGGERED (suspend/reset on init etc) JOB HANG, HWS HANG, USER TRIGGERED, RAS ERROR, The description string for reset cause may be obtained separately with something like below which returns the details - no need to add them to reset context and pass around. amdgpu_reset_get_reset_desc(reset_context); If reset is caused by a specific job, reset context already has a job pointer. From there you may get more details like device/partition id, job id, ring name etc. and provide the description. For RAS errors, there is already detailed dmesg log of IP which caused issue. So only device could be sufficient. > + DRM_ERROR("Reset cause: %s\n", reset_context.reset_cause); Please don't use DRM_*. They are deprecated. Use either drm_err() or dev_err() - they help to identify the device also. Thanks, Lijo > clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags); > > amdgpu_device_gpu_recover(adev, NULL, &reset_context); > @@ -261,12 +264,12 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, > bool run_pm) > return r; > } > > -int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev) > +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev, struct > amdgpu_reset_context *reset_context) > { > int r = 0; > > if (adev->kfd.dev) > - r = kgd2kfd_pre_reset(adev->kfd.dev); > + r = kgd2kfd_pre_reset(adev->kfd.dev, reset_context); > > return r; > } > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h > index 1de021ebdd46..c9030d8b8308 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h > @@ -47,6 +47,7 @@ enum TLB_FLUSH_TYPE { > }; > > struct amdgpu_device; > +struct amdgpu_reset_context; > > enum kfd_mem_attachment_type { > KFD_MEM_ATT_SHARED, /* Share kgd_mem->bo or another attachment's */ > @@ -170,7 +171,8 @@ bool amdgpu_amdkfd_have_atomics_support(struct > amdgpu_device *adev); > > bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid); > > -int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev); > +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev, > + struct amdgpu_reset_context *reset_context); > > int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev); > > @@ -416,7 +418,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd, > void kgd2kfd_device_exit(struct kfd_dev *kfd); > void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm); > int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm); > -int kgd2kfd_pre_reset(struct kfd_dev *kfd); > +int kgd2kfd_pr
RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for nbio_v7_4
[AMD Official Use Only - AMD Internal Distribution Only] Hi Jesse, > -Original Message- > From: Zhang, Jesse(Jie) > Sent: Monday, May 13, 2024 10:18 AM > To: Zhang, Jesse(Jie) ; > amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Koenig, Christian > ; Huang, Tim > Subject: RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for > nbio_v7_4 > > [AMD Official Use Only - AMD Internal Distribution Only] > > Ping ... > > -Original Message- > From: Jesse Zhang > Sent: Friday, May 10, 2024 10:50 AM > To: amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Koenig, Christian > ; Huang, Tim ; Zhang, > Jesse(Jie) ; Zhang, Jesse(Jie) > > Subject: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for > nbio_v7_4 > > if ras_manager obj null, don't print NBIO err data > > Signed-off-by: Jesse Zhang > --- > drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c > b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c > index fe18df10daaa..26e5885db9b7 100644 > --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c > +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c > @@ -383,7 +383,7 @@ static void > nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device > else > WREG32_SOC15(NBIO, 0, > mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl); > > - if (!ras->disable_ras_err_cnt_harvest) { > + if (!ras->disable_ras_err_cnt_harvest && obj) { We may need to check the ras pointer as well? Such as change to " if (ras && !ras->disable_ras_err_cnt_harvest && obj) {" Tim Huang > /* > * clear error status after ras_controller_intr > * according to hw team and count ue number > -- > 2.25.1 >
[PATCH v3] drm/amdgpu: Add Ring Hang Events
This patch adds 'ring hang' events to the driver. This is done by adding a 'reset_ring_hang' bool variable to the struct 'amdgpu_reset_context' in the amdgpu_reset.h file. The purpose for this 'reset_ring_hang' variable is whenever a GPU reset is initiated due to a ring hang, the reset_ring_hang should be set to 'true'. Additionally, the reset cause is passed into the kfd smi event as a string, and is displayed in dmesg as an error. This 'amdgpu_reset_context' struct is now also passed through across all relevant functions, and another event type "KFD_SMI_EVENT_RING_HANG" is added to the kfd_smi_event enum. Signed-off-by: Ori Messinger Change-Id: I6af3022eb1b4514201c9430d635ff87f167ad6f7 --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 7 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 9 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 16 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 2 ++ drivers/gpu/drm/amd/amdkfd/kfd_device.c | 7 --- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 7 ++- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h | 5 - include/uapi/linux/kfd_ioctl.h | 15 --- 10 files changed, 56 insertions(+), 18 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c index e3738d417245..f1c6dc939cc3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c @@ -133,6 +133,9 @@ static void amdgpu_amdkfd_reset_work(struct work_struct *work) reset_context.method = AMD_RESET_METHOD_NONE; reset_context.reset_req_dev = adev; + reset_context.reset_ring_hang = true; + strscpy(reset_context.reset_cause, "hws_hang", sizeof(reset_context.reset_cause)); + DRM_ERROR("Reset cause: %s\n", reset_context.reset_cause); clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags); amdgpu_device_gpu_recover(adev, NULL, &reset_context); @@ -261,12 +264,12 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm) return r; } -int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev) +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev, struct amdgpu_reset_context *reset_context) { int r = 0; if (adev->kfd.dev) - r = kgd2kfd_pre_reset(adev->kfd.dev); + r = kgd2kfd_pre_reset(adev->kfd.dev, reset_context); return r; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index 1de021ebdd46..c9030d8b8308 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -47,6 +47,7 @@ enum TLB_FLUSH_TYPE { }; struct amdgpu_device; +struct amdgpu_reset_context; enum kfd_mem_attachment_type { KFD_MEM_ATT_SHARED, /* Share kgd_mem->bo or another attachment's */ @@ -170,7 +171,8 @@ bool amdgpu_amdkfd_have_atomics_support(struct amdgpu_device *adev); bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid); -int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev); +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev, + struct amdgpu_reset_context *reset_context); int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev); @@ -416,7 +418,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd, void kgd2kfd_device_exit(struct kfd_dev *kfd); void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm); int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm); -int kgd2kfd_pre_reset(struct kfd_dev *kfd); +int kgd2kfd_pre_reset(struct kfd_dev *kfd, + struct amdgpu_reset_context *reset_context); int kgd2kfd_post_reset(struct kfd_dev *kfd); void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry); void kgd2kfd_set_sram_ecc_flag(struct kfd_dev *kfd); @@ -459,7 +462,7 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm) return 0; } -static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd) +static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd, struct amdgpu_reset_context *reset_context) { return 0; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 00fe3c2d5431..b18f37426b5e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5772,7 +5772,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, cancel_delayed_work_sync(&tmp_adev->delayed_init_work); - amdgpu_amdkfd_pre_reset(tmp_adev); + amdgpu_amdkfd_pre_reset(tmp_adev, reset_context); /* * Mark these ASICs to be reseted as untracked first diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index e474
RE: [PATCH 18/22] drm/amd/pm: check negtive return for table entries
[AMD Official Use Only - AMD Internal Distribution Only] Hi Jesse, > -Original Message- > From: Zhang, Jesse(Jie) > Sent: Monday, May 13, 2024 10:19 AM > To: Zhang, Jesse(Jie) ; > amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Koenig, Christian > ; Huang, Tim > Subject: RE: [PATCH 18/22] drm/amd/pm: check negtive return for table > entries > > [AMD Official Use Only - AMD Internal Distribution Only] > > Ping ... > > -Original Message- > From: Jesse Zhang > Sent: Friday, May 10, 2024 10:51 AM > To: amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Koenig, Christian > ; Huang, Tim ; Zhang, > Jesse(Jie) ; Zhang, Jesse(Jie) > > Subject: [PATCH 18/22] drm/amd/pm: check negtive return for table entries > > Function hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr) > returns a negative number > > Signed-off-by: Jesse Zhang > --- > drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c > b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c > index f4bd8e9357e2..4433ec4e9cf2 100644 > --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c > +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c > @@ -30,9 +30,8 @@ int psm_init_power_state_table(struct pp_hwmgr > *hwmgr) { > int result; > unsigned int i; > - unsigned int table_entries; > struct pp_power_state *state; > - int size; > + int size, table_entries; > > if (hwmgr->hwmgr_func->get_num_of_pp_table_entries == NULL) > return 0; > @@ -45,7 +44,7 @@ int psm_init_power_state_table(struct pp_hwmgr > *hwmgr) > hwmgr->ps_size = size = > hwmgr->hwmgr_func->get_power_state_size(hwmgr) + > sizeof(struct > pp_power_state); > > - if (table_entries == 0 || size == 0) { > + if (table_entries <= 0 || size == 0) { > pr_warn("Please check whether power state > management is supported on this asic\n"); > return 0; > } We should need to check this before set the hwmgr->num_ps and hwmgr->ps_size, otherwise we may have incorrect hwmgr->num_ps if meet the condition table_entries < 0. Tim Huang > -- > 2.25.1 >
RE: [PATCH 12/22] drm/amd/pm: remove logically dead code
[AMD Official Use Only - AMD Internal Distribution Only] Acked-by: Yang Wang Best Regards, Kevin -Original Message- From: amd-gfx On Behalf Of Zhang, Jesse(Jie) Sent: Monday, May 13, 2024 10:19 AM To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Koenig, Christian ; Huang, Tim Subject: RE: [PATCH 12/22] drm/amd/pm: remove logically dead code [AMD Official Use Only - AMD Internal Distribution Only] [AMD Official Use Only - AMD Internal Distribution Only] Ping ... -Original Message- From: amd-gfx On Behalf Of Jesse Zhang Sent: Friday, May 10, 2024 10:50 AM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Koenig, Christian ; Huang, Tim ; Zhang, Jesse(Jie) ; Zhang, Jesse(Jie) Subject: [PATCH 12/22] drm/amd/pm: remove logically dead code Execution cannot reach this statement: case POWER_STATE_TYPE_BALAN. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c | 9 - 1 file changed, 9 deletions(-) diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c index 60377747bab4..e861355ebd75 100644 --- a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c @@ -831,15 +831,6 @@ static struct amdgpu_ps *amdgpu_dpm_pick_power_state(struct amdgpu_device *adev, return ps; } break; - case POWER_STATE_TYPE_BALANCED: - if (ui_class == ATOM_PPLIB_CLASSIFICATION_UI_BALANCED) { - if (ps->caps & ATOM_PPLIB_SINGLE_DISPLAY_ONLY) { - if (single_display) - return ps; - } else - return ps; - } - break; case POWER_STATE_TYPE_PERFORMANCE: if (ui_class == ATOM_PPLIB_CLASSIFICATION_UI_PERFORMANCE) { if (ps->caps & ATOM_PPLIB_SINGLE_DISPLAY_ONLY) { -- 2.25.1 <>
RE: [PATCH 17/22] drm/amdgpu: fix the warning bad bit shift operation for aca_error_type type
[AMD Official Use Only - AMD Internal Distribution Only] Reviewed-by: Yang Wang Best Regards, Kevin -Original Message- From: amd-gfx On Behalf Of Zhang, Jesse(Jie) Sent: Monday, May 13, 2024 10:19 AM To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Koenig, Christian ; Huang, Tim Subject: RE: [PATCH 17/22] drm/amdgpu: fix the warning bad bit shift operation for aca_error_type type [AMD Official Use Only - AMD Internal Distribution Only] [AMD Official Use Only - AMD Internal Distribution Only] Ping ... -Original Message- From: amd-gfx On Behalf Of Jesse Zhang Sent: Friday, May 10, 2024 10:51 AM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Koenig, Christian ; Huang, Tim ; Zhang, Jesse(Jie) ; Zhang, Jesse(Jie) Subject: [PATCH 17/22] drm/amdgpu: fix the warning bad bit shift operation for aca_error_type type Filter invalid aca error types before performing a shift operation. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c index 28febf33fb1b..9e3560c190e3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c @@ -534,7 +534,7 @@ int amdgpu_aca_get_error_data(struct amdgpu_device *adev, struct aca_handle *han if (aca_handle_is_valid(handle)) return -EOPNOTSUPP; - if (!(BIT(type) & handle->mask)) + if ((type < 0) || (!(BIT(type) & handle->mask))) return 0; return __aca_get_error_data(adev, handle, type, err_data, qctx); -- 2.25.1 <>
RE: [PATCH 12/22] drm/amd/pm: remove logically dead code
[AMD Official Use Only - AMD Internal Distribution Only] Ping ... -Original Message- From: amd-gfx On Behalf Of Jesse Zhang Sent: Friday, May 10, 2024 10:50 AM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Koenig, Christian ; Huang, Tim ; Zhang, Jesse(Jie) ; Zhang, Jesse(Jie) Subject: [PATCH 12/22] drm/amd/pm: remove logically dead code Execution cannot reach this statement: case POWER_STATE_TYPE_BALAN. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c | 9 - 1 file changed, 9 deletions(-) diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c index 60377747bab4..e861355ebd75 100644 --- a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c @@ -831,15 +831,6 @@ static struct amdgpu_ps *amdgpu_dpm_pick_power_state(struct amdgpu_device *adev, return ps; } break; - case POWER_STATE_TYPE_BALANCED: - if (ui_class == ATOM_PPLIB_CLASSIFICATION_UI_BALANCED) { - if (ps->caps & ATOM_PPLIB_SINGLE_DISPLAY_ONLY) { - if (single_display) - return ps; - } else - return ps; - } - break; case POWER_STATE_TYPE_PERFORMANCE: if (ui_class == ATOM_PPLIB_CLASSIFICATION_UI_PERFORMANCE) { if (ps->caps & ATOM_PPLIB_SINGLE_DISPLAY_ONLY) { -- 2.25.1
RE: [PATCH 18/22] drm/amd/pm: check negtive return for table entries
[AMD Official Use Only - AMD Internal Distribution Only] Ping ... -Original Message- From: Jesse Zhang Sent: Friday, May 10, 2024 10:51 AM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Koenig, Christian ; Huang, Tim ; Zhang, Jesse(Jie) ; Zhang, Jesse(Jie) Subject: [PATCH 18/22] drm/amd/pm: check negtive return for table entries Function hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr) returns a negative number Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c index f4bd8e9357e2..4433ec4e9cf2 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/pp_psm.c @@ -30,9 +30,8 @@ int psm_init_power_state_table(struct pp_hwmgr *hwmgr) { int result; unsigned int i; - unsigned int table_entries; struct pp_power_state *state; - int size; + int size, table_entries; if (hwmgr->hwmgr_func->get_num_of_pp_table_entries == NULL) return 0; @@ -45,7 +44,7 @@ int psm_init_power_state_table(struct pp_hwmgr *hwmgr) hwmgr->ps_size = size = hwmgr->hwmgr_func->get_power_state_size(hwmgr) + sizeof(struct pp_power_state); - if (table_entries == 0 || size == 0) { + if (table_entries <= 0 || size == 0) { pr_warn("Please check whether power state management is supported on this asic\n"); return 0; } -- 2.25.1
RE: [PATCH 17/22] drm/amdgpu: fix the warning bad bit shift operation for aca_error_type type
[AMD Official Use Only - AMD Internal Distribution Only] Ping ... -Original Message- From: amd-gfx On Behalf Of Jesse Zhang Sent: Friday, May 10, 2024 10:51 AM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Koenig, Christian ; Huang, Tim ; Zhang, Jesse(Jie) ; Zhang, Jesse(Jie) Subject: [PATCH 17/22] drm/amdgpu: fix the warning bad bit shift operation for aca_error_type type Filter invalid aca error types before performing a shift operation. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c index 28febf33fb1b..9e3560c190e3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c @@ -534,7 +534,7 @@ int amdgpu_aca_get_error_data(struct amdgpu_device *adev, struct aca_handle *han if (aca_handle_is_valid(handle)) return -EOPNOTSUPP; - if (!(BIT(type) & handle->mask)) + if ((type < 0) || (!(BIT(type) & handle->mask))) return 0; return __aca_get_error_data(adev, handle, type, err_data, qctx); -- 2.25.1
RE: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for nbio_v7_4
[AMD Official Use Only - AMD Internal Distribution Only] Ping ... -Original Message- From: Jesse Zhang Sent: Friday, May 10, 2024 10:50 AM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Koenig, Christian ; Huang, Tim ; Zhang, Jesse(Jie) ; Zhang, Jesse(Jie) Subject: [PATCH 02/22] drm/amdgpu: the warning dereferencing obj for nbio_v7_4 if ras_manager obj null, don't print NBIO err data Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c index fe18df10daaa..26e5885db9b7 100644 --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c @@ -383,7 +383,7 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device else WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl); - if (!ras->disable_ras_err_cnt_harvest) { + if (!ras->disable_ras_err_cnt_harvest && obj) { /* * clear error status after ras_controller_intr * according to hw team and count ue number -- 2.25.1
Re: [PATCH] drm/mst: Fix NULL pointer dereference at drm_dp_add_payload_part2
On 5/10/2024 4:24 AM, Jani Nikula wrote: On Fri, 10 May 2024, "Lin, Wayne" wrote: [Public] -Original Message- From: Limonciello, Mario Sent: Friday, May 10, 2024 3:18 AM To: Linux regressions mailing list ; Wentland, Harry ; Lin, Wayne Cc: ly...@redhat.com; imre.d...@intel.com; Leon Weiß ; sta...@vger.kernel.org; dri-de...@lists.freedesktop.org; amd- g...@lists.freedesktop.org; intel-...@lists.freedesktop.org Subject: Re: [PATCH] drm/mst: Fix NULL pointer dereference at drm_dp_add_payload_part2 On 5/9/2024 07:43, Linux regression tracking (Thorsten Leemhuis) wrote: On 18.04.24 21:43, Harry Wentland wrote: On 2024-03-07 01:29, Wayne Lin wrote: [Why] Commit: - commit 5aa1dfcdf0a4 ("drm/mst: Refactor the flow for payload allocation/removement") accidently overwrite the commit - commit 54d217406afe ("drm: use mgr->dev in drm_dbg_kms in drm_dp_add_payload_part2") which cause regression. [How] Recover the original NULL fix and remove the unnecessary input parameter 'state' for drm_dp_add_payload_part2(). Fixes: 5aa1dfcdf0a4 ("drm/mst: Refactor the flow for payload allocation/removement") Reported-by: Leon Weiß Link: https://lore.kernel.org/r/38c253ea42072cc825dc969ac4e6b9b600371cc8.c a...@ruhr-uni-bochum.de/ Cc: ly...@redhat.com Cc: imre.d...@intel.com Cc: sta...@vger.kernel.org Cc: regressi...@lists.linux.dev Signed-off-by: Wayne Lin I haven't been deep in MST code in a while but this all looks pretty straightforward and good. Reviewed-by: Harry Wentland Hmmm, that was three weeks ago, but it seems since then nothing happened to fix the linked regression through this or some other patch. Is there a reason? The build failure report from the CI maybe? It touches files outside of amd but only has an ack from AMD. I think we /probably/ want an ack from i915 and nouveau to take it through. Thanks, Mario! Hi Thorsten, Yeah, like what Mario said. Would also like to have ack from i915 and nouveau. It usually works better if you Cc the folks you want an ack from! ;) Acked-by: Jani Nikula Thanks! Can someone with commit permissions take this to drm-misc?
Re: [PATCH] drm/buddy: Fix the range bias clear memory allocation issue
Hi Daniel, On 5/8/2024 2:11 PM, Daniel Vetter wrote: On Wed, May 08, 2024 at 12:27:20PM +0530, Arunpravin Paneer Selvam wrote: Problem statement: During the system boot time, an application request for the bulk volume of cleared range bias memory when the clear_avail is zero, we dont fallback into normal allocation method as we had an unnecessary clear_avail check which prevents the fallback method leads to fb allocation failure following system goes into unresponsive state. Solution: Remove the unnecessary clear_avail check in the range bias allocation function. Signed-off-by: Arunpravin Paneer Selvam Fixes: 96950929eb23 ("drm/buddy: Implement tracking clear page feature") Reviewed-by: Matthew Auld --- drivers/gpu/drm/drm_buddy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Can you please also add a kunit test case to exercise this corner case and make sure it stays fixed? I have sent the v2 along with a kunit test case for this corner case. Thanks, Arun. Thanks, Sima diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c index 284ebae71cc4..831929ac95eb 100644 --- a/drivers/gpu/drm/drm_buddy.c +++ b/drivers/gpu/drm/drm_buddy.c @@ -574,7 +574,7 @@ __drm_buddy_alloc_range_bias(struct drm_buddy *mm, block = __alloc_range_bias(mm, start, end, order, flags, fallback); - if (IS_ERR(block) && mm->clear_avail) + if (IS_ERR(block)) return __alloc_range_bias(mm, start, end, order, flags, !fallback); -- 2.25.1
[PATCH v2 1/2] drm/buddy: Fix the range bias clear memory allocation issue
Problem statement: During the system boot time, an application request for the bulk volume of cleared range bias memory when the clear_avail is zero, we dont fallback into normal allocation method as we had an unnecessary clear_avail check which prevents the fallback method leads to fb allocation failure following system goes into unresponsive state. Solution: Remove the unnecessary clear_avail check in the range bias allocation function. v2: add a kunit for this corner case (Daniel Vetter) Signed-off-by: Arunpravin Paneer Selvam Fixes: 96950929eb23 ("drm/buddy: Implement tracking clear page feature") Reviewed-by: Matthew Auld --- drivers/gpu/drm/drm_buddy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c index 284ebae71cc4..831929ac95eb 100644 --- a/drivers/gpu/drm/drm_buddy.c +++ b/drivers/gpu/drm/drm_buddy.c @@ -574,7 +574,7 @@ __drm_buddy_alloc_range_bias(struct drm_buddy *mm, block = __alloc_range_bias(mm, start, end, order, flags, fallback); - if (IS_ERR(block) && mm->clear_avail) + if (IS_ERR(block)) return __alloc_range_bias(mm, start, end, order, flags, !fallback); -- 2.25.1
[PATCH v2 2/2] drm/tests: Add a unit test for range bias allocation
Allocate cleared blocks in the bias range when the DRM buddy's clear avail is zero. This will validate the bias range allocation in scenarios like system boot when no cleared blocks are available and exercise the fallback path too. The resulting blocks should always be dirty. Signed-off-by: Arunpravin Paneer Selvam --- drivers/gpu/drm/tests/drm_buddy_test.c | 35 ++ 1 file changed, 35 insertions(+) diff --git a/drivers/gpu/drm/tests/drm_buddy_test.c b/drivers/gpu/drm/tests/drm_buddy_test.c index e3b50e240d36..a194f271bc55 100644 --- a/drivers/gpu/drm/tests/drm_buddy_test.c +++ b/drivers/gpu/drm/tests/drm_buddy_test.c @@ -26,6 +26,8 @@ static void drm_test_buddy_alloc_range_bias(struct kunit *test) u32 mm_size, ps, bias_size, bias_start, bias_end, bias_rem; DRM_RND_STATE(prng, random_seed); unsigned int i, count, *order; + struct drm_buddy_block *block; + unsigned long flags; struct drm_buddy mm; LIST_HEAD(allocated); @@ -222,6 +224,39 @@ static void drm_test_buddy_alloc_range_bias(struct kunit *test) drm_buddy_free_list(&mm, &allocated, 0); drm_buddy_fini(&mm); + + /* +* Allocate cleared blocks in the bias range when the DRM buddy's clear avail is +* zero. This will validate the bias range allocation in scenarios like system boot +* when no cleared blocks are available and exercise the fallback path too. The resulting +* blocks should always be dirty. +*/ + + KUNIT_ASSERT_FALSE_MSG(test, drm_buddy_init(&mm, mm_size, ps), + "buddy_init failed\n"); + mm.clear_avail = 0; + + bias_start = round_up(prandom_u32_state(&prng) % (mm_size - ps), ps); + bias_end = round_up(bias_start + prandom_u32_state(&prng) % (mm_size - bias_start), ps); + bias_end = max(bias_end, bias_start + ps); + bias_rem = bias_end - bias_start; + + flags = DRM_BUDDY_CLEAR_ALLOCATION | DRM_BUDDY_RANGE_ALLOCATION; + u32 size = max(round_up(prandom_u32_state(&prng) % bias_rem, ps), ps); + + KUNIT_ASSERT_FALSE_MSG(test, + drm_buddy_alloc_blocks(&mm, bias_start, + bias_end, size, ps, + &allocated, + flags), + "buddy_alloc failed with bias(%x-%x), size=%u, ps=%u\n", + bias_start, bias_end, size, ps); + + list_for_each_entry(block, &allocated, link) + KUNIT_EXPECT_EQ(test, drm_buddy_block_is_clear(block), false); + + drm_buddy_free_list(&mm, &allocated, 0); + drm_buddy_fini(&mm); } static void drm_test_buddy_alloc_clear(struct kunit *test) -- 2.25.1