[AMD Official Use Only - General] The in_gpu_reset is set after reset error count and reset error status function call, so we can't use amdgpu_in_reset(), please check ras->in_recovery flag.
Regards, Stanley From: Zhou1, Tao <tao.zh...@amd.com> Sent: Friday, October 13, 2023 5:06 PM To: Zhang, Hawking <hawking.zh...@amd.com>; amd-gfx@lists.freedesktop.org; Yang, Stanley <stanley.y...@amd.com>; Li, Candice <candice...@amd.com>; Chai, Thomas <yipeng.c...@amd.com>; Wang, Yang(Kevin) <kevinyang.w...@amd.com> Subject: Re: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions [AMD Official Use Only - General] How about this condition: if ((amdgpu_in_reset(adev) || amdgpu_ras_intr_triggered()) && mca_funcs && mca_funcs->mca_set_debug_mode) I use amdgpu_in_reset to skip touching it in all gpu resets, not only for the resets triggered by ras fatal error. Regards, Tao ________________________________ From: Zhang, Hawking <hawking.zh...@amd.com<mailto:hawking.zh...@amd.com>> Sent: Thursday, October 12, 2023 9:14 PM To: Zhou1, Tao <tao.zh...@amd.com<mailto:tao.zh...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>; Yang, Stanley <stanley.y...@amd.com<mailto:stanley.y...@amd.com>>; Li, Candice <candice...@amd.com<mailto:candice...@amd.com>>; Chai, Thomas <yipeng.c...@amd.com<mailto:yipeng.c...@amd.com>>; Wang, Yang(Kevin) <kevinyang.w...@amd.com<mailto:kevinyang.w...@amd.com>> Subject: RE: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions [AMD Official Use Only - General] - if (!amdgpu_ras_is_supported(adev, block)) + /* skip ras error reset in gpu reset */ + if (amdgpu_in_reset(adev) && + mca_funcs && mca_funcs->mca_set_debug_mode) + return 0; We should check RAS in_recovery flag in such case. Reset domain is locked in relative late phase, at least *after* error counter harvest. Please double check. Regards, Hawking -----Original Message----- From: Zhou1, Tao <tao.zh...@amd.com<mailto:tao.zh...@amd.com>> Sent: Thursday, October 12, 2023 17:01 To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Yang, Stanley <stanley.y...@amd.com<mailto:stanley.y...@amd.com>>; Zhang, Hawking <hawking.zh...@amd.com<mailto:hawking.zh...@amd.com>>; Li, Candice <candice...@amd.com<mailto:candice...@amd.com>>; Chai, Thomas <yipeng.c...@amd.com<mailto:yipeng.c...@amd.com>>; Wang, Yang(Kevin) <kevinyang.w...@amd.com<mailto:kevinyang.w...@amd.com>> Cc: Zhou1, Tao <tao.zh...@amd.com<mailto:tao.zh...@amd.com>> Subject: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions PMFW is responsible for RAS error reset in some conditions, driver can skip the operation. Signed-off-by: Tao Zhou <tao.zh...@amd.com<mailto:tao.zh...@amd.com>> --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 91ed4fd96ee1..6dddb0423411 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -1105,11 +1105,18 @@ int amdgpu_ras_reset_error_count(struct amdgpu_device *adev, enum amdgpu_ras_block block) { struct amdgpu_ras_block_object *block_obj = amdgpu_ras_get_ras_block(adev, block, 0); + const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs; if (!block_obj || !block_obj->hw_ops) return 0; - if (!amdgpu_ras_is_supported(adev, block)) + /* skip ras error reset in gpu reset */ + if (amdgpu_in_reset(adev) && + mca_funcs && mca_funcs->mca_set_debug_mode) + return 0; + + if (!amdgpu_ras_is_supported(adev, block) || + !amdgpu_ras_get_mca_debug_mode(adev)) return 0; if (block_obj->hw_ops->reset_ras_error_count) @@ -1122,6 +1129,7 @@ int amdgpu_ras_reset_error_status(struct amdgpu_device *adev, enum amdgpu_ras_block block) { struct amdgpu_ras_block_object *block_obj = amdgpu_ras_get_ras_block(adev, block, 0); + const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs; if (!block_obj || !block_obj->hw_ops) { dev_dbg_once(adev->dev, "%s doesn't config RAS function\n", @@ -1129,7 +1137,13 @@ int amdgpu_ras_reset_error_status(struct amdgpu_device *adev, return 0; } - if (!amdgpu_ras_is_supported(adev, block)) + /* skip ras error reset in gpu reset */ + if (amdgpu_in_reset(adev) && + mca_funcs && mca_funcs->mca_set_debug_mode) + return 0; + + if (!amdgpu_ras_is_supported(adev, block) || + !amdgpu_ras_get_mca_debug_mode(adev)) return 0; if (block_obj->hw_ops->reset_ras_error_count) -- 2.35.1