RE: [PATCH] drm/amdgpu: refine ras error kernel log print

2023-10-19 Thread Wang, Yang(Kevin)
[AMD Official Use Only - General]

dev_info(adev->dev, "socket: %d, die: %d "
-"%lld correctable hardware errors detected in 
%s block\n",
+"new %lld correctable hardware errors detected 
in %s block, "
+"no user action is needed.\n",

Hi Hawking,

There socket/die id information is already here for new error detect,
For the accumulated error of the current block, socket/die information is not 
recorded now.
you mean we need to add socket/die id information for accumulated error?

Best Regards,
Kevin
-Original Message-
From: Zhang, Hawking 
Sent: Thursday, October 19, 2023 9:23 PM
To: Wang, Yang(Kevin) ; amd-gfx@lists.freedesktop.org
Cc: Zhou1, Tao ; Chai, Thomas 
Subject: RE: [PATCH] drm/amdgpu: refine ras error kernel log print

[AMD Official Use Only - General]

As discussed, please add socket id and die id in the output message.

Regards,
Hawking

-Original Message-
From: Wang, Yang(Kevin) 
Sent: Thursday, October 19, 2023 20:51
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Zhou1, Tao ; 
Chai, Thomas ; Wang, Yang(Kevin) 
Subject: [PATCH] drm/amdgpu: refine ras error kernel log print

refine ras error kernel log to avoid user-ridden ambiguity.

Signed-off-by: Yang Wang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 5b831ba0ebb3..cebc19d810e9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1034,10 +1034,11 @@ static void amdgpu_ras_error_print_error_data(struct 
amdgpu_device *adev,
struct ras_err_info *err_info;

if (is_ue)
-   dev_info(adev->dev, "%ld uncorrectable hardware errors detected 
in %s block\n",
+   dev_info(adev->dev, "%ld uncorrectable hardware errors
+detected in total in %s block\n",
 ras_mgr->err_data.ue_count, blk_name);
else
-   dev_info(adev->dev, "%ld correctable hardware errors detected 
in %s block\n",
+   dev_info(adev->dev, "%ld correctable hardware errors detected 
in total in %s block, "
+"no user action is needed.\n",
 ras_mgr->err_data.ce_count, blk_name);

for_each_ras_error(err_node, err_data) { @@ -1045,14 +1046,15 @@ static 
void amdgpu_ras_error_print_error_data(struct amdgpu_device *adev,
mcm_info = _info->mcm_info;
if (is_ue && err_info->ue_count) {
dev_info(adev->dev, "socket: %d, die: %d "
-"%lld uncorrectable hardware errors detected 
in %s block\n",
+"new %lld uncorrectable hardware errors
+ detected in %s block\n",
 mcm_info->socket_id,
 mcm_info->die_id,
 err_info->ue_count,
 blk_name);
} else if (!is_ue && err_info->ce_count) {
dev_info(adev->dev, "socket: %d, die: %d "
-"%lld correctable hardware errors detected in 
%s block\n",
+"new %lld correctable hardware errors detected 
in %s block, "
+"no user action is needed.\n",
 mcm_info->socket_id,
 mcm_info->die_id,
 err_info->ce_count,
--
2.34.1




RE: [PATCH] drm/amdgpu: refine ras error kernel log print

2023-10-19 Thread Zhang, Hawking
[AMD Official Use Only - General]

As discussed, please add socket id and die id in the output message.

Regards,
Hawking

-Original Message-
From: Wang, Yang(Kevin) 
Sent: Thursday, October 19, 2023 20:51
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Zhou1, Tao ; 
Chai, Thomas ; Wang, Yang(Kevin) 
Subject: [PATCH] drm/amdgpu: refine ras error kernel log print

refine ras error kernel log to avoid user-ridden ambiguity.

Signed-off-by: Yang Wang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 5b831ba0ebb3..cebc19d810e9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1034,10 +1034,11 @@ static void amdgpu_ras_error_print_error_data(struct 
amdgpu_device *adev,
struct ras_err_info *err_info;

if (is_ue)
-   dev_info(adev->dev, "%ld uncorrectable hardware errors detected 
in %s block\n",
+   dev_info(adev->dev, "%ld uncorrectable hardware errors detected 
in
+total in %s block\n",
 ras_mgr->err_data.ue_count, blk_name);
else
-   dev_info(adev->dev, "%ld correctable hardware errors detected 
in %s block\n",
+   dev_info(adev->dev, "%ld correctable hardware errors detected 
in total in %s block, "
+"no user action is needed.\n",
 ras_mgr->err_data.ce_count, blk_name);

for_each_ras_error(err_node, err_data) { @@ -1045,14 +1046,15 @@ static 
void amdgpu_ras_error_print_error_data(struct amdgpu_device *adev,
mcm_info = _info->mcm_info;
if (is_ue && err_info->ue_count) {
dev_info(adev->dev, "socket: %d, die: %d "
-"%lld uncorrectable hardware errors detected 
in %s block\n",
+"new %lld uncorrectable hardware errors 
detected in %s block\n",
 mcm_info->socket_id,
 mcm_info->die_id,
 err_info->ue_count,
 blk_name);
} else if (!is_ue && err_info->ce_count) {
dev_info(adev->dev, "socket: %d, die: %d "
-"%lld correctable hardware errors detected in 
%s block\n",
+"new %lld correctable hardware errors detected 
in %s block, "
+"no user action is needed.\n",
 mcm_info->socket_id,
 mcm_info->die_id,
 err_info->ce_count,
--
2.34.1