[Public] Thanks Tao. This was really just to get some feedback on how to do this. And if there were any dependencies. Ideally we want to send out a CPER for the situation in the commit message. I can definitely add that as a comment.
For the 2nd part, I am not sure. The big issue is that systems that rely on CPERs to know when a GPU is bad will not have a CPER for this type of situation until they take a new UE. So we want to alert them every time we load more than the threshold. Would in-band also benefit from that? Is there a drawback to having both? I figure more alerts is always better when it comes to unhealthy HW. Kent > -----Original Message----- > From: Zhou1, Tao <[email protected]> > Sent: Friday, January 23, 2026 1:14 AM > To: Russell, Kent <[email protected]>; [email protected] > Cc: Zhang, Hawking <[email protected]>; Russell, Kent > <[email protected]> > Subject: RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading > > [AMD Official Use Only - AMD Internal Distribution Only] > > > -----Original Message----- > > From: amd-gfx <[email protected]> On Behalf Of Kent > Russell > > Sent: Thursday, January 22, 2026 11:25 PM > > To: [email protected] > > Cc: Zhang, Hawking <[email protected]>; Russell, Kent > > <[email protected]> > > Subject: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading > > > > Some older builds weren't sending RMA CPERs when the bad page threshold was > > exceeded. Newer builds have resolved this, but there could be systems out > > there > > with bad page numbers higher than the threshold, that haven't sent out an > > RMA > > CPER. To be thorough and safe, send an RMA CPER when we load the table, if > the > > threshold is met or exceeded, instead of waiting for the next UE to trigger > > the > CPER. > > > > Signed-off-by: Kent Russell <[email protected]> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 4 ++++ > > 1 file changed, 4 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > index 64dd7a81bff5..469d04a39d7d 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > @@ -1712,6 +1712,10 @@ int amdgpu_ras_eeprom_check(struct > > amdgpu_ras_eeprom_control *control) > > dev_warn(adev->dev, "RAS records:%u exceeds 90%% of > > threshold:%d", > > control->ras_num_bad_pages, > > ras->bad_page_cnt_threshold); > > + if (amdgpu_bad_page_threshold != 0 && > > + control->ras_num_bad_pages >= ras- > > >bad_page_cnt_threshold) > > + amdgpu_dpm_send_rma_reason(adev); > > + > > [Tao]: 1. Better to add comment to describe this special case; > > 2. Do we need to trigger in-band cper as well? Like: > > if (adev->cper.enabled && !amdgpu_uniras_enabled(adev) && > amdgpu_cper_generate_bp_threshold_record(adev)) > dev_warn(adev->dev, "fail to generate bad page threshold cper > records\n"); > > > } else if (hdr->header == RAS_TABLE_HDR_BAD && > > amdgpu_bad_page_threshold != 0) { > > if (hdr->version >= RAS_TABLE_VER_V2_1) { > > -- > > 2.43.0 >
