[Public] > -----Original Message----- > From: Russell, Kent <[email protected]> > Sent: Saturday, January 24, 2026 1:02 AM > To: Zhou1, Tao <[email protected]>; [email protected] > Cc: Zhang, Hawking <[email protected]> > Subject: RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading > > [Public] > > Thanks Tao. This was really just to get some feedback on how to do this. And > if there > were any dependencies. Ideally we want to send out a CPER for the situation > in the > commit message. I can definitely add that as a comment. > > For the 2nd part, I am not sure. The big issue is that systems that rely on > CPERs to > know when a GPU is bad will not have a CPER for this type of situation until > they > take a new UE. So we want to alert them every time we load more than the > threshold. Would in-band also benefit from that? Is there a drawback to > having both? > I figure more alerts is always better when it comes to unhealthy HW. > > Kent >
[Tao] Hi Kent, can system bootup successfully in this case? If the answer is no, I think in-band CPER is unnecessary, otherwise user may be confused about the inconsistence of in-band and out-of-band cper. > > -----Original Message----- > > From: Zhou1, Tao <[email protected]> > > Sent: Friday, January 23, 2026 1:14 AM > > To: Russell, Kent <[email protected]>; > > [email protected] > > Cc: Zhang, Hawking <[email protected]>; Russell, Kent > > <[email protected]> > > Subject: RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading > > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > -----Original Message----- > > > From: amd-gfx <[email protected]> On Behalf Of > > > Kent > > Russell > > > Sent: Thursday, January 22, 2026 11:25 PM > > > To: [email protected] > > > Cc: Zhang, Hawking <[email protected]>; Russell, Kent > > > <[email protected]> > > > Subject: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading > > > > > > Some older builds weren't sending RMA CPERs when the bad page > > > threshold was exceeded. Newer builds have resolved this, but there > > > could be systems out there with bad page numbers higher than the > > > threshold, that haven't sent out an RMA CPER. To be thorough and > > > safe, send an RMA CPER when we load the table, if > > the > > > threshold is met or exceeded, instead of waiting for the next UE to > > > trigger the > > CPER. > > > > > > Signed-off-by: Kent Russell <[email protected]> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 4 ++++ > > > 1 file changed, 4 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > > index 64dd7a81bff5..469d04a39d7d 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > > @@ -1712,6 +1712,10 @@ int amdgpu_ras_eeprom_check(struct > > > amdgpu_ras_eeprom_control *control) > > > dev_warn(adev->dev, "RAS records:%u exceeds > > > 90%% of threshold:%d", > > > control->ras_num_bad_pages, > > > ras->bad_page_cnt_threshold); > > > + if (amdgpu_bad_page_threshold != 0 && > > > + control->ras_num_bad_pages >= ras- > > > >bad_page_cnt_threshold) > > > + amdgpu_dpm_send_rma_reason(adev); > > > + > > > > [Tao]: 1. Better to add comment to describe this special case; > > > > 2. Do we need to trigger in-band cper as well? Like: > > > > if (adev->cper.enabled && !amdgpu_uniras_enabled(adev) && > > amdgpu_cper_generate_bp_threshold_record(adev)) > > dev_warn(adev->dev, "fail to generate bad page threshold cper > > records\n"); > > > > > } else if (hdr->header == RAS_TABLE_HDR_BAD && > > > amdgpu_bad_page_threshold != 0) { > > > if (hdr->version >= RAS_TABLE_VER_V2_1) { > > > -- > > > 2.43.0 > > >
