RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading

Russell, Kent Mon, 26 Jan 2026 06:46:42 -0800

[Public]

> -----Original Message-----
> From: Zhou1, Tao <[email protected]>
> Sent: Sunday, January 25, 2026 9:40 PM
> To: Russell, Kent <[email protected]>; [email protected]
> Cc: Zhang, Hawking <[email protected]>
> Subject: RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
>
> [Public]
>
> > -----Original Message-----
> > From: Russell, Kent <[email protected]>
> > Sent: Saturday, January 24, 2026 1:02 AM
> > To: Zhou1, Tao <[email protected]>; [email protected]
> > Cc: Zhang, Hawking <[email protected]>
> > Subject: RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
> >
> > [Public]
> >
> > Thanks Tao. This was really just to get some feedback on how to do this. 
> > And if
> there
> > were any dependencies. Ideally we want to send out a CPER for the situation 
> > in
> the
> > commit message. I can definitely add that as a comment.
> >
> > For the 2nd part, I am not sure. The big issue is that systems that rely on 
> > CPERs to
> > know when a GPU is bad will not have a CPER for this type of situation 
> > until they
> > take a new UE. So we want to alert them every time we load more than the
> > threshold. Would in-band also benefit from that? Is there a drawback to 
> > having
> both?
> > I figure more alerts is always better when it comes to unhealthy HW.
> >
> >  Kent
> >
>
> [Tao] Hi Kent, can system bootup successfully in this case? If the answer is 
> no, I
> think in-band CPER is unnecessary, otherwise user may be confused about the
> inconsistence of in-band and out-of-band cper.


What we were seeing was that after the patches from you and Gangliang, the 
system could initialize amdgpu with a massive number of retirements (previously 
it couldn't even init). But it took roughly 10 minutes from modprobe to 
completion on the node I was testing on. For health checks, a 10min init would 
get the node flagged for service, then it would need to be triaged by the 
service team to figure out why it took so long. The CPER would help here since 
that way the node would immediately be flagged as RMA, instead of needing to be 
triaged as to the reason for the slow init.

Again, I am not a CPER/RAS expert. The hope here is that we can just signal to 
OOB that we need to RMA the node, instead of waiting for the next UE to trigger 
that. If in-band would benefit from that too, then great. Because right now, 
all we see is a slow initialization and then we need to dig into dmesg to 
figure out why.

Hopefully that helps explain the unique situation a bit more clearly.

 Kent

>
> > > -----Original Message-----
> > > From: Zhou1, Tao <[email protected]>
> > > Sent: Friday, January 23, 2026 1:14 AM
> > > To: Russell, Kent <[email protected]>;
> > > [email protected]
> > > Cc: Zhang, Hawking <[email protected]>; Russell, Kent
> > > <[email protected]>
> > > Subject: RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
> > >
> > > [AMD Official Use Only - AMD Internal Distribution Only]
> > >
> > > > -----Original Message-----
> > > > From: amd-gfx <[email protected]> On Behalf Of
> > > > Kent
> > > Russell
> > > > Sent: Thursday, January 22, 2026 11:25 PM
> > > > To: [email protected]
> > > > Cc: Zhang, Hawking <[email protected]>; Russell, Kent
> > > > <[email protected]>
> > > > Subject: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
> > > >
> > > > Some older builds weren't sending RMA CPERs when the bad page
> > > > threshold was exceeded. Newer builds have resolved this, but there
> > > > could be systems out there with bad page numbers higher than the
> > > > threshold, that haven't sent out an RMA CPER. To be thorough and
> > > > safe, send an RMA CPER when we load the table, if
> > > the
> > > > threshold is met or exceeded, instead of waiting for the next UE to
> > > > trigger the
> > > CPER.
> > > >
> > > > Signed-off-by: Kent Russell <[email protected]>
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > > > index 64dd7a81bff5..469d04a39d7d 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > > > @@ -1712,6 +1712,10 @@ int amdgpu_ras_eeprom_check(struct
> > > > amdgpu_ras_eeprom_control *control)
> > > >                       dev_warn(adev->dev, "RAS records:%u exceeds
> > > > 90%% of threshold:%d",
> > > >                                       control->ras_num_bad_pages,
> > > >                                       ras->bad_page_cnt_threshold);
> > > > +             if (amdgpu_bad_page_threshold != 0 &&
> > > > +                     control->ras_num_bad_pages >= ras-
> > > > >bad_page_cnt_threshold)
> > > > +                     amdgpu_dpm_send_rma_reason(adev);
> > > > +
> > >
> > > [Tao]: 1. Better to add comment to describe this special case;
> > >
> > > 2. Do we need to trigger in-band cper as well? Like:
> > >
> > > if (adev->cper.enabled && !amdgpu_uniras_enabled(adev) &&
> > >     amdgpu_cper_generate_bp_threshold_record(adev))
> > >         dev_warn(adev->dev, "fail to generate bad page threshold cper
> > > records\n");
> > >
> > > >       } else if (hdr->header == RAS_TABLE_HDR_BAD &&
> > > >                  amdgpu_bad_page_threshold != 0) {
> > > >               if (hdr->version >= RAS_TABLE_VER_V2_1) {
> > > > --
> > > > 2.43.0
> > >
> >
>

RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading

Reply via email to