RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading

Zhou1, Tao Mon, 26 Jan 2026 18:44:44 -0800

[Public]

> -----Original Message-----
> From: Russell, Kent <[email protected]>
> Sent: Monday, January 26, 2026 10:47 PM
> To: Zhou1, Tao <[email protected]>; [email protected]
> Cc: Zhang, Hawking <[email protected]>
> Subject: RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
>
> [Public]
>
> > -----Original Message-----
> > From: Zhou1, Tao <[email protected]>
> > Sent: Sunday, January 25, 2026 9:40 PM
> > To: Russell, Kent <[email protected]>;
> > [email protected]
> > Cc: Zhang, Hawking <[email protected]>
> > Subject: RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
> >
> > [Public]
> >
> > > -----Original Message-----
> > > From: Russell, Kent <[email protected]>
> > > Sent: Saturday, January 24, 2026 1:02 AM
> > > To: Zhou1, Tao <[email protected]>; [email protected]
> > > Cc: Zhang, Hawking <[email protected]>
> > > Subject: RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
> > >
> > > [Public]
> > >
> > > Thanks Tao. This was really just to get some feedback on how to do
> > > this. And if
> > there
> > > were any dependencies. Ideally we want to send out a CPER for the
> > > situation in
> > the
> > > commit message. I can definitely add that as a comment.
> > >
> > > For the 2nd part, I am not sure. The big issue is that systems that
> > > rely on CPERs to know when a GPU is bad will not have a CPER for
> > > this type of situation until they take a new UE. So we want to alert
> > > them every time we load more than the threshold. Would in-band also
> > > benefit from that? Is there a drawback to having
> > both?
> > > I figure more alerts is always better when it comes to unhealthy HW.
> > >
> > >  Kent
> > >
> >
> > [Tao] Hi Kent, can system bootup successfully in this case? If the
> > answer is no, I think in-band CPER is unnecessary, otherwise user may
> > be confused about the inconsistence of in-band and out-of-band cper.
>
> What we were seeing was that after the patches from you and Gangliang, the 
> system
> could initialize amdgpu with a massive number of retirements (previously it 
> couldn't
> even init). But it took roughly 10 minutes from modprobe to completion on the 
> node I
> was testing on. For health checks, a 10min init would get the node flagged for
> service, then it would need to be triaged by the service team to figure out 
> why it took
> so long. The CPER would help here since that way the node would immediately be
> flagged as RMA, instead of needing to be triaged as to the reason for the 
> slow init.
>
> Again, I am not a CPER/RAS expert. The hope here is that we can just signal to
> OOB that we need to RMA the node, instead of waiting for the next UE to 
> trigger that.
> If in-band would benefit from that too, then great. Because right now, all we 
> see is a
> slow initialization and then we need to dig into dmesg to figure out why.
>
> Hopefully that helps explain the unique situation a bit more clearly.
>
>  Kent


[Tao] Hi Kent, thanks for your explanation, I'm fine with or without inband 
cper. After comment added, the patch is:

Reviewed-by: Tao Zhou <[email protected]>

>
> >
> > > > -----Original Message-----
> > > > From: Zhou1, Tao <[email protected]>
> > > > Sent: Friday, January 23, 2026 1:14 AM
> > > > To: Russell, Kent <[email protected]>;
> > > > [email protected]
> > > > Cc: Zhang, Hawking <[email protected]>; Russell, Kent
> > > > <[email protected]>
> > > > Subject: RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
> > > >
> > > > [AMD Official Use Only - AMD Internal Distribution Only]
> > > >
> > > > > -----Original Message-----
> > > > > From: amd-gfx <[email protected]> On Behalf
> > > > > Of Kent
> > > > Russell
> > > > > Sent: Thursday, January 22, 2026 11:25 PM
> > > > > To: [email protected]
> > > > > Cc: Zhang, Hawking <[email protected]>; Russell, Kent
> > > > > <[email protected]>
> > > > > Subject: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
> > > > >
> > > > > Some older builds weren't sending RMA CPERs when the bad page
> > > > > threshold was exceeded. Newer builds have resolved this, but
> > > > > there could be systems out there with bad page numbers higher
> > > > > than the threshold, that haven't sent out an RMA CPER. To be
> > > > > thorough and safe, send an RMA CPER when we load the table, if
> > > > the
> > > > > threshold is met or exceeded, instead of waiting for the next UE
> > > > > to trigger the
> > > > CPER.
> > > > >
> > > > > Signed-off-by: Kent Russell <[email protected]>
> > > > > ---
> > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 4 ++++
> > > > >  1 file changed, 4 insertions(+)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > > > > index 64dd7a81bff5..469d04a39d7d 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > > > > @@ -1712,6 +1712,10 @@ int amdgpu_ras_eeprom_check(struct
> > > > > amdgpu_ras_eeprom_control *control)
> > > > >                       dev_warn(adev->dev, "RAS records:%u
> > > > > exceeds 90%% of threshold:%d",
> > > > >                                       control->ras_num_bad_pages,
> > > > >
> > > > > ras->bad_page_cnt_threshold);
> > > > > +             if (amdgpu_bad_page_threshold != 0 &&
> > > > > +                     control->ras_num_bad_pages >= ras-
> > > > > >bad_page_cnt_threshold)
> > > > > +                     amdgpu_dpm_send_rma_reason(adev);
> > > > > +
> > > >
> > > > [Tao]: 1. Better to add comment to describe this special case;
> > > >
> > > > 2. Do we need to trigger in-band cper as well? Like:
> > > >
> > > > if (adev->cper.enabled && !amdgpu_uniras_enabled(adev) &&
> > > >     amdgpu_cper_generate_bp_threshold_record(adev))
> > > >         dev_warn(adev->dev, "fail to generate bad page threshold
> > > > cper records\n");
> > > >
> > > > >       } else if (hdr->header == RAS_TABLE_HDR_BAD &&
> > > > >                  amdgpu_bad_page_threshold != 0) {
> > > > >               if (hdr->version >= RAS_TABLE_VER_V2_1) {
> > > > > --
> > > > > 2.43.0
> > > >
> > >
> >
>

RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading

Reply via email to