[Public]

> -----Original Message-----
> From: Russell, Kent <[email protected]>
> Sent: Saturday, January 24, 2026 1:02 AM
> To: Zhou1, Tao <[email protected]>; [email protected]
> Cc: Zhang, Hawking <[email protected]>
> Subject: RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
>
> [Public]
>
> Thanks Tao. This was really just to get some feedback on how to do this. And 
> if there
> were any dependencies. Ideally we want to send out a CPER for the situation 
> in the
> commit message. I can definitely add that as a comment.
>
> For the 2nd part, I am not sure. The big issue is that systems that rely on 
> CPERs to
> know when a GPU is bad will not have a CPER for this type of situation until 
> they
> take a new UE. So we want to alert them every time we load more than the
> threshold. Would in-band also benefit from that? Is there a drawback to 
> having both?
> I figure more alerts is always better when it comes to unhealthy HW.
>
>  Kent
>

[Tao] Hi Kent, can system bootup successfully in this case? If the answer is 
no, I think in-band CPER is unnecessary, otherwise user may be confused about 
the inconsistence of in-band and out-of-band cper.

> > -----Original Message-----
> > From: Zhou1, Tao <[email protected]>
> > Sent: Friday, January 23, 2026 1:14 AM
> > To: Russell, Kent <[email protected]>;
> > [email protected]
> > Cc: Zhang, Hawking <[email protected]>; Russell, Kent
> > <[email protected]>
> > Subject: RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > > -----Original Message-----
> > > From: amd-gfx <[email protected]> On Behalf Of
> > > Kent
> > Russell
> > > Sent: Thursday, January 22, 2026 11:25 PM
> > > To: [email protected]
> > > Cc: Zhang, Hawking <[email protected]>; Russell, Kent
> > > <[email protected]>
> > > Subject: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
> > >
> > > Some older builds weren't sending RMA CPERs when the bad page
> > > threshold was exceeded. Newer builds have resolved this, but there
> > > could be systems out there with bad page numbers higher than the
> > > threshold, that haven't sent out an RMA CPER. To be thorough and
> > > safe, send an RMA CPER when we load the table, if
> > the
> > > threshold is met or exceeded, instead of waiting for the next UE to
> > > trigger the
> > CPER.
> > >
> > > Signed-off-by: Kent Russell <[email protected]>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > > index 64dd7a81bff5..469d04a39d7d 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > > @@ -1712,6 +1712,10 @@ int amdgpu_ras_eeprom_check(struct
> > > amdgpu_ras_eeprom_control *control)
> > >                       dev_warn(adev->dev, "RAS records:%u exceeds
> > > 90%% of threshold:%d",
> > >                                       control->ras_num_bad_pages,
> > >                                       ras->bad_page_cnt_threshold);
> > > +             if (amdgpu_bad_page_threshold != 0 &&
> > > +                     control->ras_num_bad_pages >= ras-
> > > >bad_page_cnt_threshold)
> > > +                     amdgpu_dpm_send_rma_reason(adev);
> > > +
> >
> > [Tao]: 1. Better to add comment to describe this special case;
> >
> > 2. Do we need to trigger in-band cper as well? Like:
> >
> > if (adev->cper.enabled && !amdgpu_uniras_enabled(adev) &&
> >     amdgpu_cper_generate_bp_threshold_record(adev))
> >         dev_warn(adev->dev, "fail to generate bad page threshold cper
> > records\n");
> >
> > >       } else if (hdr->header == RAS_TABLE_HDR_BAD &&
> > >                  amdgpu_bad_page_threshold != 0) {
> > >               if (hdr->version >= RAS_TABLE_VER_V2_1) {
> > > --
> > > 2.43.0
> >
>

Reply via email to