[Public]

Thanks Tao. This was really just to get some feedback on how to do this. And if 
there were any dependencies. Ideally we want to send out a CPER for the 
situation in the commit message. I can definitely add that as a comment.

For the 2nd part, I am not sure. The big issue is that systems that rely on 
CPERs to know when a GPU is bad will not have a CPER for this type of situation 
until they take a new UE. So we want to alert them every time we load more than 
the threshold. Would in-band also benefit from that? Is there a drawback to 
having both? I figure more alerts is always better when it comes to unhealthy 
HW.

 Kent

> -----Original Message-----
> From: Zhou1, Tao <[email protected]>
> Sent: Friday, January 23, 2026 1:14 AM
> To: Russell, Kent <[email protected]>; [email protected]
> Cc: Zhang, Hawking <[email protected]>; Russell, Kent
> <[email protected]>
> Subject: RE: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> > -----Original Message-----
> > From: amd-gfx <[email protected]> On Behalf Of Kent
> Russell
> > Sent: Thursday, January 22, 2026 11:25 PM
> > To: [email protected]
> > Cc: Zhang, Hawking <[email protected]>; Russell, Kent
> > <[email protected]>
> > Subject: [PATCH] drm/amdgpu: Send RMA CPER at bad page loading
> >
> > Some older builds weren't sending RMA CPERs when the bad page threshold was
> > exceeded. Newer builds have resolved this, but there could be systems out 
> > there
> > with bad page numbers higher than the threshold, that haven't sent out an 
> > RMA
> > CPER. To be thorough and safe, send an RMA CPER when we load the table, if
> the
> > threshold is met or exceeded, instead of waiting for the next UE to trigger 
> > the
> CPER.
> >
> > Signed-off-by: Kent Russell <[email protected]>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index 64dd7a81bff5..469d04a39d7d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > @@ -1712,6 +1712,10 @@ int amdgpu_ras_eeprom_check(struct
> > amdgpu_ras_eeprom_control *control)
> >                       dev_warn(adev->dev, "RAS records:%u exceeds 90%% of
> > threshold:%d",
> >                                       control->ras_num_bad_pages,
> >                                       ras->bad_page_cnt_threshold);
> > +             if (amdgpu_bad_page_threshold != 0 &&
> > +                     control->ras_num_bad_pages >= ras-
> > >bad_page_cnt_threshold)
> > +                     amdgpu_dpm_send_rma_reason(adev);
> > +
>
> [Tao]: 1. Better to add comment to describe this special case;
>
> 2. Do we need to trigger in-band cper as well? Like:
>
> if (adev->cper.enabled && !amdgpu_uniras_enabled(adev) &&
>     amdgpu_cper_generate_bp_threshold_record(adev))
>         dev_warn(adev->dev, "fail to generate bad page threshold cper 
> records\n");
>
> >       } else if (hdr->header == RAS_TABLE_HDR_BAD &&
> >                  amdgpu_bad_page_threshold != 0) {
> >               if (hdr->version >= RAS_TABLE_VER_V2_1) {
> > --
> > 2.43.0
>

Reply via email to