RE: [PATCH] drm/amdgpu: Send applicable RMA CPERs at end of RAS init

Russell, Kent Thu, 05 Feb 2026 06:05:14 -0800

[AMD Official Use Only - AMD Internal Distribution Only]

Thanks for the clarification. I have seen both terms used differently by 
different people, so I changed my original patch to reflect that. I'll fix 
those 3 mistakes and push it out.


 Kent

> -----Original Message-----
> From: Zhou1, Tao <[email protected]>
> Sent: Wednesday, February 4, 2026 10:34 PM
> To: Russell, Kent <[email protected]>; [email protected]
> Cc: Liu, Xiang(Dean) <[email protected]>
> Subject: RE: [PATCH] drm/amdgpu: Send applicable RMA CPERs at end of RAS init
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> > -----Original Message-----
> > From: Russell, Kent <[email protected]>
> > Sent: Thursday, February 5, 2026 1:52 AM
> > To: [email protected]
> > Cc: Liu, Xiang(Dean) <[email protected]>; Zhou1, Tao
> <[email protected]>;
> > Russell, Kent <[email protected]>
> > Subject: [PATCH] drm/amdgpu: Send applicable RMA CPERs at end of RAS init
> >
> > Firmware and monitoring tools may not be ready to receive a CPER when we 
> > read
> > the bad pages, so send the CPERs at the end of RAs initialization to ensure 
> > that
> the
>
> [Tao] RAs -> RAS
>
> > FW is ready to receive and process the CPER. This removes the previous CPER
> > submission that was added during bad page load, and sends both in-band and 
> > out-
> > of-band at the same time.
> >
> > Signed-off-by: Kent Russell <[email protected]>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       |  2 ++
> >  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 27 ++++++++++++++++---
> >  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |  2 ++
> >  3 files changed, 27 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index b28fcf932f7e..856b1bf83533 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -4650,6 +4650,8 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev)
> >                       amdgpu_ras_block_late_init_default(adev, 
> > &obj->ras_comm);
> >       }
> >
> > +     amdgpu_ras_check_bad_page_status(adev);
> > +
> >       return 0;
> >  }
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index 469d04a39d7d..91de4085a66d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > @@ -1712,10 +1712,6 @@ int amdgpu_ras_eeprom_check(struct
> > amdgpu_ras_eeprom_control *control)
> >                       dev_warn(adev->dev, "RAS records:%u exceeds 90%% of
> > threshold:%d",
> >                                       control->ras_num_bad_pages,
> >                                       ras->bad_page_cnt_threshold);
> > -             if (amdgpu_bad_page_threshold != 0 &&
> > -                     control->ras_num_bad_pages >= ras-
> > >bad_page_cnt_threshold)
> > -                     amdgpu_dpm_send_rma_reason(adev);
> > -
> >       } else if (hdr->header == RAS_TABLE_HDR_BAD &&
> >                  amdgpu_bad_page_threshold != 0) {
> >               if (hdr->version >= RAS_TABLE_VER_V2_1) { @@ -1932,3
> > +1928,26 @@ int amdgpu_ras_smu_erase_ras_table(struct amdgpu_device
> *adev,
> >                                                                          
> > result);
> >       return -EOPNOTSUPP;
> >  }
> > +
> > +void amdgpu_ras_check_bad_page_status(struct amdgpu_device *adev) {
> > +     struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> > +     struct amdgpu_ras_eeprom_control *control = ras ? &ras->eeprom_control
> > +: NULL;
> > +
> > +     if (!control || amdgpu_bad_page_threshold == 0)
> > +             return;
> > +
> > +     if (control->ras_num_bad_pages >= ras->bad_page_cnt_threshold) {
> > +             if (amdgpu_dpm_send_rma_reason(adev))
> > +                     dev_warn(adev->dev, "Unable to send in-band RMA 
> > CPER");
>
> [Tao] this is oob cper and the following one is ib cper.
>
> With my concerns fixed, the patch is:  Reviewed-by: Tao Zhou
> <[email protected]>
>
> > +             else
> > +                     dev_dbg(adev->dev, "Sent in-band RMA CPER");
> > +
> > +             if (adev->cper.enabled && !amdgpu_uniras_enabled(adev)) {
> > +                     if (amdgpu_cper_generate_bp_threshold_record(adev))
> > +                             dev_warn(adev->dev, "Unable to send 
> > out-of-band
> > RMA CPER");
> > +                     else
> > +                             dev_dbg(adev->dev, "Sent out-of-band RMA 
> > CPER");
> > +             }
> > +     }
> > +}
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> > index 2e5d63957e71..a62114800a92 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> > @@ -193,6 +193,8 @@ int amdgpu_ras_eeprom_read_idx(struct
> > amdgpu_ras_eeprom_control *control,
> >
> >  int amdgpu_ras_eeprom_update_record_num(struct amdgpu_ras_eeprom_control
> > *control);
> >
> > +void amdgpu_ras_check_bad_page_status(struct amdgpu_device *adev);
> > +
> >  extern const struct file_operations amdgpu_ras_debugfs_eeprom_size_ops;
> >  extern const struct file_operations amdgpu_ras_debugfs_eeprom_table_ops;
> >
> > --
> > 2.43.0
>

RE: [PATCH] drm/amdgpu: Send applicable RMA CPERs at end of RAS init

Reply via email to