On Tue, Mar 04, 2025 at 07:07:05AM +0000, 孙利斌_Dio wrote: > [EXTERNAL EMAIL] > > From 5fc7b1a9e0f0bcfa14068c6358019ed1e3ffc6c6 Mon Sep 17 00:00:00 2001 > From: "dio.sun" <dio....@enflame-tech.com> > Date: Wed, 26 Feb 2025 08:54:49 +0000 > Subject: [PATCH] AER: PCIE CTO recovery handle fix >
Looks like you forwarded this patch instead of submitting directly. Please fix it. > - Non-fatal PCIe CTO is reportted to PCIE RC and it will be convertted to > AdvNonFatalErr automatically > - according to PCIE SPEC 6.2.3.2.4.4 Requester with Completion Timeout( > If the severity of the CTO is non-fatal, and the Requester elects to > attempt recovery by issuing a new request, the Requester must > first handle the currecnt error case as an Advisory Non-Fatal Error.). > - Current Kernel code does nothing when receiving an AdvNonFatalErr( > Correctable Error) and the device driver has no chance to handle this > error. > - Under this situation, sometimes system will hang when more > AdvNonFatalErr coming. > > Signed-off-by: dio.sun <dio....@enflame-tech.com> > --- > drivers/pci/pcie/aer.c | 16 +++++++++++++++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 508474e17183..5ddc990c6f42 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -1154,7 +1154,21 @@ static void aer_recover_work_func(struct work_struct > *work) > ghes_estatus_pool_region_free((unsigned long)entry.regs, > sizeof(struct > aer_capability_regs)); > > - if (entry.severity == AER_NONFATAL) > + if (entry.severity == AER_CORRECTABLE) { > + if (entry.regs->cor_status & PCI_ERR_COR_ADV_NFAT) { > + pci_err(pdev, "%04x:%02x:%02x:%x advisory > non-fatal error\n", > + entry.domain, entry.bus, > PCI_SLOT(entry.devfn), > + PCI_FUNC(entry.devfn)); > + if (entry.regs->uncor_status & > PCI_ERR_UNC_COMP_TIME) { > + pci_err(pdev, "%04x:%02x:%02x:%x > completion timeout\n", > + entry.domain, > entry.bus, > + PCI_SLOT(entry.devfn), > + > PCI_FUNC(entry.devfn)); > + pcie_do_recovery(pdev, > pci_channel_io_frozen, > + > aer_root_reset); > + } > + } Why the error is handled in aer_recover_work_func()? This looks like only gets triggered from ghes_handle_aer() in drivers/acpi/apei/ghes.c. I think it should be handled in pci_aer_handle_error(). Also, the error prints should be sneaked into aer_print_error(). - Mani -- மணிவண்ணன் சதாசிவம்