On 2/3/26 5:09 AM, Lukas Wunner wrote:
On Mon, Feb 02, 2026 at 03:02:54PM +0100, Lukas Wunner wrote:
You're assuming that the parent of the Requester is always identical
to the containing Downstream Port.  But that's not necessarily the case:

E.g., imagine a DPC-capable Root Port with a PCIe switch below
whose Downstream Ports are not DPC-capable.  Let's say an Endpoint
beneath the PCIe switch sends ERR_FATAL upstream.  AFAICS, your patch
will cause pcie_do_recovery() to invoke dpc_reset_link() with the
Downstream Port of the PCIe switch as argument.  That function will
then happily use pdev->dpc_cap even though it's 0.

Thinking about this some more, I realized there's another problem:

In a scenario like the one I've outlined above, after your change,
pcie_do_recovery() will only broadcast error_detected (and other
callbacks) below the downstream port of the PCIe switch -- and not
to any other devices below the containing Root Port.

However, the DPC-induced Link Down event at the Root Port results
in a Hot Reset being propagated down the hierarchy to any device
below the Root Port.  So with your change, the siblings of the
downstream port on the PCIe switch will no longer be informed of
the reset and thus are no longer given an opportunity to recover
after reset.

The premise on which this patch is built is false -- that the bridge
upstream of the error-reporting device is always equal to the
containing Downstream Port.

Thanks again for the very detailed analysis and for the pointers to
your earlier mail.

You are right, thanks for pointing it out.


It seems the only reason why you want to pass the reporting device
to pcie_do_recovery() is that you want to call pcie_clear_device_status()
and pci_aer_clear_nonfatal_status() with that device.

In the AER path, pcie_do_recovery() is indeed invoked with the Error
Source device found by find_source_device(), and internally it treats
that dev as the function that detected the error and derives the
containing Downstream Port (bridge) from it.  For DPC, however, the
error-detecting function is the DPC-capable Downstream Port itself, not
the Endpoint identified as Error Source, so passing the Endpoint to
pcie_do_recovery() breaks that assumption.

However as I've said before, those calls are AER-specific and should
be moved out of pcie_do_recovery() so that it becomes generic and can
be used by EEH and s390:

https://lore.kernel.org/all/[email protected]/

Sure, I'd like to move it out.  I will remove the AER-specific calls
(pcie_clear_device_status() and pci_aer_raw_clear_status()) from
pcie_do_recovery() itself, and instead handle them in the AER and DPC
code paths where we already know which device(s) are the actual error
sources.  That way, pcie_do_recovery() becomes a generic recovery
framework that can be reused by EEH and s390.


There's another problem:  When a device experiences an error while DPC
is ongoing (i.e. while the link is down), its ERR_FATAL or ERR_NONFATAL
Message may not come through.  Still the error bits are set in the
device's Uncorrectable Error Status register.  So I think what we need to
do is walk the hierarchy below the containing Downstream Port after the
link is back up and search for devices with any error bits set,
then report and clear those errors.  We may do this after first
examining the device in the DPC Error Source ID register.
Any additional errors found while walking the hierarchy can then
be identified as "occurred during DPC recovery".

I agree this is similar in spirit to find_source_device() -- both walk
the bus and check AER Status registers.  For the DPC case, I'll perform
this walk after the link is back up (i.e., after dpc_reset_link()
succeeds).

Regarding pci_restore_state() in slot_reset(): I see now that it does
call pci_aer_clear_status(dev) (at line 1844 in pci.c), which will
clear the AER Status registers.  So if we walk the hierarchy after
the slot_reset callbacks, the error bits accumulated during DPC will
already be cleared.

To avoid losing those errors, I think the walk should happen after
dpc_reset_link() succeeds but *before* pcie_do_recovery() invokes the
slot_reset callbacks.  That way, we can capture the AER Status bits
before pci_restore_state() clears them.

Does that sound like the right approach, or would you prefer a
different placement?

Thanks a lot for your guidance.

Best Regards,
Shuai


Reply via email to