Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-15 Thread Kuppuswamy, Sathyanarayanan




On 10/15/20 6:55 AM, Ethan Zhao wrote:

On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
 wrote:


Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
merged fatal and non-fatal error recovery paths, and also made
recovery code depend on hotplug handler for "remove affected
device + rescan" support. But this change also complicated the
error recovery path and which in turn led to the following
issues.

1. We depend on hotplug handler for removing the affected
devices/drivers on DLLSC LINK down event (on DPC event
trigger) and DPC handler for handling the error recovery. Since
both handlers operate on same set of affected devices, it leads
to race condition, which in turn leads to  NULL pointer
exceptions or error recovery failures.You can find more details
about this issue in following link.

https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t

2. For non-hotplug capable devices fatal (DPC) error recovery
is currently broken. Current fatal error recovery implementation
relies on PCIe hotplug (pciehp) handler for detaching and
re-enumerating the affected devices/drivers. So when dealing with
non-hotplug capable devices, recovery code does not restore the state
of the affected devices correctly. You can find more details about
this issue in the following links.

https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/

In order to fix the above two issues, we should stop relying on hotplug
handler for cleaning the affected devices/drivers and let error recovery
handler own this functionality. So this patch reverts Commit bdb5ac85777d
("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
affected device + rescan"  functionality in fatal error recovery handler.

Also holding pci_lock_rescan_remove() will prevent the race between hotplug
and DPC handler.

Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Sinan Kaya 
---
  Changes since v4:
   * Added new interfaces for error recovery (pcie_do_fatal_recovery()
 and pcie_do_nonfatal_recovery()).

  Changes since v5:
   * Fixed static/non-static declartion issue.

  Documentation/PCI/pci-error-recovery.rst | 47 ++--
  Documentation/PCI/pcieaer-howto.rst  |  2 +-
  drivers/pci/pci.h|  5 +-
  drivers/pci/pcie/aer.c   | 10 ++--
  drivers/pci/pcie/dpc.c   |  2 +-
  drivers/pci/pcie/edr.c   |  2 +-
  drivers/pci/pcie/err.c   | 70 ++--
  7 files changed, 94 insertions(+), 44 deletions(-)

diff --git a/Documentation/PCI/pci-error-recovery.rst 
b/Documentation/PCI/pci-error-recovery.rst
index 84ceebb08cac..830c8af5838b 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a PCI 
error
  event will be platform-dependent, but will follow the general
  sequence described below.

-STEP 0: Error Event
+STEP 0: Error Event: ERR_NONFATAL
  ---
  A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
  is isolated, in that all I/O is blocked: all reads return 0x,
@@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and 
recovery
  proceeds to STEP 2 (MMIO Enable).

  If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
-then recovery proceeds to STEP 4 (Slot Reset).
+then recovery proceeds to STEP 3 (Slot Reset).

  If the platform is unable to recover the slot, the next step
-is STEP 6 (Permanent Failure).
+is STEP 5 (Permanent Failure).

  .. note::

@@ -198,7 +198,7 @@ reset or some such, but not restart operations. This 
callback is made if
  all drivers on a segment agree that they can try to recover and if no 
automatic
  link reset was performed by the HW. If the platform can't just re-enable IOs
  without a slot reset or a link reset, it will not call this callback, and
-instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
+instead will have gone directly to STEP 3 (Slot Reset)

  .. note::

@@ -233,18 +233,12 @@ The driver should return one of the following result 
codes:

  The next step taken depends on the results returned by the drivers.
  If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
-proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
+proceeds to STEP 4 (Resume Operations).

  If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
-proceeds to STEP 4 (Slot Reset)
+proceeds to STEP 3 (Slot Reset)

-STEP 3: Link Reset
---
-The platform resets the link.  This is a PCI-Express specific step
-and 

Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-15 Thread Ethan Zhao
On Thu, Oct 15, 2020 at 1:53 PM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
>
> On 10/14/20 10:05 PM, Ethan Zhao wrote:
> > On Thu, Oct 15, 2020 at 11:04 AM Kuppuswamy, Sathyanarayanan
> >  wrote:
> >>
> >>
> >>
> >> On 10/14/20 6:58 PM, Ethan Zhao wrote:
> >>> On Thu, Oct 15, 2020 at 1:06 AM Kuppuswamy, Sathyanarayanan
> >>>  wrote:
> 
> 
> 
>  On 10/14/20 8:07 AM, Ethan Zhao wrote:
> > On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
> >  wrote:
> >>
> >> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> >> merged fatal and non-fatal error recovery paths, and also made
> >> recovery code depend on hotplug handler for "remove affected
> >> device + rescan" support. But this change also complicated the
> >> error recovery path and which in turn led to the following
> >> issues.
> >>
> >> 1. We depend on hotplug handler for removing the affected
> >> devices/drivers on DLLSC LINK down event (on DPC event
> >> trigger) and DPC handler for handling the error recovery. Since
> >> both handlers operate on same set of affected devices, it leads
> >> to race condition, which in turn leads to  NULL pointer
> >> exceptions or error recovery failures.You can find more details
> >> about this issue in following link.
> >>
> >> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t
> >>
> >> 2. For non-hotplug capable devices fatal (DPC) error recovery
> >> is currently broken. Current fatal error recovery implementation
> >> relies on PCIe hotplug (pciehp) handler for detaching and
> >> re-enumerating the affected devices/drivers. So when dealing with
> >> non-hotplug capable devices, recovery code does not restore the state
> >> of the affected devices correctly. You can find more details about
> >> this issue in the following links.
> >>
> >> https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
> >> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> >> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/
> >>
> >> In order to fix the above two issues, we should stop relying on hotplug
> >  Yes, it doesn't rely on hotplug handler to remove and rescan the 
> > device,
> > but it couldn't prevent hotplug drivers from doing another replicated
> > removal/rescanning.
> > it doesn't make sense to leave another useless removal/rescanning there.
> > Maybe that's why these two paths were merged to one and made it rely on
> > hotplug.
>  No, as per PCIe spec, hotplug and DPC has no functional dependency. Hence
>  depending on it to handle some of its recovery function is in-correct and
>  would lead to issues in non-hotplug capable platforms (which is true
>  currently).
> >
> >>
>
> >
> >>>Though pciehp is not so hot/scalable and performance critical, but 
> >>> there
> >>>is per cpu thread to handle hot-plug operation. synchronize all threads
> >>>make them walk backwards for scalability.
> >> DPC events does not happen in high frequency. So I don't think we should
> >   It's holding global lock, once malfunction happens to one device and
> > it's driver,
> > the whole system, everyone holds it, would be blocked to work.
> >> worry about the performance here. Even hotplug handler will hold this lock
> >> when adding/removing the devices. So adding/removing devices is a 
> >> serialized
> > You don't worry about performance, but if there is a requirement needs
> > more scalable
> > and reliable hotplug, the effect will be much harder. what to do then ? 
> > choose
> > another OS ?
> As I have mentioned, all device creation/removal in PCI core code is already
> protected by this lock (including hotplug code).  So the multidomain 
> performance
> impact you mentioned should exist even now. All I am doing is, using the
> same lock for protecting device removal/rescan in error recovery code.
>
> drivers/pci/xen-pcifront.c:477: pci_lock_rescan_remove();
> drivers/pci/xen-pcifront.c:567: pci_lock_rescan_remove();
> drivers/pci/xen-pcifront.c:1064:pci_lock_rescan_remove();
> drivers/pci/hotplug/rpaphp_core.c:498:  pci_lock_rescan_remove();
> drivers/pci/hotplug/rpaphp_core.c:520:  pci_lock_rescan_remove();
> drivers/pci/hotplug/s390_pci_hpc.c:70:  pci_lock_rescan_remove();
> drivers/pci/hotplug/shpchp_pci.c:31:pci_lock_rescan_remove();
> drivers/pci/hotplug/shpchp_pci.c:73:pci_lock_rescan_remove();
> drivers/pci/hotplug/pciehp_pci.c:39:pci_lock_rescan_remove();
> drivers/pci/hotplug/pciehp_pci.c:96:pci_lock_rescan_remove();
> drivers/pci/hotplug/acpiphp_glue.c:762: pci_lock_rescan_remove();
> drivers/pci/hotplug/acpiphp_glue.c:787: pci_lock_rescan_remove();
> drivers/pci/hotplug/acpiphp_glue.c:975: 

Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-15 Thread Ethan Zhao
On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
 wrote:
>
> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> merged fatal and non-fatal error recovery paths, and also made
> recovery code depend on hotplug handler for "remove affected
> device + rescan" support. But this change also complicated the
> error recovery path and which in turn led to the following
> issues.
>
> 1. We depend on hotplug handler for removing the affected
> devices/drivers on DLLSC LINK down event (on DPC event
> trigger) and DPC handler for handling the error recovery. Since
> both handlers operate on same set of affected devices, it leads
> to race condition, which in turn leads to  NULL pointer
> exceptions or error recovery failures.You can find more details
> about this issue in following link.
>
> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t
>
> 2. For non-hotplug capable devices fatal (DPC) error recovery
> is currently broken. Current fatal error recovery implementation
> relies on PCIe hotplug (pciehp) handler for detaching and
> re-enumerating the affected devices/drivers. So when dealing with
> non-hotplug capable devices, recovery code does not restore the state
> of the affected devices correctly. You can find more details about
> this issue in the following links.
>
> https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/
>
> In order to fix the above two issues, we should stop relying on hotplug
> handler for cleaning the affected devices/drivers and let error recovery
> handler own this functionality. So this patch reverts Commit bdb5ac85777d
> ("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
> affected device + rescan"  functionality in fatal error recovery handler.
>
> Also holding pci_lock_rescan_remove() will prevent the race between hotplug
> and DPC handler.
>
> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> Signed-off-by: Kuppuswamy Sathyanarayanan 
> 
> Reviewed-by: Sinan Kaya 
> ---
>  Changes since v4:
>   * Added new interfaces for error recovery (pcie_do_fatal_recovery()
> and pcie_do_nonfatal_recovery()).
>
>  Changes since v5:
>   * Fixed static/non-static declartion issue.
>
>  Documentation/PCI/pci-error-recovery.rst | 47 ++--
>  Documentation/PCI/pcieaer-howto.rst  |  2 +-
>  drivers/pci/pci.h|  5 +-
>  drivers/pci/pcie/aer.c   | 10 ++--
>  drivers/pci/pcie/dpc.c   |  2 +-
>  drivers/pci/pcie/edr.c   |  2 +-
>  drivers/pci/pcie/err.c   | 70 ++--
>  7 files changed, 94 insertions(+), 44 deletions(-)
>
> diff --git a/Documentation/PCI/pci-error-recovery.rst 
> b/Documentation/PCI/pci-error-recovery.rst
> index 84ceebb08cac..830c8af5838b 100644
> --- a/Documentation/PCI/pci-error-recovery.rst
> +++ b/Documentation/PCI/pci-error-recovery.rst
> @@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a 
> PCI error
>  event will be platform-dependent, but will follow the general
>  sequence described below.
>
> -STEP 0: Error Event
> +STEP 0: Error Event: ERR_NONFATAL
>  ---
>  A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
>  is isolated, in that all I/O is blocked: all reads return 0x,
> @@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and 
> recovery
>  proceeds to STEP 2 (MMIO Enable).
>
>  If any driver requested a slot reset (by returning 
> PCI_ERS_RESULT_NEED_RESET),
> -then recovery proceeds to STEP 4 (Slot Reset).
> +then recovery proceeds to STEP 3 (Slot Reset).
>
>  If the platform is unable to recover the slot, the next step
> -is STEP 6 (Permanent Failure).
> +is STEP 5 (Permanent Failure).
>
>  .. note::
>
> @@ -198,7 +198,7 @@ reset or some such, but not restart operations. This 
> callback is made if
>  all drivers on a segment agree that they can try to recover and if no 
> automatic
>  link reset was performed by the HW. If the platform can't just re-enable IOs
>  without a slot reset or a link reset, it will not call this callback, and
> -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
> +instead will have gone directly to STEP 3 (Slot Reset)
>
>  .. note::
>
> @@ -233,18 +233,12 @@ The driver should return one of the following result 
> codes:
>
>  The next step taken depends on the results returned by the drivers.
>  If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
> -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
> +proceeds to STEP 4 (Resume Operations).
>
>  If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
> -proceeds to STEP 4 (Slot Reset)
> +proceeds to 

Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-15 Thread Christoph Hellwig
>  /* PCI error reporting and recovery */
> -pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> - pci_channel_state_t state,
> +pci_ers_result_t pcie_do_nonfatal_recovery(struct pci_dev *dev);
> +
> +pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
>   pci_ers_result_t (*reset_link)(struct pci_dev *pdev));

Now that both functions have descriptive names (thanks for that!),
the "_do" component of the names can be dropped.

> +pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
>   pci_ers_result_t (*reset_link)(struct pci_dev *pdev))
> +{
> + struct pci_dev *udev;
> + struct pci_bus *parent;
> + struct pci_dev *pdev, *temp;
> + pci_ers_result_t result;
> +
> + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
> + udev = dev;
> + else
> + udev = dev->bus->self;
> +
> + parent = udev->subordinate;
> + pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
> +
> +pci_lock_rescan_remove();
> +pci_dev_get(dev);
> +list_for_each_entry_safe_reverse(pdev, temp, >devices,
> +  bus_list) {
> + pci_stop_and_remove_bus_device(pdev);
> + }
> +
> + result = reset_link(udev);

Some of the indentation seems strange here, please use tab based
indents everywhere.


Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-14 Thread Kuppuswamy, Sathyanarayanan




On 10/14/20 10:05 PM, Ethan Zhao wrote:

On Thu, Oct 15, 2020 at 11:04 AM Kuppuswamy, Sathyanarayanan
 wrote:




On 10/14/20 6:58 PM, Ethan Zhao wrote:

On Thu, Oct 15, 2020 at 1:06 AM Kuppuswamy, Sathyanarayanan
 wrote:




On 10/14/20 8:07 AM, Ethan Zhao wrote:

On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
 wrote:


Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
merged fatal and non-fatal error recovery paths, and also made
recovery code depend on hotplug handler for "remove affected
device + rescan" support. But this change also complicated the
error recovery path and which in turn led to the following
issues.

1. We depend on hotplug handler for removing the affected
devices/drivers on DLLSC LINK down event (on DPC event
trigger) and DPC handler for handling the error recovery. Since
both handlers operate on same set of affected devices, it leads
to race condition, which in turn leads to  NULL pointer
exceptions or error recovery failures.You can find more details
about this issue in following link.

https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t

2. For non-hotplug capable devices fatal (DPC) error recovery
is currently broken. Current fatal error recovery implementation
relies on PCIe hotplug (pciehp) handler for detaching and
re-enumerating the affected devices/drivers. So when dealing with
non-hotplug capable devices, recovery code does not restore the state
of the affected devices correctly. You can find more details about
this issue in the following links.

https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/

In order to fix the above two issues, we should stop relying on hotplug

 Yes, it doesn't rely on hotplug handler to remove and rescan the device,
but it couldn't prevent hotplug drivers from doing another replicated
removal/rescanning.
it doesn't make sense to leave another useless removal/rescanning there.
Maybe that's why these two paths were merged to one and made it rely on
hotplug.

No, as per PCIe spec, hotplug and DPC has no functional dependency. Hence
depending on it to handle some of its recovery function is in-correct and
would lead to issues in non-hotplug capable platforms (which is true
currently).









   Though pciehp is not so hot/scalable and performance critical, but there
   is per cpu thread to handle hot-plug operation. synchronize all threads
   make them walk backwards for scalability.

DPC events does not happen in high frequency. So I don't think we should

  It's holding global lock, once malfunction happens to one device and
it's driver,
the whole system, everyone holds it, would be blocked to work.

worry about the performance here. Even hotplug handler will hold this lock
when adding/removing the devices. So adding/removing devices is a serialized

You don't worry about performance, but if there is a requirement needs
more scalable
and reliable hotplug, the effect will be much harder. what to do then ? choose
another OS ?

As I have mentioned, all device creation/removal in PCI core code is already
protected by this lock (including hotplug code).  So the multidomain performance
impact you mentioned should exist even now. All I am doing is, using the
same lock for protecting device removal/rescan in error recovery code.

drivers/pci/xen-pcifront.c:477: pci_lock_rescan_remove();
drivers/pci/xen-pcifront.c:567: pci_lock_rescan_remove();
drivers/pci/xen-pcifront.c:1064:pci_lock_rescan_remove();
drivers/pci/hotplug/rpaphp_core.c:498:  pci_lock_rescan_remove();
drivers/pci/hotplug/rpaphp_core.c:520:  pci_lock_rescan_remove();
drivers/pci/hotplug/s390_pci_hpc.c:70:  pci_lock_rescan_remove();
drivers/pci/hotplug/shpchp_pci.c:31:pci_lock_rescan_remove();
drivers/pci/hotplug/shpchp_pci.c:73:pci_lock_rescan_remove();
drivers/pci/hotplug/pciehp_pci.c:39:pci_lock_rescan_remove();
drivers/pci/hotplug/pciehp_pci.c:96:pci_lock_rescan_remove();
drivers/pci/hotplug/acpiphp_glue.c:762: pci_lock_rescan_remove();
drivers/pci/hotplug/acpiphp_glue.c:787: pci_lock_rescan_remove();
drivers/pci/hotplug/acpiphp_glue.c:975: pci_lock_rescan_remove();
drivers/pci/hotplug/acpiphp_glue.c:1026:pci_lock_rescan_remove();
drivers/pci/hotplug/cpqphp_pci.c:75:pci_lock_rescan_remove();
drivers/pci/hotplug/cpqphp_pci.c:120:   pci_lock_rescan_remove();
drivers/pci/hotplug/rpadlpar_core.c:361:pci_lock_rescan_remove();
drivers/pci/hotplug/pnv_php.c:513:  
pci_lock_rescan_remove();
drivers/pci/hotplug/pnv_php.c:582:  pci_lock_rescan_remove();
drivers/pci/hotplug/ibmphp_core.c:668:  pci_lock_rescan_remove();
drivers/pci/hotplug/ibmphp_core.c:738:  pci_lock_rescan_remove();
drivers/pci/hotplug/cpci_hotplug_pci.c:245: 

Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-14 Thread Ethan Zhao
On Thu, Oct 15, 2020 at 11:04 AM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
>
> On 10/14/20 6:58 PM, Ethan Zhao wrote:
> > On Thu, Oct 15, 2020 at 1:06 AM Kuppuswamy, Sathyanarayanan
> >  wrote:
> >>
> >>
> >>
> >> On 10/14/20 8:07 AM, Ethan Zhao wrote:
> >>> On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
> >>>  wrote:
> 
>  Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
>  merged fatal and non-fatal error recovery paths, and also made
>  recovery code depend on hotplug handler for "remove affected
>  device + rescan" support. But this change also complicated the
>  error recovery path and which in turn led to the following
>  issues.
> 
>  1. We depend on hotplug handler for removing the affected
>  devices/drivers on DLLSC LINK down event (on DPC event
>  trigger) and DPC handler for handling the error recovery. Since
>  both handlers operate on same set of affected devices, it leads
>  to race condition, which in turn leads to  NULL pointer
>  exceptions or error recovery failures.You can find more details
>  about this issue in following link.
> 
>  https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t
> 
>  2. For non-hotplug capable devices fatal (DPC) error recovery
>  is currently broken. Current fatal error recovery implementation
>  relies on PCIe hotplug (pciehp) handler for detaching and
>  re-enumerating the affected devices/drivers. So when dealing with
>  non-hotplug capable devices, recovery code does not restore the state
>  of the affected devices correctly. You can find more details about
>  this issue in the following links.
> 
>  https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
>  https://lore.kernel.org/linux-pci/12115.1588207324@famine/
>  https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/
> 
>  In order to fix the above two issues, we should stop relying on hotplug
> >>> Yes, it doesn't rely on hotplug handler to remove and rescan the 
> >>> device,
> >>> but it couldn't prevent hotplug drivers from doing another replicated
> >>> removal/rescanning.
> >>> it doesn't make sense to leave another useless removal/rescanning there.
> >>> Maybe that's why these two paths were merged to one and made it rely on
> >>> hotplug.
> >> No, as per PCIe spec, hotplug and DPC has no functional dependency. Hence
> >> depending on it to handle some of its recovery function is in-correct and
> >> would lead to issues in non-hotplug capable platforms (which is true
> >> currently).
> >>>
>
> >   pci_lock_rescan_remove() is global lock for PCIe, the mal-functional
> >   device's port holds this lock, it prevents the whole system from doing
> >   hot-plug operation.
> It does not prevent the hotplug operation, but it might delay it. Since both
> DPC and hotplug operates on same set of devices, it must be synchronized.
 Think about a large system with some PCI domains, every domain has some
 ports and devices attached. why DPC and hotplug *must* operate on the
 same set of devices from different domains ? if it must be synchronized, why
 make the hotplug handlers threadable ?

> >   Though pciehp is not so hot/scalable and performance critical, but there
> >   is per cpu thread to handle hot-plug operation. synchronize all threads
> >   make them walk backwards for scalability.
> DPC events does not happen in high frequency. So I don't think we should
 It's holding global lock, once malfunction happens to one device and
it's driver,
the whole system, everyone holds it, would be blocked to work.
> worry about the performance here. Even hotplug handler will hold this lock
> when adding/removing the devices. So adding/removing devices is a serialized
You don't worry about performance, but if there is a requirement needs
more scalable
and reliable hotplug, the effect will be much harder. what to do then ? choose
another OS ?
To be honest, I don't like the global lock/ pci_lock_rescan_remove().

BTW, I didn't try the FATAL errors brute force injection on your
patch, duplicated
removal will work naturally because it was removed ?

Thanks,
Ethan
> operation.
> >
>
> >>
>  --
>  2.17.1
> 
> >>
> >> --
> >> Sathyanarayanan Kuppuswamy
> >> Linux Kernel Developer
>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer


Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-14 Thread Kuppuswamy, Sathyanarayanan




On 10/14/20 6:58 PM, Ethan Zhao wrote:

On Thu, Oct 15, 2020 at 1:06 AM Kuppuswamy, Sathyanarayanan
 wrote:




On 10/14/20 8:07 AM, Ethan Zhao wrote:

On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
 wrote:


Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
merged fatal and non-fatal error recovery paths, and also made
recovery code depend on hotplug handler for "remove affected
device + rescan" support. But this change also complicated the
error recovery path and which in turn led to the following
issues.

1. We depend on hotplug handler for removing the affected
devices/drivers on DLLSC LINK down event (on DPC event
trigger) and DPC handler for handling the error recovery. Since
both handlers operate on same set of affected devices, it leads
to race condition, which in turn leads to  NULL pointer
exceptions or error recovery failures.You can find more details
about this issue in following link.

https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t

2. For non-hotplug capable devices fatal (DPC) error recovery
is currently broken. Current fatal error recovery implementation
relies on PCIe hotplug (pciehp) handler for detaching and
re-enumerating the affected devices/drivers. So when dealing with
non-hotplug capable devices, recovery code does not restore the state
of the affected devices correctly. You can find more details about
this issue in the following links.

https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/

In order to fix the above two issues, we should stop relying on hotplug

Yes, it doesn't rely on hotplug handler to remove and rescan the device,
but it couldn't prevent hotplug drivers from doing another replicated
removal/rescanning.
it doesn't make sense to leave another useless removal/rescanning there.
Maybe that's why these two paths were merged to one and made it rely on
hotplug.

No, as per PCIe spec, hotplug and DPC has no functional dependency. Hence
depending on it to handle some of its recovery function is in-correct and
would lead to issues in non-hotplug capable platforms (which is true
currently).





  pci_lock_rescan_remove() is global lock for PCIe, the mal-functional
  device's port holds this lock, it prevents the whole system from doing
  hot-plug operation.

It does not prevent the hotplug operation, but it might delay it. Since both
DPC and hotplug operates on same set of devices, it must be synchronized.

  Though pciehp is not so hot/scalable and performance critical, but there
  is per cpu thread to handle hot-plug operation. synchronize all threads
  make them walk backwards for scalability.

DPC events does not happen in high frequency. So I don't think we should
worry about the performance here. Even hotplug handler will hold this lock
when adding/removing the devices. So adding/removing devices is a serialized
operation.







--
2.17.1



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-14 Thread Ethan Zhao
On Thu, Oct 15, 2020 at 1:06 AM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
>
> On 10/14/20 8:07 AM, Ethan Zhao wrote:
> > On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
> >  wrote:
> >>
> >> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> >> merged fatal and non-fatal error recovery paths, and also made
> >> recovery code depend on hotplug handler for "remove affected
> >> device + rescan" support. But this change also complicated the
> >> error recovery path and which in turn led to the following
> >> issues.
> >>
> >> 1. We depend on hotplug handler for removing the affected
> >> devices/drivers on DLLSC LINK down event (on DPC event
> >> trigger) and DPC handler for handling the error recovery. Since
> >> both handlers operate on same set of affected devices, it leads
> >> to race condition, which in turn leads to  NULL pointer
> >> exceptions or error recovery failures.You can find more details
> >> about this issue in following link.
> >>
> >> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t
> >>
> >> 2. For non-hotplug capable devices fatal (DPC) error recovery
> >> is currently broken. Current fatal error recovery implementation
> >> relies on PCIe hotplug (pciehp) handler for detaching and
> >> re-enumerating the affected devices/drivers. So when dealing with
> >> non-hotplug capable devices, recovery code does not restore the state
> >> of the affected devices correctly. You can find more details about
> >> this issue in the following links.
> >>
> >> https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
> >> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> >> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/
> >>
> >> In order to fix the above two issues, we should stop relying on hotplug
> >Yes, it doesn't rely on hotplug handler to remove and rescan the device,
> > but it couldn't prevent hotplug drivers from doing another replicated
> > removal/rescanning.
> > it doesn't make sense to leave another useless removal/rescanning there.
> > Maybe that's why these two paths were merged to one and made it rely on
> > hotplug.
> No, as per PCIe spec, hotplug and DPC has no functional dependency. Hence
> depending on it to handle some of its recovery function is in-correct and
> would lead to issues in non-hotplug capable platforms (which is true
> currently).
> >
>
> >> +   else
> >> +   udev = dev->bus->self;
> >> +
> >> +   parent = udev->subordinate;
> >> +   pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
> >> +
> >> +pci_lock_rescan_remove();
> > Though here you have lock, but hotplug will do another
> > 'pci_stop_and_remove_bus_device()'
> > without merging it with the hotplug driver, you have no way to
> > remove the replicated actions in
> >hotplug handler.
> No, the core operation (remove/add device) is syncronzied and done in
> only one thread. Please check the following flow. Even in hotplug
 pci_lock_rescan_remove() is global lock for PCIe, the mal-functional
 device's port holds this lock, it prevents the whole system from doing
 hot-plug operation.
 Though pciehp is not so hot/scalable and performance critical, but there
 is per cpu thread to handle hot-plug operation. synchronize all threads
 make them walk backwards for scalability.

> handler, before removing the device, it attempts to hold 
> pci_lock_rescan_remove()
> lock. So holding the same lock in DPC handler will syncronize the DPC/hotplug
> handlers. Also if one of the thread (DPC or hotplug) removes/adds the 
> affected devices,
> other thread will not repeat the same action (since the device is already 
> removed/added).
>
> ->pciehp_ist()
>->pciehp_handle_presence_or_link_change()
>  ->pciehp_disable_slot()
>->__pciehp_disable_slot()
>  ->remove_board()
>->pciehp_unconfigure_device()
>  ->pci_lock_rescan_remove()
> >
> >
> >Thanks,
> >Ethan
> >> +pci_dev_get(dev);
> >> +list_for_each_entry_safe_reverse(pdev, temp, >devices,
> >> +bus_list) {
> >> +   pci_stop_and_remove_bus_device(pdev);
> >> +   }
> >> +
> >> +   result = reset_link(udev);
> >> +
> >> +   if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> >> +   /*
> >> +* If the error is reported by a bridge, we think this 
> >> error
> >> +* is related to the downstream link of the bridge, so we
> >> +* do error recovery on all subordinates of the bridge 
> >> instead
> >> +* of the bridge and clear the error status of the bridge.
> >> +*/
> >> +   pci_aer_clear_fatal_status(dev);
> >> +   if (pcie_aer_is_native(dev))
> >> +   pcie_clear_device_status(dev);
> >> +   }
> 

Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-14 Thread Kuppuswamy, Sathyanarayanan




On 10/14/20 8:07 AM, Ethan Zhao wrote:

On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
 wrote:


Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
merged fatal and non-fatal error recovery paths, and also made
recovery code depend on hotplug handler for "remove affected
device + rescan" support. But this change also complicated the
error recovery path and which in turn led to the following
issues.

1. We depend on hotplug handler for removing the affected
devices/drivers on DLLSC LINK down event (on DPC event
trigger) and DPC handler for handling the error recovery. Since
both handlers operate on same set of affected devices, it leads
to race condition, which in turn leads to  NULL pointer
exceptions or error recovery failures.You can find more details
about this issue in following link.

https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t

2. For non-hotplug capable devices fatal (DPC) error recovery
is currently broken. Current fatal error recovery implementation
relies on PCIe hotplug (pciehp) handler for detaching and
re-enumerating the affected devices/drivers. So when dealing with
non-hotplug capable devices, recovery code does not restore the state
of the affected devices correctly. You can find more details about
this issue in the following links.

https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/

In order to fix the above two issues, we should stop relying on hotplug

   Yes, it doesn't rely on hotplug handler to remove and rescan the device,
but it couldn't prevent hotplug drivers from doing another replicated
removal/rescanning.
it doesn't make sense to leave another useless removal/rescanning there.
Maybe that's why these two paths were merged to one and made it rely on
hotplug.

No, as per PCIe spec, hotplug and DPC has no functional dependency. Hence
depending on it to handle some of its recovery function is in-correct and
would lead to issues in non-hotplug capable platforms (which is true
currently).





+   else
+   udev = dev->bus->self;
+
+   parent = udev->subordinate;
+   pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
+
+pci_lock_rescan_remove();

Though here you have lock, but hotplug will do another
'pci_stop_and_remove_bus_device()'
without merging it with the hotplug driver, you have no way to
remove the replicated actions in
   hotplug handler.

No, the core operation (remove/add device) is syncronzied and done in
only one thread. Please check the following flow. Even in hotplug
handler, before removing the device, it attempts to hold 
pci_lock_rescan_remove()
lock. So holding the same lock in DPC handler will syncronize the DPC/hotplug
handlers. Also if one of the thread (DPC or hotplug) removes/adds the affected 
devices,
other thread will not repeat the same action (since the device is already 
removed/added).

->pciehp_ist()
  ->pciehp_handle_presence_or_link_change()
->pciehp_disable_slot()
  ->__pciehp_disable_slot()
->remove_board()
  ->pciehp_unconfigure_device()
->pci_lock_rescan_remove()



   Thanks,
   Ethan

+pci_dev_get(dev);
+list_for_each_entry_safe_reverse(pdev, temp, >devices,
+bus_list) {
+   pci_stop_and_remove_bus_device(pdev);
+   }
+
+   result = reset_link(udev);
+
+   if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+   /*
+* If the error is reported by a bridge, we think this error
+* is related to the downstream link of the bridge, so we
+* do error recovery on all subordinates of the bridge instead
+* of the bridge and clear the error status of the bridge.
+*/
+   pci_aer_clear_fatal_status(dev);
+   if (pcie_aer_is_native(dev))
+   pcie_clear_device_status(dev);
+   }
+
+   if (result == PCI_ERS_RESULT_RECOVERED) {
+   if (pcie_wait_for_link(udev, true))

And another  pci_rescan_bus() like in the hotplug handler.

As I have mentioned before, holding the same lock should make them synchronized
and not repeat the underlying functionality of pci_rescan_bus() in both threads
at the same time.

+   pci_rescan_bus(udev->bus);
+   pci_info(dev, "Device recovery from fatal error successful\n");
+} else {
+   pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
+   pci_info(dev, "Device recovery from fatal error failed\n");



--
2.17.1



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-14 Thread Ethan Zhao
On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
 wrote:
>
> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> merged fatal and non-fatal error recovery paths, and also made
> recovery code depend on hotplug handler for "remove affected
> device + rescan" support. But this change also complicated the
> error recovery path and which in turn led to the following
> issues.
>
> 1. We depend on hotplug handler for removing the affected
> devices/drivers on DLLSC LINK down event (on DPC event
> trigger) and DPC handler for handling the error recovery. Since
> both handlers operate on same set of affected devices, it leads
> to race condition, which in turn leads to  NULL pointer
> exceptions or error recovery failures.You can find more details
> about this issue in following link.
>
> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t
>
> 2. For non-hotplug capable devices fatal (DPC) error recovery
> is currently broken. Current fatal error recovery implementation
> relies on PCIe hotplug (pciehp) handler for detaching and
> re-enumerating the affected devices/drivers. So when dealing with
> non-hotplug capable devices, recovery code does not restore the state
> of the affected devices correctly. You can find more details about
> this issue in the following links.
>
> https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/
>
> In order to fix the above two issues, we should stop relying on hotplug
  Yes, it doesn't rely on hotplug handler to remove and rescan the device,
but it couldn't prevent hotplug drivers from doing another replicated
removal/rescanning.
it doesn't make sense to leave another useless removal/rescanning there.
Maybe that's why these two paths were merged to one and made it rely on
hotplug.

> handler for cleaning the affected devices/drivers and let error recovery
> handler own this functionality. So this patch reverts Commit bdb5ac85777d
> ("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
> affected device + rescan"  functionality in fatal error recovery handler.
>
> Also holding pci_lock_rescan_remove() will prevent the race between hotplug
> and DPC handler.
>
> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> Signed-off-by: Kuppuswamy Sathyanarayanan 
> 
> Reviewed-by: Sinan Kaya 
> ---
>  Changes since v4:
>   * Added new interfaces for error recovery (pcie_do_fatal_recovery()
> and pcie_do_nonfatal_recovery()).
>
>  Changes since v5:
>   * Fixed static/non-static declartion issue.
>
>  Documentation/PCI/pci-error-recovery.rst | 47 ++--
>  Documentation/PCI/pcieaer-howto.rst  |  2 +-
>  drivers/pci/pci.h|  5 +-
>  drivers/pci/pcie/aer.c   | 10 ++--
>  drivers/pci/pcie/dpc.c   |  2 +-
>  drivers/pci/pcie/edr.c   |  2 +-
>  drivers/pci/pcie/err.c   | 70 ++--
>  7 files changed, 94 insertions(+), 44 deletions(-)
>
> diff --git a/Documentation/PCI/pci-error-recovery.rst 
> b/Documentation/PCI/pci-error-recovery.rst
> index 84ceebb08cac..830c8af5838b 100644
> --- a/Documentation/PCI/pci-error-recovery.rst
> +++ b/Documentation/PCI/pci-error-recovery.rst
> @@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a 
> PCI error
>  event will be platform-dependent, but will follow the general
>  sequence described below.
>
> -STEP 0: Error Event
> +STEP 0: Error Event: ERR_NONFATAL
>  ---
>  A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
>  is isolated, in that all I/O is blocked: all reads return 0x,
> @@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and 
> recovery
>  proceeds to STEP 2 (MMIO Enable).
>
>  If any driver requested a slot reset (by returning 
> PCI_ERS_RESULT_NEED_RESET),
> -then recovery proceeds to STEP 4 (Slot Reset).
> +then recovery proceeds to STEP 3 (Slot Reset).
>
>  If the platform is unable to recover the slot, the next step
> -is STEP 6 (Permanent Failure).
> +is STEP 5 (Permanent Failure).
>
>  .. note::
>
> @@ -198,7 +198,7 @@ reset or some such, but not restart operations. This 
> callback is made if
>  all drivers on a segment agree that they can try to recover and if no 
> automatic
>  link reset was performed by the HW. If the platform can't just re-enable IOs
>  without a slot reset or a link reset, it will not call this callback, and
> -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
> +instead will have gone directly to STEP 3 (Slot Reset)
>
>  .. note::
>
> @@ -233,18 +233,12 @@ The driver should return one of the following result 
> codes:
>
>  The next step taken depends on the results returned by the 

[PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-14 Thread Kuppuswamy Sathyanarayanan
Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
merged fatal and non-fatal error recovery paths, and also made
recovery code depend on hotplug handler for "remove affected
device + rescan" support. But this change also complicated the
error recovery path and which in turn led to the following
issues.

1. We depend on hotplug handler for removing the affected
devices/drivers on DLLSC LINK down event (on DPC event
trigger) and DPC handler for handling the error recovery. Since
both handlers operate on same set of affected devices, it leads
to race condition, which in turn leads to  NULL pointer
exceptions or error recovery failures.You can find more details
about this issue in following link.

https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t

2. For non-hotplug capable devices fatal (DPC) error recovery
is currently broken. Current fatal error recovery implementation
relies on PCIe hotplug (pciehp) handler for detaching and
re-enumerating the affected devices/drivers. So when dealing with
non-hotplug capable devices, recovery code does not restore the state
of the affected devices correctly. You can find more details about
this issue in the following links.

https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/

In order to fix the above two issues, we should stop relying on hotplug
handler for cleaning the affected devices/drivers and let error recovery
handler own this functionality. So this patch reverts Commit bdb5ac85777d
("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
affected device + rescan"  functionality in fatal error recovery handler.

Also holding pci_lock_rescan_remove() will prevent the race between hotplug
and DPC handler.

Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Sinan Kaya 
---
 Changes since v4:
  * Added new interfaces for error recovery (pcie_do_fatal_recovery()
and pcie_do_nonfatal_recovery()).

 Changes since v5:
  * Fixed static/non-static declartion issue.

 Documentation/PCI/pci-error-recovery.rst | 47 ++--
 Documentation/PCI/pcieaer-howto.rst  |  2 +-
 drivers/pci/pci.h|  5 +-
 drivers/pci/pcie/aer.c   | 10 ++--
 drivers/pci/pcie/dpc.c   |  2 +-
 drivers/pci/pcie/edr.c   |  2 +-
 drivers/pci/pcie/err.c   | 70 ++--
 7 files changed, 94 insertions(+), 44 deletions(-)

diff --git a/Documentation/PCI/pci-error-recovery.rst 
b/Documentation/PCI/pci-error-recovery.rst
index 84ceebb08cac..830c8af5838b 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a PCI 
error
 event will be platform-dependent, but will follow the general
 sequence described below.
 
-STEP 0: Error Event
+STEP 0: Error Event: ERR_NONFATAL
 ---
 A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
 is isolated, in that all I/O is blocked: all reads return 0x,
@@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and 
recovery
 proceeds to STEP 2 (MMIO Enable).
 
 If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
-then recovery proceeds to STEP 4 (Slot Reset).
+then recovery proceeds to STEP 3 (Slot Reset).
 
 If the platform is unable to recover the slot, the next step
-is STEP 6 (Permanent Failure).
+is STEP 5 (Permanent Failure).
 
 .. note::
 
@@ -198,7 +198,7 @@ reset or some such, but not restart operations. This 
callback is made if
 all drivers on a segment agree that they can try to recover and if no automatic
 link reset was performed by the HW. If the platform can't just re-enable IOs
 without a slot reset or a link reset, it will not call this callback, and
-instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
+instead will have gone directly to STEP 3 (Slot Reset)
 
 .. note::
 
@@ -233,18 +233,12 @@ The driver should return one of the following result 
codes:
 
 The next step taken depends on the results returned by the drivers.
 If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
-proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
+proceeds to STEP 4 (Resume Operations).
 
 If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
-proceeds to STEP 4 (Slot Reset)
+proceeds to STEP 3 (Slot Reset)
 
-STEP 3: Link Reset
---
-The platform resets the link.  This is a PCI-Express specific step
-and is done whenever a fatal error has been detected that can be
-"solved" by resetting the link.
-
-STEP 4: Slot Reset
+STEP 3: Slot