from:"Naveen Naidu"

[PATCH v4 19/25] PCI/DPC: Use PCI_POSSIBLE_ERROR() to check read from hardware

2021-11-18 Thread Naveen Naidu

An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the
CPU read, so most hardware fabricates ~0 data.

Use PCI_POSSIBLE_ERROR() to check the response we get when we read
data from hardware.

This helps unify PCI error response checking and make error checks
consistent and easier to find.

Compile tested only.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index c556e7beafe3..3e9afee02e8d 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -79,7 +79,7 @@ static bool dpc_completed(struct pci_dev *pdev)
u16 status;
 
pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS, );
-   if ((status != 0x) && (status & PCI_EXP_DPC_STATUS_TRIGGER))
+   if ((!PCI_POSSIBLE_ERROR(status)) && (status & 
PCI_EXP_DPC_STATUS_TRIGGER))
return false;
 
if (test_bit(PCI_DPC_RECOVERING, >priv_flags))
@@ -312,7 +312,7 @@ static irqreturn_t dpc_irq(int irq, void *context)
 
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
 
-   if (!(status & PCI_EXP_DPC_STATUS_INTERRUPT) || status == (u16)(~0))
+   if (!(status & PCI_EXP_DPC_STATUS_INTERRUPT) || 
PCI_POSSIBLE_ERROR(status))
return IRQ_NONE;
 
pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
-- 
2.25.1

[PATCH v4 00/25] Unify PCI error response checking

2021-11-18 Thread Naveen Naidu

An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the 
CPU read, so most hardware fabricates ~0 data.

This patch series adds PCI_ERROR_RESPONSE definition and other helper
definition PCI_SET_ERROR_RESPONSE and PCI_POSSIBLE_ERROR and uses it
where appropriate to make these checks consistent and easier to find.

This helps unify PCI error response checking and make error check
consistent and easier to find.

This series also ensures that the error response fabrication now happens
in the PCI_OP_READ and PCI_USER_READ_CONFIG. This removes the
responsibility from controller drivers to do the error response setting. 

Patch 1:
- Adds the PCI_ERROR_RESPONSE and other related defintions
- All other patches are dependent on this patch. This patch needs to
  be applied first, before the others

Patch 2:
- Error fabrication happens in PCI_OP_READ and PCI_USER_READ_CONFIG
  whenever the data read via the controller driver fails.
- This patch needs to be applied before, Patch 4/24 to Patch 15/24 are
  applied.

Patch 3:
- Uses PCI_SET_ERROR_RESPONSE() when device is not found 

Patch 4 - 15:
- Removes redundant error fabrication that happens in controller 
  drivers when the read from a PCI device fails.
- These patches are dependent on Patch 2/24 of the series.
- These can be applied in any order.

Patch 16 - 22:
- Uses PCI_POSSIBLE_ERROR() to check the reads from hardware
- Patches can be applied in any order.

Patch 23 - 25:
- Edits the comments to include PCI_ERROR_RESPONSE alsong with
  0x, so that it becomes easier to grep for faulty 
  hardware reads.

Changelog
=
v4:
   - Rename SET_PCI_ERROR_RESPONSE to PCI_SET_ERROR_RESPONSE
   - Rename RESPONSE_IS_PCI_ERROR to PCI_POSSIBLE_ERROR

v3:
   - Change RESPONSE_IS_PCI_ERROR macro definition
   - Fix the macros, Add () around macro parameters
   - Fix alignment issue in Patch 2/24
   - Add proper receipients for all the patches

v2:
- Instead of using SET_PCI_ERROR_RESPONSE in all controller drivers
  to fabricate error response, only use them in PCI_OP_READ and
  PCI_USER_READ_CONFIG

Naveen Naidu (25):
 [PATCH v4 1/25] PCI: Add PCI_ERROR_RESPONSE and it's related definitions
 [PATCH v4 2/25] PCI: Set error response in config access defines when 
ops->read() fails
 [PATCH v4 3/25] PCI: Use PCI_SET_ERROR_RESPONSE() when device not found
 [PATCH v4 4/25] PCI: Remove redundant error fabrication when device read fails
 [PATCH v4 5/25] PCI: thunder: Remove redundant error fabrication when device 
read fails
 [PATCH v4 6/25] PCI: iproc: Remove redundant error fabrication when device 
read fails
 [PATCH v4 7/25] PCI: mediatek: Remove redundant error fabrication when device 
read fails
 [PATCH v4 8/25] PCI: exynos: Remove redundant error fabrication when device 
read fails
 [PATCH v4 9/25] PCI: histb: Remove redundant error fabrication when device 
read fails
 [PATCH v4 10/25] PCI: kirin: Remove redundant error fabrication when device 
read fails
 [PATCH v4 11/25] PCI: aardvark: Remove redundant error fabrication when device 
read fails
 [PATCH v4 12/25] PCI: mvebu: Remove redundant error fabrication when device 
read fails
 [PATCH v4 13/25] PCI: altera: Remove redundant error fabrication when device 
read fails
 [PATCH v4 14/25] PCI: rcar: Remove redundant error fabrication when device 
read fails
 [PATCH v4 15/25] PCI: rockchip: Remove redundant error fabrication when device 
read fails
 [PATCH v4 16/25] PCI/ERR: Use PCI_POSSIBLE_ERROR() to check read from hardware
 [PATCH v4 17/25] PCI: vmd: Use PCI_POSSIBLE_ERROR() to check read from hardware
 [PATCH v4 18/25] PCI: pciehp: Use PCI_POSSIBLE_ERROR() to check read from 
hardware
 [PATCH v4 19/25] PCI/DPC: Use PCI_POSSIBLE_ERROR() to check read from hardware
 [PATCH v4 20/25] PCI/PME: Use PCI_POSSIBLE_ERROR() to check read from hardware
 [PATCH v4 21/25] PCI: cpqphp: Use PCI_POSSIBLE_ERROR() to check read from 
hardware
 [PATCH v4 22/25] PCI: Use PCI_ERROR_RESPONSE to specify hardware error
 [PATCH v4 23/25] PCI: keystone: Use PCI_ERROR_RESPONSE to specify hardware 
error
 [PATCH v4 24/25] PCI: hv: Use PCI_ERROR_RESPONSE to specify hardware read error
 [PATCH v4 25/25] PCI: xgene: Use PCI_ERROR_RESPONSE to specify hardware error

 drivers/pci/access.c| 32 +++---
 drivers/pci/controller/dwc/pci-exynos.c |  4 +-
 drivers/pci/controller/dwc/pci-keystone.c   |  4 +-
 drivers/pci/controller/dwc/pcie-histb.c |  4 +-
 drivers/pci/controller/dwc/pcie-kirin.c |  4 +-
 drivers/pci/controller/pci-aardvark.c   |  4 +-
 drivers/pci/controller/pci-hyperv.c |  2 +-
 drivers/pci/controller/pci-mvebu.c  |  8 +---
 drivers/pci/controller/pci-thunder-ecam.c   | 46 +++--
 drivers/pci/controller/pci-thunder-pem.c|  4 +-
 drivers/pci/controller/pci-xgene.c  |  8 ++--
 driv

Re: [PATCH v3 01/25] PCI: Add PCI_ERROR_RESPONSE and it's related definitions

2021-11-18 Thread Naveen Naidu

On 17/11, Bjorn Helgaas wrote:
> On Thu, Oct 21, 2021 at 08:37:26PM +0530, Naveen Naidu wrote:
> > An MMIO read from a PCI device that doesn't exist or doesn't respond
> > causes a PCI error.  There's no real data to return to satisfy the
> > CPU read, so most hardware fabricates ~0 data.
> > 
> > Add a PCI_ERROR_RESPONSE definition for that and use it where
> > appropriate to make these checks consistent and easier to find.
> > 
> > Also add helper definitions SET_PCI_ERROR_RESPONSE and
> > RESPONSE_IS_PCI_ERROR to make the code more readable.
> > 
> > Suggested-by: Bjorn Helgaas 
> > Signed-off-by: Naveen Naidu 
> > ---
> >  include/linux/pci.h | 9 +
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index cd8aa6fce204..689c8277c584 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -154,6 +154,15 @@ enum pci_interrupt_pin {
> >  /* The number of legacy PCI INTx interrupts */
> >  #define PCI_NUM_INTX   4
> >  
> > +/*
> > + * Reading from a device that doesn't respond typically returns ~0.  A
> > + * successful read from a device may also return ~0, so you need additional
> > + * information to reliably identify errors.
> > + */
> > +#define PCI_ERROR_RESPONSE (~0ULL)
> > +#define SET_PCI_ERROR_RESPONSE(val)(*(val) = ((typeof(*(val))) 
> > PCI_ERROR_RESPONSE))
> > +#define RESPONSE_IS_PCI_ERROR(val) ((val) == ((typeof(val)) 
> > PCI_ERROR_RESPONSE))
> 
> Beautiful!  I really like this.
>

Thank you very much for the review ^^

> I would prefer the macros to start with "PCI_", e.g.,
> PCI_SET_ERROR_RESPONSE().
> 

ACK

> I think "RESPONSE_IS_PCI_ERROR()" is too strong because (as the
> comment says), ~0 *may* indicate an error.  Or it may be a successful
> read of a register that happens to contain ~0.
> 
> Possibilities to convey the idea that this isn't definitive:
> 
>   PCI_POSSIBLE_ERROR_RESPONSE(val)  # a little long
>   PCI_LIKELY_ERROR(val) # we really have no idea whether
>   PCI_PROBABLE_ERROR(val)   #   likely or probable
>   PCI_POSSIBLE_ERROR(val)   # promising?
>

ACK. Will use PCI_POSSIBLE_ERROR()

> Can you rebase to my "main" branch (v5.16-rc1), tweak the above, and
> collect up the acks/reviews?
> 

ACK

> We should also browse drivers outside drivers/pci for places we could
> use these.  Not necessarily as part of this series, although if
> authors there object, it would be good to learn that earlier than
> later.
> 
> Drivers that implement pci_error_handlers might be a fruitful place to
> start.  But you've done a great job finding users of ~0 and 0x...
> in drivers/pci/, too.
> 

A quick grep showed that there are around 80 drivers which have
pci_error_handlers. I was thinking that it would be better if we handle
these drivers in another patch series since the current patch series is
itself 25 patches long. And in my short tenure reading LKML, I gathered
that folks generally are not so kind to a long list of patches in a
single patch series ^^' (I might be wrong though, Apologies)

The consensus on the patch series does seem slightly positive so
ideally, I was hoping that we would not have the case where a author
does not like the way we are handling this patch. Then again, I'm
pretty sure that I might be wrong ^^'

I hope it would be okay that I send in a new patch series with the
suggested changes and handle the other changes in another patch series
^^

Thanks,
Naveen
> > +
> >  /*
> >   * pci_power_t values must match the bits in the Capabilities PME_Support
> >   * and Control/Status PowerState fields in the Power Management capability.
> > -- 
> > 2.25.1
> > 
> > ___
> > Linux-kernel-mentees mailing list
> > linux-kernel-ment...@lists.linuxfoundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

[PATCH v5 5/5] PCI/AER: Include DEVCTL in aer_print_error()

2021-10-25 Thread Naveen Naidu

Print the contents of Device Control Register of the device which
detected the error. This might help in faster error diagnosis.

It is easy to test this by using aer-inject:

  $ aer-inject -s 00:03:0 corr-err-file

The content of the corr-err-file is as below:

  AER
  COR_STATUS BAD_TLP
  HEADER_LOG 0 1 2 3

Sample output from dummy error injected by aer-inject:

  pcieport :00:03.0: AER: Corrected error received: :00:03.0
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver)
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=0040/e000, devctl=0x000f <-- devctl added to the error log
  pcieport :00:03.0:[ 6] BadTLP

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pci.h  |  2 ++
 drivers/pci/pcie/aer.c | 10 --
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index eb88d8bfeaf7..48ed7f91113b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -437,6 +437,8 @@ struct aer_err_info {
u32 status; /* COR/UNCOR Error Status */
u32 mask;   /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
+
+   u16 devctl;
 };
 
 /* Preliminary AER error information processed from Root port */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d3937f5384e4..fdeef9deb016 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -729,8 +729,8 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
   aer_error_severity_string[info->severity],
   aer_error_layer[layer], aer_agent_string[agent]);
 
-   pci_printk(level, dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
-  dev->vendor, dev->device, info->status, info->mask);
+   pci_printk(level, dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x, devctl=%#06x\n",
+  dev->vendor, dev->device, info->status, info->mask, 
info->devctl);
 
__aer_print_error(dev, info);
 
@@ -1083,6 +1083,12 @@ int aer_get_device_error_info(struct pci_dev *dev, 
struct aer_err_info *info)
if (!aer)
return 0;
 
+   /*
+* Cache the value of Device Control Register now, because later the
+* device might not be available
+*/
+   pcie_capability_read_word(dev, PCI_EXP_DEVCTL, >devctl);
+
if (info->severity == AER_CORRECTABLE) {
pci_read_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>status);
-- 
2.25.1

[PATCH v5 4/5] PCI/AER: Clear error device AER registers in aer_irq()

2021-10-25 Thread Naveen Naidu

Converge the APEI path and native AER path of clearing the AER registers
of the error device.

In APEI path, the system firmware clears the AER registers before
handing off the record to OS. But in "native AER" path, the execution
path of clearing the AER register is as follows:

  aer_isr_one_error
aer_print_port_info
  if (find_source_device())
aer_process_err_devices
  handle_error_source
pci_write_config_dword(dev, PCI_ERR_COR_STATUS, ...)

The above path has a bug, if the find_source_device() fails, AER
registers are not cleared from the error device. This means, the error
device will keep reporting the error again and again and would lead
to message spew.

Related Bug Report:
  https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173

The above bug could be avoided, if the AER registers are cleared during
the AER IRQ handler aer_irq(), which would provide guarantee that the AER
error registers are always cleared. This is similar to how APEI handles
these errors.

The main aim is that:

  When an interrupt handler deals with a interrupt, it must *always*
  clear the source of the interrupt.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pci.h  |  13 ++-
 drivers/pci/pcie/aer.c | 249 -
 2 files changed, 184 insertions(+), 78 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9be7a966fda7..eb88d8bfeaf7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -424,7 +424,6 @@ static inline bool pci_dev_is_added(const struct pci_dev 
*dev)
 #define AER_MAX_MULTI_ERR_DEVICES  5   /* Not likely to have more */
 
 struct aer_err_info {
-   struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
 
u16 id;
@@ -440,6 +439,18 @@ struct aer_err_info {
struct aer_header_log_regs tlp; /* TLP Header */
 };
 
+/* Preliminary AER error information processed from Root port */
+struct aer_devices_err_info {
+   struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
+   struct aer_err_info err_info;
+};
+
+/* AER information associated with each error device */
+struct aer_dev_err_info {
+   struct pci_dev *dev;
+   struct aer_err_info err_info;
+};
+
 int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
 #endif /* CONFIG_PCIEAER */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 241ff361b43c..d3937f5384e4 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -36,6 +36,18 @@
 
 #define AER_ERROR_SOURCES_MAX  128
 
+/*
+ * There can be 128 maximum error sources (AER_ERROR_SOURCES_MAX) and each
+ * error source can have maximum of 5 error devices (AER_MAX_MULTI_ERR_DEVICES)
+ * so the maximum error devices we can report is:
+ *
+ * AER_ERROR_DEVICES_MAX = AER_ERROR_SOURCES_MAX * AER_MAX_MULTI_ERR_DEVICES 
== (128 * 5) == 640
+ *
+ * But since, the size in KFIFO should be a power of two, the closest value
+ * to 640 is 1024
+ */
+# define AER_ERROR_DEVICES_MAX 1024
+
 #define AER_MAX_TYPEOF_COR_ERRS16  /* as per 
PCI_ERR_COR_STATUS */
 #define AER_MAX_TYPEOF_UNCOR_ERRS  27  /* as per PCI_ERR_UNCOR_STATUS*/
 
@@ -46,7 +58,7 @@ struct aer_err_source {
 
 struct aer_rpc {
struct pci_dev *rpd;/* Root Port device */
-   DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
+   DECLARE_KFIFO(aer_fifo, struct aer_dev_err_info, AER_ERROR_DEVICES_MAX);
 };
 
 /* AER stats for the device */
@@ -803,14 +815,14 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
 
 /**
  * add_error_device - list device to be handled
- * @e_info: pointer to error info
+ * @e_dev: pointer to error info
  * @dev: pointer to pci_dev to be added
  */
-static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
+static int add_error_device(struct aer_devices_err_info *e_dev, struct pci_dev 
*dev)
 {
-   if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
-   e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
-   e_info->error_dev_num++;
+   if (e_dev->err_info.error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
+   e_dev->dev[e_dev->err_info.error_dev_num] = pci_dev_get(dev);
+   e_dev->err_info.error_dev_num++;
return 0;
}
return -ENOSPC;
@@ -877,18 +889,18 @@ static bool is_error_source(struct pci_dev *dev, struct 
aer_err_info *e_info)
 
 static int find_device_iter(struct pci_dev *dev, void *data)
 {
-   struct aer_err_info *e_info = (struct aer_err_info *)data;
+   struct aer_devices_err_info *e_dev = (struct aer_devices_err_info 
*)data;
 
-   if (is_error_source(dev, e_info)) {
+   if (is_error_source(dev, _dev->err_i

[PATCH v5 3/5] PCI/DPC: Initialize info.id in dpc_process_error()

2021-10-25 Thread Naveen Naidu

In the dpc_process_error() path, info.id isn't initialized before being
passed to aer_print_error(). In the corresponding AER path, it is
initialized in aer_isr_one_error().

The error message shown during Coverity Scan is:

  Coverity #1461602
  CID 1461602 (#1 of 1): Uninitialized scalar variable (UNINIT)
  8. uninit_use_in_call: Using uninitialized value info.id when calling 
aer_print_error.

Also Per PCIe r5.0, sec 7.9.15.5, the Source ID is defined only when the
Trigger Reason indicates ERR_NONFATAL or ERR_FATAL. Initialize the
"info.id" based on the trigger reason before passing it to
aer_print_error()

Fixes: 8aefa9b0d910 ("PCI/DPC: Print AER status in DPC event handling")
Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index c556e7beafe3..6fa1b1eb4671 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -262,16 +262,24 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev 
*dev,
 
 void dpc_process_error(struct pci_dev *pdev)
 {
-   u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
+   u16 cap = pdev->dpc_cap, status, reason, ext_reason;
struct aer_err_info info;
 
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
-   pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, );
+   reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
+
+   /*
+* Per PCIe r5.0, sec 7.9.15.5, the Source ID is defined only when the
+* Trigger Reason indicates ERR_NONFATAL or ERR_FATAL.
+*/
+   if (reason == 1 || reason == 2)
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, 
);
+   else
+   info.id = 0;
 
pci_info(pdev, "containment event, status:%#06x source:%#06x\n",
-status, source);
+status, info.id);
 
-   reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5;
pci_warn(pdev, "%s detected\n",
 (reason == 0) ? "unmasked uncorrectable error" :
-- 
2.25.1

[PATCH v5 2/5] PCI: Cleanup struct aer_err_info

2021-10-25 Thread Naveen Naidu

The id, status and the mask fields of the struct aer_err_info comes
directly from the registers, hence their sizes should be explicit.

The length of these registers are:
  - id: 16 bits - Represents the Error Source Requester ID
  - status: 32 bits - COR/UNCOR Error Status
  - mask: 32 bits - COR/UNCOR Error Mask

Since the length of the above registers are even, use u16 and u32
to represent their values.

Also remove the __pad fields.

"pahole" was run on the modified struct aer_err_info and the size
remains unchanged.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pci.h | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1cce56c2aea0..9be7a966fda7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -427,18 +427,16 @@ struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
 
-   unsigned int id:16;
+   u16 id;
 
unsigned int severity:2;/* 0:NONFATAL | 1:FATAL | 2:COR */
-   unsigned int __pad1:5;
unsigned int multi_error_valid:1;
 
unsigned int first_error:5;
-   unsigned int __pad2:2;
unsigned int tlp_header_valid:1;
 
-   unsigned int status;/* COR/UNCOR Error Status */
-   unsigned int mask;  /* COR/UNCOR Error Mask */
+   u32 status; /* COR/UNCOR Error Status */
+   u32 mask;   /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
 };
 
-- 
2.25.1

[PATCH v5 1/5] PCI/AER: Remove ID from aer_agent_string[]

2021-10-25 Thread Naveen Naidu

Currently, we do not print the "id" field in the AER error logs. Yet the
aer_agent_string[] has the word "id" in it. The AER error log looks
like:

  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver ID)

Without the "id" field in the error log, The aer_agent_string[]
(eg: "Receiver ID") does not make sense. A user reading the
aer_agent_string[] in the log, might inadvertently look for an "id"
field and not finding it might lead to confusion.

Remove the "ID" from the aer_agent_string[].

It is easy to reproduce this by using aer-inject:

  $ aer-inject -s 00:03:0 corr-err-file

The content of the corr-err-file file is as below:

  AER
  COR_STATUS BAD_TLP
  HEADER_LOG 0 1 2 3

The following are sample dummy errors inject via aer-inject.

Before
===

In 010caed4ccb6 ("PCI/AER: Decode Error Source Requester ID"),
the "id" field was removed from the AER error logs, so currently AER
logs look like:

  pcieport :00:03.0: AER: Corrected error received: :00:03:0
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver ID) <--- no id field
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=0040/e000
  pcieport :00:03.0:[ 6] BadTLP

After
==

  pcieport :00:03.0: AER: Corrected error received: :00:03.0
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver)
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=0040/e000
  pcieport :00:03.0:[ 6] BadTLP

Link: 
https://lore.kernel.org/linux-pci/20211021170317.GA2700910@bhelgaas/T/#m618bda4e54042d95a1a83fccc01cdb423f7590dc
Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/aer.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9784fdcf3006..241ff361b43c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -516,10 +516,10 @@ static const char *aer_uncorrectable_error_string[] = {
 };
 
 static const char *aer_agent_string[] = {
-   "Receiver ID",
-   "Requester ID",
-   "Completer ID",
-   "Transmitter ID"
+   "Receiver",
+   "Requester",
+   "Completer",
+   "Transmitter"
 };
 
 #define aer_stats_dev_attr(name, stats_array, strings_array,   \
@@ -703,7 +703,7 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
const char *level;
 
if (!info->status) {
-   pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
(Unregistered Agent ID)\n",
+   pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
(Unregistered Agent)\n",
aer_error_severity_string[info->severity]);
goto out;
}
-- 
2.25.1

[PATCH v5 0/5] Fix long standing AER Error Handling Issues

2021-10-25 Thread Naveen Naidu

This patch series aims at fixing some of the AER error handling issues
we have.

Currently we have the following issues: 
  
  1. Confusing message in aer_print_error()
  2. aer_err_info not being initialized completely in DPC path before 
 we print the AER logs
  3. A bug [1] in clearing of AER registers in the native AER path

[1] https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/

The patch series fixes the above things.

PATCH 1: 
  - Fixes the first issue
  - This patch is independent of other patches and can be applied
seperately

PATCH 2 - 3:
  - Fixes the second issue
  - Patch 3 is depended on Patch 2 in the series

PATCH 4
  - Fixes the bug in clearing of AER registers which leades to
AER message spew [1]

PATCH 5:
  - Adds extra information (devctl register) in AER error logs.
  - Patch 5 depends on Patch 4 of the series

Thanks,
Naveen Naidu

Changelog
=
v5:
- Edit the commit message of Patch 1 and Patch 5 to include how to
  test the AER messages using aer-inject.
- Edit Patch 3 to initialize info.id depending on the trigger
  reason.
- Drop few patches (v4 4/8, 5/8 7/8) since they were wrong.

v4:
- Fix logical error in 6/8, in the previous version of the patch set
  there was a bug, in how I added the devices to the queue.

v3:
- Edit the commit messages to be in imperative style and split the
  commits to be more atomic.

v2:
- Add [PATCH 7] which includes the device control register 
  information in AER error logs.

Naveen Naidu (5):
  [PATCH v5 1/5] PCI/AER: Remove ID from aer_agent_string[]
  [PATCH v5 2/5] PCI: Cleanup struct aer_err_info
  [PATCH v5 3/5] PCI/DPC: Initialize info.id in dpc_process_error()
  [PATCH v5 4/5] PCI/AER: Clear error device AER registers in aer_irq()
  [PATCH v5 5/5] PCI/AER: Include DEVCTL in aer_print_error()

 drivers/pci/pci.h  |  23 +++-
 drivers/pci/pcie/aer.c | 269 -
 drivers/pci/pcie/dpc.c |  16 ++-
 3 files changed, 214 insertions(+), 94 deletions(-)

-- 
2.25.1

Re: [PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers

2021-10-21 Thread Naveen Naidu

On 20/10, Bjorn Helgaas wrote:
> [+cc Keith, Sinan, Oza]
> 
> On Tue, Oct 05, 2021 at 10:48:12PM +0530, Naveen Naidu wrote:
> > In the EDR path, AER registers are cleared *after* DPC error event is
> > processed. The process stack in EDR is:
> > 
> >   edr_handle_event()
> > dpc_process_error()
> > pci_aer_raw_clear_status()
> > pcie_do_recovery()
> > 
> > But in DPC path, AER status registers are cleared *while* processing
> > the error. The process stack in DPC is:
> > 
> >   dpc_handler()
> > dpc_process_error()
> >   pci_aer_clear_status()
> > pcie_do_recovery()
> 
> These are accurate but they both include dpc_process_error(), so we
> need a hint to show why the one here is different from the one in the
> EDR path, e.g.,
> 
>   dpc_handler
> dpc_process_error
>   if (reason == 0)
> pci_aer_clear_status# uncorrectable errors only
> pcie_do_recovery
> 
> > In EDR path, AER status registers are cleared irrespective of whether
> > the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
> > AER status registers are cleared only when it's an unmasked uncorrectable
> > error.
> > 
> > This leads to two different behaviours for the same task (handling of
> > DPC errors) in FFS systems and when native OS has control.
> 
> FFS?
>

Firmware First Systems

> I'd really like to have a specific example of how a user would observe
> this difference.  I know you probably don't have two systems to
> compare like that, but maybe we can work it out manually.
> 

Apologies again! Reading through the code again and the specification, I
realize that my understanding was very incorrect at the time of making
this patch. I grossly oversimplified EDR and DPC when I was learning
about it.

I'll drop this patch when I send the v5 for the series.

Apologies again ^^'

> I guess you're saying the problem is in the native DPC handling, and
> we don't clear the AER status registers for ERR_NONFATAL,
> ERR_NONFATAL, etc., right?
> 

But yes, I did have this question though (I wasn't able to find the
answers to it when reading the spec). Why do we not clear the entire
ERR_NONFATAL and ERR_FATAL registers in the DPC path just like EDR does
using the pci_aer_raw_clear_status() before going to pcie_do_recovery()

I am sure I might have missed something in the spec. I guess I'll
look/re-read these bits again.

Thanks for the review :)

> I think the current behavior is from 8aefa9b0d910 ("PCI/DPC: Print AER
> status in DPC event handling"), where Keith explicitly mentions those
> cases.  The commit log here should connect back to that and explain
> whether something has changed.
> 
> I cc'd Keith and the reviewers of that change in case any of them have
> time to dig into this again.
> 
> > Bring the same semantics for clearing the AER status register in EDR
> > path and DPC path.
> > 
> > Signed-off-by: Naveen Naidu 
> > ---
> >  drivers/pci/pcie/dpc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index faf4a1e77fab..68899a3db126 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev)
> >  dpc_get_aer_uncorrect_severity(pdev, ) &&
> >  aer_get_device_error_info(pdev, )) {
> > aer_print_error(pdev, );
> > -   pci_aer_clear_status(pdev);
> > }
> >  }
> >  
> > @@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
> > struct pci_dev *pdev = context;
> >  
> > dpc_process_error(pdev);
> > +   pci_aer_clear_status(pdev);
> >  
> > /* We configure DPC so it only triggers on ERR_FATAL */
> > pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
> > -- 
> > 2.25.1
> > 
> > ___
> > Linux-kernel-mentees mailing list
> > linux-kernel-ment...@lists.linuxfoundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

Re: [PATCH v4 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()

2021-10-21 Thread Naveen Naidu

On 20/10, Bjorn Helgaas wrote:
> On Tue, Oct 05, 2021 at 10:48:11PM +0530, Naveen Naidu wrote:
> > dpc_process_error() clears both AER fatal and non fatal status
> > registers. Instead of clearing each status registers via a different
> > function call use pci_aer_clear_status().
> > 
> > This helps clean up the code a bit.
> > 
> > Signed-off-by: Naveen Naidu 
> > ---
> >  drivers/pci/pcie/dpc.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index df3f3a10f8bc..faf4a1e77fab 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -288,8 +288,7 @@ void dpc_process_error(struct pci_dev *pdev)
> >  dpc_get_aer_uncorrect_severity(pdev, ) &&
> >  aer_get_device_error_info(pdev, )) {
> > aer_print_error(pdev, );
> > -   pci_aer_clear_nonfatal_status(pdev);
> > -   pci_aer_clear_fatal_status(pdev);
> > +   pci_aer_clear_status(pdev);
> 
> The commit log suggests that this is a simple cleanup that doesn't
> change any behavior, but that's not quite true:
> 
>   - The new code would clear PCI_ERR_ROOT_STATUS, but the old code
> does not.
> 
>   - The old code masks the status bits with the severity bits before
> clearing, but the new code does not.
> 
> The commit log needs to show why these changes are what we want.
>

Reading through the code again, I realize how wrong(stupid) I was when
making this patch. I was thinking that:

  pci_aer_clear_status() = pci_aer_clear_fatal_status() + 
pci_aer_clear_nonfatal_status()

Now I understand, that it is not at all the case. I apologize for the
mistake. I'll make sure to be meticulous while reading functions and not
just assume their behaviour just from their function names.

I'll drop this patch in the next version of the patch series I make.

Apologies again ^^'

> > }
> >  }
> >  
> > -- 
> > 2.25.1
> > 
> > ___
> > Linux-kernel-mentees mailing list
> > linux-kernel-ment...@lists.linuxfoundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

Re: [PATCH v4 3/8] PCI/DPC: Initialize info->id in dpc_process_error()

2021-10-21 Thread Naveen Naidu

On 20/10, Bjorn Helgaas wrote:
> On Tue, Oct 05, 2021 at 10:48:10PM +0530, Naveen Naidu wrote:
> > In the dpc_process_error() path, info->id isn't initialized before being
> > passed to aer_print_error(). In the corresponding AER path, it is
> > initialized in aer_isr_one_error().
> > 
> > The error message shown during Coverity Scan is:
> > 
> >   Coverity #1461602
> >   CID 1461602 (#1 of 1): Uninitialized scalar variable (UNINIT)
> >   8. uninit_use_in_call: Using uninitialized value info.id when calling 
> > aer_print_error.
> > 
> > Initialize the "info->id" before passing it to aer_print_error()
> > 
> > Fixes: 8aefa9b0d910 ("PCI/DPC: Print AER status in DPC event handling")
> > Signed-off-by: Naveen Naidu 
> > ---
> >  drivers/pci/pcie/dpc.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index c556e7beafe3..df3f3a10f8bc 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -262,14 +262,14 @@ static int dpc_get_aer_uncorrect_severity(struct 
> > pci_dev *dev,
> >  
> >  void dpc_process_error(struct pci_dev *pdev)
> >  {
> > -   u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
> > +   u16 cap = pdev->dpc_cap, status, reason, ext_reason;
> > struct aer_err_info info;
> >  
> > pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
> > -   pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, );
> > +   pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, );
> >  
> > pci_info(pdev, "containment event, status:%#06x source:%#06x\n",
> > -status, source);
> > +status, info.id);
> >  
> > reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
> 
> Per PCIe r5.0, sec 7.9.15.5, the Source ID is defined only when the
> Trigger Reason indicates ERR_NONFATAL or ERR_FATAL.  So I think we
> need to extract this reason before reading PCI_EXP_DPC_SOURCE_ID,
> e.g.,
> 
>   reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
>   if (reason == 1 || reason == 2)
> pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, );
>   else
> info.id = 0;
>

Thank you for the review, I'll make this change when I send a v5 for the
patch series.

> > ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5;
> > -- 
> > 2.25.1
> > 
> > ___
> > Linux-kernel-mentees mailing list
> > linux-kernel-ment...@lists.linuxfoundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

Re: [PATCH v4 1/8] PCI/AER: Remove ID from aer_agent_string[]

2021-10-21 Thread Naveen Naidu

On 20/10, Bjorn Helgaas wrote:
> On Tue, Oct 05, 2021 at 10:48:08PM +0530, Naveen Naidu wrote:
> > Currently, we do not print the "id" field in the AER error logs. Yet the
> > aer_agent_string[] has the word "id" in it. The AER error log looks
> > like:
> > 
> >   pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
> > Layer, (Receiver ID)
> > 
> > Without the "id" field in the error log, The aer_agent_string[]
> > (eg: "Receiver ID") does not make sense. A user reading the
> > aer_agent_string[] in the log, might inadvertently look for an "id"
> > field and not finding it might lead to confusion.
> > 
> > Remove the "ID" from the aer_agent_string[].
> > 
> > The following are sample dummy errors inject via aer-inject.
> 
> I like this, and the problem it fixes was my fault because
> these "ID" strings should have been removed by 010caed4ccb6.
> 
> If it's straightforward enough, it would be nice to have the
> aer-inject command line here in the commit log to make it easier
> for people to play with this.
>

Thank you for the review. Do you mean something like:

The following sample dummy errors are injected via aer-inject via the
following steps:

  1. The steps to compile the aer-inject tool is mentioned in (Section
 4. Software error inject) of the document [1]

 [1]: https://www.kernel.org/doc/Documentation/PCI/pcieaer-howto.txt

 Make sure to place the aer-inject executable at the home directory
 of the qemu system or at any other place.

  2. Emulate a PCIE architecture using qemu, A sample looks like
 following:
 
qemu-system-x86_64 -kernel ../linux/arch/x86_64/boot/bzImage \
-initrd  buildroot-build/images/rootfs.cpio.gz \
-append "console=ttyS0"  \
-enable-kvm -nographic \
-M q35 \
-device pcie-root-port,bus=pcie.0,id=rp1,slot=1 \
-device pcie-pci-bridge,id=br1,bus=rp1 \
-device e1000,bus=br1,addr=8
   
Note that the PCIe features are available only when using the 
'q35' Machine [2]
[2]: https://github.com/qemu/qemu/blob/master/docs/pcie.txt

  3. Once the qemu system starts up, create a sample aer-file or use any
 example aer file from [3]

 [3]:
 
https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git/tree/examples

  4. Inject any aer-error using
  
  ./aer-inject aer-file

This does look a tad bit longer for a commit log so I am unsure if you
would like to have it there. If you are okay with it, I would be happy
to add it to that :)

> > Before
> > ===
> > 
> > In 010caed4ccb6 ("PCI/AER: Decode Error Source Requester ID"),
> > the "id" field was removed from the AER error logs, so currently AER
> > logs look like:
> > 
> >   pcieport :00:03.0: AER: Corrected error received: :00:03:0
> >   pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
> > Layer, (Receiver ID) <--- no id field
> >   pcieport :00:03.0:   device [1b36:000c] error 
> > status/mask=0040/e000
> >   pcieport :00:03.0:[ 6] BadTLP
> > 
> > After
> > ==
> > 
> >   pcieport :00:03.0: AER: Corrected error received: :00:03.0
> >   pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
> > Layer, (Receiver)
> >   pcieport :00:03.0:   device [1b36:000c] error 
> > status/mask=0040/e000
> >   pcieport :00:03.0:[ 6] BadTLP
> > 
> > Signed-off-by: Naveen Naidu 
> > ---
> >  drivers/pci/pcie/aer.c | 10 +-
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index 9784fdcf3006..241ff361b43c 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -516,10 +516,10 @@ static const char *aer_uncorrectable_error_string[] = 
> > {
> >  };
> >  
> >  static const char *aer_agent_string[] = {
> > -   "Receiver ID",
> > -   "Requester ID",
> > -   "Completer ID",
> > -   "Transmitter ID"
> > +   "Receiver",
> > +   "Requester",
> > +   "Completer",
> > +   "Transmitter"
> >  };
> >  
> >  #define aer_stats_dev_attr(name, stats_array, strings_array,   
> > \
> > @@ -703,7 +703,7 @@ void aer_print_error(struct pci_dev *dev, struct 
> > aer_err_info *info)
> > const char *level;
> >  
> > if (!info->status) {
> > -   pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
> > (Unregistered Agent ID)\n",
> > +   pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
> > (Unregistered Agent)\n",
> > aer_error_severity_string[info->severity]);
> > goto out;
> > }
> > -- 
> > 2.25.1
> > 
> > ___
> > Linux-kernel-mentees mailing list
> > linux-kernel-ment...@lists.linuxfoundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

[PATCH v3 19/25] PCI/DPC: Use RESPONSE_IS_PCI_ERROR() to check read from hardware

2021-10-21 Thread Naveen Naidu

An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the
CPU read, so most hardware fabricates ~0 data.

Use RESPONSE_IS_PCI_ERROR() to check the response we get when we read
data from hardware.

This helps unify PCI error response checking and make error checks
consistent and easier to find.

Compile tested only.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index c556e7beafe3..4a051a096075 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -79,7 +79,7 @@ static bool dpc_completed(struct pci_dev *pdev)
u16 status;
 
pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS, );
-   if ((status != 0x) && (status & PCI_EXP_DPC_STATUS_TRIGGER))
+   if ((!RESPONSE_IS_PCI_ERROR(status)) && (status & 
PCI_EXP_DPC_STATUS_TRIGGER))
return false;
 
if (test_bit(PCI_DPC_RECOVERING, >priv_flags))
@@ -312,7 +312,7 @@ static irqreturn_t dpc_irq(int irq, void *context)
 
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
 
-   if (!(status & PCI_EXP_DPC_STATUS_INTERRUPT) || status == (u16)(~0))
+   if (!(status & PCI_EXP_DPC_STATUS_INTERRUPT) || 
RESPONSE_IS_PCI_ERROR(status))
return IRQ_NONE;
 
pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
-- 
2.25.1

[PATCH v3 02/25] PCI: Set error response in config access defines when ops->read() fails

2021-10-21 Thread Naveen Naidu

Make PCI_OP_READ and PCI_USER_READ_CONFIG set the data value with error
response (~0), when the PCI device read by a host controller fails.

This ensures that the controller drivers no longer need to fabricate
(~0) value when they detect error. It also  gurantees that the error
response (~0) is always set when the controller drivers fails to read a
config register from a device.

This makes error response fabrication consistent and helps in removal of
a lot of repeated code.

Suggested-by: Rob Herring 
Reviewed-by: Rob Herring 
Signed-off-by: Naveen Naidu 
---
 drivers/pci/access.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 46935695cfb9..0f732ba2f71a 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -42,7 +42,10 @@ int noinline pci_bus_read_config_##size \
if (PCI_##size##_BAD) return PCIBIOS_BAD_REGISTER_NUMBER;   \
pci_lock_config(flags); \
res = bus->ops->read(bus, devfn, pos, len, );  \
-   *value = (type)data;\
+   if (res)\
+   SET_PCI_ERROR_RESPONSE(value);  \
+   else\
+   *value = (type)data;\
pci_unlock_config(flags);   \
return res; \
 }
@@ -228,7 +231,10 @@ int pci_user_read_config_##size
\
ret = dev->bus->ops->read(dev->bus, dev->devfn, \
pos, sizeof(type), );  \
raw_spin_unlock_irq(_lock); \
-   *val = (type)data;  \
+   if (ret)\
+   SET_PCI_ERROR_RESPONSE(val);\
+   else\
+   *val = (type)data;  \
return pcibios_err_to_errno(ret);   \
 }  \
 EXPORT_SYMBOL_GPL(pci_user_read_config_##size);
-- 
2.25.1

[PATCH v3 01/25] PCI: Add PCI_ERROR_RESPONSE and it's related definitions

2021-10-21 Thread Naveen Naidu

An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the
CPU read, so most hardware fabricates ~0 data.

Add a PCI_ERROR_RESPONSE definition for that and use it where
appropriate to make these checks consistent and easier to find.

Also add helper definitions SET_PCI_ERROR_RESPONSE and
RESPONSE_IS_PCI_ERROR to make the code more readable.

Suggested-by: Bjorn Helgaas 
Signed-off-by: Naveen Naidu 
---
 include/linux/pci.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index cd8aa6fce204..689c8277c584 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -154,6 +154,15 @@ enum pci_interrupt_pin {
 /* The number of legacy PCI INTx interrupts */
 #define PCI_NUM_INTX   4
 
+/*
+ * Reading from a device that doesn't respond typically returns ~0.  A
+ * successful read from a device may also return ~0, so you need additional
+ * information to reliably identify errors.
+ */
+#define PCI_ERROR_RESPONSE (~0ULL)
+#define SET_PCI_ERROR_RESPONSE(val)(*(val) = ((typeof(*(val))) 
PCI_ERROR_RESPONSE))
+#define RESPONSE_IS_PCI_ERROR(val) ((val) == ((typeof(val)) 
PCI_ERROR_RESPONSE))
+
 /*
  * pci_power_t values must match the bits in the Capabilities PME_Support
  * and Control/Status PowerState fields in the Power Management capability.
-- 
2.25.1

[PATCH v3 00/25] Unify PCI error response checking

2021-10-21 Thread Naveen Naidu

An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the 
CPU read, so most hardware fabricates ~0 data.

This patch series adds PCI_ERROR_RESPONSE definition and other helper
definition SET_PCI_ERROR_RESPONSE and RESPONSE_IS_PCI_ERROR and uses it
where appropriate to make these checks consistent and easier to find.

This helps unify PCI error response checking and make error check
consistent and easier to find.

This series also ensures that the error response fabrication now happens
in the PCI_OP_READ and PCI_USER_READ_CONFIG. This removes the
responsibility from controller drivers to do the error response setting. 

Patch 1:
- Adds the PCI_ERROR_RESPONSE and other related defintions
- All other patches are dependent on this patch. This patch needs to
  be applied first, before the others

Patch 2:
- Error fabrication happens in PCI_OP_READ and PCI_USER_READ_CONFIG
  whenever the data read via the controller driver fails.
- This patch needs to be applied before, Patch 4/24 to Patch 15/24 are
  applied.

Patch 3:
- Uses SET_PCI_ERROR_RESPONSE() when device is not found 

Patch 4 - 15:
- Removes redundant error fabrication that happens in controller 
  drivers when the read from a PCI device fails.
- These patches are dependent on Patch 2/24 of the series.
- These can be applied in any order.

Patch 16 - 22:
- Uses RESPONSE_IS_PCI_ERROR() to check the reads from hardware
- Patches can be applied in any order.

Patch 23 - 25:
- Edits the comments to include PCI_ERROR_RESPONSE alsong with
  0x, so that it becomes easier to grep for faulty 
  hardware reads.

Changelog
=
v3:
   - Change RESPONSE_IS_PCI_ERROR macro definition
   - Fix the macros, Add () around macro parameters
   - Fix alignment issue in Patch 2/24
   - Add proper receipients for all the patches

v2:
- Instead of using SET_PCI_ERROR_RESPONSE in all controller drivers
  to fabricate error response, only use them in PCI_OP_READ and
  PCI_USER_READ_CONFIG

Naveen Naidu (25):
  [Patch 1/25] PCI: Add PCI_ERROR_RESPONSE and it's related definitions
  [Patch 2/25] PCI: Set error response in config access defines when 
ops->read() fails
  [Patch 3/25] PCI: Use SET_PCI_ERROR_RESPONSE() when device not found
  [Patch 4/25] PCI: Remove redundant error fabrication when device read fails
  [Patch 5/25] PCI: thunder: Remove redundant error fabrication when device 
read fails
  [Patch 6/25] PCI: iproc: Remove redundant error fabrication when device read 
fails
  [Patch 7/25] PCI: mediatek: Remove redundant error fabrication when device 
read fails
  [Patch 8/25] PCI: exynos: Remove redundant error fabrication when device read 
fails
  [Patch 9/25] PCI: histb: Remove redundant error fabrication when device read 
fails
  [Patch 10/25] PCI: kirin: Remove redundant error fabrication when device read 
fails
  [Patch 11/25] PCI: aardvark: Remove redundant error fabrication when device 
read fails
  [Patch 12/25] PCI: mvebu: Remove redundant error fabrication when device read 
fails
  [Patch 13/25] PCI: altera: Remove redundant error fabrication when device 
read fails
  [Patch 14/25] PCI: rcar: Remove redundant error fabrication when device read 
fails
  [Patch 15/25] PCI: rockchip: Remove redundant error fabrication when device 
read fails
  [Patch 16/25] PCI/ERR: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
  [Patch 17/25] PCI: vmd: Use RESPONSE_IS_PCI_ERROR() to check read from 
hardware
  [Patch 18/25] PCI: pciehp: Use RESPONSE_IS_PCI_ERROR() to check read from 
hardware
  [Patch 19/25] PCI/DPC: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
  [Patch 20/25] PCI/PME: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
  [Patch 21/25] PCI: cpqphp: Use RESPONSE_IS_PCI_ERROR() to check read from 
hardware
  [Patch 22/25] PCI: Use PCI_ERROR_RESPONSE to specify hardware error
  [Patch 23/25] PCI: keystone: Use PCI_ERROR_RESPONSE to specify hardware error
  [Patch 24/25] PCI: hv: Use PCI_ERROR_RESPONSE to specify hardware read error
  [Patch 25/25] PCI: xgene: Use PCI_ERROR_RESPONSE to specify hardware error

 drivers/pci/access.c| 32 +++---
 drivers/pci/controller/dwc/pci-exynos.c |  4 +-
 drivers/pci/controller/dwc/pci-keystone.c   |  4 +-
 drivers/pci/controller/dwc/pcie-histb.c |  4 +-
 drivers/pci/controller/dwc/pcie-kirin.c |  4 +-
 drivers/pci/controller/pci-aardvark.c   | 10 +
 drivers/pci/controller/pci-hyperv.c |  2 +-
 drivers/pci/controller/pci-mvebu.c  |  8 +---
 drivers/pci/controller/pci-thunder-ecam.c   | 46 +++--
 drivers/pci/controller/pci-thunder-pem.c|  4 +-
 drivers/pci/controller/pci-xgene.c  |  8 ++--
 drivers/pci/controller/pcie-altera.c|  4 +-
 drivers/pci/controller/pcie-iproc.c |  4 +-
 drivers/pci/controller/pcie-mediatek.c  |

[PATCH v2 19/24] PCI/DPC: Use RESPONSE_IS_PCI_ERROR() to check read from hardware

2021-10-15 Thread Naveen Naidu

An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the
CPU read, so most hardware fabricates ~0 data.

Use RESPONSE_IS_PCI_ERROR() to check the response we get when we read
data from hardware.

This helps unify PCI error response checking and make error checks
consistent and easier to find.

Compile tested only.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index c556e7beafe3..561c44d9429c 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -79,7 +79,7 @@ static bool dpc_completed(struct pci_dev *pdev)
u16 status;
 
pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS, );
-   if ((status != 0x) && (status & PCI_EXP_DPC_STATUS_TRIGGER))
+   if ((!RESPONSE_IS_PCI_ERROR()) && (status & 
PCI_EXP_DPC_STATUS_TRIGGER))
return false;
 
if (test_bit(PCI_DPC_RECOVERING, >priv_flags))
@@ -312,7 +312,7 @@ static irqreturn_t dpc_irq(int irq, void *context)
 
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
 
-   if (!(status & PCI_EXP_DPC_STATUS_INTERRUPT) || status == (u16)(~0))
+   if (!(status & PCI_EXP_DPC_STATUS_INTERRUPT) || 
RESPONSE_IS_PCI_ERROR())
return IRQ_NONE;
 
pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
-- 
2.25.1

Re: [PATCH v2 00/24] Unify PCI error response checking

2021-10-15 Thread Naveen Naidu

On 15/10, Naveen Naidu wrote:
> An MMIO read from a PCI device that doesn't exist or doesn't respond
> causes a PCI error.  There's no real data to return to satisfy the 
> CPU read, so most hardware fabricates ~0 data.
> 
> This patch series adds PCI_ERROR_RESPONSE definition and other helper
> definition SET_PCI_ERROR_RESPONSE and RESPONSE_IS_PCI_ERROR and uses it
> where appropriate to make these checks consistent and easier to find.
> 
> This helps unify PCI error response checking and make error check
> consistent and easier to find.
> 
> This series also ensures that the error response fabrication now happens
> in the PCI_OP_READ and PCI_USER_READ_CONFIG. This removes the
> responsibility from controller drivers to do the error response setting. 
> 
> Patch 1:
> - Adds the PCI_ERROR_RESPONSE and other related defintions
> - All other patches are dependent on this patch. This patch needs to
>   be applied first, before the others
> 
> Patch 2:
> - Error fabrication happens in PCI_OP_READ and PCI_USER_READ_CONFIG
>   whenever the data read via the controller driver fails.
> - This patch needs to be applied before, Patch 4/24 to Patch 15/24 are
>   applied.
> 
> Patch 3:
> - Uses SET_PCI_ERROR_RESPONSE() when device is not found and
>   RESPONSE_IS_PCI_ERROR() to check hardware read from the hardware.
> 
> Patch 4 - 15:
> - Removes redundant error fabrication that happens in controller 
>   drivers when the read from a PCI device fails.
> - These patches are dependent on Patch 2/24 of the series.
> - These can be applied in any order.
> 
> Patch 16 - 22:
> - Uses RESPONSE_IS_PCI_ERROR() to check the reads from hardware
> - Patches can be applied in any order.
> 
> Patch 23 - 24:
> - Edits the comments to include PCI_ERROR_RESPONSE alsong with
>   0x, so that it becomes easier to grep for faulty 
>   hardware reads.
> 
> Changelog
> =
> 
> v2:
> - Instead of using SET_PCI_ERROR_RESPONSE in all controller drivers
>   to fabricate error response, only use them in PCI_OP_READ and
>   PCI_USER_READ_CONFIG
> 
> Naveen Naidu (24):
>  [PATCH 1/24] PCI: Add PCI_ERROR_RESPONSE and it's related definitions
>  [PATCH 2/24] PCI: Set error response in config access defines when 
> ops->read() fails
>  [PATCH 3/24] PCI: Unify PCI error response checking
>  [PATCH 4/24] PCI: Remove redundant error fabrication when device read fails
>  [PATCH 5/24] PCI: thunder: Remove redundant error fabrication when device 
> read fails
>  [PATCH 6/24] PCI: iproc: Remove redundant error fabrication when device read 
> fails
>  [PATCH 7/24] PCI: mediatek: Remove redundant error fabrication when device 
> read fails
>  [PATCH 8/24] PCI: exynos: Remove redundant error fabrication when device 
> read fails
>  [PATCH 9/24] PCI: histb: Remove redundant error fabrication when device read 
> fails
>  [PATCH 10/24] PCI: kirin: Remove redundant error fabrication when device 
> read fails
>  [PATCH 11/24] PCI: aardvark: Remove redundant error fabrication when device 
> read fails
>  [PATCH 12/24] PCI: mvebu: Remove redundant error fabrication when device 
> read fails
>  [PATCH 13/24] PCI: altera: Remove redundant error fabrication when device 
> read fails
>  [PATCH 14/24] PCI: rcar: Remove redundant error fabrication when device read 
> fails
>  [PATCH 15/24] PCI: rockchip: Remove redundant error fabrication when device 
> read fails
>  [PATCH 16/24] PCI/ERR: Use RESPONSE_IS_PCI_ERROR() to check read from 
> hardware
>  [PATCH 17/24] PCI: vmd: Use RESPONSE_IS_PCI_ERROR() to check read from 
> hardware
>  [PATCH 18/24] PCI: pciehp: Use RESPONSE_IS_PCI_ERROR() to check read from 
> hardware
>  [PATCH 19/24] PCI/DPC: Use RESPONSE_IS_PCI_ERROR() to check read from 
> hardware
>  [PATCH 20/24] PCI/PME: Use RESPONSE_IS_PCI_ERROR() to check read from 
> hardware
>  [PATCH 21/24] PCI: cpqphp: Use RESPONSE_IS_PCI_ERROR() to check read from 
> hardware
>  [PATCH 22/24] PCI: keystone: Use PCI_ERROR_RESPONSE to specify hardware error
>  [PATCH 23/24] PCI: hv: Use PCI_ERROR_RESPONSE to specify hardware read error
>  [PATCH 24/24] PCI: xgene: Use PCI_ERROR_RESPONSE to specify hardware error
> 
>  drivers/pci/access.c| 36 
>  drivers/pci/controller/dwc/pci-exynos.c |  4 +-
>  drivers/pci/controller/dwc/pci-keystone.c   |  4 +-
>  drivers/pci/controller/dwc/pcie-histb.c |  4 +-
>  drivers/pci/controller/dwc/pcie-kirin.c |  4 +-
>  drivers/pci/controller/pci-aardvark.c   | 10 +
>  drivers/pci/controller/pci-hyperv.c |  2 +-
>  drivers/pci/controller/pci-mvebu.c  |

[PATCH v2 00/24] Unify PCI error response checking

2021-10-15 Thread Naveen Naidu

An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the 
CPU read, so most hardware fabricates ~0 data.

This patch series adds PCI_ERROR_RESPONSE definition and other helper
definition SET_PCI_ERROR_RESPONSE and RESPONSE_IS_PCI_ERROR and uses it
where appropriate to make these checks consistent and easier to find.

This helps unify PCI error response checking and make error check
consistent and easier to find.

This series also ensures that the error response fabrication now happens
in the PCI_OP_READ and PCI_USER_READ_CONFIG. This removes the
responsibility from controller drivers to do the error response setting. 

Patch 1:
- Adds the PCI_ERROR_RESPONSE and other related defintions
- All other patches are dependent on this patch. This patch needs to
  be applied first, before the others

Patch 2:
- Error fabrication happens in PCI_OP_READ and PCI_USER_READ_CONFIG
  whenever the data read via the controller driver fails.
- This patch needs to be applied before, Patch 4/24 to Patch 15/24 are
  applied.

Patch 3:
- Uses SET_PCI_ERROR_RESPONSE() when device is not found and
  RESPONSE_IS_PCI_ERROR() to check hardware read from the hardware.

Patch 4 - 15:
- Removes redundant error fabrication that happens in controller 
  drivers when the read from a PCI device fails.
- These patches are dependent on Patch 2/24 of the series.
- These can be applied in any order.

Patch 16 - 22:
- Uses RESPONSE_IS_PCI_ERROR() to check the reads from hardware
- Patches can be applied in any order.

Patch 23 - 24:
- Edits the comments to include PCI_ERROR_RESPONSE alsong with
  0x, so that it becomes easier to grep for faulty 
  hardware reads.

Changelog
=

v2:
- Instead of using SET_PCI_ERROR_RESPONSE in all controller drivers
  to fabricate error response, only use them in PCI_OP_READ and
  PCI_USER_READ_CONFIG

Naveen Naidu (24):
 [PATCH 1/24] PCI: Add PCI_ERROR_RESPONSE and it's related definitions
 [PATCH 2/24] PCI: Set error response in config access defines when ops->read() 
fails
 [PATCH 3/24] PCI: Unify PCI error response checking
 [PATCH 4/24] PCI: Remove redundant error fabrication when device read fails
 [PATCH 5/24] PCI: thunder: Remove redundant error fabrication when device read 
fails
 [PATCH 6/24] PCI: iproc: Remove redundant error fabrication when device read 
fails
 [PATCH 7/24] PCI: mediatek: Remove redundant error fabrication when device 
read fails
 [PATCH 8/24] PCI: exynos: Remove redundant error fabrication when device read 
fails
 [PATCH 9/24] PCI: histb: Remove redundant error fabrication when device read 
fails
 [PATCH 10/24] PCI: kirin: Remove redundant error fabrication when device read 
fails
 [PATCH 11/24] PCI: aardvark: Remove redundant error fabrication when device 
read fails
 [PATCH 12/24] PCI: mvebu: Remove redundant error fabrication when device read 
fails
 [PATCH 13/24] PCI: altera: Remove redundant error fabrication when device read 
fails
 [PATCH 14/24] PCI: rcar: Remove redundant error fabrication when device read 
fails
 [PATCH 15/24] PCI: rockchip: Remove redundant error fabrication when device 
read fails
 [PATCH 16/24] PCI/ERR: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
 [PATCH 17/24] PCI: vmd: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
 [PATCH 18/24] PCI: pciehp: Use RESPONSE_IS_PCI_ERROR() to check read from 
hardware
 [PATCH 19/24] PCI/DPC: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
 [PATCH 20/24] PCI/PME: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
 [PATCH 21/24] PCI: cpqphp: Use RESPONSE_IS_PCI_ERROR() to check read from 
hardware
 [PATCH 22/24] PCI: keystone: Use PCI_ERROR_RESPONSE to specify hardware error
 [PATCH 23/24] PCI: hv: Use PCI_ERROR_RESPONSE to specify hardware read error
 [PATCH 24/24] PCI: xgene: Use PCI_ERROR_RESPONSE to specify hardware error

 drivers/pci/access.c| 36 
 drivers/pci/controller/dwc/pci-exynos.c |  4 +-
 drivers/pci/controller/dwc/pci-keystone.c   |  4 +-
 drivers/pci/controller/dwc/pcie-histb.c |  4 +-
 drivers/pci/controller/dwc/pcie-kirin.c |  4 +-
 drivers/pci/controller/pci-aardvark.c   | 10 +
 drivers/pci/controller/pci-hyperv.c |  2 +-
 drivers/pci/controller/pci-mvebu.c  |  8 +---
 drivers/pci/controller/pci-thunder-ecam.c   | 46 +++--
 drivers/pci/controller/pci-thunder-pem.c|  4 +-
 drivers/pci/controller/pci-xgene.c  |  8 ++--
 drivers/pci/controller/pcie-altera.c|  4 +-
 drivers/pci/controller/pcie-iproc.c |  4 +-
 drivers/pci/controller/pcie-mediatek.c  | 11 +
 drivers/pci/controller/pcie-rcar-host.c |  4 +-
 drivers/pci/controller/pcie-rockchip-host.c |  4 +-
 drivers/pci/controller/vmd.c|  2 +-
 drivers/pci/hotplug/cpqphp_ctrl.c   |  4 +-
 driv

[PATCH v2 00/24] Unify PCI error response checking

2021-10-15 Thread Naveen Naidu

An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the 
CPU read, so most hardware fabricates ~0 data.

This patch series adds PCI_ERROR_RESPONSE definition and other helper
definition SET_PCI_ERROR_RESPONSE and RESPONSE_IS_PCI_ERROR and uses it
where appropriate to make these checks consistent and easier to find.

This helps unify PCI error response checking and make error check
consistent and easier to find.

This series also ensures that the error response fabrication now happens
in the PCI_OP_READ and PCI_USER_READ_CONFIG. This removes the
responsibility from controller drivers to do the error response setting. 

Patch 1:
- Adds the PCI_ERROR_RESPONSE and other related defintions
- All other patches are dependent on this patch. This patch needs to
  be applied first, before the others

Patch 2:
- Error fabrication happens in PCI_OP_READ and PCI_USER_READ_CONFIG
  whenever the data read via the controller driver fails.
- This patch needs to be applied before, Patch 4/24 to Patch 15/24 are
  applied.

Patch 3:
- Uses SET_PCI_ERROR_RESPONSE() when device is not found and
  RESPONSE_IS_PCI_ERROR() to check hardware read from the hardware.

Patch 4 - 15:
- Removes redundant error fabrication that happens in controller 
  drivers when the read from a PCI device fails.
- These patches are dependent on Patch 2/24 of the series.
- These can be applied in any order.

Patch 16 - 22:
- Uses RESPONSE_IS_PCI_ERROR() to check the reads from hardware
- Patches can be applied in any order.

Patch 23 - 24:
- Edits the comments to include PCI_ERROR_RESPONSE alsong with
  0x, so that it becomes easier to grep for faulty 
  hardware reads.

Changelog
=

v2:
- Instead of using SET_PCI_ERROR_RESPONSE in all controller drivers
  to fabricate error response, only use them in PCI_OP_READ and
  PCI_USER_READ_CONFIG

Naveen Naidu (24):
 [PATCH 1/24] PCI: Add PCI_ERROR_RESPONSE and it's related definitions
 [PATCH 2/24] PCI: Set error response in config access defines when ops->read() 
fails
 [PATCH 3/24] PCI: Unify PCI error response checking
 [PATCH 4/24] PCI: Remove redundant error fabrication when device read fails
 [PATCH 5/24] PCI: thunder: Remove redundant error fabrication when device read 
fails
 [PATCH 6/24] PCI: iproc: Remove redundant error fabrication when device read 
fails
 [PATCH 7/24] PCI: mediatek: Remove redundant error fabrication when device 
read fails
 [PATCH 8/24] PCI: exynos: Remove redundant error fabrication when device read 
fails
 [PATCH 9/24] PCI: histb: Remove redundant error fabrication when device read 
fails
 [PATCH 10/24] PCI: kirin: Remove redundant error fabrication when device read 
fails
 [PATCH 11/24] PCI: aardvark: Remove redundant error fabrication when device 
read fails
 [PATCH 12/24] PCI: mvebu: Remove redundant error fabrication when device read 
fails
 [PATCH 13/24] PCI: altera: Remove redundant error fabrication when device read 
fails
 [PATCH 14/24] PCI: rcar: Remove redundant error fabrication when device read 
fails
 [PATCH 15/24] PCI: rockchip: Remove redundant error fabrication when device 
read fails
 [PATCH 16/24] PCI/ERR: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
 [PATCH 17/24] PCI: vmd: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
 [PATCH 18/24] PCI: pciehp: Use RESPONSE_IS_PCI_ERROR() to check read from 
hardware
 [PATCH 19/24] PCI/DPC: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
 [PATCH 20/24] PCI/PME: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
 [PATCH 21/24] PCI: cpqphp: Use RESPONSE_IS_PCI_ERROR() to check read from 
hardware
 [PATCH 22/24] PCI: keystone: Use PCI_ERROR_RESPONSE to specify hardware error
 [PATCH 23/24] PCI: hv: Use PCI_ERROR_RESPONSE to specify hardware read error
 [PATCH 24/24] PCI: xgene: Use PCI_ERROR_RESPONSE to specify hardware error

 drivers/pci/access.c| 36 
 drivers/pci/controller/dwc/pci-exynos.c |  4 +-
 drivers/pci/controller/dwc/pci-keystone.c   |  4 +-
 drivers/pci/controller/dwc/pcie-histb.c |  4 +-
 drivers/pci/controller/dwc/pcie-kirin.c |  4 +-
 drivers/pci/controller/pci-aardvark.c   | 10 +
 drivers/pci/controller/pci-hyperv.c |  2 +-
 drivers/pci/controller/pci-mvebu.c  |  8 +---
 drivers/pci/controller/pci-thunder-ecam.c   | 46 +++--
 drivers/pci/controller/pci-thunder-pem.c|  4 +-
 drivers/pci/controller/pci-xgene.c  |  8 ++--
 drivers/pci/controller/pcie-altera.c|  4 +-
 drivers/pci/controller/pcie-iproc.c |  4 +-
 drivers/pci/controller/pcie-mediatek.c  | 11 +
 drivers/pci/controller/pcie-rcar-host.c |  4 +-
 drivers/pci/controller/pcie-rockchip-host.c |  4 +-
 drivers/pci/controller/vmd.c|  2 +-
 drivers/pci/hotplug/cpqphp_ctrl.c   |  4 +-
 driv

[PATCH 17/22] PCI/DPC: Use RESPONSE_IS_PCI_ERROR() to check read from hardware

2021-10-11 Thread Naveen Naidu

An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the
CPU read, so most hardware fabricates ~0 data.

Use RESPONSE_IS_PCI_ERROR() to check the response we get when we read
data from hardware.

This helps unify PCI error response checking and make error checks
consistent and easier to find.

Compile tested only.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index c556e7beafe3..561c44d9429c 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -79,7 +79,7 @@ static bool dpc_completed(struct pci_dev *pdev)
u16 status;
 
pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS, );
-   if ((status != 0x) && (status & PCI_EXP_DPC_STATUS_TRIGGER))
+   if ((!RESPONSE_IS_PCI_ERROR()) && (status & 
PCI_EXP_DPC_STATUS_TRIGGER))
return false;
 
if (test_bit(PCI_DPC_RECOVERING, >priv_flags))
@@ -312,7 +312,7 @@ static irqreturn_t dpc_irq(int irq, void *context)
 
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
 
-   if (!(status & PCI_EXP_DPC_STATUS_INTERRUPT) || status == (u16)(~0))
+   if (!(status & PCI_EXP_DPC_STATUS_INTERRUPT) || 
RESPONSE_IS_PCI_ERROR())
return IRQ_NONE;
 
pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
-- 
2.25.1

[PATCH 01/22] PCI: Add PCI_ERROR_RESPONSE and it's related defintions

2021-10-11 Thread Naveen Naidu

An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the
CPU read, so most hardware fabricates ~0 data.

Add a PCI_ERROR_RESPONSE definition for that and use it where
appropriate to make these checks consistent and easier to find.

Also add helper definitions SET_PCI_ERROR_RESPONSE and
RESPONSE_IS_PCI_ERROR to make the code more readable.

Signed-off-by: Naveen Naidu 
---
 include/linux/pci.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index cd8aa6fce204..928c589bb5c4 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -154,6 +154,15 @@ enum pci_interrupt_pin {
 /* The number of legacy PCI INTx interrupts */
 #define PCI_NUM_INTX   4
 
+/*
+ * Reading from a device that doesn't respond typically returns ~0.  A
+ * successful read from a device may also return ~0, so you need additional
+ * information to reliably identify errors.
+ */
+#define PCI_ERROR_RESPONSE (~0ULL)
+#define SET_PCI_ERROR_RESPONSE(val)(*val = ((typeof(*val)) 
PCI_ERROR_RESPONSE))
+#define RESPONSE_IS_PCI_ERROR(val) (*val == ((typeof(*val)) 
PCI_ERROR_RESPONSE))
+
 /*
  * pci_power_t values must match the bits in the Capabilities PME_Support
  * and Control/Status PowerState fields in the Power Management capability.
-- 
2.25.1

[PATCH 00/22] PCI: Unify PCI error response checking

2021-10-11 Thread Naveen Naidu

An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the 
CPU read, so most hardware fabricates ~0 data.

This patch series adds PCI_ERROR_RESPONSE definition and other helper
defintion SET_PCI_ERROR_RESPONSE and RESPONSE_IS_PCI_ERROR and uses it
where appropriate to make these checks consistent and easier to find.

This helps unify PCI error response checking and make error check
consistent and easier to find.

Patch 1:
  - Adds the PCI_ERROR_RESPONSE and other related defintions
  - All other patches are dependent on this patch. This patch needs to
be applied first, before the others

Patch 2 - 13
  - Uses SET_PCI_ERROR_RESPONSE() when device is not found

Patch 14 - 19
  - Uses RESPONSE_IS_PCI_ERROR() to check the reads from hardware

Patch 20 - 22
  - Edits the comments to include PCI_ERROR_RESPONSE alsong with
0x, so that it becomes easier to grep for faulty hardware
reads.

Thanks,
Naveen

Naveen Naidu (22):
  [PATCH 1/22] PCI: Add PCI_ERROR_RESPONSE and it's related defintions
  [PATCH 2/22] PCI: Unify PCI error response checking
  [PATCH 3/22] PCI: thunder: Use SET_PCI_ERROR_RESPONSE() when device not found
  [PATCH 4/22] PCI: iproc: Use SET_PCI_ERROR_RESPONSE() when device not found
  [PATCH 5/22] PCI: mediatek: Use SET_PCI_ERROR_RESPONSE() when device not found
  [PATCH 6/22] PCI: exynos: Use SET_PCI_ERROR_RESPONSE() when device not found
  [PATCH 7/22] PCI: histb: Use SET_PCI_ERROR_RESPONSE() when device not found
  [PATCH 8/22] PCI: kirin: Use SET_PCI_ERROR_RESPONSE() when device not found
  [PATCH 9/22] PCI: aardvark: Use SET_PCI_ERROR_RESPONSE() when device not found
  [PATCH 10/22] PCI: mvebu: Use SET_PCI_ERROR_RESPONSE() when device not found
  [PATCH 11/22] PCI: altera: Use SET_PCI_ERROR_RESPONSE() when device not found
  [PATCH 12/22] PCI: rcar: Use SET_PCI_ERROR_RESPONSE() when device not found
  [PATCH 13/22] PCI: rockchip: Use SET_PCI_ERROR_RESPONSE() when device not 
found
  [PATCH 14/22] PCI/ERR: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
  [PATCH 15/22] PCI: vmd: Use RESPONSE_IS_PCI_ERROR() to check read from 
hardware
  [PATCH 16/22] PCI: pciehp: Use RESPONSE_IS_PCI_ERROR() to check read from 
hardware
  [PATCH 17/22] PCI/DPC: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
  [PATCH 18/22] PCI/PME: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
  [PATCH 19/22] PCI: cpqphp: Use RESPONSE_IS_PCI_ERROR() to check read from 
hardware
  [PATCH 20/22] PCI: keystone: Use PCI_ERROR_RESPONSE to specify hardware error
  [PATCH 21/22] PCI: hv: Use PCI_ERROR_RESPONSE to specify hardware read error
  [PATCH 22/22] PCI: xgene: Use PCI_ERROR_RESPONSE to specify hardware error

 drivers/pci/access.c| 22 ++---
 drivers/pci/controller/dwc/pci-exynos.c |  2 +-
 drivers/pci/controller/dwc/pci-keystone.c   |  4 ++--
 drivers/pci/controller/dwc/pcie-histb.c |  2 +-
 drivers/pci/controller/dwc/pcie-kirin.c |  2 +-
 drivers/pci/controller/pci-aardvark.c   |  8 
 drivers/pci/controller/pci-hyperv.c |  2 +-
 drivers/pci/controller/pci-mvebu.c  |  4 ++--
 drivers/pci/controller/pci-thunder-ecam.c   | 20 +--
 drivers/pci/controller/pci-thunder-pem.c|  2 +-
 drivers/pci/controller/pci-xgene.c  |  8 
 drivers/pci/controller/pcie-altera.c|  2 +-
 drivers/pci/controller/pcie-iproc.c |  2 +-
 drivers/pci/controller/pcie-mediatek.c  |  4 ++--
 drivers/pci/controller/pcie-rcar-host.c |  2 +-
 drivers/pci/controller/pcie-rockchip-host.c |  2 +-
 drivers/pci/controller/vmd.c|  2 +-
 drivers/pci/hotplug/cpqphp_ctrl.c   |  4 ++--
 drivers/pci/hotplug/pciehp_hpc.c| 10 +-
 drivers/pci/pci.c   | 10 +-
 drivers/pci/pcie/dpc.c  |  4 ++--
 drivers/pci/pcie/pme.c  |  4 ++--
 drivers/pci/probe.c | 10 +-
 include/linux/pci.h |  9 +
 24 files changed, 75 insertions(+), 66 deletions(-)

-- 
2.25.1

[PATCH v4 8/8] PCI/AER: Include DEVCTL in aer_print_error()

2021-10-05 Thread Naveen Naidu

Print the contents of Device Control Register of the device which
detected the error. This might help in faster error diagnosis.

Sample output from dummy error injected by aer-inject:

  pcieport :00:03.0: AER: Corrected error received: :00:03.0
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver)
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=0040/e000, devctl=0x000f <-- devctl added to the error log
  pcieport :00:03.0:[ 6] BadTLP

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pci.h  |  2 ++
 drivers/pci/pcie/aer.c | 10 --
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index eb88d8bfeaf7..48ed7f91113b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -437,6 +437,8 @@ struct aer_err_info {
u32 status; /* COR/UNCOR Error Status */
u32 mask;   /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
+
+   u16 devctl;
 };
 
 /* Preliminary AER error information processed from Root port */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d3937f5384e4..fdeef9deb016 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -729,8 +729,8 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
   aer_error_severity_string[info->severity],
   aer_error_layer[layer], aer_agent_string[agent]);
 
-   pci_printk(level, dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
-  dev->vendor, dev->device, info->status, info->mask);
+   pci_printk(level, dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x, devctl=%#06x\n",
+  dev->vendor, dev->device, info->status, info->mask, 
info->devctl);
 
__aer_print_error(dev, info);
 
@@ -1083,6 +1083,12 @@ int aer_get_device_error_info(struct pci_dev *dev, 
struct aer_err_info *info)
if (!aer)
return 0;
 
+   /*
+* Cache the value of Device Control Register now, because later the
+* device might not be available
+*/
+   pcie_capability_read_word(dev, PCI_EXP_DEVCTL, >devctl);
+
if (info->severity == AER_CORRECTABLE) {
pci_read_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>status);
-- 
2.25.1

[PATCH v4 7/8] PCI/ERR: Remove redundant clearing of AER register in pcie_do_recovery()

2021-10-05 Thread Naveen Naidu

pcie_do_recovery() is shared across the following paths:
 - ACPI APEI
 - Native AER path
 - EDR
 - DPC

ACPI APEI
==

  ghes_handle_aer()
aer_recover_queue()
  kfifo_in_spinlocked(aer_recover_ring)

  aer_recover_work_func()
while (kfifo_get(aer_recover_ring))
  pcie_do_recovery()

In this path the system firmware clears the AER registers before
handing off the record to the OS in ghes_handle_aer()

Native AER
==

 aer_irq()
   aer_add_err_devices_to_queue()
 kfifo_put(>aer_fifo, *e_dev)
 clear_error_source_aer_registers()   < AER registers are cleard

 aer_isr()
   aer_isr_one_error()
handle_error_source()
  pcie_do_recovery()

The AER registers are cleared during the handling of IRQ, i.e before we
the recovery starts.

DPC
=

  dpc_handler()
dpc_process_error()
pci_aer_clear_status()   < AER registers are cleared
pcie_do_recovery()

EDR


  edr_handle_event()
dpc_process_error()
pci_aer_raw_clear_status()  < AER registers are cleared
pcie_do_recovery()

In all the above paths, the AER registers are cleared before
pcie_do_recovery(). The non fatal status AER registers are again cleared
in pcie_do_recovery(). This is redundant.

Remove redundant clearing of AER register in pcie_do_recovery()

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/err.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index b576aa890c76..fe04b0ae22f4 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -231,14 +231,11 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 
/*
 * If we have native control of AER, clear error status in the device
-* that detected the error.  If the platform retained control of AER,
-* it is responsible for clearing this status.  In that case, the
-* signaling device may not even be visible to the OS.
+* that detected the error.
 */
-   if (host->native_aer || pcie_ports_native) {
+   if (host->native_aer || pcie_ports_native)
pcie_clear_device_status(dev);
-   pci_aer_clear_nonfatal_status(dev);
-   }
+
pci_info(bridge, "device recovery successful\n");
return status;
 
-- 
2.25.1

[PATCH v4 6/8] PCI/AER: Clear error device AER registers in aer_irq()

2021-10-05 Thread Naveen Naidu

Converge the APEI path and native AER path of clearing the AER registers
of the error device.

In APEI path, the system firmware clears the AER registers before
handing off the record to OS. But in "native AER" path, the execution
path of clearing the AER register is as follows:

  aer_isr_one_error
aer_print_port_info
  if (find_source_device())
aer_process_err_devices
  handle_error_source
pci_write_config_dword(dev, PCI_ERR_COR_STATUS, ...)

The above path has a bug, if the find_source_device() fails, AER
registers are not cleared from the error device. This means, the error
device will keep reporting the error again and again and would lead
to message spew.

Related Bug Report:
  https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173

The above bug could be avoided, if the AER registers are cleared during
the AER IRQ handler aer_irq(), which would provide guarantee that the AER
error registers are always cleared. This is similar to how APEI handles
these errors.

The main aim is that:

  When an interrupt handler deals with a interrupt, it must *always*
  clear the source of the interrupt.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pci.h  |  13 ++-
 drivers/pci/pcie/aer.c | 249 -
 2 files changed, 184 insertions(+), 78 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9be7a966fda7..eb88d8bfeaf7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -424,7 +424,6 @@ static inline bool pci_dev_is_added(const struct pci_dev 
*dev)
 #define AER_MAX_MULTI_ERR_DEVICES  5   /* Not likely to have more */
 
 struct aer_err_info {
-   struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
 
u16 id;
@@ -440,6 +439,18 @@ struct aer_err_info {
struct aer_header_log_regs tlp; /* TLP Header */
 };
 
+/* Preliminary AER error information processed from Root port */
+struct aer_devices_err_info {
+   struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
+   struct aer_err_info err_info;
+};
+
+/* AER information associated with each error device */
+struct aer_dev_err_info {
+   struct pci_dev *dev;
+   struct aer_err_info err_info;
+};
+
 int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
 #endif /* CONFIG_PCIEAER */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 241ff361b43c..d3937f5384e4 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -36,6 +36,18 @@
 
 #define AER_ERROR_SOURCES_MAX  128
 
+/*
+ * There can be 128 maximum error sources (AER_ERROR_SOURCES_MAX) and each
+ * error source can have maximum of 5 error devices (AER_MAX_MULTI_ERR_DEVICES)
+ * so the maximum error devices we can report is:
+ *
+ * AER_ERROR_DEVICES_MAX = AER_ERROR_SOURCES_MAX * AER_MAX_MULTI_ERR_DEVICES 
== (128 * 5) == 640
+ *
+ * But since, the size in KFIFO should be a power of two, the closest value
+ * to 640 is 1024
+ */
+# define AER_ERROR_DEVICES_MAX 1024
+
 #define AER_MAX_TYPEOF_COR_ERRS16  /* as per 
PCI_ERR_COR_STATUS */
 #define AER_MAX_TYPEOF_UNCOR_ERRS  27  /* as per PCI_ERR_UNCOR_STATUS*/
 
@@ -46,7 +58,7 @@ struct aer_err_source {
 
 struct aer_rpc {
struct pci_dev *rpd;/* Root Port device */
-   DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
+   DECLARE_KFIFO(aer_fifo, struct aer_dev_err_info, AER_ERROR_DEVICES_MAX);
 };
 
 /* AER stats for the device */
@@ -803,14 +815,14 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
 
 /**
  * add_error_device - list device to be handled
- * @e_info: pointer to error info
+ * @e_dev: pointer to error info
  * @dev: pointer to pci_dev to be added
  */
-static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
+static int add_error_device(struct aer_devices_err_info *e_dev, struct pci_dev 
*dev)
 {
-   if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
-   e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
-   e_info->error_dev_num++;
+   if (e_dev->err_info.error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
+   e_dev->dev[e_dev->err_info.error_dev_num] = pci_dev_get(dev);
+   e_dev->err_info.error_dev_num++;
return 0;
}
return -ENOSPC;
@@ -877,18 +889,18 @@ static bool is_error_source(struct pci_dev *dev, struct 
aer_err_info *e_info)
 
 static int find_device_iter(struct pci_dev *dev, void *data)
 {
-   struct aer_err_info *e_info = (struct aer_err_info *)data;
+   struct aer_devices_err_info *e_dev = (struct aer_devices_err_info 
*)data;
 
-   if (is_error_source(dev, e_info)) {
+   if (is_error_source(dev, _dev->err_i

[PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers

2021-10-05 Thread Naveen Naidu

In the EDR path, AER registers are cleared *after* DPC error event is
processed. The process stack in EDR is:

  edr_handle_event()
dpc_process_error()
pci_aer_raw_clear_status()
pcie_do_recovery()

But in DPC path, AER status registers are cleared *while* processing
the error. The process stack in DPC is:

  dpc_handler()
dpc_process_error()
  pci_aer_clear_status()
pcie_do_recovery()

In EDR path, AER status registers are cleared irrespective of whether
the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
AER status registers are cleared only when it's an unmasked uncorrectable
error.

This leads to two different behaviours for the same task (handling of
DPC errors) in FFS systems and when native OS has control.

Bring the same semantics for clearing the AER status register in EDR
path and DPC path.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index faf4a1e77fab..68899a3db126 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev)
 dpc_get_aer_uncorrect_severity(pdev, ) &&
 aer_get_device_error_info(pdev, )) {
aer_print_error(pdev, );
-   pci_aer_clear_status(pdev);
}
 }
 
@@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
struct pci_dev *pdev = context;
 
dpc_process_error(pdev);
+   pci_aer_clear_status(pdev);
 
/* We configure DPC so it only triggers on ERR_FATAL */
pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
-- 
2.25.1

[PATCH v4 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()

2021-10-05 Thread Naveen Naidu

dpc_process_error() clears both AER fatal and non fatal status
registers. Instead of clearing each status registers via a different
function call use pci_aer_clear_status().

This helps clean up the code a bit.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index df3f3a10f8bc..faf4a1e77fab 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -288,8 +288,7 @@ void dpc_process_error(struct pci_dev *pdev)
 dpc_get_aer_uncorrect_severity(pdev, ) &&
 aer_get_device_error_info(pdev, )) {
aer_print_error(pdev, );
-   pci_aer_clear_nonfatal_status(pdev);
-   pci_aer_clear_fatal_status(pdev);
+   pci_aer_clear_status(pdev);
}
 }
 
-- 
2.25.1

[PATCH v4 3/8] PCI/DPC: Initialize info->id in dpc_process_error()

2021-10-05 Thread Naveen Naidu

In the dpc_process_error() path, info->id isn't initialized before being
passed to aer_print_error(). In the corresponding AER path, it is
initialized in aer_isr_one_error().

The error message shown during Coverity Scan is:

  Coverity #1461602
  CID 1461602 (#1 of 1): Uninitialized scalar variable (UNINIT)
  8. uninit_use_in_call: Using uninitialized value info.id when calling 
aer_print_error.

Initialize the "info->id" before passing it to aer_print_error()

Fixes: 8aefa9b0d910 ("PCI/DPC: Print AER status in DPC event handling")
Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index c556e7beafe3..df3f3a10f8bc 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -262,14 +262,14 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev 
*dev,
 
 void dpc_process_error(struct pci_dev *pdev)
 {
-   u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
+   u16 cap = pdev->dpc_cap, status, reason, ext_reason;
struct aer_err_info info;
 
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
-   pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, );
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, );
 
pci_info(pdev, "containment event, status:%#06x source:%#06x\n",
-status, source);
+status, info.id);
 
reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5;
-- 
2.25.1

[PATCH v4 2/8] PCI: Cleanup struct aer_err_info

2021-10-05 Thread Naveen Naidu

The id, status and the mask fields of the struct aer_err_info comes
directly from the registers, hence their sizes should be explicit.

The length of these registers are:
  - id: 16 bits - Represents the Error Source Requester ID
  - status: 32 bits - COR/UNCOR Error Status
  - mask: 32 bits - COR/UNCOR Error Mask

Since the length of the above registers are even, use u16 and u32
to represent their values.

Also remove the __pad fields.

"pahole" was run on the modified struct aer_err_info and the size
remains unchanged.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pci.h | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1cce56c2aea0..9be7a966fda7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -427,18 +427,16 @@ struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
 
-   unsigned int id:16;
+   u16 id;
 
unsigned int severity:2;/* 0:NONFATAL | 1:FATAL | 2:COR */
-   unsigned int __pad1:5;
unsigned int multi_error_valid:1;
 
unsigned int first_error:5;
-   unsigned int __pad2:2;
unsigned int tlp_header_valid:1;
 
-   unsigned int status;/* COR/UNCOR Error Status */
-   unsigned int mask;  /* COR/UNCOR Error Mask */
+   u32 status; /* COR/UNCOR Error Status */
+   u32 mask;   /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
 };
 
-- 
2.25.1

[PATCH v4 1/8] PCI/AER: Remove ID from aer_agent_string[]

2021-10-05 Thread Naveen Naidu

Currently, we do not print the "id" field in the AER error logs. Yet the
aer_agent_string[] has the word "id" in it. The AER error log looks
like:

  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver ID)

Without the "id" field in the error log, The aer_agent_string[]
(eg: "Receiver ID") does not make sense. A user reading the
aer_agent_string[] in the log, might inadvertently look for an "id"
field and not finding it might lead to confusion.

Remove the "ID" from the aer_agent_string[].

The following are sample dummy errors inject via aer-inject.

Before
===

In 010caed4ccb6 ("PCI/AER: Decode Error Source Requester ID"),
the "id" field was removed from the AER error logs, so currently AER
logs look like:

  pcieport :00:03.0: AER: Corrected error received: :00:03:0
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver ID) <--- no id field
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=0040/e000
  pcieport :00:03.0:[ 6] BadTLP

After
==

  pcieport :00:03.0: AER: Corrected error received: :00:03.0
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver)
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=00000040/e000
  pcieport :00:03.0:[ 6] BadTLP

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/aer.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9784fdcf3006..241ff361b43c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -516,10 +516,10 @@ static const char *aer_uncorrectable_error_string[] = {
 };
 
 static const char *aer_agent_string[] = {
-   "Receiver ID",
-   "Requester ID",
-   "Completer ID",
-   "Transmitter ID"
+   "Receiver",
+   "Requester",
+   "Completer",
+   "Transmitter"
 };
 
 #define aer_stats_dev_attr(name, stats_array, strings_array,   \
@@ -703,7 +703,7 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
const char *level;
 
if (!info->status) {
-   pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
(Unregistered Agent ID)\n",
+   pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
(Unregistered Agent)\n",
aer_error_severity_string[info->severity]);
goto out;
}
-- 
2.25.1

[PATCH v4 0/8] Fix long standing AER Error Handling Issues

2021-10-05 Thread Naveen Naidu

This patch series aims at fixing some of the AER error handling issues
we have.

Currently we have the following issues:
 - Confusing message in aer_print_error()
 - aer_err_info not being initialized completely in DPC path before
   we print the AER logs
 - A bug [1] in clearing of AER registers in the native AER path

[1] https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/

The primary aim of this patch series is to converge the APEI path and the
native AER error handling paths. In our current code, we find that we
have two different behaviours (especially when it comes to clearing of
the AER registers) for the same functionality.

This patch series, tries to bring the same semantics and hence more
commonanlity between the APEI part of code and the native OS
handling of AER errors.

PATCH 1:
  - Fixes the first issue

PATCH 2 - 4:
  - Fixes the second issue
  - "Patch 3/8" is dependent on "Patch 2/8" in the series

PATCH 5 - 7
  - Deals with converging the various paths and brings more
commonality between them
  - "Patch 6/8" depends on "Patch 1/8"

PATCH 8:
  -  Adds extra information in AER error logs.

Thanks,
Naveen Naidu

Changelog
=

v4:
  - Implement review comments
  - Make "Patch 1/8" commit message more meaningful
  - Fix the code comment error detected by kernel test robot 
in "Patch 6/8"

v2 and v3:
  - Fix up mail formatting and include the appropriate receipients for
the patch.

Naveen Naidu (8):
 [PATCH v4 1/8] PCI/AER: Remove ID from aer_agent_string[]
 [PATCH v4 2/8] PCI: Cleanup struct aer_err_info
 [PATCH v4 3/8] PCI/DPC: Initialize info->id in dpc_process_error()
 [PATCH v4 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()
 [PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers
 [PATCH v4 6/8] PCI/AER: Clear error device AER registers in aer_irq()
 [PATCH v4 7/8] PCI/ERR: Remove redundant clearing of AER register in 
pcie_do_recovery()
 [PATCH v4 8/8] PCI/AER: Include DEVCTL in aer_print_error()

 drivers/pci/pci.h  |  23 +++-
 drivers/pci/pcie/aer.c | 269 -
 drivers/pci/pcie/dpc.c |   9 +-
 drivers/pci/pcie/err.c |   9 +-
 4 files changed, 209 insertions(+), 101 deletions(-)

-- 
2.25.1

[PATCH 1/6] PCI/AER: Enable COR/UNCOR error reporting in set_device_error_reporting()

2021-10-04 Thread Naveen Naidu

The (PCIe r5.0, sec 7.6.4.3, Table 7-101) and  (PCIe r5.0, sec 7.8.4.6,
Table 7-104) states that the default values for the Uncorrectable Error
Mask and Correctable Error Mask should be 0b. But the current code does
not set the default value of these registers when the PCIe bus loads the
AER service driver.

Enable reporting of all correctable and uncorrectable errors during
aer_probe()

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/aer.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9784fdcf3006..88c4ca6098fb 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1212,6 +1212,7 @@ static int set_device_error_reporting(struct pci_dev 
*dev, void *data)
 {
bool enable = *((bool *)data);
int type = pci_pcie_type(dev);
+   int aer = dev->aer_cap;
 
if ((type == PCI_EXP_TYPE_ROOT_PORT) ||
(type == PCI_EXP_TYPE_RC_EC) ||
@@ -1223,8 +1224,18 @@ static int set_device_error_reporting(struct pci_dev 
*dev, void *data)
pci_disable_pcie_error_reporting(dev);
}
 
-   if (enable)
+   if (enable) {
+
+   /* Enable reporting of all uncorrectable errors */
+   /* Uncorrectable Error Mask - turned on bits disable errors */
+   pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, 0);
+
+   /* Enable reporting of all correctable errors */
+   /* Correctable Error Mask - turned on bits disable errors */
+   pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, 0);
+
pcie_set_ecrc_checking(dev);
+   }
 
return 0;
 }
-- 
2.25.1

[PATCH 0/6] MIPS: OCTEON: Remove redundant AER code

2021-10-04 Thread Naveen Naidu

e8635b484f64 ("MIPS: Add Cavium OCTEON PCI support.") added MIPS
specific code to enable PCIe and AER error reporting (*irrespective
of CONFIG_PCIEAER value*) because PCI core didn't do that at the time.

But currently, the PCI core clears and enables the AER status registers.
So it's redundant for octeon code to do so. This patch series removes
the redundant code from the pci-octeon.c

Currently, the correctable and uncorrectable AER mask registers are not
set to their default value when AER service driver is loaded. This
defect is also fixed in the "[PATCH 1/6]" in the series.

Please note that "Patch 4/6" is dependent on "Patch 1/6".

Thanks,
Naveen Naidu

Naveen Naidu (6):
 [PATCH 1/6] PCI/AER: Enable COR/UNCOR error reporting in 
set_device_error_reporting()
 [PATCH 2/6] MIPS: OCTEON: Remove redundant clearing of AER status registers
 [PATCH 3/6] MIPS: OCTEON: Remove redundant enable of PCIe normal error 
reporting
 [PATCH 4/6] MIPS: OCTEON: Remove redundant enable of COR/UNCOR error
 [PATCH 5/6] MIPS: OCTEON: Remove redundant ECRC Generation Enable
 [PATCH 6/6] MIPS: OCTEON: Remove redundant enable of RP error reporting

 arch/mips/pci/pci-octeon.c | 50 --
 drivers/pci/pcie/aer.c | 13 +-
 2 files changed, 12 insertions(+), 51 deletions(-)

-- 
2.25.1

[PATCH v3 8/8] PCI/AER: Include DEVCTL in aer_print_error()

2021-10-04 Thread Naveen Naidu

Print the contents of Device Control Register of the device which
detected the error. This might help in faster error diagnosis.

Sample output from dummy error injected by aer-inject:

  pcieport :00:03.0: AER: Corrected error received: :00:03.0
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver)
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=0040/e000, devctl=0x000f
  pcieport :00:03.0:[ 6] BadTLP

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pci.h  |  2 ++
 drivers/pci/pcie/aer.c | 10 --
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index eb88d8bfeaf7..48ed7f91113b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -437,6 +437,8 @@ struct aer_err_info {
u32 status; /* COR/UNCOR Error Status */
u32 mask;   /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
+
+   u16 devctl;
 };
 
 /* Preliminary AER error information processed from Root port */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 91f91d6ab052..42cae01b6887 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -729,8 +729,8 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
   aer_error_severity_string[info->severity],
   aer_error_layer[layer], aer_agent_string[agent]);
 
-   pci_printk(level, dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
-  dev->vendor, dev->device, info->status, info->mask);
+   pci_printk(level, dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x, devctl=%#06x\n",
+  dev->vendor, dev->device, info->status, info->mask, 
info->devctl);
 
__aer_print_error(dev, info);
 
@@ -1083,6 +1083,12 @@ int aer_get_device_error_info(struct pci_dev *dev, 
struct aer_err_info *info)
if (!aer)
return 0;
 
+   /*
+* Cache the value of Device Control Register now, because later the
+* device might not be available
+*/
+   pcie_capability_read_word(dev, PCI_EXP_DEVCTL, >devctl);
+
if (info->severity == AER_CORRECTABLE) {
pci_read_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>status);
-- 
2.25.1

[PATCH v3 7/8] PCI/ERR: Remove redundant clearing of AER register in pcie_do_recovery()

2021-10-04 Thread Naveen Naidu

pcie_do_recovery() is shared across the following paths:
 - ACPI APEI
 - Native AER path
 - EDR
 - DPC

ACPI APEI
==

  ghes_handle_aer()
aer_recover_queue()
  kfifo_in_spinlocked(aer_recover_ring)

  aer_recover_work_func()
while (kfifo_get(aer_recover_ring))
  pcie_do_recovery()

In this path the system firmware clears the AER registers before
handing off the record to the OS in ghes_handle_aer()

Native AER
==

 aer_irq()
   aer_add_err_devices_to_queue()
 kfifo_put(>aer_fifo, *e_dev)
 clear_error_source_aer_registers()   < AER registers are cleard

 aer_isr()
   aer_isr_one_error()
handle_error_source()
  pcie_do_recovery()

The AER registers are cleared during the handling of IRQ, i.e before we
the recovery starts.

DPC
=

  dpc_handler()
dpc_process_error()
pci_aer_clear_status()   < AER registers are cleared
pcie_do_recovery()

EDR


  edr_handle_event()
dpc_process_error()
pci_aer_raw_clear_status()  < AER registers are cleared
pcie_do_recovery()

In all the above paths, the AER registers are cleared before
pcie_do_recovery(). The non fatal status AER registers are again cleared
in pcie_do_recovery(). This is redundant.

Remove redundant clearing of AER register in pcie_do_recovery()

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/err.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index b576aa890c76..fe04b0ae22f4 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -231,14 +231,11 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 
/*
 * If we have native control of AER, clear error status in the device
-* that detected the error.  If the platform retained control of AER,
-* it is responsible for clearing this status.  In that case, the
-* signaling device may not even be visible to the OS.
+* that detected the error.
 */
-   if (host->native_aer || pcie_ports_native) {
+   if (host->native_aer || pcie_ports_native)
pcie_clear_device_status(dev);
-   pci_aer_clear_nonfatal_status(dev);
-   }
+
pci_info(bridge, "device recovery successful\n");
return status;
 
-- 
2.25.1

[PATCH v3 6/8] PCI/AER: Clear error device AER registers in aer_irq()

2021-10-04 Thread Naveen Naidu

Converge the APEI path and native AER path of clearing the AER registers
of the error device.

In APEI path, the system firmware clears the AER registers before
handing off the record to OS. But in "native AER" path, the execution
path of clearing the AER register is as follows:

  aer_isr_one_error
aer_print_port_info
  if (find_source_device())
aer_process_err_devices
  handle_error_source
pci_write_config_dword(dev, PCI_ERR_COR_STATUS, ...)

The above path has a bug, if the find_source_device() fails, AER
registers are not cleared from the error device. This means, the error
device will keep reporting the error again and again and would lead
to message spew.

Related Bug Report:
  https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/
  https://bugzilla.kernel.org/show_bug.cgi?id=109691
  https://bugzilla.kernel.org/show_bug.cgi?id=109691
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173

The above bug could be avoided, if the AER registers are cleared during
the AER IRQ handler aer_irq(), which would provide guarantee that the AER
error registers are always cleared. This is similar to how APEI handles
these errors.

The main aim is that:

  When an interrupt handler deals with a interrupt, it must *always*
  clear the source of the interrupt.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pci.h  |  13 ++-
 drivers/pci/pcie/aer.c | 245 -
 2 files changed, 182 insertions(+), 76 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9be7a966fda7..eb88d8bfeaf7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -424,7 +424,6 @@ static inline bool pci_dev_is_added(const struct pci_dev 
*dev)
 #define AER_MAX_MULTI_ERR_DEVICES  5   /* Not likely to have more */
 
 struct aer_err_info {
-   struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
 
u16 id;
@@ -440,6 +439,18 @@ struct aer_err_info {
struct aer_header_log_regs tlp; /* TLP Header */
 };
 
+/* Preliminary AER error information processed from Root port */
+struct aer_devices_err_info {
+   struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
+   struct aer_err_info err_info;
+};
+
+/* AER information associated with each error device */
+struct aer_dev_err_info {
+   struct pci_dev *dev;
+   struct aer_err_info err_info;
+};
+
 int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
 #endif /* CONFIG_PCIEAER */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 241ff361b43c..91f91d6ab052 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -36,6 +36,18 @@
 
 #define AER_ERROR_SOURCES_MAX  128
 
+/*
+ * There can be 128 maximum error sources (AER_ERROR_SOURCES_MAX) and each
+ * error source can have maximum of 5 error devices (AER_MAX_MULTI_ERR_DEVICES)
+ * so the maximum error devices we can report is:
+ *
+ * AER_ERROR_DEVICES_MAX = AER_ERROR_SOURCES_MAX * AER_MAX_MULTI_ERR_DEVICES 
== (128 * 5) == 640
+ *
+ * But since, the size in KFIFO should be a power of two, the closest value
+ * to 640 is 1024
+ */
+# define AER_ERROR_DEVICES_MAX 1024
+
 #define AER_MAX_TYPEOF_COR_ERRS16  /* as per 
PCI_ERR_COR_STATUS */
 #define AER_MAX_TYPEOF_UNCOR_ERRS  27  /* as per PCI_ERR_UNCOR_STATUS*/
 
@@ -46,7 +58,7 @@ struct aer_err_source {
 
 struct aer_rpc {
struct pci_dev *rpd;/* Root Port device */
-   DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
+   DECLARE_KFIFO(aer_fifo, struct aer_dev_err_info, AER_ERROR_DEVICES_MAX);
 };
 
 /* AER stats for the device */
@@ -806,11 +818,11 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
  * @e_info: pointer to error info
  * @dev: pointer to pci_dev to be added
  */
-static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
+static int add_error_device(struct aer_devices_err_info *e_dev, struct pci_dev 
*dev)
 {
-   if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
-   e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
-   e_info->error_dev_num++;
+   if (e_dev->err_info.error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
+   e_dev->dev[e_dev->err_info.error_dev_num] = pci_dev_get(dev);
+   e_dev->err_info.error_dev_num++;
return 0;
}
return -ENOSPC;
@@ -877,18 +889,18 @@ static bool is_error_source(struct pci_dev *dev, struct 
aer_err_info *e_info)
 
 static int find_device_iter(struct pci_dev *dev, void *data)
 {
-   struct aer_err_info *e_info = (struct aer_err_info *)data;
+   struct aer_devices_err_info *e_dev = (struct aer_devices_err_info 
*)data;
 
-   if (is_error_source(dev, e_info)) {
+   if (is_error_source(dev, _d

[PATCH v3 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers

2021-10-04 Thread Naveen Naidu

In the EDR path, AER registers are cleared *after* DPC error event is
processed. The process stack in EDR is:

  edr_handle_event()
dpc_process_error()
pci_aer_raw_clear_status()
pcie_do_recovery()

But in DPC path, AER status registers are cleared *while* processing
the error. The process stack in DPC is:

  dpc_handler()
dpc_process_error()
  pci_aer_clear_status()
pcie_do_recovery()

In EDR path, AER status registers are cleared irrespective of whether
the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
AER status registers are cleared only when it's an unmasked uncorrectable
error.

This leads to two different behaviours for the same task (handling of
DPC errors) in FFS systems and when native OS has control.

Bring the same semantics for clearing the AER status register in EDR
path and DPC path.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index faf4a1e77fab..68899a3db126 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev)
 dpc_get_aer_uncorrect_severity(pdev, ) &&
 aer_get_device_error_info(pdev, )) {
aer_print_error(pdev, );
-   pci_aer_clear_status(pdev);
}
 }
 
@@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
struct pci_dev *pdev = context;
 
dpc_process_error(pdev);
+   pci_aer_clear_status(pdev);
 
/* We configure DPC so it only triggers on ERR_FATAL */
pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
-- 
2.25.1

[PATCH v3 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()

2021-10-04 Thread Naveen Naidu

dpc_process_error() clears both AER fatal and non fatal status
registers. Instead of clearing each status registers via a different
function call use pci_aer_clear_status().

This helps clean up the code a bit.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index df3f3a10f8bc..faf4a1e77fab 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -288,8 +288,7 @@ void dpc_process_error(struct pci_dev *pdev)
 dpc_get_aer_uncorrect_severity(pdev, ) &&
 aer_get_device_error_info(pdev, )) {
aer_print_error(pdev, );
-   pci_aer_clear_nonfatal_status(pdev);
-   pci_aer_clear_fatal_status(pdev);
+   pci_aer_clear_status(pdev);
}
 }
 
-- 
2.25.1

[PATCH v3 3/8] PCI/DPC: Initialize info->id in dpc_process_error()

2021-10-04 Thread Naveen Naidu

In the dpc_process_error() path, info->id isn't initialized before being
passed to aer_print_error(). In the corresponding AER path, it is
initialized in aer_isr_one_error().

The error message shown during Coverity Scan is:

  Coverity #1461602
  CID 1461602 (#1 of 1): Uninitialized scalar variable (UNINIT)
  8. uninit_use_in_call: Using uninitialized value info.id when calling 
aer_print_error.

Initialize the "info->id" before passing it to aer_print_error()

Fixes: 8aefa9b0d910 ("PCI/DPC: Print AER status in DPC event handling")
Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index c556e7beafe3..df3f3a10f8bc 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -262,14 +262,14 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev 
*dev,
 
 void dpc_process_error(struct pci_dev *pdev)
 {
-   u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
+   u16 cap = pdev->dpc_cap, status, reason, ext_reason;
struct aer_err_info info;
 
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
-   pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, );
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, );
 
pci_info(pdev, "containment event, status:%#06x source:%#06x\n",
-status, source);
+status, info.id);
 
reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5;
-- 
2.25.1

[PATCH v3 2/8] PCI: Cleanup struct aer_err_info

2021-10-04 Thread Naveen Naidu

The id, status and the mask fields of the struct aer_err_info comes
directly from the registers, hence their sizes should be explicit.

The length of these registers are:
  - id: 16 bits - Represents the Error Source Requester ID
  - status: 32 bits - COR/UNCOR Error Status
  - mask: 32 bits - COR/UNCOR Error Mask

Since the length of the above registers are even, use u16 and u32
to represent their values.

Also remove the __pad fields.

"pahole" was run on the modified struct aer_err_info and the size
remains unchanged.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pci.h | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1cce56c2aea0..9be7a966fda7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -427,18 +427,16 @@ struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
 
-   unsigned int id:16;
+   u16 id;
 
unsigned int severity:2;/* 0:NONFATAL | 1:FATAL | 2:COR */
-   unsigned int __pad1:5;
unsigned int multi_error_valid:1;
 
unsigned int first_error:5;
-   unsigned int __pad2:2;
unsigned int tlp_header_valid:1;
 
-   unsigned int status;/* COR/UNCOR Error Status */
-   unsigned int mask;  /* COR/UNCOR Error Mask */
+   u32 status; /* COR/UNCOR Error Status */
+   u32 mask;   /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
 };
 
-- 
2.25.1

[PATCH v3 1/8] PCI/AER: Remove ID from aer_agent_string[]

2021-10-04 Thread Naveen Naidu

Before 010caed4ccb6 ("PCI/AER: Decode Error Source RequesterID")
the AER error logs looked like:

  pcieport :00:03.0: AER: Corrected error received: id=0018
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, id=0018 (Receiver ID)
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=0040/e000
  pcieport :00:03.0:[ 6] BadTLP

In 010caed4ccb6 ("PCI/AER: Decode Error Source Requester ID"),
the "id" field was removed from the AER error logs, so currently AER
logs look like:

  pcieport :00:03.0: AER: Corrected error received: :00:03:0
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver ID)
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=0040/e000
  pcieport :00:03.0:[ 6] BadTLP

The second line in the above logs prints "(Receiver ID)", even when
there is no "id" in the log line. This is confusing.

Remove the "ID" from the aer_agent_string[]. The error logs will
look as follows (Sample from dummy error injected by aer-inject):

  pcieport :00:03.0: AER: Corrected error received: :00:03.0
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver)
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=0040/e000
  pcieport :00:03.0:[ 6] BadTLP

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/aer.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9784fdcf3006..241ff361b43c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -516,10 +516,10 @@ static const char *aer_uncorrectable_error_string[] = {
 };
 
 static const char *aer_agent_string[] = {
-   "Receiver ID",
-   "Requester ID",
-   "Completer ID",
-   "Transmitter ID"
+   "Receiver",
+   "Requester",
+   "Completer",
+   "Transmitter"
 };
 
 #define aer_stats_dev_attr(name, stats_array, strings_array,   \
@@ -703,7 +703,7 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
const char *level;
 
if (!info->status) {
-   pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
(Unregistered Agent ID)\n",
+   pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
(Unregistered Agent)\n",
aer_error_severity_string[info->severity]);
goto out;
}
-- 
2.25.1

[PATCH v3 0/8] Fix long standing AER Error Handling Issues

2021-10-04 Thread Naveen Naidu

This patch series aims at fixing some of the AER error handling issues
we have.

Currently we have the following issues:
 - Confusing message in aer_print_error()
 - aer_err_info not being initialized completely in DPC path before
   we print the AER logs
 - A bug [1] in clearing of AER registers in the native AER path

[1] https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/

The primary aim of this patch series is to converge the APEI path and the
native AER error handling paths. In our current code, we find that we
have two different behaviours (especially when it comes to clearing of
the AER registers) for the same functionality.

This patch series, tries to bring the same semantics and hence more
commonanlity between the APEI part of code and the native OS
handling of AER errors.

PATCH 1:
  - Fixes the first issue

PATCH 2 - 4:
  - Fixes the second issue
  - "Patch 3/8" is dependent on "Patch 2/3" in the series

PATCH 5 - 7
  - Deals with converging the various paths and to bring more
commonality between them
  - "Patch 6/8" depends on "Patch 1/8"

PATCH 8:
  -  Adds extra information in AER error logs.

Thanks,
Naveen Naidu

Changelog
=

v3:
 - Fix up mail formatting and resend the patches again.
   Really sorry for all the spam. I messed up in the first try and
   instead of fixing it well in v2, I messed up again. I have fixed
   everything now. Apologies for the inconvenience caused. I'll make
   sure to not repeat it again.

v2:
  - Apologies for the mistake, I forgot to cc the linux-pci mailing 
list.Resent the email with cc to linux-pci

Naveen Naidu (8):
 [PATCH v3 1/8] PCI/AER: Remove ID from aer_agent_string[]
 [PATCH v3 2/8] PCI: Cleanup struct aer_err_info
 [PATCH v3 3/8] PCI/DPC: Initialize info->id in dpc_process_error()
 [PATCH v3 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()
 [PATCH v3 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers
 [PATCH v3 6/8] PCI/AER: Clear error device AER registers in aer_irq()
 [PATCH v3 7/8] PCI/ERR: Remove redundant clearing of AER register in 
pcie_do_recovery()
 [PATCH v3 8/8] PCI/AER: Include DEVCTL in aer_print_error()

 drivers/pci/pci.h  |  23 +++-
 drivers/pci/pcie/aer.c | 265 -
 drivers/pci/pcie/dpc.c |   9 +-
 drivers/pci/pcie/err.c |   9 +-
 4 files changed, 207 insertions(+), 99 deletions(-)

-- 
2.25.1

[PATCH 8/8] PCI/AER: Include DEVCTL in aer_print_error()

2021-10-04 Thread Naveen Naidu

Print the contents of Device Control Register of the device which
detected the error. This might help in faster error diagnosis.

Sample output from dummy error injected by aer-inject:

  pcieport :00:03.0: AER: Corrected error received: :00:03.0
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver)
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=0040/e000, devctl=0x000f
  pcieport :00:03.0:[ 6] BadTLP

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pci.h  |  2 ++
 drivers/pci/pcie/aer.c | 10 --
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index eb88d8bfeaf7..48ed7f91113b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -437,6 +437,8 @@ struct aer_err_info {
u32 status; /* COR/UNCOR Error Status */
u32 mask;   /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
+
+   u16 devctl;
 };
 
 /* Preliminary AER error information processed from Root port */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 91f91d6ab052..42cae01b6887 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -729,8 +729,8 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
   aer_error_severity_string[info->severity],
   aer_error_layer[layer], aer_agent_string[agent]);
 
-   pci_printk(level, dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
-  dev->vendor, dev->device, info->status, info->mask);
+   pci_printk(level, dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x, devctl=%#06x\n",
+  dev->vendor, dev->device, info->status, info->mask, 
info->devctl);
 
__aer_print_error(dev, info);
 
@@ -1083,6 +1083,12 @@ int aer_get_device_error_info(struct pci_dev *dev, 
struct aer_err_info *info)
if (!aer)
return 0;
 
+   /*
+* Cache the value of Device Control Register now, because later the
+* device might not be available
+*/
+   pcie_capability_read_word(dev, PCI_EXP_DEVCTL, >devctl);
+
if (info->severity == AER_CORRECTABLE) {
pci_read_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>status);
-- 
2.25.1

[PATCH 7/8] PCI/ERR: Remove redundant clearing of AER register in pcie_do_recovery()

2021-10-04 Thread Naveen Naidu

pcie_do_recovery() is shared across the following paths:
 - ACPI APEI
 - Native AER path
 - EDR
 - DPC

ACPI APEI
==

  ghes_handle_aer()
aer_recover_queue()
  kfifo_in_spinlocked(aer_recover_ring)

  aer_recover_work_func()
while (kfifo_get(aer_recover_ring))
  pcie_do_recovery()

In this path the system firmware clears the AER registers before
handing off the record to the OS in ghes_handle_aer()

Native AER
==

 aer_irq()
   aer_add_err_devices_to_queue()
 kfifo_put(>aer_fifo, *e_dev)
 clear_error_source_aer_registers()   < AER registers are cleard

 aer_isr()
   aer_isr_one_error()
handle_error_source()
  pcie_do_recovery()

The AER registers are cleared during the handling of IRQ, i.e before we
the recovery starts.

DPC
=

  dpc_handler()
dpc_process_error()
pci_aer_clear_status()   < AER registers are cleared
pcie_do_recovery()

EDR


  edr_handle_event()
dpc_process_error()
pci_aer_raw_clear_status()  < AER registers are cleared
pcie_do_recovery()

In all the above paths, the AER registers are cleared before
pcie_do_recovery(). The non fatal status AER registers are again cleared
in pcie_do_recovery(). This is redundant.

Remove redundant clearing of AER register in pcie_do_recovery()

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/err.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index b576aa890c76..fe04b0ae22f4 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -231,14 +231,11 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 
/*
 * If we have native control of AER, clear error status in the device
-* that detected the error.  If the platform retained control of AER,
-* it is responsible for clearing this status.  In that case, the
-* signaling device may not even be visible to the OS.
+* that detected the error.
 */
-   if (host->native_aer || pcie_ports_native) {
+   if (host->native_aer || pcie_ports_native)
pcie_clear_device_status(dev);
-   pci_aer_clear_nonfatal_status(dev);
-   }
+
pci_info(bridge, "device recovery successful\n");
return status;
 
-- 
2.25.1

[PATCH 6/8] PCI/AER: Clear error device AER registers in aer_irq()

2021-10-04 Thread Naveen Naidu

Converge the APEI path and native AER path of clearing the AER registers
of the error device.

In APEI path, the system firmware clears the AER registers before
handing off the record to OS. But in "native AER" path, the execution
path of clearing the AER register is as follows:

  aer_isr_one_error
aer_print_port_info
  if (find_source_device())
aer_process_err_devices
  handle_error_source
pci_write_config_dword(dev, PCI_ERR_COR_STATUS, ...)

The above path has a bug, if the find_source_device() fails, AER
registers are not cleared from the error device. This means, the error
device will keep reporting the error again and again and would lead
to message spew.

Related Bug Report:
  https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/
  https://bugzilla.kernel.org/show_bug.cgi?id=109691
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173

The above bug could be avoided, if the AER registers are cleared during
the AER IRQ handler aer_irq(), which would provide guarantee that the AER
error registers are always cleared. This is similar to how APEI handles
these errors.

The main aim is that:

  When an interrupt handler deals with a interrupt, it must *always*
  clear the source of the interrupt.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pci.h  |  13 ++-
 drivers/pci/pcie/aer.c | 245 -
 2 files changed, 182 insertions(+), 76 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9be7a966fda7..eb88d8bfeaf7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -424,7 +424,6 @@ static inline bool pci_dev_is_added(const struct pci_dev 
*dev)
 #define AER_MAX_MULTI_ERR_DEVICES  5   /* Not likely to have more */
 
 struct aer_err_info {
-   struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
 
u16 id;
@@ -440,6 +439,18 @@ struct aer_err_info {
struct aer_header_log_regs tlp; /* TLP Header */
 };
 
+/* Preliminary AER error information processed from Root port */
+struct aer_devices_err_info {
+   struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
+   struct aer_err_info err_info;
+};
+
+/* AER information associated with each error device */
+struct aer_dev_err_info {
+   struct pci_dev *dev;
+   struct aer_err_info err_info;
+};
+
 int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
 #endif /* CONFIG_PCIEAER */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 241ff361b43c..91f91d6ab052 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -36,6 +36,18 @@
 
 #define AER_ERROR_SOURCES_MAX  128
 
+/*
+ * There can be 128 maximum error sources (AER_ERROR_SOURCES_MAX) and each
+ * error source can have maximum of 5 error devices (AER_MAX_MULTI_ERR_DEVICES)
+ * so the maximum error devices we can report is:
+ *
+ * AER_ERROR_DEVICES_MAX = AER_ERROR_SOURCES_MAX * AER_MAX_MULTI_ERR_DEVICES 
== (128 * 5) == 640
+ *
+ * But since, the size in KFIFO should be a power of two, the closest value
+ * to 640 is 1024
+ */
+# define AER_ERROR_DEVICES_MAX 1024
+
 #define AER_MAX_TYPEOF_COR_ERRS16  /* as per 
PCI_ERR_COR_STATUS */
 #define AER_MAX_TYPEOF_UNCOR_ERRS  27  /* as per PCI_ERR_UNCOR_STATUS*/
 
@@ -46,7 +58,7 @@ struct aer_err_source {
 
 struct aer_rpc {
struct pci_dev *rpd;/* Root Port device */
-   DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
+   DECLARE_KFIFO(aer_fifo, struct aer_dev_err_info, AER_ERROR_DEVICES_MAX);
 };
 
 /* AER stats for the device */
@@ -806,11 +818,11 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
  * @e_info: pointer to error info
  * @dev: pointer to pci_dev to be added
  */
-static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
+static int add_error_device(struct aer_devices_err_info *e_dev, struct pci_dev 
*dev)
 {
-   if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
-   e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
-   e_info->error_dev_num++;
+   if (e_dev->err_info.error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
+   e_dev->dev[e_dev->err_info.error_dev_num] = pci_dev_get(dev);
+   e_dev->err_info.error_dev_num++;
return 0;
}
return -ENOSPC;
@@ -877,18 +889,18 @@ static bool is_error_source(struct pci_dev *dev, struct 
aer_err_info *e_info)
 
 static int find_device_iter(struct pci_dev *dev, void *data)
 {
-   struct aer_err_info *e_info = (struct aer_err_info *)data;
+   struct aer_devices_err_info *e_dev = (struct aer_devices_err_info 
*)data;
 
-   if (is_error_source(dev, e_info)) {
+   if (is_error_source(dev, _dev->err_info)) {
/* List this device */

[PATCH 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers

2021-10-04 Thread Naveen Naidu

In the EDR path, AER registers are cleared *after* DPC error event is
processed. The process stack in EDR is:

  edr_handle_event()
dpc_process_error()
pci_aer_raw_clear_status()
pcie_do_recovery()

But in DPC path, AER status registers are cleared *while* processing
the error. The process stack in DPC is:

  dpc_handler()
dpc_process_error()
  pci_aer_clear_status()
pcie_do_recovery()

In EDR path, AER status registers are cleared irrespective of whether
the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
AER status registers are cleared only when it's an unmasked uncorrectable
error.

This leads to two different behaviours for the same task (handling of
DPC errors) in FFS systems and when native OS has control.

Bring the same semantics for clearing the AER status register in EDR
path and DPC path.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index faf4a1e77fab..68899a3db126 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev)
 dpc_get_aer_uncorrect_severity(pdev, ) &&
 aer_get_device_error_info(pdev, )) {
aer_print_error(pdev, );
-   pci_aer_clear_status(pdev);
}
 }
 
@@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
struct pci_dev *pdev = context;
 
dpc_process_error(pdev);
+   pci_aer_clear_status(pdev);
 
/* We configure DPC so it only triggers on ERR_FATAL */
pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
-- 
2.25.1

[PATCH 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()

2021-10-04 Thread Naveen Naidu

dpc_process_error() clears both AER fatal and non fatal status
registers. Instead of clearing each status registers via a different
function call use pci_aer_clear_status().

This helps clean up the code a bit.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index df3f3a10f8bc..faf4a1e77fab 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -288,8 +288,7 @@ void dpc_process_error(struct pci_dev *pdev)
 dpc_get_aer_uncorrect_severity(pdev, ) &&
 aer_get_device_error_info(pdev, )) {
aer_print_error(pdev, );
-   pci_aer_clear_nonfatal_status(pdev);
-   pci_aer_clear_fatal_status(pdev);
+   pci_aer_clear_status(pdev);
}
 }
 
-- 
2.25.1

[PATCH 2/8] PCI: Cleanup struct aer_err_info

2021-10-04 Thread Naveen Naidu

The id, status and the mask fields of the struct aer_err_info comes
directly from the registers, hence their sizes should be explicit.

The length of these registers are:
  - id: 16 bits - Represents the Error Source Requester ID
  - status: 32 bits - COR/UNCOR Error Status
  - mask: 32 bits - COR/UNCOR Error Mask

Since the length of the above registers are even, use u16 and u32
to represent their values.

Also remove the __pad fields.

"pahole" was run on the modified struct aer_err_info and the size
remains unchanged.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pci.h | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1cce56c2aea0..9be7a966fda7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -427,18 +427,16 @@ struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
 
-   unsigned int id:16;
+   u16 id;
 
unsigned int severity:2;/* 0:NONFATAL | 1:FATAL | 2:COR */
-   unsigned int __pad1:5;
unsigned int multi_error_valid:1;
 
unsigned int first_error:5;
-   unsigned int __pad2:2;
unsigned int tlp_header_valid:1;
 
-   unsigned int status;/* COR/UNCOR Error Status */
-   unsigned int mask;  /* COR/UNCOR Error Mask */
+   u32 status; /* COR/UNCOR Error Status */
+   u32 mask;   /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
 };
 
-- 
2.25.1

[PATCH 1/8] PCI/AER: Remove ID from aer_agent_string[]

2021-10-04 Thread Naveen Naidu

Before 010caed4ccb6 ("PCI/AER: Decode Error Source RequesterID")
the AER error logs looked like:

  pcieport :00:03.0: AER: Corrected error received: id=0018
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, id=0018 (Receiver ID)
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=0040/e000
  pcieport :00:03.0:[ 6] BadTLP

In 010caed4ccb6 ("PCI/AER: Decode Error Source Requester ID"),
the "id" field was removed from the AER error logs, so currently AER
logs look like:

  pcieport :00:03.0: AER: Corrected error received: :00:03:0
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver ID)
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=0040/e000
  pcieport :00:03.0:[ 6] BadTLP

The second line in the above logs prints "(Receiver ID)", even when
there is no "id" in the log line. This is confusing.

Remove the "ID" from the aer_agent_string[]. The error logs will
look as follows (Sample from dummy error injected by aer-inject):

  pcieport :00:03.0: AER: Corrected error received: :00:03.0
  pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Receiver)
  pcieport :00:03.0:   device [1b36:000c] error 
status/mask=0040/e000
  pcieport :00:03.0:[ 6] BadTLP

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/aer.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9784fdcf3006..241ff361b43c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -516,10 +516,10 @@ static const char *aer_uncorrectable_error_string[] = {
 };
 
 static const char *aer_agent_string[] = {
-   "Receiver ID",
-   "Requester ID",
-   "Completer ID",
-   "Transmitter ID"
+   "Receiver",
+   "Requester",
+   "Completer",
+   "Transmitter"
 };
 
 #define aer_stats_dev_attr(name, stats_array, strings_array,   \
@@ -703,7 +703,7 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
const char *level;
 
if (!info->status) {
-   pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
(Unregistered Agent ID)\n",
+   pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
(Unregistered Agent)\n",
aer_error_severity_string[info->severity]);
goto out;
}
-- 
2.25.1

[PATCH 0/8] Fix long standing AER Error Handling Issues

2021-10-04 Thread Naveen Naidu

This patch series aims at fixing some of the AER error handling issues
we have.

Currently we have the following issues:
 - Confusing message in aer_print_error()
 - aer_err_info not being initialized completely in DPC path before 
   we print the AER logs
 - A bug [1] in clearing of AER registers in the native AER path

[1] https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/

The primary aim of this patch series is to converge the APEI path and the
native AER error handling paths. In our current code, we find that we
have two different behaviours (especially when it comes to clearing of
the AER registers) for the same functionality.

This patch series, tries to bring the same semantics and hence more 
commonanlity between the APEI part of code and the native OS 
handling of AER errors.

PATCH 1: 
  - Fixes the first issue

PATCH 2 - 4:
  - Fixes the second issue
  - "Patch 3/8" is dependent on "Patch 2/3" in the series

PATCH 5 - 7
  - Deals with converging the various paths and to bring more
commonality between them
  - "Patch 6/8" depends on "Patch 1/8"

PATCH 8:
  -  Adds extra information in AER error logs.

Thanks,
Naveen Naidu

Naveen Naidu (8):
 [PATCH 1/8] PCI/AER: Remove ID from aer_agent_string[]
 [PATCH 2/8] PCI: Cleanup struct aer_err_info
 [PATCH 3/8] PCI/DPC: Initialize info->id in dpc_process_error()
 [PATCH 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()
 [PATCH 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers
 [PATCH 6/8] PCI/AER: Clear error device AER registers in aer_irq()
 [PATCH 7/8] PCI/ERR: Remove redundant clearing of AER register in 
pcie_do_recovery()
 [PATCH 8/8] PCI/AER: Include DEVCTL in aer_print_error()

 drivers/pci/pci.h  |  23 +++-
 drivers/pci/pcie/aer.c | 265 -
 drivers/pci/pcie/dpc.c |   9 +-
 drivers/pci/pcie/err.c |   9 +-
 4 files changed, 207 insertions(+), 99 deletions(-)

-- 
2.25.1

52 matches

Mail list logo