Re: [PATCH v2 0/4] Address error and recovery for AER and DPC

2018-01-02 Thread poza

On 2018-01-03 00:32, Bjorn Helgaas wrote:

On Fri, Dec 29, 2017 at 12:54:15PM +0530, Oza Pawandeep wrote:
This patch set brings in support for DPC and AER to co-exist and not 
to

race for recovery.

The current implementation of AER and error message broadcasting to 
the

EP driver is tightly coupled and limited to AER service driver.
It is important to factor out broadcasting and other link handling
callbacks. So that not only when AER gets triggered, but also when DPC 
get

triggered, or both get triggered simultaneously (for e.g. ERR_FATAL),
callbacks are handled appropriately.
having modularized the code, the race between AER and DPC is handled
gracefully.
for e.g. when DPC is active and kicked in, AER should not attempt to 
do

recovery, because DPC takes care of it.


High-level question:

We have some convoluted code in negotiate_os_control() and
aer_service_init() that (I think) essentially disables AER unless the
platform firmware grants us permission to use it.

The last implementation note in PCIe r3.1, sec 6.2.10 says

  DPC may be controlled in some configurations by platform firmware
  and in other configurations by the operating system. DPC
  functionality is strongly linked with the functionality in Advanced
  Error Reporting. To avoid conflicts over whether platform firmware
  or the operating system have control of DPC, it is recommended that
  platform firmware and operating systems always link the control of
  DPC to the control of Advanced Error Reporting.

I read that as suggesting that we should enable DPC support in Linux
if and only if we also enable AER.  But I don't see anything in DPC
that looks like that.  Should there be something there?  Should DPC be
restructured so it's enabled and handled inside the AER driver instead
of being a separate driver?

Bjorn


The whole idea of factoring out error handing and plug it back to DPC is 
to
enable DPC is participate synchronously in pcie_port_service_driver 
hooks.


AER and DPC both being port service driver, it makes more sense, for DPC 
to be able
to do with those callbacks as much as AER is able to do with those 
callbacks currently.

but those callbacks are tightly coupled with AER driver.

that way DPC and AER can act independently in their own space, by 
gaining more control.

and if needed, both can synchronize the callbacks.

Regards,
Oza.













Re: [PATCH v2 0/4] Address error and recovery for AER and DPC

2018-01-02 Thread poza

On 2018-01-03 00:32, Bjorn Helgaas wrote:

On Fri, Dec 29, 2017 at 12:54:15PM +0530, Oza Pawandeep wrote:
This patch set brings in support for DPC and AER to co-exist and not 
to

race for recovery.

The current implementation of AER and error message broadcasting to 
the

EP driver is tightly coupled and limited to AER service driver.
It is important to factor out broadcasting and other link handling
callbacks. So that not only when AER gets triggered, but also when DPC 
get

triggered, or both get triggered simultaneously (for e.g. ERR_FATAL),
callbacks are handled appropriately.
having modularized the code, the race between AER and DPC is handled
gracefully.
for e.g. when DPC is active and kicked in, AER should not attempt to 
do

recovery, because DPC takes care of it.


High-level question:

We have some convoluted code in negotiate_os_control() and
aer_service_init() that (I think) essentially disables AER unless the
platform firmware grants us permission to use it.

The last implementation note in PCIe r3.1, sec 6.2.10 says

  DPC may be controlled in some configurations by platform firmware
  and in other configurations by the operating system. DPC
  functionality is strongly linked with the functionality in Advanced
  Error Reporting. To avoid conflicts over whether platform firmware
  or the operating system have control of DPC, it is recommended that
  platform firmware and operating systems always link the control of
  DPC to the control of Advanced Error Reporting.

I read that as suggesting that we should enable DPC support in Linux
if and only if we also enable AER.  But I don't see anything in DPC
that looks like that.  Should there be something there?  Should DPC be
restructured so it's enabled and handled inside the AER driver instead
of being a separate driver?

Bjorn


The whole idea of factoring out error handing and plug it back to DPC is 
to
enable DPC is participate synchronously in pcie_port_service_driver 
hooks.


AER and DPC both being port service driver, it makes more sense, for DPC 
to be able
to do with those callbacks as much as AER is able to do with those 
callbacks currently.

but those callbacks are tightly coupled with AER driver.

that way DPC and AER can act independently in their own space, by 
gaining more control.

and if needed, both can synchronize the callbacks.

Regards,
Oza.













Re: [PATCH v2 0/4] Address error and recovery for AER and DPC

2018-01-02 Thread Sinan Kaya
On 1/2/2018 2:02 PM, Bjorn Helgaas wrote:
> I read that as suggesting that we should enable DPC support in Linux
> if and only if we also enable AER.  But I don't see anything in DPC
> that looks like that.  Should there be something there?  Should DPC be
> restructured so it's enabled and handled inside the AER driver instead
> of being a separate driver?

I think Keith posted a patch to do this. If firmware first is enabled, DPC
init is skipped after his patch.

Oza was able to plumb the DPC handling into error recovery callbacks of
the portdrv since the portdrv layer already provides this facilities such
as reset_link and resume.

The way DPC and AER works is almost identical from AER portdrv perspective.

I really like his plumbing. Putting DPC code into AER makes it more
convoluted in my opinion.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v2 0/4] Address error and recovery for AER and DPC

2018-01-02 Thread Sinan Kaya
On 1/2/2018 2:02 PM, Bjorn Helgaas wrote:
> I read that as suggesting that we should enable DPC support in Linux
> if and only if we also enable AER.  But I don't see anything in DPC
> that looks like that.  Should there be something there?  Should DPC be
> restructured so it's enabled and handled inside the AER driver instead
> of being a separate driver?

I think Keith posted a patch to do this. If firmware first is enabled, DPC
init is skipped after his patch.

Oza was able to plumb the DPC handling into error recovery callbacks of
the portdrv since the portdrv layer already provides this facilities such
as reset_link and resume.

The way DPC and AER works is almost identical from AER portdrv perspective.

I really like his plumbing. Putting DPC code into AER makes it more
convoluted in my opinion.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v2 0/4] Address error and recovery for AER and DPC

2018-01-02 Thread Keith Busch
On Tue, Jan 02, 2018 at 01:02:15PM -0600, Bjorn Helgaas wrote:
> On Fri, Dec 29, 2017 at 12:54:15PM +0530, Oza Pawandeep wrote:
> > This patch set brings in support for DPC and AER to co-exist and not to
> > race for recovery.
> > 
> > The current implementation of AER and error message broadcasting to the
> > EP driver is tightly coupled and limited to AER service driver.
> > It is important to factor out broadcasting and other link handling
> > callbacks. So that not only when AER gets triggered, but also when DPC get
> > triggered, or both get triggered simultaneously (for e.g. ERR_FATAL),
> > callbacks are handled appropriately.
> > having modularized the code, the race between AER and DPC is handled
> > gracefully.
> > for e.g. when DPC is active and kicked in, AER should not attempt to do
> > recovery, because DPC takes care of it.
> 
> High-level question:
> 
> We have some convoluted code in negotiate_os_control() and
> aer_service_init() that (I think) essentially disables AER unless the
> platform firmware grants us permission to use it.
> 
> The last implementation note in PCIe r3.1, sec 6.2.10 says
> 
>   DPC may be controlled in some configurations by platform firmware
>   and in other configurations by the operating system. DPC
>   functionality is strongly linked with the functionality in Advanced
>   Error Reporting. To avoid conflicts over whether platform firmware
>   or the operating system have control of DPC, it is recommended that
>   platform firmware and operating systems always link the control of
>   DPC to the control of Advanced Error Reporting.
> 
> I read that as suggesting that we should enable DPC support in Linux
> if and only if we also enable AER.  But I don't see anything in DPC
> that looks like that.  Should there be something there?  Should DPC be
> restructured so it's enabled and handled inside the AER driver instead
> of being a separate driver?

Yes, I agree the two should be linked. I submitted a patch for that here,
though driver responsibilities are still separate in this series:

  https://marc.info/?l=linux-pci=151371742225111=2



Re: [PATCH v2 0/4] Address error and recovery for AER and DPC

2018-01-02 Thread Keith Busch
On Tue, Jan 02, 2018 at 01:02:15PM -0600, Bjorn Helgaas wrote:
> On Fri, Dec 29, 2017 at 12:54:15PM +0530, Oza Pawandeep wrote:
> > This patch set brings in support for DPC and AER to co-exist and not to
> > race for recovery.
> > 
> > The current implementation of AER and error message broadcasting to the
> > EP driver is tightly coupled and limited to AER service driver.
> > It is important to factor out broadcasting and other link handling
> > callbacks. So that not only when AER gets triggered, but also when DPC get
> > triggered, or both get triggered simultaneously (for e.g. ERR_FATAL),
> > callbacks are handled appropriately.
> > having modularized the code, the race between AER and DPC is handled
> > gracefully.
> > for e.g. when DPC is active and kicked in, AER should not attempt to do
> > recovery, because DPC takes care of it.
> 
> High-level question:
> 
> We have some convoluted code in negotiate_os_control() and
> aer_service_init() that (I think) essentially disables AER unless the
> platform firmware grants us permission to use it.
> 
> The last implementation note in PCIe r3.1, sec 6.2.10 says
> 
>   DPC may be controlled in some configurations by platform firmware
>   and in other configurations by the operating system. DPC
>   functionality is strongly linked with the functionality in Advanced
>   Error Reporting. To avoid conflicts over whether platform firmware
>   or the operating system have control of DPC, it is recommended that
>   platform firmware and operating systems always link the control of
>   DPC to the control of Advanced Error Reporting.
> 
> I read that as suggesting that we should enable DPC support in Linux
> if and only if we also enable AER.  But I don't see anything in DPC
> that looks like that.  Should there be something there?  Should DPC be
> restructured so it's enabled and handled inside the AER driver instead
> of being a separate driver?

Yes, I agree the two should be linked. I submitted a patch for that here,
though driver responsibilities are still separate in this series:

  https://marc.info/?l=linux-pci=151371742225111=2



Re: [PATCH v2 0/4] Address error and recovery for AER and DPC

2018-01-02 Thread Bjorn Helgaas
On Fri, Dec 29, 2017 at 12:54:15PM +0530, Oza Pawandeep wrote:
> This patch set brings in support for DPC and AER to co-exist and not to
> race for recovery.
> 
> The current implementation of AER and error message broadcasting to the
> EP driver is tightly coupled and limited to AER service driver.
> It is important to factor out broadcasting and other link handling
> callbacks. So that not only when AER gets triggered, but also when DPC get
> triggered, or both get triggered simultaneously (for e.g. ERR_FATAL),
> callbacks are handled appropriately.
> having modularized the code, the race between AER and DPC is handled
> gracefully.
> for e.g. when DPC is active and kicked in, AER should not attempt to do
> recovery, because DPC takes care of it.

High-level question:

We have some convoluted code in negotiate_os_control() and
aer_service_init() that (I think) essentially disables AER unless the
platform firmware grants us permission to use it.

The last implementation note in PCIe r3.1, sec 6.2.10 says

  DPC may be controlled in some configurations by platform firmware
  and in other configurations by the operating system. DPC
  functionality is strongly linked with the functionality in Advanced
  Error Reporting. To avoid conflicts over whether platform firmware
  or the operating system have control of DPC, it is recommended that
  platform firmware and operating systems always link the control of
  DPC to the control of Advanced Error Reporting.

I read that as suggesting that we should enable DPC support in Linux
if and only if we also enable AER.  But I don't see anything in DPC
that looks like that.  Should there be something there?  Should DPC be
restructured so it's enabled and handled inside the AER driver instead
of being a separate driver?

Bjorn


Re: [PATCH v2 0/4] Address error and recovery for AER and DPC

2018-01-02 Thread Bjorn Helgaas
On Fri, Dec 29, 2017 at 12:54:15PM +0530, Oza Pawandeep wrote:
> This patch set brings in support for DPC and AER to co-exist and not to
> race for recovery.
> 
> The current implementation of AER and error message broadcasting to the
> EP driver is tightly coupled and limited to AER service driver.
> It is important to factor out broadcasting and other link handling
> callbacks. So that not only when AER gets triggered, but also when DPC get
> triggered, or both get triggered simultaneously (for e.g. ERR_FATAL),
> callbacks are handled appropriately.
> having modularized the code, the race between AER and DPC is handled
> gracefully.
> for e.g. when DPC is active and kicked in, AER should not attempt to do
> recovery, because DPC takes care of it.

High-level question:

We have some convoluted code in negotiate_os_control() and
aer_service_init() that (I think) essentially disables AER unless the
platform firmware grants us permission to use it.

The last implementation note in PCIe r3.1, sec 6.2.10 says

  DPC may be controlled in some configurations by platform firmware
  and in other configurations by the operating system. DPC
  functionality is strongly linked with the functionality in Advanced
  Error Reporting. To avoid conflicts over whether platform firmware
  or the operating system have control of DPC, it is recommended that
  platform firmware and operating systems always link the control of
  DPC to the control of Advanced Error Reporting.

I read that as suggesting that we should enable DPC support in Linux
if and only if we also enable AER.  But I don't see anything in DPC
that looks like that.  Should there be something there?  Should DPC be
restructured so it's enabled and handled inside the AER driver instead
of being a separate driver?

Bjorn


[PATCH v2 0/4] Address error and recovery for AER and DPC

2017-12-28 Thread Oza Pawandeep
This patch set brings in support for DPC and AER to co-exist and not to
race for recovery.

The current implementation of AER and error message broadcasting to the
EP driver is tightly coupled and limited to AER service driver.
It is important to factor out broadcasting and other link handling
callbacks. So that not only when AER gets triggered, but also when DPC get
triggered, or both get triggered simultaneously (for e.g. ERR_FATAL),
callbacks are handled appropriately.
having modularized the code, the race between AER and DPC is handled
gracefully.
for e.g. when DPC is active and kicked in, AER should not attempt to do
recovery, because DPC takes care of it.

DPC should enumerate the devices after recovering the link, which is
achieved by implementing error_resume callback.

Changes since v1:
Kbuild errors fixed:
> pci_find_dpc_dev made static
> ras_event.h updated
> pci_find_aer_service call with CONFIG check
> pci_find_dpc_service call with CONFIG check

Oza Pawandeep (4):
  PCI/AER: factor out error reporting from AER
  PCI/DPC/AER: Address Concurrency between AER and DPC
  PCI/ERR: Do not do recovery if DPC service is active
  PCI/DPC: Enumerate the devices after DPC trigger event

 drivers/acpi/apei/ghes.c   |   2 +-
 drivers/pci/pcie/Makefile  |   2 +-
 drivers/pci/pcie/aer/aerdrv.h  |  30 ---
 drivers/pci/pcie/aer/aerdrv_core.c | 306 +
 drivers/pci/pcie/aer/aerdrv_errprint.c |  27 ++-
 drivers/pci/pcie/pcie-dpc.c| 127 ++-
 drivers/pci/pcie/pcie-err.c| 399 +
 drivers/pci/pcie/portdrv.h |   2 +
 include/linux/aer.h|   4 -
 include/linux/pci.h|  23 ++
 include/ras/ras_event.h|   6 +-
 11 files changed, 579 insertions(+), 349 deletions(-)
 create mode 100644 drivers/pci/pcie/pcie-err.c

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.



[PATCH v2 0/4] Address error and recovery for AER and DPC

2017-12-28 Thread Oza Pawandeep
This patch set brings in support for DPC and AER to co-exist and not to
race for recovery.

The current implementation of AER and error message broadcasting to the
EP driver is tightly coupled and limited to AER service driver.
It is important to factor out broadcasting and other link handling
callbacks. So that not only when AER gets triggered, but also when DPC get
triggered, or both get triggered simultaneously (for e.g. ERR_FATAL),
callbacks are handled appropriately.
having modularized the code, the race between AER and DPC is handled
gracefully.
for e.g. when DPC is active and kicked in, AER should not attempt to do
recovery, because DPC takes care of it.

DPC should enumerate the devices after recovering the link, which is
achieved by implementing error_resume callback.

Changes since v1:
Kbuild errors fixed:
> pci_find_dpc_dev made static
> ras_event.h updated
> pci_find_aer_service call with CONFIG check
> pci_find_dpc_service call with CONFIG check

Oza Pawandeep (4):
  PCI/AER: factor out error reporting from AER
  PCI/DPC/AER: Address Concurrency between AER and DPC
  PCI/ERR: Do not do recovery if DPC service is active
  PCI/DPC: Enumerate the devices after DPC trigger event

 drivers/acpi/apei/ghes.c   |   2 +-
 drivers/pci/pcie/Makefile  |   2 +-
 drivers/pci/pcie/aer/aerdrv.h  |  30 ---
 drivers/pci/pcie/aer/aerdrv_core.c | 306 +
 drivers/pci/pcie/aer/aerdrv_errprint.c |  27 ++-
 drivers/pci/pcie/pcie-dpc.c| 127 ++-
 drivers/pci/pcie/pcie-err.c| 399 +
 drivers/pci/pcie/portdrv.h |   2 +
 include/linux/aer.h|   4 -
 include/linux/pci.h|  23 ++
 include/ras/ras_event.h|   6 +-
 11 files changed, 579 insertions(+), 349 deletions(-)
 create mode 100644 drivers/pci/pcie/pcie-err.c

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.