Re: [Linux-stm32] [PATCH v2 00/14] Introduce STM32MP1 RCC in secured mode

2021-03-11 Thread Alex G.




On 3/11/21 12:10 PM, Alexandre TORGUE wrote:

Hi Guys

On 3/11/21 5:11 PM, Marek Vasut wrote:

On 3/11/21 3:41 PM, Ahmad Fatoum wrote:

Hello,


Hi,


On 11.03.21 15:02, Alexandre TORGUE wrote:

On 3/11/21 12:43 PM, Marek Vasut wrote:

On 3/11/21 9:08 AM, Alexandre TORGUE wrote:
1- Break the current ABI: as soon as those patches are merged, 
stm32mp157c-dk2.dtb will impose to use
A tf-a for scmi clocks. For people using u-boot spl, the will have 
to create their own "no-secure" devicetree.


NAK, this breaks existing boards and existing setups, e.g. DK2 that 
does not use ATF.


2-As you suggest, create a new "secure" dtb per boards (Not my 
wish for maintenance perspectives).


I agree with Alex (G) that the "secure" option should be opt-in.
That way existing setups remain working and no extra requirements 
are imposed on MP1 users. Esp. since as far as I understand this, 
the "secure" part isn't really about security, but rather about 
moving clock configuration from Linux to some firmware blob.


3- Keep kernel device tree as they are and applied this secure 
layer (scmi clocks phandle) thanks to dtbo in

U-boot.


Is this really better than
#include "stm32mp15xx-enable-secure-stuff.dtsi"
in a board DT ? Because that is how I imagine the opt-in "secure" 
option could work.




Discussing with Patrick about u-boot, we could use dtbo application 
thanks to extlinux.conf. BUT it it will not prevent other case (i.e. 
TF-A which jump directly in kernel@). So the "least worst" solution 
is to create a new "stm32mp1257c-scmi-dk2 board which will overload 
clock entries with a scmi phandle (as proposed by Alex).


I raised this issue before with your colleagues. I still believe the 
correct way
would be for the TF-A to pass down either a device tree or an overlay 
with the

actual settings in use, e.g.:

   - Clocks/Resets done via SCMI
   - Reserved memory regions

If TF-A directly boots Linux, it can apply the overlay itself, 
otherwise it's

passed down to SSBL that applies it before booting Linux.


That sounds good and it is something e.g. R-Car already does, it 
merges DT fragment from prior stages at U-Boot level and then passes 
the result to Linux.


So on ST hardware, the same could very well happen and it would work 
for both non-ATF / ATF / ATF+TEE options.


Even this solution sounds good but we are currently not able to do it in 
our TF-A/u-boot so not feasible for the moment. So we have to find a 
solution for now. Create a new dtb can be this solution. Our internal 
strategy is to use scmi on our official ST board. It will be a really 
drawback to include a "no-scmi.dtsi" in DH boards (for example) and to 
create a stm32mp157c-noscmi-dk2.dts ?


It could work, as long as all users are reminded to change their build 
scripts to pick up a "-noscmi.dtb". I suspect that if this were the case 
we'll see quite a few bug reports saying "stm32mp1 no longer boots with 
kernel v5.13".


I didn't think of this originally, though u-boot already does the DTB 
patching for OPTEE reserved memory regions. It's not too hard to also 
patch in the SCMI clocks at boot. In u-boot's case, runtime detection 
might even be feasible.


Alex


Re: [PATCH v2 00/14] Introduce STM32MP1 RCC in secured mode

2021-03-09 Thread Alex G.

On 1/26/21 3:01 AM, gabriel.fernan...@foss.st.com wrote:

From: Gabriel Fernandez 

Platform STM32MP1 can be used in configuration where some clocks and
IP resets can relate as secure resources.
These resources are moved from a RCC clock/reset handle to a SCMI
clock/reset_domain handle.

The RCC clock driver is now dependent of the SCMI driver, then we have
to manage now the probe defering.

v1 -> v2:
   - fix yamllint warnings.


Hi Gabriel,

I don't have much clout with the maintainers, but I have to NAK this 
series after finding major breakage.


The problem with series is that it breaks pretty much every board it 
touches. I have a DK2 here that I'm using for development, which no 
longer boots with this series applied.


The crux of the matter is that this series assumes all boards will boot 
with an FSBL that implements a very specific SCMI clock tree. This is 
major ABI breakage for anyone not using TF-A as the first stage 
bootloader. Anyone using u-boot SPL is screwed.


This series imposes a SOC-wide change via the dtsi files. So even boards 
that you don't intend to convert to SCMI will get broken this way. 
Adding a -no-scmi file that isn't used anywhere doesn't help things.


Here's what I suggest:

Generate new dtb files for those boards that you want to convert. So you 
would get:

  - stm32mp157c-dk2.dtb # Good old hardware clocks
  - stm32mp157c-dk2-secure-rcc.dtb # Clocks accessible by scmi.

A lot of users use a larger build system where they extract the relevant 
files. With the scheme I'm proposing you don't break their builds, and 
you allow SCMI users to have upstream support.


This means that you'll have to rethink the DTS and DTSI changes to 
accomodate both use cases.


Thanks,
Alex






Gabriel Fernandez (14):
   clk: stm32mp1: merge 'clk-hsi-div' and 'ck_hsi' into one clock
   clk: stm32mp1: merge 'ck_hse_rtc' and 'ck_rtc' into one clock
   clk: stm32mp1: remove intermediate pll clocks
   clk: stm32mp1: convert to module driver
   clk: stm32mp1: move RCC reset controller into RCC clock driver
   reset: stm32mp1: remove stm32mp1 reset
   dt-bindings: clock: add IDs for SCMI clocks on stm32mp15
   dt-bindings: reset: add IDs for SCMI reset domains on stm32mp15
   dt-bindings: reset: add MCU HOLD BOOT ID for SCMI reset domains on
 stm32mp15
   clk: stm32mp1: new compatible for secure RCC support
   ARM: dts: stm32: define SCMI resources on stm32mp15
   ARM: dts: stm32: move clocks/resets to SCMI resources for stm32mp15
   dt-bindings: clock: stm32mp1 new compatible for secure rcc
   ARM: dts: stm32: introduce basic boot include on stm32mp15x board

  .../bindings/clock/st,stm32mp1-rcc.yaml   |   6 +-
  arch/arm/boot/dts/stm32mp15-no-scmi.dtsi  | 158 ++
  arch/arm/boot/dts/stm32mp151.dtsi | 127 +++--
  arch/arm/boot/dts/stm32mp153.dtsi |   4 +-
  arch/arm/boot/dts/stm32mp157.dtsi |   2 +-
  arch/arm/boot/dts/stm32mp15xc.dtsi|   4 +-
  drivers/clk/Kconfig   |  10 +
  drivers/clk/clk-stm32mp1.c| 495 +++---
  drivers/reset/Kconfig |   6 -
  drivers/reset/Makefile|   1 -
  drivers/reset/reset-stm32mp1.c| 115 
  include/dt-bindings/clock/stm32mp1-clks.h |  27 +
  include/dt-bindings/reset/stm32mp1-resets.h   |  15 +
  13 files changed, 704 insertions(+), 266 deletions(-)
  create mode 100644 arch/arm/boot/dts/stm32mp15-no-scmi.dtsi
  delete mode 100644 drivers/reset/reset-stm32mp1.c



Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"

2021-02-02 Thread Alex G.

On 2/2/21 2:16 PM, Bjorn Helgaas wrote:

On Tue, Feb 02, 2021 at 01:50:20PM -0600, Alex G. wrote:

On 1/29/21 3:56 PM, Bjorn Helgaas wrote:

On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote:

On 1/28/21 5:51 PM, Sinan Kaya wrote:

On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:

AFAICT, this thread petered out with no resolution.

If the bandwidth change notifications are important to somebody,
please speak up, preferably with a patch that makes the notifications
disabled by default and adds a parameter to enable them (or some other
strategy that makes sense).

I think these are potentially useful, so I don't really want to just
revert them, but if nobody thinks these are important enough to fix,
that's a possibility.


Hide behind debug or expert option by default? or even mark it as BROKEN
until someone fixes it?


Instead of making it a config option, wouldn't it be better as a kernel
parameter? People encountering this seem quite competent in passing kernel
arguments, so having a "pcie_bw_notification=off" would solve their
problems.


I don't want people to have to discover a parameter to solve issues.
If there's a parameter, notification should default to off, and people
who want notification should supply a parameter to enable it.  Same
thing for the sysfs idea.


I can imagine cases where a per-port flag would be useful. For example, a
machine with a NIC and a couple of PCIe storage drives. In this example, the
PCIe drives downtrain willie-nillie, so it's useful to turn off their
notifications, but the NIC absolutely must not downtrain. It's debatable
whether it should be default on or default off.


I think we really just need to figure out what's going on.  Then it
should be clearer how to handle it.  I'm not really in a position to
debug the root cause since I don't have the hardware or the time.


I wonder
(a) if some PCIe devices are downtraining willie-nillie to save power
(b) if this willie-nillie downtraining somehow violates PCIe spec
(c) what is the official behavior when downtraining is intentional

My theory is: YES, YES, ASPM. But I don't know how to figure this out
without having the problem hardware in hand.


If nobody can figure out what's going on, I think we'll have to make it
disabled by default.


I think most distros do "CONFIG_PCIE_BW is not set". Is that not true?


I think it *is* true that distros do not enable CONFIG_PCIE_BW.

But it's perfectly reasonable for people building their own kernels to
enable it.  It should be safe to enable all config options.  If they
do enable CONFIG_PCIE_BW, I don't want them to waste time debugging
messages they don't expect.

If we understood why these happen and could filter out the expected
ones, that would be great.  But we don't.  We've already wasted quite
a bit of Jan's and Atanas' time, and no doubt others who haven't
bothered to file bug reports.

So I think I'll queue up a patch to remove the functionality for now.
It's easily restored if somebody debugs the problem or adds a
command-line switch or something.


I think it's best we make it a module (or kernel) parameter, default=off 
for the time being.


Alex


Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"

2021-02-02 Thread Alex G.

On 1/29/21 3:56 PM, Bjorn Helgaas wrote:

On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote:

On 1/28/21 5:51 PM, Sinan Kaya wrote:

On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:

AFAICT, this thread petered out with no resolution.

If the bandwidth change notifications are important to somebody,
please speak up, preferably with a patch that makes the notifications
disabled by default and adds a parameter to enable them (or some other
strategy that makes sense).

I think these are potentially useful, so I don't really want to just
revert them, but if nobody thinks these are important enough to fix,
that's a possibility.


Hide behind debug or expert option by default? or even mark it as BROKEN
until someone fixes it?


Instead of making it a config option, wouldn't it be better as a kernel
parameter? People encountering this seem quite competent in passing kernel
arguments, so having a "pcie_bw_notification=off" would solve their
problems.


I don't want people to have to discover a parameter to solve issues.
If there's a parameter, notification should default to off, and people
who want notification should supply a parameter to enable it.  Same
thing for the sysfs idea.


I can imagine cases where a per-port flag would be useful. For example, 
a machine with a NIC and a couple of PCIe storage drives. In this 
example, the PCIe drives donwtrain willie-nillie, so it's useful to turn 
off their notifications, but the NIC absolutely must not downtrain. It's 
debatable whether it should be default on or default off.



I think we really just need to figure out what's going on.  Then it
should be clearer how to handle it.  I'm not really in a position to
debug the root cause since I don't have the hardware or the time.


I wonder
(a) if some PCIe devices are downtraining willie-nillie to save power
(b) if this willie-nillie downtraining somehow violates PCIe spec
(c) what is the official behavior when downtraining is intentional

My theory is: YES, YES, ASPM. But I don't know how to figure this out 
without having the problem hardware in hand.




If nobody can figure out what's going on, I think we'll have to make it
disabled by default.


I think most distros do "CONFIG_PCIE_BW is not set". Is that not true?

Alex


Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"

2021-01-28 Thread Alex G.

On 1/28/21 5:51 PM, Sinan Kaya wrote:

On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:

AFAICT, this thread petered out with no resolution.

If the bandwidth change notifications are important to somebody,
please speak up, preferably with a patch that makes the notifications
disabled by default and adds a parameter to enable them (or some other
strategy that makes sense).

I think these are potentially useful, so I don't really want to just
revert them, but if nobody thinks these are important enough to fix,
that's a possibility.


Hide behind debug or expert option by default? or even mark it as BROKEN
until someone fixes it?

Instead of making it a config option, wouldn't it be better as a kernel 
parameter? People encountering this seem quite competent in passing 
kernel arguments, so having a "pcie_bw_notification=off" would solve 
their problems.


As far as marking this as broken, I've seen no conclusive evidence of to 
tell if its a sw bug or actual hardware problem. Could we have a sysfs 
to disable this on a per-downstream-port basis?


e.g.
echo 0 > /sys/bus/pci/devices/:00:04.0/bw_notification_enabled

This probably won't be ideal if there are many devices downtraining 
their links ad-hoc. At worst we'd have a way to silence those messages 
if we do encounter such devices.


Alex


Re: [PATCH v2 1/2] drm/bridge: sii902x: Enable I/O and core VCC supplies if present

2020-10-20 Thread Alex G.




On 10/20/20 2:16 AM, Sam Ravnborg wrote:

Hi Alex.


[snip]





diff --git a/drivers/gpu/drm/bridge/sii902x.c b/drivers/gpu/drm/bridge/sii902x.c
index 33fd33f953ec..d15e9f2c0d8a 100644
--- a/drivers/gpu/drm/bridge/sii902x.c
+++ b/drivers/gpu/drm/bridge/sii902x.c
@@ -17,6 +17,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
@@ -168,6 +169,8 @@ struct sii902x {
struct drm_connector connector;
struct gpio_desc *reset_gpio;
struct i2c_mux_core *i2cmux;
+   struct regulator *iovcc;
+   struct regulator *cvcc12;
/*
 * Mutex protects audio and video functions from interfering
 * each other, by keeping their i2c command sequences atomic.
@@ -954,13 +957,13 @@ static const struct drm_bridge_timings 
default_sii902x_timings = {
 | DRM_BUS_FLAG_DE_HIGH,
   };
+static int sii902x_init(struct sii902x *sii902x);

Please re-arrange the code so this prototype is not needed.


I'd be happy to re-arrange things. It will make the diff look a lot 
bigger than what it is. Is that okay?


Alex


+
   static int sii902x_probe(struct i2c_client *client,
 const struct i2c_device_id *id)
   {
struct device *dev = >dev;
-   unsigned int status = 0;
struct sii902x *sii902x;
-   u8 chipid[4];
int ret;
ret = i2c_check_functionality(client->adapter,
@@ -989,6 +992,43 @@ static int sii902x_probe(struct i2c_client *client,
mutex_init(>mutex);
+   sii902x->iovcc = devm_regulator_get(dev, "iovcc");
+   if (IS_ERR(sii902x->iovcc))
+   return PTR_ERR(sii902x->iovcc);
+
+   sii902x->cvcc12 = devm_regulator_get(dev, "cvcc12");
+   if (IS_ERR(sii902x->cvcc12))
+   return PTR_ERR(sii902x->cvcc12);
+
+   ret = regulator_enable(sii902x->iovcc);
+   if (ret < 0) {
+   dev_err_probe(dev, ret, "Failed to enable iovcc supply");
+   return ret;
+   }
+
+   ret = regulator_enable(sii902x->cvcc12);
+   if (ret < 0) {
+   dev_err_probe(dev, ret, "Failed to enable cvcc12 supply");
+   regulator_disable(sii902x->iovcc);
+   return ret;
+   }
+
+   ret = sii902x_init(sii902x);
+   if (ret < 0) {
+   regulator_disable(sii902x->cvcc12);
+   regulator_disable(sii902x->iovcc);
+   }
+
+   return ret;
+}
+
+static int sii902x_init(struct sii902x *sii902x)
+{
+   struct device *dev = >i2c->dev;
+   unsigned int status = 0;
+   u8 chipid[4];
+   int ret;
+
sii902x_reset(sii902x);
ret = regmap_write(sii902x->regmap, SII902X_REG_TPI_RQB, 0x0);
@@ -1012,11 +1052,11 @@ static int sii902x_probe(struct i2c_client *client,
regmap_read(sii902x->regmap, SII902X_INT_STATUS, );
regmap_write(sii902x->regmap, SII902X_INT_STATUS, status);
-   if (client->irq > 0) {
+   if (sii902x->i2c->irq > 0) {
regmap_write(sii902x->regmap, SII902X_INT_ENABLE,
 SII902X_HOTPLUG_EVENT);
-   ret = devm_request_threaded_irq(dev, client->irq, NULL,
+   ret = devm_request_threaded_irq(dev, sii902x->i2c->irq, NULL,
sii902x_interrupt,
IRQF_ONESHOT, dev_name(dev),
sii902x);
@@ -1031,9 +1071,9 @@ static int sii902x_probe(struct i2c_client *client,
sii902x_audio_codec_init(sii902x, dev);
-   i2c_set_clientdata(client, sii902x);
+   i2c_set_clientdata(sii902x->i2c, sii902x);
-   sii902x->i2cmux = i2c_mux_alloc(client->adapter, dev,
+   sii902x->i2cmux = i2c_mux_alloc(sii902x->i2c->adapter, dev,
1, 0, I2C_MUX_GATE,
sii902x_i2c_bypass_select,
sii902x_i2c_bypass_deselect);
@@ -1051,6 +1091,8 @@ static int sii902x_remove(struct i2c_client *client)
i2c_mux_del_adapters(sii902x->i2cmux);
drm_bridge_remove(>bridge);
+   regulator_disable(sii902x->cvcc12);
+   regulator_disable(sii902x->iovcc);
return 0;
   }


___
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 1/2] drm/bridge: sii902x: Enable I/O and core VCC supplies if present

2020-10-19 Thread Alex G.

On 9/28/20 12:30 PM, Alexandru Gagniuc wrote:

On the SII9022, the IOVCC and CVCC12 supplies must reach the correct
voltage before the reset sequence is initiated. On most boards, this
assumption is true at boot-up, so initialization succeeds.

However, when we try to initialize the chip with incorrect supply
voltages, it will not respond to I2C requests. sii902x_probe() fails
with -ENXIO.

To resolve this, look for the "iovcc" and "cvcc12" regulators, and
make sure they are enabled before starting the reset sequence. If
these supplies are not available in devicetree, then they will default
to dummy-regulator. In that case everything will work like before.

This was observed on a STM32MP157C-DK2 booting in u-boot falcon mode.
On this board, the supplies would be set by the second stage
bootloader, which does not run in falcon mode.

Signed-off-by: Alexandru Gagniuc 
---
Changes since v1:
   * Fix return code after regulator_enable(sii902x->iovcc) fails (Fabio 
Estevam)
   * Use dev_err_probe() instead of dev_err() where appropriate (Sam Ravnborg)

  drivers/gpu/drm/bridge/sii902x.c | 54 
  1 file changed, 48 insertions(+), 6 deletions(-)


This patch seems to have entered fall dormancy. Did I miss somebody in 
the CC field?


Alex



diff --git a/drivers/gpu/drm/bridge/sii902x.c b/drivers/gpu/drm/bridge/sii902x.c
index 33fd33f953ec..d15e9f2c0d8a 100644
--- a/drivers/gpu/drm/bridge/sii902x.c
+++ b/drivers/gpu/drm/bridge/sii902x.c
@@ -17,6 +17,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #include 

@@ -168,6 +169,8 @@ struct sii902x {
struct drm_connector connector;
struct gpio_desc *reset_gpio;
struct i2c_mux_core *i2cmux;
+   struct regulator *iovcc;
+   struct regulator *cvcc12;
/*
 * Mutex protects audio and video functions from interfering
 * each other, by keeping their i2c command sequences atomic.
@@ -954,13 +957,13 @@ static const struct drm_bridge_timings 
default_sii902x_timings = {
 | DRM_BUS_FLAG_DE_HIGH,
  };
  
+static int sii902x_init(struct sii902x *sii902x);

+
  static int sii902x_probe(struct i2c_client *client,
 const struct i2c_device_id *id)
  {
struct device *dev = >dev;
-   unsigned int status = 0;
struct sii902x *sii902x;
-   u8 chipid[4];
int ret;
  
  	ret = i2c_check_functionality(client->adapter,

@@ -989,6 +992,43 @@ static int sii902x_probe(struct i2c_client *client,
  
  	mutex_init(>mutex);
  
+	sii902x->iovcc = devm_regulator_get(dev, "iovcc");

+   if (IS_ERR(sii902x->iovcc))
+   return PTR_ERR(sii902x->iovcc);
+
+   sii902x->cvcc12 = devm_regulator_get(dev, "cvcc12");
+   if (IS_ERR(sii902x->cvcc12))
+   return PTR_ERR(sii902x->cvcc12);
+
+   ret = regulator_enable(sii902x->iovcc);
+   if (ret < 0) {
+   dev_err_probe(dev, ret, "Failed to enable iovcc supply");
+   return ret;
+   }
+
+   ret = regulator_enable(sii902x->cvcc12);
+   if (ret < 0) {
+   dev_err_probe(dev, ret, "Failed to enable cvcc12 supply");
+   regulator_disable(sii902x->iovcc);
+   return ret;
+   }
+
+   ret = sii902x_init(sii902x);
+   if (ret < 0) {
+   regulator_disable(sii902x->cvcc12);
+   regulator_disable(sii902x->iovcc);
+   }
+
+   return ret;
+}
+
+static int sii902x_init(struct sii902x *sii902x)
+{
+   struct device *dev = >i2c->dev;
+   unsigned int status = 0;
+   u8 chipid[4];
+   int ret;
+
sii902x_reset(sii902x);
  
  	ret = regmap_write(sii902x->regmap, SII902X_REG_TPI_RQB, 0x0);

@@ -1012,11 +1052,11 @@ static int sii902x_probe(struct i2c_client *client,
regmap_read(sii902x->regmap, SII902X_INT_STATUS, );
regmap_write(sii902x->regmap, SII902X_INT_STATUS, status);
  
-	if (client->irq > 0) {

+   if (sii902x->i2c->irq > 0) {
regmap_write(sii902x->regmap, SII902X_INT_ENABLE,
 SII902X_HOTPLUG_EVENT);
  
-		ret = devm_request_threaded_irq(dev, client->irq, NULL,

+   ret = devm_request_threaded_irq(dev, sii902x->i2c->irq, NULL,
sii902x_interrupt,
IRQF_ONESHOT, dev_name(dev),
sii902x);
@@ -1031,9 +1071,9 @@ static int sii902x_probe(struct i2c_client *client,
  
  	sii902x_audio_codec_init(sii902x, dev);
  
-	i2c_set_clientdata(client, sii902x);

+   i2c_set_clientdata(sii902x->i2c, sii902x);
  
-	sii902x->i2cmux = i2c_mux_alloc(client->adapter, dev,

+   sii902x->i2cmux = i2c_mux_alloc(sii902x->i2c->adapter, dev,
1, 0, I2C_MUX_GATE,
sii902x_i2c_bypass_select,

Re: [PATCH 1/2] drm/bridge: sii902x: Enable I/O and core VCC supplies if present

2020-09-28 Thread Alex G.

On 9/26/20 1:49 PM, Sam Ravnborg wrote:

Hi Alexandru

On Thu, Sep 24, 2020 at 03:05:05PM -0500, Alexandru Gagniuc wrote:

On the SII9022, the IOVCC and CVCC12 supplies must reach the correct
voltage before the reset sequence is initiated. On most boards, this
assumption is true at boot-up, so initialization succeeds.

However, when we try to initialize the chip with incorrect supply
voltages, it will not respond to I2C requests. sii902x_probe() fails
with -ENXIO.

To resolve this, look for the "iovcc" and "cvcc12" regulators, and
make sure they are enabled before starting the reset sequence. If
these supplies are not available in devicetree, then they will default
to dummy-regulator. In that case everything will work like before.

This was observed on a STM32MP157C-DK2 booting in u-boot falcon mode.
On this board, the supplies would be set by the second stage
bootloader, which does not run in falcon mode.

Signed-off-by: Alexandru Gagniuc 


One nitpick here. The binding should be present in the tree before
you start using it. So this patch should be applied after the binding.


It is :)
  * arch/arm/boot/dts/stm32mp15xx-dkx.dtsi

Alex


One detail below - I think others have already commented on the rest.

[snip]


Re: [PATCH 4/5] PCI: only return true when dev io state is really changed

2020-09-25 Thread Alex G.

Hi Ethan,

On 9/24/20 9:34 PM, Ethan Zhao wrote:

When uncorrectable error happens, AER driver and DPC driver interrupt
handlers likely call
pcie_do_recovery()->pci_walk_bus()->report_frozen_detected() with
pci_channel_io_frozen the same time.
If pci_dev_set_io_state() return true even if the original state is
pci_channel_io_frozen, that will cause AER or DPC handler re-enter
the error detecting and recovery procedure one after another.
The result is the recovery flow mixed between AER and DPC.
So simplify the pci_dev_set_io_state() function to only return true
when dev->error_state is changed.

Signed-off-by: Ethan Zhao 
Tested-by: Wen jin 
Tested-by: Shanshan Zhang 
---
  drivers/pci/pci.h | 31 +++
  1 file changed, 3 insertions(+), 28 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fa12f7cbc1a0..d420bb977f3b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -362,35 +362,10 @@ static inline bool pci_dev_set_io_state(struct pci_dev 
*dev,
bool changed = false;
  
  	device_lock_assert(>dev);

-   switch (new) {
-   case pci_channel_io_perm_failure:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   case pci_channel_io_perm_failure:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_frozen:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_normal:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   }
-   if (changed)
+   if (dev->error_state != new) {
dev->error_state = new;
+   changed = true;
+   }
return changed;
  }


The flow is a lot easier to follow now. Thank you.

Reviewed-by: Alexandru Gagniuc 


Re: [PATCH 1/2] drm/bridge: sii902x: Enable I/O and core VCC supplies if present

2020-09-24 Thread Alex G.

On 9/24/20 3:22 PM, Fabio Estevam wrote:
Hi Fabio,


On Thu, Sep 24, 2020 at 5:16 PM Alexandru Gagniuc  wrote:


+   ret = regulator_enable(sii902x->cvcc12);
+   if (ret < 0) {
+   dev_err(dev, "Failed to enable cvcc12 supply: %d\n", ret);
+   regulator_disable(sii902x->iovcc);
+   return PTR_ERR(sii902x->cvcc12);


return ret;


Thank you for catching that. I will fix it in v2.



 ret = regmap_write(sii902x->regmap, SII902X_REG_TPI_RQB, 0x0);
@@ -1012,11 +1052,11 @@ static int sii902x_probe(struct i2c_client *client,
 regmap_read(sii902x->regmap, SII902X_INT_STATUS, );
 regmap_write(sii902x->regmap, SII902X_INT_STATUS, status);

-   if (client->irq > 0) {
+   if (sii902x->i2c->irq > 0) {


Unrelated change.

[snip]

  Unrelated change.

[snip]

Unrelated change.

[snip]

Unrelated change.


The i2c initialization is moved into a separate function. Hence 'client' 
is no longer available. Instead, it can be accessed via 'sii902x->i2c'.


Alex


Re: [PATCH v3 3/3] PCI: pciehp: Add dmi table for in-band presence disabled

2019-10-21 Thread Alex G.

On 10/21/19 1:19 PM, Stuart Hayes wrote:



On 10/21/19 8:47 AM, Mika Westerberg wrote:

On Thu, Oct 17, 2019 at 03:32:56PM -0400, Stuart Hayes wrote:

Some systems have in-band presence detection disabled for hot-plug PCI
slots, but do not report this in the slot capabilities 2 (SLTCAP2) register.
On these systems, presence detect can become active well after the link is
reported to be active, which can cause the slots to be disabled after a
device is connected.

Add a dmi table to flag these systems as having in-band presence disabled.

Signed-off-by: Stuart Hayes 
---
  drivers/pci/hotplug/pciehp_hpc.c | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 02eb811a014f..4d377a2a62ce 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -14,6 +14,7 @@
  
  #define dev_fmt(fmt) "pciehp: " fmt
  
+#include 

  #include 
  #include 
  #include 
@@ -26,6 +27,16 @@
  #include "../pci.h"
  #include "pciehp.h"
  
+static const struct dmi_system_id inband_presence_disabled_dmi_table[] = {

+   {
+   .ident = "Dell System",
+   .matches = {
+   DMI_MATCH(DMI_OEM_STRING, "Dell System"),


Sorry if this has been discussed previously already but isn't this going
to apply on all Dell systems, not just the affected ones? Is this the
intention?



Yes, that is the intention. Applying this just makes the hotplug code wait for
the presence detect bit to be set before proceeding, which ideally wouldn't hurt
anything--for devices that don't have inband presence detect disabled, presence
detect should already be up when the code in patch 2/3 starts to wait for it.

The only issue should be with broken hotplug implementations that don't ever
bring presence detect active (these apparently exist)--but even those would 
still
work, they would just take an extra second to come up.

On the other hand, a number of Dell systems have (and will have) NVMe
implementations that have inband presence detect disabled (but they won't have
the new bit implemented to report that), and they don't work correctly without
this.


I think it's clearer if this is explained in a comment. That it doesn't 
break anything, and we're okay this applies to all hotplug ports, even 
those that are not in front of an NVMe backplane.


Alex


Re: [PATCH 0/3] PCI: pciehp: Do not turn off slot if presence comes up after link

2019-10-02 Thread Alex G.

On 10/1/19 11:13 PM, Lukas Wunner wrote:

On Tue, Oct 01, 2019 at 05:14:16PM -0400, Stuart Hayes wrote:

This patch set is based on a patch set [1] submitted many months ago by
Alexandru Gagniuc, who is no longer working on it.

[1] https://patchwork.kernel.org/cover/10909167/
 [v3,0/4] PCI: pciehp: Do not turn off slot if presence comes up after link


If I'm not mistaken, these two are identical to Alex' patches, right?

   PCI: pciehp: Add support for disabling in-band presence
   PCI: pciehp: Wait for PDS when in-band presence is disabled

I'm not sure if it's appropriate to change the author and
omit Alex' Signed-off-by.


Legally Dell owns the patches. I have no objection on my end.

Alex


Otherwise I have no objections against this series.

Thanks,

Lukas



Re: [PATCH 3/3] PCI: pciehp: Add dmi table for in-band presence disabled

2019-10-01 Thread Alex G.




On 10/1/19 4:14 PM, Stuart Hayes wrote:

Some systems have in-band presence detection disabled for hot-plug PCI slots,
but do not report this in the slot capabilities 2 (SLTCAP2) register.  On
these systems, presence detect can become active well after the link is
reported to be active, which can cause the slots to be disabled after a
device is connected.

Add a dmi table to flag these systems as having in-band presence disabled.

Signed-off-by: Stuart Hayes 
---
  drivers/pci/hotplug/pciehp_hpc.c | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 1282641c6458..1dd01e752587 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -14,6 +14,7 @@
  
  #define dev_fmt(fmt) "pciehp: " fmt
  
+#include 

  #include 
  #include 
  #include 
@@ -26,6 +27,16 @@
  #include "../pci.h"
  #include "pciehp.h"
  
+static const struct dmi_system_id inband_presence_disabled_dmi_table[] = {

+   {
+   .ident = "Dell System",
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc"),
+   },
+   },
+   {}
+};
+


I'm not sure that all Dell systems that were ever made or will be made 
have in-band presence disabled on all their hotplug slots.


This was a problem with the NVMe hot-swap bays on 14G servers. I can't 
guarantee that any other slot or machine will need this workaround. The 
best way I found to implement this is to check for the PCI-ID of the 
switches behind the port.


Alex


Re: [PATCH] Revert "PCI/LINK: Report degraded links via link bandwidth notification"

2019-04-29 Thread Alex G

On 4/29/19 1:56 PM, Bjorn Helgaas wrote:

From: Bjorn Helgaas 

This reverts commit e8303bb7a75c113388badcc49b2a84b4121c1b3e.

e8303bb7a75c added logging whenever a link changed speed or width to a
state that is considered degraded.  Unfortunately, it cannot differentiate
signal integrity-related link changes from those intentionally initiated by
an endpoint driver, including drivers that may live in userspace or VMs
when making use of vfio-pci.  Some GPU drivers actively manage the link
state to save power, which generates a stream of messages like this:

   vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 
GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5 GT/s x16 link)

We really *do* want to be alerted when the link bandwidth is reduced
because of hardware failures, but degradation by intentional link state
management is probably far more common, so the signal-to-noise ratio is
currently low.

Until we figure out a way to identify the real problems or silence the
intentional situations, revert the following commits, which include the
initial implementation (e8303bb7a75c) and subsequent fixes:


I think we're overreacting to a bit of perceived verbosity in the system 
log. Intentional degradation does not seem to me to be as common as 
advertised. I have not observed this with either radeon, nouveau, or 
amdgpu, and the proper mechanism to save power at the link level is 
ASPM. I stand to be corrected and we have on CC some very knowledgeable 
fellows that I am certain will jump at the opportunity to do so.


What it seems like to me is that a proprietary driver running in a VM is 
initiating these changes. And if that is the case then it seems this is 
a virtualization problem. A quick glance over GPU drivers in linux did 
not reveal any obvious places where we intentionally downgrade a link.


I'm not convinced a revert is the best call.

Alex


 e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth 
notification")
 3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are 
acked")
 55397ce8df48 ("PCI/LINK: Clear bandwidth notification interrupt before enabling 
it")
 0fa635aec9ab ("PCI/LINK: Deduplicate bandwidth reports for multi-function 
devices")

Link: 
https://lore.kernel.org/lkml/155597243666.19387.1205950870601742062.st...@gimli.home
Link: 
https://lore.kernel.org/lkml/155605909349.3575.13433421148215616375.st...@gimli.home
Signed-off-by: Bjorn Helgaas 
CC: Alexandru Gagniuc 
CC: Lukas Wunner 
CC: Alex Williamson 
---
  drivers/pci/pci.h  |   1 -
  drivers/pci/pcie/Makefile  |   1 -
  drivers/pci/pcie/bw_notification.c | 121 -
  drivers/pci/pcie/portdrv.h |   6 +-
  drivers/pci/pcie/portdrv_core.c|  17 ++--
  drivers/pci/pcie/portdrv_pci.c |   1 -
  drivers/pci/probe.c|   2 +-
  7 files changed, 7 insertions(+), 142 deletions(-)
  delete mode 100644 drivers/pci/pcie/bw_notification.c

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d994839a3e24..224d88634115 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -273,7 +273,6 @@ enum pcie_link_width pcie_get_width_cap(struct pci_dev 
*dev);
  u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
   enum pcie_link_width *width);
  void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
-void pcie_report_downtraining(struct pci_dev *dev);
  
  /* Single Root I/O Virtualization */

  struct pci_sriov {
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index f1d7bc1e5efa..ab514083d5d4 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -3,7 +3,6 @@
  # Makefile for PCI Express features and port driver
  
  pcieportdrv-y			:= portdrv_core.o portdrv_pci.o err.o

-pcieportdrv-y  += bw_notification.o
  
  obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o
  
diff --git a/drivers/pci/pcie/bw_notification.c b/drivers/pci/pcie/bw_notification.c

deleted file mode 100644
index 4fa9e3523ee1..
--- a/drivers/pci/pcie/bw_notification.c
+++ /dev/null
@@ -1,121 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0+
-/*
- * PCI Express Link Bandwidth Notification services driver
- * Author: Alexandru Gagniuc 
- *
- * Copyright (C) 2019, Dell Inc
- *
- * The PCIe Link Bandwidth Notification provides a way to notify the
- * operating system when the link width or data rate changes.  This
- * capability is required for all root ports and downstream ports
- * supporting links wider than x1 and/or multiple link speeds.
- *
- * This service port driver hooks into the bandwidth notification interrupt
- * and warns when links become degraded in operation.
- */
-
-#include "../pci.h"
-#include "portdrv.h"
-
-static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev)
-{
-   int ret;
-   u32 lnk_cap;
-
-   ret = pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, _cap);

Re: [PATCH] PCI: Add link_change error handler and vfio-pci user

2019-04-24 Thread Alex G

On 4/24/19 12:19 PM, Alex Williamson wrote:

On Wed, 24 Apr 2019 16:45:45 +
 wrote:

On 4/23/2019 5:42 PM, Alex Williamson wrote:

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 7e12d0163863..233cd4b5b6e8 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2403,6 +2403,19 @@ void pcie_report_downtraining(struct pci_dev *dev)


I don't think you want to change pcie_report_downtraining(). You're
advertising to "report" something, by nomenclature, but then go around
and also call a notification callback. This is also used during probe,
and you've now just killed your chance to notice you've booted with a
degraded link.
If what you want to do is silence the bandwidth notification, you want
to modify the threaded interrupt that calls this.


During probe, ie. discovery, a device wouldn't have a driver attached,
so we'd fall through to simply printing the link status.  Nothing
lost afaict.  The "report" verb doesn't have a subject here, report to
whom?  Therefore I thought it reasonable that a driver ask that it be
reported to them via a callback.  I don't see that as such a stretch of
the interface.


That's just stretching the logic, and IMO makes the intent harder to 
understand. The argument relies on the state of the PCI device and 
logic, which is not obvious to the casual observer. If you want to 
bypass the bandwidth notification, then bypass the notification.



if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn)
return;
   
+	/*

+* If driver handles link_change event, defer to driver.  PCIe drivers
+* can call pcie_print_link_status() to print current link info.
+*/
+   device_lock(>dev);
+   if (dev->driver && dev->driver->err_handler &&
+   dev->driver->err_handler->link_change) {
+   dev->driver->err_handler->link_change(dev);
+   device_unlock(>dev);
+   return;
+   }
+   device_unlock(>dev);


Can we write this such that there is a single lock()/unlock() pair?


Not without introducing a tracking variable, ex.

[snip bad code]

That's not markedly better imo, but if it's preferred I can send a v2.


How about:

if (!invoke_link_changed_handler(pdev))
very_useful_downtraining_message(pdev);


Alex

Alex


Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation

2019-04-23 Thread Alex G




On 4/22/19 5:43 PM, Alex Williamson wrote:

On systems that don't support any PCIe services other than bandwidth
notification, pcie_message_numbers() can return zero vectors, causing
the vector reallocation in pcie_port_enable_irq_vec() to retry with
zero, which fails, resulting in fallback to INTx (which might be
broken) for the bandwidth notification service.  This can resolve
spurious interrupt faults due to this service on some systems.

Fixes: e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth 
notification")
Signed-off-by: Alex Williamson 
---


+1
Tested on some Dell servers. Everything works as expected. I don't have 
a system with a device that only supports bandwidth notification.


Alex


Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation

2019-04-23 Thread Alex G

On 4/23/19 12:10 PM, Bjorn Helgaas wrote:

On Tue, Apr 23, 2019 at 09:33:53AM -0500, Alex G wrote:

On 4/22/19 7:33 PM, Alex Williamson wrote:

There is nothing wrong happening here that needs to fill logs.  I
thought maybe if I enabled notification of autonomous bandwidth
changes that it might categorize these as something we could
ignore, but it doesn't.  How can we identify only cases where this
is an erroneous/noteworthy situation?  Thanks,


You don't. Ethernet doesn't. USB doesn't. This logging behavior is
consistent with every other subsystem that deals with multi-speed links.


Can you point me to the logging in these other subsystems so I can
learn more about how they deal with this?


I don't have any in-depth articles about the logging in these systems, 
but I can extract some logs from my machines.


Ethernet:

[Sun Apr 21 11:14:06 2019] e1000e: eno1 NIC Link is Down
[Sun Apr 21 11:14:17 2019] e1000e: eno1 NIC Link is Up 1000 Mbps Full 
Duplex, Flow Control: Rx/Tx
[Sun Apr 21 11:14:23 2019] e1000e: eno1 NIC Link is Up 1000 Mbps Full 
Duplex, Flow Control: Rx/Tx

[Sun Apr 21 23:33:31 2019] e1000e: eno1 NIC Link is Down
[Sun Apr 21 23:33:43 2019] e1000e: eno1 NIC Link is Up 1000 Mbps Full 
Duplex, Flow Control: Rx/Tx
[Sun Apr 21 23:33:48 2019] e1000e: eno1 NIC Link is Up 1000 Mbps Full 
Duplex, Flow Control: Rx/Tx


I used to have one of these "green" ethernet switches that went down to 
100mbps automatically. You can imagine how "clogged" the logs were with 
link up messages. Thank goodness that switch was killed in a thunderstorm.


USB will log every device insertion and removal, very verbosely (see 
appendix A).




I agree that emitting log messages for normal and expected events will
lead to user confusion and we need to do something.

e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth
notification") was merged in v5.1-rc1, so we still have (a little)
time to figure this out before v5.1.


I always viewed the system log as a system log, instead of a database of 
system errors. I may have extremist views, but going back to Alex's 
example, I prefer to see that the power saving mechanism is doing 
something to save power on my laptop (I'll just ignore it on a desktop).


If you think increasing code complexity because people don't want things 
logged into the system log, then I'm certain we can work out some sane 
solution. It's the same problem we see with GCC, where people want 
warning messages here, but don't want the same messages there.


Alex


P.S. The pedantic in me points out that one of the examples I gave is a 
terrible example. ASPM "allows hardware-autonomous, dynamic Link power 
reduction beyond what is achievable by software-only control" [1].


[1] PCI-Express 3.0 -- 5.4.1. Active State Power Management (ASPM)


Appendix A:

[1618067.987084] usb 1-3.5: new high-speed USB device number 79 using 
xhci_hcd
[1618068.179914] usb 1-3.5: New USB device found, idVendor=0bda, 
idProduct=4014, bcdDevice= 0.05
[1618068.179924] usb 1-3.5: New USB device strings: Mfr=3, Product=1, 
SerialNumber=2

[1618068.179930] usb 1-3.5: Product: USB Audio
[1618068.179936] usb 1-3.5: Manufacturer: Generic
[1618068.179941] usb 1-3.5: SerialNumber: 200901010001
[1618068.280100] usb 1-3.6: new low-speed USB device number 80 using 
xhci_hcd

[1618068.342541] Bluetooth: hci0: Waiting for firmware download to complete
[1618068.342795] Bluetooth: hci0: Firmware loaded in 1509081 usecs
[1618068.342887] Bluetooth: hci0: Waiting for device to boot
[1618068.354919] Bluetooth: hci0: Device booted in 11797 usecs
[1618068.356006] Bluetooth: hci0: Found Intel DDC parameters: 
intel/ibt-12-16.ddc

[1618068.358958] Bluetooth: hci0: Applying Intel DDC parameters completed
[1618068.378624] usb 1-3.6: New USB device found, idVendor=04d9, 
idProduct=1400, bcdDevice= 1.43
[1618068.378626] usb 1-3.6: New USB device strings: Mfr=0, Product=0, 
SerialNumber=0
[1618068.390686] input: HID 04d9:1400 as 
/devices/pci:00/:00:14.0/usb1/1-3/1-3.6/1-3.6:1.0/0003:04D9:1400.0139/input/input921
[1618068.444282] hid-generic 0003:04D9:1400.0139: input,hidraw1: USB HID 
v1.10 Keyboard [HID 04d9:1400] on usb-:00:14.0-3.6/input0
[1618068.456373] input: HID 04d9:1400 Mouse as 
/devices/pci:00/:00:14.0/usb1/1-3/1-3.6/1-3.6:1.1/0003:04D9:1400.013A/input/input922
[1618068.457929] input: HID 04d9:1400 Consumer Control as 
/devices/pci:00/:00:14.0/usb1/1-3/1-3.6/1-3.6:1.1/0003:04D9:1400.013A/input/input923
[1618068.509294] input: HID 04d9:1400 System Control as 
/devices/pci:00/:00:14.0/usb1/1-3/1-3.6/1-3.6:1.1/0003:04D9:1400.013A/input/input924
[1618068.509518] hid-generic 0003:04D9:1400.013A: input,hidraw2: USB HID 
v1.10 Mouse [HID 04d9:1400] on usb-:00:14.0-3.6/input1
[1618068.588078] usb 1-3.7: new full-speed USB device number 81 using 
xhci_hcd
[1618068.679132] usb 1-3.7: New USB device found, idVendor=046d, 
idProduct=c52b, bcdDevice=12.03
[1

Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation

2019-04-23 Thread Alex G

On 4/23/19 11:22 AM, Alex Williamson wrote:

Nor should pci-core decide what link speed changes are intended or
errors.  Minimally we should be enabling drivers to receive this
feedback.  Thanks,


Not errors. pci core reports that a link speed change event has occured. 
Period.


Alex


Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation

2019-04-23 Thread Alex G




On 4/23/19 10:34 AM, Alex Williamson wrote:

On Tue, 23 Apr 2019 09:33:53 -0500
Alex G  wrote:


On 4/22/19 7:33 PM, Alex Williamson wrote:

On Mon, 22 Apr 2019 19:05:57 -0500
Alex G  wrote:

echo :07:00.0:pcie010 |
sudo tee /sys/bus/pci_express/drivers/pcie_bw_notification/unbind


That's a bad solution for users, this is meaningless tracking of a
device whose driver is actively managing the link bandwidth for power
purposes.


0.5W savings on a 100+W GPU? I agree it's meaningless.


Evidence?  Regardless, I don't have control of the driver that's making
these changes, but the claim seems unfounded and irrelevant.


The number of 5mW/Gb/lane doesn't ring a bell? [1] [2]. Your GPU 
supports 5Gb/s, so likely using an older, more power hungry process. I 
suspect it's still within the same order of magnitude.




I'm assigning a device to a VM [snip]
I can see why we might want to be notified of degraded links due to signal 
issues,
but what I'm reporting is that there are also entirely normal reasons
[snip] we can't seem to tell the difference


Unfortunately, there is no way in PCI-Express to distinguish between an 
expected link bandwidth change and one due to error.


If you're using virt-manager to configure the VM, then virt-manager 
could have a checkbox to disable link bandwidth management messages. I'd 
rather we avoid kernel-side heuristics (like Lukas suggested). If you're 
confident that your link will operate as intended, and don't want 
messages about it, that's your call as a user -- we shouldn't decide 
this in the kernel.


Alex

[1] 
https://www.synopsys.com/designware-ip/technical-bulletin/reduce-power-consumption.html


Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation

2019-04-23 Thread Alex G

On 4/22/19 7:33 PM, Alex Williamson wrote:

On Mon, 22 Apr 2019 19:05:57 -0500
Alex G  wrote:

echo :07:00.0:pcie010 |
sudo tee /sys/bus/pci_express/drivers/pcie_bw_notification/unbind


That's a bad solution for users, this is meaningless tracking of a
device whose driver is actively managing the link bandwidth for power
purposes. 


0.5W savings on a 100+W GPU? I agree it's meaningless.


There is nothing wrong happening here that needs to fill
logs.  I thought maybe if I enabled notification of autonomous
bandwidth changes that it might categorize these as something we could
ignore, but it doesn't.
How can we identify only cases where this is
an erroneous/noteworthy situation?  Thanks,


You don't. Ethernet doesn't. USB doesn't. This logging behavior is 
consistent with every other subsystem that deals with multi-speed links. 
I realize some people are very resistant to change (and use very ancient 
kernels). I do not, however, agree that this is a sufficient argument to 
dis-unify behavior.


Alex


Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation

2019-04-22 Thread Alex G

On 4/22/19 5:43 PM, Alex Williamson wrote:

[  329.725607] vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth,
limited by 2.5 GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5
GT/s x16 link)
[  708.151488] vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth,
limited by 2.5 GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5
GT/s x16 link)
[  718.262959] vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth,
limited by 2.5 GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5
GT/s x16 link)
[ 1138.124932] vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth,
limited by 2.5 GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5
GT/s x16 link)

What is the value of this nagging?


Good! The bandwidth notification service is working as intended. If this 
bothers you, you can unbind the device from the bandwidth notification 
driver:


echo :07:00.0:pcie010 |
sudo tee /sys/bus/pci_express/drivers/pcie_bw_notification/unbind




diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index 7d04f9d087a6..1b330129089f 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -55,7 +55,8 @@ static int pcie_message_numbers(struct pci_dev *dev, int mask,
 * 7.8.2, 7.10.10, 7.31.2.
 */
  
-	if (mask & (PCIE_PORT_SERVICE_PME | PCIE_PORT_SERVICE_HP)) {

+   if (mask & (PCIE_PORT_SERVICE_PME | PCIE_PORT_SERVICE_HP |
+   PCIE_PORT_SERVICE_BWNOTIF)) {
pcie_capability_read_word(dev, PCI_EXP_FLAGS, );
*pme = (reg16 & PCI_EXP_FLAGS_IRQ) >> 9;
nvec = *pme + 1;


Good catch!


Re: [PATCH v1 1/3] PCI / ACPI: Do not export pci_get_hp_params()

2019-04-22 Thread Alex G

On 4/22/19 3:58 PM, Bjorn Helgaas wrote:

On Fri, Feb 08, 2019 at 10:24:11AM -0600, Alexandru Gagniuc wrote:

This is only used within drivers/pci, and there is no reason to make
it available outside of the PCI core.

Signed-off-by: Alexandru Gagniuc 


Applied the whole series to pci/hotplug for v5.2, thanks!

I dropped the "list" member from struct hpx_type3 because it didn't
seem to be used.


That's a good call. That was a vestigial appendage from when I first 
intended to store a list of registers in memory. I'm glad we didn't end 
up needing a list.


Alex


Fixing the GHES driver vs not causing issues in the first place

2019-03-29 Thread Alex G.

The issue of dying inside the GHES driver has popped up a few times before.
I've looked into fixing this before, but we didn't quite come to agreement
because the verbiage in the ACPI spec is vague:
    " When a fatal uncorrected error occurs, the system is
      restarted to prevent propagation of the error. "

This popped up again a couple of times recently [1]. Linus suggested that
fixing the GHES driver might pay higher dividends. I considered reviving 
an old
fix, but put it aside after hearing concerns about the unintended 
consequences,

which I'll get to soon.

A few days back, I lost an entire MD RAID1. It was during hot-removal 
testing

that I do on an irregular basis, and a kernel panic from the GHES driver had
caused the system to go down. I have seen some occasional filesystem 
corruption
in other crashes, but nothing fsck couldn't fix. The interesting part is 
that

the array that was lost was not part of the device set that was being
hot-removed. The state of things doesn't seem very joyful.

The machine in question is a Dell r740xd. It uses firmware-first (FFS)
handling for PCIe errors, and is generally good at identifying a hot-removal
and not bothering the OS about it. The events that I am testing for are
situations where, due to slow operator, tilted drive, worn connectors, 
errors

make it past FFS to the OS -- with "fatal" severity.
    In that case, the GHES driver sees a fatal hardware error and panics.
On this machine, FFS reported it as fatal in an attempt to cause the 
system to

reboot, and prevent propagation of the error.

    The "prevent propagation of the error" flow was designed around OSes
that can't do recovery. Firmware assumes an instantaneous reboot, so it does
not set itself up to report future errors. The EFI reference code does not
re-arm errors, so we suspect most OEM's firmware will work this way. Despite
the apparently enormous stupidity of crashing an otherwise perfectly good
system, there are good and well thought out reasons behind it.
    An example is reading a block of data from an NVMe drive, and
encountering a CRC error on the PCIe bus. If we didn't  do an "instantaneous
reboot" after a previous fatal error, we will not get the CRC error 
reported.

Thus, we risk silent data corruption.

    On the Linux side, we can ignore the "fatal" nature of the error, and
even recover the PCIe devices behind the error. However, this is ill 
advised for

the reason listed above.

    The counter argument is that a panic() is more likely to cause
filesystem corruption. In over one year of testing, I have not seen a single
incident of data corruption, yet I have seen the boot ask me to run fsck on
multiple occasions. And this seems to me like a tradeoff problem rather than
anything else.

    In the case of this Dell machine, there are ways to hot-swap PCIe
devices them without needing to take either of the risks above:
    1. Turn off the downstream port manually.
    2. Insert/replace drive
    3. Turn on the downstream port manually.

    What bothers me is the firmware's assumption that a fatal error must
crash the machine. I've looked at the the verbiage in the spec, and I don't
fully agree that either side respects the contract.
Our _OSC contract with the firmware says that firmware will report 
errors to us,

while we are not allowed to touch certain parts of the device. Following the
verbiage in the spec, we do need to reboot on a fatal error, but that 
reboot is

not guaranteed to be instantaneous. Firmware still has to honor the contract
and report errors to us. This may seem like a spec non-compliance issue.

    It would be great if FFS would mark the errors as recoverable, but the
level of validation required makes this unrealistic on existing 
machines. New

machines will likely move to the Containment Error Recovery model, which is
essentially FFS for DPC (PCIe Downstream Port Containment). This leaves the
current generation of servers in limbo -- I'm led to believe other 
server OEMs

have very similar issues.

    Given the dilemma above, I'm really not sure what the right 
solution is.
How do we produce a fix that addresses complaints from both sides. When 
firmware
gives us a false positive, should we panic? Should we instead print a 
message

informing the user/admin that a reboot of the machine is required. Should we
initiate a reboot automatically?

    From Linux' ability to recover, PCIe fatal errors are false positives
-- every time, all the time. However, their fatality is not there 
without any

reason. I do want to avoid opinions that FFS wants us to crash, gives us the
finger, and we should give the finger back.

    Another option that was discussed before, but was very controversial is
the fix that involves not causing the problem in the first place [1] 
[2]. On the
machines that I have access to, FFS is handled in SMM, which means that 
all CPU

cores are held up until the processing of the error is complete. On one
particular machine 

Re: [PATCH v2] PCI/LINK: bw_notification: Do not leave interrupt handler NULL

2019-03-25 Thread Alex G.

On 3/25/19 5:25 PM, Bjorn Helgaas wrote:

On Fri, Mar 22, 2019 at 07:36:51PM -0500, Alexandru Gagniuc wrote:

A threaded IRQ with a NULL handler does not work with level-triggered
interrupts. request_threaded_irq() will return an error:

   genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq 16
   pcie_bw_notification: probe of :00:1b.0:pcie010 failed with error -22

For level interrupts we need to silence the interrupt before exiting
the IRQ handler, so just clear the PCI_EXP_LNKSTA_LBMS bit there.

Fixes: e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth 
notification")
Reported-by: Linus Torvalds 
Signed-off-by: Alexandru Gagniuc 


Applied with the following subject line to for-linus for v5.1, thanks!

   PCI/LINK: Supply IRQ handler so level-triggered IRQs are acked


You're so much better at formulating commit messages. That sounds a lot 
smoother. Thanks!



---
Changes since v1:
  - move pcie_update_link_speed() to irq to prevent duplicate read of 
link_status
  - Add Fixes: to commit message
  
  drivers/pci/pcie/bw_notification.c | 19 ++-

  1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/bw_notification.c 
b/drivers/pci/pcie/bw_notification.c
index d2eae3b7cc0f..c48746f1cf3c 100644
--- a/drivers/pci/pcie/bw_notification.c
+++ b/drivers/pci/pcie/bw_notification.c
@@ -44,11 +44,10 @@ static void pcie_disable_link_bandwidth_notification(struct 
pci_dev *dev)
pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
  }
  
-static irqreturn_t pcie_bw_notification_handler(int irq, void *context)

+static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
  {
struct pcie_device *srv = context;
struct pci_dev *port = srv->port;
-   struct pci_dev *dev;
u16 link_status, events;
int ret;
  
@@ -58,6 +57,17 @@ static irqreturn_t pcie_bw_notification_handler(int irq, void *context)

if (ret != PCIBIOS_SUCCESSFUL || !events)
return IRQ_NONE;
  
+	pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);

+   pcie_update_link_speed(port->subordinate, link_status);
+   return IRQ_WAKE_THREAD;
+}
+
+static irqreturn_t pcie_bw_notification_handler(int irq, void *context)
+{
+   struct pcie_device *srv = context;
+   struct pci_dev *port = srv->port;
+   struct pci_dev *dev;
+
/*
 * Print status from downstream devices, not this root port or
 * downstream switch port.
@@ -67,8 +77,6 @@ static irqreturn_t pcie_bw_notification_handler(int irq, void 
*context)
__pcie_print_link_status(dev, false);
up_read(_bus_sem);
  
-	pcie_update_link_speed(port->subordinate, link_status);

-   pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);
return IRQ_HANDLED;
  }
  
@@ -80,7 +88,8 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)

if (!pcie_link_bandwidth_notification_supported(srv->port))
return -ENODEV;
  
-	ret = request_threaded_irq(srv->irq, NULL, pcie_bw_notification_handler,

+   ret = request_threaded_irq(srv->irq, pcie_bw_notification_irq,
+  pcie_bw_notification_handler,
   IRQF_SHARED, "PCIe BW notif", srv);
if (ret)
return ret;
--
2.19.2



Re: [PATCH] PCI/LINK: Request a one-shot IRQ with NULL handler

2019-03-25 Thread Alex G.

Hi Borislav,

Thanks for the update. We've settled on a different fix [1], since Lukas 
was not happy with IRQF_ONESHOT [2].


Alex

[1] 
https://lore.kernel.org/linux-pci/20190323003700.7294-1-mr.nuke...@gmail.com/
[2] 
https://lore.kernel.org/linux-pci/20190318043314.noyj6t4sh26sp...@wunner.de/


On 3/25/19 1:23 PM, Borislav Petkov wrote:

From: Borislav Petkov 

Booting v5.1-rc1 causes these new messages in dmesg here:

   genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq 24
   pcie_bw_notification: probe of :00:02.0:pcie010 failed with error -22

because requesting a threaded IRQ with NULL handler needs to be oneshot.
Otherwise a level-triggered interrupt endless loop can happen, see the
detailed explanation above the "Threaded irq... " error message in
kernel/irq/manage.c.

And PCI-MSI type interrupts are one-shot safe but not in the IOAPIC
case:

   24: ... IO-APIC   28-fasteoi   PCIe BW notif, PCIe BW notif, PCIe BW notif
   25: ... IO-APIC   29-fasteoi   PCIe BW notif, PCIe BW notif

Make the BW notification handler oneshot too.

Fixes: e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth 
notification")
Suggested-by: Thomas Gleixner 
Signed-off-by: Borislav Petkov 
Cc: Bjorn Helgaas 
Cc: Lukas Wunner 
Cc: Alexandru Gagniuc 
Cc: linux-...@vger.kernel.org
---
  drivers/pci/pcie/bw_notification.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/bw_notification.c 
b/drivers/pci/pcie/bw_notification.c
index d2eae3b7cc0f..16e11c09f6bd 100644
--- a/drivers/pci/pcie/bw_notification.c
+++ b/drivers/pci/pcie/bw_notification.c
@@ -81,7 +81,7 @@ static int pcie_bandwidth_notification_probe(struct 
pcie_device *srv)
return -ENODEV;
  
  	ret = request_threaded_irq(srv->irq, NULL, pcie_bw_notification_handler,

-  IRQF_SHARED, "PCIe BW notif", srv);
+  IRQF_SHARED | IRQF_ONESHOT, "PCIe BW notif", 
srv);
if (ret)
return ret;
  



Re: [PATCH v3] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2019-03-20 Thread Alex G

On 3/20/19 4:44 PM, Linus Torvalds wrote:

On Wed, Mar 20, 2019 at 1:52 PM Bjorn Helgaas  wrote:


AFAICT, the consensus there was that it would be better to find some
sort of platform solution instead of dealing with it in individual
drivers.  The PCI core isn't really a driver, but I think the same
argument applies to it: if we had a better way to recover from readl()
errors, that way would work equally well in nvme-pci and the PCI core.


I think that patches with the pattern "if (disconnected) don't do IO"
are fundamentally broken and we should look for alternatives in all
cases.

They are fundamentally broken because they are racy: if it's an actual
sudden disconnect in the middle of IO, there's no guarantee that we'll
even be notified in time.

They are fundamentally broken because they add new magic special cases
that very few people will ever test, and the people who do test them
tend to do so with old irrelevant kernels.

Finally, they are fundamentally broken because they always end up
being just special cases. One or two special case accesses in a
driver, or perhaps all accesses of a particular type in just _one_
special driver.

Yes, yes, I realize that people want error reporting, and that
hot-removal can cause various error conditions (perhaps just parity
errors for the IO, but also perhaps other random errors caused by
firmware perhaps doing special HW setup).

But the "you get a fatal interrupt, so avoid the IO" kind of model is
completely broken, and needs to just be fixed differently. See above
why it's so completely broken.

So if the hw is set up to send some kinf of synchronous interrupt or
machine check that cannot sanely be handled (perhaps because it will
just repeat forever), we should try to just disable said thing.

PCIe allows for just polling for errors on the bridges, afaik. It's
been years since I looked at it, and maybe I'm wrong. And I bet there
are various "platform-specific value add" registers etc that may need
tweaking outside of any standard spec for PCIe error reporting. But
let's do that in a platform driver, to set up the platform to not do
the silly "I'm just going to die if I see an error" thing.

It's way better to have a model where you poll each bridge once a
minute (or one an hour) and let people know "guys, your hardware
reports errors", than make random crappy changes to random drivers
because the hardware was set up to die on said errors.

And if some MIS person wants the "hardware will die" setting, then
they can damn well have that, and then it's not out problem, but it
also means that we don't start changing random drivers for that insane
setting. It's dead, Jim, and it was the users choice.

Notice how in neither case does it make sense to try to do some "if
(disconnected) dont_do_io()" model for the drivers.


I disagree with the idea of doing something you know can cause an error 
to propagate. That being said, in this particular case we have come to 
rely a little too much on the if (disconnected) model.


You mentioned in the other thread that fixing the GHES driver will pay 
much higher dividends. I'm working on reviving a couple of changes to do 
just that. Some industry folk were very concerned about the "don't 
panic()" approach, and I want to make sure I fairly present their 
arguments in the cover letter.


I'm hoping one day we'll have the ability to use page tables to prevent 
the situations that we're trying to fix today in less than ideal ways. 
Although there are other places in msi.c that use if (disconnected), I'm 
okay with dropping this change -- provided we come up with an equivalent 
fix.


But even if FFS doesn't crash, do we really want to lose hundreds of 
milliseconds to SMM --on all cores-- when all it takes is a couple of 
cycles to check a flag?


Alex


Re: [PATCH] PCI/LINK: bw_notification: Do not leave interrupt handler NULL

2019-03-20 Thread Alex G.

On 3/20/19 8:46 AM, Bjorn Helgaas wrote:

Hi Alexandru,

On Mon, Mar 18, 2019 at 08:12:04PM -0500, Alexandru Gagniuc wrote:

A threaded IRQ with a NULL handler does not work with level-triggered
interrupts. request_threaded_irq() will return an error:

   genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq 16
   pcie_bw_notification: probe of :00:1b.0:pcie010 failed with error -22

For level interrupts we need to silence the interrupt before exiting
the IRQ handler, so just clear the PCI_EXP_LNKSTA_LBMS bit there.

Reported-by: Linus Torvalds 
Signed-off-by: Alexandru Gagniuc 


What's your thought regarding Lukas' comment?  If you do repost this,
please add a Fixes: tag to help connect this with the initial commit.


I like Lukas's idea. I should have this re-posted by end of week, unless 
there's an urgency to get it out earlier.


Alex


Re: [GIT PULL] PCI changes for v5.1

2019-03-17 Thread Alex G

On 3/17/19 4:18 PM, Linus Torvalds wrote:

On Fri, Mar 8, 2019 at 9:31 AM Bjorn Helgaas  wrote:


   - Report PCIe links that become degraded at run-time (Alexandru Gagniuc)


Gaah. Only now as I'm about to do the rc1 release am I looking at new
runtime warnings, and noticing that this causes

   genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq 16
   pcie_bw_notification: probe of :00:1b.0:pcie010 failed with error -22

because you can't have a NULL handler for a level-triggered interrupt
(you need something to shut the interrupt off so that it doesn't keep
screaming!).


Thanks for the catch. I did not see the error on my test machines. I'll 
take a look tomorrow, and update through Bjorn. Seems like an easy fix.


Alex


Re: [PATCH v2] PCI: pciehp: Report degraded links via link bandwidth notification

2018-12-27 Thread Alex G.

On 12/7/18 12:20 PM, Alexandru Gagniuc wrote:

A warning is generated when a PCIe device is probed with a degraded
link, but there was no similar mechanism to warn when the link becomes
degraded after probing. The Link Bandwidth Notification provides this
mechanism.

Use the link bandwidth notification interrupt to detect bandwidth
changes, and rescan the bandwidth, looking for the weakest point. This
is the same logic used in probe().

Signed-off-by: Alexandru Gagniuc 
---


ping.


Changes since v1:
  * Layer on top of the pcie port service drivers, instead of hotplug service.

This patch needs to be applied on top of:
PCI: Add missing include to drivers/pci.h

Anticipated FAQ:

Q: Why is this unconditionally compiled in?
A: The symmetrical check in pci probe() is also always compiled in.

Q: Why call module_init() instead of adding a call in pcie_init_services() ?
A: A call in pcie_init_services() also requires a prototype in portdrv.h, a
non-static implementation in bw_notification.c. Using module_init() is
functionally equivalent, and takes less code.

Q: Why print only on degraded links and not when a link is back to full speed?
For symmetry with PCI probe(). Although I see a benefit in conveying that a
link is back to full speed, I expect this to be extremely rare. Secondary bus
reset is usually needed to retrain at full bandwidth.


  drivers/pci/pcie/Makefile  |   1 +
  drivers/pci/pcie/bw_notification.c | 107 +
  drivers/pci/pcie/portdrv.h |   4 +-
  drivers/pci/pcie/portdrv_core.c|  14 ++--
  4 files changed, 121 insertions(+), 5 deletions(-)
  create mode 100644 drivers/pci/pcie/bw_notification.c

diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index ab514083d5d4..f1d7bc1e5efa 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -3,6 +3,7 @@
  # Makefile for PCI Express features and port driver
  
  pcieportdrv-y			:= portdrv_core.o portdrv_pci.o err.o

+pcieportdrv-y  += bw_notification.o
  
  obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o
  
diff --git a/drivers/pci/pcie/bw_notification.c b/drivers/pci/pcie/bw_notification.c

new file mode 100644
index ..64391ea9a172
--- /dev/null
+++ b/drivers/pci/pcie/bw_notification.c
@@ -0,0 +1,107 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * PCI Express Bandwidth notification services driver
+ * Author: Alexandru Gagniuc 
+ *
+ * Copyright (C) 2018, Dell Inc
+ *
+ * The PCIe bandwidth notification provides a way to notify the operating 
system
+ * when the link width or data rate changes. This capability is required for 
all
+ * root ports and downstream ports supporting links wider than x1 and/or
+ * multiple link speeds.
+ *
+ * This service port driver hooks into the bandwidth notification interrupt and
+ * warns when links become degraded in operation.
+ */
+
+#include 
+
+#include "../pci.h"
+#include "portdrv.h"
+
+static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev)
+{
+   int ret;
+   u32 lnk_cap;
+
+   ret = pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, _cap);
+   return (ret == PCIBIOS_SUCCESSFUL) && (lnk_cap & PCI_EXP_LNKCAP_LBNC);
+}
+
+static void pcie_enable_link_bandwidth_notification(struct pci_dev *dev)
+{
+   u16 lnk_ctl;
+
+   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
+   lnk_ctl |= PCI_EXP_LNKCTL_LBMIE;
+   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
+}
+
+static void pcie_disable_link_bandwidth_notification(struct pci_dev *dev)
+{
+   u16 lnk_ctl;
+
+   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
+   lnk_ctl &= ~PCI_EXP_LNKCTL_LBMIE;
+   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
+}
+
+static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
+{
+   struct pcie_device *srv = context;
+   struct pci_dev *port = srv->port;
+   struct pci_dev *dev;
+   u16 link_status, events;
+   int ret;
+
+   ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, _status);
+   events = link_status & PCI_EXP_LNKSTA_LBMS;
+
+   if (!events || ret != PCIBIOS_SUCCESSFUL)
+   return IRQ_NONE;
+
+   /* Print status from upstream link partner, not this downstream port. */
+   list_for_each_entry(dev, >subordinate->devices, bus_list)
+   __pcie_print_link_status(dev, false);
+
+   pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);
+   return IRQ_HANDLED;
+}
+
+static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
+{
+   int ret;
+
+   /* Single-width or single-speed ports do not have to support this. */
+   if (!pcie_link_bandwidth_notification_supported(srv->port))
+   return -ENODEV;
+
+   ret = devm_request_irq(>device, srv->irq, pcie_bw_notification_irq,
+  IRQF_SHARED, "PCIe BW notif", srv);
+   if (ret)
+   return ret;
+
+   

Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-05 Thread Alex G.

ping

On 09/18/2018 05:15 PM, Alexandru Gagniuc wrote:

When a PCI device is gone, we don't want to send IO to it if we can
avoid it. We expose functionality via the irq_chip structure. As
users of that structure may not know about the underlying PCI device,
it's our responsibility to guard against removed devices.

.irq_write_msi_msg() is already guarded inside __pci_write_msi_msg().
.irq_mask/unmask() are not. Guard them for completeness.

For example, surprise removal of a PCIe device triggers teardown. This
touches the irq_chips ops some point to disable the interrupts. I/O
generated here can crash the system on firmware-first machines.
Not triggering the IO in the first place greatly reduces the
possibility of the problem occurring.

Signed-off-by: Alexandru Gagniuc 
---
  drivers/pci/msi.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index f2ef896464b3..f31058fd2260 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 
flag)
  {
struct msi_desc *desc = irq_data_get_msi_desc(data);
  
+	if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))

+   return;
+
if (desc->msi_attrib.is_msix) {
msix_mask_irq(desc, flag);
readl(desc->mask_base);  /* Flush write to device */



Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-05 Thread Alex G.

ping

On 09/18/2018 05:15 PM, Alexandru Gagniuc wrote:

When a PCI device is gone, we don't want to send IO to it if we can
avoid it. We expose functionality via the irq_chip structure. As
users of that structure may not know about the underlying PCI device,
it's our responsibility to guard against removed devices.

.irq_write_msi_msg() is already guarded inside __pci_write_msi_msg().
.irq_mask/unmask() are not. Guard them for completeness.

For example, surprise removal of a PCIe device triggers teardown. This
touches the irq_chips ops some point to disable the interrupts. I/O
generated here can crash the system on firmware-first machines.
Not triggering the IO in the first place greatly reduces the
possibility of the problem occurring.

Signed-off-by: Alexandru Gagniuc 
---
  drivers/pci/msi.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index f2ef896464b3..f31058fd2260 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 
flag)
  {
struct msi_desc *desc = irq_data_get_msi_desc(data);
  
+	if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))

+   return;
+
if (desc->msi_attrib.is_msix) {
msix_mask_irq(desc, flag);
readl(desc->mask_base);  /* Flush write to device */



Re: [PATCH] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-08-29 Thread Alex G.

Should I resubmit this rebased on 4.19-rc*, or just leave this patch as is?

Alex

On 07/30/2018 04:21 PM, Alexandru Gagniuc wrote:

When a PCI device is gone, we don't want to send IO to it if we can
avoid it. We expose functionality via the irq_chip structure. As
users of that structure may not know about the underlying PCI device,
it's our responsibility to guard against removed devices.

irq_write_msi_msg is already guarded. pci_msi_(un)mask_irq are not.
Guard them for completeness.

For example, surprise removal of a PCIe device triggers teardown. This
touches the irq_chips ops some point to disable the interrupts. I/O
generated here can crash the system on machines with buggy firmware.
Not triggering the IO in the first place eliminates the problem.

Signed-off-by: Alexandru Gagniuc 
---

There's another patch by Lukas Wunner that is needed (not yet published)
in order to fully block IO on SURPRISE!!! removal. The existing code only
sets the PCI_DEV_DISCONNECTED bit in an unreasonably narrow set of
circumstances. Lukas' patch fixes that.

However, this change is otherwise fully independent, and enjoy!

  drivers/pci/msi.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 4d88afdfc843..5f47b5cb0401 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 
flag)
  {
struct msi_desc *desc = irq_data_get_msi_desc(data);
  
+	if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))

+   return;
+
if (desc->msi_attrib.is_msix) {
msix_mask_irq(desc, flag);
readl(desc->mask_base);  /* Flush write to device */



Re: [PATCH] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-08-29 Thread Alex G.

Should I resubmit this rebased on 4.19-rc*, or just leave this patch as is?

Alex

On 07/30/2018 04:21 PM, Alexandru Gagniuc wrote:

When a PCI device is gone, we don't want to send IO to it if we can
avoid it. We expose functionality via the irq_chip structure. As
users of that structure may not know about the underlying PCI device,
it's our responsibility to guard against removed devices.

irq_write_msi_msg is already guarded. pci_msi_(un)mask_irq are not.
Guard them for completeness.

For example, surprise removal of a PCIe device triggers teardown. This
touches the irq_chips ops some point to disable the interrupts. I/O
generated here can crash the system on machines with buggy firmware.
Not triggering the IO in the first place eliminates the problem.

Signed-off-by: Alexandru Gagniuc 
---

There's another patch by Lukas Wunner that is needed (not yet published)
in order to fully block IO on SURPRISE!!! removal. The existing code only
sets the PCI_DEV_DISCONNECTED bit in an unreasonably narrow set of
circumstances. Lukas' patch fixes that.

However, this change is otherwise fully independent, and enjoy!

  drivers/pci/msi.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 4d88afdfc843..5f47b5cb0401 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 
flag)
  {
struct msi_desc *desc = irq_data_get_msi_desc(data);
  
+	if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))

+   return;
+
if (desc->msi_attrib.is_msix) {
msix_mask_irq(desc, flag);
readl(desc->mask_base);  /* Flush write to device */



Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-09 Thread Alex G.




On 08/09/2018 02:18 PM, Bjorn Helgaas wrote:

On Thu, Aug 09, 2018 at 02:00:23PM -0500, Alex G. wrote:

On 08/09/2018 01:29 PM, Bjorn Helgaas wrote:

On Thu, Aug 09, 2018 at 04:46:32PM +, alex_gagn...@dellteam.com wrote:

On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:

(snip_

 enable_ecrc_checking()
 disable_ecrc_checking()


I don't immediately see how this would affect FFS, but the bits are part
of the AER capability structure. According to the FFS model, those would
be owned by FW, and we'd have to avoid touching them.


Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root
Ports that contain the FIRMWARE_FIRST flag as well as values the OS is
supposed to write to several AER capability registers.  It looks like
we currently ignore everything except the FIRMWARE_FIRST and GLOBAL
flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux).

That seems like a pretty major screwup and more than I want to fix
right now.


The logic is not very clear, but I think it goes like this:
For GLOBAL and FFS, disable native AER everywhere.
When !GLOBAL and FFS, then only disable native AER for the root port
described by the HEST entry.


I agree the code is convoluted, but that sounds right to me.

What I meant is that we ignore the values the HEST entry tells us
we're supposed to write to Device Control and the AER Uncorrectable
Error Mask, Uncorrectable Error Severity, Correctable Error Mask, and
AER Capabilities and Control.


Wait, what? _HPX has the same information. This is madness!
Since root ports are not hot-swappable, the BIOS normally programs those 
registers. Even if linux doesn't apply said masks, the programming BIOS 
did should be sufficient to have *cough* correct *cough* behavior.



For practical considerations this is not an issue today. The ACPI error
handling code currently crashes when it encounters any fatal error, so
we wouldn't hit this in the FFS case.


I wasn't aware the firmware-first path was *that* broken.  Are there
problem reports for this?  Is this a regression?


It's been like this since, I believe, 3.10, and probably much earlier. All
reports that I have seen of linux crashing on surprise hot-plug have been
caused by the panic() call in the apei code. Dell BIOSes do an extreme
amount of work to determine when it's safe to _not_ report errors to the OS,
since all known OSes crash on this path.


Oh, is this the __ghes_panic() path?  If so, I'm going to turn away
and plead ignorance unless the PCI core is doing something wrong that
eventually results in that panic.


I agree, and I'll quote you on that!

Alex


Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-09 Thread Alex G.




On 08/09/2018 02:18 PM, Bjorn Helgaas wrote:

On Thu, Aug 09, 2018 at 02:00:23PM -0500, Alex G. wrote:

On 08/09/2018 01:29 PM, Bjorn Helgaas wrote:

On Thu, Aug 09, 2018 at 04:46:32PM +, alex_gagn...@dellteam.com wrote:

On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:

(snip_

 enable_ecrc_checking()
 disable_ecrc_checking()


I don't immediately see how this would affect FFS, but the bits are part
of the AER capability structure. According to the FFS model, those would
be owned by FW, and we'd have to avoid touching them.


Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root
Ports that contain the FIRMWARE_FIRST flag as well as values the OS is
supposed to write to several AER capability registers.  It looks like
we currently ignore everything except the FIRMWARE_FIRST and GLOBAL
flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux).

That seems like a pretty major screwup and more than I want to fix
right now.


The logic is not very clear, but I think it goes like this:
For GLOBAL and FFS, disable native AER everywhere.
When !GLOBAL and FFS, then only disable native AER for the root port
described by the HEST entry.


I agree the code is convoluted, but that sounds right to me.

What I meant is that we ignore the values the HEST entry tells us
we're supposed to write to Device Control and the AER Uncorrectable
Error Mask, Uncorrectable Error Severity, Correctable Error Mask, and
AER Capabilities and Control.


Wait, what? _HPX has the same information. This is madness!
Since root ports are not hot-swappable, the BIOS normally programs those 
registers. Even if linux doesn't apply said masks, the programming BIOS 
did should be sufficient to have *cough* correct *cough* behavior.



For practical considerations this is not an issue today. The ACPI error
handling code currently crashes when it encounters any fatal error, so
we wouldn't hit this in the FFS case.


I wasn't aware the firmware-first path was *that* broken.  Are there
problem reports for this?  Is this a regression?


It's been like this since, I believe, 3.10, and probably much earlier. All
reports that I have seen of linux crashing on surprise hot-plug have been
caused by the panic() call in the apei code. Dell BIOSes do an extreme
amount of work to determine when it's safe to _not_ report errors to the OS,
since all known OSes crash on this path.


Oh, is this the __ghes_panic() path?  If so, I'm going to turn away
and plead ignorance unless the PCI core is doing something wrong that
eventually results in that panic.


I agree, and I'll quote you on that!

Alex


Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-09 Thread Alex G.




On 08/09/2018 01:29 PM, Bjorn Helgaas wrote:

On Thu, Aug 09, 2018 at 04:46:32PM +, alex_gagn...@dellteam.com wrote:

On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:

(snip_

enable_ecrc_checking()
disable_ecrc_checking()


I don't immediately see how this would affect FFS, but the bits are part
of the AER capability structure. According to the FFS model, those would
be owned by FW, and we'd have to avoid touching them.


Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root
Ports that contain the FIRMWARE_FIRST flag as well as values the OS is
supposed to write to several AER capability registers.  It looks like
we currently ignore everything except the FIRMWARE_FIRST and GLOBAL
flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux).

That seems like a pretty major screwup and more than I want to fix
right now.


The logic is not very clear, but I think it goes like this:
For GLOBAL and FFS, disable native AER everywhere.
When !GLOBAL and FFS, then only disable native AER for the root port 
described by the HEST entry.


aer_acpi_firmware_first() doesn't care about context. Though check 
aer_set_firmware_first(), where we pass a pci_device as a context.


The FFS implementations I've seen have one HEST entry, with GLOBAL and 
FFS. I have yet to see more fine-grained control of root ports. I 
suspect that if we had this finer control from HEST, we'd honor it.


I do eventually want to get a test BIOS with different HEST entries, and 
make sure things work as intended. Though BIOS team is overloaded with 
other requests.



pci_cleanup_aer_uncorrect_error_status()


This probably should be guarded. It's only called from a few specific
drivers, so the impact is not as high as being called from the core.


pci_aer_clear_fatal_status()


This is only called when doing fatal_recovery, right?


True.  It takes a lot of analysis to convince oneself that this is not
used in the firmware-first path, so I think we should add a guard
there.


I agree. GHES has a header severity and a severity field for each 
section. All BIOSes I've seen do fatal/fatal, though a BIOS that would 
report correctable/fatal, would bypass the apei code and take us here.



For practical considerations this is not an issue today. The ACPI error
handling code currently crashes when it encounters any fatal error, so
we wouldn't hit this in the FFS case.


I wasn't aware the firmware-first path was *that* broken.  Are there
problem reports for this?  Is this a regression?


It's been like this since, I believe, 3.10, and probably much earlier. 
All reports that I have seen of linux crashing on surprise hot-plug have 
been caused by the panic() call in the apei code. Dell BIOSes do an 
extreme amount of work to determine when it's safe to _not_ report 
errors to the OS, since all known OSes crash on this path.


Fun fact: there's even dedicated hardware to accomplish the above.


The PCIe standards contact I usually talk to about these PCIe subtleties
is currently on vacation. The number one issue was a FFS corner case
with OS clearing bits on probe. The other functions you mention are a
corner case of a corner case. The big fish is
pci_cleanup_aer_error_status_regs() on probe(), and it would be nice to
have that resolved.

I'll sync up with Austin when he gets back to see about the other
functions though I suspect we'll end up fixing them as well.


I'd like to fix all the obvious cases at once (excluding the ECRC
stuff).  What do you think about the following patch?


That looks very sensible to me. Thank you for updating it :).
I'm pulling in the changes right now to run some quick tests, and I 
don't expect any trouble.


Alex


commit 15ed68dcc26864c849a12a36db4d4771bad7991f
Author: Alexandru Gagniuc 
Date:   Tue Jul 17 10:31:23 2018 -0500

 PCI/AER: Don't clear AER bits if error handling is Firmware-First
 
 If the platform requests Firmware-First error handling, firmware is

 responsible for reading and clearing AER status bits.  If OSPM also clears
 them, we may miss errors.  See ACPI v6.2, sec 18.3.2.5 and 18.4.
 
 This race is mostly of theoretical significance, as it is not easy to

 reasonably demonstrate it in testing.
 
 Signed-off-by: Alexandru Gagniuc 

 [bhelgaas: add similar guards to pci_cleanup_aer_uncorrect_error_status()
 and pci_aer_clear_fatal_status()]
 Signed-off-by: Bjorn Helgaas 

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index c6cc855bfa22..4e823ae051a7 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -397,6 +397,9 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev 
*dev)
if (!pos)
return -EIO;
  
+	if (pcie_aer_get_firmware_first(dev))

+   return -EIO;
+
/* Clear status bits for ERR_NONFATAL errors only */
pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, );
pci_read_config_dword(dev, pos + 

Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-09 Thread Alex G.




On 08/09/2018 01:29 PM, Bjorn Helgaas wrote:

On Thu, Aug 09, 2018 at 04:46:32PM +, alex_gagn...@dellteam.com wrote:

On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:

(snip_

enable_ecrc_checking()
disable_ecrc_checking()


I don't immediately see how this would affect FFS, but the bits are part
of the AER capability structure. According to the FFS model, those would
be owned by FW, and we'd have to avoid touching them.


Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root
Ports that contain the FIRMWARE_FIRST flag as well as values the OS is
supposed to write to several AER capability registers.  It looks like
we currently ignore everything except the FIRMWARE_FIRST and GLOBAL
flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux).

That seems like a pretty major screwup and more than I want to fix
right now.


The logic is not very clear, but I think it goes like this:
For GLOBAL and FFS, disable native AER everywhere.
When !GLOBAL and FFS, then only disable native AER for the root port 
described by the HEST entry.


aer_acpi_firmware_first() doesn't care about context. Though check 
aer_set_firmware_first(), where we pass a pci_device as a context.


The FFS implementations I've seen have one HEST entry, with GLOBAL and 
FFS. I have yet to see more fine-grained control of root ports. I 
suspect that if we had this finer control from HEST, we'd honor it.


I do eventually want to get a test BIOS with different HEST entries, and 
make sure things work as intended. Though BIOS team is overloaded with 
other requests.



pci_cleanup_aer_uncorrect_error_status()


This probably should be guarded. It's only called from a few specific
drivers, so the impact is not as high as being called from the core.


pci_aer_clear_fatal_status()


This is only called when doing fatal_recovery, right?


True.  It takes a lot of analysis to convince oneself that this is not
used in the firmware-first path, so I think we should add a guard
there.


I agree. GHES has a header severity and a severity field for each 
section. All BIOSes I've seen do fatal/fatal, though a BIOS that would 
report correctable/fatal, would bypass the apei code and take us here.



For practical considerations this is not an issue today. The ACPI error
handling code currently crashes when it encounters any fatal error, so
we wouldn't hit this in the FFS case.


I wasn't aware the firmware-first path was *that* broken.  Are there
problem reports for this?  Is this a regression?


It's been like this since, I believe, 3.10, and probably much earlier. 
All reports that I have seen of linux crashing on surprise hot-plug have 
been caused by the panic() call in the apei code. Dell BIOSes do an 
extreme amount of work to determine when it's safe to _not_ report 
errors to the OS, since all known OSes crash on this path.


Fun fact: there's even dedicated hardware to accomplish the above.


The PCIe standards contact I usually talk to about these PCIe subtleties
is currently on vacation. The number one issue was a FFS corner case
with OS clearing bits on probe. The other functions you mention are a
corner case of a corner case. The big fish is
pci_cleanup_aer_error_status_regs() on probe(), and it would be nice to
have that resolved.

I'll sync up with Austin when he gets back to see about the other
functions though I suspect we'll end up fixing them as well.


I'd like to fix all the obvious cases at once (excluding the ECRC
stuff).  What do you think about the following patch?


That looks very sensible to me. Thank you for updating it :).
I'm pulling in the changes right now to run some quick tests, and I 
don't expect any trouble.


Alex


commit 15ed68dcc26864c849a12a36db4d4771bad7991f
Author: Alexandru Gagniuc 
Date:   Tue Jul 17 10:31:23 2018 -0500

 PCI/AER: Don't clear AER bits if error handling is Firmware-First
 
 If the platform requests Firmware-First error handling, firmware is

 responsible for reading and clearing AER status bits.  If OSPM also clears
 them, we may miss errors.  See ACPI v6.2, sec 18.3.2.5 and 18.4.
 
 This race is mostly of theoretical significance, as it is not easy to

 reasonably demonstrate it in testing.
 
 Signed-off-by: Alexandru Gagniuc 

 [bhelgaas: add similar guards to pci_cleanup_aer_uncorrect_error_status()
 and pci_aer_clear_fatal_status()]
 Signed-off-by: Bjorn Helgaas 

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index c6cc855bfa22..4e823ae051a7 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -397,6 +397,9 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev 
*dev)
if (!pos)
return -EIO;
  
+	if (pcie_aer_get_firmware_first(dev))

+   return -EIO;
+
/* Clear status bits for ERR_NONFATAL errors only */
pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, );
pci_read_config_dword(dev, pos + 

Re: [PATCH v3] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-07 Thread Alex G.




On 08/07/2018 08:14 PM, Bjorn Helgaas wrote:

On Mon, Jul 30, 2018 at 06:35:31PM -0500, Alexandru Gagniuc wrote:

When we don't own AER, we shouldn't touch the AER error bits. Clearing
error bits willy-nilly might cause firmware to miss some errors. In
theory, these bits get cleared by FFS, or via ACPI _HPX method. These
mechanisms are not subject to the problem.


What's FFS?


Firmware-first. Nobody likes spelling it out, and all other proposed 
acronyms are insanely tong-twisting. So, FFS.



I guess you mean FFS and _HPX are not subject to the problem because
they're supplied by firmware, so firmware would be responsible for
looking at the bits before clearing them?


Exactly.


This race is mostly of theoretical significance, since I can't
reasonably demonstrate this race in the lab.

On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
need for two checks: aer_cap and get_firmware_first().

Signed-off-by: Alexandru Gagniuc 
---

Changes since v2:
   - Added missing negation in pci_cleanup_aer_error_status_regs()

  drivers/pci/pcie/aer.c | 17 ++---
  1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..40e5c86271d1 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
aer_set_firmware_first(dev);
return dev->__aer_firmware_first;
  }
+
+static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
+{
+   return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
+}


I think it complicates things to have both "firmware_first" and
"kernel_first" interfaces, so I would prefer to stick with the
existing "firmware_first" style.


  #define   PCI_EXP_AER_FLAGS   (PCI_EXP_DEVCTL_CERE | 
PCI_EXP_DEVCTL_NFERE | \
 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
  
@@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)
  
  int pci_enable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
-   return -EIO;
-
-   if (!dev->aer_cap)
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;
  
  	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);


This change doesn't actually fix anything, does it?  It looks like a
cleanup that doesn't change the behavior.


Initially (v1), this was a one-liner, but someone had a complaint about 
having pcie_aer_get_firmware_first() boilerplate all over the place. 
That's why I added the "kernel_first" function (previous comment), and 
then updated this here for completeness. I'm also fine with v1.



@@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
  
  int pci_disable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;


This change does effectively add a test for dev->aer_cap.  That makes
sense in terms of symmetry with pci_enable_pcie_error_reporting(),
but I think it should be a separate patch because it's conceptually
separate from the change below.

We should keep the existing behavior (but add the symmetry) here for
now, but it's not clear to me that these paths should care about AER
or firmware-first at all.  PCI_EXP_DEVCTL is not an AER register and
we have the _HPX mechanism for firmware to influence it (which these
paths currently ignore).  I suspect we should program these reporting
enable bits in the core enumeration path instead of having drivers
call these interfaces.


The headache is that FFS needs the reporting bit to stay enabled in 
order to get AER notifications. Disabling things here could really break 
firmware. Of course, that's a cyclical argument, since FW is broken by 
definition.



If/when we make changes along these lines, the history will be easier
to follow if *this* change is not connected with the change below to
pci_cleanup_aer_error_status_regs().


I agree. I think it might be preferred then to go with v1, and leave the 
refactoring to a later time, since the extra changes are cosmetical and 
social.



return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
@@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
if (!pci_is_pcie(dev))
return -ENODEV;
  
-	pos = dev->aer_cap;

-   if (!pos)
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;


This part makes sense to me, but I think I would rather have it match
the existing style in pci_enable_pcie_error_reporting(), i.e., keep
the test for dev->aer_cap and add a test for
pcie_aer_get_firmware_first().


Had it that way in v1.

Alex


+   pos = dev->aer_cap;
port_type = pci_pcie_type(dev);
if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, );
--

Re: [PATCH v3] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-07 Thread Alex G.




On 08/07/2018 08:14 PM, Bjorn Helgaas wrote:

On Mon, Jul 30, 2018 at 06:35:31PM -0500, Alexandru Gagniuc wrote:

When we don't own AER, we shouldn't touch the AER error bits. Clearing
error bits willy-nilly might cause firmware to miss some errors. In
theory, these bits get cleared by FFS, or via ACPI _HPX method. These
mechanisms are not subject to the problem.


What's FFS?


Firmware-first. Nobody likes spelling it out, and all other proposed 
acronyms are insanely tong-twisting. So, FFS.



I guess you mean FFS and _HPX are not subject to the problem because
they're supplied by firmware, so firmware would be responsible for
looking at the bits before clearing them?


Exactly.


This race is mostly of theoretical significance, since I can't
reasonably demonstrate this race in the lab.

On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
need for two checks: aer_cap and get_firmware_first().

Signed-off-by: Alexandru Gagniuc 
---

Changes since v2:
   - Added missing negation in pci_cleanup_aer_error_status_regs()

  drivers/pci/pcie/aer.c | 17 ++---
  1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..40e5c86271d1 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
aer_set_firmware_first(dev);
return dev->__aer_firmware_first;
  }
+
+static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
+{
+   return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
+}


I think it complicates things to have both "firmware_first" and
"kernel_first" interfaces, so I would prefer to stick with the
existing "firmware_first" style.


  #define   PCI_EXP_AER_FLAGS   (PCI_EXP_DEVCTL_CERE | 
PCI_EXP_DEVCTL_NFERE | \
 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
  
@@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)
  
  int pci_enable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
-   return -EIO;
-
-   if (!dev->aer_cap)
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;
  
  	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);


This change doesn't actually fix anything, does it?  It looks like a
cleanup that doesn't change the behavior.


Initially (v1), this was a one-liner, but someone had a complaint about 
having pcie_aer_get_firmware_first() boilerplate all over the place. 
That's why I added the "kernel_first" function (previous comment), and 
then updated this here for completeness. I'm also fine with v1.



@@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
  
  int pci_disable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;


This change does effectively add a test for dev->aer_cap.  That makes
sense in terms of symmetry with pci_enable_pcie_error_reporting(),
but I think it should be a separate patch because it's conceptually
separate from the change below.

We should keep the existing behavior (but add the symmetry) here for
now, but it's not clear to me that these paths should care about AER
or firmware-first at all.  PCI_EXP_DEVCTL is not an AER register and
we have the _HPX mechanism for firmware to influence it (which these
paths currently ignore).  I suspect we should program these reporting
enable bits in the core enumeration path instead of having drivers
call these interfaces.


The headache is that FFS needs the reporting bit to stay enabled in 
order to get AER notifications. Disabling things here could really break 
firmware. Of course, that's a cyclical argument, since FW is broken by 
definition.



If/when we make changes along these lines, the history will be easier
to follow if *this* change is not connected with the change below to
pci_cleanup_aer_error_status_regs().


I agree. I think it might be preferred then to go with v1, and leave the 
refactoring to a later time, since the extra changes are cosmetical and 
social.



return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
@@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
if (!pci_is_pcie(dev))
return -ENODEV;
  
-	pos = dev->aer_cap;

-   if (!pos)
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;


This part makes sense to me, but I think I would rather have it match
the existing style in pci_enable_pcie_error_reporting(), i.e., keep
the test for dev->aer_cap and add a test for
pcie_aer_get_firmware_first().


Had it that way in v1.

Alex


+   pos = dev->aer_cap;
port_type = pci_pcie_type(dev);
if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, );
--

Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

2018-07-31 Thread Alex G.

On 07/31/2018 01:40 AM, Tal Gilboa wrote:
[snip]
@@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct 
pci_dev *dev)

  /* Advanced Error Reporting */
  pci_aer_init(dev);
+    /* Check link and detect downtrain errors */
+    pcie_check_upstream_link(dev);


This is called for every PCIe device right? Won't there be a 
duplicated print in case a device loads with lower PCIe bandwidth than 
needed?


Alex, can you comment on this please?


Of course I can.

There's one print at probe() time, which happens if bandwidth < max. I 
would think that's fine. There is a way to duplicate it, and that is if 
the driver also calls print_link_status(). A few driver maintainers who 
call it have indicated they'd be fine with removing it from the driver, 
and leaving it in the core PCI.


Should the device come back at low speed, go away, then come back at 
full speed we'd still only see one print from the first probe. Again, if 
driver doesn't also call the function.
There's the corner case where the device come up at < max, goes away, 
then comes back faster, but still < max. There will be two prints, but 
they won't be true duplicates -- each one will indicate different BW.


Alex


+
  if (pci_probe_reset_function(dev) == 0)
  dev->reset_fn = 1;
  }
diff --git a/include/linux/pci.h b/include/linux/pci.h
index abd5d5e17aee..15bfab8f7a1b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
  u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
**limiting_dev,

   enum pci_bus_speed *speed,
   enum pcie_link_width *width);
+void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
  void pcie_print_link_status(struct pci_dev *dev);
  int pcie_flr(struct pci_dev *dev);
  int __pci_reset_function_locked(struct pci_dev *dev);




Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

2018-07-31 Thread Alex G.

On 07/31/2018 01:40 AM, Tal Gilboa wrote:
[snip]
@@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct 
pci_dev *dev)

  /* Advanced Error Reporting */
  pci_aer_init(dev);
+    /* Check link and detect downtrain errors */
+    pcie_check_upstream_link(dev);


This is called for every PCIe device right? Won't there be a 
duplicated print in case a device loads with lower PCIe bandwidth than 
needed?


Alex, can you comment on this please?


Of course I can.

There's one print at probe() time, which happens if bandwidth < max. I 
would think that's fine. There is a way to duplicate it, and that is if 
the driver also calls print_link_status(). A few driver maintainers who 
call it have indicated they'd be fine with removing it from the driver, 
and leaving it in the core PCI.


Should the device come back at low speed, go away, then come back at 
full speed we'd still only see one print from the first probe. Again, if 
driver doesn't also call the function.
There's the corner case where the device come up at < max, goes away, 
then comes back faster, but still < max. There will be two prints, but 
they won't be true duplicates -- each one will indicate different BW.


Alex


+
  if (pci_probe_reset_function(dev) == 0)
  dev->reset_fn = 1;
  }
diff --git a/include/linux/pci.h b/include/linux/pci.h
index abd5d5e17aee..15bfab8f7a1b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
  u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
**limiting_dev,

   enum pci_bus_speed *speed,
   enum pcie_link_width *width);
+void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
  void pcie_print_link_status(struct pci_dev *dev);
  int pcie_flr(struct pci_dev *dev);
  int __pci_reset_function_locked(struct pci_dev *dev);




Re: [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER

2018-07-24 Thread Alex G.




On 07/23/2018 11:52 AM, Alexandru Gagniuc wrote:

When we don't own AER, we shouldn't touch the AER error bits. Clearing
error bits willy-nilly might cause firmware to miss some errors. In
theory, these bits get cleared by FFS, or via ACPI _HPX method. These
mechanisms are not subject to the problem.

This race is mostly of theoretical significance, since I can't
reasonably demonstrate this race in the lab.

On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
need for two checks: aer_cap and get_firmware_first().

Signed-off-by: Alexandru Gagniuc 
---
  drivers/pci/pcie/aer.c | 17 ++---
  1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..85c3e173c025 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
aer_set_firmware_first(dev);
return dev->__aer_firmware_first;
  }
+
+static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
+{
+   return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
+}
+
  #define   PCI_EXP_AER_FLAGS   (PCI_EXP_DEVCTL_CERE | 
PCI_EXP_DEVCTL_NFERE | \
 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
  
@@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)
  
  int pci_enable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
-   return -EIO;
-
-   if (!dev->aer_cap)
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;
  
  	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);

@@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
  
  int pci_disable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;
  
  	return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,

@@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
if (!pci_is_pcie(dev))
return -ENODEV;
  
-	pos = dev->aer_cap;

-   if (!pos)
+   if (pcie_aer_is_kernel_first(dev))


This here is missing a '!'. It's in my local branch, so I must have 
exported the patch before I fixed that. I'll get that fixed next rev.



return -EIO;
  
+	pos = dev->aer_cap;

port_type = pci_pcie_type(dev);
if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, );



Re: [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER

2018-07-24 Thread Alex G.




On 07/23/2018 11:52 AM, Alexandru Gagniuc wrote:

When we don't own AER, we shouldn't touch the AER error bits. Clearing
error bits willy-nilly might cause firmware to miss some errors. In
theory, these bits get cleared by FFS, or via ACPI _HPX method. These
mechanisms are not subject to the problem.

This race is mostly of theoretical significance, since I can't
reasonably demonstrate this race in the lab.

On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
need for two checks: aer_cap and get_firmware_first().

Signed-off-by: Alexandru Gagniuc 
---
  drivers/pci/pcie/aer.c | 17 ++---
  1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..85c3e173c025 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
aer_set_firmware_first(dev);
return dev->__aer_firmware_first;
  }
+
+static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
+{
+   return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
+}
+
  #define   PCI_EXP_AER_FLAGS   (PCI_EXP_DEVCTL_CERE | 
PCI_EXP_DEVCTL_NFERE | \
 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
  
@@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)
  
  int pci_enable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
-   return -EIO;
-
-   if (!dev->aer_cap)
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;
  
  	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);

@@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
  
  int pci_disable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;
  
  	return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,

@@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
if (!pci_is_pcie(dev))
return -ENODEV;
  
-	pos = dev->aer_cap;

-   if (!pos)
+   if (pcie_aer_is_kernel_first(dev))


This here is missing a '!'. It's in my local branch, so I must have 
exported the patch before I fixed that. I'll get that fixed next rev.



return -EIO;
  
+	pos = dev->aer_cap;

port_type = pci_pcie_type(dev);
if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, );



Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

2018-07-23 Thread Alex G.




On 07/23/2018 05:14 PM, Jakub Kicinski wrote:

On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote:

On 7/24/2018 12:01 AM, Jakub Kicinski wrote:

On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote:

PCIe downtraining happens when both the device and PCIe port are
capable of a larger bus width or higher speed than negotiated.
Downtraining might be indicative of other problems in the system, and
identifying this from userspace is neither intuitive, nor
straightforward.

The easiest way to detect this is with pcie_print_link_status(),
since the bottleneck is usually the link that is downtrained. It's not
a perfect solution, but it works extremely well in most cases.

Signed-off-by: Alexandru Gagniuc 
---

For the sake of review, I've created a __pcie_print_link_status() which
takes a 'verbose' argument. If we agree want to go this route, and update
the users of pcie_print_link_status(), I can split this up in two patches.
I prefer just printing this information in the core functions, and letting
drivers not have to worry about this. Though there seems to be strong for
not going that route, so here it goes:


FWIW the networking drivers print PCIe BW because sometimes the network
bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps
card on a x8 link.

Sorry to bike shed, but currently the networking cards print the info
during probe.  Would it make sense to move your message closer to probe
time?  Rather than when device is added.  If driver structure is
available, we could also consider adding a boolean to struct pci_driver
to indicate if driver wants the verbose message?  This way we avoid
duplicated prints.

I have no objection to current patch, it LGTM.  Just a thought.


I don't see the reason for having two functions. What's the problem with
adding the verbose argument to the original function?


IMHO it's reasonable to keep the default parameter to what 90% of users
want by a means on a wrapper.  The non-verbose output is provided by
the core already for all devices.

What do you think of my proposal above Tal?  That would make the extra
wrapper unnecessary since the verbose parameter would be part of the
driver structure, and it would avoid the duplicated output.


I see how it might make sense to add another member to the driver 
struct, but is it worth the extra learning curve? It seems to be 
something with the potential to confuse new driver developers, and 
having a very marginal benefit.

Although, if that's what people want...

Alex


Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

2018-07-23 Thread Alex G.




On 07/23/2018 05:14 PM, Jakub Kicinski wrote:

On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote:

On 7/24/2018 12:01 AM, Jakub Kicinski wrote:

On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote:

PCIe downtraining happens when both the device and PCIe port are
capable of a larger bus width or higher speed than negotiated.
Downtraining might be indicative of other problems in the system, and
identifying this from userspace is neither intuitive, nor
straightforward.

The easiest way to detect this is with pcie_print_link_status(),
since the bottleneck is usually the link that is downtrained. It's not
a perfect solution, but it works extremely well in most cases.

Signed-off-by: Alexandru Gagniuc 
---

For the sake of review, I've created a __pcie_print_link_status() which
takes a 'verbose' argument. If we agree want to go this route, and update
the users of pcie_print_link_status(), I can split this up in two patches.
I prefer just printing this information in the core functions, and letting
drivers not have to worry about this. Though there seems to be strong for
not going that route, so here it goes:


FWIW the networking drivers print PCIe BW because sometimes the network
bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps
card on a x8 link.

Sorry to bike shed, but currently the networking cards print the info
during probe.  Would it make sense to move your message closer to probe
time?  Rather than when device is added.  If driver structure is
available, we could also consider adding a boolean to struct pci_driver
to indicate if driver wants the verbose message?  This way we avoid
duplicated prints.

I have no objection to current patch, it LGTM.  Just a thought.


I don't see the reason for having two functions. What's the problem with
adding the verbose argument to the original function?


IMHO it's reasonable to keep the default parameter to what 90% of users
want by a means on a wrapper.  The non-verbose output is provided by
the core already for all devices.

What do you think of my proposal above Tal?  That would make the extra
wrapper unnecessary since the verbose parameter would be part of the
driver structure, and it would avoid the duplicated output.


I see how it might make sense to add another member to the driver 
struct, but is it worth the extra learning curve? It seems to be 
something with the potential to confuse new driver developers, and 
having a very marginal benefit.

Although, if that's what people want...

Alex


Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

2018-07-23 Thread Alex G.

On 07/23/2018 12:21 AM, Tal Gilboa wrote:

On 7/19/2018 6:49 PM, Alex G. wrote:



On 07/18/2018 08:38 AM, Tal Gilboa wrote:

On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:

[+cc maintainers of drivers that already use pcie_print_link_status()
and GPU folks]

[snip]



+    /* Multi-function PCIe share the same link/status. */
+    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
+    return;
+
+    pcie_print_link_status(dev);
+}


Is this function called by default for every PCIe device? What about 
VFs? We make an exception for them on our driver since a VF doesn't 
have access to the needed information in order to provide a 
meaningful message.


I'm assuming VF means virtual function. pcie_print_link_status() 
doesn't care if it's passed a virtual function. It will try to do its 
job. That's why I bail out three lines above, with 'dev->is_virtfn' 
check.


Alex


That's the point - we don't want to call pcie_print_link_status() for 
virtual functions. We make the distinction in our driver. If you want to 
change the code to call this function by default it shouldn't affect the 
current usage.


I'm not understanding very well what you're asking. I understand you 
want to avoid printing this message on virtual functions, and that's 
already taken care of. I'm also not changing current behavior.  Let's 
get v2 out and start the discussion again based on that.


Alex


Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

2018-07-23 Thread Alex G.

On 07/23/2018 12:21 AM, Tal Gilboa wrote:

On 7/19/2018 6:49 PM, Alex G. wrote:



On 07/18/2018 08:38 AM, Tal Gilboa wrote:

On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:

[+cc maintainers of drivers that already use pcie_print_link_status()
and GPU folks]

[snip]



+    /* Multi-function PCIe share the same link/status. */
+    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
+    return;
+
+    pcie_print_link_status(dev);
+}


Is this function called by default for every PCIe device? What about 
VFs? We make an exception for them on our driver since a VF doesn't 
have access to the needed information in order to provide a 
meaningful message.


I'm assuming VF means virtual function. pcie_print_link_status() 
doesn't care if it's passed a virtual function. It will try to do its 
job. That's why I bail out three lines above, with 'dev->is_virtfn' 
check.


Alex


That's the point - we don't want to call pcie_print_link_status() for 
virtual functions. We make the distinction in our driver. If you want to 
change the code to call this function by default it shouldn't affect the 
current usage.


I'm not understanding very well what you're asking. I understand you 
want to avoid printing this message on virtual functions, and that's 
already taken care of. I'm also not changing current behavior.  Let's 
get v2 out and start the discussion again based on that.


Alex


Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-07-19 Thread Alex G.




On 07/19/2018 11:58 AM, Sinan Kaya wrote:


On 7/19/2018 8:55 AM, Alex G. wrote:
I find the intent clearer if we check it here rather than having to do 
the mental parsing of the state of aer_cap.


I don't feel too strong about my comment to be honest. This was a
style/maintenance comment.

It feels like we are putting pcie_aer_get_firmware_first() into core
functions unnecessarily after your change. I understand the need for
your change. I'm asking if it is the right place or not.

pcie_aer_get_firmware_first() should be called from either the init or
probe function so that the rest of the AER functions do not get called
from any other context.

If someone adds another AER function, we might need to add another
pcie_aer_get_firmware_first() check there. So, we have unnecessary code
duplication.


We could move the aer_cap and get_ffs() check into one function that we 
end up calling all over the place. I understand your concern about code 
duplication, and I agree with it. I don't think that at this point it's 
that big of a deal, although we might need to guard every aer_() call.


So moving all the checks in a pcie_aer_is_kernel_first() makes sense.

Alex


Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-07-19 Thread Alex G.




On 07/19/2018 11:58 AM, Sinan Kaya wrote:


On 7/19/2018 8:55 AM, Alex G. wrote:
I find the intent clearer if we check it here rather than having to do 
the mental parsing of the state of aer_cap.


I don't feel too strong about my comment to be honest. This was a
style/maintenance comment.

It feels like we are putting pcie_aer_get_firmware_first() into core
functions unnecessarily after your change. I understand the need for
your change. I'm asking if it is the right place or not.

pcie_aer_get_firmware_first() should be called from either the init or
probe function so that the rest of the AER functions do not get called
from any other context.

If someone adds another AER function, we might need to add another
pcie_aer_get_firmware_first() check there. So, we have unnecessary code
duplication.


We could move the aer_cap and get_ffs() check into one function that we 
end up calling all over the place. I understand your concern about code 
duplication, and I agree with it. I don't think that at this point it's 
that big of a deal, although we might need to guard every aer_() call.


So moving all the checks in a pcie_aer_is_kernel_first() makes sense.

Alex


Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-07-19 Thread Alex G.




On 07/17/2018 10:41 AM, Sinan Kaya wrote:


On 7/17/2018 8:31 AM, Alexandru Gagniuc wrote:

+    if (pcie_aer_get_firmware_first(dev))
+    return -EIO;


Can you move this to closer to the caller pci_aer_init()?


I could move it there. although pci_cleanup_aer_error_status_regs() is 
called directly from pci_restore_state.  Of course, aer_cap should be 
zero in this case, and we'd still bail out.
I find the intent clearer if we check it here rather than having to do 
the mental parsing of the state of aer_cap.


Alex


Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-07-19 Thread Alex G.




On 07/17/2018 10:41 AM, Sinan Kaya wrote:


On 7/17/2018 8:31 AM, Alexandru Gagniuc wrote:

+    if (pcie_aer_get_firmware_first(dev))
+    return -EIO;


Can you move this to closer to the caller pci_aer_init()?


I could move it there. although pci_cleanup_aer_error_status_regs() is 
called directly from pci_restore_state.  Of course, aer_cap should be 
zero in this case, and we'd still bail out.
I find the intent clearer if we check it here rather than having to do 
the mental parsing of the state of aer_cap.


Alex


Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

2018-07-19 Thread Alex G.




On 07/18/2018 08:38 AM, Tal Gilboa wrote:

On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:

[+cc maintainers of drivers that already use pcie_print_link_status()
and GPU folks]

[snip]



+    /* Multi-function PCIe share the same link/status. */
+    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
+    return;
+
+    pcie_print_link_status(dev);
+}


Is this function called by default for every PCIe device? What about 
VFs? We make an exception for them on our driver since a VF doesn't have 
access to the needed information in order to provide a meaningful message.


I'm assuming VF means virtual function. pcie_print_link_status() doesn't 
care if it's passed a virtual function. It will try to do its job. 
That's why I bail out three lines above, with 'dev->is_virtfn' check.


Alex


Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

2018-07-19 Thread Alex G.




On 07/18/2018 08:38 AM, Tal Gilboa wrote:

On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:

[+cc maintainers of drivers that already use pcie_print_link_status()
and GPU folks]

[snip]



+    /* Multi-function PCIe share the same link/status. */
+    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
+    return;
+
+    pcie_print_link_status(dev);
+}


Is this function called by default for every PCIe device? What about 
VFs? We make an exception for them on our driver since a VF doesn't have 
access to the needed information in order to provide a meaningful message.


I'm assuming VF means virtual function. pcie_print_link_status() doesn't 
care if it's passed a virtual function. It will try to do its job. 
That's why I bail out three lines above, with 'dev->is_virtfn' check.


Alex


Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

2018-07-19 Thread Alex G.




On 07/18/2018 04:53 PM, Bjorn Helgaas wrote:

[+cc Mike (hfi1)]

On Mon, Jul 16, 2018 at 10:28:35PM +, alex_gagn...@dellteam.com wrote:

On 7/16/2018 4:17 PM, Bjorn Helgaas wrote:

...
The easiest way to detect this is with pcie_print_link_status(),
since the bottleneck is usually the link that is downtrained. It's not
a perfect solution, but it works extremely well in most cases.


This is an interesting idea.  I have two concerns:

Some drivers already do this on their own, and we probably don't want
duplicate output for those devices.  In most cases (ixgbe and mlx* are
exceptions), the drivers do this unconditionally so we *could* remove
it from the driver if we add it to the core.  The dmesg order would
change, and the message wouldn't be associated with the driver as it
now is.


Oh, there are only 8 users of that. Even I could patch up the drivers to
remove the call, assuming we reach agreement about this change.


Also, I think some of the GPU devices might come up at a lower speed,
then download firmware, then reset the device so it comes up at a
higher speed.  I think this patch will make us complain about about
the low initial speed, which might confuse users.


I spoke to one of the PCIe spec writers. It's allowable for a device to
downtrain speed or width. It would also be extremely dumb to downtrain
with the intent to re-train at a higher speed later, but it's possible
devices do dumb stuff like that. That's why it's an informational
message, instead of a warning.


FWIW, here's some of the discussion related to hfi1 from [1]:

   > Btw, why is the driver configuring the PCIe link speed?  Isn't
   > this something we should be handling in the PCI core?

   The device comes out of reset at the 5GT/s speed. The driver
   downloads device firmware, programs PCIe registers, and co-ordinates
   the transition to 8GT/s.

   This recipe is device specific and is therefore implemented in the
   hfi1 driver built on top of PCI core functions and macros.

Also several DRM drivers seem to do this (see ),
si_pcie_gen3_enable()); from [2]:

   My understanding was that some platfoms only bring up the link in gen 1
   mode for compatibility reasons.

[1] 
https://lkml.kernel.org/r/32e1700b9017364d9b60aed9960492bc627ff...@fmsmsx120.amr.corp.intel.com
[2] 
https://lkml.kernel.org/r/bn6pr12mb1809bd30aa5b890c054f9832f7...@bn6pr12mb1809.namprd12.prod.outlook.com


Downtraining a link "for compatibility reasons" is one of those dumb 
things that devices do. I'm SURPRISED AMD HW does it, although it is 
perfectly permissible by PCIe spec.



Another case: Some devices (lower-end GPUs) use silicon (and marketing)
that advertises x16, but they're only routed for x8. I'm okay with
seeing an informational message in this case. In fact, I didn't know
that my Quadro card for three years is only wired for x8 until I was
testing this patch.


Yeah, it's probably OK.  I don't want bug reports from people who
think something's broken when it's really just a hardware limitation
of their system.  But hopefully the message is not alarming.


It looks fairly innocent:

[0.749415] pci :18:00.0: 4.000 Gb/s available PCIe bandwidth, 
limited by 5 GT/s x1 link at :17:03.0 (capable of 15.752 Gb/s with 8 
GT/s x2 link)



So I'm not sure whether it's better to do this in the core for all
devices, or if we should just add it to the high-performance drivers
that really care.


You're thinking "do I really need that bandwidth" because I'm using a
function called "_bandwidth_". The point of the change is very far from
that: it is to help in system troubleshooting by detecting downtraining
conditions.


I'm not sure what you think I'm thinking :)  My question is whether
it's worthwhile to print this extra information for *every* PCIe
device, given that your use case is the tiny percentage of broken
systems.


I think this information is a lot more useful than a bunch of other info 
that's printed. Is "type 00 class 0x088000" more valuable? What about 
"reg 0x20: [mem 0x9d95-0x9d95 64bit pref]", which is also 
available under /proc/iomem for those curious?



If we only printed the info in the "bw_avail < bw_cap" case, i.e.,
when the device is capable of more than it's getting, that would make
a lot of sense to me.  The normal case line is more questionable.  I
think the reason that's there is because the network drivers are very
performance sensitive and like to see that info all the time.


I agree that can be an acceptable compromise.


Maybe we need something like this:

   pcie_print_link_status(struct pci_dev *dev, int verbose)
   {
 ...
 if (bw_avail >= bw_cap) {
   if (verbose)
 pci_info(dev, "... available PCIe bandwidth ...");
 } else
   pci_info(dev, "... available PCIe bandwidth, limited by ...");
   }

So the core could print only the potential problems with:

   pcie_print_link_status(dev, 0);

and drivers that really care even if there's no problem could do:

   

Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

2018-07-19 Thread Alex G.




On 07/18/2018 04:53 PM, Bjorn Helgaas wrote:

[+cc Mike (hfi1)]

On Mon, Jul 16, 2018 at 10:28:35PM +, alex_gagn...@dellteam.com wrote:

On 7/16/2018 4:17 PM, Bjorn Helgaas wrote:

...
The easiest way to detect this is with pcie_print_link_status(),
since the bottleneck is usually the link that is downtrained. It's not
a perfect solution, but it works extremely well in most cases.


This is an interesting idea.  I have two concerns:

Some drivers already do this on their own, and we probably don't want
duplicate output for those devices.  In most cases (ixgbe and mlx* are
exceptions), the drivers do this unconditionally so we *could* remove
it from the driver if we add it to the core.  The dmesg order would
change, and the message wouldn't be associated with the driver as it
now is.


Oh, there are only 8 users of that. Even I could patch up the drivers to
remove the call, assuming we reach agreement about this change.


Also, I think some of the GPU devices might come up at a lower speed,
then download firmware, then reset the device so it comes up at a
higher speed.  I think this patch will make us complain about about
the low initial speed, which might confuse users.


I spoke to one of the PCIe spec writers. It's allowable for a device to
downtrain speed or width. It would also be extremely dumb to downtrain
with the intent to re-train at a higher speed later, but it's possible
devices do dumb stuff like that. That's why it's an informational
message, instead of a warning.


FWIW, here's some of the discussion related to hfi1 from [1]:

   > Btw, why is the driver configuring the PCIe link speed?  Isn't
   > this something we should be handling in the PCI core?

   The device comes out of reset at the 5GT/s speed. The driver
   downloads device firmware, programs PCIe registers, and co-ordinates
   the transition to 8GT/s.

   This recipe is device specific and is therefore implemented in the
   hfi1 driver built on top of PCI core functions and macros.

Also several DRM drivers seem to do this (see ),
si_pcie_gen3_enable()); from [2]:

   My understanding was that some platfoms only bring up the link in gen 1
   mode for compatibility reasons.

[1] 
https://lkml.kernel.org/r/32e1700b9017364d9b60aed9960492bc627ff...@fmsmsx120.amr.corp.intel.com
[2] 
https://lkml.kernel.org/r/bn6pr12mb1809bd30aa5b890c054f9832f7...@bn6pr12mb1809.namprd12.prod.outlook.com


Downtraining a link "for compatibility reasons" is one of those dumb 
things that devices do. I'm SURPRISED AMD HW does it, although it is 
perfectly permissible by PCIe spec.



Another case: Some devices (lower-end GPUs) use silicon (and marketing)
that advertises x16, but they're only routed for x8. I'm okay with
seeing an informational message in this case. In fact, I didn't know
that my Quadro card for three years is only wired for x8 until I was
testing this patch.


Yeah, it's probably OK.  I don't want bug reports from people who
think something's broken when it's really just a hardware limitation
of their system.  But hopefully the message is not alarming.


It looks fairly innocent:

[0.749415] pci :18:00.0: 4.000 Gb/s available PCIe bandwidth, 
limited by 5 GT/s x1 link at :17:03.0 (capable of 15.752 Gb/s with 8 
GT/s x2 link)



So I'm not sure whether it's better to do this in the core for all
devices, or if we should just add it to the high-performance drivers
that really care.


You're thinking "do I really need that bandwidth" because I'm using a
function called "_bandwidth_". The point of the change is very far from
that: it is to help in system troubleshooting by detecting downtraining
conditions.


I'm not sure what you think I'm thinking :)  My question is whether
it's worthwhile to print this extra information for *every* PCIe
device, given that your use case is the tiny percentage of broken
systems.


I think this information is a lot more useful than a bunch of other info 
that's printed. Is "type 00 class 0x088000" more valuable? What about 
"reg 0x20: [mem 0x9d95-0x9d95 64bit pref]", which is also 
available under /proc/iomem for those curious?



If we only printed the info in the "bw_avail < bw_cap" case, i.e.,
when the device is capable of more than it's getting, that would make
a lot of sense to me.  The normal case line is more questionable.  I
think the reason that's there is because the network drivers are very
performance sensitive and like to see that info all the time.


I agree that can be an acceptable compromise.


Maybe we need something like this:

   pcie_print_link_status(struct pci_dev *dev, int verbose)
   {
 ...
 if (bw_avail >= bw_cap) {
   if (verbose)
 pci_info(dev, "... available PCIe bandwidth ...");
 } else
   pci_info(dev, "... available PCIe bandwidth, limited by ...");
   }

So the core could print only the potential problems with:

   pcie_print_link_status(dev, 0);

and drivers that really care even if there's no problem could do:

   

Re: [PATCH v3] PCI/AER: Fix aerdrv loading with "pcie_ports=native" parameter

2018-07-03 Thread Alex G.



On 07/03/2018 11:38 AM, Bjorn Helgaas wrote:
> On Mon, Jul 02, 2018 at 11:16:01AM -0500, Alexandru Gagniuc wrote:
>> According to the documentation, "pcie_ports=native", linux should use
>> native AER and DPC services. While that is true for the _OSC method
>> parsing, this is not the only place that is checked. Should the HEST
>> table list PCIe ports as firmware-first, linux will not use native
>> services.
>>
>> This happens because aer_acpi_firmware_first() doesn't take
>> 'pcie_ports' into account. This is wrong. DPC uses the same logic when
>> it decides whether to load or not, so fixing this also fixes DPC not
>> loading.
>>
>> Signed-off-by: Alexandru Gagniuc 
>> ---
>>  drivers/pci/pcie/aer.c | 9 ++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index a2e88386af28..db2c01056dc7 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -283,13 +283,14 @@ static int aer_hest_parse(struct acpi_hest_header 
>> *hest_hdr, void *data)
>>  
>>  static void aer_set_firmware_first(struct pci_dev *pci_dev)
>>  {
>> -int rc;
>> +int rc = 0;
>>  struct aer_hest_parse_info info = {
>>  .pci_dev= pci_dev,
>>  .firmware_first = 0,
>>  };
>>  
>> -rc = apei_hest_parse(aer_hest_parse, );
>> +if (!pcie_ports_native)
>> +rc = apei_hest_parse(aer_hest_parse, );
>>  
>>  if (rc)
>>  pci_dev->__aer_firmware_first = 0;
>> @@ -324,7 +325,9 @@ bool aer_acpi_firmware_first(void)
>>  };
>>  
>>  if (!parsed) {
>> -apei_hest_parse(aer_hest_parse, );
>> +if (!pcie_ports_native)
>> +apei_hest_parse(aer_hest_parse, );
>> +
>>  aer_firmware_first = info.firmware_first;
>>  parsed = true;
>>  }
> 
> I was thinking something along the lines of the patch below, so we
> don't have to work through the settings of "rc" and "info".  But maybe
> I'm missing something subtle?
> 
> One subtle thing that I didn't look into is the
> pcie_aer_get_firmware_first() stub for the non-CONFIG_ACPI_APEI case.
> 
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index b24f2d252180..5ccbd7635f33 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -342,6 +342,9 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
>   if (!pci_is_pcie(dev))
>   return 0;
>  
> + if (pcie_ports_native)
> + return 0;
> +

Here we're not setting dev->__aer_firmware_first_valid. Assuming we only
check for it via pcie_aer_get_firmware_first(), we should be fine. THis
field is used in portdrv.h, but its use seems safe. I think this
approach can work.

>   if (!dev->__aer_firmware_first_valid)
>   aer_set_firmware_first(dev);
>   return dev->__aer_firmware_first;
> @@ -362,6 +365,9 @@ bool aer_acpi_firmware_first(void)
>   .firmware_first = 0,
>   };
>  
> + if (pcie_ports_native)
> + return false;
> +

This makes better sense.

>   if (!parsed) {
>   apei_hest_parse(aer_hest_parse, );
>   aer_firmware_first = info.firmware_first;
> 


Re: [PATCH v3] PCI/AER: Fix aerdrv loading with "pcie_ports=native" parameter

2018-07-03 Thread Alex G.



On 07/03/2018 11:38 AM, Bjorn Helgaas wrote:
> On Mon, Jul 02, 2018 at 11:16:01AM -0500, Alexandru Gagniuc wrote:
>> According to the documentation, "pcie_ports=native", linux should use
>> native AER and DPC services. While that is true for the _OSC method
>> parsing, this is not the only place that is checked. Should the HEST
>> table list PCIe ports as firmware-first, linux will not use native
>> services.
>>
>> This happens because aer_acpi_firmware_first() doesn't take
>> 'pcie_ports' into account. This is wrong. DPC uses the same logic when
>> it decides whether to load or not, so fixing this also fixes DPC not
>> loading.
>>
>> Signed-off-by: Alexandru Gagniuc 
>> ---
>>  drivers/pci/pcie/aer.c | 9 ++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index a2e88386af28..db2c01056dc7 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -283,13 +283,14 @@ static int aer_hest_parse(struct acpi_hest_header 
>> *hest_hdr, void *data)
>>  
>>  static void aer_set_firmware_first(struct pci_dev *pci_dev)
>>  {
>> -int rc;
>> +int rc = 0;
>>  struct aer_hest_parse_info info = {
>>  .pci_dev= pci_dev,
>>  .firmware_first = 0,
>>  };
>>  
>> -rc = apei_hest_parse(aer_hest_parse, );
>> +if (!pcie_ports_native)
>> +rc = apei_hest_parse(aer_hest_parse, );
>>  
>>  if (rc)
>>  pci_dev->__aer_firmware_first = 0;
>> @@ -324,7 +325,9 @@ bool aer_acpi_firmware_first(void)
>>  };
>>  
>>  if (!parsed) {
>> -apei_hest_parse(aer_hest_parse, );
>> +if (!pcie_ports_native)
>> +apei_hest_parse(aer_hest_parse, );
>> +
>>  aer_firmware_first = info.firmware_first;
>>  parsed = true;
>>  }
> 
> I was thinking something along the lines of the patch below, so we
> don't have to work through the settings of "rc" and "info".  But maybe
> I'm missing something subtle?
> 
> One subtle thing that I didn't look into is the
> pcie_aer_get_firmware_first() stub for the non-CONFIG_ACPI_APEI case.
> 
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index b24f2d252180..5ccbd7635f33 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -342,6 +342,9 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
>   if (!pci_is_pcie(dev))
>   return 0;
>  
> + if (pcie_ports_native)
> + return 0;
> +

Here we're not setting dev->__aer_firmware_first_valid. Assuming we only
check for it via pcie_aer_get_firmware_first(), we should be fine. THis
field is used in portdrv.h, but its use seems safe. I think this
approach can work.

>   if (!dev->__aer_firmware_first_valid)
>   aer_set_firmware_first(dev);
>   return dev->__aer_firmware_first;
> @@ -362,6 +365,9 @@ bool aer_acpi_firmware_first(void)
>   .firmware_first = 0,
>   };
>  
> + if (pcie_ports_native)
> + return false;
> +

This makes better sense.

>   if (!parsed) {
>   apei_hest_parse(aer_hest_parse, );
>   aer_firmware_first = info.firmware_first;
> 


Re: [PATCH v2] PCI/AER: Fix aerdrv loading with "pcie_ports=native" parameter

2018-07-02 Thread Alex G.



On 07/02/2018 08:16 AM, Bjorn Helgaas wrote:
> On Sat, Jun 30, 2018 at 11:39:00PM -0500, Alex G wrote:
>> On 06/30/2018 04:31 PM, Bjorn Helgaas wrote:
>>> [+cc Borislav, linux-acpi, since this involves APEI/HEST]
>>
>> Borislav is not the relevant maintainer here, since we're not contingent on
>> APEI handling. I think Keith has a lot more experience with this part of the
>> kernel.
> 
> Thanks for adding Keith.
> 
>>> On Tue, Jun 19, 2018 at 02:58:20PM -0500, Alexandru Gagniuc wrote:
>>>> According to the documentation, "pcie_ports=native", linux should use
>>>> native AER and DPC services. While that is true for the _OSC method
>>>> parsing, this is not the only place that is checked. Should the HEST
>>>> table list PCIe ports as firmware-first, linux will not use native
>>>> services.
>>>
>>> Nothing in ACPI-land looks at pcie_ports_native.  How should ACPI
>>> things work in the "pcie_ports=native" case?  I guess we still have to
>>> expect to receive error records from the firmware, because it may
>>> certainly send us non-PCI errors (machine checks, etc) and maybe even
>>> some PCI errors (even if the Linux AER driver claims AER interrupts,
>>> we don't know what notification mechanisms the firmware may be using).
>>
>> I think ACPI land shouldn't care about this. We care about it from the PCIe
>> stand point at the interface with ACPI. FW might see a delta in the sense
>> that we request control of some features via _OSC, which we otherwise would
>> not do without pcie_ports=native.
>>
>>> I guess best-case, we'll get ACPI error records for all non-PCI
>>> things, and the Linux AER driver will see all the AER errors.
>>
>> It might affect FW's ability to catch errors, but that's dependent on the
>> root port implementation.
>>
>>> Worst-case, I don't really know what to expect.  Duplicate reporting
>>> of AER errors via firmware and Linux AER driver?  Some kind of
>>> confusion about who acknowledges and clears them?
>>
>> Once user enters pcie_ports=native, all bets are off: you broke the contract
>> you have with the FW -- whether or not you have this patch.
>>
>>> Out of curiosity, what is your use case for "pcie_ports=native"?
>>> Presumably there's something that works better when using it, and
>>> things work even *better* with this patch?
>>
>> Corectness. It bothers me that actual behavior does not match the
>> documentation:
>>
>>  native  Use native PCIe services associated with PCIe ports
>> unconditionally.
>>
>>
>>> I know people do use it, because I often see it mentioned in forums
>>> and bug reports, but I really don't expect it to work very well
>>> because we're ignoring the usage model the firmware is designed
>>> around.  My unproven suspicion is that most uses are in the black
>>> magic category of "there's a bug here, and we don't know how to fix
>>> it, but pcie_ports=native makes it work better".
>>
>> There exist cases that firmware didn't consider. I would not call them
>> "firmware bugs", but there are cases where the user understands the platform
>> better than firmware.
>> Example: on certain PCIe switches, a hardware PCIe error may bring the
>> switch downstream ports into a state where they stop notifying hotplug
>> events. Depending on the platform, firmware may or may not fix this
>> condition, but "pcie_ports=native" enables DPC. DPC contains the error
>> without the switch downstream port entering the weird error state in the
>> first place.
>>
>> All bets are off at this point.
> 
> If a user needs "pcie_ports=native", I claim that's a user experience
> problem, and the underlying cause is a hardware, firmware, or OS
> defect.
> 
> I have no doubt the situation you describe is real, but this doesn't
> make any progress toward resolving the user experience problem.  In
> fact, it propagates the folklore that using "pcie_ports=native" is an
> appropriate final solution.  It's fine as a temporary workaround while
> we figure out a better solution, but we need some mechanism for
> analyzing the problem and eventually removing the need to use
> "pcie_ports=native".

Speaking of user experience, I'd argue that it's a horrible experience
for the kernel to _not_ do what it is asked.

I'm going to go fix the little comment about the patch. I had the same
dilemma when I wrote it, but didn't find it too noteworthy. It makes
more sense now that you mentioned it.

Alex

> I have a minor comment on the patch, but I think it makes sense.  This
> might be a good time to resurrect Prarit's "taint-on-pci-parameters"
> patch.  If somebody uses "pcie_ports=native", I think it makes sense
> to taint the kernel both because (1) we broke the contract with the
> firmware and we don't really know what to expect, and (2) it's an
> opportunity to encourage the user to raise a bug report.
> 
> Bjorn
> 


Re: [PATCH v2] PCI/AER: Fix aerdrv loading with "pcie_ports=native" parameter

2018-07-02 Thread Alex G.



On 07/02/2018 08:16 AM, Bjorn Helgaas wrote:
> On Sat, Jun 30, 2018 at 11:39:00PM -0500, Alex G wrote:
>> On 06/30/2018 04:31 PM, Bjorn Helgaas wrote:
>>> [+cc Borislav, linux-acpi, since this involves APEI/HEST]
>>
>> Borislav is not the relevant maintainer here, since we're not contingent on
>> APEI handling. I think Keith has a lot more experience with this part of the
>> kernel.
> 
> Thanks for adding Keith.
> 
>>> On Tue, Jun 19, 2018 at 02:58:20PM -0500, Alexandru Gagniuc wrote:
>>>> According to the documentation, "pcie_ports=native", linux should use
>>>> native AER and DPC services. While that is true for the _OSC method
>>>> parsing, this is not the only place that is checked. Should the HEST
>>>> table list PCIe ports as firmware-first, linux will not use native
>>>> services.
>>>
>>> Nothing in ACPI-land looks at pcie_ports_native.  How should ACPI
>>> things work in the "pcie_ports=native" case?  I guess we still have to
>>> expect to receive error records from the firmware, because it may
>>> certainly send us non-PCI errors (machine checks, etc) and maybe even
>>> some PCI errors (even if the Linux AER driver claims AER interrupts,
>>> we don't know what notification mechanisms the firmware may be using).
>>
>> I think ACPI land shouldn't care about this. We care about it from the PCIe
>> stand point at the interface with ACPI. FW might see a delta in the sense
>> that we request control of some features via _OSC, which we otherwise would
>> not do without pcie_ports=native.
>>
>>> I guess best-case, we'll get ACPI error records for all non-PCI
>>> things, and the Linux AER driver will see all the AER errors.
>>
>> It might affect FW's ability to catch errors, but that's dependent on the
>> root port implementation.
>>
>>> Worst-case, I don't really know what to expect.  Duplicate reporting
>>> of AER errors via firmware and Linux AER driver?  Some kind of
>>> confusion about who acknowledges and clears them?
>>
>> Once user enters pcie_ports=native, all bets are off: you broke the contract
>> you have with the FW -- whether or not you have this patch.
>>
>>> Out of curiosity, what is your use case for "pcie_ports=native"?
>>> Presumably there's something that works better when using it, and
>>> things work even *better* with this patch?
>>
>> Corectness. It bothers me that actual behavior does not match the
>> documentation:
>>
>>  native  Use native PCIe services associated with PCIe ports
>> unconditionally.
>>
>>
>>> I know people do use it, because I often see it mentioned in forums
>>> and bug reports, but I really don't expect it to work very well
>>> because we're ignoring the usage model the firmware is designed
>>> around.  My unproven suspicion is that most uses are in the black
>>> magic category of "there's a bug here, and we don't know how to fix
>>> it, but pcie_ports=native makes it work better".
>>
>> There exist cases that firmware didn't consider. I would not call them
>> "firmware bugs", but there are cases where the user understands the platform
>> better than firmware.
>> Example: on certain PCIe switches, a hardware PCIe error may bring the
>> switch downstream ports into a state where they stop notifying hotplug
>> events. Depending on the platform, firmware may or may not fix this
>> condition, but "pcie_ports=native" enables DPC. DPC contains the error
>> without the switch downstream port entering the weird error state in the
>> first place.
>>
>> All bets are off at this point.
> 
> If a user needs "pcie_ports=native", I claim that's a user experience
> problem, and the underlying cause is a hardware, firmware, or OS
> defect.
> 
> I have no doubt the situation you describe is real, but this doesn't
> make any progress toward resolving the user experience problem.  In
> fact, it propagates the folklore that using "pcie_ports=native" is an
> appropriate final solution.  It's fine as a temporary workaround while
> we figure out a better solution, but we need some mechanism for
> analyzing the problem and eventually removing the need to use
> "pcie_ports=native".

Speaking of user experience, I'd argue that it's a horrible experience
for the kernel to _not_ do what it is asked.

I'm going to go fix the little comment about the patch. I had the same
dilemma when I wrote it, but didn't find it too noteworthy. It makes
more sense now that you mentioned it.

Alex

> I have a minor comment on the patch, but I think it makes sense.  This
> might be a good time to resurrect Prarit's "taint-on-pci-parameters"
> patch.  If somebody uses "pcie_ports=native", I think it makes sense
> to taint the kernel both because (1) we broke the contract with the
> firmware and we don't really know what to expect, and (2) it's an
> opportunity to encourage the user to raise a bug report.
> 
> Bjorn
> 


Re: [PATCH v2] PCI/AER: Fix aerdrv loading with "pcie_ports=native" parameter

2018-06-30 Thread Alex G

On 06/30/2018 04:31 PM, Bjorn Helgaas wrote:

[+cc Borislav, linux-acpi, since this involves APEI/HEST]


Borislav is not the relevant maintainer here, since we're not contingent 
on APEI handling. I think Keith has a lot more experience with this part 
of the kernel.



On Tue, Jun 19, 2018 at 02:58:20PM -0500, Alexandru Gagniuc wrote:

According to the documentation, "pcie_ports=native", linux should use
native AER and DPC services. While that is true for the _OSC method
parsing, this is not the only place that is checked. Should the HEST
table list PCIe ports as firmware-first, linux will not use native
services.


Nothing in ACPI-land looks at pcie_ports_native.  How should ACPI
things work in the "pcie_ports=native" case?  I guess we still have to
expect to receive error records from the firmware, because it may
certainly send us non-PCI errors (machine checks, etc) and maybe even
some PCI errors (even if the Linux AER driver claims AER interrupts,
we don't know what notification mechanisms the firmware may be using).


I think ACPI land shouldn't care about this. We care about it from the 
PCIe stand point at the interface with ACPI. FW might see a delta in the 
sense that we request control of some features via _OSC, which we 
otherwise would not do without pcie_ports=native.



I guess best-case, we'll get ACPI error records for all non-PCI
things, and the Linux AER driver will see all the AER errors.


It might affect FW's ability to catch errors, but that's dependent on 
the root port implementation.



Worst-case, I don't really know what to expect.  Duplicate reporting
of AER errors via firmware and Linux AER driver?  Some kind of
confusion about who acknowledges and clears them?


Once user enters pcie_ports=native, all bets are off: you broke the 
contract you have with the FW -- whether or not you have this patch.



Out of curiosity, what is your use case for "pcie_ports=native"?
Presumably there's something that works better when using it, and
things work even *better* with this patch?


Corectness. It bothers me that actual behavior does not match the 
documentation:


 native  Use native PCIe services associated with PCIe ports
unconditionally.



I know people do use it, because I often see it mentioned in forums
and bug reports, but I really don't expect it to work very well
because we're ignoring the usage model the firmware is designed
around.  My unproven suspicion is that most uses are in the black
magic category of "there's a bug here, and we don't know how to fix
it, but pcie_ports=native makes it work better".


There exist cases that firmware didn't consider. I would not call them 
"firmware bugs", but there are cases where the user understands the 
platform better than firmware.
Example: on certain PCIe switches, a hardware PCIe error may bring the 
switch downstream ports into a state where they stop notifying hotplug 
events. Depending on the platform, firmware may or may not fix this 
condition, but "pcie_ports=native" enables DPC. DPC contains the error 
without the switch downstream port entering the weird error state in the 
first place.


All bets are off at this point.


Obviously I would much rather find and fix those bugs so people
wouldn't have to stumble over the problem in the first place.


Using native services when firmware asks us not to is a crapshoot every 
time. I don't condone the use of this feature, but if we do get a 
pcie_ports=native request, we should at least honor it.




This happens because aer_acpi_firmware_first() doesn't take
'pcie_ports' into account. This is wrong. DPC uses the same logic when
it decides whether to load or not, so fixing this also fixes DPC not
loading.

Signed-off-by: Alexandru Gagniuc 
---
  drivers/pci/pcie/aer.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

Changes since v1:
  - Re-tested with latest and greatest (v4.18-rc1) -- works great

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..98ced0f7c850 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -291,7 +291,7 @@ static void aer_set_firmware_first(struct pci_dev *pci_dev)
  
  	rc = apei_hest_parse(aer_hest_parse, );
  
-	if (rc)

+   if (rc || pcie_ports_native)
pci_dev->__aer_firmware_first = 0;
else
pci_dev->__aer_firmware_first = info.firmware_first;
@@ -327,6 +327,9 @@ bool aer_acpi_firmware_first(void)
apei_hest_parse(aer_hest_parse, );
aer_firmware_first = info.firmware_first;
parsed = true;
+   if (pcie_ports_native)
+   aer_firmware_first = 0;
+
}
return aer_firmware_first;
  }
--
2.14.3



Re: [PATCH v2] PCI/AER: Fix aerdrv loading with "pcie_ports=native" parameter

2018-06-30 Thread Alex G

On 06/30/2018 04:31 PM, Bjorn Helgaas wrote:

[+cc Borislav, linux-acpi, since this involves APEI/HEST]


Borislav is not the relevant maintainer here, since we're not contingent 
on APEI handling. I think Keith has a lot more experience with this part 
of the kernel.



On Tue, Jun 19, 2018 at 02:58:20PM -0500, Alexandru Gagniuc wrote:

According to the documentation, "pcie_ports=native", linux should use
native AER and DPC services. While that is true for the _OSC method
parsing, this is not the only place that is checked. Should the HEST
table list PCIe ports as firmware-first, linux will not use native
services.


Nothing in ACPI-land looks at pcie_ports_native.  How should ACPI
things work in the "pcie_ports=native" case?  I guess we still have to
expect to receive error records from the firmware, because it may
certainly send us non-PCI errors (machine checks, etc) and maybe even
some PCI errors (even if the Linux AER driver claims AER interrupts,
we don't know what notification mechanisms the firmware may be using).


I think ACPI land shouldn't care about this. We care about it from the 
PCIe stand point at the interface with ACPI. FW might see a delta in the 
sense that we request control of some features via _OSC, which we 
otherwise would not do without pcie_ports=native.



I guess best-case, we'll get ACPI error records for all non-PCI
things, and the Linux AER driver will see all the AER errors.


It might affect FW's ability to catch errors, but that's dependent on 
the root port implementation.



Worst-case, I don't really know what to expect.  Duplicate reporting
of AER errors via firmware and Linux AER driver?  Some kind of
confusion about who acknowledges and clears them?


Once user enters pcie_ports=native, all bets are off: you broke the 
contract you have with the FW -- whether or not you have this patch.



Out of curiosity, what is your use case for "pcie_ports=native"?
Presumably there's something that works better when using it, and
things work even *better* with this patch?


Corectness. It bothers me that actual behavior does not match the 
documentation:


 native  Use native PCIe services associated with PCIe ports
unconditionally.



I know people do use it, because I often see it mentioned in forums
and bug reports, but I really don't expect it to work very well
because we're ignoring the usage model the firmware is designed
around.  My unproven suspicion is that most uses are in the black
magic category of "there's a bug here, and we don't know how to fix
it, but pcie_ports=native makes it work better".


There exist cases that firmware didn't consider. I would not call them 
"firmware bugs", but there are cases where the user understands the 
platform better than firmware.
Example: on certain PCIe switches, a hardware PCIe error may bring the 
switch downstream ports into a state where they stop notifying hotplug 
events. Depending on the platform, firmware may or may not fix this 
condition, but "pcie_ports=native" enables DPC. DPC contains the error 
without the switch downstream port entering the weird error state in the 
first place.


All bets are off at this point.


Obviously I would much rather find and fix those bugs so people
wouldn't have to stumble over the problem in the first place.


Using native services when firmware asks us not to is a crapshoot every 
time. I don't condone the use of this feature, but if we do get a 
pcie_ports=native request, we should at least honor it.




This happens because aer_acpi_firmware_first() doesn't take
'pcie_ports' into account. This is wrong. DPC uses the same logic when
it decides whether to load or not, so fixing this also fixes DPC not
loading.

Signed-off-by: Alexandru Gagniuc 
---
  drivers/pci/pcie/aer.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

Changes since v1:
  - Re-tested with latest and greatest (v4.18-rc1) -- works great

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..98ced0f7c850 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -291,7 +291,7 @@ static void aer_set_firmware_first(struct pci_dev *pci_dev)
  
  	rc = apei_hest_parse(aer_hest_parse, );
  
-	if (rc)

+   if (rc || pcie_ports_native)
pci_dev->__aer_firmware_first = 0;
else
pci_dev->__aer_firmware_first = info.firmware_first;
@@ -327,6 +327,9 @@ bool aer_acpi_firmware_first(void)
apei_hest_parse(aer_hest_parse, );
aer_firmware_first = info.firmware_first;
parsed = true;
+   if (pcie_ports_native)
+   aer_firmware_first = 0;
+
}
return aer_firmware_first;
  }
--
2.14.3



Re: [PATCH] PCI: DPC: Clear AER status bits before disabling port containment

2018-06-26 Thread Alex G.



On 06/19/2018 04:57 PM, Bjorn Helgaas wrote:
> On Wed, May 16, 2018 at 05:12:21PM -0600, Keith Busch wrote:
>> On Wed, May 16, 2018 at 06:44:22PM -0400, Sinan Kaya wrote:
>>> On 5/16/2018 5:33 PM, Alexandru Gagniuc wrote:
 AER status bits are sticky, and they survive system resets. Downstream
 devices are usually taken care of after re-enumerating the downstream
 busses, as the AER bits are cleared during probe().

 However, nothing clears the bits of the port which contained the
 error. These sticky bits may leave some BIOSes to think that something
 bad happened, and print ominous messages on next boot. To prevent this,
 tidy up the AER status bits before releasing containment.

 Signed-off-by: Alexandru Gagniuc 
 ---
  drivers/pci/pcie/dpc.c | 4 
  1 file changed, 4 insertions(+)

 diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
 index 8c57d607e603..bf82d6936556 100644
 --- a/drivers/pci/pcie/dpc.c
 +++ b/drivers/pci/pcie/dpc.c
 @@ -112,6 +112,10 @@ static void dpc_work(struct work_struct *work)
dpc->rp_pio_status = 0;
}
  
 +  /* DPC event made a mess of our AER status bits. Clean them up. */
 +  pci_cleanup_aer_error_status_regs(pdev);
 +  /* TODO: Should we also use aer_print_error to log the event? */
 +
pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
PCI_EXP_DPC_STATUS_TRIGGER | PCI_EXP_DPC_STATUS_INTERRUPT);
  

>>>
>>> I think Keith has a patch to fix this. It was under review at some point.
>>
>> Right, I do intend to following up on this, but I've had some trouble
>> finding time the last few weeks. Sorry about that, things will clear up
>> for me shortly.
> 
> I'll drop this (Alexandru's) patch for now, waiting for your update, Keith.

I wonder if clearing AER status bits is mutually exclusive with
refactoring other parts of DPC handling?

Alex


Re: [PATCH] PCI: DPC: Clear AER status bits before disabling port containment

2018-06-26 Thread Alex G.



On 06/19/2018 04:57 PM, Bjorn Helgaas wrote:
> On Wed, May 16, 2018 at 05:12:21PM -0600, Keith Busch wrote:
>> On Wed, May 16, 2018 at 06:44:22PM -0400, Sinan Kaya wrote:
>>> On 5/16/2018 5:33 PM, Alexandru Gagniuc wrote:
 AER status bits are sticky, and they survive system resets. Downstream
 devices are usually taken care of after re-enumerating the downstream
 busses, as the AER bits are cleared during probe().

 However, nothing clears the bits of the port which contained the
 error. These sticky bits may leave some BIOSes to think that something
 bad happened, and print ominous messages on next boot. To prevent this,
 tidy up the AER status bits before releasing containment.

 Signed-off-by: Alexandru Gagniuc 
 ---
  drivers/pci/pcie/dpc.c | 4 
  1 file changed, 4 insertions(+)

 diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
 index 8c57d607e603..bf82d6936556 100644
 --- a/drivers/pci/pcie/dpc.c
 +++ b/drivers/pci/pcie/dpc.c
 @@ -112,6 +112,10 @@ static void dpc_work(struct work_struct *work)
dpc->rp_pio_status = 0;
}
  
 +  /* DPC event made a mess of our AER status bits. Clean them up. */
 +  pci_cleanup_aer_error_status_regs(pdev);
 +  /* TODO: Should we also use aer_print_error to log the event? */
 +
pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
PCI_EXP_DPC_STATUS_TRIGGER | PCI_EXP_DPC_STATUS_INTERRUPT);
  

>>>
>>> I think Keith has a patch to fix this. It was under review at some point.
>>
>> Right, I do intend to following up on this, but I've had some trouble
>> finding time the last few weeks. Sorry about that, things will clear up
>> for me shortly.
> 
> I'll drop this (Alexandru's) patch for now, waiting for your update, Keith.

I wonder if clearing AER status bits is mutually exclusive with
refactoring other parts of DPC handling?

Alex


Re: [PATCH v2] PCI: Check for PCIe downtraining conditions

2018-06-01 Thread Alex G.



On 06/01/2018 10:10 AM, Sinan Kaya wrote:
> On 6/1/2018 11:06 AM, Alex G. wrote:
>> On 06/01/2018 10:03 AM, Sinan Kaya wrote:
>>> On 6/1/2018 11:01 AM, Alexandru Gagniuc wrote:
>>>> +  /* Multi-function PCIe share the same link/status. */
>>>> +  if (PCI_FUNC(dev->devfn) != 0)
>>>> +  return;
>>>
>>> How about virtual functions?
>>
>> I have almost no clue about those. Is your concern that we might end up
>> printing more than our fair share of link statuses?
> 
> Yes, struct pci_dev also has a flag called is_virtfn that you should bail out
> here too. 

Will be fixed in v3.

Thanks,
Alex

>>
>> Alex
>>
>>
> 
> 


Re: [PATCH v2] PCI: Check for PCIe downtraining conditions

2018-06-01 Thread Alex G.



On 06/01/2018 10:10 AM, Sinan Kaya wrote:
> On 6/1/2018 11:06 AM, Alex G. wrote:
>> On 06/01/2018 10:03 AM, Sinan Kaya wrote:
>>> On 6/1/2018 11:01 AM, Alexandru Gagniuc wrote:
>>>> +  /* Multi-function PCIe share the same link/status. */
>>>> +  if (PCI_FUNC(dev->devfn) != 0)
>>>> +  return;
>>>
>>> How about virtual functions?
>>
>> I have almost no clue about those. Is your concern that we might end up
>> printing more than our fair share of link statuses?
> 
> Yes, struct pci_dev also has a flag called is_virtfn that you should bail out
> here too. 

Will be fixed in v3.

Thanks,
Alex

>>
>> Alex
>>
>>
> 
> 


Re: [PATCH v2] PCI: Check for PCIe downtraining conditions

2018-06-01 Thread Alex G.
On 06/01/2018 10:12 AM, Andy Shevchenko wrote:
> On Fri, Jun 1, 2018 at 6:01 PM, Alexandru Gagniuc  
> wrote:
>> PCIe downtraining happens when both the device and PCIe port are
>> capable of a larger bus width or higher speed than negotiated.
>> Downtraining might be indicative of other problems in the system, and
>> identifying this from userspace is neither intuitive, nor straigh
>> forward.
>>
>> The easiest way to detect this is with pcie_print_link_status(),
>> since the bottleneck is usually the link that is downtrained. It's not
>> a perfect solution, but it works extremely well in most cases.
> 
>> +static void pcie_check_upstream_link(struct pci_dev *dev)
>> +{
> 
>> +
> 
> This is redundant, but...

Hmm. I thought it's not safe to call pci_pcie_type() on non-pcie devices.

I see the pci_is_pcie() check followed by pci_pcie_type() is not
uncommon. I didn't think it would be an issue, as long as it's
consistent with the rest of the code.

>> +   if (!pci_is_pcie(dev))
>> +   return;
>> +
>> +   /* Look from the device up to avoid downstream ports with no 
>> devices. */
>> +   if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>> +   (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>> +   (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>> +   return;
> 
> ...wouldn't be better
> 
> int type = pci_pcie_type(dev);
> 
> ?

An extra local variable when the compiler knows how to optimize it out?
To me, it doesn't seem like it would improve readability, but it would
make the code longer.

> But also possible, looking at existing code,
> 
> static inline bool pci_is_pcie_type(dev, type)
> {
>   return pci_is_pcie(dev) ? pci_pcie_type(dev) == type : false;
> }

return pci_is_pcie(dev) && (pci_pcie_type(dev) == type);

seems cleaner. Although this sort of cleanup is beyond the scope of this
change.

Thanks,
Alex


Re: [PATCH v2] PCI: Check for PCIe downtraining conditions

2018-06-01 Thread Alex G.
On 06/01/2018 10:12 AM, Andy Shevchenko wrote:
> On Fri, Jun 1, 2018 at 6:01 PM, Alexandru Gagniuc  
> wrote:
>> PCIe downtraining happens when both the device and PCIe port are
>> capable of a larger bus width or higher speed than negotiated.
>> Downtraining might be indicative of other problems in the system, and
>> identifying this from userspace is neither intuitive, nor straigh
>> forward.
>>
>> The easiest way to detect this is with pcie_print_link_status(),
>> since the bottleneck is usually the link that is downtrained. It's not
>> a perfect solution, but it works extremely well in most cases.
> 
>> +static void pcie_check_upstream_link(struct pci_dev *dev)
>> +{
> 
>> +
> 
> This is redundant, but...

Hmm. I thought it's not safe to call pci_pcie_type() on non-pcie devices.

I see the pci_is_pcie() check followed by pci_pcie_type() is not
uncommon. I didn't think it would be an issue, as long as it's
consistent with the rest of the code.

>> +   if (!pci_is_pcie(dev))
>> +   return;
>> +
>> +   /* Look from the device up to avoid downstream ports with no 
>> devices. */
>> +   if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>> +   (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>> +   (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>> +   return;
> 
> ...wouldn't be better
> 
> int type = pci_pcie_type(dev);
> 
> ?

An extra local variable when the compiler knows how to optimize it out?
To me, it doesn't seem like it would improve readability, but it would
make the code longer.

> But also possible, looking at existing code,
> 
> static inline bool pci_is_pcie_type(dev, type)
> {
>   return pci_is_pcie(dev) ? pci_pcie_type(dev) == type : false;
> }

return pci_is_pcie(dev) && (pci_pcie_type(dev) == type);

seems cleaner. Although this sort of cleanup is beyond the scope of this
change.

Thanks,
Alex


Re: [PATCH v2] PCI: Check for PCIe downtraining conditions

2018-06-01 Thread Alex G.
On 06/01/2018 10:03 AM, Sinan Kaya wrote:
> On 6/1/2018 11:01 AM, Alexandru Gagniuc wrote:
>> +/* Multi-function PCIe share the same link/status. */
>> +if (PCI_FUNC(dev->devfn) != 0)
>> +return;
> 
> How about virtual functions?

I have almost no clue about those. Is your concern that we might end up
printing more than our fair share of link statuses?

Alex



Re: [PATCH v2] PCI: Check for PCIe downtraining conditions

2018-06-01 Thread Alex G.
On 06/01/2018 10:03 AM, Sinan Kaya wrote:
> On 6/1/2018 11:01 AM, Alexandru Gagniuc wrote:
>> +/* Multi-function PCIe share the same link/status. */
>> +if (PCI_FUNC(dev->devfn) != 0)
>> +return;
> 
> How about virtual functions?

I have almost no clue about those. Is your concern that we might end up
printing more than our fair share of link statuses?

Alex



Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.



On 05/31/2018 12:27 PM, Alex G. wrote:
> On 05/31/2018 12:11 PM, Sinan Kaya wrote:
>> On 5/31/2018 12:49 PM, Alex G. wrote:
>>>>bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
>>>>bw_avail = pcie_bandwidth_available(dev, _dev, , , 
>>>> *parent*);
>>> That's confusing. I'd expect _capable() and _available() to be
>>> symmetrical. They either both look at one link only, or both go down to
>>> the root port. Though it seems _capable() is link-local, and
>>> _available() is down to root port.
>>>
>>
>> As you know, link speed is a qualification of two devices speed capability.
>> Both speed and width parameters get negotiated by two devices during TS1 and 
>> TS2
>> ordered set exchange. 
>>
>> You need to see what your link partner can support in available function() 
>> vs.
>> what this device can do in bandwidth() function.
> 
> I see. I'm not sure I can use pcie_print_link_status() without some
> major refactoring. I need to look at capability of device and it
> downstream port. There's no point complaining that an x16 device is
> running at x4 when the port is only x4 capable.
> 
> Let me think some more on this.

I did some thinking, and I don't like using the same message for both
point-to-point links and full-tree links. From a userspace point of view
having an arbitrary terminator in the tree parsing is confusing. So
we're better of not modifying the behavior here.

On the other hand, just printing the link status on probe might be good
enough. If we're downtraining, but there's a slower link somewhere else,
it won't matter much that we're downtrained. Just calling the unmodified
pcie_print_link_status() will most of the time find the exact segment
that's also downtrained. So let's do this in v2

Alex

> Alex
> 


Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.



On 05/31/2018 12:27 PM, Alex G. wrote:
> On 05/31/2018 12:11 PM, Sinan Kaya wrote:
>> On 5/31/2018 12:49 PM, Alex G. wrote:
>>>>bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
>>>>bw_avail = pcie_bandwidth_available(dev, _dev, , , 
>>>> *parent*);
>>> That's confusing. I'd expect _capable() and _available() to be
>>> symmetrical. They either both look at one link only, or both go down to
>>> the root port. Though it seems _capable() is link-local, and
>>> _available() is down to root port.
>>>
>>
>> As you know, link speed is a qualification of two devices speed capability.
>> Both speed and width parameters get negotiated by two devices during TS1 and 
>> TS2
>> ordered set exchange. 
>>
>> You need to see what your link partner can support in available function() 
>> vs.
>> what this device can do in bandwidth() function.
> 
> I see. I'm not sure I can use pcie_print_link_status() without some
> major refactoring. I need to look at capability of device and it
> downstream port. There's no point complaining that an x16 device is
> running at x4 when the port is only x4 capable.
> 
> Let me think some more on this.

I did some thinking, and I don't like using the same message for both
point-to-point links and full-tree links. From a userspace point of view
having an arbitrary terminator in the tree parsing is confusing. So
we're better of not modifying the behavior here.

On the other hand, just printing the link status on probe might be good
enough. If we're downtraining, but there's a slower link somewhere else,
it won't matter much that we're downtrained. Just calling the unmodified
pcie_print_link_status() will most of the time find the exact segment
that's also downtrained. So let's do this in v2

Alex

> Alex
> 


Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.
On 05/31/2018 10:30 AM, Sinan Kaya wrote:
> On 5/31/2018 11:05 AM, Alexandru Gagniuc wrote:
>> +if (dev_cur_speed < max_link_speed)
>> +pci_warn(dev, "PCIe downtrain: link speed is %s (%s capable)",
>> + pcie_bus_speed_name(dev_cur_speed),
>> + pcie_bus_speed_name(max_link_speed));
>> +
> 
> Also this isn't quite correct. Target link speed is what the device tries to
> train. A device can try to train to much lower speed than the maximum on 
> purpose.
> 
> It makes sense to print this if the speed that platform wants via target link
> speed is different from what is actually established though.

After seeing Gen 3 devices that train above the speed in the target link
speed field, I talked to one the spec writers today. There is some
ambiguity with the target link speed field. In PCIe 4.0 they are
clarifying that to state that this field is "permitted to have no effect".

Alex



Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.
On 05/31/2018 10:30 AM, Sinan Kaya wrote:
> On 5/31/2018 11:05 AM, Alexandru Gagniuc wrote:
>> +if (dev_cur_speed < max_link_speed)
>> +pci_warn(dev, "PCIe downtrain: link speed is %s (%s capable)",
>> + pcie_bus_speed_name(dev_cur_speed),
>> + pcie_bus_speed_name(max_link_speed));
>> +
> 
> Also this isn't quite correct. Target link speed is what the device tries to
> train. A device can try to train to much lower speed than the maximum on 
> purpose.
> 
> It makes sense to print this if the speed that platform wants via target link
> speed is different from what is actually established though.

After seeing Gen 3 devices that train above the speed in the target link
speed field, I talked to one the spec writers today. There is some
ambiguity with the target link speed field. In PCIe 4.0 they are
clarifying that to state that this field is "permitted to have no effect".

Alex



Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.
On 05/31/2018 12:11 PM, Sinan Kaya wrote:
> On 5/31/2018 12:49 PM, Alex G. wrote:
>>> bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
>>> bw_avail = pcie_bandwidth_available(dev, _dev, , , 
>>> *parent*);
>> That's confusing. I'd expect _capable() and _available() to be
>> symmetrical. They either both look at one link only, or both go down to
>> the root port. Though it seems _capable() is link-local, and
>> _available() is down to root port.
>>
> 
> As you know, link speed is a qualification of two devices speed capability.
> Both speed and width parameters get negotiated by two devices during TS1 and 
> TS2
> ordered set exchange. 
> 
> You need to see what your link partner can support in available function() vs.
> what this device can do in bandwidth() function.

I see. I'm not sure I can use pcie_print_link_status() without some
major refactoring. I need to look at capability of device and it
downstream port. There's no point complaining that an x16 device is
running at x4 when the port is only x4 capable.

Let me think some more on this.

Alex


Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.
On 05/31/2018 12:11 PM, Sinan Kaya wrote:
> On 5/31/2018 12:49 PM, Alex G. wrote:
>>> bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
>>> bw_avail = pcie_bandwidth_available(dev, _dev, , , 
>>> *parent*);
>> That's confusing. I'd expect _capable() and _available() to be
>> symmetrical. They either both look at one link only, or both go down to
>> the root port. Though it seems _capable() is link-local, and
>> _available() is down to root port.
>>
> 
> As you know, link speed is a qualification of two devices speed capability.
> Both speed and width parameters get negotiated by two devices during TS1 and 
> TS2
> ordered set exchange. 
> 
> You need to see what your link partner can support in available function() vs.
> what this device can do in bandwidth() function.

I see. I'm not sure I can use pcie_print_link_status() without some
major refactoring. I need to look at capability of device and it
downstream port. There's no point complaining that an x16 device is
running at x4 when the port is only x4 capable.

Let me think some more on this.

Alex


Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.



On 05/31/2018 11:49 AM, Alex G. wrote:
> 
> 
> On 05/31/2018 11:13 AM, Sinan Kaya wrote:
>> On 5/31/2018 12:01 PM, Alex G. wrote:
>>>>   PCI: Add pcie_print_link_status() to log link speed and whether it's 
>>>> limited
>>> This one, I have, but it's not what I need. This looks at the available
>>> bandwidth from root port to endpoint, whereas I'm only interested in
>>> downtraining between endpoint and upstream port.
>>
>> I see what you are saying. 
>>
>> With a little bit of effort, you can reuse the same code.
>>
>> Here is an attempt.
>>
>> You can probably extend pcie_bandwidth_available() to put an optional parent 
>> bridge
>> device for your own use case and terminate the loop around here.
>>
>> https://elixir.bootlin.com/linux/v4.17-rc7/source/drivers/pci/pci.c#L5182
>>
>> Then, you can use the existing code to achieve what you are looking for via
>> pcie_print_link_status() by adding an optional parent parameter.
>>
>>  bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
>>  bw_avail = pcie_bandwidth_available(dev, _dev, , , 
>> *parent*);
> 
> That's confusing.

"confusing" refers to the way the code currently works. It doesn't refer
to your proposal.


Alex

 I'd expect _capable() and _available() to be
> symmetrical. They either both look at one link only, or both go down to
> the root port. Though it seems _capable() is link-local, and
> _available() is down to root port.
> 
>>
>> If parent parameter is NULL, code can walk all the way to root as it is 
>> doing today.
>> If it is not, then will terminate the loop on the first iteration.
>>


Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.



On 05/31/2018 11:49 AM, Alex G. wrote:
> 
> 
> On 05/31/2018 11:13 AM, Sinan Kaya wrote:
>> On 5/31/2018 12:01 PM, Alex G. wrote:
>>>>   PCI: Add pcie_print_link_status() to log link speed and whether it's 
>>>> limited
>>> This one, I have, but it's not what I need. This looks at the available
>>> bandwidth from root port to endpoint, whereas I'm only interested in
>>> downtraining between endpoint and upstream port.
>>
>> I see what you are saying. 
>>
>> With a little bit of effort, you can reuse the same code.
>>
>> Here is an attempt.
>>
>> You can probably extend pcie_bandwidth_available() to put an optional parent 
>> bridge
>> device for your own use case and terminate the loop around here.
>>
>> https://elixir.bootlin.com/linux/v4.17-rc7/source/drivers/pci/pci.c#L5182
>>
>> Then, you can use the existing code to achieve what you are looking for via
>> pcie_print_link_status() by adding an optional parent parameter.
>>
>>  bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
>>  bw_avail = pcie_bandwidth_available(dev, _dev, , , 
>> *parent*);
> 
> That's confusing.

"confusing" refers to the way the code currently works. It doesn't refer
to your proposal.


Alex

 I'd expect _capable() and _available() to be
> symmetrical. They either both look at one link only, or both go down to
> the root port. Though it seems _capable() is link-local, and
> _available() is down to root port.
> 
>>
>> If parent parameter is NULL, code can walk all the way to root as it is 
>> doing today.
>> If it is not, then will terminate the loop on the first iteration.
>>


Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.



On 05/31/2018 11:13 AM, Sinan Kaya wrote:
> On 5/31/2018 12:01 PM, Alex G. wrote:
>>>   PCI: Add pcie_print_link_status() to log link speed and whether it's 
>>> limited
>> This one, I have, but it's not what I need. This looks at the available
>> bandwidth from root port to endpoint, whereas I'm only interested in
>> downtraining between endpoint and upstream port.
> 
> I see what you are saying. 
> 
> With a little bit of effort, you can reuse the same code.
> 
> Here is an attempt.
> 
> You can probably extend pcie_bandwidth_available() to put an optional parent 
> bridge
> device for your own use case and terminate the loop around here.
> 
> https://elixir.bootlin.com/linux/v4.17-rc7/source/drivers/pci/pci.c#L5182
> 
> Then, you can use the existing code to achieve what you are looking for via
> pcie_print_link_status() by adding an optional parent parameter.
> 
>   bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
>   bw_avail = pcie_bandwidth_available(dev, _dev, , , 
> *parent*);

That's confusing. I'd expect _capable() and _available() to be
symmetrical. They either both look at one link only, or both go down to
the root port. Though it seems _capable() is link-local, and
_available() is down to root port.

> 
> If parent parameter is NULL, code can walk all the way to root as it is doing 
> today.
> If it is not, then will terminate the loop on the first iteration.
> 


Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.



On 05/31/2018 11:13 AM, Sinan Kaya wrote:
> On 5/31/2018 12:01 PM, Alex G. wrote:
>>>   PCI: Add pcie_print_link_status() to log link speed and whether it's 
>>> limited
>> This one, I have, but it's not what I need. This looks at the available
>> bandwidth from root port to endpoint, whereas I'm only interested in
>> downtraining between endpoint and upstream port.
> 
> I see what you are saying. 
> 
> With a little bit of effort, you can reuse the same code.
> 
> Here is an attempt.
> 
> You can probably extend pcie_bandwidth_available() to put an optional parent 
> bridge
> device for your own use case and terminate the loop around here.
> 
> https://elixir.bootlin.com/linux/v4.17-rc7/source/drivers/pci/pci.c#L5182
> 
> Then, you can use the existing code to achieve what you are looking for via
> pcie_print_link_status() by adding an optional parent parameter.
> 
>   bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
>   bw_avail = pcie_bandwidth_available(dev, _dev, , , 
> *parent*);

That's confusing. I'd expect _capable() and _available() to be
symmetrical. They either both look at one link only, or both go down to
the root port. Though it seems _capable() is link-local, and
_available() is down to root port.

> 
> If parent parameter is NULL, code can walk all the way to root as it is doing 
> today.
> If it is not, then will terminate the loop on the first iteration.
> 


Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.



On 05/31/2018 10:54 AM, Sinan Kaya wrote:
> On 5/31/2018 11:46 AM, Alex G. wrote:
>>> https://lkml.org/lkml/2018/3/30/553
>> Oh, pcie_get_speed_cap()/pcie_get_width_cap() seems to handle the
>> capability. Not seeing one for status and speed name.
>>
>>> are you working on linux-next?
>> v4.17-rc7
>>
> 
> I think everything you need is in the series. I don't know which linux
> tree this landed or if it landed. Probably, it is shipping for 4.18. 
> Need some help from Bjorn where to locate these.
> 
> Fri, 30 Mar 2018 16:04:40 -0500
> 
> Bjorn Helgaas (6):
>   bnx2x: Report PCIe link properties with pcie_print_link_status()
>   bnxt_en: Report PCIe link properties with pcie_print_link_status()
>   cxgb4: Report PCIe link properties with pcie_print_link_status()
>   fm10k: Report PCIe link properties with pcie_print_link_status()
>   ixgbe: Report PCIe link properties with pcie_print_link_status()
>   PCI: Remove unused pcie_get_minimum_link()

I remember seeing some of these in my tree.

> Tal Gilboa (8):
>   PCI: Add pcie_get_speed_cap() to find max supported link speed
>   PCI: Add pcie_get_width_cap() to find max supported link width
>   PCI: Add pcie_bandwidth_capable() to compute max supported link 
> bandwidth
>   PCI: Add pcie_bandwidth_available() to compute bandwidth available to 
> device

I definitely have these.

>   PCI: Add pcie_print_link_status() to log link speed and whether it's 
> limited

This one, I have, but it's not what I need. This looks at the available
bandwidth from root port to endpoint, whereas I'm only interested in
downtraining between endpoint and upstream port.

Alex



Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.



On 05/31/2018 10:54 AM, Sinan Kaya wrote:
> On 5/31/2018 11:46 AM, Alex G. wrote:
>>> https://lkml.org/lkml/2018/3/30/553
>> Oh, pcie_get_speed_cap()/pcie_get_width_cap() seems to handle the
>> capability. Not seeing one for status and speed name.
>>
>>> are you working on linux-next?
>> v4.17-rc7
>>
> 
> I think everything you need is in the series. I don't know which linux
> tree this landed or if it landed. Probably, it is shipping for 4.18. 
> Need some help from Bjorn where to locate these.
> 
> Fri, 30 Mar 2018 16:04:40 -0500
> 
> Bjorn Helgaas (6):
>   bnx2x: Report PCIe link properties with pcie_print_link_status()
>   bnxt_en: Report PCIe link properties with pcie_print_link_status()
>   cxgb4: Report PCIe link properties with pcie_print_link_status()
>   fm10k: Report PCIe link properties with pcie_print_link_status()
>   ixgbe: Report PCIe link properties with pcie_print_link_status()
>   PCI: Remove unused pcie_get_minimum_link()

I remember seeing some of these in my tree.

> Tal Gilboa (8):
>   PCI: Add pcie_get_speed_cap() to find max supported link speed
>   PCI: Add pcie_get_width_cap() to find max supported link width
>   PCI: Add pcie_bandwidth_capable() to compute max supported link 
> bandwidth
>   PCI: Add pcie_bandwidth_available() to compute bandwidth available to 
> device

I definitely have these.

>   PCI: Add pcie_print_link_status() to log link speed and whether it's 
> limited

This one, I have, but it's not what I need. This looks at the available
bandwidth from root port to endpoint, whereas I'm only interested in
downtraining between endpoint and upstream port.

Alex



Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.



On 05/31/2018 10:38 AM, Sinan Kaya wrote:
> On 5/31/2018 11:29 AM, alex_gagn...@dellteam.com wrote:
>> On 5/31/2018 10:28 AM, Sinan Kaya wrote:
>>> On 5/31/2018 11:05 AM, Alexandru Gagniuc wrote:
 +static void pcie_max_link_cap(struct pci_dev *dev, enum pci_bus_speed 
 *speed,
 +  enum pcie_link_width *width)
 +{
 +  uint32_t lnkcap;
 +
 +  pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, );
 +
 +  *speed = pcie_link_speed[lnkcap & PCI_EXP_LNKCAP_SLS];
 +  *width = (lnkcap & PCI_EXP_LNKCAP_MLW) >> PCI_EXP_LNKCAP_MLW_SHIFT;
 +}
 +
 +static void pcie_cur_link_sta(struct pci_dev *dev, enum pci_bus_speed 
 *speed,
 +  enum pcie_link_width *width)
 +{
 +  uint16_t lnksta;
 +
 +  pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
 +  *speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS];
 +  *width = (lnksta & PCI_EXP_LNKSTA_NLW) >> PCI_EXP_LNKSTA_NLW_SHIFT;
 +}
 +
 +static const char *pcie_bus_speed_name(enum pci_bus_speed speed)
 +{
 +  switch (speed) {
 +  case PCIE_SPEED_2_5GT:
 +  return "2.5 GT/s";
 +  case PCIE_SPEED_5_0GT:
 +  return "5.0 GT/s";
 +  case PCIE_SPEED_8_0GT:
 +  return "8.0 GT/s";
 +  default:
 +  return "unknown";
 +  }
 +}
>>>
>>> I thought Bjorn added some functions to retrieve this now.
>>
>> Hmm. I couldn't find them.
> 
> https://lkml.org/lkml/2018/3/30/553

Oh, pcie_get_speed_cap()/pcie_get_width_cap() seems to handle the
capability. Not seeing one for status and speed name.

> are you working on linux-next?

v4.17-rc7

>>
>> Alex
>>
>>
> 
> 


Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex G.



On 05/31/2018 10:38 AM, Sinan Kaya wrote:
> On 5/31/2018 11:29 AM, alex_gagn...@dellteam.com wrote:
>> On 5/31/2018 10:28 AM, Sinan Kaya wrote:
>>> On 5/31/2018 11:05 AM, Alexandru Gagniuc wrote:
 +static void pcie_max_link_cap(struct pci_dev *dev, enum pci_bus_speed 
 *speed,
 +  enum pcie_link_width *width)
 +{
 +  uint32_t lnkcap;
 +
 +  pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, );
 +
 +  *speed = pcie_link_speed[lnkcap & PCI_EXP_LNKCAP_SLS];
 +  *width = (lnkcap & PCI_EXP_LNKCAP_MLW) >> PCI_EXP_LNKCAP_MLW_SHIFT;
 +}
 +
 +static void pcie_cur_link_sta(struct pci_dev *dev, enum pci_bus_speed 
 *speed,
 +  enum pcie_link_width *width)
 +{
 +  uint16_t lnksta;
 +
 +  pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
 +  *speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS];
 +  *width = (lnksta & PCI_EXP_LNKSTA_NLW) >> PCI_EXP_LNKSTA_NLW_SHIFT;
 +}
 +
 +static const char *pcie_bus_speed_name(enum pci_bus_speed speed)
 +{
 +  switch (speed) {
 +  case PCIE_SPEED_2_5GT:
 +  return "2.5 GT/s";
 +  case PCIE_SPEED_5_0GT:
 +  return "5.0 GT/s";
 +  case PCIE_SPEED_8_0GT:
 +  return "8.0 GT/s";
 +  default:
 +  return "unknown";
 +  }
 +}
>>>
>>> I thought Bjorn added some functions to retrieve this now.
>>
>> Hmm. I couldn't find them.
> 
> https://lkml.org/lkml/2018/3/30/553

Oh, pcie_get_speed_cap()/pcie_get_width_cap() seems to handle the
capability. Not seeing one for status and speed name.

> are you working on linux-next?

v4.17-rc7

>>
>> Alex
>>
>>
> 
> 


Re: [PATCH 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices

2018-05-23 Thread Alex G.
On 05/23/2018 09:32 AM, Jes Sorensen wrote:
> On 05/23/2018 10:26 AM, Matthew Wilcox wrote:
>> On Wed, May 23, 2018 at 10:20:10AM -0400, Jes Sorensen wrote:
 +++ b/drivers/pci/pcie/aer/aerdrv_stats.c
 @@ -0,0 +1,64 @@
 +// SPDX-License-Identifier: GPL-2.0
>>>
>>> Fix the formatting please - that gross // gibberish doesn't belong there.
>>
>> Sorry, Jes.  The Chief Penguin has Spoken, and that's the preferred
>> syntax:
>>
>> 2. Style:
>>
>>The SPDX license identifier is added in form of a comment.  The comment
>>style depends on the file type::
>>
>>   C source: // SPDX-License-Identifier: 
>>
>> (you can dig up the discussion around this on the mailing list if you
>> like.  Linus actually thinks that C++ single-line comments are one of
>> the few things that language got right)
> 
> Well I'll agree to disagree with Linus on this one. It's ugly as fsck
> and allows for ambiguous statements in the code.

You misspelled "fuck".

Alex


Re: [PATCH 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices

2018-05-23 Thread Alex G.
On 05/23/2018 09:32 AM, Jes Sorensen wrote:
> On 05/23/2018 10:26 AM, Matthew Wilcox wrote:
>> On Wed, May 23, 2018 at 10:20:10AM -0400, Jes Sorensen wrote:
 +++ b/drivers/pci/pcie/aer/aerdrv_stats.c
 @@ -0,0 +1,64 @@
 +// SPDX-License-Identifier: GPL-2.0
>>>
>>> Fix the formatting please - that gross // gibberish doesn't belong there.
>>
>> Sorry, Jes.  The Chief Penguin has Spoken, and that's the preferred
>> syntax:
>>
>> 2. Style:
>>
>>The SPDX license identifier is added in form of a comment.  The comment
>>style depends on the file type::
>>
>>   C source: // SPDX-License-Identifier: 
>>
>> (you can dig up the discussion around this on the mailing list if you
>> like.  Linus actually thinks that C++ single-line comments are one of
>> the few things that language got right)
> 
> Well I'll agree to disagree with Linus on this one. It's ugly as fsck
> and allows for ambiguous statements in the code.

You misspelled "fuck".

Alex


Re: [PATCH 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices

2018-05-23 Thread Alex G.
On 05/23/2018 09:20 AM, Jes Sorensen wrote:
> On 05/22/2018 06:28 PM, Rajat Jain wrote:
>> new file mode 100644
>> index ..b9f251992209
>> --- /dev/null
>> +++ b/drivers/pci/pcie/aer/aerdrv_stats.c
>> @@ -0,0 +1,64 @@
>> +// SPDX-License-Identifier: GPL-2.0
> 
> Fix the formatting please - that gross // gibberish doesn't belong there.

Deep breath in. Deep breath out.

git grep SPDX

Although I don't like it, this format is already too common.

Cheers,
Alex


Re: [PATCH 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices

2018-05-23 Thread Alex G.
On 05/23/2018 09:20 AM, Jes Sorensen wrote:
> On 05/22/2018 06:28 PM, Rajat Jain wrote:
>> new file mode 100644
>> index ..b9f251992209
>> --- /dev/null
>> +++ b/drivers/pci/pcie/aer/aerdrv_stats.c
>> @@ -0,0 +1,64 @@
>> +// SPDX-License-Identifier: GPL-2.0
> 
> Fix the formatting please - that gross // gibberish doesn't belong there.

Deep breath in. Deep breath out.

git grep SPDX

Although I don't like it, this format is already too common.

Cheers,
Alex


Re: [PATCH 5/5] Documentation/PCI: Add details of PCI AER statistics

2018-05-22 Thread Alex G.
On 05/22/2018 05:28 PM, Rajat Jain wrote:
> Add the PCI AER statistics details to
> Documentation/PCI/pcieaer-howto.txt
> 
> Signed-off-by: Rajat Jain 
> ---
>  Documentation/PCI/pcieaer-howto.txt | 35 +
>  1 file changed, 35 insertions(+)
> 
> diff --git a/Documentation/PCI/pcieaer-howto.txt 
> b/Documentation/PCI/pcieaer-howto.txt
> index acd06bb8..86ee9f9ff5e1 100644
> --- a/Documentation/PCI/pcieaer-howto.txt
> +++ b/Documentation/PCI/pcieaer-howto.txt
> @@ -73,6 +73,41 @@ In the example, 'Requester ID' means the ID of the device 
> who sends
>  the error message to root port. Pls. refer to pci express specs for
>  other fields.
>  
> +2.4 AER statistics
> +
> +When AER messages are captured, the statistics are exposed via the following
> +sysfs attributes under the "aer_stats" folder for the device:
> +
> +2.4.1 Device sysfs Attributes
> +
> +These attributes show up under all the devices that are AER capable. These
> +indicate the errors "as seen by the device". Note that this may mean that if
> +an end point is causing problems, the AER counters may increment at its link
> +partner (e.g. root port) because the errors will be "seen" by the link 
> partner
> +and not the the problematic end point itself (which may report all counters
> +as 0 as it never saw any problems).

I was afraid of that. Is there a way to look at the requester ID to log
AER errors to the correct device?

Alex


Re: [PATCH 5/5] Documentation/PCI: Add details of PCI AER statistics

2018-05-22 Thread Alex G.
On 05/22/2018 05:28 PM, Rajat Jain wrote:
> Add the PCI AER statistics details to
> Documentation/PCI/pcieaer-howto.txt
> 
> Signed-off-by: Rajat Jain 
> ---
>  Documentation/PCI/pcieaer-howto.txt | 35 +
>  1 file changed, 35 insertions(+)
> 
> diff --git a/Documentation/PCI/pcieaer-howto.txt 
> b/Documentation/PCI/pcieaer-howto.txt
> index acd06bb8..86ee9f9ff5e1 100644
> --- a/Documentation/PCI/pcieaer-howto.txt
> +++ b/Documentation/PCI/pcieaer-howto.txt
> @@ -73,6 +73,41 @@ In the example, 'Requester ID' means the ID of the device 
> who sends
>  the error message to root port. Pls. refer to pci express specs for
>  other fields.
>  
> +2.4 AER statistics
> +
> +When AER messages are captured, the statistics are exposed via the following
> +sysfs attributes under the "aer_stats" folder for the device:
> +
> +2.4.1 Device sysfs Attributes
> +
> +These attributes show up under all the devices that are AER capable. These
> +indicate the errors "as seen by the device". Note that this may mean that if
> +an end point is causing problems, the AER counters may increment at its link
> +partner (e.g. root port) because the errors will be "seen" by the link 
> partner
> +and not the the problematic end point itself (which may report all counters
> +as 0 as it never saw any problems).

I was afraid of that. Is there a way to look at the requester ID to log
AER errors to the correct device?

Alex


Re: [PATCH 2/5] PCI/AER: Add sysfs stats for AER capable devices

2018-05-22 Thread Alex G.


On 05/22/2018 05:28 PM, Rajat Jain wrote:
> Add the following AER sysfs stats to represent the counters for each
> kind of error as seen by the device:
> 
> dev_total_cor_errs
> dev_total_fatal_errs
> dev_total_nonfatal_errs
> 
> Signed-off-by: Rajat Jain 
> ---
>  drivers/pci/pci-sysfs.c|  3 ++
>  drivers/pci/pci.h  |  4 +-
>  drivers/pci/pcie/aer/aerdrv.h  |  1 +
>  drivers/pci/pcie/aer/aerdrv_errprint.c |  1 +
>  drivers/pci/pcie/aer/aerdrv_stats.c| 72 ++
>  5 files changed, 80 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 366d93af051d..730f985a3dc9 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1743,6 +1743,9 @@ static const struct attribute_group 
> *pci_dev_attr_groups[] = {
>  #endif
>   _bridge_attr_group,
>   _dev_attr_group,
> +#ifdef CONFIG_PCIEAER
> + _stats_attr_group,
> +#endif
>   NULL,
>  };

So if the device is removed as part of recovery, then these get reset,
right? So if the device fails intermittently, these counters would keep
getting reset. Is this the intent?

(snip)

>  /**
>   * pci_match_one_device - Tell if a PCI device structure has a matching
> diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
> index d8b9fba536ed..b5d5ad6f2c03 100644
> --- a/drivers/pci/pcie/aer/aerdrv.h
> +++ b/drivers/pci/pcie/aer/aerdrv.h
> @@ -87,6 +87,7 @@ void aer_print_port_info(struct pci_dev *dev, struct 
> aer_err_info *info);
>  irqreturn_t aer_irq(int irq, void *context);
>  int pci_aer_stats_init(struct pci_dev *pdev);
>  void pci_aer_stats_exit(struct pci_dev *pdev);
> +void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
>  
>  #ifdef CONFIG_ACPI_APEI
>  int pcie_aer_get_firmware_first(struct pci_dev *pci_dev);
> diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c 
> b/drivers/pci/pcie/aer/aerdrv_errprint.c
> index 21ca5e1b0ded..5e8b98deda08 100644
> --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
> +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
> @@ -155,6 +155,7 @@ static void __aer_print_error(struct pci_dev *dev,
>   pci_err(dev, "   [%2d] Unknown Error Bit%s\n",
>   i, info->first_error == i ? " (First)" : "");
>   }
> + pci_dev_aer_stats_incr(dev, info);

What about AER errors that are contained by DPC?

Alex


Re: [PATCH 2/5] PCI/AER: Add sysfs stats for AER capable devices

2018-05-22 Thread Alex G.


On 05/22/2018 05:28 PM, Rajat Jain wrote:
> Add the following AER sysfs stats to represent the counters for each
> kind of error as seen by the device:
> 
> dev_total_cor_errs
> dev_total_fatal_errs
> dev_total_nonfatal_errs
> 
> Signed-off-by: Rajat Jain 
> ---
>  drivers/pci/pci-sysfs.c|  3 ++
>  drivers/pci/pci.h  |  4 +-
>  drivers/pci/pcie/aer/aerdrv.h  |  1 +
>  drivers/pci/pcie/aer/aerdrv_errprint.c |  1 +
>  drivers/pci/pcie/aer/aerdrv_stats.c| 72 ++
>  5 files changed, 80 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 366d93af051d..730f985a3dc9 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1743,6 +1743,9 @@ static const struct attribute_group 
> *pci_dev_attr_groups[] = {
>  #endif
>   _bridge_attr_group,
>   _dev_attr_group,
> +#ifdef CONFIG_PCIEAER
> + _stats_attr_group,
> +#endif
>   NULL,
>  };

So if the device is removed as part of recovery, then these get reset,
right? So if the device fails intermittently, these counters would keep
getting reset. Is this the intent?

(snip)

>  /**
>   * pci_match_one_device - Tell if a PCI device structure has a matching
> diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
> index d8b9fba536ed..b5d5ad6f2c03 100644
> --- a/drivers/pci/pcie/aer/aerdrv.h
> +++ b/drivers/pci/pcie/aer/aerdrv.h
> @@ -87,6 +87,7 @@ void aer_print_port_info(struct pci_dev *dev, struct 
> aer_err_info *info);
>  irqreturn_t aer_irq(int irq, void *context);
>  int pci_aer_stats_init(struct pci_dev *pdev);
>  void pci_aer_stats_exit(struct pci_dev *pdev);
> +void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
>  
>  #ifdef CONFIG_ACPI_APEI
>  int pcie_aer_get_firmware_first(struct pci_dev *pci_dev);
> diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c 
> b/drivers/pci/pcie/aer/aerdrv_errprint.c
> index 21ca5e1b0ded..5e8b98deda08 100644
> --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
> +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
> @@ -155,6 +155,7 @@ static void __aer_print_error(struct pci_dev *dev,
>   pci_err(dev, "   [%2d] Unknown Error Bit%s\n",
>   i, info->first_error == i ? " (First)" : "");
>   }
> + pci_dev_aer_stats_incr(dev, info);

What about AER errors that are contained by DPC?

Alex


Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()

2018-05-22 Thread Alex G.
On 05/22/2018 01:45 PM, Luck, Tony wrote:
> On Tue, May 22, 2018 at 01:19:34PM -0500, Alex G. wrote:
>> Firmware started passing "fatal" GHES headers with the explicit intent of
>> crashing an OS. At the same time, we've learnt how to handle these errors in
>> a number of cases. With DPC (coming soon to firmware-first) the error is
>> contained, and a non-issue.
> 
> Perhaps DPC is the change that you need to emphasize as to
> why things are different now, so we can change the default
> Linux behavior.
> 
> With the h/w guaranteeing that corrupt data is contained, we
> should be safe to disregard BIOS indications of "fatal" problems
> that could be anything and might show up in unknown ways some
> time later if we keep running.

Sure. DPC is much harder to contest as a reason. However, the AER path
benefits as well from this change in behavior. I'm certain there are
other classes of errors that benefit as well from the change, though I
haven't had the time or the inclination to look for them.

Alex


Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()

2018-05-22 Thread Alex G.
On 05/22/2018 01:45 PM, Luck, Tony wrote:
> On Tue, May 22, 2018 at 01:19:34PM -0500, Alex G. wrote:
>> Firmware started passing "fatal" GHES headers with the explicit intent of
>> crashing an OS. At the same time, we've learnt how to handle these errors in
>> a number of cases. With DPC (coming soon to firmware-first) the error is
>> contained, and a non-issue.
> 
> Perhaps DPC is the change that you need to emphasize as to
> why things are different now, so we can change the default
> Linux behavior.
> 
> With the h/w guaranteeing that corrupt data is contained, we
> should be safe to disregard BIOS indications of "fatal" problems
> that could be anything and might show up in unknown ways some
> time later if we keep running.

Sure. DPC is much harder to contest as a reason. However, the AER path
benefits as well from this change in behavior. I'm certain there are
other classes of errors that benefit as well from the change, though I
haven't had the time or the inclination to look for them.

Alex


Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()

2018-05-22 Thread Alex G.

On 05/22/2018 01:13 PM, Rafael J. Wysocki wrote:
(snip)

Of course, you are free to have a differing opinion and I don't have
to convince you about my point.  You need to convince me about your
point to get the patch in through my tree, which you haven't done so
far.


My point is that crossing your arms and saying "impress me" doesn't give 
me a good idea of how you'd like to see the problem resolved.


Alex


Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()

2018-05-22 Thread Alex G.

On 05/22/2018 01:13 PM, Rafael J. Wysocki wrote:
(snip)

Of course, you are free to have a differing opinion and I don't have
to convince you about my point.  You need to convince me about your
point to get the patch in through my tree, which you haven't done so
far.


My point is that crossing your arms and saying "impress me" doesn't give 
me a good idea of how you'd like to see the problem resolved.


Alex


Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()

2018-05-22 Thread Alex G.

On 05/22/2018 01:10 PM, Rafael J. Wysocki wrote:

On Tue, May 22, 2018 at 7:57 PM, Luck, Tony  wrote:

On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:

I especially don't want to have the case where a PCIe error is *really*
fatal and then we noodle in some handlers debating about the severity
because it got marked as recoverable intermittently and end up causing
data corruption on the storage device. Here's a real no-no for ya.


All that we have is a message from the BIOS that this is a "fatal"
error.  When did we start trusting the BIOS to give us accurate
information?


Some time ago, actually.

This is about changing the existing behavior which has been to treat
"fatal" errors reported by the BIOS as good enough reasons for a panic
for quite a while AFAICS.


Yes, you are correct. I'd actually like to go deeper, and remove the 
policy to panic() on fatal errors. Now whether we blacklist or whitelist 
which errors can go through is up for debate, but the current policy 
seems broken.



PCIe fatal means that the link or the device is broken.


And that may really mean that the component in question is on fire.
We just don't know.


Should there be a physical fire, we have much bigger issues. At the same 
time, we could retrain the link, call the driver, and release freon gas 
to put out the fire. That's something we don't currently have the option 
to do.



But that seems a poor reason to take down a large server that may have
dozens of devices (some of them set up specifically to handle
errors ... e.g. mirrored disks on separate controllers, or NIC
devices that have been "bonded" together).

So, as long as the action for a "fatal" error is to mark a link
down and offline the device, that seems a pretty reasonable course
of action.

The argument gets a lot more marginal if you simply reset the
link and re-enable the device to "fix" it. That might be enough,
but I don't think the OS has enough data to make the call.


Again, that's about changing the existing behavior or the existing policy even.

What exactly has changed to make us consider this now?


Firmware started passing "fatal" GHES headers with the explicit intent 
of crashing an OS. At the same time, we've learnt how to handle these 
errors in a number of cases. With DPC (coming soon to firmware-first) 
the error is contained, and a non-issue.


As specs and hardware implementations evolve, we have to adapt. I'm here 
until November, and one of my goals is to involve linux upstream in the 
development of these features so that when the hardware hits the market, 
we're ready. That does mean we have to drop some of the silly things 
we're doing.


Alex


Thanks,
Rafael



Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()

2018-05-22 Thread Alex G.

On 05/22/2018 01:10 PM, Rafael J. Wysocki wrote:

On Tue, May 22, 2018 at 7:57 PM, Luck, Tony  wrote:

On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:

I especially don't want to have the case where a PCIe error is *really*
fatal and then we noodle in some handlers debating about the severity
because it got marked as recoverable intermittently and end up causing
data corruption on the storage device. Here's a real no-no for ya.


All that we have is a message from the BIOS that this is a "fatal"
error.  When did we start trusting the BIOS to give us accurate
information?


Some time ago, actually.

This is about changing the existing behavior which has been to treat
"fatal" errors reported by the BIOS as good enough reasons for a panic
for quite a while AFAICS.


Yes, you are correct. I'd actually like to go deeper, and remove the 
policy to panic() on fatal errors. Now whether we blacklist or whitelist 
which errors can go through is up for debate, but the current policy 
seems broken.



PCIe fatal means that the link or the device is broken.


And that may really mean that the component in question is on fire.
We just don't know.


Should there be a physical fire, we have much bigger issues. At the same 
time, we could retrain the link, call the driver, and release freon gas 
to put out the fire. That's something we don't currently have the option 
to do.



But that seems a poor reason to take down a large server that may have
dozens of devices (some of them set up specifically to handle
errors ... e.g. mirrored disks on separate controllers, or NIC
devices that have been "bonded" together).

So, as long as the action for a "fatal" error is to mark a link
down and offline the device, that seems a pretty reasonable course
of action.

The argument gets a lot more marginal if you simply reset the
link and re-enable the device to "fix" it. That might be enough,
but I don't think the OS has enough data to make the call.


Again, that's about changing the existing behavior or the existing policy even.

What exactly has changed to make us consider this now?


Firmware started passing "fatal" GHES headers with the explicit intent 
of crashing an OS. At the same time, we've learnt how to handle these 
errors in a number of cases. With DPC (coming soon to firmware-first) 
the error is contained, and a non-issue.


As specs and hardware implementations evolve, we have to adapt. I'm here 
until November, and one of my goals is to involve linux upstream in the 
development of these features so that when the hardware hits the market, 
we're ready. That does mean we have to drop some of the silly things 
we're doing.


Alex


Thanks,
Rafael



Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()

2018-05-22 Thread Alex G.

On 05/22/2018 12:57 PM, Luck, Tony wrote:

On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:

I especially don't want to have the case where a PCIe error is *really*
fatal and then we noodle in some handlers debating about the severity
because it got marked as recoverable intermittently and end up causing
data corruption on the storage device. Here's a real no-no for ya.


All that we have is a message from the BIOS that this is a "fatal"
error.  When did we start trusting the BIOS to give us accurate
information?


When we merged ACPI handling.


PCIe fatal means that the link or the device is broken. But that
seems a poor reason to take down a large server that may have
dozens of devices (some of them set up specifically to handle
errors ... e.g. mirrored disks on separate controllers, or NIC
devices that have been "bonded" together).

So, as long as the action for a "fatal" error is to mark a link
down and offline the device, that seems a pretty reasonable course
of action.

The argument gets a lot more marginal if you simply reset the
link and re-enable the device to "fix" it. That might be enough,
but I don't think the OS has enough data to make the call.


I'm not 100% satisfied with how AER handler works, and how certain 
drivers (nvme!!!) interface with AER handling. But this is an arguments 
that belongs in PCI code, and a fight I will fight with Bjorn and Keith. 
The issue we're having with Borislav and Rafael's estate is that we 
can't make it to PCI land.


I'm seeing here the same fight that I saw with firmware vs OS, where fw 
wants to have control, and OS wants to have control. I saw the same with 
ME/CSE/CSME team vs ChromeOS team, where ME team did everything possible 
to make sure only they can access the boot vector and boot the 
processor, and ChromeOS team couldn't use this approach because they 
wanted their own root of trust. I've seen this in other places as well, 
though confidentiality agreements prevent me from talking about it.


It's the issue of control, and it's a fact of life. Borislav and Rafael 
don't want to relinquish control until they can be 100% certain that 
going further will result in 100% recovery. That is a goal I aspire to 
as well, but an unachievable ideal nonetheless.


I thought the best compromise would be to be as close as possible to 
native handling. That is, if AER can't recover, we don't recover the 
device, but the machine keeps running. I think there's some deeper 
history to GHES handling, which I didn't take into consideration. The 
fight is to convince appropriate parties to share the responsibility in 
a way which doesn't kill the machine. We still have a ways to go until 
we get there.


Alex


-Tony

P.S. I deliberately put "fatal" in quotes above because to
quote "The Princess Bride" -- "that word, I do not think it
means what you think it means". :-)



Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()

2018-05-22 Thread Alex G.

On 05/22/2018 12:57 PM, Luck, Tony wrote:

On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:

I especially don't want to have the case where a PCIe error is *really*
fatal and then we noodle in some handlers debating about the severity
because it got marked as recoverable intermittently and end up causing
data corruption on the storage device. Here's a real no-no for ya.


All that we have is a message from the BIOS that this is a "fatal"
error.  When did we start trusting the BIOS to give us accurate
information?


When we merged ACPI handling.


PCIe fatal means that the link or the device is broken. But that
seems a poor reason to take down a large server that may have
dozens of devices (some of them set up specifically to handle
errors ... e.g. mirrored disks on separate controllers, or NIC
devices that have been "bonded" together).

So, as long as the action for a "fatal" error is to mark a link
down and offline the device, that seems a pretty reasonable course
of action.

The argument gets a lot more marginal if you simply reset the
link and re-enable the device to "fix" it. That might be enough,
but I don't think the OS has enough data to make the call.


I'm not 100% satisfied with how AER handler works, and how certain 
drivers (nvme!!!) interface with AER handling. But this is an arguments 
that belongs in PCI code, and a fight I will fight with Bjorn and Keith. 
The issue we're having with Borislav and Rafael's estate is that we 
can't make it to PCI land.


I'm seeing here the same fight that I saw with firmware vs OS, where fw 
wants to have control, and OS wants to have control. I saw the same with 
ME/CSE/CSME team vs ChromeOS team, where ME team did everything possible 
to make sure only they can access the boot vector and boot the 
processor, and ChromeOS team couldn't use this approach because they 
wanted their own root of trust. I've seen this in other places as well, 
though confidentiality agreements prevent me from talking about it.


It's the issue of control, and it's a fact of life. Borislav and Rafael 
don't want to relinquish control until they can be 100% certain that 
going further will result in 100% recovery. That is a goal I aspire to 
as well, but an unachievable ideal nonetheless.


I thought the best compromise would be to be as close as possible to 
native handling. That is, if AER can't recover, we don't recover the 
device, but the machine keeps running. I think there's some deeper 
history to GHES handling, which I didn't take into consideration. The 
fight is to convince appropriate parties to share the responsibility in 
a way which doesn't kill the machine. We still have a ways to go until 
we get there.


Alex


-Tony

P.S. I deliberately put "fatal" in quotes above because to
quote "The Princess Bride" -- "that word, I do not think it
means what you think it means". :-)



Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()

2018-05-22 Thread Alex G.


On 05/22/2018 09:54 AM, Borislav Petkov wrote:
> On Tue, May 22, 2018 at 09:39:15AM -0500, Alex G. wrote:
>> No, the problem is with the current approach, not with mine. The problem
>> is trying to handle the error outside of the existing handler. That's a
>> no-no, IMO.
> 
> Let me save you some time: until you come up with a proper solution for
> *all* PCIe errors so that the kernel can correctly decide what to do for
> each error based on its actual severity, consider this NAKed.

I do have a proper solution for _all_ PCIe errors. In fact, we discussed
several valid approaches already.

> I don't care about outside or inside of the handler 

I do. I have a handler that can handle (no pun intended) errors. I want
to use the same code path in native and GHES cases. If I allow ghes.c to
take different decisions than what aer_do_recovery() would, I've failed.

>- this thing needs to be done properly 

Exactly!

> and not just to serve your particular use case of
> abrupt removal of devices causing PCIe errors, and punish the rest.

I think you're confused about what I'm actually trying to do. Or maybe
you're confused about how PCIe errors work. That's understandable. PCIe
uses the term "fatal" for errors that may make the link unusable, and
which may require a link reset, and in most other specs "fatal" means
"on fire". I understand your confusion, and I hope I cleared it up.

You're trying to make the case that surprise removal is my only concern
and use case, because that's the example that I gave. It makes your
argument stronger, but it's wrong. You don't know our test setup, and
all the things I'm testing for, and whenever I try to tell you, you fall
back to the 'surprise removal' example.

I don't know why you'd think Dell would pay me to work on this if I were
to allow things like silent data corruption to creep in. This isn't a
traditional company from Redmond, Washington.

> I especially don't want to have the case where a PCIe error is *really*
> fatal and then we noodle in some handlers debating about the severity
> because it got marked as recoverable intermittently and end up causing
> data corruption on the storage device. Here's a real no-no for ya.

I especially don't want a kernel maintainer who hasn't even read the
recovery handler (let alone the spec around which the handler was
written) tell me how the recovery handler works and what it's supposed
to do (see, I can be an ass).
PCIe errors really are fatal. They might need to unload the driver and
remove the device. But somebody set the questionable policy that
"fatal"=="panic", and that is completely inappropriate for a larger
class of errors -- PCIe happens to be the easiest example to pick on.

And even you realize that the argument that a panic() will somehow
prevent data corruption is complete noodle sauce.

Alex



  1   2   >