from:"Alex G"

Re: [Linux-stm32] [PATCH v2 00/14] Introduce STM32MP1 RCC in secured mode

2021-03-11 Thread Alex G.





On 3/11/21 12:10 PM, Alexandre TORGUE wrote:

Hi Guys

On 3/11/21 5:11 PM, Marek Vasut wrote:

On 3/11/21 3:41 PM, Ahmad Fatoum wrote:

Hello,


Hi,


On 11.03.21 15:02, Alexandre TORGUE wrote:

On 3/11/21 12:43 PM, Marek Vasut wrote:

On 3/11/21 9:08 AM, Alexandre TORGUE wrote:
1- Break the current ABI: as soon as those patches are merged, 
stm32mp157c-dk2.dtb will impose to use
A tf-a for scmi clocks. For people using u-boot spl, the will have 
to create their own "no-secure" devicetree.


NAK, this breaks existing boards and existing setups, e.g. DK2 that 
does not use ATF.


2-As you suggest, create a new "secure" dtb per boards (Not my 
wish for maintenance perspectives).


I agree with Alex (G) that the "secure" option should be opt-in.
That way existing setups remain working and no extra requirements 
are imposed on MP1 users. Esp. since as far as I understand this, 
the "secure" part isn't really about security, but rather about 
moving clock configuration from Linux to some firmware blob.


3- Keep kernel device tree as they are and applied this secure 
layer (scmi clocks phandle) thanks to dtbo in

U-boot.


Is this really better than
#include "stm32mp15xx-enable-secure-stuff.dtsi"
in a board DT ? Because that is how I imagine the opt-in "secure" 
option could work.




Discussing with Patrick about u-boot, we could use dtbo application 
thanks to extlinux.conf. BUT it it will not prevent other case (i.e. 
TF-A which jump directly in kernel@). So the "least worst" solution 
is to create a new "stm32mp1257c-scmi-dk2 board which will overload 
clock entries with a scmi phandle (as proposed by Alex).


I raised this issue before with your colleagues. I still believe the 
correct way
would be for the TF-A to pass down either a device tree or an overlay 
with the

actual settings in use, e.g.:

   - Clocks/Resets done via SCMI
   - Reserved memory regions

If TF-A directly boots Linux, it can apply the overlay itself, 
otherwise it's

passed down to SSBL that applies it before booting Linux.


That sounds good and it is something e.g. R-Car already does, it 
merges DT fragment from prior stages at U-Boot level and then passes 
the result to Linux.


So on ST hardware, the same could very well happen and it would work 
for both non-ATF / ATF / ATF+TEE options.


Even this solution sounds good but we are currently not able to do it in 
our TF-A/u-boot so not feasible for the moment. So we have to find a 
solution for now. Create a new dtb can be this solution. Our internal 
strategy is to use scmi on our official ST board. It will be a really 
drawback to include a "no-scmi.dtsi" in DH boards (for example) and to 
create a stm32mp157c-noscmi-dk2.dts ?


It could work, as long as all users are reminded to change their build 
scripts to pick up a "-noscmi.dtb". I suspect that if this were the case 
we'll see quite a few bug reports saying "stm32mp1 no longer boots with 
kernel v5.13".


I didn't think of this originally, though u-boot already does the DTB 
patching for OPTEE reserved memory regions. It's not too hard to also 
patch in the SCMI clocks at boot. In u-boot's case, runtime detection 
might even be feasible.


Alex

Re: [PATCH v2 00/14] Introduce STM32MP1 RCC in secured mode

2021-03-09 Thread Alex G.


On 1/26/21 3:01 AM, gabriel.fernan...@foss.st.com wrote:

From: Gabriel Fernandez 

Platform STM32MP1 can be used in configuration where some clocks and
IP resets can relate as secure resources.
These resources are moved from a RCC clock/reset handle to a SCMI
clock/reset_domain handle.

The RCC clock driver is now dependent of the SCMI driver, then we have
to manage now the probe defering.

v1 -> v2:
   - fix yamllint warnings.


Hi Gabriel,

I don't have much clout with the maintainers, but I have to NAK this 
series after finding major breakage.


The problem with series is that it breaks pretty much every board it 
touches. I have a DK2 here that I'm using for development, which no 
longer boots with this series applied.


The crux of the matter is that this series assumes all boards will boot 
with an FSBL that implements a very specific SCMI clock tree. This is 
major ABI breakage for anyone not using TF-A as the first stage 
bootloader. Anyone using u-boot SPL is screwed.


This series imposes a SOC-wide change via the dtsi files. So even boards 
that you don't intend to convert to SCMI will get broken this way. 
Adding a -no-scmi file that isn't used anywhere doesn't help things.


Here's what I suggest:

Generate new dtb files for those boards that you want to convert. So you 
would get:

  - stm32mp157c-dk2.dtb # Good old hardware clocks
  - stm32mp157c-dk2-secure-rcc.dtb # Clocks accessible by scmi.

A lot of users use a larger build system where they extract the relevant 
files. With the scheme I'm proposing you don't break their builds, and 
you allow SCMI users to have upstream support.


This means that you'll have to rethink the DTS and DTSI changes to 
accomodate both use cases.


Thanks,
Alex






Gabriel Fernandez (14):
   clk: stm32mp1: merge 'clk-hsi-div' and 'ck_hsi' into one clock
   clk: stm32mp1: merge 'ck_hse_rtc' and 'ck_rtc' into one clock
   clk: stm32mp1: remove intermediate pll clocks
   clk: stm32mp1: convert to module driver
   clk: stm32mp1: move RCC reset controller into RCC clock driver
   reset: stm32mp1: remove stm32mp1 reset
   dt-bindings: clock: add IDs for SCMI clocks on stm32mp15
   dt-bindings: reset: add IDs for SCMI reset domains on stm32mp15
   dt-bindings: reset: add MCU HOLD BOOT ID for SCMI reset domains on
 stm32mp15
   clk: stm32mp1: new compatible for secure RCC support
   ARM: dts: stm32: define SCMI resources on stm32mp15
   ARM: dts: stm32: move clocks/resets to SCMI resources for stm32mp15
   dt-bindings: clock: stm32mp1 new compatible for secure rcc
   ARM: dts: stm32: introduce basic boot include on stm32mp15x board

  .../bindings/clock/st,stm32mp1-rcc.yaml   |   6 +-
  arch/arm/boot/dts/stm32mp15-no-scmi.dtsi  | 158 ++
  arch/arm/boot/dts/stm32mp151.dtsi | 127 +++--
  arch/arm/boot/dts/stm32mp153.dtsi |   4 +-
  arch/arm/boot/dts/stm32mp157.dtsi |   2 +-
  arch/arm/boot/dts/stm32mp15xc.dtsi|   4 +-
  drivers/clk/Kconfig   |  10 +
  drivers/clk/clk-stm32mp1.c| 495 +++---
  drivers/reset/Kconfig |   6 -
  drivers/reset/Makefile|   1 -
  drivers/reset/reset-stm32mp1.c| 115 
  include/dt-bindings/clock/stm32mp1-clks.h |  27 +
  include/dt-bindings/reset/stm32mp1-resets.h   |  15 +
  13 files changed, 704 insertions(+), 266 deletions(-)
  create mode 100644 arch/arm/boot/dts/stm32mp15-no-scmi.dtsi
  delete mode 100644 drivers/reset/reset-stm32mp1.c

Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"

2021-02-02 Thread Alex G.


On 2/2/21 2:16 PM, Bjorn Helgaas wrote:

On Tue, Feb 02, 2021 at 01:50:20PM -0600, Alex G. wrote:

On 1/29/21 3:56 PM, Bjorn Helgaas wrote:

On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote:

On 1/28/21 5:51 PM, Sinan Kaya wrote:

On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:

AFAICT, this thread petered out with no resolution.

If the bandwidth change notifications are important to somebody,
please speak up, preferably with a patch that makes the notifications
disabled by default and adds a parameter to enable them (or some other
strategy that makes sense).

I think these are potentially useful, so I don't really want to just
revert them, but if nobody thinks these are important enough to fix,
that's a possibility.


Hide behind debug or expert option by default? or even mark it as BROKEN
until someone fixes it?


Instead of making it a config option, wouldn't it be better as a kernel
parameter? People encountering this seem quite competent in passing kernel
arguments, so having a "pcie_bw_notification=off" would solve their
problems.


I don't want people to have to discover a parameter to solve issues.
If there's a parameter, notification should default to off, and people
who want notification should supply a parameter to enable it.  Same
thing for the sysfs idea.


I can imagine cases where a per-port flag would be useful. For example, a
machine with a NIC and a couple of PCIe storage drives. In this example, the
PCIe drives downtrain willie-nillie, so it's useful to turn off their
notifications, but the NIC absolutely must not downtrain. It's debatable
whether it should be default on or default off.


I think we really just need to figure out what's going on.  Then it
should be clearer how to handle it.  I'm not really in a position to
debug the root cause since I don't have the hardware or the time.


I wonder
(a) if some PCIe devices are downtraining willie-nillie to save power
(b) if this willie-nillie downtraining somehow violates PCIe spec
(c) what is the official behavior when downtraining is intentional

My theory is: YES, YES, ASPM. But I don't know how to figure this out
without having the problem hardware in hand.


If nobody can figure out what's going on, I think we'll have to make it
disabled by default.


I think most distros do "CONFIG_PCIE_BW is not set". Is that not true?


I think it *is* true that distros do not enable CONFIG_PCIE_BW.

But it's perfectly reasonable for people building their own kernels to
enable it.  It should be safe to enable all config options.  If they
do enable CONFIG_PCIE_BW, I don't want them to waste time debugging
messages they don't expect.

If we understood why these happen and could filter out the expected
ones, that would be great.  But we don't.  We've already wasted quite
a bit of Jan's and Atanas' time, and no doubt others who haven't
bothered to file bug reports.

So I think I'll queue up a patch to remove the functionality for now.
It's easily restored if somebody debugs the problem or adds a
command-line switch or something.


I think it's best we make it a module (or kernel) parameter, default=off 
for the time being.


Alex

Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"

2021-02-02 Thread Alex G.


On 1/29/21 3:56 PM, Bjorn Helgaas wrote:

On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote:

On 1/28/21 5:51 PM, Sinan Kaya wrote:

On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:

AFAICT, this thread petered out with no resolution.

If the bandwidth change notifications are important to somebody,
please speak up, preferably with a patch that makes the notifications
disabled by default and adds a parameter to enable them (or some other
strategy that makes sense).

I think these are potentially useful, so I don't really want to just
revert them, but if nobody thinks these are important enough to fix,
that's a possibility.


Hide behind debug or expert option by default? or even mark it as BROKEN
until someone fixes it?


Instead of making it a config option, wouldn't it be better as a kernel
parameter? People encountering this seem quite competent in passing kernel
arguments, so having a "pcie_bw_notification=off" would solve their
problems.


I don't want people to have to discover a parameter to solve issues.
If there's a parameter, notification should default to off, and people
who want notification should supply a parameter to enable it.  Same
thing for the sysfs idea.


I can imagine cases where a per-port flag would be useful. For example, 
a machine with a NIC and a couple of PCIe storage drives. In this 
example, the PCIe drives donwtrain willie-nillie, so it's useful to turn 
off their notifications, but the NIC absolutely must not downtrain. It's 
debatable whether it should be default on or default off.



I think we really just need to figure out what's going on.  Then it
should be clearer how to handle it.  I'm not really in a position to
debug the root cause since I don't have the hardware or the time.


I wonder
(a) if some PCIe devices are downtraining willie-nillie to save power
(b) if this willie-nillie downtraining somehow violates PCIe spec
(c) what is the official behavior when downtraining is intentional

My theory is: YES, YES, ASPM. But I don't know how to figure this out 
without having the problem hardware in hand.




If nobody can figure out what's going on, I think we'll have to make it
disabled by default.


I think most distros do "CONFIG_PCIE_BW is not set". Is that not true?

Alex

Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"

2021-01-28 Thread Alex G.


On 1/28/21 5:51 PM, Sinan Kaya wrote:

On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:

AFAICT, this thread petered out with no resolution.

If the bandwidth change notifications are important to somebody,
please speak up, preferably with a patch that makes the notifications
disabled by default and adds a parameter to enable them (or some other
strategy that makes sense).

I think these are potentially useful, so I don't really want to just
revert them, but if nobody thinks these are important enough to fix,
that's a possibility.


Hide behind debug or expert option by default? or even mark it as BROKEN
until someone fixes it?

Instead of making it a config option, wouldn't it be better as a kernel 
parameter? People encountering this seem quite competent in passing 
kernel arguments, so having a "pcie_bw_notification=off" would solve 
their problems.


As far as marking this as broken, I've seen no conclusive evidence of to 
tell if its a sw bug or actual hardware problem. Could we have a sysfs 
to disable this on a per-downstream-port basis?


e.g.
echo 0 > /sys/bus/pci/devices/:00:04.0/bw_notification_enabled

This probably won't be ideal if there are many devices downtraining 
their links ad-hoc. At worst we'd have a way to silence those messages 
if we do encounter such devices.


Alex

Re: [PATCH v2 1/2] drm/bridge: sii902x: Enable I/O and core VCC supplies if present

2020-10-20 Thread Alex G.





On 10/20/20 2:16 AM, Sam Ravnborg wrote:

Hi Alex.


[snip]





diff --git a/drivers/gpu/drm/bridge/sii902x.c b/drivers/gpu/drm/bridge/sii902x.c
index 33fd33f953ec..d15e9f2c0d8a 100644
--- a/drivers/gpu/drm/bridge/sii902x.c
+++ b/drivers/gpu/drm/bridge/sii902x.c
@@ -17,6 +17,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
@@ -168,6 +169,8 @@ struct sii902x {
struct drm_connector connector;
struct gpio_desc *reset_gpio;
struct i2c_mux_core *i2cmux;
+   struct regulator *iovcc;
+   struct regulator *cvcc12;
/*
 * Mutex protects audio and video functions from interfering
 * each other, by keeping their i2c command sequences atomic.
@@ -954,13 +957,13 @@ static const struct drm_bridge_timings 
default_sii902x_timings = {
 | DRM_BUS_FLAG_DE_HIGH,
   };
+static int sii902x_init(struct sii902x *sii902x);

Please re-arrange the code so this prototype is not needed.


I'd be happy to re-arrange things. It will make the diff look a lot 
bigger than what it is. Is that okay?


Alex


+
   static int sii902x_probe(struct i2c_client *client,
 const struct i2c_device_id *id)
   {
struct device *dev = >dev;
-   unsigned int status = 0;
struct sii902x *sii902x;
-   u8 chipid[4];
int ret;
ret = i2c_check_functionality(client->adapter,
@@ -989,6 +992,43 @@ static int sii902x_probe(struct i2c_client *client,
mutex_init(>mutex);
+   sii902x->iovcc = devm_regulator_get(dev, "iovcc");
+   if (IS_ERR(sii902x->iovcc))
+   return PTR_ERR(sii902x->iovcc);
+
+   sii902x->cvcc12 = devm_regulator_get(dev, "cvcc12");
+   if (IS_ERR(sii902x->cvcc12))
+   return PTR_ERR(sii902x->cvcc12);
+
+   ret = regulator_enable(sii902x->iovcc);
+   if (ret < 0) {
+   dev_err_probe(dev, ret, "Failed to enable iovcc supply");
+   return ret;
+   }
+
+   ret = regulator_enable(sii902x->cvcc12);
+   if (ret < 0) {
+   dev_err_probe(dev, ret, "Failed to enable cvcc12 supply");
+   regulator_disable(sii902x->iovcc);
+   return ret;
+   }
+
+   ret = sii902x_init(sii902x);
+   if (ret < 0) {
+   regulator_disable(sii902x->cvcc12);
+   regulator_disable(sii902x->iovcc);
+   }
+
+   return ret;
+}
+
+static int sii902x_init(struct sii902x *sii902x)
+{
+   struct device *dev = >i2c->dev;
+   unsigned int status = 0;
+   u8 chipid[4];
+   int ret;
+
sii902x_reset(sii902x);
ret = regmap_write(sii902x->regmap, SII902X_REG_TPI_RQB, 0x0);
@@ -1012,11 +1052,11 @@ static int sii902x_probe(struct i2c_client *client,
regmap_read(sii902x->regmap, SII902X_INT_STATUS, );
regmap_write(sii902x->regmap, SII902X_INT_STATUS, status);
-   if (client->irq > 0) {
+   if (sii902x->i2c->irq > 0) {
regmap_write(sii902x->regmap, SII902X_INT_ENABLE,
 SII902X_HOTPLUG_EVENT);
-   ret = devm_request_threaded_irq(dev, client->irq, NULL,
+   ret = devm_request_threaded_irq(dev, sii902x->i2c->irq, NULL,
sii902x_interrupt,
IRQF_ONESHOT, dev_name(dev),
sii902x);
@@ -1031,9 +1071,9 @@ static int sii902x_probe(struct i2c_client *client,
sii902x_audio_codec_init(sii902x, dev);
-   i2c_set_clientdata(client, sii902x);
+   i2c_set_clientdata(sii902x->i2c, sii902x);
-   sii902x->i2cmux = i2c_mux_alloc(client->adapter, dev,
+   sii902x->i2cmux = i2c_mux_alloc(sii902x->i2c->adapter, dev,
1, 0, I2C_MUX_GATE,
sii902x_i2c_bypass_select,
sii902x_i2c_bypass_deselect);
@@ -1051,6 +1091,8 @@ static int sii902x_remove(struct i2c_client *client)
i2c_mux_del_adapters(sii902x->i2cmux);
drm_bridge_remove(>bridge);
+   regulator_disable(sii902x->cvcc12);
+   regulator_disable(sii902x->iovcc);
return 0;
   }


___
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 1/2] drm/bridge: sii902x: Enable I/O and core VCC supplies if present

2020-10-19 Thread Alex G.


On 9/28/20 12:30 PM, Alexandru Gagniuc wrote:

On the SII9022, the IOVCC and CVCC12 supplies must reach the correct
voltage before the reset sequence is initiated. On most boards, this
assumption is true at boot-up, so initialization succeeds.

However, when we try to initialize the chip with incorrect supply
voltages, it will not respond to I2C requests. sii902x_probe() fails
with -ENXIO.

To resolve this, look for the "iovcc" and "cvcc12" regulators, and
make sure they are enabled before starting the reset sequence. If
these supplies are not available in devicetree, then they will default
to dummy-regulator. In that case everything will work like before.

This was observed on a STM32MP157C-DK2 booting in u-boot falcon mode.
On this board, the supplies would be set by the second stage
bootloader, which does not run in falcon mode.

Signed-off-by: Alexandru Gagniuc 
---
Changes since v1:
   * Fix return code after regulator_enable(sii902x->iovcc) fails (Fabio 
Estevam)
   * Use dev_err_probe() instead of dev_err() where appropriate (Sam Ravnborg)

  drivers/gpu/drm/bridge/sii902x.c | 54 
  1 file changed, 48 insertions(+), 6 deletions(-)


This patch seems to have entered fall dormancy. Did I miss somebody in 
the CC field?


Alex



diff --git a/drivers/gpu/drm/bridge/sii902x.c b/drivers/gpu/drm/bridge/sii902x.c
index 33fd33f953ec..d15e9f2c0d8a 100644
--- a/drivers/gpu/drm/bridge/sii902x.c
+++ b/drivers/gpu/drm/bridge/sii902x.c
@@ -17,6 +17,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #include 

@@ -168,6 +169,8 @@ struct sii902x {
struct drm_connector connector;
struct gpio_desc *reset_gpio;
struct i2c_mux_core *i2cmux;
+   struct regulator *iovcc;
+   struct regulator *cvcc12;
/*
 * Mutex protects audio and video functions from interfering
 * each other, by keeping their i2c command sequences atomic.
@@ -954,13 +957,13 @@ static const struct drm_bridge_timings 
default_sii902x_timings = {
 | DRM_BUS_FLAG_DE_HIGH,
  };
  
+static int sii902x_init(struct sii902x *sii902x);

+
  static int sii902x_probe(struct i2c_client *client,
 const struct i2c_device_id *id)
  {
struct device *dev = >dev;
-   unsigned int status = 0;
struct sii902x *sii902x;
-   u8 chipid[4];
int ret;
  
  	ret = i2c_check_functionality(client->adapter,

@@ -989,6 +992,43 @@ static int sii902x_probe(struct i2c_client *client,
  
  	mutex_init(>mutex);
  
+	sii902x->iovcc = devm_regulator_get(dev, "iovcc");

+   if (IS_ERR(sii902x->iovcc))
+   return PTR_ERR(sii902x->iovcc);
+
+   sii902x->cvcc12 = devm_regulator_get(dev, "cvcc12");
+   if (IS_ERR(sii902x->cvcc12))
+   return PTR_ERR(sii902x->cvcc12);
+
+   ret = regulator_enable(sii902x->iovcc);
+   if (ret < 0) {
+   dev_err_probe(dev, ret, "Failed to enable iovcc supply");
+   return ret;
+   }
+
+   ret = regulator_enable(sii902x->cvcc12);
+   if (ret < 0) {
+   dev_err_probe(dev, ret, "Failed to enable cvcc12 supply");
+   regulator_disable(sii902x->iovcc);
+   return ret;
+   }
+
+   ret = sii902x_init(sii902x);
+   if (ret < 0) {
+   regulator_disable(sii902x->cvcc12);
+   regulator_disable(sii902x->iovcc);
+   }
+
+   return ret;
+}
+
+static int sii902x_init(struct sii902x *sii902x)
+{
+   struct device *dev = >i2c->dev;
+   unsigned int status = 0;
+   u8 chipid[4];
+   int ret;
+
sii902x_reset(sii902x);
  
  	ret = regmap_write(sii902x->regmap, SII902X_REG_TPI_RQB, 0x0);

@@ -1012,11 +1052,11 @@ static int sii902x_probe(struct i2c_client *client,
regmap_read(sii902x->regmap, SII902X_INT_STATUS, );
regmap_write(sii902x->regmap, SII902X_INT_STATUS, status);
  
-	if (client->irq > 0) {

+   if (sii902x->i2c->irq > 0) {
regmap_write(sii902x->regmap, SII902X_INT_ENABLE,
 SII902X_HOTPLUG_EVENT);
  
-		ret = devm_request_threaded_irq(dev, client->irq, NULL,

+   ret = devm_request_threaded_irq(dev, sii902x->i2c->irq, NULL,
sii902x_interrupt,
IRQF_ONESHOT, dev_name(dev),
sii902x);
@@ -1031,9 +1071,9 @@ static int sii902x_probe(struct i2c_client *client,
  
  	sii902x_audio_codec_init(sii902x, dev);
  
-	i2c_set_clientdata(client, sii902x);

+   i2c_set_clientdata(sii902x->i2c, sii902x);
  
-	sii902x->i2cmux = i2c_mux_alloc(client->adapter, dev,

+   sii902x->i2cmux = i2c_mux_alloc(sii902x->i2c->adapter, dev,
1, 0, I2C_MUX_GATE,
sii902x_i2c_bypass_select,

Re: [PATCH 1/2] drm/bridge: sii902x: Enable I/O and core VCC supplies if present

2020-09-28 Thread Alex G.


On 9/26/20 1:49 PM, Sam Ravnborg wrote:

Hi Alexandru

On Thu, Sep 24, 2020 at 03:05:05PM -0500, Alexandru Gagniuc wrote:

On the SII9022, the IOVCC and CVCC12 supplies must reach the correct
voltage before the reset sequence is initiated. On most boards, this
assumption is true at boot-up, so initialization succeeds.

However, when we try to initialize the chip with incorrect supply
voltages, it will not respond to I2C requests. sii902x_probe() fails
with -ENXIO.

To resolve this, look for the "iovcc" and "cvcc12" regulators, and
make sure they are enabled before starting the reset sequence. If
these supplies are not available in devicetree, then they will default
to dummy-regulator. In that case everything will work like before.

This was observed on a STM32MP157C-DK2 booting in u-boot falcon mode.
On this board, the supplies would be set by the second stage
bootloader, which does not run in falcon mode.

Signed-off-by: Alexandru Gagniuc 


One nitpick here. The binding should be present in the tree before
you start using it. So this patch should be applied after the binding.


It is :)
  * arch/arm/boot/dts/stm32mp15xx-dkx.dtsi

Alex


One detail below - I think others have already commented on the rest.

[snip]

Re: [PATCH 4/5] PCI: only return true when dev io state is really changed

2020-09-25 Thread Alex G.


Hi Ethan,

On 9/24/20 9:34 PM, Ethan Zhao wrote:

When uncorrectable error happens, AER driver and DPC driver interrupt
handlers likely call
pcie_do_recovery()->pci_walk_bus()->report_frozen_detected() with
pci_channel_io_frozen the same time.
If pci_dev_set_io_state() return true even if the original state is
pci_channel_io_frozen, that will cause AER or DPC handler re-enter
the error detecting and recovery procedure one after another.
The result is the recovery flow mixed between AER and DPC.
So simplify the pci_dev_set_io_state() function to only return true
when dev->error_state is changed.

Signed-off-by: Ethan Zhao 
Tested-by: Wen jin 
Tested-by: Shanshan Zhang 
---
  drivers/pci/pci.h | 31 +++
  1 file changed, 3 insertions(+), 28 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fa12f7cbc1a0..d420bb977f3b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -362,35 +362,10 @@ static inline bool pci_dev_set_io_state(struct pci_dev 
*dev,
bool changed = false;
  
  	device_lock_assert(>dev);

-   switch (new) {
-   case pci_channel_io_perm_failure:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   case pci_channel_io_perm_failure:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_frozen:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_normal:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   }
-   if (changed)
+   if (dev->error_state != new) {
dev->error_state = new;
+   changed = true;
+   }
return changed;
  }


The flow is a lot easier to follow now. Thank you.

Reviewed-by: Alexandru Gagniuc

Re: [PATCH 1/2] drm/bridge: sii902x: Enable I/O and core VCC supplies if present

2020-09-24 Thread Alex G.


On 9/24/20 3:22 PM, Fabio Estevam wrote:
Hi Fabio,


On Thu, Sep 24, 2020 at 5:16 PM Alexandru Gagniuc  wrote:


+   ret = regulator_enable(sii902x->cvcc12);
+   if (ret < 0) {
+   dev_err(dev, "Failed to enable cvcc12 supply: %d\n", ret);
+   regulator_disable(sii902x->iovcc);
+   return PTR_ERR(sii902x->cvcc12);


return ret;


Thank you for catching that. I will fix it in v2.



 ret = regmap_write(sii902x->regmap, SII902X_REG_TPI_RQB, 0x0);
@@ -1012,11 +1052,11 @@ static int sii902x_probe(struct i2c_client *client,
 regmap_read(sii902x->regmap, SII902X_INT_STATUS, );
 regmap_write(sii902x->regmap, SII902X_INT_STATUS, status);

-   if (client->irq > 0) {
+   if (sii902x->i2c->irq > 0) {


Unrelated change.

[snip]

  Unrelated change.

[snip]

Unrelated change.

[snip]

Unrelated change.


The i2c initialization is moved into a separate function. Hence 'client' 
is no longer available. Instead, it can be accessed via 'sii902x->i2c'.


Alex

Re: [PATCH v3 3/3] PCI: pciehp: Add dmi table for in-band presence disabled

2019-10-21 Thread Alex G.


On 10/21/19 1:19 PM, Stuart Hayes wrote:



On 10/21/19 8:47 AM, Mika Westerberg wrote:

On Thu, Oct 17, 2019 at 03:32:56PM -0400, Stuart Hayes wrote:

Some systems have in-band presence detection disabled for hot-plug PCI
slots, but do not report this in the slot capabilities 2 (SLTCAP2) register.
On these systems, presence detect can become active well after the link is
reported to be active, which can cause the slots to be disabled after a
device is connected.

Add a dmi table to flag these systems as having in-band presence disabled.

Signed-off-by: Stuart Hayes 
---
  drivers/pci/hotplug/pciehp_hpc.c | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 02eb811a014f..4d377a2a62ce 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -14,6 +14,7 @@
  
  #define dev_fmt(fmt) "pciehp: " fmt
  
+#include 

  #include 
  #include 
  #include 
@@ -26,6 +27,16 @@
  #include "../pci.h"
  #include "pciehp.h"
  
+static const struct dmi_system_id inband_presence_disabled_dmi_table[] = {

+   {
+   .ident = "Dell System",
+   .matches = {
+   DMI_MATCH(DMI_OEM_STRING, "Dell System"),


Sorry if this has been discussed previously already but isn't this going
to apply on all Dell systems, not just the affected ones? Is this the
intention?



Yes, that is the intention. Applying this just makes the hotplug code wait for
the presence detect bit to be set before proceeding, which ideally wouldn't hurt
anything--for devices that don't have inband presence detect disabled, presence
detect should already be up when the code in patch 2/3 starts to wait for it.

The only issue should be with broken hotplug implementations that don't ever
bring presence detect active (these apparently exist)--but even those would 
still
work, they would just take an extra second to come up.

On the other hand, a number of Dell systems have (and will have) NVMe
implementations that have inband presence detect disabled (but they won't have
the new bit implemented to report that), and they don't work correctly without
this.


I think it's clearer if this is explained in a comment. That it doesn't 
break anything, and we're okay this applies to all hotplug ports, even 
those that are not in front of an NVMe backplane.


Alex

Re: [PATCH 0/3] PCI: pciehp: Do not turn off slot if presence comes up after link

2019-10-02 Thread Alex G.


On 10/1/19 11:13 PM, Lukas Wunner wrote:

On Tue, Oct 01, 2019 at 05:14:16PM -0400, Stuart Hayes wrote:

This patch set is based on a patch set [1] submitted many months ago by
Alexandru Gagniuc, who is no longer working on it.

[1] https://patchwork.kernel.org/cover/10909167/
 [v3,0/4] PCI: pciehp: Do not turn off slot if presence comes up after link


If I'm not mistaken, these two are identical to Alex' patches, right?

   PCI: pciehp: Add support for disabling in-band presence
   PCI: pciehp: Wait for PDS when in-band presence is disabled

I'm not sure if it's appropriate to change the author and
omit Alex' Signed-off-by.


Legally Dell owns the patches. I have no objection on my end.

Alex


Otherwise I have no objections against this series.

Thanks,

Lukas

Re: [PATCH 3/3] PCI: pciehp: Add dmi table for in-band presence disabled

2019-10-01 Thread Alex G.





On 10/1/19 4:14 PM, Stuart Hayes wrote:

Some systems have in-band presence detection disabled for hot-plug PCI slots,
but do not report this in the slot capabilities 2 (SLTCAP2) register.  On
these systems, presence detect can become active well after the link is
reported to be active, which can cause the slots to be disabled after a
device is connected.

Add a dmi table to flag these systems as having in-band presence disabled.

Signed-off-by: Stuart Hayes 
---
  drivers/pci/hotplug/pciehp_hpc.c | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 1282641c6458..1dd01e752587 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -14,6 +14,7 @@
  
  #define dev_fmt(fmt) "pciehp: " fmt
  
+#include 

  #include 
  #include 
  #include 
@@ -26,6 +27,16 @@
  #include "../pci.h"
  #include "pciehp.h"
  
+static const struct dmi_system_id inband_presence_disabled_dmi_table[] = {

+   {
+   .ident = "Dell System",
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc"),
+   },
+   },
+   {}
+};
+


I'm not sure that all Dell systems that were ever made or will be made 
have in-band presence disabled on all their hotplug slots.


This was a problem with the NVMe hot-swap bays on 14G servers. I can't 
guarantee that any other slot or machine will need this workaround. The 
best way I found to implement this is to check for the PCI-ID of the 
switches behind the port.


Alex

Re: [PATCH] Revert "PCI/LINK: Report degraded links via link bandwidth notification"

2019-04-29 Thread Alex G


On 4/29/19 1:56 PM, Bjorn Helgaas wrote:

From: Bjorn Helgaas 

This reverts commit e8303bb7a75c113388badcc49b2a84b4121c1b3e.

e8303bb7a75c added logging whenever a link changed speed or width to a
state that is considered degraded.  Unfortunately, it cannot differentiate
signal integrity-related link changes from those intentionally initiated by
an endpoint driver, including drivers that may live in userspace or VMs
when making use of vfio-pci.  Some GPU drivers actively manage the link
state to save power, which generates a stream of messages like this:

   vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 
GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5 GT/s x16 link)

We really *do* want to be alerted when the link bandwidth is reduced
because of hardware failures, but degradation by intentional link state
management is probably far more common, so the signal-to-noise ratio is
currently low.

Until we figure out a way to identify the real problems or silence the
intentional situations, revert the following commits, which include the
initial implementation (e8303bb7a75c) and subsequent fixes:


I think we're overreacting to a bit of perceived verbosity in the system 
log. Intentional degradation does not seem to me to be as common as 
advertised. I have not observed this with either radeon, nouveau, or 
amdgpu, and the proper mechanism to save power at the link level is 
ASPM. I stand to be corrected and we have on CC some very knowledgeable 
fellows that I am certain will jump at the opportunity to do so.


What it seems like to me is that a proprietary driver running in a VM is 
initiating these changes. And if that is the case then it seems this is 
a virtualization problem. A quick glance over GPU drivers in linux did 
not reveal any obvious places where we intentionally downgrade a link.


I'm not convinced a revert is the best call.

Alex


 e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth 
notification")
 3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are 
acked")
 55397ce8df48 ("PCI/LINK: Clear bandwidth notification interrupt before enabling 
it")
 0fa635aec9ab ("PCI/LINK: Deduplicate bandwidth reports for multi-function 
devices")

Link: 
https://lore.kernel.org/lkml/155597243666.19387.1205950870601742062.st...@gimli.home
Link: 
https://lore.kernel.org/lkml/155605909349.3575.13433421148215616375.st...@gimli.home
Signed-off-by: Bjorn Helgaas 
CC: Alexandru Gagniuc 
CC: Lukas Wunner 
CC: Alex Williamson 
---
  drivers/pci/pci.h  |   1 -
  drivers/pci/pcie/Makefile  |   1 -
  drivers/pci/pcie/bw_notification.c | 121 -
  drivers/pci/pcie/portdrv.h |   6 +-
  drivers/pci/pcie/portdrv_core.c|  17 ++--
  drivers/pci/pcie/portdrv_pci.c |   1 -
  drivers/pci/probe.c|   2 +-
  7 files changed, 7 insertions(+), 142 deletions(-)
  delete mode 100644 drivers/pci/pcie/bw_notification.c

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d994839a3e24..224d88634115 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -273,7 +273,6 @@ enum pcie_link_width pcie_get_width_cap(struct pci_dev 
*dev);
  u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
   enum pcie_link_width *width);
  void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
-void pcie_report_downtraining(struct pci_dev *dev);
  
  /* Single Root I/O Virtualization */

  struct pci_sriov {
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index f1d7bc1e5efa..ab514083d5d4 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -3,7 +3,6 @@
  # Makefile for PCI Express features and port driver
  
  pcieportdrv-y			:= portdrv_core.o portdrv_pci.o err.o

-pcieportdrv-y  += bw_notification.o
  
  obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o
  
diff --git a/drivers/pci/pcie/bw_notification.c b/drivers/pci/pcie/bw_notification.c

deleted file mode 100644
index 4fa9e3523ee1..
--- a/drivers/pci/pcie/bw_notification.c
+++ /dev/null
@@ -1,121 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0+
-/*
- * PCI Express Link Bandwidth Notification services driver
- * Author: Alexandru Gagniuc 
- *
- * Copyright (C) 2019, Dell Inc
- *
- * The PCIe Link Bandwidth Notification provides a way to notify the
- * operating system when the link width or data rate changes.  This
- * capability is required for all root ports and downstream ports
- * supporting links wider than x1 and/or multiple link speeds.
- *
- * This service port driver hooks into the bandwidth notification interrupt
- * and warns when links become degraded in operation.
- */
-
-#include "../pci.h"
-#include "portdrv.h"
-
-static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev)
-{
-   int ret;
-   u32 lnk_cap;
-
-   ret = pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, _cap);

Re: [PATCH] PCI: Add link_change error handler and vfio-pci user

2019-04-24 Thread Alex G


On 4/24/19 12:19 PM, Alex Williamson wrote:

On Wed, 24 Apr 2019 16:45:45 +
 wrote:

On 4/23/2019 5:42 PM, Alex Williamson wrote:

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 7e12d0163863..233cd4b5b6e8 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2403,6 +2403,19 @@ void pcie_report_downtraining(struct pci_dev *dev)


I don't think you want to change pcie_report_downtraining(). You're
advertising to "report" something, by nomenclature, but then go around
and also call a notification callback. This is also used during probe,
and you've now just killed your chance to notice you've booted with a
degraded link.
If what you want to do is silence the bandwidth notification, you want
to modify the threaded interrupt that calls this.


During probe, ie. discovery, a device wouldn't have a driver attached,
so we'd fall through to simply printing the link status.  Nothing
lost afaict.  The "report" verb doesn't have a subject here, report to
whom?  Therefore I thought it reasonable that a driver ask that it be
reported to them via a callback.  I don't see that as such a stretch of
the interface.


That's just stretching the logic, and IMO makes the intent harder to 
understand. The argument relies on the state of the PCI device and 
logic, which is not obvious to the casual observer. If you want to 
bypass the bandwidth notification, then bypass the notification.



if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn)
return;
   
+	/*

+* If driver handles link_change event, defer to driver.  PCIe drivers
+* can call pcie_print_link_status() to print current link info.
+*/
+   device_lock(>dev);
+   if (dev->driver && dev->driver->err_handler &&
+   dev->driver->err_handler->link_change) {
+   dev->driver->err_handler->link_change(dev);
+   device_unlock(>dev);
+   return;
+   }
+   device_unlock(>dev);


Can we write this such that there is a single lock()/unlock() pair?


Not without introducing a tracking variable, ex.

[snip bad code]

That's not markedly better imo, but if it's preferred I can send a v2.


How about:

if (!invoke_link_changed_handler(pdev))
very_useful_downtraining_message(pdev);


Alex

Alex

Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation

2019-04-23 Thread Alex G





On 4/22/19 5:43 PM, Alex Williamson wrote:

On systems that don't support any PCIe services other than bandwidth
notification, pcie_message_numbers() can return zero vectors, causing
the vector reallocation in pcie_port_enable_irq_vec() to retry with
zero, which fails, resulting in fallback to INTx (which might be
broken) for the bandwidth notification service.  This can resolve
spurious interrupt faults due to this service on some systems.

Fixes: e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth 
notification")
Signed-off-by: Alex Williamson 
---


+1
Tested on some Dell servers. Everything works as expected. I don't have 
a system with a device that only supports bandwidth notification.


Alex

Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation

2019-04-23 Thread Alex G


On 4/23/19 12:10 PM, Bjorn Helgaas wrote:

On Tue, Apr 23, 2019 at 09:33:53AM -0500, Alex G wrote:

On 4/22/19 7:33 PM, Alex Williamson wrote:

There is nothing wrong happening here that needs to fill logs.  I
thought maybe if I enabled notification of autonomous bandwidth
changes that it might categorize these as something we could
ignore, but it doesn't.  How can we identify only cases where this
is an erroneous/noteworthy situation?  Thanks,


You don't. Ethernet doesn't. USB doesn't. This logging behavior is
consistent with every other subsystem that deals with multi-speed links.


Can you point me to the logging in these other subsystems so I can
learn more about how they deal with this?


I don't have any in-depth articles about the logging in these systems, 
but I can extract some logs from my machines.


Ethernet:

[Sun Apr 21 11:14:06 2019] e1000e: eno1 NIC Link is Down
[Sun Apr 21 11:14:17 2019] e1000e: eno1 NIC Link is Up 1000 Mbps Full 
Duplex, Flow Control: Rx/Tx
[Sun Apr 21 11:14:23 2019] e1000e: eno1 NIC Link is Up 1000 Mbps Full 
Duplex, Flow Control: Rx/Tx

[Sun Apr 21 23:33:31 2019] e1000e: eno1 NIC Link is Down
[Sun Apr 21 23:33:43 2019] e1000e: eno1 NIC Link is Up 1000 Mbps Full 
Duplex, Flow Control: Rx/Tx
[Sun Apr 21 23:33:48 2019] e1000e: eno1 NIC Link is Up 1000 Mbps Full 
Duplex, Flow Control: Rx/Tx


I used to have one of these "green" ethernet switches that went down to 
100mbps automatically. You can imagine how "clogged" the logs were with 
link up messages. Thank goodness that switch was killed in a thunderstorm.


USB will log every device insertion and removal, very verbosely (see 
appendix A).




I agree that emitting log messages for normal and expected events will
lead to user confusion and we need to do something.

e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth
notification") was merged in v5.1-rc1, so we still have (a little)
time to figure this out before v5.1.


I always viewed the system log as a system log, instead of a database of 
system errors. I may have extremist views, but going back to Alex's 
example, I prefer to see that the power saving mechanism is doing 
something to save power on my laptop (I'll just ignore it on a desktop).


If you think increasing code complexity because people don't want things 
logged into the system log, then I'm certain we can work out some sane 
solution. It's the same problem we see with GCC, where people want 
warning messages here, but don't want the same messages there.


Alex


P.S. The pedantic in me points out that one of the examples I gave is a 
terrible example. ASPM "allows hardware-autonomous, dynamic Link power 
reduction beyond what is achievable by software-only control" [1].


[1] PCI-Express 3.0 -- 5.4.1. Active State Power Management (ASPM)


Appendix A:

[1618067.987084] usb 1-3.5: new high-speed USB device number 79 using 
xhci_hcd
[1618068.179914] usb 1-3.5: New USB device found, idVendor=0bda, 
idProduct=4014, bcdDevice= 0.05
[1618068.179924] usb 1-3.5: New USB device strings: Mfr=3, Product=1, 
SerialNumber=2

[1618068.179930] usb 1-3.5: Product: USB Audio
[1618068.179936] usb 1-3.5: Manufacturer: Generic
[1618068.179941] usb 1-3.5: SerialNumber: 200901010001
[1618068.280100] usb 1-3.6: new low-speed USB device number 80 using 
xhci_hcd

[1618068.342541] Bluetooth: hci0: Waiting for firmware download to complete
[1618068.342795] Bluetooth: hci0: Firmware loaded in 1509081 usecs
[1618068.342887] Bluetooth: hci0: Waiting for device to boot
[1618068.354919] Bluetooth: hci0: Device booted in 11797 usecs
[1618068.356006] Bluetooth: hci0: Found Intel DDC parameters: 
intel/ibt-12-16.ddc

[1618068.358958] Bluetooth: hci0: Applying Intel DDC parameters completed
[1618068.378624] usb 1-3.6: New USB device found, idVendor=04d9, 
idProduct=1400, bcdDevice= 1.43
[1618068.378626] usb 1-3.6: New USB device strings: Mfr=0, Product=0, 
SerialNumber=0
[1618068.390686] input: HID 04d9:1400 as 
/devices/pci:00/:00:14.0/usb1/1-3/1-3.6/1-3.6:1.0/0003:04D9:1400.0139/input/input921
[1618068.444282] hid-generic 0003:04D9:1400.0139: input,hidraw1: USB HID 
v1.10 Keyboard [HID 04d9:1400] on usb-:00:14.0-3.6/input0
[1618068.456373] input: HID 04d9:1400 Mouse as 
/devices/pci:00/:00:14.0/usb1/1-3/1-3.6/1-3.6:1.1/0003:04D9:1400.013A/input/input922
[1618068.457929] input: HID 04d9:1400 Consumer Control as 
/devices/pci:00/:00:14.0/usb1/1-3/1-3.6/1-3.6:1.1/0003:04D9:1400.013A/input/input923
[1618068.509294] input: HID 04d9:1400 System Control as 
/devices/pci:00/:00:14.0/usb1/1-3/1-3.6/1-3.6:1.1/0003:04D9:1400.013A/input/input924
[1618068.509518] hid-generic 0003:04D9:1400.013A: input,hidraw2: USB HID 
v1.10 Mouse [HID 04d9:1400] on usb-:00:14.0-3.6/input1
[1618068.588078] usb 1-3.7: new full-speed USB device number 81 using 
xhci_hcd
[1618068.679132] usb 1-3.7: New USB device found, idVendor=046d, 
idProduct=c52b, bcdDevice=12.03
[1

Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation

2019-04-23 Thread Alex G


On 4/23/19 11:22 AM, Alex Williamson wrote:

Nor should pci-core decide what link speed changes are intended or
errors.  Minimally we should be enabling drivers to receive this
feedback.  Thanks,


Not errors. pci core reports that a link speed change event has occured. 
Period.


Alex

Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation

2019-04-23 Thread Alex G





On 4/23/19 10:34 AM, Alex Williamson wrote:

On Tue, 23 Apr 2019 09:33:53 -0500
Alex G  wrote:


On 4/22/19 7:33 PM, Alex Williamson wrote:

On Mon, 22 Apr 2019 19:05:57 -0500
Alex G  wrote:

echo :07:00.0:pcie010 |
sudo tee /sys/bus/pci_express/drivers/pcie_bw_notification/unbind


That's a bad solution for users, this is meaningless tracking of a
device whose driver is actively managing the link bandwidth for power
purposes.


0.5W savings on a 100+W GPU? I agree it's meaningless.


Evidence?  Regardless, I don't have control of the driver that's making
these changes, but the claim seems unfounded and irrelevant.


The number of 5mW/Gb/lane doesn't ring a bell? [1] [2]. Your GPU 
supports 5Gb/s, so likely using an older, more power hungry process. I 
suspect it's still within the same order of magnitude.




I'm assigning a device to a VM [snip]
I can see why we might want to be notified of degraded links due to signal 
issues,
but what I'm reporting is that there are also entirely normal reasons
[snip] we can't seem to tell the difference


Unfortunately, there is no way in PCI-Express to distinguish between an 
expected link bandwidth change and one due to error.


If you're using virt-manager to configure the VM, then virt-manager 
could have a checkbox to disable link bandwidth management messages. I'd 
rather we avoid kernel-side heuristics (like Lukas suggested). If you're 
confident that your link will operate as intended, and don't want 
messages about it, that's your call as a user -- we shouldn't decide 
this in the kernel.


Alex

[1] 
https://www.synopsys.com/designware-ip/technical-bulletin/reduce-power-consumption.html

Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation

2019-04-23 Thread Alex G


On 4/22/19 7:33 PM, Alex Williamson wrote:

On Mon, 22 Apr 2019 19:05:57 -0500
Alex G  wrote:

echo :07:00.0:pcie010 |
sudo tee /sys/bus/pci_express/drivers/pcie_bw_notification/unbind


That's a bad solution for users, this is meaningless tracking of a
device whose driver is actively managing the link bandwidth for power
purposes. 


0.5W savings on a 100+W GPU? I agree it's meaningless.


There is nothing wrong happening here that needs to fill
logs.  I thought maybe if I enabled notification of autonomous
bandwidth changes that it might categorize these as something we could
ignore, but it doesn't.
How can we identify only cases where this is
an erroneous/noteworthy situation?  Thanks,


You don't. Ethernet doesn't. USB doesn't. This logging behavior is 
consistent with every other subsystem that deals with multi-speed links. 
I realize some people are very resistant to change (and use very ancient 
kernels). I do not, however, agree that this is a sufficient argument to 
dis-unify behavior.


Alex

Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation

2019-04-22 Thread Alex G


On 4/22/19 5:43 PM, Alex Williamson wrote:

[  329.725607] vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth,
limited by 2.5 GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5
GT/s x16 link)
[  708.151488] vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth,
limited by 2.5 GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5
GT/s x16 link)
[  718.262959] vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth,
limited by 2.5 GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5
GT/s x16 link)
[ 1138.124932] vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth,
limited by 2.5 GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5
GT/s x16 link)

What is the value of this nagging?


Good! The bandwidth notification service is working as intended. If this 
bothers you, you can unbind the device from the bandwidth notification 
driver:


echo :07:00.0:pcie010 |
sudo tee /sys/bus/pci_express/drivers/pcie_bw_notification/unbind




diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index 7d04f9d087a6..1b330129089f 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -55,7 +55,8 @@ static int pcie_message_numbers(struct pci_dev *dev, int mask,
 * 7.8.2, 7.10.10, 7.31.2.
 */
  
-	if (mask & (PCIE_PORT_SERVICE_PME | PCIE_PORT_SERVICE_HP)) {

+   if (mask & (PCIE_PORT_SERVICE_PME | PCIE_PORT_SERVICE_HP |
+   PCIE_PORT_SERVICE_BWNOTIF)) {
pcie_capability_read_word(dev, PCI_EXP_FLAGS, );
*pme = (reg16 & PCI_EXP_FLAGS_IRQ) >> 9;
nvec = *pme + 1;


Good catch!

Re: [PATCH v1 1/3] PCI / ACPI: Do not export pci_get_hp_params()

2019-04-22 Thread Alex G


On 4/22/19 3:58 PM, Bjorn Helgaas wrote:

On Fri, Feb 08, 2019 at 10:24:11AM -0600, Alexandru Gagniuc wrote:

This is only used within drivers/pci, and there is no reason to make
it available outside of the PCI core.

Signed-off-by: Alexandru Gagniuc 


Applied the whole series to pci/hotplug for v5.2, thanks!

I dropped the "list" member from struct hpx_type3 because it didn't
seem to be used.


That's a good call. That was a vestigial appendage from when I first 
intended to store a list of registers in memory. I'm glad we didn't end 
up needing a list.


Alex

Fixing the GHES driver vs not causing issues in the first place

2019-03-29 Thread Alex G.


The issue of dying inside the GHES driver has popped up a few times before.
I've looked into fixing this before, but we didn't quite come to agreement
because the verbiage in the ACPI spec is vague:
    " When a fatal uncorrected error occurs, the system is
      restarted to prevent propagation of the error. "

This popped up again a couple of times recently [1]. Linus suggested that
fixing the GHES driver might pay higher dividends. I considered reviving 
an old
fix, but put it aside after hearing concerns about the unintended 
consequences,

which I'll get to soon.

A few days back, I lost an entire MD RAID1. It was during hot-removal 
testing

that I do on an irregular basis, and a kernel panic from the GHES driver had
caused the system to go down. I have seen some occasional filesystem 
corruption
in other crashes, but nothing fsck couldn't fix. The interesting part is 
that

the array that was lost was not part of the device set that was being
hot-removed. The state of things doesn't seem very joyful.

The machine in question is a Dell r740xd. It uses firmware-first (FFS)
handling for PCIe errors, and is generally good at identifying a hot-removal
and not bothering the OS about it. The events that I am testing for are
situations where, due to slow operator, tilted drive, worn connectors, 
errors

make it past FFS to the OS -- with "fatal" severity.
    In that case, the GHES driver sees a fatal hardware error and panics.
On this machine, FFS reported it as fatal in an attempt to cause the 
system to

reboot, and prevent propagation of the error.

    The "prevent propagation of the error" flow was designed around OSes
that can't do recovery. Firmware assumes an instantaneous reboot, so it does
not set itself up to report future errors. The EFI reference code does not
re-arm errors, so we suspect most OEM's firmware will work this way. Despite
the apparently enormous stupidity of crashing an otherwise perfectly good
system, there are good and well thought out reasons behind it.
    An example is reading a block of data from an NVMe drive, and
encountering a CRC error on the PCIe bus. If we didn't  do an "instantaneous
reboot" after a previous fatal error, we will not get the CRC error 
reported.

Thus, we risk silent data corruption.

    On the Linux side, we can ignore the "fatal" nature of the error, and
even recover the PCIe devices behind the error. However, this is ill 
advised for

the reason listed above.

    The counter argument is that a panic() is more likely to cause
filesystem corruption. In over one year of testing, I have not seen a single
incident of data corruption, yet I have seen the boot ask me to run fsck on
multiple occasions. And this seems to me like a tradeoff problem rather than
anything else.

    In the case of this Dell machine, there are ways to hot-swap PCIe
devices them without needing to take either of the risks above:
    1. Turn off the downstream port manually.
    2. Insert/replace drive
    3. Turn on the downstream port manually.

    What bothers me is the firmware's assumption that a fatal error must
crash the machine. I've looked at the the verbiage in the spec, and I don't
fully agree that either side respects the contract.
Our _OSC contract with the firmware says that firmware will report 
errors to us,

while we are not allowed to touch certain parts of the device. Following the
verbiage in the spec, we do need to reboot on a fatal error, but that 
reboot is

not guaranteed to be instantaneous. Firmware still has to honor the contract
and report errors to us. This may seem like a spec non-compliance issue.

    It would be great if FFS would mark the errors as recoverable, but the
level of validation required makes this unrealistic on existing 
machines. New

machines will likely move to the Containment Error Recovery model, which is
essentially FFS for DPC (PCIe Downstream Port Containment). This leaves the
current generation of servers in limbo -- I'm led to believe other 
server OEMs

have very similar issues.

    Given the dilemma above, I'm really not sure what the right 
solution is.
How do we produce a fix that addresses complaints from both sides. When 
firmware
gives us a false positive, should we panic? Should we instead print a 
message

informing the user/admin that a reboot of the machine is required. Should we
initiate a reboot automatically?

    From Linux' ability to recover, PCIe fatal errors are false positives
-- every time, all the time. However, their fatality is not there 
without any

reason. I do want to avoid opinions that FFS wants us to crash, gives us the
finger, and we should give the finger back.

    Another option that was discussed before, but was very controversial is
the fix that involves not causing the problem in the first place [1] 
[2]. On the
machines that I have access to, FFS is handled in SMM, which means that 
all CPU

cores are held up until the processing of the error is complete. On one
particular machine

Re: [PATCH v2] PCI/LINK: bw_notification: Do not leave interrupt handler NULL

2019-03-25 Thread Alex G.


On 3/25/19 5:25 PM, Bjorn Helgaas wrote:

On Fri, Mar 22, 2019 at 07:36:51PM -0500, Alexandru Gagniuc wrote:

A threaded IRQ with a NULL handler does not work with level-triggered
interrupts. request_threaded_irq() will return an error:

   genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq 16
   pcie_bw_notification: probe of :00:1b.0:pcie010 failed with error -22

For level interrupts we need to silence the interrupt before exiting
the IRQ handler, so just clear the PCI_EXP_LNKSTA_LBMS bit there.

Fixes: e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth 
notification")
Reported-by: Linus Torvalds 
Signed-off-by: Alexandru Gagniuc 


Applied with the following subject line to for-linus for v5.1, thanks!

   PCI/LINK: Supply IRQ handler so level-triggered IRQs are acked


You're so much better at formulating commit messages. That sounds a lot 
smoother. Thanks!



---
Changes since v1:
  - move pcie_update_link_speed() to irq to prevent duplicate read of 
link_status
  - Add Fixes: to commit message
  
  drivers/pci/pcie/bw_notification.c | 19 ++-

  1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/bw_notification.c 
b/drivers/pci/pcie/bw_notification.c
index d2eae3b7cc0f..c48746f1cf3c 100644
--- a/drivers/pci/pcie/bw_notification.c
+++ b/drivers/pci/pcie/bw_notification.c
@@ -44,11 +44,10 @@ static void pcie_disable_link_bandwidth_notification(struct 
pci_dev *dev)
pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
  }
  
-static irqreturn_t pcie_bw_notification_handler(int irq, void *context)

+static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
  {
struct pcie_device *srv = context;
struct pci_dev *port = srv->port;
-   struct pci_dev *dev;
u16 link_status, events;
int ret;
  
@@ -58,6 +57,17 @@ static irqreturn_t pcie_bw_notification_handler(int irq, void *context)

if (ret != PCIBIOS_SUCCESSFUL || !events)
return IRQ_NONE;
  
+	pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);

+   pcie_update_link_speed(port->subordinate, link_status);
+   return IRQ_WAKE_THREAD;
+}
+
+static irqreturn_t pcie_bw_notification_handler(int irq, void *context)
+{
+   struct pcie_device *srv = context;
+   struct pci_dev *port = srv->port;
+   struct pci_dev *dev;
+
/*
 * Print status from downstream devices, not this root port or
 * downstream switch port.
@@ -67,8 +77,6 @@ static irqreturn_t pcie_bw_notification_handler(int irq, void 
*context)
__pcie_print_link_status(dev, false);
up_read(_bus_sem);
  
-	pcie_update_link_speed(port->subordinate, link_status);

-   pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);
return IRQ_HANDLED;
  }
  
@@ -80,7 +88,8 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)

if (!pcie_link_bandwidth_notification_supported(srv->port))
return -ENODEV;
  
-	ret = request_threaded_irq(srv->irq, NULL, pcie_bw_notification_handler,

+   ret = request_threaded_irq(srv->irq, pcie_bw_notification_irq,
+  pcie_bw_notification_handler,
   IRQF_SHARED, "PCIe BW notif", srv);
if (ret)
return ret;
--
2.19.2

Re: [PATCH] PCI/LINK: Request a one-shot IRQ with NULL handler

2019-03-25 Thread Alex G.


Hi Borislav,

Thanks for the update. We've settled on a different fix [1], since Lukas 
was not happy with IRQF_ONESHOT [2].


Alex

[1] 
https://lore.kernel.org/linux-pci/20190323003700.7294-1-mr.nuke...@gmail.com/
[2] 
https://lore.kernel.org/linux-pci/20190318043314.noyj6t4sh26sp...@wunner.de/


On 3/25/19 1:23 PM, Borislav Petkov wrote:

From: Borislav Petkov 

Booting v5.1-rc1 causes these new messages in dmesg here:

   genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq 24
   pcie_bw_notification: probe of :00:02.0:pcie010 failed with error -22

because requesting a threaded IRQ with NULL handler needs to be oneshot.
Otherwise a level-triggered interrupt endless loop can happen, see the
detailed explanation above the "Threaded irq... " error message in
kernel/irq/manage.c.

And PCI-MSI type interrupts are one-shot safe but not in the IOAPIC
case:

   24: ... IO-APIC   28-fasteoi   PCIe BW notif, PCIe BW notif, PCIe BW notif
   25: ... IO-APIC   29-fasteoi   PCIe BW notif, PCIe BW notif

Make the BW notification handler oneshot too.

Fixes: e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth 
notification")
Suggested-by: Thomas Gleixner 
Signed-off-by: Borislav Petkov 
Cc: Bjorn Helgaas 
Cc: Lukas Wunner 
Cc: Alexandru Gagniuc 
Cc: linux-...@vger.kernel.org
---
  drivers/pci/pcie/bw_notification.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/bw_notification.c 
b/drivers/pci/pcie/bw_notification.c
index d2eae3b7cc0f..16e11c09f6bd 100644
--- a/drivers/pci/pcie/bw_notification.c
+++ b/drivers/pci/pcie/bw_notification.c
@@ -81,7 +81,7 @@ static int pcie_bandwidth_notification_probe(struct 
pcie_device *srv)
return -ENODEV;
  
  	ret = request_threaded_irq(srv->irq, NULL, pcie_bw_notification_handler,

-  IRQF_SHARED, "PCIe BW notif", srv);
+  IRQF_SHARED | IRQF_ONESHOT, "PCIe BW notif", 
srv);
if (ret)
return ret;

Re: [PATCH v3] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2019-03-20 Thread Alex G


On 3/20/19 4:44 PM, Linus Torvalds wrote:

On Wed, Mar 20, 2019 at 1:52 PM Bjorn Helgaas  wrote:


AFAICT, the consensus there was that it would be better to find some
sort of platform solution instead of dealing with it in individual
drivers.  The PCI core isn't really a driver, but I think the same
argument applies to it: if we had a better way to recover from readl()
errors, that way would work equally well in nvme-pci and the PCI core.


I think that patches with the pattern "if (disconnected) don't do IO"
are fundamentally broken and we should look for alternatives in all
cases.

They are fundamentally broken because they are racy: if it's an actual
sudden disconnect in the middle of IO, there's no guarantee that we'll
even be notified in time.

They are fundamentally broken because they add new magic special cases
that very few people will ever test, and the people who do test them
tend to do so with old irrelevant kernels.

Finally, they are fundamentally broken because they always end up
being just special cases. One or two special case accesses in a
driver, or perhaps all accesses of a particular type in just _one_
special driver.

Yes, yes, I realize that people want error reporting, and that
hot-removal can cause various error conditions (perhaps just parity
errors for the IO, but also perhaps other random errors caused by
firmware perhaps doing special HW setup).

But the "you get a fatal interrupt, so avoid the IO" kind of model is
completely broken, and needs to just be fixed differently. See above
why it's so completely broken.

So if the hw is set up to send some kinf of synchronous interrupt or
machine check that cannot sanely be handled (perhaps because it will
just repeat forever), we should try to just disable said thing.

PCIe allows for just polling for errors on the bridges, afaik. It's
been years since I looked at it, and maybe I'm wrong. And I bet there
are various "platform-specific value add" registers etc that may need
tweaking outside of any standard spec for PCIe error reporting. But
let's do that in a platform driver, to set up the platform to not do
the silly "I'm just going to die if I see an error" thing.

It's way better to have a model where you poll each bridge once a
minute (or one an hour) and let people know "guys, your hardware
reports errors", than make random crappy changes to random drivers
because the hardware was set up to die on said errors.

And if some MIS person wants the "hardware will die" setting, then
they can damn well have that, and then it's not out problem, but it
also means that we don't start changing random drivers for that insane
setting. It's dead, Jim, and it was the users choice.

Notice how in neither case does it make sense to try to do some "if
(disconnected) dont_do_io()" model for the drivers.


I disagree with the idea of doing something you know can cause an error 
to propagate. That being said, in this particular case we have come to 
rely a little too much on the if (disconnected) model.


You mentioned in the other thread that fixing the GHES driver will pay 
much higher dividends. I'm working on reviving a couple of changes to do 
just that. Some industry folk were very concerned about the "don't 
panic()" approach, and I want to make sure I fairly present their 
arguments in the cover letter.


I'm hoping one day we'll have the ability to use page tables to prevent 
the situations that we're trying to fix today in less than ideal ways. 
Although there are other places in msi.c that use if (disconnected), I'm 
okay with dropping this change -- provided we come up with an equivalent 
fix.


But even if FFS doesn't crash, do we really want to lose hundreds of 
milliseconds to SMM --on all cores-- when all it takes is a couple of 
cycles to check a flag?


Alex

Re: [PATCH] PCI/LINK: bw_notification: Do not leave interrupt handler NULL

2019-03-20 Thread Alex G.


On 3/20/19 8:46 AM, Bjorn Helgaas wrote:

Hi Alexandru,

On Mon, Mar 18, 2019 at 08:12:04PM -0500, Alexandru Gagniuc wrote:

A threaded IRQ with a NULL handler does not work with level-triggered
interrupts. request_threaded_irq() will return an error:

   genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq 16
   pcie_bw_notification: probe of :00:1b.0:pcie010 failed with error -22

For level interrupts we need to silence the interrupt before exiting
the IRQ handler, so just clear the PCI_EXP_LNKSTA_LBMS bit there.

Reported-by: Linus Torvalds 
Signed-off-by: Alexandru Gagniuc 


What's your thought regarding Lukas' comment?  If you do repost this,
please add a Fixes: tag to help connect this with the initial commit.


I like Lukas's idea. I should have this re-posted by end of week, unless 
there's an urgency to get it out earlier.


Alex

Re: [GIT PULL] PCI changes for v5.1

2019-03-17 Thread Alex G


On 3/17/19 4:18 PM, Linus Torvalds wrote:

On Fri, Mar 8, 2019 at 9:31 AM Bjorn Helgaas  wrote:


   - Report PCIe links that become degraded at run-time (Alexandru Gagniuc)


Gaah. Only now as I'm about to do the rc1 release am I looking at new
runtime warnings, and noticing that this causes

   genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq 16
   pcie_bw_notification: probe of :00:1b.0:pcie010 failed with error -22

because you can't have a NULL handler for a level-triggered interrupt
(you need something to shut the interrupt off so that it doesn't keep
screaming!).


Thanks for the catch. I did not see the error on my test machines. I'll 
take a look tomorrow, and update through Bjorn. Seems like an easy fix.


Alex

Re: [PATCH v2] PCI: pciehp: Report degraded links via link bandwidth notification

2018-12-27 Thread Alex G.


On 12/7/18 12:20 PM, Alexandru Gagniuc wrote:

A warning is generated when a PCIe device is probed with a degraded
link, but there was no similar mechanism to warn when the link becomes
degraded after probing. The Link Bandwidth Notification provides this
mechanism.

Use the link bandwidth notification interrupt to detect bandwidth
changes, and rescan the bandwidth, looking for the weakest point. This
is the same logic used in probe().

Signed-off-by: Alexandru Gagniuc 
---


ping.


Changes since v1:
  * Layer on top of the pcie port service drivers, instead of hotplug service.

This patch needs to be applied on top of:
PCI: Add missing include to drivers/pci.h

Anticipated FAQ:

Q: Why is this unconditionally compiled in?
A: The symmetrical check in pci probe() is also always compiled in.

Q: Why call module_init() instead of adding a call in pcie_init_services() ?
A: A call in pcie_init_services() also requires a prototype in portdrv.h, a
non-static implementation in bw_notification.c. Using module_init() is
functionally equivalent, and takes less code.

Q: Why print only on degraded links and not when a link is back to full speed?
For symmetry with PCI probe(). Although I see a benefit in conveying that a
link is back to full speed, I expect this to be extremely rare. Secondary bus
reset is usually needed to retrain at full bandwidth.


  drivers/pci/pcie/Makefile  |   1 +
  drivers/pci/pcie/bw_notification.c | 107 +
  drivers/pci/pcie/portdrv.h |   4 +-
  drivers/pci/pcie/portdrv_core.c|  14 ++--
  4 files changed, 121 insertions(+), 5 deletions(-)
  create mode 100644 drivers/pci/pcie/bw_notification.c

diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index ab514083d5d4..f1d7bc1e5efa 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -3,6 +3,7 @@
  # Makefile for PCI Express features and port driver
  
  pcieportdrv-y			:= portdrv_core.o portdrv_pci.o err.o

+pcieportdrv-y  += bw_notification.o
  
  obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o
  
diff --git a/drivers/pci/pcie/bw_notification.c b/drivers/pci/pcie/bw_notification.c

new file mode 100644
index ..64391ea9a172
--- /dev/null
+++ b/drivers/pci/pcie/bw_notification.c
@@ -0,0 +1,107 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * PCI Express Bandwidth notification services driver
+ * Author: Alexandru Gagniuc 
+ *
+ * Copyright (C) 2018, Dell Inc
+ *
+ * The PCIe bandwidth notification provides a way to notify the operating 
system
+ * when the link width or data rate changes. This capability is required for 
all
+ * root ports and downstream ports supporting links wider than x1 and/or
+ * multiple link speeds.
+ *
+ * This service port driver hooks into the bandwidth notification interrupt and
+ * warns when links become degraded in operation.
+ */
+
+#include 
+
+#include "../pci.h"
+#include "portdrv.h"
+
+static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev)
+{
+   int ret;
+   u32 lnk_cap;
+
+   ret = pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, _cap);
+   return (ret == PCIBIOS_SUCCESSFUL) && (lnk_cap & PCI_EXP_LNKCAP_LBNC);
+}
+
+static void pcie_enable_link_bandwidth_notification(struct pci_dev *dev)
+{
+   u16 lnk_ctl;
+
+   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
+   lnk_ctl |= PCI_EXP_LNKCTL_LBMIE;
+   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
+}
+
+static void pcie_disable_link_bandwidth_notification(struct pci_dev *dev)
+{
+   u16 lnk_ctl;
+
+   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
+   lnk_ctl &= ~PCI_EXP_LNKCTL_LBMIE;
+   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
+}
+
+static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
+{
+   struct pcie_device *srv = context;
+   struct pci_dev *port = srv->port;
+   struct pci_dev *dev;
+   u16 link_status, events;
+   int ret;
+
+   ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, _status);
+   events = link_status & PCI_EXP_LNKSTA_LBMS;
+
+   if (!events || ret != PCIBIOS_SUCCESSFUL)
+   return IRQ_NONE;
+
+   /* Print status from upstream link partner, not this downstream port. */
+   list_for_each_entry(dev, >subordinate->devices, bus_list)
+   __pcie_print_link_status(dev, false);
+
+   pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);
+   return IRQ_HANDLED;
+}
+
+static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
+{
+   int ret;
+
+   /* Single-width or single-speed ports do not have to support this. */
+   if (!pcie_link_bandwidth_notification_supported(srv->port))
+   return -ENODEV;
+
+   ret = devm_request_irq(>device, srv->irq, pcie_bw_notification_irq,
+  IRQF_SHARED, "PCIe BW notif", srv);
+   if (ret)
+   return ret;
+
+

Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-05 Thread Alex G.


ping

On 09/18/2018 05:15 PM, Alexandru Gagniuc wrote:

When a PCI device is gone, we don't want to send IO to it if we can
avoid it. We expose functionality via the irq_chip structure. As
users of that structure may not know about the underlying PCI device,
it's our responsibility to guard against removed devices.

.irq_write_msi_msg() is already guarded inside __pci_write_msi_msg().
.irq_mask/unmask() are not. Guard them for completeness.

For example, surprise removal of a PCIe device triggers teardown. This
touches the irq_chips ops some point to disable the interrupts. I/O
generated here can crash the system on firmware-first machines.
Not triggering the IO in the first place greatly reduces the
possibility of the problem occurring.

Signed-off-by: Alexandru Gagniuc 
---
  drivers/pci/msi.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index f2ef896464b3..f31058fd2260 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 
flag)
  {
struct msi_desc *desc = irq_data_get_msi_desc(data);
  
+	if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))

+   return;
+
if (desc->msi_attrib.is_msix) {
msix_mask_irq(desc, flag);
readl(desc->mask_base);  /* Flush write to device */

Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-05 Thread Alex G.


ping

On 09/18/2018 05:15 PM, Alexandru Gagniuc wrote:

When a PCI device is gone, we don't want to send IO to it if we can
avoid it. We expose functionality via the irq_chip structure. As
users of that structure may not know about the underlying PCI device,
it's our responsibility to guard against removed devices.

.irq_write_msi_msg() is already guarded inside __pci_write_msi_msg().
.irq_mask/unmask() are not. Guard them for completeness.

For example, surprise removal of a PCIe device triggers teardown. This
touches the irq_chips ops some point to disable the interrupts. I/O
generated here can crash the system on firmware-first machines.
Not triggering the IO in the first place greatly reduces the
possibility of the problem occurring.

Signed-off-by: Alexandru Gagniuc 
---
  drivers/pci/msi.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index f2ef896464b3..f31058fd2260 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 
flag)
  {
struct msi_desc *desc = irq_data_get_msi_desc(data);
  
+	if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))

+   return;
+
if (desc->msi_attrib.is_msix) {
msix_mask_irq(desc, flag);
readl(desc->mask_base);  /* Flush write to device */

Re: [PATCH] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-08-29 Thread Alex G.


Should I resubmit this rebased on 4.19-rc*, or just leave this patch as is?

Alex

On 07/30/2018 04:21 PM, Alexandru Gagniuc wrote:

When a PCI device is gone, we don't want to send IO to it if we can
avoid it. We expose functionality via the irq_chip structure. As
users of that structure may not know about the underlying PCI device,
it's our responsibility to guard against removed devices.

irq_write_msi_msg is already guarded. pci_msi_(un)mask_irq are not.
Guard them for completeness.

For example, surprise removal of a PCIe device triggers teardown. This
touches the irq_chips ops some point to disable the interrupts. I/O
generated here can crash the system on machines with buggy firmware.
Not triggering the IO in the first place eliminates the problem.

Signed-off-by: Alexandru Gagniuc 
---

There's another patch by Lukas Wunner that is needed (not yet published)
in order to fully block IO on SURPRISE!!! removal. The existing code only
sets the PCI_DEV_DISCONNECTED bit in an unreasonably narrow set of
circumstances. Lukas' patch fixes that.

However, this change is otherwise fully independent, and enjoy!

  drivers/pci/msi.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 4d88afdfc843..5f47b5cb0401 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 
flag)
  {
struct msi_desc *desc = irq_data_get_msi_desc(data);
  
+	if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))

+   return;
+
if (desc->msi_attrib.is_msix) {
msix_mask_irq(desc, flag);
readl(desc->mask_base);  /* Flush write to device */

Re: [PATCH] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-08-29 Thread Alex G.


Should I resubmit this rebased on 4.19-rc*, or just leave this patch as is?

Alex

On 07/30/2018 04:21 PM, Alexandru Gagniuc wrote:

When a PCI device is gone, we don't want to send IO to it if we can
avoid it. We expose functionality via the irq_chip structure. As
users of that structure may not know about the underlying PCI device,
it's our responsibility to guard against removed devices.

irq_write_msi_msg is already guarded. pci_msi_(un)mask_irq are not.
Guard them for completeness.

For example, surprise removal of a PCIe device triggers teardown. This
touches the irq_chips ops some point to disable the interrupts. I/O
generated here can crash the system on machines with buggy firmware.
Not triggering the IO in the first place eliminates the problem.

Signed-off-by: Alexandru Gagniuc 
---

There's another patch by Lukas Wunner that is needed (not yet published)
in order to fully block IO on SURPRISE!!! removal. The existing code only
sets the PCI_DEV_DISCONNECTED bit in an unreasonably narrow set of
circumstances. Lukas' patch fixes that.

However, this change is otherwise fully independent, and enjoy!

  drivers/pci/msi.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 4d88afdfc843..5f47b5cb0401 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 
flag)
  {
struct msi_desc *desc = irq_data_get_msi_desc(data);
  
+	if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))

+   return;
+
if (desc->msi_attrib.is_msix) {
msix_mask_irq(desc, flag);
readl(desc->mask_base);  /* Flush write to device */

Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-09 Thread Alex G.





On 08/09/2018 02:18 PM, Bjorn Helgaas wrote:

On Thu, Aug 09, 2018 at 02:00:23PM -0500, Alex G. wrote:

On 08/09/2018 01:29 PM, Bjorn Helgaas wrote:

On Thu, Aug 09, 2018 at 04:46:32PM +, alex_gagn...@dellteam.com wrote:

On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:

(snip_

 enable_ecrc_checking()
 disable_ecrc_checking()


I don't immediately see how this would affect FFS, but the bits are part
of the AER capability structure. According to the FFS model, those would
be owned by FW, and we'd have to avoid touching them.


Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root
Ports that contain the FIRMWARE_FIRST flag as well as values the OS is
supposed to write to several AER capability registers.  It looks like
we currently ignore everything except the FIRMWARE_FIRST and GLOBAL
flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux).

That seems like a pretty major screwup and more than I want to fix
right now.


The logic is not very clear, but I think it goes like this:
For GLOBAL and FFS, disable native AER everywhere.
When !GLOBAL and FFS, then only disable native AER for the root port
described by the HEST entry.


I agree the code is convoluted, but that sounds right to me.

What I meant is that we ignore the values the HEST entry tells us
we're supposed to write to Device Control and the AER Uncorrectable
Error Mask, Uncorrectable Error Severity, Correctable Error Mask, and
AER Capabilities and Control.


Wait, what? _HPX has the same information. This is madness!
Since root ports are not hot-swappable, the BIOS normally programs those 
registers. Even if linux doesn't apply said masks, the programming BIOS 
did should be sufficient to have *cough* correct *cough* behavior.



For practical considerations this is not an issue today. The ACPI error
handling code currently crashes when it encounters any fatal error, so
we wouldn't hit this in the FFS case.


I wasn't aware the firmware-first path was *that* broken.  Are there
problem reports for this?  Is this a regression?


It's been like this since, I believe, 3.10, and probably much earlier. All
reports that I have seen of linux crashing on surprise hot-plug have been
caused by the panic() call in the apei code. Dell BIOSes do an extreme
amount of work to determine when it's safe to _not_ report errors to the OS,
since all known OSes crash on this path.


Oh, is this the __ghes_panic() path?  If so, I'm going to turn away
and plead ignorance unless the PCI core is doing something wrong that
eventually results in that panic.


I agree, and I'll quote you on that!

Alex

Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-09 Thread Alex G.





On 08/09/2018 02:18 PM, Bjorn Helgaas wrote:

On Thu, Aug 09, 2018 at 02:00:23PM -0500, Alex G. wrote:

On 08/09/2018 01:29 PM, Bjorn Helgaas wrote:

On Thu, Aug 09, 2018 at 04:46:32PM +, alex_gagn...@dellteam.com wrote:

On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:

(snip_

 enable_ecrc_checking()
 disable_ecrc_checking()


I don't immediately see how this would affect FFS, but the bits are part
of the AER capability structure. According to the FFS model, those would
be owned by FW, and we'd have to avoid touching them.


Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root
Ports that contain the FIRMWARE_FIRST flag as well as values the OS is
supposed to write to several AER capability registers.  It looks like
we currently ignore everything except the FIRMWARE_FIRST and GLOBAL
flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux).

That seems like a pretty major screwup and more than I want to fix
right now.


The logic is not very clear, but I think it goes like this:
For GLOBAL and FFS, disable native AER everywhere.
When !GLOBAL and FFS, then only disable native AER for the root port
described by the HEST entry.


I agree the code is convoluted, but that sounds right to me.

What I meant is that we ignore the values the HEST entry tells us
we're supposed to write to Device Control and the AER Uncorrectable
Error Mask, Uncorrectable Error Severity, Correctable Error Mask, and
AER Capabilities and Control.


Wait, what? _HPX has the same information. This is madness!
Since root ports are not hot-swappable, the BIOS normally programs those 
registers. Even if linux doesn't apply said masks, the programming BIOS 
did should be sufficient to have *cough* correct *cough* behavior.



For practical considerations this is not an issue today. The ACPI error
handling code currently crashes when it encounters any fatal error, so
we wouldn't hit this in the FFS case.


I wasn't aware the firmware-first path was *that* broken.  Are there
problem reports for this?  Is this a regression?


It's been like this since, I believe, 3.10, and probably much earlier. All
reports that I have seen of linux crashing on surprise hot-plug have been
caused by the panic() call in the apei code. Dell BIOSes do an extreme
amount of work to determine when it's safe to _not_ report errors to the OS,
since all known OSes crash on this path.


Oh, is this the __ghes_panic() path?  If so, I'm going to turn away
and plead ignorance unless the PCI core is doing something wrong that
eventually results in that panic.


I agree, and I'll quote you on that!

Alex

Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-09 Thread Alex G.





On 08/09/2018 01:29 PM, Bjorn Helgaas wrote:

On Thu, Aug 09, 2018 at 04:46:32PM +, alex_gagn...@dellteam.com wrote:

On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:

(snip_

enable_ecrc_checking()
disable_ecrc_checking()


I don't immediately see how this would affect FFS, but the bits are part
of the AER capability structure. According to the FFS model, those would
be owned by FW, and we'd have to avoid touching them.


Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root
Ports that contain the FIRMWARE_FIRST flag as well as values the OS is
supposed to write to several AER capability registers.  It looks like
we currently ignore everything except the FIRMWARE_FIRST and GLOBAL
flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux).

That seems like a pretty major screwup and more than I want to fix
right now.


The logic is not very clear, but I think it goes like this:
For GLOBAL and FFS, disable native AER everywhere.
When !GLOBAL and FFS, then only disable native AER for the root port 
described by the HEST entry.


aer_acpi_firmware_first() doesn't care about context. Though check 
aer_set_firmware_first(), where we pass a pci_device as a context.


The FFS implementations I've seen have one HEST entry, with GLOBAL and 
FFS. I have yet to see more fine-grained control of root ports. I 
suspect that if we had this finer control from HEST, we'd honor it.


I do eventually want to get a test BIOS with different HEST entries, and 
make sure things work as intended. Though BIOS team is overloaded with 
other requests.



pci_cleanup_aer_uncorrect_error_status()


This probably should be guarded. It's only called from a few specific
drivers, so the impact is not as high as being called from the core.


pci_aer_clear_fatal_status()


This is only called when doing fatal_recovery, right?


True.  It takes a lot of analysis to convince oneself that this is not
used in the firmware-first path, so I think we should add a guard
there.


I agree. GHES has a header severity and a severity field for each 
section. All BIOSes I've seen do fatal/fatal, though a BIOS that would 
report correctable/fatal, would bypass the apei code and take us here.



For practical considerations this is not an issue today. The ACPI error
handling code currently crashes when it encounters any fatal error, so
we wouldn't hit this in the FFS case.


I wasn't aware the firmware-first path was *that* broken.  Are there
problem reports for this?  Is this a regression?


It's been like this since, I believe, 3.10, and probably much earlier. 
All reports that I have seen of linux crashing on surprise hot-plug have 
been caused by the panic() call in the apei code. Dell BIOSes do an 
extreme amount of work to determine when it's safe to _not_ report 
errors to the OS, since all known OSes crash on this path.


Fun fact: there's even dedicated hardware to accomplish the above.


The PCIe standards contact I usually talk to about these PCIe subtleties
is currently on vacation. The number one issue was a FFS corner case
with OS clearing bits on probe. The other functions you mention are a
corner case of a corner case. The big fish is
pci_cleanup_aer_error_status_regs() on probe(), and it would be nice to
have that resolved.

I'll sync up with Austin when he gets back to see about the other
functions though I suspect we'll end up fixing them as well.


I'd like to fix all the obvious cases at once (excluding the ECRC
stuff).  What do you think about the following patch?


That looks very sensible to me. Thank you for updating it :).
I'm pulling in the changes right now to run some quick tests, and I 
don't expect any trouble.


Alex


commit 15ed68dcc26864c849a12a36db4d4771bad7991f
Author: Alexandru Gagniuc 
Date:   Tue Jul 17 10:31:23 2018 -0500

 PCI/AER: Don't clear AER bits if error handling is Firmware-First
 
 If the platform requests Firmware-First error handling, firmware is

 responsible for reading and clearing AER status bits.  If OSPM also clears
 them, we may miss errors.  See ACPI v6.2, sec 18.3.2.5 and 18.4.
 
 This race is mostly of theoretical significance, as it is not easy to

 reasonably demonstrate it in testing.
 
 Signed-off-by: Alexandru Gagniuc 

 [bhelgaas: add similar guards to pci_cleanup_aer_uncorrect_error_status()
 and pci_aer_clear_fatal_status()]
 Signed-off-by: Bjorn Helgaas 

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index c6cc855bfa22..4e823ae051a7 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -397,6 +397,9 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev 
*dev)
if (!pos)
return -EIO;
  
+	if (pcie_aer_get_firmware_first(dev))

+   return -EIO;
+
/* Clear status bits for ERR_NONFATAL errors only */
pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, );
pci_read_config_dword(dev, pos +

Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-09 Thread Alex G.





On 08/09/2018 01:29 PM, Bjorn Helgaas wrote:

On Thu, Aug 09, 2018 at 04:46:32PM +, alex_gagn...@dellteam.com wrote:

On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:

(snip_

enable_ecrc_checking()
disable_ecrc_checking()


I don't immediately see how this would affect FFS, but the bits are part
of the AER capability structure. According to the FFS model, those would
be owned by FW, and we'd have to avoid touching them.


Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root
Ports that contain the FIRMWARE_FIRST flag as well as values the OS is
supposed to write to several AER capability registers.  It looks like
we currently ignore everything except the FIRMWARE_FIRST and GLOBAL
flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux).

That seems like a pretty major screwup and more than I want to fix
right now.


The logic is not very clear, but I think it goes like this:
For GLOBAL and FFS, disable native AER everywhere.
When !GLOBAL and FFS, then only disable native AER for the root port 
described by the HEST entry.


aer_acpi_firmware_first() doesn't care about context. Though check 
aer_set_firmware_first(), where we pass a pci_device as a context.


The FFS implementations I've seen have one HEST entry, with GLOBAL and 
FFS. I have yet to see more fine-grained control of root ports. I 
suspect that if we had this finer control from HEST, we'd honor it.


I do eventually want to get a test BIOS with different HEST entries, and 
make sure things work as intended. Though BIOS team is overloaded with 
other requests.



pci_cleanup_aer_uncorrect_error_status()


This probably should be guarded. It's only called from a few specific
drivers, so the impact is not as high as being called from the core.


pci_aer_clear_fatal_status()


This is only called when doing fatal_recovery, right?


True.  It takes a lot of analysis to convince oneself that this is not
used in the firmware-first path, so I think we should add a guard
there.


I agree. GHES has a header severity and a severity field for each 
section. All BIOSes I've seen do fatal/fatal, though a BIOS that would 
report correctable/fatal, would bypass the apei code and take us here.



For practical considerations this is not an issue today. The ACPI error
handling code currently crashes when it encounters any fatal error, so
we wouldn't hit this in the FFS case.


I wasn't aware the firmware-first path was *that* broken.  Are there
problem reports for this?  Is this a regression?


It's been like this since, I believe, 3.10, and probably much earlier. 
All reports that I have seen of linux crashing on surprise hot-plug have 
been caused by the panic() call in the apei code. Dell BIOSes do an 
extreme amount of work to determine when it's safe to _not_ report 
errors to the OS, since all known OSes crash on this path.


Fun fact: there's even dedicated hardware to accomplish the above.


The PCIe standards contact I usually talk to about these PCIe subtleties
is currently on vacation. The number one issue was a FFS corner case
with OS clearing bits on probe. The other functions you mention are a
corner case of a corner case. The big fish is
pci_cleanup_aer_error_status_regs() on probe(), and it would be nice to
have that resolved.

I'll sync up with Austin when he gets back to see about the other
functions though I suspect we'll end up fixing them as well.


I'd like to fix all the obvious cases at once (excluding the ECRC
stuff).  What do you think about the following patch?


That looks very sensible to me. Thank you for updating it :).
I'm pulling in the changes right now to run some quick tests, and I 
don't expect any trouble.


Alex


commit 15ed68dcc26864c849a12a36db4d4771bad7991f
Author: Alexandru Gagniuc 
Date:   Tue Jul 17 10:31:23 2018 -0500

 PCI/AER: Don't clear AER bits if error handling is Firmware-First
 
 If the platform requests Firmware-First error handling, firmware is

 responsible for reading and clearing AER status bits.  If OSPM also clears
 them, we may miss errors.  See ACPI v6.2, sec 18.3.2.5 and 18.4.
 
 This race is mostly of theoretical significance, as it is not easy to

 reasonably demonstrate it in testing.
 
 Signed-off-by: Alexandru Gagniuc 

 [bhelgaas: add similar guards to pci_cleanup_aer_uncorrect_error_status()
 and pci_aer_clear_fatal_status()]
 Signed-off-by: Bjorn Helgaas 

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index c6cc855bfa22..4e823ae051a7 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -397,6 +397,9 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev 
*dev)
if (!pos)
return -EIO;
  
+	if (pcie_aer_get_firmware_first(dev))

+   return -EIO;
+
/* Clear status bits for ERR_NONFATAL errors only */
pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, );
pci_read_config_dword(dev, pos +

Re: [PATCH v3] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-07 Thread Alex G.





On 08/07/2018 08:14 PM, Bjorn Helgaas wrote:

On Mon, Jul 30, 2018 at 06:35:31PM -0500, Alexandru Gagniuc wrote:

When we don't own AER, we shouldn't touch the AER error bits. Clearing
error bits willy-nilly might cause firmware to miss some errors. In
theory, these bits get cleared by FFS, or via ACPI _HPX method. These
mechanisms are not subject to the problem.


What's FFS?


Firmware-first. Nobody likes spelling it out, and all other proposed 
acronyms are insanely tong-twisting. So, FFS.



I guess you mean FFS and _HPX are not subject to the problem because
they're supplied by firmware, so firmware would be responsible for
looking at the bits before clearing them?


Exactly.


This race is mostly of theoretical significance, since I can't
reasonably demonstrate this race in the lab.

On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
need for two checks: aer_cap and get_firmware_first().

Signed-off-by: Alexandru Gagniuc 
---

Changes since v2:
   - Added missing negation in pci_cleanup_aer_error_status_regs()

  drivers/pci/pcie/aer.c | 17 ++---
  1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..40e5c86271d1 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
aer_set_firmware_first(dev);
return dev->__aer_firmware_first;
  }
+
+static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
+{
+   return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
+}


I think it complicates things to have both "firmware_first" and
"kernel_first" interfaces, so I would prefer to stick with the
existing "firmware_first" style.


  #define   PCI_EXP_AER_FLAGS   (PCI_EXP_DEVCTL_CERE | 
PCI_EXP_DEVCTL_NFERE | \
 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
  
@@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)
  
  int pci_enable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
-   return -EIO;
-
-   if (!dev->aer_cap)
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;
  
  	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);


This change doesn't actually fix anything, does it?  It looks like a
cleanup that doesn't change the behavior.


Initially (v1), this was a one-liner, but someone had a complaint about 
having pcie_aer_get_firmware_first() boilerplate all over the place. 
That's why I added the "kernel_first" function (previous comment), and 
then updated this here for completeness. I'm also fine with v1.



@@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
  
  int pci_disable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;


This change does effectively add a test for dev->aer_cap.  That makes
sense in terms of symmetry with pci_enable_pcie_error_reporting(),
but I think it should be a separate patch because it's conceptually
separate from the change below.

We should keep the existing behavior (but add the symmetry) here for
now, but it's not clear to me that these paths should care about AER
or firmware-first at all.  PCI_EXP_DEVCTL is not an AER register and
we have the _HPX mechanism for firmware to influence it (which these
paths currently ignore).  I suspect we should program these reporting
enable bits in the core enumeration path instead of having drivers
call these interfaces.


The headache is that FFS needs the reporting bit to stay enabled in 
order to get AER notifications. Disabling things here could really break 
firmware. Of course, that's a cyclical argument, since FW is broken by 
definition.



If/when we make changes along these lines, the history will be easier
to follow if *this* change is not connected with the change below to
pci_cleanup_aer_error_status_regs().


I agree. I think it might be preferred then to go with v1, and leave the 
refactoring to a later time, since the extra changes are cosmetical and 
social.



return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
@@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
if (!pci_is_pcie(dev))
return -ENODEV;
  
-	pos = dev->aer_cap;

-   if (!pos)
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;


This part makes sense to me, but I think I would rather have it match
the existing style in pci_enable_pcie_error_reporting(), i.e., keep
the test for dev->aer_cap and add a test for
pcie_aer_get_firmware_first().


Had it that way in v1.

Alex


+   pos = dev->aer_cap;
port_type = pci_pcie_type(dev);
if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, );
--

Re: [PATCH v3] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-07 Thread Alex G.





On 08/07/2018 08:14 PM, Bjorn Helgaas wrote:

On Mon, Jul 30, 2018 at 06:35:31PM -0500, Alexandru Gagniuc wrote:

When we don't own AER, we shouldn't touch the AER error bits. Clearing
error bits willy-nilly might cause firmware to miss some errors. In
theory, these bits get cleared by FFS, or via ACPI _HPX method. These
mechanisms are not subject to the problem.


What's FFS?


Firmware-first. Nobody likes spelling it out, and all other proposed 
acronyms are insanely tong-twisting. So, FFS.



I guess you mean FFS and _HPX are not subject to the problem because
they're supplied by firmware, so firmware would be responsible for
looking at the bits before clearing them?


Exactly.


This race is mostly of theoretical significance, since I can't
reasonably demonstrate this race in the lab.

On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
need for two checks: aer_cap and get_firmware_first().

Signed-off-by: Alexandru Gagniuc 
---

Changes since v2:
   - Added missing negation in pci_cleanup_aer_error_status_regs()

  drivers/pci/pcie/aer.c | 17 ++---
  1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..40e5c86271d1 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
aer_set_firmware_first(dev);
return dev->__aer_firmware_first;
  }
+
+static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
+{
+   return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
+}


I think it complicates things to have both "firmware_first" and
"kernel_first" interfaces, so I would prefer to stick with the
existing "firmware_first" style.


  #define   PCI_EXP_AER_FLAGS   (PCI_EXP_DEVCTL_CERE | 
PCI_EXP_DEVCTL_NFERE | \
 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
  
@@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)
  
  int pci_enable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
-   return -EIO;
-
-   if (!dev->aer_cap)
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;
  
  	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);


This change doesn't actually fix anything, does it?  It looks like a
cleanup that doesn't change the behavior.


Initially (v1), this was a one-liner, but someone had a complaint about 
having pcie_aer_get_firmware_first() boilerplate all over the place. 
That's why I added the "kernel_first" function (previous comment), and 
then updated this here for completeness. I'm also fine with v1.



@@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
  
  int pci_disable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;


This change does effectively add a test for dev->aer_cap.  That makes
sense in terms of symmetry with pci_enable_pcie_error_reporting(),
but I think it should be a separate patch because it's conceptually
separate from the change below.

We should keep the existing behavior (but add the symmetry) here for
now, but it's not clear to me that these paths should care about AER
or firmware-first at all.  PCI_EXP_DEVCTL is not an AER register and
we have the _HPX mechanism for firmware to influence it (which these
paths currently ignore).  I suspect we should program these reporting
enable bits in the core enumeration path instead of having drivers
call these interfaces.


The headache is that FFS needs the reporting bit to stay enabled in 
order to get AER notifications. Disabling things here could really break 
firmware. Of course, that's a cyclical argument, since FW is broken by 
definition.



If/when we make changes along these lines, the history will be easier
to follow if *this* change is not connected with the change below to
pci_cleanup_aer_error_status_regs().


I agree. I think it might be preferred then to go with v1, and leave the 
refactoring to a later time, since the extra changes are cosmetical and 
social.



return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
@@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
if (!pci_is_pcie(dev))
return -ENODEV;
  
-	pos = dev->aer_cap;

-   if (!pos)
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;


This part makes sense to me, but I think I would rather have it match
the existing style in pci_enable_pcie_error_reporting(), i.e., keep
the test for dev->aer_cap and add a test for
pcie_aer_get_firmware_first().


Had it that way in v1.

Alex


+   pos = dev->aer_cap;
port_type = pci_pcie_type(dev);
if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, );
--

Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

2018-07-31 Thread Alex G.


On 07/31/2018 01:40 AM, Tal Gilboa wrote:
[snip]
@@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct 
pci_dev *dev)

  /* Advanced Error Reporting */
  pci_aer_init(dev);
+    /* Check link and detect downtrain errors */
+    pcie_check_upstream_link(dev);


This is called for every PCIe device right? Won't there be a 
duplicated print in case a device loads with lower PCIe bandwidth than 
needed?


Alex, can you comment on this please?


Of course I can.

There's one print at probe() time, which happens if bandwidth < max. I 
would think that's fine. There is a way to duplicate it, and that is if 
the driver also calls print_link_status(). A few driver maintainers who 
call it have indicated they'd be fine with removing it from the driver, 
and leaving it in the core PCI.


Should the device come back at low speed, go away, then come back at 
full speed we'd still only see one print from the first probe. Again, if 
driver doesn't also call the function.
There's the corner case where the device come up at < max, goes away, 
then comes back faster, but still < max. There will be two prints, but 
they won't be true duplicates -- each one will indicate different BW.


Alex


+
  if (pci_probe_reset_function(dev) == 0)
  dev->reset_fn = 1;
  }
diff --git a/include/linux/pci.h b/include/linux/pci.h
index abd5d5e17aee..15bfab8f7a1b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
  u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
**limiting_dev,

   enum pci_bus_speed *speed,
   enum pcie_link_width *width);
+void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
  void pcie_print_link_status(struct pci_dev *dev);
  int pcie_flr(struct pci_dev *dev);
  int __pci_reset_function_locked(struct pci_dev *dev);

Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

2018-07-31 Thread Alex G.


On 07/31/2018 01:40 AM, Tal Gilboa wrote:
[snip]
@@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct 
pci_dev *dev)

  /* Advanced Error Reporting */
  pci_aer_init(dev);
+    /* Check link and detect downtrain errors */
+    pcie_check_upstream_link(dev);


This is called for every PCIe device right? Won't there be a 
duplicated print in case a device loads with lower PCIe bandwidth than 
needed?


Alex, can you comment on this please?


Of course I can.

There's one print at probe() time, which happens if bandwidth < max. I 
would think that's fine. There is a way to duplicate it, and that is if 
the driver also calls print_link_status(). A few driver maintainers who 
call it have indicated they'd be fine with removing it from the driver, 
and leaving it in the core PCI.


Should the device come back at low speed, go away, then come back at 
full speed we'd still only see one print from the first probe. Again, if 
driver doesn't also call the function.
There's the corner case where the device come up at < max, goes away, 
then comes back faster, but still < max. There will be two prints, but 
they won't be true duplicates -- each one will indicate different BW.


Alex


+
  if (pci_probe_reset_function(dev) == 0)
  dev->reset_fn = 1;
  }
diff --git a/include/linux/pci.h b/include/linux/pci.h
index abd5d5e17aee..15bfab8f7a1b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
  u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
**limiting_dev,

   enum pci_bus_speed *speed,
   enum pcie_link_width *width);
+void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
  void pcie_print_link_status(struct pci_dev *dev);
  int pcie_flr(struct pci_dev *dev);
  int __pci_reset_function_locked(struct pci_dev *dev);

Re: [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER

2018-07-24 Thread Alex G.





On 07/23/2018 11:52 AM, Alexandru Gagniuc wrote:

When we don't own AER, we shouldn't touch the AER error bits. Clearing
error bits willy-nilly might cause firmware to miss some errors. In
theory, these bits get cleared by FFS, or via ACPI _HPX method. These
mechanisms are not subject to the problem.

This race is mostly of theoretical significance, since I can't
reasonably demonstrate this race in the lab.

On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
need for two checks: aer_cap and get_firmware_first().

Signed-off-by: Alexandru Gagniuc 
---
  drivers/pci/pcie/aer.c | 17 ++---
  1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..85c3e173c025 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
aer_set_firmware_first(dev);
return dev->__aer_firmware_first;
  }
+
+static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
+{
+   return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
+}
+
  #define   PCI_EXP_AER_FLAGS   (PCI_EXP_DEVCTL_CERE | 
PCI_EXP_DEVCTL_NFERE | \
 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
  
@@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)
  
  int pci_enable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
-   return -EIO;
-
-   if (!dev->aer_cap)
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;
  
  	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);

@@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
  
  int pci_disable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;
  
  	return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,

@@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
if (!pci_is_pcie(dev))
return -ENODEV;
  
-	pos = dev->aer_cap;

-   if (!pos)
+   if (pcie_aer_is_kernel_first(dev))


This here is missing a '!'. It's in my local branch, so I must have 
exported the patch before I fixed that. I'll get that fixed next rev.



return -EIO;
  
+	pos = dev->aer_cap;

port_type = pci_pcie_type(dev);
if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, );

Re: [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER

2018-07-24 Thread Alex G.





On 07/23/2018 11:52 AM, Alexandru Gagniuc wrote:

When we don't own AER, we shouldn't touch the AER error bits. Clearing
error bits willy-nilly might cause firmware to miss some errors. In
theory, these bits get cleared by FFS, or via ACPI _HPX method. These
mechanisms are not subject to the problem.

This race is mostly of theoretical significance, since I can't
reasonably demonstrate this race in the lab.

On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
need for two checks: aer_cap and get_firmware_first().

Signed-off-by: Alexandru Gagniuc 
---
  drivers/pci/pcie/aer.c | 17 ++---
  1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..85c3e173c025 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
aer_set_firmware_first(dev);
return dev->__aer_firmware_first;
  }
+
+static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
+{
+   return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
+}
+
  #define   PCI_EXP_AER_FLAGS   (PCI_EXP_DEVCTL_CERE | 
PCI_EXP_DEVCTL_NFERE | \
 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
  
@@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)
  
  int pci_enable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
-   return -EIO;
-
-   if (!dev->aer_cap)
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;
  
  	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);

@@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
  
  int pci_disable_pcie_error_reporting(struct pci_dev *dev)

  {
-   if (pcie_aer_get_firmware_first(dev))
+   if (!pcie_aer_is_kernel_first(dev))
return -EIO;
  
  	return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,

@@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
if (!pci_is_pcie(dev))
return -ENODEV;
  
-	pos = dev->aer_cap;

-   if (!pos)
+   if (pcie_aer_is_kernel_first(dev))


This here is missing a '!'. It's in my local branch, so I must have 
exported the patch before I fixed that. I'll get that fixed next rev.



return -EIO;
  
+	pos = dev->aer_cap;

port_type = pci_pcie_type(dev);
if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, );

Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

2018-07-23 Thread Alex G.





On 07/23/2018 05:14 PM, Jakub Kicinski wrote:

On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote:

On 7/24/2018 12:01 AM, Jakub Kicinski wrote:

On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote:

PCIe downtraining happens when both the device and PCIe port are
capable of a larger bus width or higher speed than negotiated.
Downtraining might be indicative of other problems in the system, and
identifying this from userspace is neither intuitive, nor
straightforward.

The easiest way to detect this is with pcie_print_link_status(),
since the bottleneck is usually the link that is downtrained. It's not
a perfect solution, but it works extremely well in most cases.

Signed-off-by: Alexandru Gagniuc 
---

For the sake of review, I've created a __pcie_print_link_status() which
takes a 'verbose' argument. If we agree want to go this route, and update
the users of pcie_print_link_status(), I can split this up in two patches.
I prefer just printing this information in the core functions, and letting
drivers not have to worry about this. Though there seems to be strong for
not going that route, so here it goes:


FWIW the networking drivers print PCIe BW because sometimes the network
bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps
card on a x8 link.

Sorry to bike shed, but currently the networking cards print the info
during probe.  Would it make sense to move your message closer to probe
time?  Rather than when device is added.  If driver structure is
available, we could also consider adding a boolean to struct pci_driver
to indicate if driver wants the verbose message?  This way we avoid
duplicated prints.

I have no objection to current patch, it LGTM.  Just a thought.


I don't see the reason for having two functions. What's the problem with
adding the verbose argument to the original function?


IMHO it's reasonable to keep the default parameter to what 90% of users
want by a means on a wrapper.  The non-verbose output is provided by
the core already for all devices.

What do you think of my proposal above Tal?  That would make the extra
wrapper unnecessary since the verbose parameter would be part of the
driver structure, and it would avoid the duplicated output.


I see how it might make sense to add another member to the driver 
struct, but is it worth the extra learning curve? It seems to be 
something with the potential to confuse new driver developers, and 
having a very marginal benefit.

Although, if that's what people want...

Alex

Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

2018-07-23 Thread Alex G.





On 07/23/2018 05:14 PM, Jakub Kicinski wrote:

On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote:

On 7/24/2018 12:01 AM, Jakub Kicinski wrote:

On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote:

PCIe downtraining happens when both the device and PCIe port are
capable of a larger bus width or higher speed than negotiated.
Downtraining might be indicative of other problems in the system, and
identifying this from userspace is neither intuitive, nor
straightforward.

The easiest way to detect this is with pcie_print_link_status(),
since the bottleneck is usually the link that is downtrained. It's not
a perfect solution, but it works extremely well in most cases.

Signed-off-by: Alexandru Gagniuc 
---

For the sake of review, I've created a __pcie_print_link_status() which
takes a 'verbose' argument. If we agree want to go this route, and update
the users of pcie_print_link_status(), I can split this up in two patches.
I prefer just printing this information in the core functions, and letting
drivers not have to worry about this. Though there seems to be strong for
not going that route, so here it goes:


FWIW the networking drivers print PCIe BW because sometimes the network
bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps
card on a x8 link.

Sorry to bike shed, but currently the networking cards print the info
during probe.  Would it make sense to move your message closer to probe
time?  Rather than when device is added.  If driver structure is
available, we could also consider adding a boolean to struct pci_driver
to indicate if driver wants the verbose message?  This way we avoid
duplicated prints.

I have no objection to current patch, it LGTM.  Just a thought.


I don't see the reason for having two functions. What's the problem with
adding the verbose argument to the original function?


IMHO it's reasonable to keep the default parameter to what 90% of users
want by a means on a wrapper.  The non-verbose output is provided by
the core already for all devices.

What do you think of my proposal above Tal?  That would make the extra
wrapper unnecessary since the verbose parameter would be part of the
driver structure, and it would avoid the duplicated output.


I see how it might make sense to add another member to the driver 
struct, but is it worth the extra learning curve? It seems to be 
something with the potential to confuse new driver developers, and 
having a very marginal benefit.

Although, if that's what people want...

Alex

Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

2018-07-23 Thread Alex G.


On 07/23/2018 12:21 AM, Tal Gilboa wrote:

On 7/19/2018 6:49 PM, Alex G. wrote:



On 07/18/2018 08:38 AM, Tal Gilboa wrote:

On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:

[+cc maintainers of drivers that already use pcie_print_link_status()
and GPU folks]

[snip]



+    /* Multi-function PCIe share the same link/status. */
+    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
+    return;
+
+    pcie_print_link_status(dev);
+}


Is this function called by default for every PCIe device? What about 
VFs? We make an exception for them on our driver since a VF doesn't 
have access to the needed information in order to provide a 
meaningful message.


I'm assuming VF means virtual function. pcie_print_link_status() 
doesn't care if it's passed a virtual function. It will try to do its 
job. That's why I bail out three lines above, with 'dev->is_virtfn' 
check.


Alex


That's the point - we don't want to call pcie_print_link_status() for 
virtual functions. We make the distinction in our driver. If you want to 
change the code to call this function by default it shouldn't affect the 
current usage.


I'm not understanding very well what you're asking. I understand you 
want to avoid printing this message on virtual functions, and that's 
already taken care of. I'm also not changing current behavior.  Let's 
get v2 out and start the discussion again based on that.


Alex

Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

2018-07-23 Thread Alex G.


On 07/23/2018 12:21 AM, Tal Gilboa wrote:

On 7/19/2018 6:49 PM, Alex G. wrote:



On 07/18/2018 08:38 AM, Tal Gilboa wrote:

On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:

[+cc maintainers of drivers that already use pcie_print_link_status()
and GPU folks]

[snip]



+    /* Multi-function PCIe share the same link/status. */
+    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
+    return;
+
+    pcie_print_link_status(dev);
+}


Is this function called by default for every PCIe device? What about 
VFs? We make an exception for them on our driver since a VF doesn't 
have access to the needed information in order to provide a 
meaningful message.


I'm assuming VF means virtual function. pcie_print_link_status() 
doesn't care if it's passed a virtual function. It will try to do its 
job. That's why I bail out three lines above, with 'dev->is_virtfn' 
check.


Alex


That's the point - we don't want to call pcie_print_link_status() for 
virtual functions. We make the distinction in our driver. If you want to 
change the code to call this function by default it shouldn't affect the 
current usage.


I'm not understanding very well what you're asking. I understand you 
want to avoid printing this message on virtual functions, and that's 
already taken care of. I'm also not changing current behavior.  Let's 
get v2 out and start the discussion again based on that.


Alex

Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-07-19 Thread Alex G.





On 07/19/2018 11:58 AM, Sinan Kaya wrote:


On 7/19/2018 8:55 AM, Alex G. wrote:
I find the intent clearer if we check it here rather than having to do 
the mental parsing of the state of aer_cap.


I don't feel too strong about my comment to be honest. This was a
style/maintenance comment.

It feels like we are putting pcie_aer_get_firmware_first() into core
functions unnecessarily after your change. I understand the need for
your change. I'm asking if it is the right place or not.

pcie_aer_get_firmware_first() should be called from either the init or
probe function so that the rest of the AER functions do not get called
from any other context.

If someone adds another AER function, we might need to add another
pcie_aer_get_firmware_first() check there. So, we have unnecessary code
duplication.


We could move the aer_cap and get_ffs() check into one function that we 
end up calling all over the place. I understand your concern about code 
duplication, and I agree with it. I don't think that at this point it's 
that big of a deal, although we might need to guard every aer_() call.


So moving all the checks in a pcie_aer_is_kernel_first() makes sense.

Alex

Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-07-19 Thread Alex G.





On 07/19/2018 11:58 AM, Sinan Kaya wrote:


On 7/19/2018 8:55 AM, Alex G. wrote:
I find the intent clearer if we check it here rather than having to do 
the mental parsing of the state of aer_cap.


I don't feel too strong about my comment to be honest. This was a
style/maintenance comment.

It feels like we are putting pcie_aer_get_firmware_first() into core
functions unnecessarily after your change. I understand the need for
your change. I'm asking if it is the right place or not.

pcie_aer_get_firmware_first() should be called from either the init or
probe function so that the rest of the AER functions do not get called
from any other context.

If someone adds another AER function, we might need to add another
pcie_aer_get_firmware_first() check there. So, we have unnecessary code
duplication.


We could move the aer_cap and get_ffs() check into one function that we 
end up calling all over the place. I understand your concern about code 
duplication, and I agree with it. I don't think that at this point it's 
that big of a deal, although we might need to guard every aer_() call.


So moving all the checks in a pcie_aer_is_kernel_first() makes sense.

Alex

Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-07-19 Thread Alex G.





On 07/17/2018 10:41 AM, Sinan Kaya wrote:


On 7/17/2018 8:31 AM, Alexandru Gagniuc wrote:

+    if (pcie_aer_get_firmware_first(dev))
+    return -EIO;


Can you move this to closer to the caller pci_aer_init()?


I could move it there. although pci_cleanup_aer_error_status_regs() is 
called directly from pci_restore_state.  Of course, aer_cap should be 
zero in this case, and we'd still bail out.
I find the intent clearer if we check it here rather than having to do 
the mental parsing of the state of aer_cap.


Alex

Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-07-19 Thread Alex G.





On 07/17/2018 10:41 AM, Sinan Kaya wrote:


On 7/17/2018 8:31 AM, Alexandru Gagniuc wrote:

+    if (pcie_aer_get_firmware_first(dev))
+    return -EIO;


Can you move this to closer to the caller pci_aer_init()?


I could move it there. although pci_cleanup_aer_error_status_regs() is 
called directly from pci_restore_state.  Of course, aer_cap should be 
zero in this case, and we'd still bail out.
I find the intent clearer if we check it here rather than having to do 
the mental parsing of the state of aer_cap.


Alex

Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

2018-07-19 Thread Alex G.





On 07/18/2018 08:38 AM, Tal Gilboa wrote:

On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:

[+cc maintainers of drivers that already use pcie_print_link_status()
and GPU folks]

[snip]



+    /* Multi-function PCIe share the same link/status. */
+    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
+    return;
+
+    pcie_print_link_status(dev);
+}


Is this function called by default for every PCIe device? What about 
VFs? We make an exception for them on our driver since a VF doesn't have 
access to the needed information in order to provide a meaningful message.


I'm assuming VF means virtual function. pcie_print_link_status() doesn't 
care if it's passed a virtual function. It will try to do its job. 
That's why I bail out three lines above, with 'dev->is_virtfn' check.


Alex

Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

2018-07-19 Thread Alex G.





On 07/18/2018 08:38 AM, Tal Gilboa wrote:

On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:

[+cc maintainers of drivers that already use pcie_print_link_status()
and GPU folks]

[snip]



+    /* Multi-function PCIe share the same link/status. */
+    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
+    return;
+
+    pcie_print_link_status(dev);
+}


Is this function called by default for every PCIe device? What about 
VFs? We make an exception for them on our driver since a VF doesn't have 
access to the needed information in order to provide a meaningful message.


I'm assuming VF means virtual function. pcie_print_link_status() doesn't 
care if it's passed a virtual function. It will try to do its job. 
That's why I bail out three lines above, with 'dev->is_virtfn' check.


Alex

Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

2018-07-19 Thread Alex G.

On 07/18/2018 04:53 PM, Bjorn Helgaas wrote:

[+cc Mike (hfi1)]

On Mon, Jul 16, 2018 at 10:28:35PM +, alex_gagn...@dellteam.com wrote:

On 7/16/2018 4:17 PM, Bjorn Helgaas wrote:

...
The easiest way to detect this is with pcie_print_link_status(),
since the bottleneck is usually the link that is downtrained. It's not
a perfect solution, but it works extremely well in most cases.

This is an interesting idea. I have two concerns:

Some drivers already do this on their own, and we probably don't want
duplicate output for those devices. In most cases (ixgbe and mlx* are
exceptions), the drivers do this unconditionally so we *could* remove
it from the driver if we add it to the core. The dmesg order would
change, and the message wouldn't be associated with the driver as it
now is.

Oh, there are only 8 users of that. Even I could patch up the drivers to
remove the call, assuming we reach agreement about this change.

Also, I think some of the GPU devices might come up at a lower speed,
then download firmware, then reset the device so it comes up at a
higher speed. I think this patch will make us complain about about
the low initial speed, which might confuse users.

I spoke to one of the PCIe spec writers. It's allowable for a device to
downtrain speed or width. It would also be extremely dumb to downtrain
with the intent to re-train at a higher speed later, but it's possible
devices do dumb stuff like that. That's why it's an informational
message, instead of a warning.

FWIW, here's some of the discussion related to hfi1 from [1]:

> Btw, why is the driver configuring the PCIe link speed? Isn't
> this something we should be handling in the PCI core?

The device comes out of reset at the 5GT/s speed. The driver
downloads device firmware, programs PCIe registers, and co-ordinates
the transition to 8GT/s.

This recipe is device specific and is therefore implemented in the
hfi1 driver built on top of PCI core functions and macros.

Also several DRM drivers seem to do this (see ),
si_pcie_gen3_enable()); from [2]:

My understanding was that some platfoms only bring up the link in gen 1
mode for compatibility reasons.

[1]
https://lkml.kernel.org/r/32e1700b9017364d9b60aed9960492bc627ff...@fmsmsx120.amr.corp.intel.com
[2]
https://lkml.kernel.org/r/bn6pr12mb1809bd30aa5b890c054f9832f7...@bn6pr12mb1809.namprd12.prod.outlook.com

Downtraining a link "for compatibility reasons" is one of those dumb
things that devices do. I'm SURPRISED AMD HW does it, although it is
perfectly permissible by PCIe spec.

Another case: Some devices (lower-end GPUs) use silicon (and marketing)
that advertises x16, but they're only routed for x8. I'm okay with
seeing an informational message in this case. In fact, I didn't know
that my Quadro card for three years is only wired for x8 until I was
testing this patch.

Yeah, it's probably OK. I don't want bug reports from people who
think something's broken when it's really just a hardware limitation
of their system. But hopefully the message is not alarming.

It looks fairly innocent:

[0.749415] pci :18:00.0: 4.000 Gb/s available PCIe bandwidth,
limited by 5 GT/s x1 link at :17:03.0 (capable of 15.752 Gb/s with 8
GT/s x2 link)

So I'm not sure whether it's better to do this in the core for all
devices, or if we should just add it to the high-performance drivers
that really care.

You're thinking "do I really need that bandwidth" because I'm using a
function called "_bandwidth_". The point of the change is very far from
that: it is to help in system troubleshooting by detecting downtraining
conditions.

I'm not sure what you think I'm thinking :) My question is whether
it's worthwhile to print this extra information for *every* PCIe
device, given that your use case is the tiny percentage of broken
systems.

I think this information is a lot more useful than a bunch of other info
that's printed. Is "type 00 class 0x088000" more valuable? What about
"reg 0x20: [mem 0x9d95-0x9d95 64bit pref]", which is also
available under /proc/iomem for those curious?

If we only printed the info in the "bw_avail < bw_cap" case, i.e.,
when the device is capable of more than it's getting, that would make
a lot of sense to me. The normal case line is more questionable. I
think the reason that's there is because the network drivers are very
performance sensitive and like to see that info all the time.

I agree that can be an acceptable compromise.

Maybe we need something like this:

pcie_print_link_status(struct pci_dev *dev, int verbose)
{
...
if (bw_avail >= bw_cap) {
if (verbose)
pci_info(dev, "... available PCIe bandwidth ...");
} else
pci_info(dev, "... available PCIe bandwidth, limited by ...");
}

So the core could print only the potential problems with:

pcie_print_link_status(dev, 0);

and drivers that really care even if there's no problem could do: