Re: [Linux-stm32] [PATCH v2 00/14] Introduce STM32MP1 RCC in secured mode
On 3/11/21 12:10 PM, Alexandre TORGUE wrote: Hi Guys On 3/11/21 5:11 PM, Marek Vasut wrote: On 3/11/21 3:41 PM, Ahmad Fatoum wrote: Hello, Hi, On 11.03.21 15:02, Alexandre TORGUE wrote: On 3/11/21 12:43 PM, Marek Vasut wrote: On 3/11/21 9:08 AM, Alexandre TORGUE wrote: 1- Break the current ABI: as soon as those patches are merged, stm32mp157c-dk2.dtb will impose to use A tf-a for scmi clocks. For people using u-boot spl, the will have to create their own "no-secure" devicetree. NAK, this breaks existing boards and existing setups, e.g. DK2 that does not use ATF. 2-As you suggest, create a new "secure" dtb per boards (Not my wish for maintenance perspectives). I agree with Alex (G) that the "secure" option should be opt-in. That way existing setups remain working and no extra requirements are imposed on MP1 users. Esp. since as far as I understand this, the "secure" part isn't really about security, but rather about moving clock configuration from Linux to some firmware blob. 3- Keep kernel device tree as they are and applied this secure layer (scmi clocks phandle) thanks to dtbo in U-boot. Is this really better than #include "stm32mp15xx-enable-secure-stuff.dtsi" in a board DT ? Because that is how I imagine the opt-in "secure" option could work. Discussing with Patrick about u-boot, we could use dtbo application thanks to extlinux.conf. BUT it it will not prevent other case (i.e. TF-A which jump directly in kernel@). So the "least worst" solution is to create a new "stm32mp1257c-scmi-dk2 board which will overload clock entries with a scmi phandle (as proposed by Alex). I raised this issue before with your colleagues. I still believe the correct way would be for the TF-A to pass down either a device tree or an overlay with the actual settings in use, e.g.: - Clocks/Resets done via SCMI - Reserved memory regions If TF-A directly boots Linux, it can apply the overlay itself, otherwise it's passed down to SSBL that applies it before booting Linux. That sounds good and it is something e.g. R-Car already does, it merges DT fragment from prior stages at U-Boot level and then passes the result to Linux. So on ST hardware, the same could very well happen and it would work for both non-ATF / ATF / ATF+TEE options. Even this solution sounds good but we are currently not able to do it in our TF-A/u-boot so not feasible for the moment. So we have to find a solution for now. Create a new dtb can be this solution. Our internal strategy is to use scmi on our official ST board. It will be a really drawback to include a "no-scmi.dtsi" in DH boards (for example) and to create a stm32mp157c-noscmi-dk2.dts ? It could work, as long as all users are reminded to change their build scripts to pick up a "-noscmi.dtb". I suspect that if this were the case we'll see quite a few bug reports saying "stm32mp1 no longer boots with kernel v5.13". I didn't think of this originally, though u-boot already does the DTB patching for OPTEE reserved memory regions. It's not too hard to also patch in the SCMI clocks at boot. In u-boot's case, runtime detection might even be feasible. Alex
Re: [PATCH v2 00/14] Introduce STM32MP1 RCC in secured mode
On 1/26/21 3:01 AM, gabriel.fernan...@foss.st.com wrote: From: Gabriel Fernandez Platform STM32MP1 can be used in configuration where some clocks and IP resets can relate as secure resources. These resources are moved from a RCC clock/reset handle to a SCMI clock/reset_domain handle. The RCC clock driver is now dependent of the SCMI driver, then we have to manage now the probe defering. v1 -> v2: - fix yamllint warnings. Hi Gabriel, I don't have much clout with the maintainers, but I have to NAK this series after finding major breakage. The problem with series is that it breaks pretty much every board it touches. I have a DK2 here that I'm using for development, which no longer boots with this series applied. The crux of the matter is that this series assumes all boards will boot with an FSBL that implements a very specific SCMI clock tree. This is major ABI breakage for anyone not using TF-A as the first stage bootloader. Anyone using u-boot SPL is screwed. This series imposes a SOC-wide change via the dtsi files. So even boards that you don't intend to convert to SCMI will get broken this way. Adding a -no-scmi file that isn't used anywhere doesn't help things. Here's what I suggest: Generate new dtb files for those boards that you want to convert. So you would get: - stm32mp157c-dk2.dtb # Good old hardware clocks - stm32mp157c-dk2-secure-rcc.dtb # Clocks accessible by scmi. A lot of users use a larger build system where they extract the relevant files. With the scheme I'm proposing you don't break their builds, and you allow SCMI users to have upstream support. This means that you'll have to rethink the DTS and DTSI changes to accomodate both use cases. Thanks, Alex Gabriel Fernandez (14): clk: stm32mp1: merge 'clk-hsi-div' and 'ck_hsi' into one clock clk: stm32mp1: merge 'ck_hse_rtc' and 'ck_rtc' into one clock clk: stm32mp1: remove intermediate pll clocks clk: stm32mp1: convert to module driver clk: stm32mp1: move RCC reset controller into RCC clock driver reset: stm32mp1: remove stm32mp1 reset dt-bindings: clock: add IDs for SCMI clocks on stm32mp15 dt-bindings: reset: add IDs for SCMI reset domains on stm32mp15 dt-bindings: reset: add MCU HOLD BOOT ID for SCMI reset domains on stm32mp15 clk: stm32mp1: new compatible for secure RCC support ARM: dts: stm32: define SCMI resources on stm32mp15 ARM: dts: stm32: move clocks/resets to SCMI resources for stm32mp15 dt-bindings: clock: stm32mp1 new compatible for secure rcc ARM: dts: stm32: introduce basic boot include on stm32mp15x board .../bindings/clock/st,stm32mp1-rcc.yaml | 6 +- arch/arm/boot/dts/stm32mp15-no-scmi.dtsi | 158 ++ arch/arm/boot/dts/stm32mp151.dtsi | 127 +++-- arch/arm/boot/dts/stm32mp153.dtsi | 4 +- arch/arm/boot/dts/stm32mp157.dtsi | 2 +- arch/arm/boot/dts/stm32mp15xc.dtsi| 4 +- drivers/clk/Kconfig | 10 + drivers/clk/clk-stm32mp1.c| 495 +++--- drivers/reset/Kconfig | 6 - drivers/reset/Makefile| 1 - drivers/reset/reset-stm32mp1.c| 115 include/dt-bindings/clock/stm32mp1-clks.h | 27 + include/dt-bindings/reset/stm32mp1-resets.h | 15 + 13 files changed, 704 insertions(+), 266 deletions(-) create mode 100644 arch/arm/boot/dts/stm32mp15-no-scmi.dtsi delete mode 100644 drivers/reset/reset-stm32mp1.c
Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"
On 2/2/21 2:16 PM, Bjorn Helgaas wrote: On Tue, Feb 02, 2021 at 01:50:20PM -0600, Alex G. wrote: On 1/29/21 3:56 PM, Bjorn Helgaas wrote: On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote: On 1/28/21 5:51 PM, Sinan Kaya wrote: On 1/28/2021 6:39 PM, Bjorn Helgaas wrote: AFAICT, this thread petered out with no resolution. If the bandwidth change notifications are important to somebody, please speak up, preferably with a patch that makes the notifications disabled by default and adds a parameter to enable them (or some other strategy that makes sense). I think these are potentially useful, so I don't really want to just revert them, but if nobody thinks these are important enough to fix, that's a possibility. Hide behind debug or expert option by default? or even mark it as BROKEN until someone fixes it? Instead of making it a config option, wouldn't it be better as a kernel parameter? People encountering this seem quite competent in passing kernel arguments, so having a "pcie_bw_notification=off" would solve their problems. I don't want people to have to discover a parameter to solve issues. If there's a parameter, notification should default to off, and people who want notification should supply a parameter to enable it. Same thing for the sysfs idea. I can imagine cases where a per-port flag would be useful. For example, a machine with a NIC and a couple of PCIe storage drives. In this example, the PCIe drives downtrain willie-nillie, so it's useful to turn off their notifications, but the NIC absolutely must not downtrain. It's debatable whether it should be default on or default off. I think we really just need to figure out what's going on. Then it should be clearer how to handle it. I'm not really in a position to debug the root cause since I don't have the hardware or the time. I wonder (a) if some PCIe devices are downtraining willie-nillie to save power (b) if this willie-nillie downtraining somehow violates PCIe spec (c) what is the official behavior when downtraining is intentional My theory is: YES, YES, ASPM. But I don't know how to figure this out without having the problem hardware in hand. If nobody can figure out what's going on, I think we'll have to make it disabled by default. I think most distros do "CONFIG_PCIE_BW is not set". Is that not true? I think it *is* true that distros do not enable CONFIG_PCIE_BW. But it's perfectly reasonable for people building their own kernels to enable it. It should be safe to enable all config options. If they do enable CONFIG_PCIE_BW, I don't want them to waste time debugging messages they don't expect. If we understood why these happen and could filter out the expected ones, that would be great. But we don't. We've already wasted quite a bit of Jan's and Atanas' time, and no doubt others who haven't bothered to file bug reports. So I think I'll queue up a patch to remove the functionality for now. It's easily restored if somebody debugs the problem or adds a command-line switch or something. I think it's best we make it a module (or kernel) parameter, default=off for the time being. Alex
Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"
On 1/29/21 3:56 PM, Bjorn Helgaas wrote: On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote: On 1/28/21 5:51 PM, Sinan Kaya wrote: On 1/28/2021 6:39 PM, Bjorn Helgaas wrote: AFAICT, this thread petered out with no resolution. If the bandwidth change notifications are important to somebody, please speak up, preferably with a patch that makes the notifications disabled by default and adds a parameter to enable them (or some other strategy that makes sense). I think these are potentially useful, so I don't really want to just revert them, but if nobody thinks these are important enough to fix, that's a possibility. Hide behind debug or expert option by default? or even mark it as BROKEN until someone fixes it? Instead of making it a config option, wouldn't it be better as a kernel parameter? People encountering this seem quite competent in passing kernel arguments, so having a "pcie_bw_notification=off" would solve their problems. I don't want people to have to discover a parameter to solve issues. If there's a parameter, notification should default to off, and people who want notification should supply a parameter to enable it. Same thing for the sysfs idea. I can imagine cases where a per-port flag would be useful. For example, a machine with a NIC and a couple of PCIe storage drives. In this example, the PCIe drives donwtrain willie-nillie, so it's useful to turn off their notifications, but the NIC absolutely must not downtrain. It's debatable whether it should be default on or default off. I think we really just need to figure out what's going on. Then it should be clearer how to handle it. I'm not really in a position to debug the root cause since I don't have the hardware or the time. I wonder (a) if some PCIe devices are downtraining willie-nillie to save power (b) if this willie-nillie downtraining somehow violates PCIe spec (c) what is the official behavior when downtraining is intentional My theory is: YES, YES, ASPM. But I don't know how to figure this out without having the problem hardware in hand. If nobody can figure out what's going on, I think we'll have to make it disabled by default. I think most distros do "CONFIG_PCIE_BW is not set". Is that not true? Alex
Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"
On 1/28/21 5:51 PM, Sinan Kaya wrote: On 1/28/2021 6:39 PM, Bjorn Helgaas wrote: AFAICT, this thread petered out with no resolution. If the bandwidth change notifications are important to somebody, please speak up, preferably with a patch that makes the notifications disabled by default and adds a parameter to enable them (or some other strategy that makes sense). I think these are potentially useful, so I don't really want to just revert them, but if nobody thinks these are important enough to fix, that's a possibility. Hide behind debug or expert option by default? or even mark it as BROKEN until someone fixes it? Instead of making it a config option, wouldn't it be better as a kernel parameter? People encountering this seem quite competent in passing kernel arguments, so having a "pcie_bw_notification=off" would solve their problems. As far as marking this as broken, I've seen no conclusive evidence of to tell if its a sw bug or actual hardware problem. Could we have a sysfs to disable this on a per-downstream-port basis? e.g. echo 0 > /sys/bus/pci/devices/:00:04.0/bw_notification_enabled This probably won't be ideal if there are many devices downtraining their links ad-hoc. At worst we'd have a way to silence those messages if we do encounter such devices. Alex
Re: [PATCH v2 1/2] drm/bridge: sii902x: Enable I/O and core VCC supplies if present
On 10/20/20 2:16 AM, Sam Ravnborg wrote: Hi Alex. [snip] diff --git a/drivers/gpu/drm/bridge/sii902x.c b/drivers/gpu/drm/bridge/sii902x.c index 33fd33f953ec..d15e9f2c0d8a 100644 --- a/drivers/gpu/drm/bridge/sii902x.c +++ b/drivers/gpu/drm/bridge/sii902x.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include @@ -168,6 +169,8 @@ struct sii902x { struct drm_connector connector; struct gpio_desc *reset_gpio; struct i2c_mux_core *i2cmux; + struct regulator *iovcc; + struct regulator *cvcc12; /* * Mutex protects audio and video functions from interfering * each other, by keeping their i2c command sequences atomic. @@ -954,13 +957,13 @@ static const struct drm_bridge_timings default_sii902x_timings = { | DRM_BUS_FLAG_DE_HIGH, }; +static int sii902x_init(struct sii902x *sii902x); Please re-arrange the code so this prototype is not needed. I'd be happy to re-arrange things. It will make the diff look a lot bigger than what it is. Is that okay? Alex + static int sii902x_probe(struct i2c_client *client, const struct i2c_device_id *id) { struct device *dev = >dev; - unsigned int status = 0; struct sii902x *sii902x; - u8 chipid[4]; int ret; ret = i2c_check_functionality(client->adapter, @@ -989,6 +992,43 @@ static int sii902x_probe(struct i2c_client *client, mutex_init(>mutex); + sii902x->iovcc = devm_regulator_get(dev, "iovcc"); + if (IS_ERR(sii902x->iovcc)) + return PTR_ERR(sii902x->iovcc); + + sii902x->cvcc12 = devm_regulator_get(dev, "cvcc12"); + if (IS_ERR(sii902x->cvcc12)) + return PTR_ERR(sii902x->cvcc12); + + ret = regulator_enable(sii902x->iovcc); + if (ret < 0) { + dev_err_probe(dev, ret, "Failed to enable iovcc supply"); + return ret; + } + + ret = regulator_enable(sii902x->cvcc12); + if (ret < 0) { + dev_err_probe(dev, ret, "Failed to enable cvcc12 supply"); + regulator_disable(sii902x->iovcc); + return ret; + } + + ret = sii902x_init(sii902x); + if (ret < 0) { + regulator_disable(sii902x->cvcc12); + regulator_disable(sii902x->iovcc); + } + + return ret; +} + +static int sii902x_init(struct sii902x *sii902x) +{ + struct device *dev = >i2c->dev; + unsigned int status = 0; + u8 chipid[4]; + int ret; + sii902x_reset(sii902x); ret = regmap_write(sii902x->regmap, SII902X_REG_TPI_RQB, 0x0); @@ -1012,11 +1052,11 @@ static int sii902x_probe(struct i2c_client *client, regmap_read(sii902x->regmap, SII902X_INT_STATUS, ); regmap_write(sii902x->regmap, SII902X_INT_STATUS, status); - if (client->irq > 0) { + if (sii902x->i2c->irq > 0) { regmap_write(sii902x->regmap, SII902X_INT_ENABLE, SII902X_HOTPLUG_EVENT); - ret = devm_request_threaded_irq(dev, client->irq, NULL, + ret = devm_request_threaded_irq(dev, sii902x->i2c->irq, NULL, sii902x_interrupt, IRQF_ONESHOT, dev_name(dev), sii902x); @@ -1031,9 +1071,9 @@ static int sii902x_probe(struct i2c_client *client, sii902x_audio_codec_init(sii902x, dev); - i2c_set_clientdata(client, sii902x); + i2c_set_clientdata(sii902x->i2c, sii902x); - sii902x->i2cmux = i2c_mux_alloc(client->adapter, dev, + sii902x->i2cmux = i2c_mux_alloc(sii902x->i2c->adapter, dev, 1, 0, I2C_MUX_GATE, sii902x_i2c_bypass_select, sii902x_i2c_bypass_deselect); @@ -1051,6 +1091,8 @@ static int sii902x_remove(struct i2c_client *client) i2c_mux_del_adapters(sii902x->i2cmux); drm_bridge_remove(>bridge); + regulator_disable(sii902x->cvcc12); + regulator_disable(sii902x->iovcc); return 0; } ___ dri-devel mailing list dri-de...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 1/2] drm/bridge: sii902x: Enable I/O and core VCC supplies if present
On 9/28/20 12:30 PM, Alexandru Gagniuc wrote: On the SII9022, the IOVCC and CVCC12 supplies must reach the correct voltage before the reset sequence is initiated. On most boards, this assumption is true at boot-up, so initialization succeeds. However, when we try to initialize the chip with incorrect supply voltages, it will not respond to I2C requests. sii902x_probe() fails with -ENXIO. To resolve this, look for the "iovcc" and "cvcc12" regulators, and make sure they are enabled before starting the reset sequence. If these supplies are not available in devicetree, then they will default to dummy-regulator. In that case everything will work like before. This was observed on a STM32MP157C-DK2 booting in u-boot falcon mode. On this board, the supplies would be set by the second stage bootloader, which does not run in falcon mode. Signed-off-by: Alexandru Gagniuc --- Changes since v1: * Fix return code after regulator_enable(sii902x->iovcc) fails (Fabio Estevam) * Use dev_err_probe() instead of dev_err() where appropriate (Sam Ravnborg) drivers/gpu/drm/bridge/sii902x.c | 54 1 file changed, 48 insertions(+), 6 deletions(-) This patch seems to have entered fall dormancy. Did I miss somebody in the CC field? Alex diff --git a/drivers/gpu/drm/bridge/sii902x.c b/drivers/gpu/drm/bridge/sii902x.c index 33fd33f953ec..d15e9f2c0d8a 100644 --- a/drivers/gpu/drm/bridge/sii902x.c +++ b/drivers/gpu/drm/bridge/sii902x.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include @@ -168,6 +169,8 @@ struct sii902x { struct drm_connector connector; struct gpio_desc *reset_gpio; struct i2c_mux_core *i2cmux; + struct regulator *iovcc; + struct regulator *cvcc12; /* * Mutex protects audio and video functions from interfering * each other, by keeping their i2c command sequences atomic. @@ -954,13 +957,13 @@ static const struct drm_bridge_timings default_sii902x_timings = { | DRM_BUS_FLAG_DE_HIGH, }; +static int sii902x_init(struct sii902x *sii902x); + static int sii902x_probe(struct i2c_client *client, const struct i2c_device_id *id) { struct device *dev = >dev; - unsigned int status = 0; struct sii902x *sii902x; - u8 chipid[4]; int ret; ret = i2c_check_functionality(client->adapter, @@ -989,6 +992,43 @@ static int sii902x_probe(struct i2c_client *client, mutex_init(>mutex); + sii902x->iovcc = devm_regulator_get(dev, "iovcc"); + if (IS_ERR(sii902x->iovcc)) + return PTR_ERR(sii902x->iovcc); + + sii902x->cvcc12 = devm_regulator_get(dev, "cvcc12"); + if (IS_ERR(sii902x->cvcc12)) + return PTR_ERR(sii902x->cvcc12); + + ret = regulator_enable(sii902x->iovcc); + if (ret < 0) { + dev_err_probe(dev, ret, "Failed to enable iovcc supply"); + return ret; + } + + ret = regulator_enable(sii902x->cvcc12); + if (ret < 0) { + dev_err_probe(dev, ret, "Failed to enable cvcc12 supply"); + regulator_disable(sii902x->iovcc); + return ret; + } + + ret = sii902x_init(sii902x); + if (ret < 0) { + regulator_disable(sii902x->cvcc12); + regulator_disable(sii902x->iovcc); + } + + return ret; +} + +static int sii902x_init(struct sii902x *sii902x) +{ + struct device *dev = >i2c->dev; + unsigned int status = 0; + u8 chipid[4]; + int ret; + sii902x_reset(sii902x); ret = regmap_write(sii902x->regmap, SII902X_REG_TPI_RQB, 0x0); @@ -1012,11 +1052,11 @@ static int sii902x_probe(struct i2c_client *client, regmap_read(sii902x->regmap, SII902X_INT_STATUS, ); regmap_write(sii902x->regmap, SII902X_INT_STATUS, status); - if (client->irq > 0) { + if (sii902x->i2c->irq > 0) { regmap_write(sii902x->regmap, SII902X_INT_ENABLE, SII902X_HOTPLUG_EVENT); - ret = devm_request_threaded_irq(dev, client->irq, NULL, + ret = devm_request_threaded_irq(dev, sii902x->i2c->irq, NULL, sii902x_interrupt, IRQF_ONESHOT, dev_name(dev), sii902x); @@ -1031,9 +1071,9 @@ static int sii902x_probe(struct i2c_client *client, sii902x_audio_codec_init(sii902x, dev); - i2c_set_clientdata(client, sii902x); + i2c_set_clientdata(sii902x->i2c, sii902x); - sii902x->i2cmux = i2c_mux_alloc(client->adapter, dev, + sii902x->i2cmux = i2c_mux_alloc(sii902x->i2c->adapter, dev, 1, 0, I2C_MUX_GATE, sii902x_i2c_bypass_select,
Re: [PATCH 1/2] drm/bridge: sii902x: Enable I/O and core VCC supplies if present
On 9/26/20 1:49 PM, Sam Ravnborg wrote: Hi Alexandru On Thu, Sep 24, 2020 at 03:05:05PM -0500, Alexandru Gagniuc wrote: On the SII9022, the IOVCC and CVCC12 supplies must reach the correct voltage before the reset sequence is initiated. On most boards, this assumption is true at boot-up, so initialization succeeds. However, when we try to initialize the chip with incorrect supply voltages, it will not respond to I2C requests. sii902x_probe() fails with -ENXIO. To resolve this, look for the "iovcc" and "cvcc12" regulators, and make sure they are enabled before starting the reset sequence. If these supplies are not available in devicetree, then they will default to dummy-regulator. In that case everything will work like before. This was observed on a STM32MP157C-DK2 booting in u-boot falcon mode. On this board, the supplies would be set by the second stage bootloader, which does not run in falcon mode. Signed-off-by: Alexandru Gagniuc One nitpick here. The binding should be present in the tree before you start using it. So this patch should be applied after the binding. It is :) * arch/arm/boot/dts/stm32mp15xx-dkx.dtsi Alex One detail below - I think others have already commented on the rest. [snip]
Re: [PATCH 4/5] PCI: only return true when dev io state is really changed
Hi Ethan, On 9/24/20 9:34 PM, Ethan Zhao wrote: When uncorrectable error happens, AER driver and DPC driver interrupt handlers likely call pcie_do_recovery()->pci_walk_bus()->report_frozen_detected() with pci_channel_io_frozen the same time. If pci_dev_set_io_state() return true even if the original state is pci_channel_io_frozen, that will cause AER or DPC handler re-enter the error detecting and recovery procedure one after another. The result is the recovery flow mixed between AER and DPC. So simplify the pci_dev_set_io_state() function to only return true when dev->error_state is changed. Signed-off-by: Ethan Zhao Tested-by: Wen jin Tested-by: Shanshan Zhang --- drivers/pci/pci.h | 31 +++ 1 file changed, 3 insertions(+), 28 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fa12f7cbc1a0..d420bb977f3b 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -362,35 +362,10 @@ static inline bool pci_dev_set_io_state(struct pci_dev *dev, bool changed = false; device_lock_assert(>dev); - switch (new) { - case pci_channel_io_perm_failure: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - case pci_channel_io_perm_failure: - changed = true; - break; - } - break; - case pci_channel_io_frozen: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - case pci_channel_io_normal: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - } - if (changed) + if (dev->error_state != new) { dev->error_state = new; + changed = true; + } return changed; } The flow is a lot easier to follow now. Thank you. Reviewed-by: Alexandru Gagniuc
Re: [PATCH 1/2] drm/bridge: sii902x: Enable I/O and core VCC supplies if present
On 9/24/20 3:22 PM, Fabio Estevam wrote: Hi Fabio, On Thu, Sep 24, 2020 at 5:16 PM Alexandru Gagniuc wrote: + ret = regulator_enable(sii902x->cvcc12); + if (ret < 0) { + dev_err(dev, "Failed to enable cvcc12 supply: %d\n", ret); + regulator_disable(sii902x->iovcc); + return PTR_ERR(sii902x->cvcc12); return ret; Thank you for catching that. I will fix it in v2. ret = regmap_write(sii902x->regmap, SII902X_REG_TPI_RQB, 0x0); @@ -1012,11 +1052,11 @@ static int sii902x_probe(struct i2c_client *client, regmap_read(sii902x->regmap, SII902X_INT_STATUS, ); regmap_write(sii902x->regmap, SII902X_INT_STATUS, status); - if (client->irq > 0) { + if (sii902x->i2c->irq > 0) { Unrelated change. [snip] Unrelated change. [snip] Unrelated change. [snip] Unrelated change. The i2c initialization is moved into a separate function. Hence 'client' is no longer available. Instead, it can be accessed via 'sii902x->i2c'. Alex
Re: [PATCH v3 3/3] PCI: pciehp: Add dmi table for in-band presence disabled
On 10/21/19 1:19 PM, Stuart Hayes wrote: On 10/21/19 8:47 AM, Mika Westerberg wrote: On Thu, Oct 17, 2019 at 03:32:56PM -0400, Stuart Hayes wrote: Some systems have in-band presence detection disabled for hot-plug PCI slots, but do not report this in the slot capabilities 2 (SLTCAP2) register. On these systems, presence detect can become active well after the link is reported to be active, which can cause the slots to be disabled after a device is connected. Add a dmi table to flag these systems as having in-band presence disabled. Signed-off-by: Stuart Hayes --- drivers/pci/hotplug/pciehp_hpc.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 02eb811a014f..4d377a2a62ce 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -14,6 +14,7 @@ #define dev_fmt(fmt) "pciehp: " fmt +#include #include #include #include @@ -26,6 +27,16 @@ #include "../pci.h" #include "pciehp.h" +static const struct dmi_system_id inband_presence_disabled_dmi_table[] = { + { + .ident = "Dell System", + .matches = { + DMI_MATCH(DMI_OEM_STRING, "Dell System"), Sorry if this has been discussed previously already but isn't this going to apply on all Dell systems, not just the affected ones? Is this the intention? Yes, that is the intention. Applying this just makes the hotplug code wait for the presence detect bit to be set before proceeding, which ideally wouldn't hurt anything--for devices that don't have inband presence detect disabled, presence detect should already be up when the code in patch 2/3 starts to wait for it. The only issue should be with broken hotplug implementations that don't ever bring presence detect active (these apparently exist)--but even those would still work, they would just take an extra second to come up. On the other hand, a number of Dell systems have (and will have) NVMe implementations that have inband presence detect disabled (but they won't have the new bit implemented to report that), and they don't work correctly without this. I think it's clearer if this is explained in a comment. That it doesn't break anything, and we're okay this applies to all hotplug ports, even those that are not in front of an NVMe backplane. Alex
Re: [PATCH 0/3] PCI: pciehp: Do not turn off slot if presence comes up after link
On 10/1/19 11:13 PM, Lukas Wunner wrote: On Tue, Oct 01, 2019 at 05:14:16PM -0400, Stuart Hayes wrote: This patch set is based on a patch set [1] submitted many months ago by Alexandru Gagniuc, who is no longer working on it. [1] https://patchwork.kernel.org/cover/10909167/ [v3,0/4] PCI: pciehp: Do not turn off slot if presence comes up after link If I'm not mistaken, these two are identical to Alex' patches, right? PCI: pciehp: Add support for disabling in-band presence PCI: pciehp: Wait for PDS when in-band presence is disabled I'm not sure if it's appropriate to change the author and omit Alex' Signed-off-by. Legally Dell owns the patches. I have no objection on my end. Alex Otherwise I have no objections against this series. Thanks, Lukas
Re: [PATCH 3/3] PCI: pciehp: Add dmi table for in-band presence disabled
On 10/1/19 4:14 PM, Stuart Hayes wrote: Some systems have in-band presence detection disabled for hot-plug PCI slots, but do not report this in the slot capabilities 2 (SLTCAP2) register. On these systems, presence detect can become active well after the link is reported to be active, which can cause the slots to be disabled after a device is connected. Add a dmi table to flag these systems as having in-band presence disabled. Signed-off-by: Stuart Hayes --- drivers/pci/hotplug/pciehp_hpc.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 1282641c6458..1dd01e752587 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -14,6 +14,7 @@ #define dev_fmt(fmt) "pciehp: " fmt +#include #include #include #include @@ -26,6 +27,16 @@ #include "../pci.h" #include "pciehp.h" +static const struct dmi_system_id inband_presence_disabled_dmi_table[] = { + { + .ident = "Dell System", + .matches = { + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc"), + }, + }, + {} +}; + I'm not sure that all Dell systems that were ever made or will be made have in-band presence disabled on all their hotplug slots. This was a problem with the NVMe hot-swap bays on 14G servers. I can't guarantee that any other slot or machine will need this workaround. The best way I found to implement this is to check for the PCI-ID of the switches behind the port. Alex
Re: [PATCH] Revert "PCI/LINK: Report degraded links via link bandwidth notification"
On 4/29/19 1:56 PM, Bjorn Helgaas wrote: From: Bjorn Helgaas This reverts commit e8303bb7a75c113388badcc49b2a84b4121c1b3e. e8303bb7a75c added logging whenever a link changed speed or width to a state that is considered degraded. Unfortunately, it cannot differentiate signal integrity-related link changes from those intentionally initiated by an endpoint driver, including drivers that may live in userspace or VMs when making use of vfio-pci. Some GPU drivers actively manage the link state to save power, which generates a stream of messages like this: vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5 GT/s x16 link) We really *do* want to be alerted when the link bandwidth is reduced because of hardware failures, but degradation by intentional link state management is probably far more common, so the signal-to-noise ratio is currently low. Until we figure out a way to identify the real problems or silence the intentional situations, revert the following commits, which include the initial implementation (e8303bb7a75c) and subsequent fixes: I think we're overreacting to a bit of perceived verbosity in the system log. Intentional degradation does not seem to me to be as common as advertised. I have not observed this with either radeon, nouveau, or amdgpu, and the proper mechanism to save power at the link level is ASPM. I stand to be corrected and we have on CC some very knowledgeable fellows that I am certain will jump at the opportunity to do so. What it seems like to me is that a proprietary driver running in a VM is initiating these changes. And if that is the case then it seems this is a virtualization problem. A quick glance over GPU drivers in linux did not reveal any obvious places where we intentionally downgrade a link. I'm not convinced a revert is the best call. Alex e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth notification") 3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are acked") 55397ce8df48 ("PCI/LINK: Clear bandwidth notification interrupt before enabling it") 0fa635aec9ab ("PCI/LINK: Deduplicate bandwidth reports for multi-function devices") Link: https://lore.kernel.org/lkml/155597243666.19387.1205950870601742062.st...@gimli.home Link: https://lore.kernel.org/lkml/155605909349.3575.13433421148215616375.st...@gimli.home Signed-off-by: Bjorn Helgaas CC: Alexandru Gagniuc CC: Lukas Wunner CC: Alex Williamson --- drivers/pci/pci.h | 1 - drivers/pci/pcie/Makefile | 1 - drivers/pci/pcie/bw_notification.c | 121 - drivers/pci/pcie/portdrv.h | 6 +- drivers/pci/pcie/portdrv_core.c| 17 ++-- drivers/pci/pcie/portdrv_pci.c | 1 - drivers/pci/probe.c| 2 +- 7 files changed, 7 insertions(+), 142 deletions(-) delete mode 100644 drivers/pci/pcie/bw_notification.c diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index d994839a3e24..224d88634115 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -273,7 +273,6 @@ enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev); u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed, enum pcie_link_width *width); void __pcie_print_link_status(struct pci_dev *dev, bool verbose); -void pcie_report_downtraining(struct pci_dev *dev); /* Single Root I/O Virtualization */ struct pci_sriov { diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile index f1d7bc1e5efa..ab514083d5d4 100644 --- a/drivers/pci/pcie/Makefile +++ b/drivers/pci/pcie/Makefile @@ -3,7 +3,6 @@ # Makefile for PCI Express features and port driver pcieportdrv-y := portdrv_core.o portdrv_pci.o err.o -pcieportdrv-y += bw_notification.o obj-$(CONFIG_PCIEPORTBUS) += pcieportdrv.o diff --git a/drivers/pci/pcie/bw_notification.c b/drivers/pci/pcie/bw_notification.c deleted file mode 100644 index 4fa9e3523ee1.. --- a/drivers/pci/pcie/bw_notification.c +++ /dev/null @@ -1,121 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0+ -/* - * PCI Express Link Bandwidth Notification services driver - * Author: Alexandru Gagniuc - * - * Copyright (C) 2019, Dell Inc - * - * The PCIe Link Bandwidth Notification provides a way to notify the - * operating system when the link width or data rate changes. This - * capability is required for all root ports and downstream ports - * supporting links wider than x1 and/or multiple link speeds. - * - * This service port driver hooks into the bandwidth notification interrupt - * and warns when links become degraded in operation. - */ - -#include "../pci.h" -#include "portdrv.h" - -static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev) -{ - int ret; - u32 lnk_cap; - - ret = pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, _cap);
Re: [PATCH] PCI: Add link_change error handler and vfio-pci user
On 4/24/19 12:19 PM, Alex Williamson wrote: On Wed, 24 Apr 2019 16:45:45 + wrote: On 4/23/2019 5:42 PM, Alex Williamson wrote: diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 7e12d0163863..233cd4b5b6e8 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -2403,6 +2403,19 @@ void pcie_report_downtraining(struct pci_dev *dev) I don't think you want to change pcie_report_downtraining(). You're advertising to "report" something, by nomenclature, but then go around and also call a notification callback. This is also used during probe, and you've now just killed your chance to notice you've booted with a degraded link. If what you want to do is silence the bandwidth notification, you want to modify the threaded interrupt that calls this. During probe, ie. discovery, a device wouldn't have a driver attached, so we'd fall through to simply printing the link status. Nothing lost afaict. The "report" verb doesn't have a subject here, report to whom? Therefore I thought it reasonable that a driver ask that it be reported to them via a callback. I don't see that as such a stretch of the interface. That's just stretching the logic, and IMO makes the intent harder to understand. The argument relies on the state of the PCI device and logic, which is not obvious to the casual observer. If you want to bypass the bandwidth notification, then bypass the notification. if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn) return; + /* +* If driver handles link_change event, defer to driver. PCIe drivers +* can call pcie_print_link_status() to print current link info. +*/ + device_lock(>dev); + if (dev->driver && dev->driver->err_handler && + dev->driver->err_handler->link_change) { + dev->driver->err_handler->link_change(dev); + device_unlock(>dev); + return; + } + device_unlock(>dev); Can we write this such that there is a single lock()/unlock() pair? Not without introducing a tracking variable, ex. [snip bad code] That's not markedly better imo, but if it's preferred I can send a v2. How about: if (!invoke_link_changed_handler(pdev)) very_useful_downtraining_message(pdev); Alex Alex
Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation
On 4/22/19 5:43 PM, Alex Williamson wrote: On systems that don't support any PCIe services other than bandwidth notification, pcie_message_numbers() can return zero vectors, causing the vector reallocation in pcie_port_enable_irq_vec() to retry with zero, which fails, resulting in fallback to INTx (which might be broken) for the bandwidth notification service. This can resolve spurious interrupt faults due to this service on some systems. Fixes: e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth notification") Signed-off-by: Alex Williamson --- +1 Tested on some Dell servers. Everything works as expected. I don't have a system with a device that only supports bandwidth notification. Alex
Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation
On 4/23/19 12:10 PM, Bjorn Helgaas wrote: On Tue, Apr 23, 2019 at 09:33:53AM -0500, Alex G wrote: On 4/22/19 7:33 PM, Alex Williamson wrote: There is nothing wrong happening here that needs to fill logs. I thought maybe if I enabled notification of autonomous bandwidth changes that it might categorize these as something we could ignore, but it doesn't. How can we identify only cases where this is an erroneous/noteworthy situation? Thanks, You don't. Ethernet doesn't. USB doesn't. This logging behavior is consistent with every other subsystem that deals with multi-speed links. Can you point me to the logging in these other subsystems so I can learn more about how they deal with this? I don't have any in-depth articles about the logging in these systems, but I can extract some logs from my machines. Ethernet: [Sun Apr 21 11:14:06 2019] e1000e: eno1 NIC Link is Down [Sun Apr 21 11:14:17 2019] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx [Sun Apr 21 11:14:23 2019] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx [Sun Apr 21 23:33:31 2019] e1000e: eno1 NIC Link is Down [Sun Apr 21 23:33:43 2019] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx [Sun Apr 21 23:33:48 2019] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx I used to have one of these "green" ethernet switches that went down to 100mbps automatically. You can imagine how "clogged" the logs were with link up messages. Thank goodness that switch was killed in a thunderstorm. USB will log every device insertion and removal, very verbosely (see appendix A). I agree that emitting log messages for normal and expected events will lead to user confusion and we need to do something. e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth notification") was merged in v5.1-rc1, so we still have (a little) time to figure this out before v5.1. I always viewed the system log as a system log, instead of a database of system errors. I may have extremist views, but going back to Alex's example, I prefer to see that the power saving mechanism is doing something to save power on my laptop (I'll just ignore it on a desktop). If you think increasing code complexity because people don't want things logged into the system log, then I'm certain we can work out some sane solution. It's the same problem we see with GCC, where people want warning messages here, but don't want the same messages there. Alex P.S. The pedantic in me points out that one of the examples I gave is a terrible example. ASPM "allows hardware-autonomous, dynamic Link power reduction beyond what is achievable by software-only control" [1]. [1] PCI-Express 3.0 -- 5.4.1. Active State Power Management (ASPM) Appendix A: [1618067.987084] usb 1-3.5: new high-speed USB device number 79 using xhci_hcd [1618068.179914] usb 1-3.5: New USB device found, idVendor=0bda, idProduct=4014, bcdDevice= 0.05 [1618068.179924] usb 1-3.5: New USB device strings: Mfr=3, Product=1, SerialNumber=2 [1618068.179930] usb 1-3.5: Product: USB Audio [1618068.179936] usb 1-3.5: Manufacturer: Generic [1618068.179941] usb 1-3.5: SerialNumber: 200901010001 [1618068.280100] usb 1-3.6: new low-speed USB device number 80 using xhci_hcd [1618068.342541] Bluetooth: hci0: Waiting for firmware download to complete [1618068.342795] Bluetooth: hci0: Firmware loaded in 1509081 usecs [1618068.342887] Bluetooth: hci0: Waiting for device to boot [1618068.354919] Bluetooth: hci0: Device booted in 11797 usecs [1618068.356006] Bluetooth: hci0: Found Intel DDC parameters: intel/ibt-12-16.ddc [1618068.358958] Bluetooth: hci0: Applying Intel DDC parameters completed [1618068.378624] usb 1-3.6: New USB device found, idVendor=04d9, idProduct=1400, bcdDevice= 1.43 [1618068.378626] usb 1-3.6: New USB device strings: Mfr=0, Product=0, SerialNumber=0 [1618068.390686] input: HID 04d9:1400 as /devices/pci:00/:00:14.0/usb1/1-3/1-3.6/1-3.6:1.0/0003:04D9:1400.0139/input/input921 [1618068.444282] hid-generic 0003:04D9:1400.0139: input,hidraw1: USB HID v1.10 Keyboard [HID 04d9:1400] on usb-:00:14.0-3.6/input0 [1618068.456373] input: HID 04d9:1400 Mouse as /devices/pci:00/:00:14.0/usb1/1-3/1-3.6/1-3.6:1.1/0003:04D9:1400.013A/input/input922 [1618068.457929] input: HID 04d9:1400 Consumer Control as /devices/pci:00/:00:14.0/usb1/1-3/1-3.6/1-3.6:1.1/0003:04D9:1400.013A/input/input923 [1618068.509294] input: HID 04d9:1400 System Control as /devices/pci:00/:00:14.0/usb1/1-3/1-3.6/1-3.6:1.1/0003:04D9:1400.013A/input/input924 [1618068.509518] hid-generic 0003:04D9:1400.013A: input,hidraw2: USB HID v1.10 Mouse [HID 04d9:1400] on usb-:00:14.0-3.6/input1 [1618068.588078] usb 1-3.7: new full-speed USB device number 81 using xhci_hcd [1618068.679132] usb 1-3.7: New USB device found, idVendor=046d, idProduct=c52b, bcdDevice=12.03 [1
Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation
On 4/23/19 11:22 AM, Alex Williamson wrote: Nor should pci-core decide what link speed changes are intended or errors. Minimally we should be enabling drivers to receive this feedback. Thanks, Not errors. pci core reports that a link speed change event has occured. Period. Alex
Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation
On 4/23/19 10:34 AM, Alex Williamson wrote: On Tue, 23 Apr 2019 09:33:53 -0500 Alex G wrote: On 4/22/19 7:33 PM, Alex Williamson wrote: On Mon, 22 Apr 2019 19:05:57 -0500 Alex G wrote: echo :07:00.0:pcie010 | sudo tee /sys/bus/pci_express/drivers/pcie_bw_notification/unbind That's a bad solution for users, this is meaningless tracking of a device whose driver is actively managing the link bandwidth for power purposes. 0.5W savings on a 100+W GPU? I agree it's meaningless. Evidence? Regardless, I don't have control of the driver that's making these changes, but the claim seems unfounded and irrelevant. The number of 5mW/Gb/lane doesn't ring a bell? [1] [2]. Your GPU supports 5Gb/s, so likely using an older, more power hungry process. I suspect it's still within the same order of magnitude. I'm assigning a device to a VM [snip] I can see why we might want to be notified of degraded links due to signal issues, but what I'm reporting is that there are also entirely normal reasons [snip] we can't seem to tell the difference Unfortunately, there is no way in PCI-Express to distinguish between an expected link bandwidth change and one due to error. If you're using virt-manager to configure the VM, then virt-manager could have a checkbox to disable link bandwidth management messages. I'd rather we avoid kernel-side heuristics (like Lukas suggested). If you're confident that your link will operate as intended, and don't want messages about it, that's your call as a user -- we shouldn't decide this in the kernel. Alex [1] https://www.synopsys.com/designware-ip/technical-bulletin/reduce-power-consumption.html
Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation
On 4/22/19 7:33 PM, Alex Williamson wrote: On Mon, 22 Apr 2019 19:05:57 -0500 Alex G wrote: echo :07:00.0:pcie010 | sudo tee /sys/bus/pci_express/drivers/pcie_bw_notification/unbind That's a bad solution for users, this is meaningless tracking of a device whose driver is actively managing the link bandwidth for power purposes. 0.5W savings on a 100+W GPU? I agree it's meaningless. There is nothing wrong happening here that needs to fill logs. I thought maybe if I enabled notification of autonomous bandwidth changes that it might categorize these as something we could ignore, but it doesn't. How can we identify only cases where this is an erroneous/noteworthy situation? Thanks, You don't. Ethernet doesn't. USB doesn't. This logging behavior is consistent with every other subsystem that deals with multi-speed links. I realize some people are very resistant to change (and use very ancient kernels). I do not, however, agree that this is a sufficient argument to dis-unify behavior. Alex
Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation
On 4/22/19 5:43 PM, Alex Williamson wrote: [ 329.725607] vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5 GT/s x16 link) [ 708.151488] vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5 GT/s x16 link) [ 718.262959] vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5 GT/s x16 link) [ 1138.124932] vfio-pci :07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x16 link at :00:02.0 (capable of 64.000 Gb/s with 5 GT/s x16 link) What is the value of this nagging? Good! The bandwidth notification service is working as intended. If this bothers you, you can unbind the device from the bandwidth notification driver: echo :07:00.0:pcie010 | sudo tee /sys/bus/pci_express/drivers/pcie_bw_notification/unbind diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c index 7d04f9d087a6..1b330129089f 100644 --- a/drivers/pci/pcie/portdrv_core.c +++ b/drivers/pci/pcie/portdrv_core.c @@ -55,7 +55,8 @@ static int pcie_message_numbers(struct pci_dev *dev, int mask, * 7.8.2, 7.10.10, 7.31.2. */ - if (mask & (PCIE_PORT_SERVICE_PME | PCIE_PORT_SERVICE_HP)) { + if (mask & (PCIE_PORT_SERVICE_PME | PCIE_PORT_SERVICE_HP | + PCIE_PORT_SERVICE_BWNOTIF)) { pcie_capability_read_word(dev, PCI_EXP_FLAGS, ); *pme = (reg16 & PCI_EXP_FLAGS_IRQ) >> 9; nvec = *pme + 1; Good catch!
Re: [PATCH v1 1/3] PCI / ACPI: Do not export pci_get_hp_params()
On 4/22/19 3:58 PM, Bjorn Helgaas wrote: On Fri, Feb 08, 2019 at 10:24:11AM -0600, Alexandru Gagniuc wrote: This is only used within drivers/pci, and there is no reason to make it available outside of the PCI core. Signed-off-by: Alexandru Gagniuc Applied the whole series to pci/hotplug for v5.2, thanks! I dropped the "list" member from struct hpx_type3 because it didn't seem to be used. That's a good call. That was a vestigial appendage from when I first intended to store a list of registers in memory. I'm glad we didn't end up needing a list. Alex
Fixing the GHES driver vs not causing issues in the first place
The issue of dying inside the GHES driver has popped up a few times before. I've looked into fixing this before, but we didn't quite come to agreement because the verbiage in the ACPI spec is vague: " When a fatal uncorrected error occurs, the system is restarted to prevent propagation of the error. " This popped up again a couple of times recently [1]. Linus suggested that fixing the GHES driver might pay higher dividends. I considered reviving an old fix, but put it aside after hearing concerns about the unintended consequences, which I'll get to soon. A few days back, I lost an entire MD RAID1. It was during hot-removal testing that I do on an irregular basis, and a kernel panic from the GHES driver had caused the system to go down. I have seen some occasional filesystem corruption in other crashes, but nothing fsck couldn't fix. The interesting part is that the array that was lost was not part of the device set that was being hot-removed. The state of things doesn't seem very joyful. The machine in question is a Dell r740xd. It uses firmware-first (FFS) handling for PCIe errors, and is generally good at identifying a hot-removal and not bothering the OS about it. The events that I am testing for are situations where, due to slow operator, tilted drive, worn connectors, errors make it past FFS to the OS -- with "fatal" severity. In that case, the GHES driver sees a fatal hardware error and panics. On this machine, FFS reported it as fatal in an attempt to cause the system to reboot, and prevent propagation of the error. The "prevent propagation of the error" flow was designed around OSes that can't do recovery. Firmware assumes an instantaneous reboot, so it does not set itself up to report future errors. The EFI reference code does not re-arm errors, so we suspect most OEM's firmware will work this way. Despite the apparently enormous stupidity of crashing an otherwise perfectly good system, there are good and well thought out reasons behind it. An example is reading a block of data from an NVMe drive, and encountering a CRC error on the PCIe bus. If we didn't do an "instantaneous reboot" after a previous fatal error, we will not get the CRC error reported. Thus, we risk silent data corruption. On the Linux side, we can ignore the "fatal" nature of the error, and even recover the PCIe devices behind the error. However, this is ill advised for the reason listed above. The counter argument is that a panic() is more likely to cause filesystem corruption. In over one year of testing, I have not seen a single incident of data corruption, yet I have seen the boot ask me to run fsck on multiple occasions. And this seems to me like a tradeoff problem rather than anything else. In the case of this Dell machine, there are ways to hot-swap PCIe devices them without needing to take either of the risks above: 1. Turn off the downstream port manually. 2. Insert/replace drive 3. Turn on the downstream port manually. What bothers me is the firmware's assumption that a fatal error must crash the machine. I've looked at the the verbiage in the spec, and I don't fully agree that either side respects the contract. Our _OSC contract with the firmware says that firmware will report errors to us, while we are not allowed to touch certain parts of the device. Following the verbiage in the spec, we do need to reboot on a fatal error, but that reboot is not guaranteed to be instantaneous. Firmware still has to honor the contract and report errors to us. This may seem like a spec non-compliance issue. It would be great if FFS would mark the errors as recoverable, but the level of validation required makes this unrealistic on existing machines. New machines will likely move to the Containment Error Recovery model, which is essentially FFS for DPC (PCIe Downstream Port Containment). This leaves the current generation of servers in limbo -- I'm led to believe other server OEMs have very similar issues. Given the dilemma above, I'm really not sure what the right solution is. How do we produce a fix that addresses complaints from both sides. When firmware gives us a false positive, should we panic? Should we instead print a message informing the user/admin that a reboot of the machine is required. Should we initiate a reboot automatically? From Linux' ability to recover, PCIe fatal errors are false positives -- every time, all the time. However, their fatality is not there without any reason. I do want to avoid opinions that FFS wants us to crash, gives us the finger, and we should give the finger back. Another option that was discussed before, but was very controversial is the fix that involves not causing the problem in the first place [1] [2]. On the machines that I have access to, FFS is handled in SMM, which means that all CPU cores are held up until the processing of the error is complete. On one particular machine
Re: [PATCH v2] PCI/LINK: bw_notification: Do not leave interrupt handler NULL
On 3/25/19 5:25 PM, Bjorn Helgaas wrote: On Fri, Mar 22, 2019 at 07:36:51PM -0500, Alexandru Gagniuc wrote: A threaded IRQ with a NULL handler does not work with level-triggered interrupts. request_threaded_irq() will return an error: genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq 16 pcie_bw_notification: probe of :00:1b.0:pcie010 failed with error -22 For level interrupts we need to silence the interrupt before exiting the IRQ handler, so just clear the PCI_EXP_LNKSTA_LBMS bit there. Fixes: e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth notification") Reported-by: Linus Torvalds Signed-off-by: Alexandru Gagniuc Applied with the following subject line to for-linus for v5.1, thanks! PCI/LINK: Supply IRQ handler so level-triggered IRQs are acked You're so much better at formulating commit messages. That sounds a lot smoother. Thanks! --- Changes since v1: - move pcie_update_link_speed() to irq to prevent duplicate read of link_status - Add Fixes: to commit message drivers/pci/pcie/bw_notification.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pcie/bw_notification.c b/drivers/pci/pcie/bw_notification.c index d2eae3b7cc0f..c48746f1cf3c 100644 --- a/drivers/pci/pcie/bw_notification.c +++ b/drivers/pci/pcie/bw_notification.c @@ -44,11 +44,10 @@ static void pcie_disable_link_bandwidth_notification(struct pci_dev *dev) pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl); } -static irqreturn_t pcie_bw_notification_handler(int irq, void *context) +static irqreturn_t pcie_bw_notification_irq(int irq, void *context) { struct pcie_device *srv = context; struct pci_dev *port = srv->port; - struct pci_dev *dev; u16 link_status, events; int ret; @@ -58,6 +57,17 @@ static irqreturn_t pcie_bw_notification_handler(int irq, void *context) if (ret != PCIBIOS_SUCCESSFUL || !events) return IRQ_NONE; + pcie_capability_write_word(port, PCI_EXP_LNKSTA, events); + pcie_update_link_speed(port->subordinate, link_status); + return IRQ_WAKE_THREAD; +} + +static irqreturn_t pcie_bw_notification_handler(int irq, void *context) +{ + struct pcie_device *srv = context; + struct pci_dev *port = srv->port; + struct pci_dev *dev; + /* * Print status from downstream devices, not this root port or * downstream switch port. @@ -67,8 +77,6 @@ static irqreturn_t pcie_bw_notification_handler(int irq, void *context) __pcie_print_link_status(dev, false); up_read(_bus_sem); - pcie_update_link_speed(port->subordinate, link_status); - pcie_capability_write_word(port, PCI_EXP_LNKSTA, events); return IRQ_HANDLED; } @@ -80,7 +88,8 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv) if (!pcie_link_bandwidth_notification_supported(srv->port)) return -ENODEV; - ret = request_threaded_irq(srv->irq, NULL, pcie_bw_notification_handler, + ret = request_threaded_irq(srv->irq, pcie_bw_notification_irq, + pcie_bw_notification_handler, IRQF_SHARED, "PCIe BW notif", srv); if (ret) return ret; -- 2.19.2
Re: [PATCH] PCI/LINK: Request a one-shot IRQ with NULL handler
Hi Borislav, Thanks for the update. We've settled on a different fix [1], since Lukas was not happy with IRQF_ONESHOT [2]. Alex [1] https://lore.kernel.org/linux-pci/20190323003700.7294-1-mr.nuke...@gmail.com/ [2] https://lore.kernel.org/linux-pci/20190318043314.noyj6t4sh26sp...@wunner.de/ On 3/25/19 1:23 PM, Borislav Petkov wrote: From: Borislav Petkov Booting v5.1-rc1 causes these new messages in dmesg here: genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq 24 pcie_bw_notification: probe of :00:02.0:pcie010 failed with error -22 because requesting a threaded IRQ with NULL handler needs to be oneshot. Otherwise a level-triggered interrupt endless loop can happen, see the detailed explanation above the "Threaded irq... " error message in kernel/irq/manage.c. And PCI-MSI type interrupts are one-shot safe but not in the IOAPIC case: 24: ... IO-APIC 28-fasteoi PCIe BW notif, PCIe BW notif, PCIe BW notif 25: ... IO-APIC 29-fasteoi PCIe BW notif, PCIe BW notif Make the BW notification handler oneshot too. Fixes: e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth notification") Suggested-by: Thomas Gleixner Signed-off-by: Borislav Petkov Cc: Bjorn Helgaas Cc: Lukas Wunner Cc: Alexandru Gagniuc Cc: linux-...@vger.kernel.org --- drivers/pci/pcie/bw_notification.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/pci/pcie/bw_notification.c b/drivers/pci/pcie/bw_notification.c index d2eae3b7cc0f..16e11c09f6bd 100644 --- a/drivers/pci/pcie/bw_notification.c +++ b/drivers/pci/pcie/bw_notification.c @@ -81,7 +81,7 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv) return -ENODEV; ret = request_threaded_irq(srv->irq, NULL, pcie_bw_notification_handler, - IRQF_SHARED, "PCIe BW notif", srv); + IRQF_SHARED | IRQF_ONESHOT, "PCIe BW notif", srv); if (ret) return ret;
Re: [PATCH v3] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
On 3/20/19 4:44 PM, Linus Torvalds wrote: On Wed, Mar 20, 2019 at 1:52 PM Bjorn Helgaas wrote: AFAICT, the consensus there was that it would be better to find some sort of platform solution instead of dealing with it in individual drivers. The PCI core isn't really a driver, but I think the same argument applies to it: if we had a better way to recover from readl() errors, that way would work equally well in nvme-pci and the PCI core. I think that patches with the pattern "if (disconnected) don't do IO" are fundamentally broken and we should look for alternatives in all cases. They are fundamentally broken because they are racy: if it's an actual sudden disconnect in the middle of IO, there's no guarantee that we'll even be notified in time. They are fundamentally broken because they add new magic special cases that very few people will ever test, and the people who do test them tend to do so with old irrelevant kernels. Finally, they are fundamentally broken because they always end up being just special cases. One or two special case accesses in a driver, or perhaps all accesses of a particular type in just _one_ special driver. Yes, yes, I realize that people want error reporting, and that hot-removal can cause various error conditions (perhaps just parity errors for the IO, but also perhaps other random errors caused by firmware perhaps doing special HW setup). But the "you get a fatal interrupt, so avoid the IO" kind of model is completely broken, and needs to just be fixed differently. See above why it's so completely broken. So if the hw is set up to send some kinf of synchronous interrupt or machine check that cannot sanely be handled (perhaps because it will just repeat forever), we should try to just disable said thing. PCIe allows for just polling for errors on the bridges, afaik. It's been years since I looked at it, and maybe I'm wrong. And I bet there are various "platform-specific value add" registers etc that may need tweaking outside of any standard spec for PCIe error reporting. But let's do that in a platform driver, to set up the platform to not do the silly "I'm just going to die if I see an error" thing. It's way better to have a model where you poll each bridge once a minute (or one an hour) and let people know "guys, your hardware reports errors", than make random crappy changes to random drivers because the hardware was set up to die on said errors. And if some MIS person wants the "hardware will die" setting, then they can damn well have that, and then it's not out problem, but it also means that we don't start changing random drivers for that insane setting. It's dead, Jim, and it was the users choice. Notice how in neither case does it make sense to try to do some "if (disconnected) dont_do_io()" model for the drivers. I disagree with the idea of doing something you know can cause an error to propagate. That being said, in this particular case we have come to rely a little too much on the if (disconnected) model. You mentioned in the other thread that fixing the GHES driver will pay much higher dividends. I'm working on reviving a couple of changes to do just that. Some industry folk were very concerned about the "don't panic()" approach, and I want to make sure I fairly present their arguments in the cover letter. I'm hoping one day we'll have the ability to use page tables to prevent the situations that we're trying to fix today in less than ideal ways. Although there are other places in msi.c that use if (disconnected), I'm okay with dropping this change -- provided we come up with an equivalent fix. But even if FFS doesn't crash, do we really want to lose hundreds of milliseconds to SMM --on all cores-- when all it takes is a couple of cycles to check a flag? Alex
Re: [PATCH] PCI/LINK: bw_notification: Do not leave interrupt handler NULL
On 3/20/19 8:46 AM, Bjorn Helgaas wrote: Hi Alexandru, On Mon, Mar 18, 2019 at 08:12:04PM -0500, Alexandru Gagniuc wrote: A threaded IRQ with a NULL handler does not work with level-triggered interrupts. request_threaded_irq() will return an error: genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq 16 pcie_bw_notification: probe of :00:1b.0:pcie010 failed with error -22 For level interrupts we need to silence the interrupt before exiting the IRQ handler, so just clear the PCI_EXP_LNKSTA_LBMS bit there. Reported-by: Linus Torvalds Signed-off-by: Alexandru Gagniuc What's your thought regarding Lukas' comment? If you do repost this, please add a Fixes: tag to help connect this with the initial commit. I like Lukas's idea. I should have this re-posted by end of week, unless there's an urgency to get it out earlier. Alex
Re: [GIT PULL] PCI changes for v5.1
On 3/17/19 4:18 PM, Linus Torvalds wrote: On Fri, Mar 8, 2019 at 9:31 AM Bjorn Helgaas wrote: - Report PCIe links that become degraded at run-time (Alexandru Gagniuc) Gaah. Only now as I'm about to do the rc1 release am I looking at new runtime warnings, and noticing that this causes genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq 16 pcie_bw_notification: probe of :00:1b.0:pcie010 failed with error -22 because you can't have a NULL handler for a level-triggered interrupt (you need something to shut the interrupt off so that it doesn't keep screaming!). Thanks for the catch. I did not see the error on my test machines. I'll take a look tomorrow, and update through Bjorn. Seems like an easy fix. Alex
Re: [PATCH v2] PCI: pciehp: Report degraded links via link bandwidth notification
On 12/7/18 12:20 PM, Alexandru Gagniuc wrote: A warning is generated when a PCIe device is probed with a degraded link, but there was no similar mechanism to warn when the link becomes degraded after probing. The Link Bandwidth Notification provides this mechanism. Use the link bandwidth notification interrupt to detect bandwidth changes, and rescan the bandwidth, looking for the weakest point. This is the same logic used in probe(). Signed-off-by: Alexandru Gagniuc --- ping. Changes since v1: * Layer on top of the pcie port service drivers, instead of hotplug service. This patch needs to be applied on top of: PCI: Add missing include to drivers/pci.h Anticipated FAQ: Q: Why is this unconditionally compiled in? A: The symmetrical check in pci probe() is also always compiled in. Q: Why call module_init() instead of adding a call in pcie_init_services() ? A: A call in pcie_init_services() also requires a prototype in portdrv.h, a non-static implementation in bw_notification.c. Using module_init() is functionally equivalent, and takes less code. Q: Why print only on degraded links and not when a link is back to full speed? For symmetry with PCI probe(). Although I see a benefit in conveying that a link is back to full speed, I expect this to be extremely rare. Secondary bus reset is usually needed to retrain at full bandwidth. drivers/pci/pcie/Makefile | 1 + drivers/pci/pcie/bw_notification.c | 107 + drivers/pci/pcie/portdrv.h | 4 +- drivers/pci/pcie/portdrv_core.c| 14 ++-- 4 files changed, 121 insertions(+), 5 deletions(-) create mode 100644 drivers/pci/pcie/bw_notification.c diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile index ab514083d5d4..f1d7bc1e5efa 100644 --- a/drivers/pci/pcie/Makefile +++ b/drivers/pci/pcie/Makefile @@ -3,6 +3,7 @@ # Makefile for PCI Express features and port driver pcieportdrv-y := portdrv_core.o portdrv_pci.o err.o +pcieportdrv-y += bw_notification.o obj-$(CONFIG_PCIEPORTBUS) += pcieportdrv.o diff --git a/drivers/pci/pcie/bw_notification.c b/drivers/pci/pcie/bw_notification.c new file mode 100644 index ..64391ea9a172 --- /dev/null +++ b/drivers/pci/pcie/bw_notification.c @@ -0,0 +1,107 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * PCI Express Bandwidth notification services driver + * Author: Alexandru Gagniuc + * + * Copyright (C) 2018, Dell Inc + * + * The PCIe bandwidth notification provides a way to notify the operating system + * when the link width or data rate changes. This capability is required for all + * root ports and downstream ports supporting links wider than x1 and/or + * multiple link speeds. + * + * This service port driver hooks into the bandwidth notification interrupt and + * warns when links become degraded in operation. + */ + +#include + +#include "../pci.h" +#include "portdrv.h" + +static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev) +{ + int ret; + u32 lnk_cap; + + ret = pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, _cap); + return (ret == PCIBIOS_SUCCESSFUL) && (lnk_cap & PCI_EXP_LNKCAP_LBNC); +} + +static void pcie_enable_link_bandwidth_notification(struct pci_dev *dev) +{ + u16 lnk_ctl; + + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl); + lnk_ctl |= PCI_EXP_LNKCTL_LBMIE; + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl); +} + +static void pcie_disable_link_bandwidth_notification(struct pci_dev *dev) +{ + u16 lnk_ctl; + + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl); + lnk_ctl &= ~PCI_EXP_LNKCTL_LBMIE; + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl); +} + +static irqreturn_t pcie_bw_notification_irq(int irq, void *context) +{ + struct pcie_device *srv = context; + struct pci_dev *port = srv->port; + struct pci_dev *dev; + u16 link_status, events; + int ret; + + ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, _status); + events = link_status & PCI_EXP_LNKSTA_LBMS; + + if (!events || ret != PCIBIOS_SUCCESSFUL) + return IRQ_NONE; + + /* Print status from upstream link partner, not this downstream port. */ + list_for_each_entry(dev, >subordinate->devices, bus_list) + __pcie_print_link_status(dev, false); + + pcie_capability_write_word(port, PCI_EXP_LNKSTA, events); + return IRQ_HANDLED; +} + +static int pcie_bandwidth_notification_probe(struct pcie_device *srv) +{ + int ret; + + /* Single-width or single-speed ports do not have to support this. */ + if (!pcie_link_bandwidth_notification_supported(srv->port)) + return -ENODEV; + + ret = devm_request_irq(>device, srv->irq, pcie_bw_notification_irq, + IRQF_SHARED, "PCIe BW notif", srv); + if (ret) + return ret; + +
Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
ping On 09/18/2018 05:15 PM, Alexandru Gagniuc wrote: When a PCI device is gone, we don't want to send IO to it if we can avoid it. We expose functionality via the irq_chip structure. As users of that structure may not know about the underlying PCI device, it's our responsibility to guard against removed devices. .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg(). .irq_mask/unmask() are not. Guard them for completeness. For example, surprise removal of a PCIe device triggers teardown. This touches the irq_chips ops some point to disable the interrupts. I/O generated here can crash the system on firmware-first machines. Not triggering the IO in the first place greatly reduces the possibility of the problem occurring. Signed-off-by: Alexandru Gagniuc --- drivers/pci/msi.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index f2ef896464b3..f31058fd2260 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 flag) { struct msi_desc *desc = irq_data_get_msi_desc(data); + if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc))) + return; + if (desc->msi_attrib.is_msix) { msix_mask_irq(desc, flag); readl(desc->mask_base); /* Flush write to device */
Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
ping On 09/18/2018 05:15 PM, Alexandru Gagniuc wrote: When a PCI device is gone, we don't want to send IO to it if we can avoid it. We expose functionality via the irq_chip structure. As users of that structure may not know about the underlying PCI device, it's our responsibility to guard against removed devices. .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg(). .irq_mask/unmask() are not. Guard them for completeness. For example, surprise removal of a PCIe device triggers teardown. This touches the irq_chips ops some point to disable the interrupts. I/O generated here can crash the system on firmware-first machines. Not triggering the IO in the first place greatly reduces the possibility of the problem occurring. Signed-off-by: Alexandru Gagniuc --- drivers/pci/msi.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index f2ef896464b3..f31058fd2260 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 flag) { struct msi_desc *desc = irq_data_get_msi_desc(data); + if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc))) + return; + if (desc->msi_attrib.is_msix) { msix_mask_irq(desc, flag); readl(desc->mask_base); /* Flush write to device */
Re: [PATCH] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
Should I resubmit this rebased on 4.19-rc*, or just leave this patch as is? Alex On 07/30/2018 04:21 PM, Alexandru Gagniuc wrote: When a PCI device is gone, we don't want to send IO to it if we can avoid it. We expose functionality via the irq_chip structure. As users of that structure may not know about the underlying PCI device, it's our responsibility to guard against removed devices. irq_write_msi_msg is already guarded. pci_msi_(un)mask_irq are not. Guard them for completeness. For example, surprise removal of a PCIe device triggers teardown. This touches the irq_chips ops some point to disable the interrupts. I/O generated here can crash the system on machines with buggy firmware. Not triggering the IO in the first place eliminates the problem. Signed-off-by: Alexandru Gagniuc --- There's another patch by Lukas Wunner that is needed (not yet published) in order to fully block IO on SURPRISE!!! removal. The existing code only sets the PCI_DEV_DISCONNECTED bit in an unreasonably narrow set of circumstances. Lukas' patch fixes that. However, this change is otherwise fully independent, and enjoy! drivers/pci/msi.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 4d88afdfc843..5f47b5cb0401 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 flag) { struct msi_desc *desc = irq_data_get_msi_desc(data); + if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc))) + return; + if (desc->msi_attrib.is_msix) { msix_mask_irq(desc, flag); readl(desc->mask_base); /* Flush write to device */
Re: [PATCH] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected
Should I resubmit this rebased on 4.19-rc*, or just leave this patch as is? Alex On 07/30/2018 04:21 PM, Alexandru Gagniuc wrote: When a PCI device is gone, we don't want to send IO to it if we can avoid it. We expose functionality via the irq_chip structure. As users of that structure may not know about the underlying PCI device, it's our responsibility to guard against removed devices. irq_write_msi_msg is already guarded. pci_msi_(un)mask_irq are not. Guard them for completeness. For example, surprise removal of a PCIe device triggers teardown. This touches the irq_chips ops some point to disable the interrupts. I/O generated here can crash the system on machines with buggy firmware. Not triggering the IO in the first place eliminates the problem. Signed-off-by: Alexandru Gagniuc --- There's another patch by Lukas Wunner that is needed (not yet published) in order to fully block IO on SURPRISE!!! removal. The existing code only sets the PCI_DEV_DISCONNECTED bit in an unreasonably narrow set of circumstances. Lukas' patch fixes that. However, this change is otherwise fully independent, and enjoy! drivers/pci/msi.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 4d88afdfc843..5f47b5cb0401 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 flag) { struct msi_desc *desc = irq_data_get_msi_desc(data); + if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc))) + return; + if (desc->msi_attrib.is_msix) { msix_mask_irq(desc, flag); readl(desc->mask_base); /* Flush write to device */
Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
On 08/09/2018 02:18 PM, Bjorn Helgaas wrote: On Thu, Aug 09, 2018 at 02:00:23PM -0500, Alex G. wrote: On 08/09/2018 01:29 PM, Bjorn Helgaas wrote: On Thu, Aug 09, 2018 at 04:46:32PM +, alex_gagn...@dellteam.com wrote: On 08/09/2018 09:16 AM, Bjorn Helgaas wrote: (snip_ enable_ecrc_checking() disable_ecrc_checking() I don't immediately see how this would affect FFS, but the bits are part of the AER capability structure. According to the FFS model, those would be owned by FW, and we'd have to avoid touching them. Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root Ports that contain the FIRMWARE_FIRST flag as well as values the OS is supposed to write to several AER capability registers. It looks like we currently ignore everything except the FIRMWARE_FIRST and GLOBAL flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux). That seems like a pretty major screwup and more than I want to fix right now. The logic is not very clear, but I think it goes like this: For GLOBAL and FFS, disable native AER everywhere. When !GLOBAL and FFS, then only disable native AER for the root port described by the HEST entry. I agree the code is convoluted, but that sounds right to me. What I meant is that we ignore the values the HEST entry tells us we're supposed to write to Device Control and the AER Uncorrectable Error Mask, Uncorrectable Error Severity, Correctable Error Mask, and AER Capabilities and Control. Wait, what? _HPX has the same information. This is madness! Since root ports are not hot-swappable, the BIOS normally programs those registers. Even if linux doesn't apply said masks, the programming BIOS did should be sufficient to have *cough* correct *cough* behavior. For practical considerations this is not an issue today. The ACPI error handling code currently crashes when it encounters any fatal error, so we wouldn't hit this in the FFS case. I wasn't aware the firmware-first path was *that* broken. Are there problem reports for this? Is this a regression? It's been like this since, I believe, 3.10, and probably much earlier. All reports that I have seen of linux crashing on surprise hot-plug have been caused by the panic() call in the apei code. Dell BIOSes do an extreme amount of work to determine when it's safe to _not_ report errors to the OS, since all known OSes crash on this path. Oh, is this the __ghes_panic() path? If so, I'm going to turn away and plead ignorance unless the PCI core is doing something wrong that eventually results in that panic. I agree, and I'll quote you on that! Alex
Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
On 08/09/2018 02:18 PM, Bjorn Helgaas wrote: On Thu, Aug 09, 2018 at 02:00:23PM -0500, Alex G. wrote: On 08/09/2018 01:29 PM, Bjorn Helgaas wrote: On Thu, Aug 09, 2018 at 04:46:32PM +, alex_gagn...@dellteam.com wrote: On 08/09/2018 09:16 AM, Bjorn Helgaas wrote: (snip_ enable_ecrc_checking() disable_ecrc_checking() I don't immediately see how this would affect FFS, but the bits are part of the AER capability structure. According to the FFS model, those would be owned by FW, and we'd have to avoid touching them. Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root Ports that contain the FIRMWARE_FIRST flag as well as values the OS is supposed to write to several AER capability registers. It looks like we currently ignore everything except the FIRMWARE_FIRST and GLOBAL flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux). That seems like a pretty major screwup and more than I want to fix right now. The logic is not very clear, but I think it goes like this: For GLOBAL and FFS, disable native AER everywhere. When !GLOBAL and FFS, then only disable native AER for the root port described by the HEST entry. I agree the code is convoluted, but that sounds right to me. What I meant is that we ignore the values the HEST entry tells us we're supposed to write to Device Control and the AER Uncorrectable Error Mask, Uncorrectable Error Severity, Correctable Error Mask, and AER Capabilities and Control. Wait, what? _HPX has the same information. This is madness! Since root ports are not hot-swappable, the BIOS normally programs those registers. Even if linux doesn't apply said masks, the programming BIOS did should be sufficient to have *cough* correct *cough* behavior. For practical considerations this is not an issue today. The ACPI error handling code currently crashes when it encounters any fatal error, so we wouldn't hit this in the FFS case. I wasn't aware the firmware-first path was *that* broken. Are there problem reports for this? Is this a regression? It's been like this since, I believe, 3.10, and probably much earlier. All reports that I have seen of linux crashing on surprise hot-plug have been caused by the panic() call in the apei code. Dell BIOSes do an extreme amount of work to determine when it's safe to _not_ report errors to the OS, since all known OSes crash on this path. Oh, is this the __ghes_panic() path? If so, I'm going to turn away and plead ignorance unless the PCI core is doing something wrong that eventually results in that panic. I agree, and I'll quote you on that! Alex
Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
On 08/09/2018 01:29 PM, Bjorn Helgaas wrote: On Thu, Aug 09, 2018 at 04:46:32PM +, alex_gagn...@dellteam.com wrote: On 08/09/2018 09:16 AM, Bjorn Helgaas wrote: (snip_ enable_ecrc_checking() disable_ecrc_checking() I don't immediately see how this would affect FFS, but the bits are part of the AER capability structure. According to the FFS model, those would be owned by FW, and we'd have to avoid touching them. Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root Ports that contain the FIRMWARE_FIRST flag as well as values the OS is supposed to write to several AER capability registers. It looks like we currently ignore everything except the FIRMWARE_FIRST and GLOBAL flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux). That seems like a pretty major screwup and more than I want to fix right now. The logic is not very clear, but I think it goes like this: For GLOBAL and FFS, disable native AER everywhere. When !GLOBAL and FFS, then only disable native AER for the root port described by the HEST entry. aer_acpi_firmware_first() doesn't care about context. Though check aer_set_firmware_first(), where we pass a pci_device as a context. The FFS implementations I've seen have one HEST entry, with GLOBAL and FFS. I have yet to see more fine-grained control of root ports. I suspect that if we had this finer control from HEST, we'd honor it. I do eventually want to get a test BIOS with different HEST entries, and make sure things work as intended. Though BIOS team is overloaded with other requests. pci_cleanup_aer_uncorrect_error_status() This probably should be guarded. It's only called from a few specific drivers, so the impact is not as high as being called from the core. pci_aer_clear_fatal_status() This is only called when doing fatal_recovery, right? True. It takes a lot of analysis to convince oneself that this is not used in the firmware-first path, so I think we should add a guard there. I agree. GHES has a header severity and a severity field for each section. All BIOSes I've seen do fatal/fatal, though a BIOS that would report correctable/fatal, would bypass the apei code and take us here. For practical considerations this is not an issue today. The ACPI error handling code currently crashes when it encounters any fatal error, so we wouldn't hit this in the FFS case. I wasn't aware the firmware-first path was *that* broken. Are there problem reports for this? Is this a regression? It's been like this since, I believe, 3.10, and probably much earlier. All reports that I have seen of linux crashing on surprise hot-plug have been caused by the panic() call in the apei code. Dell BIOSes do an extreme amount of work to determine when it's safe to _not_ report errors to the OS, since all known OSes crash on this path. Fun fact: there's even dedicated hardware to accomplish the above. The PCIe standards contact I usually talk to about these PCIe subtleties is currently on vacation. The number one issue was a FFS corner case with OS clearing bits on probe. The other functions you mention are a corner case of a corner case. The big fish is pci_cleanup_aer_error_status_regs() on probe(), and it would be nice to have that resolved. I'll sync up with Austin when he gets back to see about the other functions though I suspect we'll end up fixing them as well. I'd like to fix all the obvious cases at once (excluding the ECRC stuff). What do you think about the following patch? That looks very sensible to me. Thank you for updating it :). I'm pulling in the changes right now to run some quick tests, and I don't expect any trouble. Alex commit 15ed68dcc26864c849a12a36db4d4771bad7991f Author: Alexandru Gagniuc Date: Tue Jul 17 10:31:23 2018 -0500 PCI/AER: Don't clear AER bits if error handling is Firmware-First If the platform requests Firmware-First error handling, firmware is responsible for reading and clearing AER status bits. If OSPM also clears them, we may miss errors. See ACPI v6.2, sec 18.3.2.5 and 18.4. This race is mostly of theoretical significance, as it is not easy to reasonably demonstrate it in testing. Signed-off-by: Alexandru Gagniuc [bhelgaas: add similar guards to pci_cleanup_aer_uncorrect_error_status() and pci_aer_clear_fatal_status()] Signed-off-by: Bjorn Helgaas diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index c6cc855bfa22..4e823ae051a7 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -397,6 +397,9 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev) if (!pos) return -EIO; + if (pcie_aer_get_firmware_first(dev)) + return -EIO; + /* Clear status bits for ERR_NONFATAL errors only */ pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, ); pci_read_config_dword(dev, pos +
Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
On 08/09/2018 01:29 PM, Bjorn Helgaas wrote: On Thu, Aug 09, 2018 at 04:46:32PM +, alex_gagn...@dellteam.com wrote: On 08/09/2018 09:16 AM, Bjorn Helgaas wrote: (snip_ enable_ecrc_checking() disable_ecrc_checking() I don't immediately see how this would affect FFS, but the bits are part of the AER capability structure. According to the FFS model, those would be owned by FW, and we'd have to avoid touching them. Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root Ports that contain the FIRMWARE_FIRST flag as well as values the OS is supposed to write to several AER capability registers. It looks like we currently ignore everything except the FIRMWARE_FIRST and GLOBAL flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux). That seems like a pretty major screwup and more than I want to fix right now. The logic is not very clear, but I think it goes like this: For GLOBAL and FFS, disable native AER everywhere. When !GLOBAL and FFS, then only disable native AER for the root port described by the HEST entry. aer_acpi_firmware_first() doesn't care about context. Though check aer_set_firmware_first(), where we pass a pci_device as a context. The FFS implementations I've seen have one HEST entry, with GLOBAL and FFS. I have yet to see more fine-grained control of root ports. I suspect that if we had this finer control from HEST, we'd honor it. I do eventually want to get a test BIOS with different HEST entries, and make sure things work as intended. Though BIOS team is overloaded with other requests. pci_cleanup_aer_uncorrect_error_status() This probably should be guarded. It's only called from a few specific drivers, so the impact is not as high as being called from the core. pci_aer_clear_fatal_status() This is only called when doing fatal_recovery, right? True. It takes a lot of analysis to convince oneself that this is not used in the firmware-first path, so I think we should add a guard there. I agree. GHES has a header severity and a severity field for each section. All BIOSes I've seen do fatal/fatal, though a BIOS that would report correctable/fatal, would bypass the apei code and take us here. For practical considerations this is not an issue today. The ACPI error handling code currently crashes when it encounters any fatal error, so we wouldn't hit this in the FFS case. I wasn't aware the firmware-first path was *that* broken. Are there problem reports for this? Is this a regression? It's been like this since, I believe, 3.10, and probably much earlier. All reports that I have seen of linux crashing on surprise hot-plug have been caused by the panic() call in the apei code. Dell BIOSes do an extreme amount of work to determine when it's safe to _not_ report errors to the OS, since all known OSes crash on this path. Fun fact: there's even dedicated hardware to accomplish the above. The PCIe standards contact I usually talk to about these PCIe subtleties is currently on vacation. The number one issue was a FFS corner case with OS clearing bits on probe. The other functions you mention are a corner case of a corner case. The big fish is pci_cleanup_aer_error_status_regs() on probe(), and it would be nice to have that resolved. I'll sync up with Austin when he gets back to see about the other functions though I suspect we'll end up fixing them as well. I'd like to fix all the obvious cases at once (excluding the ECRC stuff). What do you think about the following patch? That looks very sensible to me. Thank you for updating it :). I'm pulling in the changes right now to run some quick tests, and I don't expect any trouble. Alex commit 15ed68dcc26864c849a12a36db4d4771bad7991f Author: Alexandru Gagniuc Date: Tue Jul 17 10:31:23 2018 -0500 PCI/AER: Don't clear AER bits if error handling is Firmware-First If the platform requests Firmware-First error handling, firmware is responsible for reading and clearing AER status bits. If OSPM also clears them, we may miss errors. See ACPI v6.2, sec 18.3.2.5 and 18.4. This race is mostly of theoretical significance, as it is not easy to reasonably demonstrate it in testing. Signed-off-by: Alexandru Gagniuc [bhelgaas: add similar guards to pci_cleanup_aer_uncorrect_error_status() and pci_aer_clear_fatal_status()] Signed-off-by: Bjorn Helgaas diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index c6cc855bfa22..4e823ae051a7 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -397,6 +397,9 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev) if (!pos) return -EIO; + if (pcie_aer_get_firmware_first(dev)) + return -EIO; + /* Clear status bits for ERR_NONFATAL errors only */ pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, ); pci_read_config_dword(dev, pos +
Re: [PATCH v3] PCI/AER: Do not clear AER bits if we don't own AER
On 08/07/2018 08:14 PM, Bjorn Helgaas wrote: On Mon, Jul 30, 2018 at 06:35:31PM -0500, Alexandru Gagniuc wrote: When we don't own AER, we shouldn't touch the AER error bits. Clearing error bits willy-nilly might cause firmware to miss some errors. In theory, these bits get cleared by FFS, or via ACPI _HPX method. These mechanisms are not subject to the problem. What's FFS? Firmware-first. Nobody likes spelling it out, and all other proposed acronyms are insanely tong-twisting. So, FFS. I guess you mean FFS and _HPX are not subject to the problem because they're supplied by firmware, so firmware would be responsible for looking at the bits before clearing them? Exactly. This race is mostly of theoretical significance, since I can't reasonably demonstrate this race in the lab. On a side-note, pcie_aer_is_kernel_first() is created to alleviate the need for two checks: aer_cap and get_firmware_first(). Signed-off-by: Alexandru Gagniuc --- Changes since v2: - Added missing negation in pci_cleanup_aer_error_status_regs() drivers/pci/pcie/aer.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index a2e88386af28..40e5c86271d1 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev) aer_set_firmware_first(dev); return dev->__aer_firmware_first; } + +static bool pcie_aer_is_kernel_first(struct pci_dev *dev) +{ + return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev); +} I think it complicates things to have both "firmware_first" and "kernel_first" interfaces, so I would prefer to stick with the existing "firmware_first" style. #define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \ PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE) @@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void) int pci_enable_pcie_error_reporting(struct pci_dev *dev) { - if (pcie_aer_get_firmware_first(dev)) - return -EIO; - - if (!dev->aer_cap) + if (!pcie_aer_is_kernel_first(dev)) return -EIO; return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS); This change doesn't actually fix anything, does it? It looks like a cleanup that doesn't change the behavior. Initially (v1), this was a one-liner, but someone had a complaint about having pcie_aer_get_firmware_first() boilerplate all over the place. That's why I added the "kernel_first" function (previous comment), and then updated this here for completeness. I'm also fine with v1. @@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting); int pci_disable_pcie_error_reporting(struct pci_dev *dev) { - if (pcie_aer_get_firmware_first(dev)) + if (!pcie_aer_is_kernel_first(dev)) return -EIO; This change does effectively add a test for dev->aer_cap. That makes sense in terms of symmetry with pci_enable_pcie_error_reporting(), but I think it should be a separate patch because it's conceptually separate from the change below. We should keep the existing behavior (but add the symmetry) here for now, but it's not clear to me that these paths should care about AER or firmware-first at all. PCI_EXP_DEVCTL is not an AER register and we have the _HPX mechanism for firmware to influence it (which these paths currently ignore). I suspect we should program these reporting enable bits in the core enumeration path instead of having drivers call these interfaces. The headache is that FFS needs the reporting bit to stay enabled in order to get AER notifications. Disabling things here could really break firmware. Of course, that's a cyclical argument, since FW is broken by definition. If/when we make changes along these lines, the history will be easier to follow if *this* change is not connected with the change below to pci_cleanup_aer_error_status_regs(). I agree. I think it might be preferred then to go with v1, and leave the refactoring to a later time, since the extra changes are cosmetical and social. return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, @@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev) if (!pci_is_pcie(dev)) return -ENODEV; - pos = dev->aer_cap; - if (!pos) + if (!pcie_aer_is_kernel_first(dev)) return -EIO; This part makes sense to me, but I think I would rather have it match the existing style in pci_enable_pcie_error_reporting(), i.e., keep the test for dev->aer_cap and add a test for pcie_aer_get_firmware_first(). Had it that way in v1. Alex + pos = dev->aer_cap; port_type = pci_pcie_type(dev); if (port_type == PCI_EXP_TYPE_ROOT_PORT) { pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, ); --
Re: [PATCH v3] PCI/AER: Do not clear AER bits if we don't own AER
On 08/07/2018 08:14 PM, Bjorn Helgaas wrote: On Mon, Jul 30, 2018 at 06:35:31PM -0500, Alexandru Gagniuc wrote: When we don't own AER, we shouldn't touch the AER error bits. Clearing error bits willy-nilly might cause firmware to miss some errors. In theory, these bits get cleared by FFS, or via ACPI _HPX method. These mechanisms are not subject to the problem. What's FFS? Firmware-first. Nobody likes spelling it out, and all other proposed acronyms are insanely tong-twisting. So, FFS. I guess you mean FFS and _HPX are not subject to the problem because they're supplied by firmware, so firmware would be responsible for looking at the bits before clearing them? Exactly. This race is mostly of theoretical significance, since I can't reasonably demonstrate this race in the lab. On a side-note, pcie_aer_is_kernel_first() is created to alleviate the need for two checks: aer_cap and get_firmware_first(). Signed-off-by: Alexandru Gagniuc --- Changes since v2: - Added missing negation in pci_cleanup_aer_error_status_regs() drivers/pci/pcie/aer.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index a2e88386af28..40e5c86271d1 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev) aer_set_firmware_first(dev); return dev->__aer_firmware_first; } + +static bool pcie_aer_is_kernel_first(struct pci_dev *dev) +{ + return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev); +} I think it complicates things to have both "firmware_first" and "kernel_first" interfaces, so I would prefer to stick with the existing "firmware_first" style. #define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \ PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE) @@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void) int pci_enable_pcie_error_reporting(struct pci_dev *dev) { - if (pcie_aer_get_firmware_first(dev)) - return -EIO; - - if (!dev->aer_cap) + if (!pcie_aer_is_kernel_first(dev)) return -EIO; return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS); This change doesn't actually fix anything, does it? It looks like a cleanup that doesn't change the behavior. Initially (v1), this was a one-liner, but someone had a complaint about having pcie_aer_get_firmware_first() boilerplate all over the place. That's why I added the "kernel_first" function (previous comment), and then updated this here for completeness. I'm also fine with v1. @@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting); int pci_disable_pcie_error_reporting(struct pci_dev *dev) { - if (pcie_aer_get_firmware_first(dev)) + if (!pcie_aer_is_kernel_first(dev)) return -EIO; This change does effectively add a test for dev->aer_cap. That makes sense in terms of symmetry with pci_enable_pcie_error_reporting(), but I think it should be a separate patch because it's conceptually separate from the change below. We should keep the existing behavior (but add the symmetry) here for now, but it's not clear to me that these paths should care about AER or firmware-first at all. PCI_EXP_DEVCTL is not an AER register and we have the _HPX mechanism for firmware to influence it (which these paths currently ignore). I suspect we should program these reporting enable bits in the core enumeration path instead of having drivers call these interfaces. The headache is that FFS needs the reporting bit to stay enabled in order to get AER notifications. Disabling things here could really break firmware. Of course, that's a cyclical argument, since FW is broken by definition. If/when we make changes along these lines, the history will be easier to follow if *this* change is not connected with the change below to pci_cleanup_aer_error_status_regs(). I agree. I think it might be preferred then to go with v1, and leave the refactoring to a later time, since the extra changes are cosmetical and social. return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, @@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev) if (!pci_is_pcie(dev)) return -ENODEV; - pos = dev->aer_cap; - if (!pos) + if (!pcie_aer_is_kernel_first(dev)) return -EIO; This part makes sense to me, but I think I would rather have it match the existing style in pci_enable_pcie_error_reporting(), i.e., keep the test for dev->aer_cap and add a test for pcie_aer_get_firmware_first(). Had it that way in v1. Alex + pos = dev->aer_cap; port_type = pci_pcie_type(dev); if (port_type == PCI_EXP_TYPE_ROOT_PORT) { pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, ); --
Re: [PATCH v5] PCI: Check for PCIe downtraining conditions
On 07/31/2018 01:40 AM, Tal Gilboa wrote: [snip] @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct pci_dev *dev) /* Advanced Error Reporting */ pci_aer_init(dev); + /* Check link and detect downtrain errors */ + pcie_check_upstream_link(dev); This is called for every PCIe device right? Won't there be a duplicated print in case a device loads with lower PCIe bandwidth than needed? Alex, can you comment on this please? Of course I can. There's one print at probe() time, which happens if bandwidth < max. I would think that's fine. There is a way to duplicate it, and that is if the driver also calls print_link_status(). A few driver maintainers who call it have indicated they'd be fine with removing it from the driver, and leaving it in the core PCI. Should the device come back at low speed, go away, then come back at full speed we'd still only see one print from the first probe. Again, if driver doesn't also call the function. There's the corner case where the device come up at < max, goes away, then comes back faster, but still < max. There will be two prints, but they won't be true duplicates -- each one will indicate different BW. Alex + if (pci_probe_reset_function(dev) == 0) dev->reset_fn = 1; } diff --git a/include/linux/pci.h b/include/linux/pci.h index abd5d5e17aee..15bfab8f7a1b 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps); u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev, enum pci_bus_speed *speed, enum pcie_link_width *width); +void __pcie_print_link_status(struct pci_dev *dev, bool verbose); void pcie_print_link_status(struct pci_dev *dev); int pcie_flr(struct pci_dev *dev); int __pci_reset_function_locked(struct pci_dev *dev);
Re: [PATCH v5] PCI: Check for PCIe downtraining conditions
On 07/31/2018 01:40 AM, Tal Gilboa wrote: [snip] @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct pci_dev *dev) /* Advanced Error Reporting */ pci_aer_init(dev); + /* Check link and detect downtrain errors */ + pcie_check_upstream_link(dev); This is called for every PCIe device right? Won't there be a duplicated print in case a device loads with lower PCIe bandwidth than needed? Alex, can you comment on this please? Of course I can. There's one print at probe() time, which happens if bandwidth < max. I would think that's fine. There is a way to duplicate it, and that is if the driver also calls print_link_status(). A few driver maintainers who call it have indicated they'd be fine with removing it from the driver, and leaving it in the core PCI. Should the device come back at low speed, go away, then come back at full speed we'd still only see one print from the first probe. Again, if driver doesn't also call the function. There's the corner case where the device come up at < max, goes away, then comes back faster, but still < max. There will be two prints, but they won't be true duplicates -- each one will indicate different BW. Alex + if (pci_probe_reset_function(dev) == 0) dev->reset_fn = 1; } diff --git a/include/linux/pci.h b/include/linux/pci.h index abd5d5e17aee..15bfab8f7a1b 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps); u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev, enum pci_bus_speed *speed, enum pcie_link_width *width); +void __pcie_print_link_status(struct pci_dev *dev, bool verbose); void pcie_print_link_status(struct pci_dev *dev); int pcie_flr(struct pci_dev *dev); int __pci_reset_function_locked(struct pci_dev *dev);
Re: [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER
On 07/23/2018 11:52 AM, Alexandru Gagniuc wrote: When we don't own AER, we shouldn't touch the AER error bits. Clearing error bits willy-nilly might cause firmware to miss some errors. In theory, these bits get cleared by FFS, or via ACPI _HPX method. These mechanisms are not subject to the problem. This race is mostly of theoretical significance, since I can't reasonably demonstrate this race in the lab. On a side-note, pcie_aer_is_kernel_first() is created to alleviate the need for two checks: aer_cap and get_firmware_first(). Signed-off-by: Alexandru Gagniuc --- drivers/pci/pcie/aer.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index a2e88386af28..85c3e173c025 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev) aer_set_firmware_first(dev); return dev->__aer_firmware_first; } + +static bool pcie_aer_is_kernel_first(struct pci_dev *dev) +{ + return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev); +} + #define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \ PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE) @@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void) int pci_enable_pcie_error_reporting(struct pci_dev *dev) { - if (pcie_aer_get_firmware_first(dev)) - return -EIO; - - if (!dev->aer_cap) + if (!pcie_aer_is_kernel_first(dev)) return -EIO; return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS); @@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting); int pci_disable_pcie_error_reporting(struct pci_dev *dev) { - if (pcie_aer_get_firmware_first(dev)) + if (!pcie_aer_is_kernel_first(dev)) return -EIO; return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, @@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev) if (!pci_is_pcie(dev)) return -ENODEV; - pos = dev->aer_cap; - if (!pos) + if (pcie_aer_is_kernel_first(dev)) This here is missing a '!'. It's in my local branch, so I must have exported the patch before I fixed that. I'll get that fixed next rev. return -EIO; + pos = dev->aer_cap; port_type = pci_pcie_type(dev); if (port_type == PCI_EXP_TYPE_ROOT_PORT) { pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, );
Re: [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER
On 07/23/2018 11:52 AM, Alexandru Gagniuc wrote: When we don't own AER, we shouldn't touch the AER error bits. Clearing error bits willy-nilly might cause firmware to miss some errors. In theory, these bits get cleared by FFS, or via ACPI _HPX method. These mechanisms are not subject to the problem. This race is mostly of theoretical significance, since I can't reasonably demonstrate this race in the lab. On a side-note, pcie_aer_is_kernel_first() is created to alleviate the need for two checks: aer_cap and get_firmware_first(). Signed-off-by: Alexandru Gagniuc --- drivers/pci/pcie/aer.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index a2e88386af28..85c3e173c025 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev) aer_set_firmware_first(dev); return dev->__aer_firmware_first; } + +static bool pcie_aer_is_kernel_first(struct pci_dev *dev) +{ + return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev); +} + #define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \ PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE) @@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void) int pci_enable_pcie_error_reporting(struct pci_dev *dev) { - if (pcie_aer_get_firmware_first(dev)) - return -EIO; - - if (!dev->aer_cap) + if (!pcie_aer_is_kernel_first(dev)) return -EIO; return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS); @@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting); int pci_disable_pcie_error_reporting(struct pci_dev *dev) { - if (pcie_aer_get_firmware_first(dev)) + if (!pcie_aer_is_kernel_first(dev)) return -EIO; return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, @@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev) if (!pci_is_pcie(dev)) return -ENODEV; - pos = dev->aer_cap; - if (!pos) + if (pcie_aer_is_kernel_first(dev)) This here is missing a '!'. It's in my local branch, so I must have exported the patch before I fixed that. I'll get that fixed next rev. return -EIO; + pos = dev->aer_cap; port_type = pci_pcie_type(dev); if (port_type == PCI_EXP_TYPE_ROOT_PORT) { pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, );
Re: [PATCH v5] PCI: Check for PCIe downtraining conditions
On 07/23/2018 05:14 PM, Jakub Kicinski wrote: On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote: On 7/24/2018 12:01 AM, Jakub Kicinski wrote: On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote: PCIe downtraining happens when both the device and PCIe port are capable of a larger bus width or higher speed than negotiated. Downtraining might be indicative of other problems in the system, and identifying this from userspace is neither intuitive, nor straightforward. The easiest way to detect this is with pcie_print_link_status(), since the bottleneck is usually the link that is downtrained. It's not a perfect solution, but it works extremely well in most cases. Signed-off-by: Alexandru Gagniuc --- For the sake of review, I've created a __pcie_print_link_status() which takes a 'verbose' argument. If we agree want to go this route, and update the users of pcie_print_link_status(), I can split this up in two patches. I prefer just printing this information in the core functions, and letting drivers not have to worry about this. Though there seems to be strong for not going that route, so here it goes: FWIW the networking drivers print PCIe BW because sometimes the network bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps card on a x8 link. Sorry to bike shed, but currently the networking cards print the info during probe. Would it make sense to move your message closer to probe time? Rather than when device is added. If driver structure is available, we could also consider adding a boolean to struct pci_driver to indicate if driver wants the verbose message? This way we avoid duplicated prints. I have no objection to current patch, it LGTM. Just a thought. I don't see the reason for having two functions. What's the problem with adding the verbose argument to the original function? IMHO it's reasonable to keep the default parameter to what 90% of users want by a means on a wrapper. The non-verbose output is provided by the core already for all devices. What do you think of my proposal above Tal? That would make the extra wrapper unnecessary since the verbose parameter would be part of the driver structure, and it would avoid the duplicated output. I see how it might make sense to add another member to the driver struct, but is it worth the extra learning curve? It seems to be something with the potential to confuse new driver developers, and having a very marginal benefit. Although, if that's what people want... Alex
Re: [PATCH v5] PCI: Check for PCIe downtraining conditions
On 07/23/2018 05:14 PM, Jakub Kicinski wrote: On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote: On 7/24/2018 12:01 AM, Jakub Kicinski wrote: On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote: PCIe downtraining happens when both the device and PCIe port are capable of a larger bus width or higher speed than negotiated. Downtraining might be indicative of other problems in the system, and identifying this from userspace is neither intuitive, nor straightforward. The easiest way to detect this is with pcie_print_link_status(), since the bottleneck is usually the link that is downtrained. It's not a perfect solution, but it works extremely well in most cases. Signed-off-by: Alexandru Gagniuc --- For the sake of review, I've created a __pcie_print_link_status() which takes a 'verbose' argument. If we agree want to go this route, and update the users of pcie_print_link_status(), I can split this up in two patches. I prefer just printing this information in the core functions, and letting drivers not have to worry about this. Though there seems to be strong for not going that route, so here it goes: FWIW the networking drivers print PCIe BW because sometimes the network bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps card on a x8 link. Sorry to bike shed, but currently the networking cards print the info during probe. Would it make sense to move your message closer to probe time? Rather than when device is added. If driver structure is available, we could also consider adding a boolean to struct pci_driver to indicate if driver wants the verbose message? This way we avoid duplicated prints. I have no objection to current patch, it LGTM. Just a thought. I don't see the reason for having two functions. What's the problem with adding the verbose argument to the original function? IMHO it's reasonable to keep the default parameter to what 90% of users want by a means on a wrapper. The non-verbose output is provided by the core already for all devices. What do you think of my proposal above Tal? That would make the extra wrapper unnecessary since the verbose parameter would be part of the driver structure, and it would avoid the duplicated output. I see how it might make sense to add another member to the driver struct, but is it worth the extra learning curve? It seems to be something with the potential to confuse new driver developers, and having a very marginal benefit. Although, if that's what people want... Alex
Re: [PATCH v3] PCI: Check for PCIe downtraining conditions
On 07/23/2018 12:21 AM, Tal Gilboa wrote: On 7/19/2018 6:49 PM, Alex G. wrote: On 07/18/2018 08:38 AM, Tal Gilboa wrote: On 7/16/2018 5:17 PM, Bjorn Helgaas wrote: [+cc maintainers of drivers that already use pcie_print_link_status() and GPU folks] [snip] + /* Multi-function PCIe share the same link/status. */ + if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn) + return; + + pcie_print_link_status(dev); +} Is this function called by default for every PCIe device? What about VFs? We make an exception for them on our driver since a VF doesn't have access to the needed information in order to provide a meaningful message. I'm assuming VF means virtual function. pcie_print_link_status() doesn't care if it's passed a virtual function. It will try to do its job. That's why I bail out three lines above, with 'dev->is_virtfn' check. Alex That's the point - we don't want to call pcie_print_link_status() for virtual functions. We make the distinction in our driver. If you want to change the code to call this function by default it shouldn't affect the current usage. I'm not understanding very well what you're asking. I understand you want to avoid printing this message on virtual functions, and that's already taken care of. I'm also not changing current behavior. Let's get v2 out and start the discussion again based on that. Alex
Re: [PATCH v3] PCI: Check for PCIe downtraining conditions
On 07/23/2018 12:21 AM, Tal Gilboa wrote: On 7/19/2018 6:49 PM, Alex G. wrote: On 07/18/2018 08:38 AM, Tal Gilboa wrote: On 7/16/2018 5:17 PM, Bjorn Helgaas wrote: [+cc maintainers of drivers that already use pcie_print_link_status() and GPU folks] [snip] + /* Multi-function PCIe share the same link/status. */ + if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn) + return; + + pcie_print_link_status(dev); +} Is this function called by default for every PCIe device? What about VFs? We make an exception for them on our driver since a VF doesn't have access to the needed information in order to provide a meaningful message. I'm assuming VF means virtual function. pcie_print_link_status() doesn't care if it's passed a virtual function. It will try to do its job. That's why I bail out three lines above, with 'dev->is_virtfn' check. Alex That's the point - we don't want to call pcie_print_link_status() for virtual functions. We make the distinction in our driver. If you want to change the code to call this function by default it shouldn't affect the current usage. I'm not understanding very well what you're asking. I understand you want to avoid printing this message on virtual functions, and that's already taken care of. I'm also not changing current behavior. Let's get v2 out and start the discussion again based on that. Alex
Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
On 07/19/2018 11:58 AM, Sinan Kaya wrote: On 7/19/2018 8:55 AM, Alex G. wrote: I find the intent clearer if we check it here rather than having to do the mental parsing of the state of aer_cap. I don't feel too strong about my comment to be honest. This was a style/maintenance comment. It feels like we are putting pcie_aer_get_firmware_first() into core functions unnecessarily after your change. I understand the need for your change. I'm asking if it is the right place or not. pcie_aer_get_firmware_first() should be called from either the init or probe function so that the rest of the AER functions do not get called from any other context. If someone adds another AER function, we might need to add another pcie_aer_get_firmware_first() check there. So, we have unnecessary code duplication. We could move the aer_cap and get_ffs() check into one function that we end up calling all over the place. I understand your concern about code duplication, and I agree with it. I don't think that at this point it's that big of a deal, although we might need to guard every aer_() call. So moving all the checks in a pcie_aer_is_kernel_first() makes sense. Alex
Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
On 07/19/2018 11:58 AM, Sinan Kaya wrote: On 7/19/2018 8:55 AM, Alex G. wrote: I find the intent clearer if we check it here rather than having to do the mental parsing of the state of aer_cap. I don't feel too strong about my comment to be honest. This was a style/maintenance comment. It feels like we are putting pcie_aer_get_firmware_first() into core functions unnecessarily after your change. I understand the need for your change. I'm asking if it is the right place or not. pcie_aer_get_firmware_first() should be called from either the init or probe function so that the rest of the AER functions do not get called from any other context. If someone adds another AER function, we might need to add another pcie_aer_get_firmware_first() check there. So, we have unnecessary code duplication. We could move the aer_cap and get_ffs() check into one function that we end up calling all over the place. I understand your concern about code duplication, and I agree with it. I don't think that at this point it's that big of a deal, although we might need to guard every aer_() call. So moving all the checks in a pcie_aer_is_kernel_first() makes sense. Alex
Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
On 07/17/2018 10:41 AM, Sinan Kaya wrote: On 7/17/2018 8:31 AM, Alexandru Gagniuc wrote: + if (pcie_aer_get_firmware_first(dev)) + return -EIO; Can you move this to closer to the caller pci_aer_init()? I could move it there. although pci_cleanup_aer_error_status_regs() is called directly from pci_restore_state. Of course, aer_cap should be zero in this case, and we'd still bail out. I find the intent clearer if we check it here rather than having to do the mental parsing of the state of aer_cap. Alex
Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
On 07/17/2018 10:41 AM, Sinan Kaya wrote: On 7/17/2018 8:31 AM, Alexandru Gagniuc wrote: + if (pcie_aer_get_firmware_first(dev)) + return -EIO; Can you move this to closer to the caller pci_aer_init()? I could move it there. although pci_cleanup_aer_error_status_regs() is called directly from pci_restore_state. Of course, aer_cap should be zero in this case, and we'd still bail out. I find the intent clearer if we check it here rather than having to do the mental parsing of the state of aer_cap. Alex
Re: [PATCH v3] PCI: Check for PCIe downtraining conditions
On 07/18/2018 08:38 AM, Tal Gilboa wrote: On 7/16/2018 5:17 PM, Bjorn Helgaas wrote: [+cc maintainers of drivers that already use pcie_print_link_status() and GPU folks] [snip] + /* Multi-function PCIe share the same link/status. */ + if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn) + return; + + pcie_print_link_status(dev); +} Is this function called by default for every PCIe device? What about VFs? We make an exception for them on our driver since a VF doesn't have access to the needed information in order to provide a meaningful message. I'm assuming VF means virtual function. pcie_print_link_status() doesn't care if it's passed a virtual function. It will try to do its job. That's why I bail out three lines above, with 'dev->is_virtfn' check. Alex
Re: [PATCH v3] PCI: Check for PCIe downtraining conditions
On 07/18/2018 08:38 AM, Tal Gilboa wrote: On 7/16/2018 5:17 PM, Bjorn Helgaas wrote: [+cc maintainers of drivers that already use pcie_print_link_status() and GPU folks] [snip] + /* Multi-function PCIe share the same link/status. */ + if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn) + return; + + pcie_print_link_status(dev); +} Is this function called by default for every PCIe device? What about VFs? We make an exception for them on our driver since a VF doesn't have access to the needed information in order to provide a meaningful message. I'm assuming VF means virtual function. pcie_print_link_status() doesn't care if it's passed a virtual function. It will try to do its job. That's why I bail out three lines above, with 'dev->is_virtfn' check. Alex
Re: [PATCH v3] PCI: Check for PCIe downtraining conditions
On 07/18/2018 04:53 PM, Bjorn Helgaas wrote: [+cc Mike (hfi1)] On Mon, Jul 16, 2018 at 10:28:35PM +, alex_gagn...@dellteam.com wrote: On 7/16/2018 4:17 PM, Bjorn Helgaas wrote: ... The easiest way to detect this is with pcie_print_link_status(), since the bottleneck is usually the link that is downtrained. It's not a perfect solution, but it works extremely well in most cases. This is an interesting idea. I have two concerns: Some drivers already do this on their own, and we probably don't want duplicate output for those devices. In most cases (ixgbe and mlx* are exceptions), the drivers do this unconditionally so we *could* remove it from the driver if we add it to the core. The dmesg order would change, and the message wouldn't be associated with the driver as it now is. Oh, there are only 8 users of that. Even I could patch up the drivers to remove the call, assuming we reach agreement about this change. Also, I think some of the GPU devices might come up at a lower speed, then download firmware, then reset the device so it comes up at a higher speed. I think this patch will make us complain about about the low initial speed, which might confuse users. I spoke to one of the PCIe spec writers. It's allowable for a device to downtrain speed or width. It would also be extremely dumb to downtrain with the intent to re-train at a higher speed later, but it's possible devices do dumb stuff like that. That's why it's an informational message, instead of a warning. FWIW, here's some of the discussion related to hfi1 from [1]: > Btw, why is the driver configuring the PCIe link speed? Isn't > this something we should be handling in the PCI core? The device comes out of reset at the 5GT/s speed. The driver downloads device firmware, programs PCIe registers, and co-ordinates the transition to 8GT/s. This recipe is device specific and is therefore implemented in the hfi1 driver built on top of PCI core functions and macros. Also several DRM drivers seem to do this (see ), si_pcie_gen3_enable()); from [2]: My understanding was that some platfoms only bring up the link in gen 1 mode for compatibility reasons. [1] https://lkml.kernel.org/r/32e1700b9017364d9b60aed9960492bc627ff...@fmsmsx120.amr.corp.intel.com [2] https://lkml.kernel.org/r/bn6pr12mb1809bd30aa5b890c054f9832f7...@bn6pr12mb1809.namprd12.prod.outlook.com Downtraining a link "for compatibility reasons" is one of those dumb things that devices do. I'm SURPRISED AMD HW does it, although it is perfectly permissible by PCIe spec. Another case: Some devices (lower-end GPUs) use silicon (and marketing) that advertises x16, but they're only routed for x8. I'm okay with seeing an informational message in this case. In fact, I didn't know that my Quadro card for three years is only wired for x8 until I was testing this patch. Yeah, it's probably OK. I don't want bug reports from people who think something's broken when it's really just a hardware limitation of their system. But hopefully the message is not alarming. It looks fairly innocent: [0.749415] pci :18:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at :17:03.0 (capable of 15.752 Gb/s with 8 GT/s x2 link) So I'm not sure whether it's better to do this in the core for all devices, or if we should just add it to the high-performance drivers that really care. You're thinking "do I really need that bandwidth" because I'm using a function called "_bandwidth_". The point of the change is very far from that: it is to help in system troubleshooting by detecting downtraining conditions. I'm not sure what you think I'm thinking :) My question is whether it's worthwhile to print this extra information for *every* PCIe device, given that your use case is the tiny percentage of broken systems. I think this information is a lot more useful than a bunch of other info that's printed. Is "type 00 class 0x088000" more valuable? What about "reg 0x20: [mem 0x9d95-0x9d95 64bit pref]", which is also available under /proc/iomem for those curious? If we only printed the info in the "bw_avail < bw_cap" case, i.e., when the device is capable of more than it's getting, that would make a lot of sense to me. The normal case line is more questionable. I think the reason that's there is because the network drivers are very performance sensitive and like to see that info all the time. I agree that can be an acceptable compromise. Maybe we need something like this: pcie_print_link_status(struct pci_dev *dev, int verbose) { ... if (bw_avail >= bw_cap) { if (verbose) pci_info(dev, "... available PCIe bandwidth ..."); } else pci_info(dev, "... available PCIe bandwidth, limited by ..."); } So the core could print only the potential problems with: pcie_print_link_status(dev, 0); and drivers that really care even if there's no problem could do:
Re: [PATCH v3] PCI: Check for PCIe downtraining conditions
On 07/18/2018 04:53 PM, Bjorn Helgaas wrote: [+cc Mike (hfi1)] On Mon, Jul 16, 2018 at 10:28:35PM +, alex_gagn...@dellteam.com wrote: On 7/16/2018 4:17 PM, Bjorn Helgaas wrote: ... The easiest way to detect this is with pcie_print_link_status(), since the bottleneck is usually the link that is downtrained. It's not a perfect solution, but it works extremely well in most cases. This is an interesting idea. I have two concerns: Some drivers already do this on their own, and we probably don't want duplicate output for those devices. In most cases (ixgbe and mlx* are exceptions), the drivers do this unconditionally so we *could* remove it from the driver if we add it to the core. The dmesg order would change, and the message wouldn't be associated with the driver as it now is. Oh, there are only 8 users of that. Even I could patch up the drivers to remove the call, assuming we reach agreement about this change. Also, I think some of the GPU devices might come up at a lower speed, then download firmware, then reset the device so it comes up at a higher speed. I think this patch will make us complain about about the low initial speed, which might confuse users. I spoke to one of the PCIe spec writers. It's allowable for a device to downtrain speed or width. It would also be extremely dumb to downtrain with the intent to re-train at a higher speed later, but it's possible devices do dumb stuff like that. That's why it's an informational message, instead of a warning. FWIW, here's some of the discussion related to hfi1 from [1]: > Btw, why is the driver configuring the PCIe link speed? Isn't > this something we should be handling in the PCI core? The device comes out of reset at the 5GT/s speed. The driver downloads device firmware, programs PCIe registers, and co-ordinates the transition to 8GT/s. This recipe is device specific and is therefore implemented in the hfi1 driver built on top of PCI core functions and macros. Also several DRM drivers seem to do this (see ), si_pcie_gen3_enable()); from [2]: My understanding was that some platfoms only bring up the link in gen 1 mode for compatibility reasons. [1] https://lkml.kernel.org/r/32e1700b9017364d9b60aed9960492bc627ff...@fmsmsx120.amr.corp.intel.com [2] https://lkml.kernel.org/r/bn6pr12mb1809bd30aa5b890c054f9832f7...@bn6pr12mb1809.namprd12.prod.outlook.com Downtraining a link "for compatibility reasons" is one of those dumb things that devices do. I'm SURPRISED AMD HW does it, although it is perfectly permissible by PCIe spec. Another case: Some devices (lower-end GPUs) use silicon (and marketing) that advertises x16, but they're only routed for x8. I'm okay with seeing an informational message in this case. In fact, I didn't know that my Quadro card for three years is only wired for x8 until I was testing this patch. Yeah, it's probably OK. I don't want bug reports from people who think something's broken when it's really just a hardware limitation of their system. But hopefully the message is not alarming. It looks fairly innocent: [0.749415] pci :18:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at :17:03.0 (capable of 15.752 Gb/s with 8 GT/s x2 link) So I'm not sure whether it's better to do this in the core for all devices, or if we should just add it to the high-performance drivers that really care. You're thinking "do I really need that bandwidth" because I'm using a function called "_bandwidth_". The point of the change is very far from that: it is to help in system troubleshooting by detecting downtraining conditions. I'm not sure what you think I'm thinking :) My question is whether it's worthwhile to print this extra information for *every* PCIe device, given that your use case is the tiny percentage of broken systems. I think this information is a lot more useful than a bunch of other info that's printed. Is "type 00 class 0x088000" more valuable? What about "reg 0x20: [mem 0x9d95-0x9d95 64bit pref]", which is also available under /proc/iomem for those curious? If we only printed the info in the "bw_avail < bw_cap" case, i.e., when the device is capable of more than it's getting, that would make a lot of sense to me. The normal case line is more questionable. I think the reason that's there is because the network drivers are very performance sensitive and like to see that info all the time. I agree that can be an acceptable compromise. Maybe we need something like this: pcie_print_link_status(struct pci_dev *dev, int verbose) { ... if (bw_avail >= bw_cap) { if (verbose) pci_info(dev, "... available PCIe bandwidth ..."); } else pci_info(dev, "... available PCIe bandwidth, limited by ..."); } So the core could print only the potential problems with: pcie_print_link_status(dev, 0); and drivers that really care even if there's no problem could do:
Re: [PATCH v3] PCI/AER: Fix aerdrv loading with "pcie_ports=native" parameter
On 07/03/2018 11:38 AM, Bjorn Helgaas wrote: > On Mon, Jul 02, 2018 at 11:16:01AM -0500, Alexandru Gagniuc wrote: >> According to the documentation, "pcie_ports=native", linux should use >> native AER and DPC services. While that is true for the _OSC method >> parsing, this is not the only place that is checked. Should the HEST >> table list PCIe ports as firmware-first, linux will not use native >> services. >> >> This happens because aer_acpi_firmware_first() doesn't take >> 'pcie_ports' into account. This is wrong. DPC uses the same logic when >> it decides whether to load or not, so fixing this also fixes DPC not >> loading. >> >> Signed-off-by: Alexandru Gagniuc >> --- >> drivers/pci/pcie/aer.c | 9 ++--- >> 1 file changed, 6 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c >> index a2e88386af28..db2c01056dc7 100644 >> --- a/drivers/pci/pcie/aer.c >> +++ b/drivers/pci/pcie/aer.c >> @@ -283,13 +283,14 @@ static int aer_hest_parse(struct acpi_hest_header >> *hest_hdr, void *data) >> >> static void aer_set_firmware_first(struct pci_dev *pci_dev) >> { >> -int rc; >> +int rc = 0; >> struct aer_hest_parse_info info = { >> .pci_dev= pci_dev, >> .firmware_first = 0, >> }; >> >> -rc = apei_hest_parse(aer_hest_parse, ); >> +if (!pcie_ports_native) >> +rc = apei_hest_parse(aer_hest_parse, ); >> >> if (rc) >> pci_dev->__aer_firmware_first = 0; >> @@ -324,7 +325,9 @@ bool aer_acpi_firmware_first(void) >> }; >> >> if (!parsed) { >> -apei_hest_parse(aer_hest_parse, ); >> +if (!pcie_ports_native) >> +apei_hest_parse(aer_hest_parse, ); >> + >> aer_firmware_first = info.firmware_first; >> parsed = true; >> } > > I was thinking something along the lines of the patch below, so we > don't have to work through the settings of "rc" and "info". But maybe > I'm missing something subtle? > > One subtle thing that I didn't look into is the > pcie_aer_get_firmware_first() stub for the non-CONFIG_ACPI_APEI case. > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index b24f2d252180..5ccbd7635f33 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -342,6 +342,9 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev) > if (!pci_is_pcie(dev)) > return 0; > > + if (pcie_ports_native) > + return 0; > + Here we're not setting dev->__aer_firmware_first_valid. Assuming we only check for it via pcie_aer_get_firmware_first(), we should be fine. THis field is used in portdrv.h, but its use seems safe. I think this approach can work. > if (!dev->__aer_firmware_first_valid) > aer_set_firmware_first(dev); > return dev->__aer_firmware_first; > @@ -362,6 +365,9 @@ bool aer_acpi_firmware_first(void) > .firmware_first = 0, > }; > > + if (pcie_ports_native) > + return false; > + This makes better sense. > if (!parsed) { > apei_hest_parse(aer_hest_parse, ); > aer_firmware_first = info.firmware_first; >
Re: [PATCH v3] PCI/AER: Fix aerdrv loading with "pcie_ports=native" parameter
On 07/03/2018 11:38 AM, Bjorn Helgaas wrote: > On Mon, Jul 02, 2018 at 11:16:01AM -0500, Alexandru Gagniuc wrote: >> According to the documentation, "pcie_ports=native", linux should use >> native AER and DPC services. While that is true for the _OSC method >> parsing, this is not the only place that is checked. Should the HEST >> table list PCIe ports as firmware-first, linux will not use native >> services. >> >> This happens because aer_acpi_firmware_first() doesn't take >> 'pcie_ports' into account. This is wrong. DPC uses the same logic when >> it decides whether to load or not, so fixing this also fixes DPC not >> loading. >> >> Signed-off-by: Alexandru Gagniuc >> --- >> drivers/pci/pcie/aer.c | 9 ++--- >> 1 file changed, 6 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c >> index a2e88386af28..db2c01056dc7 100644 >> --- a/drivers/pci/pcie/aer.c >> +++ b/drivers/pci/pcie/aer.c >> @@ -283,13 +283,14 @@ static int aer_hest_parse(struct acpi_hest_header >> *hest_hdr, void *data) >> >> static void aer_set_firmware_first(struct pci_dev *pci_dev) >> { >> -int rc; >> +int rc = 0; >> struct aer_hest_parse_info info = { >> .pci_dev= pci_dev, >> .firmware_first = 0, >> }; >> >> -rc = apei_hest_parse(aer_hest_parse, ); >> +if (!pcie_ports_native) >> +rc = apei_hest_parse(aer_hest_parse, ); >> >> if (rc) >> pci_dev->__aer_firmware_first = 0; >> @@ -324,7 +325,9 @@ bool aer_acpi_firmware_first(void) >> }; >> >> if (!parsed) { >> -apei_hest_parse(aer_hest_parse, ); >> +if (!pcie_ports_native) >> +apei_hest_parse(aer_hest_parse, ); >> + >> aer_firmware_first = info.firmware_first; >> parsed = true; >> } > > I was thinking something along the lines of the patch below, so we > don't have to work through the settings of "rc" and "info". But maybe > I'm missing something subtle? > > One subtle thing that I didn't look into is the > pcie_aer_get_firmware_first() stub for the non-CONFIG_ACPI_APEI case. > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index b24f2d252180..5ccbd7635f33 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -342,6 +342,9 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev) > if (!pci_is_pcie(dev)) > return 0; > > + if (pcie_ports_native) > + return 0; > + Here we're not setting dev->__aer_firmware_first_valid. Assuming we only check for it via pcie_aer_get_firmware_first(), we should be fine. THis field is used in portdrv.h, but its use seems safe. I think this approach can work. > if (!dev->__aer_firmware_first_valid) > aer_set_firmware_first(dev); > return dev->__aer_firmware_first; > @@ -362,6 +365,9 @@ bool aer_acpi_firmware_first(void) > .firmware_first = 0, > }; > > + if (pcie_ports_native) > + return false; > + This makes better sense. > if (!parsed) { > apei_hest_parse(aer_hest_parse, ); > aer_firmware_first = info.firmware_first; >
Re: [PATCH v2] PCI/AER: Fix aerdrv loading with "pcie_ports=native" parameter
On 07/02/2018 08:16 AM, Bjorn Helgaas wrote: > On Sat, Jun 30, 2018 at 11:39:00PM -0500, Alex G wrote: >> On 06/30/2018 04:31 PM, Bjorn Helgaas wrote: >>> [+cc Borislav, linux-acpi, since this involves APEI/HEST] >> >> Borislav is not the relevant maintainer here, since we're not contingent on >> APEI handling. I think Keith has a lot more experience with this part of the >> kernel. > > Thanks for adding Keith. > >>> On Tue, Jun 19, 2018 at 02:58:20PM -0500, Alexandru Gagniuc wrote: >>>> According to the documentation, "pcie_ports=native", linux should use >>>> native AER and DPC services. While that is true for the _OSC method >>>> parsing, this is not the only place that is checked. Should the HEST >>>> table list PCIe ports as firmware-first, linux will not use native >>>> services. >>> >>> Nothing in ACPI-land looks at pcie_ports_native. How should ACPI >>> things work in the "pcie_ports=native" case? I guess we still have to >>> expect to receive error records from the firmware, because it may >>> certainly send us non-PCI errors (machine checks, etc) and maybe even >>> some PCI errors (even if the Linux AER driver claims AER interrupts, >>> we don't know what notification mechanisms the firmware may be using). >> >> I think ACPI land shouldn't care about this. We care about it from the PCIe >> stand point at the interface with ACPI. FW might see a delta in the sense >> that we request control of some features via _OSC, which we otherwise would >> not do without pcie_ports=native. >> >>> I guess best-case, we'll get ACPI error records for all non-PCI >>> things, and the Linux AER driver will see all the AER errors. >> >> It might affect FW's ability to catch errors, but that's dependent on the >> root port implementation. >> >>> Worst-case, I don't really know what to expect. Duplicate reporting >>> of AER errors via firmware and Linux AER driver? Some kind of >>> confusion about who acknowledges and clears them? >> >> Once user enters pcie_ports=native, all bets are off: you broke the contract >> you have with the FW -- whether or not you have this patch. >> >>> Out of curiosity, what is your use case for "pcie_ports=native"? >>> Presumably there's something that works better when using it, and >>> things work even *better* with this patch? >> >> Corectness. It bothers me that actual behavior does not match the >> documentation: >> >> native Use native PCIe services associated with PCIe ports >> unconditionally. >> >> >>> I know people do use it, because I often see it mentioned in forums >>> and bug reports, but I really don't expect it to work very well >>> because we're ignoring the usage model the firmware is designed >>> around. My unproven suspicion is that most uses are in the black >>> magic category of "there's a bug here, and we don't know how to fix >>> it, but pcie_ports=native makes it work better". >> >> There exist cases that firmware didn't consider. I would not call them >> "firmware bugs", but there are cases where the user understands the platform >> better than firmware. >> Example: on certain PCIe switches, a hardware PCIe error may bring the >> switch downstream ports into a state where they stop notifying hotplug >> events. Depending on the platform, firmware may or may not fix this >> condition, but "pcie_ports=native" enables DPC. DPC contains the error >> without the switch downstream port entering the weird error state in the >> first place. >> >> All bets are off at this point. > > If a user needs "pcie_ports=native", I claim that's a user experience > problem, and the underlying cause is a hardware, firmware, or OS > defect. > > I have no doubt the situation you describe is real, but this doesn't > make any progress toward resolving the user experience problem. In > fact, it propagates the folklore that using "pcie_ports=native" is an > appropriate final solution. It's fine as a temporary workaround while > we figure out a better solution, but we need some mechanism for > analyzing the problem and eventually removing the need to use > "pcie_ports=native". Speaking of user experience, I'd argue that it's a horrible experience for the kernel to _not_ do what it is asked. I'm going to go fix the little comment about the patch. I had the same dilemma when I wrote it, but didn't find it too noteworthy. It makes more sense now that you mentioned it. Alex > I have a minor comment on the patch, but I think it makes sense. This > might be a good time to resurrect Prarit's "taint-on-pci-parameters" > patch. If somebody uses "pcie_ports=native", I think it makes sense > to taint the kernel both because (1) we broke the contract with the > firmware and we don't really know what to expect, and (2) it's an > opportunity to encourage the user to raise a bug report. > > Bjorn >
Re: [PATCH v2] PCI/AER: Fix aerdrv loading with "pcie_ports=native" parameter
On 07/02/2018 08:16 AM, Bjorn Helgaas wrote: > On Sat, Jun 30, 2018 at 11:39:00PM -0500, Alex G wrote: >> On 06/30/2018 04:31 PM, Bjorn Helgaas wrote: >>> [+cc Borislav, linux-acpi, since this involves APEI/HEST] >> >> Borislav is not the relevant maintainer here, since we're not contingent on >> APEI handling. I think Keith has a lot more experience with this part of the >> kernel. > > Thanks for adding Keith. > >>> On Tue, Jun 19, 2018 at 02:58:20PM -0500, Alexandru Gagniuc wrote: >>>> According to the documentation, "pcie_ports=native", linux should use >>>> native AER and DPC services. While that is true for the _OSC method >>>> parsing, this is not the only place that is checked. Should the HEST >>>> table list PCIe ports as firmware-first, linux will not use native >>>> services. >>> >>> Nothing in ACPI-land looks at pcie_ports_native. How should ACPI >>> things work in the "pcie_ports=native" case? I guess we still have to >>> expect to receive error records from the firmware, because it may >>> certainly send us non-PCI errors (machine checks, etc) and maybe even >>> some PCI errors (even if the Linux AER driver claims AER interrupts, >>> we don't know what notification mechanisms the firmware may be using). >> >> I think ACPI land shouldn't care about this. We care about it from the PCIe >> stand point at the interface with ACPI. FW might see a delta in the sense >> that we request control of some features via _OSC, which we otherwise would >> not do without pcie_ports=native. >> >>> I guess best-case, we'll get ACPI error records for all non-PCI >>> things, and the Linux AER driver will see all the AER errors. >> >> It might affect FW's ability to catch errors, but that's dependent on the >> root port implementation. >> >>> Worst-case, I don't really know what to expect. Duplicate reporting >>> of AER errors via firmware and Linux AER driver? Some kind of >>> confusion about who acknowledges and clears them? >> >> Once user enters pcie_ports=native, all bets are off: you broke the contract >> you have with the FW -- whether or not you have this patch. >> >>> Out of curiosity, what is your use case for "pcie_ports=native"? >>> Presumably there's something that works better when using it, and >>> things work even *better* with this patch? >> >> Corectness. It bothers me that actual behavior does not match the >> documentation: >> >> native Use native PCIe services associated with PCIe ports >> unconditionally. >> >> >>> I know people do use it, because I often see it mentioned in forums >>> and bug reports, but I really don't expect it to work very well >>> because we're ignoring the usage model the firmware is designed >>> around. My unproven suspicion is that most uses are in the black >>> magic category of "there's a bug here, and we don't know how to fix >>> it, but pcie_ports=native makes it work better". >> >> There exist cases that firmware didn't consider. I would not call them >> "firmware bugs", but there are cases where the user understands the platform >> better than firmware. >> Example: on certain PCIe switches, a hardware PCIe error may bring the >> switch downstream ports into a state where they stop notifying hotplug >> events. Depending on the platform, firmware may or may not fix this >> condition, but "pcie_ports=native" enables DPC. DPC contains the error >> without the switch downstream port entering the weird error state in the >> first place. >> >> All bets are off at this point. > > If a user needs "pcie_ports=native", I claim that's a user experience > problem, and the underlying cause is a hardware, firmware, or OS > defect. > > I have no doubt the situation you describe is real, but this doesn't > make any progress toward resolving the user experience problem. In > fact, it propagates the folklore that using "pcie_ports=native" is an > appropriate final solution. It's fine as a temporary workaround while > we figure out a better solution, but we need some mechanism for > analyzing the problem and eventually removing the need to use > "pcie_ports=native". Speaking of user experience, I'd argue that it's a horrible experience for the kernel to _not_ do what it is asked. I'm going to go fix the little comment about the patch. I had the same dilemma when I wrote it, but didn't find it too noteworthy. It makes more sense now that you mentioned it. Alex > I have a minor comment on the patch, but I think it makes sense. This > might be a good time to resurrect Prarit's "taint-on-pci-parameters" > patch. If somebody uses "pcie_ports=native", I think it makes sense > to taint the kernel both because (1) we broke the contract with the > firmware and we don't really know what to expect, and (2) it's an > opportunity to encourage the user to raise a bug report. > > Bjorn >
Re: [PATCH v2] PCI/AER: Fix aerdrv loading with "pcie_ports=native" parameter
On 06/30/2018 04:31 PM, Bjorn Helgaas wrote: [+cc Borislav, linux-acpi, since this involves APEI/HEST] Borislav is not the relevant maintainer here, since we're not contingent on APEI handling. I think Keith has a lot more experience with this part of the kernel. On Tue, Jun 19, 2018 at 02:58:20PM -0500, Alexandru Gagniuc wrote: According to the documentation, "pcie_ports=native", linux should use native AER and DPC services. While that is true for the _OSC method parsing, this is not the only place that is checked. Should the HEST table list PCIe ports as firmware-first, linux will not use native services. Nothing in ACPI-land looks at pcie_ports_native. How should ACPI things work in the "pcie_ports=native" case? I guess we still have to expect to receive error records from the firmware, because it may certainly send us non-PCI errors (machine checks, etc) and maybe even some PCI errors (even if the Linux AER driver claims AER interrupts, we don't know what notification mechanisms the firmware may be using). I think ACPI land shouldn't care about this. We care about it from the PCIe stand point at the interface with ACPI. FW might see a delta in the sense that we request control of some features via _OSC, which we otherwise would not do without pcie_ports=native. I guess best-case, we'll get ACPI error records for all non-PCI things, and the Linux AER driver will see all the AER errors. It might affect FW's ability to catch errors, but that's dependent on the root port implementation. Worst-case, I don't really know what to expect. Duplicate reporting of AER errors via firmware and Linux AER driver? Some kind of confusion about who acknowledges and clears them? Once user enters pcie_ports=native, all bets are off: you broke the contract you have with the FW -- whether or not you have this patch. Out of curiosity, what is your use case for "pcie_ports=native"? Presumably there's something that works better when using it, and things work even *better* with this patch? Corectness. It bothers me that actual behavior does not match the documentation: native Use native PCIe services associated with PCIe ports unconditionally. I know people do use it, because I often see it mentioned in forums and bug reports, but I really don't expect it to work very well because we're ignoring the usage model the firmware is designed around. My unproven suspicion is that most uses are in the black magic category of "there's a bug here, and we don't know how to fix it, but pcie_ports=native makes it work better". There exist cases that firmware didn't consider. I would not call them "firmware bugs", but there are cases where the user understands the platform better than firmware. Example: on certain PCIe switches, a hardware PCIe error may bring the switch downstream ports into a state where they stop notifying hotplug events. Depending on the platform, firmware may or may not fix this condition, but "pcie_ports=native" enables DPC. DPC contains the error without the switch downstream port entering the weird error state in the first place. All bets are off at this point. Obviously I would much rather find and fix those bugs so people wouldn't have to stumble over the problem in the first place. Using native services when firmware asks us not to is a crapshoot every time. I don't condone the use of this feature, but if we do get a pcie_ports=native request, we should at least honor it. This happens because aer_acpi_firmware_first() doesn't take 'pcie_ports' into account. This is wrong. DPC uses the same logic when it decides whether to load or not, so fixing this also fixes DPC not loading. Signed-off-by: Alexandru Gagniuc --- drivers/pci/pcie/aer.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) Changes since v1: - Re-tested with latest and greatest (v4.18-rc1) -- works great diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index a2e88386af28..98ced0f7c850 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -291,7 +291,7 @@ static void aer_set_firmware_first(struct pci_dev *pci_dev) rc = apei_hest_parse(aer_hest_parse, ); - if (rc) + if (rc || pcie_ports_native) pci_dev->__aer_firmware_first = 0; else pci_dev->__aer_firmware_first = info.firmware_first; @@ -327,6 +327,9 @@ bool aer_acpi_firmware_first(void) apei_hest_parse(aer_hest_parse, ); aer_firmware_first = info.firmware_first; parsed = true; + if (pcie_ports_native) + aer_firmware_first = 0; + } return aer_firmware_first; } -- 2.14.3
Re: [PATCH v2] PCI/AER: Fix aerdrv loading with "pcie_ports=native" parameter
On 06/30/2018 04:31 PM, Bjorn Helgaas wrote: [+cc Borislav, linux-acpi, since this involves APEI/HEST] Borislav is not the relevant maintainer here, since we're not contingent on APEI handling. I think Keith has a lot more experience with this part of the kernel. On Tue, Jun 19, 2018 at 02:58:20PM -0500, Alexandru Gagniuc wrote: According to the documentation, "pcie_ports=native", linux should use native AER and DPC services. While that is true for the _OSC method parsing, this is not the only place that is checked. Should the HEST table list PCIe ports as firmware-first, linux will not use native services. Nothing in ACPI-land looks at pcie_ports_native. How should ACPI things work in the "pcie_ports=native" case? I guess we still have to expect to receive error records from the firmware, because it may certainly send us non-PCI errors (machine checks, etc) and maybe even some PCI errors (even if the Linux AER driver claims AER interrupts, we don't know what notification mechanisms the firmware may be using). I think ACPI land shouldn't care about this. We care about it from the PCIe stand point at the interface with ACPI. FW might see a delta in the sense that we request control of some features via _OSC, which we otherwise would not do without pcie_ports=native. I guess best-case, we'll get ACPI error records for all non-PCI things, and the Linux AER driver will see all the AER errors. It might affect FW's ability to catch errors, but that's dependent on the root port implementation. Worst-case, I don't really know what to expect. Duplicate reporting of AER errors via firmware and Linux AER driver? Some kind of confusion about who acknowledges and clears them? Once user enters pcie_ports=native, all bets are off: you broke the contract you have with the FW -- whether or not you have this patch. Out of curiosity, what is your use case for "pcie_ports=native"? Presumably there's something that works better when using it, and things work even *better* with this patch? Corectness. It bothers me that actual behavior does not match the documentation: native Use native PCIe services associated with PCIe ports unconditionally. I know people do use it, because I often see it mentioned in forums and bug reports, but I really don't expect it to work very well because we're ignoring the usage model the firmware is designed around. My unproven suspicion is that most uses are in the black magic category of "there's a bug here, and we don't know how to fix it, but pcie_ports=native makes it work better". There exist cases that firmware didn't consider. I would not call them "firmware bugs", but there are cases where the user understands the platform better than firmware. Example: on certain PCIe switches, a hardware PCIe error may bring the switch downstream ports into a state where they stop notifying hotplug events. Depending on the platform, firmware may or may not fix this condition, but "pcie_ports=native" enables DPC. DPC contains the error without the switch downstream port entering the weird error state in the first place. All bets are off at this point. Obviously I would much rather find and fix those bugs so people wouldn't have to stumble over the problem in the first place. Using native services when firmware asks us not to is a crapshoot every time. I don't condone the use of this feature, but if we do get a pcie_ports=native request, we should at least honor it. This happens because aer_acpi_firmware_first() doesn't take 'pcie_ports' into account. This is wrong. DPC uses the same logic when it decides whether to load or not, so fixing this also fixes DPC not loading. Signed-off-by: Alexandru Gagniuc --- drivers/pci/pcie/aer.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) Changes since v1: - Re-tested with latest and greatest (v4.18-rc1) -- works great diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index a2e88386af28..98ced0f7c850 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -291,7 +291,7 @@ static void aer_set_firmware_first(struct pci_dev *pci_dev) rc = apei_hest_parse(aer_hest_parse, ); - if (rc) + if (rc || pcie_ports_native) pci_dev->__aer_firmware_first = 0; else pci_dev->__aer_firmware_first = info.firmware_first; @@ -327,6 +327,9 @@ bool aer_acpi_firmware_first(void) apei_hest_parse(aer_hest_parse, ); aer_firmware_first = info.firmware_first; parsed = true; + if (pcie_ports_native) + aer_firmware_first = 0; + } return aer_firmware_first; } -- 2.14.3
Re: [PATCH] PCI: DPC: Clear AER status bits before disabling port containment
On 06/19/2018 04:57 PM, Bjorn Helgaas wrote: > On Wed, May 16, 2018 at 05:12:21PM -0600, Keith Busch wrote: >> On Wed, May 16, 2018 at 06:44:22PM -0400, Sinan Kaya wrote: >>> On 5/16/2018 5:33 PM, Alexandru Gagniuc wrote: AER status bits are sticky, and they survive system resets. Downstream devices are usually taken care of after re-enumerating the downstream busses, as the AER bits are cleared during probe(). However, nothing clears the bits of the port which contained the error. These sticky bits may leave some BIOSes to think that something bad happened, and print ominous messages on next boot. To prevent this, tidy up the AER status bits before releasing containment. Signed-off-by: Alexandru Gagniuc --- drivers/pci/pcie/dpc.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index 8c57d607e603..bf82d6936556 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -112,6 +112,10 @@ static void dpc_work(struct work_struct *work) dpc->rp_pio_status = 0; } + /* DPC event made a mess of our AER status bits. Clean them up. */ + pci_cleanup_aer_error_status_regs(pdev); + /* TODO: Should we also use aer_print_error to log the event? */ + pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS, PCI_EXP_DPC_STATUS_TRIGGER | PCI_EXP_DPC_STATUS_INTERRUPT); >>> >>> I think Keith has a patch to fix this. It was under review at some point. >> >> Right, I do intend to following up on this, but I've had some trouble >> finding time the last few weeks. Sorry about that, things will clear up >> for me shortly. > > I'll drop this (Alexandru's) patch for now, waiting for your update, Keith. I wonder if clearing AER status bits is mutually exclusive with refactoring other parts of DPC handling? Alex
Re: [PATCH] PCI: DPC: Clear AER status bits before disabling port containment
On 06/19/2018 04:57 PM, Bjorn Helgaas wrote: > On Wed, May 16, 2018 at 05:12:21PM -0600, Keith Busch wrote: >> On Wed, May 16, 2018 at 06:44:22PM -0400, Sinan Kaya wrote: >>> On 5/16/2018 5:33 PM, Alexandru Gagniuc wrote: AER status bits are sticky, and they survive system resets. Downstream devices are usually taken care of after re-enumerating the downstream busses, as the AER bits are cleared during probe(). However, nothing clears the bits of the port which contained the error. These sticky bits may leave some BIOSes to think that something bad happened, and print ominous messages on next boot. To prevent this, tidy up the AER status bits before releasing containment. Signed-off-by: Alexandru Gagniuc --- drivers/pci/pcie/dpc.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index 8c57d607e603..bf82d6936556 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -112,6 +112,10 @@ static void dpc_work(struct work_struct *work) dpc->rp_pio_status = 0; } + /* DPC event made a mess of our AER status bits. Clean them up. */ + pci_cleanup_aer_error_status_regs(pdev); + /* TODO: Should we also use aer_print_error to log the event? */ + pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS, PCI_EXP_DPC_STATUS_TRIGGER | PCI_EXP_DPC_STATUS_INTERRUPT); >>> >>> I think Keith has a patch to fix this. It was under review at some point. >> >> Right, I do intend to following up on this, but I've had some trouble >> finding time the last few weeks. Sorry about that, things will clear up >> for me shortly. > > I'll drop this (Alexandru's) patch for now, waiting for your update, Keith. I wonder if clearing AER status bits is mutually exclusive with refactoring other parts of DPC handling? Alex
Re: [PATCH v2] PCI: Check for PCIe downtraining conditions
On 06/01/2018 10:10 AM, Sinan Kaya wrote: > On 6/1/2018 11:06 AM, Alex G. wrote: >> On 06/01/2018 10:03 AM, Sinan Kaya wrote: >>> On 6/1/2018 11:01 AM, Alexandru Gagniuc wrote: >>>> + /* Multi-function PCIe share the same link/status. */ >>>> + if (PCI_FUNC(dev->devfn) != 0) >>>> + return; >>> >>> How about virtual functions? >> >> I have almost no clue about those. Is your concern that we might end up >> printing more than our fair share of link statuses? > > Yes, struct pci_dev also has a flag called is_virtfn that you should bail out > here too. Will be fixed in v3. Thanks, Alex >> >> Alex >> >> > >
Re: [PATCH v2] PCI: Check for PCIe downtraining conditions
On 06/01/2018 10:10 AM, Sinan Kaya wrote: > On 6/1/2018 11:06 AM, Alex G. wrote: >> On 06/01/2018 10:03 AM, Sinan Kaya wrote: >>> On 6/1/2018 11:01 AM, Alexandru Gagniuc wrote: >>>> + /* Multi-function PCIe share the same link/status. */ >>>> + if (PCI_FUNC(dev->devfn) != 0) >>>> + return; >>> >>> How about virtual functions? >> >> I have almost no clue about those. Is your concern that we might end up >> printing more than our fair share of link statuses? > > Yes, struct pci_dev also has a flag called is_virtfn that you should bail out > here too. Will be fixed in v3. Thanks, Alex >> >> Alex >> >> > >
Re: [PATCH v2] PCI: Check for PCIe downtraining conditions
On 06/01/2018 10:12 AM, Andy Shevchenko wrote: > On Fri, Jun 1, 2018 at 6:01 PM, Alexandru Gagniuc > wrote: >> PCIe downtraining happens when both the device and PCIe port are >> capable of a larger bus width or higher speed than negotiated. >> Downtraining might be indicative of other problems in the system, and >> identifying this from userspace is neither intuitive, nor straigh >> forward. >> >> The easiest way to detect this is with pcie_print_link_status(), >> since the bottleneck is usually the link that is downtrained. It's not >> a perfect solution, but it works extremely well in most cases. > >> +static void pcie_check_upstream_link(struct pci_dev *dev) >> +{ > >> + > > This is redundant, but... Hmm. I thought it's not safe to call pci_pcie_type() on non-pcie devices. I see the pci_is_pcie() check followed by pci_pcie_type() is not uncommon. I didn't think it would be an issue, as long as it's consistent with the rest of the code. >> + if (!pci_is_pcie(dev)) >> + return; >> + >> + /* Look from the device up to avoid downstream ports with no >> devices. */ >> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) && >> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) && >> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM)) >> + return; > > ...wouldn't be better > > int type = pci_pcie_type(dev); > > ? An extra local variable when the compiler knows how to optimize it out? To me, it doesn't seem like it would improve readability, but it would make the code longer. > But also possible, looking at existing code, > > static inline bool pci_is_pcie_type(dev, type) > { > return pci_is_pcie(dev) ? pci_pcie_type(dev) == type : false; > } return pci_is_pcie(dev) && (pci_pcie_type(dev) == type); seems cleaner. Although this sort of cleanup is beyond the scope of this change. Thanks, Alex
Re: [PATCH v2] PCI: Check for PCIe downtraining conditions
On 06/01/2018 10:12 AM, Andy Shevchenko wrote: > On Fri, Jun 1, 2018 at 6:01 PM, Alexandru Gagniuc > wrote: >> PCIe downtraining happens when both the device and PCIe port are >> capable of a larger bus width or higher speed than negotiated. >> Downtraining might be indicative of other problems in the system, and >> identifying this from userspace is neither intuitive, nor straigh >> forward. >> >> The easiest way to detect this is with pcie_print_link_status(), >> since the bottleneck is usually the link that is downtrained. It's not >> a perfect solution, but it works extremely well in most cases. > >> +static void pcie_check_upstream_link(struct pci_dev *dev) >> +{ > >> + > > This is redundant, but... Hmm. I thought it's not safe to call pci_pcie_type() on non-pcie devices. I see the pci_is_pcie() check followed by pci_pcie_type() is not uncommon. I didn't think it would be an issue, as long as it's consistent with the rest of the code. >> + if (!pci_is_pcie(dev)) >> + return; >> + >> + /* Look from the device up to avoid downstream ports with no >> devices. */ >> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) && >> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) && >> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM)) >> + return; > > ...wouldn't be better > > int type = pci_pcie_type(dev); > > ? An extra local variable when the compiler knows how to optimize it out? To me, it doesn't seem like it would improve readability, but it would make the code longer. > But also possible, looking at existing code, > > static inline bool pci_is_pcie_type(dev, type) > { > return pci_is_pcie(dev) ? pci_pcie_type(dev) == type : false; > } return pci_is_pcie(dev) && (pci_pcie_type(dev) == type); seems cleaner. Although this sort of cleanup is beyond the scope of this change. Thanks, Alex
Re: [PATCH v2] PCI: Check for PCIe downtraining conditions
On 06/01/2018 10:03 AM, Sinan Kaya wrote: > On 6/1/2018 11:01 AM, Alexandru Gagniuc wrote: >> +/* Multi-function PCIe share the same link/status. */ >> +if (PCI_FUNC(dev->devfn) != 0) >> +return; > > How about virtual functions? I have almost no clue about those. Is your concern that we might end up printing more than our fair share of link statuses? Alex
Re: [PATCH v2] PCI: Check for PCIe downtraining conditions
On 06/01/2018 10:03 AM, Sinan Kaya wrote: > On 6/1/2018 11:01 AM, Alexandru Gagniuc wrote: >> +/* Multi-function PCIe share the same link/status. */ >> +if (PCI_FUNC(dev->devfn) != 0) >> +return; > > How about virtual functions? I have almost no clue about those. Is your concern that we might end up printing more than our fair share of link statuses? Alex
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 12:27 PM, Alex G. wrote: > On 05/31/2018 12:11 PM, Sinan Kaya wrote: >> On 5/31/2018 12:49 PM, Alex G. wrote: >>>>bw_cap = pcie_bandwidth_capable(dev, _cap, _cap); >>>>bw_avail = pcie_bandwidth_available(dev, _dev, , , >>>> *parent*); >>> That's confusing. I'd expect _capable() and _available() to be >>> symmetrical. They either both look at one link only, or both go down to >>> the root port. Though it seems _capable() is link-local, and >>> _available() is down to root port. >>> >> >> As you know, link speed is a qualification of two devices speed capability. >> Both speed and width parameters get negotiated by two devices during TS1 and >> TS2 >> ordered set exchange. >> >> You need to see what your link partner can support in available function() >> vs. >> what this device can do in bandwidth() function. > > I see. I'm not sure I can use pcie_print_link_status() without some > major refactoring. I need to look at capability of device and it > downstream port. There's no point complaining that an x16 device is > running at x4 when the port is only x4 capable. > > Let me think some more on this. I did some thinking, and I don't like using the same message for both point-to-point links and full-tree links. From a userspace point of view having an arbitrary terminator in the tree parsing is confusing. So we're better of not modifying the behavior here. On the other hand, just printing the link status on probe might be good enough. If we're downtraining, but there's a slower link somewhere else, it won't matter much that we're downtrained. Just calling the unmodified pcie_print_link_status() will most of the time find the exact segment that's also downtrained. So let's do this in v2 Alex > Alex >
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 12:27 PM, Alex G. wrote: > On 05/31/2018 12:11 PM, Sinan Kaya wrote: >> On 5/31/2018 12:49 PM, Alex G. wrote: >>>>bw_cap = pcie_bandwidth_capable(dev, _cap, _cap); >>>>bw_avail = pcie_bandwidth_available(dev, _dev, , , >>>> *parent*); >>> That's confusing. I'd expect _capable() and _available() to be >>> symmetrical. They either both look at one link only, or both go down to >>> the root port. Though it seems _capable() is link-local, and >>> _available() is down to root port. >>> >> >> As you know, link speed is a qualification of two devices speed capability. >> Both speed and width parameters get negotiated by two devices during TS1 and >> TS2 >> ordered set exchange. >> >> You need to see what your link partner can support in available function() >> vs. >> what this device can do in bandwidth() function. > > I see. I'm not sure I can use pcie_print_link_status() without some > major refactoring. I need to look at capability of device and it > downstream port. There's no point complaining that an x16 device is > running at x4 when the port is only x4 capable. > > Let me think some more on this. I did some thinking, and I don't like using the same message for both point-to-point links and full-tree links. From a userspace point of view having an arbitrary terminator in the tree parsing is confusing. So we're better of not modifying the behavior here. On the other hand, just printing the link status on probe might be good enough. If we're downtraining, but there's a slower link somewhere else, it won't matter much that we're downtrained. Just calling the unmodified pcie_print_link_status() will most of the time find the exact segment that's also downtrained. So let's do this in v2 Alex > Alex >
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 10:30 AM, Sinan Kaya wrote: > On 5/31/2018 11:05 AM, Alexandru Gagniuc wrote: >> +if (dev_cur_speed < max_link_speed) >> +pci_warn(dev, "PCIe downtrain: link speed is %s (%s capable)", >> + pcie_bus_speed_name(dev_cur_speed), >> + pcie_bus_speed_name(max_link_speed)); >> + > > Also this isn't quite correct. Target link speed is what the device tries to > train. A device can try to train to much lower speed than the maximum on > purpose. > > It makes sense to print this if the speed that platform wants via target link > speed is different from what is actually established though. After seeing Gen 3 devices that train above the speed in the target link speed field, I talked to one the spec writers today. There is some ambiguity with the target link speed field. In PCIe 4.0 they are clarifying that to state that this field is "permitted to have no effect". Alex
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 10:30 AM, Sinan Kaya wrote: > On 5/31/2018 11:05 AM, Alexandru Gagniuc wrote: >> +if (dev_cur_speed < max_link_speed) >> +pci_warn(dev, "PCIe downtrain: link speed is %s (%s capable)", >> + pcie_bus_speed_name(dev_cur_speed), >> + pcie_bus_speed_name(max_link_speed)); >> + > > Also this isn't quite correct. Target link speed is what the device tries to > train. A device can try to train to much lower speed than the maximum on > purpose. > > It makes sense to print this if the speed that platform wants via target link > speed is different from what is actually established though. After seeing Gen 3 devices that train above the speed in the target link speed field, I talked to one the spec writers today. There is some ambiguity with the target link speed field. In PCIe 4.0 they are clarifying that to state that this field is "permitted to have no effect". Alex
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 12:11 PM, Sinan Kaya wrote: > On 5/31/2018 12:49 PM, Alex G. wrote: >>> bw_cap = pcie_bandwidth_capable(dev, _cap, _cap); >>> bw_avail = pcie_bandwidth_available(dev, _dev, , , >>> *parent*); >> That's confusing. I'd expect _capable() and _available() to be >> symmetrical. They either both look at one link only, or both go down to >> the root port. Though it seems _capable() is link-local, and >> _available() is down to root port. >> > > As you know, link speed is a qualification of two devices speed capability. > Both speed and width parameters get negotiated by two devices during TS1 and > TS2 > ordered set exchange. > > You need to see what your link partner can support in available function() vs. > what this device can do in bandwidth() function. I see. I'm not sure I can use pcie_print_link_status() without some major refactoring. I need to look at capability of device and it downstream port. There's no point complaining that an x16 device is running at x4 when the port is only x4 capable. Let me think some more on this. Alex
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 12:11 PM, Sinan Kaya wrote: > On 5/31/2018 12:49 PM, Alex G. wrote: >>> bw_cap = pcie_bandwidth_capable(dev, _cap, _cap); >>> bw_avail = pcie_bandwidth_available(dev, _dev, , , >>> *parent*); >> That's confusing. I'd expect _capable() and _available() to be >> symmetrical. They either both look at one link only, or both go down to >> the root port. Though it seems _capable() is link-local, and >> _available() is down to root port. >> > > As you know, link speed is a qualification of two devices speed capability. > Both speed and width parameters get negotiated by two devices during TS1 and > TS2 > ordered set exchange. > > You need to see what your link partner can support in available function() vs. > what this device can do in bandwidth() function. I see. I'm not sure I can use pcie_print_link_status() without some major refactoring. I need to look at capability of device and it downstream port. There's no point complaining that an x16 device is running at x4 when the port is only x4 capable. Let me think some more on this. Alex
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 11:49 AM, Alex G. wrote: > > > On 05/31/2018 11:13 AM, Sinan Kaya wrote: >> On 5/31/2018 12:01 PM, Alex G. wrote: >>>> PCI: Add pcie_print_link_status() to log link speed and whether it's >>>> limited >>> This one, I have, but it's not what I need. This looks at the available >>> bandwidth from root port to endpoint, whereas I'm only interested in >>> downtraining between endpoint and upstream port. >> >> I see what you are saying. >> >> With a little bit of effort, you can reuse the same code. >> >> Here is an attempt. >> >> You can probably extend pcie_bandwidth_available() to put an optional parent >> bridge >> device for your own use case and terminate the loop around here. >> >> https://elixir.bootlin.com/linux/v4.17-rc7/source/drivers/pci/pci.c#L5182 >> >> Then, you can use the existing code to achieve what you are looking for via >> pcie_print_link_status() by adding an optional parent parameter. >> >> bw_cap = pcie_bandwidth_capable(dev, _cap, _cap); >> bw_avail = pcie_bandwidth_available(dev, _dev, , , >> *parent*); > > That's confusing. "confusing" refers to the way the code currently works. It doesn't refer to your proposal. Alex I'd expect _capable() and _available() to be > symmetrical. They either both look at one link only, or both go down to > the root port. Though it seems _capable() is link-local, and > _available() is down to root port. > >> >> If parent parameter is NULL, code can walk all the way to root as it is >> doing today. >> If it is not, then will terminate the loop on the first iteration. >>
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 11:49 AM, Alex G. wrote: > > > On 05/31/2018 11:13 AM, Sinan Kaya wrote: >> On 5/31/2018 12:01 PM, Alex G. wrote: >>>> PCI: Add pcie_print_link_status() to log link speed and whether it's >>>> limited >>> This one, I have, but it's not what I need. This looks at the available >>> bandwidth from root port to endpoint, whereas I'm only interested in >>> downtraining between endpoint and upstream port. >> >> I see what you are saying. >> >> With a little bit of effort, you can reuse the same code. >> >> Here is an attempt. >> >> You can probably extend pcie_bandwidth_available() to put an optional parent >> bridge >> device for your own use case and terminate the loop around here. >> >> https://elixir.bootlin.com/linux/v4.17-rc7/source/drivers/pci/pci.c#L5182 >> >> Then, you can use the existing code to achieve what you are looking for via >> pcie_print_link_status() by adding an optional parent parameter. >> >> bw_cap = pcie_bandwidth_capable(dev, _cap, _cap); >> bw_avail = pcie_bandwidth_available(dev, _dev, , , >> *parent*); > > That's confusing. "confusing" refers to the way the code currently works. It doesn't refer to your proposal. Alex I'd expect _capable() and _available() to be > symmetrical. They either both look at one link only, or both go down to > the root port. Though it seems _capable() is link-local, and > _available() is down to root port. > >> >> If parent parameter is NULL, code can walk all the way to root as it is >> doing today. >> If it is not, then will terminate the loop on the first iteration. >>
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 11:13 AM, Sinan Kaya wrote: > On 5/31/2018 12:01 PM, Alex G. wrote: >>> PCI: Add pcie_print_link_status() to log link speed and whether it's >>> limited >> This one, I have, but it's not what I need. This looks at the available >> bandwidth from root port to endpoint, whereas I'm only interested in >> downtraining between endpoint and upstream port. > > I see what you are saying. > > With a little bit of effort, you can reuse the same code. > > Here is an attempt. > > You can probably extend pcie_bandwidth_available() to put an optional parent > bridge > device for your own use case and terminate the loop around here. > > https://elixir.bootlin.com/linux/v4.17-rc7/source/drivers/pci/pci.c#L5182 > > Then, you can use the existing code to achieve what you are looking for via > pcie_print_link_status() by adding an optional parent parameter. > > bw_cap = pcie_bandwidth_capable(dev, _cap, _cap); > bw_avail = pcie_bandwidth_available(dev, _dev, , , > *parent*); That's confusing. I'd expect _capable() and _available() to be symmetrical. They either both look at one link only, or both go down to the root port. Though it seems _capable() is link-local, and _available() is down to root port. > > If parent parameter is NULL, code can walk all the way to root as it is doing > today. > If it is not, then will terminate the loop on the first iteration. >
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 11:13 AM, Sinan Kaya wrote: > On 5/31/2018 12:01 PM, Alex G. wrote: >>> PCI: Add pcie_print_link_status() to log link speed and whether it's >>> limited >> This one, I have, but it's not what I need. This looks at the available >> bandwidth from root port to endpoint, whereas I'm only interested in >> downtraining between endpoint and upstream port. > > I see what you are saying. > > With a little bit of effort, you can reuse the same code. > > Here is an attempt. > > You can probably extend pcie_bandwidth_available() to put an optional parent > bridge > device for your own use case and terminate the loop around here. > > https://elixir.bootlin.com/linux/v4.17-rc7/source/drivers/pci/pci.c#L5182 > > Then, you can use the existing code to achieve what you are looking for via > pcie_print_link_status() by adding an optional parent parameter. > > bw_cap = pcie_bandwidth_capable(dev, _cap, _cap); > bw_avail = pcie_bandwidth_available(dev, _dev, , , > *parent*); That's confusing. I'd expect _capable() and _available() to be symmetrical. They either both look at one link only, or both go down to the root port. Though it seems _capable() is link-local, and _available() is down to root port. > > If parent parameter is NULL, code can walk all the way to root as it is doing > today. > If it is not, then will terminate the loop on the first iteration. >
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 10:54 AM, Sinan Kaya wrote: > On 5/31/2018 11:46 AM, Alex G. wrote: >>> https://lkml.org/lkml/2018/3/30/553 >> Oh, pcie_get_speed_cap()/pcie_get_width_cap() seems to handle the >> capability. Not seeing one for status and speed name. >> >>> are you working on linux-next? >> v4.17-rc7 >> > > I think everything you need is in the series. I don't know which linux > tree this landed or if it landed. Probably, it is shipping for 4.18. > Need some help from Bjorn where to locate these. > > Fri, 30 Mar 2018 16:04:40 -0500 > > Bjorn Helgaas (6): > bnx2x: Report PCIe link properties with pcie_print_link_status() > bnxt_en: Report PCIe link properties with pcie_print_link_status() > cxgb4: Report PCIe link properties with pcie_print_link_status() > fm10k: Report PCIe link properties with pcie_print_link_status() > ixgbe: Report PCIe link properties with pcie_print_link_status() > PCI: Remove unused pcie_get_minimum_link() I remember seeing some of these in my tree. > Tal Gilboa (8): > PCI: Add pcie_get_speed_cap() to find max supported link speed > PCI: Add pcie_get_width_cap() to find max supported link width > PCI: Add pcie_bandwidth_capable() to compute max supported link > bandwidth > PCI: Add pcie_bandwidth_available() to compute bandwidth available to > device I definitely have these. > PCI: Add pcie_print_link_status() to log link speed and whether it's > limited This one, I have, but it's not what I need. This looks at the available bandwidth from root port to endpoint, whereas I'm only interested in downtraining between endpoint and upstream port. Alex
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 10:54 AM, Sinan Kaya wrote: > On 5/31/2018 11:46 AM, Alex G. wrote: >>> https://lkml.org/lkml/2018/3/30/553 >> Oh, pcie_get_speed_cap()/pcie_get_width_cap() seems to handle the >> capability. Not seeing one for status and speed name. >> >>> are you working on linux-next? >> v4.17-rc7 >> > > I think everything you need is in the series. I don't know which linux > tree this landed or if it landed. Probably, it is shipping for 4.18. > Need some help from Bjorn where to locate these. > > Fri, 30 Mar 2018 16:04:40 -0500 > > Bjorn Helgaas (6): > bnx2x: Report PCIe link properties with pcie_print_link_status() > bnxt_en: Report PCIe link properties with pcie_print_link_status() > cxgb4: Report PCIe link properties with pcie_print_link_status() > fm10k: Report PCIe link properties with pcie_print_link_status() > ixgbe: Report PCIe link properties with pcie_print_link_status() > PCI: Remove unused pcie_get_minimum_link() I remember seeing some of these in my tree. > Tal Gilboa (8): > PCI: Add pcie_get_speed_cap() to find max supported link speed > PCI: Add pcie_get_width_cap() to find max supported link width > PCI: Add pcie_bandwidth_capable() to compute max supported link > bandwidth > PCI: Add pcie_bandwidth_available() to compute bandwidth available to > device I definitely have these. > PCI: Add pcie_print_link_status() to log link speed and whether it's > limited This one, I have, but it's not what I need. This looks at the available bandwidth from root port to endpoint, whereas I'm only interested in downtraining between endpoint and upstream port. Alex
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 10:38 AM, Sinan Kaya wrote: > On 5/31/2018 11:29 AM, alex_gagn...@dellteam.com wrote: >> On 5/31/2018 10:28 AM, Sinan Kaya wrote: >>> On 5/31/2018 11:05 AM, Alexandru Gagniuc wrote: +static void pcie_max_link_cap(struct pci_dev *dev, enum pci_bus_speed *speed, + enum pcie_link_width *width) +{ + uint32_t lnkcap; + + pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, ); + + *speed = pcie_link_speed[lnkcap & PCI_EXP_LNKCAP_SLS]; + *width = (lnkcap & PCI_EXP_LNKCAP_MLW) >> PCI_EXP_LNKCAP_MLW_SHIFT; +} + +static void pcie_cur_link_sta(struct pci_dev *dev, enum pci_bus_speed *speed, + enum pcie_link_width *width) +{ + uint16_t lnksta; + + pcie_capability_read_word(dev, PCI_EXP_LNKSTA, ); + *speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS]; + *width = (lnksta & PCI_EXP_LNKSTA_NLW) >> PCI_EXP_LNKSTA_NLW_SHIFT; +} + +static const char *pcie_bus_speed_name(enum pci_bus_speed speed) +{ + switch (speed) { + case PCIE_SPEED_2_5GT: + return "2.5 GT/s"; + case PCIE_SPEED_5_0GT: + return "5.0 GT/s"; + case PCIE_SPEED_8_0GT: + return "8.0 GT/s"; + default: + return "unknown"; + } +} >>> >>> I thought Bjorn added some functions to retrieve this now. >> >> Hmm. I couldn't find them. > > https://lkml.org/lkml/2018/3/30/553 Oh, pcie_get_speed_cap()/pcie_get_width_cap() seems to handle the capability. Not seeing one for status and speed name. > are you working on linux-next? v4.17-rc7 >> >> Alex >> >> > >
Re: [PATCH] PCI: Check for PCIe downtraining conditions
On 05/31/2018 10:38 AM, Sinan Kaya wrote: > On 5/31/2018 11:29 AM, alex_gagn...@dellteam.com wrote: >> On 5/31/2018 10:28 AM, Sinan Kaya wrote: >>> On 5/31/2018 11:05 AM, Alexandru Gagniuc wrote: +static void pcie_max_link_cap(struct pci_dev *dev, enum pci_bus_speed *speed, + enum pcie_link_width *width) +{ + uint32_t lnkcap; + + pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, ); + + *speed = pcie_link_speed[lnkcap & PCI_EXP_LNKCAP_SLS]; + *width = (lnkcap & PCI_EXP_LNKCAP_MLW) >> PCI_EXP_LNKCAP_MLW_SHIFT; +} + +static void pcie_cur_link_sta(struct pci_dev *dev, enum pci_bus_speed *speed, + enum pcie_link_width *width) +{ + uint16_t lnksta; + + pcie_capability_read_word(dev, PCI_EXP_LNKSTA, ); + *speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS]; + *width = (lnksta & PCI_EXP_LNKSTA_NLW) >> PCI_EXP_LNKSTA_NLW_SHIFT; +} + +static const char *pcie_bus_speed_name(enum pci_bus_speed speed) +{ + switch (speed) { + case PCIE_SPEED_2_5GT: + return "2.5 GT/s"; + case PCIE_SPEED_5_0GT: + return "5.0 GT/s"; + case PCIE_SPEED_8_0GT: + return "8.0 GT/s"; + default: + return "unknown"; + } +} >>> >>> I thought Bjorn added some functions to retrieve this now. >> >> Hmm. I couldn't find them. > > https://lkml.org/lkml/2018/3/30/553 Oh, pcie_get_speed_cap()/pcie_get_width_cap() seems to handle the capability. Not seeing one for status and speed name. > are you working on linux-next? v4.17-rc7 >> >> Alex >> >> > >
Re: [PATCH 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices
On 05/23/2018 09:32 AM, Jes Sorensen wrote: > On 05/23/2018 10:26 AM, Matthew Wilcox wrote: >> On Wed, May 23, 2018 at 10:20:10AM -0400, Jes Sorensen wrote: +++ b/drivers/pci/pcie/aer/aerdrv_stats.c @@ -0,0 +1,64 @@ +// SPDX-License-Identifier: GPL-2.0 >>> >>> Fix the formatting please - that gross // gibberish doesn't belong there. >> >> Sorry, Jes. The Chief Penguin has Spoken, and that's the preferred >> syntax: >> >> 2. Style: >> >>The SPDX license identifier is added in form of a comment. The comment >>style depends on the file type:: >> >> C source: // SPDX-License-Identifier: >> >> (you can dig up the discussion around this on the mailing list if you >> like. Linus actually thinks that C++ single-line comments are one of >> the few things that language got right) > > Well I'll agree to disagree with Linus on this one. It's ugly as fsck > and allows for ambiguous statements in the code. You misspelled "fuck". Alex
Re: [PATCH 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices
On 05/23/2018 09:32 AM, Jes Sorensen wrote: > On 05/23/2018 10:26 AM, Matthew Wilcox wrote: >> On Wed, May 23, 2018 at 10:20:10AM -0400, Jes Sorensen wrote: +++ b/drivers/pci/pcie/aer/aerdrv_stats.c @@ -0,0 +1,64 @@ +// SPDX-License-Identifier: GPL-2.0 >>> >>> Fix the formatting please - that gross // gibberish doesn't belong there. >> >> Sorry, Jes. The Chief Penguin has Spoken, and that's the preferred >> syntax: >> >> 2. Style: >> >>The SPDX license identifier is added in form of a comment. The comment >>style depends on the file type:: >> >> C source: // SPDX-License-Identifier: >> >> (you can dig up the discussion around this on the mailing list if you >> like. Linus actually thinks that C++ single-line comments are one of >> the few things that language got right) > > Well I'll agree to disagree with Linus on this one. It's ugly as fsck > and allows for ambiguous statements in the code. You misspelled "fuck". Alex
Re: [PATCH 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices
On 05/23/2018 09:20 AM, Jes Sorensen wrote: > On 05/22/2018 06:28 PM, Rajat Jain wrote: >> new file mode 100644 >> index ..b9f251992209 >> --- /dev/null >> +++ b/drivers/pci/pcie/aer/aerdrv_stats.c >> @@ -0,0 +1,64 @@ >> +// SPDX-License-Identifier: GPL-2.0 > > Fix the formatting please - that gross // gibberish doesn't belong there. Deep breath in. Deep breath out. git grep SPDX Although I don't like it, this format is already too common. Cheers, Alex
Re: [PATCH 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices
On 05/23/2018 09:20 AM, Jes Sorensen wrote: > On 05/22/2018 06:28 PM, Rajat Jain wrote: >> new file mode 100644 >> index ..b9f251992209 >> --- /dev/null >> +++ b/drivers/pci/pcie/aer/aerdrv_stats.c >> @@ -0,0 +1,64 @@ >> +// SPDX-License-Identifier: GPL-2.0 > > Fix the formatting please - that gross // gibberish doesn't belong there. Deep breath in. Deep breath out. git grep SPDX Although I don't like it, this format is already too common. Cheers, Alex
Re: [PATCH 5/5] Documentation/PCI: Add details of PCI AER statistics
On 05/22/2018 05:28 PM, Rajat Jain wrote: > Add the PCI AER statistics details to > Documentation/PCI/pcieaer-howto.txt > > Signed-off-by: Rajat Jain> --- > Documentation/PCI/pcieaer-howto.txt | 35 + > 1 file changed, 35 insertions(+) > > diff --git a/Documentation/PCI/pcieaer-howto.txt > b/Documentation/PCI/pcieaer-howto.txt > index acd06bb8..86ee9f9ff5e1 100644 > --- a/Documentation/PCI/pcieaer-howto.txt > +++ b/Documentation/PCI/pcieaer-howto.txt > @@ -73,6 +73,41 @@ In the example, 'Requester ID' means the ID of the device > who sends > the error message to root port. Pls. refer to pci express specs for > other fields. > > +2.4 AER statistics > + > +When AER messages are captured, the statistics are exposed via the following > +sysfs attributes under the "aer_stats" folder for the device: > + > +2.4.1 Device sysfs Attributes > + > +These attributes show up under all the devices that are AER capable. These > +indicate the errors "as seen by the device". Note that this may mean that if > +an end point is causing problems, the AER counters may increment at its link > +partner (e.g. root port) because the errors will be "seen" by the link > partner > +and not the the problematic end point itself (which may report all counters > +as 0 as it never saw any problems). I was afraid of that. Is there a way to look at the requester ID to log AER errors to the correct device? Alex
Re: [PATCH 5/5] Documentation/PCI: Add details of PCI AER statistics
On 05/22/2018 05:28 PM, Rajat Jain wrote: > Add the PCI AER statistics details to > Documentation/PCI/pcieaer-howto.txt > > Signed-off-by: Rajat Jain > --- > Documentation/PCI/pcieaer-howto.txt | 35 + > 1 file changed, 35 insertions(+) > > diff --git a/Documentation/PCI/pcieaer-howto.txt > b/Documentation/PCI/pcieaer-howto.txt > index acd06bb8..86ee9f9ff5e1 100644 > --- a/Documentation/PCI/pcieaer-howto.txt > +++ b/Documentation/PCI/pcieaer-howto.txt > @@ -73,6 +73,41 @@ In the example, 'Requester ID' means the ID of the device > who sends > the error message to root port. Pls. refer to pci express specs for > other fields. > > +2.4 AER statistics > + > +When AER messages are captured, the statistics are exposed via the following > +sysfs attributes under the "aer_stats" folder for the device: > + > +2.4.1 Device sysfs Attributes > + > +These attributes show up under all the devices that are AER capable. These > +indicate the errors "as seen by the device". Note that this may mean that if > +an end point is causing problems, the AER counters may increment at its link > +partner (e.g. root port) because the errors will be "seen" by the link > partner > +and not the the problematic end point itself (which may report all counters > +as 0 as it never saw any problems). I was afraid of that. Is there a way to look at the requester ID to log AER errors to the correct device? Alex
Re: [PATCH 2/5] PCI/AER: Add sysfs stats for AER capable devices
On 05/22/2018 05:28 PM, Rajat Jain wrote: > Add the following AER sysfs stats to represent the counters for each > kind of error as seen by the device: > > dev_total_cor_errs > dev_total_fatal_errs > dev_total_nonfatal_errs > > Signed-off-by: Rajat Jain> --- > drivers/pci/pci-sysfs.c| 3 ++ > drivers/pci/pci.h | 4 +- > drivers/pci/pcie/aer/aerdrv.h | 1 + > drivers/pci/pcie/aer/aerdrv_errprint.c | 1 + > drivers/pci/pcie/aer/aerdrv_stats.c| 72 ++ > 5 files changed, 80 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c > index 366d93af051d..730f985a3dc9 100644 > --- a/drivers/pci/pci-sysfs.c > +++ b/drivers/pci/pci-sysfs.c > @@ -1743,6 +1743,9 @@ static const struct attribute_group > *pci_dev_attr_groups[] = { > #endif > _bridge_attr_group, > _dev_attr_group, > +#ifdef CONFIG_PCIEAER > + _stats_attr_group, > +#endif > NULL, > }; So if the device is removed as part of recovery, then these get reset, right? So if the device fails intermittently, these counters would keep getting reset. Is this the intent? (snip) > /** > * pci_match_one_device - Tell if a PCI device structure has a matching > diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h > index d8b9fba536ed..b5d5ad6f2c03 100644 > --- a/drivers/pci/pcie/aer/aerdrv.h > +++ b/drivers/pci/pcie/aer/aerdrv.h > @@ -87,6 +87,7 @@ void aer_print_port_info(struct pci_dev *dev, struct > aer_err_info *info); > irqreturn_t aer_irq(int irq, void *context); > int pci_aer_stats_init(struct pci_dev *pdev); > void pci_aer_stats_exit(struct pci_dev *pdev); > +void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info); > > #ifdef CONFIG_ACPI_APEI > int pcie_aer_get_firmware_first(struct pci_dev *pci_dev); > diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c > b/drivers/pci/pcie/aer/aerdrv_errprint.c > index 21ca5e1b0ded..5e8b98deda08 100644 > --- a/drivers/pci/pcie/aer/aerdrv_errprint.c > +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c > @@ -155,6 +155,7 @@ static void __aer_print_error(struct pci_dev *dev, > pci_err(dev, " [%2d] Unknown Error Bit%s\n", > i, info->first_error == i ? " (First)" : ""); > } > + pci_dev_aer_stats_incr(dev, info); What about AER errors that are contained by DPC? Alex
Re: [PATCH 2/5] PCI/AER: Add sysfs stats for AER capable devices
On 05/22/2018 05:28 PM, Rajat Jain wrote: > Add the following AER sysfs stats to represent the counters for each > kind of error as seen by the device: > > dev_total_cor_errs > dev_total_fatal_errs > dev_total_nonfatal_errs > > Signed-off-by: Rajat Jain > --- > drivers/pci/pci-sysfs.c| 3 ++ > drivers/pci/pci.h | 4 +- > drivers/pci/pcie/aer/aerdrv.h | 1 + > drivers/pci/pcie/aer/aerdrv_errprint.c | 1 + > drivers/pci/pcie/aer/aerdrv_stats.c| 72 ++ > 5 files changed, 80 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c > index 366d93af051d..730f985a3dc9 100644 > --- a/drivers/pci/pci-sysfs.c > +++ b/drivers/pci/pci-sysfs.c > @@ -1743,6 +1743,9 @@ static const struct attribute_group > *pci_dev_attr_groups[] = { > #endif > _bridge_attr_group, > _dev_attr_group, > +#ifdef CONFIG_PCIEAER > + _stats_attr_group, > +#endif > NULL, > }; So if the device is removed as part of recovery, then these get reset, right? So if the device fails intermittently, these counters would keep getting reset. Is this the intent? (snip) > /** > * pci_match_one_device - Tell if a PCI device structure has a matching > diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h > index d8b9fba536ed..b5d5ad6f2c03 100644 > --- a/drivers/pci/pcie/aer/aerdrv.h > +++ b/drivers/pci/pcie/aer/aerdrv.h > @@ -87,6 +87,7 @@ void aer_print_port_info(struct pci_dev *dev, struct > aer_err_info *info); > irqreturn_t aer_irq(int irq, void *context); > int pci_aer_stats_init(struct pci_dev *pdev); > void pci_aer_stats_exit(struct pci_dev *pdev); > +void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info); > > #ifdef CONFIG_ACPI_APEI > int pcie_aer_get_firmware_first(struct pci_dev *pci_dev); > diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c > b/drivers/pci/pcie/aer/aerdrv_errprint.c > index 21ca5e1b0ded..5e8b98deda08 100644 > --- a/drivers/pci/pcie/aer/aerdrv_errprint.c > +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c > @@ -155,6 +155,7 @@ static void __aer_print_error(struct pci_dev *dev, > pci_err(dev, " [%2d] Unknown Error Bit%s\n", > i, info->first_error == i ? " (First)" : ""); > } > + pci_dev_aer_stats_incr(dev, info); What about AER errors that are contained by DPC? Alex
Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()
On 05/22/2018 01:45 PM, Luck, Tony wrote: > On Tue, May 22, 2018 at 01:19:34PM -0500, Alex G. wrote: >> Firmware started passing "fatal" GHES headers with the explicit intent of >> crashing an OS. At the same time, we've learnt how to handle these errors in >> a number of cases. With DPC (coming soon to firmware-first) the error is >> contained, and a non-issue. > > Perhaps DPC is the change that you need to emphasize as to > why things are different now, so we can change the default > Linux behavior. > > With the h/w guaranteeing that corrupt data is contained, we > should be safe to disregard BIOS indications of "fatal" problems > that could be anything and might show up in unknown ways some > time later if we keep running. Sure. DPC is much harder to contest as a reason. However, the AER path benefits as well from this change in behavior. I'm certain there are other classes of errors that benefit as well from the change, though I haven't had the time or the inclination to look for them. Alex
Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()
On 05/22/2018 01:45 PM, Luck, Tony wrote: > On Tue, May 22, 2018 at 01:19:34PM -0500, Alex G. wrote: >> Firmware started passing "fatal" GHES headers with the explicit intent of >> crashing an OS. At the same time, we've learnt how to handle these errors in >> a number of cases. With DPC (coming soon to firmware-first) the error is >> contained, and a non-issue. > > Perhaps DPC is the change that you need to emphasize as to > why things are different now, so we can change the default > Linux behavior. > > With the h/w guaranteeing that corrupt data is contained, we > should be safe to disregard BIOS indications of "fatal" problems > that could be anything and might show up in unknown ways some > time later if we keep running. Sure. DPC is much harder to contest as a reason. However, the AER path benefits as well from this change in behavior. I'm certain there are other classes of errors that benefit as well from the change, though I haven't had the time or the inclination to look for them. Alex
Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()
On 05/22/2018 01:13 PM, Rafael J. Wysocki wrote: (snip) Of course, you are free to have a differing opinion and I don't have to convince you about my point. You need to convince me about your point to get the patch in through my tree, which you haven't done so far. My point is that crossing your arms and saying "impress me" doesn't give me a good idea of how you'd like to see the problem resolved. Alex
Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()
On 05/22/2018 01:13 PM, Rafael J. Wysocki wrote: (snip) Of course, you are free to have a differing opinion and I don't have to convince you about my point. You need to convince me about your point to get the patch in through my tree, which you haven't done so far. My point is that crossing your arms and saying "impress me" doesn't give me a good idea of how you'd like to see the problem resolved. Alex
Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()
On 05/22/2018 01:10 PM, Rafael J. Wysocki wrote: On Tue, May 22, 2018 at 7:57 PM, Luck, Tonywrote: On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote: I especially don't want to have the case where a PCIe error is *really* fatal and then we noodle in some handlers debating about the severity because it got marked as recoverable intermittently and end up causing data corruption on the storage device. Here's a real no-no for ya. All that we have is a message from the BIOS that this is a "fatal" error. When did we start trusting the BIOS to give us accurate information? Some time ago, actually. This is about changing the existing behavior which has been to treat "fatal" errors reported by the BIOS as good enough reasons for a panic for quite a while AFAICS. Yes, you are correct. I'd actually like to go deeper, and remove the policy to panic() on fatal errors. Now whether we blacklist or whitelist which errors can go through is up for debate, but the current policy seems broken. PCIe fatal means that the link or the device is broken. And that may really mean that the component in question is on fire. We just don't know. Should there be a physical fire, we have much bigger issues. At the same time, we could retrain the link, call the driver, and release freon gas to put out the fire. That's something we don't currently have the option to do. But that seems a poor reason to take down a large server that may have dozens of devices (some of them set up specifically to handle errors ... e.g. mirrored disks on separate controllers, or NIC devices that have been "bonded" together). So, as long as the action for a "fatal" error is to mark a link down and offline the device, that seems a pretty reasonable course of action. The argument gets a lot more marginal if you simply reset the link and re-enable the device to "fix" it. That might be enough, but I don't think the OS has enough data to make the call. Again, that's about changing the existing behavior or the existing policy even. What exactly has changed to make us consider this now? Firmware started passing "fatal" GHES headers with the explicit intent of crashing an OS. At the same time, we've learnt how to handle these errors in a number of cases. With DPC (coming soon to firmware-first) the error is contained, and a non-issue. As specs and hardware implementations evolve, we have to adapt. I'm here until November, and one of my goals is to involve linux upstream in the development of these features so that when the hardware hits the market, we're ready. That does mean we have to drop some of the silly things we're doing. Alex Thanks, Rafael
Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()
On 05/22/2018 01:10 PM, Rafael J. Wysocki wrote: On Tue, May 22, 2018 at 7:57 PM, Luck, Tony wrote: On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote: I especially don't want to have the case where a PCIe error is *really* fatal and then we noodle in some handlers debating about the severity because it got marked as recoverable intermittently and end up causing data corruption on the storage device. Here's a real no-no for ya. All that we have is a message from the BIOS that this is a "fatal" error. When did we start trusting the BIOS to give us accurate information? Some time ago, actually. This is about changing the existing behavior which has been to treat "fatal" errors reported by the BIOS as good enough reasons for a panic for quite a while AFAICS. Yes, you are correct. I'd actually like to go deeper, and remove the policy to panic() on fatal errors. Now whether we blacklist or whitelist which errors can go through is up for debate, but the current policy seems broken. PCIe fatal means that the link or the device is broken. And that may really mean that the component in question is on fire. We just don't know. Should there be a physical fire, we have much bigger issues. At the same time, we could retrain the link, call the driver, and release freon gas to put out the fire. That's something we don't currently have the option to do. But that seems a poor reason to take down a large server that may have dozens of devices (some of them set up specifically to handle errors ... e.g. mirrored disks on separate controllers, or NIC devices that have been "bonded" together). So, as long as the action for a "fatal" error is to mark a link down and offline the device, that seems a pretty reasonable course of action. The argument gets a lot more marginal if you simply reset the link and re-enable the device to "fix" it. That might be enough, but I don't think the OS has enough data to make the call. Again, that's about changing the existing behavior or the existing policy even. What exactly has changed to make us consider this now? Firmware started passing "fatal" GHES headers with the explicit intent of crashing an OS. At the same time, we've learnt how to handle these errors in a number of cases. With DPC (coming soon to firmware-first) the error is contained, and a non-issue. As specs and hardware implementations evolve, we have to adapt. I'm here until November, and one of my goals is to involve linux upstream in the development of these features so that when the hardware hits the market, we're ready. That does mean we have to drop some of the silly things we're doing. Alex Thanks, Rafael
Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()
On 05/22/2018 12:57 PM, Luck, Tony wrote: On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote: I especially don't want to have the case where a PCIe error is *really* fatal and then we noodle in some handlers debating about the severity because it got marked as recoverable intermittently and end up causing data corruption on the storage device. Here's a real no-no for ya. All that we have is a message from the BIOS that this is a "fatal" error. When did we start trusting the BIOS to give us accurate information? When we merged ACPI handling. PCIe fatal means that the link or the device is broken. But that seems a poor reason to take down a large server that may have dozens of devices (some of them set up specifically to handle errors ... e.g. mirrored disks on separate controllers, or NIC devices that have been "bonded" together). So, as long as the action for a "fatal" error is to mark a link down and offline the device, that seems a pretty reasonable course of action. The argument gets a lot more marginal if you simply reset the link and re-enable the device to "fix" it. That might be enough, but I don't think the OS has enough data to make the call. I'm not 100% satisfied with how AER handler works, and how certain drivers (nvme!!!) interface with AER handling. But this is an arguments that belongs in PCI code, and a fight I will fight with Bjorn and Keith. The issue we're having with Borislav and Rafael's estate is that we can't make it to PCI land. I'm seeing here the same fight that I saw with firmware vs OS, where fw wants to have control, and OS wants to have control. I saw the same with ME/CSE/CSME team vs ChromeOS team, where ME team did everything possible to make sure only they can access the boot vector and boot the processor, and ChromeOS team couldn't use this approach because they wanted their own root of trust. I've seen this in other places as well, though confidentiality agreements prevent me from talking about it. It's the issue of control, and it's a fact of life. Borislav and Rafael don't want to relinquish control until they can be 100% certain that going further will result in 100% recovery. That is a goal I aspire to as well, but an unachievable ideal nonetheless. I thought the best compromise would be to be as close as possible to native handling. That is, if AER can't recover, we don't recover the device, but the machine keeps running. I think there's some deeper history to GHES handling, which I didn't take into consideration. The fight is to convince appropriate parties to share the responsibility in a way which doesn't kill the machine. We still have a ways to go until we get there. Alex -Tony P.S. I deliberately put "fatal" in quotes above because to quote "The Princess Bride" -- "that word, I do not think it means what you think it means". :-)
Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()
On 05/22/2018 12:57 PM, Luck, Tony wrote: On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote: I especially don't want to have the case where a PCIe error is *really* fatal and then we noodle in some handlers debating about the severity because it got marked as recoverable intermittently and end up causing data corruption on the storage device. Here's a real no-no for ya. All that we have is a message from the BIOS that this is a "fatal" error. When did we start trusting the BIOS to give us accurate information? When we merged ACPI handling. PCIe fatal means that the link or the device is broken. But that seems a poor reason to take down a large server that may have dozens of devices (some of them set up specifically to handle errors ... e.g. mirrored disks on separate controllers, or NIC devices that have been "bonded" together). So, as long as the action for a "fatal" error is to mark a link down and offline the device, that seems a pretty reasonable course of action. The argument gets a lot more marginal if you simply reset the link and re-enable the device to "fix" it. That might be enough, but I don't think the OS has enough data to make the call. I'm not 100% satisfied with how AER handler works, and how certain drivers (nvme!!!) interface with AER handling. But this is an arguments that belongs in PCI code, and a fight I will fight with Bjorn and Keith. The issue we're having with Borislav and Rafael's estate is that we can't make it to PCI land. I'm seeing here the same fight that I saw with firmware vs OS, where fw wants to have control, and OS wants to have control. I saw the same with ME/CSE/CSME team vs ChromeOS team, where ME team did everything possible to make sure only they can access the boot vector and boot the processor, and ChromeOS team couldn't use this approach because they wanted their own root of trust. I've seen this in other places as well, though confidentiality agreements prevent me from talking about it. It's the issue of control, and it's a fact of life. Borislav and Rafael don't want to relinquish control until they can be 100% certain that going further will result in 100% recovery. That is a goal I aspire to as well, but an unachievable ideal nonetheless. I thought the best compromise would be to be as close as possible to native handling. That is, if AER can't recover, we don't recover the device, but the machine keeps running. I think there's some deeper history to GHES handling, which I didn't take into consideration. The fight is to convince appropriate parties to share the responsibility in a way which doesn't kill the machine. We still have a ways to go until we get there. Alex -Tony P.S. I deliberately put "fatal" in quotes above because to quote "The Princess Bride" -- "that word, I do not think it means what you think it means". :-)
Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()
On 05/22/2018 09:54 AM, Borislav Petkov wrote: > On Tue, May 22, 2018 at 09:39:15AM -0500, Alex G. wrote: >> No, the problem is with the current approach, not with mine. The problem >> is trying to handle the error outside of the existing handler. That's a >> no-no, IMO. > > Let me save you some time: until you come up with a proper solution for > *all* PCIe errors so that the kernel can correctly decide what to do for > each error based on its actual severity, consider this NAKed. I do have a proper solution for _all_ PCIe errors. In fact, we discussed several valid approaches already. > I don't care about outside or inside of the handler I do. I have a handler that can handle (no pun intended) errors. I want to use the same code path in native and GHES cases. If I allow ghes.c to take different decisions than what aer_do_recovery() would, I've failed. >- this thing needs to be done properly Exactly! > and not just to serve your particular use case of > abrupt removal of devices causing PCIe errors, and punish the rest. I think you're confused about what I'm actually trying to do. Or maybe you're confused about how PCIe errors work. That's understandable. PCIe uses the term "fatal" for errors that may make the link unusable, and which may require a link reset, and in most other specs "fatal" means "on fire". I understand your confusion, and I hope I cleared it up. You're trying to make the case that surprise removal is my only concern and use case, because that's the example that I gave. It makes your argument stronger, but it's wrong. You don't know our test setup, and all the things I'm testing for, and whenever I try to tell you, you fall back to the 'surprise removal' example. I don't know why you'd think Dell would pay me to work on this if I were to allow things like silent data corruption to creep in. This isn't a traditional company from Redmond, Washington. > I especially don't want to have the case where a PCIe error is *really* > fatal and then we noodle in some handlers debating about the severity > because it got marked as recoverable intermittently and end up causing > data corruption on the storage device. Here's a real no-no for ya. I especially don't want a kernel maintainer who hasn't even read the recovery handler (let alone the spec around which the handler was written) tell me how the recovery handler works and what it's supposed to do (see, I can be an ass). PCIe errors really are fatal. They might need to unload the driver and remove the device. But somebody set the questionable policy that "fatal"=="panic", and that is completely inappropriate for a larger class of errors -- PCIe happens to be the easiest example to pick on. And even you realize that the argument that a panic() will somehow prevent data corruption is complete noodle sauce. Alex