Re: [PATCH] PCI: Add link_change error handler and vfio-pci user

2019-04-24 Thread Alex_Gagniuc
On 4/23/2019 5:42 PM, Alex Williamson wrote:
> The PCIe bandwidth notification service generates logging any time a
> link changes speed or width to a state that is considered downgraded.
> Unfortunately, it cannot differentiate signal integrity related link
> changes from those intentionally initiated by an endpoint driver,
> including drivers that may live in userspace or VMs when making use
> of vfio-pci.  Therefore, allow the driver to have a say in whether
> the link is indeed downgraded and worth noting in the log, or if the
> change is perhaps intentional.
> 
> For vfio-pci, we don't know the intentions of the user/guest driver
> either, but we do know that GPU drivers in guests actively manage
> the link state and therefore trigger the bandwidth notification for
> what appear to be entirely intentional link changes.
> 
> Fixes: e8303bb7a75c PCI/LINK: Report degraded links via link bandwidth 
> notification
> Link: 
> https://lore.kernel.org/linux-pci/155597243666.19387.1205950870601742062.st...@gimli.home/T/#u
> Signed-off-by: Alex Williamson 
> ---
> 
> Changing to pci_dbg() logging is not super usable, so let's try the
> previous idea of letting the driver handle link change events as they
> see fit.  Ideally this might be two patches, but for easier handling,
> folding the pci and vfio-pci bits together.  Comments?  Thanks,

I think this callback opens up a can of worms where drivers can ad-hoc 
kill a number what otherwise can be indicators of problems. But I don't 
have to like it to review it :).

>   drivers/pci/probe.c |   13 +
>   drivers/vfio/pci/vfio_pci.c |   10 ++
>   include/linux/pci.h |3 +++
>   3 files changed, 26 insertions(+)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 7e12d0163863..233cd4b5b6e8 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2403,6 +2403,19 @@ void pcie_report_downtraining(struct pci_dev *dev)

I don't think you want to change pcie_report_downtraining(). You're 
advertising to "report" something, by nomenclature, but then go around 
and also call a notification callback. This is also used during probe, 
and you've now just killed your chance to notice you've booted with a 
degraded link.
If what you want to do is silence the bandwidth notification, you want 
to modify the threaded interrupt that calls this.

>   if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn)
>   return;
>   
> + /*
> +  * If driver handles link_change event, defer to driver.  PCIe drivers
> +  * can call pcie_print_link_status() to print current link info.
> +  */
> + device_lock(>dev);
> + if (dev->driver && dev->driver->err_handler &&
> + dev->driver->err_handler->link_change) {
> + dev->driver->err_handler->link_change(dev);
> + device_unlock(>dev);
> + return;
> + }
> + device_unlock(>dev);

Can we write this such that there is a single lock()/unlock() pair?

> +
>   /* Print link status only if the device is constrained by the fabric */
>   __pcie_print_link_status(dev, false);
>   }
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index cab71da46f4a..c9ffc0ccabb3 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1418,8 +1418,18 @@ static pci_ers_result_t 
> vfio_pci_aer_err_detected(struct pci_dev *pdev,
>   return PCI_ERS_RESULT_CAN_RECOVER;
>   }
>   
> +/*
> + * Ignore link change notification, we can't differentiate signal related
> + * link changes from user driver power management type operations, so do
> + * nothing.  Potentially this could be routed out to the user.
> + */
> +static void vfio_pci_link_change(struct pci_dev *pdev)
> +{
> +}
> +
>   static const struct pci_error_handlers vfio_err_handlers = {
>   .error_detected = vfio_pci_aer_err_detected,
> + .link_change = vfio_pci_link_change,
>   };
>   
>   static struct pci_driver vfio_pci_driver = {
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 27854731afc4..e9194bc03f9e 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -763,6 +763,9 @@ struct pci_error_handlers {
>   
>   /* Device driver may resume normal operations */
>   void (*resume)(struct pci_dev *dev);
> +
> + /* PCIe link change notification */
> + void (*link_change)(struct pci_dev *dev);
>   };
>   
>   
> 
> 




Re: [PATCH v2] PCI: pciehp: Report degraded links via link bandwidth notification

2019-02-27 Thread Alex_Gagniuc
On 2/24/19 8:29 PM, Lukas Wunner wrote:
> On Fri, Dec 07, 2018 at 12:20:00PM -0600, Alexandru Gagniuc wrote:
> 
> 
>> Q: Why is this unconditionally compiled in?
>> A: The symmetrical check in pci probe() is also always compiled in.
> 
> Hm, it looks like the convention is to provide a separate Kconfig entry
> for each port service.

Does the convention still make sense in light of the symmetry reason?

>> Q: Why call module_init() instead of adding a call in pcie_init_services() ?
>> A: A call in pcie_init_services() also requires a prototype in portdrv.h, a
>> non-static implementation in bw_notification.c. Using module_init() is
>> functionally equivalent, and takes less code.
> 
> Commit c29de84149ab ("PCI: portdrv: Initialize service drivers directly")
> moved away from module_init() on purpose, apparently to fix a race
> condition.

*GROWL*


> What if the link is retrained at the same speed/width?  Intuitively
> I'd compare the speed in the Link Status Register to what is cached
> in the cur_bus_speed member of struct pci_bus and print a message
> only if the speed has changed.  (Don't we need to cache the width as
> well?)

There are two mechanisms to bring a degraded link back to full BW.
   1. Secondary bus reset, which results in the device being tore down 
along with our cached speed value.
   2. Set the PCI_EXP_LNKCTL_LD (Link Disable) bit and clear it. We do 
that as part of the pciehp teardown path. We'd lose our cached value 
just like in the first case.

>> +static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
>> [...]
> 
> You need to hold pci_bus_sem [...]
> This may sleep, so request the IRQ with request_threaded_irq() [...]

Good catch! Thanks!
All other issues you pointed out should be resolved in next version.

Alex


Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

2019-02-27 Thread Alex_Gagniuc
On 2/27/19 11:51 AM, Keith Busch wrote:
> I can't tell where you're going with this. It doesn't sound like you're
> talking about hotplug anymore, at least.

We're trying to fix an issue related to hotplug. However, the proposed 
fixes may have unintended consequences and side-effects. I want to make 
sure that we're aware of those potential problems.

Alex




Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

2019-02-27 Thread Alex_Gagniuc
On 2/26/19 7:02 PM, Linus Torvalds wrote:
> On Tue, Feb 26, 2019 at 2:37 PM  wrote:
>>
>> Then nobody gets the (error) message. You can go a bit further and try
>> 'pcie_ports=native". Again, nobody gets the memo. ):
> 
> So? The error was bogus to begin with. Why would we care?

Of course nobody cares about that. We care about actual errors that we 
now know we won't be notified of. Imagine if we didn't get the memo that 
a piece of data is corrupt, and imagine the reaction of RAS folk.

And I know the counter to that is a panic() is much more likely to cause 
data corruption, and we're trading one piece of crap for an even 
stinkier one. Whatever we end up doing, we have to do better than 
silence errors and pretend nothing happened.


> Yes, yes, PCI bridges have the ability to return errors in accesses to
> non-existent devices. But that was always bogus, and is never useful.
> The whole "you get an interrupt or NMI on a bad access" is simply a
> horribly broken model. It's not useful.
> 
> We already have long depended on hotplug drivers noticing the "oh, I'm
> getting all-ff returns, the device may be gone". It's usually trivial,
> and works a whole lot better.

And that's been working great, hasn't it? I think you're thinking 
strictly about hotplug. There are other situations where things are all 
F'd, but the hardware isn't sending all F's. (example: ECRC errors)


> It's not an error. Trying to force it to be an NMI or SCI or machine
> check is bogus. It causes horrendous pain, because asynchronous
> reporting doesn't work reliably anyway, and *synchronous* reporting is
> impossible to sanely handle without crazy problems.
> 
> So the only sane model for hotplug devices is "IO still works, and
> returns all ones". Maybe with an async one-time and *recoverable*
> machine check or other reporting the access after the fact.

Exactly!!! A notification (not calling it an 'error') that something 
unusual has happened is good. Treating these things like errors is so 
obvious, even a caveman wouldn't do it.
In a world with FFS, we don't always get to have that model. Oh, FFS!


> Anything else is simply broken. It would be broken even if firmware
> wasn't involved, but obviously firmware people tend to often make a
> bad situation even worse.

Linus, be nice to firmware people. I've met a few, and I can vouch that 
they're very kind and nice. They're also very scared, especially when OS 
people want to ask them a few questions.

I think FFS should get out of the way when OS advertises it's capable of 
handling XYZ. There are some good arguments why this hasn't happened, 
but I won't get into details. I do think it's unlikely that machines 
will be moving back to an OS-controlled model.

And Linus, keep in mind, when these machines were developed, OSes 
couldn't handle recovery properly. None of this was ever an issue. It's 
our fault that we've changed the OS after the machines are on the market.

Alex


Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

2019-02-26 Thread Alex_Gagniuc
On 2/25/19 9:55 AM, Keith Busch wrote:
> On Sun, Feb 24, 2019 at 03:27:09PM -0800, alex_gagn...@dellteam.com wrote:
>> [   57.680494] {1}[Hardware Error]: Hardware error from APEI Generic  
>> Hardware Error Source: 1
>> [   57.680495] {1}[Hardware Error]: event severity: fatal
>> [   57.680496] {1}[Hardware Error]:  Error 0, type: fatal
>> [   57.680496] {1}[Hardware Error]:   section_type: PCIe error
>> [   57.680497] {1}[Hardware Error]:   port_type: 6, downstream switch port
>> [   57.680498] {1}[Hardware Error]:   version: 3.0
>> [   57.680498] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
>> [   57.680499] {1}[Hardware Error]:   device_id: :3c:07.0
>> [   57.680499] {1}[Hardware Error]:   slot: 1
>> [   57.680500] {1}[Hardware Error]:   secondary_bus: 0x40
>> [   57.680500] {1}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x9733
>> [   57.680501] {1}[Hardware Error]:   class_code: 000406
>> [   57.680502] {1}[Hardware Error]:   bridge: secondary_status: 0x, > 
>> control: 0x0003
> 
> This is a reaction to a ERR_FATAL message, right? What happens if we
> ignore firmware first and override control of the AER masking with a
> set to the Unsupported Request Error Mask in the root and downstream
> ports? You can do a quick test like this for the above's hardware:
> 
># setpci -s 3c:07.0 ECAP_AER+8.l=10:10
> 
> You'd probably have to do the same command to the root port BDf, and any
> other switches you have them in the hierarchy.

Then nobody gets the (error) message. You can go a bit further and try 
'pcie_ports=native". Again, nobody gets the memo. ):

Alex



Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

2019-02-24 Thread Alex_Gagniuc
On 2/24/19 4:42 PM, Linus Torvalds wrote:
> On Sun, Feb 24, 2019 at 12:37 PM  wrote:
>>
>> Dell r740xd to name one. r640 is even worse -- they probably didn't give
>> me one because I'd have too much stuff to complain about.
>>
>> On the above machines, firmware-first (FFS) tries to guess when there's
>> a SURPRISE!!! removal of a PCIe card and supress any errors reported to
>> the OS. When the OS keeps firing IO over the dead link, FFS doesn't know
>> if it can safely supress the error. It reports is via NMI, and
>> drivers/acpi/apei/ghes.c panics whenever that happens.
> 
> Can we just fix that ghes driver?
> 
> It's not useful to panic just for random reasons. I realize that some
> of the RAS people have the mindset that "hey, I don't know what's
> wrong, so I'd better kill the machine than continue", but that's
> bogus.

That's the first thing I tried, but Borislav didn't like it. And he's 
right in the strictest sense of the ACPI spec: a fatal GHES error must 
result in a machine reboot [1].

> What happens if we just fix that part?

On rx740xd, on a NVMe hotplug bay, the upstream port stops sending 
hotplug interrupts. We could fix that with a quirk by clearing a 
proprietary bit in the switch. However, FFS won't re-arm itself to 
receive any further errors, so we'd never get notified in case there is 
a genuine error.

>> As I see it, there's a more fundamental problem. As long as we accept
>> platforms where firmware does some things first (FFS), we have much less
>> control over what happens. The best we can do is wishy-washy fixes like
>> this one.
> 
> Oh, I agree that platforms with random firmware things are horrid. But
> we've been able to handle them just fine before, without making every
> single possible hotplug pci driver have nasty problems and
> workarounds.
> 
> I suspect we'd be much better off having the ghes driver just not panic.

Keith Busch of Intel at some point suggested remapping all MMIO 
resources of a dead PCIe device to a read-only page that returns all 
F's. Neither of us were too sure how to do that, or how to handle the 
problem of in-flight DMA, which wouldn't hit the page tables.

> What is the actual ghes error? Is it the "unknown, just panic" case,
> or something else?

More like "fatal error, just panic". It looks like this (from a serial 
console):

[   57.680494] {1}[Hardware Error]: Hardware error from APEI Generic 
Hardware Error Source: 1
[   57.680495] {1}[Hardware Error]: event severity: fatal
[   57.680496] {1}[Hardware Error]:  Error 0, type: fatal
[   57.680496] {1}[Hardware Error]:   section_type: PCIe error
[   57.680497] {1}[Hardware Error]:   port_type: 6, downstream switch port
[   57.680498] {1}[Hardware Error]:   version: 3.0
[   57.680498] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
[   57.680499] {1}[Hardware Error]:   device_id: :3c:07.0
[   57.680499] {1}[Hardware Error]:   slot: 1
[   57.680500] {1}[Hardware Error]:   secondary_bus: 0x40
[   57.680500] {1}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x9733
[   57.680501] {1}[Hardware Error]:   class_code: 000406
[   57.680502] {1}[Hardware Error]:   bridge: secondary_status: 0x, 
control: 0x0003
[   57.680503] Kernel panic - not syncing: Fatal hardware error!
[   57.680572] Kernel Offset: 0x2a00 from 0x8100 
(relocation range: 0x8000-0xbfff)


Alex



[1] ACPI 6.3 - 18.1 Hardware Errors and Error Sources

"A fatal hardware error is an uncorrected or uncontained error condition 
that is determined to be unrecoverable by the hardware. When a fatal 
uncorrected error occurs, the system is restarted to prevent propagation 
of the error."




Re: [PATCH RFC v2 2/4] PCI: pciehp: Do not turn off slot if presence comes up after link

2019-02-24 Thread Alex_Gagniuc
On 2/23/19 12:50 AM, Lukas Wunner wrote:
> 
> [EXTERNAL EMAIL]
> 
> On Fri, Feb 22, 2019 at 07:56:28PM +, alex_gagn...@dellteam.com wrote:
>> On 2/21/19 1:36 AM, Lukas Wunner wrote:
>>> On Tue, Feb 19, 2019 at 07:20:28PM -0600, Alexandru Gagniuc wrote:
mutex_lock(>state_lock);
 +  present = pciehp_card_present(ctrl);
 +  link_active = pciehp_check_link_active(ctrl);
switch (ctrl->state) {
>>>
>>> These two assignments appear to be superfluous as you're also performing
>>> them in pciehp_check_link_active().
>>
>> Not sure. Between the first check, and this check, you can have several
>> seconds elapse depending on whether the driver's .probe()/remove() is
>> invoked. Whatever you got at the beginning would be stale. If you had a
>> picture dictionary and looked up 'bad idea', it would have a picture of
>> the above code with the second check removed.
> 
> I don't quite follow.  You're no longer using the "present" and
> "link_active" variables in pciehp_handle_presence_or_link_change(),
> the variables are set again further down in the function and you're
> *also* reading PDS and DLLLA in is_delayed_presence_up_event().
> So the above-quoted assignments are superfluous.  Am I missing something?
> 
> (Sorry, I had copy-pasted the wrong function name, I meant
> is_delayed_presence_up_event() above, not pciehp_check_link_active().


I see what I did. You're right. I should remove the following lines from 
the patch. I'll have that fixed when I re-submit this.

+   present = pciehp_card_present(ctrl);
+   link_active = pciehp_check_link_active(ctrl);

> 
>> I've got all the other review comments addressed in my local branch. I'm
>> waiting on Lord Helgass' decision on which solution is better.
>   
> 
> Can we keep this discussion in a neutral tone please?

I'm sorry. I thought comparing linux to feudalism would be hillarious, 
but I now see not everyone agrees. Sorry, Bjorn.

Alex


Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

2019-02-24 Thread Alex_Gagniuc
On 2/22/19 3:29 PM, Linus Torvalds wrote:
> On Thu, Feb 21, 2019 at 5:07 PM Jon Derrick  
> wrote:
>>
>> Some platforms don't seem to easily tolerate non-posted mmio reads on
>> lost (hot removed) devices. This has been noted in previous
>> modifications to other layers where an mmio read to a lost device could
>> cause an undesired firmware intervention [1][2].
> 
> This is broken, and whatever platform that requires this is broken.
> 
> This has absolutely nothing to do with nvme, and should not be handled
> by a driver.
> 
> The platform code should be fixed.
> 
> What broken platform is this, and why is it causing problems?
> 
> None of this wishy-washy "some platforms". Name them, and let's get them 
> fixed.

Dell r740xd to name one. r640 is even worse -- they probably didn't give 
me one because I'd have too much stuff to complain about.

On the above machines, firmware-first (FFS) tries to guess when there's 
a SURPRISE!!! removal of a PCIe card and supress any errors reported to 
the OS. When the OS keeps firing IO over the dead link, FFS doesn't know 
if it can safely supress the error. It reports is via NMI, and
drivers/acpi/apei/ghes.c panics whenever that happens.

Does this sound like horrible code?

As I see it, there's a more fundamental problem. As long as we accept 
platforms where firmware does some things first (FFS), we have much less 
control over what happens. The best we can do is wishy-washy fixes like 
this one.

Did I mention that FFS's explicit intent on the above machines is to 
make the OS crash?

Alex






Re: [PATCH RFC v2 2/4] PCI: pciehp: Do not turn off slot if presence comes up after link

2019-02-22 Thread Alex_Gagniuc
On 2/21/19 1:36 AM, Lukas Wunner wrote:
> On Tue, Feb 19, 2019 at 07:20:28PM -0600, Alexandru Gagniuc wrote:
>>  mutex_lock(>state_lock);
>> +present = pciehp_card_present(ctrl);
>> +link_active = pciehp_check_link_active(ctrl);
>>  switch (ctrl->state) {
> 
> These two assignments appear to be superfluous as you're also performing
> them in pciehp_check_link_active().

Not sure. Between the first check, and this check, you can have several 
seconds elapse depending on whether the driver's .probe()/remove() is 
invoked. Whatever you got at the beginning would be stale. If you had a 
picture dictionary and looked up 'bad idea', it would have a picture of 
the above code with the second check removed.

I've got all the other review comments addressed in my local branch. I'm 
waiting on Lord Helgass' decision on which solution is better.

Alex




Re: [PATCH RFC v2 4/4] PCI: hotplug: Add quirk For Dell nvme pcie switches

2019-02-22 Thread Alex_Gagniuc
On 2/21/19 8:05 PM, Oliver wrote:
> On Fri, Feb 22, 2019 at 5:38 AM  wrote:
>> On 2/21/19 1:57 AM, Lukas Wunner wrote:
[snip]
>>> If the quirk is x86-specific, please enclose it in "#ifdef CONFIG_X86"
>>> to reduce kernel footprint on other arches.
>>
>> That's a tricky one. If you look at p. 185 of [1], items 9, 11, and 12
>> are standard x16 cards that would fit in any x16 slot. Those cards have
>> the offending switches.
>>
>> On the one hand, you could take the cards and backplane and put them in
>> a non-hax86 system. On the other hand, I don't see why someone would
>> want to do this.
> 
> I have a couple of POWER boxes with Dell branded switch cards in them.
> I have no idea why either, but it does happen.

The hardware debouncer, I think, is on the backplane, so you'd really 
need both the switch and backplane combo. I've seen other marketing 
departments call the switches "NVMe HBA".

Alex


Re: [PATCH RFC v2 4/4] PCI: hotplug: Add quirk For Dell nvme pcie switches

2019-02-21 Thread Alex_Gagniuc
On 2/21/19 1:57 AM, Lukas Wunner wrote:
> 
> [EXTERNAL EMAIL]
> 
> On Tue, Feb 19, 2019 at 07:20:30PM -0600, Alexandru Gagniuc wrote:
>> --- a/drivers/pci/hotplug/pciehp_hpc.c
>> +++ b/drivers/pci/hotplug/pciehp_hpc.c
>> @@ -952,3 +952,23 @@ DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_QCOM, 
>> 0x0400,
>>PCI_CLASS_BRIDGE_PCI, 8, quirk_cmd_compl);
>>   DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_QCOM, 0x0401,
>>PCI_CLASS_BRIDGE_PCI, 8, quirk_cmd_compl);
>> +
>> +
> 
> Duplicate newline.
> 
> 
>> +static void fixup_dell_nvme_backplane_switches(struct pci_dev *pdev)
> 
> Can we have a little code comment above the function such as:
> 
> +/*
> + * Dell  NVMe storage backplanes disable in-band presence
> + * (PCIe r5.0 sec X.Y.Z) but neglect to set the corresponding flag in the
> + * Slot Capabilities 2 register.
> + */
> 
> 
>> +if (pdev->subsystem_vendor != PCI_VENDOR_ID_DELL
>> +|| pdev->subsystem_device != 0x1fc7)
> 
> This looks a little unpolished, how about:
> 
> + if (pdev->subsystem_vendor != PCI_VENDOR_ID_DELL ||
> + pdev->subsystem_device != 0x1fc7)
> 
> 
>> +return;
>> +
>> +pdev->no_in_band_presence = 1;
>> +}
>> +
>> +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_PLX, 0x9733,
> 
> By convention there's no blank line between the closing curly brace
> and the DECLARE_PCI_FIXUP_CLASS_FINAL().

I'm sorry for all the style issues. I realize it's noise and should just 
be done right from the beginning. Is there a way to make checkpatch.pl 
catch these before they go out?


> If the quirk is x86-specific, please enclose it in "#ifdef CONFIG_X86"
> to reduce kernel footprint on other arches.

That's a tricky one. If you look at p. 185 of [1], items 9, 11, and 12 
are standard x16 cards that would fit in any x16 slot. Those cards have 
the offending switches.

On the one hand, you could take the cards and backplane and put them in 
a non-hax86 system. On the other hand, I don't see why someone would 
want to do this.

Alex

[1] https://topics-cdn.dell.com/pdf/poweredge-r740xd_owners-manual_en-us.pdf




Re: [PATCH RFC v2 1/4] PCI: hotplug: Add support for disabling in-band presence

2019-02-21 Thread Alex_Gagniuc
On 2/21/19 1:20 AM, Lukas Wunner wrote:
> 
> [EXTERNAL EMAIL]
> 
> On Tue, Feb 19, 2019 at 07:20:27PM -0600, Alexandru Gagniuc wrote:
>> @@ -846,6 +846,9 @@ struct controller *pcie_init(struct pcie_device *dev)
>>  if (pdev->is_thunderbolt)
>>  slot_cap |= PCI_EXP_SLTCAP_NCCS;
>>   
>> +if (pdev->no_in_band_presence)
>> +ctrl->inband_presence_disabled = 1;
>> +
>>  ctrl->slot_cap = slot_cap;
>>  mutex_init(>ctrl_lock);
>>  mutex_init(>state_lock);
> 
> The above hunk belongs in patch 4.
> 
> 
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -413,6 +413,7 @@ struct pci_dev {
>>  unsigned intnon_compliant_bars:1;   /* Broken BARs; ignore them */
>>  unsigned intis_probed:1;/* Device probing in progress */
>>  unsigned intlink_active_reporting:1;/* Device capable of reporting 
>> link active */
>> +unsigned intno_in_band_presence:1;  /* Device does not report 
>> in-band presence */
>>  unsigned intno_vf_scan:1;   /* Don't scan for VFs after IOV 
>> enablement */
>>  pci_dev_flags_t dev_flags;
>>  atomic_tenable_cnt; /* pci_enable_device has been called */
> 
> Same here.

:)

> Thanks,
> 
> Lukas
> 



Re: [PATCH] PCI: pciehp: Do not turn off slot if presence comes up after link

2019-02-14 Thread Alex_Gagniuc
On 2/14/19 1:01 AM, Lukas Wunner wrote:
> On Wed, Feb 13, 2019 at 06:55:46PM +, alex_gagn...@dellteam.com wrote:
>> On 2/13/19 2:36 AM, Lukas Wunner wrote:
 (*) A bit hypothetical: There is no hardware yet implementing the ECN.
>>>
>>> Hm, this contradicts Austin Bolen's e-mail of Jan 25 that "Yes, this
>>> platform disables in-band presence" (when asked whether your host
>>> controller already adheres to the ECN).
>>
>> Both statements are true. The hardware does indeed disable in-band
>> presence, in a rudimentary way that is not compliant with the ECN -- it
>> doesn't implement the bits required by the ECN.
> 
> Ugh, can a BIOS update make those machines compliant to the ECN

No, that is not possible, I'm told.

> or do we need a quirk specifically for them?

We might have to.

> It should match the spec, which I have no access to.

I am as irritated as you that the spec is not readily available for 
public viewing.

Cheers,
Alex


Re: [PATCH] PCI: pciehp: Do not turn off slot if presence comes up after link

2019-02-13 Thread Alex_Gagniuc
On 2/13/19 2:36 AM, Lukas Wunner wrote:
>> (*) A bit hypothetical: There is no hardware yet implementing the ECN.
> 
> Hm, this contradicts Austin Bolen's e-mail of Jan 25 that "Yes, this
> platform disables in-band presence" (when asked whether your host
> controller already adheres to the ECN).

Both statements are true. The hardware does indeed disable in-band 
presence, in a rudimentary way that is not compliant with the ECN -- it 
doesn't implement the bits required by the ECN.


>> I'm
>> not sure that there is a spec-justifiable reason to not access a device
>> whose DLL is up, but PD isn't.
> 
> Austin asked in an e-mail of Jan 24 whether "the hot-inserted device's
> config space [is] accessed immediately after waiting this 20 + 100 ms
> delay", which sounded to me like you'd prefer the device not to be
> accessed until PDS is 1.

"Unless Readiness Notifications mechanisms are used (see Section 6.23), 
the Root Complex and/or system software must allow at least 1.0 s after 
a Conventional Reset of a device, before it may determine that a device 
which fails to return a Successful Completion status for a valid 
Configuration Request is a broken device. This period is independent of 
how quickly Link training completes."
(Section 6.6)

The concern was the one second delay, which is addressed by the use of 
pci_bus_wait_crs().

>>> Be mindful however that pcie_wait_for_link() is also called from the
>>> DPC driver.  Keith should be able to judge whether a change to that
>>> function breaks DPC.
>>
>> That's why I went for ammending pciehp_handle_presence_or_link_change().
>> Smaller bug surface ;). What I'm thinking at this point is, keep the
>> patch as is, but, also check for in-band PD disable before aborting the
>> shutdown. Old behavior stays the same.
> 
> I'm worried that amending pciehp_handle_presence_or_link_change() makes
> the event handling logic more difficult to understand.

Don't worry. It's already difficult to understand ;)

>  Polling PDS in
> pcie_wait_for_link() or disabling either PDC or DLLSC if in-band presence
> is disabled seems simpler to reason about.

pcie_wait_for_link() is generic PCIe layer. I don't think mixing hotplug 
concepts is a good layering violation.

Disabling PDC or DLLSC can work. I've sometimes wondered why we even 
care about PDC. I can imagine situations where platform might want to 
signal imminent removal by yanking PD.

>> in-band PD disable (what's a good acronym for that, BTW?)
> 
> I don't know, maybe inband_presence_disabled?

PCI_EXP_SLTCAP2_IBPD ?


Re: [PATCH] PCI: pciehp: Do not turn off slot if presence comes up after link

2019-02-12 Thread Alex_Gagniuc
On 2/12/19 2:30 AM, Lukas Wunner wrote:
> The PCI SIG
> should probably consider granting access to specs to open source
> developers to ease adoption of new features in Linux et al.

Don't get me started on just how ridiculous I think PCI-SIG's policy is 
with respect to public availability of standards.

>>> Table 4-14 in sec 4.2.6 is also important because it shows the difference
>>> between Link Up and in-band presence.
>>
>> So, we'd expect PD to come up at the same time or before DLL?
> 
> Per table 4-14 and figure 4-23 (immediately below the table) in r4.0,
> PDC ought to be signaled before DLLSC.  As said, Linux can handle PDC
> after DLLSC if they're not too far apart (100 ms, see pcie_wait_for_link()).
> 
> 
>> I'd like a generic solution. I suspect supporting 'in-band PD disabled'
>> and quirking for that would be a saner way to do things. If we support
>> it, then we need to handle PDSC after DLLSC scenarios anyway.
> 
> I agree, having support for this new ECN would be great.
> 
> If you look at struct controller in drivers/pci/hotplug/pciehp.h,
> there's a section labelled /* capabilities and quirks */.  An idea
> would be to add an unsigned int there to indicate whether in-band
> presence is disabled.  An alternative would be to add a flag to
> struct pci_dev.

> Instead of modifying the logic in pciehp_handle_presence_or_link_change(),
> you could amend pcie_wait_for_link() to poll PDS until it's set, in
> addition to DLLLA.  The rationale would be that although the link is up,
> the hot-added device cannot really be considered accessible until PDS
> is also set.  Unfortunately we cannot do this for all devices because
> (as I've said before), some broken devices hardwire PDS to zero.  But
> it should be safe to constrain it to those which can disable in-band
> presence.


The recommendation is to set the "in-band PD disable" bit, and it's 
possible that platform firmware may have set it at boot time (*). I'm 
not sure that there is a spec-justifiable reason to not access a device 
whose DLL is up, but PD isn't.

(*) A bit hypothetical: There is no hardware yet implementing the ECN.

> Be mindful however that pcie_wait_for_link() is also called from the
> DPC driver.  Keith should be able to judge whether a change to that
> function breaks DPC.

That's why I went for ammending pciehp_handle_presence_or_link_change(). 
Smaller bug surface ;). What I'm thinking at this point is, keep the 
patch as is, but, also check for in-band PD disable before aborting the 
shutdown. Old behavior stays the same.

I'll rework this a little bit for in-band PD disable (what's a good 
acronym for that, BTW?), and then quirk those machines that are known to 
'disable' this in hardware.

Alex


Re: [PATCH] PCI: pciehp: Do not turn off slot if presence comes up after link

2019-02-11 Thread Alex_Gagniuc
On 2/9/19 5:58 AM, Lukas Wunner wrote:
> 
> [EXTERNAL EMAIL]
> 
> On Tue, Feb 05, 2019 at 03:06:56PM -0600, Alexandru Gagniuc wrote:
>> According to PCIe 3.0, the presence detect state is a logical OR of
>> in-band and out-of-band presence.
> 
> Since Bjorn asked for a spec reference:
> 
> PCIe r4.0 sec 7.8.11: "Presence Detect State - This bit indicates the
> presence of an adapter in the slot, reflected by the logical OR of the
> Physical Layer in-band presence detect mechanism and, if present, any
> out-of-band presence detect mechanism defined for the slot's
> corresponding form factor."

ECN [1] was approved last November, so it's normative spec text. Sorry 
if the Ukranians didn't get ahold of it yet. I'll try to explain the delta.
There's an in-band PD supported/disable set of bits. If 'supported' is 
set, then you can set 'disable', which removes the 'logical OR" and now 
PD is only out-of-band presence.

> Table 4-14 in sec 4.2.6 is also important because it shows the difference
> between Link Up and in-band presence.

So, we'd expect PD to come up at the same time or before DLL?

>> With this, we'd expect the presence
>> state to always be asserted when the link comes up.
>>
>> Not all hardware follows this, and it is possible for the presence to
>> come up after the link. In this case, the PCIe device would be
>> erroneously disabled and re-probed. It is possible to distinguish
>> between a delayed presence and a card swap by looking at the DLL state
>> changed bit -- The link has to come down if the card is removed.
> 
> So in reality, PDC and DLLSC events rarely come in simultaneously and
> we do allow some leeway in pcie_wait_for_link(), it's just not enough
> in your particular case due to the way presence is detected by the
> platform.
> 
> I think in this case instead of changing the behavior for everyone,
> it's more appropriate to add a quirk for your hardware.  Two ideas:
> 
> * Amend pcie_init() to set slot_cap |= PCI_EXP_SLTCAP_ABP.  Insert
>this as a quirk right below the existing Thunderbolt quirk and
>use PCI vendor/device IDs and/or DMI checks to identify your
>particular hardware.  If the ABP bit is set, PDC events are not
>enabled by pcie_enable_notification(), so you will solely rely on
>DLLSC events.  Problem solved.  (Hopefully.)
> 
> * Alternatively, add a DECLARE_PCI_FIXUP_FINAL() quirk which sets
>pdev->link_active_reporting = false.  This causes pcie_wait_for_link()
>to wait 1100 msec before the hot-added device's config space is
>accessed for the first time.

These are both hacks right? We're declaring something untrue about the 
hardware because we want to obtain a specific side-effect. BUt the 
side-effects are not guaranteed not to change.

> Would this work for you?

I'm certain it's doable. In theory one could use the PCI ID of the DP, 
and the vendor and machine names to filter the quirk. But what happens 
in situations where the same PCI ID switch is used in different parts of 
the system? You want the quirk here or not there. It quickly becomes a 
shartstorm.

I'd like a generic solution. I suspect supporting 'in-band PD disabled' 
and quirking for that would be a saner way to do things. If we support 
it, then we need to handle PDSC after DLLSC scenarios anyway.

Alex

[1] PCI-SIG ENGINEERING CHANGE NOTICE
Async Hot-Plug Updates
Nov 29, 2018


Re: [RFC] PCI / ACPI: Implementing Type 3 _HPX records

2019-01-17 Thread Alex_Gagniuc
Hi Bjorn

On 1/14/19 2:01 PM, Bjorn Helgaas wrote:
> On Thu, Jan 10, 2019 at 05:11:27PM -0600, Alexandru Gagniuc wrote:
>> _HPX Type 3 is intended to be more generic and allow configuration of
>> settings not possible with Type 2 tables. For example, FW could ensure
>> that the completion timeout value is set accordingly throughout the PCI
>> tree; some switches require very special handling.
>>
>> Type 3 can come in an arbitrary number of ACPI packages. With the older
>> types we only ever had one descriptor. We could get lazy and store it
>> on the stack. The current flow is to parse the HPX table
>> --pci_get_hp_params()-- and then program the PCIe device registers.
>>
>> In the current flow, we'll need to malloc(). Here's what I tried:
>>   1) devm_kmalloc() on the pci_dev
>>This ended up giving me NULL dereference at boot time.
> 
> Yeah, I don't think devm_*() is going to work because that's part of the
> driver binding model; this is in the PCI core and this use is not related
> to the lifetime of a driver's binding to a device.
> 
>>   2) Add a cleanup function to be called after writing the PCIe registers
>>
>> This RFC implements method 2. The HPX3 stuff is still NDA'd, but the
>> housekeeping parts are in here. Is this a good way to do things? I'm not
>> too sure about the need to call cleanup() on a stack variable. But if
>> not that, then what else?
> 
> pci_get_hp_params() is currently exported, but only used inside
> drivers/pci, so I think it could be unexported.

It can.

> I don't remember if there's a reason why things are split between
> pci_get_hp_params(), which runs _HPX or _HPP, and the program_hpp_type*()
> functions.  If we could get rid of that split, we could get rid of struct
> hotplug_params and do the configuration directly in the pci_get_hp_params()
> path.  I think that would simplify the memory management.

Thanks for the in-depth analysis. I'm working on a proof-of-concept 
patch to do what you suggested. On downside though, is that we may have 
to parse some things twice.

Another thing, I've been thinking is, why not store the hotplug 
parameters from ACPI with the pci_bus struct. That way, you only need to 
parse ACPI once per bus, instead of each time a device is probed. There 
would be some memory overhead, but the hotplug structures are 
(thankfully) still relatively small. I could try a proof of concept for 
this idea as well... let me know.

Alex


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-12-27 Thread Alex_Gagniuc
On 11/8/18 2:09 PM, Bjorn Helgaas wrote:
> 
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive 
> information.
> 
> 
> [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about
> PCI error recovery in general]
> 
> On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote:
>> On Tue, Sep 18, 2018 at 05:15:00PM -0500, Alexandru Gagniuc wrote:
>>> When a PCI device is gone, we don't want to send IO to it if we can
>>> avoid it. We expose functionality via the irq_chip structure. As
>>> users of that structure may not know about the underlying PCI device,
>>> it's our responsibility to guard against removed devices.
>>>
>>> .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg().
>>> .irq_mask/unmask() are not. Guard them for completeness.
>>>
>>> For example, surprise removal of a PCIe device triggers teardown. This
>>> touches the irq_chips ops some point to disable the interrupts. I/O
>>> generated here can crash the system on firmware-first machines.
>>> Not triggering the IO in the first place greatly reduces the
>>> possibility of the problem occurring.
>>>
>>> Signed-off-by: Alexandru Gagniuc 
>>
>> Applied to pci/misc for v4.21, thanks!
> 
> I'm having second thoughts about this. 

Do we have a verdict on this? If you don't like this approach, then I'll 
have to fix the problem in some other way, but the problem still needs 
to be fixed.

Alex


Re: [PATCH] PCI: pciehp: Report degraded links via link bandwidth notification

2018-11-29 Thread Alex_Gagniuc
On 11/29/2018 5:05 PM, Bjorn Helgaas wrote:
> On Thu, Nov 29, 2018 at 08:13:12PM +0100, Lukas Wunner wrote:
>> I guess the interrupt is shared with hotplug and PME?  In that case write
>> a separate pcie_port_service_driver and request the interrupt with
>> IRQF_SHARED.  Define a new service type in drivers/pci/pcie/portdrv.h.
>> Amend get_port_device_capability() to check for PCI_EXP_LNKCAP_LBNC.
> 
> I really don't like the port driver design.  I'd rather integrate
> those services more tightly into the PCI core.  But realistically
> that's wishful thinking and may never happen, so this might be the
> most expedient approach.

So, how would it get integrated? I don't like the port service driver 
either. It's too dicky on how it creates some new devices that other 
drives bind to. If we could have a 1:1 mapping between service drivers 
and PCI capabilities, then it might make better sense.

So, do I go the new service driver route?

Alex


Re: [PATCH] PCI: pciehp: Report degraded links via link bandwidth notification

2018-11-29 Thread Alex_Gagniuc
On 11/29/2018 5:05 PM, Bjorn Helgaas wrote:
> On Thu, Nov 29, 2018 at 08:13:12PM +0100, Lukas Wunner wrote:
>> I guess the interrupt is shared with hotplug and PME?  In that case write
>> a separate pcie_port_service_driver and request the interrupt with
>> IRQF_SHARED.  Define a new service type in drivers/pci/pcie/portdrv.h.
>> Amend get_port_device_capability() to check for PCI_EXP_LNKCAP_LBNC.
> 
> I really don't like the port driver design.  I'd rather integrate
> those services more tightly into the PCI core.  But realistically
> that's wishful thinking and may never happen, so this might be the
> most expedient approach.

So, how would it get integrated? I don't like the port service driver 
either. It's too dicky on how it creates some new devices that other 
drives bind to. If we could have a 1:1 mapping between service drivers 
and PCI capabilities, then it might make better sense.

So, do I go the new service driver route?

Alex


Re: [PATCH] PCI: pciehp: Report degraded links via link bandwidth notification

2018-11-29 Thread Alex_Gagniuc
On 11/29/2018 10:06 AM, Mika Westerberg wrote:
>> @@ -515,7 +515,8 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
>>  struct controller *ctrl = (struct controller *)dev_id;
>>  struct pci_dev *pdev = ctrl_dev(ctrl);
>>  struct device *parent = pdev->dev.parent;
>> -u16 status, events;
>> +struct pci_dev *endpoint;
>> +u16 status, events, link_status;
> 
> Looks better if you write them in opposite order (reverse xmas-tree):
> 
>   u16 status, events, link_status;
>   struct pci_dev *endpoint;
> 

I don't decorate in November :p

>> +pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, _status);
>> +
> 
> Unnecessary empty line.

However Bjorn wants it, though I don't like the crowded look with this 
line removed.

>> +if (link_status & PCI_EXP_LNKSTA_LBMS) {
>> +if (pdev->subordinate && pdev->subordinate->self)
>> +endpoint = pdev->subordinate->self;
> 
> Hmm, I thought pdev->subordinate->self == pdev, no?

That makes no sense, but I think you're right. I'm trying to get to the 
other end of the PCIe link. Is there a simple way to do that? (other 
than convoluted logic that all leads to the same mistake)

>>   static void pcie_enable_notification(struct controller *ctrl)
>>   {
>>  u16 cmd, mask;
>> @@ -713,6 +743,9 @@ static void pcie_enable_notification(struct controller 
>> *ctrl)
>>  pcie_write_cmd_nowait(ctrl, cmd, mask);
>>  ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
>>   pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, cmd);
>> +
>> +if (pcie_link_bandwidth_notification_supported(ctrl))
>> +pcie_enable_link_bandwidth_notification(ctrl);
> 
> Do we ever need to disable it?

I can't think of a case where that would be needed.

Alex


Re: [PATCH] PCI: pciehp: Report degraded links via link bandwidth notification

2018-11-29 Thread Alex_Gagniuc
On 11/29/2018 10:06 AM, Mika Westerberg wrote:
>> @@ -515,7 +515,8 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
>>  struct controller *ctrl = (struct controller *)dev_id;
>>  struct pci_dev *pdev = ctrl_dev(ctrl);
>>  struct device *parent = pdev->dev.parent;
>> -u16 status, events;
>> +struct pci_dev *endpoint;
>> +u16 status, events, link_status;
> 
> Looks better if you write them in opposite order (reverse xmas-tree):
> 
>   u16 status, events, link_status;
>   struct pci_dev *endpoint;
> 

I don't decorate in November :p

>> +pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, _status);
>> +
> 
> Unnecessary empty line.

However Bjorn wants it, though I don't like the crowded look with this 
line removed.

>> +if (link_status & PCI_EXP_LNKSTA_LBMS) {
>> +if (pdev->subordinate && pdev->subordinate->self)
>> +endpoint = pdev->subordinate->self;
> 
> Hmm, I thought pdev->subordinate->self == pdev, no?

That makes no sense, but I think you're right. I'm trying to get to the 
other end of the PCIe link. Is there a simple way to do that? (other 
than convoluted logic that all leads to the same mistake)

>>   static void pcie_enable_notification(struct controller *ctrl)
>>   {
>>  u16 cmd, mask;
>> @@ -713,6 +743,9 @@ static void pcie_enable_notification(struct controller 
>> *ctrl)
>>  pcie_write_cmd_nowait(ctrl, cmd, mask);
>>  ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
>>   pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, cmd);
>> +
>> +if (pcie_link_bandwidth_notification_supported(ctrl))
>> +pcie_enable_link_bandwidth_notification(ctrl);
> 
> Do we ever need to disable it?

I can't think of a case where that would be needed.

Alex


Re: [PATCH] PCI: pciehp: Report degraded links via link bandwidth notification

2018-11-29 Thread Alex_Gagniuc
On 11/29/2018 11:36 AM, Bjorn Helgaas wrote:
> On Wed, Nov 28, 2018 at 06:08:24PM -0600, Alexandru Gagniuc wrote:
>> A warning is generated when a PCIe device is probed with a degraded
>> link, but there was no similar mechanism to warn when the link becomes
>> degraded after probing. The Link Bandwidth Notification provides this
>> mechanism.
>>
>> Use the link bandwidth notification interrupt to detect bandwidth
>> changes, and rescan the bandwidth, looking for the weakest point. This
>> is the same logic used in probe().
> 
> I like the concept of this.  What I don't like is the fact that it's
> tied to pciehp, since I don't think the concept of Link Bandwidth
> Notification is related to hotplug.  So I think we'll only notice this
> for ports that support hotplug.  Maybe it's worth doing it this way
> anyway, even if it could be generalized in the future?

That makes sense. At first, I thought that BW notification was tied to 
hotplug, but our PCIe spec writer disagreed with that assertion. I'm 
just not sure where to handle the interrupt otherwise.

Alex



Re: [PATCH] PCI: pciehp: Report degraded links via link bandwidth notification

2018-11-29 Thread Alex_Gagniuc
On 11/29/2018 11:36 AM, Bjorn Helgaas wrote:
> On Wed, Nov 28, 2018 at 06:08:24PM -0600, Alexandru Gagniuc wrote:
>> A warning is generated when a PCIe device is probed with a degraded
>> link, but there was no similar mechanism to warn when the link becomes
>> degraded after probing. The Link Bandwidth Notification provides this
>> mechanism.
>>
>> Use the link bandwidth notification interrupt to detect bandwidth
>> changes, and rescan the bandwidth, looking for the weakest point. This
>> is the same logic used in probe().
> 
> I like the concept of this.  What I don't like is the fact that it's
> tied to pciehp, since I don't think the concept of Link Bandwidth
> Notification is related to hotplug.  So I think we'll only notice this
> for ports that support hotplug.  Maybe it's worth doing it this way
> anyway, even if it could be generalized in the future?

That makes sense. At first, I thought that BW notification was tied to 
hotplug, but our PCIe spec writer disagreed with that assertion. I'm 
just not sure where to handle the interrupt otherwise.

Alex



Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Alex_Gagniuc
On 11/08/2018 02:09 PM, Bjorn Helgaas wrote:
> 
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive 
> information.
> 
> 
> [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about
> PCI error recovery in general]

Has anyone seen seen the ECRs in the PCIe base spec and ACPI that have 
been floating around the past few months? -- HPX, SFI, CER. Without 
divulging too much before publication, I'm curious on opinions on how 
well (or not well) those flows would work in general, and in linux.

> On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote:
> I'm having second thoughts about this.  One thing I'm uncomfortable
> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> instead of systematic, in the sense that I don't know how we convince
> ourselves that this (and only this) is the correct place to put it. >
> Another is that the only place we call pci_dev_set_disconnected() is
> in pciehp and acpiphp, so the only "disconnected" case we catch is if
> hotplug happens to be involved.  Every MMIO read from the device is an
> opportunity to learn whether it is reachable (a read from an
> unreachable device typically returns ~0 data), but we don't do
> anything at all with those.
 >
> The config accessors already check pci_dev_is_disconnected(), so this
> patch is really aimed at MMIO accesses.  I think it would be more
> robust if we added wrappers for readl() and writel() so we could
> notice read errors and avoid future reads and writes.

I wouldn't expect anything less than  complete scrutiny and quality 
control of unquestionable moral integrity :). In theory ~0 can be a 
great indicator that something may be wrong. Though I think it's about 
as ad-hoc as pci_dev_is_disconnected().

I slightly like the idea of wrapping the MMIO accessors. There's still 
memcpy and DMA that cause the same MemRead/Wr PCIe transactions, and the 
same sort of errors in PCIe land, and it would be good to have more 
testing on this. Since this patch is tested and confirmed to fix a known 
failure case, I would keep it, and the look at fixing the problem in a 
more generic way.

BTW, a lot of the problems we're fixing here come courtesy of 
firmware-first error handling. Do we reach a point where we draw a line 
in handling new problems introduced by FFS? So, if something is a 
problem with FFS, but not native handling, do we commit to supporting it?

Alex



Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Alex_Gagniuc
On 11/08/2018 02:09 PM, Bjorn Helgaas wrote:
> 
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive 
> information.
> 
> 
> [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about
> PCI error recovery in general]

Has anyone seen seen the ECRs in the PCIe base spec and ACPI that have 
been floating around the past few months? -- HPX, SFI, CER. Without 
divulging too much before publication, I'm curious on opinions on how 
well (or not well) those flows would work in general, and in linux.

> On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote:
> I'm having second thoughts about this.  One thing I'm uncomfortable
> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> instead of systematic, in the sense that I don't know how we convince
> ourselves that this (and only this) is the correct place to put it. >
> Another is that the only place we call pci_dev_set_disconnected() is
> in pciehp and acpiphp, so the only "disconnected" case we catch is if
> hotplug happens to be involved.  Every MMIO read from the device is an
> opportunity to learn whether it is reachable (a read from an
> unreachable device typically returns ~0 data), but we don't do
> anything at all with those.
 >
> The config accessors already check pci_dev_is_disconnected(), so this
> patch is really aimed at MMIO accesses.  I think it would be more
> robust if we added wrappers for readl() and writel() so we could
> notice read errors and avoid future reads and writes.

I wouldn't expect anything less than  complete scrutiny and quality 
control of unquestionable moral integrity :). In theory ~0 can be a 
great indicator that something may be wrong. Though I think it's about 
as ad-hoc as pci_dev_is_disconnected().

I slightly like the idea of wrapping the MMIO accessors. There's still 
memcpy and DMA that cause the same MemRead/Wr PCIe transactions, and the 
same sort of errors in PCIe land, and it would be good to have more 
testing on this. Since this patch is tested and confirmed to fix a known 
failure case, I would keep it, and the look at fixing the problem in a 
more generic way.

BTW, a lot of the problems we're fixing here come courtesy of 
firmware-first error handling. Do we reach a point where we draw a line 
in handling new problems introduced by FFS? So, if something is a 
problem with FFS, but not native handling, do we commit to supporting it?

Alex



Re: [PATCH] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-09-18 Thread Alex_Gagniuc
On 9/12/2018 4:28 PM, Bjorn Helgaas wrote:
> On Mon, Jul 30, 2018 at 04:21:44PM -0500, Alexandru Gagniuc wrote:
>> When a PCI device is gone, we don't want to send IO to it if we can
>> avoid it. We expose functionality via the irq_chip structure. As
>> users of that structure may not know about the underlying PCI device,
>> it's our responsibility to guard against removed devices.
> 
> I'm pretty ambivalent about pci_dev_is_disconnected() in general, but
> I think I'll take this, given a couple minor changelog clarifications:
> 
>> irq_write_msi_msg is already guarded. pci_msi_(un)mask_irq are not.
>> Guard them for completeness.
> 
> By the irq_write_msi_msg() guard, I guess you mean this path:
> 
>pci_msi_domain_write_msg   # irq_chip.irq_write_msi_msg
>  __pci_write_msi_msg
>if (dev->current_state != PCI_D0 || pci_dev_is_disconnected(dev))
>  /* don't touch */

Yes!

> pci_msi_(un)mask_irq() may be irq_chip.irq_mask, .irq_unmask, etc
> pointers.  So these are parallel because they're all irq_chip function
> pointers, but the changelog isn't (yet) parallel because it uses the
> irq_chip pointer name for .irq_write_msi_msg but not for mask/unmask

Good catch! I'll get this corrected.


>> For example, surprise removal of a PCIe device triggers teardown. This
>> touches the irq_chips ops some point to disable the interrupts. I/O
>> generated here can crash the system on machines with buggy firmware.
>> Not triggering the IO in the first place eliminates the problem.
> 
> It doesn't eliminate the problem completely because .irq_mask() and
> .irq_unmask() may be called for reasons other than surprise removal,
> and if a surprise removal happens after the pci_dev_is_disconnected()
> check but before the readl(), we will still generate I/O to a device
> that's gone.  I'd be OK if you said it "reduces" the problem.

That sounds reasonable.

> One reason I'm ambivalent about pci_dev_is_disconnected() is that in
> cases like this, it turns a reproducible problem into a very
> hard-to-reproduce problem, which reduces the likelihood that the buggy
> firmware will be fixed.

If it manages to turn this into 99.999% territory, I'll be much happier. 
I'd love to give you an academically correct solution, but I just don't 
see how, given how firmware-first philosophy is written.

> Do you have information about known platforms with this buggy firmware
> and the signature of the crash?  If you do, it's always nice to be
> able to connect a patch with the user-visible problem it fixes.

 From what I've heard, it won't be fixed. The number of changes needed 
would require re-qualifying the firmware. I'm told that's very hard to 
do on platforms that are shipping. I can reword this to say 
"firmware-first" instead of "buggy" since they are mostly synonymous.

Alex

>> Signed-off-by: Alexandru Gagniuc 
>> ---
>>
>> There's another patch by Lukas Wunner that is needed (not yet published)
>> in order to fully block IO on SURPRISE!!! removal. The existing code only
>> sets the PCI_DEV_DISCONNECTED bit in an unreasonably narrow set of
>> circumstances. Lukas' patch fixes that.
>>
>> However, this change is otherwise fully independent, and enjoy!
>>
>>   drivers/pci/msi.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
>> index 4d88afdfc843..5f47b5cb0401 100644
>> --- a/drivers/pci/msi.c
>> +++ b/drivers/pci/msi.c
>> @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 
>> flag)
>>   {
>>  struct msi_desc *desc = irq_data_get_msi_desc(data);
>>   
>> +if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))
>> +return;
>> +
>>  if (desc->msi_attrib.is_msix) {
>>  msix_mask_irq(desc, flag);
>>  readl(desc->mask_base); /* Flush write to device */
>> -- 
>> 2.17.1
>>
> 



Re: [PATCH] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-09-18 Thread Alex_Gagniuc
On 9/12/2018 4:28 PM, Bjorn Helgaas wrote:
> On Mon, Jul 30, 2018 at 04:21:44PM -0500, Alexandru Gagniuc wrote:
>> When a PCI device is gone, we don't want to send IO to it if we can
>> avoid it. We expose functionality via the irq_chip structure. As
>> users of that structure may not know about the underlying PCI device,
>> it's our responsibility to guard against removed devices.
> 
> I'm pretty ambivalent about pci_dev_is_disconnected() in general, but
> I think I'll take this, given a couple minor changelog clarifications:
> 
>> irq_write_msi_msg is already guarded. pci_msi_(un)mask_irq are not.
>> Guard them for completeness.
> 
> By the irq_write_msi_msg() guard, I guess you mean this path:
> 
>pci_msi_domain_write_msg   # irq_chip.irq_write_msi_msg
>  __pci_write_msi_msg
>if (dev->current_state != PCI_D0 || pci_dev_is_disconnected(dev))
>  /* don't touch */

Yes!

> pci_msi_(un)mask_irq() may be irq_chip.irq_mask, .irq_unmask, etc
> pointers.  So these are parallel because they're all irq_chip function
> pointers, but the changelog isn't (yet) parallel because it uses the
> irq_chip pointer name for .irq_write_msi_msg but not for mask/unmask

Good catch! I'll get this corrected.


>> For example, surprise removal of a PCIe device triggers teardown. This
>> touches the irq_chips ops some point to disable the interrupts. I/O
>> generated here can crash the system on machines with buggy firmware.
>> Not triggering the IO in the first place eliminates the problem.
> 
> It doesn't eliminate the problem completely because .irq_mask() and
> .irq_unmask() may be called for reasons other than surprise removal,
> and if a surprise removal happens after the pci_dev_is_disconnected()
> check but before the readl(), we will still generate I/O to a device
> that's gone.  I'd be OK if you said it "reduces" the problem.

That sounds reasonable.

> One reason I'm ambivalent about pci_dev_is_disconnected() is that in
> cases like this, it turns a reproducible problem into a very
> hard-to-reproduce problem, which reduces the likelihood that the buggy
> firmware will be fixed.

If it manages to turn this into 99.999% territory, I'll be much happier. 
I'd love to give you an academically correct solution, but I just don't 
see how, given how firmware-first philosophy is written.

> Do you have information about known platforms with this buggy firmware
> and the signature of the crash?  If you do, it's always nice to be
> able to connect a patch with the user-visible problem it fixes.

 From what I've heard, it won't be fixed. The number of changes needed 
would require re-qualifying the firmware. I'm told that's very hard to 
do on platforms that are shipping. I can reword this to say 
"firmware-first" instead of "buggy" since they are mostly synonymous.

Alex

>> Signed-off-by: Alexandru Gagniuc 
>> ---
>>
>> There's another patch by Lukas Wunner that is needed (not yet published)
>> in order to fully block IO on SURPRISE!!! removal. The existing code only
>> sets the PCI_DEV_DISCONNECTED bit in an unreasonably narrow set of
>> circumstances. Lukas' patch fixes that.
>>
>> However, this change is otherwise fully independent, and enjoy!
>>
>>   drivers/pci/msi.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
>> index 4d88afdfc843..5f47b5cb0401 100644
>> --- a/drivers/pci/msi.c
>> +++ b/drivers/pci/msi.c
>> @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 
>> flag)
>>   {
>>  struct msi_desc *desc = irq_data_get_msi_desc(data);
>>   
>> +if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc)))
>> +return;
>> +
>>  if (desc->msi_attrib.is_msix) {
>>  msix_mask_irq(desc, flag);
>>  readl(desc->mask_base); /* Flush write to device */
>> -- 
>> 2.17.1
>>
> 



Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-09 Thread Alex_Gagniuc
On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:
> On Tue, Jul 17, 2018 at 10:31:23AM -0500, Alexandru Gagniuc wrote:
>> When we don't own AER, we shouldn't touch the AER error bits. This
>> happens unconditionally on device probe(). Clearing AER bits
>> willy-nilly might cause firmware to miss errors. Instead
>> these bits should get cleared by FFS, or via ACPI _HPX method.
>>
>> This race is mostly of theoretical significance, as it is not easy to
>> reasonably demonstrate it in testing.
>>
>> Signed-off-by: Alexandru Gagniuc 
>> ---
>>   drivers/pci/pcie/aer.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index a2e88386af28..18037a2a8231 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -383,6 +383,9 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev 
>> *dev)
>>  if (!pci_is_pcie(dev))
>>  return -ENODEV;
>>   
>> +if (pcie_aer_get_firmware_first(dev))
>> +return -EIO;
> 
> I like this patch.
> 
> Do we need the same thing in the following places that also clear AER
> status bits or write AER control bits?

In theory, every exported function would guard for this. I think the 
idea a long long time ago was that the check happens during 
initialization, and the others are not hit.

>enable_ecrc_checking()
>disable_ecrc_checking()

I don't immediately see how this would affect FFS, but the bits are part 
of the AER capability structure. According to the FFS model, those would 
be owned by FW, and we'd have to avoid touching them.

>pci_cleanup_aer_uncorrect_error_status()

This probably should be guarded. It's only called from a few specific 
drivers, so the impact is not as high as being called from the core.

>pci_aer_clear_fatal_status()

This is only called when doing fatal_recovery, right?
For practical considerations this is not an issue today. The ACPI error 
handling code currently crashes when it encounters any fatal error, so 
we wouldn't hit this in the FFS case.

If the ACPI code pulls its thinking appendage out of the other end of 
the digestive tract, then we could be hitting this in the future. For 
correctness, guarding makes sense.

The PCIe standards contact I usually talk to about these PCIe subtleties 
is currently on vacation. The number one issue was a FFS corner case 
with OS clearing bits on probe. The other functions you mention are a 
corner case of a corner case. The big fish is 
pci_cleanup_aer_error_status_regs() on probe(), and it would be nice to 
have that resolved.

I'll sync up with Austin when he gets back to see about the other 
functions though I suspect we'll end up fixing them as well.

Alex

>>  pos = dev->aer_cap;
>>  if (!pos)
>>  return -EIO;
>> -- 
>> 2.14.3
>>
> 




Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

2018-08-09 Thread Alex_Gagniuc
On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:
> On Tue, Jul 17, 2018 at 10:31:23AM -0500, Alexandru Gagniuc wrote:
>> When we don't own AER, we shouldn't touch the AER error bits. This
>> happens unconditionally on device probe(). Clearing AER bits
>> willy-nilly might cause firmware to miss errors. Instead
>> these bits should get cleared by FFS, or via ACPI _HPX method.
>>
>> This race is mostly of theoretical significance, as it is not easy to
>> reasonably demonstrate it in testing.
>>
>> Signed-off-by: Alexandru Gagniuc 
>> ---
>>   drivers/pci/pcie/aer.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index a2e88386af28..18037a2a8231 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -383,6 +383,9 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev 
>> *dev)
>>  if (!pci_is_pcie(dev))
>>  return -ENODEV;
>>   
>> +if (pcie_aer_get_firmware_first(dev))
>> +return -EIO;
> 
> I like this patch.
> 
> Do we need the same thing in the following places that also clear AER
> status bits or write AER control bits?

In theory, every exported function would guard for this. I think the 
idea a long long time ago was that the check happens during 
initialization, and the others are not hit.

>enable_ecrc_checking()
>disable_ecrc_checking()

I don't immediately see how this would affect FFS, but the bits are part 
of the AER capability structure. According to the FFS model, those would 
be owned by FW, and we'd have to avoid touching them.

>pci_cleanup_aer_uncorrect_error_status()

This probably should be guarded. It's only called from a few specific 
drivers, so the impact is not as high as being called from the core.

>pci_aer_clear_fatal_status()

This is only called when doing fatal_recovery, right?
For practical considerations this is not an issue today. The ACPI error 
handling code currently crashes when it encounters any fatal error, so 
we wouldn't hit this in the FFS case.

If the ACPI code pulls its thinking appendage out of the other end of 
the digestive tract, then we could be hitting this in the future. For 
correctness, guarding makes sense.

The PCIe standards contact I usually talk to about these PCIe subtleties 
is currently on vacation. The number one issue was a FFS corner case 
with OS clearing bits on probe. The other functions you mention are a 
corner case of a corner case. The big fish is 
pci_cleanup_aer_error_status_regs() on probe(), and it would be nice to 
have that resolved.

I'll sync up with Austin when he gets back to see about the other 
functions though I suspect we'll end up fixing them as well.

Alex

>>  pos = dev->aer_cap;
>>  if (!pos)
>>  return -EIO;
>> -- 
>> 2.14.3
>>
> 




Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

2018-08-06 Thread Alex_Gagniuc
On 08/05/2018 02:06 AM, Tal Gilboa wrote:
> On 7/31/2018 6:10 PM, Alex G. wrote:
>> On 07/31/2018 01:40 AM, Tal Gilboa wrote:
>> [snip]
>> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct
>> pci_dev *dev)
>>    /* Advanced Error Reporting */
>>    pci_aer_init(dev);
>> +    /* Check link and detect downtrain errors */
>> +    pcie_check_upstream_link(dev);

 This is called for every PCIe device right? Won't there be a
 duplicated print in case a device loads with lower PCIe bandwidth
 than needed?
>>>
>>> Alex, can you comment on this please?
>>
>> Of course I can.
>>
>> There's one print at probe() time, which happens if bandwidth < max. I
>> would think that's fine. There is a way to duplicate it, and that is if
>> the driver also calls print_link_status(). A few driver maintainers who
>> call it have indicated they'd be fine with removing it from the driver,
>> and leaving it in the core PCI.
> 
> We would be fine with that as well. Please include the removal in your
> patches.

What's the proper procedure? Do I wait for confirmation from Bjorn 
before knocking on maintainer's doors, or do I William Wallace into 
their trees and demand they merge the removal (pending Bjorn's approval 
on the other side) ?

Alex

>>
>> Should the device come back at low speed, go away, then come back at
>> full speed we'd still only see one print from the first probe. Again, if
>> driver doesn't also call the function.
>> There's the corner case where the device come up at < max, goes away,
>> then comes back faster, but still < max. There will be two prints, but
>> they won't be true duplicates -- each one will indicate different BW.
> 
> This is fine.
> 
>>
>> Alex
>>
>> +
>>    if (pci_probe_reset_function(dev) == 0)
>>    dev->reset_fn = 1;
>>    }
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index abd5d5e17aee..15bfab8f7a1b 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
>>    u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev
>> **limiting_dev,
>>     enum pci_bus_speed *speed,
>>     enum pcie_link_width *width);
>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
>>    void pcie_print_link_status(struct pci_dev *dev);
>>    int pcie_flr(struct pci_dev *dev);
>>    int __pci_reset_function_locked(struct pci_dev *dev);
>
> 



Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

2018-08-06 Thread Alex_Gagniuc
On 08/05/2018 02:06 AM, Tal Gilboa wrote:
> On 7/31/2018 6:10 PM, Alex G. wrote:
>> On 07/31/2018 01:40 AM, Tal Gilboa wrote:
>> [snip]
>> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct
>> pci_dev *dev)
>>    /* Advanced Error Reporting */
>>    pci_aer_init(dev);
>> +    /* Check link and detect downtrain errors */
>> +    pcie_check_upstream_link(dev);

 This is called for every PCIe device right? Won't there be a
 duplicated print in case a device loads with lower PCIe bandwidth
 than needed?
>>>
>>> Alex, can you comment on this please?
>>
>> Of course I can.
>>
>> There's one print at probe() time, which happens if bandwidth < max. I
>> would think that's fine. There is a way to duplicate it, and that is if
>> the driver also calls print_link_status(). A few driver maintainers who
>> call it have indicated they'd be fine with removing it from the driver,
>> and leaving it in the core PCI.
> 
> We would be fine with that as well. Please include the removal in your
> patches.

What's the proper procedure? Do I wait for confirmation from Bjorn 
before knocking on maintainer's doors, or do I William Wallace into 
their trees and demand they merge the removal (pending Bjorn's approval 
on the other side) ?

Alex

>>
>> Should the device come back at low speed, go away, then come back at
>> full speed we'd still only see one print from the first probe. Again, if
>> driver doesn't also call the function.
>> There's the corner case where the device come up at < max, goes away,
>> then comes back faster, but still < max. There will be two prints, but
>> they won't be true duplicates -- each one will indicate different BW.
> 
> This is fine.
> 
>>
>> Alex
>>
>> +
>>    if (pci_probe_reset_function(dev) == 0)
>>    dev->reset_fn = 1;
>>    }
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index abd5d5e17aee..15bfab8f7a1b 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
>>    u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev
>> **limiting_dev,
>>     enum pci_bus_speed *speed,
>>     enum pcie_link_width *width);
>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
>>    void pcie_print_link_status(struct pci_dev *dev);
>>    int pcie_flr(struct pci_dev *dev);
>>    int __pci_reset_function_locked(struct pci_dev *dev);
>
> 



Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

2018-07-30 Thread Alex_Gagniuc
On 07/24/2018 08:40 AM, Tal Gilboa wrote:
> On 7/24/2018 2:59 AM, Alex G. wrote:
>>
>>
>> On 07/23/2018 05:14 PM, Jakub Kicinski wrote:
>>> On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote:
 On 7/24/2018 12:01 AM, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote:
>> PCIe downtraining happens when both the device and PCIe port are
>> capable of a larger bus width or higher speed than negotiated.
>> Downtraining might be indicative of other problems in the system, and
>> identifying this from userspace is neither intuitive, nor
>> straightforward.
>>
>> The easiest way to detect this is with pcie_print_link_status(),
>> since the bottleneck is usually the link that is downtrained. It's not
>> a perfect solution, but it works extremely well in most cases.
>>
>> Signed-off-by: Alexandru Gagniuc 
>> ---
>>
>> For the sake of review, I've created a __pcie_print_link_status()
>> which
>> takes a 'verbose' argument. If we agree want to go this route, and
>> update
>> the users of pcie_print_link_status(), I can split this up in two
>> patches.
>> I prefer just printing this information in the core functions, and
>> letting
>> drivers not have to worry about this. Though there seems to be
>> strong for
>> not going that route, so here it goes:
>
> FWIW the networking drivers print PCIe BW because sometimes the network
> bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps
> card on a x8 link.
>
> Sorry to bike shed, but currently the networking cards print the info
> during probe.  Would it make sense to move your message closer to probe
> time?  Rather than when device is added.  If driver structure is
> available, we could also consider adding a boolean to struct pci_driver
> to indicate if driver wants the verbose message?  This way we avoid
> duplicated prints.
>
> I have no objection to current patch, it LGTM.  Just a thought.

 I don't see the reason for having two functions. What's the problem with
 adding the verbose argument to the original function?
>>>
>>> IMHO it's reasonable to keep the default parameter to what 90% of users
>>> want by a means on a wrapper.  The non-verbose output is provided by
>>> the core already for all devices.
>>>
>>> What do you think of my proposal above Tal?  That would make the extra
>>> wrapper unnecessary since the verbose parameter would be part of the
>>> driver structure, and it would avoid the duplicated output.
>>
>> I see how it might make sense to add another member to the driver
>> struct, but is it worth the extra learning curve? It seems to be
>> something with the potential to confuse new driver developers, and
>> having a very marginal benefit.
>> Although, if that's what people want...
> 
> I prefer the wrapper function. Looking at struct pci_driver it would
> seem strange for it to hold a field for controlling verbosity (IMO).
> This is a very (very) specific field in a very general struct.

If people are okay with the wrapper, then I'm not going to update the patch.

Alex


Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

2018-07-30 Thread Alex_Gagniuc
On 07/24/2018 08:40 AM, Tal Gilboa wrote:
> On 7/24/2018 2:59 AM, Alex G. wrote:
>>
>>
>> On 07/23/2018 05:14 PM, Jakub Kicinski wrote:
>>> On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote:
 On 7/24/2018 12:01 AM, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote:
>> PCIe downtraining happens when both the device and PCIe port are
>> capable of a larger bus width or higher speed than negotiated.
>> Downtraining might be indicative of other problems in the system, and
>> identifying this from userspace is neither intuitive, nor
>> straightforward.
>>
>> The easiest way to detect this is with pcie_print_link_status(),
>> since the bottleneck is usually the link that is downtrained. It's not
>> a perfect solution, but it works extremely well in most cases.
>>
>> Signed-off-by: Alexandru Gagniuc 
>> ---
>>
>> For the sake of review, I've created a __pcie_print_link_status()
>> which
>> takes a 'verbose' argument. If we agree want to go this route, and
>> update
>> the users of pcie_print_link_status(), I can split this up in two
>> patches.
>> I prefer just printing this information in the core functions, and
>> letting
>> drivers not have to worry about this. Though there seems to be
>> strong for
>> not going that route, so here it goes:
>
> FWIW the networking drivers print PCIe BW because sometimes the network
> bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps
> card on a x8 link.
>
> Sorry to bike shed, but currently the networking cards print the info
> during probe.  Would it make sense to move your message closer to probe
> time?  Rather than when device is added.  If driver structure is
> available, we could also consider adding a boolean to struct pci_driver
> to indicate if driver wants the verbose message?  This way we avoid
> duplicated prints.
>
> I have no objection to current patch, it LGTM.  Just a thought.

 I don't see the reason for having two functions. What's the problem with
 adding the verbose argument to the original function?
>>>
>>> IMHO it's reasonable to keep the default parameter to what 90% of users
>>> want by a means on a wrapper.  The non-verbose output is provided by
>>> the core already for all devices.
>>>
>>> What do you think of my proposal above Tal?  That would make the extra
>>> wrapper unnecessary since the verbose parameter would be part of the
>>> driver structure, and it would avoid the duplicated output.
>>
>> I see how it might make sense to add another member to the driver
>> struct, but is it worth the extra learning curve? It seems to be
>> something with the potential to confuse new driver developers, and
>> having a very marginal benefit.
>> Although, if that's what people want...
> 
> I prefer the wrapper function. Looking at struct pci_driver it would
> seem strange for it to hold a field for controlling verbosity (IMO).
> This is a very (very) specific field in a very general struct.

If people are okay with the wrapper, then I'm not going to update the patch.

Alex


Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

2018-07-16 Thread Alex_Gagniuc
On 7/16/2018 4:17 PM, Bjorn Helgaas wrote:
> [+cc maintainers of drivers that already use pcie_print_link_status()
> and GPU folks]

Thanks for finding them!

[snip]
>> identifying this from userspace is neither intuitive, nor straigh
>> forward.
> 
> s/straigh/straight/
> In this context, I think "straightforward" should be closed up
> (without the space).

That's a straightforward edit. Thanks for the feedback!

>> The easiest way to detect this is with pcie_print_link_status(),
>> since the bottleneck is usually the link that is downtrained. It's not
>> a perfect solution, but it works extremely well in most cases.
> 
> This is an interesting idea.  I have two concerns:
> 
> Some drivers already do this on their own, and we probably don't want
> duplicate output for those devices.  In most cases (ixgbe and mlx* are
> exceptions), the drivers do this unconditionally so we *could* remove
> it from the driver if we add it to the core.  The dmesg order would
> change, and the message wouldn't be associated with the driver as it
> now is.

Oh, there are only 8 users of that. Even I could patch up the drivers to 
remove the call, assuming we reach agreement about this change.

> Also, I think some of the GPU devices might come up at a lower speed,
> then download firmware, then reset the device so it comes up at a
> higher speed.  I think this patch will make us complain about about
> the low initial speed, which might confuse users.

I spoke to one of the PCIe spec writers. It's allowable for a device to 
downtrain speed or width. It would also be extremely dumb to downtrain 
with the intent to re-train at a higher speed later, but it's possible 
devices do dumb stuff like that. That's why it's an informational 
message, instead of a warning.

Another case: Some devices (lower-end GPUs) use silicon (and marketing) 
that advertises x16, but they're only routed for x8. I'm okay with 
seeing an informational message in this case. In fact, I didn't know 
that my Quadro card for three years is only wired for x8 until I was 
testing this patch.

> So I'm not sure whether it's better to do this in the core for all
> devices, or if we should just add it to the high-performance drivers
> that really care.

You're thinking "do I really need that bandwidth" because I'm using a 
function called "_bandwidth_". The point of the change is very far from 
that: it is to help in system troubleshooting by detecting downtraining 
conditions.

>> Signed-off-by: Alexandru Gagniuc 
[snip]
>> +/* Look from the device up to avoid downstream ports with no devices. */
>> +if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>> +(pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>> +(pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>> +return;
> 
> Do we care about Upstream Ports here?  

YES! Switches. e.g. an x16 switch with 4x downstream ports could 
downtrain at 8x and 4x, and we'd never catch it.

> I suspect that ultimately we
> only care about the bandwidth to Endpoints, and if an Endpoint is
> constrained by a slow link farther up the tree,
> pcie_print_link_status() is supposed to identify that slow link.

See above.

> I would find this test easier to read as
> 
>if (!(type == PCI_EXP_TYPE_ENDPOINT || type == PCI_EXP_TYPE_LEG_END))
>  return;
> 
> But maybe I'm the only one that finds the conjunction of inequalities
> hard to read.  No big deal either way.
> 
>> +/* Multi-function PCIe share the same link/status. */
>> +if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>> +return;
>> +
>> +pcie_print_link_status(dev);
>> +}
>> +
>>   static void pci_init_capabilities(struct pci_dev *dev)
>>   {
>>  /* Enhanced Allocation */
>> @@ -2181,6 +2200,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
>>  /* Advanced Error Reporting */
>>  pci_aer_init(dev);
>>   
>> +/* Check link and detect downtrain errors */
>> +pcie_check_upstream_link(dev);
>> +
>>  if (pci_probe_reset_function(dev) == 0)
>>  dev->reset_fn = 1;
>>   }
>> -- 
>> 2.14.4
>>
> 



Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

2018-07-16 Thread Alex_Gagniuc
On 7/16/2018 4:17 PM, Bjorn Helgaas wrote:
> [+cc maintainers of drivers that already use pcie_print_link_status()
> and GPU folks]

Thanks for finding them!

[snip]
>> identifying this from userspace is neither intuitive, nor straigh
>> forward.
> 
> s/straigh/straight/
> In this context, I think "straightforward" should be closed up
> (without the space).

That's a straightforward edit. Thanks for the feedback!

>> The easiest way to detect this is with pcie_print_link_status(),
>> since the bottleneck is usually the link that is downtrained. It's not
>> a perfect solution, but it works extremely well in most cases.
> 
> This is an interesting idea.  I have two concerns:
> 
> Some drivers already do this on their own, and we probably don't want
> duplicate output for those devices.  In most cases (ixgbe and mlx* are
> exceptions), the drivers do this unconditionally so we *could* remove
> it from the driver if we add it to the core.  The dmesg order would
> change, and the message wouldn't be associated with the driver as it
> now is.

Oh, there are only 8 users of that. Even I could patch up the drivers to 
remove the call, assuming we reach agreement about this change.

> Also, I think some of the GPU devices might come up at a lower speed,
> then download firmware, then reset the device so it comes up at a
> higher speed.  I think this patch will make us complain about about
> the low initial speed, which might confuse users.

I spoke to one of the PCIe spec writers. It's allowable for a device to 
downtrain speed or width. It would also be extremely dumb to downtrain 
with the intent to re-train at a higher speed later, but it's possible 
devices do dumb stuff like that. That's why it's an informational 
message, instead of a warning.

Another case: Some devices (lower-end GPUs) use silicon (and marketing) 
that advertises x16, but they're only routed for x8. I'm okay with 
seeing an informational message in this case. In fact, I didn't know 
that my Quadro card for three years is only wired for x8 until I was 
testing this patch.

> So I'm not sure whether it's better to do this in the core for all
> devices, or if we should just add it to the high-performance drivers
> that really care.

You're thinking "do I really need that bandwidth" because I'm using a 
function called "_bandwidth_". The point of the change is very far from 
that: it is to help in system troubleshooting by detecting downtraining 
conditions.

>> Signed-off-by: Alexandru Gagniuc 
[snip]
>> +/* Look from the device up to avoid downstream ports with no devices. */
>> +if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>> +(pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>> +(pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>> +return;
> 
> Do we care about Upstream Ports here?  

YES! Switches. e.g. an x16 switch with 4x downstream ports could 
downtrain at 8x and 4x, and we'd never catch it.

> I suspect that ultimately we
> only care about the bandwidth to Endpoints, and if an Endpoint is
> constrained by a slow link farther up the tree,
> pcie_print_link_status() is supposed to identify that slow link.

See above.

> I would find this test easier to read as
> 
>if (!(type == PCI_EXP_TYPE_ENDPOINT || type == PCI_EXP_TYPE_LEG_END))
>  return;
> 
> But maybe I'm the only one that finds the conjunction of inequalities
> hard to read.  No big deal either way.
> 
>> +/* Multi-function PCIe share the same link/status. */
>> +if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>> +return;
>> +
>> +pcie_print_link_status(dev);
>> +}
>> +
>>   static void pci_init_capabilities(struct pci_dev *dev)
>>   {
>>  /* Enhanced Allocation */
>> @@ -2181,6 +2200,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
>>  /* Advanced Error Reporting */
>>  pci_aer_init(dev);
>>   
>> +/* Check link and detect downtrain errors */
>> +pcie_check_upstream_link(dev);
>> +
>>  if (pci_probe_reset_function(dev) == 0)
>>  dev->reset_fn = 1;
>>   }
>> -- 
>> 2.14.4
>>
> 



Re: [PATCH] PCI: access.c: Piggyback user config access on pci_read/write_*()

2018-06-04 Thread Alex_Gagniuc
On 6/4/2018 11:09 AM, Keith Busch wrote:
> On Mon, Jun 04, 2018 at 10:48:02AM -0500, Alexandru Gagniuc wrote:
>> +++ b/drivers/pci/access.c
>> @@ -223,16 +223,9 @@ int pci_user_read_config_##size 
>> \
>>  (struct pci_dev *dev, int pos, type *val)   \
>>   {  \
>>  int ret = PCIBIOS_SUCCESSFUL;   \
>> -u32 data = -1;  \
>>  if (PCI_##size##_BAD)   \
>>  return -EINVAL; \
>> -raw_spin_lock_irq(_lock);   \
>> -if (unlikely(dev->block_cfg_access))\
>> -pci_wait_cfg(dev);  \
>> -ret = dev->bus->ops->read(dev->bus, dev->devfn, \
>> -pos, sizeof(type), );  \
>> -raw_spin_unlock_irq(_lock); \
>> -*val = (type)data;  \
>> +ret = pci_read_config_##size(dev, pos, val);\
>>  return pcibios_err_to_errno(ret);   \
>>   }  \
> 
> If it wasn't for the block_cfg_access check for user access, this would
> have been a nice code reuse cleanup.

Great fresh catch! I'll get that fixed in v2.

Alex



Re: [PATCH] PCI: access.c: Piggyback user config access on pci_read/write_*()

2018-06-04 Thread Alex_Gagniuc
On 6/4/2018 11:09 AM, Keith Busch wrote:
> On Mon, Jun 04, 2018 at 10:48:02AM -0500, Alexandru Gagniuc wrote:
>> +++ b/drivers/pci/access.c
>> @@ -223,16 +223,9 @@ int pci_user_read_config_##size 
>> \
>>  (struct pci_dev *dev, int pos, type *val)   \
>>   {  \
>>  int ret = PCIBIOS_SUCCESSFUL;   \
>> -u32 data = -1;  \
>>  if (PCI_##size##_BAD)   \
>>  return -EINVAL; \
>> -raw_spin_lock_irq(_lock);   \
>> -if (unlikely(dev->block_cfg_access))\
>> -pci_wait_cfg(dev);  \
>> -ret = dev->bus->ops->read(dev->bus, dev->devfn, \
>> -pos, sizeof(type), );  \
>> -raw_spin_unlock_irq(_lock); \
>> -*val = (type)data;  \
>> +ret = pci_read_config_##size(dev, pos, val);\
>>  return pcibios_err_to_errno(ret);   \
>>   }  \
> 
> If it wasn't for the block_cfg_access check for user access, this would
> have been a nice code reuse cleanup.

Great fresh catch! I'll get that fixed in v2.

Alex



Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex_Gagniuc
On 5/31/2018 10:28 AM, Sinan Kaya wrote:
> On 5/31/2018 11:05 AM, Alexandru Gagniuc wrote:
>> +static void pcie_max_link_cap(struct pci_dev *dev, enum pci_bus_speed 
>> *speed,
>> +enum pcie_link_width *width)
>> +{
>> +uint32_t lnkcap;
>> +
>> +pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, );
>> +
>> +*speed = pcie_link_speed[lnkcap & PCI_EXP_LNKCAP_SLS];
>> +*width = (lnkcap & PCI_EXP_LNKCAP_MLW) >> PCI_EXP_LNKCAP_MLW_SHIFT;
>> +}
>> +
>> +static void pcie_cur_link_sta(struct pci_dev *dev, enum pci_bus_speed 
>> *speed,
>> +enum pcie_link_width *width)
>> +{
>> +uint16_t lnksta;
>> +
>> +pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
>> +*speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS];
>> +*width = (lnksta & PCI_EXP_LNKSTA_NLW) >> PCI_EXP_LNKSTA_NLW_SHIFT;
>> +}
>> +
>> +static const char *pcie_bus_speed_name(enum pci_bus_speed speed)
>> +{
>> +switch (speed) {
>> +case PCIE_SPEED_2_5GT:
>> +return "2.5 GT/s";
>> +case PCIE_SPEED_5_0GT:
>> +return "5.0 GT/s";
>> +case PCIE_SPEED_8_0GT:
>> +return "8.0 GT/s";
>> +default:
>> +return "unknown";
>> +}
>> +}
> 
> I thought Bjorn added some functions to retrieve this now.

Hmm. I couldn't find them.

Alex



Re: [PATCH] PCI: Check for PCIe downtraining conditions

2018-05-31 Thread Alex_Gagniuc
On 5/31/2018 10:28 AM, Sinan Kaya wrote:
> On 5/31/2018 11:05 AM, Alexandru Gagniuc wrote:
>> +static void pcie_max_link_cap(struct pci_dev *dev, enum pci_bus_speed 
>> *speed,
>> +enum pcie_link_width *width)
>> +{
>> +uint32_t lnkcap;
>> +
>> +pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, );
>> +
>> +*speed = pcie_link_speed[lnkcap & PCI_EXP_LNKCAP_SLS];
>> +*width = (lnkcap & PCI_EXP_LNKCAP_MLW) >> PCI_EXP_LNKCAP_MLW_SHIFT;
>> +}
>> +
>> +static void pcie_cur_link_sta(struct pci_dev *dev, enum pci_bus_speed 
>> *speed,
>> +enum pcie_link_width *width)
>> +{
>> +uint16_t lnksta;
>> +
>> +pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
>> +*speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS];
>> +*width = (lnksta & PCI_EXP_LNKSTA_NLW) >> PCI_EXP_LNKSTA_NLW_SHIFT;
>> +}
>> +
>> +static const char *pcie_bus_speed_name(enum pci_bus_speed speed)
>> +{
>> +switch (speed) {
>> +case PCIE_SPEED_2_5GT:
>> +return "2.5 GT/s";
>> +case PCIE_SPEED_5_0GT:
>> +return "5.0 GT/s";
>> +case PCIE_SPEED_8_0GT:
>> +return "8.0 GT/s";
>> +default:
>> +return "unknown";
>> +}
>> +}
> 
> I thought Bjorn added some functions to retrieve this now.

Hmm. I couldn't find them.

Alex