Re: [PATCH mlx5-next v8 0/4] Dynamically assign MSI-X vectors count
modprobe -q -r mlx5_ib mlx5_core > 2. Ensure that driver doesn't run and it is safe to change MSI-X > echo 0 > /sys/bus/pci/devices/\:08\:00.0/sriov_drivers_autoprobe > 3. Load driver for the PF > modprobe mlx5_core > 4. Configure one of the VFs with new number > echo 2 > /sys/bus/pci/devices/\:08\:00.0/sriov_numvfs > echo 21 > /sys/bus/pci/devices/\:08\:00.2/sriov_vf_msix_count > > After this series: > [root@server ~]# lspci -vs :08:00.2 > 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 > Virtual Function] > > Capabilities: [9c] MSI-X: Enable- Count=21 Masked- > > Thanks > > Leon Romanovsky (4): > PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs > net/mlx5: Add dynamic MSI-X capabilities bits > net/mlx5: Dynamically assign MSI-X vectors count > net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks > > Documentation/ABI/testing/sysfs-bus-pci | 29 + > .../net/ethernet/mellanox/mlx5/core/main.c| 6 ++ > .../ethernet/mellanox/mlx5/core/mlx5_core.h | 12 +++ > .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 73 + > .../net/ethernet/mellanox/mlx5/core/sriov.c | 48 - > drivers/pci/iov.c | 102 -- > drivers/pci/pci-sysfs.c | 3 +- > drivers/pci/pci.h | 3 +- > include/linux/mlx5/mlx5_ifc.h | 11 +- > include/linux/pci.h | 8 ++ > 10 files changed, 284 insertions(+), 11 deletions(-) Looks good to me, thanks for persevering with this. Acked-by: Bjorn Helgaas Minor comments on 1/4, not critical.
Re: [PATCH mlx5-next v8 1/4] PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs
Possible subject, since this adds *two* files, not just "a file": PCI/IOV: Add sysfs MSI-X vector assignment interface On Sun, Mar 14, 2021 at 02:42:53PM +0200, Leon Romanovsky wrote: > A typical cloud provider SR-IOV use case is to create many VFs for use by > guest VMs. The VFs may not be assigned to a VM until a customer requests a > VM of a certain size, e.g., number of CPUs. A VF may need MSI-X vectors > proportional to the number of CPUs in the VM, but there is no standard way > to change the number of MSI-X vectors supported by a VF. > ... > +#ifdef CONFIG_PCI_MSI > +static ssize_t sriov_vf_msix_count_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + struct pci_dev *vf_dev = to_pci_dev(dev); > + struct pci_dev *pdev = pci_physfn(vf_dev); > + int val, ret; > + > + ret = kstrtoint(buf, 0, &val); > + if (ret) > + return ret; > + > + if (val < 0) > + return -EINVAL; > + > + device_lock(&pdev->dev); > + if (!pdev->driver || !pdev->driver->sriov_set_msix_vec_count) { > + ret = -EOPNOTSUPP; > + goto err_pdev; > + } > + > + device_lock(&vf_dev->dev); > + if (vf_dev->driver) { > + /* > + * A driver is already attached to this VF and has configured > + * itself based on the current MSI-X vector count. Changing > + * the vector size could mess up the driver, so block it. > + */ > + ret = -EBUSY; > + goto err_dev; > + } > + > + ret = pdev->driver->sriov_set_msix_vec_count(vf_dev, val); > + > +err_dev: > + device_unlock(&vf_dev->dev); > +err_pdev: > + device_unlock(&pdev->dev); > + return ret ? : count; > +} > +static DEVICE_ATTR_WO(sriov_vf_msix_count); > + > +static ssize_t sriov_vf_total_msix_show(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct pci_dev *pdev = to_pci_dev(dev); > + u32 vf_total_msix = 0; > + > + device_lock(dev); > + if (!pdev->driver || !pdev->driver->sriov_get_vf_total_msix) > + goto unlock; > + > + vf_total_msix = pdev->driver->sriov_get_vf_total_msix(pdev); > +unlock: > + device_unlock(dev); > + return sysfs_emit(buf, "%u\n", vf_total_msix); > +} > +static DEVICE_ATTR_RO(sriov_vf_total_msix); Can you reverse the order of sriov_vf_total_msix_show() and sriov_vf_msix_count_store()? Currently we have: VF stuff (msix_count_store) PF stuff (total_msix) more VF stuff related to the above (vf_dev_attrs, are_visible) so the total_msix bit is mixed in the middle. > +#endif > + > +static struct attribute *sriov_vf_dev_attrs[] = { > +#ifdef CONFIG_PCI_MSI > + &dev_attr_sriov_vf_msix_count.attr, > +#endif > + NULL, > +}; > + > +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj, > + struct attribute *a, int n) > +{ > + struct device *dev = kobj_to_dev(kobj); > + struct pci_dev *pdev = to_pci_dev(dev); > + > + if (!pdev->is_virtfn) > + return 0; > + > + return a->mode; > +} > + > +const struct attribute_group sriov_vf_dev_attr_group = { > + .attrs = sriov_vf_dev_attrs, > + .is_visible = sriov_vf_attrs_are_visible, > +}; > + > int pci_iov_add_virtfn(struct pci_dev *dev, int id) > { > int i; > @@ -400,18 +487,21 @@ static DEVICE_ATTR_RO(sriov_stride); > static DEVICE_ATTR_RO(sriov_vf_device); > static DEVICE_ATTR_RW(sriov_drivers_autoprobe); > > -static struct attribute *sriov_dev_attrs[] = { > +static struct attribute *sriov_pf_dev_attrs[] = { This and the related sriov_pf_attrs_are_visible change below are nice. Would you mind splitting them to a preliminary patch, since they really aren't related to the concept of *this* patch? > &dev_attr_sriov_totalvfs.attr, > &dev_attr_sriov_numvfs.attr, > &dev_attr_sriov_offset.attr, > &dev_attr_sriov_stride.attr, > &dev_attr_sriov_vf_device.attr, > &dev_attr_sriov_drivers_autoprobe.attr, > +#ifdef CONFIG_PCI_MSI > + &dev_attr_sriov_vf_total_msix.attr, > +#endif > NULL, > }; > > -static umode_t sriov_attrs_are_visible(struct kobject *kobj, > -struct attribute *a, int n) > +static umode_t sriov_pf_attrs_are_visible(struct kobject *kobj, > + struct attribute *a, int n) > { > struct device *dev = kobj_to_dev(kobj); > > @@ -421,9 +511,9 @@ static umode_t sriov_attrs_are_visible(struct kobject > *kobj, > return a->mode; > } > > -const struct attribute_group sriov_dev_attr_group = { > - .attrs = sriov_dev_attrs, > - .is_visible = sriov_attrs_are_visible, > +const struct attribute_group sriov_pf_dev_attr_group = { > + .attrs = sriov
Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
[+cc Rafael, in case you're interested in the driver core issue here] On Wed, Mar 31, 2021 at 07:08:07AM +0300, Leon Romanovsky wrote: > On Tue, Mar 30, 2021 at 03:41:41PM -0500, Bjorn Helgaas wrote: > > On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote: > > > On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote: > > > > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote: > > > > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote: > > > > > > > > > > > I think I misunderstood Greg's subdirectory comment. We already > > > > > > have > > > > > > directories like this: > > > > > > > > > > Yes, IIRC, Greg's remark applies if you have to start creating > > > > > directories with manual kobjects. > > > > > > > > > > > and aspm_ctrl_attr_group (for "link") is nicely done with static > > > > > > attributes. So I think we could do something like this: > > > > > > > > > > > > /sys/bus/pci/devices/:01:00.0/ # PF directory > > > > > > sriov/ # SR-IOV related stuff > > > > > > vf_total_msix > > > > > > vf_msix_count_BB:DD.F# includes bus/dev/fn of first VF > > > > > > ... > > > > > > vf_msix_count_BB:DD.F# includes bus/dev/fn of last VF > > > > > > > > > > It looks a bit odd that it isn't a subdirectory, but this seems > > > > > reasonable. > > > > > > > > Sorry, I missed your point; you'll have to lay it out more explicitly. > > > > I did intend that "sriov" *is* a subdirectory of the :01:00.0 > > > > directory. The full paths would be: > > > > > > > > /sys/bus/pci/devices/:01:00.0/sriov/vf_total_msix > > > > /sys/bus/pci/devices/:01:00.0/sriov/vf_msix_count_BB:DD.F > > > > ... > > > > > > Sorry, I was meaning what you first proposed: > > > > > >/sys/bus/pci/devices/:01:00.0/sriov/BB:DD.F/vf_msix_count > > > > > > Which has the extra sub directory to organize the child VFs. > > > > > > Keep in mind there is going to be alot of VFs here, > 1k - so this > > > will be a huge directory. > > > > With :01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain > > 1 + 1K files ("vf_total_msix" + 1 per VF). > > > > With :01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain > > 1 file and 1K subdirectories. > > This is racy by design, in order to add new file and create BB:DD.F > directory, the VF will need to do it after or during it's creation. > During PF creation it is unknown to PF those BB:DD.F values. > > The race here is due to the events of PF,VF directory already sent > but new directory structure is not ready yet. > > From code perspective, we will need to add something similar to > pci_iov_sysfs_link() with the code that you didn't like in previous > variants (the one that messes with sysfs_create_file API). > > It looks not good for large SR-IOV systems with >1K VFs with > gazillion subdirectories inside PF, while the more natural is to see > them in VF. > > So I'm completely puzzled why you want to do these files on PF and > not on VF as v0, v7 and v8 proposed. On both mlx5 and NVMe, the "assign vectors to VF" functionality is implemented by the PF, so I think it's reasonable to explore the idea of "associate the vector assignment sysfs file with the PF." Assume 1K VFs. Either way we have >1K subdirectories of /sys/devices/pci:00/. I think we should avoid an extra subdirectory level, so I think the choices on the table are: Associate "vf_msix_count" with the PF: - /sys/...//sriov/vf_total_msix# all on PF - /sys/...//sriov/vf_msix_count_BB:DD.F (1K of these). Greg says the number of these is not a problem. - The "vf_total_msix" and "vf_msix_count_*" files are all related and are grouped together in PF/sriov/. - The "vf_msix_count_*" files operate directly on the PF. Lock the PF for serialization, lookup and lock the VF to ensure no VF driver, call PF driver callback to assign vectors. - Requires special sysfs code to create/remove "vf_msix_count_*" files when setting/clearing VF Enable. This code could create them only when the PF driver actually supports vector assignment. Una
Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote: > On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote: > > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote: > > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote: > > > > > > > I think I misunderstood Greg's subdirectory comment. We already have > > > > directories like this: > > > > > > Yes, IIRC, Greg's remark applies if you have to start creating > > > directories with manual kobjects. > > > > > > > and aspm_ctrl_attr_group (for "link") is nicely done with static > > > > attributes. So I think we could do something like this: > > > > > > > > /sys/bus/pci/devices/:01:00.0/ # PF directory > > > > sriov/ # SR-IOV related stuff > > > > vf_total_msix > > > > vf_msix_count_BB:DD.F# includes bus/dev/fn of first VF > > > > ... > > > > vf_msix_count_BB:DD.F# includes bus/dev/fn of last VF > > > > > > It looks a bit odd that it isn't a subdirectory, but this seems > > > reasonable. > > > > Sorry, I missed your point; you'll have to lay it out more explicitly. > > I did intend that "sriov" *is* a subdirectory of the :01:00.0 > > directory. The full paths would be: > > > > /sys/bus/pci/devices/:01:00.0/sriov/vf_total_msix > > /sys/bus/pci/devices/:01:00.0/sriov/vf_msix_count_BB:DD.F > > ... > > Sorry, I was meaning what you first proposed: > >/sys/bus/pci/devices/:01:00.0/sriov/BB:DD.F/vf_msix_count > > Which has the extra sub directory to organize the child VFs. > > Keep in mind there is going to be alot of VFs here, > 1k - so this > will be a huge directory. With :01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain 1 + 1K files ("vf_total_msix" + 1 per VF). With :01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain 1 file and 1K subdirectories. No real difference now, but if we add more files per VF, a BB:DD.F/ subdirectory would certainly be nicer. I'm dense and don't fully understand Greg's subdirectory comment. The VF will have its own "pci/devices/:BB:DD.F/" directory, so adding sriov/BB:DD.F/ under the PF shouldn't affect any udev events or rules for the VF. I see "ATTR{power/control}" in lots of udev rules, so I guess udev could manage a single subdirectory like "ATTR{sriov/vf_total_msix}". I doubt it could do "ATTR{sriov/adm/vf_total_msix}" (with another level) or "ATTR{sriov/BBB:DD.F/vf_msix_count}" (with variable VF text in the path). But it doesn't seem like that level of control would be in a udev rule anyway. A PF udev rule might *start* a program to manage MSI-X vectors, but such a program should be able to deal with whatever directory structure we want. If my uninformed udev speculation makes sense *and* we think there will be more per-VF files later, I think I'm OK either way. Bjorn
[PATCH v4 3/3] ARM: iop32x: disable N2100 PCI parity reporting
From: Heiner Kallweit On the N2100, instead of just marking the r8169 chips as having broken_parity_status, disable parity error reporting for them entirely. This was the only relevant place that set broken_parity_status, so we no longer need to check for it in the r8169 error interrupt handler. [bhelgaas: squash into one patch, commit log] Link: https://lore.kernel.org/r/0c0dcbf2-5f1e-954c-ebd7-e6ccfae5c...@gmail.com Link: https://lore.kernel.org/r/9e312679-a684-e9c7-2656-420723706...@gmail.com --- arch/arm/mach-iop32x/n2100.c | 8 drivers/net/ethernet/realtek/r8169_main.c | 14 -- 2 files changed, 4 insertions(+), 18 deletions(-) diff --git a/arch/arm/mach-iop32x/n2100.c b/arch/arm/mach-iop32x/n2100.c index 78b9a5ee41c9..bf99e718f8b8 100644 --- a/arch/arm/mach-iop32x/n2100.c +++ b/arch/arm/mach-iop32x/n2100.c @@ -116,16 +116,16 @@ static struct hw_pci n2100_pci __initdata = { }; /* - * Both r8169 chips on the n2100 exhibit PCI parity problems. Set - * the ->broken_parity_status flag for both ports so that the r8169 - * driver knows it should ignore error interrupts. + * Both r8169 chips on the n2100 exhibit PCI parity problems. Turn + * off parity reporting for both ports so we don't get error interrupts + * for them. */ static void n2100_fixup_r8169(struct pci_dev *dev) { if (dev->bus->number == 0 && (dev->devfn == PCI_DEVFN(1, 0) || dev->devfn == PCI_DEVFN(2, 0))) - dev->broken_parity_status = 1; + pci_disable_parity(dev); } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, PCI_ANY_ID, n2100_fixup_r8169); diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c index f704da3f214c..a6aff0d993eb 100644 --- a/drivers/net/ethernet/realtek/r8169_main.c +++ b/drivers/net/ethernet/realtek/r8169_main.c @@ -4358,20 +4358,6 @@ static void rtl8169_pcierr_interrupt(struct net_device *dev) if (net_ratelimit()) netdev_err(dev, "PCI error (cmd = 0x%04x, status_errs = 0x%04x)\n", pci_cmd, pci_status_errs); - /* -* The recovery sequence below admits a very elaborated explanation: -* - it seems to work; -* - I did not see what else could be done; -* - it makes iop3xx happy. -* -* Feel free to adjust to your needs. -*/ - if (pdev->broken_parity_status) - pci_cmd &= ~PCI_COMMAND_PARITY; - else - pci_cmd |= PCI_COMMAND_SERR | PCI_COMMAND_PARITY; - - pci_write_config_word(pdev, PCI_COMMAND, pci_cmd); rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING); } -- 2.25.1
[PATCH v4 2/3] IB/mthca: Disable parity reporting
From: Heiner Kallweit For Mellanox Tavor devices, we previously set dev->broken_parity_status, which does not change the device's behavior; it merely prevents the EDAC PCI error reporting from warning about Master Data Parity Error, Signaled System Error, or Detected Parity Error for this device. Instead, disable Parity Error Response so the device doesn't report parity errors in the first place. [bhelgaas: split out pci_disable_parity(), commit log, keep quirk static] Link: https://lore.kernel.org/r/d375987c-ea4f-dd98-4ef8-99b2fbfe7...@gmail.com --- drivers/pci/quirks.c | 13 - 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 653660e3ba9e..6aa9df411604 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -206,16 +206,11 @@ DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_ANY_ID, PCI_ANY_ID, PCI_CLASS_BRIDGE_HOST, 8, quirk_mmio_always_on); /* - * The Mellanox Tavor device gives false positive parity errors. Mark this - * device with a broken_parity_status to allow PCI scanning code to "skip" - * this now blacklisted device. + * The Mellanox Tavor device gives false positive parity errors. Disable + * parity error reporting. */ -static void quirk_mellanox_tavor(struct pci_dev *dev) -{ - dev->broken_parity_status = 1; /* This device gives false positives */ -} -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR, quirk_mellanox_tavor); -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR_BRIDGE, quirk_mellanox_tavor); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR, pci_disable_parity); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR_BRIDGE, pci_disable_parity); /* * Deal with broken BIOSes that neglect to enable passive release, -- 2.25.1
[PATCH v4 1/3] PCI: Add pci_disable_parity()
From: Bjorn Helgaas Add pci_disable_parity() to disable reporting of parity errors for a device by clearing PCI_COMMAND_PARITY. The device will still set PCI_STATUS_DETECTED_PARITY when it detects a parity error or receives a Poisoned TLP, but it will not set PCI_STATUS_PARITY, which means it will not assert PERR# (conventional PCI) or report Poisoned TLPs (PCIe). Based-on: https://lore.kernel.org/linux-arm-kernel/d375987c-ea4f-dd98-4ef8-99b2fbfe7...@gmail.com/ Based-on-patch-by: Heiner Kallweit Signed-off-by: Bjorn Helgaas --- drivers/pci/pci.c | 17 + include/linux/pci.h | 1 + 2 files changed, 18 insertions(+) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 16a17215f633..b1845e5e5c8f 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4453,6 +4453,23 @@ void pci_clear_mwi(struct pci_dev *dev) } EXPORT_SYMBOL(pci_clear_mwi); +/** + * pci_disable_parity - disable parity checking for device + * @dev: the PCI device to operate on + * + * Disable parity checking for device @dev + */ +void pci_disable_parity(struct pci_dev *dev) +{ + u16 cmd; + + pci_read_config_word(dev, PCI_COMMAND, &cmd); + if (cmd & PCI_COMMAND_PARITY) { + cmd &= ~PCI_COMMAND_PARITY; + pci_write_config_word(dev, PCI_COMMAND, cmd); + } +} + /** * pci_intx - enables/disables PCI INTx for device dev * @pdev: the PCI device to operate on diff --git a/include/linux/pci.h b/include/linux/pci.h index 86c799c97b77..4eaa773115da 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1201,6 +1201,7 @@ int __must_check pci_set_mwi(struct pci_dev *dev); int __must_check pcim_set_mwi(struct pci_dev *dev); int pci_try_set_mwi(struct pci_dev *dev); void pci_clear_mwi(struct pci_dev *dev); +void pci_disable_parity(struct pci_dev *dev); void pci_intx(struct pci_dev *dev, int enable); bool pci_check_and_mask_intx(struct pci_dev *dev); bool pci_check_and_unmask_intx(struct pci_dev *dev); -- 2.25.1
[PATCH v4 0/3] PCI: Disable parity checking
From: Bjorn Helgaas I think this is essentially the same as Heiner's v3 posting, with these changes: - Added a pci_disable_parity() interface in pci.c instead of making a public pci_quirk_broken_parity() because quirks.c is only compiled when CONFIG_PCI_QUIRKS=y. - Removed the setting of dev->broken_parity_status because it's really only used by EDAC error reporting, and if we disable parity error reporting, we shouldn't get there. This change will be visible in the sysfs "broken_parity_status" file, but I doubt that's important. I dropped Leon's reviewed-by because I fiddled with the code. Similarly I haven't added your signed-off-by, Heiner, because I don't want you blamed for my errors. But if this looks OK to you I'll add it. v1: https://lore.kernel.org/r/a6f09e1b-4076-59d1-a4e3-05c5955bf...@gmail.com v2: https://lore.kernel.org/r/bbc33d9b-af7c-8910-cdb3-fa3e3b2e3...@gmail.com - reduce scope of N2100 change to using the new PCI core quirk v3: https://lore.kernel.org/r/992c800e-2e12-16b0-4845-6311b295d...@gmail.com/ - improve commit message of patch 2 v4: - add pci_disable_parity() (not conditional on CONFIG_PCI_QUIRKS) - remove setting of dev->broken_parity_status Bjorn Helgaas (1): PCI: Add pci_disable_parity() Heiner Kallweit (2): IB/mthca: Disable parity reporting ARM: iop32x: disable N2100 PCI parity reporting arch/arm/mach-iop32x/n2100.c | 8 drivers/net/ethernet/realtek/r8169_main.c | 14 -- drivers/pci/pci.c | 17 + drivers/pci/quirks.c | 13 - include/linux/pci.h | 1 + 5 files changed, 26 insertions(+), 27 deletions(-) -- 2.25.1
Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote: > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote: > > > I think I misunderstood Greg's subdirectory comment. We already have > > directories like this: > > Yes, IIRC, Greg's remark applies if you have to start creating > directories with manual kobjects. > > > and aspm_ctrl_attr_group (for "link") is nicely done with static > > attributes. So I think we could do something like this: > > > > /sys/bus/pci/devices/:01:00.0/ # PF directory > > sriov/ # SR-IOV related stuff > > vf_total_msix > > vf_msix_count_BB:DD.F# includes bus/dev/fn of first VF > > ... > > vf_msix_count_BB:DD.F# includes bus/dev/fn of last VF > > It looks a bit odd that it isn't a subdirectory, but this seems > reasonable. Sorry, I missed your point; you'll have to lay it out more explicitly. I did intend that "sriov" *is* a subdirectory of the :01:00.0 directory. The full paths would be: /sys/bus/pci/devices/:01:00.0/sriov/vf_total_msix /sys/bus/pci/devices/:01:00.0/sriov/vf_msix_count_BB:DD.F ... > > For NVMe, a write to vf_msix_count_* would have to auto-offline the VF > > before asking the PF to assign the vectors, as Jason suggests above. > > It is also not awful if it returns EBUSY if the admin hasn't done > some device-specific offline sequence. Agreed. The concept of "offline" is not visible in this interface. > I'm just worried adding the idea of offline here is going to open a > huge can of worms in terms of defining what it means, and the very > next ask will be to start all VFs in offline mode. This would be some > weird overlap with the no-driver-autoprobing sysfs. We've been > thinking about this alot here and there are not easy answers. We haven't added any idea of offline in the sysfs interface. I'm only trying to figure out whether it would be possible to use this interface on top of devices with an offline concept, e.g., NVMe. > mlx5 sort of has an offline concept too, but we have been modeling it > in devlink, which is kind of like nvme-cli for networking. > > Jason
Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
On Fri, Mar 26, 2021 at 04:01:48PM -0300, Jason Gunthorpe wrote: > On Fri, Mar 26, 2021 at 11:50:44AM -0700, Alexander Duyck wrote: > > > My concern would be that we are defining the user space interface. > > Once we have this working as a single operation I could see us having > > to support it that way going forward as somebody will script something > > not expecting an "offline" sysfs file, and the complaint would be that > > we are breaking userspace if we require the use of an "offline" > > file. > > Well, we wouldn't do that. The semantic we define here is that the > msix_count interface 'auto-offlines' if that is what is required. If > we add some formal offline someday then 'auto-offline' would be a NOP > when the device is offline and do the same online/offline sequence as > today if it isn't. Alexander, Keith, any more thoughts on this? I think I misunderstood Greg's subdirectory comment. We already have directories like this: /sys/bus/pci/devices/:01:00.0/link/ /sys/bus/pci/devices/:01:00.0/msi_irqs/ /sys/bus/pci/devices/:01:00.0/power/ and aspm_ctrl_attr_group (for "link") is nicely done with static attributes. So I think we could do something like this: /sys/bus/pci/devices/:01:00.0/ # PF directory sriov/ # SR-IOV related stuff vf_total_msix vf_msix_count_BB:DD.F# includes bus/dev/fn of first VF ... vf_msix_count_BB:DD.F# includes bus/dev/fn of last VF And I think this could support the mlx5 model as well as the NVMe model. For NVMe, a write to vf_msix_count_* would have to auto-offline the VF before asking the PF to assign the vectors, as Jason suggests above. Before VF Enable is set, the vf_msix_count_* files wouldn't exist and we wouldn't be able to assign vectors to VFs; IIUC that's a difference from the NVMe interface, but maybe not a terrible one? I'm not proposing changing nvme-cli to use this, but if the interface is general enough to support both, that would be a good clue that it might be able to support future devices with similar functionality. Bjorn
Re: [PATCH] PCI: Remove pci_try_set_mwi
On Sun, Mar 28, 2021 at 12:04:35AM +0100, Heiner Kallweit wrote: > On 26.03.2021 22:26, Bjorn Helgaas wrote: > > [+cc Randy, Andrew (though I'm sure you have zero interest in this > > ancient question :))] > > > > On Wed, Dec 09, 2020 at 09:31:21AM +0100, Heiner Kallweit wrote: > >> pci_set_mwi() and pci_try_set_mwi() do exactly the same, just that the > >> former one is declared as __must_check. However also some callers of > >> pci_set_mwi() have a comment that it's an optional feature. I don't > >> think there's much sense in this separation and the use of > >> __must_check. Therefore remove pci_try_set_mwi() and remove the > >> __must_check attribute from pci_set_mwi(). > >> I don't expect either function to be used in new code anyway. > > > > There's not much I like better than removing things. But some > > significant thought went into adding pci_try_set_mwi() in the first > > place, so I need a little more convincing about why it's safe to > > remove it. > > > > Thanks for the link to the 13 yrs old discussion. Unfortunately it > doesn't mention any real argument for the __must_check, just: > > "And one of the reasons for adding the __must_check annotation is to > weed out design errors." > And the very next response in the discussion calls this a "non-argument". > Plus not mentioning what the other reasons could be. I think you're referring to Alan's response [1]: akpm> And we *need* to be excessively anal in the PCI setup code. akpm> We have metric shitloads of bugs due to problems in that area, akpm> and the more formality and error handling and error reporting akpm> we can get in there the better off we will be. ac> No argument there So Alan is actually *agreeing* that "we need to be excessively anal in the PCI setup code," not saying that "weeding out design errors is not an argument for __must_check." > Currently we have three ancient drivers that bail out if the call fails. > Most callers of pci_set_mwi() use the return code only to emit an > error message, but they proceed normally. Majority of users calls > pci_try_set_mwi(). And as stated in the commit message I don't expect > any new usage of pci_set_mwi(). I would love to merge this patch. We just need to clarify the commit log. Right now the only justification is "I don't think there's much sense in the __must_check annotation," which may well be true but could use some support. If MWI is purely an optimization and there's never a functional problem if pci_set_mwi() fails, we should say that (and maybe update any drivers that bail out on failure). Andrew and Alan both seem to agree that MSI *is* purely advisory: akpm> pci_set_mwi() is an advisory thing, and on certain platforms akpm> it might fail to set the cacheline size to the desired number. akpm> This is not a fatal error and the driver can successfully run akpm> at a lesser performance level. ac> Correct. But even after that, Andrew proposed adding pci_try_set_mwi(). So it makes sense to really understand what was going on there so we don't break something in the name of cleaning it up. [1] https://lore.kernel.org/linux-ide/20070405211609.5263d...@the-village.bc.nu/ > > The argument should cite the discussion about adding it. I think one > > of the earliest conversations is here: > > https://lore.kernel.org/linux-ide/20070404213704.224128ec.randy.dun...@oracle.com/
Re: [PATCH] PCI: Remove pci_try_set_mwi
On Fri, Mar 26, 2021 at 11:42:46PM +0200, Andy Shevchenko wrote: > On Fri, Mar 26, 2021 at 04:26:55PM -0500, Bjorn Helgaas wrote: > > [+cc Randy, Andrew (though I'm sure you have zero interest in this > > ancient question :))] > > > > On Wed, Dec 09, 2020 at 09:31:21AM +0100, Heiner Kallweit wrote: > > > pci_set_mwi() and pci_try_set_mwi() do exactly the same, just that the > > > former one is declared as __must_check. However also some callers of > > > pci_set_mwi() have a comment that it's an optional feature. I don't > > > think there's much sense in this separation and the use of > > > __must_check. Therefore remove pci_try_set_mwi() and remove the > > > __must_check attribute from pci_set_mwi(). > > > I don't expect either function to be used in new code anyway. > > > > There's not much I like better than removing things. But some > > significant thought went into adding pci_try_set_mwi() in the first > > place, so I need a little more convincing about why it's safe to > > remove it. > > > > The argument should cite the discussion about adding it. I think one > > of the earliest conversations is here: > > https://lore.kernel.org/linux-ide/20070404213704.224128ec.randy.dun...@oracle.com/ > > It's solely PCI feature which is absent on PCIe. > > So, if there is a guarantee that the driver never services a device connected > to old PCI bus, it's okay to remove the call (it's no-op on PCIe anyway). Yes, I'm aware that MWI is a no-op on PCIe. If we want to argue that we don't need to support Conventional PCI devices, that should be explicit, and we could remove pci_set_mwi() completely. But I don't think we're ready to drop Conventional PCI support. > OTOH, PCI core may try MWI itself for every device (but this is an opposite, > what should we do on broken devices that do change their state based on that > bit while violating specification). > > In any case > > Acked-by: Andy Shevchenko Thanks! Bjorn
Re: [PATCH] PCI: Remove pci_try_set_mwi
[+cc Randy, Andrew (though I'm sure you have zero interest in this ancient question :))] On Wed, Dec 09, 2020 at 09:31:21AM +0100, Heiner Kallweit wrote: > pci_set_mwi() and pci_try_set_mwi() do exactly the same, just that the > former one is declared as __must_check. However also some callers of > pci_set_mwi() have a comment that it's an optional feature. I don't > think there's much sense in this separation and the use of > __must_check. Therefore remove pci_try_set_mwi() and remove the > __must_check attribute from pci_set_mwi(). > I don't expect either function to be used in new code anyway. There's not much I like better than removing things. But some significant thought went into adding pci_try_set_mwi() in the first place, so I need a little more convincing about why it's safe to remove it. The argument should cite the discussion about adding it. I think one of the earliest conversations is here: https://lore.kernel.org/linux-ide/20070404213704.224128ec.randy.dun...@oracle.com/ > Signed-off-by: Heiner Kallweit > --- > patch applies on top of pci/misc for v5.11 > --- > Documentation/PCI/pci.rst | 5 + > drivers/ata/pata_cs5530.c | 2 +- > drivers/ata/sata_mv.c | 2 +- > drivers/dma/dw/pci.c | 2 +- > drivers/dma/hsu/pci.c | 2 +- > drivers/ide/cs5530.c | 2 +- > drivers/mfd/intel-lpss-pci.c | 2 +- > drivers/net/ethernet/adaptec/starfire.c | 2 +- > drivers/net/ethernet/alacritech/slicoss.c | 2 +- > drivers/net/ethernet/dec/tulip/tulip_core.c | 5 + > drivers/net/ethernet/sun/cassini.c| 4 ++-- > drivers/net/wireless/intersil/p54/p54pci.c| 2 +- > .../intersil/prism54/islpci_hotplug.c | 3 +-- > .../wireless/realtek/rtl818x/rtl8180/dev.c| 2 +- > drivers/pci/pci.c | 19 --- > drivers/scsi/3w-9xxx.c| 4 ++-- > drivers/scsi/3w-sas.c | 4 ++-- > drivers/scsi/csiostor/csio_init.c | 2 +- > drivers/scsi/lpfc/lpfc_init.c | 2 +- > drivers/scsi/qla2xxx/qla_init.c | 8 > drivers/scsi/qla2xxx/qla_mr.c | 2 +- > drivers/tty/serial/8250/8250_lpss.c | 2 +- > drivers/usb/chipidea/ci_hdrc_pci.c| 2 +- > drivers/usb/gadget/udc/amd5536udc_pci.c | 2 +- > drivers/usb/gadget/udc/net2280.c | 2 +- > drivers/usb/gadget/udc/pch_udc.c | 2 +- > include/linux/pci.h | 5 ++--- > 27 files changed, 33 insertions(+), 60 deletions(-) > > diff --git a/Documentation/PCI/pci.rst b/Documentation/PCI/pci.rst > index 814b40f83..120362cc9 100644 > --- a/Documentation/PCI/pci.rst > +++ b/Documentation/PCI/pci.rst > @@ -226,10 +226,7 @@ If the PCI device can use the PCI > Memory-Write-Invalidate transaction, > call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval > and also ensures that the cache line size register is set correctly. > Check the return value of pci_set_mwi() as not all architectures > -or chip-sets may support Memory-Write-Invalidate. Alternatively, > -if Mem-Wr-Inval would be nice to have but is not required, call > -pci_try_set_mwi() to have the system do its best effort at enabling > -Mem-Wr-Inval. > +or chip-sets may support Memory-Write-Invalidate. > > > Request MMIO/IOP resources > diff --git a/drivers/ata/pata_cs5530.c b/drivers/ata/pata_cs5530.c > index ad75d02b6..8654b3ae1 100644 > --- a/drivers/ata/pata_cs5530.c > +++ b/drivers/ata/pata_cs5530.c > @@ -214,7 +214,7 @@ static int cs5530_init_chip(void) > } > > pci_set_master(cs5530_0); > - pci_try_set_mwi(cs5530_0); > + pci_set_mwi(cs5530_0); > > /* >* Set PCI CacheLineSize to 16-bytes: > diff --git a/drivers/ata/sata_mv.c b/drivers/ata/sata_mv.c > index 664ef658a..ee37755ea 100644 > --- a/drivers/ata/sata_mv.c > +++ b/drivers/ata/sata_mv.c > @@ -4432,7 +4432,7 @@ static int mv_pci_init_one(struct pci_dev *pdev, > mv_print_info(host); > > pci_set_master(pdev); > - pci_try_set_mwi(pdev); > + pci_set_mwi(pdev); > return ata_host_activate(host, pdev->irq, mv_interrupt, IRQF_SHARED, >IS_GEN_I(hpriv) ? &mv5_sht : &mv6_sht); > } > diff --git a/drivers/dma/dw/pci.c b/drivers/dma/dw/pci.c > index 1142aa6f8..1c20b7485 100644 > --- a/drivers/dma/dw/pci.c > +++ b/drivers/dma/dw/pci.c > @@ -30,7 +30,7 @@ static int dw_pci_probe(struct pci_dev *pdev, const struct > pci_device_id *pid) > } > > pci_set_master(pdev); > - pci_try_set_mwi(pdev); > + pci_set_mwi(pdev); > > ret = pci_set_dma_mask(pdev, DMA_BIT_MASK(32)); > if (ret) > diff --git a/drivers/dma/hsu/pci.c b/drivers/dma/hsu/pci.c > index 07cc7320a..420dd3706 100644 > --- a/drivers/dma/hsu/p
Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
On Fri, Mar 26, 2021 at 11:50:44AM -0700, Alexander Duyck wrote: > I almost wonder if it wouldn't make sense to just partition this up to > handle flexible resources in the future. Maybe something like having > the directory setup such that you have "sriov_resources/msix/" and > then you could have individual files with one for the total and the > rest with the VF BDF naming scheme. Then if we have to, we could add > other subdirectories in the future to handle things like queues in the > future. Subdirectories would be nice, but Greg KH said earlier in a different context that there's an issue with them [1]. He went on to say tools like udev would miss uevents for the subdirs [2]. I don't know whether that's really a problem in this case -- it doesn't seem like we would care about uevents for files that do MSI-X vector assignment. [1] https://lore.kernel.org/linux-pci/20191121211017.ga854...@kroah.com/ [2] https://lore.kernel.org/linux-pci/20191124170207.ga2267...@kroah.com/
Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
On Fri, Mar 26, 2021 at 09:00:50AM -0700, Alexander Duyck wrote: > On Thu, Mar 25, 2021 at 11:44 PM Leon Romanovsky wrote: > > On Thu, Mar 25, 2021 at 03:28:36PM -0300, Jason Gunthorpe wrote: > > > On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote: > > > > On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote: > > > > > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote: > > > > > > > > > > > NVMe and mlx5 have basically identical functionality in this > > > > > > respect. > > > > > > Other devices and vendors will likely implement similar > > > > > > functionality. > > > > > > It would be ideal if we had an interface generic enough to support > > > > > > them all. > > > > > > > > > > > > Is the mlx5 interface proposed here sufficient to support the NVMe > > > > > > model? I think it's close, but not quite, because the the NVMe > > > > > > "offline" state isn't explicitly visible in the mlx5 model. > > > > > > > > > > I thought Keith basically said "offline" wasn't really useful as a > > > > > distinct idea. It is an artifact of nvme being a standards body > > > > > divorced from the operating system. > > > > > > > > > > In linux offline and no driver attached are the same thing, you'd > > > > > never want an API to make a nvme device with a driver attached offline > > > > > because it would break the driver. > > > > > > > > I think the sticky part is that Linux driver attach is not visible to > > > > the hardware device, while the NVMe "offline" state *is*. An NVMe PF > > > > can only assign resources to a VF when the VF is offline, and the VF > > > > is only usable when it is online. > > > > > > > > For NVMe, software must ask the PF to make those online/offline > > > > transitions via Secondary Controller Offline and Secondary Controller > > > > Online commands [1]. How would this be integrated into this sysfs > > > > interface? > > > > > > Either the NVMe PF driver tracks the driver attach state using a bus > > > notifier and mirrors it to the offline state, or it simply > > > offline/onlines as part of the sequence to program the MSI change. > > > > > > I don't see why we need any additional modeling of this behavior. > > > > > > What would be the point of onlining a device without a driver? > > > > Agree, we should remember that we are talking about Linux kernel model > > and implementation, where _no_driver_ means _offline_. > > The only means you have of guaranteeing the driver is "offline" is by > holding on the device lock and checking it. So it is only really > useful for one operation and then you have to release the lock. The > idea behind having an "offline" state would be to allow you to > aggregate multiple potential operations into a single change. > > For example you would place the device offline, then change > interrupts, and then queues, and then you could online it again. The > kernel code could have something in place to prevent driver load on > "offline" devices. What it gives you is more of a transactional model > versus what you have right now which is more of a concurrent model. Thanks, Alex. Leon currently does enforce the "offline" situation by holding the VF device lock while checking that it has no driver and asking the PF to do the assignment. I agree this is only useful for a single operation. Would the current series *prevent* a transactional model from being added later if it turns out to be useful? I think I can imagine keeping the same sysfs files but changing the implementation to check for the VF being offline, while adding something new to control online/offline. I also want to resurrect your idea of associating "sriov_vf_msix_count" with the PF instead of the VF. I really like that idea, and it better reflects the way both mlx5 and NVMe work. I don't think there was a major objection to it, but the discussion seems to have petered out after your suggestion of putting the PCI bus/device/funcion in the filename, which I also like [1]. Leon has implemented a ton of variations, but I don't think having all the files in the PF directory was one of them. Bjorn [1] https://lore.kernel.org/r/cakgt0ue363fzewqgua1uaayotuyh8qpeadw1u5yfns7xkol...@mail.gmail.com
Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote: > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote: > > > NVMe and mlx5 have basically identical functionality in this respect. > > Other devices and vendors will likely implement similar functionality. > > It would be ideal if we had an interface generic enough to support > > them all. > > > > Is the mlx5 interface proposed here sufficient to support the NVMe > > model? I think it's close, but not quite, because the the NVMe > > "offline" state isn't explicitly visible in the mlx5 model. > > I thought Keith basically said "offline" wasn't really useful as a > distinct idea. It is an artifact of nvme being a standards body > divorced from the operating system. > > In linux offline and no driver attached are the same thing, you'd > never want an API to make a nvme device with a driver attached offline > because it would break the driver. I think the sticky part is that Linux driver attach is not visible to the hardware device, while the NVMe "offline" state *is*. An NVMe PF can only assign resources to a VF when the VF is offline, and the VF is only usable when it is online. For NVMe, software must ask the PF to make those online/offline transitions via Secondary Controller Offline and Secondary Controller Online commands [1]. How would this be integrated into this sysfs interface? > So I think it is good as is (well one of the 8 versions anyhow). > > Keith didn't go into detail why the queue allocations in nvme were any > different than the queue allocations in mlx5. I expect they can > probably work the same where the # of interrupts is an upper bound on > the # of CPUs that can get queues and the device, once instantiated, > could be configured for the number of queues to actually operate, if > it wants. I don't really care about the queue allocations. I don't think we need to solve those here; we just need to make sure that what we do here doesn't preclude NVMe queue allocations. Bjorn [1] NVMe 1.4a, sec 5.22
Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
On Thu, Mar 11, 2021 at 05:44:24PM -0400, Jason Gunthorpe wrote: > On Fri, Mar 12, 2021 at 05:50:34AM +0900, Keith Busch wrote: > > On Thu, Mar 11, 2021 at 04:22:34PM -0400, Jason Gunthorpe wrote: > > > On Thu, Mar 11, 2021 at 12:16:02PM -0700, Keith Busch wrote: > > > > On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote: > > > > > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote: > > > > > > > > > > > > I'm not so much worried about management software as the > > > > > > fact that this is a vendor specific implementation detail > > > > > > that is shaping how the kernel interfaces are meant to > > > > > > work. Other than the mlx5 I don't know if there are any > > > > > > other vendors really onboard with this sort of solution. > > > > > > > > > > I know this is currently vendor-specific, but I thought the > > > > > value proposition of dynamic configuration of VFs for > > > > > different clients sounded compelling enough that other > > > > > vendors would do something similar. But I'm not an SR-IOV > > > > > guy and have no vendor insight, so maybe that's not the > > > > > case? > > > > > > > > NVMe has a similar feature defined by the standard where a PF > > > > controller can dynamically assign MSIx vectors to VFs. The > > > > whole thing is managed in user space with an ioctl, though. I > > > > guess we could wire up the driver to handle it through this > > > > sysfs interface too, but I think the protocol specific tooling > > > > is more appropriate for nvme. > > > > > > Really? Why not share a common uAPI? > > > > We associate interrupt vectors with other dynamically assigned > > nvme specific resources (IO queues), and these are not always > > allocated 1:1. > > mlx5 is doing that too, the end driver gets to assign the MSI vector > to a CPU and then dynamically attach queues to it. > > I'm not sure I get why nvme would want to link those two things as > the CPU assignment and queue attach could happen in a VM while the > MSIX should be in the host? > > > A common uAPI for MSIx only gets us half way to configuring the > > VFs for that particular driver. > > > > > Do you have a standards reference for this? > > > > Yes, sections 5.22 and 8.5 from this spec: > > > > > > https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf > > > > An example of open source tooling implementing this is nvme-cli's > > "nvme virt-mgmt" command. > > Oh it is fascinating! 8.5.2 looks like exactly the same thing being > implemented here for mlx5, including changing the "Read only" config > space value > > Still confused why this shouldn't be the same API?? NVMe and mlx5 have basically identical functionality in this respect. Other devices and vendors will likely implement similar functionality. It would be ideal if we had an interface generic enough to support them all. Is the mlx5 interface proposed here sufficient to support the NVMe model? I think it's close, but not quite, because the the NVMe "offline" state isn't explicitly visible in the mlx5 model. I'd like to see an argument that nvme-cli *could* be implemented on top of this. nvme-cli uses an ioctl and we may not want to reimplement it with a new interface, but if Leon's interface is *capable* of supporting the NVMe model, it's a good clue that it may also work for future devices. If this isn't quite enough to support the NVMe model, what would it take to get there? Bjorn
Re: [PATCH next-queue v3 1/3] Revert "PCI: Make pci_enable_ptm() private"
On Mon, Mar 22, 2021 at 09:18:20AM -0700, Vinicius Costa Gomes wrote: > Make pci_enable_ptm() accessible from the drivers. > > Even if PTM still works on the platform I am using without calling > this function, it might be possible that it's not always the case. I don't understand the value of this paragraph. The rest of it makes good sense (although I think we might want to add a wrapper as I mentioned elsewhere). > Exposing this to the driver enables the driver to use the > 'ptm_enabled' field of 'pci_dev' to check if PTM is enabled or not. > > This reverts commit ac6c26da29c12fa511c877c273ed5c939dc9e96c. > > Signed-off-by: Vinicius Costa Gomes > Acked-by: Bjorn Helgaas > --- > drivers/pci/pci.h | 3 --- > include/linux/pci.h | 7 +++ > 2 files changed, 7 insertions(+), 3 deletions(-) > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > index ef7c4661314f..2c61557e1cc1 100644 > --- a/drivers/pci/pci.h > +++ b/drivers/pci/pci.h > @@ -599,11 +599,8 @@ static inline void pcie_ecrc_get_policy(char *str) { } > > #ifdef CONFIG_PCIE_PTM > void pci_ptm_init(struct pci_dev *dev); > -int pci_enable_ptm(struct pci_dev *dev, u8 *granularity); > #else > static inline void pci_ptm_init(struct pci_dev *dev) { } > -static inline int pci_enable_ptm(struct pci_dev *dev, u8 *granularity) > -{ return -EINVAL; } > #endif > > struct pci_dev_reset_methods { > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 86c799c97b77..3d3dc07eac3b 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -1610,6 +1610,13 @@ static inline bool pci_aer_available(void) { return > false; } > > bool pci_ats_disabled(void); > > +#ifdef CONFIG_PCIE_PTM > +int pci_enable_ptm(struct pci_dev *dev, u8 *granularity); > +#else > +static inline int pci_enable_ptm(struct pci_dev *dev, u8 *granularity) > +{ return -EINVAL; } > +#endif > + > void pci_cfg_access_lock(struct pci_dev *dev); > bool pci_cfg_access_trylock(struct pci_dev *dev); > void pci_cfg_access_unlock(struct pci_dev *dev); > -- > 2.31.0 >
Re: [PATCH next-queue v3 3/3] igc: Add support for PTP getcrosststamp()
On Mon, Mar 22, 2021 at 09:18:22AM -0700, Vinicius Costa Gomes wrote: > i225 has support for PCIe PTM, which allows us to implement support > for the PTP_SYS_OFFSET_PRECISE ioctl(), implemented in the driver via > the getcrosststamp() function. > +static bool igc_is_ptm_supported(struct igc_adapter *adapter) > +{ > +#if IS_ENABLED(CONFIG_X86_TSC) && IS_ENABLED(CONFIG_PCIE_PTM) > + return adapter->pdev->ptm_enabled; > +#endif It's not obvious why you make this x86-specific. Maybe a comment? You shouldn't have to test for CONFIG_PCIE_PTM, either. We probably should have a pdev->ptm_enabled() predicate with a stub that returns false when CONFIG_PCIE_PTM is not set. > + return false; > +} > +/* PCIe Registers */ > +#define IGC_PTM_CTRL 0x12540 /* PTM Control */ > +#define IGC_PTM_STAT 0x12544 /* PTM Status */ > +#define IGC_PTM_CYCLE_CTRL 0x1254C /* PTM Cycle Control */ > + > +/* PTM Time registers */ > +#define IGC_PTM_T1_TIM0_L0x12558 /* T1 on Timer 0 Low */ > +#define IGC_PTM_T1_TIM0_H0x1255C /* T1 on Timer 0 High */ > + > +#define IGC_PTM_CURR_T2_L0x1258C /* Current T2 Low */ > +#define IGC_PTM_CURR_T2_H0x12590 /* Current T2 High */ > +#define IGC_PTM_PREV_T2_L0x12584 /* Previous T2 Low */ > +#define IGC_PTM_PREV_T2_H0x12588 /* Previous T2 High */ > +#define IGC_PTM_PREV_T4M10x12578 /* T4 Minus T1 on previous PTM Cycle */ > +#define IGC_PTM_CURR_T4M10x1257C /* T4 Minus T1 on this PTM Cycle */ > +#define IGC_PTM_PREV_T3M20x12580 /* T3 Minus T2 on previous PTM Cycle */ > +#define IGC_PTM_TDELAY 0x12594 /* PTM PCIe Link Delay */ > + > +#define IGC_PCIE_DIG_DELAY 0x12550 /* PCIe Digital Delay */ > +#define IGC_PCIE_PHY_DELAY 0x12554 /* PCIe PHY Delay */ I assume the above are device-specific registers, right? Nothing that would be found in the PCIe base spec? Bjorn
Re: [PATCH next-queue v3 2/3] igc: Enable PCIe PTM
On Mon, Mar 22, 2021 at 09:18:21AM -0700, Vinicius Costa Gomes wrote: > In practice, enabling PTM also sets the enabled_ptm flag in the PCI > device, the flag will be used for detecting if PTM is enabled before > adding support for the SYSOFFSET_PRECISE ioctl() (which is added by > implementing the getcrosststamp() PTP function). I think you're referring to the "pci_dev.ptm_enabled" flag. I'm not sure what the connection to this patch is. The SYSOFFSET_PRECISE stuff also seems to belong with some other patch. This patch merely enables PTM if it's supported (might be worth expanding Precision Time Measurement for context). > Signed-off-by: Vinicius Costa Gomes > --- > drivers/net/ethernet/intel/igc/igc_main.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/drivers/net/ethernet/intel/igc/igc_main.c > b/drivers/net/ethernet/intel/igc/igc_main.c > index f77feadde8d2..04319ffae288 100644 > --- a/drivers/net/ethernet/intel/igc/igc_main.c > +++ b/drivers/net/ethernet/intel/igc/igc_main.c > @@ -12,6 +12,8 @@ > #include > #include > #include > +#include > + > #include > > #include "igc.h" > @@ -5792,6 +5794,10 @@ static int igc_probe(struct pci_dev *pdev, > > pci_enable_pcie_error_reporting(pdev); > > + err = pci_enable_ptm(pdev, NULL); > + if (err < 0) > + dev_err(&pdev->dev, "PTM not supported\n"); > + > pci_set_master(pdev); > > err = -ENOMEM; > -- > 2.31.0 >
Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote: > On Wed, Mar 10, 2021 at 11:09 AM Bjorn Helgaas wrote: > > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote: > > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky wrote: > > > > From: Leon Romanovsky > > > > > > > > @Alexander Duyck, please update me if I can add your ROB tag again > > > > to the series, because you liked v6 more. > > > > > > > > Thanks > > > > > > > > - > > > > Changelog > > > > v7: > > > > * Rebase on top v5.12-rc1 > > > > * More english fixes > > > > * Returned to static sysfs creation model as was implemented in v0/v1. > > > > > > Yeah, so I am not a fan of the series. The problem is there is only > > > one driver that supports this, all VFs are going to expose this sysfs, > > > and I don't know how likely it is that any others are going to > > > implement this functionality. I feel like you threw out all the > > > progress from v2-v6. > > > > pci_enable_vfs_overlay() turned up in v4, so I think v0-v3 had static > > sysfs files regardless of whether the PF driver was bound. > > > > > I really feel like the big issue is that this model is broken as you > > > have the VFs exposing sysfs interfaces that make use of the PFs to > > > actually implement. Greg's complaint was the PF pushing sysfs onto the > > > VFs. My complaint is VFs sysfs files operating on the PF. The trick is > > > to find a way to address both issues. > > > > > > Maybe the compromise is to reach down into the IOV code and have it > > > register the sysfs interface at device creation time in something like > > > pci_iov_sysfs_link if the PF has the functionality present to support > > > it. > > > > IIUC there are two questions on the table: > > > > 1) Should the sysfs files be visible only when a PF driver that > > supports MSI-X vector assignment is bound? > > > > I think this is a cosmetic issue. The presence of the file is > > not a reliable signal to management software; it must always > > tolerate files that don't exist (e.g., on old kernels) or files > > that are visible but don't work (e.g., vectors may be exhausted). > > > > If we start with the files always being visible, we should be > > able to add smarts later to expose them only when the PF driver > > is bound. > > > > My concerns with pci_enable_vf_overlay() are that it uses a > > little more sysfs internals than I'd like (although there are > > many callers of sysfs_create_files()) and it uses > > pci_get_domain_bus_and_slot(), which is generally a hack and > > creates refcounting hassles. Speaking of which, isn't v6 missing > > a pci_dev_put() to match the pci_get_domain_bus_and_slot()? > > I'm not so much worried about management software as the fact that > this is a vendor specific implementation detail that is shaping how > the kernel interfaces are meant to work. Other than the mlx5 I don't > know if there are any other vendors really onboard with this sort of > solution. I know this is currently vendor-specific, but I thought the value proposition of dynamic configuration of VFs for different clients sounded compelling enough that other vendors would do something similar. But I'm not an SR-IOV guy and have no vendor insight, so maybe that's not the case? > In addition it still feels rather hacky to be modifying read-only PCIe > configuration space on the fly via a backdoor provided by the PF. It > almost feels like this should be some sort of quirk rather than a > standard feature for an SR-IOV VF. I agree, I'm not 100% comfortable with modifying the read-only Table Size register. Maybe there's another approach that would be better? It *is* nice that the current approach doesn't require changes in the VF driver. > > 2) Should a VF sysfs file use the PF to implement this? > > > > Can you elaborate on your idea here? I guess > > pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the > > VF, and you're thinking we could also make a "virtfnX_msix_count" > > in the PF directory? That's a really interesting idea. > > I would honestly be more comfortable if the PF owned these files > instead of the VFs. One of the things I didn't like about
Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote: > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky wrote: > > From: Leon Romanovsky > > > > @Alexander Duyck, please update me if I can add your ROB tag again > > to the series, because you liked v6 more. > > > > Thanks > > > > - > > Changelog > > v7: > > * Rebase on top v5.12-rc1 > > * More english fixes > > * Returned to static sysfs creation model as was implemented in v0/v1. > > Yeah, so I am not a fan of the series. The problem is there is only > one driver that supports this, all VFs are going to expose this sysfs, > and I don't know how likely it is that any others are going to > implement this functionality. I feel like you threw out all the > progress from v2-v6. pci_enable_vfs_overlay() turned up in v4, so I think v0-v3 had static sysfs files regardless of whether the PF driver was bound. > I really feel like the big issue is that this model is broken as you > have the VFs exposing sysfs interfaces that make use of the PFs to > actually implement. Greg's complaint was the PF pushing sysfs onto the > VFs. My complaint is VFs sysfs files operating on the PF. The trick is > to find a way to address both issues. > > Maybe the compromise is to reach down into the IOV code and have it > register the sysfs interface at device creation time in something like > pci_iov_sysfs_link if the PF has the functionality present to support > it. IIUC there are two questions on the table: 1) Should the sysfs files be visible only when a PF driver that supports MSI-X vector assignment is bound? I think this is a cosmetic issue. The presence of the file is not a reliable signal to management software; it must always tolerate files that don't exist (e.g., on old kernels) or files that are visible but don't work (e.g., vectors may be exhausted). If we start with the files always being visible, we should be able to add smarts later to expose them only when the PF driver is bound. My concerns with pci_enable_vf_overlay() are that it uses a little more sysfs internals than I'd like (although there are many callers of sysfs_create_files()) and it uses pci_get_domain_bus_and_slot(), which is generally a hack and creates refcounting hassles. Speaking of which, isn't v6 missing a pci_dev_put() to match the pci_get_domain_bus_and_slot()? 2) Should a VF sysfs file use the PF to implement this? Can you elaborate on your idea here? I guess pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the VF, and you're thinking we could also make a "virtfnX_msix_count" in the PF directory? That's a really interesting idea. > Also we might want to double check that the PF cannot be unbound while > the VF is present. I know for a while there it was possible to remove > the PF driver while the VF was present. The Mellanox drivers may not > allow it but it might not hurt to look at taking a reference against > the PF driver if you are allocating the VF MSI-X configuration sysfs > file. Unbinding the PF driver will either remove the *_msix_count files or make them stop working. Is that a problem? I'm not sure we should add a magic link that prevents driver unbinding. Seems like it would be hard for users to figure out why the driver can't be removed. Bjorn
Re: [patch 12/14] PCI: hv: Use tasklet_disable_in_atomic()
On Tue, Mar 09, 2021 at 09:42:15AM +0100, Thomas Gleixner wrote: > From: Sebastian Andrzej Siewior > > The hv_compose_msi_msg() callback in irq_chip::irq_compose_msi_msg is > invoked via irq_chip_compose_msi_msg(), which itself is always invoked from > atomic contexts from the guts of the interrupt core code. > > There is no way to change this w/o rewriting the whole driver, so use > tasklet_disable_in_atomic() which allows to make tasklet_disable() > sleepable once the remaining atomic users are addressed. > > Signed-off-by: Sebastian Andrzej Siewior > Signed-off-by: Thomas Gleixner > Cc: "K. Y. Srinivasan" > Cc: Haiyang Zhang > Cc: Stephen Hemminger > Cc: Wei Liu > Cc: Lorenzo Pieralisi > Cc: Rob Herring > Cc: Bjorn Helgaas > Cc: linux-hyp...@vger.kernel.org > Cc: linux-...@vger.kernel.org Acked-by: Bjorn Helgaas It'd be ideal if you could merge this as a group. Let me know if you want me to do anything else. > --- > drivers/pci/controller/pci-hyperv.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- a/drivers/pci/controller/pci-hyperv.c > +++ b/drivers/pci/controller/pci-hyperv.c > @@ -1458,7 +1458,7 @@ static void hv_compose_msi_msg(struct ir >* Prevents hv_pci_onchannelcallback() from running concurrently >* in the tasklet. >*/ > - tasklet_disable(&channel->callback_event); > + tasklet_disable_in_atomic(&channel->callback_event); > > /* >* Since this function is called with IRQ locks held, can't >
Re: [PATCH 3/3] PCI: Convert rtw88 power cycle quirk to shutdown quirk
[+cc Rafael, linux-pm] On Thu, Mar 04, 2021 at 02:07:18PM +0800, Kai-Heng Feng wrote: > On Sat, Feb 27, 2021 at 2:17 AM Bjorn Helgaas wrote: > > On Fri, Feb 26, 2021 at 02:31:31PM +0100, Heiner Kallweit wrote: > > > On 26.02.2021 13:18, Kai-Heng Feng wrote: > > > > On Fri, Feb 26, 2021 at 8:10 PM Heiner Kallweit > > > > wrote: > > > >> > > > >> On 26.02.2021 08:12, Kalle Valo wrote: > > > >>> Kai-Heng Feng writes: > > > >>> > > > >>>> Now we have a generic D3 shutdown quirk, so convert the original > > > >>>> approach to a PCI quirk. > > > >>>> > > > >>>> Signed-off-by: Kai-Heng Feng > > > >>>> --- > > > >>>> drivers/net/wireless/realtek/rtw88/pci.c | 2 -- > > > >>>> drivers/pci/quirks.c | 6 ++ > > > >>>> 2 files changed, 6 insertions(+), 2 deletions(-) > > > >>> > > > >>> It would have been nice to CC linux-wireless also on patches 1-2. I > > > >>> only > > > >>> saw patch 3 and had to search the rest of patches from lkml. > > > >>> > > > >>> I assume this goes via the PCI tree so: > > > >>> > > > >>> Acked-by: Kalle Valo > > > >> > > > >> To me it looks odd to (mis-)use the quirk mechanism to set a device > > > >> to D3cold on shutdown. As I see it the quirk mechanism is used to work > > > >> around certain device misbehavior. And setting a device to a D3 > > > >> state on shutdown is a normal activity, and the shutdown() callback > > > >> seems to be a good place for it. > > > >> I miss an explanation what the actual benefit of the change is. > > > > > > > > To make putting device to D3 more generic, as there are more than one > > > > device need the quirk. > > > > > > > > Here's the discussion: > > > > https://lore.kernel.org/linux-usb/00de6927-3fa6-a9a3-2d65-2b4d4e8f0...@linux.intel.com/ > > > > > > > > > > Thanks for the link. For the AMD USB use case I don't have a strong > > > opinion, > > > what's considered the better option may be a question of personal taste. > > > For rtw88 however I'd still consider it over-engineering to replace a > > > simple > > > call to pci_set_power_state() with a PCI quirk. > > > I may be biased here because I find it sometimes bothering if I want to > > > look up how a device is handled and in addition to checking the respective > > > driver I also have to grep through quirks.c whether there's any special > > > handling. > > > > I haven't looked at these patches carefully, but in general, I agree > > that quirks should be used to work around hardware defects in the > > device. If the device behaves correctly per spec, we should use a > > different mechanism so the code remains generic and all devices get > > the benefit. > > > > If we do add quirks, the commit log should explain what the device > > defect is. > > So maybe it's reasonable to put all PCI devices to D3 at shutdown? I don't know off-hand. I added Rafael and linux-pm in case they do. If not, I suggest working up a patch to do that and a commit log that explains why that's a good idea and then we can have a discussion about it. This thread really doesn't have that justification. It says "putting device X in D3cold at shutdown saves 0.03w while in S5", but doesn't explain why that's safe or desirable for all devices. Bjorn
Re: [PATCH 3/3] PCI: Convert rtw88 power cycle quirk to shutdown quirk
On Fri, Feb 26, 2021 at 02:31:31PM +0100, Heiner Kallweit wrote: > On 26.02.2021 13:18, Kai-Heng Feng wrote: > > On Fri, Feb 26, 2021 at 8:10 PM Heiner Kallweit > > wrote: > >> > >> On 26.02.2021 08:12, Kalle Valo wrote: > >>> Kai-Heng Feng writes: > >>> > Now we have a generic D3 shutdown quirk, so convert the original > approach to a PCI quirk. > > Signed-off-by: Kai-Heng Feng > --- > drivers/net/wireless/realtek/rtw88/pci.c | 2 -- > drivers/pci/quirks.c | 6 ++ > 2 files changed, 6 insertions(+), 2 deletions(-) > >>> > >>> It would have been nice to CC linux-wireless also on patches 1-2. I only > >>> saw patch 3 and had to search the rest of patches from lkml. > >>> > >>> I assume this goes via the PCI tree so: > >>> > >>> Acked-by: Kalle Valo > >> > >> To me it looks odd to (mis-)use the quirk mechanism to set a device > >> to D3cold on shutdown. As I see it the quirk mechanism is used to work > >> around certain device misbehavior. And setting a device to a D3 > >> state on shutdown is a normal activity, and the shutdown() callback > >> seems to be a good place for it. > >> I miss an explanation what the actual benefit of the change is. > > > > To make putting device to D3 more generic, as there are more than one > > device need the quirk. > > > > Here's the discussion: > > https://lore.kernel.org/linux-usb/00de6927-3fa6-a9a3-2d65-2b4d4e8f0...@linux.intel.com/ > > > > Thanks for the link. For the AMD USB use case I don't have a strong opinion, > what's considered the better option may be a question of personal taste. > For rtw88 however I'd still consider it over-engineering to replace a simple > call to pci_set_power_state() with a PCI quirk. > I may be biased here because I find it sometimes bothering if I want to > look up how a device is handled and in addition to checking the respective > driver I also have to grep through quirks.c whether there's any special > handling. I haven't looked at these patches carefully, but in general, I agree that quirks should be used to work around hardware defects in the device. If the device behaves correctly per spec, we should use a different mechanism so the code remains generic and all devices get the benefit. If we do add quirks, the commit log should explain what the device defect is. Bjorn
Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
On Wed, Feb 24, 2021 at 11:53:30AM +0200, Leon Romanovsky wrote: > On Tue, Feb 23, 2021 at 03:07:43PM -0600, Bjorn Helgaas wrote: > > On Sun, Feb 21, 2021 at 08:59:18AM +0200, Leon Romanovsky wrote: > > > On Sat, Feb 20, 2021 at 01:06:00PM -0600, Bjorn Helgaas wrote: > > > > On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote: > > > > > > > > > Ok, can you step back and try to explain what problem you are trying > > > > > to > > > > > solve first, before getting bogged down in odd details? I find it > > > > > highly unlikely that this is something "unique", but I could be wrong > > > > > as > > > > > I do not understand what you are wanting to do here at all. > > > > > > > > We want to add two new sysfs files: > > > > > > > > sriov_vf_total_msix, for PF devices > > > > sriov_vf_msix_count, for VF devices associated with the PF > > > > > > > > AFAICT it is *acceptable* if they are both present always. But it > > > > would be *ideal* if they were only present when a driver that > > > > implements the ->sriov_get_vf_total_msix() callback is bound to the > > > > PF. > > > > > > BTW, we already have all possible combinations: static, static with > > > folder, with and without "sriov_" prefix, dynamic with and without > > > folders on VFs. > > > > > > I need to know on which version I'll get Acked-by and that version I > > > will resubmit. > > > > I propose that you make static attributes for both files, so > > "sriov_vf_total_msix" is visible for *every* PF in the system and > > "sriov_vf_msix_count" is visible for *every* VF in the system. > > No problem, this is close to v0/v1. > > > The PF "sriov_vf_total_msix" show function can return zero if there's > > no PF driver or it doesn't support ->sriov_get_vf_total_msix(). > > (Incidentally, I think the documentation should mention that when it > > *is* supported, the contents of this file are *constant*, i.e., it > > does not decrease as vectors are assigned to VFs.) > > > > The "sriov_vf_msix_count" set function can ignore writes if there's no > > PF driver or it doesn't support ->sriov_get_vf_total_msix(), or if a > > VF driver is bound. > > Just to be clear, why don't we return EINVAL/EOPNOTSUPP instead of > silently ignore? Returning some error is fine. I just meant that the reads/writes would have no effect on the PCI core or the device driver.
Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
On Sun, Feb 21, 2021 at 08:59:18AM +0200, Leon Romanovsky wrote: > On Sat, Feb 20, 2021 at 01:06:00PM -0600, Bjorn Helgaas wrote: > > On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote: > > > > > Ok, can you step back and try to explain what problem you are trying to > > > solve first, before getting bogged down in odd details? I find it > > > highly unlikely that this is something "unique", but I could be wrong as > > > I do not understand what you are wanting to do here at all. > > > > We want to add two new sysfs files: > > > > sriov_vf_total_msix, for PF devices > > sriov_vf_msix_count, for VF devices associated with the PF > > > > AFAICT it is *acceptable* if they are both present always. But it > > would be *ideal* if they were only present when a driver that > > implements the ->sriov_get_vf_total_msix() callback is bound to the > > PF. > > BTW, we already have all possible combinations: static, static with > folder, with and without "sriov_" prefix, dynamic with and without > folders on VFs. > > I need to know on which version I'll get Acked-by and that version I > will resubmit. I propose that you make static attributes for both files, so "sriov_vf_total_msix" is visible for *every* PF in the system and "sriov_vf_msix_count" is visible for *every* VF in the system. The PF "sriov_vf_total_msix" show function can return zero if there's no PF driver or it doesn't support ->sriov_get_vf_total_msix(). (Incidentally, I think the documentation should mention that when it *is* supported, the contents of this file are *constant*, i.e., it does not decrease as vectors are assigned to VFs.) The "sriov_vf_msix_count" set function can ignore writes if there's no PF driver or it doesn't support ->sriov_get_vf_total_msix(), or if a VF driver is bound. Any userspace software must be able to deal with those scenarios anyway, so I don't think the mere presence or absence of the files is a meaningful signal to that software. If we figure out a way to make the files visible only when the appropriate driver is bound, that might be nice and could always be done later. But I don't think it's essential. Bjorn
Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote: > Ok, can you step back and try to explain what problem you are trying to > solve first, before getting bogged down in odd details? I find it > highly unlikely that this is something "unique", but I could be wrong as > I do not understand what you are wanting to do here at all. We want to add two new sysfs files: sriov_vf_total_msix, for PF devices sriov_vf_msix_count, for VF devices associated with the PF AFAICT it is *acceptable* if they are both present always. But it would be *ideal* if they were only present when a driver that implements the ->sriov_get_vf_total_msix() callback is bound to the PF.
Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
On Thu, Feb 18, 2021 at 12:15:51PM +0200, Leon Romanovsky wrote: > On Wed, Feb 17, 2021 at 12:02:39PM -0600, Bjorn Helgaas wrote: > > [+cc Greg in case he wants to chime in on the sysfs discussion. > > TL;DR: we're trying to add/remove sysfs files when a PCI driver that > > supports certain callbacks binds or unbinds; series at > > https://lore.kernel.org/r/20210209133445.700225-1-l...@kernel.org] > > > > On Tue, Feb 16, 2021 at 09:58:25PM +0200, Leon Romanovsky wrote: > > > On Tue, Feb 16, 2021 at 10:12:12AM -0600, Bjorn Helgaas wrote: > > > > On Tue, Feb 16, 2021 at 09:33:44AM +0200, Leon Romanovsky wrote: > > > > > On Mon, Feb 15, 2021 at 03:01:06PM -0600, Bjorn Helgaas wrote: > > > > > > On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote: > > > > > > > From: Leon Romanovsky > > > > > > > > > +int pci_enable_vf_overlay(struct pci_dev *dev) > > > > > > > +{ > > > > > > > + struct pci_dev *virtfn; > > > > > > > + int id, ret; > > > > > > > + > > > > > > > + if (!dev->is_physfn || !dev->sriov->num_VFs) > > > > > > > + return 0; > > > > > > > + > > > > > > > + ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs); > > > > > > > > > > > > But I still don't like the fact that we're calling > > > > > > sysfs_create_files() and sysfs_remove_files() directly. It makes > > > > > > complication and opportunities for errors. > > > > > > > > > > It is not different from any other code that we have in the kernel. > > > > > > > > It *is* different. There is a general rule that drivers should not > > > > call sysfs_* [1]. The PCI core is arguably not a "driver," but it is > > > > still true that callers of sysfs_create_files() are very special, and > > > > I'd prefer not to add another one. > > > > > > PCI for me is a bus, and bus is the right place to manage sysfs. > > > But it doesn't matter, we understand each other positions. > > > > > > > > Let's be concrete, can you point to the errors in this code that I > > > > > should fix? > > > > > > > > I'm not saying there are current errors; I'm saying the additional > > > > code makes errors possible in future code. For example, we hope that > > > > other drivers can use these sysfs interfaces, and it's possible they > > > > may not call pci_enable_vf_overlay() or pci_disable_vfs_overlay() > > > > correctly. > > > > > > If not, we will fix, we just need is to ensure that sysfs name won't > > > change, everything else is easy to change. > > > > > > > Or there may be races in device addition/removal. We have current > > > > issues in this area, e.g., [2], and they're fairly subtle. I'm not > > > > saying your patches have these issues; only that extra code makes more > > > > chances for mistakes and it's more work to validate it. > > > > > > > > > > I don't see the advantage of creating these files only when > > > > > > the PF driver supports this. The management tools have to > > > > > > deal with sriov_vf_total_msix == 0 and sriov_vf_msix_count == > > > > > > 0 anyway. Having the sysfs files not be present at all might > > > > > > be slightly prettier to the person running "ls", but I'm not > > > > > > sure the code complication is worth that. > > > > > > > > > > It is more than "ls", right now sriov_numvfs is visible without > > > > > relation to the driver, even if driver doesn't implement > > > > > ".sriov_configure", which IMHO bad. We didn't want to repeat. > > > > > > > > > > Right now, we have many devices that supports SR-IOV, but small > > > > > amount of them are capable to rewrite their VF MSI-X table siz. > > > > > We don't want "to punish" and clatter their sysfs. > > > > > > > > I agree, it's clutter, but at least it's just cosmetic clutter > > > > (but I'm willing to hear discussion about why it's more than > > > > cosmetic; see below). > > > > > > It is more
Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
On Wed, Feb 17, 2021 at 03:25:22PM -0400, Jason Gunthorpe wrote: > On Wed, Feb 17, 2021 at 12:02:39PM -0600, Bjorn Helgaas wrote: > > > > BTW, I asked more than once how these sysfs knobs should be handled > > > in the PCI/core. > > > > Thanks for the pointers. This is the first instance I can think of > > where we want to create PCI core sysfs files based on a driver > > binding, so there really isn't a precedent. > > The MSI stuff does it today, doesn't it? eg: > > virtblk_probe (this is a driver bind) > init_vq >virtio_find_vqs > vp_modern_find_vqs > vp_find_vqs > vp_find_vqs_msix >vp_request_msix_vectors > pci_alloc_irq_vectors_affinity > __pci_enable_msi_range > msi_capability_init > populate_msi_sysfs > ret = sysfs_create_groups(&pdev->dev.kobj, msi_irq_groups); > > And the sysfs is removed during pci_disable_msi(), also called by the > driver Yes, you're right, I didn't notice that one. I'm not quite convinced that we clean up correctly in all cases -- pci_disable_msix(), pci_disable_msi(), pci_free_irq_vectors(), pcim_release(), etc are called by several drivers, but in my quick look I didn't see a guaranteed-to-be-called path to the cleanup during driver unbind. I probably just missed it.
Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
[+cc Greg in case he wants to chime in on the sysfs discussion. TL;DR: we're trying to add/remove sysfs files when a PCI driver that supports certain callbacks binds or unbinds; series at https://lore.kernel.org/r/20210209133445.700225-1-l...@kernel.org] On Tue, Feb 16, 2021 at 09:58:25PM +0200, Leon Romanovsky wrote: > On Tue, Feb 16, 2021 at 10:12:12AM -0600, Bjorn Helgaas wrote: > > On Tue, Feb 16, 2021 at 09:33:44AM +0200, Leon Romanovsky wrote: > > > On Mon, Feb 15, 2021 at 03:01:06PM -0600, Bjorn Helgaas wrote: > > > > On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote: > > > > > From: Leon Romanovsky > > > > > +int pci_enable_vf_overlay(struct pci_dev *dev) > > > > > +{ > > > > > + struct pci_dev *virtfn; > > > > > + int id, ret; > > > > > + > > > > > + if (!dev->is_physfn || !dev->sriov->num_VFs) > > > > > + return 0; > > > > > + > > > > > + ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs); > > > > > > > > But I still don't like the fact that we're calling > > > > sysfs_create_files() and sysfs_remove_files() directly. It makes > > > > complication and opportunities for errors. > > > > > > It is not different from any other code that we have in the kernel. > > > > It *is* different. There is a general rule that drivers should not > > call sysfs_* [1]. The PCI core is arguably not a "driver," but it is > > still true that callers of sysfs_create_files() are very special, and > > I'd prefer not to add another one. > > PCI for me is a bus, and bus is the right place to manage sysfs. > But it doesn't matter, we understand each other positions. > > > > Let's be concrete, can you point to the errors in this code that I > > > should fix? > > > > I'm not saying there are current errors; I'm saying the additional > > code makes errors possible in future code. For example, we hope that > > other drivers can use these sysfs interfaces, and it's possible they > > may not call pci_enable_vf_overlay() or pci_disable_vfs_overlay() > > correctly. > > If not, we will fix, we just need is to ensure that sysfs name won't > change, everything else is easy to change. > > > Or there may be races in device addition/removal. We have current > > issues in this area, e.g., [2], and they're fairly subtle. I'm not > > saying your patches have these issues; only that extra code makes more > > chances for mistakes and it's more work to validate it. > > > > > > I don't see the advantage of creating these files only when > > > > the PF driver supports this. The management tools have to > > > > deal with sriov_vf_total_msix == 0 and sriov_vf_msix_count == > > > > 0 anyway. Having the sysfs files not be present at all might > > > > be slightly prettier to the person running "ls", but I'm not > > > > sure the code complication is worth that. > > > > > > It is more than "ls", right now sriov_numvfs is visible without > > > relation to the driver, even if driver doesn't implement > > > ".sriov_configure", which IMHO bad. We didn't want to repeat. > > > > > > Right now, we have many devices that supports SR-IOV, but small > > > amount of them are capable to rewrite their VF MSI-X table siz. > > > We don't want "to punish" and clatter their sysfs. > > > > I agree, it's clutter, but at least it's just cosmetic clutter > > (but I'm willing to hear discussion about why it's more than > > cosmetic; see below). > > It is more than cosmetic and IMHO it is related to the driver role. > This feature is advertised, managed and configured by PF. It is very > natural request that the PF will view/hide those sysfs files. Agreed, it's natural if the PF driver adds/removes those files. But I don't think it's *essential*, and they *could* be static because of this: > > From the management software point of view, I don't think it matters. > > That software already needs to deal with files that don't exist (on > > old kernels) and files that contain zero (feature not supported or no > > vectors are available). I wonder if sysfs_update_group() would let us have our cake and eat it, too? Maybe we could define these files as static attributes and call sysfs_update_group() when the PF driver binds or unbind
Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
Proposed subject: PCI/IOV: Add dynamic MSI-X vector assignment sysfs interface On Tue, Feb 16, 2021 at 09:33:44AM +0200, Leon Romanovsky wrote: > On Mon, Feb 15, 2021 at 03:01:06PM -0600, Bjorn Helgaas wrote: > > On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote: > > > From: Leon Romanovsky Here's a draft of the sort of thing I'm looking for here: A typical cloud provider SR-IOV use case is to create many VFs for use by guest VMs. The VFs may not be assigned to a VM until a customer requests a VM of a certain size, e.g., number of CPUs. A VF may need MSI-X vectors proportional to the number of CPUs in the VM, but there is no standard way to change the number of MSI-X vectors supported by a VF. Some Mellanox ConnectX devices support dynamic assignment of MSI-X vectors to SR-IOV VFs. This can be done by the PF driver after VFs are enabled, and it can be done without affecting VFs that are already in use. The hardware supports a limited pool of MSI-X vectors that can be assigned to the PF or to individual VFs. This is device-specific behavior that requires support in the PF driver. Add a read-only "sriov_vf_total_msix" sysfs file for the PF and a writable "sriov_vf_msix_count" file for each VF. Management software may use these to learn how many MSI-X vectors are available and to dynamically assign them to VFs before the VFs are passed through to a VM. If the PF driver implements the ->sriov_get_vf_total_msix() callback, "sriov_vf_total_msix" contains the total number of MSI-X vectors available for distribution among VFs. If no driver is bound to the VF, writing "N" to "sriov_vf_msix_count" uses the PF driver ->sriov_set_msix_vec_count() callback to assign "N" MSI-X vectors to the VF. When a VF driver subsequently reads the MSI-X Message Control register, it will see the new Table Size "N". > > > Extend PCI sysfs interface with a new callback that allows configuration > > > of the number of MSI-X vectors for specific SR-IOV VF. This is needed > > > to optimize the performance of VFs devices by allocating the number of > > > vectors based on the administrator knowledge of the intended use of the > > > VF. > > > > > > This function is applicable for SR-IOV VF because such devices allocate > > > their MSI-X table before they will run on the VMs and HW can't guess the > > > right number of vectors, so some devices allocate them statically and > > > equally. > > > > This commit log should be clear that this functionality is motivated > > by *mlx5* behavior. The description above makes it sound like this is > > generic PCI spec behavior, and it is not. > > > > It may be a reasonable design that conforms to the spec, and we hope > > the model will be usable by other designs, but it is not required by > > the spec and AFAIK there is nothing in the spec you can point to as > > background for this. > > > > So don't *remove* the text you have above, but please *add* some > > preceding background information about how mlx5 works. > > > > > 1) The newly added /sys/bus/pci/devices/.../sriov_vf_msix_count > > > file will be seen for the VFs and it is writable as long as a driver is > > > not > > > bound to the VF. > > > > This adds /sys/bus/pci/devices/.../sriov_vf_msix_count for VF > > devices and is writable ... > > > > > The values accepted are: > > > * > 0 - this will be number reported by the Table Size in the VF's MSI-X > > > Message > > > Control register > > > * < 0 - not valid > > > * = 0 - will reset to the device default value > > > > = 0 - will reset to a device-specific default value > > > > > 2) In order to make management easy, provide new read-only sysfs file that > > > returns a total number of possible to configure MSI-X vectors. > > > > For PF devices, this adds a read-only > > /sys/bus/pci/devices/.../sriov_vf_total_msix file that contains the > > total number of MSI-X vectors available for distribution among VFs. > > > > Just as in sysfs-bus-pci, this file should be listed first, because > > you must read it before you can use vf_msix_count. > > No problem, I'll change, just remember that we are talking about commit > message because in Documentation file, the order is exactly as you request. Yes, I noticed that, thank you! It will be good to have them in the same order in both the commit log and the Documentation file. I think it will make more sense to readers. > > > cat /sys/bus/pci/d
Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote: > From: Leon Romanovsky > > Extend PCI sysfs interface with a new callback that allows configuration > of the number of MSI-X vectors for specific SR-IOV VF. This is needed > to optimize the performance of VFs devices by allocating the number of > vectors based on the administrator knowledge of the intended use of the VF. > > This function is applicable for SR-IOV VF because such devices allocate > their MSI-X table before they will run on the VMs and HW can't guess the > right number of vectors, so some devices allocate them statically and equally. This commit log should be clear that this functionality is motivated by *mlx5* behavior. The description above makes it sound like this is generic PCI spec behavior, and it is not. It may be a reasonable design that conforms to the spec, and we hope the model will be usable by other designs, but it is not required by the spec and AFAIK there is nothing in the spec you can point to as background for this. So don't *remove* the text you have above, but please *add* some preceding background information about how mlx5 works. > 1) The newly added /sys/bus/pci/devices/.../sriov_vf_msix_count > file will be seen for the VFs and it is writable as long as a driver is not > bound to the VF. This adds /sys/bus/pci/devices/.../sriov_vf_msix_count for VF devices and is writable ... > The values accepted are: > * > 0 - this will be number reported by the Table Size in the VF's MSI-X > Message > Control register > * < 0 - not valid > * = 0 - will reset to the device default value = 0 - will reset to a device-specific default value > 2) In order to make management easy, provide new read-only sysfs file that > returns a total number of possible to configure MSI-X vectors. For PF devices, this adds a read-only /sys/bus/pci/devices/.../sriov_vf_total_msix file that contains the total number of MSI-X vectors available for distribution among VFs. Just as in sysfs-bus-pci, this file should be listed first, because you must read it before you can use vf_msix_count. > cat /sys/bus/pci/devices/.../sriov_vf_total_msix > = 0 - feature is not supported > > 0 - total number of MSI-X vectors available for distribution among the VFs > > Signed-off-by: Leon Romanovsky > --- > Documentation/ABI/testing/sysfs-bus-pci | 28 + > drivers/pci/iov.c | 153 > include/linux/pci.h | 12 ++ > 3 files changed, 193 insertions(+) > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci > b/Documentation/ABI/testing/sysfs-bus-pci > index 25c9c39770c6..7dadc3610959 100644 > --- a/Documentation/ABI/testing/sysfs-bus-pci > +++ b/Documentation/ABI/testing/sysfs-bus-pci > @@ -375,3 +375,31 @@ Description: > The value comes from the PCI kernel device state and can be one > of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold". > The file is read only. > + > +What:/sys/bus/pci/devices/.../sriov_vf_total_msix > +Date:January 2021 > +Contact: Leon Romanovsky > +Description: > + This file is associated with the SR-IOV PFs. > + It contains the total number of MSI-X vectors available for > + assignment to all VFs associated with this PF. It may be zero > + if the device doesn't support this functionality. s/associated with the/associated with/ > +What:/sys/bus/pci/devices/.../sriov_vf_msix_count > +Date:January 2021 > +Contact: Leon Romanovsky > +Description: > + This file is associated with the SR-IOV VFs. > + It allows configuration of the number of MSI-X vectors for > + the VF. This is needed to optimize performance of newly bound > + devices by allocating the number of vectors based on the > + administrator knowledge of targeted VM. s/associated with the/associated with/ s/knowledge of targeted VM/knowledge of how the VF will be used/ > + The values accepted are: > + * > 0 - this will be number reported by the VF's MSI-X > + capability this number will be reported as the Table Size in the VF's MSI-X capability > + * < 0 - not valid > + * = 0 - will reset to the device default value > + > + The file is writable if the PF is bound to a driver that > + implements ->sriov_set_msix_vec_count(). > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > index 4afd4ee4f7f0..c0554aa6b90a 100644 > --- a/drivers/pci/iov.c > +++ b/drivers/pci/iov.c > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id) > return (dev->devfn + dev->sriov->offset + > dev->sriov->stride * vf_id) & 0xff; > } > +EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn); > > /* > * Per SR-IOV spec sec 3.3.10 and
Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
On Fri, Feb 05, 2021 at 07:35:47PM +0200, Leon Romanovsky wrote: > On Thu, Feb 04, 2021 at 03:12:12PM -0600, Bjorn Helgaas wrote: > > On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote: > > > On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote: > > > > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote: > > > > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote: > > > > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote: > > > > > > > From: Leon Romanovsky > > > > > > > > > > > > > > Extend PCI sysfs interface with a new callback that allows > > > > > > > configure the number of MSI-X vectors for specific SR-IO VF. > > > > > > > This is needed to optimize the performance of newly bound > > > > > > > devices by allocating the number of vectors based on the > > > > > > > administrator knowledge of targeted VM. > > > > > > > > > > > > I'm reading between the lines here, but IIUC the point is that you > > > > > > have a PF that supports a finite number of MSI-X vectors for use > > > > > > by all the VFs, and this interface is to control the distribution > > > > > > of those MSI-X vectors among the VFs. > > > > > > This commit log should describe how *this* device manages this > > > > allocation and how the PF Table Size and the VF Table Sizes are > > > > related. Per PCIe, there is no necessary connection between them. > > > > > > There is no connection in mlx5 devices either. PF is used as a vehicle > > > to access VF that doesn't have driver yet. From "table size" perspective > > > they completely independent, because PF already probed by driver and > > > it is already too late to change it. > > > > > > So PF table size is static and can be changed through FW utility only. > > > > This is where description of the device would be useful. > > > > The fact that you need "sriov_vf_total_msix" to advertise how many > > vectors are available and "sriov_vf_msix_count" to influence how they > > are distributed across the VFs suggests that these Table Sizes are not > > completely independent. > > > > Can a VF have a bigger Table Size than the PF does? Can all the VF > > Table Sizes added together be bigger than the PF Table Size? If VF A > > has a larger Table Size, does that mean VF B must have a smaller Table > > Size? > > VFs are completely independent devices and their table size can be > bigger than PF. FW has two pools, one for PF and another for all VFs. > In real world scenarios, every VF will have more MSI-X vectors than PF, > which will be distributed by orchestration software. Well, if the sum of all the VF Table Sizes cannot exceed the size of the FW pool for VFs, I would say the VFs are not completely independent. Increasing the Table Size of one VF reduces it for other VFs. This is an essential detail because it's the whole reason behind this interface, so sketching this out in the commit log will make this much easier to understand. > > Here's the sequence as I understand it: > > > > 1) PF driver binds to PF > > 2) PF driver enables VFs > > 3) PF driver creates /sys/...//sriov_vf_total_msix > > 4) PF driver creates /sys/...//sriov_vf_msix_count for each VF > > 5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count > > 6) VF driver binds to VF > > 7) VF reads MSI-X Message Control (Table Size) > > > > Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read > > different Table Sizes? That would be a little weird. > > Yes, this is the flow. I think differently from you and think this > is actual good thing that user writes new msix count and it is shown > immediately. Only weird because per spec Table Size is read-only and in this scenario it changes, so it may be surprising, but probably not a huge deal. > > I'm also a little concerned about doing 2 before 3 & 4. That works > > for mlx5 because implements the Table Size adjustment in a way that > > works *after* the VFs have been enabled. > > It is not related to mlx5, but to the PCI spec that requires us to > create all VFs at the same time. Before enabling VFs, they don't > exist. Yes. I can imagine a PF driver that collects characteristics for the desired VFs before enabling them, sort of like we already collect the *number* of VFs. But I think your
Re: [PATCH resend net-next v2 2/3] PCI/VPD: Change Chelsio T4 quirk to provide access to full virtual address space
[+cc Casey, Rahul] On Fri, Feb 05, 2021 at 08:29:45PM +0100, Heiner Kallweit wrote: > cxgb4 uses the full VPD address space for accessing its EEPROM (with some > mapping, see t4_eeprom_ptov()). In cudbg_collect_vpd_data() it sets the > VPD len to 32K (PCI_VPD_MAX_SIZE), and then back to 2K (CUDBG_VPD_PF_SIZE). > Having official (structured) and inofficial (unstructured) VPD data > violates the PCI spec, let's set VPD len according to all data that can be > accessed via PCI VPD access, no matter of its structure. s/inofficial/unofficial/ > Signed-off-by: Heiner Kallweit > --- > drivers/pci/vpd.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/pci/vpd.c b/drivers/pci/vpd.c > index 7915d10f9..06a7954d0 100644 > --- a/drivers/pci/vpd.c > +++ b/drivers/pci/vpd.c > @@ -633,9 +633,8 @@ static void quirk_chelsio_extend_vpd(struct pci_dev *dev) > /* >* If this is a T3-based adapter, there's a 1KB VPD area at offset >* 0xc00 which contains the preferred VPD values. If this is a T4 or > - * later based adapter, the special VPD is at offset 0x400 for the > - * Physical Functions (the SR-IOV Virtual Functions have no VPD > - * Capabilities). The PCI VPD Access core routines will normally > + * later based adapter, provide access to the full virtual EEPROM > + * address space. The PCI VPD Access core routines will normally >* compute the size of the VPD by parsing the VPD Data Structure at >* offset 0x000. This will result in silent failures when attempting >* to accesses these other VPD areas which are beyond those computed > @@ -644,7 +643,7 @@ static void quirk_chelsio_extend_vpd(struct pci_dev *dev) > if (chip == 0x0 && prod >= 0x20) > pci_set_vpd_size(dev, 8192); > else if (chip >= 0x4 && func < 0x8) > - pci_set_vpd_size(dev, 2048); > + pci_set_vpd_size(dev, PCI_VPD_MAX_SIZE); This code was added by 7dcf688d4c78 ("PCI/cxgb4: Extend T3 PCI quirk to T4+ devices") [1]. Unfortunately that commit doesn't really have the details about what it fixes, other than the silent failures it mentions in the comment. Some devices hang if we try to read at the wrong VPD address, and this can be done via the sysfs "vpd" file. Can you expand the commit log with an argument for why it is always safe to set the size to PCI_VPD_MAX_SIZE for these devices? The fact that cudbg_collect_vpd_data() fiddles around with pci_set_vpd_size() suggests to me that there is *some* problem with reading parts of the VPD. Otherwise, why would they bother? 940c9c458866 ("cxgb4: collect vpd info directly from hardware") [2] added the pci_set_vpd_size() usage, but doesn't say why it's needed. Maybe Rahul will remember? Bjorn [1] https://git.kernel.org/linus/7dcf688d4c78 [2] https://git.kernel.org/linus/940c9c458866 > } > > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID, > -- > 2.30.0 > > >
Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote: > On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote: > > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote: > > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote: > > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote: > > > > > From: Leon Romanovsky > > > > > > > > > > Extend PCI sysfs interface with a new callback that allows > > > > > configure the number of MSI-X vectors for specific SR-IO VF. > > > > > This is needed to optimize the performance of newly bound > > > > > devices by allocating the number of vectors based on the > > > > > administrator knowledge of targeted VM. > > > > > > > > I'm reading between the lines here, but IIUC the point is that you > > > > have a PF that supports a finite number of MSI-X vectors for use > > > > by all the VFs, and this interface is to control the distribution > > > > of those MSI-X vectors among the VFs. > > This commit log should describe how *this* device manages this > > allocation and how the PF Table Size and the VF Table Sizes are > > related. Per PCIe, there is no necessary connection between them. > > There is no connection in mlx5 devices either. PF is used as a vehicle > to access VF that doesn't have driver yet. From "table size" perspective > they completely independent, because PF already probed by driver and > it is already too late to change it. > > So PF table size is static and can be changed through FW utility only. This is where description of the device would be useful. The fact that you need "sriov_vf_total_msix" to advertise how many vectors are available and "sriov_vf_msix_count" to influence how they are distributed across the VFs suggests that these Table Sizes are not completely independent. Can a VF have a bigger Table Size than the PF does? Can all the VF Table Sizes added together be bigger than the PF Table Size? If VF A has a larger Table Size, does that mean VF B must have a smaller Table Size? Obviously I do not understand the details about how this device works. It would be helpful to have those details here. Here's the sequence as I understand it: 1) PF driver binds to PF 2) PF driver enables VFs 3) PF driver creates /sys/...//sriov_vf_total_msix 4) PF driver creates /sys/...//sriov_vf_msix_count for each VF 5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count 6) VF driver binds to VF 7) VF reads MSI-X Message Control (Table Size) Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read different Table Sizes? That would be a little weird. I'm also a little concerned about doing 2 before 3 & 4. That works for mlx5 because implements the Table Size adjustment in a way that works *after* the VFs have been enabled. But it seems conceivable that a device could implement vector distribution in a way that would require the VF Table Sizes to be fixed *before* enabling VFs. That would be nice in the sense that the VFs would be created "fully formed" and the VF Table Size would be completely read-only as documented. The other knob idea you mentioned at [2]: echo ":01:00.2 123" > sriov_vf_msix_count would have the advantage of working for both cases. That's definitely more complicated, but at the same time, I would hate to carve a sysfs interface into stone if it might not work for other devices. > > > > > +What: > > > > > /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > > > > > +Date:January 2021 > > > > > +Contact: Leon Romanovsky > > > > > +Description: > > > > > + This file is associated with the SR-IOV PFs. > > > > > + It returns a total number of possible to configure MSI-X > > > > > + vectors on the enabled VFs. > > > > > + > > > > > + The values returned are: > > > > > + * > 0 - this will be total number possible to consume > > > > > by VFs, > > > > > + * = 0 - feature is not supported > > > Not sure the "= 0" description is necessary here. If the value > > returned is the number of MSI-X vectors available for assignment to > > VFs, "0" is a perfectly legitimate value. It just means there are > > none. It doesn't need to be described separately. > > I wanted to help users and remove ambiguity. For example, mlx5 drivers > wil
Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote: > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote: > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote: > > > From: Leon Romanovsky > > > > > > Extend PCI sysfs interface with a new callback that allows > > > configure the number of MSI-X vectors for specific SR-IO VF. > > > This is needed to optimize the performance of newly bound > > > devices by allocating the number of vectors based on the > > > administrator knowledge of targeted VM. > > > > I'm reading between the lines here, but IIUC the point is that you > > have a PF that supports a finite number of MSI-X vectors for use > > by all the VFs, and this interface is to control the distribution > > of those MSI-X vectors among the VFs. > > The MSI-X is HW resource, all devices in the world have limitation > here. > > > > This function is applicable for SR-IOV VF because such devices > > > allocate their MSI-X table before they will run on the VMs and > > > HW can't guess the right number of vectors, so the HW allocates > > > them statically and equally. > > > > This is written in a way that suggests this is behavior required > > by the PCIe spec. If it is indeed related to something in the > > spec, please cite it. > > Spec doesn't say it directly, but you will need to really hurt brain > of your users if you decide to do it differently. You have one > enable bit to create all VFs at the same time without any option to > configure them in advance. > > Of course, you can create some partition map, upload it to FW and > create from there. Of course all devices have limitations. But let's add some details about the way *this* device works. That will help avoid giving the impression that this is the *only* way spec-conforming devices can work. > > "such devices allocate their MSI-X table before they will run on > > the VMs": Let's be specific here. This MSI-X Table allocation > > apparently doesn't happen when we set VF Enable in the PF, because > > these sysfs files are attached to the VFs, which don't exist yet. > > It's not the VF driver binding, because that's a software > > construct. What is the hardware event that triggers the > > allocation? > > Write of MSI-X vector count to the FW through PF. This is an example of something that is obviously specific to this mlx5 device. The Table Size field in Message Control is RO per spec, and obviously firmware on the device is completely outside the scope of the PCIe spec. This commit log should describe how *this* device manages this allocation and how the PF Table Size and the VF Table Sizes are related. Per PCIe, there is no necessary connection between them. > > > cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > > > = 0 - feature is not supported > > > > 0 - total number of MSI-X vectors to consume by the VFs > > > > "total number of MSI-X vectors available for distribution among the > > VFs"? > > Users need to be aware of how much vectors exist in the system. Understood -- if there's an interface to influence the distribution of vectors among VFs, one needs to know how many vectors there are to work with. My point was that "number of vectors to consume by VFs" is awkward wording, so I suggested an alternative. > > > +What:/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count > > > +Date:January 2021 > > > +Contact: Leon Romanovsky > > > +Description: > > > + This file is associated with the SR-IOV VFs. > > > + It allows configuration of the number of MSI-X vectors for > > > + the VF. This is needed to optimize performance of newly bound > > > + devices by allocating the number of vectors based on the > > > + administrator knowledge of targeted VM. > > > + > > > + The values accepted are: > > > + * > 0 - this will be number reported by the VF's MSI-X > > > + capability > > > + * < 0 - not valid > > > + * = 0 - will reset to the device default value > > > + > > > + The file is writable if the PF is bound to a driver that > > > + set sriov_vf_total_msix > 0 and there is no driver bound > > > + to the VF. Drivers don't actually set "sriov_vf_total_msix". This should probably say something like "the PF is bound to a driver that implements ->sriov_set_msix
Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote: > From: Leon Romanovsky > > Extend PCI sysfs interface with a new callback that allows configure > the number of MSI-X vectors for specific SR-IO VF. This is needed > to optimize the performance of newly bound devices by allocating > the number of vectors based on the administrator knowledge of targeted VM. s/configure/configuration of/ s/SR-IO/SR-IOV/ s/newly bound/VFs/ ? s/VF/VFs/ s/knowledge of targeted VM/knowledge of the intended use of the VF/ (I'm not a VF expert, but I think they can be used even without VMs) I'm reading between the lines here, but IIUC the point is that you have a PF that supports a finite number of MSI-X vectors for use by all the VFs, and this interface is to control the distribution of those MSI-X vectors among the VFs. > This function is applicable for SR-IOV VF because such devices allocate > their MSI-X table before they will run on the VMs and HW can't guess the > right number of vectors, so the HW allocates them statically and equally. This is written in a way that suggests this is behavior required by the PCIe spec. If it is indeed related to something in the spec, please cite it. But I think this is actually something device-specific, not something we can derive directly from the spec. If that's the case, be clear that we're addressing a device-specific need, and we're hoping that this will be useful for other devices as well. "such devices allocate their MSI-X table before they will run on the VMs": Let's be specific here. This MSI-X Table allocation apparently doesn't happen when we set VF Enable in the PF, because these sysfs files are attached to the VFs, which don't exist yet. It's not the VF driver binding, because that's a software construct. What is the hardware event that triggers the allocation? Obviously the distribution among VFs can be changed after VF Enable is set. Maybe the distribution is dynamic, and the important point is that it must be changed before the VF driver reads the Message Control register for Table Size? But that isn't the same as "devices allocating their MSI-X table before being passed through to a VM," so it's confusing. The language about allocating the MSI-X table needs to be made precise here and in the code comments below. "before they will run on the VMs": Devices don't "run on VMs". I think the usual terminology is that a device may be "passed through to a VM". "HW allocates them statically and equally" sounds like a description of some device-specific behavior (unless there's something in the spec that requires this, in which case you should cite it). It's OK if this is device-specific; just don't pretend that it's generic if it's not. > 1) The newly added /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count > file will be seen for the VFs and it is writable as long as a driver is not > bounded to the VF. "bound to the VF" > The values accepted are: > * > 0 - this will be number reported by the VF's MSI-X capability Specifically, I guess by Table Size in the VF's MSI-X Message Control register? > * < 0 - not valid > * = 0 - will reset to the device default value > > 2) In order to make management easy, provide new read-only sysfs file that > returns a total number of possible to configure MSI-X vectors. > > cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > = 0 - feature is not supported > > 0 - total number of MSI-X vectors to consume by the VFs "total number of MSI-X vectors available for distribution among the VFs"? > Signed-off-by: Leon Romanovsky > --- > Documentation/ABI/testing/sysfs-bus-pci | 32 + > drivers/pci/iov.c | 180 > drivers/pci/msi.c | 47 +++ > drivers/pci/pci.h | 4 + > include/linux/pci.h | 10 ++ > 5 files changed, 273 insertions(+) > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci > b/Documentation/ABI/testing/sysfs-bus-pci > index 25c9c39770c6..4d206ade5331 100644 > --- a/Documentation/ABI/testing/sysfs-bus-pci > +++ b/Documentation/ABI/testing/sysfs-bus-pci > @@ -375,3 +375,35 @@ Description: > The value comes from the PCI kernel device state and can be one > of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold". > The file is read only. > + > +What:/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count > +Date:January 2021 > +Contact: Leon Romanovsky > +Description: > + This file is associated with the SR-IOV VFs. > + It allows configuration of the number of MSI-X vectors for > + the VF. This is needed to optimize performance of newly bound > + devices by allocating the number of vectors based on the > + administrator knowledge of targeted VM. > + > + The values accepted are: > + * > 0 -
Re: [PATCH mlx5-next v5 3/4] net/mlx5: Dynamically assign MSI-X vectors count
On Tue, Jan 26, 2021 at 10:57:29AM +0200, Leon Romanovsky wrote: > From: Leon Romanovsky > > The number of MSI-X vectors is PCI property visible through lspci, that > field is read-only and configured by the device. The static assignment > of an amount of MSI-X vectors doesn't allow utilize the newly created > VF because it is not known to the device the future load and configuration > where that VF will be used. > > To overcome the inefficiency in the spread of such MSI-X vectors, we > allow the kernel to instruct the device with the needed number of such > vectors. > > Such change immediately increases the amount of MSI-X vectors for the > system with @ VFs from 12 vectors per-VF, to be 32 vectors per-VF. Not knowing anything about mlx5, it looks like maybe this gets some parameters from firmware on the device, then changes the way MSI-X vectors are distributed among VFs? I don't understand the implications above about "static assignment" and "inefficiency in the spread." I guess maybe this takes advantage of the fact that you know how many VFs are enabled, so if NumVFs is less that TotalVFs, you can assign more vectors to each VF? If that's the case, spell it out a little bit. The current text makes it sound like you discovered brand new MSI-X vectors somewhere, regardless of how many VFs are enabled, which doesn't sound right. > Before this patch: > [root@server ~]# lspci -vs :08:00.2 > 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 > Virtual Function] > > Capabilities: [9c] MSI-X: Enable- Count=12 Masked- > > After this patch: > [root@server ~]# lspci -vs :08:00.2 > 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 > Virtual Function] > > Capabilities: [9c] MSI-X: Enable- Count=32 Masked- > > Signed-off-by: Leon Romanovsky > --- > .../net/ethernet/mellanox/mlx5/core/main.c| 4 ++ > .../ethernet/mellanox/mlx5/core/mlx5_core.h | 5 ++ > .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 72 +++ > .../net/ethernet/mellanox/mlx5/core/sriov.c | 13 +++- > 4 files changed, 92 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c > b/drivers/net/ethernet/mellanox/mlx5/core/main.c > index ca6f2fc39ea0..79cfcc844156 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c > @@ -567,6 +567,10 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, > void *set_ctx) > if (MLX5_CAP_GEN_MAX(dev, mkey_by_name)) > MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1); > > + if (MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix)) > + MLX5_SET(cmd_hca_cap, set_hca_cap, num_total_dynamic_vf_msix, > + MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix)); > + > return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE); > } > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h > b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h > index 0a0302ce7144..5babb4434a87 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h > +++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h > @@ -172,6 +172,11 @@ int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, > int vecidx, > struct notifier_block *nb); > int mlx5_irq_detach_nb(struct mlx5_irq_table *irq_table, int vecidx, > struct notifier_block *nb); > + > +int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int devfn, > + int msix_vec_count); > +int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs); > + > struct cpumask * > mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx); > struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table); > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c > b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c > index 6fd974920394..2a35888fcff0 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c > @@ -55,6 +55,78 @@ static struct mlx5_irq *mlx5_irq_get(struct mlx5_core_dev > *dev, int vecidx) > return &irq_table->irq[vecidx]; > } > > +/** > + * mlx5_get_default_msix_vec_count() - Get defaults of number of MSI-X > vectors > + * to be set s/defaults of number of/default number of/ s/to be set/to be assigned to each VF/ ? > + * @dev: PF to work on > + * @num_vfs: Number of VFs was asked when SR-IOV was enabled s/Number of VFs was asked when SR-IOV was enabled/Number of enabled VFs/ ? > + **/ Documentation/doc-guide/kernel-doc.rst says kernel-doc comments end with just "*/" (not "**/"). > +int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs) > +{ > + int num_vf_msix, min_msix, max_msix; > + > + num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix); > + if (
[PATCH v2] octeontx2-af: Fix 'physical' typos
From: Bjorn Helgaas Fix misspellings of "physical". Signed-off-by: Bjorn Helgaas --- Thanks, Willem! drivers/net/ethernet/marvell/octeontx2/af/rvu.c | 2 +- drivers/net/ethernet/marvell/octeontx2/nic/otx2_flows.c | 2 +- drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c| 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c index e8fd712860a1..565d9373bfe4 100644 --- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c +++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c @@ -646,7 +646,7 @@ static int rvu_setup_msix_resources(struct rvu *rvu) } /* HW interprets RVU_AF_MSIXTR_BASE address as an IOVA, hence -* create a IOMMU mapping for the physcial address configured by +* create an IOMMU mapping for the physical address configured by * firmware and reconfig RVU_AF_MSIXTR_BASE with IOVA. */ cfg = rvu_read64(rvu, BLKADDR_RVUM, RVU_PRIV_CONST); diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_flows.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_flows.c index be8ccfce1848..b4d6a6bb3070 100644 --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_flows.c +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_flows.c @@ -1,5 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 -/* Marvell OcteonTx2 RVU Physcial Function ethernet driver +/* Marvell OcteonTx2 RVU Physical Function ethernet driver * * Copyright (C) 2020 Marvell. */ diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c index 634d60655a74..07ec85aebcca 100644 --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c @@ -1,5 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 -/* Marvell OcteonTx2 RVU Physcial Function ethernet driver +/* Marvell OcteonTx2 RVU Physical Function ethernet driver * * Copyright (C) 2020 Marvell International Ltd. * -- 2.25.1
[PATCH] octeontx2-af: Fix 'physical' typos
From: Bjorn Helgaas Fix misspellings of "physical". Signed-off-by: Bjorn Helgaas --- drivers/net/ethernet/marvell/octeontx2/af/rvu.c | 2 +- drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c index e8fd712860a1..565d9373bfe4 100644 --- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c +++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c @@ -646,7 +646,7 @@ static int rvu_setup_msix_resources(struct rvu *rvu) } /* HW interprets RVU_AF_MSIXTR_BASE address as an IOVA, hence -* create a IOMMU mapping for the physcial address configured by +* create an IOMMU mapping for the physical address configured by * firmware and reconfig RVU_AF_MSIXTR_BASE with IOVA. */ cfg = rvu_read64(rvu, BLKADDR_RVUM, RVU_PRIV_CONST); diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c index 634d60655a74..07ec85aebcca 100644 --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c @@ -1,5 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 -/* Marvell OcteonTx2 RVU Physcial Function ethernet driver +/* Marvell OcteonTx2 RVU Physical Function ethernet driver * * Copyright (C) 2020 Marvell International Ltd. * -- 2.25.1
Re: [PATCH v3 1/3] PCI: Disable parity checking if broken_parity is set
On Wed, Jan 13, 2021 at 09:52:23PM +0100, Heiner Kallweit wrote: > On 06.01.2021 20:34, Heiner Kallweit wrote: > > On 06.01.2021 20:22, Bjorn Helgaas wrote: > >> On Wed, Jan 06, 2021 at 06:50:22PM +0100, Heiner Kallweit wrote: > >>> If we know that a device has broken parity checking, then disable it. > >>> This avoids quirks like in r8169 where on the first parity error > >>> interrupt parity checking will be disabled if broken_parity_status > >>> is set. Make pci_quirk_broken_parity() public so that it can be used > >>> by platform code, e.g. for Thecus N2100. > >>> > >>> Signed-off-by: Heiner Kallweit > >>> Reviewed-by: Leon Romanovsky > >> > >> Acked-by: Bjorn Helgaas > >> > >> This series should all go together. Let me know if you want me to do > >> anything more (would require acks for arm and r8169, of course). > >> > > Right. For r8169 I'm the maintainer myself and agreed with Jakub that > > the r8169 patch will go through the PCI tree. > > > > Regarding the arm/iop32x part: > > MAINTAINERS file lists Lennert as maintainer, let me add him. > > Strange thing is that the MAINTAINERS entry for arm/iop32x has no > > F entry, therefore the get_maintainers scripts will never list him > > as addressee. The script lists Russell as "odd fixer". > > @Lennert: Please provide a patch to add the missing F entry. > > > > ARM/INTEL IOP32X ARM ARCHITECTURE > > M: Lennert Buytenhek > > L: linux-arm-ker...@lists.infradead.org (moderated for non-subscribers) > > S: Maintained > > Bjorn, I saw that you set the series to "not applicable". Is this because > of the missing ack for the arm part? No, it's because I screwed up. I use "not applicable" when I expect patches to go via another tree. I just missed your note about merging via the PCI tree. I'll take a look soon. > I checked and Lennert's last kernel contribution is from 2015. Having said > that the maintainer's entry may be outdated. Not sure who else would be > entitled to ack this patch. The change is simple enough, could you take > it w/o an ack? > Alternatively, IIRC Russell has got such a device. Russell, would it > be possible that you test that there's still no false-positive parity > errors with this series? > > > > > >>> --- > >>> drivers/pci/quirks.c | 17 +++-- > >>> include/linux/pci.h | 2 ++ > >>> 2 files changed, 13 insertions(+), 6 deletions(-) > >>> > >>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > >>> index 653660e3b..ab54e26b8 100644 > >>> --- a/drivers/pci/quirks.c > >>> +++ b/drivers/pci/quirks.c > >>> @@ -205,17 +205,22 @@ static void quirk_mmio_always_on(struct pci_dev > >>> *dev) > >>> DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_ANY_ID, PCI_ANY_ID, > >>> PCI_CLASS_BRIDGE_HOST, 8, quirk_mmio_always_on); > >>> > >>> +void pci_quirk_broken_parity(struct pci_dev *dev) > >>> +{ > >>> + u16 cmd; > >>> + > >>> + dev->broken_parity_status = 1; /* This device gives false positives */ > >>> + pci_read_config_word(dev, PCI_COMMAND, &cmd); > >>> + pci_write_config_word(dev, PCI_COMMAND, cmd & ~PCI_COMMAND_PARITY); > >>> +} > >>> + > >>> /* > >>> * The Mellanox Tavor device gives false positive parity errors. Mark > >>> this > >>> * device with a broken_parity_status to allow PCI scanning code to > >>> "skip" > >>> * this now blacklisted device. > >>> */ > >>> -static void quirk_mellanox_tavor(struct pci_dev *dev) > >>> -{ > >>> - dev->broken_parity_status = 1; /* This device gives false positives */ > >>> -} > >>> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, > >>> PCI_DEVICE_ID_MELLANOX_TAVOR, quirk_mellanox_tavor); > >>> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, > >>> PCI_DEVICE_ID_MELLANOX_TAVOR_BRIDGE, quirk_mellanox_tavor); > >>> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, > >>> PCI_DEVICE_ID_MELLANOX_TAVOR, pci_quirk_broken_parity); > >>> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, > >>> PCI_DEVICE_ID_MELLANOX_TAVOR_BRIDGE, pci_quirk_broken_parity); > >>> > >>> /* > >>> * Deal with broken BIOSes that neglect to enable passive release, > >>> diff --git a/include/linux/pci.h b/include/linux/pci.h > >>> index b32126d26..161dcc474 100644 > >>> --- a/include/linux/pci.h > >>> +++ b/include/linux/pci.h > >>> @@ -1916,6 +1916,8 @@ enum pci_fixup_pass { > >>> pci_fixup_suspend_late, /* pci_device_suspend_late() */ > >>> }; > >>> > >>> +void pci_quirk_broken_parity(struct pci_dev *dev); > >>> + > >>> #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS > >>> #define __DECLARE_PCI_FIXUP_SECTION(sec, name, vendor, device, class, > >>> \ > >>> class_shift, hook) \ > >>> -- > >>> 2.30.0 > >>> > >>> > >>> > > >
Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
On Thu, Jan 07, 2021 at 10:54:38PM -0500, Don Dutile wrote: > On 1/7/21 7:57 PM, Bjorn Helgaas wrote: > > On Sun, Jan 03, 2021 at 10:24:37AM +0200, Leon Romanovsky wrote: > > > + **/ > > > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb) > > > +{ > > > + struct pci_dev *pdev = pci_physfn(dev); > > > + > > > + if (!dev->msix_cap || !pdev->msix_cap) > > > + return -EINVAL; > > > + > > > + if (dev->driver || !pdev->driver || > > > + !pdev->driver->sriov_set_msix_vec_count) > > > + return -EOPNOTSUPP; > > > + > > > + if (numb < 0) > > > + /* > > > + * We don't support negative numbers for now, > > > + * but maybe in the future it will make sense. > > > + */ > > > + return -EINVAL; > > > + > > > + return pdev->driver->sriov_set_msix_vec_count(dev, numb); > > > > So we write to a VF sysfs file, get here and look up the PF, call a PF > > driver callback with the VF as an argument, the callback (at least for > > mlx5) looks up the PF from the VF, then does some mlx5-specific magic > > to the PF that influences the VF somehow? > > There's no PF lookup above it's just checking if a pdev has a > driver with the desired msix-cap setting(reduction) feature. We started with the VF (the sysfs file is attached to the VF). "pdev" is the corresponding PF; that's what I meant by "looking up the PF". Then we call the PF driver sriov_set_msix_vec_count() method. I asked because this raises questions of whether we need mutual exclusion or some other coordination between setting this for multiple VFs. Obviously it's great to answer all these in email, but at the end of the day, the rationale needs to be in the commit, either in code comments or the commit log.
Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
[+cc Alex, Don] This patch does not actually *configure* the number of vectors, so the subject is not quite accurate. IIUC, this patch adds a sysfs file that can be used to configure the number of vectors. The subject should mention the sysfs connection. On Sun, Jan 03, 2021 at 10:24:37AM +0200, Leon Romanovsky wrote: > From: Leon Romanovsky > > This function is applicable for SR-IOV VFs because such devices allocate > their MSI-X table before they will run on the targeted hardware and they > can't guess the right amount of vectors. This sentence doesn't quite have enough context to make sense to me. Per PCIe r5.0, sec 9.5.1.2, I think PFs and VFs have independent MSI-X Capabilities. What is the connection between the PF MSI-X and the VF MSI-X? The MSI-X table sizes should be determined by the Table Size in the Message Control register. Apparently we write a VF's Table Size before a driver is bound to the VF? Where does that happen? "Before they run on the targeted hardware" -- do you mean before the VF is passed through to a guest virtual machine? You mention "target VM" below, which makes more sense to me. VFs don't "run"; they're not software. I apologize for not being an expert in the use of VFs. Please mention the sysfs path in the commit log. > Signed-off-by: Leon Romanovsky > --- > Documentation/ABI/testing/sysfs-bus-pci | 16 +++ > drivers/pci/iov.c | 57 + > drivers/pci/msi.c | 30 + > drivers/pci/pci-sysfs.c | 1 + > drivers/pci/pci.h | 1 + > include/linux/pci.h | 8 > 6 files changed, 113 insertions(+) > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci > b/Documentation/ABI/testing/sysfs-bus-pci > index 25c9c39770c6..30720a9e1386 100644 > --- a/Documentation/ABI/testing/sysfs-bus-pci > +++ b/Documentation/ABI/testing/sysfs-bus-pci > @@ -375,3 +375,19 @@ Description: > The value comes from the PCI kernel device state and can be one > of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold". > The file is read only. > + > +What:/sys/bus/pci/devices/.../vf_msix_vec > +Date:December 2020 > +Contact: Leon Romanovsky > +Description: > + This file is associated with the SR-IOV VFs. It allows overwrite > + the amount of MSI-X vectors for that VF. This is needed to > optimize > + performance of newly bounded devices by allocating the number of > + vectors based on the internal knowledge of targeted VM. s/allows overwrite/allows configuration of/ s/for that/for the/ s/amount of/number of/ s/bounded/bound/ What "internal knowledge" is this? AFAICT this would have to be some user-space administration knowledge, not anything internal to the kernel. > + The values accepted are: > + * > 0 - this will be number reported by the PCI VF's PCIe > MSI-X capability. s/PCI// (it's obvious we're talking about PCI here) s/PCIe// (MSI-X is not PCIe-specific, and there's no need to mention it at all) > + * < 0 - not valid > + * = 0 - will reset to the device default value > + > + The file is writable if no driver is bounded. >From the code, it looks more like this: The file is writable if the PF is bound to a driver that supports the ->sriov_set_msix_vec_count() callback and there is no driver bound to the VF. Please wrap all of this to fit in 80 columns like the rest of the file. > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > index 4afd4ee4f7f0..0f8c570361fc 100644 > --- a/drivers/pci/iov.c > +++ b/drivers/pci/iov.c > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id) > return (dev->devfn + dev->sriov->offset + > dev->sriov->stride * vf_id) & 0xff; > } > +EXPORT_SYMBOL(pci_iov_virtfn_devfn); > > /* > * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may > @@ -426,6 +427,62 @@ const struct attribute_group sriov_dev_attr_group = { > .is_visible = sriov_attrs_are_visible, > }; > > +#ifdef CONFIG_PCI_MSI > +static ssize_t vf_msix_vec_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + struct pci_dev *pdev = to_pci_dev(dev); > + int numb = pci_msix_vec_count(pdev); > + > + if (numb < 0) > + return numb; > + > + return sprintf(buf, "%d\n", numb); > +} > + > +static ssize_t vf_msix_vec_store(struct device *dev, > + struct device_attribute *attr, const char *buf, > + size_t count) > +{ > + struct pci_dev *vf_dev = to_pci_dev(dev); > + int val, ret; > + > + ret = kstrtoint(buf, 0, &val); > + if (ret) > + return ret; > + > + ret = pci_set_msix_vec_count(vf_dev, val); > + if (ret) > +
Re: [PATCH v3 1/3] PCI: Disable parity checking if broken_parity is set
On Wed, Jan 06, 2021 at 06:50:22PM +0100, Heiner Kallweit wrote: > If we know that a device has broken parity checking, then disable it. > This avoids quirks like in r8169 where on the first parity error > interrupt parity checking will be disabled if broken_parity_status > is set. Make pci_quirk_broken_parity() public so that it can be used > by platform code, e.g. for Thecus N2100. > > Signed-off-by: Heiner Kallweit > Reviewed-by: Leon Romanovsky Acked-by: Bjorn Helgaas This series should all go together. Let me know if you want me to do anything more (would require acks for arm and r8169, of course). > --- > drivers/pci/quirks.c | 17 +++-- > include/linux/pci.h | 2 ++ > 2 files changed, 13 insertions(+), 6 deletions(-) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 653660e3b..ab54e26b8 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -205,17 +205,22 @@ static void quirk_mmio_always_on(struct pci_dev *dev) > DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_ANY_ID, PCI_ANY_ID, > PCI_CLASS_BRIDGE_HOST, 8, quirk_mmio_always_on); > > +void pci_quirk_broken_parity(struct pci_dev *dev) > +{ > + u16 cmd; > + > + dev->broken_parity_status = 1; /* This device gives false positives */ > + pci_read_config_word(dev, PCI_COMMAND, &cmd); > + pci_write_config_word(dev, PCI_COMMAND, cmd & ~PCI_COMMAND_PARITY); > +} > + > /* > * The Mellanox Tavor device gives false positive parity errors. Mark this > * device with a broken_parity_status to allow PCI scanning code to "skip" > * this now blacklisted device. > */ > -static void quirk_mellanox_tavor(struct pci_dev *dev) > -{ > - dev->broken_parity_status = 1; /* This device gives false positives */ > -} > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, > PCI_DEVICE_ID_MELLANOX_TAVOR, quirk_mellanox_tavor); > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, > PCI_DEVICE_ID_MELLANOX_TAVOR_BRIDGE, quirk_mellanox_tavor); > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, > PCI_DEVICE_ID_MELLANOX_TAVOR, pci_quirk_broken_parity); > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, > PCI_DEVICE_ID_MELLANOX_TAVOR_BRIDGE, pci_quirk_broken_parity); > > /* > * Deal with broken BIOSes that neglect to enable passive release, > diff --git a/include/linux/pci.h b/include/linux/pci.h > index b32126d26..161dcc474 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -1916,6 +1916,8 @@ enum pci_fixup_pass { > pci_fixup_suspend_late, /* pci_device_suspend_late() */ > }; > > +void pci_quirk_broken_parity(struct pci_dev *dev); > + > #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS > #define __DECLARE_PCI_FIXUP_SECTION(sec, name, vendor, device, class, > \ > class_shift, hook) \ > -- > 2.30.0 > > >
Re: [PATCH v2 2/3] ARM: iop32x: improve N2100 PCI broken parity quirk'
On Wed, Jan 06, 2021 at 12:05:41PM +0100, Heiner Kallweit wrote: > Use new PCI core function pci_quirk_broken_parity(), in addition to > setting broken_parity_status is disables parity checking. That sentence has a typo or something so it doesn't read quite right. Maybe: Use new PCI core function pci_quirk_broken_parity() to disable parity checking. "broken_parity_status" is basically internal to the PCI core and doesn't really seem relevant here. The only uses are the sysfs store/show functions and edac. > This allows us to remove a quirk in r8169 driver. > > Signed-off-by: Heiner Kallweit > --- > v2: > - remove additional changes from this patch > --- > arch/arm/mach-iop32x/n2100.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/arm/mach-iop32x/n2100.c b/arch/arm/mach-iop32x/n2100.c > index 78b9a5ee4..9f2aae3cd 100644 > --- a/arch/arm/mach-iop32x/n2100.c > +++ b/arch/arm/mach-iop32x/n2100.c > @@ -125,7 +125,7 @@ static void n2100_fixup_r8169(struct pci_dev *dev) > if (dev->bus->number == 0 && > (dev->devfn == PCI_DEVFN(1, 0) || >dev->devfn == PCI_DEVFN(2, 0))) > - dev->broken_parity_status = 1; > + pci_quirk_broken_parity(dev); > } > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, PCI_ANY_ID, > n2100_fixup_r8169); > > -- > 2.30.0 > >
Re: [PATCH 2/3] ARM: iop32x: improve N2100 PCI broken parity quirk
On Tue, Jan 05, 2021 at 10:42:31AM +0100, Heiner Kallweit wrote: > Simplify the quirk by using new PCI core function > pci_quirk_broken_parity(). In addition make the quirk > more specific, use device id 0x8169 instead of PCI_ANY_ID. > > Signed-off-by: Heiner Kallweit > --- > arch/arm/mach-iop32x/n2100.c | 8 +++- > 1 file changed, 3 insertions(+), 5 deletions(-) > > diff --git a/arch/arm/mach-iop32x/n2100.c b/arch/arm/mach-iop32x/n2100.c > index 78b9a5ee4..24c3eec46 100644 > --- a/arch/arm/mach-iop32x/n2100.c > +++ b/arch/arm/mach-iop32x/n2100.c > @@ -122,12 +122,10 @@ static struct hw_pci n2100_pci __initdata = { > */ > static void n2100_fixup_r8169(struct pci_dev *dev) > { > - if (dev->bus->number == 0 && > - (dev->devfn == PCI_DEVFN(1, 0) || > - dev->devfn == PCI_DEVFN(2, 0))) > - dev->broken_parity_status = 1; > + if (machine_is_n2100()) > + pci_quirk_broken_parity(dev); Whatever "machine_is_n2100()" is (I can't find the definition), it is surely not equivalent to "00:01.0 || 00:02.0". That change probably should be a separate patch with some explanation. If this makes the quirk safe to use in a generic kernel, that sounds like a good thing. I guess a parity problem could be the result of a defect in either the device (e.g., 0x8169), which would be an issue in *all* platforms, or a platform-specific issue in the way it's wired up. I assume it's the latter because the quirk is not in drivers/pci/quirks.c. Why is it safe to restrict this to device ID 0x8169? If this is platform issue, it might affect any device in the slot. > } > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, PCI_ANY_ID, > n2100_fixup_r8169); > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, 0x8169, n2100_fixup_r8169); > > static int __init n2100_pci_init(void) > { > -- > 2.30.0 > >
Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check
On Wed, Dec 16, 2020 at 12:20:53PM +0100, Ian Kumlien wrote: > On Wed, Dec 16, 2020 at 1:08 AM Bjorn Helgaas wrote: > > On Tue, Dec 15, 2020 at 02:09:12PM +0100, Ian Kumlien wrote: > > > On Tue, Dec 15, 2020 at 1:40 AM Bjorn Helgaas wrote: > > > > On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote: > > > > > On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas > > > > > wrote: > > > > > > > > > > If you're interested, you could probably unload the Realtek drivers, > > > > > > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit > > > > > > in 02:04.0, e.g., > > > > > > > > > > > > # > > > > > > RT=/sys/devices/pci:00/:00:01.2/:01:00.0/:02:04.0 > > > > > > # echo 1 > $RT/:04:00.0/remove > > > > > > # echo 1 > $RT/:04:00.1/remove > > > > > > # echo 1 > $RT/:04:00.2/remove > > > > > > # echo 1 > $RT/:04:00.4/remove > > > > > > # echo 1 > $RT/:04:00.7/remove > > > > > > # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010 > > > > > > > > > > > > That should take 04:00.x out of the picture. > > > > > > > > > > Didn't actually change the behaviour, I'm suspecting an errata for > > > > > AMD pcie... > > > > > > > > > > So did this, with unpatched kernel: > > > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > > > [ 5] 0.00-1.00 sec 4.56 MBytes 38.2 Mbits/sec0 67.9 > > > > > KBytes > > > > > [ 5] 1.00-2.00 sec 4.47 MBytes 37.5 Mbits/sec0 96.2 > > > > > KBytes > > > > > [ 5] 2.00-3.00 sec 4.85 MBytes 40.7 Mbits/sec0 50.9 > > > > > KBytes > > > > > [ 5] 3.00-4.00 sec 4.23 MBytes 35.4 Mbits/sec0 70.7 > > > > > KBytes > > > > > [ 5] 4.00-5.00 sec 4.23 MBytes 35.4 Mbits/sec0 48.1 > > > > > KBytes > > > > > [ 5] 5.00-6.00 sec 4.23 MBytes 35.4 Mbits/sec0 45.2 > > > > > KBytes > > > > > [ 5] 6.00-7.00 sec 4.23 MBytes 35.4 Mbits/sec0 36.8 > > > > > KBytes > > > > > [ 5] 7.00-8.00 sec 3.98 MBytes 33.4 Mbits/sec0 36.8 > > > > > KBytes > > > > > [ 5] 8.00-9.00 sec 4.23 MBytes 35.4 Mbits/sec0 36.8 > > > > > KBytes > > > > > [ 5] 9.00-10.00 sec 4.23 MBytes 35.4 Mbits/sec0 48.1 > > > > > KBytes > > > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > > > [ ID] Interval Transfer Bitrate Retr > > > > > [ 5] 0.00-10.00 sec 43.2 MBytes 36.2 Mbits/sec0 > > > > > sender > > > > > [ 5] 0.00-10.00 sec 42.7 MBytes 35.8 Mbits/sec > > > > > receiver > > > > > > > > > > and: > > > > > echo 0 > > > > > > /sys/devices/pci:00/:00:01.2/:01:00.0/link/l1_aspm > > > > > > > > BTW, thanks a lot for testing out the "l1_aspm" sysfs file. I'm very > > > > pleased that it seems to be working as intended. > > > > > > It was nice to find it for easy disabling :) > > > > > > > > and: > > > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > > > [ 5] 0.00-1.00 sec 113 MBytes 951 Mbits/sec 153772 > > > > > KBytes > > > > > [ 5] 1.00-2.00 sec 109 MBytes 912 Mbits/sec 276550 > > > > > KBytes > > > > > [ 5] 2.00-3.00 sec 111 MBytes 933 Mbits/sec 123625 > > > > > KBytes > > > > > [ 5] 3.00-4.00 sec 111 MBytes 933 Mbits/sec 31687 > > > > > KBytes > > > > > [ 5] 4.00-5.00 sec 110 MBytes 923 Mbits/sec0679 > > > > > KBytes > > > > > [ 5] 5.00-6.00 sec 110 MBytes 923 Mbits/sec 136577 > > > > > KBytes > > > > > [ 5] 6.00-7.00 sec 110 MBytes 923 Mbits/sec 214645 > > > > > KBytes > > > > > [ 5] 7.00-8.00 sec 110 MBytes 923 Mbits/sec 32628 > > > > > KBytes > > > > > [ 5] 8.
Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check
On Tue, Dec 15, 2020 at 02:09:12PM +0100, Ian Kumlien wrote: > On Tue, Dec 15, 2020 at 1:40 AM Bjorn Helgaas wrote: > > > > On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote: > > > On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas wrote: > > > > > > If you're interested, you could probably unload the Realtek drivers, > > > > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit > > > > in 02:04.0, e.g., > > > > > > > > # RT=/sys/devices/pci:00/:00:01.2/:01:00.0/:02:04.0 > > > > # echo 1 > $RT/:04:00.0/remove > > > > # echo 1 > $RT/:04:00.1/remove > > > > # echo 1 > $RT/:04:00.2/remove > > > > # echo 1 > $RT/:04:00.4/remove > > > > # echo 1 > $RT/:04:00.7/remove > > > > # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010 > > > > > > > > That should take 04:00.x out of the picture. > > > > > > Didn't actually change the behaviour, I'm suspecting an errata for AMD > > > pcie... > > > > > > So did this, with unpatched kernel: > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > [ 5] 0.00-1.00 sec 4.56 MBytes 38.2 Mbits/sec0 67.9 KBytes > > > [ 5] 1.00-2.00 sec 4.47 MBytes 37.5 Mbits/sec0 96.2 KBytes > > > [ 5] 2.00-3.00 sec 4.85 MBytes 40.7 Mbits/sec0 50.9 KBytes > > > [ 5] 3.00-4.00 sec 4.23 MBytes 35.4 Mbits/sec0 70.7 KBytes > > > [ 5] 4.00-5.00 sec 4.23 MBytes 35.4 Mbits/sec0 48.1 KBytes > > > [ 5] 5.00-6.00 sec 4.23 MBytes 35.4 Mbits/sec0 45.2 KBytes > > > [ 5] 6.00-7.00 sec 4.23 MBytes 35.4 Mbits/sec0 36.8 KBytes > > > [ 5] 7.00-8.00 sec 3.98 MBytes 33.4 Mbits/sec0 36.8 KBytes > > > [ 5] 8.00-9.00 sec 4.23 MBytes 35.4 Mbits/sec0 36.8 KBytes > > > [ 5] 9.00-10.00 sec 4.23 MBytes 35.4 Mbits/sec0 48.1 KBytes > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > [ ID] Interval Transfer Bitrate Retr > > > [ 5] 0.00-10.00 sec 43.2 MBytes 36.2 Mbits/sec0 > > > sender > > > [ 5] 0.00-10.00 sec 42.7 MBytes 35.8 Mbits/sec > > > receiver > > > > > > and: > > > echo 0 > /sys/devices/pci:00/:00:01.2/:01:00.0/link/l1_aspm > > > > BTW, thanks a lot for testing out the "l1_aspm" sysfs file. I'm very > > pleased that it seems to be working as intended. > > It was nice to find it for easy disabling :) > > > > and: > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > [ 5] 0.00-1.00 sec 113 MBytes 951 Mbits/sec 153772 KBytes > > > [ 5] 1.00-2.00 sec 109 MBytes 912 Mbits/sec 276550 KBytes > > > [ 5] 2.00-3.00 sec 111 MBytes 933 Mbits/sec 123625 KBytes > > > [ 5] 3.00-4.00 sec 111 MBytes 933 Mbits/sec 31687 KBytes > > > [ 5] 4.00-5.00 sec 110 MBytes 923 Mbits/sec0679 KBytes > > > [ 5] 5.00-6.00 sec 110 MBytes 923 Mbits/sec 136577 KBytes > > > [ 5] 6.00-7.00 sec 110 MBytes 923 Mbits/sec 214645 KBytes > > > [ 5] 7.00-8.00 sec 110 MBytes 923 Mbits/sec 32628 KBytes > > > [ 5] 8.00-9.00 sec 110 MBytes 923 Mbits/sec 81537 KBytes > > > [ 5] 9.00-10.00 sec 110 MBytes 923 Mbits/sec 10577 KBytes > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > [ ID] Interval Transfer Bitrate Retr > > > [ 5] 0.00-10.00 sec 1.08 GBytes 927 Mbits/sec 1056 > > > sender > > > [ 5] 0.00-10.00 sec 1.07 GBytes 923 Mbits/sec > > > receiver > > > > > > But this only confirms that the fix i experience is a side effect. > > > > > > The original code is still wrong :) > > > > What exactly is this machine? Brand, model, config? Maybe you could > > add this and a dmesg log to the buzilla? It seems like other people > > should be seeing the same problem, so I'm hoping to grub around on the > > web to see if there are similar reports involving these devices. > > ASUS Pro WS X570-ACE with AMD Ryzen 9 3900X Possible similar issues: https://forums.unraid.net/topic/94274-hardware-upgrade-woes/ https://forums.servethehome.com/index.php?threads/upgraded-my-home-server-from-intel-to-amd-virtual-disk-stu
Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check
On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote: > On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas wrote: > > If you're interested, you could probably unload the Realtek drivers, > > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit > > in 02:04.0, e.g., > > > > # RT=/sys/devices/pci:00/:00:01.2/:01:00.0/:02:04.0 > > # echo 1 > $RT/:04:00.0/remove > > # echo 1 > $RT/:04:00.1/remove > > # echo 1 > $RT/:04:00.2/remove > > # echo 1 > $RT/:04:00.4/remove > > # echo 1 > $RT/:04:00.7/remove > > # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010 > > > > That should take 04:00.x out of the picture. > > Didn't actually change the behaviour, I'm suspecting an errata for AMD pcie... > > So did this, with unpatched kernel: > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 4.56 MBytes 38.2 Mbits/sec0 67.9 KBytes > [ 5] 1.00-2.00 sec 4.47 MBytes 37.5 Mbits/sec0 96.2 KBytes > [ 5] 2.00-3.00 sec 4.85 MBytes 40.7 Mbits/sec0 50.9 KBytes > [ 5] 3.00-4.00 sec 4.23 MBytes 35.4 Mbits/sec0 70.7 KBytes > [ 5] 4.00-5.00 sec 4.23 MBytes 35.4 Mbits/sec0 48.1 KBytes > [ 5] 5.00-6.00 sec 4.23 MBytes 35.4 Mbits/sec0 45.2 KBytes > [ 5] 6.00-7.00 sec 4.23 MBytes 35.4 Mbits/sec0 36.8 KBytes > [ 5] 7.00-8.00 sec 3.98 MBytes 33.4 Mbits/sec0 36.8 KBytes > [ 5] 8.00-9.00 sec 4.23 MBytes 35.4 Mbits/sec0 36.8 KBytes > [ 5] 9.00-10.00 sec 4.23 MBytes 35.4 Mbits/sec0 48.1 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 43.2 MBytes 36.2 Mbits/sec0 sender > [ 5] 0.00-10.00 sec 42.7 MBytes 35.8 Mbits/sec receiver > > and: > echo 0 > /sys/devices/pci:00/:00:01.2/:01:00.0/link/l1_aspm BTW, thanks a lot for testing out the "l1_aspm" sysfs file. I'm very pleased that it seems to be working as intended. > and: > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 113 MBytes 951 Mbits/sec 153772 KBytes > [ 5] 1.00-2.00 sec 109 MBytes 912 Mbits/sec 276550 KBytes > [ 5] 2.00-3.00 sec 111 MBytes 933 Mbits/sec 123625 KBytes > [ 5] 3.00-4.00 sec 111 MBytes 933 Mbits/sec 31687 KBytes > [ 5] 4.00-5.00 sec 110 MBytes 923 Mbits/sec0679 KBytes > [ 5] 5.00-6.00 sec 110 MBytes 923 Mbits/sec 136577 KBytes > [ 5] 6.00-7.00 sec 110 MBytes 923 Mbits/sec 214645 KBytes > [ 5] 7.00-8.00 sec 110 MBytes 923 Mbits/sec 32628 KBytes > [ 5] 8.00-9.00 sec 110 MBytes 923 Mbits/sec 81537 KBytes > [ 5] 9.00-10.00 sec 110 MBytes 923 Mbits/sec 10577 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 1.08 GBytes 927 Mbits/sec 1056 sender > [ 5] 0.00-10.00 sec 1.07 GBytes 923 Mbits/sec receiver > > But this only confirms that the fix i experience is a side effect. > > The original code is still wrong :) What exactly is this machine? Brand, model, config? Maybe you could add this and a dmesg log to the buzilla? It seems like other people should be seeing the same problem, so I'm hoping to grub around on the web to see if there are similar reports involving these devices. https://bugzilla.kernel.org/show_bug.cgi?id=209725 Here's one that is superficially similar: https://linux-hardware.org/index.php?probe=e5f24075e5&log=lspci_all in that it has a RP -- switch -- I211 path. Interestingly, the switch here advertises <64us L1 exit latency instead of the <32us latency your switch advertises. Of course, I can't tell if it's exactly the same switch. Bjorn
Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check
On Mon, Dec 14, 2020 at 04:47:32PM +0100, Ian Kumlien wrote: > On Mon, Dec 14, 2020 at 3:02 PM Bjorn Helgaas wrote: > > On Mon, Dec 14, 2020 at 10:14:18AM +0100, Ian Kumlien wrote: > > > On Mon, Dec 14, 2020 at 6:44 AM Bjorn Helgaas wrote: > > > > > > > > [+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM > > > > issue with I211 or Realtek NICs. Beginning of thread: > > > > https://lore.kernel.org/r/20201024205548.1837770-1-ian.kuml...@gmail.com > > > > > > > > Short story: Ian has: > > > > > > > > Root Port --- Switch --- I211 NIC > > > >\-- multifunction Realtek NIC, etc > > > > > > > > and the I211 performance is poor with ASPM L1 enabled on both links > > > > in the path to it. The patch here disables ASPM on the upstream link > > > > and fixes the performance, but AFAICT the devices in that path give us > > > > no reason to disable L1. If I understand the spec correctly, the > > > > Realtek device should not be relevant to the I211 path.] > > > > > > > > On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote: > > > > > On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas > > > > > wrote: > > > > > > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote: > > > > > > > Make pcie_aspm_check_latency comply with the PCIe spec, > > > > > > > specifically: > > > > > > > "5.4.1.2.2. Exit from the L1 State" > > > > > > > > > > > > > > Which makes it clear that each switch is required to > > > > > > > initiate a transition within 1μs from receiving it, > > > > > > > accumulating this latency and then we have to wait for the > > > > > > > slowest link along the path before entering L0 state from > > > > > > > L1. > > > > > > > ... > > > > > > > > > > > > > On my specific system: > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit > > > > > > > Network Connection (rev 03) > > > > > > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. > > > > > > > Device 816e (rev 1a) > > > > > > > > > > > > > > Exit latency Acceptable latency > > > > > > > Tree: L1 L0s L1 L0s > > > > > > > -- --- - --- -- > > > > > > > 00:01.2 <32 us - > > > > > > > | 01:00.0 <32 us - > > > > > > > |- 02:03.0 <32 us - > > > > > > > | \03:00.0 <16 us <2us <64 us <512ns > > > > > > > | > > > > > > > \- 02:04.0 <32 us - > > > > > > > \04:00.0 <64 us unlimited <64 us <512ns > > > > > > > > > > > > > > 04:00.0's latency is the same as the maximum it allows so as > > > > > > > we walk the path the first switchs startup latency will pass > > > > > > > the acceptable latency limit for the link, and as a > > > > > > > side-effect it fixes my issues with 03:00.0. > > > > > > > > > > > > > > Without this patch, 03:00.0 misbehaves and only gives me ~40 > > > > > > > mbit/s over links with 6 or more hops. With this patch I'm > > > > > > > back to a maximum of ~933 mbit/s. > > > > > > > > > > > > There are two paths here that share a Link: > > > > > > > > > > > > 00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC > > > > > > 00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek > > > > > > > > > > > > 1) The path to the I211 NIC includes four Ports and two Links (the > > > > > >connection between 01:00.0 and 02:03.0 is internal Switch > > > > > > routing, > > > > > >not a Link). > > > > > > > > > > >The Ports advertise L1 exit latencies of <32us, <32us, <32us, > > > > > ><16us. If both Links are in L1 and 03:00.0 initiates L1 exit at > > > > > > T, > > > > > >01:00.0 init
Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check
On Mon, Dec 14, 2020 at 10:14:18AM +0100, Ian Kumlien wrote: > On Mon, Dec 14, 2020 at 6:44 AM Bjorn Helgaas wrote: > > > > [+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM > > issue with I211 or Realtek NICs. Beginning of thread: > > https://lore.kernel.org/r/20201024205548.1837770-1-ian.kuml...@gmail.com > > > > Short story: Ian has: > > > > Root Port --- Switch --- I211 NIC > >\-- multifunction Realtek NIC, etc > > > > and the I211 performance is poor with ASPM L1 enabled on both links > > in the path to it. The patch here disables ASPM on the upstream link > > and fixes the performance, but AFAICT the devices in that path give us > > no reason to disable L1. If I understand the spec correctly, the > > Realtek device should not be relevant to the I211 path.] > > > > On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote: > > > On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas wrote: > > > > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote: > > > > > Make pcie_aspm_check_latency comply with the PCIe spec, specifically: > > > > > "5.4.1.2.2. Exit from the L1 State" > > > > > > > > > > Which makes it clear that each switch is required to > > > > > initiate a transition within 1μs from receiving it, > > > > > accumulating this latency and then we have to wait for the > > > > > slowest link along the path before entering L0 state from > > > > > L1. > > > > > ... > > > > > > > > > On my specific system: > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network > > > > > Connection (rev 03) > > > > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. > > > > > Device 816e (rev 1a) > > > > > > > > > > Exit latency Acceptable latency > > > > > Tree: L1 L0s L1 L0s > > > > > -- --- - --- -- > > > > > 00:01.2 <32 us - > > > > > | 01:00.0 <32 us - > > > > > |- 02:03.0 <32 us - > > > > > | \03:00.0 <16 us <2us <64 us <512ns > > > > > | > > > > > \- 02:04.0 <32 us - > > > > > \04:00.0 <64 us unlimited <64 us <512ns > > > > > > > > > > 04:00.0's latency is the same as the maximum it allows so as > > > > > we walk the path the first switchs startup latency will pass > > > > > the acceptable latency limit for the link, and as a > > > > > side-effect it fixes my issues with 03:00.0. > > > > > > > > > > Without this patch, 03:00.0 misbehaves and only gives me ~40 > > > > > mbit/s over links with 6 or more hops. With this patch I'm > > > > > back to a maximum of ~933 mbit/s. > > > > > > > > There are two paths here that share a Link: > > > > > > > > 00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC > > > > 00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek > > > > > > > > 1) The path to the I211 NIC includes four Ports and two Links (the > > > >connection between 01:00.0 and 02:03.0 is internal Switch routing, > > > >not a Link). > > > > > > >The Ports advertise L1 exit latencies of <32us, <32us, <32us, > > > ><16us. If both Links are in L1 and 03:00.0 initiates L1 exit at T, > > > >01:00.0 initiates L1 exit at T + 1. A TLP from 03:00.0 may see up > > > >to 1 + 32 = 33us of L1 exit latency. > > > > > > > >The NIC can tolerate up to 64us of L1 exit latency, so it is safe > > > >to enable L1 for both Links. > > > > > > > > 2) The path to the Realtek device is similar except that the Realtek > > > >L1 exit latency is <64us. If both Links are in L1 and 04:00.x > > > >initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1, > > > >but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit > > > >latency. > > > > > > > >The Realtek device can only tolerate 64us of latency, so it is not > > > >safe to enable L1 for both Links. It should be safe to enable L1 > > > >on the shared link because the exit latency
Re: [PATCH 1/3] PCI/ASPM: Use the path max in L1 ASPM latency check
[+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM issue with I211 or Realtek NICs. Beginning of thread: https://lore.kernel.org/r/20201024205548.1837770-1-ian.kuml...@gmail.com Short story: Ian has: Root Port --- Switch --- I211 NIC \-- multifunction Realtek NIC, etc and the I211 performance is poor with ASPM L1 enabled on both links in the path to it. The patch here disables ASPM on the upstream link and fixes the performance, but AFAICT the devices in that path give us no reason to disable L1. If I understand the spec correctly, the Realtek device should not be relevant to the I211 path.] On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote: > On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas wrote: > > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote: > > > Make pcie_aspm_check_latency comply with the PCIe spec, specifically: > > > "5.4.1.2.2. Exit from the L1 State" > > > > > > Which makes it clear that each switch is required to initiate a > > > transition within 1μs from receiving it, accumulating this latency and > > > then we have to wait for the slowest link along the path before > > > entering L0 state from L1. > > > ... > > > > > On my specific system: > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network > > > Connection (rev 03) > > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device > > > 816e (rev 1a) > > > > > > Exit latency Acceptable latency > > > Tree: L1 L0s L1 L0s > > > -- --- - --- -- > > > 00:01.2 <32 us - > > > | 01:00.0 <32 us - > > > |- 02:03.0 <32 us - > > > | \03:00.0 <16 us <2us <64 us <512ns > > > | > > > \- 02:04.0 <32 us - > > > \04:00.0 <64 us unlimited <64 us <512ns > > > > > > 04:00.0's latency is the same as the maximum it allows so as we walk the > > > path > > > the first switchs startup latency will pass the acceptable latency limit > > > for the link, and as a side-effect it fixes my issues with 03:00.0. > > > > > > Without this patch, 03:00.0 misbehaves and only gives me ~40 mbit/s over > > > links with 6 or more hops. With this patch I'm back to a maximum of ~933 > > > mbit/s. > > > > There are two paths here that share a Link: > > > > 00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC > > 00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek > > > > 1) The path to the I211 NIC includes four Ports and two Links (the > >connection between 01:00.0 and 02:03.0 is internal Switch routing, > >not a Link). > > >The Ports advertise L1 exit latencies of <32us, <32us, <32us, > ><16us. If both Links are in L1 and 03:00.0 initiates L1 exit at T, > >01:00.0 initiates L1 exit at T + 1. A TLP from 03:00.0 may see up > >to 1 + 32 = 33us of L1 exit latency. > > > >The NIC can tolerate up to 64us of L1 exit latency, so it is safe > >to enable L1 for both Links. > > > > 2) The path to the Realtek device is similar except that the Realtek > >L1 exit latency is <64us. If both Links are in L1 and 04:00.x > >initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1, > >but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit > >latency. > > > >The Realtek device can only tolerate 64us of latency, so it is not > >safe to enable L1 for both Links. It should be safe to enable L1 > >on the shared link because the exit latency for that link would be > ><32us. > > 04:00.0: > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us > LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s > unlimited, L1 <64us > > So maximum latency for the entire link has to be <64 us > For the device to leave L1 ASPM takes <64us > > So the device itself is the slowest entry along the link, which > means that nothing else along that path can have ASPM enabled Yes. That's what I said above: "it is not safe to enable L1 for both Links." Unless I'm missing something, we agree on that. I also said that it should be safe to enable L1 on the shared Link (from 00:01.2 to 01:00.0) because if the downstream Link is always in L0, the exit latency of the shared Link should be <32us, and 04:00.x can tolerate 64us. > > > The original code path did: &g
Re: [PATCH] drivers: broadcom: save return value of pci_find_capability() in u8
On Mon, Dec 07, 2020 at 01:40:33AM +0530, Puranjay Mohan wrote: > Callers of pci_find_capability() should save the return value in u8. > change the type of pcix_cap from int to u8, to match the specification. > > Signed-off-by: Puranjay Mohan > --- > drivers/net/ethernet/broadcom/tg3.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/broadcom/tg3.h > b/drivers/net/ethernet/broadcom/tg3.h > index 1000c894064f..f1781d2dce0b 100644 > --- a/drivers/net/ethernet/broadcom/tg3.h > +++ b/drivers/net/ethernet/broadcom/tg3.h > @@ -3268,7 +3268,7 @@ struct tg3 { > > int pci_fn; > int msi_cap; > - int pcix_cap; > + u8 pcix_cap; msi_cap is also a u8. But I don't think it's worth changing either of these unless we take a broader look and see whether they're needed at all. msi_cap is used to restore the MSI enable bit after a highly device-specific reset. pcix_cap is used for some PCI-X configuration that really should be done via pcix_set_mmrbc() and possibly some sort of quirk for PCI_X_CMD_MAX_SPLIT. But that's all pretty messy and I doubt it's worth doing it at this point, since PCI-X is pretty much ancient history. > int pcie_readrq; > > struct mii_bus *mdio_bus; > -- > 2.27.0 >
Re: [PATCH v2 0/5] Improve s0ix flows for systems i219LM
On Wed, Dec 02, 2020 at 07:24:28PM +, Limonciello, Mario wrote: > > -Original Message- > > From: Jakub Kicinski > > Sent: Wednesday, December 2, 2020 13:07 > > To: Limonciello, Mario > > Cc: Tony Nguyen; intel-wired-...@lists.osuosl.org; Linux PM; Netdev; > > Alexander > > Duyck; Sasha Netfin; Aaron Brown; Stefan Assmann; David Miller; > > darc...@redhat.com; Shen, Yijun; Yuan, Perry > > Subject: Re: [PATCH v2 0/5] Improve s0ix flows for systems i219LM > > > > > > [EXTERNAL EMAIL] > > > > On Wed, 2 Dec 2020 10:17:43 -0600 Mario Limonciello wrote: > > > commit e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME > > systems") > > > disabled s0ix flows for systems that have various incarnations of the > > > i219-LM ethernet controller. This was done because of some regressions > > > caused by an earlier > > > commit 632fbd5eb5b0e ("e1000e: fix S0ix flows for cable connected case") > > > with i219-LM controller. > > > > > > Performing suspend to idle with these ethernet controllers requires a > > properly > > > configured system. To make enabling such systems easier, this patch > > > series allows turning on using ethtool. > > > > > > The flows have also been confirmed to be configured correctly on Dell's > > Latitude > > > and Precision CML systems containing the i219-LM controller, when the > > > kernel > > also > > > contains the fix for s0i3.2 entry previously submitted here: > > > https://marc.info/?l=linux-netdev&m=160677194809564&w=2 > > > > > > Patches 3 and 4 will turn the behavior on by default for Dell's CML > > > systems. > > > Patch 5 allows accessing the value of the flags via ethtool to tell if the > > > heuristics have turned on s0ix flows, as well as for development purposes > > > to determine if a system should be added to the heuristics list. > > > > I don't see PCI or Bjorn Helgaas CCed. > > > > You can drop linux-kernel tho. > > Correct, that was intentional that PCI (and Bjorn) weren't added. Since I > came > up with a way to detect platforms without DMI as suggested and this is > entirely > controlling a driver behavior within e1000e only on systems with i219-LM I > didn't think that PCI ML was actually needed. > > Since you disagree, I'll add Bjorn into this thread. > > @Bjorn Helgaas, > > Apologies that you're looped in this way rather than directly to the > submission, > but the cover letter is above and the patch series can be viewed at this > patchwork > if you would like to fetch the mbox and respond to provide any comments. > > https://patchwork.ozlabs.org/project/netdev/list/?series=218121 > > I'll include you directly if any future v3 is necessary. No need, I don't think. AFAICT there's nothing there related to the PCI core. Thanks! Bjorn
Re: [PATCH v2 4/5] e1000e: Add more Dell CML systems into s0ix heuristics
s/s0ix/S0ix/ in subject. On Wed, Dec 02, 2020 at 10:17:47AM -0600, Mario Limonciello wrote: > These comet lake systems are not yet released, but have been validated > on pre-release hardware. s/comet lake/Comet Lake/ to match previous usage in patch 3/5. > This is being submitted separately from released hardware in case of > a regression between pre-release and release hardware so this commit > can be reverted alone. > > Tested-by: Yijun Shen > Signed-off-by: Mario Limonciello > --- > drivers/net/ethernet/intel/e1000e/s0ix.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/net/ethernet/intel/e1000e/s0ix.c > b/drivers/net/ethernet/intel/e1000e/s0ix.c > index 74043e80c32f..0dd2e2702ebb 100644 > --- a/drivers/net/ethernet/intel/e1000e/s0ix.c > +++ b/drivers/net/ethernet/intel/e1000e/s0ix.c > @@ -60,6 +60,9 @@ static bool e1000e_check_subsystem_allowlist(struct pci_dev > *dev) > case 0x09c2: /* Precision 3551 */ > case 0x09c3: /* Precision 7550 */ > case 0x09c4: /* Precision 7750 */ > + case 0x0a40: /* Notebook 0x0a40 */ > + case 0x0a41: /* Notebook 0x0a41 */ > + case 0x0a42: /* Notebook 0x0a42 */ > return true; > } > } > -- > 2.25.1 >
Re: [PATCH v2 2/5] e1000e: Move all s0ix related code into it's own source file
s/it's/its/ (in subject as well as below). Previous patches used "S0ix", not "s0ix" (in subject as well as below, as well as subject and commit log of 3/5 and 5/5). On Wed, Dec 02, 2020 at 10:17:45AM -0600, Mario Limonciello wrote: > Introduce a flag to indicate the device should be using the s0ix > flows and use this flag to run those functions. Would be nicer to have a move that does nothing else + a separate patch that adds a flag so it's more obvious, but again, not my circus. > Splitting the code to it's own file will make future heuristics > more self containted. s/containted/contained/ Bjorn
Re: [PATCH v2 1/5] e1000e: fix S0ix flow to allow S0i3.2 subset entry
On Wed, Dec 02, 2020 at 10:17:44AM -0600, Mario Limonciello wrote: > From: Vitaly Lifshits > > Changed a configuration in the flows to align with > architecture requirements to achieve S0i3.2 substate. I guess this is really talking about requirements of a specific CPU/SOC before it will enter S0i3.2? > Also fixed a typo in the previous commit 632fbd5eb5b0 > ("e1000e: fix S0ix flows for cable connected case"). Not clear what the typo was, maybe these? > - ew32(FEXTNVM12, mac_data); > + ew32(FEXTNVM6, mac_data); > - ew32(FEXTNVM12, mac_data); > + ew32(FEXTNVM6, mac_data); I would probably have put typo fixes in a separate patch, especially since the cover letter mentions regressions related to 632fbd5eb5b0. Maybe the commit log for the fix should mention that it's fixing a regression, what the regression was, and include a Fixes: tag? But not my circus. > Signed-off-by: Vitaly Lifshits > Tested-by: Aaron Brown > Signed-off-by: Tony Nguyen > --- > drivers/net/ethernet/intel/e1000e/netdev.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c > b/drivers/net/ethernet/intel/e1000e/netdev.c > index b30f00891c03..128ab6898070 100644 > --- a/drivers/net/ethernet/intel/e1000e/netdev.c > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c > @@ -6475,13 +6475,13 @@ static void e1000e_s0ix_entry_flow(struct > e1000_adapter *adapter) > > /* Ungate PGCB clock */ > mac_data = er32(FEXTNVM9); > - mac_data |= BIT(28); > + mac_data &= ~BIT(28); > ew32(FEXTNVM9, mac_data); > > /* Enable K1 off to enable mPHY Power Gating */ > mac_data = er32(FEXTNVM6); > mac_data |= BIT(31); > - ew32(FEXTNVM12, mac_data); > + ew32(FEXTNVM6, mac_data); > > /* Enable mPHY power gating for any link and speed */ > mac_data = er32(FEXTNVM8); > @@ -6525,11 +6525,11 @@ static void e1000e_s0ix_exit_flow(struct > e1000_adapter *adapter) > /* Disable K1 off */ > mac_data = er32(FEXTNVM6); > mac_data &= ~BIT(31); > - ew32(FEXTNVM12, mac_data); > + ew32(FEXTNVM6, mac_data); > > /* Disable Ungate PGCB clock */ > mac_data = er32(FEXTNVM9); > - mac_data &= ~BIT(28); > + mac_data |= BIT(28); > ew32(FEXTNVM9, mac_data); > > /* Cancel not waking from dynamic > -- > 2.25.1 >
Re: [PATCH] PCI: Rename d3_delay in the pci_dev struct to align with PCI specification
On Thu, Jul 30, 2020 at 09:08:48PM +, Krzysztof Wilczyński wrote: > Rename PCI-related variable "d3_delay" to "d3hot_delay" in the pci_dev > struct to better align with the PCI Firmware specification (see PCI > Firmware Specification, Revision 3.2, Section 4.6.9, p. 73). > > The pci_dev struct already contains variable "d3cold_delay", thus > renaming "d3_delay" to "d3hot_delay" reduces ambiguity as PCI devices > support two variants of the D3 power state: D3hot and D3cold. > > Also, rename other constants and variables, and updates code comments > and documentation to ensure alignment with the PCI specification. > > There is no change to the functionality. > > Signed-off-by: Krzysztof Wilczyński Applied to pci/pm for v5.10, thanks! > --- > Documentation/power/pci.rst | 2 +- > arch/x86/pci/fixup.c | 2 +- > arch/x86/pci/intel_mid_pci.c | 2 +- > drivers/hid/intel-ish-hid/ipc/ipc.c | 2 +- > drivers/net/ethernet/marvell/sky2.c | 2 +- > drivers/pci/pci-acpi.c| 6 +- > drivers/pci/pci.c | 14 ++-- > drivers/pci/pci.h | 4 +- > drivers/pci/quirks.c | 68 +-- > .../staging/media/atomisp/pci/atomisp_v4l2.c | 2 +- > include/linux/pci.h | 2 +- > include/uapi/linux/pci_regs.h | 2 +- > 12 files changed, 54 insertions(+), 54 deletions(-) > > diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst > index 1831e431f725..b04fb18cc4e2 100644 > --- a/Documentation/power/pci.rst > +++ b/Documentation/power/pci.rst > @@ -320,7 +320,7 @@ that these callbacks operate on:: > unsigned intd2_support:1; /* Low power state D2 is supported */ > unsigned intno_d1d2:1; /* D1 and D2 are forbidden */ > unsigned intwakeup_prepared:1; /* Device prepared for wake up */ > - unsigned intd3_delay; /* D3->D0 transition time in ms */ > + unsigned intd3hot_delay;/* D3hot->D0 transition time in ms */ > ... >}; > > diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c > index 0c67a5a94de3..9e3d9cc6afc4 100644 > --- a/arch/x86/pci/fixup.c > +++ b/arch/x86/pci/fixup.c > @@ -587,7 +587,7 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0xa26d, > pci_invalid_bar); > static void pci_fixup_amd_ehci_pme(struct pci_dev *dev) > { > dev_info(&dev->dev, "PME# does not work under D3, disabling it\n"); > - dev->pme_support &= ~((PCI_PM_CAP_PME_D3 | PCI_PM_CAP_PME_D3cold) > + dev->pme_support &= ~((PCI_PM_CAP_PME_D3hot | PCI_PM_CAP_PME_D3cold) > >> PCI_PM_CAP_PME_SHIFT); > } > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x7808, pci_fixup_amd_ehci_pme); > diff --git a/arch/x86/pci/intel_mid_pci.c b/arch/x86/pci/intel_mid_pci.c > index 00c62115f39c..979f310b67d4 100644 > --- a/arch/x86/pci/intel_mid_pci.c > +++ b/arch/x86/pci/intel_mid_pci.c > @@ -322,7 +322,7 @@ static void pci_d3delay_fixup(struct pci_dev *dev) >*/ > if (type1_access_ok(dev->bus->number, dev->devfn, PCI_DEVICE_ID)) > return; > - dev->d3_delay = 0; > + dev->d3hot_delay = 0; > } > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_ANY_ID, pci_d3delay_fixup); > > diff --git a/drivers/hid/intel-ish-hid/ipc/ipc.c > b/drivers/hid/intel-ish-hid/ipc/ipc.c > index 8f8dfdf64833..a45ac7fa417b 100644 > --- a/drivers/hid/intel-ish-hid/ipc/ipc.c > +++ b/drivers/hid/intel-ish-hid/ipc/ipc.c > @@ -755,7 +755,7 @@ static int _ish_hw_reset(struct ishtp_device *dev) > csr |= PCI_D3hot; > pci_write_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, csr); > > - mdelay(pdev->d3_delay); > + mdelay(pdev->d3hot_delay); > > csr &= ~PCI_PM_CTRL_STATE_MASK; > csr |= PCI_D0; > diff --git a/drivers/net/ethernet/marvell/sky2.c > b/drivers/net/ethernet/marvell/sky2.c > index fe54764caea9..ce7a94060a96 100644 > --- a/drivers/net/ethernet/marvell/sky2.c > +++ b/drivers/net/ethernet/marvell/sky2.c > @@ -5104,7 +5104,7 @@ static int sky2_probe(struct pci_dev *pdev, const > struct pci_device_id *ent) > INIT_WORK(&hw->restart_work, sky2_restart); > > pci_set_drvdata(pdev, hw); > - pdev->d3_delay = 300; > + pdev->d3hot_delay = 300; > > return 0; > > diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c > index 7224b1e5f2a8..c54588ad2d9c 100644 > --- a/drivers/pci/pci-acpi.c > +++ b/drivers/pci/pci-acpi.c > @@ -1167,7 +1167,7 @@ static struct acpi_device > *acpi_pci_find_companion(struct device *dev) > * @pdev: the PCI device whose delay is to be updated > * @handle: ACPI handle of this device > * > - * Update the d3_delay and d3cold_delay of a PCI device from the ACPI _DSM > + * Update the d3hot_delay and d3cold_delay of a PCI device from the ACPI _DSM > * control method of either the device itself or the PCI host bridge. >
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
[to: Christoph in case he has comments, since I think he wrote this code] On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote: > If we have isolated CPUs dedicated for use by real-time tasks, we try to > move IRQs to housekeeping CPUs from the userspace to reduce latency > overhead on the isolated CPUs. > > If we allocate too many IRQ vectors, moving them all to housekeeping CPUs > may exceed per-CPU vector limits. > > When we have isolated CPUs, limit the number of vectors allocated by > pci_alloc_irq_vectors() to the minimum number required by the driver, or > to one per housekeeping CPU if that is larger. > > Signed-off-by: Nitesh Narayan Lal Acked-by: Bjorn Helgaas > --- > drivers/pci/msi.c | 18 ++ > 1 file changed, 18 insertions(+) > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index 30ae4ffda5c1..8c156867803c 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -23,6 +23,7 @@ > #include > #include > #include > +#include > > #include "pci.h" > > @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev > *dev, unsigned int min_vecs, > struct irq_affinity *affd) > { > struct irq_affinity msi_default_affd = {0}; > + unsigned int hk_cpus; > int nvecs = -ENOSPC; > > + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); > + > + /* > + * If we have isolated CPUs for use by real-time tasks, to keep the > + * latency overhead to a minimum, device-specific IRQ vectors are moved > + * to the housekeeping CPUs from the userspace by changing their > + * affinity mask. Limit the vector usage to keep housekeeping CPUs from > + * running out of IRQ vectors. > + */ > + if (hk_cpus < num_online_cpus()) { > + if (hk_cpus < min_vecs) > + max_vecs = min_vecs; > + else if (hk_cpus < max_vecs) > + max_vecs = hk_cpus; > + } > + > if (flags & PCI_IRQ_AFFINITY) { > if (!affd) > affd = &msi_default_affd; > -- > 2.18.2 >
Re: [PATCH next-queue v1 1/3] Revert "PCI: Make pci_enable_ptm() private"
On Fri, Sep 25, 2020 at 04:28:32PM -0700, Vinicius Costa Gomes wrote: > Make pci_enable_ptm() accessible from the drivers. > > Even if PTM still works on the platform I am using without calling > this this function, it might be possible that it's not always the > case. *Does* PTM work on your system without calling pci_enable_ptm()? If so, I think that would mean the BIOS enabled PTM, and that seems slightly surprising. > Exposing this to the driver enables the driver to use the > 'ptm_enabled' field of 'pci_dev' to check if PTM is enabled or not. > > This reverts commit ac6c26da29c12fa511c877c273ed5c939dc9e96c. > > Signed-off-by: Vinicius Costa Gomes AFAICT we just never had any callers at all for pci_enable_ptm(). I probably shouldn't have merged it in the first place. Acked-by: Bjorn Helgaas > --- > drivers/pci/pci.h | 3 --- > include/linux/pci.h | 7 +++ > 2 files changed, 7 insertions(+), 3 deletions(-) > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > index fa12f7cbc1a0..8871109fe390 100644 > --- a/drivers/pci/pci.h > +++ b/drivers/pci/pci.h > @@ -582,11 +582,8 @@ static inline void pcie_ecrc_get_policy(char *str) { } > > #ifdef CONFIG_PCIE_PTM > void pci_ptm_init(struct pci_dev *dev); > -int pci_enable_ptm(struct pci_dev *dev, u8 *granularity); > #else > static inline void pci_ptm_init(struct pci_dev *dev) { } > -static inline int pci_enable_ptm(struct pci_dev *dev, u8 *granularity) > -{ return -EINVAL; } > #endif > > struct pci_dev_reset_methods { > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 835530605c0d..ec4b28153cc4 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -1593,6 +1593,13 @@ static inline bool pci_aer_available(void) { return > false; } > > bool pci_ats_disabled(void); > > +#ifdef CONFIG_PCIE_PTM > +int pci_enable_ptm(struct pci_dev *dev, u8 *granularity); > +#else > +static inline int pci_enable_ptm(struct pci_dev *dev, u8 *granularity) > +{ return -EINVAL; } > +#endif > + > void pci_cfg_access_lock(struct pci_dev *dev); > bool pci_cfg_access_trylock(struct pci_dev *dev); > void pci_cfg_access_unlock(struct pci_dev *dev); > -- > 2.28.0 >
Re: [PATCH v3 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Fri, Sep 25, 2020 at 02:26:54PM -0400, Nitesh Narayan Lal wrote: > If we have isolated CPUs dedicated for use by real-time tasks, we try to > move IRQs to housekeeping CPUs from the userspace to reduce latency > overhead on the isolated CPUs. > > If we allocate too many IRQ vectors, moving them all to housekeeping CPUs > may exceed per-CPU vector limits. > > When we have isolated CPUs, limit the number of vectors allocated by > pci_alloc_irq_vectors() to the minimum number required by the driver, or > to one per housekeeping CPU if that is larger. > > Signed-off-by: Nitesh Narayan Lal > --- > include/linux/pci.h | 17 + > 1 file changed, 17 insertions(+) > > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 835530605c0d..a7b10240b778 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -38,6 +38,7 @@ > #include > #include > #include > +#include > #include > > #include > @@ -1797,6 +1798,22 @@ static inline int > pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, > unsigned int max_vecs, unsigned int flags) > { > + unsigned int hk_cpus; > + > + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); Add blank line here before the block comment. > + /* > + * If we have isolated CPUs for use by real-time tasks, to keep the > + * latency overhead to a minimum, device-specific IRQ vectors are moved > + * to the housekeeping CPUs from the userspace by changing their > + * affinity mask. Limit the vector usage to keep housekeeping CPUs from > + * running out of IRQ vectors. > + */ > + if (hk_cpus < num_online_cpus()) { > + if (hk_cpus < min_vecs) > + max_vecs = min_vecs; > + else if (hk_cpus < max_vecs) > + max_vecs = hk_cpus; > + } It seems like you'd want to do this inside pci_alloc_irq_vectors_affinity() since that's an exported interface, and drivers that use it will bypass the limiting you're doing here. > return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags, > NULL); > } > -- > 2.18.2 >
Re: [PATCH v2 4/4] PCI: Limit pci_alloc_irq_vectors as per housekeeping CPUs
On Thu, Sep 24, 2020 at 05:39:07PM -0400, Nitesh Narayan Lal wrote: > > On 9/24/20 4:45 PM, Bjorn Helgaas wrote: > > Possible subject: > > > > PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs > > Will switch to this. > > > On Wed, Sep 23, 2020 at 02:11:26PM -0400, Nitesh Narayan Lal wrote: > >> This patch limits the pci_alloc_irq_vectors, max_vecs argument that is > >> passed on by the caller based on the housekeeping online CPUs (that are > >> meant to perform managed IRQ jobs). > >> > >> A minimum of the max_vecs passed and housekeeping online CPUs is derived > >> to ensure that we don't create excess vectors as that can be problematic > >> specifically in an RT environment. In cases where the min_vecs exceeds the > >> housekeeping online CPUs, max vecs is restricted based on the min_vecs > >> instead. The proposed change is required because for an RT environment > >> unwanted IRQs are moved to the housekeeping CPUs from isolated CPUs to > >> keep the latency overhead to a minimum. If the number of housekeeping CPUs > >> is significantly lower than that of the isolated CPUs we can run into > >> failures while moving these IRQs to housekeeping CPUs due to per CPU > >> vector limit. > > Does this capture enough of the log? > > > > If we have isolated CPUs dedicated for use by real-time tasks, we > > try to move IRQs to housekeeping CPUs to reduce overhead on the > > isolated CPUs. > > How about: > " > If we have isolated CPUs or CPUs running in nohz_full mode for the purpose > of real-time, we try to move IRQs to housekeeping CPUs to reduce latency > overhead on these real-time CPUs. > " > > What do you think? It's OK, but from the PCI core perspective, "nohz_full mode" doesn't really mean anything. I think it's a detail that should be inside the "housekeeping CPU" abstraction. > > If we allocate too many IRQ vectors, moving them all to housekeeping > > CPUs may exceed per-CPU vector limits. > > > > When we have isolated CPUs, limit the number of vectors allocated by > > pci_alloc_irq_vectors() to the minimum number required by the > > driver, or to one per housekeeping CPU if that is larger > > I think this is good, I can adopt this. > > > . > > > >> Signed-off-by: Nitesh Narayan Lal > >> --- > >> include/linux/pci.h | 15 +++ > >> 1 file changed, 15 insertions(+) > >> > >> diff --git a/include/linux/pci.h b/include/linux/pci.h > >> index 835530605c0d..cf9ca9410213 100644 > >> --- a/include/linux/pci.h > >> +++ b/include/linux/pci.h > >> @@ -38,6 +38,7 @@ > >> #include > >> #include > >> #include > >> +#include > >> #include > >> > >> #include > >> @@ -1797,6 +1798,20 @@ static inline int > >> pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, > >> unsigned int max_vecs, unsigned int flags) > >> { > >> + unsigned int hk_cpus = hk_num_online_cpus(); > >> + > >> + /* > >> + * For a real-time environment, try to be conservative and at max only > >> + * ask for the same number of vectors as there are housekeeping online > >> + * CPUs. In case, the min_vecs requested exceeds the housekeeping > >> + * online CPUs, restrict the max_vecs based on the min_vecs instead. > >> + */ > >> + if (hk_cpus != num_online_cpus()) { > >> + if (min_vecs > hk_cpus) > >> + max_vecs = min_vecs; > >> + else > >> + max_vecs = min_t(int, max_vecs, hk_cpus); > >> + } > > Is the below basically the same? > > > > /* > > * If we have isolated CPUs for use by real-time tasks, > > * minimize overhead on those CPUs by moving IRQs to the > > * remaining "housekeeping" CPUs. Limit vector usage to keep > > * housekeeping CPUs from running out of IRQ vectors. > > */ > > How about the following as a comment: > > " > If we have isolated CPUs or CPUs running in nohz_full mode for real-time, > latency overhead is minimized on those CPUs by moving the IRQ vectors to > the housekeeping CPUs. Limit the number of vectors to keep housekeeping > CPUs from running out of IRQ vectors. > > " > > > if (housekeeping_cpus < num_online_cpus()) { > > if (housekeeping_cpus < min_vecs) > >
Re: [PATCH v2 1/4] sched/isolation: API to get housekeeping online CPUs
On Wed, Sep 23, 2020 at 02:11:23PM -0400, Nitesh Narayan Lal wrote: > Introduce a new API hk_num_online_cpus(), that can be used to > retrieve the number of online housekeeping CPUs that are meant to handle > managed IRQ jobs. > > This API is introduced for the drivers that were previously relying only > on num_online_cpus() to determine the number of MSIX vectors to create. > In an RT environment with large isolated but fewer housekeeping CPUs this > was leading to a situation where an attempt to move all of the vectors > corresponding to isolated CPUs to housekeeping CPUs were failing due to > per CPU vector limit. > > Signed-off-by: Nitesh Narayan Lal > --- > include/linux/sched/isolation.h | 13 + > 1 file changed, 13 insertions(+) > > diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h > index cc9f393e2a70..2e96b626e02e 100644 > --- a/include/linux/sched/isolation.h > +++ b/include/linux/sched/isolation.h > @@ -57,4 +57,17 @@ static inline bool housekeeping_cpu(int cpu, enum hk_flags > flags) > return true; > } > > +static inline unsigned int hk_num_online_cpus(void) > +{ > +#ifdef CONFIG_CPU_ISOLATION > + const struct cpumask *hk_mask; > + > + if (static_branch_unlikely(&housekeeping_overridden)) { > + hk_mask = housekeeping_cpumask(HK_FLAG_MANAGED_IRQ); > + return cpumask_weight(hk_mask); > + } > +#endif > + return cpumask_weight(cpu_online_mask); Just curious: why is this not #ifdef CONFIG_CPU_ISOLATION ... #endif return num_online_cpus(); > +} > + > #endif /* _LINUX_SCHED_ISOLATION_H */ > -- > 2.18.2 >
Re: [PATCH v2 4/4] PCI: Limit pci_alloc_irq_vectors as per housekeeping CPUs
Possible subject: PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs On Wed, Sep 23, 2020 at 02:11:26PM -0400, Nitesh Narayan Lal wrote: > This patch limits the pci_alloc_irq_vectors, max_vecs argument that is > passed on by the caller based on the housekeeping online CPUs (that are > meant to perform managed IRQ jobs). > > A minimum of the max_vecs passed and housekeeping online CPUs is derived > to ensure that we don't create excess vectors as that can be problematic > specifically in an RT environment. In cases where the min_vecs exceeds the > housekeeping online CPUs, max vecs is restricted based on the min_vecs > instead. The proposed change is required because for an RT environment > unwanted IRQs are moved to the housekeeping CPUs from isolated CPUs to > keep the latency overhead to a minimum. If the number of housekeeping CPUs > is significantly lower than that of the isolated CPUs we can run into > failures while moving these IRQs to housekeeping CPUs due to per CPU > vector limit. Does this capture enough of the log? If we have isolated CPUs dedicated for use by real-time tasks, we try to move IRQs to housekeeping CPUs to reduce overhead on the isolated CPUs. If we allocate too many IRQ vectors, moving them all to housekeeping CPUs may exceed per-CPU vector limits. When we have isolated CPUs, limit the number of vectors allocated by pci_alloc_irq_vectors() to the minimum number required by the driver, or to one per housekeeping CPU if that is larger. > Signed-off-by: Nitesh Narayan Lal > --- > include/linux/pci.h | 15 +++ > 1 file changed, 15 insertions(+) > > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 835530605c0d..cf9ca9410213 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -38,6 +38,7 @@ > #include > #include > #include > +#include > #include > > #include > @@ -1797,6 +1798,20 @@ static inline int > pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, > unsigned int max_vecs, unsigned int flags) > { > + unsigned int hk_cpus = hk_num_online_cpus(); > + > + /* > + * For a real-time environment, try to be conservative and at max only > + * ask for the same number of vectors as there are housekeeping online > + * CPUs. In case, the min_vecs requested exceeds the housekeeping > + * online CPUs, restrict the max_vecs based on the min_vecs instead. > + */ > + if (hk_cpus != num_online_cpus()) { > + if (min_vecs > hk_cpus) > + max_vecs = min_vecs; > + else > + max_vecs = min_t(int, max_vecs, hk_cpus); > + } Is the below basically the same? /* * If we have isolated CPUs for use by real-time tasks, * minimize overhead on those CPUs by moving IRQs to the * remaining "housekeeping" CPUs. Limit vector usage to keep * housekeeping CPUs from running out of IRQ vectors. */ if (housekeeping_cpus < num_online_cpus()) { if (housekeeping_cpus < min_vecs) max_vecs = min_vecs; else if (housekeeping_cpus < max_vecs) max_vecs = housekeeping_cpus; } My comment isn't quite right because this patch only limits the number of vectors; it doesn't actually *move* IRQs to the housekeeping CPUs. I don't know where the move happens (or maybe you just avoid assigning IRQs to isolated CPUs, and I don't know how that happens either). > return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags, > NULL); > } > -- > 2.18.2 >
Re: [RFC][Patch v1 1/3] sched/isolation: API to get num of hosekeeping CPUs
[+cc Ingo, Peter, Juri, Vincent (scheduler maintainers)] s/hosekeeping/housekeeping/ (in subject) On Wed, Sep 09, 2020 at 11:08:16AM -0400, Nitesh Narayan Lal wrote: > Introduce a new API num_housekeeping_cpus(), that can be used to retrieve > the number of housekeeping CPUs by reading an atomic variable > __num_housekeeping_cpus. This variable is set from housekeeping_setup(). > > This API is introduced for the purpose of drivers that were previously > relying only on num_online_cpus() to determine the number of MSIX vectors > to create. In an RT environment with large isolated but a fewer > housekeeping CPUs this was leading to a situation where an attempt to > move all of the vectors corresponding to isolated CPUs to housekeeping > CPUs was failing due to per CPU vector limit. Totally kibitzing here, but AFAICT the concepts of "isolated CPU" and "housekeeping CPU" are not currently exposed to drivers, and it's not completely clear to me that they should be. We have carefully constructed notions of possible, present, online, active CPUs, and it seems like whatever we do here should be somehow integrated with those. > If there are no isolated CPUs specified then the API returns the number > of all online CPUs. > > Signed-off-by: Nitesh Narayan Lal > --- > include/linux/sched/isolation.h | 7 +++ > kernel/sched/isolation.c| 23 +++ > 2 files changed, 30 insertions(+) > > diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h > index cc9f393e2a70..94c25d956d8a 100644 > --- a/include/linux/sched/isolation.h > +++ b/include/linux/sched/isolation.h > @@ -25,6 +25,7 @@ extern bool housekeeping_enabled(enum hk_flags flags); > extern void housekeeping_affine(struct task_struct *t, enum hk_flags flags); > extern bool housekeeping_test_cpu(int cpu, enum hk_flags flags); > extern void __init housekeeping_init(void); > +extern unsigned int num_housekeeping_cpus(void); > > #else > > @@ -46,6 +47,12 @@ static inline bool housekeeping_enabled(enum hk_flags > flags) > static inline void housekeeping_affine(struct task_struct *t, > enum hk_flags flags) { } > static inline void housekeeping_init(void) { } > + > +static unsigned int num_housekeeping_cpus(void) > +{ > + return num_online_cpus(); > +} > + > #endif /* CONFIG_CPU_ISOLATION */ > > static inline bool housekeeping_cpu(int cpu, enum hk_flags flags) > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c > index 5a6ea03f9882..7024298390b7 100644 > --- a/kernel/sched/isolation.c > +++ b/kernel/sched/isolation.c > @@ -13,6 +13,7 @@ DEFINE_STATIC_KEY_FALSE(housekeeping_overridden); > EXPORT_SYMBOL_GPL(housekeeping_overridden); > static cpumask_var_t housekeeping_mask; > static unsigned int housekeeping_flags; > +static atomic_t __num_housekeeping_cpus __read_mostly; > > bool housekeeping_enabled(enum hk_flags flags) > { > @@ -20,6 +21,27 @@ bool housekeeping_enabled(enum hk_flags flags) > } > EXPORT_SYMBOL_GPL(housekeeping_enabled); > > +/* > + * num_housekeeping_cpus() - Read the number of housekeeping CPUs. > + * > + * This function returns the number of available housekeeping CPUs > + * based on __num_housekeeping_cpus which is of type atomic_t > + * and is initialized at the time of the housekeeping setup. > + */ > +unsigned int num_housekeeping_cpus(void) > +{ > + unsigned int cpus; > + > + if (static_branch_unlikely(&housekeeping_overridden)) { > + cpus = atomic_read(&__num_housekeeping_cpus); > + /* We should always have at least one housekeeping CPU */ > + BUG_ON(!cpus); > + return cpus; > + } > + return num_online_cpus(); > +} > +EXPORT_SYMBOL_GPL(num_housekeeping_cpus); > + > int housekeeping_any_cpu(enum hk_flags flags) > { > int cpu; > @@ -131,6 +153,7 @@ static int __init housekeeping_setup(char *str, enum > hk_flags flags) > > housekeeping_flags |= flags; > > + atomic_set(&__num_housekeeping_cpus, cpumask_weight(housekeeping_mask)); > free_bootmem_cpumask_var(non_housekeeping_mask); > > return 1; > -- > 2.27.0 >
Re: [PATCH] Convert enum pci_dev_flags to bit fields in struct pci_dev
[+cc Kai-Heng, Colin, Myron] On Mon, Sep 14, 2020 at 03:57:56AM +, Krzysztof Wilczyński wrote: > All the flags defined in the enum pci_dev_flags are used to determine > whether a particular feature of an underlying PCI device should be used > or not - features are also often disabled via a device-specific quirk. > > These flags are tightly coupled with a PCI device and primarily used in > simple binary on/off manner to check whether something is enabled or > disabled, and have almost no other users (aside of two network drivers) > outside of the PCI device drivers space. > > Therefore, convert enum pci_dev_flags into a set of bit fields in the > struct pci_dev, and then drop said enum and the typedef pci_dev_flags_t. > > This will keep PCI device-specific features as part of the struct > pci_dev and make the code that used to use flags simpler. > > Suggested-by: Bjorn Helgaas > Signed-off-by: Krzysztof Wilczyński I like this because we currently have two styles for setting per-PCI dev flags: pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3 pdev->no_d3cold = true; and there's no obvious reason to choose one way over the other. This patch converts everything to the second style. We might look at doing pci_bus_flags_t at the same time. How much heartburn does this cause distro folks? You can let me know off-list if you want :) Generally we don't worry too much in the upstream world about breaking out-of-tree modules, and these are trivial changes that affect very few drivers anyway, but I don't want to gratuitously make things hard for distros. > --- > drivers/net/ethernet/atheros/alx/main.c | 2 +- > drivers/net/ethernet/sfc/ef10_sriov.c | 3 +- > drivers/pci/msi.c | 2 +- > drivers/pci/pci.c | 22 ++-- > drivers/pci/probe.c | 2 +- > drivers/pci/quirks.c| 24 ++--- > drivers/pci/search.c| 4 +-- > drivers/pci/vpd.c | 4 +-- > include/linux/pci.h | 47 + > 9 files changed, 47 insertions(+), 63 deletions(-) > > diff --git a/drivers/net/ethernet/atheros/alx/main.c > b/drivers/net/ethernet/atheros/alx/main.c > index 9b7f1af5f574..c52669f8ec26 100644 > --- a/drivers/net/ethernet/atheros/alx/main.c > +++ b/drivers/net/ethernet/atheros/alx/main.c > @@ -1763,7 +1763,7 @@ static int alx_probe(struct pci_dev *pdev, const struct > pci_device_id *ent) > netdev->watchdog_timeo = ALX_WATCHDOG_TIME; > > if (ent->driver_data & ALX_DEV_QUIRK_MSI_INTX_DISABLE_BUG) > - pdev->dev_flags |= PCI_DEV_FLAGS_MSI_INTX_DISABLE_BUG; > + pdev->msi_intx_disabled = 1; > > err = alx_init_sw(alx); > if (err) { > diff --git a/drivers/net/ethernet/sfc/ef10_sriov.c > b/drivers/net/ethernet/sfc/ef10_sriov.c > index 21fa6c0e8873..9af7e11ea113 100644 > --- a/drivers/net/ethernet/sfc/ef10_sriov.c > +++ b/drivers/net/ethernet/sfc/ef10_sriov.c > @@ -122,8 +122,7 @@ static void efx_ef10_sriov_free_vf_vports(struct efx_nic > *efx) > struct ef10_vf *vf = nic_data->vf + i; > > /* If VF is assigned, do not free the vport */ > - if (vf->pci_dev && > - vf->pci_dev->dev_flags & PCI_DEV_FLAGS_ASSIGNED) > + if (vf->pci_dev && vf->pci_dev->flags_assigned) > continue; > > if (vf->vport_assigned) { > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index 30ae4ffda5c1..719ae72d9028 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -405,7 +405,7 @@ static void free_msi_irqs(struct pci_dev *dev) > > static void pci_intx_for_msi(struct pci_dev *dev, int enable) > { > - if (!(dev->dev_flags & PCI_DEV_FLAGS_MSI_INTX_DISABLE_BUG)) > + if (!dev->msi_intx_disabled) > pci_intx(dev, enable); > } > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index e39c5499770f..08ffe872c34c 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -1320,7 +1320,7 @@ int pci_set_power_state(struct pci_dev *dev, > pci_power_t state) >* This device is quirked not to be put into D3, so don't put it in >* D3 >*/ > - if (state >= PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3)) > + if (state >= PCI_D3hot && dev->no_d3) > return 0; > > /* > @@ -4528,7 +4528,7 @@ bool pcie_has_flr(struct pci_dev *dev) > { > u32 cap; > > - if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET) > + if (dev->
Re: [PATCH net-next 03/11] mlxsw: spectrum_policer: Add policer core
On Tue, Aug 18, 2020 at 11:37:45AM +0200, Petr Machata wrote: > Ido Schimmel writes: > > On Mon, Aug 17, 2020 at 10:38:24AM -0500, Bjorn Helgaas wrote: > >> You've likely seen this already, but Coverity found this problem: > >> > >> *** CID 1466147: Control flow issues (DEADCODE) > >> /drivers/net/ethernet/mellanox/mlxsw/spectrum_policer.c: 380 in > >> mlxsw_sp_policers_init() > >> 374 } > >> 375 > >> 376 return 0; > >> 377 > >> 378 err_family_register: > >> 379 for (i--; i >= 0; i--) { > >> >>> CID 1466147: Control flow issues (DEADCODE) > >> >>> Execution cannot reach this statement: "struct > >> mlxsw_sp_policer_fam...". > >> 380 struct mlxsw_sp_policer_family *family; > >> 381 > >> 382 family = mlxsw_sp->policer_core->family_arr[i]; > >> 383 mlxsw_sp_policer_family_unregister(mlxsw_sp, family); > >> 384 } > >> 385 err_init: > >> > >> I think the problem is that MLXSW_SP_POLICER_TYPE_MAX is 0 because > >> > >> > +enum mlxsw_sp_policer_type { > >> > +MLXSW_SP_POLICER_TYPE_SINGLE_RATE, > >> > + > >> > +__MLXSW_SP_POLICER_TYPE_MAX, > >> > +MLXSW_SP_POLICER_TYPE_MAX = __MLXSW_SP_POLICER_TYPE_MAX - 1, > >> > +}; > >> > >> so we can only execute the family_register loop once, with i == 0, > >> and if we get to err_family_register via the error exit: > >> > >> > +for (i = 0; i < MLXSW_SP_POLICER_TYPE_MAX + 1; i++) { > >> > +err = mlxsw_sp_policer_family_register(mlxsw_sp, > >> > mlxsw_sp_policer_family_arr[i]); > >> > +if (err) > >> > +goto err_family_register; > >> > >> i will be 0, so i-- sets i to -1, so we don't enter the > >> family_unregister loop body since -1 is not >= 0. > > > > Thanks for the report, but isn't the code doing the right thing here? I > > mean, it's dead code now, but as soon as we add another family it will > > be executed. It seems error prone to remove it only to please Coverity > > and then add it back when it's actually needed. > > Agreed. You're right, I missed the forest for the trees. Sorry for the noise. Bjorn
Re: [PATCH net-next 03/11] mlxsw: spectrum_policer: Add policer core
You've likely seen this already, but Coverity found this problem: *** CID 1466147: Control flow issues (DEADCODE) /drivers/net/ethernet/mellanox/mlxsw/spectrum_policer.c: 380 in mlxsw_sp_policers_init() 374 } 375 376 return 0; 377 378 err_family_register: 379 for (i--; i >= 0; i--) { >>> CID 1466147: Control flow issues (DEADCODE) >>> Execution cannot reach this statement: "struct mlxsw_sp_policer_fam...". 380 struct mlxsw_sp_policer_family *family; 381 382 family = mlxsw_sp->policer_core->family_arr[i]; 383 mlxsw_sp_policer_family_unregister(mlxsw_sp, family); 384 } 385 err_init: I think the problem is that MLXSW_SP_POLICER_TYPE_MAX is 0 because > +enum mlxsw_sp_policer_type { > + MLXSW_SP_POLICER_TYPE_SINGLE_RATE, > + > + __MLXSW_SP_POLICER_TYPE_MAX, > + MLXSW_SP_POLICER_TYPE_MAX = __MLXSW_SP_POLICER_TYPE_MAX - 1, > +}; so we can only execute the family_register loop once, with i == 0, and if we get to err_family_register via the error exit: > + for (i = 0; i < MLXSW_SP_POLICER_TYPE_MAX + 1; i++) { > + err = mlxsw_sp_policer_family_register(mlxsw_sp, > mlxsw_sp_policer_family_arr[i]); > + if (err) > + goto err_family_register; i will be 0, so i-- sets i to -1, so we don't enter the family_unregister loop body since -1 is not >= 0. This code is now upstream as 8d3fbae70d8d ("mlxsw: spectrum_policer: Add policer core"). Bjorn On Wed, Jul 15, 2020 at 11:27:25AM +0300, Ido Schimmel wrote: > From: Ido Schimmel > > Add common code to handle all policer-related functionality in mlxsw. > Currently, only policer for policy engines are supported, but it in the > future more policer families will be added such as CPU (trap) policers > and storm control policers. > > The API allows different modules to add / delete policers and read their > drop counter. > > Signed-off-by: Ido Schimmel > Reviewed-by: Jiri Pirko > Reviewed-by: Petr Machata > --- > drivers/net/ethernet/mellanox/mlxsw/Makefile | 2 +- > .../net/ethernet/mellanox/mlxsw/spectrum.c| 12 + > .../net/ethernet/mellanox/mlxsw/spectrum.h| 32 ++ > .../mellanox/mlxsw/spectrum_policer.c | 403 ++ > 4 files changed, 448 insertions(+), 1 deletion(-) > create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_policer.c > > diff --git a/drivers/net/ethernet/mellanox/mlxsw/Makefile > b/drivers/net/ethernet/mellanox/mlxsw/Makefile > index 3709983fbd77..892724380ea2 100644 > --- a/drivers/net/ethernet/mellanox/mlxsw/Makefile > +++ b/drivers/net/ethernet/mellanox/mlxsw/Makefile > @@ -31,7 +31,7 @@ mlxsw_spectrum-objs := spectrum.o > spectrum_buffers.o \ > spectrum_qdisc.o spectrum_span.o \ > spectrum_nve.o spectrum_nve_vxlan.o \ > spectrum_dpipe.o spectrum_trap.o \ > -spectrum_ethtool.o > +spectrum_ethtool.o spectrum_policer.o > mlxsw_spectrum-$(CONFIG_MLXSW_SPECTRUM_DCB) += spectrum_dcb.o > mlxsw_spectrum-$(CONFIG_PTP_1588_CLOCK) += spectrum_ptp.o > obj-$(CONFIG_MLXSW_MINIMAL) += mlxsw_minimal.o > diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c > b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c > index 4ac634bd3571..c6ab61818800 100644 > --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c > +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c > @@ -2860,6 +2860,12 @@ static int mlxsw_sp_init(struct mlxsw_core *mlxsw_core, > goto err_fids_init; > } > > + err = mlxsw_sp_policers_init(mlxsw_sp); > + if (err) { > + dev_err(mlxsw_sp->bus_info->dev, "Failed to initialize > policers\n"); > + goto err_policers_init; > + } > + > err = mlxsw_sp_traps_init(mlxsw_sp); > if (err) { > dev_err(mlxsw_sp->bus_info->dev, "Failed to set traps\n"); > @@ -3019,6 +3025,8 @@ static int mlxsw_sp_init(struct mlxsw_core *mlxsw_core, > err_devlink_traps_init: > mlxsw_sp_traps_fini(mlxsw_sp); > err_traps_init: > + mlxsw_sp_policers_fini(mlxsw_sp); > +err_policers_init: > mlxsw_sp_fids_fini(mlxsw_sp); > err_fids_init: > mlxsw_sp_kvdl_fini(mlxsw_sp); > @@ -3046,6 +3054,7 @@ static int mlxsw_sp1_init(struct mlxsw_core *mlxsw_core, > mlxsw_sp->port_type_speed_ops = &mlxsw_sp1_port_type_speed_ops; > mlxsw_sp->ptp_ops = &mlxsw_sp1_ptp_ops; > mlxsw_sp->span_ops = &mlxsw_sp1_span_ops; > + mlxsw_sp->policer_core_ops = &mlxsw_sp1_policer_core_ops; > mlxsw_sp->listeners = mlxsw_sp1_listener; > mlxsw_sp->listeners_count = ARRAY_SIZE(mlxsw_sp1_listener); > mlxsw_sp->lowest_shaper_bs = MLXSW_REG_QEEC_LOWEST_SHAPER_BS_SP1; > @@ -3074,6 +3083,7 @@ static int mlxsw_sp2_init(struct mlx
Re: [RFC PATCH 00/17] Drop uses of pci_read_config_*() return value
On Sun, Aug 02, 2020 at 08:46:48PM +0200, Borislav Petkov wrote: > On Sun, Aug 02, 2020 at 07:28:00PM +0200, Saheed Bolarinwa wrote: > > Because the value ~0 has a meaning to some drivers and only > > No, ~0 means that the PCI read failed. For *every* PCI device I know. Wait, I'm not convinced yet. I know that if a PCI read fails, you normally get ~0 data because the host bridge fabricates it to complete the CPU load. But what guarantees that a PCI config register cannot contain ~0? If there's something about that in the spec I'd love to know where it is because it would simplify a lot of things. I don't think we should merge any of these patches as-is. If we *do* want to go this direction, we at least need some kind of macro or function that tests for ~0 so we have a clue about what's happening and can grep for it. Bjorn
Re: [PATCH v1] farsync: use generic power management
On Wed, Jul 29, 2020 at 03:47:30PM +0530, Vaibhav Gupta wrote: > On Tue, Jul 28, 2020 at 03:04:13PM -0500, Bjorn Helgaas wrote: > > On Tue, Jul 28, 2020 at 09:58:10AM +0530, Vaibhav Gupta wrote: > > > The .suspend() and .resume() callbacks are not defined for this driver. > > > Still, their power management structure follows the legacy framework. To > > > bring it under the generic framework, simply remove the binding of > > > callbacks from "struct pci_driver". > > > > FWIW, this commit log is slightly misleading because .suspend and > > .resume are NULL by default, so this patch actually is a complete > > no-op as far as code generation is concerned. > > > > This change is worthwhile because it simplifies the code a little, but > > it doesn't convert the driver from legacy to generic power management. > > This driver doesn't supply a .pm structure, so it doesn't seem to do > > *any* power management. > > Agreed. Actually, as their presence only causes PCI core to call > pci_legacy_suspend/resume() for them, I thought that after removing > the binding from "struct pci_driver", this driver qualifies to be > grouped under genric framework, so used "use generic power > management" for the heading. > > I should have written "remove legacy bindning". This removed the *mention* of fst_driver.suspend and fst_driver.resume, which is important because we want to eventually remove those members completely from struct pci_driver. But fst_driver.suspend and fst_driver.resume *exist* before and after this patch, and they're initialized to zero before and after this patch. Since they were zero before, and they're still zero after this patch, the PCI core doesn't call pci_legacy_suspend/resume(). This patch doesn't change that at all. > But David has applied the patch, should I send a v2 or fix to update > message? No, I don't think David updates patches after he's applied them. But if the situation comes up again, you'll know how to describe it :) Bjorn
Re: [PATCH v1] farsync: use generic power management
On Tue, Jul 28, 2020 at 09:58:10AM +0530, Vaibhav Gupta wrote: > The .suspend() and .resume() callbacks are not defined for this driver. > Still, their power management structure follows the legacy framework. To > bring it under the generic framework, simply remove the binding of > callbacks from "struct pci_driver". FWIW, this commit log is slightly misleading because .suspend and .resume are NULL by default, so this patch actually is a complete no-op as far as code generation is concerned. This change is worthwhile because it simplifies the code a little, but it doesn't convert the driver from legacy to generic power management. This driver doesn't supply a .pm structure, so it doesn't seem to do *any* power management. > Change code indentation from space to tab in "struct pci_driver". > > Signed-off-by: Vaibhav Gupta > --- > drivers/net/wan/farsync.c | 10 -- > 1 file changed, 4 insertions(+), 6 deletions(-) > > diff --git a/drivers/net/wan/farsync.c b/drivers/net/wan/farsync.c > index 7916efce7188..15dacfde6b83 100644 > --- a/drivers/net/wan/farsync.c > +++ b/drivers/net/wan/farsync.c > @@ -2636,12 +2636,10 @@ fst_remove_one(struct pci_dev *pdev) > } > > static struct pci_driver fst_driver = { > -.name= FST_NAME, > -.id_table= fst_pci_dev_id, > -.probe = fst_add_one, > -.remove = fst_remove_one, > -.suspend = NULL, > -.resume = NULL, > + .name = FST_NAME, > + .id_table = fst_pci_dev_id, > + .probe = fst_add_one, > + .remove = fst_remove_one, > }; > > static int __init > -- > 2.27.0 >
Re: [net-next 10/10] net/mlx5e: Add support for PCI relaxed ordering
On Sun, Jul 08, 2040 at 11:22:12AM +0300, Aya Levin wrote: > On 7/6/2020 10:49 PM, David Miller wrote: > > From: Aya Levin > > Date: Mon, 6 Jul 2020 16:00:59 +0300 > > > > > Assuming the discussions with Bjorn will conclude in a well-trusted > > > API that ensures relaxed ordering in enabled, I'd still like a method > > > to turn off relaxed ordering for performance debugging sake. > > > Bjorn highlighted the fact that the PCIe sub system can only offer a > > > query method. Even if theoretically a set API will be provided, this > > > will not fit a netdev debugging - I wonder if CPU vendors even support > > > relaxed ordering set/unset... > > > On the driver's side relaxed ordering is an attribute of the mkey and > > > should be available for configuration (similar to number of CPU > > > vs. number of channels). > > > Based on the above, and binding the driver's default relaxed ordering > > > to the return value from pcie_relaxed_ordering_enabled(), may I > > > continue with previous direction of a private-flag to control the > > > client side (my driver) ? > > > > I don't like this situation at all. > > > > If RO is so dodgy that it potentially needs to be disabled, that is > > going to be an issue not just with networking devices but also with > > storage and other device types as well. > > > > Will every device type have a custom way to disable RO, thus > > inconsistently, in order to accomodate this? > > > > That makes no sense and is a terrible user experience. > > > > That's why the knob belongs generically in PCI or similar. > > > Hi Bjorn, > > Mellanox NIC supports relaxed ordering operation over DMA buffers. > However for debug prepossess we must have a chicken bit to disable > relaxed ordering on a specific system without effecting others in > run-time. In order to meet this requirement, I added a netdev > private-flag to ethtool for set RO API. > > Dave raised a concern regarding embedding relaxed ordering set API > per system (networking, storage and others). We need the ability to > manage relaxed ordering in a unify manner. Could you please define a > PCI sub-system solution to meet this requirement? I agree, this is definitely a mess. Let me just outline what I think we have today and what we're missing. - On the hardware side, device use of Relaxed Ordering is controlled by the Enable Relaxed Ordering bit in the PCIe Device Control register (or the PCI-X Command register). If set, the device is allowed but not required to set the Relaxed Ordering bit in transactions it initiates (PCIe r5.0, sec 7.5.3.4; PCI-X 2.0, sec 7.2.3). I suspect there may be device-specific controls, too, because [1] claims to enable/disable Relaxed Ordering but doesn't touch the PCIe Device Control register. Device-specific controls are certainly allowed, but of course it would be up to the driver, and the device cannot generate TLPs with Relaxed Ordering unless the architected PCIe Enable Relaxed Ordering bit is *also* set. - Platform firmware can enable Relaxed Ordering for a device either before handoff to the OS or via the _HPX ACPI method. - The PCI core never enables Relaxed Ordering itself except when applying _HPX. - At enumeration-time, the PCI core disables Relaxed Ordering in pci_configure_relaxed_ordering() if the device is below a Root Port that has a quirk indicating an erratum. This quirk currently includes many Intel Root Ports, but not all, and is an ongoing maintenance problem. - The PCI core provides pcie_relaxed_ordering_enabled() which tells you whether Relaxed Ordering is enabled. Only used by cxgb4 and csio, which use that information to fill in Ingress Queue Commands. - The PCI core does not provide a driver interface to enable or disable Relaxed Ordering. - Some drivers disable Relaxed Ordering themselves: mtip32xx, netup_unidvb, tg3, myri10ge (oddly, only if CONFIG_MYRI10GE_DCA), tsi721, kp2000_pcie. - Some drivers enable Relaxed Ordering themselves: niu, tegra. What are we missing and what should the PCI core do? - Currently the Enable Relaxed Ordering bit depends on what firmware did. Maybe the PCI core should always clear it during enumeration? - The PCI core should probably have a driver interface like pci_set_relaxed_ordering(dev, enable) that can set or clear the architected PCI-X or PCIe Enable Relaxed Ordering bit. - Maybe there should be a kernel command-line parameter like "pci=norelax" that disables Relaxed Ordering for every device and prevents pci_set_relaxed_ordering() from enabling it. I'm mixed on this because these tend to become folklore about how to "fix" problems and we end up with systems that don't work unless you happen to find the option on the web. For debugging issues, it might be enough to disable Relaxed Ordering using setpci, e.g., "setpci -s02:00.0 CAP_EXP+8.w=0" [1] https
Re: [PATCH v1 1/4] qlge/qlge_main.c: use genric power management
Vaibhav: s/genric/generic/ in the subject On Tue, Jun 30, 2020 at 12:09:36AM +0800, kernel test robot wrote: > Hi Vaibhav, > > Thank you for the patch! Yet something to improve: > > [auto build test ERROR on staging/staging-testing] > [also build test ERROR on v5.8-rc3 next-20200629] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use as documented in > https://git-scm.com/docs/git-format-patch] > > url: > https://github.com/0day-ci/linux/commits/Vaibhav-Gupta/drivers-staging-use-generic-power-management/20200629-163141 > base: https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git > 347fa58ff5558075eec98725029c443c80ffbf4a > config: x86_64-rhel-7.6 (attached as .config) > compiler: gcc-9 (Debian 9.3.0-13) 9.3.0 > reproduce (this is a W=1 build): > # save the attached .config to linux build tree > make W=1 ARCH=x86_64 > > If you fix the issue, kindly add following tag as appropriate > Reported-by: kernel test robot If the patch has already been merged and we need an incremental patch that fixes *only* the build issue, I think it's fine to add a "Reported-by" tag. But if this patch hasn't been merged anywhere, I think adding the "Reported-by" tag would be pointless and distracting. This report should result in a v2 posting of the patch with the build issue fixed. There will be no evidence of the problem in the v2 patch. The patch itself contains other changes unrelated to the build issue, so "Reported-by" makes so sense for them. I would treat this as just another review comment, and we don't usually credit those in the commit log (though it's nice if they're mentioned in the v2 cover letter so reviewers know what changed and why). Is there any chance kbuild could be made smart enough to suggest the tag only when it finds an issue in some list of published trees? > All errors (new ones prefixed by >>): > >drivers/staging/qlge/qlge_main.c: In function 'qlge_resume': > >> drivers/staging/qlge/qlge_main.c:4793:17: error: 'pdev' undeclared (first > >> use in this function); did you mean 'qdev'? > 4793 | pci_set_master(pdev); > | ^~~~ > | qdev > ...
Re: [net-next 10/10] net/mlx5e: Add support for PCI relaxed ordering
[+cc Ashok, Ding, Casey] On Mon, Jun 29, 2020 at 12:32:44PM +0300, Aya Levin wrote: > I wanted to turn on RO on the ETH driver based on > pcie_relaxed_ordering_enabled(). > From my experiments I see that pcie_relaxed_ordering_enabled() return true > on Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz. This CPU is from Haswell > series which is known to have bug in RO implementation. In this case, I > expected pcie_relaxed_ordering_enabled() to return false, shouldn't it? Is there an erratum for this? How do we know this device has a bug in relaxed ordering? > In addition, we are worried about future bugs in new CPUs which may result > in performance degradation while using RO, as long as the function > pcie_relaxed_ordering_enabled() will return true for these CPUs. I'm worried about this too. I do not want to add a Device ID to the quirk_relaxedordering_disable() list for every new Intel CPU. That's a huge hassle and creates a real problem for old kernels running on those new CPUs, because things might work "most of the time" but not always. Maybe we need to prevent the use of relaxed ordering for *all* Intel CPUs. > That's why > we thought of adding the feature on our card with default off and enable the > user to set it. Bjorn
Re: [net-next 10/10] net/mlx5e: Add support for PCI relaxed ordering
On Wed, Jun 24, 2020 at 10:22:58AM -0700, Jakub Kicinski wrote: > On Wed, 24 Jun 2020 10:34:40 +0300 Aya Levin wrote: > > >> I think Michal will rightly complain that this does not belong in > > >> private flags any more. As (/if?) ARM deployments take a foothold > > >> in DC this will become a common setting for most NICs. > > > > > > Initially we used pcie_relaxed_ordering_enabled() to > > > programmatically enable this on/off on boot but this seems to > > > introduce some degradation on some Intel CPUs since the Intel Faulty > > > CPUs list is not up to date. Aya is discussing this with Bjorn. > > Adding Bjorn Helgaas > > I see. Simply using pcie_relaxed_ordering_enabled() and blacklisting > bad CPUs seems far nicer from operational perspective. Perhaps Bjorn > will chime in. Pushing the validation out to the user is not a great > solution IMHO. I'm totally lost, but maybe it doesn't matter because it looks like David has pulled this series already. There probably *should* be a PCI core interface to enable RO, but there isn't one today. pcie_relaxed_ordering_enabled() doesn't *enable* anything. All it does is tell you whether RO is already enabled. This patch ([net-next 10/10] net/mlx5e: Add support for PCI relaxed ordering) apparently adds a knob to control RO, but I can't connect the dots. It doesn't touch PCI_EXP_DEVCTL_RELAX_EN, and that symbol doesn't occur anywhere in drivers/net except tg3, myri10ge, and niu. And this whole series doesn't contain PCI_EXP_DEVCTL_RELAX_EN or pcie_relaxed_ordering_enabled(). I do have a couple emails from Aya, but they didn't include a patch and I haven't quite figured out what the question was. > > > So until we figure this out, will keep this off by default. > > > > > > for the private flags we want to keep them for performance analysis as > > > we do with all other mlx5 special performance features and flags.
Re: [PATCH v1 1/4] ide: use generic power management
On Thu, Jun 25, 2020 at 06:14:09AM +0800, kernel test robot wrote: > Hi Vaibhav, > > Thank you for the patch! Yet something to improve: > > [auto build test ERROR on ide/master] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use as documented in > https://git-scm.com/docs/git-format-patch] > > url: > https://github.com/0day-ci/linux/commits/Vaibhav-Gupta/drivers-ide-use-generic-power-management/20200625-013242 > base: https://git.kernel.org/pub/scm/linux/kernel/git/davem/ide.git master > config: x86_64-randconfig-a004-20200624 (attached as .config) This auto build testing is a great service, but is there any way to tweak the info above to make it easier to reproduce the problem? I tried to checkout the source that caused these errors, but failed. This is probably because I'm not a git expert, but maybe others are in the same boat. For example, I tried: $ git remote add kbuild https://github.com/0day-ci/linux/commits/Vaibhav-Gupta/drivers-ide-use-generic-power-management/20200625-013242 $ git fetch kbuild fatal: repository 'https://github.com/0day-ci/linux/commits/Vaibhav-Gupta/drivers-ide-use-generic-power-management/20200625-013242/' not found I also visited the github URL in a browser, and I'm sure there must be information there that would let me fetch the source, but I don't know enough about github to find it. The report doesn't include a SHA1, so even if I *did* manage to fetch the sources, I wouldn't be able to validate they were the *correct* ones. > compiler: gcc-9 (Debian 9.3.0-13) 9.3.0 > reproduce (this is a W=1 build): > # save the attached .config to linux build tree > make W=1 ARCH=x86_64 > > If you fix the issue, kindly add following tag as appropriate > Reported-by: kernel test robot > > All errors (new ones prefixed by >>, old ones prefixed by <<): > > >> ERROR: modpost: "ide_pci_pm_ops" [drivers/ide/ide-pci-generic.ko] > >> undefined! > >> ERROR: modpost: "ide_pci_pm_ops" [drivers/ide/serverworks.ko] undefined! > >> ERROR: modpost: "ide_pci_pm_ops" [drivers/ide/piix.ko] undefined! > >> ERROR: modpost: "ide_pci_pm_ops" [drivers/ide/pdc202xx_old.ko] undefined! > >> ERROR: modpost: "ide_pci_pm_ops" [drivers/ide/ns87415.ko] undefined! > >> ERROR: modpost: "ide_pci_pm_ops" [drivers/ide/hpt366.ko] undefined! > > --- > 0-DAY CI Kernel Test Service, Intel Corporation > https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
Re: [Linux-kernel-mentees] [PATCH v2 2/2] realtek/8139cp: Remove Legacy Power Management
On Tue, Apr 28, 2020 at 08:13:14PM +0530, Vaibhav Gupta wrote: > Upgrade power management from legacy to generic using dev_pm_ops. > > Add "__maybe_unused" attribute to resume() and susend() callbacks > definition to suppress compiler warnings. > > Generic callback requires an argument of type "struct device*". Hence, > convert it to "struct net_device*" using "dev_get_drv_data()" to use > it in the callback. > > Most of the cleaning part is to remove pci_save_state(), > pci_set_power_state(), etc power management function calls. > > Signed-off-by: Vaibhav Gupta > --- > drivers/net/ethernet/realtek/8139cp.c | 25 +++-- > 1 file changed, 7 insertions(+), 18 deletions(-) > > diff --git a/drivers/net/ethernet/realtek/8139cp.c > b/drivers/net/ethernet/realtek/8139cp.c > index 60d342f82fb3..4f2fb1393966 100644 > --- a/drivers/net/ethernet/realtek/8139cp.c > +++ b/drivers/net/ethernet/realtek/8139cp.c > @@ -2054,10 +2054,9 @@ static void cp_remove_one (struct pci_dev *pdev) > free_netdev(dev); > } > > -#ifdef CONFIG_PM > -static int cp_suspend (struct pci_dev *pdev, pm_message_t state) > +static int __maybe_unused cp_suspend(struct device *device) > { > - struct net_device *dev = pci_get_drvdata(pdev); > + struct net_device *dev = dev_get_drvdata(device); > struct cp_private *cp = netdev_priv(dev); > unsigned long flags; > > @@ -2075,16 +2074,12 @@ static int cp_suspend (struct pci_dev *pdev, > pm_message_t state) > > spin_unlock_irqrestore (&cp->lock, flags); > > - pci_save_state(pdev); > - pci_enable_wake(pdev, pci_choose_state(pdev, state), cp->wol_enabled); This one is a little more interesting because it relies on the driver state (cp->wol_enabled). IIUC, the corresponding pci_enable_wake() in the generic path is in pci_prepare_to_sleep() (called from pci_pm_suspend_noirq()). But of course the generic path doesn't look at cp->wol_enabled. It looks at device_may_wakeup(), but I don't know whether there's a connection between that and cp->wol_enabled. > - pci_set_power_state(pdev, pci_choose_state(pdev, state)); > - > return 0; > } > > -static int cp_resume (struct pci_dev *pdev) > +static int __maybe_unused cp_resume(struct device *device) > { > - struct net_device *dev = pci_get_drvdata (pdev); > + struct net_device *dev = dev_get_drvdata(device); > struct cp_private *cp = netdev_priv(dev); > unsigned long flags; > > @@ -2093,10 +2088,6 @@ static int cp_resume (struct pci_dev *pdev) > > netif_device_attach (dev); > > - pci_set_power_state(pdev, PCI_D0); > - pci_restore_state(pdev); > - pci_enable_wake(pdev, PCI_D0, 0); > - > /* FIXME: sh*t may happen if the Rx ring buffer is depleted */ > cp_init_rings_index (cp); > cp_init_hw (cp); > @@ -2111,7 +2102,6 @@ static int cp_resume (struct pci_dev *pdev) > > return 0; > } > -#endif /* CONFIG_PM */ > > static const struct pci_device_id cp_pci_tbl[] = { > { PCI_DEVICE(PCI_VENDOR_ID_REALTEK, PCI_DEVICE_ID_REALTEK_8139), > }, > @@ -2120,15 +2110,14 @@ static const struct pci_device_id cp_pci_tbl[] = { > }; > MODULE_DEVICE_TABLE(pci, cp_pci_tbl); > > +static SIMPLE_DEV_PM_OPS(cp_pm_ops, cp_suspend, cp_resume); > + > static struct pci_driver cp_driver = { > .name = DRV_NAME, > .id_table = cp_pci_tbl, > .probe= cp_init_one, > .remove = cp_remove_one, > -#ifdef CONFIG_PM > - .resume = cp_resume, > - .suspend = cp_suspend, > -#endif > + .driver.pm= &cp_pm_ops, > }; > > module_pci_driver(cp_driver); > -- > 2.26.2 >
Re: [Linux-kernel-mentees] [PATCH v2 1/2] realtek/8139too: Remove Legacy Power Management
Uncapitalize "legacy power management" in subject. I'd say "convert", not "remove", to make it clear that the driver will still do power management afterwards. I think your to: and cc: list came from the get_maintainer.pl script, but you can trim it a bit by omitting people who have just made occasional random fixups. These drivers are really unmaintained, so Dave M, netdev, Rafael, linux-pm, linux-pci, and maybe LKML are probably enough. On Tue, Apr 28, 2020 at 08:13:13PM +0530, Vaibhav Gupta wrote: > Upgrade power management from legacy to generic using dev_pm_ops. Instead of the paragraphs below, which cover the stuff that's fairly obvious, I think it would be more useful to include hints about where the things you removed will be done now. That helps reviewers verify that this doesn't break anything. E.g., In the legacy PM model, drivers save and restore PCI state and set the device power state directly. In the generic model, this is all done by the PCI core in .suspend_noirq() (pci_pm_suspend_noirq()) and .resume_noirq() (pci_pm_resume_noirq()). This sort of thing could go in each commit log. The cover letter doesn't normally go in the commit log, so you have to assume it will be lost. > Remove "struct pci_driver.suspend" and "struct pci_driver.resume" > bindings, and add "struct pci_driver.driver.pm" . > > Add "__maybe_unused" attribute to resume() and susend() callbacks > definition to suppress compiler warnings. > > Generic callback requires an argument of type "struct device*". Hence, > convert it to "struct net_device*" using "dev_get_drvdata()" to use > it in the callback. > > Signed-off-by: Vaibhav Gupta Acked-by: Bjorn Helgaas Thanks a lot for doing this! > --- > drivers/net/ethernet/realtek/8139too.c | 26 +++--- > 1 file changed, 7 insertions(+), 19 deletions(-) > > diff --git a/drivers/net/ethernet/realtek/8139too.c > b/drivers/net/ethernet/realtek/8139too.c > index 5caeb8368eab..227139d42227 100644 > --- a/drivers/net/ethernet/realtek/8139too.c > +++ b/drivers/net/ethernet/realtek/8139too.c > @@ -2603,17 +2603,13 @@ static void rtl8139_set_rx_mode (struct net_device > *dev) > spin_unlock_irqrestore (&tp->lock, flags); > } > > -#ifdef CONFIG_PM > - > -static int rtl8139_suspend (struct pci_dev *pdev, pm_message_t state) > +static int __maybe_unused rtl8139_suspend(struct device *device) > { > - struct net_device *dev = pci_get_drvdata (pdev); > + struct net_device *dev = dev_get_drvdata(device); > struct rtl8139_private *tp = netdev_priv(dev); > void __iomem *ioaddr = tp->mmio_addr; > unsigned long flags; > > - pci_save_state (pdev); > - > if (!netif_running (dev)) > return 0; > > @@ -2631,38 +2627,30 @@ static int rtl8139_suspend (struct pci_dev *pdev, > pm_message_t state) > > spin_unlock_irqrestore (&tp->lock, flags); > > - pci_set_power_state (pdev, PCI_D3hot); > - > return 0; > } > > - > -static int rtl8139_resume (struct pci_dev *pdev) > +static int __maybe_unused rtl8139_resume(struct device *device) > { > - struct net_device *dev = pci_get_drvdata (pdev); > + struct net_device *dev = dev_get_drvdata(device); > > - pci_restore_state (pdev); > if (!netif_running (dev)) > return 0; > - pci_set_power_state (pdev, PCI_D0); > + > rtl8139_init_ring (dev); > rtl8139_hw_start (dev); > netif_device_attach (dev); > return 0; > } > > -#endif /* CONFIG_PM */ > - > +static SIMPLE_DEV_PM_OPS(rtl8139_pm_ops, rtl8139_suspend, rtl8139_resume); > > static struct pci_driver rtl8139_pci_driver = { > .name = DRV_NAME, > .id_table = rtl8139_pci_tbl, > .probe = rtl8139_init_one, > .remove = rtl8139_remove_one, > -#ifdef CONFIG_PM > - .suspend= rtl8139_suspend, > - .resume = rtl8139_resume, > -#endif /* CONFIG_PM */ > + .driver.pm = &rtl8139_pm_ops, > }; > > > -- > 2.26.2 >
Re: [PATCH] liquidio: Use pcie_flr() instead of reimplementing it
On Thu, Aug 08, 2019 at 07:57:53AM +0300, Denis Efremov wrote: > octeon_mbox_process_cmd() directly writes the PCI_EXP_DEVCTL_BCR_FLR > bit, which bypasses timing requirements imposed by the PCIe spec. > This patch fixes the function to use the pcie_flr() interface instead. > > Signed-off-by: Denis Efremov Reviewed-by: Bjorn Helgaas Thanks for doing this, Denis. When possible it's better to use a PCI core interface than to fiddle with PCI config space directly from a driver. > --- > drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c > b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c > index 021d99cd1665..614d07be7181 100644 > --- a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c > +++ b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c > @@ -260,9 +260,7 @@ static int octeon_mbox_process_cmd(struct octeon_mbox > *mbox, > dev_info(&oct->pci_dev->dev, >"got a request for FLR from VF that owns DPI ring > %u\n", >mbox->q_no); > - pcie_capability_set_word( > - oct->sriov_info.dpiring_to_vfpcidev_lut[mbox->q_no], > - PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR); > + pcie_flr(oct->sriov_info.dpiring_to_vfpcidev_lut[mbox->q_no]); > break; > > case OCTEON_PF_CHANGED_VF_MACADDR: > -- > 2.21.0 >
Re: [PATCH net-next 1/2] PCI: let pci_disable_link_state propagate errors
On Tue, Jun 18, 2019 at 11:13:48PM +0200, Heiner Kallweit wrote: > Drivers may rely on pci_disable_link_state() having disabled certain > ASPM link states. If OS can't control ASPM then pci_disable_link_state() > turns into a no-op w/o informing the caller. The driver therefore may > falsely assume the respective ASPM link states are disabled. > Let pci_disable_link_state() propagate errors to the caller, enabling > the caller to react accordingly. > > Signed-off-by: Heiner Kallweit Acked-by: Bjorn Helgaas Thanks, I think this makes good sense. > --- > drivers/pci/pcie/aspm.c | 20 +++- > include/linux/pci-aspm.h | 7 --- > 2 files changed, 15 insertions(+), 12 deletions(-) > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c > index fd4cb7508..e44af7f4d 100644 > --- a/drivers/pci/pcie/aspm.c > +++ b/drivers/pci/pcie/aspm.c > @@ -1062,18 +1062,18 @@ void pcie_aspm_powersave_config_link(struct pci_dev > *pdev) > up_read(&pci_bus_sem); > } > > -static void __pci_disable_link_state(struct pci_dev *pdev, int state, bool > sem) > +static int __pci_disable_link_state(struct pci_dev *pdev, int state, bool > sem) > { > struct pci_dev *parent = pdev->bus->self; > struct pcie_link_state *link; > > if (!pci_is_pcie(pdev)) > - return; > + return 0; > > if (pdev->has_secondary_link) > parent = pdev; > if (!parent || !parent->link_state) > - return; > + return -EINVAL; > > /* >* A driver requested that ASPM be disabled on this device, but > @@ -1085,7 +1085,7 @@ static void __pci_disable_link_state(struct pci_dev > *pdev, int state, bool sem) >*/ > if (aspm_disabled) { > pci_warn(pdev, "can't disable ASPM; OS doesn't have ASPM > control\n"); > - return; > + return -EPERM; > } > > if (sem) > @@ -1105,11 +1105,13 @@ static void __pci_disable_link_state(struct pci_dev > *pdev, int state, bool sem) > mutex_unlock(&aspm_lock); > if (sem) > up_read(&pci_bus_sem); > + > + return 0; > } > > -void pci_disable_link_state_locked(struct pci_dev *pdev, int state) > +int pci_disable_link_state_locked(struct pci_dev *pdev, int state) > { > - __pci_disable_link_state(pdev, state, false); > + return __pci_disable_link_state(pdev, state, false); > } > EXPORT_SYMBOL(pci_disable_link_state_locked); > > @@ -1117,14 +1119,14 @@ EXPORT_SYMBOL(pci_disable_link_state_locked); > * pci_disable_link_state - Disable device's link state, so the link will > * never enter specific states. Note that if the BIOS didn't grant ASPM > * control to the OS, this does nothing because we can't touch the LNKCTL > - * register. > + * register. Returns 0 or a negative errno. > * > * @pdev: PCI device > * @state: ASPM link state to disable > */ > -void pci_disable_link_state(struct pci_dev *pdev, int state) > +int pci_disable_link_state(struct pci_dev *pdev, int state) > { > - __pci_disable_link_state(pdev, state, true); > + return __pci_disable_link_state(pdev, state, true); > } > EXPORT_SYMBOL(pci_disable_link_state); > > diff --git a/include/linux/pci-aspm.h b/include/linux/pci-aspm.h > index df28af5ce..67064145d 100644 > --- a/include/linux/pci-aspm.h > +++ b/include/linux/pci-aspm.h > @@ -24,11 +24,12 @@ > #define PCIE_LINK_STATE_CLKPM4 > > #ifdef CONFIG_PCIEASPM > -void pci_disable_link_state(struct pci_dev *pdev, int state); > -void pci_disable_link_state_locked(struct pci_dev *pdev, int state); > +int pci_disable_link_state(struct pci_dev *pdev, int state); > +int pci_disable_link_state_locked(struct pci_dev *pdev, int state); > void pcie_no_aspm(void); > #else > -static inline void pci_disable_link_state(struct pci_dev *pdev, int state) { > } > +static inline int pci_disable_link_state(struct pci_dev *pdev, int state) > +{ return 0; } > static inline void pcie_no_aspm(void) { } > #endif > > -- > 2.22.0 > >
Re: [PATCH 0/3] PCI: add help pci_dev_id
On Fri, Apr 19, 2019 at 03:13:38PM -0700, David Miller wrote: > From: Heiner Kallweit > Date: Fri, 19 Apr 2019 20:27:45 +0200 > > > In several places in the kernel we find PCI_DEVID used like this: > > PCI_DEVID(dev->bus->number, dev->devfn) Therefore create a helper > > for it. > > I'll wait for an ACK from the PCI folks on patch #1. #1 and #2 touch PCI, #3 is a trivial r8169 patch. There are a few other places where this helper could be used (powernv/npu-dma.c, amdkfd/kfd_topology.c, amd_iommu.c, intel-iommu.c, intel_irq_remapping.c, stmmac_pci.c, chromeos_laptop.c). I think it would be easier for me to collect acks for the trivial changes to those places and merge the whole shebang via PCI. Heiner, do you want to update those other places, too? Bjorn
Re: [PATCH net] r8169: switch off ASPM by default and add sysfs attribute to control ASPM
On Tue, Apr 09, 2019 at 07:32:15PM +0200, Heiner Kallweit wrote: > On 05.04.2019 21:28, Heiner Kallweit wrote: > > On 05.04.2019 21:10, Bjorn Helgaas wrote: > >> On Wed, Apr 03, 2019 at 07:45:29PM +0200, Heiner Kallweit wrote: > >>> On 03.04.2019 15:14, Bjorn Helgaas wrote: > >>>> On Wed, Apr 03, 2019 at 07:53:40AM +0200, Heiner Kallweit wrote: > >>>>> On 02.04.2019 23:57, Bjorn Helgaas wrote: > >>>>>> On Tue, Apr 02, 2019 at 10:41:20PM +0200, Heiner Kallweit wrote: > >>>>>>> On 02.04.2019 22:16, Florian Fainelli wrote: > >>>>>>>> On 4/2/19 12:55 PM, Heiner Kallweit wrote: > >>>>>>>>> There are numerous reports about different problems caused by > >>>>>>>>> ASPM incompatibilities between certain network chip versions > >>>>>>>>> and board chipsets. On the other hand on (especially mobile) > >>>>>>>>> systems where ASPM works properly it can significantly > >>>>>>>>> contribute to power-saving and increased battery runtime. > >>>>>>>>> One problem so far to make ASPM configurable was to find an > >>>>>>>>> acceptable way of configuration (e.g. module parameters are > >>>>>>>>> discouraged for that purpose). > >> > >>>>>>>>> +Certain combinations of network chip versions and board > >>>>>>>>> +chipsets result in increased packet latency, PCIe errors, or > >>>>>>>>> +significantly reduced network performance. Therefore ASPM is > >>>>>>>>> +off by default. On the other hand ASPM can significantly > >>>>>>>>> +contribute to power-saving and thus increased battery runtime > >>>>>>>>> +on notebooks. > >> > >>>> That said, I think Frederick has already started working on a plan > >>>> for the PCI core to expose sysfs files to manage ASPM. This is > >>>> similar to the link_state files enabled by CONFIG_PCIEASPM_DEBUG, > >>>> but it will be always enabled and probably structured slightly > >>>> differently. The idea is that this would be generic and would not > >>>> require any driver support. > >> > >>> Frederick, is there anything you could share already? Or any timeline? > >>> Based on Bjorns info what seems to be best to me: > >>> 1. Disable ASPM for r8169 on stable (back to 4.19). > >>> 2. Once the generic ASPM sysfs attributes are available, reenable ASPM > >>>for r8169 in net-next. > >> > >> This is out of my wheelhouse, but even with a generic sysfs knob, it > >> doesn't sound like a good idea to me to enable ASPM by default for > >> r8169 if we think it's unreliable on any significant fraction of > >> machines. > >> > > I was a little bit imprecise. With the second statement I wanted to say: > > Keep ASPM disabled per default, but make it possible that setting the > > new sysfs attribute enables ASPM. After digging deeper in the ASPM core > > code it seems however that we don't even have to touch the driver later. > > ASPM has been disabled again for r8169: b75bb8a5b755 ("r8169: disable ASPM > again"). So, coming back to controlling ASPM via sysfs: > My first thought would be to extend pci_disable_link_state with support > for disabling L1.1/L1.2, and then basically expose pci_disable_link_state > via sysfs (attribute reading being handled with a direct read from > pcie_link_state->aspm_disable). > > Is this what you were planning or do you have some other approach in mind? I can't remember the details of what Frederick and I talked about, but I think that's the general approach. Bjorn
Re: [PATCH net] r8169: switch off ASPM by default and add sysfs attribute to control ASPM
On Wed, Apr 03, 2019 at 07:45:29PM +0200, Heiner Kallweit wrote: > On 03.04.2019 15:14, Bjorn Helgaas wrote: > > On Wed, Apr 03, 2019 at 07:53:40AM +0200, Heiner Kallweit wrote: > >> On 02.04.2019 23:57, Bjorn Helgaas wrote: > >>> On Tue, Apr 02, 2019 at 10:41:20PM +0200, Heiner Kallweit wrote: > >>>> On 02.04.2019 22:16, Florian Fainelli wrote: > >>>>> On 4/2/19 12:55 PM, Heiner Kallweit wrote: > >>>>>> There are numerous reports about different problems caused by > >>>>>> ASPM incompatibilities between certain network chip versions > >>>>>> and board chipsets. On the other hand on (especially mobile) > >>>>>> systems where ASPM works properly it can significantly > >>>>>> contribute to power-saving and increased battery runtime. > >>>>>> One problem so far to make ASPM configurable was to find an > >>>>>> acceptable way of configuration (e.g. module parameters are > >>>>>> discouraged for that purpose). > >>>>>> +Certain combinations of network chip versions and board > >>>>>> +chipsets result in increased packet latency, PCIe errors, or > >>>>>> +significantly reduced network performance. Therefore ASPM is > >>>>>> +off by default. On the other hand ASPM can significantly > >>>>>> +contribute to power-saving and thus increased battery runtime > >>>>>> +on notebooks. > > That said, I think Frederick has already started working on a plan > > for the PCI core to expose sysfs files to manage ASPM. This is > > similar to the link_state files enabled by CONFIG_PCIEASPM_DEBUG, > > but it will be always enabled and probably structured slightly > > differently. The idea is that this would be generic and would not > > require any driver support. > Frederick, is there anything you could share already? Or any timeline? > Based on Bjorns info what seems to be best to me: > 1. Disable ASPM for r8169 on stable (back to 4.19). > 2. Once the generic ASPM sysfs attributes are available, reenable ASPM >for r8169 in net-next. This is out of my wheelhouse, but even with a generic sysfs knob, it doesn't sound like a good idea to me to enable ASPM by default for r8169 if we think it's unreliable on any significant fraction of machines. Users of those unreliable machines will see poor performance, PCIe errors, etc, and they won't have a clue about how to fix them. To me it sounds better to leave ASPM disabled by default for r8169, then incrementally whitelist systems that are known to work. Users will have poor battery life, but at least things will work reliably. Bjorn
Re: [PATCH net] r8169: switch off ASPM by default and add sysfs attribute to control ASPM
[+cc Frederick] On Wed, Apr 03, 2019 at 07:53:40AM +0200, Heiner Kallweit wrote: > On 02.04.2019 23:57, Bjorn Helgaas wrote: > > On Tue, Apr 02, 2019 at 10:41:20PM +0200, Heiner Kallweit wrote: > >> On 02.04.2019 22:16, Florian Fainelli wrote: > >>> On 4/2/19 12:55 PM, Heiner Kallweit wrote: > >>>> There are numerous reports about different problems caused by ASPM > >>>> incompatibilities between certain network chip versions and board > >>>> chipsets. On the other hand on (especially mobile) systems where ASPM > >>>> works properly it can significantly contribute to power-saving and > >>>> increased battery runtime. > >>>> One problem so far to make ASPM configurable was to find an acceptable > >>>> way of configuration (e.g. module parameters are discouraged for that > >>>> purpose). > >>>> > >>>> As a new attempt let's switch off ASPM per default and make it > >>>> configurable by a sysfs attribute. The attribute is documented in > >>>> new file Documentation/networking/device_drivers/realtek/r8169.txt. > > > > Both module parameters and sysfs attributes are a poor user > > experience. It's very difficult for users to figure out that > > a tweak is needed. > > > >>> I am not sure this is where it should be solved, there is > >>> definitively a device specific aspect to properly supporting the > >>> enabling of ASPM L0s, L1s etc, but the actual sysfs knobs should > >>> belong to the pci_device itself, since this is something that > >>> likely other drivers would want to be able to expose. You would > >>> probably want to work with the PCI maintainers to come up with a > >>> standard solution that applies beyond just r8169 since presumably > >>> there must be a gazillion of devices with the same issues. > > > > The Linux PCI core support for ASPM is poor. Without more details, > > it's impossible to tell whether these issues are hardware or firmware > > defects on the device itself, or something that Linux is doing wrong. > > There are several known defects, especially related to L1 substates > > and hotplug. > > > The vendor refuses to release datasheets or errata. Only certain > combinations of board chipsets (and maybe BIOS versions) and network > chip versions (from the ~ 50 supported by the driver) seem to be > affected. One typical symptom is missed RX packets, maybe the RX FIFO > isn't big enough to buffer all packets until PCIe has woken up. > The Windows vendor driver uses a hack, they dynamically disable ASPM > under load. I'm not super sympathetic to vendors like that or to OEMs that work with them. If we can make the NIC work reliably by disabling ASPM, that's step one. If we can figure out how to extend battery life by enabling ASPM in some cases, great, but we have to be careful to do it in a way that is supportable and doesn't generate lots of user complaints that require debugging. That said, I think Frederick has already started working on a plan for the PCI core to expose sysfs files to manage ASPM. This is similar to the link_state files enabled by CONFIG_PCIEASPM_DEBUG, but it will be always enabled and probably structured slightly differently. The idea is that this would be generic and would not require any driver support. Bjorn
Re: [PATCH net] r8169: switch off ASPM by default and add sysfs attribute to control ASPM
[+cc Rajat] On Tue, Apr 02, 2019 at 10:41:20PM +0200, Heiner Kallweit wrote: > On 02.04.2019 22:16, Florian Fainelli wrote: > > On 4/2/19 12:55 PM, Heiner Kallweit wrote: > >> There are numerous reports about different problems caused by ASPM > >> incompatibilities between certain network chip versions and board > >> chipsets. On the other hand on (especially mobile) systems where ASPM > >> works properly it can significantly contribute to power-saving and > >> increased battery runtime. > >> One problem so far to make ASPM configurable was to find an acceptable > >> way of configuration (e.g. module parameters are discouraged for that > >> purpose). > >> > >> As a new attempt let's switch off ASPM per default and make it > >> configurable by a sysfs attribute. The attribute is documented in > >> new file Documentation/networking/device_drivers/realtek/r8169.txt. Both module parameters and sysfs attributes are a poor user experience. It's very difficult for users to figure out that a tweak is needed. > > I am not sure this is where it should be solved, there is > > definitively a device specific aspect to properly supporting the > > enabling of ASPM L0s, L1s etc, but the actual sysfs knobs should > > belong to the pci_device itself, since this is something that > > likely other drivers would want to be able to expose. You would > > probably want to work with the PCI maintainers to come up with a > > standard solution that applies beyond just r8169 since presumably > > there must be a gazillion of devices with the same issues. The Linux PCI core support for ASPM is poor. Without more details, it's impossible to tell whether these issues are hardware or firmware defects on the device itself, or something that Linux is doing wrong. There are several known defects, especially related to L1 substates and hotplug. We already have the "pcie_aspm=off" kernel parameter. That's system-wide, but maybe that's sufficient for debugging? If we can determine that a device is broken, quirks along the lines of quirk_disable_aspm_l0s() are one possibility. > Here this attribute only controls whether the device actively enters > ASPM states. I doesn't change anything on the other side of the > link. > > Maybe it would be an option to add an attribute on device level in > the PCI core that reflects "ASPM allowed", or even in more detail > which ASPM states / sub-states are acceptable. Certain support for > disabling ASPM states is provided by pci_disable_link_state(). Not > sure whether the device would have to be notified by the core that > certain states have been disabled. Maybe not because this is auto- > negotiated between the PCIe link partners. Drivers should not be involved in ASPM except to the extent that there are hardware/firmware defects that mean some ASPM states should be disabled even though the device advertises support for them. Some ASPM configuration requires knowledge of latencies along the entire path to the root complex, so it involves more than just the device, and it really should be done by the PCI core before the driver is attached. > Based on the usage of pci_disable_link_state() not that many devices > seem to be effected. One good example may be Intel e1000e > (__e1000e_disable_aspm()). > > >> > >> Fixes: a99790bf5c7f ("r8169: Reinstate ASPM Support") > >> Signed-off-by: Heiner Kallweit > >> --- > >> .../device_drivers/realtek/r8169.txt | 19 + > >> drivers/net/ethernet/realtek/r8169.c | 75 +-- > >> 2 files changed, 86 insertions(+), 8 deletions(-) > >> create mode 100644 > >> Documentation/networking/device_drivers/realtek/r8169.txt > >> > >> diff --git a/Documentation/networking/device_drivers/realtek/r8169.txt > >> b/Documentation/networking/device_drivers/realtek/r8169.txt > >> new file mode 100644 > >> index 0..669995d0c > >> --- /dev/null > >> +++ b/Documentation/networking/device_drivers/realtek/r8169.txt > >> @@ -0,0 +1,19 @@ > >> +Written by Heiner Kallweit > >> + > >> +Version 04/02/2019 > >> + > >> +Driver-specific sysfs attributes > >> + > >> + > >> +rtl_aspm (bool) > >> +--- > >> + > >> +Certain combinations of network chip versions and board chipsets result in > >> +increased packet latency, PCIe errors, or significantly reduced network > >> +performance. Therefore ASPM is off by default. On the other hand ASPM can > >> +significantly contribute to power-saving and thus increased battery > >> +runtime on notebooks. Therefore this sysfs attribute allows to switch on > >> +ASPM on systems where ASPM works properly. The attribute accepts any form > >> +of bool value, e.g. 1/y/on. See also kerneldoc of kstrtobool(). > >> +Note that the attribute is accessible only if interface is up. > >> +Else network chip and PCIe link may be powered-down and not reachable. > >> diff --git a/drivers/net/ethernet/realtek/r8169.c > >> b/drivers/net/ethernet/realtek/r8169.c > >> index 19efa88f
[PATCH] net: Don't default Cavium PTP driver to 'y'
From: Bjorn Helgaas 8c56df372bc1 ("net: add support for Cavium PTP coprocessor") added the Cavium PTP coprocessor driver and enabled it by default. Remove the "default y" because the driver only applies to Cavium ThunderX processors. Fixes: 8c56df372bc1 ("net: add support for Cavium PTP coprocessor") Signed-off-by: Bjorn Helgaas --- drivers/net/ethernet/cavium/Kconfig |1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/ethernet/cavium/Kconfig b/drivers/net/ethernet/cavium/Kconfig index 5f03199a3acf..05f4a3b21e29 100644 --- a/drivers/net/ethernet/cavium/Kconfig +++ b/drivers/net/ethernet/cavium/Kconfig @@ -54,7 +54,6 @@ config CAVIUM_PTP tristate "Cavium PTP coprocessor as PTP clock" depends on 64BIT && PCI imply PTP_1588_CLOCK - default y ---help--- This driver adds support for the Precision Time Protocol Clocks and Timestamping coprocessor (PTP) found on Cavium processors.
Re: [PATCH] PCI / ACPI: Don't clear pme_poll on device that has unreliable ACPI wake
On Sun, Feb 03, 2019 at 01:46:50AM +0800, Kai Heng Feng wrote: > > On Jan 28, 2019, at 3:51 PM, Kai Heng Feng > > wrote: > > >> If I understand correctly, the bugzilla lspci > >> (https://bugzilla.kernel.org/attachment.cgi?id=280691) was collected > >> at point 8, and it shows PME_Status=1 when it should be 0. > >> > >> If we write a 1 to PME_Status to clear it, and it remains set, that's > >> obviously a hardware defect, and Intel should document that in an > >> erratum, and a quirk would be the appropriate way to work around it. > >> But I doubt that's what's happening. > > > > I’ll ask them if they can provide an erratum. > > Got confirmed with e1000e folks, I219 (the device in question) doesn’t > really support runtime D3. Did you get a reference, e.g., an intel.com URL for that? Intel usually publishes errata for hardware defects, which is nice because it means every customer doesn't have to experimentally rediscover them. > I also checked the behavior of the device under Windows, and it > stays at D0 all the time even when it’s not in use. I think there are two possible explanations for this: 1) This device requires a Windows or a driver update with a device-specific quirk similar to what you're proposing for Linux. 2) Windows correctly detects that this device doesn't support D3, and Linux has a bug and does not detect that. Obviously nobody wants to require OS or driver updates just for minor device changes, and the PCI and ACPI specs are designed to allow generic, non device-specific code to detect D3 support, so the first case should be a result of a hardware defect. > So I sent a patch [1] to disable it. > > [1] https://lkml.org/lkml/2019/2/2/200 OK. Since that's in drivers/net/..., I have no objection and the e1000e maintainers would deal with that. Bjorn
Re: [PATCH net-next v6 1/2] net: add support for Cavium PTP coprocessor
On Mon, Jan 15, 2018 at 06:44:56PM +0600, Aleksey Makarov wrote: > +++ b/drivers/net/ethernet/cavium/common/cavium_ptp.c > @@ -0,0 +1,353 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* cavium_ptp.c - PTP 1588 clock on Cavium hardware > + * Copyright (c) 2003-2015, 2017 Cavium, Inc. > + */ > + > +#include > +#include > +#include > +#include > + > +#include "cavium_ptp.h" > + > +#define DRV_NAME "Cavium PTP Driver" This is also unconventional and looks funny, e.g., here: $ ls /sys/bus/pci/drivers/ 8250_mid/ exar_serial/ mei_me/ snd_hda_intel/ agpgart-intel/i801_smbus/ parport_pc/ snd_soc_skl/ agpgart-sis/ i915/pcieport/ xen-platform-pci/ agpgart-via/ intel_ish_ipc/ rtsx_pci/ xhci_hcd/ ahci/ intel_pch_thermal/ serial/ 'Cavium PTP Driver'/ iosf_mbi_pci/shpchp/ e1000e/ iwlwifi/ skl_uncore/
Re: [PATCH net-next v6 1/2] net: add support for Cavium PTP coprocessor
On Mon, Jan 15, 2018 at 06:44:56PM +0600, Aleksey Makarov wrote: > From: Radoslaw Biernacki > > This patch adds support for the Precision Time Protocol > Clocks and Timestamping hardware found on Cavium ThunderX > processors. > > Signed-off-by: Radoslaw Biernacki > Signed-off-by: Aleksey Makarov > Acked-by: Philippe Ombredanne > --- > drivers/net/ethernet/cavium/Kconfig | 12 + > drivers/net/ethernet/cavium/Makefile| 1 + > drivers/net/ethernet/cavium/common/Makefile | 1 + > drivers/net/ethernet/cavium/common/cavium_ptp.c | 353 > > drivers/net/ethernet/cavium/common/cavium_ptp.h | 70 + > 5 files changed, 437 insertions(+) > create mode 100644 drivers/net/ethernet/cavium/common/Makefile > create mode 100644 drivers/net/ethernet/cavium/common/cavium_ptp.c > create mode 100644 drivers/net/ethernet/cavium/common/cavium_ptp.h > > diff --git a/drivers/net/ethernet/cavium/Kconfig > b/drivers/net/ethernet/cavium/Kconfig > index 63be75eb34d2..96586c0b4490 100644 > --- a/drivers/net/ethernet/cavium/Kconfig > +++ b/drivers/net/ethernet/cavium/Kconfig > @@ -50,6 +50,18 @@ config THUNDER_NIC_RGX > This driver supports configuring XCV block of RGX interface > present on CN81XX chip. > > +config CAVIUM_PTP > + tristate "Cavium PTP coprocessor as PTP clock" > + depends on 64BIT > + imply PTP_1588_CLOCK > + default y Why is this "default y"? It looks like this is a PCI driver and probably should be loaded only when the PCI device is present. > + ---help--- > + This driver adds support for the Precision Time Protocol Clocks and > + Timestamping coprocessor (PTP) found on Cavium processors. > + PTP provides timestamping mechanism that is suitable for use in IEEE > 1588 > + Precision Time Protocol or other purposes. Timestamps can be used in > + BGX, TNS, GTI, and NIC blocks.
Re: [PATCH] PCI / ACPI: Don't clear pme_poll on device that has unreliable ACPI wake
On Thu, Jan 24, 2019 at 11:29:37PM +0800, Kai Heng Feng wrote: > > On Jan 24, 2019, at 11:15 PM, Bjorn Helgaas wrote: > > On Wed, Jan 23, 2019 at 03:17:37PM +0800, Kai Heng Feng wrote: > >>> On Jan 23, 2019, at 7:51 AM, Bjorn Helgaas wrote: > >>> On Tue, Jan 22, 2019 at 02:45:44PM +0800, Kai-Heng Feng wrote: > >>>> There are some e1000e devices can only be woken up from D3 one time, by > >>>> plugging ethernet cable. Subsequent cable plugging does set PME bit > >>>> correctly, but it still doesn't get woken up. > >>>> > >>>> Since e1000e connects to the root complex directly, we rely on ACPI to > >>>> wake it up. In this case, the GPE from _PRW only works once and stops > >>>> working after that. > >>>> > >>>> So introduce a new PCI quirk, to avoid clearing pme_poll flag for buggy > >>>> platform firmwares that have unreliable GPE wake. > >>> > >>> This quirk applies to all 0x15bb (E1000_DEV_ID_PCH_CNP_I219_LM7) and > >>> 0x15bd (E1000_DEV_ID_PCH_CNP_I219_LM6) devices. The e1000e driver > >>> claims about a zillion different device IDs. > >>> > >>> I would be surprised if these two devices are defective but all the > >>> others work correctly. Could it be that there is a problem with the > >>> wiring on this particular motherboard or with the ACPI _PRW methods > >>> (or the way Linux interprets them) in this firmware? > >> > >> If this is a motherboard issue or platform specific, do you prefer to use > >> DMI matches here? > > > > I'm not sure what the problem is yet, so let's hold off on the exact > > structure of the fix. > > I think DMI table can put in e1000e driver instead of PCI quirk. I don't think we should add a quirk or DMI table yet because we haven't gotten to the root cause of this problem. If the root cause is a problem in the Linux code, adding a quirk will mask the problem for this specific system, but will leave other systems with similar problems. > > If I understand correctly, e1000e wakeup works once, but doesn't work > > after that. Your lspci (from after that first wakeup, from > > https://bugzilla.kernel.org/attachment.cgi?id=280691) shows this: > > > > 00:14.0 XHC XHCI USB > >Flags: PMEClk- DSI- D1- D2- ... PME(D0-,D1-,D2-,D3hot+,D3cold+) > >Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME- > > 00:1f.3 HDAS audio > >Flags: PMEClk- DSI- D1- D2- ... PME(D0-,D1-,D2-,D3hot+,D3cold+) > >Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME- > > 00:1f.6 GLAN e1000e > >Flags: PMEClk- DSI+ D1- D2- ... PME(D0+,D1-,D2-,D3hot+,D3cold+) > >Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=1 PME+ > > > > So the e1000e PME_Status bit is still set, which means it probably > > won't generate another PME interrupt, which would explain why wakeup > > doesn't work. To test this theory, can you try this: > > > > - sleep > > - wakeup via e1000e > > # DEV=00:1f.6 > > # lspci -vvs $DEV > > # setpci -s $DEV CAP_PM+4.W > > # setpci -s $DEV CAP_PM+4.W=0x8100 > > - sleep > > - attempt another wakeup via e1000e > > > > If this second wakeup works, it would suggest that PME_Status isn't > > being cleared correctly. I see code, e.g., in > > acpi_setup_gpe_for_wake(), that *looks* like it would arrange to clear > > it, but I'm not very familiar with it. Maybe there's some issue with > > multiple devices sharing an "implicit notification" situation like > > this. > > The PME status is being cleared correctly. I was hoping to understand this better via the experiment above, but I'm still confused. Here's the scenario as I understand it: 0) fresh boot 1) e1000e PME_Status should be 0 2) sleep 3) wakeup via e1000e succeeds 4) e1000e PME_Status should be 0 5) sleep 6) wakeup via e1000e fails 7) wakeup via USB succeeds 8) e1000e PME_Status should be 0, but is actually 1 If I understand correctly, the bugzilla lspci (https://bugzilla.kernel.org/attachment.cgi?id=280691) was collected at point 8, and it shows PME_Status=1 when it should be 0. If we write a 1 to PME_Status to clear it, and it remains set, that's obviously a hardware defect, and Intel should document that in an erratum, and a quirk would be the appropriate way to work around it. But I doubt that's what's happening. If e1000e changes PME_Status from 0 to 1 and we don't get an interrupt (in this case, an SCI triggering GPE 0x6d), the proble
Re: [PATCH] PCI / ACPI: Don't clear pme_poll on device that has unreliable ACPI wake
On Wed, Jan 23, 2019 at 03:17:37PM +0800, Kai Heng Feng wrote: > > On Jan 23, 2019, at 7:51 AM, Bjorn Helgaas wrote: > > On Tue, Jan 22, 2019 at 02:45:44PM +0800, Kai-Heng Feng wrote: > >> There are some e1000e devices can only be woken up from D3 one time, by > >> plugging ethernet cable. Subsequent cable plugging does set PME bit > >> correctly, but it still doesn't get woken up. > >> > >> Since e1000e connects to the root complex directly, we rely on ACPI to > >> wake it up. In this case, the GPE from _PRW only works once and stops > >> working after that. > >> > >> So introduce a new PCI quirk, to avoid clearing pme_poll flag for buggy > >> platform firmwares that have unreliable GPE wake. > > > > This quirk applies to all 0x15bb (E1000_DEV_ID_PCH_CNP_I219_LM7) and > > 0x15bd (E1000_DEV_ID_PCH_CNP_I219_LM6) devices. The e1000e driver > > claims about a zillion different device IDs. > > > > I would be surprised if these two devices are defective but all the > > others work correctly. Could it be that there is a problem with the > > wiring on this particular motherboard or with the ACPI _PRW methods > > (or the way Linux interprets them) in this firmware? > > If this is a motherboard issue or platform specific, do you prefer to use > DMI matches here? I'm not sure what the problem is yet, so let's hold off on the exact structure of the fix. If I understand correctly, e1000e wakeup works once, but doesn't work after that. Your lspci (from after that first wakeup, from https://bugzilla.kernel.org/attachment.cgi?id=280691) shows this: 00:14.0 XHC XHCI USB Flags: PMEClk- DSI- D1- D2- ... PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME- 00:1f.3 HDAS audio Flags: PMEClk- DSI- D1- D2- ... PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME- 00:1f.6 GLAN e1000e Flags: PMEClk- DSI+ D1- D2- ... PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=1 PME+ So the e1000e PME_Status bit is still set, which means it probably won't generate another PME interrupt, which would explain why wakeup doesn't work. To test this theory, can you try this: - sleep - wakeup via e1000e # DEV=00:1f.6 # lspci -vvs $DEV # setpci -s $DEV CAP_PM+4.W # setpci -s $DEV CAP_PM+4.W=0x8100 - sleep - attempt another wakeup via e1000e If this second wakeup works, it would suggest that PME_Status isn't being cleared correctly. I see code, e.g., in acpi_setup_gpe_for_wake(), that *looks* like it would arrange to clear it, but I'm not very familiar with it. Maybe there's some issue with multiple devices sharing an "implicit notification" situation like this. > As for _PRW, it’s shared by USB controller, Audio controller and ethernet. > Only the ethernet (e1000e) has this issue. > > When this issue happens, the e1000e doesn’t get woken up by ethernet cable > plugging, but inserting a USB device or plugging audio jack can wake up all > three devices. So I think Linux interprets ACPI correctly here. > > Their _PRW here: > USB controller: > Scope (_SB.PCI0) > { > Device (XDCI) > { > Method (_PRW, 0, NotSerialized) // _PRW: Power Resources for Wake > { > Return (GPRW (0x6D, 0x04)) > } > > Audio controller: > Scope (_SB.PCI0) > > { > > Device (HDAS) > { > > … > Method (_PRW, 0, NotSerialized) // _PRW: Power Resources for > Wake > { > Return (GPRW (0x6D, 0x04)) > } > > Ethernet controller: > Scope (_SB.PCI0) > > { > Device (GLAN) > { > >… > Method (_PRW, 0, NotSerialized) // _PRW: Power Resources for > Wake >
Re: [PATCH] PCI / ACPI: Don't clear pme_poll on device that has unreliable ACPI wake
On Tue, Jan 22, 2019 at 02:45:44PM +0800, Kai-Heng Feng wrote: > There are some e1000e devices can only be woken up from D3 one time, by > plugging ethernet cable. Subsequent cable plugging does set PME bit > correctly, but it still doesn't get woken up. > > Since e1000e connects to the root complex directly, we rely on ACPI to > wake it up. In this case, the GPE from _PRW only works once and stops > working after that. > > So introduce a new PCI quirk, to avoid clearing pme_poll flag for buggy > platform firmwares that have unreliable GPE wake. This quirk applies to all 0x15bb (E1000_DEV_ID_PCH_CNP_I219_LM7) and 0x15bd (E1000_DEV_ID_PCH_CNP_I219_LM6) devices. The e1000e driver claims about a zillion different device IDs. I would be surprised if these two devices are defective but all the others work correctly. Could it be that there is a problem with the wiring on this particular motherboard or with the ACPI _PRW methods (or the way Linux interprets them) in this firmware? Would you mind attaching a complete dmesg log and "sudo lspci -vvv" output to the bugzilla, please? > Signed-off-by: Kai-Heng Feng > --- > drivers/pci/pci-acpi.c | 2 +- > drivers/pci/quirks.c | 8 > include/linux/pci.h| 1 + > 3 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c > index e1949f7efd9c..184e2fc8a294 100644 > --- a/drivers/pci/pci-acpi.c > +++ b/drivers/pci/pci-acpi.c > @@ -430,7 +430,7 @@ static void pci_acpi_wake_dev(struct > acpi_device_wakeup_context *context) > > pci_dev = to_pci_dev(context->dev); > > - if (pci_dev->pme_poll) > + if (pci_dev->pme_poll && !pci_dev->unreliable_acpi_wake) > pci_dev->pme_poll = false; > > if (pci_dev->current_state == PCI_D3cold) { > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index b0a413f3f7ca..ed4863496fa8 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -4948,6 +4948,14 @@ DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_AMD, > PCI_ANY_ID, > DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, > PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8, quirk_gpu_hda); > > +static void quirk_unreliable_acpi_wake(struct pci_dev *pdev) > +{ > + pci_info(pdev, "ACPI Wake unreliable, always poll PME\n"); > + pdev->unreliable_acpi_wake = 1; > +} > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x15bb, > quirk_unreliable_acpi_wake); > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x15bd, > quirk_unreliable_acpi_wake); > + > /* > * Some IDT switches incorrectly flag an ACS Source Validation error on > * completions for config read requests even though PCIe r4.0, sec > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 65f1d8c2f082..d22065c1576f 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -331,6 +331,7 @@ struct pci_dev { > unsigned intpme_support:5; /* Bitmask of states from which PME# > can be generated */ > unsigned intpme_poll:1; /* Poll device's PME status bit */ > + unsigned intunreliable_acpi_wake:1; /* ACPI Wake doesn't always > work */ > unsigned intd1_support:1; /* Low power state D1 is supported */ > unsigned intd2_support:1; /* Low power state D2 is supported */ > unsigned intno_d1d2:1; /* D1 and D2 are forbidden */ > -- > 2.17.1 >
Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]
On Tue, Dec 11, 2018 at 6:38 PM David Gibson wrote: > > On Tue, Dec 11, 2018 at 08:01:43AM -0600, Bjorn Helgaas wrote: > > Hi David, > > > > I see you're still working on this, but if you do end up going this > > direction eventually, would you mind splitting this into two patches: > > 1) rename the quirk to make it more generic (but not changing any > > behavior), and 2) add the ConnectX devices to the quirk. That way > > the ConnectX change is smaller and more easily > > understood/reverted/etc. > > Sure. Would it make sense to send (1) as an independent cleanup, > while I'm still working out exactly what (if anything) we need for > (2)? You could, but I don't think there's really much benefit in doing the first without the second, and I think there is some value in handling both patches at the same time.
Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]
Hi David, I see you're still working on this, but if you do end up going this direction eventually, would you mind splitting this into two patches: 1) rename the quirk to make it more generic (but not changing any behavior), and 2) add the ConnectX devices to the quirk. That way the ConnectX change is smaller and more easily understood/reverted/etc. On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote: > Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when > unbound from their regular driver and attached to vfio-pci in order to pass > them through to a guest. > > This goes away if the disable_idle_d3 option is used, so it looks like a > problem with the hardware handling D3 state. To fix that more permanently, > use a device quirk to disable D3 state for these devices. > > We do this by renaming the existing quirk_no_ata_d3() more generally and > attaching it to the ConnectX-[45] devices (0x15b3:0x1013). > > Signed-off-by: David Gibson > --- > drivers/pci/quirks.c | 17 +++-- > 1 file changed, 11 insertions(+), 6 deletions(-) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 4700d24e5d55..add3f516ca12 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -1315,23 +1315,24 @@ static void quirk_ide_samemode(struct pci_dev *pdev) > } > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_10, > quirk_ide_samemode); > > -/* Some ATA devices break if put into D3 */ > -static void quirk_no_ata_d3(struct pci_dev *pdev) > +/* Some devices (including a number of ATA cards) break if put into D3 */ > +static void quirk_no_d3(struct pci_dev *pdev) > { > pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3; > } > + > /* Quirk the legacy ATA devices only. The AHCI ones are ok */ > DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_SERVERWORKS, PCI_ANY_ID, > - PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); > + PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3); > DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_ATI, PCI_ANY_ID, > - PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); > + PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3); > /* ALi loses some register settings that we cannot then restore */ > DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AL, PCI_ANY_ID, > - PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); > + PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3); > /* VIA comes back fine but we need to keep it alive or ACPI GTM failures > occur when mode detecting */ > DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_VIA, PCI_ANY_ID, > - PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); > + PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3); > > /* > * This was originally an Alpha-specific thing, but it really fits here. > @@ -3367,6 +3368,10 @@ static void mellanox_check_broken_intx_masking(struct > pci_dev *pdev) > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID, > mellanox_check_broken_intx_masking); > > +/* Mellanox MT27800 (ConnectX-5) IB card seems to break with D3 > + * In particular this shows up when the device is bound to the vfio-pci > driver */ Follow usual multiline comment style, i.e., /* * text ... * more text ... */ > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_MELLANOX, > PCI_DEVICE_ID_MELLANOX_CONNECTX4, quirk_no_d3) > + > static void quirk_no_bus_reset(struct pci_dev *dev) > { > dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET; > -- > 2.19.2 >
Re: [PATCH v3] PCI: Reprogram bridge prefetch registers on resume
[+cc LKML] On Tue, Sep 18, 2018 at 04:32:44PM -0500, Bjorn Helgaas wrote: > On Thu, Sep 13, 2018 at 11:37:45AM +0800, Daniel Drake wrote: > > On 38+ Intel-based Asus products, the nvidia GPU becomes unusable > > after S3 suspend/resume. The affected products include multiple > > generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs > > many errors such as: > > > > fifo: fault 00 [READ] at 00555000 engine 00 [GR] client 04 > > [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown] > > DRM: failed to idle channel 0 [DRM] > > > > Similarly, the nvidia proprietary driver also fails after resume > > (black screen, 100% CPU usage in Xorg process). We shipped a sample > > to Nvidia for diagnosis, and their response indicated that it's a > > problem with the parent PCI bridge (on the Intel SoC), not the GPU. > > > > Runtime suspend/resume works fine, only S3 suspend is affected. > > > > We found a workaround: on resume, rewrite the Intel PCI bridge > > 'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In > > the cases that I checked, this register has value 0 and we just have to > > rewrite that value. > > > > Linux already saves and restores PCI config space during suspend/resume, > > but this register was being skipped because upon resume, it already > > has value 0 (the correct, pre-suspend value). > > > > Intel appear to have previously acknowledged this behaviour and the > > requirement to rewrite this register. > > https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23 > > > > Based on that, rewrite the prefetch register values even when that > > appears unnecessary. > > > > We have confirmed this solution on all the affected models we have > > in-hands (X542UQ, UX533FD, X530UN, V272UN). > > > > Additionally, this solves an issue where r8169 MSI-X interrupts were > > broken after S3 suspend/resume on Asus X441UAR. This issue was recently > > worked around in commit 7bb05b85bc2d ("r8169: don't use MSI-X on > > RTL8106e"). It also fixes the same issue on RTL6186evl/8111evl on an > > Aimfor-tech laptop that we had not yet patched. I suspect it will also > > fix the issue that was worked around in commit 7c53a722459c ("r8169: > > don't use MSI-X on RTL8168g"). > > > > Thomas Martitz reports that this change also solves an issue where > > the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive > > after S3 suspend/resume. > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=201069 > > Signed-off-by: Daniel Drake > > Applied with Rafael's and Peter's reviewed-by to pci/enumeration for v4.20. > Thanks for the the huge investigative effort! Since this looks low-risk and fixes several painful issues, I think this merits a stable tag and being included in v4.19 (instead of waiting for v4.20). I moved it to for-linus for v4.19. Let me know if you object. > > --- > > drivers/pci/pci.c | 25 + > > 1 file changed, 17 insertions(+), 8 deletions(-) > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > index 29ff9619b5fa..5d58220b6997 100644 > > --- a/drivers/pci/pci.c > > +++ b/drivers/pci/pci.c > > @@ -1289,12 +1289,12 @@ int pci_save_state(struct pci_dev *dev) > > EXPORT_SYMBOL(pci_save_state); > > > > static void pci_restore_config_dword(struct pci_dev *pdev, int offset, > > -u32 saved_val, int retry) > > +u32 saved_val, int retry, bool force) > > { > > u32 val; > > > > pci_read_config_dword(pdev, offset, &val); > > - if (val == saved_val) > > + if (!force && val == saved_val) > > return; > > > > for (;;) { > > @@ -1313,25 +1313,34 @@ static void pci_restore_config_dword(struct pci_dev > > *pdev, int offset, > > } > > > > static void pci_restore_config_space_range(struct pci_dev *pdev, > > - int start, int end, int retry) > > + int start, int end, int retry, > > + bool force) > > { > > int index; > > > > for (index = end; index >= start; index--) > > pci_restore_config_dword(pdev, 4 * index, > > pdev->saved_config_space[index], > > -retry); > > +
Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
On Mon, Jul 30, 2018 at 08:19:50PM -0700, Alexander Duyck wrote: > On Mon, Jul 30, 2018 at 7:33 PM, Bjorn Helgaas wrote: > > On Mon, Jul 30, 2018 at 08:02:48AM -0700, Alexander Duyck wrote: > >> On Mon, Jul 30, 2018 at 7:07 AM, Bjorn Helgaas wrote: > >> > On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote: > >> >> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh > >> >> wrote: > >> >> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas > >> >> > wrote: > >> >> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote: > >> >> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko > >> >> >> > wrote: > >> >> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, > >> >> >> > > jakub.kicin...@netronome.com > >> >> >> > > wrote: > >> >> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote: > >> >> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote: > >> >> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote: > >> >> >> > >>> >>>> The devlink params haven't been upstream even for a full > >> >> >> > >>> >>>> cycle > >> >> >> > >>> >>>> and > >> >> >> > >>> >>>> already you guys are starting to use them to configure > >> >> >> > >>> >>>> standard > >> >> >> > >>> >>>> features like queuing. > >> >> >> > >>> >>> > >> >> >> > >>> >>> We developed the devlink params in order to support > >> >> >> > >>> >>> non-standard > >> >> >> > >>> >>> configuration only. And for non-standard, there are > >> >> >> > >>> >>> generic and > >> >> >> > >>> >>> vendor > >> >> >> > >>> >>> specific options. > >> >> >> > >>> >> > >> >> >> > >>> >> I thought it was developed for performing non-standard and > >> >> >> > >>> >> possibly > >> >> >> > >>> >> vendor specific configuration. Look at > >> >> >> > >>> >> DEVLINK_PARAM_GENERIC_* > >> >> >> > >>> >> for > >> >> >> > >>> >> examples of well justified generic options for which we > >> >> >> > >>> >> have no > >> >> >> > >>> >> other API. The vendor mlx4 options look fairly vendor > >> >> >> > >>> >> specific > >> >> >> > >>> >> if you > >> >> >> > >>> >> ask me, too. > >> >> >> > >>> >> > >> >> >> > >>> >> Configuring queuing has an API. The question is it > >> >> >> > >>> >> acceptable to > >> >> >> > >>> >> enter > >> >> >> > >>> >> into the risky territory of controlling offloads via devlink > >> >> >> > >>> >> parameters > >> >> >> > >>> >> or would we rather make vendors take the time and effort to > >> >> >> > >>> >> model > >> >> >> > >>> >> things to (a subset) of existing APIs. The HW never fits > >> >> >> > >>> >> the > >> >> >> > >>> >> APIs > >> >> >> > >>> >> perfectly. > >> >> >> > >>> > > >> >> >> > >>> > I understand what you meant here, I would like to highlight > >> >> >> > >>> > that > >> >> >> > >>> > this > >> >> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc. > >> >> >> > >>> > The vendor specific configuration suggested here is to > >&g
Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
On Mon, Jul 30, 2018 at 08:02:48AM -0700, Alexander Duyck wrote: > On Mon, Jul 30, 2018 at 7:07 AM, Bjorn Helgaas wrote: > > On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote: > >> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh > >> wrote: > >> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas > >> > wrote: > >> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote: > >> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko wrote: > >> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicin...@netronome.com > >> >> > > wrote: > >> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote: > >> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote: > >> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote: > >> >> > >>> >>>> The devlink params haven't been upstream even for a full > >> >> > >>> >>>> cycle > >> >> > >>> >>>> and > >> >> > >>> >>>> already you guys are starting to use them to configure > >> >> > >>> >>>> standard > >> >> > >>> >>>> features like queuing. > >> >> > >>> >>> > >> >> > >>> >>> We developed the devlink params in order to support > >> >> > >>> >>> non-standard > >> >> > >>> >>> configuration only. And for non-standard, there are generic > >> >> > >>> >>> and > >> >> > >>> >>> vendor > >> >> > >>> >>> specific options. > >> >> > >>> >> > >> >> > >>> >> I thought it was developed for performing non-standard and > >> >> > >>> >> possibly > >> >> > >>> >> vendor specific configuration. Look at DEVLINK_PARAM_GENERIC_* > >> >> > >>> >> for > >> >> > >>> >> examples of well justified generic options for which we have no > >> >> > >>> >> other API. The vendor mlx4 options look fairly vendor specific > >> >> > >>> >> if you > >> >> > >>> >> ask me, too. > >> >> > >>> >> > >> >> > >>> >> Configuring queuing has an API. The question is it acceptable > >> >> > >>> >> to > >> >> > >>> >> enter > >> >> > >>> >> into the risky territory of controlling offloads via devlink > >> >> > >>> >> parameters > >> >> > >>> >> or would we rather make vendors take the time and effort to > >> >> > >>> >> model > >> >> > >>> >> things to (a subset) of existing APIs. The HW never fits the > >> >> > >>> >> APIs > >> >> > >>> >> perfectly. > >> >> > >>> > > >> >> > >>> > I understand what you meant here, I would like to highlight that > >> >> > >>> > this > >> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc. > >> >> > >>> > The vendor specific configuration suggested here is to handle a > >> >> > >>> > congestion > >> >> > >>> > state in Multi Host environment (which includes PF and multiple > >> >> > >>> > VFs per > >> >> > >>> > host), where one host is not aware to the other hosts, and each > >> >> > >>> > is > >> >> > >>> > running > >> >> > >>> > on its own pci/driver. It is a device working mode > >> >> > >>> > configuration. > >> >> > >>> > > >> >> > >>> > This couldn't fit into any existing API, thus creating this > >> >> > >>> > vendor specific > >> >> > >>> > unique API is needed. > >> >> > >>> > >> >> > >>> If we are just g
Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote: > On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh wrote: > > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas wrote: > >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote: > >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko wrote: > >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicin...@netronome.com > >> > > wrote: > >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote: > >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote: > >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote: > >> > >>> >>>> The devlink params haven't been upstream even for a full cycle > >> > >>> >>>> and > >> > >>> >>>> already you guys are starting to use them to configure standard > >> > >>> >>>> features like queuing. > >> > >>> >>> > >> > >>> >>> We developed the devlink params in order to support non-standard > >> > >>> >>> configuration only. And for non-standard, there are generic and > >> > >>> >>> vendor > >> > >>> >>> specific options. > >> > >>> >> > >> > >>> >> I thought it was developed for performing non-standard and > >> > >>> >> possibly > >> > >>> >> vendor specific configuration. Look at DEVLINK_PARAM_GENERIC_* > >> > >>> >> for > >> > >>> >> examples of well justified generic options for which we have no > >> > >>> >> other API. The vendor mlx4 options look fairly vendor specific > >> > >>> >> if you > >> > >>> >> ask me, too. > >> > >>> >> > >> > >>> >> Configuring queuing has an API. The question is it acceptable to > >> > >>> >> enter > >> > >>> >> into the risky territory of controlling offloads via devlink > >> > >>> >> parameters > >> > >>> >> or would we rather make vendors take the time and effort to model > >> > >>> >> things to (a subset) of existing APIs. The HW never fits the > >> > >>> >> APIs > >> > >>> >> perfectly. > >> > >>> > > >> > >>> > I understand what you meant here, I would like to highlight that > >> > >>> > this > >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc. > >> > >>> > The vendor specific configuration suggested here is to handle a > >> > >>> > congestion > >> > >>> > state in Multi Host environment (which includes PF and multiple > >> > >>> > VFs per > >> > >>> > host), where one host is not aware to the other hosts, and each is > >> > >>> > running > >> > >>> > on its own pci/driver. It is a device working mode configuration. > >> > >>> > > >> > >>> > This couldn't fit into any existing API, thus creating this > >> > >>> > vendor specific > >> > >>> > unique API is needed. > >> > >>> > >> > >>> If we are just going to start creating devlink interfaces in for > >> > >>> every > >> > >>> one-off option a device wants to add why did we even bother with > >> > >>> trying to prevent drivers from using sysfs? This just feels like we > >> > >>> are back to the same arguments we had back in the day with it. > >> > >>> > >> > >>> I feel like the bigger question here is if devlink is how we are > >> > >>> going > >> > >>> to deal with all PCIe related features going forward, or should we > >> > >>> start looking at creating a new interface/tool for PCI/PCIe related > >> > >>> features? My concern is that we have already had features such as > >> > >>> DMA > >> > >>> Coalescing that didn't really fit into anything and now we are > >> > >>> starting to see other things related to DMA and PCIe bus credits.
Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote: > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko wrote: > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicin...@netronome.com wrote: > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote: > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote: > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote: > >>> The devlink params haven't been upstream even for a full cycle and > >>> already you guys are starting to use them to configure standard > >>> features like queuing. > >>> >>> > >>> >>> We developed the devlink params in order to support non-standard > >>> >>> configuration only. And for non-standard, there are generic and vendor > >>> >>> specific options. > >>> >> > >>> >> I thought it was developed for performing non-standard and possibly > >>> >> vendor specific configuration. Look at DEVLINK_PARAM_GENERIC_* for > >>> >> examples of well justified generic options for which we have no > >>> >> other API. The vendor mlx4 options look fairly vendor specific if you > >>> >> ask me, too. > >>> >> > >>> >> Configuring queuing has an API. The question is it acceptable to enter > >>> >> into the risky territory of controlling offloads via devlink parameters > >>> >> or would we rather make vendors take the time and effort to model > >>> >> things to (a subset) of existing APIs. The HW never fits the APIs > >>> >> perfectly. > >>> > > >>> > I understand what you meant here, I would like to highlight that this > >>> > mechanism was not meant to handle SRIOV, Representors, etc. > >>> > The vendor specific configuration suggested here is to handle a > >>> > congestion > >>> > state in Multi Host environment (which includes PF and multiple VFs per > >>> > host), where one host is not aware to the other hosts, and each is > >>> > running > >>> > on its own pci/driver. It is a device working mode configuration. > >>> > > >>> > This couldn't fit into any existing API, thus creating this vendor > >>> > specific > >>> > unique API is needed. > >>> > >>> If we are just going to start creating devlink interfaces in for every > >>> one-off option a device wants to add why did we even bother with > >>> trying to prevent drivers from using sysfs? This just feels like we > >>> are back to the same arguments we had back in the day with it. > >>> > >>> I feel like the bigger question here is if devlink is how we are going > >>> to deal with all PCIe related features going forward, or should we > >>> start looking at creating a new interface/tool for PCI/PCIe related > >>> features? My concern is that we have already had features such as DMA > >>> Coalescing that didn't really fit into anything and now we are > >>> starting to see other things related to DMA and PCIe bus credits. I'm > >>> wondering if we shouldn't start looking at a tool/interface to > >>> configure all the PCIe related features such as interrupts, error > >>> reporting, DMA configuration, power management, etc. Maybe we could > >>> even look at sharing it across subsystems and include things like > >>> storage, graphics, and other subsystems in the conversation. > >> > >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do need > >>to build up an API. Sharing it across subsystems would be very cool! I read the thread (starting at [1], for anybody else coming in late) and I see this has something to do with "configuring outbound PCIe buffers", but I haven't seen the connection to PCIe protocol or features, i.e., I can't connect this to anything in the PCIe spec. Can somebody help me understand how the PCI core is relevant? If there's some connection with a feature defined by PCIe, or if it affects the PCIe transaction protocol somehow, I'm definitely interested in this. But if this only affects the data transferred over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not sure why the PCI core should care. > > I wonder howcome there isn't such API in place already. Or is it? > > If it is not, do you have any idea how should it look like? Should it be > > an extension of the existing PCI uapi or something completely new? > > It would be probably good to loop some PCI people in... > > The closest thing I can think of in terms of answering your questions > as to why we haven't seen anything like that would be setpci. > Basically with that tool you can go through the PCI configuration > space and update any piece you want. The problem is it can have > effects on the driver and I don't recall there ever being any sort of > notification mechanism added to make a driver aware of configuration > updates. setpci is a development and debugging tool, not something we should use as the standard way of configuring things. Use of setpci should probably taint the kernel because the PCI core configures features like MPS, ASPM, AER, etc., based on the assumption that nobody else is changing things in PCI config space. > As far as the interface I