Re: [PATCH v1 1/1] treewide: Align match_string() with sysfs_match_string()

2024-06-03 Thread Bjorn Helgaas
On Sun, Jun 02, 2024 at 06:57:12PM +0300, Andy Shevchenko wrote:
> Make two APIs look similar. Hence convert match_string() to be
> a 2-argument macro. In order to avoid unneeded churn, convert
> all users as well. There is no functional change intended.

Looks nice, thanks for doing this.

> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index ac6293c24976..2d317c7e1cea 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -210,7 +210,7 @@ void pcie_ecrc_get_policy(char *str)
>  {
>   int i;
>  
> - i = match_string(ecrc_policy_str, ARRAY_SIZE(ecrc_policy_str), str);
> + i = match_string(ecrc_policy_str, str);
>   if (i < 0)
>   return;
>  

Acked-by: Bjorn Helgaas# drivers/pci/

> +++ b/mm/vmpressure.c
> @@ -388,7 +388,7 @@ int vmpressure_register_event(struct mem_cgroup *memcg,
>  
>   /* Find required level */
>   token = strsep(, ",");
> - ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token);
> + ret = match_string(vmpressure_str_levels, token);

VMPRESSURE_NUM_LEVELS looks like it's no longer used?

>   if (ret < 0)
>   goto out;
>   level = ret;
> @@ -396,7 +396,7 @@ int vmpressure_register_event(struct mem_cgroup *memcg,
>   /* Find optional mode */
>   token = strsep(, ",");
>   if (token) {
> - ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, 
> token);
> + ret = match_string(vmpressure_str_modes, token);

Ditto.

>   if (ret < 0)
>   goto out;
>   mode = ret;



Re: [Nouveau] [PATCH v2 0/9] Improvements to pcie_bandwidth_available() for eGPUs

2023-11-03 Thread Bjorn Helgaas
On Fri, Nov 03, 2023 at 02:07:49PM -0500, Mario Limonciello wrote:
> Downstream drivers are getting the wrong values from
> pcie_bandwidth_available() which is causing problems for performance
> of eGPUs.
> 
> This series overhauls Thunderbolt related device detection and uses
> the changes to change the behavior of pcie_bandwidth_available().
> 
> NOTE: This series is currently based on top of v6.6 + this change that
>   will be merged for 6.7:
> Link: https://patchwork.freedesktop.org/patch/564738/

Thanks, Mario, I'll look at this soon after v6.7-rc1 (probably Nov
12), so the amdgpu patch should be in mainline by then.

> v1->v2:
>  * Rename is_thunderbolt
>  * Look for _DSD instead of link
>  * Drop pci_is_thunderbolt_attached() from all drivers
>  * Adjust links
>  * Adjust commit messages
>  * Add quirk for Tiger Lake
> 
> Mario Limonciello (9):
>   drm/nouveau: Switch from pci_is_thunderbolt_attached() to
> dev_is_removable()
>   drm/radeon: Switch from pci_is_thunderbolt_attached() to
> dev_is_removable()
>   PCI: Drop pci_is_thunderbolt_attached()
>   PCI: Move the `PCI_CLASS_SERIAL_USB_USB4` definition to common header
>   PCI: pciehp: Move check for is_thunderbolt into a quirk
>   PCI: Rename is_thunderbolt to is_tunneled
>   PCI: ACPI: Detect PCIe root ports that are used for tunneling
>   PCI: Exclude PCIe ports used for tunneling in
> pcie_bandwidth_available()
>   PCI: Add a quirk to mark 0x8086 : 0x9a23 as supporting PCIe tunneling
> 
>  drivers/gpu/drm/nouveau/nouveau_vga.c  |  6 +-
>  drivers/gpu/drm/radeon/radeon_device.c |  4 +-
>  drivers/gpu/drm/radeon/radeon_kms.c|  2 +-
>  drivers/pci/hotplug/pciehp_hpc.c   |  6 +-
>  drivers/pci/pci-acpi.c | 16 ++
>  drivers/pci/pci.c  | 76 +-
>  drivers/pci/probe.c|  2 +-
>  drivers/pci/quirks.c   | 31 +++
>  drivers/platform/x86/apple-gmux.c  |  2 +-
>  drivers/thunderbolt/nhi.h  |  2 -
>  include/linux/pci.h| 25 +
>  include/linux/pci_ids.h|  1 +
>  12 files changed, 109 insertions(+), 64 deletions(-)
> 
> -- 
> 2.34.1
> 


Re: [Nouveau] [PATCH 0/5] Add the pci_get_base_class() helper and use it

2023-09-28 Thread Bjorn Helgaas
On Fri, Aug 25, 2023 at 02:27:09PM +0800, Sui Jingfeng wrote:
> From: Sui Jingfeng 
> 
> There is no function that can be used to get all PCI(e) devices in a
> system by matching against its the PCI base class code only, while keep
> the sub-class code and the programming interface ignored. Therefore, add
> the pci_get_base_class() function to suit the need.
> 
> For example, if an application want to process all PCI(e) display devices
> in a system, it can achieve such goal by writing the code as following:
> 
> pdev = NULL;
> do {
> pdev = pci_get_base_class(PCI_BASE_CLASS_DISPLAY, pdev);
> if (!pdev)
> break;
> 
> do_something_for_pci_display_device(pdev);
> } while (1);
> 
> Sui Jingfeng (5):
>   PCI: Add the pci_get_base_class() helper
>   ALSA: hda/intel: Use pci_get_base_class() to reduce duplicated code
>   drm/nouveau: Use pci_get_base_class() to reduce duplicated code
>   drm/amdgpu: Use pci_get_base_class() to reduce duplicated code
>   drm/radeon: Use pci_get_base_class() to reduce duplicated code
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 11 +++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c | 20 ---
>  drivers/gpu/drm/nouveau/nouveau_acpi.c   | 11 +++--
>  drivers/gpu/drm/radeon/radeon_bios.c | 20 ---
>  drivers/pci/search.c | 31 
>  include/linux/pci.h  |  5 
>  sound/pci/hda/hda_intel.c| 16 
>  7 files changed, 59 insertions(+), 55 deletions(-)

Applied to pci/enumeration for v6.7, thanks.


Re: [Nouveau] [PATCH v7 6/8] PCI/VGA: Introduce is_boot_device function callback to vga_client_register

2023-06-30 Thread Bjorn Helgaas
On Fri, Jun 30, 2023 at 10:14:11AM +0800, suijingfeng wrote:
> On 2023/6/30 01:44, Limonciello, Mario wrote:
> > > On 2023/6/29 23:54, Bjorn Helgaas wrote:
> > > > On Thu, Jun 22, 2023 at 01:08:15PM +0800, Sui Jingfeng wrote:

> > > > 4) Right now we're in the middle of the v6.5 merge window, so new
> > > >  content, e.g., this series, is too late for v6.5.  Most
> > > >  maintainers, including me, wait to merge new content until the
> > > >  merge window closes and a new -rc1 is tagged.  This merge window
> > > >  should close on July 9, and people will start merging content for
> > > >  v6.6, typically based on v6.5-rc1.
> > > 
> > > Would you will merge all of the patches in this series (e.g. including
> > > the patch for drm/amdgpu(7/8) and drm/radeon(8/8)) ?
> > > 
> > > Or just part of them?

The bulk of this series is drivers/pci changes, so typically I would
merge all the patches after getting Acked-by tags from the other
subsystems (DRM and VFIO).

> Is it possible to merge the PCI/VGA part as fast as possible,
> especially the PATCH-0006 PCI/VGA: Introduce is_boot_device function
> callback to vga_client_register

We're in the middle of the v6.5 merge window, so it's too late to add
things to v6.5-rc1.  The most likely path for new material like this
would be to queue it for v6.6, which means I would merge it after
v6.5-rc1 is tagged (that tag will probably happen on July 9).

It would then be in -next until the v6.6 merge window opens (likely in
September), when it would be merged into Linus' tree.

If the series fixes a regression or other major defect, it's
*possible* to merge things earlier, so they appear in v6.5.  But this
series doesn't seem to fall in that category, so I think v6.6 is a
more realistic target.

Merging for v6.6 would include both the PCI parts and the DRM parts at
the same time, so hopefully that addresses your dependency concerns.

I suggest that you wait until v6.5-rc1, rebase your patches so they
apply cleanly on that tag, collect all the Reviewed-by and Acked-by
tags, include them in your commit logs, and then repost them.

Bjorn


Re: [Nouveau] [PATCH v7 6/8] PCI/VGA: Introduce is_boot_device function callback to vga_client_register

2023-06-29 Thread Bjorn Helgaas
On Thu, Jun 22, 2023 at 01:08:15PM +0800, Sui Jingfeng wrote:
> Hi,
> 
> 
> A nouveau developer(Lyude) from redhat send me a R-B,
> 
> Thanks for the developers of nouveau project.
> 
> 
> Please allow me add a link[1] here.
> 
> 
> [1] 
> https://lore.kernel.org/all/0afadc69f99a36bc9d03ecf54ff25859dbc10e28.ca...@redhat.com/

1) Thanks for this.  If you post another version of this series,
   please pick up Lyude's Reviewed-by and include it in the relevant
   patches (as long as you haven't made significant changes to the
   code Lyude reviewed).  Whoever applies this should automatically
   pick up Reviewed-by/Ack/etc that are replies to the version being
   applied, but they won't go through previous revisions to find them.

2) Please mention the commit to which the series applies.  I tried to
   apply this on v6.4-rc1, but it doesn't apply cleanly.

3) Thanks for including cover letters in your postings.  Please
   include a little changelog in the cover letter so we know what
   changed between v6 and v7, etc.

4) Right now we're in the middle of the v6.5 merge window, so new
   content, e.g., this series, is too late for v6.5.  Most
   maintainers, including me, wait to merge new content until the
   merge window closes and a new -rc1 is tagged.  This merge window
   should close on July 9, and people will start merging content for
   v6.6, typically based on v6.5-rc1.

Bjorn


Re: [Nouveau] [Intel-gfx] [PATCH v3 4/4] PCI/VGA: introduce is_boot_device function callback to vga_client_register

2023-06-09 Thread Bjorn Helgaas
On Fri, Jun 09, 2023 at 10:27:39AM +0800, Sui Jingfeng wrote:
> On 2023/6/9 03:19, Bjorn Helgaas wrote:
> > On Thu, Jun 08, 2023 at 07:43:22PM +0800, Sui Jingfeng wrote:
> > > From: Sui Jingfeng 
> > > 
> > > The vga_is_firmware_default() function is arch-dependent, which doesn't
> > > sound right. At least, it also works on the Mips and LoongArch platforms.
> > > Tested with the drm/amdgpu and drm/radeon drivers. However, it's difficult
> > > to enumerate all arch-driver combinations. I'm wrong if there is only one
> > > exception.
> > > 
> > > With the observation that device drivers typically have better knowledge
> > > about which PCI bar contains the firmware framebuffer, which could avoid
> > > the need to iterate all of the PCI BARs.
> > > 
> > > But as a PCI function at pci/vgaarb.c, vga_is_firmware_default() is
> > > probably not suitable to make such an optimization for a specific device.
> > > 
> > > There are PCI display controllers that don't have a dedicated VRAM bar,
> > > this function will lose its effectiveness in such a case. Luckily, the
> > > device driver can provide an accurate workaround.
> > > 
> > > Therefore, this patch introduces a callback that allows the device driver
> > > to tell the VGAARB if the device is the default boot device. This patch
> > > only intends to introduce the mechanism, while the implementation is left
> > > to the device driver authors. Also honor the comment: "Clients have two
> > > callback mechanisms they can use"
> > s/bar/BAR/ (several)
> > 
> > Nothing here uses the callback.  I don't want to merge this until we
> > have a user.
> 
> This is chicken and egg question.
> 
> If you could help get this merge first, I will show you the first user.
> 
> > I'm not sure why the device driver should know whether its device is
> > the default boot device.
> 
> It's not that the device driver should know,
> 
> but it's about that the device driver has the right to override.
> 
> Device driver may have better approach to identify the default boot
> device.

The way we usually handle this is to include the new callback in the
same series as the first user of it.  That has two benefits:
(1) everybody can review the whole picture and possibly suggest
different approaches, and (2) when we merge the infrastructure,
we also merge a user of it at the same time, so the whole thing can be
tested and we don't end up with unused code.

Bjorn


Re: [Nouveau] [Intel-gfx] [PATCH v3 4/4] PCI/VGA: introduce is_boot_device function callback to vga_client_register

2023-06-08 Thread Bjorn Helgaas
On Thu, Jun 08, 2023 at 07:43:22PM +0800, Sui Jingfeng wrote:
> From: Sui Jingfeng 
> 
> The vga_is_firmware_default() function is arch-dependent, which doesn't
> sound right. At least, it also works on the Mips and LoongArch platforms.
> Tested with the drm/amdgpu and drm/radeon drivers. However, it's difficult
> to enumerate all arch-driver combinations. I'm wrong if there is only one
> exception.
> 
> With the observation that device drivers typically have better knowledge
> about which PCI bar contains the firmware framebuffer, which could avoid
> the need to iterate all of the PCI BARs.
> 
> But as a PCI function at pci/vgaarb.c, vga_is_firmware_default() is
> probably not suitable to make such an optimization for a specific device.
> 
> There are PCI display controllers that don't have a dedicated VRAM bar,
> this function will lose its effectiveness in such a case. Luckily, the
> device driver can provide an accurate workaround.
> 
> Therefore, this patch introduces a callback that allows the device driver
> to tell the VGAARB if the device is the default boot device. This patch
> only intends to introduce the mechanism, while the implementation is left
> to the device driver authors. Also honor the comment: "Clients have two
> callback mechanisms they can use"

s/bar/BAR/ (several)

Nothing here uses the callback.  I don't want to merge this until we
have a user.

I'm not sure why the device driver should know whether its device is
the default boot device.

> Signed-off-by: Sui Jingfeng 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>  drivers/gpu/drm/i915/display/intel_vga.c   |  3 +--
>  drivers/gpu/drm/nouveau/nouveau_vga.c  |  2 +-
>  drivers/gpu/drm/radeon/radeon_device.c |  2 +-
>  drivers/pci/vgaarb.c   | 22 ++
>  drivers/vfio/pci/vfio_pci_core.c   |  2 +-
>  include/linux/vgaarb.h |  8 +---
>  7 files changed, 28 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5c7d40873ee2..7a096f2d5c16 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3960,7 +3960,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   /* this will fail for cards that aren't VGA class devices, just
>* ignore it */
>   if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
> - vga_client_register(adev->pdev, amdgpu_device_vga_set_decode);
> + vga_client_register(adev->pdev, amdgpu_device_vga_set_decode, 
> NULL);
>  
>   px = amdgpu_device_supports_px(ddev);
>  
> diff --git a/drivers/gpu/drm/i915/display/intel_vga.c 
> b/drivers/gpu/drm/i915/display/intel_vga.c
> index 286a0bdd28c6..98d7d4dffe9f 100644
> --- a/drivers/gpu/drm/i915/display/intel_vga.c
> +++ b/drivers/gpu/drm/i915/display/intel_vga.c
> @@ -115,7 +115,6 @@ intel_vga_set_decode(struct pci_dev *pdev, bool 
> enable_decode)
>  
>  int intel_vga_register(struct drm_i915_private *i915)
>  {
> -
>   struct pci_dev *pdev = to_pci_dev(i915->drm.dev);
>   int ret;
>  
> @@ -127,7 +126,7 @@ int intel_vga_register(struct drm_i915_private *i915)
>* then we do not take part in VGA arbitration and the
>* vga_client_register() fails with -ENODEV.
>*/
> - ret = vga_client_register(pdev, intel_vga_set_decode);
> + ret = vga_client_register(pdev, intel_vga_set_decode, NULL);
>   if (ret && ret != -ENODEV)
>   return ret;
>  
> diff --git a/drivers/gpu/drm/nouveau/nouveau_vga.c 
> b/drivers/gpu/drm/nouveau/nouveau_vga.c
> index f8bf0ec26844..162b4f4676c7 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_vga.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_vga.c
> @@ -92,7 +92,7 @@ nouveau_vga_init(struct nouveau_drm *drm)
>   return;
>   pdev = to_pci_dev(dev->dev);
>  
> - vga_client_register(pdev, nouveau_vga_set_decode);
> + vga_client_register(pdev, nouveau_vga_set_decode, NULL);
>  
>   /* don't register Thunderbolt eGPU with vga_switcheroo */
>   if (pci_is_thunderbolt_attached(pdev))
> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
> b/drivers/gpu/drm/radeon/radeon_device.c
> index afbb3a80c0c6..71f2ff39d6a1 100644
> --- a/drivers/gpu/drm/radeon/radeon_device.c
> +++ b/drivers/gpu/drm/radeon/radeon_device.c
> @@ -1425,7 +1425,7 @@ int radeon_device_init(struct radeon_device *rdev,
>   /* if we have > 1 VGA cards, then disable the radeon VGA resources */
>   /* this will fail for cards that aren't VGA class devices, just
>* ignore it */
> - vga_client_register(rdev->pdev, radeon_vga_set_decode);
> + vga_client_register(rdev->pdev, radeon_vga_set_decode, NULL);
>  
>   if (rdev->flags & RADEON_IS_PX)
>   runtime = true;
> diff --git a/drivers/pci/vgaarb.c b/drivers/pci/vgaarb.c
> index b0bf4952a95d..d3dab61e0ef2 100644
> --- a/drivers/pci/vgaarb.c
> 

Re: [Nouveau] [Intel-gfx] [PATCH v3 3/4] PCI/VGA: only deal with VGA class devices

2023-06-08 Thread Bjorn Helgaas
Start with verb and capitalize to match ("Deal only with ...")

On Thu, Jun 08, 2023 at 07:43:21PM +0800, Sui Jingfeng wrote:
> From: Sui Jingfeng 
> 
> vgaarb only deal with the VGA devcie(pdev->class == 0x0300), so replace the
> pci_get_subsys() function with pci_get_class(). Filter the non pci display
> device out. There no need to process the non display PCI device.

s/non pci/non-PCI/
s/non display/non-display/

This is fine, and I'll merge this, but someday I would like to get rid
of the bus_register_notifier() and call vga_arbiter_add_pci_device()
and vga_arbiter_del_pci_device() directly from the PCI core.

Or if you wanted to do that now, that would be even better :)

Bjorn

> Signed-off-by: Sui Jingfeng 
> ---
>  drivers/pci/vgaarb.c | 22 --
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/pci/vgaarb.c b/drivers/pci/vgaarb.c
> index 7f0347f4f6fd..b0bf4952a95d 100644
> --- a/drivers/pci/vgaarb.c
> +++ b/drivers/pci/vgaarb.c
> @@ -756,10 +756,6 @@ static bool vga_arbiter_add_pci_device(struct pci_dev 
> *pdev)
>   struct pci_dev *bridge;
>   u16 cmd;
>  
> - /* Only deal with VGA class devices */
> - if ((pdev->class >> 8) != PCI_CLASS_DISPLAY_VGA)
> - return false;
> -
>   /* Allocate structure */
>   vgadev = kzalloc(sizeof(struct vga_device), GFP_KERNEL);
>   if (vgadev == NULL) {
> @@ -1499,7 +1495,9 @@ static int pci_notify(struct notifier_block *nb, 
> unsigned long action,
>   struct pci_dev *pdev = to_pci_dev(dev);
>   bool notify = false;
>  
> - vgaarb_dbg(dev, "%s\n", __func__);
> + /* Only deal with VGA class devices */
> + if ((pdev->class >> 8) != PCI_CLASS_DISPLAY_VGA)
> + return 0;
>  
>   /* For now we're only intereted in devices added and removed. I didn't
>* test this thing here, so someone needs to double check for the
> @@ -1509,6 +1507,8 @@ static int pci_notify(struct notifier_block *nb, 
> unsigned long action,
>   else if (action == BUS_NOTIFY_DEL_DEVICE)
>   notify = vga_arbiter_del_pci_device(pdev);
>  
> + vgaarb_dbg(dev, "%s: action = %lu\n", __func__, action);
> +
>   if (notify)
>   vga_arbiter_notify_clients();
>   return 0;
> @@ -1533,8 +1533,8 @@ static struct miscdevice vga_arb_device = {
>  
>  static int __init vga_arb_device_init(void)
>  {
> + struct pci_dev *pdev = NULL;
>   int rc;
> - struct pci_dev *pdev;
>  
>   rc = misc_register(_arb_device);
>   if (rc < 0)
> @@ -1545,11 +1545,13 @@ static int __init vga_arb_device_init(void)
>   /* We add all PCI devices satisfying VGA class in the arbiter by
>* default
>*/
> - pdev = NULL;
> - while ((pdev =
> - pci_get_subsys(PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID,
> -PCI_ANY_ID, pdev)) != NULL)
> + while (1) {
> + pdev = pci_get_class(PCI_CLASS_DISPLAY_VGA << 8, pdev);
> + if (!pdev)
> + break;
> +
>   vga_arbiter_add_pci_device(pdev);
> + };
>  
>   pr_info("loaded\n");
>   return rc;
> -- 
> 2.25.1
> 


Re: [Nouveau] [Intel-gfx] [PATCH v3 1/4] PCI/VGA: tidy up the code and comment format

2023-06-08 Thread Bjorn Helgaas
Capitalize subject to match ("Tidy ...")

On Thu, Jun 08, 2023 at 07:43:19PM +0800, Sui Jingfeng wrote:
> From: Sui Jingfeng 
> 
> This patch replaces the leading space with a tab and removes the double
> blank line, no functional change.

Can you move this to the end of the series?  The functional changes
are more likely to be backported, and I think the backport may be a
little easier without the cleanup in the middle.

>   /* we could in theory hand out locks on IO and mem
> -  * separately to userspace but it can cause deadlocks */
> +  * separately to userspace but it can cause deadlocks
> +  */

Since you're touching this anyway, can you update it to the
conventional multi-line comment style:

  /*
   * We could in theory ...
   */

And capitalize "We", add a period at end, and rewrap to fill 78
columns or so?  Same for other comments below.

> +++ b/include/linux/vgaarb.h
> @@ -23,9 +23,7 @@
>   * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>   * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>   * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> - * DEALINGS
> - * IN THE SOFTWARE.
> - *
> + * DEALINGS IN THE SOFTWARE.
>   */

Can you make a separate patch to replace this entire copyright notice
with the appropriate SPDX-License-Identifier header?
Documentation/process/license-rules.rst has details.

Bjorn


Re: [Nouveau] [Intel-gfx] [PATCH v2 1/2] vgaarb: various coding style and comments fix

2023-06-06 Thread Bjorn Helgaas
Match the subject line style:

  $ git log --oneline drivers/pci/vgaarb.c
  f321c35feaee PCI/VGA: Replace full MIT license text with SPDX identifier
  d5109fe4d1ec PCI/VGA: Use unsigned format string to print lock counts
  4e6c91847a7f PCI/VGA: Log bridge control messages when adding devices
  dc593fd48abb PCI/VGA: Remove empty vga_arb_device_card_gone()
  ...

Subject line should be a summary of the commit log, not just "various
style fixes".  This one needs to say something about
vga_str_to_iostate().

On Mon, Jun 05, 2023 at 04:58:30AM +0800, Sui Jingfeng wrote:
> From: Sui Jingfeng 
> 
> To keep consistent with vga_iostate_to_str() function, the third argument
> of vga_str_to_iostate() function should be 'unsigned int *'.
> 
> Signed-off-by: Sui Jingfeng 
> ---
>  drivers/pci/vgaarb.c   | 29 +++--
>  include/linux/vgaarb.h |  8 +++-
>  2 files changed, 18 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/pci/vgaarb.c b/drivers/pci/vgaarb.c
> index 5a696078b382..e40e6e5e5f03 100644
> --- a/drivers/pci/vgaarb.c
> +++ b/drivers/pci/vgaarb.c
> @@ -61,7 +61,6 @@ static bool vga_arbiter_used;
>  static DEFINE_SPINLOCK(vga_lock);
>  static DECLARE_WAIT_QUEUE_HEAD(vga_wait_queue);
>  
> -
>  static const char *vga_iostate_to_str(unsigned int iostate)
>  {
>   /* Ignore VGA_RSRC_IO and VGA_RSRC_MEM */
> @@ -77,10 +76,12 @@ static const char *vga_iostate_to_str(unsigned int 
> iostate)
>   return "none";
>  }
>  
> -static int vga_str_to_iostate(char *buf, int str_size, int *io_state)
> +static int vga_str_to_iostate(char *buf, int str_size, unsigned int 
> *io_state)
>  {
> - /* we could in theory hand out locks on IO and mem
> -  * separately to userspace but it can cause deadlocks */
> + /*
> +  * we could in theory hand out locks on IO and mem
> +  * separately to userspace but it can cause deadlocks
> +  */

Omit all the comment formatting changes.  They are distractions from the
vga_str_to_iostate() parameter change.

I think this patch should be the single line change to the
vga_str_to_iostate() prototype so it matches the callers.

If you want to do the other comment formatting changes, they're fine,
but they should be all together in a separate patch that clearly
doesn't change the generated code.

Bjorn


Re: [Nouveau] [PATCH] PCI: stop spamming info in quirk_nvidia_hda

2023-03-16 Thread Bjorn Helgaas
On Thu, Mar 16, 2023 at 03:31:22PM +0100, Karol Herbst wrote:
> Users kept complaining about those messages and it's a little spammy on
> prime systems so turn it into a debug print.

What is a "prime system"?

I'm a little surprised that users would really care about the message.
But I do see comments like these:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1836308/comments/15
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2002206
that suggest the message happens frequently, maybe if we're resuming
the controller after runtime suspend?

Maybe this should be a pci_info_once() sort of thing?  I think there's
some value in knowing that we're changing the BIOS configuration
outside the purview of a driver, since I assume BIOS had some reason
for hiding the HDA controller.

> Cc: Bjorn Helgaas 
> Cc: Lukas Wunner 
> Cc: linux-...@vger.kernel.org
> Cc: nouveau@lists.freedesktop.org
> Fixes: b516ea586d71 ("PCI: Enable NVIDIA HDA controllers")
> Signed-off-by: Karol Herbst 
> ---
>  drivers/pci/quirks.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44cab813bf951..b10c77bbe4716 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5549,7 +5549,7 @@ static void quirk_nvidia_hda(struct pci_dev *gpu)
>   if (val & BIT(25))
>   return;
>  
> - pci_info(gpu, "Enabling HDA controller\n");
> + pci_dbg(gpu, "Enabling HDA controller\n");
>   pci_write_config_dword(gpu, 0x488, val | BIT(25));
>  
>   /* The GPU becomes a multi-function device when the HDA is enabled */
> -- 
> 2.39.2
> 


Re: [Nouveau] [PATCH v5 3/7] PCI: Drop the `is_thunderbolt` attribute from PCI core

2022-02-28 Thread Bjorn Helgaas
On Mon, Feb 28, 2022 at 03:33:13PM +, Limonciello, Mario wrote:
> [AMD Official Use Only]
> 
> > 
> > On Fri, Feb 25, 2022 at 11:42:24AM -0600, Bjorn Helgaas wrote:
> > > That would just leave the "PCI_VSEC_ID_INTEL_TBT implies external-
> > facing"
> > > assumption above.  Not having a Thunderbolt spec, I have no idea how
> > > you deal with that.
> > 
> > You can download the spec here:
> > 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> > w.usb.org%2Fsites%2Fdefault%2Ffiles%2FUSB4%2520Specification%2520202
> > 6.zipdata=04%7C01%7Cmario.limonciello%40amd.com%7Ca26e64
> > 7a4acf4e7681d308d9faa358fd%7C3dd8961fe4884e608e11a82d994e183d%7C0
> > %7C0%7C637816402472428689%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC
> > 4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&
> > amp;sdata=HSDqx%2BHzSnczTZxaBij8sgqvJSS8ajtjCzZd2CPSbR4%3Dre
> > served=0
> > 
> > Inside the archive there is also the DVSEC spec with name "USB4 DVSEC
> > Version 1.0.pdf".
> 
> The spec has Host_Router_indication (bits 18-19) as meaning external facing.
> I'll respin the patch 3 for using that.

Thanks, please include the spec citation when you do.  And probably
the URL, because it's not at all obvious how the casual reader would
get from "is_thunderbolt" to a recent add-on to the USB4 spec.

Bjorn


Re: [Nouveau] [PATCH v5 3/7] PCI: Drop the `is_thunderbolt` attribute from PCI core

2022-02-25 Thread Bjorn Helgaas
On Thu, Feb 24, 2022 at 03:51:12PM -0600, Mario Limonciello wrote:
> The `is_thunderbolt` attribute originally had a well defined list of
> quirks that it existed for, but it has been overloaded with more
> meaning.
> 
> Instead use the driver core removable attribute to indicate the
> detail a device is attached to a thunderbolt or USB4 chain.
> 
> Signed-off-by: Mario Limonciello 
> ---
>  drivers/pci/probe.c   | 2 +-
>  drivers/platform/x86/apple-gmux.c | 2 +-
>  include/linux/pci.h   | 5 ++---
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 17a969942d37..1b752d425c47 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1584,7 +1584,7 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
>   /* Is the device part of a Thunderbolt controller? */
>   vsec = pci_find_vsec_capability(dev, PCI_VENDOR_ID_INTEL, 
> PCI_VSEC_ID_INTEL_TBT);
>   if (vsec)
> - dev->is_thunderbolt = 1;
> + dev->external_facing = true;
>  }
>  
>  static void set_pcie_untrusted(struct pci_dev *dev)
> diff --git a/drivers/platform/x86/apple-gmux.c 
> b/drivers/platform/x86/apple-gmux.c
> index 57553f9b4d1d..da0c39b0 100644
> --- a/drivers/platform/x86/apple-gmux.c
> +++ b/drivers/platform/x86/apple-gmux.c
> @@ -596,7 +596,7 @@ static int gmux_resume(struct device *dev)
>  
>  static int is_thunderbolt(struct device *dev, void *data)
>  {
> - return to_pci_dev(dev)->is_thunderbolt;
> + return to_pci_dev(dev)->external_facing;
>  }
>  
>  static int gmux_probe(struct pnp_dev *pnp, const struct pnp_device_id *id)
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 1e5b769e42fc..d9719eb14654 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -442,7 +442,6 @@ struct pci_dev {
>   unsigned intis_virtfn:1;
>   unsigned intis_hotplug_bridge:1;
>   unsigned intshpc_managed:1; /* SHPC owned by shpchp */
> - unsigned intis_thunderbolt:1;   /* Thunderbolt controller */
>   unsigned intno_cmd_complete:1;  /* Lies about command completed 
> events */
>  
>   /*
> @@ -2447,11 +2446,11 @@ static inline bool pci_is_thunderbolt_attached(struct 
> pci_dev *pdev)
>  {
>   struct pci_dev *parent = pdev;
>  
> - if (pdev->is_thunderbolt)
> + if (dev_is_removable(>dev))
>   return true;
>  
>   while ((parent = pci_upstream_bridge(parent)))
> - if (parent->is_thunderbolt)
> + if (dev_is_removable(>dev))
>   return true;
>  
>   return false;

Since you remove this function entirely later, it seems like you might
as well push this to the end of the series, so you won't have to
change it before removing it.

That would just leave the "PCI_VSEC_ID_INTEL_TBT implies external-facing"
assumption above.  Not having a Thunderbolt spec, I have no idea how
you deal with that.

But it is definitely not the case that "dev_is_removable() implies
device is Thunderbolt", so I don't think this last hunk can work.

Bjorn


Re: [Nouveau] [PATCH v5 3/7] PCI: Drop the `is_thunderbolt` attribute from PCI core

2022-02-24 Thread Bjorn Helgaas
On Thu, Feb 24, 2022 at 07:23:49PM -0600, Bjorn Helgaas wrote:
> On Thu, Feb 24, 2022 at 03:51:12PM -0600, Mario Limonciello wrote:
> > The `is_thunderbolt` attribute originally had a well defined list of
> > quirks that it existed for, but it has been overloaded with more
> > meaning.
> > 
> > Instead use the driver core removable attribute to indicate the
> > detail a device is attached to a thunderbolt or USB4 chain.
> ...

> If these things are not specifically related to Thunderbolt, I'd
> prefer to get rid of pci_is_thunderbolt_attached() and see if we can
> help the GPU folks figure out what they really need.

Ah.  Guess I should read the whole series before commenting :)  I see
that you *did* remove pci_is_thunderbolt_attached() in the last patch.
I'll look more at the rest tomorrow.

Bjorn


Re: [Nouveau] [PATCH v5 3/7] PCI: Drop the `is_thunderbolt` attribute from PCI core

2022-02-24 Thread Bjorn Helgaas
On Thu, Feb 24, 2022 at 03:51:12PM -0600, Mario Limonciello wrote:
> The `is_thunderbolt` attribute originally had a well defined list of
> quirks that it existed for, but it has been overloaded with more
> meaning.
> 
> Instead use the driver core removable attribute to indicate the
> detail a device is attached to a thunderbolt or USB4 chain.
> 
> Signed-off-by: Mario Limonciello 
> ---
>  drivers/pci/probe.c   | 2 +-
>  drivers/platform/x86/apple-gmux.c | 2 +-
>  include/linux/pci.h   | 5 ++---
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 17a969942d37..1b752d425c47 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1584,7 +1584,7 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
>   /* Is the device part of a Thunderbolt controller? */
>   vsec = pci_find_vsec_capability(dev, PCI_VENDOR_ID_INTEL, 
> PCI_VSEC_ID_INTEL_TBT);
>   if (vsec)
> - dev->is_thunderbolt = 1;
> + dev->external_facing = true;

I assume there's a spec for the PCI_VSEC_ID_INTEL_TBT Capability.  Is
that public?  Does the spec say that a device with that capability
must be external-facing?

Even if it's not public, I think a citation (name, revision, section)
would be useful.

>  }
>  
>  static void set_pcie_untrusted(struct pci_dev *dev)
> diff --git a/drivers/platform/x86/apple-gmux.c 
> b/drivers/platform/x86/apple-gmux.c
> index 57553f9b4d1d..da0c39b0 100644
> --- a/drivers/platform/x86/apple-gmux.c
> +++ b/drivers/platform/x86/apple-gmux.c
> @@ -596,7 +596,7 @@ static int gmux_resume(struct device *dev)
>  
>  static int is_thunderbolt(struct device *dev, void *data)
>  {
> - return to_pci_dev(dev)->is_thunderbolt;
> + return to_pci_dev(dev)->external_facing;

This looks ... sort of weird.  I don't know anything about
apple-gmux.c, so I guess I don't care, but assuming any
external-facing device must be a Thunderbolt device seems like a
stretch.

Ugh.  This is used via "bus_for_each_dev(_bus_type)", which means
it's not hotplug-safe.  I'm sure we "know" implicitly that hotplug
isn't an issue in apple-gmux, but it's better not to have examples
that get copied to places where it *is* an issue.

>  }
>  
>  static int gmux_probe(struct pnp_dev *pnp, const struct pnp_device_id *id)
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 1e5b769e42fc..d9719eb14654 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -442,7 +442,6 @@ struct pci_dev {
>   unsigned intis_virtfn:1;
>   unsigned intis_hotplug_bridge:1;
>   unsigned intshpc_managed:1; /* SHPC owned by shpchp */
> - unsigned intis_thunderbolt:1;   /* Thunderbolt controller */
>   unsigned intno_cmd_complete:1;  /* Lies about command completed 
> events */
>  
>   /*
> @@ -2447,11 +2446,11 @@ static inline bool pci_is_thunderbolt_attached(struct 
> pci_dev *pdev)
>  {
>   struct pci_dev *parent = pdev;
>  
> - if (pdev->is_thunderbolt)
> + if (dev_is_removable(>dev))
>   return true;
>  
>   while ((parent = pci_upstream_bridge(parent)))
> - if (parent->is_thunderbolt)
> + if (dev_is_removable(>dev))
>   return true;

I don't get this.  Plain old PCI devices can be removable, too.

pci_is_thunderbolt_attached() is only used by GPU drivers.  What
property of Thunderbolt do they care about?

nouveau_vga_init() and radeon_device_init() use it to decide to
register with vga_switcheroo.  So maybe that's something to do with
removability?  Of course, that's not specific to Thunderbolt, because
garden-variety PCIe devices are removable.

amdgpu_driver_load_kms() and radeon_driver_load_kms() apparently use
it for something related to power control.  I don't know what the
Thunderbolt connection is.

nbio_v2_3_enable_aspm() looks like it uses it to change some ASPM
parameters.  Seems like potentially a device erratum or quirk
material?

If these things are not specifically related to Thunderbolt, I'd
prefer to get rid of pci_is_thunderbolt_attached() and see if we can
help the GPU folks figure out what they really need.

>   return false;
> -- 
> 2.34.1
> 


Re: [Nouveau] [PATCH v3 05/12] PCI: Detect root port of internal USB4 devices by `usb4-host-interface`

2022-02-17 Thread Bjorn Helgaas
On Mon, Feb 14, 2022 at 12:56:58PM +0200, Mika Westerberg wrote:
> On Mon, Feb 14, 2022 at 09:52:02AM +0100, Lukas Wunner wrote:
> > On Mon, Feb 14, 2022 at 09:34:26AM +0200, Mika Westerberg wrote:
> > > On Fri, Feb 11, 2022 at 03:45:46PM -0600, Bjorn Helgaas wrote:
> > > > My expectation is that "USB" (like "PCI" and "PCIe") tells me
> > > > something about how a device is electrically connected and how
> > > > software can operate it.  It doesn't really tell me anything about
> > > > whether those electrical connections are permanent, made through an
> > > > internal slot, or made through an external connector and cable.
> > > 
> > > It is used to identify "tunneled" ports (whether PCIe, USB 3.x or
> > > DisplayPort). Tunnels are created by software (in Linux it is the
> > > Thunderbolt driver) and are dynamic in nature. The USB4 links go over
> > > USB Type-C cable which also is something user can plug/unplug freely.
> > > 
> > > I would say it is reasonable expectation that anything behind these
> > > ports can be assumed as "removable".
> > 
> > USB gadgets may be soldered to the mainboard.  Those cannot be
> > unplugged freely.  It is common practice to solder USB Ethernet
> > or USB FTDI serial ports and nothing's preventing a vendor to solder
> > USB4/Thunderbolt gadgets.
> 
> Right, that's why I say it is "reasonable expectation" that anything
> behind these ports can be assumed "removable" :) Of course they don't
> have to be but if we assume that in the driver where this actually
> matters we should be on the safe side, no?

Spec citations help maintain things over the long term.  Reasonable
expectations that are part of today's folklore but are not written
down and shared leads to things that work today but not tomorrow.

Bjorn


Re: [Nouveau] [PATCH v3 07/12] PCI: Set ports for discrete USB4 controllers appropriately

2022-02-11 Thread Bjorn Helgaas
Make the subject specific, not just "appropriately."  I think you're
marking something *removable*, so include that.

On Fri, Feb 11, 2022 at 01:32:45PM -0600, Mario Limonciello wrote:
> Discrete USB4 controllers won't have ACPI nodes specifying which
> PCIe or XHCI port they are linked with.
> 
> In order to set the removable attribute appropriately, use the
> USB4 DVSEC extended capabability set on these root ports to determine
> if they are located on a discrete USB4 controller.

s/capabability/capability/

> Suggested-by: Mika Westerberg 
> Link: https://usb.org/sites/default/files/USB4%20Specification%202026.zip
> Signed-off-by: Mario Limonciello 
> ---
>  drivers/pci/probe.c | 33 +
>  include/linux/pci_ids.h |  2 ++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 67ca33188cba..1ed3e24db11e 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -25,6 +25,8 @@
>  #define CARDBUS_LATENCY_TIMER176 /* secondary latency timer */
>  #define CARDBUS_RESERVE_BUSNR3
>  
> +#define PCI_DVSEC_ID_USB40x23
> +
>  static struct resource busn_resource = {
>   .name   = "PCI busn",
>   .start  = 0,
> @@ -1590,6 +1592,36 @@ static void set_pcie_untrusted(struct pci_dev *dev)
>   dev->untrusted = true;
>  }
>  
> +static bool pci_is_discrete_usb4(struct pci_dev *dev)
> +{
> + int dvsec_val = 0, pos;
> + u32 hdr;
> +
> + /* USB4 spec says vendors can use either */
> + pos = pci_find_dvsec_capability(dev,
> + PCI_VENDOR_ID_INTEL,
> + PCI_DVSEC_ID_USB4);
> + if (pos) {
> + dvsec_val = 0x06;
> + } else {
> + pos = pci_find_dvsec_capability(dev,
> + PCI_VENDOR_ID_USB_IF,
> + PCI_DVSEC_ID_USB4);
> + if (pos)
> + dvsec_val = 0x01;
> + }
> + if (!dvsec_val)
> + return false;
> +
> + pci_read_config_dword(dev, pos + PCI_DVSEC_HEADER2, );
> + if ((hdr & GENMASK(15, 0)) != dvsec_val)
> + return false;
> + /* this port is used for either NHI/PCIe tunnel/USB tunnel */

Capitalize comment like others in this file.

Spec reference would be helpful, too.  I don't know how to verify
any of these values.  The Link: above is great, but name, revision,
section number would be even better.

> + if (hdr & GENMASK(18, 16))
> + return true;
> + return false;
> +}
> +
>  static void pci_set_removable(struct pci_dev *dev)
>  {
>   struct pci_dev *parent = pci_upstream_bridge(dev);
> @@ -1612,6 +1644,7 @@ static void pci_set_removable(struct pci_dev *dev)
>   if (vsec ||
>   dev->class == PCI_CLASS_SERIAL_USB_USB4 ||
>   pci_acpi_is_usb4(dev) ||
> + pci_is_discrete_usb4(dev) ||
>   (parent &&
>   (parent->external_facing || dev_is_removable(>dev
>   dev_set_removable(>dev, DEVICE_REMOVABLE);
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index 61b161d914f0..271326e058b9 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -3097,4 +3097,6 @@
>  
>  #define PCI_VENDOR_ID_NCUBE  0x10ff
>  
> +#define PCI_VENDOR_ID_USB_IF 0x1EC0

This file is supposed to be sorted by Vendor ID.  PCI_VENDOR_ID_XEN,
PCI_VENDOR_ID_OCZ, and PCI_VENDOR_ID_NCUBE screwed up, but you can put
USB_IF in the correct spot.

It's not 100%, but most entries use lower-case hex.


Re: [Nouveau] [PATCH v3 06/12] PCI: Explicitly mark USB4 NHI devices as removable

2022-02-11 Thread Bjorn Helgaas
On Fri, Feb 11, 2022 at 01:32:44PM -0600, Mario Limonciello wrote:
> USB4 class devices are also removable like Intel Thunderbolt devices.

Spec reference, please.

> Drivers of downstream devices use this information to declare functional
> differences in how the drivers perform by knowing that they are connected
> to an upstream TBT/USB4 port.
> 
> Reviewed-by: Macpaul Lin 
> Signed-off-by: Mario Limonciello 
> ---
>  drivers/pci/probe.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 2693211d31cf..67ca33188cba 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1610,6 +1610,7 @@ static void pci_set_removable(struct pci_dev *dev)
>* exposed as "removable" to userspace.
>*/
>   if (vsec ||
> + dev->class == PCI_CLASS_SERIAL_USB_USB4 ||
>   pci_acpi_is_usb4(dev) ||
>   (parent &&
>   (parent->external_facing || dev_is_removable(>dev
> -- 
> 2.34.1
> 


Re: [Nouveau] [PATCH v3 05/12] PCI: Detect root port of internal USB4 devices by `usb4-host-interface`

2022-02-11 Thread Bjorn Helgaas
On Fri, Feb 11, 2022 at 01:32:43PM -0600, Mario Limonciello wrote:
> The root port used for PCIe tunneling should be marked as removable to
> ensure that the entire chain is marked removable.
> 
> This can be done by looking for the device property specified in
> the ACPI tables `usb4-host-interface`.
> 
> Suggested-by: Mika Westerberg 
> Link: 
> https://docs.microsoft.com/en-us/windows-hardware/drivers/pci/dsd-for-pcie-root-ports#mapping-native-protocols-pcie-displayport-tunneled-through-usb4-to-usb4-host-routers
> Signed-off-by: Mario Limonciello 
> ---
>  drivers/pci/pci-acpi.c | 10 ++
>  drivers/pci/pci.h  |  5 +
>  drivers/pci/probe.c|  1 +
>  3 files changed, 16 insertions(+)
> 
> diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
> index a42dbf448860..6368e5633b1b 100644
> --- a/drivers/pci/pci-acpi.c
> +++ b/drivers/pci/pci-acpi.c
> @@ -1391,6 +1391,16 @@ void pci_acpi_cleanup(struct device *dev, struct 
> acpi_device *adev)
>   }
>  }
>  
> +bool pci_acpi_is_usb4(struct pci_dev *dev)
> +{
> + struct acpi_device *adev = ACPI_COMPANION(>dev);
> +
> + if (!adev)
> + return false;
> + return fwnode_property_present(acpi_fwnode_handle(adev),
> +"usb4-host-interface");

Maybe it's obvious to everybody but me that "USB4" means this device
is removable.  The Microsoft reference above doesn't say anything
about removability.

My expectation is that "USB" (like "PCI" and "PCIe") tells me
something about how a device is electrically connected and how
software can operate it.  It doesn't really tell me anything about
whether those electrical connections are permanent, made through an
internal slot, or made through an external connector and cable.

> +}
> +
>  static struct fwnode_handle *(*pci_msi_get_fwnode_cb)(struct device *dev);
>  
>  /**
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 3d60cabde1a1..359607c0542d 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -695,6 +695,7 @@ void acpi_pci_refresh_power_state(struct pci_dev *dev);
>  int acpi_pci_wakeup(struct pci_dev *dev, bool enable);
>  bool acpi_pci_need_resume(struct pci_dev *dev);
>  pci_power_t acpi_pci_choose_state(struct pci_dev *pdev);
> +bool pci_acpi_is_usb4(struct pci_dev *dev);
>  #else
>  static inline int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
>  {
> @@ -734,6 +735,10 @@ static inline pci_power_t acpi_pci_choose_state(struct 
> pci_dev *pdev)
>  {
>   return PCI_POWER_ERROR;
>  }
> +static inline bool pci_acpi_is_usb4(struct pci_dev *dev)
> +{
> + return false;
> +}
>  #endif
>  
>  #ifdef CONFIG_PCIEASPM
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index e41656cdd8f0..2693211d31cf 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1610,6 +1610,7 @@ static void pci_set_removable(struct pci_dev *dev)
>* exposed as "removable" to userspace.
>*/
>   if (vsec ||
> + pci_acpi_is_usb4(dev) ||
>   (parent &&
>   (parent->external_facing || dev_is_removable(>dev
>   dev_set_removable(>dev, DEVICE_REMOVABLE);
> -- 
> 2.34.1
> 


Re: [Nouveau] [PATCH v3 03/12] PCI: Move check for old Apple Thunderbolt controllers into a quirk

2022-02-11 Thread Bjorn Helgaas
On Fri, Feb 11, 2022 at 01:32:41PM -0600, Mario Limonciello wrote:
> `pci_bridge_d3_possible` currently checks explicitly for a Thunderbolt
> controller to indicate that D3 is possible.  As this is used solely
> for older Apple systems, move it into a quirk that enumerates across
> all Intel TBT controllers.
> 
> Suggested-by: Mika Westerberg 
> Signed-off-by: Mario Limonciello 
> ---
>  drivers/pci/pci.c| 12 +-
>  drivers/pci/quirks.c | 53 
>  2 files changed, 60 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 9ecce435fb3f..5002e214c9a6 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1064,7 +1064,13 @@ static inline bool platform_pci_bridge_d3(struct 
> pci_dev *dev)
>   if (pci_use_mid_pm())
>   return false;
>  
> - return acpi_pci_bridge_d3(dev);
> + if (acpi_pci_bridge_d3(dev))
> + return true;
> +
> + if (device_property_read_bool(>dev, "HotPlugSupportInD3"))
> + return true;

Why do we need this?  acpi_pci_bridge_d3() already looks for
"HotPlugSupportInD3".

> + return false;
>  }
>  
>  /**
> @@ -2954,10 +2960,6 @@ bool pci_bridge_d3_possible(struct pci_dev *bridge)
>   if (pci_bridge_d3_force)
>   return true;
>  
> - /* Even the oldest 2010 Thunderbolt controller supports D3. */
> - if (bridge->is_thunderbolt)
> - return true;
> -
>   /* Platform might know better if the bridge supports D3 */
>   if (platform_pci_bridge_d3(bridge))
>   return true;
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 6d3c88edde00..aaf098ca7d54 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3756,6 +3756,59 @@ DECLARE_PCI_FIXUP_SUSPEND_LATE(PCI_VENDOR_ID_INTEL,
>  quirk_apple_poweroff_thunderbolt);
>  #endif
>  
> +/* Apple machines as old as 2010 can do D3 with Thunderbolt controllers, but 
> don't specify
> + * it in the ACPI tables

Wrap to fit in 80 columns like the rest of the file.  Also use the:

  /*
   * comment ...
   */

style if it's more than one line.

I don't think "as old as 2010" is helpful here -- I assume 2010 is
there because there *were* no Thunderbolt controllers before 2010, but
the code doesn't check any dates, so we basically assume all Apple
machines of any age with the listed controllers can do this.

> + */
> +static void quirk_apple_d3_thunderbolt(struct pci_dev *dev)
> +{
> + struct property_entry properties[] = {
> + PROPERTY_ENTRY_BOOL("HotPlugSupportInD3"),
> + {},
> + };
> +
> + if (!x86_apple_machine)
> + return;

The current code doesn't check x86_apple_machine, so this needs some
justification.  How do I know this works the same as before?

> +
> + if (device_create_managed_software_node(>dev, properties, NULL))
> + pci_warn(dev, "could not add HotPlugSupportInD3 property");
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_LIGHT_RIDGE,
> + quirk_apple_d3_thunderbolt);

The current code assumes *all* Thunderbolt controllers support D3, so
it would assume a controller released next year would support D3, but
this code would assume the opposite.  Are we supposed to add
everything to this list, or do newer machines supply
HotPlugSupportInD3, or ...?

How did you derive this list?  (Question for the commit log and/or
comments here.)

> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_EAGLE_RIDGE,
> + quirk_apple_d3_thunderbolt);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_LIGHT_PEAK,
> + quirk_apple_d3_thunderbolt);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
> PCI_DEVICE_ID_INTEL_CACTUS_RIDGE_4C,
> + quirk_apple_d3_thunderbolt);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
> PCI_DEVICE_ID_INTEL_CACTUS_RIDGE_2C,
> + quirk_apple_d3_thunderbolt);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PORT_RIDGE,
> + quirk_apple_d3_thunderbolt);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
> PCI_DEVICE_ID_INTEL_REDWOOD_RIDGE_2C_NHI,
> + quirk_apple_d3_thunderbolt);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
> PCI_DEVICE_ID_INTEL_REDWOOD_RIDGE_2C_BRIDGE,
> + quirk_apple_d3_thunderbolt);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
> PCI_DEVICE_ID_INTEL_REDWOOD_RIDGE_4C_NHI,
> + quirk_apple_d3_thunderbolt);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
> PCI_DEVICE_ID_INTEL_REDWOOD_RIDGE_4C_BRIDGE,
> + quirk_apple_d3_thunderbolt);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
> PCI_DEVICE_ID_INTEL_FALCON_RIDGE_2C_NHI,
> + quirk_apple_d3_thunderbolt);

Re: [Nouveau] [PATCH v3 02/12] PCI: Move `is_thunderbolt` check for lack of command completed to a quirk

2022-02-11 Thread Bjorn Helgaas
Update subject to something like:

  PCI: pciehp: Quirk broken Command Completed support on Intel Thunderbolt 
controllers

On Fri, Feb 11, 2022 at 01:32:40PM -0600, Mario Limonciello wrote:
> The `is_thunderbolt` check is currently used to indicate the lack of
> command completed support for a number of older Thunderbolt devices.
> 
> This however is heavy handed and should have been done via a quirk.  Move
> the affected devices outlined in commit 493fb50e958c ("PCI: pciehp: Assume
> NoCompl+ for Thunderbolt ports") into pci quirks.
> 
> Suggested-by: Lukas Wunner 
> Signed-off-by: Mario Limonciello 
> ---
>  drivers/pci/hotplug/pciehp_hpc.c |  6 +-
>  drivers/pci/quirks.c | 17 +
>  include/linux/pci.h  |  2 ++
>  3 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c 
> b/drivers/pci/hotplug/pciehp_hpc.c
> index 1c1ebf3dad43..e4c42b24aba8 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -996,11 +996,7 @@ struct controller *pcie_init(struct pcie_device *dev)
>   if (pdev->hotplug_user_indicators)
>   slot_cap &= ~(PCI_EXP_SLTCAP_AIP | PCI_EXP_SLTCAP_PIP);
>  
> - /*
> -  * We assume no Thunderbolt controllers support Command Complete events,
> -  * but some controllers falsely claim they do.
> -  */
> - if (pdev->is_thunderbolt)
> + if (pdev->no_cmd_complete)
>   slot_cap |= PCI_EXP_SLTCAP_NCCS;
>  
>   ctrl->slot_cap = slot_cap;
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index d2dd6a6cda60..6d3c88edde00 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3675,6 +3675,23 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
> PCI_DEVICE_ID_INTEL_CACTUS_RIDGE_4C
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PORT_RIDGE,
>   quirk_thunderbolt_hotplug_msi);

Please add a comment above to the effect that PCI_EXP_SLTCAP_NCCS
being clear means the controller generates a Command Completed software
notification when it completes a command, and these controllers don't
generate those notifications even though PCI_EXP_SLTCAP_NCCS is clear
(PCIe r6.0, sec 7.5.3.9).

> +static void quirk_thunderbolt_command_completed(struct pci_dev *pdev)

Rename to quirk_no_command_completed().  This doesn't have anything to
do with Thunderbolt; it's just that the affected devices happen to be
Thunderbolt controllers.

> +{
> + pdev->no_cmd_complete = 1;
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_LIGHT_RIDGE,
> + quirk_thunderbolt_command_completed);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_EAGLE_RIDGE,
> + quirk_thunderbolt_command_completed);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_LIGHT_PEAK,
> + quirk_thunderbolt_command_completed);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
> PCI_DEVICE_ID_INTEL_CACTUS_RIDGE_4C,
> + quirk_thunderbolt_command_completed);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
> PCI_DEVICE_ID_INTEL_CACTUS_RIDGE_2C,
> + quirk_thunderbolt_command_completed);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PORT_RIDGE,
> + quirk_thunderbolt_command_completed);

Can we put these in drivers/pci/hotplug/pciehp_hpc.c?  We already
have a few similar quirks there.

>  #ifdef CONFIG_ACPI
>  /*
>   * Apple: Shutdown Cactus Ridge Thunderbolt controller.
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 8253a5413d7c..1e5b769e42fc 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -443,6 +443,8 @@ struct pci_dev {
>   unsigned intis_hotplug_bridge:1;
>   unsigned intshpc_managed:1; /* SHPC owned by shpchp */
>   unsigned intis_thunderbolt:1;   /* Thunderbolt controller */
> + unsigned intno_cmd_complete:1;  /* Lies about command completed 
> events */
> +
>   /*
>* Devices marked being untrusted are the ones that can potentially
>* execute DMA attacks and similar. They are typically connected
> -- 
> 2.34.1
> 


Re: [Nouveau] [PATCH v3 01/12] thunderbolt: move definition of PCI_CLASS_SERIAL_USB_USB4

2022-02-11 Thread Bjorn Helgaas
On Fri, Feb 11, 2022 at 01:32:39PM -0600, Mario Limonciello wrote:
> This PCI class definition of the USB4 device is currently located only in
> the thunderbolt driver.
> 
> It will be needed by a few other drivers for upcoming changes. Move it into
> the common include file.
> 
> Acked-by: Alex Deucher 
> Acked-by: Mika Westerberg 
> Signed-off-by: Mario Limonciello 

I would change the subject to:

  PCI: Add USB4 class definition

because this seems like more of a PCI thing than a Thunderbolt thing,
but either way:

Acked-by: Bjorn Helgaas 

> ---
>  drivers/thunderbolt/nhi.h | 2 --
>  include/linux/pci_ids.h   | 1 +
>  2 files changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/thunderbolt/nhi.h b/drivers/thunderbolt/nhi.h
> index 69083aab2736..79e980b51f94 100644
> --- a/drivers/thunderbolt/nhi.h
> +++ b/drivers/thunderbolt/nhi.h
> @@ -81,6 +81,4 @@ extern const struct tb_nhi_ops icl_nhi_ops;
>  #define PCI_DEVICE_ID_INTEL_TGL_H_NHI0   0x9a1f
>  #define PCI_DEVICE_ID_INTEL_TGL_H_NHI1   0x9a21
>  
> -#define PCI_CLASS_SERIAL_USB_USB40x0c0340
> -
>  #endif
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index aad54c666407..61b161d914f0 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -116,6 +116,7 @@
>  #define PCI_CLASS_SERIAL_USB_OHCI0x0c0310
>  #define PCI_CLASS_SERIAL_USB_EHCI0x0c0320
>  #define PCI_CLASS_SERIAL_USB_XHCI0x0c0330
> +#define PCI_CLASS_SERIAL_USB_USB40x0c0340
>  #define PCI_CLASS_SERIAL_USB_DEVICE  0x0c03fe
>  #define PCI_CLASS_SERIAL_FIBER   0x0c04
>  #define PCI_CLASS_SERIAL_SMBUS   0x0c05
> -- 
> 2.34.1
> 


Re: [Nouveau] 5.12.1 0010:nvkm_falcon_v1_wait_for_halt+0x8f/0xb9 [nouveau]

2021-05-06 Thread Bjorn Helgaas
[+cc Ben]

Hi Marc,

Thanks for paying attention to these things.  I added Ben (who
probably would see this via nouveau@lists.freedesktop.org anyway).
I don't see a PCI issue here, but the nouveau timeout, which I know
nothing about, does look like it could be interesting.

On Wed, May 05, 2021 at 02:42:27PM -0700, Marc MERLIN wrote:
> Howdy,
> I upgraded my thinkpad P73 from 5.9 to 5.12, and I now get this new
> ug at boot (although the system does continue booting and display works
> since I use i915 for display and only use nouveau for PM)
> 
> Short:
> [   18.561181] WARNING: CPU: 15 PID: 220 at 
> drivers/gpu/drm/nouveau/nvkm/falcon/v1.c:247 
> nvkm_falcon_v1_wait_for_halt+0x8f/0xb9 [nouveau]
> [   18.561300] Modules linked in: dm_crypt trusted tpm rng_core dm_mod 
> raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx multipath 
> sata_sil24 r8169 realtek mdio_devres libphy mii hid_generic usbhid hid 
> crct10dif_pclmul crc32_pclmul crc32c_intel xhci_pci rtsx_pci_sdmmc nouveau 
> ghash_clmulni_intel xhci_hcd mmc_core e1000e i2c_designware_platform mxm_wmi 
> i2c_designware_core hwmon ptp aesni_intel intel_lpss_pci drm_ttm_helper 
> i2c_i801 crypto_simd intel_lpss i2c_smbus psmouse i915 cryptd pps_core 
> thunderbolt rtsx_pci idma64 usbcore ttm i2c_nvidia_gpu thermal wmi battery
> [   18.561636] CPU: 15 PID: 220 Comm: kworker/15:2 Tainted: G U   
>  5.12.1-amd64-preempt-sysrq-20190817 #1
> [   18.561707] Hardware name: LENOVO 20QRS00200/20QRS00200, BIOS N2NET40W 
> (1.25 ) 08/26/2020
> [   18.561765] Workqueue: pm pm_runtime_work
> [   18.561799] RIP: 0010:nvkm_falcon_v1_wait_for_halt+0x8f/0xb9 [nouveau]
> 
> Despite the warning, chip seems to go to sleep on batteries, poewertop
> shows an encouraging low battery use (my lowest one yet of any kernel):
> The battery reports a discharge rate of 10.7 W
> The power consumed was 230 J
> 
> So it seems that what I need from nouveau is working (power management)
> 
> Full warning below with logs
> 
> 
> Long:
> [0.00] Linux version 5.12.1-amd64-preempt-sysrq-20190817 
> (r...@sauron.svh.merlins.org) (gcc (Debian 10.2.1-3) 10.2.1 20201224, GNU ld 
> (GNU Binutils for Debian) 2.35.1) #1 SMP PREEMPT Wed May 5 13:05:02 PDT 2021
> [0.00] Command line: 
> BOOT_IMAGE=/vmlinuz-5.12.1-amd64-preempt-sysrq-20190817 
> root=/dev/mapper/cryptroot ro rootflags=subvol=root 
> cryptopts=source=/dev/nvme0n1p7,keyscript=/sbin/cryptgetpw 
> usbcore.autosuspend=1 pcie_aspm=force resume=/dev/dm-1 acpi_backlight=vendor 
> nouveau.debug=disp=trace
> [8.672663] nouveau :01:00.0: runtime IRQ mapping not provided by arch
> [8.677434] nouveau :01:00.0: enabling device ( -> 0003)
> [8.691872] nouveau :01:00.0: NVIDIA TU104 (164000a1)
> [8.789240] nouveau :01:00.0: bios: version 90.04.4d.00.2c
> [8.789605] nouveau :01:00.0: pmu: firmware unavailable
> [8.789897] nouveau :01:00.0: enabling bus mastering
> [8.789978] nouveau :01:00.0: disp: preinit running...
> [8.789981] nouveau :01:00.0: disp: preinit completed in 0us
> [8.789997] nouveau :01:00.0: disp: fini running...
> [8.78] nouveau :01:00.0: disp: fini completed in 0us
> [8.790189] nouveau :01:00.0: fb: 8192 MiB GDDR6
> [8.800113] nouveau :01:00.0: disp: init running...
> [8.800116] nouveau :01:00.0: disp: init skipped, engine has no users
> [8.800118] nouveau :01:00.0: disp: init completed in 2us
> [8.801512] nouveau :01:00.0: DRM: VRAM: 8192 MiB
> [8.801515] nouveau :01:00.0: DRM: GART: 536870912 MiB
> [8.801517] nouveau :01:00.0: DRM: BIT table 'A' not found
> [8.801520] nouveau :01:00.0: DRM: BIT table 'L' not found
> [8.801521] nouveau :01:00.0: DRM: TMDS table version 2.0
> [8.801525] nouveau :01:00.0: DRM: DCB version 4.1
> [8.801527] nouveau :01:00.0: DRM: DCB outp 00: 02800f66 04600020
> [8.801529] nouveau :01:00.0: DRM: DCB outp 01: 02011f52 00020010
> [8.801531] nouveau :01:00.0: DRM: DCB outp 02: 01022f36 04600010
> [8.801533] nouveau :01:00.0: DRM: DCB outp 03: 04033f76 04600010
> [8.801535] nouveau :01:00.0: DRM: DCB outp 04: 04044f86 04600020
> [8.801537] nouveau :01:00.0: DRM: DCB conn 00: 00020047
> [8.801539] nouveau :01:00.0: DRM: DCB conn 01: 00010161
> [8.801541] nouveau :01:00.0: DRM: DCB conn 02: 1248
> [8.801543] nouveau :01:00.0: DRM: DCB conn 03: 01000348
> [8.801543] nouveau :01:00.0: DRM: DCB conn 04: 02000471
> [8.802234] nouveau :01:00.0: DRM: MM: using COPY for buffer copies
> [8.802255] nouveau :01:00.0: disp: init running...
> [8.802257] nouveau :01:00.0: disp: one-time init running...
> [8.802259] nouveau :01:00.0: disp: outp 00:0006:0f82: type 06 loc 0 
> or 2 link 2 con 0 edid 6 bus 0 head f
> [8.802265] nouveau :01:00.0: disp: outp 00:0006:0f82: bios dp 42 13 
> 00 

Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2021-01-29 Thread Bjorn Helgaas
On Thu, Jan 28, 2021 at 04:56:26PM -0800, Marc MERLIN wrote:
> On Wed, Jan 27, 2021 at 03:33:00PM -0600, Bjorn Helgaas wrote:
> > Hi Marc, I appreciate your persistence on this.  I am frankly
> > surprised that you've put up with this so long.
>  
> Well, been using linux for 27 years, but also it's not like I have much
> of a choice outside of switching to windows, as tempting as it's getting
> sometimes ;)
> 
> > > after boot, when it gets the right trigger (not sure which ones), it
> > > loops on this evern 2 seconds, mostly forever.
> > > 
> > > I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or 
> > > something else.
> > 
> > IIUC there are basically two problems:
> > 
> >   1) A 2 minute delay during boot
> > Another random thought: is there any chance the boot delay could be
> > related to crypto waiting for entropy?
> 
> So, the 2mn hang went away after I added the nouveau firwmare in initrd.
> The only problem is that the nouveau driver does not give a very good
> clue as to what's going on and what to do.
>
> For comparison the intel iwlwifi driver is very clear about firmware
> it's trying to load, if it can't and what exact firmware you need to
> find on the internet (filename)

I guess you're referring to this in iwl_request_firmware()?

  IWL_ERR(drv, "check 
git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git\n"); 

How can we fix this in nouveau so we don't have the debug this again?
I don't really know how firmware loading works, but "git grep -A5
request_firmware drivers/gpu/drm/nouveau/" shows that we generally
print something when request_firmware() fails.

But I didn't notice those messages in your logs, so I'm probably
barking up the wrong tree.

> >   2) Some sort of event every 2 seconds that kills your battery life
> > Your machine doesn't sound unusual, and I haven't seen a flood of
> > similar reports, so maybe there's something unusual about your config.
> > But I really don't have any guesses for either one.
> 
> Honestly, there are not too many thinpad P73 running linux out there. I
> wouldn't be surprised if it's only a handful or two.
> 
> > It sounds like v5.5 worked fine and you first noticed the slow boot
> > problem in v5.8.  We *could* try to bisect it, but I know that's a lot
> > of work on your part.
> 
> I've done that in the past, to be honest now that it works after I added
> the firmware that nouveau started needing, and didn't need before, the
> hang at boot is gone for sure.
> The PCI PM wakeup issues on batteries happen sometimes still, but they
> are much more rare now.

So maybe the wakeups are related to having vs not having the nouveau
firmware?  I'm still curious about that, and it smells like a bug to
me, but probably something to do with nouveau where I have no hope of
debugging it.

> > Grasping for any ideas for the boot delay; could you boot with
> > "initcall_debug" and collect your "lsmod" output?  I notice async_tx
> > in some of your logs, but I have no idea what it is.  It's from
> > crypto, so possibly somewhat unusual?
> 
> Is this still neeeded? I think of nouveau does a better job of helping
> the user correct the issue if firmware is missing (I think intel even
> gives a URL in printk), that would probably be what's needed for the
> most part.

Nope, don't bother with this, thanks.

Bjorn
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2021-01-28 Thread Bjorn Helgaas
On Wed, Jan 27, 2021 at 03:33:02PM -0600, Bjorn Helgaas wrote:
> On Sat, Dec 26, 2020 at 03:12:09AM -0800, Marc MERLIN wrote:
> > This started with 5.5 and hasn't gotten better since then, despite
> > some reports I tried to send.
> > 
> > As per my previous message:
> > I have a Thinkpad P70 with hybrid graphics.
> > 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro 
> > M600M] (rev a2)
> > that one works fine, I can use i915 for the main screen, and nouveau to
> > display on the external ports (external ports are only wired to nvidia
> > chip, so it's impossible to use them without turning the nvidia chip
> > on).
> >  
> > I now got a newer P73 also with the same hybrid graphics (setup as such
> > in the bios). It runs fine with i915, and I don't need to use external
> > display with nouveau for now (it almost works, but I only see the mouse
> > cursor on the external screen, no window or anything else can get
> > displayed, very weird).
> > 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 
> > 4000 Mobile / Max-Q] (rev a1)
> >  
> > 
> > after boot, when it gets the right trigger (not sure which ones), it
> > loops on this evern 2 seconds, mostly forever.
> > 
> > I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or 
> > something else.
> 
> IIUC there are basically two problems:
> 
>   1) A 2 minute delay during boot
>   2) Some sort of event every 2 seconds that kills your battery life
> 
> Your machine doesn't sound unusual, and I haven't seen a flood of
> similar reports, so maybe there's something unusual about your config.
> But I really don't have any guesses for either one.
> 
> It sounds like v5.5 worked fine and you first noticed the slow boot
> problem in v5.8.  We *could* try to bisect it, but I know that's a lot
> of work on your part.
> 
> Grasping for any ideas for the boot delay; could you boot with
> "initcall_debug" and collect your "lsmod" output?  I notice async_tx
> in some of your logs, but I have no idea what it is.  It's from
> crypto, so possibly somewhat unusual?

Another random thought: is there any chance the boot delay could be
related to crypto waiting for entropy?
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2021-01-27 Thread Bjorn Helgaas
Hi Marc, I appreciate your persistence on this.  I am frankly
surprised that you've put up with this so long.

On Sat, Dec 26, 2020 at 03:12:09AM -0800, Marc MERLIN wrote:
> This started with 5.5 and hasn't gotten better since then, despite
> some reports I tried to send.
> 
> As per my previous message:
> I have a Thinkpad P70 with hybrid graphics.
> 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M600M] 
> (rev a2)
> that one works fine, I can use i915 for the main screen, and nouveau to
> display on the external ports (external ports are only wired to nvidia
> chip, so it's impossible to use them without turning the nvidia chip
> on).
>  
> I now got a newer P73 also with the same hybrid graphics (setup as such
> in the bios). It runs fine with i915, and I don't need to use external
> display with nouveau for now (it almost works, but I only see the mouse
> cursor on the external screen, no window or anything else can get
> displayed, very weird).
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 
> 4000 Mobile / Max-Q] (rev a1)
>  
> 
> after boot, when it gets the right trigger (not sure which ones), it
> loops on this evern 2 seconds, mostly forever.
> 
> I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or 
> something else.

IIUC there are basically two problems:

  1) A 2 minute delay during boot
  2) Some sort of event every 2 seconds that kills your battery life

Your machine doesn't sound unusual, and I haven't seen a flood of
similar reports, so maybe there's something unusual about your config.
But I really don't have any guesses for either one.

It sounds like v5.5 worked fine and you first noticed the slow boot
problem in v5.8.  We *could* try to bisect it, but I know that's a lot
of work on your part.

Grasping for any ideas for the boot delay; could you boot with
"initcall_debug" and collect your "lsmod" output?  I notice async_tx
in some of your logs, but I have no idea what it is.  It's from
crypto, so possibly somewhat unusual?

Bjorn
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] add nouveau to lore?

2021-01-27 Thread Bjorn Helgaas
Is there any chance the nouveau mailing list
(https://lists.freedesktop.org/archives/nouveau/) could be added to
lore?

Threads like this conversation:
https://lore.kernel.org/r/20200908002935.gd20...@merlins.org get
harder to browse when mailing lists that *are* on lore get removed
from cc.

Thanks,
  Bjorn
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] nouveau regression with 5.7 caused by "PCI/PM: Assume ports without DLL Link Active train links in 100 ms"

2020-07-24 Thread Bjorn Helgaas
On Fri, Jul 24, 2020 at 12:57:51PM +0300, Mika Westerberg wrote:
> On Thu, Jul 23, 2020 at 10:30:58PM +0200, Karol Herbst wrote:
> > On Wed, Jul 22, 2020 at 11:25 AM Mika Westerberg
> >  wrote:
> > >
> > > On Tue, Jul 21, 2020 at 01:37:12PM -0500, Patrick Volkerding wrote:
> > > > On 7/21/20 10:27 AM, Mika Westerberg wrote:
> > > > > On Tue, Jul 21, 2020 at 11:01:55AM -0400, Lyude Paul wrote:
> > > > >> Sure thing. Also, feel free to let me know if you'd like access to 
> > > > >> one of the
> > > > >> systems we saw breaking with this patch - I'm fairly sure I've got 
> > > > >> one of them
> > > > >> locally at my apartment and don't mind setting up AMT/KVM/SSH
> > > > > Probably no need for remote access (thanks for the offer, though). I
> > > > > attached a test patch to the bug report:
> > > > >
> > > > >   https://bugzilla.kernel.org/show_bug.cgi?id=208597
> > > > >
> > > > > that tries to work it around (based on the ->pm_cap == 0). I wonder if
> > > > > anyone would have time to try it out.
> > > >
> > > >
> > > > Hi Mika,
> > > >
> > > > I can confirm that this patch applied to 5.4.52 fixes the issue with
> > > > hybrid graphics on the Thinkpad X1 Extreme gen2.
> > >
> > > Great, thanks for testing!
> > 
> > yeah, works on the P1G2 as well.
> 
> Thanks for testing!
> 
> Since we have the revert queued for this release cycle, I think I will
> send an updated version of "PCI/PM: Assume ports without DLL Link Active
> train links in 100 ms" after v5.9-rc1 is released that has this
> workaround in place.
> 
> (I'm continuing my vacation so will be offline next week).

Sounds fine, sorry for interrupting your vacation!
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] nouveau regression with 5.7 caused by "PCI/PM: Assume ports without DLL Link Active train links in 100 ms"

2020-07-17 Thread Bjorn Helgaas
On Fri, Jul 17, 2020 at 10:43:18AM -0400, Sasha Levin wrote:
> On Fri, Jul 17, 2020 at 02:43:52AM +0200, Karol Herbst wrote:
> > On Fri, Jul 17, 2020 at 1:54 AM Bjorn Helgaas  wrote:
> > > On Fri, Jul 17, 2020 at 12:10:39AM +0200, Karol Herbst wrote:
> > > > On Tue, Jul 7, 2020 at 9:30 PM Karol Herbst  wrote:
> > > > >
> > > > > Hi everybody,
> > > > >
> > > > > with the mentioned commit Nouveau isn't able to load firmware onto the
> > > > > GPU on one of my systems here. Even though the issue doesn't always
> > > > > happen I am quite confident this is the commit breaking it.
> > > > >
> > > > > I am still digging into the issue and trying to figure out what
> > > > > exactly breaks, but it shows up in different ways. Either we are not
> > > > > able to boot the engines on the GPU or the GPU becomes unresponsive.
> > > > > Btw, this is also a system where our runtime power management issue
> > > > > shows up, so maybe there is indeed something funky with the bridge
> > > > > controller.
> > > > >
> > > > > Just pinging you in case you have an idea on how this could break 
> > > > > Nouveau
> > > > >
> > > > > most of the times it shows up like this:
> > > > > nouveau :01:00.0: acr: AHESASC binary failed
> > > > >
> > > > > Sometimes it works at boot and fails at runtime resuming with random
> > > > > faults. So I will be investigating a bit more, but yeah... I am super
> > > > > sure the commit triggered this issue, no idea if it actually causes
> > > > > it.
> > > >
> > > > so yeah.. I reverted that locally and never ran into issues again.
> > > > Still valid on latest 5.7. So can we get this reverted or properly
> > > > fixed? This breaks runtime pm for us on at least some hardware.
> > > 
> > > Yeah, that stinks.  We had another similar report from Patrick:
> > > 
> > >   
> > > https://lore.kernel.org/r/caerspo5stek_my1dehwp7ahd0xop87+ohywktjbl7algdbx...@mail.gmail.com
> > > 
> > > Apparently the problem is ec411e02b7a2 ("PCI/PM: Assume ports without
> > > DLL Link Active train links in 100 ms"), which Patrick found was
> > > backported to v5.4.49 as 828b192c57e8, and you found was backported to
> > > v5.7.6 as afaff825e3a4.
> > > 
> > > Oddly, Patrick reported that v5.7.7 worked correctly, even though it
> > > still contains afaff825e3a4.
> > > 
> > > I guess in the absence of any other clues we'll have to revert it.
> > > I hate to do that because that means we'll have slow resume of
> > > Thunderbolt-connected devices again, but that's better than having
> > > GPUs completely broken.
> > > 
> > > Could you and Patrick open bugzilla.kernel.org reports, attach dmesg
> > > logs and "sudo lspci -vv" output, and add the URLs to Kai-Heng's
> > > original report at https://bugzilla.kernel.org/show_bug.cgi?id=206837
> > > and to this thread?
> > > 
> > > There must be a way to fix the slow resume problem without breaking
> > > the GPUs.
> > > 
> > 
> > I wouldn't be surprised if this is related to the Intel bridge we
> > check against for Nouveau.. I still have to check on another laptop
> > with the same bridge our workaround was required as well but wouldn't
> > be surprised if it shows the same problem. Will get you the
> > information from both systems tomorrow then.
> 
> I take it that ec411e02b7a2 will be reverted upstream?

Yes, unless we have a better fix soon.  I applied the revert to my
for-linus branch, so it will appear in -next soon.  I think it's a
little late to get it in -rc5, so I'll probably ask Linus to pull it
next week for -rc6.

Bjorn
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] nouveau regression with 5.7 caused by "PCI/PM: Assume ports without DLL Link Active train links in 100 ms"

2020-07-16 Thread Bjorn Helgaas
[+cc Sasha -- stable kernel regression]
[+cc Patrick, Kai-Heng, LKML]

On Fri, Jul 17, 2020 at 12:10:39AM +0200, Karol Herbst wrote:
> On Tue, Jul 7, 2020 at 9:30 PM Karol Herbst  wrote:
> >
> > Hi everybody,
> >
> > with the mentioned commit Nouveau isn't able to load firmware onto the
> > GPU on one of my systems here. Even though the issue doesn't always
> > happen I am quite confident this is the commit breaking it.
> >
> > I am still digging into the issue and trying to figure out what
> > exactly breaks, but it shows up in different ways. Either we are not
> > able to boot the engines on the GPU or the GPU becomes unresponsive.
> > Btw, this is also a system where our runtime power management issue
> > shows up, so maybe there is indeed something funky with the bridge
> > controller.
> >
> > Just pinging you in case you have an idea on how this could break Nouveau
> >
> > most of the times it shows up like this:
> > nouveau :01:00.0: acr: AHESASC binary failed
> >
> > Sometimes it works at boot and fails at runtime resuming with random
> > faults. So I will be investigating a bit more, but yeah... I am super
> > sure the commit triggered this issue, no idea if it actually causes
> > it.
> 
> so yeah.. I reverted that locally and never ran into issues again.
> Still valid on latest 5.7. So can we get this reverted or properly
> fixed? This breaks runtime pm for us on at least some hardware.

Yeah, that stinks.  We had another similar report from Patrick:

  
https://lore.kernel.org/r/caerspo5stek_my1dehwp7ahd0xop87+ohywktjbl7algdbx...@mail.gmail.com

Apparently the problem is ec411e02b7a2 ("PCI/PM: Assume ports without
DLL Link Active train links in 100 ms"), which Patrick found was
backported to v5.4.49 as 828b192c57e8, and you found was backported to
v5.7.6 as afaff825e3a4.

Oddly, Patrick reported that v5.7.7 worked correctly, even though it
still contains afaff825e3a4.

I guess in the absence of any other clues we'll have to revert it.
I hate to do that because that means we'll have slow resume of
Thunderbolt-connected devices again, but that's better than having
GPUs completely broken.

Could you and Patrick open bugzilla.kernel.org reports, attach dmesg
logs and "sudo lspci -vv" output, and add the URLs to Kai-Heng's
original report at https://bugzilla.kernel.org/show_bug.cgi?id=206837
and to this thread?

There must be a way to fix the slow resume problem without breaking
the GPUs.

Bjorn
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v3] PCI: Use ioremap(), not phys_to_virt() for platform ROM

2020-03-30 Thread Bjorn Helgaas
On Mon, Mar 30, 2020 at 01:54:33PM +, Deucher, Alexander wrote:
> > -Original Message-
> > From: Bjorn Helgaas 
> > Sent: Saturday, March 28, 2020 4:19 PM
> > To: Mikel Rychliski 
> > Cc: amd-...@lists.freedesktop.org; linux-...@vger.kernel.org;
> > nouveau@lists.freedesktop.org; Deucher, Alexander
> > ; Koenig, Christian
> > ; Zhou, David(ChunMing)
> > ; Matthew Garrett
> > ; Ben Skeggs ;
> > Christoph Hellwig 
> > Subject: Re: [PATCH v3] PCI: Use ioremap(), not phys_to_virt() for platform
> > ROM
> > 
> > On Wed, Mar 18, 2020 at 10:16:23PM -0400, Mikel Rychliski wrote:
> > > On some EFI systems, the video BIOS is provided by the EFI firmware.
> > > The boot stub code stores the physical address of the ROM image in pdev-
> > >rom.
> > > Currently we attempt to access this pointer using phys_to_virt(),
> > > which doesn't work with CONFIG_HIGHMEM.
> > >
> > > On these systems, attempting to load the radeon module on a x86_32
> > > kernel can result in the following:
> > >
> > > BUG: unable to handle page fault for address: 3e8ed03c
> > > #PF: supervisor read access in kernel mode
> > > #PF: error_code(0x) - not-present page
> > > *pde = 
> > > Oops:  [#1] PREEMPT SMP
> > > CPU: 0 PID: 317 Comm: systemd-udevd Not tainted 5.6.0-rc3-next-
> > 20200228 #2
> > > Hardware name: Apple Computer, Inc. MacPro1,1/Mac-F4208DC8, BIOS
> > MP11.88Z.005C.B08.0707021221 07/02/07
> > > EIP: radeon_get_bios+0x5ed/0xe50 [radeon]
> > > Code: 00 00 84 c0 0f 85 12 fd ff ff c7 87 64 01 00 00 00 00 00 00 8b 
> > > 47 08 8b
> > 55 b0 e8 1e 83 e1 d6 85 c0 74 1a 8b 55 c0 85 d2 74 13 <80> 38 55 75 0e 80 
> > 78 01
> > aa 0f 84 a4 03 00 00 8d 74 26 00 68 dc 06
> > > EAX: 3e8ed03c EBX:  ECX: 3e8ed03c EDX: 0001
> > > ESI: 0004 EDI: eec04000 EBP: eef3fc60 ESP: eef3fbe0
> > > DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010206
> > > CR0: 80050033 CR2: 3e8ed03c CR3: 2ec77000 CR4: 06d0
> > > Call Trace:
> > >  ? register_client+0x34/0xe0
> > >  ? register_client+0xab/0xe0
> > >  r520_init+0x26/0x240 [radeon]
> > >  radeon_device_init+0x533/0xa50 [radeon]
> > >  radeon_driver_load_kms+0x80/0x220 [radeon]
> > >  drm_dev_register+0xa7/0x180 [drm]
> > >  radeon_pci_probe+0x10f/0x1a0 [radeon]
> > >  pci_device_probe+0xd4/0x140
> > >  really_probe+0x13d/0x3b0
> > >  driver_probe_device+0x56/0xd0
> > >  device_driver_attach+0x49/0x50
> > >  __driver_attach+0x79/0x130
> > >  ? device_driver_attach+0x50/0x50
> > >  bus_for_each_dev+0x5b/0xa0
> > >  driver_attach+0x19/0x20
> > >  ? device_driver_attach+0x50/0x50
> > >  bus_add_driver+0x117/0x1d0
> > >  ? pci_bus_num_vf+0x20/0x20
> > >  driver_register+0x66/0xb0
> > >  ? 0xf80f4000
> > >  __pci_register_driver+0x3d/0x40
> > >  radeon_init+0x82/0x1000 [radeon]
> > >  do_one_initcall+0x42/0x200
> > >  ? kvfree+0x25/0x30
> > >  ? __vunmap+0x206/0x230
> > >  ? kmem_cache_alloc_trace+0x16f/0x220
> > >  ? do_init_module+0x21/0x220
> > >  do_init_module+0x50/0x220
> > >  load_module+0x1f26/0x2200
> > >  sys_init_module+0x12d/0x160
> > >  do_fast_syscall_32+0x82/0x250
> > >  entry_SYSENTER_32+0xa5/0xf8
> > >
> > > Fix the issue by updating all drivers which can access a platform
> > > provided ROM. Instead of calling the helper function
> > > pci_platform_rom() which uses phys_to_virt(), call ioremap() directly on
> > the pdev->rom.
> > >
> > > radeon_read_platform_bios() previously directly accessed an __iomem
> > > pointer. Avoid this by calling memcpy_fromio() instead of kmemdup().
> > >
> > > pci_platform_rom() now has no remaining callers, so remove it.
> > >
> > > Signed-off-by: Mikel Rychliski 
> > 
> > I applied this to pci/resource for v5.7.  I would feel better if some of the
> > graphics guys chimed in, or even applied it via the DRM tree since most of 
> > the
> > changes are actually in drivers/gpu.
> 
> Feel free to take it through the PCI tree.  These areas of radeon and amdgpu 
> don't really change much at all so, I'm not too concerned about a conflict.
> 
> Acked-by: Alex Deucher 

Re: [Nouveau] [PATCH v3] PCI: Use ioremap(), not phys_to_virt() for platform ROM

2020-03-28 Thread Bjorn Helgaas
On Wed, Mar 18, 2020 at 10:16:23PM -0400, Mikel Rychliski wrote:
> On some EFI systems, the video BIOS is provided by the EFI firmware.  The
> boot stub code stores the physical address of the ROM image in pdev->rom.
> Currently we attempt to access this pointer using phys_to_virt(), which
> doesn't work with CONFIG_HIGHMEM.
> 
> On these systems, attempting to load the radeon module on a x86_32 kernel
> can result in the following:
> 
> BUG: unable to handle page fault for address: 3e8ed03c
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x) - not-present page
> *pde = 
> Oops:  [#1] PREEMPT SMP
> CPU: 0 PID: 317 Comm: systemd-udevd Not tainted 5.6.0-rc3-next-20200228 #2
> Hardware name: Apple Computer, Inc. MacPro1,1/Mac-F4208DC8, BIOS 
> MP11.88Z.005C.B08.0707021221 07/02/07
> EIP: radeon_get_bios+0x5ed/0xe50 [radeon]
> Code: 00 00 84 c0 0f 85 12 fd ff ff c7 87 64 01 00 00 00 00 00 00 8b 47 
> 08 8b 55 b0 e8 1e 83 e1 d6 85 c0 74 1a 8b 55 c0 85 d2 74 13 <80> 38 55 75 0e 
> 80 78 01 aa 0f 84 a4 03 00 00 8d 74 26 00 68 dc 06
> EAX: 3e8ed03c EBX:  ECX: 3e8ed03c EDX: 0001
> ESI: 0004 EDI: eec04000 EBP: eef3fc60 ESP: eef3fbe0
> DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010206
> CR0: 80050033 CR2: 3e8ed03c CR3: 2ec77000 CR4: 06d0
> Call Trace:
>  ? register_client+0x34/0xe0
>  ? register_client+0xab/0xe0
>  r520_init+0x26/0x240 [radeon]
>  radeon_device_init+0x533/0xa50 [radeon]
>  radeon_driver_load_kms+0x80/0x220 [radeon]
>  drm_dev_register+0xa7/0x180 [drm]
>  radeon_pci_probe+0x10f/0x1a0 [radeon]
>  pci_device_probe+0xd4/0x140
>  really_probe+0x13d/0x3b0
>  driver_probe_device+0x56/0xd0
>  device_driver_attach+0x49/0x50
>  __driver_attach+0x79/0x130
>  ? device_driver_attach+0x50/0x50
>  bus_for_each_dev+0x5b/0xa0
>  driver_attach+0x19/0x20
>  ? device_driver_attach+0x50/0x50
>  bus_add_driver+0x117/0x1d0
>  ? pci_bus_num_vf+0x20/0x20
>  driver_register+0x66/0xb0
>  ? 0xf80f4000
>  __pci_register_driver+0x3d/0x40
>  radeon_init+0x82/0x1000 [radeon]
>  do_one_initcall+0x42/0x200
>  ? kvfree+0x25/0x30
>  ? __vunmap+0x206/0x230
>  ? kmem_cache_alloc_trace+0x16f/0x220
>  ? do_init_module+0x21/0x220
>  do_init_module+0x50/0x220
>  load_module+0x1f26/0x2200
>  sys_init_module+0x12d/0x160
>  do_fast_syscall_32+0x82/0x250
>  entry_SYSENTER_32+0xa5/0xf8
> 
> Fix the issue by updating all drivers which can access a platform
> provided ROM. Instead of calling the helper function pci_platform_rom()
> which uses phys_to_virt(), call ioremap() directly on the pdev->rom.
> 
> radeon_read_platform_bios() previously directly accessed an __iomem
> pointer. Avoid this by calling memcpy_fromio() instead of kmemdup().
> 
> pci_platform_rom() now has no remaining callers, so remove it.
> 
> Signed-off-by: Mikel Rychliski 

I applied this to pci/resource for v5.7.  I would feel better if some
of the graphics guys chimed in, or even applied it via the DRM tree
since most of the changes are actually in drivers/gpu.

Feel free to add my

  Acked-by: Bjorn Helgaas 

and let me know if you do that.

> ---
> 
> Tested on a MacPro 1,1 with a Radeon X1900 XT card and 32-bit kernel.
> 
> Changes in v3:
>  - Inline pci_platform_rom()
> 
> Changes in v2:
>  - Add iounmap() call in nouveau
>  - Update function comment for pci_platform_rom()
>  - Minor changes to commit messages
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c   | 31 
> +-
>  .../gpu/drm/nouveau/nvkm/subdev/bios/shadowpci.c   | 17 ++--
>  drivers/gpu/drm/radeon/radeon_bios.c   | 30 +
>  drivers/pci/rom.c  | 17 
>  include/linux/pci.h|  1 -
>  5 files changed, 52 insertions(+), 44 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
> index 50dff69a0f6e..b1172d93c99c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
> @@ -192,30 +192,35 @@ static bool amdgpu_read_bios_from_rom(struct 
> amdgpu_device *adev)
>  
>  static bool amdgpu_read_platform_bios(struct amdgpu_device *adev)
>  {
> - uint8_t __iomem *bios;
> - size_t size;
> + phys_addr_t rom = adev->pdev->rom;
> + size_t romlen = adev->pdev->romlen;
> + void __iomem *bios;
>  
>   adev->bios = NULL;
>  
> - bios = pci_platform_rom(adev->pdev

Re: [Nouveau] [PATCH v7] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2020-03-24 Thread Bjorn Helgaas
On Tue, Mar 24, 2020 at 06:31:08PM +0100, Karol Herbst wrote:
> On Sat, Mar 21, 2020 at 2:02 AM Karol Herbst  wrote:
> >
> > On Fri, Mar 20, 2020 at 11:19 PM Bjorn Helgaas  wrote:
> > >
> > > On Tue, Mar 10, 2020 at 08:26:27PM +0100, Karol Herbst wrote:
> > > > Fixes the infamous 'runtime PM' bug many users are facing on Laptops 
> > > > with
> > > > Nvidia Pascal GPUs by skipping said PCI power state changes on the GPU.
> > > >
> > > > Depending on the used kernel there might be messages like those in 
> > > > demsg:
> > > >
> > > > "nouveau :01:00.0: Refused to change power state, currently in D3"
> > > > "nouveau :01:00.0: can't change power state from D3cold to D0 
> > > > (config
> > > > space inaccessible)"
> > > > followed by backtraces of kernel crashes or timeouts within nouveau.
> > > >
> > > > It's still unkown why this issue exists, but this is a reliable 
> > > > workaround
> > > > and solves a very annoying issue for user having to choose between a
> > > > crashing kernel or higher power consumption of their Laptops.
> > >
> > > Thanks for the bugzilla link.  The bugzilla mentions lots of mailing
> > > list discussion.  Can you include links to some of that?
> > >
> > > IIUC this basically just turns off PCI power management for the GPU.
> > > Can you do that with something like the following?  I don't know
> > > anything about DRM, so I don't know where you could save the pm_cap,
> > > but I'm sure the driver could keep it somewhere.
> > >
> >
> > Sure this would work? From a quick look over the pci code, it looks
> > like a of code would be skipped we really need, like the platform code
> > to turn off the GPU via ACPI. But I could also remember incorrectly on
> > how all of that worked again. I can of course try and see what the
> > effect of this patch would be. And would the parent bus even go into
> > D3hot if it knows one of its children is still at D0? Because that's
> > what the result of that would be as well, no? And I know that if the
> > bus stays in D0, that it has a negative impact on power consumption.
> >
> > Anyway, I will try that out, I am just not seeing how that would help.
> 
> so it seems like that has worked unless I screwed up locally. Will do
> some proper testing and then I think we won't need to go through the
> pci tree anymore as no changes are required there with that.

Hehe, looks like our responses crossed in the mail :)  I hope further
testing is still positive; let me know if not.

Bjorn
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v7] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2020-03-24 Thread Bjorn Helgaas
On Sat, Mar 21, 2020 at 02:02:22AM +0100, Karol Herbst wrote:
> On Fri, Mar 20, 2020 at 11:19 PM Bjorn Helgaas  wrote:
> >
> > On Tue, Mar 10, 2020 at 08:26:27PM +0100, Karol Herbst wrote:
> > > Fixes the infamous 'runtime PM' bug many users are facing on Laptops with
> > > Nvidia Pascal GPUs by skipping said PCI power state changes on the GPU.
> > >
> > > Depending on the used kernel there might be messages like those in demsg:
> > >
> > > "nouveau :01:00.0: Refused to change power state, currently in D3"
> > > "nouveau :01:00.0: can't change power state from D3cold to D0 (config
> > > space inaccessible)"
> > > followed by backtraces of kernel crashes or timeouts within nouveau.
> > >
> > > It's still unkown why this issue exists, but this is a reliable workaround
> > > and solves a very annoying issue for user having to choose between a
> > > crashing kernel or higher power consumption of their Laptops.
> >
> > Thanks for the bugzilla link.  The bugzilla mentions lots of mailing
> > list discussion.  Can you include links to some of that?
> >
> > IIUC this basically just turns off PCI power management for the GPU.
> > Can you do that with something like the following?  I don't know
> > anything about DRM, so I don't know where you could save the pm_cap,
> > but I'm sure the driver could keep it somewhere.
> 
> Sure this would work? From a quick look over the pci code, it looks
> like a of code would be skipped we really need, like the platform
> code to turn off the GPU via ACPI. But I could also remember
> incorrectly on how all of that worked again. I can of course try and
> see what the effect of this patch would be. 

I'm not in a position to test this myself.  I would expect that if a
device lacks a PCI power management capability, we could still use
ACPI power management.  My idea with this patch was to simulate that
situation by clearing pdev->pm_cap so we treat the GPU as though it
had no PCI PM capability.

> And would the parent bus even go into D3hot if it knows one of its
> children is still at D0? Because that's what the result of that
> would be as well, no? And I know that if the bus stays in D0, that
> it has a negative impact on power consumption.

I don't understand this part.  Are you saying you want the GPU in D0
and the upstream component (root port or switch) in D3hot?

I think the rule for the upstream component (the root port or switch
leading to the GPU) is in PCIe spec 5.0, sec 5.3.2.  Basically it says
the upstream component cannot be in a lower power state than the GPU,
i.e.,

  - if the GPU is in D0, the upstream component must be in D0;
  - if the GPU is in D2, the upstream component can be in D0-D2;
  - if the GPU is in D3hot, the upstream component can be in D0-D3hot

So I don't understand how we *could* have the GPU in D0 and the
upstream component in D3hot.

> Anyway, I will try that out, I am just not seeing how that would help.
> 
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c 
> > b/drivers/gpu/drm/nouveau/nouveau_drm.c
> > index b65ae817eabf..2ad825e8891c 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_drm.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
> > @@ -618,6 +618,23 @@ nouveau_drm_device_fini(struct drm_device *dev)
> > kfree(drm);
> >  }
> >
> > +static void quirk_broken_nv_runpm(struct drm_device *drm_dev)
> > +{
> > +   struct pci_dev *pdev = drm_dev->pdev;
> > +   struct pci_dev *bridge = pci_upstream_bridge(pdev);
> > +
> > +   if (!bridge || bridge->vendor != PCI_VENDOR_ID_INTEL)
> > +   return;
> > +
> > +   switch (bridge->device) {
> > +   case 0x1901:
> > +   STASH->pm_cap = pdev->pm_cap;
> > +   pdev->pm_cap = 0;
> > +   NV_INFO(drm_dev, "Disabling PCI power management to avoid 
> > bug\n");
> > +   break;
> > +   }
> > +}
> > +
> >  static int nouveau_drm_probe(struct pci_dev *pdev,
> >  const struct pci_device_id *pent)
> >  {
> > @@ -699,6 +716,7 @@ static int nouveau_drm_probe(struct pci_dev *pdev,
> > if (ret)
> > goto fail_drm_dev_init;
> >
> > +   quirk_broken_nv_runpm(drm_dev);
> > return 0;
> >
> >  fail_drm_dev_init:
> > @@ -735,6 +753,9 @@ nouveau_drm_remove(struct pci_dev *pdev)
> >  {
> > struct drm_device *dev = pci_get_drvdata(pdev);
> >
> > +   /* If we disabled PCI power management, restore it */
> > +   if (STASH->pm_cap)
> > +   pdev->pm_cap = STASH->pm_cap;
> > nouveau_drm_device_remove(dev);
> > pci_disable_device(pdev);
> >  }
> >
> 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v7] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2020-03-20 Thread Bjorn Helgaas
On Tue, Mar 10, 2020 at 08:26:27PM +0100, Karol Herbst wrote:
> Fixes the infamous 'runtime PM' bug many users are facing on Laptops with
> Nvidia Pascal GPUs by skipping said PCI power state changes on the GPU.
> 
> Depending on the used kernel there might be messages like those in demsg:
> 
> "nouveau :01:00.0: Refused to change power state, currently in D3"
> "nouveau :01:00.0: can't change power state from D3cold to D0 (config
> space inaccessible)"
> followed by backtraces of kernel crashes or timeouts within nouveau.
> 
> It's still unkown why this issue exists, but this is a reliable workaround
> and solves a very annoying issue for user having to choose between a
> crashing kernel or higher power consumption of their Laptops.

Thanks for the bugzilla link.  The bugzilla mentions lots of mailing
list discussion.  Can you include links to some of that?

IIUC this basically just turns off PCI power management for the GPU.
Can you do that with something like the following?  I don't know
anything about DRM, so I don't know where you could save the pm_cap,
but I'm sure the driver could keep it somewhere.


diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c 
b/drivers/gpu/drm/nouveau/nouveau_drm.c
index b65ae817eabf..2ad825e8891c 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -618,6 +618,23 @@ nouveau_drm_device_fini(struct drm_device *dev)
kfree(drm);
 }
 
+static void quirk_broken_nv_runpm(struct drm_device *drm_dev)
+{
+   struct pci_dev *pdev = drm_dev->pdev;
+   struct pci_dev *bridge = pci_upstream_bridge(pdev);
+
+   if (!bridge || bridge->vendor != PCI_VENDOR_ID_INTEL)
+   return;
+
+   switch (bridge->device) {
+   case 0x1901:
+   STASH->pm_cap = pdev->pm_cap;
+   pdev->pm_cap = 0;
+   NV_INFO(drm_dev, "Disabling PCI power management to avoid 
bug\n");
+   break;
+   }
+}
+
 static int nouveau_drm_probe(struct pci_dev *pdev,
 const struct pci_device_id *pent)
 {
@@ -699,6 +716,7 @@ static int nouveau_drm_probe(struct pci_dev *pdev,
if (ret)
goto fail_drm_dev_init;
 
+   quirk_broken_nv_runpm(drm_dev);
return 0;
 
 fail_drm_dev_init:
@@ -735,6 +753,9 @@ nouveau_drm_remove(struct pci_dev *pdev)
 {
struct drm_device *dev = pci_get_drvdata(pdev);
 
+   /* If we disabled PCI power management, restore it */
+   if (STASH->pm_cap)
+   pdev->pm_cap = STASH->pm_cap;
nouveau_drm_device_remove(dev);
pci_disable_device(pdev);
 }
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-19 Thread Bjorn Helgaas
[+cc Daniel, Vidya, Thierry; thread starts at
https://lore.kernel.org/r/20191017121901.13699-1-kher...@redhat.com]

On Tue, Nov 19, 2019 at 11:26:45PM +0100, Karol Herbst wrote:
> On Tue, Nov 19, 2019 at 10:50 PM Bjorn Helgaas  wrote:
> > On Thu, Oct 17, 2019 at 02:19:01PM +0200, Karol Herbst wrote:
> > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into
> > > higher device states.
> > > ...

> > > + *  - kernel crashes, as all pci reads return -1, which most code
> > > + *isn't able to handle well enough.
> >
> > More details about these crashes would be useful as we look at
> > places that *should* be able to handle errors like this.
> 
> makes sense, I could ,orthogonal to this, make the code more robust
> if we hit issues like this in the future. What I am mostly wondering
> about is, why pci core doesn't give up if the device doesn't come
> back from D3cold? It sounds like, that the most sane thing to do
> here is to just give up and fail runtime_resume and report errors
> back to userspace trying to make use of the devices.

It's possible there's something the core could do here.  It's
interesting that we have at least three issues in this area in this
release:

  1) This NVIDIA GPU issue
  2) Daniel's issue with AMD Ryzen5/7 XHCI power-on
 (https://lore.kernel.org/r/20191014061355.29072-1-dr...@endlessm.com)
  3) Vidya's issue with Intel 750 NVMe power-on
 (https://lore.kernel.org/r/20191118172310.21373-1-vid...@nvidia.com)

Vidya's current patch is the most generic (calling pci_dev_wait() from
pci_power_up()).  That will print a warning if the device doesn't
respond, but we still don't go as far as returning errors to
runtime_resume.

This is definitely something we should consider improving in the
future.

> > > + *  - sudden shutdowns, as the kernel identified an unrecoverable error 
> > > after
> > > + *userspace tries to access the GPU.
> >
> > This doesn't fit with the others and more details might be
> > informative here as well.
> 
> yeah.. I try to get more infos on that. But at least for me (and it
> might be a distribution thing) if I execute lspci, the system shuts
> down, or at least tries to and might fail.

The lspci doesn't need to be after the failure occurs.  You can do it
immediately after boot.

If you can capture any part of the dmesg or console log of these
sudden shutdowns, that's all I'm interested in at this point.

Bjorn
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-19 Thread Bjorn Helgaas
[+cc Dave]

On Thu, Oct 17, 2019 at 02:19:01PM +0200, Karol Herbst wrote:
> Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device
> states.
> 
> v2: convert to pci_dev quirk
> put a proper technical explanation of the issue as a in-code comment
> v3: disable it only for certain combinations of intel and nvidia hardware
> v4: simplify quirk by setting flag on the GPU itself

I have zero confidence that we understand the real problem, but we do
need to do something with this.  I'll merge it for v5.5 if we get the
minor procedural stuff below straightened out.

> Signed-off-by: Karol Herbst 
> Cc: Bjorn Helgaas 
> Cc: Lyude Paul 
> Cc: Rafael J. Wysocki 
> Cc: Mika Westerberg 
> Cc: linux-...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: dri-de...@lists.freedesktop.org
> Cc: nouveau@lists.freedesktop.org
> ---
>  drivers/pci/pci.c|  7 ++
>  drivers/pci/quirks.c | 53 
>  include/linux/pci.h  |  1 +
>  3 files changed, 61 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b97d9e10c9cc..02e71e0bcdd7 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -850,6 +850,13 @@ static int pci_raw_set_power_state(struct pci_dev *dev, 
> pci_power_t state)
>  || (state == PCI_D2 && !dev->d2_support))
>   return -EIO;
>  
> + /*
> +  * check if we have a bad combination of bridge controller and nvidia
> + * GPU, see quirk_broken_nv_runpm for more info

Whitespace damage.  Capitalized incorrectly (see other comments
nearby).

> +  */
> + if (state != PCI_D0 && dev->broken_nv_runpm)
> + return 0;
> +
>   pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, );
>  
>   /*
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44c4ae1abd00..0006c9e37b6f 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5268,3 +5268,56 @@ static void 
> quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1,
> PCI_CLASS_DISPLAY_VGA, 8,
> quirk_reset_lenovo_thinkpad_p50_nvgpu);
> +
> +/*
> + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus after
> + * those were put into D3cold state if they were put into a non D0 PCI PM
> + * device state before doing so.

A device in D3cold is off the bus by definition.

IIUC the problem is that the sequence D0 -> D3hot -> D3cold -> D0 for
the GPU fails in the transition back to D0, while D0 -> D3cold -> D0
works fine.

So I guess the problem is that we can put the device in D3cold with no
problem, but if we put in D3hot before going to D3cold, the device
never comes back to D0.  Right?

> + * This leads to various issue different issues which all manifest 
> differently,

s/issue different//

Actually, I think there's only one underlying issue with several
manifestations.

> + * but have the same root cause:
> + *  - AIML code execution hits an infinite loop (as the coe waits on device
> + *memory to change).

s/AIML/AML/
s/coe/code/

> + *  - kernel crashes, as all pci reads return -1, which most code isn't able
> + *to handle well enough.

s/pci/PCI/

More details about these crashes would be useful as we look at places
that *should* be able to handle errors like this.

> + *  - sudden shutdowns, as the kernel identified an unrecoverable error after
> + *userspace tries to access the GPU.

This doesn't fit with the others and more details might be
informative here as well.

> + * In all cases dmesg will contain at least one line like this:
> + * 'nouveau :01:00.0: Refused to change power state, currently in D3'
> + * followed by a lot of nouveau timeouts.
> + *
> + * ACPI code writes bit 0x80 to the not documented PCI register 0x248 of the
> + * PCIe bridge controller in order to power down the GPU.
> + * Nonetheless, there are other code paths inside the ACPI firmware which use
> + * other registers, which seem to work fine:
> + *  - 0xbc bit 0x20 (publicly available documentation claims 'reserved')
> + *  - 0xb0 bit 0x10 (link disable)

All these register addresses are device-specific, so they're useless
without identifying the device.  "lspci -vvnn" output would let us at
least connect this with something.  It would be nice to have that info
archived along with your acpidump and python repro scripts in a
bugzilla with the URL in the commit log.

These are likely in PCI capabilities.  If I make the leap of assuming
the "link disable" bit is PCI_EXP_LNKCTL_LD, that would mean the Link
Control register is at 0xb0 and the register at 0xbc would be the Root

Re: [Nouveau] [PATCH v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-10-16 Thread Bjorn Helgaas
On Wed, Oct 16, 2019 at 11:48:22PM +0200, Karol Herbst wrote:
> On Wed, Oct 16, 2019 at 11:37 PM Bjorn Helgaas  wrote:
> > On Wed, Oct 16, 2019 at 09:18:32PM +0200, Karol Herbst wrote:
> > > but setting the PCI_DEV_FLAGS_NO_D3 flag does prevent using the
> > > platform means of putting the device into D3cold, right? That's
> > > actually what should still happen, just the D3hot step should be
> > > skipped.
> >
> > If I understand correctly, when we put a device in D3cold on an ACPI
> > system, we do something like this:
> >
> >   pci_set_power_state(D3cold)
> > if (PCI_DEV_FLAGS_NO_D3)
> >   return 0   <-- nothing at all if 
> > quirked
> > pci_raw_set_power_state
> >   pci_write_config_word(PCI_PM_CTRL, D3hot)  <-- set to D3hot
> > __pci_complete_power_transition(D3cold)
> >   pci_platform_power_transition(D3cold)
> > platform_pci_set_power_state(D3cold)
> >   acpi_pci_set_power_state(D3cold)
> > acpi_device_set_power(ACPI_STATE_D3_COLD)
> >   ...
> > acpi_evaluate_object("_OFF") <-- set to D3cold
> >
> > I did not understand the connection with platform (ACPI) power
> > management from your patch.  It sounds like you want this entire path
> > except that you want to skip the PCI_PM_CTRL write?
> >
> 
> exactly. I am running with this workaround for a while now and never
> had any fails with it anymore. The GPU gets turned off correctly and I
> see the same power savings, just that the GPU can be powered on again.
> 
> > That seems like something Rafael should weigh in on.  I don't know
> > why we set the device to D3hot with PCI_PM_CTRL before using the ACPI
> > methods, and I don't know what the effect of skipping that is.  It
> > seems a little messy to slice out this tiny piece from the middle, but
> > maybe it makes sense.
> >
> 
> afaik when I was talking with others in the past about it, Windows is
> doing that before using ACPI calls, but maybe they have some similar
> workarounds for certain intel bridges as well? I am sure it affects
> more than the one I am blacklisting here, but I rather want to check
> each device before blacklisting all kabylake and sky lake bridges (as
> those are the ones were this issue can be observed).

From a quick look at the ACPI spec, I didn't see conditions like "OSPM
must put PCI devices in D3hot before executing _OFF".  But obviously
there's *some* reason and I probably just missed it.

> Sadly we had no luck getting any information about such workaround out
> of Nvidia or Intel.

I'm not surprised; it doesn't seem like we really have the details
needed to get to a root cause yet.  I think what we really need is a
PCIe analyzer trace to see what happens when the device "falls off the
bus".

Bjorn
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-10-16 Thread Bjorn Helgaas
[+cc linux-acpi]

On Wed, Oct 16, 2019 at 09:18:32PM +0200, Karol Herbst wrote:
> but setting the PCI_DEV_FLAGS_NO_D3 flag does prevent using the
> platform means of putting the device into D3cold, right? That's
> actually what should still happen, just the D3hot step should be
> skipped.

If I understand correctly, when we put a device in D3cold on an ACPI
system, we do something like this:

  pci_set_power_state(D3cold)
if (PCI_DEV_FLAGS_NO_D3)
  return 0   <-- nothing at all if quirked
pci_raw_set_power_state
  pci_write_config_word(PCI_PM_CTRL, D3hot)  <-- set to D3hot
__pci_complete_power_transition(D3cold)
  pci_platform_power_transition(D3cold)
platform_pci_set_power_state(D3cold)
  acpi_pci_set_power_state(D3cold)
acpi_device_set_power(ACPI_STATE_D3_COLD)
  ...
acpi_evaluate_object("_OFF") <-- set to D3cold

I did not understand the connection with platform (ACPI) power
management from your patch.  It sounds like you want this entire path
except that you want to skip the PCI_PM_CTRL write?

That seems like something Rafael should weigh in on.  I don't know
why we set the device to D3hot with PCI_PM_CTRL before using the ACPI
methods, and I don't know what the effect of skipping that is.  It
seems a little messy to slice out this tiny piece from the middle, but
maybe it makes sense.

> On Wed, Oct 16, 2019 at 9:14 PM Bjorn Helgaas  wrote:
> >
> > On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote:
> > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher 
> > > device
> > > states.
> > >
> > > v2: convert to pci_dev quirk
> > > put a proper technical explanation of the issue as a in-code comment
> > > v3: disable it only for certain combinations of intel and nvidia hardware
> > >
> > > Signed-off-by: Karol Herbst 
> > > Cc: Bjorn Helgaas 
> > > Cc: Lyude Paul 
> > > Cc: Rafael J. Wysocki 
> > > Cc: Mika Westerberg 
> > > Cc: linux-...@vger.kernel.org
> > > Cc: linux...@vger.kernel.org
> > > Cc: dri-de...@lists.freedesktop.org
> > > Cc: nouveau@lists.freedesktop.org
> > > ---
> > >  drivers/pci/pci.c| 11 ++
> > >  drivers/pci/quirks.c | 52 
> > >  include/linux/pci.h  |  1 +
> > >  3 files changed, 64 insertions(+)
> > >
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index b97d9e10c9cc..8e056eb7e6ff 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -805,6 +805,13 @@ static inline bool platform_pci_bridge_d3(struct 
> > > pci_dev *dev)
> > >   return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false;
> > >  }
> > >
> > > +static inline bool parent_broken_child_pm(struct pci_dev *dev)
> > > +{
> > > + if (!dev->bus || !dev->bus->self)
> > > + return false;
> > > + return dev->bus->self->broken_nv_runpm && dev->broken_nv_runpm;
> > > +}
> > > +
> > >  /**
> > >   * pci_raw_set_power_state - Use PCI PM registers to set the power state 
> > > of
> > >   *given PCI device
> > > @@ -850,6 +857,10 @@ static int pci_raw_set_power_state(struct pci_dev 
> > > *dev, pci_power_t state)
> > >  || (state == PCI_D2 && !dev->d2_support))
> > >   return -EIO;
> > >
> > > + /* check if the bus controller causes issues */
> > > + if (state != PCI_D0 && parent_broken_child_pm(dev))
> > > + return 0;
> > > +
> > >   pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, );
> > >
> > >   /*
> > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > index 44c4ae1abd00..c2f20b745dd4 100644
> > > --- a/drivers/pci/quirks.c
> > > +++ b/drivers/pci/quirks.c
> > > @@ -5268,3 +5268,55 @@ static void 
> > > quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev)
> > >  DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1,
> > > PCI_CLASS_DISPLAY_VGA, 8,
> > > quirk_reset_lenovo_thinkpad_p50_nvgpu);
> > > +
> > > +/*
> > > + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus 
> > > after
> > > + * those were put into D3cold state if they were put into a non D0 PCI PM
> > &g

Re: [Nouveau] [PATCH v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-10-16 Thread Bjorn Helgaas
On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote:
> Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device
> states.
> 
> v2: convert to pci_dev quirk
> put a proper technical explanation of the issue as a in-code comment
> v3: disable it only for certain combinations of intel and nvidia hardware
> 
> Signed-off-by: Karol Herbst 
> Cc: Bjorn Helgaas 
> Cc: Lyude Paul 
> Cc: Rafael J. Wysocki 
> Cc: Mika Westerberg 
> Cc: linux-...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: dri-de...@lists.freedesktop.org
> Cc: nouveau@lists.freedesktop.org
> ---
>  drivers/pci/pci.c| 11 ++
>  drivers/pci/quirks.c | 52 
>  include/linux/pci.h  |  1 +
>  3 files changed, 64 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b97d9e10c9cc..8e056eb7e6ff 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -805,6 +805,13 @@ static inline bool platform_pci_bridge_d3(struct pci_dev 
> *dev)
>   return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false;
>  }
>  
> +static inline bool parent_broken_child_pm(struct pci_dev *dev)
> +{
> + if (!dev->bus || !dev->bus->self)
> + return false;
> + return dev->bus->self->broken_nv_runpm && dev->broken_nv_runpm;
> +}
> +
>  /**
>   * pci_raw_set_power_state - Use PCI PM registers to set the power state of
>   *given PCI device
> @@ -850,6 +857,10 @@ static int pci_raw_set_power_state(struct pci_dev *dev, 
> pci_power_t state)
>  || (state == PCI_D2 && !dev->d2_support))
>   return -EIO;
>  
> + /* check if the bus controller causes issues */
> + if (state != PCI_D0 && parent_broken_child_pm(dev))
> + return 0;
> +
>   pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, );
>  
>   /*
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44c4ae1abd00..c2f20b745dd4 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5268,3 +5268,55 @@ static void 
> quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1,
> PCI_CLASS_DISPLAY_VGA, 8,
> quirk_reset_lenovo_thinkpad_p50_nvgpu);
> +
> +/*
> + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus after
> + * those were put into D3cold state if they were put into a non D0 PCI PM
> + * device state before doing so.
> + *
> + * This leads to various issue different issues which all manifest 
> differently,
> + * but have the same root cause:
> + *  - AIML code execution hits an infinite loop (as the coe waits on device
> + *memory to change).
> + *  - kernel crashes, as all pci reads return -1, which most code isn't able
> + *to handle well enough.
> + *  - sudden shutdowns, as the kernel identified an unrecoverable error after
> + *userspace tries to access the GPU.
> + *
> + * In all cases dmesg will contain at least one line like this:
> + * 'nouveau :01:00.0: Refused to change power state, currently in D3'
> + * followed by a lot of nouveau timeouts.
> + *
> + * ACPI code writes bit 0x80 to the not documented PCI register 0x248 of the
> + * PCIe bridge controller in order to power down the GPU.
> + * Nonetheless, there are other code paths inside the ACPI firmware which use
> + * other registers, which seem to work fine:
> + *  - 0xbc bit 0x20 (publicly available documentation claims 'reserved')
> + *  - 0xb0 bit 0x10 (link disable)
> + * Changing the conditions inside the firmware by poking into the relevant
> + * addresses does resolve the issue, but it seemed to be ACPI private memory
> + * and not any device accessible memory at all, so there is no portable way 
> of
> + * changing the conditions.
> + *
> + * The only systems where this behavior can be seen are hybrid graphics 
> laptops
> + * with a secondary Nvidia Pascal GPU. It cannot be ruled out that this issue
> + * only occurs in combination with listed Intel PCIe bridge controllers and
> + * the mentioned GPUs or if it's only a hw bug in the bridge controller.
> + *
> + * But because this issue was NOT seen on laptops with an Nvidia Pascal GPU
> + * and an Intel Coffee Lake SoC, there is a higher chance of there being a 
> bug
> + * in the bridge controller rather than in the GPU.
> + *
> + * This issue was not able to be reproduced on non laptop systems.
> + */
> +
> +static void quirk_broken_nv_runpm(struct pci_dev *dev)
> +{
> + dev->

Re: [Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges

2019-10-01 Thread Bjorn Helgaas
On Tue, Oct 01, 2019 at 06:21:28PM +0200, Karol Herbst wrote:
> On Tue, Oct 1, 2019 at 3:27 PM Bjorn Helgaas  wrote:
> > On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:
> > > On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
> > >  wrote:
> > > >
> > > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > > > > still happens with your patch applied. The machine simply gets shut 
> > > > > down.
> > > > >
> > > > > dmesg can be found here:
> > > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
> > > >
> > > > Looking your dmesg:
> > > >
> > > > Sep 30 17:24:27 kernel: nouveau :01:00.0: DRM: DCB version 4.1
> > > > Sep 30 17:24:27 kernel: nouveau :01:00.0: DRM: MM: using COPY for 
> > > > buffer copies
> > > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 
> > > > :01:00.0 on minor 1
> > > >
> > > > I would assume it runtime suspends here. Then it wakes up because of PCI
> > > > access from userspace:
> > > >
> > > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
> > > >
> > > > and for some reason it does not get resumed properly. There are also few
> > > > warnings from ACPI that might be relevant:
> > > >
> > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 
> > > > type mismatch - Found [Buffer], ACPI requires [Package] 
> > > > (20190509/nsarguments-59)
> > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: 
> > > > Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] 
> > > > (20190509/nsarguments-59)
> > >
> > > afaik this is the case for essentially every laptop out there.
> >
> > I think we should look into this a little bit.
> > acpi_ns_check_argument_types() checks the argument type and prints
> > this message, but AFAICT it doesn't actually fix anything or prevent
> > execution of the method, so I have no idea what happens when we
> > actually execute the _DSM.
> 
> I can assure you that this warning happens on every single laptop out
> there with dual Nvidia graphics and it's essentially just a firmware
> bug. And it never caused any issues on any of the older laptops (or
> newest one for that matter).

Rafael, do you know anything about this?  If ACPI has some sort of
workaround so it can execute the method correctly anyway, maybe we
should remove or reword the warning?

Or if this does prevent execution of the method, maybe we need to add
a workaround since the problem is so prevalent in the field?

> > If we execute this _DSM as part of power management, and the _DSM
> > doesn't work right, it would be no surprise that we have problems.
> >
> > Maybe we could learn something by turning on ACPI_DB_PARSE output (see
> > Documentation/firmware-guide/acpi/debug.rst).
> >
> > You must have an acpidump already from all your investigation.  Can
> > you put it somewhere, e.g., bugzilla.kernel.org, and include a URL?
> 
> Will do so later, right now I am traveling to XDC and will have more
> time for that then.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges

2019-10-01 Thread Bjorn Helgaas
On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:
> On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
>  wrote:
> >
> > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > > still happens with your patch applied. The machine simply gets shut down.
> > >
> > > dmesg can be found here:
> > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
> >
> > Looking your dmesg:
> >
> > Sep 30 17:24:27 kernel: nouveau :01:00.0: DRM: DCB version 4.1
> > Sep 30 17:24:27 kernel: nouveau :01:00.0: DRM: MM: using COPY for 
> > buffer copies
> > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 
> > :01:00.0 on minor 1
> >
> > I would assume it runtime suspends here. Then it wakes up because of PCI
> > access from userspace:
> >
> > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
> >
> > and for some reason it does not get resumed properly. There are also few
> > warnings from ACPI that might be relevant:
> >
> > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type 
> > mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 
> > type mismatch - Found [Buffer], ACPI requires [Package] 
> > (20190509/nsarguments-59)
> 
> afaik this is the case for essentially every laptop out there.

I think we should look into this a little bit.
acpi_ns_check_argument_types() checks the argument type and prints
this message, but AFAICT it doesn't actually fix anything or prevent
execution of the method, so I have no idea what happens when we
actually execute the _DSM.

If we execute this _DSM as part of power management, and the _DSM
doesn't work right, it would be no surprise that we have problems.

Maybe we could learn something by turning on ACPI_DB_PARSE output (see
Documentation/firmware-guide/acpi/debug.rst).

You must have an acpidump already from all your investigation.  Can
you put it somewhere, e.g., bugzilla.kernel.org, and include a URL?
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges

2019-09-27 Thread Bjorn Helgaas
[+cc Rafael, Mika, linux-pm]

On Fri, Sep 27, 2019 at 04:44:21PM +0200, Karol Herbst wrote:
> Fixes runpm breakage mainly on Nvidia GPUs as they are not able to resume.

I don't know what runpm is.  Some userspace utility?  Module
parameter?

> Works perfectly with this workaround applied.
> 
> RFC comment:
> We are quite sure that there is a higher amount of bridges affected by this,
> but I was only testing it on my own machine for now.
> 
> I've stresstested runpm by doing 5000 runpm cycles with that patch applied
> and never saw it fail.
> 
> I mainly wanted to get a discussion going on if that's a feasable workaround
> indeed or if we need something better.
> 
> I am also sure, that the nouveau driver itself isn't at fault as I am able
> to reproduce the same issue by poking into some PCI registers on the PCIe
> bridge to put the GPU into D3cold as it's done in ACPI code.
> 
> I've written a little python script to reproduce this issue without the need
> of loading nouveau:
> https://raw.githubusercontent.com/karolherbst/pci-stub-runpm/master/nv_runpm_bug_test.py

Nice script, thanks for sharing it :)  I could learn a lot of useful
python by studying it.

> Signed-off-by: Karol Herbst 
> Cc: Bjorn Helgaas 
> Cc: Lyude Paul 
> Cc: linux-...@vger.kernel.org
> Cc: dri-de...@lists.freedesktop.org
> Cc: nouveau@lists.freedesktop.org
> ---
>  drivers/pci/pci.c | 39 +++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 088fcdc8d2b4..9dbd29ced1ac 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -799,6 +799,42 @@ static inline bool platform_pci_bridge_d3(struct pci_dev 
> *dev)
>   return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false;
>  }
>  
> +/*
> + * some intel bridges cause serious issues with runpm if the client device
> + * is put into D1/D2/D3hot before putting the client into D3cold via
> + * platform means (generally ACPI).

You mention Nvidia GPUs above, but I guess the same issue may affect
other devices?  I would really like to chase this down to a more
specific issue, e.g., a hardware defect with erratum, an ACPI defect,
or a Linux defect.  Without the specifics, this is just a band-aid.

I don't see any relevant requirements in the _OFF description, but I
don't know much about ACPI power control.

Your script allows several scenarios; I *guess* the one that causes
the problem is:

  - write 3 (D3hot) to GPU PowerState (PCIE_PM_REG == 0x64, I assume
PM Capability Control Register)
  - write 3 (D3hot) to bridge PowerState (0x84, I assume PM Capability
Control Register)
  - run _OFF on the power resource for the bridge

From your script I assume you do:

  - run _ON on the power resource for the bridge
  - write 0 (D0) to the bridge PowerState

You do *not* write the GPU PowerState (which we can't do if the GPU is
in D3cold).  Is there some assumption that it comes out of D3cold via
some other mechanism, e.g., is the _ON supposed to wake up the GPU?

What exactly is the serious issue?  I guess it's that the rescan
doesn't detect the GPU, which means it's not responding to config
accesses?  Is there any timing component here, e.g., maybe we're
missing some delay like the ones Mika is adding to the reset paths?

> + *
> + * skipping this makes runpm work perfectly fine on such devices.
> + *
> + * As far as we know only skylake and kaby lake SoCs are affected.
> + */
> +static unsigned short intel_broken_d3_bridges[] = {
> + /* kbl */
> + 0x1901,
> +};
> +
> +static inline bool intel_broken_pci_pm(struct pci_bus *bus)
> +{
> + struct pci_dev *bridge;
> + int i;
> +
> + if (!bus || !bus->self)
> + return false;
> +
> + bridge = bus->self;
> + if (bridge->vendor != PCI_VENDOR_ID_INTEL)
> + return false;
> +
> + for (i = 0; i < ARRAY_SIZE(intel_broken_d3_bridges); i++) {
> + if (bridge->device == intel_broken_d3_bridges[i]) {
> + pci_err(bridge, "found broken intel bridge\n");

If this ends up being a hardware defect, we should use a quirk to set
a bit in the pci_dev once, as we do for broken_intx_masking and
similar bits.

> + return true;
> + }
> + }
> +
> + return false;
> +}
> +
>  /**
>   * pci_raw_set_power_state - Use PCI PM registers to set the power state of
>   *given PCI device
> @@ -827,6 +863,9 @@ static int pci_raw_set_power_state(struct pci_dev *dev, 
> pci_power_t state)
>   if (state < PCI_D0 || state > PCI_D3hot)
>   return -EINVAL;
>  
> + if (state != PCI_D0 && intel_broken_pci_pm(dev->bus))
&g

Re: [PATCH] PCI: Use pci_reset_bus() in quirk_reset_lenovo_thinkpad_50_nvgpu()

2019-08-15 Thread Bjorn Helgaas
On Thu, Aug 01, 2019 at 06:01:17PM -0400, Lyude Paul wrote:
> Since quirk_nvidia_hda() was added there's now two nvidia device
> functions on any laptops with nvidia GPUs: the HDA controller, and the
> GPU itself. Unfortunately this has the sideaffect of breaking
> quirk_reset_lenovo_thinkpad_50_nvgpu() since pci_reset_function() was
> using pci_parent_bus_reset() to reset the GPU's respective PCI bus, and
> pci_parent_bus_reset() does not work on busses which have more then a
> single device function present.
> 
> So, fix this by simply calling pci_reset_bus() instead which properly
> resets the GPU bus and all device functions under it, including both the
> GPU and the HDA controller.
> 
> Fixes: b516ea586d71 ("PCI: Enable NVIDIA HDA controllers")
> Cc: Lukas Wunner 
> Cc: Daniel Drake 
> Cc: Bjorn Helgaas 
> Cc: Aaron Plattner 
> Cc: Peter Wu 
> Cc: Ilia Mirkin 
> Cc: Karol Herbst 
> Cc: Maik Freudenberg 
> Cc: linux-...@vger.kernel.org
> Signed-off-by: Lyude Paul 

We merged b516ea586d71 for v5.3, so I applied this with Ben's ack to
for-linus for v5.3, thanks!

> ---
>  drivers/pci/quirks.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 208aacf39329..44c4ae1abd00 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5256,7 +5256,7 @@ static void 
> quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev)
>*/
>   if (ioread32(map + 0x2240c) & 0x2) {
>   pci_info(pdev, FW_BUG "GPU left initialized by EFI, 
> resetting\n");
> - ret = pci_reset_function(pdev);
> + ret = pci_reset_bus(pdev);
>   if (ret < 0)
>   pci_err(pdev, "Failed to reset GPU: %d\n", ret);
>   }
> -- 
> 2.21.0
> 


Re: [Nouveau] [PATCH v2] PCI: Expose hidden NVIDIA HDA controllers

2019-07-10 Thread Bjorn Helgaas
On Mon, Jul 08, 2019 at 01:17:44PM +0800, Daniel Drake wrote:
> From: Lukas Wunner 
> 
> The integrated HDA controller on Nvidia GPUs can be hidden with a bit in
> the GPU's config space. Information about this scheme was provided by
> NVIDIA on their forums.
> 
> Many laptops now ship with this device hidden, meaning that Linux users
> of affected platforms (where the HDMI connector comes off the NVIDIA GPU)
> cannot use HDMI audio functionality.
> 
> Some platforms have ACPI DSDT code that will make the device visible if
> the HDMI cable was connected at boot time, but this does not handle the
> hotplug case, and this limitation has also been confirmed under Windows.
> 
> Avoid this issue by exposing the HDMI audio device on device enumeration
> and resume.
> 
> The GPU and HDA controller are two functions of the same PCI device
> (VGA class device on function 0 and audio device on function 1).
> The multifunction flag in the GPU's Header Type register is cleared when
> the HDA controller is hidden and set if it's exposed, so reread the flag
> after exposing the HDA.
> 
> According to Ilia Mirkin, the HDA controller is only present on GPUs with
> PCI ID values from MCP89's onwards, so do not touch config space on older
> GPUs.
> 
> This quirk is limited to NVIDIA PCI devices with the VGA Controller
> device class. This is expected to correspond to product configurations
> where the NVIDIA GPU has connectors attached. Other products where the
> device class is 3D Controller are expected to correspond to configurations
> where the NVIDIA GPU is dedicated (dGPU) and has no connectors.
> 
> It's sensible to avoid exposing the HDA controller on dGPU setups,
> especially because we've seen cases where the PCI BARs are not set
> up correctly by the platform in this case, causing Linux to log
> errors if the device is visible. This assumption of device class
> accurately corresponding to product configuration is true for 6 of 6
> laptops recently checked at the Endless lab, and there are also signs of
> agreement checking the data from 74 previously tested products, however
> Ilia Mirkin comments that he's seen cases where it is not true. Anyway, it
> looks like this quirk should fix audio support for the majority of
> affected users.
> 
> This commit takes inspiration from an earlier patch by Daniel Drake.
> 
> Link: https://devtalk.nvidia.com/default/topic/1024022
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=75985
> Cc: Aaron Plattner 
> Cc: Peter Wu 
> Cc: Ilia Mirkin 
> Cc: Karol Herbst 
> Cc: Maik Freudenberg 
> Signed-off-by: Lukas Wunner 
> Signed-off-by: Daniel Drake 

I applied this (slightly revised as below) to pci/misc and I think we
can still squeeze it in for v5.3.

My revisions:

  - Don't write the enable bit if it's already set.

  - Log a note when enabling the HDA.  I don't like writing
undocumented config bits in *every* current and future NVIDIA GPU,
so the note is just a hint that we're doing something slightly
risky.

  - Use "hdr_type & 0x80" to match the other places we set
pdev->multifunction.

  - Remove the commit log parts that don't seem relevant for future
maintenance and add the URL to the original posting.

Let me know if I broke anything.

commit b678f90a1a6f
Author: Lukas Wunner 
Date:   Mon Jul 8 13:17:44 2019 +0800

PCI: Enable NVIDIA HDA controllers

Many NVIDIA GPUs can be configured as either a single-function video device
or a multi-function device with video at function 0 and an HDA audio
controller at function 1.  The HDA controller can be enabled or disabled by
a bit in the function 0 config space.

Some BIOSes leave the HDA disabled, which means the HDMI connector from the
NVIDIA GPU may not work.  Sometimes the BIOS enables the HDA if an HDMI
cable is connected at boot time, but that doesn't handle hotplug cases.

Enable the HDA controller on device enumeration and resume and re-read the
header type, which tells us whether the GPU is a multi-function device.

This quirk is limited to NVIDIA PCI devices with the VGA Controller device
class.  This is expected to correspond to product configurations where the
NVIDIA GPU has connectors attached.  Other products where the device class
is 3D Controller are expected to correspond to configurations where the
NVIDIA GPU is dedicated (dGPU) and has no connectors.  See original post
(URL below) for more details.

This commit takes inspiration from an earlier patch by Daniel Drake.

Link: https://lore.kernel.org/r/20190708051744.24039-1-dr...@endlessm.com
Link: https://devtalk.nvidia.com/default/topic/1024022
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=75985
Signed-off-by: Lukas Wunner 
Signed-o

Re: [Nouveau] [PATCH] PCI: Expose hidden NVIDIA HDA controllers

2019-06-13 Thread Bjorn Helgaas
[+cc Rafael, Martin, zigarrre]

On Thu, Jun 13, 2019 at 02:35:14PM +0800, Daniel Drake wrote:
> From: Lukas Wunner 
> 
> The integrated HDA controller on Nvidia GPUs can be hidden with a bit in
> the GPU's config space. Information about this scheme was provided by
> NVIDIA on their forums.
> 
> Many laptops now ship with this device hidden, meaning that Linux users
> of affected platforms (where the HDMI connector comes off the NVIDIA GPU)
> cannot use HDMI audio functionality.
> 
> Avoid this issue by exposing the HDMI audio device on device enumeration
> and resume.
> 
> The GPU and HDA controller are two functions of the same PCI device
> (VGA class device on function 0 and audio device on function 1).
> The multifunction flag in the GPU's Header Type register is cleared when
> the HDA controller is hidden and set if it's exposed, so reread the flag
> after exposing the HDA.

Is it possible that this works on Windows but not Linux because they
handle ACPI hotplug slightly differently?

Martin did some nice debug [1] and found that _DSM, _PS0, and _PS3
functions write the config bit at 0x488.  The dmesg log [2] from
zigarrre seems to show that _OSC failed (which I *think* means we
won't use pciehp) and that there's a slot registered by acpiphp.

Maybe this works on Windows via an ACPI hotplug event that runs AML to
flip the 0x488 bit and rescans the bus?  That would make more sense to
me than the idea that the Nvidia driver does it.  I could imagine the
driver flipping the bit, but the bus rescan seems like it would be out
of the driver's purview.

The dmesg log also shows some _DSM warnings; are these correlated with
cable plug/unplug?  Is there some acpiphp debug we can turn on to get
more visibility into hotplug events and how we handle them?

[1] https://bugs.freedesktop.org/show_bug.cgi?id=75985#c32
[2] https://bugs.freedesktop.org/attachment.cgi?id=128640

> According to Ilia Mirkin, the HDA controller is only present from MCP89
> onward, so do not touch config space on older GPUs.
> 
> This quirk is limited to NVIDIA PCI devices with the VGA Controller
> device class. This is expected to correspond to product configurations
> where the NVIDIA GPU has connectors attached. Other products where the
> device class is 3D Controller are expected to correspond to configurations
> where the NVIDIA GPU is dedicated (dGPU) and has no connectors.
> 
> It's sensible to avoid exposing the HDA controller on dGPU setups,
> especially because we've seen cases where the PCI BARs are not set
> up correctly by the platform in this case, causing Linux to log
> errors if the device is visible. This assumption of device class
> accurately corresponding to product configuration is true for 6 of 6
> laptops recently checked at the Endless lab, and there are also signs of
> agreement checking the data from 74 previously tested products, however
> Ilia Mirkin comments that he's seen cases where it is not true. Anyway, it
> looks like this quirk should fix audio support for the majority of
> affected users.
> 
> This commit takes inspiration from an earlier patch by Daniel Drake.
> 
> Link: https://devtalk.nvidia.com/default/topic/1024022
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=75985
> Cc: Aaron Plattner 
> Cc: Peter Wu 
> Cc: Ilia Mirkin 
> Cc: Karol Herbst 
> Cc: Maik Freudenberg 
> Signed-off-by: Lukas Wunner 
> Tested-by: Daniel Drake 
> ---
>  drivers/pci/quirks.c| 28 
>  include/linux/pci_ids.h |  1 +
>  2 files changed, 29 insertions(+)
> 
> Submitting Lukas's patch from over a year ago.
> 
> It got held up before on patch dependencies (which are now all merged)
> and also the concern around the device class assumption not being true
> in all cases. However, there hasn't been any progress towards finding a
> better approach, and we don't have any logs to hand from the cases where
> this isn't true, so I think we should go with this approach which will
> work in the (vast?) majority of cases.
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 0f16acc323c6..52046b517e2e 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4971,6 +4971,34 @@ DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_AMD, 
> PCI_ANY_ID,
>  DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8, quirk_gpu_hda);
>  
> +/*
> + * Many laptop BIOSes hide the integrated HDA controller on NVIDIA GPUs
> + * via a special bit. This prevents Linux from seeing and using it.
> + * Unhide it here.
> + * https://devtalk.nvidia.com/default/topic/1024022
> + */
> +static void quirk_nvidia_hda(struct pci_dev *gpu)
> +{
> + u8 hdr_type;
> + u32 val;
> +
> + /* there was no integrated HDA controller before MCP89 */
> + if (gpu->device < PCI_DEVICE_ID_NVIDIA_GEFORCE_320M)
> + return;
> +
> + /* bit 25 at offset 0x488 hides or exposes the HDA controller */
> + 

Re: [Nouveau] [PATCH v2 4/4] pci: save the boot pcie link speed and restore it on fini

2019-06-03 Thread Bjorn Helgaas
256 bytes, MaxReadReq 128 bytes
> > >DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq-
> > > AuxPwr- TransPend-
> > >LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1,
> > > Exit Latency L0s <256ns, L1 <8us
> > >ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
> > >LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> > >ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > >LnkSta: Speed 2.5GT/s (downgraded), Width x16 (ok)
> > >TrErr- Train- SlotClk+ DLActive- BWMgmt+ ABWMgmt+
> > >SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd-
> > > HotPlug- Surprise-
> > >Slot #1, PowerLimit 75.000W; Interlock- NoCompl+
> > >SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt-
> > > HPIrq- LinkChg-
> > >Control: AttnInd Unknown, PwrInd Unknown,
> > > Power- Interlock-
> > >SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt-
> > > PresDet- Interlock-
> > >Changed: MRL- PresDet+ LinkState-
> > >RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal-
> > > PMEIntEna- CRSVisible-
> > >RootCap: CRSVisible-
> > >RootSta: PME ReqID , PMEStatus- PMEPending-
> > >DevCap2: Completion Timeout: Not Supported,
> > > TimeoutDis-, LTR+, OBFF Via WAKE# ARIFwd-
> > > AtomicOpsCap: Routing- 32bit+ 64bit+ 128bitCAS+
> > >DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-,
> > > LTR+, OBFF Via WAKE# ARIFwd-
> > > AtomicOpsCtl: ReqEn- EgressBlck-
> > >LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- 
> > > SpeedDis-
> > > Transmit Margin: Normal Operating Range,
> > > EnterModifiedCompliance- ComplianceSOS-
> > > Compliance De-emphasis: -6dB
> > >LnkSta2: Current De-emphasis Level: -6dB,
> > > EqualizationComplete-, EqualizationPhase1-
> > > EqualizationPhase2-, EqualizationPhase3-,
> > > LinkEqualizationRequest-
> > >Capabilities: [100 v1] Virtual Channel
> > >Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
> > >Arb:Fixed- WRR32- WRR64- WRR128-
> > >Ctrl:   ArbSelect=Fixed
> > >Status: InProgress-
> > >VC0:Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> > >Arb:Fixed+ WRR32- WRR64- WRR128- TWRR128- 
> > > WRR256-
> > >Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
> > >Status: NegoPending+ InProgress-
> > >Capabilities: [140 v1] Root Complex Link
> > >Desc:   PortNumber=02 ComponentID=01 EltType=Config
> > >Link0:  Desc:   TargetPort=00 TargetComponent=01
> > > AssocRCRB- LinkType=MemMapped LinkValid+
> > >Addr:   fed19000
> > >Capabilities: [d94 v1] Secondary PCI Express 
> > >Kernel driver in use: pcieport
> > > 00: 86 80 01 19 07 00 10 00 05 00 04 06 00 00 81 00
> > > 10: 00 00 00 00 00 00 00 00 00 01 01 00 e0 e0 00 20
> > > 20: 00 ec 00 ed 01 c0 f1 d1 00 00 00 00 00 00 00 00
> > > 30: 00 00 00 00 88 00 00 00 00 00 00 00 ff 01 10 00
> > > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 70: 00 00 00 00 00 00 00 00 00 62 17 00 00 00 00 0a
> > > 80: 01 90 03 c8 08 00 00 00 0d 80 00 00 28 10 be 07
> > > 90: 05 a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > a0: 10 00 42 01 01 80 00 00 20 00 00 00 03 ad 61 02
> > > b0: 40 00 01 d1 80 25 0c 00 00 00 08 00 00 00 00 00
> > > c0: 00 00 00 00 80 0b 08 00 00 64 00 00 0e 00 00 00
> > > d0: 43 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > f0: 00 40 01 00 4e 01 01 22 00 00 00 00 e0 00 10 00
> > >
> > > lspci -vvxxx -s 1:00.00
> > > 01:00.0 3D controller: NVIDIA Corporation GP107M [GeForce GTX 1050
> > > Mobile] (rev ff) (prog-if ff)
> > >!!! Unknown header type 7f
> > >Kernel driver

Re: [Nouveau] [PATCH v2 4/4] pci: save the boot pcie link speed and restore it on fini

2019-05-21 Thread Bjorn Helgaas
On Tue, May 21, 2019 at 03:28:48PM +0200, Karol Herbst wrote:
> On Tue, May 21, 2019 at 3:11 PM Bjorn Helgaas  wrote:
> > On Tue, May 21, 2019 at 12:30:38AM +0200, Karol Herbst wrote:
> > > On Mon, May 20, 2019 at 11:20 PM Bjorn Helgaas  wrote:
> > > > On Tue, May 07, 2019 at 10:12:45PM +0200, Karol Herbst wrote:
> > > > > Apperantly things go south if we suspend the device with a different 
> > > > > PCIE
> > > > > link speed set than it got booted with. Fixes runtime suspend on my 
> > > > > gp107.
> > > > >
> > > > > This all looks like some bug inside the pci subsystem and I would 
> > > > > prefer a
> > > > > fix there instead of nouveau, but maybe there is no real nice way of 
> > > > > doing
> > > > > that outside of drivers?
> > > >
> > > > I agree it would be nice to fix this in the PCI core if that's
> > > > feasible.
> > > >
> > > > It looks like this driver changes the PCIe link speed using some
> > > > device-specific mechanism.  When we suspend, we put the device in
> > > > D3cold, so it loses all its state.  When we resume, the link probably
> > > > comes up at the boot speed because nothing did that device-specific
> > > > magic to change it, so you probably end up with the link being slow
> > > > but the driver thinking it's configured to be fast, and maybe that
> > > > combination doesn't work.
> > > >
> > > > If it requires something device-specific to change that link speed, I
> > > > don't know how to put that in the PCI core.  But maybe I'm missing
> > > > something?
> > > >
> > > > Per the PCIe spec (r4.0, sec 1.2):
> > > >
> > > >   Initialization – During hardware initialization, each PCI Express
> > > >   Link is set up following a negotiation of Lane widths and frequency
> > > >   of operation by the two agents at each end of the Link. No firmware
> > > >   or operating system software is involved.
> > > >
> > > > I have been assuming that this means device-specific link speed
> > > > management is out of spec, but it seems pretty common that devices
> > > > don't come up by themselves at the fastest possible link speed.  So
> > > > maybe the spec just intends that devices can operate at *some* valid
> > > > speed.
> > >
> > > I would expect that devices kind of have to figure out what they can
> > > operate on and the operating system kind of just checks what the
> > > current state is and doesn't try to "restore" the old state or
> > > something?
> >
> > The devices at each end of the link negotiate the width and speed of
> > the link.  This is done directly by the hardware without any help from
> > the OS.
> >
> > The OS can read the current link state (Current Link Speed and
> > Negotiated Link Width, both in the Link Status register).  The OS has
> > very little control over that state.  It can't directly restore the
> > state because the hardware has to negotiate a width & speed that
> > result in reliable operation.
> >
> > > We don't do anything in the driver after the device was suspended. And
> > > the 0x88000 is a mirror of the PCI config space, but we also got some
> > > PCIe stuff at 0x8c000 which is used by newer GPUs for gen3 stuff
> > > essentially. I have no idea how much of this is part of the actual pci
> > > standard and how much is driver specific. But the driver also wants to
> > > have some control over the link speed as it's tight to performance
> > > states on GPU.
> >
> > As far as I'm aware, there is no generic PCIe way for the OS to
> > influence the link width or speed.  If the GPU driver needs to do
> > that, it would be via some device-specific mechanism.
> >
> > > The big issue here is just, that the GPU boots with 8.0, some on-gpu
> > > init mechanism decreases it to 2.5. If we suspend, the GPU or at least
> > > the communication with the controller is broken. But if we set it to
> > > the boot speed, resuming the GPU just works. So my assumption was,
> > > that _something_ (might it be the controller or the pci subsystem)
> > > tries to force to operate on an invalid link speed and because the
> > > bridge controller is actually powered down as well (as all children
> > > are in D3cold) I could imagine that something in the pci subsystem
> > &g

Re: [Nouveau] [PATCH v2 4/4] pci: save the boot pcie link speed and restore it on fini

2019-05-21 Thread Bjorn Helgaas
On Tue, May 21, 2019 at 12:30:38AM +0200, Karol Herbst wrote:
> On Mon, May 20, 2019 at 11:20 PM Bjorn Helgaas  wrote:
> > On Tue, May 07, 2019 at 10:12:45PM +0200, Karol Herbst wrote:
> > > Apperantly things go south if we suspend the device with a different PCIE
> > > link speed set than it got booted with. Fixes runtime suspend on my gp107.
> > >
> > > This all looks like some bug inside the pci subsystem and I would prefer a
> > > fix there instead of nouveau, but maybe there is no real nice way of doing
> > > that outside of drivers?
> >
> > I agree it would be nice to fix this in the PCI core if that's
> > feasible.
> >
> > It looks like this driver changes the PCIe link speed using some
> > device-specific mechanism.  When we suspend, we put the device in
> > D3cold, so it loses all its state.  When we resume, the link probably
> > comes up at the boot speed because nothing did that device-specific
> > magic to change it, so you probably end up with the link being slow
> > but the driver thinking it's configured to be fast, and maybe that
> > combination doesn't work.
> >
> > If it requires something device-specific to change that link speed, I
> > don't know how to put that in the PCI core.  But maybe I'm missing
> > something?
> >
> > Per the PCIe spec (r4.0, sec 1.2):
> >
> >   Initialization – During hardware initialization, each PCI Express
> >   Link is set up following a negotiation of Lane widths and frequency
> >   of operation by the two agents at each end of the Link. No firmware
> >   or operating system software is involved.
> >
> > I have been assuming that this means device-specific link speed
> > management is out of spec, but it seems pretty common that devices
> > don't come up by themselves at the fastest possible link speed.  So
> > maybe the spec just intends that devices can operate at *some* valid
> > speed.
> 
> I would expect that devices kind of have to figure out what they can
> operate on and the operating system kind of just checks what the
> current state is and doesn't try to "restore" the old state or
> something?

The devices at each end of the link negotiate the width and speed of
the link.  This is done directly by the hardware without any help from
the OS.

The OS can read the current link state (Current Link Speed and
Negotiated Link Width, both in the Link Status register).  The OS has
very little control over that state.  It can't directly restore the
state because the hardware has to negotiate a width & speed that
result in reliable operation.

> We don't do anything in the driver after the device was suspended. And
> the 0x88000 is a mirror of the PCI config space, but we also got some
> PCIe stuff at 0x8c000 which is used by newer GPUs for gen3 stuff
> essentially. I have no idea how much of this is part of the actual pci
> standard and how much is driver specific. But the driver also wants to
> have some control over the link speed as it's tight to performance
> states on GPU.

As far as I'm aware, there is no generic PCIe way for the OS to
influence the link width or speed.  If the GPU driver needs to do
that, it would be via some device-specific mechanism.

> The big issue here is just, that the GPU boots with 8.0, some on-gpu
> init mechanism decreases it to 2.5. If we suspend, the GPU or at least
> the communication with the controller is broken. But if we set it to
> the boot speed, resuming the GPU just works. So my assumption was,
> that _something_ (might it be the controller or the pci subsystem)
> tries to force to operate on an invalid link speed and because the
> bridge controller is actually powered down as well (as all children
> are in D3cold) I could imagine that something in the pci subsystem
> actually restores the state which lets the controller fail to
> establish communication again?

  1) At boot-time, the Port and the GPU hardware negotiate 8.0 GT/s
 without OS/driver intervention.

  2) Some mechanism reduces link speed to 2.5 GT/s.  This probably
 requires driver intervention or at least some ACPI method.

  3) Suspend puts GPU into D3cold (powered off).

  4) Resume restores GPU to D0, and the Port and GPU hardware again
 negotiate 8.0 GT/s without OS/driver intervention, just like at
 initial boot.

  5) Now the driver thinks the GPU is at 2.5 GT/s but it's actually at
 8.0 GT/s.

Without knowing more about the transition to 2.5 GT/s, I can't guess
why the GPU wouldn't work after resume.  From a PCIe point of view,
the link is supposed to work and the device should be reachable
independent of the link speed.  But maybe there's some weird
dependency between the GPU and the driver here.

It sounds like t

Re: [Nouveau] [PATCH v2 2/4] pci: enable pcie link changes for pascal

2019-05-20 Thread Bjorn Helgaas
On Tue, May 07, 2019 at 10:12:43PM +0200, Karol Herbst wrote:
> Signed-off-by: Karol Herbst 
> Reviewed-by: Lyude Paul 
> ---
>  drm/nouveau/nvkm/subdev/pci/gk104.c |  8 
>  drm/nouveau/nvkm/subdev/pci/gp100.c | 10 ++
>  drm/nouveau/nvkm/subdev/pci/priv.h  |  5 +
>  3 files changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/drm/nouveau/nvkm/subdev/pci/gk104.c 
> b/drm/nouveau/nvkm/subdev/pci/gk104.c
> index e6803050..66489018 100644
> --- a/drm/nouveau/nvkm/subdev/pci/gk104.c
> +++ b/drm/nouveau/nvkm/subdev/pci/gk104.c
> @@ -23,7 +23,7 @@
>   */
>  #include "priv.h"
>  
> -static int
> +int
>  gk104_pcie_version_supported(struct nvkm_pci *pci)
>  {
>   return (nvkm_rd32(pci->subdev.device, 0x8c1c0) & 0x4) == 0x4 ? 2 : 1;
> @@ -108,7 +108,7 @@ gk104_pcie_lnkctl_speed(struct nvkm_pci *pci)
>   return -1;
>  }
>  
> -static enum nvkm_pcie_speed
> +enum nvkm_pcie_speed
>  gk104_pcie_max_speed(struct nvkm_pci *pci)
>  {
>   u32 max_speed = nvkm_rd32(pci->subdev.device, 0x8c1c0) & 0x30;

Some of this looks pretty similar to the PCI core code that reads
PCI_EXP_LNKSTA to find the link speed (but admittedly there's not
really a good interface to do that unless bus->cur_bus_speed already
has what you need).

It doesn't look like this is directly reading the PCI_EXP_LNKSTA from
PCI config space; maybe it's reading a mirror of that via a
device-specific MMIO address, or maybe it's reading something else
completely.

But it makes me wonder if there's a way to make generic PCI core
interfaces for some of this stuff.

> @@ -146,7 +146,7 @@ gk104_pcie_set_link_speed(struct nvkm_pci *pci, enum 
> nvkm_pcie_speed speed)
>   nvkm_mask(device, 0x8c040, 0x1, 0x1);
>  }
>  
> -static int
> +int
>  gk104_pcie_init(struct nvkm_pci * pci)
>  {
>   enum nvkm_pcie_speed lnkctl_speed, max_speed, cap_speed;
> @@ -178,7 +178,7 @@ gk104_pcie_init(struct nvkm_pci * pci)
>   return 0;
>  }
>  
> -static int
> +int
>  gk104_pcie_set_link(struct nvkm_pci *pci, enum nvkm_pcie_speed speed, u8 
> width)
>  {
>   struct nvkm_subdev *subdev = >subdev;
> diff --git a/drm/nouveau/nvkm/subdev/pci/gp100.c 
> b/drm/nouveau/nvkm/subdev/pci/gp100.c
> index 82c5234a..eb19c7a4 100644
> --- a/drm/nouveau/nvkm/subdev/pci/gp100.c
> +++ b/drm/nouveau/nvkm/subdev/pci/gp100.c
> @@ -35,6 +35,16 @@ gp100_pci_func = {
>   .wr08 = nv40_pci_wr08,
>   .wr32 = nv40_pci_wr32,
>   .msi_rearm = gp100_pci_msi_rearm,
> +
> + .pcie.init = gk104_pcie_init,
> + .pcie.set_link = gk104_pcie_set_link,
> +
> + .pcie.max_speed = gk104_pcie_max_speed,
> + .pcie.cur_speed = g84_pcie_cur_speed,
> +
> + .pcie.set_version = gf100_pcie_set_version,
> + .pcie.version = gf100_pcie_version,
> + .pcie.version_supported = gk104_pcie_version_supported,
>  };
>  
>  int
> diff --git a/drm/nouveau/nvkm/subdev/pci/priv.h 
> b/drm/nouveau/nvkm/subdev/pci/priv.h
> index c17f6063..a0d4c007 100644
> --- a/drm/nouveau/nvkm/subdev/pci/priv.h
> +++ b/drm/nouveau/nvkm/subdev/pci/priv.h
> @@ -54,6 +54,11 @@ int gf100_pcie_cap_speed(struct nvkm_pci *);
>  int gf100_pcie_init(struct nvkm_pci *);
>  int gf100_pcie_set_link(struct nvkm_pci *, enum nvkm_pcie_speed, u8);
>  
> +int gk104_pcie_init(struct nvkm_pci *);
> +int gk104_pcie_set_link(struct nvkm_pci *, enum nvkm_pcie_speed, u8 width);
> +enum nvkm_pcie_speed gk104_pcie_max_speed(struct nvkm_pci *);
> +int gk104_pcie_version_supported(struct nvkm_pci *);
> +
>  int nvkm_pcie_oneinit(struct nvkm_pci *);
>  int nvkm_pcie_init(struct nvkm_pci *);
>  #endif
> -- 
> 2.21.0
> 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v2 4/4] pci: save the boot pcie link speed and restore it on fini

2019-05-20 Thread Bjorn Helgaas
On Tue, May 07, 2019 at 10:12:45PM +0200, Karol Herbst wrote:
> Apperantly things go south if we suspend the device with a different PCIE
> link speed set than it got booted with. Fixes runtime suspend on my gp107.
> 
> This all looks like some bug inside the pci subsystem and I would prefer a
> fix there instead of nouveau, but maybe there is no real nice way of doing
> that outside of drivers?

I agree it would be nice to fix this in the PCI core if that's
feasible.

It looks like this driver changes the PCIe link speed using some
device-specific mechanism.  When we suspend, we put the device in
D3cold, so it loses all its state.  When we resume, the link probably
comes up at the boot speed because nothing did that device-specific
magic to change it, so you probably end up with the link being slow
but the driver thinking it's configured to be fast, and maybe that
combination doesn't work.

If it requires something device-specific to change that link speed, I
don't know how to put that in the PCI core.  But maybe I'm missing
something?

Per the PCIe spec (r4.0, sec 1.2):

  Initialization – During hardware initialization, each PCI Express
  Link is set up following a negotiation of Lane widths and frequency
  of operation by the two agents at each end of the Link. No firmware
  or operating system software is involved.

I have been assuming that this means device-specific link speed
management is out of spec, but it seems pretty common that devices
don't come up by themselves at the fastest possible link speed.  So
maybe the spec just intends that devices can operate at *some* valid
speed.

> v2: squashed together patch 4 and 5
> 
> Signed-off-by: Karol Herbst 
> Reviewed-by: Lyude Paul 
> ---
>  drm/nouveau/include/nvkm/subdev/pci.h |  5 +++--
>  drm/nouveau/nvkm/subdev/pci/base.c|  9 +++--
>  drm/nouveau/nvkm/subdev/pci/pcie.c| 24 
>  drm/nouveau/nvkm/subdev/pci/priv.h|  2 ++
>  4 files changed, 32 insertions(+), 8 deletions(-)
> 
> diff --git a/drm/nouveau/include/nvkm/subdev/pci.h 
> b/drm/nouveau/include/nvkm/subdev/pci.h
> index 1fdf3098..b23793a2 100644
> --- a/drm/nouveau/include/nvkm/subdev/pci.h
> +++ b/drm/nouveau/include/nvkm/subdev/pci.h
> @@ -26,8 +26,9 @@ struct nvkm_pci {
>   } agp;
>  
>   struct {
> - enum nvkm_pcie_speed speed;
> - u8 width;
> + enum nvkm_pcie_speed cur_speed;
> + enum nvkm_pcie_speed def_speed;
> + u8 cur_width;
>   } pcie;
>  
>   bool msi;
> diff --git a/drm/nouveau/nvkm/subdev/pci/base.c 
> b/drm/nouveau/nvkm/subdev/pci/base.c
> index ee2431a7..d9fb5a83 100644
> --- a/drm/nouveau/nvkm/subdev/pci/base.c
> +++ b/drm/nouveau/nvkm/subdev/pci/base.c
> @@ -90,6 +90,8 @@ nvkm_pci_fini(struct nvkm_subdev *subdev, bool suspend)
>  
>   if (pci->agp.bridge)
>   nvkm_agp_fini(pci);
> + else if (pci_is_pcie(pci->pdev))
> + nvkm_pcie_fini(pci);
>  
>   return 0;
>  }
> @@ -100,6 +102,8 @@ nvkm_pci_preinit(struct nvkm_subdev *subdev)
>   struct nvkm_pci *pci = nvkm_pci(subdev);
>   if (pci->agp.bridge)
>   nvkm_agp_preinit(pci);
> + else if (pci_is_pcie(pci->pdev))
> + nvkm_pcie_preinit(pci);
>   return 0;
>  }
>  
> @@ -193,8 +197,9 @@ nvkm_pci_new_(const struct nvkm_pci_func *func, struct 
> nvkm_device *device,
>   pci->func = func;
>   pci->pdev = device->func->pci(device)->pdev;
>   pci->irq = -1;
> - pci->pcie.speed = -1;
> - pci->pcie.width = -1;
> + pci->pcie.cur_speed = -1;
> + pci->pcie.def_speed = -1;
> + pci->pcie.cur_width = -1;
>  
>   if (device->type == NVKM_DEVICE_AGP)
>   nvkm_agp_ctor(pci);
> diff --git a/drm/nouveau/nvkm/subdev/pci/pcie.c 
> b/drm/nouveau/nvkm/subdev/pci/pcie.c
> index 70ccbe0d..731dd30e 100644
> --- a/drm/nouveau/nvkm/subdev/pci/pcie.c
> +++ b/drm/nouveau/nvkm/subdev/pci/pcie.c
> @@ -85,6 +85,13 @@ nvkm_pcie_oneinit(struct nvkm_pci *pci)
>   return 0;
>  }
>  
> +int
> +nvkm_pcie_preinit(struct nvkm_pci *pci)
> +{
> + pci->pcie.def_speed = nvkm_pcie_get_speed(pci);
> + return 0;
> +}
> +
>  int
>  nvkm_pcie_init(struct nvkm_pci *pci)
>  {
> @@ -105,12 +112,21 @@ nvkm_pcie_init(struct nvkm_pci *pci)
>   if (pci->func->pcie.init)
>   pci->func->pcie.init(pci);
>  
> - if (pci->pcie.speed != -1)
> - nvkm_pcie_set_link(pci, pci->pcie.speed, pci->pcie.width);
> + if (pci->pcie.cur_speed != -1)
> + nvkm_pcie_set_link(pci, pci->pcie.cur_speed,
> +pci->pcie.cur_width);
>  
>   return 0;
>  }
>  
> +int
> +nvkm_pcie_fini(struct nvkm_pci *pci)
> +{
> + if (!IS_ERR_VALUE(pci->pcie.def_speed))
> + return nvkm_pcie_set_link(pci, pci->pcie.def_speed, 16);
> + return 0;
> +}
> +
>  int
>  nvkm_pcie_set_link(struct nvkm_pci *pci, enum nvkm_pcie_speed speed, u8 
> width)
>  {
> @@ -146,8 +162,8 @@ 

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-04-25 Thread Bjorn Helgaas
On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote:
> On a very specific subset of ThinkPad P50 SKUs, particularly ones that
> come with a Quadro M1000M chip instead of the M2000M variant, the BIOS
> seems to have a very nasty habit of not always resetting the secondary
> Nvidia GPU between full reboots if the laptop is configured in Hybrid
> Graphics mode. The reason for this happening is unknown, but the
> following steps and possibly a good bit of patience will reproduce the
> issue:
> 
> 1. Boot up the laptop normally in Hybrid graphics mode
> 2. Make sure nouveau is loaded and that the GPU is awake
> 2. Allow the nvidia GPU to runtime suspend itself after being idle
> 3. Reboot the machine, the more sudden the better (e.g sysrq-b may help)
> 4. If nouveau loads up properly, reboot the machine again and go back to
> step 2 until you reproduce the issue
> 
> This results in some very strange behavior: the GPU will
> quite literally be left in exactly the same state it was in when the
> previously booted kernel started the reboot. This has all sorts of bad
> sideaffects: for starters, this completely breaks nouveau starting with a
> mysterious EVO channel failure that happens well before we've actually
> used the EVO channel for anything:
> 
> nouveau :01:00.0: disp: chid 0 mthd  data 0400 1000
> 0002
> ...

> So to do this, we add a new pci quirk using
> DECLARE_PCI_FIXUP_CLASS_FINAL that will be invoked before the PCI probe
> at boot finishes. From there, we check to make sure that this is indeed
> the specific P50 variant of this GPU. We also make sure that the GPU PCI
> device is advertising NoReset- in order to prevent us from trying to
> reset the GPU when the machine is in Dedicated graphics mode (where the
> GPU being initialized by the BIOS is normal and expected). Finally, we
> try mapping the MMIO space for the GPU which should only work if the GPU
> is actually active in D0 mode. We can then read the magic 0x2240c
> register on the GPU, which will have bit 1 set if the GPU's firmware has
> already been posted during a previous boot. Once we've confirmed all of
> this, we reset the PCI device and re-disable it - bringing the GPU back
> into a healthy state.
> 
> Signed-off-by: Lyude Paul 
> Cc: nouveau@lists.freedesktop.org
> Cc: dri-de...@lists.freedesktop.org
> Cc: Karol Herbst 
> Cc: Ben Skeggs 
> Cc: sta...@vger.kernel.org

Applied to pci/misc for v5.2, thanks!

> ---
>  drivers/pci/quirks.c | 65 
>  1 file changed, 65 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index b0a413f3f7ca..948492fda8bf 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5117,3 +5117,68 @@ SWITCHTEC_QUIRK(0x8573);  /* PFXI 48XG3 */
>  SWITCHTEC_QUIRK(0x8574);  /* PFXI 64XG3 */
>  SWITCHTEC_QUIRK(0x8575);  /* PFXI 80XG3 */
>  SWITCHTEC_QUIRK(0x8576);  /* PFXI 96XG3 */
> +
> +/*
> + * On certain Lenovo Thinkpad P50 SKUs, specifically those with a Nvidia
> + * Quadro M1000M, the BIOS will occasionally make the mistake of not 
> resetting
> + * the nvidia GPU between reboots if the system is configured to use hybrid
> + * graphics mode. This results in the GPU being left in whatever state it was
> + * in during the previous boot which causes spurious interrupts from the GPU,
> + * which in turn cause us to disable the wrong IRQs and end up breaking the
> + * touchpad. Unsurprisingly, this also completely breaks nouveau.
> + *
> + * Luckily, it seems a simple reset of the PCI device for the nvidia GPU
> + * manages to bring the GPU back into a clean state and fix all of these
> + * issues. Additionally since the GPU will report NoReset+ when the machine 
> is
> + * configured in Dedicated display mode, we don't need to worry about
> + * accidentally resetting the GPU when it's supposed to already be
> + * initialized.
> + */
> +static void
> +quirk_lenovo_thinkpad_p50_nvgpu_survives_reboot(struct pci_dev *pdev)
> +{
> + void __iomem *map;
> + int ret;
> +
> + if (pdev->subsystem_vendor != PCI_VENDOR_ID_LENOVO ||
> + pdev->subsystem_device != 0x222e ||
> + !pdev->reset_fn)
> + return;
> +
> + /*
> +  * If we can't enable the device's mmio space, it's probably not even
> +  * initialized. This is fine, and means we can just skip the quirk
> +  * entirely.
> +  */
> + if (pci_enable_device_mem(pdev)) {
> + pci_dbg(pdev, "Can't enable device mem, no reset needed\n");
> + return;
> + }
> +
> + /* Taken from drivers/gpu/drm/nouveau/engine/device/base.c */
> + map = ioremap(pci_resource_start(pdev, 0), 0x102000);
> + if (!map) {
> + pci_err(pdev, "Can't map MMIO space, this is probably very 
> bad\n");
> + goto out_disable;
> + }
> +
> + /*
> +  * Be extra careful, and make sure that the GPU firmware is posted
> +  * before trying a reset
> +  */
> + if 

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-04-24 Thread Bjorn Helgaas
On Wed, Apr 24, 2019 at 03:16:37PM -0400, Lyude Paul wrote:
> On Wed, 2019-04-24 at 13:59 -0500, Bjorn Helgaas wrote:
> > Not being a scheduled work expert, I was unsure if this experiment was
> > equivalent to what I proposed.
> > 
> > I'm always suspicious of singleton solutions like this (using
> > schedule_work() in runtime_resume()) because usually they seem to be
> > solving a generic problem that should happen on many kinds of
> > hardware.  The 0b2fe6594fa2 ("drm/nouveau: Queue hpd_work on (runtime)
> > resume") commit log says:
> > 
> >   We need to call drm_helper_hpd_irq_event() on resume to properly
> >   detect monitor connection / disconnection on some laptops, use
> >   hpd_work for this to avoid deadlocks.
> > 
> > The situation of a monitor being connected or disconnected during
> > suspend can happen to *any* GPU, but the commit only changes nouveau,
> > which of course raises the question of how we deal with that in other
> > drivers.  If the Nvidia GPU has some unique behavior related to
> > monitor connection, that would explain special-case code there, but
> > the commit doesn't mention anything like that.
> > 
> > It should be simple to revert 0b2fe6594fa2 and see whether it changes
> > the behavior at all (well, simple except for the fact that this
> > problem isn't 100% reproducible in the first place).
> 
> It's not 100% reproducible, but it's at least 90% so it's not
> difficult for me to test at all.
> 
> Also, reverting this commit makes no difference either. 

OK, great, that makes it crystal clear.  I didn't know you had
specifically tested that revert.  Thanks for doing that.

> Note that while that commit only changed nouveau, scheduled_work()
> is exactly how a number of other drivers (i915 for instance) handle
> reprobing like this as well.

OK.  The GPU code would be a lot more approachable if similar things
were done in similar ways.  I spent an hour or so looking for this
similar code in i915, but gave up.

> The reason being that we can't do full connector reprobing in our
> runtime resume thread because we could deadlock if someone else is
> holding a modesetting lock we need and waiting on us to resume at
> the same time (there's a number of other bug fixes in nouveau for
> other issues caused by the same deadlock scenario). 

You mention nouveau specifically here, but I assume this is a generic
deadlock scenario that applies to any GPU, and they all avoid the
deadlock in the same way.  Right?

> I'm confused here though, it sounds like you're running under the
> assumption that PCI devices like this aren't reset into a clean
> state during a system reboot, is that correct?

No, I wasn't trying to say anything about that.  My point here is
that:

  - you're reporting a problem that only happens with nouveau and
only happens during shutdown/reboot
  - the behavior is similar to a race (not 100% reproducible, seems
to happen more if shutdown is faster)
  - shutdown involves resuming the device (see pci_device_shutdown())
  - nouveau_pmops_runtime_resume() schedules asynchronous work, which
(to my untrained eye) looks unusual
  - asynchronous work is inherently subject to races

So I think that's all somewhat suspicious.  But if the same problem
happens without the asynchronous work, obviously the issue is
elsewhere.

But you *are* right that if the device were actually reset, none of
this should matter.  It certainly seems that the BIOS neglects to
reset it in some cases.

I can sort of imagine a BIOS engineer thinking that if the device
looks like it's in use, we shouldn't reset it, and it's still
conceivable that some sort of Linux shutdown race could leave it
looking like it's in use.  But you've been working with Lenovo on
this, and it seems like that would be pretty obvious to somebody with
the BIOS source (though I just demonstrated above that even with the
source it's easy to miss things).

I'm out of ideas, so I think your quirk is the best way forward.  I
trimmed out some of the commit log backtraces and such, added the
bugzilla, and tweaked the patch to use pci_iomap() instead of
ioremap().  Would the patch below work for you?


commit 18dc5b3c7ddc
Author: Lyude Paul 
Date:   Tue Feb 12 17:02:30 2019 -0500

PCI: Reset Lenovo ThinkPad P50 nvgpu at boot if necessary

On ThinkPad P50 SKUs with an Nvidia Quadro M1000M instead of the M2000M
variant, the BIOS does not always reset the secondary Nvidia GPU during
reboot if the laptop is configured in Hybrid Graphics mode.  The reason is
unknown, but the following steps and possibly a good bit of patience will
reproduce the issue:

  1. Boot up the laptop normally in Hybrid Graphics mode
  2. Make sure nouveau is loaded and that the GPU is awake
  2. Allow the Nvid

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-04-24 Thread Bjorn Helgaas
On Mon, Apr 15, 2019 at 02:07:18PM -0400, Lyude Paul wrote:
> On Thu, 2019-04-04 at 09:17 -0500, Bjorn Helgaas wrote:
> > [+cc Hans, author of 0b2fe6594fa2 ("drm/nouveau: Queue hpd_work on (runtime)
> > resume")]
> > 
> > On Fri, Mar 22, 2019 at 06:30:15AM -0500, Bjorn Helgaas wrote:
> > > On Thu, Mar 21, 2019 at 05:48:19PM -0500, Bjorn Helgaas wrote:
> > > > On Wed, Mar 13, 2019 at 06:25:02PM -0400, Lyude Paul wrote:
> > > > > On Fri, 2019-02-15 at 16:17 -0500, Lyude Paul wrote:
> > > > > > On Thu, 2019-02-14 at 18:43 -0600, Bjorn Helgaas wrote:
> > > > > > > On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote:
> > > > > > > > On a very specific subset of ThinkPad P50 SKUs, particularly
> > > > > > > > ones that come with a Quadro M1000M chip instead of the M2000M
> > > > > > > > variant, the BIOS seems to have a very nasty habit of not
> > > > > > > > always resetting the secondary Nvidia GPU between full reboots
> > > > > > > > if the laptop is configured in Hybrid Graphics mode. The
> > > > > > > > reason for this happening is unknown, but the following steps
> > > > > > > > and possibly a good bit of patience will reproduce the issue:
> > > > > > > > 
> > > > > > > > 1. Boot up the laptop normally in Hybrid graphics mode
> > > > > > > > 2. Make sure nouveau is loaded and that the GPU is awake
> > > > > > > > 2. Allow the nvidia GPU to runtime suspend itself after being
> > > > > > > > idle
> > > > > > > > 3. Reboot the machine, the more sudden the better (e.g sysrq-b
> > > > > > > > may help)
> > > > > > > > 4. If nouveau loads up properly, reboot the machine again and go
> > > > > > > > back to
> > > > > > > > step 2 until you reproduce the issue
> > > > > > > > 
> > > > > > > > This results in some very strange behavior: the GPU will quite
> > > > > > > > literally be left in exactly the same state it was in when the
> > > > > > > > previously booted kernel started the reboot. This has all
> > > > > > > > sorts of bad sideaffects: for starters, this completely breaks
> > > > > > > > nouveau starting with a mysterious EVO channel failure that
> > > > > > > > happens well before we've actually used the EVO channel for
> > > > > > > > anything:
> > > > 
> > > > Thanks for the hybrid tutorial (snipped from this response).  IIUC,
> > > > what you said was that in hybrid mode, the Intel GPU drives the
> > > > built-in display and the Nvidia GPU drives any external displays and
> > > > may be used for DRI PRIME rendering (whatever that is).  But since you
> > > > say the Nvidia device gets runtime suspended, I assume there's no
> > > > external display here and you're not using DRI PRIME.
> > > > 
> > > > I wonder if it's related to the fact that the Nvidia GPU has been
> > > > runtime suspended before you do the reboot.  Can you try turning of
> > > > runtime power management for the GPU by setting the runpm module
> > > > parameter to 0?  I *think* this would be booting with
> > > > "nouveau.runpm=0".
> > > 
> > > Sorry, I wasn't really thinking here.  You already *said* this is
> > > related to runtime suspend.  It only happens when the Nvidia GPU has
> > > been suspended.
> > > 
> > > I don't know that much about suspend, but ISTR seeing comments about
> > > resuming devices before we shutdown.  If we do that, maybe there's
> > > some kind of race between that resume and the reboot?
> > 
> > I think we do in fact resume PCI devices before shutdown.  Here's the
> > path I'm looking at:
> > 
> >   device_shutdown
> > pm_runtime_get_noresume
> > pm_runtime_barrier
> > dev->bus->shutdown
> >   pci_device_shutdown
> > pm_runtime_resume
> >   __pm_runtime_resume(dev, 0)
> > rpm_resume(dev, 0)
> >   __update_runtime_status(dev, RPM_RESUMING)
> >   callback = RPM_GET_CALLBACK(dev, runtime_resume)
> >   rpm_callback(callback, dev)
> > __rpm_callback
> >

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-04-24 Thread Bjorn Helgaas
On Wed, Apr 24, 2019 at 01:31:09PM -0400, Lyude Paul wrote:
> Any update on this? This has been waiting for a while now

Oh, sorry, I guess we were both waiting for the other.  Let me respond to
the email with context.

> On Thu, 2019-04-04 at 09:17 -0500, Bjorn Helgaas wrote:
> > [+cc Hans, author of 0b2fe6594fa2 ("drm/nouveau: Queue hpd_work on (runtime)
> > resume")]
> -- 
> Cheers,
>   Lyude Paul
> 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-04-04 Thread Bjorn Helgaas
[+cc Hans, author of 0b2fe6594fa2 ("drm/nouveau: Queue hpd_work on (runtime) 
resume")]

On Fri, Mar 22, 2019 at 06:30:15AM -0500, Bjorn Helgaas wrote:
> On Thu, Mar 21, 2019 at 05:48:19PM -0500, Bjorn Helgaas wrote:
> > On Wed, Mar 13, 2019 at 06:25:02PM -0400, Lyude Paul wrote:
> > > On Fri, 2019-02-15 at 16:17 -0500, Lyude Paul wrote:
> > > > On Thu, 2019-02-14 at 18:43 -0600, Bjorn Helgaas wrote:
> > > > > On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote:
> > > > > > On a very specific subset of ThinkPad P50 SKUs, particularly
> > > > > > ones that come with a Quadro M1000M chip instead of the M2000M
> > > > > > variant, the BIOS seems to have a very nasty habit of not
> > > > > > always resetting the secondary Nvidia GPU between full reboots
> > > > > > if the laptop is configured in Hybrid Graphics mode. The
> > > > > > reason for this happening is unknown, but the following steps
> > > > > > and possibly a good bit of patience will reproduce the issue:
> > > > > > 
> > > > > > 1. Boot up the laptop normally in Hybrid graphics mode
> > > > > > 2. Make sure nouveau is loaded and that the GPU is awake
> > > > > > 2. Allow the nvidia GPU to runtime suspend itself after being idle
> > > > > > 3. Reboot the machine, the more sudden the better (e.g sysrq-b may 
> > > > > > help)
> > > > > > 4. If nouveau loads up properly, reboot the machine again and go 
> > > > > > back to
> > > > > > step 2 until you reproduce the issue
> > > > > > 
> > > > > > This results in some very strange behavior: the GPU will quite
> > > > > > literally be left in exactly the same state it was in when the
> > > > > > previously booted kernel started the reboot. This has all
> > > > > > sorts of bad sideaffects: for starters, this completely breaks
> > > > > > nouveau starting with a mysterious EVO channel failure that
> > > > > > happens well before we've actually used the EVO channel for
> > > > > > anything:
> > 
> > Thanks for the hybrid tutorial (snipped from this response).  IIUC,
> > what you said was that in hybrid mode, the Intel GPU drives the
> > built-in display and the Nvidia GPU drives any external displays and
> > may be used for DRI PRIME rendering (whatever that is).  But since you
> > say the Nvidia device gets runtime suspended, I assume there's no
> > external display here and you're not using DRI PRIME.
> > 
> > I wonder if it's related to the fact that the Nvidia GPU has been
> > runtime suspended before you do the reboot.  Can you try turning of
> > runtime power management for the GPU by setting the runpm module
> > parameter to 0?  I *think* this would be booting with
> > "nouveau.runpm=0".
> 
> Sorry, I wasn't really thinking here.  You already *said* this is
> related to runtime suspend.  It only happens when the Nvidia GPU has
> been suspended.
> 
> I don't know that much about suspend, but ISTR seeing comments about
> resuming devices before we shutdown.  If we do that, maybe there's
> some kind of race between that resume and the reboot?

I think we do in fact resume PCI devices before shutdown.  Here's the
path I'm looking at:

  device_shutdown
pm_runtime_get_noresume
pm_runtime_barrier
dev->bus->shutdown
  pci_device_shutdown
pm_runtime_resume
  __pm_runtime_resume(dev, 0)
rpm_resume(dev, 0)
  __update_runtime_status(dev, RPM_RESUMING)
  callback = RPM_GET_CALLBACK(dev, runtime_resume)
  rpm_callback(callback, dev)
__rpm_callback
  pci_pm_runtime_resume
drv->pm->runtime_resume
  nouveau_pmops_runtime_resume
nouveau_do_resume
schedule_work(hpd_work)   # <---
...
nouveau_display_hpd_work
  pm_runtime_get_sync
  drm_helper_hpd_irq_event
  pm_runtime_mark_last_busy
  pm_runtime_put_sync

I'm curious about that "schedule_work(hpd_work)" near the end because
no other drivers seem to use schedule_work() in the runtime_resume
path, and I don't know how that synchronizes with the shutdown
process.  I don't see anything that waits for
nouveau_display_hpd_work() to complete, so it seems like something
that could be a race.

I wonder this problem

[Nouveau] [PATCH] drm/nouveau: Remove duplicate ACPI_VIDEO_NOTIFY_PROBE definition

2019-04-04 Thread Bjorn Helgaas
From: Bjorn Helgaas 

Commit 3a6536c51d5d ("drm/nouveau: Intercept ACPI_VIDEO_NOTIFY_PROBE")
added a definition of ACPI_VIDEO_NOTIFY_PROBE because  didn't
supply one.  Later, commit eff4a751cce5 ("ACPI / video: Move
ACPI_VIDEO_NOTIFY_* defines to acpi/video.h") moved ACPI_VIDEO_NOTIFY_PROBE
and other definitions to , so the copy in nouveau_display.c
is now unnecessary.

Remove the unnecessary definition from nouveau_display.c.

Signed-off-by: Bjorn Helgaas 
CC: Hans de Goede 
---
 drivers/gpu/drm/nouveau/nouveau_display.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c 
b/drivers/gpu/drm/nouveau/nouveau_display.c
index 55c0fa451163..832da8e0020d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_display.c
+++ b/drivers/gpu/drm/nouveau/nouveau_display.c
@@ -358,15 +358,6 @@ nouveau_display_hpd_work(struct work_struct *work)
 
 #ifdef CONFIG_ACPI
 
-/*
- * Hans de Goede: This define belongs in acpi/video.h, I've submitted a patch
- * to the acpi subsys to move it there from drivers/acpi/acpi_video.c .
- * This should be dropped once that is merged.
- */
-#ifndef ACPI_VIDEO_NOTIFY_PROBE
-#define ACPI_VIDEO_NOTIFY_PROBE0x81
-#endif
-
 static int
 nouveau_display_acpi_ntfy(struct notifier_block *nb, unsigned long val,
  void *data)
-- 
2.21.0.392.gf8f6787159e-goog

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-03-22 Thread Bjorn Helgaas
On Thu, Mar 21, 2019 at 05:48:19PM -0500, Bjorn Helgaas wrote:
> On Wed, Mar 13, 2019 at 06:25:02PM -0400, Lyude Paul wrote:
> > On Fri, 2019-02-15 at 16:17 -0500, Lyude Paul wrote:
> > > On Thu, 2019-02-14 at 18:43 -0600, Bjorn Helgaas wrote:
> > > > On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote:
> > > > > On a very specific subset of ThinkPad P50 SKUs, particularly
> > > > > ones that come with a Quadro M1000M chip instead of the M2000M
> > > > > variant, the BIOS seems to have a very nasty habit of not
> > > > > always resetting the secondary Nvidia GPU between full reboots
> > > > > if the laptop is configured in Hybrid Graphics mode. The
> > > > > reason for this happening is unknown, but the following steps
> > > > > and possibly a good bit of patience will reproduce the issue:
> > > > > 
> > > > > 1. Boot up the laptop normally in Hybrid graphics mode
> > > > > 2. Make sure nouveau is loaded and that the GPU is awake
> > > > > 2. Allow the nvidia GPU to runtime suspend itself after being idle
> > > > > 3. Reboot the machine, the more sudden the better (e.g sysrq-b may 
> > > > > help)
> > > > > 4. If nouveau loads up properly, reboot the machine again and go back 
> > > > > to
> > > > > step 2 until you reproduce the issue
> > > > > 
> > > > > This results in some very strange behavior: the GPU will quite
> > > > > literally be left in exactly the same state it was in when the
> > > > > previously booted kernel started the reboot. This has all
> > > > > sorts of bad sideaffects: for starters, this completely breaks
> > > > > nouveau starting with a mysterious EVO channel failure that
> > > > > happens well before we've actually used the EVO channel for
> > > > > anything:
> 
> Thanks for the hybrid tutorial (snipped from this response).  IIUC,
> what you said was that in hybrid mode, the Intel GPU drives the
> built-in display and the Nvidia GPU drives any external displays and
> may be used for DRI PRIME rendering (whatever that is).  But since you
> say the Nvidia device gets runtime suspended, I assume there's no
> external display here and you're not using DRI PRIME.
> 
> I wonder if it's related to the fact that the Nvidia GPU has been
> runtime suspended before you do the reboot.  Can you try turning of
> runtime power management for the GPU by setting the runpm module
> parameter to 0?  I *think* this would be booting with
> "nouveau.runpm=0".

Sorry, I wasn't really thinking here.  You already *said* this is
related to runtime suspend.  It only happens when the Nvidia GPU has
been suspended.

I don't know that much about suspend, but ISTR seeing comments about
resuming devices before we shutdown.  If we do that, maybe there's
some kind of race between that resume and the reboot?

Bjorn
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-03-21 Thread Bjorn Helgaas
[+cc Rafael]

On Wed, Mar 13, 2019 at 06:25:02PM -0400, Lyude Paul wrote:
> On Fri, 2019-02-15 at 16:17 -0500, Lyude Paul wrote:
> > On Thu, 2019-02-14 at 18:43 -0600, Bjorn Helgaas wrote:
> > > On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote:
> > > > On a very specific subset of ThinkPad P50 SKUs, particularly
> > > > ones that come with a Quadro M1000M chip instead of the M2000M
> > > > variant, the BIOS seems to have a very nasty habit of not
> > > > always resetting the secondary Nvidia GPU between full reboots
> > > > if the laptop is configured in Hybrid Graphics mode. The
> > > > reason for this happening is unknown, but the following steps
> > > > and possibly a good bit of patience will reproduce the issue:
> > > > 
> > > > 1. Boot up the laptop normally in Hybrid graphics mode
> > > > 2. Make sure nouveau is loaded and that the GPU is awake
> > > > 2. Allow the nvidia GPU to runtime suspend itself after being idle
> > > > 3. Reboot the machine, the more sudden the better (e.g sysrq-b may help)
> > > > 4. If nouveau loads up properly, reboot the machine again and go back to
> > > > step 2 until you reproduce the issue
> > > > 
> > > > This results in some very strange behavior: the GPU will quite
> > > > literally be left in exactly the same state it was in when the
> > > > previously booted kernel started the reboot. This has all
> > > > sorts of bad sideaffects: for starters, this completely breaks
> > > > nouveau starting with a mysterious EVO channel failure that
> > > > happens well before we've actually used the EVO channel for
> > > > anything:

Thanks for the hybrid tutorial (snipped from this response).  IIUC,
what you said was that in hybrid mode, the Intel GPU drives the
built-in display and the Nvidia GPU drives any external displays and
may be used for DRI PRIME rendering (whatever that is).  But since you
say the Nvidia device gets runtime suspended, I assume there's no
external display here and you're not using DRI PRIME.

I wonder if it's related to the fact that the Nvidia GPU has been
runtime suspended before you do the reboot.  Can you try turning of
runtime power management for the GPU by setting the runpm module
parameter to 0?  I *think* this would be booting with
"nouveau.runpm=0".

> > > Is there a bug report for this?  Bugzilla.kernel.org would be ideal,
> > > including "lspci -vvxxx" and dmidecode for the system.
> > > 
> > Not yet, but there has been discussion about this between nouveau
> > developers on our IRC channel.
>
> I lied: yes there actually is a bug report for this, but it's
> currently on the Red Hat bugzilla. I can get more information from
> it if you need (with lenovo's approval of course).

Can you please make a bugzilla.kernel.org entry with as much
information (dmesg, "lspci -vvxxx", dmidecode, etc) as you can get
approval for?  You can include the Red Hat bugzilla URL in the commit
log, too, but that's not quite as good because we have no control over
whether it's public.

> And additionally: I've been working with Lenovo on this issue for a
> couple of months now, and we've gone through dozens of different
> trial BIOSes with no success thus far. However, Lenovo is currently
> working on trying to add this workaround into their BIOS but I've
> been told that this change is going to take a decent amount of time
> since they need to test it across multiple operating systems. I'd be
> happy to come back and add a conditional later to turn this
> workaround off for later BIOS versions once Lenovo has released a
> proper fix.

Sounds like Lenovo is going to a lot of trouble for this.  The ideal
thing from my point of view would be if they could figure out why this
works on Windows but not on Linux.  I doubt Windows has a quirk like
this, so if we could figure out why it works on Windows, we could
likely do something similar in Linux.

> > > > So to do this, we add a new pci quirk using
> > > > DECLARE_PCI_FIXUP_CLASS_FINAL that will be invoked before the PCI probe
> > > > at boot finishes. From there, we check to make sure that this is indeed
> > > > the specific P50 variant of this GPU. We also make sure that the GPU PCI
> > > > device is advertising NoReset- in order to prevent us from trying to
> > > > reset the GPU when the machine is in Dedicated graphics mode (where the
> > > > GPU being initialized by the BIOS is normal and expected). Finally, we
> > > > try mapping the MMIO space for the GPU which should only work if the GPU
> > > > is actuall

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-02-18 Thread Bjorn Helgaas
On Mon, Feb 18, 2019 at 3:14 PM Sasha Levin  wrote:
>
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees: all
>
> The bot has tested the following trees: v4.20.8, v4.19.21, v4.14.99, 
> v4.9.156, v4.4.174, v3.18.134.
>
> v4.20.8: Build OK!
> v4.19.21: Failed to apply! Possible dependencies:
> 01d5d7fa8376 ("PCI: Add macro for Switchtec quirk declarations")
>
> v4.14.99: Failed to apply! Possible dependencies:
> 01d5d7fa8376 ("PCI: Add macro for Switchtec quirk declarations")
> 06dc4ee54e30 ("PCI: Disable MSI for Freescale Layerscape PCIe RC mode")
> 07f4f97d7b4b ("vga_switcheroo: Use device link for HDA controller")
> 8948ca1a12c9 ("vga_switcheroo: Deduplicate power state tracking")
> ad281ecf1c7d ("PCI: Add DMA alias quirk for Microsemi Switchtec NTB")
>
> v4.9.156: Failed to apply! Possible dependencies:
> 01d5d7fa8376 ("PCI: Add macro for Switchtec quirk declarations")
> 07f4f97d7b4b ("vga_switcheroo: Use device link for HDA controller")
> 3e13676862f9 ("thunderbolt: Add support for DMA configuration based 
> mailbox")
> 46cd4b75cd0e ("efi: Add device path parser")
> 58c5475aba67 ("x86/efi: Retrieve and assign Apple device properties")
> 630b3aff8a51 ("treewide: Consolidate Apple DMI checks")
> 81a54b5e1986 ("thunderbolt: Let the connection manager handle all 
> notifications")
> 8948ca1a12c9 ("vga_switcheroo: Deduplicate power state tracking")
> 9d3cce0b6136 ("thunderbolt: Introduce thunderbolt bus and connection 
> manager")
> ac6c44de503e ("thunderbolt: Expose get_route() to other files")
> ad281ecf1c7d ("PCI: Add DMA alias quirk for Microsemi Switchtec NTB")
> bfe778ac4982 ("thunderbolt: Convert switch to a device")
> c9cc3aaa0281 ("thunderbolt: Use Device ROM retrieved from EFI")
> da2da04b8d44 ("thunderbolt: Rework capability handling")
> f67cf491175a ("thunderbolt: Add support for Internal Connection Manager 
> (ICM)")
> fa6d513aefe4 ("drivers:gpu: vga :vga_switcheroo.c : Fixed some coding 
> style issues")
>
> v4.4.174: Failed to apply! Possible dependencies:
> 01d5d7fa8376 ("PCI: Add macro for Switchtec quirk declarations")
> 07f4f97d7b4b ("vga_switcheroo: Use device link for HDA controller")
> 14d2000182ed ("drm/radeon: Defer probe if gmux is present but its driver 
> isn't")
> 156d7d4120e1 ("vga_switcheroo: Add handler flags infrastructure")
> 3a848662c751 ("vga_switcheroo: Prettify documentation")
> 412c8f7de011 ("drm/radeon: Return -EPROBE_DEFER when amdkfd not loaded")
> 704ab614ec12 ("drm/i915: Defer probe if gmux is present but its driver 
> isn't")
> 8948ca1a12c9 ("vga_switcheroo: Deduplicate power state tracking")
> 989561de9b51 ("PM / Domains: add setter for dev.pm_domain")
> 98b3a3402eb6 ("drm/nouveau: Defer probe if gmux is present but its driver 
> isn't")
> a345918d6ee6 ("vga_switcheroo: Support deferred probing of audio clients")
> ad281ecf1c7d ("PCI: Add DMA alias quirk for Microsemi Switchtec NTB")
> b00e5334ab1b ("vga_switcheroo: Add helper for deferred probing")
> b5f88dd1d6ef ("Revert "ACPI / LPSS: allow to use specific PM domain 
> during ->probe()"")
> c1e1655bb892 ("apple-gmux: Assign apple_gmux_data before registering")
> c68f4528a2e9 ("drm/amdkfd: Track when module's init is complete")
> fa6d513aefe4 ("drivers:gpu: vga :vga_switcheroo.c : Fixed some coding 
> style issues")
>
> v3.18.134: Failed to apply! Possible dependencies:
> 01d5d7fa8376 ("PCI: Add macro for Switchtec quirk declarations")
> 07f4f97d7b4b ("vga_switcheroo: Use device link for HDA controller")
> 083dba02947d ("drm/nouveau/device: recognise GM204")
> 2f4a58e852d1 ("drm/nouveau/subdev: always upcast through 
> nouveau_subdev()/nouveau_engine()")
> 4766ec53945f ("drm/nouveau/bios: add parsing of BIT M(v2) +0x03 table")
> 50e216d6e7c3 ("drm/nouveau/bios: add parsing of pmu image tables")
> 5444204036b2 ("drm/nouveau: switch to new-style timer macros")
> 7bb6d4428d3d ("drm/nouveau: move the (far too many...) different s/r 
> paths to the same place")
> 8d5e3af15c79 ("drm/nouveau/device: Add support for GK208B, resolves bug 
> 86935")
> 8d85e06b5e04 ("drm/nouveau/bios: add pci data structure parsing")
> 989aa5b76ad2 ("drm/nouveau/nvif: namespace of nvkm accessors (no binary 
> change)")
> a01ca78c8f11 ("drm/nouveau/nvif: simplify and tidy library interfaces")
> ad281ecf1c7d ("PCI: Add DMA alias quirk for Microsemi Switchtec NTB")
> ad4a36263535 ("drm/nouveau/bios: split out shadow methods")
> b71a1344ec20 ("drm/nouveau/bios: add NPDE parsing")
> ba6e34e61271 ("drm/gm204/devinit: initial implementation")
> be83cd4ef9a2 ("drm/nouveau: finalise nvkm namespace switch (no binary 
> change)")
> c39f472e9f14 ("drm/nouveau: remove symlinks, move core/ to nvkm/ (no 

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-02-15 Thread Bjorn Helgaas
Hi Lyude,

On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote:
> On a very specific subset of ThinkPad P50 SKUs, particularly ones that
> come with a Quadro M1000M chip instead of the M2000M variant, the BIOS
> seems to have a very nasty habit of not always resetting the secondary
> Nvidia GPU between full reboots if the laptop is configured in Hybrid
> Graphics mode. The reason for this happening is unknown, but the
> following steps and possibly a good bit of patience will reproduce the
> issue:
> 
> 1. Boot up the laptop normally in Hybrid graphics mode
> 2. Make sure nouveau is loaded and that the GPU is awake
> 2. Allow the nvidia GPU to runtime suspend itself after being idle
> 3. Reboot the machine, the more sudden the better (e.g sysrq-b may help)
> 4. If nouveau loads up properly, reboot the machine again and go back to
> step 2 until you reproduce the issue
> 
> This results in some very strange behavior: the GPU will
> quite literally be left in exactly the same state it was in when the
> previously booted kernel started the reboot. This has all sorts of bad
> sideaffects: for starters, this completely breaks nouveau starting with a
> mysterious EVO channel failure that happens well before we've actually
> used the EVO channel for anything:

There are a lot of moving parts here that are probably obvious to you
but not to me.  I need help untangling this a bit so I'm comfortable
that we got to the root cause and that we're doing something logical
as opposed to something that just happens to make things work.  I
really don't know enough to even ask the right questions...

Is there a bug report for this?  Bugzilla.kernel.org would be ideal,
including "lspci -vvxxx" and dmidecode for the system.

Is this running a current BIOS?  The date in your log below looks
pretty recent, so I assume it is current.

I assume "hybrid graphics" means you have two GPUs.  Do you select
hybrid graphics mode in the BIOS?

I assume when you say the Nvidia GPU doesn't get reset on a full
reboot, you're talking about a "warm reboot", and that if you actually
remove the power and do a cold reboot, there's no problem?

I assume Nvidia GPU being active means you are using the performance
GPU.  Does that mean the integrated GPU is completely unused and Linux
does nothing at all with it?  Is Linux doing any switching between
them?  If so, how?  I am not 100% confident in the code I've seen that
does the switching.

> nouveau :01:00.0: disp: chid 0 mthd  data 0400 1000
> 0002
> 
> Later on, this causes us to timeout trying to bring up the GR ctx:
> 
> [ cut here ]
> nouveau :01:00.0: timeout
> WARNING: CPU: 0 PID: 12 at
> drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgf100.c:1547
> gf100_grctx_generate+0x7b2/0x850 [nouveau]
> Modules linked in: nouveau mxm_wmi i915 crc32c_intel ttm i2c_algo_bit
> serio_raw drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
> xhci_pci drm xhci_hcd i2c_core wmi video
> CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.0.0-rc5Lyude-Test+ #29
> Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET82W (1.55 )
> 12/18/2018
> Workqueue: events_long drm_dp_mst_link_probe_work [drm_kms_helper]
> RIP: 0010:gf100_grctx_generate+0x7b2/0x850 [nouveau]
> Code: 85 d2 75 04 48 8b 57 10 48 89 95 28 ff ff ff e8 b4 37 0e e1 48 8b
> 95 28 ff ff ff 48 c7 c7 b1 97 57 a0 48 89 c6 e8 5a 38 c0 e0 <0f> 0b e9
> b9 fd ff ff 48 8b 85 60 ff ff ff 48 8b 40 10 48 8b 78 10
> RSP: 0018:c90b77f0 EFLAGS: 00010286
> RAX:  RBX: 71af8000 RCX: 
> RDX: 7f41dfe0 RSI: 7f415698 RDI: 7f415698
> RBP: c90b78c8 R08:  R09: 
> R10:  R11:  R12: 72118000
> R13:  R14: a0551420 R15: c90b7818
> FS:  () GS:7f40()
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 5644d0556ca8 CR3: 02214006 CR4: 003606f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  gf100_gr_init_ctxctl+0x27b/0x2d0 [nouveau]
>  gf100_gr_init+0x5bd/0x5e0 [nouveau]
>  gf100_gr_init_+0x61/0x70 [nouveau]
>  nvkm_gr_init+0x1d/0x20 [nouveau]
>  nvkm_engine_init+0xcb/0x210 [nouveau]
>  nvkm_subdev_init+0xd6/0x230 [nouveau]
>  nvkm_engine_ref.part.0+0x52/0x70 [nouveau]
>  nvkm_engine_ref+0x13/0x20 [nouveau]
>  nvkm_ioctl_new+0x12c/0x260 [nouveau]
>  ? nvkm_fifo_chan_child_del+0xa0/0xa0 [nouveau]
>  ? gf100_gr_dtor+0xe0/0xe0 [nouveau]
>  nvkm_ioctl+0xe2/0x180 [nouveau]
>  nvkm_client_ioctl+0x12/0x20 [nouveau]
>  nvif_object_ioctl+0x47/0x50 [nouveau]
>  nvif_object_init+0xc8/0x120 [nouveau]
>  nvc0_fbcon_accel_init+0x5c/0x960 [nouveau]
>  nouveau_fbcon_create+0x5a5/0x5d0 [nouveau]
>  ? drm_setup_crtcs+0x27b/0xcb0 [drm_kms_helper]
>  ? __lock_is_held+0x5e/0xa0

Re: [Nouveau] [PATCH v3] PCI: Reprogram bridge prefetch registers on resume

2018-10-02 Thread Bjorn Helgaas
On Tue, Oct 2, 2018 at 4:27 PM Thomas Martitz  wrote:
>
> Am 02.10.18 um 22:03 schrieb Bjorn Helgaas:
> > Hi Thomas,
> >
> > On Mon, Oct 01, 2018 at 04:25:06PM +0200, Thomas Martitz wrote:
> >> Am 01.10.18 um 06:57 schrieb Daniel Drake:
> >>> On Sun, Sep 30, 2018 at 5:07 AM Thomas Martitz  wrote:
> >>>> The latest iteration does not work on my HP system. The GPU fails to
> >>>> power up just like the unpatched kernel.
> >>>
> >>> That's weird, I would not expect a behaviour change in the latest
> >>> patch. pci_restore_config_dword() has some debug messages, could you
> >>> please make them visible and show logs again?
> >>> Also remind us of the PCI device address of the parent bridge (lspci -vt)
> >>
> >> I'll follow up with more the requested information on bugzilla
> >> (Link: https://bugzilla.kernel.org/show_bug.cgi?id=201069).
> >>
> >> On a quick re-check, it seems to depend on if I used the eGPU before
> >> the initial suspend. If I run glxgears (with DRI_PRIME=1) before suspend it
> >> seems fine.
> >
> > Does the patch ([1]) make things *worse* compared to v4.19-rc5?
> >
>
> No, certainly not. It does look like a different issue since resuming now
> works at least if I used the eGPU in some way before suspend
> (DRI_PRIME=1 glxgears seems to be enough, I assume glxinfo would work as
> well).
>
> Without the patch resuming the eGPU does not work whatsoever.
>
> Please ship the patch. I'll hopefully sort this other issue out.

Great, thanks!  That's what I thought, but just wanted to double check.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v3] PCI: Reprogram bridge prefetch registers on resume

2018-10-02 Thread Bjorn Helgaas
Hi Thomas,

On Mon, Oct 01, 2018 at 04:25:06PM +0200, Thomas Martitz wrote:
> Am 01.10.18 um 06:57 schrieb Daniel Drake:
> > On Sun, Sep 30, 2018 at 5:07 AM Thomas Martitz  wrote:
> > > The latest iteration does not work on my HP system. The GPU fails to
> > > power up just like the unpatched kernel.
> > 
> > That's weird, I would not expect a behaviour change in the latest
> > patch. pci_restore_config_dword() has some debug messages, could you
> > please make them visible and show logs again?
> > Also remind us of the PCI device address of the parent bridge (lspci -vt)
> 
> I'll follow up with more the requested information on bugzilla
> (Link: https://bugzilla.kernel.org/show_bug.cgi?id=201069).
> 
> On a quick re-check, it seems to depend on if I used the eGPU before
> the initial suspend. If I run glxgears (with DRI_PRIME=1) before suspend it
> seems fine.

Does the patch ([1]) make things *worse* compared to v4.19-rc5?

If so, I'll drop the patch until we figure this out.  But if the GPU
power issue also occurs in v4.19-rc5, I think we can assume it's a
different problem and we can go ahead and merge [1].

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=for-linus=083874549fdfefa629dfa752785e20427dde1511
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v3] PCI: Reprogram bridge prefetch registers on resume

2018-09-27 Thread Bjorn Helgaas
[+cc LKML]

On Tue, Sep 18, 2018 at 04:32:44PM -0500, Bjorn Helgaas wrote:
> On Thu, Sep 13, 2018 at 11:37:45AM +0800, Daniel Drake wrote:
> > On 38+ Intel-based Asus products, the nvidia GPU becomes unusable
> > after S3 suspend/resume. The affected products include multiple
> > generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
> > many errors such as:
> > 
> > fifo: fault 00 [READ] at 00555000 engine 00 [GR] client 04
> >   [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown]
> > DRM: failed to idle channel 0 [DRM]
> > 
> > Similarly, the nvidia proprietary driver also fails after resume
> > (black screen, 100% CPU usage in Xorg process). We shipped a sample
> > to Nvidia for diagnosis, and their response indicated that it's a
> > problem with the parent PCI bridge (on the Intel SoC), not the GPU.
> > 
> > Runtime suspend/resume works fine, only S3 suspend is affected.
> > 
> > We found a workaround: on resume, rewrite the Intel PCI bridge
> > 'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In
> > the cases that I checked, this register has value 0 and we just have to
> > rewrite that value.
> > 
> > Linux already saves and restores PCI config space during suspend/resume,
> > but this register was being skipped because upon resume, it already
> > has value 0 (the correct, pre-suspend value).
> > 
> > Intel appear to have previously acknowledged this behaviour and the
> > requirement to rewrite this register.
> > https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23
> > 
> > Based on that, rewrite the prefetch register values even when that
> > appears unnecessary.
> > 
> > We have confirmed this solution on all the affected models we have
> > in-hands (X542UQ, UX533FD, X530UN, V272UN).
> > 
> > Additionally, this solves an issue where r8169 MSI-X interrupts were
> > broken after S3 suspend/resume on Asus X441UAR. This issue was recently
> > worked around in commit 7bb05b85bc2d ("r8169: don't use MSI-X on
> > RTL8106e"). It also fixes the same issue on RTL6186evl/8111evl on an
> > Aimfor-tech laptop that we had not yet patched. I suspect it will also
> > fix the issue that was worked around in commit 7c53a722459c ("r8169:
> > don't use MSI-X on RTL8168g").
> > 
> > Thomas Martitz reports that this change also solves an issue where
> > the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
> > after S3 suspend/resume.
> > 
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=201069
> > Signed-off-by: Daniel Drake 
> 
> Applied with Rafael's and Peter's reviewed-by to pci/enumeration for v4.20.
> Thanks for the the huge investigative effort!

Since this looks low-risk and fixes several painful issues, I think
this merits a stable tag and being included in v4.19 (instead of
waiting for v4.20).  

I moved it to for-linus for v4.19.  Let me know if you object.

> > ---
> >  drivers/pci/pci.c | 25 +
> >  1 file changed, 17 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 29ff9619b5fa..5d58220b6997 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -1289,12 +1289,12 @@ int pci_save_state(struct pci_dev *dev)
> >  EXPORT_SYMBOL(pci_save_state);
> >  
> >  static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
> > -u32 saved_val, int retry)
> > +u32 saved_val, int retry, bool force)
> >  {
> > u32 val;
> >  
> > pci_read_config_dword(pdev, offset, );
> > -   if (val == saved_val)
> > +   if (!force && val == saved_val)
> > return;
> >  
> > for (;;) {
> > @@ -1313,25 +1313,34 @@ static void pci_restore_config_dword(struct pci_dev 
> > *pdev, int offset,
> >  }
> >  
> >  static void pci_restore_config_space_range(struct pci_dev *pdev,
> > -  int start, int end, int retry)
> > +  int start, int end, int retry,
> > +  bool force)
> >  {
> > int index;
> >  
> > for (index = end; index >= start; index--)
> > pci_restore_config_dword(pdev, 4 * index,
> >  pdev->saved_config_space[index],
> > -retry);
> > +retry, force);
> >  }
> >  

Re: [Nouveau] [PATCH v3] PCI: Reprogram bridge prefetch registers on resume

2018-09-18 Thread Bjorn Helgaas
On Thu, Sep 13, 2018 at 11:37:45AM +0800, Daniel Drake wrote:
> On 38+ Intel-based Asus products, the nvidia GPU becomes unusable
> after S3 suspend/resume. The affected products include multiple
> generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
> many errors such as:
> 
> fifo: fault 00 [READ] at 00555000 engine 00 [GR] client 04
>   [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown]
> DRM: failed to idle channel 0 [DRM]
> 
> Similarly, the nvidia proprietary driver also fails after resume
> (black screen, 100% CPU usage in Xorg process). We shipped a sample
> to Nvidia for diagnosis, and their response indicated that it's a
> problem with the parent PCI bridge (on the Intel SoC), not the GPU.
> 
> Runtime suspend/resume works fine, only S3 suspend is affected.
> 
> We found a workaround: on resume, rewrite the Intel PCI bridge
> 'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In
> the cases that I checked, this register has value 0 and we just have to
> rewrite that value.
> 
> Linux already saves and restores PCI config space during suspend/resume,
> but this register was being skipped because upon resume, it already
> has value 0 (the correct, pre-suspend value).
> 
> Intel appear to have previously acknowledged this behaviour and the
> requirement to rewrite this register.
> https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23
> 
> Based on that, rewrite the prefetch register values even when that
> appears unnecessary.
> 
> We have confirmed this solution on all the affected models we have
> in-hands (X542UQ, UX533FD, X530UN, V272UN).
> 
> Additionally, this solves an issue where r8169 MSI-X interrupts were
> broken after S3 suspend/resume on Asus X441UAR. This issue was recently
> worked around in commit 7bb05b85bc2d ("r8169: don't use MSI-X on
> RTL8106e"). It also fixes the same issue on RTL6186evl/8111evl on an
> Aimfor-tech laptop that we had not yet patched. I suspect it will also
> fix the issue that was worked around in commit 7c53a722459c ("r8169:
> don't use MSI-X on RTL8168g").
> 
> Thomas Martitz reports that this change also solves an issue where
> the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
> after S3 suspend/resume.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201069
> Signed-off-by: Daniel Drake 

Applied with Rafael's and Peter's reviewed-by to pci/enumeration for v4.20.
Thanks for the the huge investigative effort!

> ---
>  drivers/pci/pci.c | 25 +
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 29ff9619b5fa..5d58220b6997 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1289,12 +1289,12 @@ int pci_save_state(struct pci_dev *dev)
>  EXPORT_SYMBOL(pci_save_state);
>  
>  static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
> -  u32 saved_val, int retry)
> +  u32 saved_val, int retry, bool force)
>  {
>   u32 val;
>  
>   pci_read_config_dword(pdev, offset, );
> - if (val == saved_val)
> + if (!force && val == saved_val)
>   return;
>  
>   for (;;) {
> @@ -1313,25 +1313,34 @@ static void pci_restore_config_dword(struct pci_dev 
> *pdev, int offset,
>  }
>  
>  static void pci_restore_config_space_range(struct pci_dev *pdev,
> -int start, int end, int retry)
> +int start, int end, int retry,
> +bool force)
>  {
>   int index;
>  
>   for (index = end; index >= start; index--)
>   pci_restore_config_dword(pdev, 4 * index,
>pdev->saved_config_space[index],
> -  retry);
> +  retry, force);
>  }
>  
>  static void pci_restore_config_space(struct pci_dev *pdev)
>  {
>   if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
> - pci_restore_config_space_range(pdev, 10, 15, 0);
> + pci_restore_config_space_range(pdev, 10, 15, 0, false);
>   /* Restore BARs before the command register. */
> - pci_restore_config_space_range(pdev, 4, 9, 10);
> - pci_restore_config_space_range(pdev, 0, 3, 0);
> + pci_restore_config_space_range(pdev, 4, 9, 10, false);
> + pci_restore_config_space_range(pdev, 0, 3, 0, false);
> + } else if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> + pci_restore_config_space_range(pdev, 12, 15, 0, false);
> + /* Force rewriting of prefetch registers to avoid
> +  * S3 resume issues on Intel PCI bridges that occur when
> +  * these registers are not explicitly written.
> +  */
> + pci_restore_config_space_range(pdev, 9, 11, 0, true);
> + 

Re: [Nouveau] [PATCH] PCI: Reprogram bridge prefetch registers on resume

2018-09-11 Thread Bjorn Helgaas
[+cc LKML, Dave, Luming]

On Fri, Sep 07, 2018 at 05:05:15PM +0200, Peter Wu wrote:
> On Fri, Sep 07, 2018 at 01:36:14PM +0800, Daniel Drake wrote:
> <..>
> > Thomas Martitz reports that this workaround also solves an issue where
> > the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
> > after S3 suspend/resume.
> 
> Where was this claimed? It is not stated in the linked bug:
> (https://bugs.freedesktop.org/show_bug.cgi?id=105760
> 
> > On resume, reprogram the PCI bridge prefetch registers, including the
> > magic register mentioned above.
> > 
> > This matches Win10 behaviour, which also rewrites these registers
> > during S3 resume (checked with qemu tracing).
> 
> Windows 10 unconditionally rewrites these registers (BAR, I/O Base +
> Limit, Memory Base + Limit, etc. from top to bottom), see annotations:
> https://www.spinics.net/lists/linux-pci/msg75856.html
> 
> Linux has a generic "restore" operation that works backwards from the
> end of the PCI config space to the beginning, see
> pci_restore_config_space. Do you have a dmesg where you see the
> "restoring config space at offset" messages?
> 
> Would it be reasonable to unconditionally write these registers in
> pci_restore_config_dword, like Windows does?

That sounds reasonable to me.

We did write them unconditionally, prior to 04d9c1a1100b ("[PATCH]
PCI: Improve PCI config space writeback") [1].  That commit apparently
fixed suspend on some laptop.

But at that time, we restored the config space in order of dword 0, 1,
2, ... 15, which means we restored the command register before the
BARs and windows, and it's conceivable that the problem was the
ordering, not the rewriting of the same value.

Only a week later, we reversed the order with 8b8c8d280ab2 ("[PATCH]
PCI: reverse pci config space restore order") [2], which seems like a
good idea and even includes a spec reference.  I found similar
language in the Intel ICH 10 datasheet, sec 14.1.3 [3].

So it seems reasonable to me to try writing them unconditionally in
reverse order (the same order we use today).

[1] 
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=04d9c1a1100b
[2] 
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8b8c8d280ab2
[3] Intel ICH 10 IBL 373635
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] PCI: Reprogram bridge prefetch registers on resume

2018-09-07 Thread Bjorn Helgaas
[+cc LKML]

On Fri, Sep 07, 2018 at 01:36:14PM +0800, Daniel Drake wrote:
> On 38+ Intel-based Asus products, the nvidia GPU becomes unusable
> after S3 suspend/resume. The affected products include multiple
> generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
> many errors such as:
> 
> fifo: fault 00 [READ] at 00555000 engine 00 [GR] client 04 
> [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown]
> DRM: failed to idle channel 0 [DRM]
> 
> Similarly, the nvidia proprietary driver also fails after resume
> (black screen, 100% CPU usage in Xorg process). We shipped a sample
> to Nvidia for diagnosis, and their response indicated that it's a
> problem with the parent PCI bridge (on the Intel SoC), not the GPU.
> 
> Runtime suspend/resume works fine, only S3 suspend is affected.
> 
> We found a workaround: on resume, rewrite the Intel PCI bridge
> 'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In
> the cases that I checked, this register has value 0 and we just have to
> rewrite that value.
> 
> It's very strange that rewriting the exact same register value
> makes a difference, but it definitely makes the issue go away.
> It's not just acting as some kind of memory barrier, because rewriting
> other bridge registers does not work around the issue. There's something
> magic in this particular register. We have confirmed this on all
> the affected models we have in-hands (X542UQ, UX533FD, X530UN, V272UN).
> 
> Additionally, this workaround solves an issue where r8169 MSI-X
> interrupts were broken after S3 suspend/resume on Asus X441UAR. This
> issue was recently worked around in commit 7bb05b85bc2d ("r8169:
> don't use MSI-X on RTL8106e"). It also fixes the same issue on
> RTL6186evl/8111evl on an Aimfor-tech laptop that we had not yet
> patched. I suspect it will also fix the issue that was worked around in
> commit 7c53a722459c ("r8169: don't use MSI-X on RTL8168g").

This is crazy.  I would think *lots* of devices, like anything that
uses prefetchable memory, would be affected by this.

> Thomas Martitz reports that this workaround also solves an issue where
> the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
> after S3 suspend/resume.
> 
> From our testing, the affected Intel PCI bridges are:
> Intel Skylake PCIe Controller (x16) [8086:1901] (rev 05)
> Intel Skylake PCIe Controller (x16) [8086:1901] (rev 07)
> Intel Device [8086:31d8] (rev f3)
> Intel Celeron N3350/Pentium N4200/Atom E3900 Series PCI Express Port B #1 
> [8086:5ad6] (rev fb)
> Intel Celeron N3350/Pentium N4200/Atom E3900 Series PCI Express Port A #1 
> [8086:5ad8] (rev fb)
> Intel Sunrise Point-LP PCI Express Root Port [8086:9d10] (rev f1)
> Intel Sunrise Point-LP PCI Express Root Port #5 [8086:9d14] (rev f1)
> Intel Device [8086:9dbc] (rev f0)
> 
> On resume, reprogram the PCI bridge prefetch registers, including the
> magic register mentioned above.
> 
> This matches Win10 behaviour, which also rewrites these registers
> during S3 resume (checked with qemu tracing).
> 
> Link: 
> https://marc.info/?i=CAD8Lp46Y2eOR7WE28xToUL8s-aYiqPa0nS=1gsd0axkddxq...@mail.gmail.com

Can you use

  
https://lkml.kernel.org/r/CAD8Lp46Y2eOR7WE28xToUL8s-aYiqPa0nS=1gsd0axkddxq...@mail.gmail.com

instead, so we don't depend on marc.info?  lkml.kernel.org doesn't have
linux-pci archives, but it might someday, and it does redirect to other
archives already.

> Link: https://bugs.freedesktop.org/show_bug.cgi?id=105760

I would really like to have a bugzilla.kernel.org issue with the
excellent debugging you and Peter did attached.  Then we don't also
have to depend on github.com, etc., sticking around.

The list of platforms below could also be attached there, since you
went to a lot of trouble to collect it, but it's probably more than is
necessary for the changelog.

> Signed-off-by: Daniel Drake 
> ---
> 
> Notes:
> Replaces patch:
>PCI: add prefetch quirk to work around Asus/Nvidia suspend issues
> 
> Below is the list of Asus products with Intel/Nvidia that we
> believe are affected by the GPU resume issue.
> 
> I revised my counting method from my last patch to eliminate duplicate
> platforms that had multiple SKUs with the same DMI/GPU/bridge, that's why
> the product count reduced from 43 to 38.
> 
> sys_vendor: ASUSTeK COMPUTER INC.
> board_name: FX502VD
> product_name: FX502VD
> 01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev 
> ff) (prog-if ff)
>   !!! Unknown header type 7f
> 00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) 
> (prog-if 00 [Normal decode])
> ...
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues

2018-08-31 Thread Bjorn Helgaas
[+cc Intel folks]

On Fri, Aug 31, 2018 at 03:30:57PM +0800, Daniel Drake wrote:
> On over 40 Intel-based Asus products, the nvidia GPU becomes unusable
> after S3 suspend/resume. The affected products include multiple
> generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
> many errors such as:
> 
> fifo: fault 00 [READ] at 00555000 engine 00 [GR] client 04 
> [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown]
> DRM: failed to idle channel 0 [DRM]
> 
> Similarly, the nvidia proprietary driver also fails after resume
> (black screen, 100% CPU usage in Xorg process). We shipped a sample
> to Nvidia for diagnosis, and their response indicated that it's a
> problem with the parent PCI bridge (on the Intel SoC), not the GPU.
> 
> We found a workaround: on resume, rewrite the Intel PCI bridge
> 'Prefetchable Base Upper 32 Bits' register. In the cases that I checked,
> this register has value 0 and we just have to rewrite that value.
> 
> It's very strange that rewriting the exact same register value
> makes a difference, but it definitely makes the issue go away.
> It's not just acting as some kind of memory barrier, because rewriting
> other bridge registers does not work around the issue. There's something
> magic in this particular register.

If true, this sounds like some sort of erratum, so it would be good to
get some input from Intel, and I cc'd a few Intel folks.

It's interesting that all the systems below are from Asus.  That makes
me think there's some BIOS or SMM connection, e.g., SMM traps the
register write and does something magic.

Does this problem happen after a full system suspend/resume, or does
it happen after runtime suspend of only the GPU?  Or runtime suspend
of only the GPU and the upstream bridge?

Can we tell whether Windows rewrites this register unconditionally at
resume-time?  If so, it may be more robust for Linux to do the same.
The whole thing is black magic, which I hate, but if it's our only
choice, it may be better to have this applied everywhere so we don't
keep stubbing our toes on new systems that require the quirk.

> We examined our database of Asus hardware and identified 43 products
> that we believe are affected. Checking the nvidia GPU parent PCI bridge
> on each one, in total 5 Intel PCI bridges need quirking as below.
> The quirk will run on bridges even where no nvidia GPU is connected,
> but it should be harmless, and we at least limit it to only running
> on Asus products.
> 
> This fix was tested on all the affected models that we have in hands
> (X542UQ, UX533FD, X530UN, V272UN).
> 
> Signed-off-by: Daniel Drake 
> ---
> 
> Notes:
> If anyone has ideas for why writing this register makes a difference, or
> suggestions for other approaches then I'm all ears...
> 
> Here is some basic info of the 43 products believed to be affected:
> basic DMI data, nvidia GPU PCI info, parent PCI bridge info.

Can you attach the list below to a kernel.org bugzilla and include the
URL in your changelog?

> sys_vendor: ASUSTeK COMPUTER INC.
> board_name: FX502VD
> product_name: FX502VD
> 01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev 
> ff) (prog-if ff)
>   !!! Unknown header type 7f
> 00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) 
> (prog-if 00 [Normal decode])
> 
> sys_vendor: ASUSTeK COMPUTER INC.
> board_name: FX570UD
> product_name: ASUS Gaming FX570UD
> 01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev 
> a1)
>   Subsystem: ASUSTeK Computer Inc. Device [1043:1f40]
> 00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) 
> (prog-if 00 [Normal decode])
> 
> sys_vendor: ASUSTeK COMPUTER INC.
> board_name: GL553VD
> product_name: GL553VD
> 01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev 
> a1)
>   Subsystem: ASUSTeK Computer Inc. Device [1043:15e0]
> 00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) 
> (prog-if 00 [Normal decode])
> 
> sys_vendor: ASUSTeK COMPUTER INC.
> board_name: GL553VD
> product_name: GL553VD
> 01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev 
> a1)
>   Subsystem: ASUSTeK Computer Inc. Device [1043:15e0]
> 00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) 
> (prog-if 00 [Normal decode])
> 
> sys_vendor: ASUSTeK COMPUTER INC.
> board_name: GL753VD
> product_name: GL753VD
> 01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev 
> a1)
>   Subsystem: ASUSTeK Computer Inc. Device [1043:1590]
> 00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) 
> (prog-if 00 [Normal decode])
> 
> sys_vendor: ASUSTeK COMPUTER INC.
> board_name: GL753VD
> product_name: GL753VD
> 01:00.0 3D controller [0302]: NVIDIA Corporation Device 

Re: [Nouveau] [PATCH v2 0/7] Modernize vga_switcheroo by using device link for HDA

2018-03-23 Thread Bjorn Helgaas
On Sun, Mar 11, 2018 at 04:55:49PM +0100, Lukas Wunner wrote:
> On Tue, Mar 06, 2018 at 11:29:40AM +0100, Daniel Vetter wrote:
> > On Sat, Mar 03, 2018 at 10:53:24AM +0100, Lukas Wunner wrote:
> > > Modernize vga_switcheroo by using a device link to enforce a runtime PM
> > > dependency from an HDA controller to the GPU it's integrated into, v2.
> > > 
> > > https://github.com/l1k/linux/commits/switcheroo_devlink_v2
> > 
> > This all looks really reasonable and like a good cleanup, but it's a bit
> > too much detail so I'll punt review to someone else with more clue.
> 
> Patches [3/7] to [7/7] were reviewed by Peter Wu, the HDA bits in
> patch [5/7] additionally by Takashi.
> 
> Patch [2/7] was acked by Bjorn.  There was no ack for patch [1/7]
> (authored by Rafael), but it adressed the objection Bjorn raised
> against my original patch, so I'm assuming Bjorn is okay with it.
> (Bjorn, please let me know if that isn't the case.)

I am OK with it.  I sent an ack and possible minor changelog tweak.

I expect that you'll merge the whole series via drm-misc.

> The series has been tested on 5 systems, which raises the confidence:
> 2x AMD PowerXpress (Mike Lothian, Kai Heng Feng)
> 2x Nvidia Optimus (Denis Lisov, Peter Wu)
> 1x MacBook Pro
> 
> The issues found during Peter Wu's thorough testing appear to all
> be unrelated to this series, as per my e-mail yesterday.
> 
> If there are no objections, I plan to push the series to
> drm-misc-next by the middle of the coming week so that it
> would still catch the last train to 4.17.
> 
> Thanks,
> 
> Lukas
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v2 1/7] PCI: Restore config space on runtime resume despite being unbound

2018-03-13 Thread Bjorn Helgaas
On Sat, Mar 03, 2018 at 10:53:24AM +0100, Lukas Wunner wrote:
> From: Rafael J. Wysocki <r...@rjwysocki.net>
> 
> We leave PCI devices not bound to a driver in D0 during runtime suspend.
> But they may have a parent which is bound and can be transitioned to
> D3cold at runtime.  Once the parent goes to D3cold, the unbound child
> may go to D3cold as well.  When the child comes out of D3cold, its BARs
> are uninitialized and thus inaccessible when a driver tries to probe.

There's no clear way to tell whether a BAR is uninitialized.  At
power-up, the writable bits will be zero, which is a valid BAR value.
If enabled in PCI_COMMAND, the BAR is accessible and may conflict with
other devices.

Possible alternate wording:

  When the child goes to D3cold, its internal state, including
  configuration of BARs, MSI, ASPM, MPS, etc., is lost.

> Moreover configuration done during enumeration, e.g. ASPM and MPS, will
> be lost.
> 
> One example are recent hybrid graphics laptops which cut power to the
> discrete GPU when the root port above it goes to ACPI power state D3.
> Users may provoke this by unbinding the GPU driver and allowing runtime
> PM on the GPU via sysfs:  The PM core will then treat the GPU as
> "suspended", which in turn allows the root port to runtime suspend,
> causing the power resources listed in its _PR3 object to be powered off.
> The GPU's BARs will be uninitialized when a driver later probes it.
> 
> Another example are hybrid graphics laptops where the GPU itself (rather
> than the root port) is capable of runtime suspending to D3cold.  If the
> GPU's integrated HDA controller is not bound and the GPU's driver
> decides to runtime suspend to D3cold, the HDA controller's BARs will be
> uninitialized when a driver later probes it.
> 
> Fix by saving and restoring config space over a runtime suspend cycle
> even if the device is not bound.
> 
> Cc: Bjorn Helgaas <bhelg...@google.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
> [lukas: add commit message, bikeshed code comments for clarity]
> Signed-off-by: Lukas Wunner <lu...@wunner.de>

Acked-by: Bjorn Helgaas <bhelg...@google.com>

> ---
> Changes since v1:
> - Replace patch to use pci_save_state() / pci_restore_state()
>   for consistency between runtime PM code path of bound and unbound
>   devices. (Rafael, Bjorn)
> 
>  drivers/pci/pci-driver.c | 17 +++--
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 3bed6beda051..6a67cdbd0e6a 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1224,11 +1224,14 @@ static int pci_pm_runtime_suspend(struct device *dev)
>   int error;
>  
>   /*
> -  * If pci_dev->driver is not set (unbound), the device should
> -  * always remain in D0 regardless of the runtime PM status
> +  * If pci_dev->driver is not set (unbound), we leave the device in D0,
> +  * but it may go to D3cold when the bridge above it runtime suspends.
> +  * Save its config space in case that happens.

Thanks for this clarification.

>*/
> - if (!pci_dev->driver)
> + if (!pci_dev->driver) {
> + pci_save_state(pci_dev);
>   return 0;
> + }
>  
>   if (!pm || !pm->runtime_suspend)
>   return -ENOSYS;
> @@ -1276,16 +1279,18 @@ static int pci_pm_runtime_resume(struct device *dev)
>   const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
>  
>   /*
> -  * If pci_dev->driver is not set (unbound), the device should
> -  * always remain in D0 regardless of the runtime PM status
> +  * Restoring config space is necessary even if the device is not bound
> +  * to a driver because although we left it in D0, it may have gone to
> +  * D3cold when the bridge above it runtime suspended.
>*/
> + pci_restore_standard_config(pci_dev);
> +
>   if (!pci_dev->driver)
>   return 0;
>  
>   if (!pm || !pm->runtime_resume)
>   return -ENOSYS;
>  
> - pci_restore_standard_config(pci_dev);
>   pci_fixup_device(pci_fixup_resume_early, pci_dev);
>   pci_enable_wake(pci_dev, PCI_D0, false);
>   pci_fixup_device(pci_fixup_resume, pci_dev);
> -- 
> 2.15.1
> 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 5/7] vga_switcheroo: Use device link for HDA controller

2018-02-23 Thread Bjorn Helgaas
[+cc George]

On Fri, Feb 23, 2018 at 10:51:47AM +0100, Lukas Wunner wrote:
> On Tue, Feb 20, 2018 at 04:20:59PM -0600, Bjorn Helgaas wrote:
> > On Sun, Feb 18, 2018 at 09:38:32AM +0100, Lukas Wunner wrote:
> > > The device link is added in a PCI quirk rather than in hda_intel.c.
> > > It is therefore legal for the GPU to runtime suspend to D3cold even if
> > > the HDA controller is not bound to a driver or if CONFIG_SND_HDA_INTEL
> > > is not enabled, for accesses to the HDA controller will cause the GPU to
> > > wake up regardless if they're occurring outside of hda_intel.c (think
> > > config space readout via sysfs).
> > 
> > I guess this GPU wakeup happens *if* the path accessing the HDA
> > controller calls pm_runtime_get_sync() or similar, right?
> 
> Right.
> 
> > We do have
> > that in the sysfs config accessors via pci_config_pm_runtime_get(),
> > but not in the sysfs mmap paths.  I think that's a latent PCI core
> > defect independent of this series.
> 
> Yes, there may be a few places where the parent of the device is
> erroneously not runtime resumed when sysfs files are accessed.
> I've never used the sysfs mmap feature, so never witnessed issues
> with it.
> 
> > We also don't have that in PCI core config accessors.  That maybe
> > doesn't matter because the core probably shouldn't be touching devices
> > after enumeration (although there might be holes there for things like
> > ASPM where a sysfs file can cause reconfiguration of several devices).
> 
> I assume with PCI core config accessors you're referring to
> pci_read_config_word() and friends?  Those are arguably hotpaths
> where we wouldn't want the overhead to check runtime PM status
> everytime.  They might also be called from atomic context, I'm
> not sure, and the runtime PM callbacks may sleep by default
> (unless pm_runtime_irq_safe() was called).

Yes, I was thinking of pci_read_config_word(), etc.  They're used by
the core during enumeration and occasionally by drivers, and both
cases already pay attention to PM.  I think we have a few holes in the
runtime sysfs area, but I think it's reasonable to deal with those in
the callers.

So I don't think those need to actually *do* anything with PM status,
although it might be interesting to add some (optional) checking
there.  George Cherian is turning up some issues in this area because
he has a root port that reports errors with an exception instead of
silently synthesizing all ones data like most hardware does [1].

Bjorn

[1] 
https://lkml.kernel.org/r/1517554846-16703-1-git-send-email-george.cher...@cavium.com
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 2/7] PCI: Make pci_wakeup_bus() & pci_bus_set_current_state() public

2018-02-22 Thread Bjorn Helgaas
On Sun, Feb 18, 2018 at 09:38:32AM +0100, Lukas Wunner wrote:
> There are PCI devices which are power-manageable by a nonstandard means,
> such as a custom ACPI method.  One example are discrete GPUs in hybrid
> graphics laptops, another are Thunderbolt controllers in Macs.
> 
> Such devices can't be put into D3cold with pci_set_power_state() because
> pci_platform_power_transition() fails with -ENODEV.  Instead they're put
> into D3hot by pci_set_power_state() and subsequently into D3cold by
> invoking the nonstandard means.  However as a consequence the cached
> current_state is incorrectly left at D3hot.
> 
> What we need to do is walk the hierarchy below such a PCI device on
> powerdown and update the current_state to D3cold.  On powerup the PCI
> device itself and the hierarchy below it is in D0uninitialized, so we
> need to walk the hierarchy again and wake all devices, causing them to
> be put into D0active and then letting them autosuspend as they see fit.
> 
> To this end make pci_wakeup_bus() & pci_bus_set_current_state() public
> so PCI drivers don't have to reinvent the wheel.
> 
> Cc: Bjorn Helgaas <bhelg...@google.com>
> Cc: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
> Signed-off-by: Lukas Wunner <lu...@wunner.de>

Acked-by: Bjorn Helgaas <bhelg...@google.com>

> ---
>  drivers/pci/pci.c   | 8 
>  include/linux/pci.h | 2 ++
>  2 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index f694650235f2..6e6e322a5a7d 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -800,7 +800,7 @@ static int pci_wakeup(struct pci_dev *pci_dev, void *ign)
>   * pci_wakeup_bus - Walk given bus and wake up devices on it
>   * @bus: Top bus of the subtree to walk.
>   */
> -static void pci_wakeup_bus(struct pci_bus *bus)
> +void pci_wakeup_bus(struct pci_bus *bus)
>  {
>   if (bus)
>   pci_walk_bus(bus, pci_wakeup, NULL);
> @@ -850,11 +850,11 @@ static int __pci_dev_set_current_state(struct pci_dev 
> *dev, void *data)
>  }
>  
>  /**
> - * __pci_bus_set_current_state - Walk given bus and set current state of 
> devices
> + * pci_bus_set_current_state - Walk given bus and set current state of 
> devices
>   * @bus: Top bus of the subtree to walk.
>   * @state: state to be set
>   */
> -static void __pci_bus_set_current_state(struct pci_bus *bus, pci_power_t 
> state)
> +void pci_bus_set_current_state(struct pci_bus *bus, pci_power_t state)
>  {
>   if (bus)
>   pci_walk_bus(bus, __pci_dev_set_current_state, );
> @@ -876,7 +876,7 @@ int __pci_complete_power_transition(struct pci_dev *dev, 
> pci_power_t state)
>   ret = pci_platform_power_transition(dev, state);
>   /* Power off the bridge may power off the whole hierarchy */
>   if (!ret && state == PCI_D3cold)
> - __pci_bus_set_current_state(dev->subordinate, PCI_D3cold);
> + pci_bus_set_current_state(dev->subordinate, PCI_D3cold);
>   return ret;
>  }
>  EXPORT_SYMBOL_GPL(__pci_complete_power_transition);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 024a1beda008..ae42289662df 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1147,6 +1147,8 @@ void pci_pme_wakeup_bus(struct pci_bus *bus);
>  void pci_d3cold_enable(struct pci_dev *dev);
>  void pci_d3cold_disable(struct pci_dev *dev);
>  bool pcie_relaxed_ordering_enabled(struct pci_dev *dev);
> +void pci_wakeup_bus(struct pci_bus *bus);
> +void pci_bus_set_current_state(struct pci_bus *bus, pci_power_t state);
>  
>  /* PCI Virtual Channel */
>  int pci_save_vc_state(struct pci_dev *dev);
> -- 
> 2.15.1
> 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 5/7] vga_switcheroo: Use device link for HDA controller

2018-02-20 Thread Bjorn Helgaas
ors located on the GPU,
> so the GPU needs to be on for the HDA controller to do anything useful.
> 
> This commit implicitly fixes an unbalanced runtime PM ref upon unbind of
> hda_intel.c:  On ->probe, a runtime PM ref was previously released under
> the condition "azx_has_pm_runtime(chip) || hda->use_vga_switcheroo", but
> on ->remove a runtime PM ref was only acquired under the first of those
> conditions.  Thus, binding and unbinding the driver twice on a
> vga_switcheroo capable system caused the runtime PM refcount to drop
> below zero.  The issue is resolved because the AZX_DCAPS_PM_RUNTIME flag
> is now always set if use_vga_switcheroo is true.
> 
> For more information on device links please refer to:
> https://www.kernel.org/doc/html/latest/driver-api/device_link.html
> Documentation/driver-api/device_link.rst
> 
> Cc: Dave Airlie <airl...@redhat.com>
> Cc: Ben Skeggs <bske...@redhat.com>
> Cc: Takashi Iwai <ti...@suse.de>
> Cc: Peter Wu <pe...@lekensteyn.nl>
> Cc: Alex Deucher <alexander.deuc...@amd.com>
> Cc: Bjorn Helgaas <bhelg...@google.com>
> Cc: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
> Signed-off-by: Lukas Wunner <lu...@wunner.de>

Acked-by: Bjorn Helgaas <bhelg...@google.com>

(Trivial nit below.)

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |   2 -
>  drivers/gpu/drm/nouveau/nouveau_drm.c   |   2 -
>  drivers/gpu/drm/radeon/radeon_drv.c |   2 -
>  drivers/gpu/vga/vga_switcheroo.c| 115 
> +++-
>  drivers/pci/quirks.c|  39 +++
>  include/linux/pci_ids.h |   1 +
>  include/linux/vga_switcheroo.h  |   6 --
>  include/sound/hdaudio.h |   3 -
>  sound/pci/hda/hda_intel.c   |  36 +++---
>  sound/pci/hda/hda_intel.h   |   3 -
>  10 files changed, 73 insertions(+), 136 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 50afcf65181a..ba4335fd4f65 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -720,7 +720,6 @@ static int amdgpu_pmops_runtime_suspend(struct device 
> *dev)
>  
>   drm_dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>   drm_kms_helper_poll_disable(drm_dev);
> - vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_OFF);
>  
>   ret = amdgpu_device_suspend(drm_dev, false, false);
>   pci_save_state(pdev);
> @@ -757,7 +756,6 @@ static int amdgpu_pmops_runtime_resume(struct device *dev)
>  
>   ret = amdgpu_device_resume(drm_dev, false, false);
>   drm_kms_helper_poll_enable(drm_dev);
> - vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_ON);
>   drm_dev->switch_power_state = DRM_SWITCH_POWER_ON;
>   return 0;
>  }
> diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c 
> b/drivers/gpu/drm/nouveau/nouveau_drm.c
> index 3e293029e3a6..6959951d45d6 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_drm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
> @@ -856,7 +856,6 @@ nouveau_pmops_runtime_suspend(struct device *dev)
>   }
>  
>   drm_kms_helper_poll_disable(drm_dev);
> - vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_OFF);
>   nouveau_switcheroo_optimus_dsm();
>   ret = nouveau_do_suspend(drm_dev, true);
>   pci_save_state(pdev);
> @@ -891,7 +890,6 @@ nouveau_pmops_runtime_resume(struct device *dev)
>  
>   /* do magic */
>   nvif_mask(>object, 0x088488, (1 << 25), (1 << 25));
> - vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_ON);
>   drm_dev->switch_power_state = DRM_SWITCH_POWER_ON;
>  
>   /* Monitors may have been connected / disconnected during suspend */
> diff --git a/drivers/gpu/drm/radeon/radeon_drv.c 
> b/drivers/gpu/drm/radeon/radeon_drv.c
> index 31dd04f6baa1..b28288a781ef 100644
> --- a/drivers/gpu/drm/radeon/radeon_drv.c
> +++ b/drivers/gpu/drm/radeon/radeon_drv.c
> @@ -415,7 +415,6 @@ static int radeon_pmops_runtime_suspend(struct device 
> *dev)
>  
>   drm_dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>   drm_kms_helper_poll_disable(drm_dev);
> - vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_OFF);
>  
>   ret = radeon_suspend_kms(drm_dev, false, false, false);
>   pci_save_state(pdev);
> @@ -452,7 +451,6 @@ static int radeon_pmops_runtime_resume(struct device *dev)
>  
>   ret = radeon_resume_kms(drm_dev, false, false);
>   drm_kms_helper_poll_enable(drm_dev);
> - vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_ON);
>   drm_dev->switch_power_state

Re: [Nouveau] [PATCH 1/7] PCI: Restore BARs on runtime resume despite being unbound

2018-02-20 Thread Bjorn Helgaas
On Sun, Feb 18, 2018 at 09:38:32AM +0100, Lukas Wunner wrote:
> PCI devices not bound to a driver are supposed to stay in D0 during
> runtime suspend.

Doesn't "runtime suspend" mean an individual device is suspended while
the rest of the system remains active?

If so, maybe it would be more direct to say "PCI devices not bound to
a driver cannot be runtime suspended"?

(It's a separate question not relevant to this patch, but I'm not
convinced that "PCI devices without a driver cannot be suspended"
should be accepted as a rule.  If it is a rule, we should be able to
deduce it from the specs.)

> But they may have a parent which is bound and can be
> transitioned to D3cold at runtime.  Once the parent goes to D3cold, the
> unbound child may go to D3cold as well.  When the child comes out of
> D3cold, its BARs are uninitialized and thus inaccessible when a driver
> tries to probe.
> 
> One example are recent hybrid graphics laptops which cut power to the
> discrete GPU when the root port above it goes to ACPI power state D3.
> Users may provoke this by unbinding the GPU driver and allowing runtime
> PM on the GPU via sysfs:  The PM core will then treat the GPU as
> "suspended", which in turn allows the root port to runtime suspend,
> causing the power resources listed in its _PR3 object to be powered off.
> The GPU's BARs will be uninitialized when a driver later probes it.
> 
> Another example are hybrid graphics laptops where the GPU itself (rather
> than the root port) is capable of runtime suspending to D3cold.  If the
> GPU's integrated HDA controller is not bound and the GPU's driver
> decides to runtime suspend to D3cold, the HDA controller's BARs will be
> uninitialized when a driver later probes it.
> 
> Fix by restoring the BARs on runtime resume if the device is not bound.
> This is sufficient to fix the above-mentioned use cases.  Other use
> cases might require a full-blown pci_save_state() / pci_restore_state()
> or execution of fixups.  We can add that once use cases materialize,
> let's not inflate the code unnecessarily.

Why would we not do a full-blown restore?  With this patch, I think
some configuration done during enumeration, e.g., ASPM and MPS, will
be lost.  "lspci -vv" of the HDA before and after the suspend may show
different things, which seems counter-intuitive.

I wouldn't think of a full-blown restore as "inflating the code"; I
would think of it as "having fewer special cases", i.e., we always use
the same restore path instead of having the main one plus a special
one for unbound devices.

> Cc: Bjorn Helgaas <bhelg...@google.com>
> Cc: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
> Signed-off-by: Lukas Wunner <lu...@wunner.de>
> ---
>  drivers/pci/pci-driver.c | 8 ++--
>  drivers/pci/pci.c| 2 +-
>  drivers/pci/pci.h| 1 +
>  3 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 3bed6beda051..51b11cbd48f6 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1277,10 +1277,14 @@ static int pci_pm_runtime_resume(struct device *dev)
>  
>   /*
>* If pci_dev->driver is not set (unbound), the device should
> -  * always remain in D0 regardless of the runtime PM status
> +  * always remain in D0 regardless of the runtime PM status.
> +  * But if its parent can go to D3cold, this device may have
> +  * been in D3cold as well and require restoration of its BARs.
>*/
> - if (!pci_dev->driver)
> + if (!pci_dev->driver) {
> + pci_restore_bars(pci_dev);

If we do decide not to do a full-blown restore, how do we decide
whether to use pci_restore_bars() or pci_restore_config_space()?

I'm not sure why we have both.  The pci_restore_bars() path looks a
little smarter in that it is more careful when updating 64-bit BARs
that can't be updated atomically.

>   return 0;
> + }
>  
>   if (!pm || !pm->runtime_resume)
>   return -ENOSYS;
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index f6a4dd10d9b0..f694650235f2 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -563,7 +563,7 @@ int pci_wait_for_pending(struct pci_dev *dev, int pos, 
> u16 mask)
>   * Restore the BAR values for a given device, so as to make it
>   * accessible by its driver.
>   */
> -static void pci_restore_bars(struct pci_dev *dev)
> +void pci_restore_bars(struct pci_dev *dev)
>  {
>   int i;
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index fcd81911b127..29dc15bbe3bf 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -83,6 +83,7 @

Re: [Nouveau] [PATCH v1] ACPI: Switch to use generic UUID API

2017-05-06 Thread Bjorn Helgaas
On Thu, May 4, 2017 at 4:21 AM, Andy Shevchenko
<andriy.shevche...@linux.intel.com> wrote:
> acpi_evaluate_dsm() and friends take a pointer to a raw buffer of 16
> bytes. Instead we convert them to use uuid_le type. At the same time we
> convert current users.
>
> acpi_str_to_uuid() becomes useless after the conversion and it's safe to
> get rid of it.
>
> The conversion fixes a potential bug in int340x_thermal as well since
> we have to use memcmp() on binary data.
>
> Cc: Rafael J. Wysocki <r...@rjwysocki.net>
> Cc: Mika Westerberg <mika.westerb...@linux.intel.com>
> Cc: Borislav Petkov <b...@suse.de>
> Cc: Dan Williams <dan.j.willi...@intel.com>
> Cc: Amir Goldstein <amir7...@gmail.com>
> Cc: Jarkko Sakkinen <jarkko.sakki...@linux.intel.com>
> Cc: Jani Nikula <jani.nik...@linux.intel.com>
> Cc: Ben Skeggs <bske...@redhat.com>
> Cc: Benjamin Tissoires <benjamin.tissoi...@redhat.com>
> Cc: Joerg Roedel <j...@8bytes.org>
> Cc: Adrian Hunter <adrian.hun...@intel.com>
> Cc: Yisen Zhuang <yisen.zhu...@huawei.com>
> Cc: Bjorn Helgaas <bhelg...@google.com>
> Cc: Zhang Rui <rui.zh...@intel.com>
> Cc: Felipe Balbi <ba...@kernel.org>
> Cc: Mathias Nyman <mathias.ny...@intel.com>
> Cc: Heikki Krogerus <heikki.kroge...@linux.intel.com>
> Cc: Liam Girdwood <lgirdw...@gmail.com>
> Cc: Mark Brown <broo...@kernel.org>
> Signed-off-by: Andy Shevchenko <andriy.shevche...@linux.intel.com>

For the drivers/pci parts:

Acked-by: Bjorn Helgaas <bhelg...@google.com>

> ---
>  drivers/acpi/acpi_extlog.c | 10 +++---
>  drivers/acpi/bus.c | 29 ++--
>  drivers/acpi/nfit/core.c   | 40 
> +++---
>  drivers/acpi/nfit/nfit.h   |  3 +-
>  drivers/acpi/utils.c   |  4 +--
>  drivers/char/tpm/tpm_crb.c |  9 +++--
>  drivers/char/tpm/tpm_ppi.c | 20 +--
>  drivers/gpu/drm/i915/intel_acpi.c  | 14 +++-
>  drivers/gpu/drm/nouveau/nouveau_acpi.c | 20 +--
>  drivers/gpu/drm/nouveau/nvkm/subdev/mxm/base.c |  9 +++--
>  drivers/hid/i2c-hid/i2c-hid.c  |  9 +++--
>  drivers/iommu/dmar.c   | 11 +++---
>  drivers/mmc/host/sdhci-pci-core.c  |  9 +++--
>  drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c | 15 
>  drivers/pci/pci-acpi.c | 11 +++---
>  drivers/pci/pci-label.c|  4 +--
>  drivers/thermal/int340x_thermal/int3400_thermal.c  |  8 ++---
>  drivers/usb/dwc3/dwc3-pci.c|  6 ++--
>  drivers/usb/host/xhci-pci.c|  9 +++--
>  drivers/usb/misc/ucsi.c|  2 +-
>  drivers/usb/typec/typec_wcove.c|  4 +--
>  include/acpi/acpi_bus.h|  9 ++---
>  include/linux/acpi.h   |  4 +--
>  include/linux/pci-acpi.h   |  2 +-
>  sound/soc/intel/skylake/skl-nhlt.c |  7 ++--
>  tools/testing/nvdimm/test/iomap.c  |  2 +-
>  tools/testing/nvdimm/test/nfit.c   |  2 +-
>  27 files changed, 116 insertions(+), 156 deletions(-)
>
> diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
> index 502ea4dc2080..69d6140b6afa 100644
> --- a/drivers/acpi/acpi_extlog.c
> +++ b/drivers/acpi/acpi_extlog.c
> @@ -182,17 +182,17 @@ static int extlog_print(struct notifier_block *nb, 
> unsigned long val,
>
>  static bool __init extlog_get_l1addr(void)
>  {
> -   u8 uuid[16];
> +   uuid_le uuid;
> acpi_handle handle;
> union acpi_object *obj;
>
> -   acpi_str_to_uuid(extlog_dsm_uuid, uuid);
> -
> +   if (uuid_le_to_bin(extlog_dsm_uuid, ))
> +   return false;
> if (ACPI_FAILURE(acpi_get_handle(NULL, "\\_SB", )))
> return false;
> -   if (!acpi_check_dsm(handle, uuid, EXTLOG_DSM_REV, 1 << 
> EXTLOG_FN_ADDR))
> +   if (!acpi_check_dsm(handle, , EXTLOG_DSM_REV, 1 << 
> EXTLOG_FN_ADDR))
> return false;
> -   obj = acpi_evaluate_dsm_typed(handle, uuid, EXTLOG_DSM_REV,
> +   obj = acpi_evaluate_dsm_typed(handle, , EXTLOG_DSM_REV,
>   EXTLOG_FN_ADDR, NULL, 
> ACPI_TYPE_INTEGER);
> if (!obj) {
> return false;
> diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
> index 784bda663d16..e8130a

Re: [Nouveau] [PATCH 1/5] PCI: Recognize Thunderbolt devices

2017-03-03 Thread Bjorn Helgaas
On Sat, Feb 25, 2017 at 08:40:03AM +0100, Lukas Wunner wrote:
> On Fri, Feb 24, 2017 at 04:17:24PM -0600, Bjorn Helgaas wrote:
> > On Fri, Feb 24, 2017 at 08:19:45PM +0100, Lukas Wunner wrote:
> > > --- a/include/linux/pci.h
> > > +++ b/include/linux/pci.h
> > > @@ -358,6 +358,7 @@ struct pci_dev {
> > >   unsigned intis_virtfn:1;
> > >   unsigned intreset_fn:1;
> > >   unsigned intis_hotplug_bridge:1;
> > > + unsigned intis_thunderbolt:1; /* Thunderbolt controller */
> > 
> > I'm not really keen on having this in the PCI core because the core
> > doesn't need this or even know what it means.
> > 
> > pci_find_next_ext_capability() is available to drivers, and if
> > Thunderbolt-connectedness is useful information to apple-gmux or GPU
> > drivers, it's fine with me if you want to use it there.  I just don't
> > see the benefit to having it in the core.
> 
> The above contradicts your statement 3 days earlier:
> 
>   "Assuming we need it, having it in struct pci_dev is fine.
>There's no point in looking up the VSEC capability more than once."
>   (http://www.spinics.net/lists/linux-pci/msg58532.html)

It's clear that none of the proposed uses is related to Thunderbolt as
a technology, which is why I would suggest we don't need the flag.
But in the interest of moving on, if you remove the changelog part
about whitelisting Thunderbolt for runtime PM (since that's not part
of this series), you can add my:

Acked-by: Bjorn Helgaas <bhelg...@google.com>
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 1/5] PCI: Recognize Thunderbolt devices

2017-03-03 Thread Bjorn Helgaas
On Fri, Feb 24, 2017 at 08:19:45PM +0100, Lukas Wunner wrote:
> Detect on probe whether a PCI device is part of a Thunderbolt controller.
> 
> Intel uses a Vendor-Specific Extended Capability (VSEC) with ID 0x1234
> on such devices.  Detect presence of this VSEC and cache it in a newly
> added is_thunderbolt bit in struct pci_dev.  Add a helper to check
> whether a given PCI device is situated on a Thunderbolt daisy chain.
> 
> The necessity arises from the following:
> 
> * To power off Thunderbolt controllers on Macs even if their BIOS is
>   older than 2015, their PCIe ports need to be whitelisted for runtime
>   PM.  For this we need a way to recognize them.
> 
> * Dual GPU MacBook Pros introduced 2011+ can no longer switch external
>   DisplayPort ports between GPUs.  (They're no longer just used for DP
>   but have become combined DP/Thunderbolt ports.)  The driver to switch
>   the ports, drivers/platform/x86/apple-gmux.c, needs to detect presence
>   of a Thunderbolt controller and, if found, keep external ports
>   permanently switched to the discrete GPU.
> 
> * If an external Thunderbolt GPU is connected to a dual GPU laptop (Mac
>   or not), that GPU is currently registered with vga_switcheroo even
>   though it can neither drive the laptop's panel nor be powered off by
>   the platform.  To vga_switcheroo it will appear as if two discrete
>   GPUs are present.  As a result, when the external GPU is runtime
>   suspended, vga_switcheroo will cut power to the internal discrete GPU
>   which may not be runtime suspended at all at this moment.  The
>   solution is to not register external GPUs with vga_switcheroo, which
>   necessitates a way to recognize if they're on a Thunderbolt daisy
>   chain.

If I understand correctly, vga_switcheroo manages two GPUs that have a
single output: either there's a mux that connects one GPU or the other
to the output, or one GPU is permanently connected to the output and
the other does offline rendering.

To this non-GPU person, it sounds like the important question is
whether two GPUs are related in this way (either they feed the same
mux, or there's some special offline rendering connection between
them).  That sounds unrelated to the question of how the GPUs
themselves are connected to the PCI hierarchy.

From a pure PCI perspective, I assume it would be conceivable to have
two Thunderbolt-connected GPUs feeding into a mux.  Or to have a GPU
(unrelated to the mux) in a non-Thunderbolt plugin slot or connected
externally via a non-Thunderbolt switch and an iPass cable.

If that's the case, and we're using Thunderbolt-connectedness as a
hint that happens to work for the hardware we know about now, that's
fine -- I'm just trying to understand what's a hint and what's
intrisic to being connected via Thunderbolt.

> Cc: Andreas Noever 
> Cc: Michael Jamet 
> Cc: Tomas Winkler 
> Cc: Amir Levy 
> Signed-off-by: Lukas Wunner 
> ---
>  drivers/pci/pci.h   |  2 ++
>  drivers/pci/probe.c | 21 +
>  include/linux/pci.h | 23 +++
>  3 files changed, 46 insertions(+)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index cb17db242f30..45c2b8144911 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -3,6 +3,8 @@
>  
>  #define PCI_FIND_CAP_TTL 48
>  
> +#define PCI_VSEC_ID_INTEL_TBT0x1234  /* Thunderbolt */
> +
>  extern const unsigned char pcie_link_speed[];
>  
>  bool pcie_cap_has_lnkctl(const struct pci_dev *dev);
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 204960e70333..7963ecc6d85f 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1208,6 +1208,24 @@ void set_pcie_hotplug_bridge(struct pci_dev *pdev)
>   pdev->is_hotplug_bridge = 1;
>  }
>  
> +static void set_pcie_thunderbolt(struct pci_dev *dev)
> +{
> + int vsec = 0;
> + u32 header;
> +
> + while ((vsec = pci_find_next_ext_capability(dev, vsec,
> + PCI_EXT_CAP_ID_VNDR))) {
> + pci_read_config_dword(dev, vsec + PCI_VNDR_HEADER, );
> +
> + /* Is the device part of a Thunderbolt controller? */
> + if (dev->vendor == PCI_VENDOR_ID_INTEL &&
> + PCI_VNDR_HEADER_ID(header) == PCI_VSEC_ID_INTEL_TBT) {
> + dev->is_thunderbolt = 1;
> + return;
> + }
> + }
> +}
> +
>  /**
>   * pci_ext_cfg_is_aliased - is ext config space just an alias of std config?
>   * @dev: PCI device
> @@ -1360,6 +1378,9 @@ int pci_setup_device(struct pci_dev *dev)
>   /* need to have dev->class ready */
>   dev->cfg_size = pci_cfg_space_size(dev);
>  
> + /* need to have dev->cfg_size ready */
> + set_pcie_thunderbolt(dev);
> +
>   /* "Unknown power state" */
>   dev->current_state = PCI_UNKNOWN;
>  
> diff --git 

Re: [Nouveau] [PATCH 3/5] drm/nouveau: Don't register Thunderbolt eGPU with vga_switcheroo

2017-02-24 Thread Bjorn Helgaas
On Fri, Feb 24, 2017 at 1:19 PM, Lukas Wunner  wrote:
> An external Thunderbolt GPU can neither drive the laptop's panel nor be
> powered off by the platform, so there's no point in registering it with
> vga_switcheroo.

> In fact, when the external GPU is runtime suspended,
> vga_switcheroo will cut power to the internal discrete GPU, resulting in
> a lockup.

Why does suspending the external GPU cause vga_switcheroo to cut power
to the internal GPU?  No doubt this would be obvious to a GPU person,
which I definitely am not.

> Cc: Ben Skeggs 
> Signed-off-by: Lukas Wunner 
> ---
>  drivers/gpu/drm/nouveau/nouveau_vga.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_vga.c 
> b/drivers/gpu/drm/nouveau/nouveau_vga.c
> index eef22c6b9665..c2a7fd606c2e 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_vga.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_vga.c
> @@ -95,6 +95,10 @@ nouveau_vga_init(struct nouveau_drm *drm)
>
> vga_client_register(dev->pdev, dev, NULL, nouveau_vga_set_decode);
>
> +   /* don't register Thunderbolt eGPU with vga_switcheroo */
> +   if (pci_is_thunderbolt_attached(dev->pdev))
> +   return;

I guess there's no way to move this inside
vga_switcheroo_register_client() instead of putting the test in all
the drivers?

> +
> if (nouveau_runtime_pm == 1)
> runtime = true;
> if ((nouveau_runtime_pm == -1) && (nouveau_is_optimus() || 
> nouveau_is_v1_dsm()))
> @@ -111,6 +115,11 @@ nouveau_vga_fini(struct nouveau_drm *drm)
> struct drm_device *dev = drm->dev;
> bool runtime = false;
>
> +   vga_client_register(dev->pdev, NULL, NULL, NULL);
> +
> +   if (pci_is_thunderbolt_attached(dev->pdev))
> +   return;
> +
> if (nouveau_runtime_pm == 1)
> runtime = true;
> if ((nouveau_runtime_pm == -1) && (nouveau_is_optimus() || 
> nouveau_is_v1_dsm()))
> @@ -119,7 +128,6 @@ nouveau_vga_fini(struct nouveau_drm *drm)
> vga_switcheroo_unregister_client(dev->pdev);
> if (runtime && nouveau_is_v1_dsm() && !nouveau_is_optimus())
> vga_switcheroo_fini_domain_pm_ops(drm->dev->dev);
> -   vga_client_register(dev->pdev, NULL, NULL, NULL);

The amd & radeon patches look like this:

-   vga_switcheroo_unregister_client(rdev->pdev);
+   if (!pci_is_thunderbolt_attached(adev->pdev))
+   vga_switcheroo_unregister_client(adev->pdev);

This nouveau patch looks suspiciously different.  But again, I'm not a
GPU guy so all I can really say is "hmm, I wonder why it's different?"

>  }
>
>
> --
> 2.11.0
>
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] How to check for proper MSI support?

2014-07-08 Thread Bjorn Helgaas
On Thu, Jul 3, 2014 at 10:32 PM, Ilia Mirkin imir...@alum.mit.edu wrote:
 So let me get this straight -- you're suggesting I add a quirk for
 every PCI chipset that doesn't support MSI? There are probably
 hundreds of these... anything made before 1999 or so, and probably a
 bunch since then too. There _has_ to be a way to do this generically.
 Is the PCI spec version anywhere in the root hub?

 Perhaps we can check if every bridge on the way to the CPU has the MSI
 capability (including the root hub)? (And naturally _that_ won't
 work... on my sandybridge laptop, the host bridge doesn't have the MSI
 cap but the system most definitely supports MSI.)

 Adding Bjorn... perhaps you know? Some of the info has been stripped
 out by now, you can see the full lspci -vvvxxx at
 http://marc.info/?l=linux-pcim=140443441730503w=2

Huh, this stinks.  We don't really have a good way of figuring out
whether the system chipset supports MSI.  The ACPI FADT MSI Not
Supported bit (ACPI_FADT_NO_MSI) was added to the ACPI v3.0b spec in
October 2006, so that won't help the systems that predate that or
don't have ACPI.

We have quirks for some Serverworks, ATI, and VIA chipsets that
basically do the same as booting with pci=nomsi.  But as you say,
it's unreasonable to add quirks for all old systems.

Brian, can you open a report at http://bugzilla.kernel.org and attach
a complete dmesg log, /proc/cpuinfo contents, and lspci -vvv output?

If I understand correctly, you have a P200MMX with a 430FX chipset.
I'm not a hardware guy, but sounds like that might be a 200MHz Pentium
with MMX (P54CS), which does have an integrated LAPIC, according to
wikipedia.

From the PCI host bridge's perspective, an incoming MSI just looks
like a normal DMA write.  As long as that write reaches the CPU LAPIC,
it should work fine.  There's not really any specific MSI support
required, except to route the incoming write to the CPU LAPIC.  So
it's possible that a bridge designed before MSI was added to the PCI
spec might be able to support MSI.

But I don't know how much value there is in MSI on such old systems.
Maybe we could default to disabling MSI on BIOS dates before 1998 or
something.

Bjorn
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau