Re: [Nouveau] [PATCH v2] PCI: Expose hidden NVIDIA HDA controllers

2019-07-10 Thread Daniel Drake
On Thu, Jul 11, 2019 at 6:47 AM Bjorn Helgaas  wrote:
> I applied this (slightly revised as below) to pci/misc and I think we
> can still squeeze it in for v5.3.

Thanks. Tested briefly and it seems to be working fine!
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] [PATCH v2] PCI: Expose hidden NVIDIA HDA controllers

2019-07-07 Thread Daniel Drake
From: Lukas Wunner 

The integrated HDA controller on Nvidia GPUs can be hidden with a bit in
the GPU's config space. Information about this scheme was provided by
NVIDIA on their forums.

Many laptops now ship with this device hidden, meaning that Linux users
of affected platforms (where the HDMI connector comes off the NVIDIA GPU)
cannot use HDMI audio functionality.

Some platforms have ACPI DSDT code that will make the device visible if
the HDMI cable was connected at boot time, but this does not handle the
hotplug case, and this limitation has also been confirmed under Windows.

Avoid this issue by exposing the HDMI audio device on device enumeration
and resume.

The GPU and HDA controller are two functions of the same PCI device
(VGA class device on function 0 and audio device on function 1).
The multifunction flag in the GPU's Header Type register is cleared when
the HDA controller is hidden and set if it's exposed, so reread the flag
after exposing the HDA.

According to Ilia Mirkin, the HDA controller is only present on GPUs with
PCI ID values from MCP89's onwards, so do not touch config space on older
GPUs.

This quirk is limited to NVIDIA PCI devices with the VGA Controller
device class. This is expected to correspond to product configurations
where the NVIDIA GPU has connectors attached. Other products where the
device class is 3D Controller are expected to correspond to configurations
where the NVIDIA GPU is dedicated (dGPU) and has no connectors.

It's sensible to avoid exposing the HDA controller on dGPU setups,
especially because we've seen cases where the PCI BARs are not set
up correctly by the platform in this case, causing Linux to log
errors if the device is visible. This assumption of device class
accurately corresponding to product configuration is true for 6 of 6
laptops recently checked at the Endless lab, and there are also signs of
agreement checking the data from 74 previously tested products, however
Ilia Mirkin comments that he's seen cases where it is not true. Anyway, it
looks like this quirk should fix audio support for the majority of
affected users.

This commit takes inspiration from an earlier patch by Daniel Drake.

Link: https://devtalk.nvidia.com/default/topic/1024022
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=75985
Cc: Aaron Plattner 
Cc: Peter Wu 
Cc: Ilia Mirkin 
Cc: Karol Herbst 
Cc: Maik Freudenberg 
Signed-off-by: Lukas Wunner 
Signed-off-by: Daniel Drake 
---

Notes:
v2:
 - Mention in commit message that the ACPI code that controls this bit
   is insufficient (also confirmed on Windows on the buglink)
 - Tweak commit message to clarify the MCP89 comparison, thanks to Ilia

 drivers/pci/quirks.c| 28 
 include/linux/pci_ids.h |  1 +
 2 files changed, 29 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 0f16acc323c6..52046b517e2e 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4971,6 +4971,34 @@ DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_AMD, 
PCI_ANY_ID,
 DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
  PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8, quirk_gpu_hda);
 
+/*
+ * Many laptop BIOSes hide the integrated HDA controller on NVIDIA GPUs
+ * via a special bit. This prevents Linux from seeing and using it.
+ * Unhide it here.
+ * https://devtalk.nvidia.com/default/topic/1024022
+ */
+static void quirk_nvidia_hda(struct pci_dev *gpu)
+{
+   u8 hdr_type;
+   u32 val;
+
+   /* there was no integrated HDA controller before MCP89 */
+   if (gpu->device < PCI_DEVICE_ID_NVIDIA_GEFORCE_320M)
+   return;
+
+   /* bit 25 at offset 0x488 hides or exposes the HDA controller */
+   pci_read_config_dword(gpu, 0x488, );
+   pci_write_config_dword(gpu, 0x488, val | BIT(25));
+
+   /* the GPU becomes a multifunction device when the HDA is exposed */
+   pci_read_config_byte(gpu, PCI_HEADER_TYPE, _type);
+   gpu->multifunction = !!(hdr_type & BIT(7));
+}
+DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
+  PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda);
+DECLARE_PCI_FIXUP_CLASS_RESUME_EARLY(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
+  PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda);
+
 /*
  * Some IDT switches incorrectly flag an ACS Source Validation error on
  * completions for config read requests even though PCIe r4.0, sec
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 70e86148cb1e..66898463b81f 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -1336,6 +1336,7 @@
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP78S_SMBUS0x0752
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP77_IDE   0x0759
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP73_SMBUS 0x07D8
+#define PCI_DEVICE_ID_NVIDIA_GEFORCE_320M   0x08A0
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP79_SMBUS 0x

Re: [Nouveau] [PATCH] PCI: Expose hidden NVIDIA HDA controllers

2019-06-24 Thread Daniel Drake
On Fri, Jun 14, 2019 at 4:03 AM Bjorn Helgaas  wrote:
> Is it possible that this works on Windows but not Linux because they
> handle ACPI hotplug slightly differently?
>
> Martin did some nice debug [1] and found that _DSM, _PS0, and _PS3
> functions write the config bit at 0x488.  The dmesg log [2] from
> zigarrre seems to show that _OSC failed (which I *think* means we
> won't use pciehp) and that there's a slot registered by acpiphp.
>
> Maybe this works on Windows via an ACPI hotplug event that runs AML to
> flip the 0x488 bit and rescans the bus?  That would make more sense to
> me than the idea that the Nvidia driver does it.  I could imagine the
> driver flipping the bit, but the bus rescan seems like it would be out
> of the driver's purview.

Oh, I had missed that part of the conversation. I checked on the Acer
Predator G3-572 and I was able to find ACPI code to manipulate the
magic bit, however I can't see any linkage to _DSM methods, and the
code looks like it would only be run on power-up.

I modified the DSDT to avoid the codepaths that tweak the bit, booted
Windows and confirmed that change had taken effect. Then I installed
the nvidia driver and observed that the magic bit was being
manipulated according to HDMI cable status.

So I believe the nvidia Windows driver does not rely on ACPI for this
to work. It presumably does it directly and causes rescans, as evil as
that sounds. I added more details to the bug report.

> The dmesg log also shows some _DSM warnings; are these correlated with
> cable plug/unplug?  Is there some acpiphp debug we can turn on to get
> more visibility into hotplug events and how we handle them?

I scattered a load of debug messages around the ACPI & PCIe hotplug
code but nothing gets hit. I don't think this is architected to be
handled by existing PCI hotplug mechanisms.

I don't have any _DSM warnings on this product, even after connecting
and disconnecting the HDMI cable.

Thanks
Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] [PATCH] PCI: Expose hidden NVIDIA HDA controllers

2019-06-13 Thread Daniel Drake
From: Lukas Wunner 

The integrated HDA controller on Nvidia GPUs can be hidden with a bit in
the GPU's config space. Information about this scheme was provided by
NVIDIA on their forums.

Many laptops now ship with this device hidden, meaning that Linux users
of affected platforms (where the HDMI connector comes off the NVIDIA GPU)
cannot use HDMI audio functionality.

Avoid this issue by exposing the HDMI audio device on device enumeration
and resume.

The GPU and HDA controller are two functions of the same PCI device
(VGA class device on function 0 and audio device on function 1).
The multifunction flag in the GPU's Header Type register is cleared when
the HDA controller is hidden and set if it's exposed, so reread the flag
after exposing the HDA.

According to Ilia Mirkin, the HDA controller is only present from MCP89
onward, so do not touch config space on older GPUs.

This quirk is limited to NVIDIA PCI devices with the VGA Controller
device class. This is expected to correspond to product configurations
where the NVIDIA GPU has connectors attached. Other products where the
device class is 3D Controller are expected to correspond to configurations
where the NVIDIA GPU is dedicated (dGPU) and has no connectors.

It's sensible to avoid exposing the HDA controller on dGPU setups,
especially because we've seen cases where the PCI BARs are not set
up correctly by the platform in this case, causing Linux to log
errors if the device is visible. This assumption of device class
accurately corresponding to product configuration is true for 6 of 6
laptops recently checked at the Endless lab, and there are also signs of
agreement checking the data from 74 previously tested products, however
Ilia Mirkin comments that he's seen cases where it is not true. Anyway, it
looks like this quirk should fix audio support for the majority of
affected users.

This commit takes inspiration from an earlier patch by Daniel Drake.

Link: https://devtalk.nvidia.com/default/topic/1024022
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=75985
Cc: Aaron Plattner 
Cc: Peter Wu 
Cc: Ilia Mirkin 
Cc: Karol Herbst 
Cc: Maik Freudenberg 
Signed-off-by: Lukas Wunner 
Tested-by: Daniel Drake 
---
 drivers/pci/quirks.c| 28 
 include/linux/pci_ids.h |  1 +
 2 files changed, 29 insertions(+)

Submitting Lukas's patch from over a year ago.

It got held up before on patch dependencies (which are now all merged)
and also the concern around the device class assumption not being true
in all cases. However, there hasn't been any progress towards finding a
better approach, and we don't have any logs to hand from the cases where
this isn't true, so I think we should go with this approach which will
work in the (vast?) majority of cases.

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 0f16acc323c6..52046b517e2e 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4971,6 +4971,34 @@ DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_AMD, 
PCI_ANY_ID,
 DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
  PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8, quirk_gpu_hda);
 
+/*
+ * Many laptop BIOSes hide the integrated HDA controller on NVIDIA GPUs
+ * via a special bit. This prevents Linux from seeing and using it.
+ * Unhide it here.
+ * https://devtalk.nvidia.com/default/topic/1024022
+ */
+static void quirk_nvidia_hda(struct pci_dev *gpu)
+{
+   u8 hdr_type;
+   u32 val;
+
+   /* there was no integrated HDA controller before MCP89 */
+   if (gpu->device < PCI_DEVICE_ID_NVIDIA_GEFORCE_320M)
+   return;
+
+   /* bit 25 at offset 0x488 hides or exposes the HDA controller */
+   pci_read_config_dword(gpu, 0x488, );
+   pci_write_config_dword(gpu, 0x488, val | BIT(25));
+
+   /* the GPU becomes a multifunction device when the HDA is exposed */
+   pci_read_config_byte(gpu, PCI_HEADER_TYPE, _type);
+   gpu->multifunction = !!(hdr_type & BIT(7));
+}
+DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
+  PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda);
+DECLARE_PCI_FIXUP_CLASS_RESUME_EARLY(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
+  PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda);
+
 /*
  * Some IDT switches incorrectly flag an ACS Source Validation error on
  * completions for config read requests even though PCIe r4.0, sec
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 70e86148cb1e..66898463b81f 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -1336,6 +1336,7 @@
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP78S_SMBUS0x0752
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP77_IDE   0x0759
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP73_SMBUS 0x07D8
+#define PCI_DEVICE_ID_NVIDIA_GEFORCE_320M   0x08A0
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP79_SMBUS 0x0AA2
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP89_SA

Re: [Nouveau] [PATCH v3] PCI: Reprogram bridge prefetch registers on resume

2018-09-30 Thread Daniel Drake
On Sun, Sep 30, 2018 at 5:07 AM Thomas Martitz  wrote:
> The latest iteration does not work on my HP system. The GPU fails to
> power up just like the unpatched kernel.

That's weird, I would not expect a behaviour change in the latest
patch. pci_restore_config_dword() has some debug messages, could you
please make them visible and show logs again?
Also remind us of the PCI device address of the parent bridge (lspci -vt)

Thanks
Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v3] PCI: Reprogram bridge prefetch registers on resume

2018-09-12 Thread Daniel Drake
On 38+ Intel-based Asus products, the nvidia GPU becomes unusable
after S3 suspend/resume. The affected products include multiple
generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
many errors such as:

fifo: fault 00 [READ] at 00555000 engine 00 [GR] client 04
  [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown]
DRM: failed to idle channel 0 [DRM]

Similarly, the nvidia proprietary driver also fails after resume
(black screen, 100% CPU usage in Xorg process). We shipped a sample
to Nvidia for diagnosis, and their response indicated that it's a
problem with the parent PCI bridge (on the Intel SoC), not the GPU.

Runtime suspend/resume works fine, only S3 suspend is affected.

We found a workaround: on resume, rewrite the Intel PCI bridge
'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In
the cases that I checked, this register has value 0 and we just have to
rewrite that value.

Linux already saves and restores PCI config space during suspend/resume,
but this register was being skipped because upon resume, it already
has value 0 (the correct, pre-suspend value).

Intel appear to have previously acknowledged this behaviour and the
requirement to rewrite this register.
https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23

Based on that, rewrite the prefetch register values even when that
appears unnecessary.

We have confirmed this solution on all the affected models we have
in-hands (X542UQ, UX533FD, X530UN, V272UN).

Additionally, this solves an issue where r8169 MSI-X interrupts were
broken after S3 suspend/resume on Asus X441UAR. This issue was recently
worked around in commit 7bb05b85bc2d ("r8169: don't use MSI-X on
RTL8106e"). It also fixes the same issue on RTL6186evl/8111evl on an
Aimfor-tech laptop that we had not yet patched. I suspect it will also
fix the issue that was worked around in commit 7c53a722459c ("r8169:
don't use MSI-X on RTL8168g").

Thomas Martitz reports that this change also solves an issue where
the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
after S3 suspend/resume.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=201069
Signed-off-by: Daniel Drake 
---
 drivers/pci/pci.c | 25 +
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 29ff9619b5fa..5d58220b6997 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1289,12 +1289,12 @@ int pci_save_state(struct pci_dev *dev)
 EXPORT_SYMBOL(pci_save_state);
 
 static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
-u32 saved_val, int retry)
+u32 saved_val, int retry, bool force)
 {
u32 val;
 
pci_read_config_dword(pdev, offset, );
-   if (val == saved_val)
+   if (!force && val == saved_val)
return;
 
for (;;) {
@@ -1313,25 +1313,34 @@ static void pci_restore_config_dword(struct pci_dev 
*pdev, int offset,
 }
 
 static void pci_restore_config_space_range(struct pci_dev *pdev,
-  int start, int end, int retry)
+  int start, int end, int retry,
+  bool force)
 {
int index;
 
for (index = end; index >= start; index--)
pci_restore_config_dword(pdev, 4 * index,
 pdev->saved_config_space[index],
-retry);
+retry, force);
 }
 
 static void pci_restore_config_space(struct pci_dev *pdev)
 {
if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
-   pci_restore_config_space_range(pdev, 10, 15, 0);
+   pci_restore_config_space_range(pdev, 10, 15, 0, false);
/* Restore BARs before the command register. */
-   pci_restore_config_space_range(pdev, 4, 9, 10);
-   pci_restore_config_space_range(pdev, 0, 3, 0);
+   pci_restore_config_space_range(pdev, 4, 9, 10, false);
+   pci_restore_config_space_range(pdev, 0, 3, 0, false);
+   } else if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+   pci_restore_config_space_range(pdev, 12, 15, 0, false);
+   /* Force rewriting of prefetch registers to avoid
+* S3 resume issues on Intel PCI bridges that occur when
+* these registers are not explicitly written.
+*/
+   pci_restore_config_space_range(pdev, 9, 11, 0, true);
+   pci_restore_config_space_range(pdev, 0, 8, 0, false);
} else {
-   pci_restore_config_space_range(pdev, 0, 15, 0);
+   pci_restore_config_space_range(pdev, 0, 15, 0, false);
}
 }
 
-- 
2.17.1

___
Nouveau mailing 

Re: [Nouveau] [PATCH v2] PCI: Reprogram bridge prefetch registers on resume

2018-09-12 Thread Daniel Drake
On Wed, Sep 12, 2018 at 5:05 PM, Rafael J. Wysocki  wrote:
> Passing the extra bool around is somewhat clumsy, but I haven't found
> a cleaner way to do it, so

Thanks, according to your suggestion (which I plan to follow after
this) we will remove this parameter for v4.20+ and have all the
register writes be forced regardless of current value.

Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v2] PCI: Reprogram bridge prefetch registers on resume

2018-09-12 Thread Daniel Drake
On 38+ Intel-based Asus products, the nvidia GPU becomes unusable
after S3 suspend/resume. The affected products include multiple
generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
many errors such as:

fifo: fault 00 [READ] at 00555000 engine 00 [GR] client 04 [HUB/FE] 
reason 4a [] on channel -1 [007fa91000 unknown]
DRM: failed to idle channel 0 [DRM]

Similarly, the nvidia proprietary driver also fails after resume
(black screen, 100% CPU usage in Xorg process). We shipped a sample
to Nvidia for diagnosis, and their response indicated that it's a
problem with the parent PCI bridge (on the Intel SoC), not the GPU.

Runtime suspend/resume works fine, only S3 suspend is affected.

We found a workaround: on resume, rewrite the Intel PCI bridge
'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In
the cases that I checked, this register has value 0 and we just have to
rewrite that value.

Linux already saves and restores PCI config space during suspend/resume,
but this register was being skipped because upon resume, it already
has value 0 (the correct, pre-suspend value).

Intel appear to have previously acknowledged this behaviour and the
requirement to rewrite this register.
https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23

Based on that, rewrite the prefetch register values even when that
appears unnecessary.

We have confirmed this solution on all the affected models we have
in-hands (X542UQ, UX533FD, X530UN, V272UN).

Additionally, this solves an issue where r8169 MSI-X interrupts were
broken after S3 suspend/resume on Asus X441UAR. This issue was recently
worked around in commit 7bb05b85bc2d ("r8169: don't use MSI-X on
RTL8106e"). It also fixes the same issue on RTL6186evl/8111evl on an
Aimfor-tech laptop that we had not yet patched. I suspect it will also
fix the issue that was worked around in commit 7c53a722459c ("r8169:
don't use MSI-X on RTL8168g").

Thomas Martitz reports that this change also solves an issue where
the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
after S3 suspend/resume.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=201069
Signed-off-by: Daniel Drake 
---

Notes:
Replaces patch:
   PCI: add prefetch quirk to work around Asus/Nvidia suspend issues

Some of the more verbose info was moved to bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=201069

This patch is aimed at v4.19 (and maybe v4.18-stable); we may follow
up with more intrusive improvements for v4.20+.

v2: reimplement the register restore within the existing
pci_restore_config_space() code.

 drivers/pci/pci.c | 31 +--
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 29ff9619b5fa..e1704100e72d 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1289,13 +1289,15 @@ int pci_save_state(struct pci_dev *dev)
 EXPORT_SYMBOL(pci_save_state);
 
 static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
-u32 saved_val, int retry)
+u32 saved_val, int retry, bool force)
 {
u32 val;
 
-   pci_read_config_dword(pdev, offset, );
-   if (val == saved_val)
-   return;
+   if (!force) {
+   pci_read_config_dword(pdev, offset, );
+   if (val == saved_val)
+   return;
+   }
 
for (;;) {
pci_dbg(pdev, "restoring config space at offset %#x (was %#x, 
writing %#x)\n",
@@ -1313,25 +1315,34 @@ static void pci_restore_config_dword(struct pci_dev 
*pdev, int offset,
 }
 
 static void pci_restore_config_space_range(struct pci_dev *pdev,
-  int start, int end, int retry)
+  int start, int end, int retry,
+  bool force)
 {
int index;
 
for (index = end; index >= start; index--)
pci_restore_config_dword(pdev, 4 * index,
 pdev->saved_config_space[index],
-retry);
+retry, force);
 }
 
 static void pci_restore_config_space(struct pci_dev *pdev)
 {
if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
-   pci_restore_config_space_range(pdev, 10, 15, 0);
+   pci_restore_config_space_range(pdev, 10, 15, 0, false);
/* Restore BARs before the command register. */
-   pci_restore_config_space_range(pdev, 4, 9, 10);
-   pci_restore_config_space_range(pdev, 0, 3, 0);
+   pci_restore_config_space_range(pdev, 4, 9, 10, false);
+   pci_restore_config_space_range(pdev, 0, 3, 0, false);
+   } else if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+   

Re: [Nouveau] [PATCH] PCI: Reprogram bridge prefetch registers on resume

2018-09-10 Thread Daniel Drake
I have created https://bugzilla.kernel.org/show_bug.cgi?id=201069 to
archive the research done so far.

On Fri, Sep 7, 2018 at 11:05 PM, Peter Wu  wrote:
> Windows 10 unconditionally rewrites these registers (BAR, I/O Base +
> Limit, Memory Base + Limit, etc. from top to bottom), see annotations:
> https://www.spinics.net/lists/linux-pci/msg75856.html
>
> Linux has a generic "restore" operation that works backwards from the
> end of the PCI config space to the beginning, see
> pci_restore_config_space. Do you have a dmesg where you see the
> "restoring config space at offset" messages?

Interesting, I had not spotted this code. The logs for the affected
bridge on Asus X542UQ:

 :00:1c.0: restoring config space at offset 0x3c (was 0x100,
writing 0x1001ff)
 :00:1c.0: restoring config space at offset 0x24 (was 0x10001,
writing 0xe1f1d001)
 :00:1c.0: restoring config space at offset 0x20 (was 0x0, writing
0xef00ee00)
 :00:1c.0: restoring config space at offset 0xc (was 0x81,
writing 0x810010)
 :00:1c.0: restoring config space at offset 0x4 (was 0x17,
writing 0x100407)

So it didn't restore 0x28 (the magic register) and also register 0x2c
(prefetch limit upper 32 bits) because their values were already 0 on
resume.

https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23 has an
assertion that Intel recognises this issue and calls for all those
registers to be restored on resume.

I propose for Linux 4.19 we apply a minimally intrusive change that
"forcibly" restores dword registers 0x24, 0x28 and 0x2c for PCI
bridges (regardless of their current value), while retaining the
existing restore order (high to low) over all registers and doing the
read-then-maybe-write thing for all of the other regs (per current
behaviour).


Then we also consider swiftly applying a followup patch implementing a
more thorough approach and we give it some time in linux-next before
releasing in Linux 4.20 which does something more thorough. I think
perhaps more discussion is needed there, or at least some more input
from Bjorn. Seems like we have 3 approaches to consider:

1. Peter Wu suggests following what windows does. Windows appears to
start with low registers and works its way upwards, which means it
writes BAR addresses, I/O base, etc, before writing prefetch
registers. It skips over read-only and write-to-clear registers and
also only writes some of the registers at the very end - like the
command register.

To be thorough here we would probably also have to study and copy what
Windows does for non-bridge devices (not sure how many device classes
we would want to study here?). Also it is a bit risky in that Bjorn
has pointed out that Linux's earlier approach with a high level of
similarity here (restore registers in ascending order, regardless of
their current value) caused a laptop to hang on resume.


2. Bjorn suggested tweaking the existing code to always write the
saved value even when the hardware has that same value. The
read-maybe-write logic was introduced to avoid a laptop hang, but at
that time we were restoring registers in ascending order, now we are
descending it might be worth giving this another try.


3. Do nothing else beyond the minimal change that I propose for v4.19?
Looking a bit more into git history this seems to be a sensitive and
problematic area, more changes might mean more trouble. For example
right now pci_restore_config_space() does:
pci_restore_config_space_range(pdev, 10, 15, 0, 0);
/* Restore BARs before the command register. */
pci_restore_config_space_range(pdev, 4, 9, 10, 0);
pci_restore_config_space_range(pdev, 0, 3, 0, 0);
but pci_restore_config_space_range() already goes in descending order,
so the above is already equivalent to the code in the "else" path that
follows:
pci_restore_config_space_range(pdev, 0, 15, 0, 0);

and if you look at the history behind this "mixup" there's stuff that
goes back to 2012 like commit a6cb9ee7cabe ("PCI: Retry BARs
restoration for Type 0 headers only") which was fixing up another
problematic commit in this area, etc.

Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] PCI: Reprogram bridge prefetch registers on resume

2018-09-07 Thread Daniel Drake
On Fri, Sep 7, 2018 at 2:40 PM, Sinan Kaya  wrote:
> On 9/6/2018 10:36 PM, Daniel Drake wrote:
>>
>> +   if (pci_dev->class == PCI_CLASS_BRIDGE_PCI << 8)
>> +   pci_setup_bridge_mmio_pref(pci_dev);
>
>
> This should probably some kind of a quirk rather than default
> for the listed card as it sounds like you are dealing with
> broken hardware.

With that approach there's a sizeable list that your quirk list is
incomplete or out of date.

And when the bug bites, it's extremely cryptic. We've spent months
working on this issue and only found this magic register write mostly
through a stroke of good luck. Separately there's been a flurry of
mails around the r8169 MSI-X problem but as far as I can see nobody
suggested even looking at the values of the parent bridge prefetch
registers. And even if they did, they'd probably have said "values are
fine, nothing to see here" (exactly as we did 4 months ago when Nvidia
mentioned these registers as a possible cause - oops!).

So here I'm instead following a suggestion from Bjorn, after also
having confirmed the windows behaviour:

https://marc.info/?l=linux-pci=153574276126484=2
> Can we tell whether Windows rewrites this register unconditionally at
> resume-time?  If so, it may be more robust for Linux to do the same.
> The whole thing is black magic, which I hate, but if it's our only
> choice, it may be better to have this applied everywhere so we don't
> keep stubbing our toes on new systems that require the quirk.

Also, we just spoke to Asus BIOS engineers who told us that the BIOS
does already restore the PCI bridge prefetch registers on resume. I
guess this is why the other registers like the (non-zero) prefetch
base address lower 32 bits do have the right value on resume even
before my patch. It sounds like a more subtle bug related to register
write timing or sequence, in that case it will be harder to define who
is responsible for the breakage and hence under which conditions the
quirk should apply.

Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH] PCI: Reprogram bridge prefetch registers on resume

2018-09-06 Thread Daniel Drake
On 38+ Intel-based Asus products, the nvidia GPU becomes unusable
after S3 suspend/resume. The affected products include multiple
generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
many errors such as:

fifo: fault 00 [READ] at 00555000 engine 00 [GR] client 04 [HUB/FE] 
reason 4a [] on channel -1 [007fa91000 unknown]
DRM: failed to idle channel 0 [DRM]

Similarly, the nvidia proprietary driver also fails after resume
(black screen, 100% CPU usage in Xorg process). We shipped a sample
to Nvidia for diagnosis, and their response indicated that it's a
problem with the parent PCI bridge (on the Intel SoC), not the GPU.

Runtime suspend/resume works fine, only S3 suspend is affected.

We found a workaround: on resume, rewrite the Intel PCI bridge
'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In
the cases that I checked, this register has value 0 and we just have to
rewrite that value.

It's very strange that rewriting the exact same register value
makes a difference, but it definitely makes the issue go away.
It's not just acting as some kind of memory barrier, because rewriting
other bridge registers does not work around the issue. There's something
magic in this particular register. We have confirmed this on all
the affected models we have in-hands (X542UQ, UX533FD, X530UN, V272UN).

Additionally, this workaround solves an issue where r8169 MSI-X
interrupts were broken after S3 suspend/resume on Asus X441UAR. This
issue was recently worked around in commit 7bb05b85bc2d ("r8169:
don't use MSI-X on RTL8106e"). It also fixes the same issue on
RTL6186evl/8111evl on an Aimfor-tech laptop that we had not yet
patched. I suspect it will also fix the issue that was worked around in
commit 7c53a722459c ("r8169: don't use MSI-X on RTL8168g").

Thomas Martitz reports that this workaround also solves an issue where
the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
after S3 suspend/resume.

From our testing, the affected Intel PCI bridges are:
Intel Skylake PCIe Controller (x16) [8086:1901] (rev 05)
Intel Skylake PCIe Controller (x16) [8086:1901] (rev 07)
Intel Device [8086:31d8] (rev f3)
Intel Celeron N3350/Pentium N4200/Atom E3900 Series PCI Express Port B #1 
[8086:5ad6] (rev fb)
Intel Celeron N3350/Pentium N4200/Atom E3900 Series PCI Express Port A #1 
[8086:5ad8] (rev fb)
Intel Sunrise Point-LP PCI Express Root Port [8086:9d10] (rev f1)
Intel Sunrise Point-LP PCI Express Root Port #5 [8086:9d14] (rev f1)
Intel Device [8086:9dbc] (rev f0)

On resume, reprogram the PCI bridge prefetch registers, including the
magic register mentioned above.

This matches Win10 behaviour, which also rewrites these registers
during S3 resume (checked with qemu tracing).

Link: 
https://marc.info/?i=CAD8Lp46Y2eOR7WE28xToUL8s-aYiqPa0nS=1gsd0axkddxq...@mail.gmail.com
Link: https://bugs.freedesktop.org/show_bug.cgi?id=105760
Signed-off-by: Daniel Drake 
---

Notes:
Replaces patch:
   PCI: add prefetch quirk to work around Asus/Nvidia suspend issues

Below is the list of Asus products with Intel/Nvidia that we
believe are affected by the GPU resume issue.

I revised my counting method from my last patch to eliminate duplicate
platforms that had multiple SKUs with the same DMI/GPU/bridge, that's why
the product count reduced from 43 to 38.

sys_vendor: ASUSTeK COMPUTER INC.
board_name: FX502VD
product_name: FX502VD
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev 
ff) (prog-if ff)
!!! Unknown header type 7f
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) 
(prog-if 00 [Normal decode])

sys_vendor: ASUSTeK COMPUTER INC.
board_name: FX570UD
product_name: ASUS Gaming FX570UD
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev a1)
Subsystem: ASUSTeK Computer Inc. Device [1043:1f40]
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) 
(prog-if 00 [Normal decode])

sys_vendor: ASUSTeK COMPUTER INC.
board_name: GL553VD
product_name: GL553VD
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev a1)
Subsystem: ASUSTeK Computer Inc. Device [1043:15e0]
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) 
(prog-if 00 [Normal decode])

sys_vendor: ASUSTeK COMPUTER INC.
board_name: GL753VD
product_name: GL753VD
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev a1)
Subsystem: ASUSTeK Computer Inc. Device [1043:1590]
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) 
(prog-if 00 [Normal decode])

sys_vendor: ASUSTeK COMPUTER INC.
board_name: K401UQK
product_name: K401UQK
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134d] (rev a2)
Subsystem: ASUSTeK Computer Inc. Device [1043:14b0]
  

Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues

2018-09-06 Thread Daniel Drake
On Sat, Sep 1, 2018 at 3:12 AM, Bjorn Helgaas  wrote:
> Can we tell whether Windows rewrites this register unconditionally at
> resume-time?  If so, it may be more robust for Linux to do the same.
> The whole thing is black magic, which I hate, but if it's our only
> choice, it may be better to have this applied everywhere so we don't
> keep stubbing our toes on new systems that require the quirk.

Checked this with qemu adding a PCI-to-PCI bridge (ioh3420).

$ qemu-system-x86_64 -enable-kvm -M q35,accel=kvm -m 2G -vga qxl -cpu
host -hda testimg.img -device
ioh3420,id=rp1,bus=pcie.0,addr=1c.0,port=1 -trace events=events.txt

events.txt has:
pci_cfg_read
pci_cfg_write

Logged cfg space accesses during boot:
https://gist.github.com/dsd/135fb255cb2b237567d8ea2d6bfc6917#file-boot-txt

Suspend:
https://gist.github.com/dsd/135fb255cb2b237567d8ea2d6bfc6917#file-suspend-txt

Resume:
https://gist.github.com/dsd/135fb255cb2b237567d8ea2d6bfc6917#file-resume-txt

Notably during resume, the prefetch-related registers get rewritten:
  pci_cfg_write ioh3420 28:0 @0x24 <- 0xfeb0fea0
  pci_cfg_write ioh3420 28:0 @0x28 <- 0x0
  pci_cfg_write ioh3420 28:0 @0x2c <- 0x0

This happened even though there was nothing behind the bridge.
Windows failed to resume in this test (black screen) but the traced
register writes seem indicative enough.

Peter Wu confirms the same results in a similar experiment:
https://marc.info/?l=linux-pci=153616336225386=2

I'll look into creating a new patch that unconditionally reprograms
the PCI bridge prefetch stuff on resume.

Thanks
Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] Rewriting Intel PCI bridge prefetch base address bits solves nvidia graphics issues

2018-09-05 Thread Daniel Drake
On Tue, Aug 28, 2018 at 5:57 PM, Peter Wu  wrote:
> Only non-bridge devices can be passed to a guest, but perhaps logging
> access to the emulated bridge is already sufficient. The Prefetchable
> Base Upper 32 Bits register is at offset 0x28.
>
> In a trace where the Nvidia device is disabled/enabled via Device
> Manager, I see writes on the enable path:
>
> 2571@1535108904.593107:rp_write_config (ioh3420, @0x28, 0x0, len=0x4)

Did you do anything special to get an emulated bridge included in this setup?

Folllowing the instructions at
https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF I can
successfully pass through devices to windows running under
virt-manager. In the nvidia GPU case I haven't got passed the driver
installation failure, but I can pass through other devices OK and
install their drivers.

However I do not end up with any PCI-to-PCI bridges in this setup. The
passed through device sits at address 00:08.0, parent is the PCI host
bridge 00:00.0.

(I'm trying to spy if Windows appears to restore or reset the PCI
bridge prefetch registers upon resume)

Thanks
Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues

2018-09-04 Thread Daniel Drake
On Tue, Sep 4, 2018 at 2:43 PM, Mika Westerberg
 wrote:
> Yes, can you check if the failing device BAR is included in any of the
> above entries? If not then it is probably not related.

mtrr again for reference:
reg00: base=0x0c000 ( 3072MB), size= 1024MB, count=1: uncachable
reg01: base=0x0a000 ( 2560MB), size=  512MB, count=1: uncachable
reg02: base=0x09000 ( 2304MB), size=  256MB, count=1: uncachable
reg03: base=0x08c00 ( 2240MB), size=   64MB, count=1: uncachable
reg04: base=0x08b80 ( 2232MB), size=8MB, count=1: uncachable


The PCI bridge is:
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express
Root Port (rev f1) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
SERR- TAbort-
SERR- https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues

2018-09-03 Thread Daniel Drake
On Mon, Sep 3, 2018 at 8:12 PM, Mika Westerberg
 wrote:
> We have seen one similar issue with LPSS devices when BIOS assigns
> device BARs above 4G (which is not the case here) and it turned out to
> be misconfigured MTRR register or something like that. It may not be
> related at all but it could be worth a try to dump out MTRR registers of
> one of the affected systems and see if the memory areas are listed there
> (and if the attributes are somehow wrong if found).

From Asus X542UQ:

# cat /proc/mtrr
reg00: base=0x0c000 ( 3072MB), size= 1024MB, count=1: uncachable
reg01: base=0x0a000 ( 2560MB), size=  512MB, count=1: uncachable
reg02: base=0x09000 ( 2304MB), size=  256MB, count=1: uncachable
reg03: base=0x08c00 ( 2240MB), size=   64MB, count=1: uncachable
reg04: base=0x08b80 ( 2232MB), size=8MB, count=1: uncachable

# cat /sys/kernel/debug/x86/pat_memtype_list
PAT memtype list:
write-back @ 0x84a23000-0x84a24000
write-back @ 0x8ad34000-0x8ad6
write-back @ 0x8ad5f000-0x8ad66000
write-back @ 0x8ad5f000-0x8ad6
write-back @ 0x8ad65000-0x8ad6a000
write-back @ 0x8ad69000-0x8ad6b000
write-back @ 0x8ad6a000-0x8ad6c000
write-back @ 0x8ad6b000-0x8ad6e000
write-back @ 0x8ad9c000-0x8ad9d000
write-back @ 0x8adce000-0x8adcf000
write-back @ 0x8adcf000-0x8add
write-back @ 0x8adcf000-0x8add2000
write-back @ 0x8add3000-0x8add4000
write-back @ 0x8ae04000-0x8ae05000
write-back @ 0x8b208000-0x8b209000
write-combining @ 0xc000-0xd000
write-combining @ 0xd000-0xe000
write-combining @ 0xe000-0xe004
write-combining @ 0xe004-0xe005
write-combining @ 0xe005-0xe0051000
write-combining @ 0xe0051000-0xe0151000
write-combining @ 0xe0151000-0xe0191000
write-combining @ 0xe0191000-0xe01a1000
write-combining @ 0xe01a1000-0xe01b1000
write-combining @ 0xe01b1000-0xe01c1000
write-combining @ 0xe01c1000-0xe01c3000
write-combining @ 0xe01c3000-0xe01c5000
write-combining @ 0xe01c5000-0xe01cd000
write-combining @ 0xe01cd000-0xe01d5000
write-combining @ 0xe01d5000-0xe01dd000
write-combining @ 0xe01dd000-0xe01e5000
write-combining @ 0xe01e5000-0xe01ed000
write-combining @ 0xe01ed000-0xe01f5000
write-combining @ 0xe01f5000-0xe01fd000
write-combining @ 0xe01fd000-0xe0205000
write-combining @ 0xe0205000-0xe020d000
write-combining @ 0xe020d000-0xe0215000
uncached-minus @ 0xed00-0xed20
write-combining @ 0xed80-0xee00
uncached-minus @ 0xee00-0xef00
uncached-minus @ 0xef20-0xef40
uncached-minus @ 0xef40-0xef401000
uncached-minus @ 0xef404000-0xef405000
uncached-minus @ 0xef51-0xef52
uncached-minus @ 0xef528000-0xef52c000
uncached-minus @ 0xef533000-0xef534000
uncached-minus @ 0xef533000-0xef534000
uncached-minus @ 0xef533000-0xef534000
uncached-minus @ 0xef534000-0xef535000
uncached-minus @ 0xef534000-0xef535000
uncached-minus @ 0xef534000-0xef535000
uncached-minus @ 0xef535000-0xef536000
uncached-minus @ 0xef537000-0xef538000
uncached-minus @ 0xef538000-0xef539000
uncached-minus @ 0xef538000-0xef539000
uncached-minus @ 0xef538000-0xef539000
uncached-minus @ 0xef539000-0xef53a000
uncached-minus @ 0xef539000-0xef53a000
uncached-minus @ 0xef539000-0xef53a000
uncached-minus @ 0xef53a000-0xef53b000
uncached-minus @ 0xf000-0xf800
uncached-minus @ 0xf00e-0xf00e1000
uncached-minus @ 0xf010-0xf0101000
uncached-minus @ 0xf0101000-0xf0102000
uncached-minus @ 0xfdac-0xfdad
uncached-minus @ 0xfdae-0xfdaf
uncached-minus @ 0xfdaf-0xfdb0
uncached-minus @ 0xfdc43000-0xfdc44000
uncached-minus @ 0xfe00-0xfe001000
uncached-minus @ 0xfe00-0xfe001000
uncached-minus @ 0xfed0-0xfed01000
uncached-minus @ 0xfed15000-0xfed16000
uncached-minus @ 0xfed4-0xfed41000
uncached-minus @ 0xfed9-0xfed91000
uncached-minus @ 0xfed91000-0xfed92000

Is that the info you were looking for?

Thanks
Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues

2018-09-03 Thread Daniel Drake
On Sat, Sep 1, 2018 at 3:12 AM, Bjorn Helgaas  wrote:
> If true, this sounds like some sort of erratum, so it would be good to
> get some input from Intel, and I cc'd a few Intel folks.

Yes, it would be great to get their input.

> It's interesting that all the systems below are from Asus.  That makes
> me think there's some BIOS or SMM connection, e.g., SMM traps the
> register write and does something magic.

Is there a way I can check if there is a SMM trap active for this address?

> Does this problem happen after a full system suspend/resume, or does
> it happen after runtime suspend of only the GPU?  Or runtime suspend
> of only the GPU and the upstream bridge?

runtime suspend/resume works fine. It only happens after S3 suspend.

> Can we tell whether Windows rewrites this register unconditionally at
> resume-time?  If so, it may be more robust for Linux to do the same.
> The whole thing is black magic, which I hate, but if it's our only
> choice, it may be better to have this applied everywhere so we don't
> keep stubbing our toes on new systems that require the quirk.

Any suggestions for how to make this happen? Booting windows in
virt-manager (hoping that I could then spy on PCI config space reg
accesses), I don't see an option for S3 suspend, but I'll keep looking
into this.

Thanks
Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues

2018-08-31 Thread Daniel Drake
On over 40 Intel-based Asus products, the nvidia GPU becomes unusable
after S3 suspend/resume. The affected products include multiple
generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
many errors such as:

fifo: fault 00 [READ] at 00555000 engine 00 [GR] client 04 [HUB/FE] 
reason 4a [] on channel -1 [007fa91000 unknown]
DRM: failed to idle channel 0 [DRM]

Similarly, the nvidia proprietary driver also fails after resume
(black screen, 100% CPU usage in Xorg process). We shipped a sample
to Nvidia for diagnosis, and their response indicated that it's a
problem with the parent PCI bridge (on the Intel SoC), not the GPU.

We found a workaround: on resume, rewrite the Intel PCI bridge
'Prefetchable Base Upper 32 Bits' register. In the cases that I checked,
this register has value 0 and we just have to rewrite that value.

It's very strange that rewriting the exact same register value
makes a difference, but it definitely makes the issue go away.
It's not just acting as some kind of memory barrier, because rewriting
other bridge registers does not work around the issue. There's something
magic in this particular register.

We examined our database of Asus hardware and identified 43 products
that we believe are affected. Checking the nvidia GPU parent PCI bridge
on each one, in total 5 Intel PCI bridges need quirking as below.
The quirk will run on bridges even where no nvidia GPU is connected,
but it should be harmless, and we at least limit it to only running
on Asus products.

This fix was tested on all the affected models that we have in hands
(X542UQ, UX533FD, X530UN, V272UN).

Signed-off-by: Daniel Drake 
---

Notes:
If anyone has ideas for why writing this register makes a difference, or
suggestions for other approaches then I'm all ears...

Here is some basic info of the 43 products believed to be affected:
basic DMI data, nvidia GPU PCI info, parent PCI bridge info.

sys_vendor: ASUSTeK COMPUTER INC.
board_name: FX502VD
product_name: FX502VD
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev 
ff) (prog-if ff)
!!! Unknown header type 7f
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) 
(prog-if 00 [Normal decode])

sys_vendor: ASUSTeK COMPUTER INC.
board_name: FX570UD
product_name: ASUS Gaming FX570UD
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev a1)
Subsystem: ASUSTeK Computer Inc. Device [1043:1f40]
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) 
(prog-if 00 [Normal decode])

sys_vendor: ASUSTeK COMPUTER INC.
board_name: GL553VD
product_name: GL553VD
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev a1)
Subsystem: ASUSTeK Computer Inc. Device [1043:15e0]
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) 
(prog-if 00 [Normal decode])

sys_vendor: ASUSTeK COMPUTER INC.
board_name: GL553VD
product_name: GL553VD
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev a1)
Subsystem: ASUSTeK Computer Inc. Device [1043:15e0]
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) 
(prog-if 00 [Normal decode])

sys_vendor: ASUSTeK COMPUTER INC.
board_name: GL753VD
product_name: GL753VD
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev a1)
Subsystem: ASUSTeK Computer Inc. Device [1043:1590]
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) 
(prog-if 00 [Normal decode])

sys_vendor: ASUSTeK COMPUTER INC.
board_name: GL753VD
product_name: GL753VD
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev a1)
Subsystem: ASUSTeK Computer Inc. Device [1043:1590]
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) 
(prog-if 00 [Normal decode])

sys_vendor: ASUSTeK COMPUTER INC.
board_name: K401UQK
product_name: K401UQK
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134d] (rev a2)
Subsystem: ASUSTeK Computer Inc. Device [1043:14b0]
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) 
(prog-if 00 [Normal decode])

sys_vendor: ASUSTeK COMPUTER INC.
board_name: P1440UF
product_name: ASUSPRO P1440UF
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:174d] (rev a2)
Subsystem: ASUSTeK Computer Inc. Device [1043:1f10]
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) 
(prog-if 00 [Normal decode])

sys_vendor: ASUSTeK COMPUTER INC.
board_name: P2440UQ
product_name: P2440UQ
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134d] (rev a2)
Subsystem: ASUSTeK Computer Inc. Device [1043:13ce]
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10

Re: [Nouveau] Rewriting Intel PCI bridge prefetch base address bits solves nvidia graphics issues

2018-08-31 Thread Daniel Drake
On Thu, Aug 30, 2018 at 5:40 PM, Peter Wu  wrote:
> As the BIOS date is not visible, can you also confirm that this message
> is visible in dmesg?
>
>nouveau: detected PR support, will not use DSM

Yes, that gets logged.

> For laptops, it appears that you have to do at least two things:
> - Ensure that the Subsystem Vendor/Product ID are set.
> - Expose a _ROM ACPI method that provides VBIOS.
>
> Perhaps you also need to provide a "_DSM" method that emulates at least
> the "Optimus" interface for GUID a486d8f8-0bda-471b-a72b-6042a6b5bee0.
>
> You probably lost interest here, but if you want to continue anyway this
> is what allowed me to install the driver on the XPS 9560:
> https://github.com/Lekensteyn/acpi-stuff/blob/master/d3test/fakedev.asl

Indeed. I'm going to submit the workaround and I'll look to come back
to this qemu/vfio analysis later.

Thanks
Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] Rewriting Intel PCI bridge prefetch base address bits solves nvidia graphics issues

2018-08-30 Thread Daniel Drake
On Tue, Aug 28, 2018 at 5:57 PM, Peter Wu  wrote:
> Just to be sure, after "sleep", do both devices report "suspended" in
> /sys/bus/pci/devices/:00:1c.0/power/runtime_status
> /sys/bus/pci/devices/:01:00.0/power/runtime_status
>
> and was this reproduced with a recent mainline kernel with no special
> cmdline options? The endlessm kernel on Github seems to have quite some
> patches, one of them explicitly disable runtime PM:
> https://github.com/endlessm/linux/commit/8b128b50cd6725eee2ae9025a1510a221d9b42f2

Yes, I checked for this issue in the past and I'm certain that nouveau
runtime pm works fine.

I also checked again now on X542UQ and the results are the same.
nouveau can do runtime suspend/resume (confirmed by reading
runtime_status) and then render 3D graphics OK. lspci is fine too. It
is just S3 suspend that is affected. This was testing on Linux 4.18
unmodified. I had to set nouveau runpm parameter to 1 for it to use
runtime pm.

Also checked with Karol's patch, the S3 issue is still there. Seems
like 2 different issues.

> Could you share some details:
> - acpidump
> - lspci -nnvvv
> - BIOS version (from /sys/class/dmi/id/)
> - kernel version (mainline?)

Linux 4.18 mainline
BIOS version: X542UQ.202
acpidump: 
https://gist.githubusercontent.com/dsd/79352284d4adce14f30d70e94fad89f2/raw/ed9480e924be413fff567da2edd5a2a7a86619d0/gistfile1.txt
pci: 
https://gist.githubusercontent.com/dsd/79352284d4adce14f30d70e94fad89f2/raw/ed9480e924be413fff567da2edd5a2a7a86619d0/pci

> Only non-bridge devices can be passed to a guest, but perhaps logging
> access to the emulated bridge is already sufficient. The Prefetchable
> Base Upper 32 Bits register is at offset 0x28.
>
> In a trace where the Nvidia device is disabled/enabled via Device
> Manager, I see writes on the enable path:
>
> 2571@1535108904.593107:rp_write_config (ioh3420, @0x28, 0x0, len=0x4)
>
> For Linux, I only see one write at startup, none on runtime resume.
> I did not test system sleep/resume. (disable/enable is arguably a bit
> different from system s/r, you may want to do additional testing here.)

I managed to install Win10 Home under virt-manager with the nvidia
device passed through.
However the nvidia windows driver installer refuses to install, says:
The NVIDIA graphics driver is not compatible with this version of Windows.
This graphics driver could not find compatible graphics hardware.

One trick for similar sounding problems is to change hypervisor vendor
ID but no luck here.

I was going to check if I can monitor PCI bridge config space access
even without the nvidia driver installed, but I can't find a way to
make the windows VM suspend and resume - the option is not available
in the VM.

Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] Rewriting Intel PCI bridge prefetch base address bits solves nvidia graphics issues

2018-08-27 Thread Daniel Drake
On Fri, Aug 24, 2018 at 11:42 PM, Peter Wu  wrote:
> Are these systems also affected through runtime power management? For
> example:
>
> modprobe nouveau# should enable runtime PM
> sleep 6 # wait for runtime suspend to kick in
> lspci -s1:  # runtime resume by reading PCI config space
>
> On laptops from about 2015-2016 with a GTX 9xxM this sequence results in
> hangs on various laptops
> (https://bugzilla.kernel.org/show_bug.cgi?id=156341).

This works fine here. I'm facing a different issue.

>> After a lot of experimentation I found a workaround: during resume,
>> set the value of PCI_PREF_BASE_UPPER32 to 0 on the parent PCI bridge.
>> Easily done in drivers/pci/quirks.c. Now all nvidia stuff works fine.
>
> I am curious, how did you discover this? While this could work, perhaps
> there are alternative workarounds/fixes?

Based on the observation that the following procedure works fine (note
the addition of step 3):

1. Boot
2. Suspend/resume
3. echo rescan > /sys/bus/pci/devices/:00:1c.0/rescan
4. Load nouveau driver
5. Start X

I worked through the rescan codepath until I had isolated the specific
code which magically makes things work (in pci_bridge_check_ranges).

Having found that, step 3 in the above test procedure can be replaced
with a simple:
   setpci -s 00:1c.0 0x28.l=0

> When you say "parent PCI" bridge, is that actually the device you see in
> "lspci -tv"? On a Dell XPS 9560, the GPU is under a different device:
>
>   -[:00]-+-00.0  Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor 
> Host Bridge/DRAM Registers
>  +-01.0-[01]00.0  NVIDIA Corporation GP107M [GeForce GTX 1050 
> Mobile]
>
>  00:01.0 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th 
> Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 05)

Yes, it's the parent bridge shown by lspci. The address of this varies
from system to system.

>> 1. Is the Intel PCI bridge misbehaving here? Why does writing the same
>> value of PCI_PREF_BASE_UPPER32 make any difference at all?
>
> At what point in the suspend code path did you insert this write? It is
> possible that the write somehow acted as a fence/memory barrier?

static void quirk_pref_base_upper32(struct pci_dev *dev)
{
   u32 pref_base_upper32;
   pci_read_config_dword(dev, PCI_PREF_BASE_UPPER32, _base_upper32);
   pci_write_config_dword(dev, PCI_PREF_BASE_UPPER32, pref_base_upper32);
}
DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_INTEL,  0x9d10, quirk_pref_base_upper32);

I don't think it's acting as a barrier. I tried changing this code to
rewrite other registers such as PCI_PREF_MEMORY_BASE and that makes
the bug come back.

>> 2. Who is responsible for saving and restoring PCI bridge
>> configuration during suspend and resume? Linux? ACPI? BIOS?
>
> Not sure about PCI bridges, but at least for the PCI Express Capability
> registers, it is in control of the OS when control is granted via the
> ACPI _OSC method.

I guess you are referring to pci_save_pcie_state(). I can't see
anything equivalent for the bridge registers.

> As Windows is probably not affected by this issue, a change must be
> possible to make Linux more compatible with Windows. Though I am not
> sure what change is needed.

I agree. There's a definite difference with Windows here and it would
be great to find a fix along those lines.

> I recently compared PCI configuration space access and ACPI method
> invocation using QEMU + VFIO with Linux 4.18, Windows 7 and Windows 10
> (1803). There were differences like disabling MSI/interrupts before
> suspend, setting the Enable Clock Power Management bit in PCI Express
> Link Control and more, but applying these changes were so far not really
> successful.

Interesting. Do you know any way that I could spy on Windows' accesses
to the PCI bridge registers?
Looking at at https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF
I suspect VFIO would not help me here.
It says:
Note: If they are grouped with other devices in this manner, pci
root ports and bridges should neither be bound to vfio at boot, nor be
added to the VM.

Thanks
Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] Rewriting Intel PCI bridge prefetch base address bits solves nvidia graphics issues

2018-08-23 Thread Daniel Drake
Hi,

We are facing a suspend/resume problem with many different Asus laptop
models (30+ products) with Intel chipsets (multiple generations) and
nvidia GPUs (several different ones). Reproducers include:

1. Boot
2. Suspend/resume
3. Load nouveau driver
4. Start X
5. Observe slow X startup and many many errors in logs (primarily
nouveau fifo faults)

or

1. Boot
2. Load nouveau driver
3. Start X
4. Run glxgears - observe spinning gears
4. Suspend/resume
5. Run glxgears - observe that output is all black

or

1. Boot
2. Load proprietary nvidia driver
3. Start X
4. Suspend/resume
5. Observe screen all black, Xorg using 100% CPU

So, suspend/resume basically kills the nvidia card in some way.

After a lot of experimentation I found a workaround: during resume,
set the value of PCI_PREF_BASE_UPPER32 to 0 on the parent PCI bridge.
Easily done in drivers/pci/quirks.c. Now all nvidia stuff works fine.

As an example of an affected product, take the Asus X542UQ (Intel
KabyLake i7-7500U with Nvidia GeForce 940MX). The PCI bridge is:

00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI
Express Root Port [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 120
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: e000-efff
Memory behind bridge: ee00-ef0f
Prefetchable memory behind bridge: d000-e1ff
Capabilities: [40] Express Root Port (Slot+), MSI 00
Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [90] Subsystem: ASUSTeK Computer Inc. Sunrise
Point-LP PCI Express Root Port [1043:1a00]
Capabilities: [a0] Power Management version 3
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Access Control Services
Capabilities: [200] L1 PM Substates
Capabilities: [220] #19
Kernel driver in use: pcieport

The really weird thing here is that the workaround register
PCI_PREF_BASE_UPPER32 already appears to have value 0, as shown above
and also verified during resume. But simply writing value 0 again
definitely results in all the problems going away.


1. Is the Intel PCI bridge misbehaving here? Why does writing the same
value of PCI_PREF_BASE_UPPER32 make any difference at all?


2. Who is responsible for saving and restoring PCI bridge
configuration during suspend and resume? Linux? ACPI? BIOS?

I could not see any Linux code to save and restore these registers.
Likewise I didn't find anything in the ACPI DSDT/SSDT - neither on the
affected products, nor on a similar product that does not suffer this
nvidia issue. Linux does put the PCI bridge into D3 power state during
suspend, and upon resume the lower 32 bits of the prefetch address are
still set to the same value, so through some means this info is not
being lost.


3. Any other suggestions, hints or experiments I could do to help move
forward on this issue?

My goal is to add a workaround to Linux (perhaps as a pci quirk) for
existing devices, but also we are in conversation with Asus engineers
and if we can come up with a concrete diagnosis, we should be able to
have them fix this at the BIOS level in future products.


Thanks
Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] Kernel panic on nouveau during boot on NVIDIA NV118 (GM108)

2017-06-02 Thread Daniel Drake
On Fri, Jun 2, 2017 at 4:01 AM, Chris Chiu  wrote:
> We are working with new desktop that have the NVIDIA NV118
> chipset.
>
> During boot, the display becomes unusable at the point where the
> nouveau driver loads. We have reproduced on 4.8, 4.11 and linux
> master (4.12-rc3).

To save digging into the attachment, here is the crash:

nouveau :01:00.0: pci: failed to adjust cap speed
nouveau :01:00.0: pci: failed to adjust lnkctl speed
nouveau :01:00.0: fb: 0 MiB DDR3
divide error:  [#1] SMP
Modules linked in: hid_generic usbmouse usbkbd usbhid i915 nouveau(+)
mxm_wmi i2c_algo_bit drm_kms_helper sdhci_pci syscopyarea sysfillrect
sysimgblt fb_sys_fops ttm sdhci drm ahci libahci wmi i2c_hid hid video
CPU: 3 PID: 200 Comm: systemd-udevd Not tainted 4.11.0-2-generic
#7+dev143.9f9ecd2beos3.2.2-Endless
Hardware name: Acer Aspire Z20-730/IPMAL-BR3, BIOS D01 07/07/2016
task: 8a3070288000 task.stack: a3eb4103c000
RIP: 0010:gf100_ltc_oneinit_tag_ram+0xba/0x100 [nouveau]
RSP: 0018:a3eb4103f6b8 EFLAGS: 00010206
RAX: 1000ffefdfff RBX: 8a306f915000 RCX: 8a3075570030
RDX:  RSI: dead0200 RDI: 8a307fd9b700
RBP: a3eb4103f6d0 R08: 0001e980 R09: 8a3077003900
R10: a3eb40cdbda0 R11: 8a307fd986a4 R12: 
R13: 00015fff R14: 8a306fa2e400 R15: 8a306f914e00
FS:  7f456d052900() GS:8a307fd8() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fefceb1c020 CR3: 00026fbc5000 CR4: 003406e0
Call Trace:
 gm107_ltc_oneinit+0x7c/0x90 [nouveau]
 nvkm_ltc_oneinit+0x13/0x20 [nouveau]
 nvkm_subdev_init+0x50/0x210 [nouveau]
 nvkm_device_init+0x151/0x270 [nouveau]
 nvkm_udevice_init+0x48/0x60 [nouveau]
 nvkm_object_init+0x40/0x190 [nouveau]
 nvkm_ioctl_new+0x179/0x290 [nouveau]
 ? nvkm_client_notify+0x30/0x30 [nouveau]
 ? nvkm_udevice_rd08+0x30/0x30 [nouveau]
 nvkm_ioctl+0x168/0x240 [nouveau]
 ? nvif_client_init+0x42/0x110 [nouveau]
 nvkm_client_ioctl+0x12/0x20 [nouveau]
 nvif_object_ioctl+0x42/0x50 [nouveau]
 nvif_object_init+0xc2/0x130 [nouveau]
 nvif_device_init+0x12/0x30 [nouveau]
 nouveau_cli_init+0x15e/0x1d0 [nouveau]
 nouveau_drm_load+0x67/0x8b0 [nouveau]
 ? sysfs_do_create_link_sd.isra.2+0x70/0xb0
 drm_dev_register+0x148/0x1e0 [drm]
 drm_get_pci_dev+0xa0/0x160 [drm]
 nouveau_drm_probe+0x1d9/0x260 [nouveau]

Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH] drm/nouveau/core: recognise GP107 chipset

2017-02-17 Thread Daniel Drake
From: Chris Chiu <c...@endlessm.com>

This new graphics card was failing to initialize with nouveau due to
an "unknown chipset" error.

Copy the GP106 configuration and rename for GP107/NV137. We don't
know for certain that this is fully correct, but brief desktop testing
suggests this is working fine.

Signed-off-by: Chris Chiu <c...@endlessm.com>
Signed-off-by: Daniel Drake <dr...@endlessm.com>
---
 drivers/gpu/drm/nouveau/nvkm/engine/device/base.c | 29 +++
 1 file changed, 29 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c 
b/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c
index fea30d6..d242431 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c
@@ -2237,6 +2237,34 @@ nv136_chipset = {
.fifo = gp100_fifo_new,
 };
 
+static const struct nvkm_device_chip
+nv137_chipset = {
+   .name = "GP107",
+   .bar = gf100_bar_new,
+   .bios = nvkm_bios_new,
+   .bus = gf100_bus_new,
+   .devinit = gm200_devinit_new,
+   .fb = gp104_fb_new,
+   .fuse = gm107_fuse_new,
+   .gpio = gk104_gpio_new,
+   .i2c = gm200_i2c_new,
+   .ibus = gm200_ibus_new,
+   .imem = nv50_instmem_new,
+   .ltc = gp100_ltc_new,
+   .mc = gp100_mc_new,
+   .mmu = gf100_mmu_new,
+   .pci = gp100_pci_new,
+   .timer = gk20a_timer_new,
+   .top = gk104_top_new,
+   .ce[0] = gp104_ce_new,
+   .ce[1] = gp104_ce_new,
+   .ce[2] = gp104_ce_new,
+   .ce[3] = gp104_ce_new,
+   .disp = gp104_disp_new,
+   .dma = gf119_dma_new,
+   .fifo = gp100_fifo_new,
+};
+
 static int
 nvkm_device_event_ctor(struct nvkm_object *object, void *data, u32 size,
   struct nvkm_notify *notify)
@@ -2673,6 +2701,7 @@ nvkm_device_ctor(const struct nvkm_device_func *func,
case 0x130: device->chip = _chipset; break;
case 0x134: device->chip = _chipset; break;
case 0x136: device->chip = _chipset; break;
+   case 0x137: device->chip = _chipset; break;
default:
nvdev_error(device, "unknown chipset (%08x)\n", boot0);
goto done;
-- 
2.9.3

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau