Re: [Nouveau] [PATCH v2 0/9] drm/nouveau: Various fixes for GP10B
On Mon, Nov 18, 2019 at 11:45 PM Thierry Reding wrote: > > On Sat, Nov 02, 2019 at 06:56:28PM +0100, Thierry Reding wrote: > > From: Thierry Reding > > > > Hi Ben, > > > > here's a revised subset of the patches I had sent out a couple of weeks > > ago. I've reworked the BAR2 accesses in the way that you had suggested, > > which at least for GP10B turned out to be fairly trivial to do. I have > > not looked in detail at this for GV11B yet, but a cursory look showed > > that BAR2 is accessed in more places, so the equivalent for GV11B might > > be a bit more involved. > > > > Other than that, not a lot has changed since then. I've added a couple > > of precursory patches to add IOMMU helper dummies for the case where > > IOMMU is disabled (as suggested by Ben Dooks). > > > > Joerg, it'd be great if you could give an Acked-by on those two patches > > so that Ben can pick them both up into the Nouveau tree. Alternatively I > > can put them both into a stable branch and send a pull request to both > > of you. Or yet another alternative would be for Joerg to apply them now > > and Ben to wait for v5.5-rc1 until he picks up the rest. All of those > > work for me. > > Hi Joerg, Ben, > > do you guys have any further comments on this series? I've got an > updated patch to silence the warning that the kbuild robot flagged, so > if there are no other comments I can send a final v3 of the series. I'm on leave at the moment. But the nouveau fixes look fine to me and I'm happy to pick them up once you send a v3, and allow Jeorg to apply the rest through his tree. Thanks, Ben. > > Thierry ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
[+cc Daniel, Vidya, Thierry; thread starts at https://lore.kernel.org/r/20191017121901.13699-1-kher...@redhat.com] On Tue, Nov 19, 2019 at 11:26:45PM +0100, Karol Herbst wrote: > On Tue, Nov 19, 2019 at 10:50 PM Bjorn Helgaas wrote: > > On Thu, Oct 17, 2019 at 02:19:01PM +0200, Karol Herbst wrote: > > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into > > > higher device states. > > > ... > > > + * - kernel crashes, as all pci reads return -1, which most code > > > + *isn't able to handle well enough. > > > > More details about these crashes would be useful as we look at > > places that *should* be able to handle errors like this. > > makes sense, I could ,orthogonal to this, make the code more robust > if we hit issues like this in the future. What I am mostly wondering > about is, why pci core doesn't give up if the device doesn't come > back from D3cold? It sounds like, that the most sane thing to do > here is to just give up and fail runtime_resume and report errors > back to userspace trying to make use of the devices. It's possible there's something the core could do here. It's interesting that we have at least three issues in this area in this release: 1) This NVIDIA GPU issue 2) Daniel's issue with AMD Ryzen5/7 XHCI power-on (https://lore.kernel.org/r/20191014061355.29072-1-dr...@endlessm.com) 3) Vidya's issue with Intel 750 NVMe power-on (https://lore.kernel.org/r/20191118172310.21373-1-vid...@nvidia.com) Vidya's current patch is the most generic (calling pci_dev_wait() from pci_power_up()). That will print a warning if the device doesn't respond, but we still don't go as far as returning errors to runtime_resume. This is definitely something we should consider improving in the future. > > > + * - sudden shutdowns, as the kernel identified an unrecoverable error > > > after > > > + *userspace tries to access the GPU. > > > > This doesn't fit with the others and more details might be > > informative here as well. > > yeah.. I try to get more infos on that. But at least for me (and it > might be a distribution thing) if I execute lspci, the system shuts > down, or at least tries to and might fail. The lspci doesn't need to be after the failure occurs. You can do it immediately after boot. If you can capture any part of the dmesg or console log of these sudden shutdowns, that's all I'm interested in at this point. Bjorn ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
On Tue, Nov 19, 2019 at 10:50 PM Bjorn Helgaas wrote: > > [+cc Dave] > > On Thu, Oct 17, 2019 at 02:19:01PM +0200, Karol Herbst wrote: > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device > > states. > > > > v2: convert to pci_dev quirk > > put a proper technical explanation of the issue as a in-code comment > > v3: disable it only for certain combinations of intel and nvidia hardware > > v4: simplify quirk by setting flag on the GPU itself > > I have zero confidence that we understand the real problem, but we do > need to do something with this. I'll merge it for v5.5 if we get the > minor procedural stuff below straightened out. > Thanks, and I agree with your statement, but at this point I think only Intel can help out digging deeper as I see no way to debug this further. > > Signed-off-by: Karol Herbst > > Cc: Bjorn Helgaas > > Cc: Lyude Paul > > Cc: Rafael J. Wysocki > > Cc: Mika Westerberg > > Cc: linux-...@vger.kernel.org > > Cc: linux...@vger.kernel.org > > Cc: dri-de...@lists.freedesktop.org > > Cc: nouveau@lists.freedesktop.org > > --- > > drivers/pci/pci.c| 7 ++ > > drivers/pci/quirks.c | 53 > > include/linux/pci.h | 1 + > > 3 files changed, 61 insertions(+) > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > index b97d9e10c9cc..02e71e0bcdd7 100644 > > --- a/drivers/pci/pci.c > > +++ b/drivers/pci/pci.c > > @@ -850,6 +850,13 @@ static int pci_raw_set_power_state(struct pci_dev > > *dev, pci_power_t state) > > || (state == PCI_D2 && !dev->d2_support)) > > return -EIO; > > > > + /* > > + * check if we have a bad combination of bridge controller and nvidia > > + * GPU, see quirk_broken_nv_runpm for more info > > Whitespace damage. Capitalized incorrectly (see other comments > nearby). > > > + */ > > + if (state != PCI_D0 && dev->broken_nv_runpm) > > + return 0; > > + > > pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, ); > > > > /* > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > index 44c4ae1abd00..0006c9e37b6f 100644 > > --- a/drivers/pci/quirks.c > > +++ b/drivers/pci/quirks.c > > @@ -5268,3 +5268,56 @@ static void > > quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev) > > DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1, > > PCI_CLASS_DISPLAY_VGA, 8, > > quirk_reset_lenovo_thinkpad_p50_nvgpu); > > + > > +/* > > + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus > > after > > + * those were put into D3cold state if they were put into a non D0 PCI PM > > + * device state before doing so. > > A device in D3cold is off the bus by definition. > > IIUC the problem is that the sequence D0 -> D3hot -> D3cold -> D0 for > the GPU fails in the transition back to D0, while D0 -> D3cold -> D0 > works fine. > > So I guess the problem is that we can put the device in D3cold with no > problem, but if we put in D3hot before going to D3cold, the device > never comes back to D0. Right? > correct. It By the way, it doesn't matter if I put the device into D1 instead, as long as the device is not in D0 state before putting it into D3cold, it fails. > > + * This leads to various issue different issues which all manifest > > differently, > > s/issue different// > > Actually, I think there's only one underlying issue with several > manifestations. > > > + * but have the same root cause: > > + * - AIML code execution hits an infinite loop (as the coe waits on device > > + *memory to change). > > s/AIML/AML/ > s/coe/code/ > > > + * - kernel crashes, as all pci reads return -1, which most code isn't > > able > > + *to handle well enough. > > s/pci/PCI/ > > More details about these crashes would be useful as we look at places > that *should* be able to handle errors like this. > makes sense, I could ,orthogonal to this, make the code more robust if we hit issues like this in the future. What I am mostly wondering about is, why pci core doesn't give up if the device doesn't come back from D3cold? It sounds like, that the most sane thing to do here is to just give up and fail runtime_resume and report errors back to userspace trying to make use of the devices. > > + * - sudden shutdowns, as the kernel identified an unrecoverable error > > after > > + *userspace tries to access the GPU. > > This doesn't fit with the others and more details might be > informative here as well. > yeah.. I try to get more infos on that. But at least for me (and it might be a distribution thing) if I execute lspci, the system shuts down, or at least tries to and might fail. > > + * In all cases dmesg will contain at least one line like this: > > + * 'nouveau :01:00.0: Refused to change power state, currently in D3' > > + * followed by a lot of nouveau timeouts. > > + * > > + * ACPI code writes bit 0x80 to the not
Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
[+cc Dave] On Thu, Oct 17, 2019 at 02:19:01PM +0200, Karol Herbst wrote: > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device > states. > > v2: convert to pci_dev quirk > put a proper technical explanation of the issue as a in-code comment > v3: disable it only for certain combinations of intel and nvidia hardware > v4: simplify quirk by setting flag on the GPU itself I have zero confidence that we understand the real problem, but we do need to do something with this. I'll merge it for v5.5 if we get the minor procedural stuff below straightened out. > Signed-off-by: Karol Herbst > Cc: Bjorn Helgaas > Cc: Lyude Paul > Cc: Rafael J. Wysocki > Cc: Mika Westerberg > Cc: linux-...@vger.kernel.org > Cc: linux...@vger.kernel.org > Cc: dri-de...@lists.freedesktop.org > Cc: nouveau@lists.freedesktop.org > --- > drivers/pci/pci.c| 7 ++ > drivers/pci/quirks.c | 53 > include/linux/pci.h | 1 + > 3 files changed, 61 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index b97d9e10c9cc..02e71e0bcdd7 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -850,6 +850,13 @@ static int pci_raw_set_power_state(struct pci_dev *dev, > pci_power_t state) > || (state == PCI_D2 && !dev->d2_support)) > return -EIO; > > + /* > + * check if we have a bad combination of bridge controller and nvidia > + * GPU, see quirk_broken_nv_runpm for more info Whitespace damage. Capitalized incorrectly (see other comments nearby). > + */ > + if (state != PCI_D0 && dev->broken_nv_runpm) > + return 0; > + > pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, ); > > /* > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 44c4ae1abd00..0006c9e37b6f 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -5268,3 +5268,56 @@ static void > quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev) > DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1, > PCI_CLASS_DISPLAY_VGA, 8, > quirk_reset_lenovo_thinkpad_p50_nvgpu); > + > +/* > + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus after > + * those were put into D3cold state if they were put into a non D0 PCI PM > + * device state before doing so. A device in D3cold is off the bus by definition. IIUC the problem is that the sequence D0 -> D3hot -> D3cold -> D0 for the GPU fails in the transition back to D0, while D0 -> D3cold -> D0 works fine. So I guess the problem is that we can put the device in D3cold with no problem, but if we put in D3hot before going to D3cold, the device never comes back to D0. Right? > + * This leads to various issue different issues which all manifest > differently, s/issue different// Actually, I think there's only one underlying issue with several manifestations. > + * but have the same root cause: > + * - AIML code execution hits an infinite loop (as the coe waits on device > + *memory to change). s/AIML/AML/ s/coe/code/ > + * - kernel crashes, as all pci reads return -1, which most code isn't able > + *to handle well enough. s/pci/PCI/ More details about these crashes would be useful as we look at places that *should* be able to handle errors like this. > + * - sudden shutdowns, as the kernel identified an unrecoverable error after > + *userspace tries to access the GPU. This doesn't fit with the others and more details might be informative here as well. > + * In all cases dmesg will contain at least one line like this: > + * 'nouveau :01:00.0: Refused to change power state, currently in D3' > + * followed by a lot of nouveau timeouts. > + * > + * ACPI code writes bit 0x80 to the not documented PCI register 0x248 of the > + * PCIe bridge controller in order to power down the GPU. > + * Nonetheless, there are other code paths inside the ACPI firmware which use > + * other registers, which seem to work fine: > + * - 0xbc bit 0x20 (publicly available documentation claims 'reserved') > + * - 0xb0 bit 0x10 (link disable) All these register addresses are device-specific, so they're useless without identifying the device. "lspci -vvnn" output would let us at least connect this with something. It would be nice to have that info archived along with your acpidump and python repro scripts in a bugzilla with the URL in the commit log. These are likely in PCI capabilities. If I make the leap of assuming the "link disable" bit is PCI_EXP_LNKCTL_LD, that would mean the Link Control register is at 0xb0 and the register at 0xbc would be the Root Control register, and indeed 0x20 in Root Control is reserved. I don't know what the relevance of all this is, though. It's not remarkable that accesses to these registers work. Unless you mean you can access these registers *after* trying to put the device back in D0, but other
Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
On Thu, 17 Oct 2019 at 22:19, Karol Herbst wrote: > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device > states. Can we get this acked/committed? At this stage I think we've done all we can unless Intel actually escalate this internally and work out how the hw is broken. Dave. > > v2: convert to pci_dev quirk > put a proper technical explanation of the issue as a in-code comment > v3: disable it only for certain combinations of intel and nvidia hardware > v4: simplify quirk by setting flag on the GPU itself > > Signed-off-by: Karol Herbst > Cc: Bjorn Helgaas > Cc: Lyude Paul > Cc: Rafael J. Wysocki > Cc: Mika Westerberg > Cc: linux-...@vger.kernel.org > Cc: linux...@vger.kernel.org > Cc: dri-de...@lists.freedesktop.org > Cc: nouveau@lists.freedesktop.org > --- > drivers/pci/pci.c| 7 ++ > drivers/pci/quirks.c | 53 > include/linux/pci.h | 1 + > 3 files changed, 61 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index b97d9e10c9cc..02e71e0bcdd7 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -850,6 +850,13 @@ static int pci_raw_set_power_state(struct pci_dev *dev, > pci_power_t state) >|| (state == PCI_D2 && !dev->d2_support)) > return -EIO; > > + /* > +* check if we have a bad combination of bridge controller and nvidia > + * GPU, see quirk_broken_nv_runpm for more info > +*/ > + if (state != PCI_D0 && dev->broken_nv_runpm) > + return 0; > + > pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, ); > > /* > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 44c4ae1abd00..0006c9e37b6f 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -5268,3 +5268,56 @@ static void > quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev) > DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1, > PCI_CLASS_DISPLAY_VGA, 8, > quirk_reset_lenovo_thinkpad_p50_nvgpu); > + > +/* > + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus after > + * those were put into D3cold state if they were put into a non D0 PCI PM > + * device state before doing so. > + * > + * This leads to various issue different issues which all manifest > differently, > + * but have the same root cause: > + * - AIML code execution hits an infinite loop (as the coe waits on device > + *memory to change). > + * - kernel crashes, as all pci reads return -1, which most code isn't able > + *to handle well enough. > + * - sudden shutdowns, as the kernel identified an unrecoverable error after > + *userspace tries to access the GPU. > + * > + * In all cases dmesg will contain at least one line like this: > + * 'nouveau :01:00.0: Refused to change power state, currently in D3' > + * followed by a lot of nouveau timeouts. > + * > + * ACPI code writes bit 0x80 to the not documented PCI register 0x248 of the > + * PCIe bridge controller in order to power down the GPU. > + * Nonetheless, there are other code paths inside the ACPI firmware which use > + * other registers, which seem to work fine: > + * - 0xbc bit 0x20 (publicly available documentation claims 'reserved') > + * - 0xb0 bit 0x10 (link disable) > + * Changing the conditions inside the firmware by poking into the relevant > + * addresses does resolve the issue, but it seemed to be ACPI private memory > + * and not any device accessible memory at all, so there is no portable way > of > + * changing the conditions. > + * > + * The only systems where this behavior can be seen are hybrid graphics > laptops > + * with a secondary Nvidia Pascal GPU. It cannot be ruled out that this issue > + * only occurs in combination with listed Intel PCIe bridge controllers and > + * the mentioned GPUs or if it's only a hw bug in the bridge controller. > + * > + * But because this issue was NOT seen on laptops with an Nvidia Pascal GPU > + * and an Intel Coffee Lake SoC, there is a higher chance of there being a > bug > + * in the bridge controller rather than in the GPU. > + * > + * This issue was not able to be reproduced on non laptop systems. > + */ > + > +static void quirk_broken_nv_runpm(struct pci_dev *dev) > +{ > + struct pci_dev *bridge = pci_upstream_bridge(dev); > + > + if (bridge->vendor == PCI_VENDOR_ID_INTEL && > + bridge->device == 0x1901) > + dev->broken_nv_runpm = 1; > +} > +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, > + PCI_BASE_CLASS_DISPLAY, 16, > + quirk_broken_nv_runpm); > diff --git a/include/linux/pci.h b/include/linux/pci.h > index ac8a6c4e1792..903a0b3a39ec 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -416,6 +416,7 @@ struct pci_dev { > unsigned int__aer_firmware_first_valid:1; >