Re: [Nouveau] [PATCH v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
On Wed, Oct 16, 2019 at 11:48:22PM +0200, Karol Herbst wrote: > On Wed, Oct 16, 2019 at 11:37 PM Bjorn Helgaas wrote: > > On Wed, Oct 16, 2019 at 09:18:32PM +0200, Karol Herbst wrote: > > > but setting the PCI_DEV_FLAGS_NO_D3 flag does prevent using the > > > platform means of putting the device into D3cold, right? That's > > > actually what should still happen, just the D3hot step should be > > > skipped. > > > > If I understand correctly, when we put a device in D3cold on an ACPI > > system, we do something like this: > > > > pci_set_power_state(D3cold) > > if (PCI_DEV_FLAGS_NO_D3) > > return 0 <-- nothing at all if > > quirked > > pci_raw_set_power_state > > pci_write_config_word(PCI_PM_CTRL, D3hot) <-- set to D3hot > > __pci_complete_power_transition(D3cold) > > pci_platform_power_transition(D3cold) > > platform_pci_set_power_state(D3cold) > > acpi_pci_set_power_state(D3cold) > > acpi_device_set_power(ACPI_STATE_D3_COLD) > > ... > > acpi_evaluate_object("_OFF") <-- set to D3cold > > > > I did not understand the connection with platform (ACPI) power > > management from your patch. It sounds like you want this entire path > > except that you want to skip the PCI_PM_CTRL write? > > > > exactly. I am running with this workaround for a while now and never > had any fails with it anymore. The GPU gets turned off correctly and I > see the same power savings, just that the GPU can be powered on again. > > > That seems like something Rafael should weigh in on. I don't know > > why we set the device to D3hot with PCI_PM_CTRL before using the ACPI > > methods, and I don't know what the effect of skipping that is. It > > seems a little messy to slice out this tiny piece from the middle, but > > maybe it makes sense. > > > > afaik when I was talking with others in the past about it, Windows is > doing that before using ACPI calls, but maybe they have some similar > workarounds for certain intel bridges as well? I am sure it affects > more than the one I am blacklisting here, but I rather want to check > each device before blacklisting all kabylake and sky lake bridges (as > those are the ones were this issue can be observed). From a quick look at the ACPI spec, I didn't see conditions like "OSPM must put PCI devices in D3hot before executing _OFF". But obviously there's *some* reason and I probably just missed it. > Sadly we had no luck getting any information about such workaround out > of Nvidia or Intel. I'm not surprised; it doesn't seem like we really have the details needed to get to a root cause yet. I think what we really need is a PCIe analyzer trace to see what happens when the device "falls off the bus". Bjorn ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
On Wed, Oct 16, 2019 at 11:37 PM Bjorn Helgaas wrote: > > [+cc linux-acpi] > > On Wed, Oct 16, 2019 at 09:18:32PM +0200, Karol Herbst wrote: > > but setting the PCI_DEV_FLAGS_NO_D3 flag does prevent using the > > platform means of putting the device into D3cold, right? That's > > actually what should still happen, just the D3hot step should be > > skipped. > > If I understand correctly, when we put a device in D3cold on an ACPI > system, we do something like this: > > pci_set_power_state(D3cold) > if (PCI_DEV_FLAGS_NO_D3) > return 0 <-- nothing at all if quirked > pci_raw_set_power_state > pci_write_config_word(PCI_PM_CTRL, D3hot) <-- set to D3hot > __pci_complete_power_transition(D3cold) > pci_platform_power_transition(D3cold) > platform_pci_set_power_state(D3cold) > acpi_pci_set_power_state(D3cold) > acpi_device_set_power(ACPI_STATE_D3_COLD) > ... > acpi_evaluate_object("_OFF") <-- set to D3cold > > I did not understand the connection with platform (ACPI) power > management from your patch. It sounds like you want this entire path > except that you want to skip the PCI_PM_CTRL write? > exactly. I am running with this workaround for a while now and never had any fails with it anymore. The GPU gets turned off correctly and I see the same power savings, just that the GPU can be powered on again. > That seems like something Rafael should weigh in on. I don't know > why we set the device to D3hot with PCI_PM_CTRL before using the ACPI > methods, and I don't know what the effect of skipping that is. It > seems a little messy to slice out this tiny piece from the middle, but > maybe it makes sense. > afaik when I was talking with others in the past about it, Windows is doing that before using ACPI calls, but maybe they have some similar workarounds for certain intel bridges as well? I am sure it affects more than the one I am blacklisting here, but I rather want to check each device before blacklisting all kabylake and sky lake bridges (as those are the ones were this issue can be observed). Sadly we had no luck getting any information about such workaround out of Nvidia or Intel. > > On Wed, Oct 16, 2019 at 9:14 PM Bjorn Helgaas wrote: > > > > > > On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote: > > > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher > > > > device > > > > states. > > > > > > > > v2: convert to pci_dev quirk > > > > put a proper technical explanation of the issue as a in-code comment > > > > v3: disable it only for certain combinations of intel and nvidia > > > > hardware > > > > > > > > Signed-off-by: Karol Herbst > > > > Cc: Bjorn Helgaas > > > > Cc: Lyude Paul > > > > Cc: Rafael J. Wysocki > > > > Cc: Mika Westerberg > > > > Cc: linux-...@vger.kernel.org > > > > Cc: linux...@vger.kernel.org > > > > Cc: dri-de...@lists.freedesktop.org > > > > Cc: nouveau@lists.freedesktop.org > > > > --- > > > > drivers/pci/pci.c| 11 ++ > > > > drivers/pci/quirks.c | 52 > > > > include/linux/pci.h | 1 + > > > > 3 files changed, 64 insertions(+) > > > > > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > > > index b97d9e10c9cc..8e056eb7e6ff 100644 > > > > --- a/drivers/pci/pci.c > > > > +++ b/drivers/pci/pci.c > > > > @@ -805,6 +805,13 @@ static inline bool platform_pci_bridge_d3(struct > > > > pci_dev *dev) > > > > return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false; > > > > } > > > > > > > > +static inline bool parent_broken_child_pm(struct pci_dev *dev) > > > > +{ > > > > + if (!dev->bus || !dev->bus->self) > > > > + return false; > > > > + return dev->bus->self->broken_nv_runpm && dev->broken_nv_runpm; > > > > +} > > > > + > > > > /** > > > > * pci_raw_set_power_state - Use PCI PM registers to set the power > > > > state of > > > > *given PCI device > > > > @@ -850,6 +857,10 @@ static int pci_raw_set_power_state(struct pci_dev > > > > *dev, pci_power_t state) > > > > || (state == PCI_D2 && !dev->d2_support)) > > > > return -EIO; > > > > > > > > + /* check if the bus controller causes issues */ > > > > + if (state != PCI_D0 && parent_broken_child_pm(dev)) > > > > + return 0; > > > > + > > > > pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, ); > > > > > > > > /* > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > > > index 44c4ae1abd00..c2f20b745dd4 100644 > > > > --- a/drivers/pci/quirks.c > > > > +++ b/drivers/pci/quirks.c > > > > @@ -5268,3 +5268,55 @@ static void > > > > quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev) > > > > DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1, > > > > PCI_CLASS_DISPLAY_VGA, 8, > > > >
Re: [Nouveau] [PATCH v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
[+cc linux-acpi] On Wed, Oct 16, 2019 at 09:18:32PM +0200, Karol Herbst wrote: > but setting the PCI_DEV_FLAGS_NO_D3 flag does prevent using the > platform means of putting the device into D3cold, right? That's > actually what should still happen, just the D3hot step should be > skipped. If I understand correctly, when we put a device in D3cold on an ACPI system, we do something like this: pci_set_power_state(D3cold) if (PCI_DEV_FLAGS_NO_D3) return 0 <-- nothing at all if quirked pci_raw_set_power_state pci_write_config_word(PCI_PM_CTRL, D3hot) <-- set to D3hot __pci_complete_power_transition(D3cold) pci_platform_power_transition(D3cold) platform_pci_set_power_state(D3cold) acpi_pci_set_power_state(D3cold) acpi_device_set_power(ACPI_STATE_D3_COLD) ... acpi_evaluate_object("_OFF") <-- set to D3cold I did not understand the connection with platform (ACPI) power management from your patch. It sounds like you want this entire path except that you want to skip the PCI_PM_CTRL write? That seems like something Rafael should weigh in on. I don't know why we set the device to D3hot with PCI_PM_CTRL before using the ACPI methods, and I don't know what the effect of skipping that is. It seems a little messy to slice out this tiny piece from the middle, but maybe it makes sense. > On Wed, Oct 16, 2019 at 9:14 PM Bjorn Helgaas wrote: > > > > On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote: > > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher > > > device > > > states. > > > > > > v2: convert to pci_dev quirk > > > put a proper technical explanation of the issue as a in-code comment > > > v3: disable it only for certain combinations of intel and nvidia hardware > > > > > > Signed-off-by: Karol Herbst > > > Cc: Bjorn Helgaas > > > Cc: Lyude Paul > > > Cc: Rafael J. Wysocki > > > Cc: Mika Westerberg > > > Cc: linux-...@vger.kernel.org > > > Cc: linux...@vger.kernel.org > > > Cc: dri-de...@lists.freedesktop.org > > > Cc: nouveau@lists.freedesktop.org > > > --- > > > drivers/pci/pci.c| 11 ++ > > > drivers/pci/quirks.c | 52 > > > include/linux/pci.h | 1 + > > > 3 files changed, 64 insertions(+) > > > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > > index b97d9e10c9cc..8e056eb7e6ff 100644 > > > --- a/drivers/pci/pci.c > > > +++ b/drivers/pci/pci.c > > > @@ -805,6 +805,13 @@ static inline bool platform_pci_bridge_d3(struct > > > pci_dev *dev) > > > return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false; > > > } > > > > > > +static inline bool parent_broken_child_pm(struct pci_dev *dev) > > > +{ > > > + if (!dev->bus || !dev->bus->self) > > > + return false; > > > + return dev->bus->self->broken_nv_runpm && dev->broken_nv_runpm; > > > +} > > > + > > > /** > > > * pci_raw_set_power_state - Use PCI PM registers to set the power state > > > of > > > *given PCI device > > > @@ -850,6 +857,10 @@ static int pci_raw_set_power_state(struct pci_dev > > > *dev, pci_power_t state) > > > || (state == PCI_D2 && !dev->d2_support)) > > > return -EIO; > > > > > > + /* check if the bus controller causes issues */ > > > + if (state != PCI_D0 && parent_broken_child_pm(dev)) > > > + return 0; > > > + > > > pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, ); > > > > > > /* > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > > index 44c4ae1abd00..c2f20b745dd4 100644 > > > --- a/drivers/pci/quirks.c > > > +++ b/drivers/pci/quirks.c > > > @@ -5268,3 +5268,55 @@ static void > > > quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev) > > > DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1, > > > PCI_CLASS_DISPLAY_VGA, 8, > > > quirk_reset_lenovo_thinkpad_p50_nvgpu); > > > + > > > +/* > > > + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus > > > after > > > + * those were put into D3cold state if they were put into a non D0 PCI PM > > > + * device state before doing so. > > > + * > > > + * This leads to various issue different issues which all manifest > > > differently, > > > + * but have the same root cause: > > > + * - AIML code execution hits an infinite loop (as the coe waits on > > > device > > > + *memory to change). > > > + * - kernel crashes, as all pci reads return -1, which most code isn't > > > able > > > + *to handle well enough. > > > + * - sudden shutdowns, as the kernel identified an unrecoverable error > > > after > > > + *userspace tries to access the GPU. > > > + * > > > + * In all cases dmesg will contain at least one line like this: > > > + * 'nouveau :01:00.0: Refused to change power state, currently in D3' > > > +
[Nouveau] [Bug 75985] [NVC1] HDMI audio device only visible after rescan
https://bugs.freedesktop.org/show_bug.cgi?id=75985 --- Comment #114 from Lukas Wunner --- (In reply to Przemysław Kopa from comment #113) > (In reply to Lukas Wunner from comment #112) > > Glad to hear. You don't seem to have any commits in the kernel so far. Would > > you like to try and bake these changes into a proper patch? If not I'll > > gladly create and submit the patch myself but mentoring someone else make > > their first contribution is more beneficial to the community, hence my > > question. > > Lukas, could you please handle it this time? Sorry for not posting sooner. Sure thing. Just one question, you wrote that you had to add "HDA_CODEC_ENTRY(0x10de0403, "GPU 0403 HDMI/DP", patch_nvhdmi)" to snd_hda_id_hdmi[] with the rationale that the "PCI ID of my Nvidia HDA wasn't there". This confuses me because the PCI device ID of the HDA controller is "0bea" and "0403" are the 16 most significant bits of the PCI class ID. HDA_CODEC_ENTRY() needs to match for the 32-bit HD audio vendor ID. Just to double-check, could you execute "cat /sys/bus/pci/devices/:01:00.1/hdaudioC1D0/vendor_id" and post the result here? Is it really 0x10de0403? Thanks! -- You are receiving this mail because: You are the assignee for the bug.___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
but setting the PCI_DEV_FLAGS_NO_D3 flag does prevent using the platform means of putting the device into D3cold, right? That's actually what should still happen, just the D3hot step should be skipped. On Wed, Oct 16, 2019 at 9:14 PM Bjorn Helgaas wrote: > > On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote: > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device > > states. > > > > v2: convert to pci_dev quirk > > put a proper technical explanation of the issue as a in-code comment > > v3: disable it only for certain combinations of intel and nvidia hardware > > > > Signed-off-by: Karol Herbst > > Cc: Bjorn Helgaas > > Cc: Lyude Paul > > Cc: Rafael J. Wysocki > > Cc: Mika Westerberg > > Cc: linux-...@vger.kernel.org > > Cc: linux...@vger.kernel.org > > Cc: dri-de...@lists.freedesktop.org > > Cc: nouveau@lists.freedesktop.org > > --- > > drivers/pci/pci.c| 11 ++ > > drivers/pci/quirks.c | 52 > > include/linux/pci.h | 1 + > > 3 files changed, 64 insertions(+) > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > index b97d9e10c9cc..8e056eb7e6ff 100644 > > --- a/drivers/pci/pci.c > > +++ b/drivers/pci/pci.c > > @@ -805,6 +805,13 @@ static inline bool platform_pci_bridge_d3(struct > > pci_dev *dev) > > return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false; > > } > > > > +static inline bool parent_broken_child_pm(struct pci_dev *dev) > > +{ > > + if (!dev->bus || !dev->bus->self) > > + return false; > > + return dev->bus->self->broken_nv_runpm && dev->broken_nv_runpm; > > +} > > + > > /** > > * pci_raw_set_power_state - Use PCI PM registers to set the power state of > > *given PCI device > > @@ -850,6 +857,10 @@ static int pci_raw_set_power_state(struct pci_dev > > *dev, pci_power_t state) > > || (state == PCI_D2 && !dev->d2_support)) > > return -EIO; > > > > + /* check if the bus controller causes issues */ > > + if (state != PCI_D0 && parent_broken_child_pm(dev)) > > + return 0; > > + > > pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, ); > > > > /* > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > index 44c4ae1abd00..c2f20b745dd4 100644 > > --- a/drivers/pci/quirks.c > > +++ b/drivers/pci/quirks.c > > @@ -5268,3 +5268,55 @@ static void > > quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev) > > DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1, > > PCI_CLASS_DISPLAY_VGA, 8, > > quirk_reset_lenovo_thinkpad_p50_nvgpu); > > + > > +/* > > + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus > > after > > + * those were put into D3cold state if they were put into a non D0 PCI PM > > + * device state before doing so. > > + * > > + * This leads to various issue different issues which all manifest > > differently, > > + * but have the same root cause: > > + * - AIML code execution hits an infinite loop (as the coe waits on device > > + *memory to change). > > + * - kernel crashes, as all pci reads return -1, which most code isn't > > able > > + *to handle well enough. > > + * - sudden shutdowns, as the kernel identified an unrecoverable error > > after > > + *userspace tries to access the GPU. > > + * > > + * In all cases dmesg will contain at least one line like this: > > + * 'nouveau :01:00.0: Refused to change power state, currently in D3' > > + * followed by a lot of nouveau timeouts. > > + * > > + * ACPI code writes bit 0x80 to the not documented PCI register 0x248 of > > the > > + * PCIe bridge controller in order to power down the GPU. > > + * Nonetheless, there are other code paths inside the ACPI firmware which > > use > > + * other registers, which seem to work fine: > > + * - 0xbc bit 0x20 (publicly available documentation claims 'reserved') > > + * - 0xb0 bit 0x10 (link disable) > > + * Changing the conditions inside the firmware by poking into the relevant > > + * addresses does resolve the issue, but it seemed to be ACPI private > > memory > > + * and not any device accessible memory at all, so there is no portable > > way of > > + * changing the conditions. > > + * > > + * The only systems where this behavior can be seen are hybrid graphics > > laptops > > + * with a secondary Nvidia Pascal GPU. It cannot be ruled out that this > > issue > > + * only occurs in combination with listed Intel PCIe bridge controllers and > > + * the mentioned GPUs or if it's only a hw bug in the bridge controller. > > + * > > + * But because this issue was NOT seen on laptops with an Nvidia Pascal GPU > > + * and an Intel Coffee Lake SoC, there is a higher chance of there being a > > bug > > + * in the bridge controller rather than in the GPU. > > + * > > + * This issue was not able to be reproduced on non laptop systems. > > + */ > > +
Re: [Nouveau] [PATCH v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote: > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device > states. > > v2: convert to pci_dev quirk > put a proper technical explanation of the issue as a in-code comment > v3: disable it only for certain combinations of intel and nvidia hardware > > Signed-off-by: Karol Herbst > Cc: Bjorn Helgaas > Cc: Lyude Paul > Cc: Rafael J. Wysocki > Cc: Mika Westerberg > Cc: linux-...@vger.kernel.org > Cc: linux...@vger.kernel.org > Cc: dri-de...@lists.freedesktop.org > Cc: nouveau@lists.freedesktop.org > --- > drivers/pci/pci.c| 11 ++ > drivers/pci/quirks.c | 52 > include/linux/pci.h | 1 + > 3 files changed, 64 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index b97d9e10c9cc..8e056eb7e6ff 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -805,6 +805,13 @@ static inline bool platform_pci_bridge_d3(struct pci_dev > *dev) > return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false; > } > > +static inline bool parent_broken_child_pm(struct pci_dev *dev) > +{ > + if (!dev->bus || !dev->bus->self) > + return false; > + return dev->bus->self->broken_nv_runpm && dev->broken_nv_runpm; > +} > + > /** > * pci_raw_set_power_state - Use PCI PM registers to set the power state of > *given PCI device > @@ -850,6 +857,10 @@ static int pci_raw_set_power_state(struct pci_dev *dev, > pci_power_t state) > || (state == PCI_D2 && !dev->d2_support)) > return -EIO; > > + /* check if the bus controller causes issues */ > + if (state != PCI_D0 && parent_broken_child_pm(dev)) > + return 0; > + > pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, ); > > /* > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 44c4ae1abd00..c2f20b745dd4 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -5268,3 +5268,55 @@ static void > quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev) > DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1, > PCI_CLASS_DISPLAY_VGA, 8, > quirk_reset_lenovo_thinkpad_p50_nvgpu); > + > +/* > + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus after > + * those were put into D3cold state if they were put into a non D0 PCI PM > + * device state before doing so. > + * > + * This leads to various issue different issues which all manifest > differently, > + * but have the same root cause: > + * - AIML code execution hits an infinite loop (as the coe waits on device > + *memory to change). > + * - kernel crashes, as all pci reads return -1, which most code isn't able > + *to handle well enough. > + * - sudden shutdowns, as the kernel identified an unrecoverable error after > + *userspace tries to access the GPU. > + * > + * In all cases dmesg will contain at least one line like this: > + * 'nouveau :01:00.0: Refused to change power state, currently in D3' > + * followed by a lot of nouveau timeouts. > + * > + * ACPI code writes bit 0x80 to the not documented PCI register 0x248 of the > + * PCIe bridge controller in order to power down the GPU. > + * Nonetheless, there are other code paths inside the ACPI firmware which use > + * other registers, which seem to work fine: > + * - 0xbc bit 0x20 (publicly available documentation claims 'reserved') > + * - 0xb0 bit 0x10 (link disable) > + * Changing the conditions inside the firmware by poking into the relevant > + * addresses does resolve the issue, but it seemed to be ACPI private memory > + * and not any device accessible memory at all, so there is no portable way > of > + * changing the conditions. > + * > + * The only systems where this behavior can be seen are hybrid graphics > laptops > + * with a secondary Nvidia Pascal GPU. It cannot be ruled out that this issue > + * only occurs in combination with listed Intel PCIe bridge controllers and > + * the mentioned GPUs or if it's only a hw bug in the bridge controller. > + * > + * But because this issue was NOT seen on laptops with an Nvidia Pascal GPU > + * and an Intel Coffee Lake SoC, there is a higher chance of there being a > bug > + * in the bridge controller rather than in the GPU. > + * > + * This issue was not able to be reproduced on non laptop systems. > + */ > + > +static void quirk_broken_nv_runpm(struct pci_dev *dev) > +{ > + dev->broken_nv_runpm = 1; Can you use the existing PCI_DEV_FLAGS_NO_D3 flag for this instead of adding a new flag? I would put the parent_broken_child_pm() logic here, if possible, e.g., something like: struct pci_dev *bridge = pci_upstream_bridge(dev); if (bridge && bridge->vendor == PCI_VENDOR_ID_INTEL && bridge->device == 0x1901) dev->dev_flags |= PCI_DEV_FLAGS_NO_D3; > +} >
Re: [Nouveau] [PATCH] drm: Generalized NV Block Linear DRM format mod
On 10/15/19 8:42 AM, Daniel Vetter wrote: On Tue, Oct 15, 2019 at 5:14 PM James Jones wrote: On 10/15/19 7:19 AM, Daniel Vetter wrote: On Mon, Oct 14, 2019 at 03:13:21PM -0700, James Jones wrote: Builds upon the existing NVIDIA 16Bx2 block linear format modifiers by adding more "fields" to the existing parameterized DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK format modifier macro that allow fully defining a unique-across- all-NVIDIA-hardware bit layout using a minimal set of fields and values. The new modifier macro DRM_FORMAT_MOD_NVIDIA_BLOCK_LINEAR_2D is effectively backwards compatible with the existing macro, introducing a superset of the previously definable format modifiers. Backwards compatibility has two quirks. First, the zero value for the "kind" field, which is implied by the DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK macro, must be special cased in drivers and assumed to map to the pre-Turing generic kind of 0xfe, since a kind of "zero" is reserved for linear buffer layouts on all GPUs. Second, it is assumed backwards compatibility is only needed when running on Tegra GPUs, and specifically Tegra GPUs prior to Xavier. This is based on two assertions: -Tegra GPUs prior to Xavier used a slightly different raw bit layout than desktop GPUs, making it impossible to directly share block linear buffers between the two. -Support for the existing block linear modifiers was incomplete, making them useful only for exporting buffers created by nouveau and importing them to Tegra DRM as framebuffers for scan out. There was no support for adding framebuffers using format modifiers in nouveau, nor importing dma-buf/PRIME GEM objects into nouveau userspace drivers with modifiers in Mesa. Hence it is assumed the prior modifiers were not intended for use on desktop GPUs, and as a corrolary, were not intended to support sharing block linear buffers across two different NVIDIA GPUs. Signed-off-by: James Jones --- include/uapi/drm/drm_fourcc.h | 108 +++--- 1 file changed, 100 insertions(+), 8 deletions(-) diff --git a/include/uapi/drm/drm_fourcc.h b/include/uapi/drm/drm_fourcc.h index 3feeaa3f987a..cc9853d42a24 100644 --- a/include/uapi/drm/drm_fourcc.h +++ b/include/uapi/drm/drm_fourcc.h @@ -497,7 +497,99 @@ extern "C" { #define DRM_FORMAT_MOD_NVIDIA_TEGRA_TILED fourcc_mod_code(NVIDIA, 1) /* - * 16Bx2 Block Linear layout, used by desktop GPUs, and Tegra K1 and later + * Generalized Block Linear layout, used by desktop GPUs starting with NV50/G80, + * and Tegra GPUs starting with Tegra K1. + * + * Pixels are arranged in Groups of Bytes (GOBs). GOB size and layout varies + * based on the architecture generation. GOBs themselves are then arranged in + * 3D blocks, with the block dimensions (in terms of GOBs) always being a power + * of two, and hence expressible as their log2 equivalent (E.g., "2" represents + * a block depth or height of "4"). + * + * Chapter 20 "Pixel Memory Formats" of the Tegra X1 TRM describes this format + * in full detail. + * + * Macro + * Bits Param Description + * - - + * + * 3:0 h log2(height) of each block, in GOBs. Placed here for + * compatibility with the existing + * DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK()-based modifiers. + * + * 4:4 - Must be 1, to indicate block-linear layout. Necessary for + * compatibility with the existing + * DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK()-based modifiers. + * + * 8:5 - Reserved (To support 3D-surfaces with variable log2(depth) block + * size). Must be zero. + * + * Note there is no log2(width) parameter. Some portions of the + * hardware support a block width of two gobs, but it is impractical + * to use due to lack of support elsewhere, and has no known + * benefits. + * + * 11:9 - Reserved (To support 2D-array textures with variable array stride + * in blocks, specified via log2(tile width in blocks)). Must be + * zero. + * + * 19:12 k Page Kind. This value directly maps to a field in the page + * tables of all GPUs >= NV50. It affects the exact layout of bits + * in memory and can be derived from the tuple + * + * (format, GPU model, compression type, samples per pixel) + * + * Where compression type is defined below. If GPU model were + * implied by the format modifier, format, or memory buffer, page + * kind would not need to be included in the modifier itself, but + * since the modifier should define the layout of the associated + * memory buffer independent from any device or other context, it + * must be included here. + * + * To grandfather in prior block linear format modifiers to this + * layout,
[Nouveau] [PATCH v2] drm: Generalized NV Block Linear DRM format mod
Builds upon the existing NVIDIA 16Bx2 block linear format modifiers by adding more "fields" to the existing parameterized DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK format modifier macro that allow fully defining a unique-across- all-NVIDIA-hardware bit layout using a minimal set of fields and values. The new modifier macro DRM_FORMAT_MOD_NVIDIA_BLOCK_LINEAR_2D is effectively backwards compatible with the existing macro, introducing a superset of the previously definable format modifiers. Backwards compatibility has two quirks. First, the zero value for the "kind" field, which is implied by the DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK macro, must be special cased in drivers and assumed to map to the pre-Turing generic kind of 0xfe, since a kind of "zero" is reserved for linear buffer layouts on all GPUs. Second, it is assumed backwards compatibility is only needed when running on Tegra GPUs, and specifically Tegra GPUs prior to Xavier. This is based on two assertions: -Tegra GPUs prior to Xavier used a slightly different raw bit layout than desktop GPUs, making it impossible to directly share block linear buffers between the two. -Support for the existing block linear modifiers was incomplete, making them useful only for exporting buffers created by nouveau and importing them to Tegra DRM as framebuffers for scan out. There was no support for adding framebuffers using format modifiers in nouveau, nor importing dma-buf/PRIME GEM objects into nouveau userspace drivers with modifiers in Mesa. Hence it is assumed the prior modifiers were not intended for use on desktop GPUs, and as a corrolary, were not intended to support sharing block linear buffers across two different NVIDIA GPUs. v2: - Added canonicalize helper function Signed-off-by: James Jones --- include/uapi/drm/drm_fourcc.h | 116 +++--- 1 file changed, 108 insertions(+), 8 deletions(-) diff --git a/include/uapi/drm/drm_fourcc.h b/include/uapi/drm/drm_fourcc.h index 3feeaa3f987a..56c8fe30caab 100644 --- a/include/uapi/drm/drm_fourcc.h +++ b/include/uapi/drm/drm_fourcc.h @@ -497,7 +497,107 @@ extern "C" { #define DRM_FORMAT_MOD_NVIDIA_TEGRA_TILED fourcc_mod_code(NVIDIA, 1) /* - * 16Bx2 Block Linear layout, used by desktop GPUs, and Tegra K1 and later + * Generalized Block Linear layout, used by desktop GPUs starting with NV50/G80, + * and Tegra GPUs starting with Tegra K1. + * + * Pixels are arranged in Groups of Bytes (GOBs). GOB size and layout varies + * based on the architecture generation. GOBs themselves are then arranged in + * 3D blocks, with the block dimensions (in terms of GOBs) always being a power + * of two, and hence expressible as their log2 equivalent (E.g., "2" represents + * a block depth or height of "4"). + * + * Chapter 20 "Pixel Memory Formats" of the Tegra X1 TRM describes this format + * in full detail. + * + * Macro + * Bits Param Description + * - - + * + * 3:0 h log2(height) of each block, in GOBs. Placed here for + * compatibility with the existing + * DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK()-based modifiers. + * + * 4:4 - Must be 1, to indicate block-linear layout. Necessary for + * compatibility with the existing + * DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK()-based modifiers. + * + * 8:5 - Reserved (To support 3D-surfaces with variable log2(depth) block + * size). Must be zero. + * + * Note there is no log2(width) parameter. Some portions of the + * hardware support a block width of two gobs, but it is impractical + * to use due to lack of support elsewhere, and has no known + * benefits. + * + * 11:9 - Reserved (To support 2D-array textures with variable array stride + * in blocks, specified via log2(tile width in blocks)). Must be + * zero. + * + * 19:12 k Page Kind. This value directly maps to a field in the page + * tables of all GPUs >= NV50. It affects the exact layout of bits + * in memory and can be derived from the tuple + * + * (format, GPU model, compression type, samples per pixel) + * + * Where compression type is defined below. If GPU model were + * implied by the format modifier, format, or memory buffer, page + * kind would not need to be included in the modifier itself, but + * since the modifier should define the layout of the associated + * memory buffer independent from any device or other context, it + * must be included here. + * + * 21:20 g GOB Height and Page Kind Generation. The height of a GOB changed + * starting with Fermi GPUs. Additionally, the mapping between page + * kind and bit layout has changed at various points. + * + * 0 = Gob Height 8, Fermi -
Re: [Nouveau] nouveau kernel module will not load on old Sony Vaio laptop with 8400M GT
Karol Herbst composed on 2019-10-16 15:25 (UTC+0200): > Felix Miata wrote: >> is there anyone here who can help with: >> https://bugs.freedesktop.org/show_bug.cgi?id=111853 >> nouveau kernel module won't load (not available) on Sony laptop with NVIDIA >> G86M >> [GeForce 8400M GT] ID: 10de:0426 >> ??? > do you know if it used to work with older kernels? If yes, maybe a git > bisect on the kernel could help I've updated the bug to indicate 3.16.7 and 4.2.6 kernels will load the kernel nouveau module without producing expected X results, along with Xorg.0.logs. -- Evolution as taught in public schools is religion, not science. Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] [Bug 111853] nouveau kernel module won't load (not available) on Sony laptop with NVIDIA G86M [GeForce 8400M GT] ID: 10de:0426
https://bugs.freedesktop.org/show_bug.cgi?id=111853 --- Comment #11 from Felix Miata --- Created attachment 145758 --> https://bugs.freedesktop.org/attachment.cgi?id=145758=edit Xorg.0.log from live Knoppix 7.6.1 # lsmod | sort | grep veau drm_kms_helper 70712 1 nouveau mxm_wmi 1635 1 nouveau nouveau 1033769 0 ttm60685 1 nouveau wmi 7363 2 mxm_wmi,nouveau # inxi -c0 -GxxSM System:Host: Microknoppix Kernel: 4.2.6-64 x86_64 (64 bit gcc: 5.2.1) Console: tty 3 dm: kdm Distro: Debian GNU/Linux stretch/sid Machine: System: Sony (portable) product: VGN-AR730E v: C3LR1E11 serial: 28272434-3101919 Mobo: Sony model: VAIO Bios: Phoenix v: R2090J8 date: 02/26/2008 Chassis: type: 10 Graphics: Card: NVIDIA G86M [GeForce 8400M GT] bus-ID: 01:00.0 chip-ID: 10de:0426 Display Server: X.org 1.17.3 drivers: vesa,nouveau (unloaded: fbdev) tty size: 80x25 Advanced Data: N/A for root out of X # dmesg | grep ailed [0.770863] acpi PNP0A08:00: _OSC failed (AE_SUPPORT); disabling ASPM [0.849357] pci :01:00.0: BAR 6: failed to assign [mem size 0x0002 pref] [1.607734] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SATA.PRT0._SDD] (Node 8800bf08f338), AE_NOT_FOUND (20150619/psparse-536) [1.609103] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SATA.PRT0._SDD] (Node 8800bf08f338), AE_NOT_FOUND (20150619/psparse-536) [ 37.465961] systemd-udevd[2004]: Process '/usr/sbin/alsactl -E HOME=/var/run/alsa restore 0' failed with exit code 99. [ 38.226541] systemd-udevd[2008]: Process '/sbin/crda' failed with exit code 249. [ 38.229603] systemd-udevd[2008]: Process '/sbin/crda' failed with exit code 249. [ 42.469065] systemd-udevd[2086]: Process '/usr/sbin/alsactl -E HOME=/var/run/alsa restore 0' failed with exit code 99. [ 69.564168] systemd-logind[2229]: Failed to start user service, ignoring: Unknown unit: user@1000.service [ 100.435441] uvcvideo: Failed to query (129) UVC probe control : -32 (exp. 26). [ 100.435443] uvcvideo: Failed to initialize the device (-5). Again, nouveau loads, but X video is scrambled VESA, again with no /dev/dri/card0 or /dev/fb0. -- You are receiving this mail because: You are the assignee for the bug.___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] [Bug 111853] nouveau kernel module won't load (not available) on Sony laptop with NVIDIA G86M [GeForce 8400M GT] ID: 10de:0426
https://bugs.freedesktop.org/show_bug.cgi?id=111853 --- Comment #10 from Felix Miata --- Created attachment 145757 --> https://bugs.freedesktop.org/attachment.cgi?id=145757=edit Xorg.0.log from live LMDE 2 Betsy boot # uname -a Linux stresslinux 2.6.37.6-0.5-default #1 SMP 2011-04-25 21:48:33 +0200 x86_64 x86_64 x86_64 GNU/Linux # lsmod | sort | grep veau button 6797 1 nouveau drm 229676 3 nouveau,ttm,drm_kms_helper drm_kms_helper 36630 1 nouveau i2c_algo_bit6342 1 nouveau nouveau 678496 1 ttm72581 1 nouveau video 15865 1 nouveau But Stresslinux 0.7.106 has no X :-( # uname -a Linux mint 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt7-1 (2015-03-01) x86_64 GNU/Linux # lsmod | sort | grep veau button 12944 1 nouveau drm 249955 3 ttm,drm_kms_helper,nouveau drm_kms_helper 49210 1 nouveau i2c_algo_bit 12751 1 nouveau i2c_core 46012 7 drm,i2c_i801,drm_kms_helper,i2c_algo_bit,v4l2_common,nouveau,videodev mxm_wmi12515 1 nouveau nouveau 1122419 0 ttm77862 1 nouveau video 18096 1 nouveau wmi17339 2 mxm_wmi,nouveau # inxi -GxxS System:Host: mint Kernel: 3.16.0-4-amd64 x86_64 (64 bit gcc: 4.8.4) Desktop: MATE 1.8.1 (Gtk 3.14.5+4) dm: mdm Distro: LinuxMint 2 betsy Graphics: Card: NVIDIA G86M [GeForce 8400M GT] bus-ID: 01:00.0 chip-ID: 10de:0426 Display Server: X.Org 1.16.4 drivers: fbdev,vesa,nouveau Resolution: 1024x768@61.00hz GLX Renderer: Gallium 0.4 on llvmpipe (LLVM 3.5, 128 bits) GLX Version: 3.0 Mesa 10.3.2 Direct Rendering: Yes With this old live distro, nouveau loads, but there's no /dev/dri/card0 or /dev/fb0, so it's stuck in VESA 1024x768. -- You are receiving this mail because: You are the assignee for the bug.___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] [Bug 75985] [NVC1] HDMI audio device only visible after rescan
https://bugs.freedesktop.org/show_bug.cgi?id=75985 --- Comment #113 from Przemysław Kopa --- (In reply to Lukas Wunner from comment #112) > > Glad to hear. You don't seem to have any commits in the kernel so far. Would > you like to try and bake these changes into a proper patch? If not I'll > gladly create and submit the patch myself but mentoring someone else make > their first contribution is more beneficial to the community, hence my > question. Lukas, could you please handle it this time? Sorry for not posting sooner. -- You are receiving this mail because: You are the assignee for the bug.___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] [PATCH v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device states. v2: convert to pci_dev quirk put a proper technical explanation of the issue as a in-code comment v3: disable it only for certain combinations of intel and nvidia hardware Signed-off-by: Karol Herbst Cc: Bjorn Helgaas Cc: Lyude Paul Cc: Rafael J. Wysocki Cc: Mika Westerberg Cc: linux-...@vger.kernel.org Cc: linux...@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Cc: nouveau@lists.freedesktop.org --- drivers/pci/pci.c| 11 ++ drivers/pci/quirks.c | 52 include/linux/pci.h | 1 + 3 files changed, 64 insertions(+) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index b97d9e10c9cc..8e056eb7e6ff 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -805,6 +805,13 @@ static inline bool platform_pci_bridge_d3(struct pci_dev *dev) return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false; } +static inline bool parent_broken_child_pm(struct pci_dev *dev) +{ + if (!dev->bus || !dev->bus->self) + return false; + return dev->bus->self->broken_nv_runpm && dev->broken_nv_runpm; +} + /** * pci_raw_set_power_state - Use PCI PM registers to set the power state of * given PCI device @@ -850,6 +857,10 @@ static int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state) || (state == PCI_D2 && !dev->d2_support)) return -EIO; + /* check if the bus controller causes issues */ + if (state != PCI_D0 && parent_broken_child_pm(dev)) + return 0; + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, ); /* diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 44c4ae1abd00..c2f20b745dd4 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -5268,3 +5268,55 @@ static void quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev) DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1, PCI_CLASS_DISPLAY_VGA, 8, quirk_reset_lenovo_thinkpad_p50_nvgpu); + +/* + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus after + * those were put into D3cold state if they were put into a non D0 PCI PM + * device state before doing so. + * + * This leads to various issue different issues which all manifest differently, + * but have the same root cause: + * - AIML code execution hits an infinite loop (as the coe waits on device + *memory to change). + * - kernel crashes, as all pci reads return -1, which most code isn't able + *to handle well enough. + * - sudden shutdowns, as the kernel identified an unrecoverable error after + *userspace tries to access the GPU. + * + * In all cases dmesg will contain at least one line like this: + * 'nouveau :01:00.0: Refused to change power state, currently in D3' + * followed by a lot of nouveau timeouts. + * + * ACPI code writes bit 0x80 to the not documented PCI register 0x248 of the + * PCIe bridge controller in order to power down the GPU. + * Nonetheless, there are other code paths inside the ACPI firmware which use + * other registers, which seem to work fine: + * - 0xbc bit 0x20 (publicly available documentation claims 'reserved') + * - 0xb0 bit 0x10 (link disable) + * Changing the conditions inside the firmware by poking into the relevant + * addresses does resolve the issue, but it seemed to be ACPI private memory + * and not any device accessible memory at all, so there is no portable way of + * changing the conditions. + * + * The only systems where this behavior can be seen are hybrid graphics laptops + * with a secondary Nvidia Pascal GPU. It cannot be ruled out that this issue + * only occurs in combination with listed Intel PCIe bridge controllers and + * the mentioned GPUs or if it's only a hw bug in the bridge controller. + * + * But because this issue was NOT seen on laptops with an Nvidia Pascal GPU + * and an Intel Coffee Lake SoC, there is a higher chance of there being a bug + * in the bridge controller rather than in the GPU. + * + * This issue was not able to be reproduced on non laptop systems. + */ + +static void quirk_broken_nv_runpm(struct pci_dev *dev) +{ + dev->broken_nv_runpm = 1; +} +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, + PCI_BASE_CLASS_DISPLAY, 16, + quirk_broken_nv_runpm); +/* kaby lake */ +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x1901, + quirk_broken_nv_runpm); diff --git a/include/linux/pci.h b/include/linux/pci.h index ac8a6c4e1792..903a0b3a39ec 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -416,6 +416,7 @@ struct pci_dev { unsigned int__aer_firmware_first_valid:1; unsigned int__aer_firmware_first:1; unsigned intbroken_intx_masking:1; /* INTx masking can't be used
Re: [Nouveau] nouveau kernel module will not load on old Sony Vaio laptop with 8400M GT
do you know if it used to work with older kernels? If yes, maybe a git bisect on the kernel could help On Wed, Oct 16, 2019 at 12:48 AM Felix Miata wrote: > > is there anyone here who can help with: > > https://bugs.freedesktop.org/show_bug.cgi?id=111853 > nouveau kernel module won't load (not available) on Sony laptop with NVIDIA > G86M > [GeForce 8400M GT] ID: 10de:0426 > > ??? > -- > Evolution as taught in public schools is religion, not science. > > Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! > > Felix Miata *** http://fm.no-ip.com/ > ___ > Nouveau mailing list > Nouveau@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/nouveau ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau