Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
On 5/29/23 18:01, Nick Hastings wrote: Hi, * Nick Hastings [230529 12:51]: * Mario Limonciello [230529 10:14]: On 5/28/23 19:56, Nick Hastings wrote: Hi, * Mario Limonciello [230528 21:44]: On 5/28/23 01:49, Salvatore Bonaccorso wrote: Hi Mario Nick Hastings reported in Debian in https://bugs.debian.org/1036530 lockups from his system after updating from a 6.0 based version to 6.1.y. > #regzbot ^introduced 24867516f06d he bisected the issue and tracked it down to: On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote: Control: tags -1 - moreinfo Hi, I repeated the git bisect, and the bad commit seems to be: (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit commit 24867516f06dabedef3be7eea0ef0846b91538bc Author: Mario Limonciello Date: Tue Aug 23 13:51:31 2022 -0500 ACPI: OSI: Remove Linux-Dell-Video _OSI string This string was introduced because drivers for NVIDIA hardware had bugs supporting RTD3 in the past. Before proprietary NVIDIA driver started to support RTD3, Ubuntu had had a mechanism for switching PRIME on and off, though it had required to logout/login to make the library switch happen. When the PRIME had been off, the mechanism had unloaded the NVIDIA driver and put the device into D3cold, but the GPU had never come back to D0 again which is why ODMs used the _OSI to expose an old _DSM method to switch the power on/off. That has been fixed by commit 5775b843a619 ("PCI: Restore config space on runtime resume despite being unbound"). so vendors shouldn't be using this string to modify ASL any more. Reviewed-by: Lyude Paul Signed-off-by: Mario Limonciello Signed-off-by: Rafael J. Wysocki drivers/acpi/osi.c | 9 - 1 file changed, 9 deletions(-) This machine is a Dell with an nvidia chip so it looks like this really could be the commit that that is causing the problems. The description of the commit also seems (to my untrained eye) to be consistent with the error reported on the console when the lockup occurs: [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) [ 60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold to D0, device inaccessible Hopefully this is enough information for experts to resolve this. Does this ring some bell for you? Do you need any further information from Nick? Regards, Salvatore Have Nick try using "pcie_port_pm=off" and see if it helps the issue. I booted into a 6.1 kernel with this option. It has been running without problems for 1.5 hours. Usually I would expect the lockup to have occurred by now. I let this run for 3 hours without issue. Does this happen in the latest 6.4 RC as well? I have compiled that kernel and will boot into it after running this one with the pcie_port_pm=off for another hour or so. I'm now running 6.4.0-rc4 without seeing the problem after 1 hour. I did eventually see a lockup of this kernel. On the console I saw: [ 151.035036] vfio-pci :01:00.0 Unable to change power state from D3cold to D0, device inaccessible I did not see the other two lines that were present in earlier lock ups > I did however see two unrelated problems that I include here for completeness: 1. iwlwifi module did not automatically load 2. Xwayland used huge amount of CPU even though was not running any X programs. Recompiling my wayland compositor without XWayland support "fixed" this. I think we need to see a full dmesg and acpidump to better characterize it. Please find attached. Let me know if there is anything else I can provide. Regards, Nick. I don't see nouveau loading, are you explicitly preventing it from loading? Yes nouveau is blacklisted. Can I see the journal from a boot when it reproduced? Hmm not sure which n for "journalctl -b n" maps to which kernel (is that what you are requesting?). The commit hash doesn't not seem to be listed. I may have to boot into a bad kernel again. Please find attached the output from a "journalctl --system -bN" for a kernel that has this issue. Regards, Nick. In this log I see nouveau loaded, but I also don't see the failure occurring. As you're actually loading nouveau, can you please try nouveau.runpm=0 on the kernel command line? If that helps the issue; I strongly suggest you cross reference the latest kernel to see if this bug still exists.
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Hi, * Nick Hastings [230529 12:51]: > * Mario Limonciello [230529 10:14]: > > On 5/28/23 19:56, Nick Hastings wrote: > > > Hi, > > > > > > * Mario Limonciello [230528 21:44]: > > > > On 5/28/23 01:49, Salvatore Bonaccorso wrote: > > > > > Hi Mario > > > > > > > > > > Nick Hastings reported in Debian in https://bugs.debian.org/1036530 > > > > > lockups from his system after updating from a 6.0 based version to > > > > > 6.1.y. > > > > > > #regzbot ^introduced 24867516f06d > > > > > > > > > > he bisected the issue and tracked it down to: > > > > > > > > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote: > > > > > > Control: tags -1 - moreinfo > > > > > > > > > > > > Hi, > > > > > > > > > > > > I repeated the git bisect, and the bad commit seems to be: > > > > > > > > > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad > > > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit > > > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc > > > > > > Author: Mario Limonciello > > > > > > Date: Tue Aug 23 13:51:31 2022 -0500 > > > > > > > > > > > > ACPI: OSI: Remove Linux-Dell-Video _OSI string > > > > > > This string was introduced because drivers for NVIDIA hardware > > > > > > had bugs supporting RTD3 in the past. > > > > > > Before proprietary NVIDIA driver started to support RTD3, > > > > > > Ubuntu had > > > > > > had a mechanism for switching PRIME on and off, though it had > > > > > > required > > > > > > to logout/login to make the library switch happen. > > > > > > When the PRIME had been off, the mechanism had unloaded the > > > > > > NVIDIA > > > > > > driver and put the device into D3cold, but the GPU had never > > > > > > come back > > > > > > to D0 again which is why ODMs used the _OSI to expose an old > > > > > > _DSM > > > > > > method to switch the power on/off. > > > > > > That has been fixed by commit 5775b843a619 ("PCI: Restore > > > > > > config space > > > > > > on runtime resume despite being unbound"). so vendors > > > > > > shouldn't be > > > > > > using this string to modify ASL any more. > > > > > > Reviewed-by: Lyude Paul > > > > > > Signed-off-by: Mario Limonciello > > > > > > Signed-off-by: Rafael J. Wysocki > > > > > > > > > > > >drivers/acpi/osi.c | 9 - > > > > > >1 file changed, 9 deletions(-) > > > > > > > > > > > > This machine is a Dell with an nvidia chip so it looks like this > > > > > > really > > > > > > could be the commit that that is causing the problems. The > > > > > > description > > > > > > of the commit also seems (to my untrained eye) to be consistent > > > > > > with the > > > > > > error reported on the console when the lockup occurs: > > > > > > > > > > > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to > > > > > > previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > > > > > > [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON > > > > > > due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > > > > > > [ 60.083261] vfio-pci :01:00.0 Unable to change power state > > > > > > from D3cold to D0, device inaccessible > > > > > > > > > > > > Hopefully this is enough information for experts to resolve this. > > > > > > > > > > Does this ring some bell for you? Do you need any further information > > > > > from Nick? > > > > > > > > > > Regards, > > > > > Salvatore > > > > > > > > > > > Have Nick try using "pcie_port_pm=off" and see if it helps the issue. > > > > > > I booted into a 6.1 kernel with this option. It has been running without > > > problems for 1.5 hours. Usually I would expect the lockup to have > > > occurred by now. > > I let this run for 3 hours without issue. > > > > > Does this happen in the latest 6.4 RC as well? > > > > > > I have compiled that kernel and will boot into it after running this one > > > with the pcie_port_pm=off for another hour or so. > > I'm now running 6.4.0-rc4 without seeing the problem after 1 hour. I did eventually see a lockup of this kernel. On the console I saw: [ 151.035036] vfio-pci :01:00.0 Unable to change power state from D3cold to D0, device inaccessible I did not see the other two lines that were present in earlier lock ups > I did however see two unrelated problems that I include here for > completeness: > 1. iwlwifi module did not automatically load > 2. Xwayland used huge amount of CPU even though was not running any X > programs. Recompiling my wayland compositor without XWayland support > "fixed" this. > > > > > I think we need to see a full dmesg and acpidump to better > > > > characterize it. > > > > > > Please find attached. Let me know if there is anything else I can provide. > > > > > > Regards, > > > > > > Nick. > > > > I don't see nouveau loading, are you explicitly preventing it from > > loading? > > Yes nouveau is blacklisted. > >
Bug#1036900: Things I tried
After googling, I tried a few things: Memory has correct timing, frequencies and voltage (no improvement) kernel parameters => no improvement - idle=nomwait - processor.max_cstate=5 - rcu_nocbs=0-11 Undervolting / Overclocking => seems to make the system a bit more stable - Reducing PPT to 45W - PBS Curve all cores: -10 - Boost limit: -300 (ending around 4Ghz) Deactivate SMT => no improvement Deactivate selective CPUs (Error always showed on CPU5) => no improvement Deactivating tx, sg, tso offloading => no improvement Overall it seems the system crashes when doing load changes, e.g. like compiling. It then takes SATA, network, etc. down, leading to an unusable system.