Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
On Thu, 2023-06-01 at 11:18 -0500, Limonciello, Mario wrote: > +Lyude, Lukas, Karol > > On 5/31/2023 6:40 PM, Nick Hastings wrote: > > Hi, > > > > * Nick Hastings [230530 16:01]: > > > * Mario Limonciello [230530 13:00]: > > > > > > As you're actually loading nouveau, can you please try nouveau.runpm=0 > > > > on > > > > the kernel command line? > > > I'm not intentionally loading it. This machine also has intel graphics > > > which is what I prefer. Checking my > > > /etc/modprobe.d/blacklist-nvidia-nouveau.conf > > > I see: > > > > > > blacklist nvidia > > > blacklist nvidia-drm > > > blacklist nvidia-modeset > > > blacklist nvidia-uvm > > > blacklist ipmi_msghandler > > > blacklist ipmi_devintf > > > > > > So I thought I had blacklisted it but it seems I did not. Since I do not > > > want to use it maybe it is better to check if the lock up occurs with > > > nouveau blacklisted. I will try that now. > > I blacklisted nouveau and booted into a 6.1 kernel: > > % uname -a > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) > > x86_64 GNU/Linux > > > > It has been running without problems for nearly two days now: > > % uptime > > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27 > > > > Regards, > > > > Nick. > > Thanks, that makes a lot more sense now. > > Nick, Can you please test if nouveau works with runtime PM in the > latest 6.4-rc? > > If it works in 6.4-rc, there are probably nouveau commits that need > to be backported to 6.1 LTS. > > If it's still broken in 6.4-rc, I believe you should file a bug: > > https://gitlab.freedesktop.org/drm/nouveau/ > > > Lyude, Lukas, Karol > > This thread is in relation to this commit: > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string") > > Nick has found that runtime PM is *not* working for nouveau. > > If you recall we did 24867516f06d because 5775b843a619 was > supposed to have fixed it. Gotcha, I guess keep me updated since it seems like things -might- be working from what I gathered here? Happy to look further if they find that 6.4-rc is broken though > -- Cheers, Lyude Paul (she/her) Software Engineer at Red Hat
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Hi, * Limonciello, Mario [230701 06:40]: > > > > Nevertheless: thx for your report your help through this thread. > > > > No problem. I am willing to try to do more, but right now I don't know > > how to do what has been suggested. > > > > Here is where to report Nouveau bugs: > > https://gitlab.freedesktop.org/drm/nouveau/-/issues/ Thanks. Done: https://gitlab.freedesktop.org/drm/nouveau/-/issues/241 Cheers, Nick.
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Nevertheless: thx for your report your help through this thread. No problem. I am willing to try to do more, but right now I don't know how to do what has been suggested. Here is where to report Nouveau bugs: https://gitlab.freedesktop.org/drm/nouveau/-/issues/
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Hi, * Thorsten Leemhuis [230630 22:02]: > On 27.06.23 00:34, Nick Hastings wrote: > > * Linux regression tracking (Thorsten Leemhuis) > > [230626 21:09]: > >> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting > >> for once, to make this easily accessible to everyone. > >> > >> Nick, what's the status/was there any progress? Did you do what Mario > >> suggested and file a nouveau bug? > > > > It was not apparent that the suggestion to open "a Nouveau drm bug" was > > addressed to me. > > I wish things were earlier for reporters, but from what I can see this > is the only way forward if you or some silent bystander cares. In principle I can open another bug report, but I don't know how or where to report "a Nouveau drm bug". Please keep in mind that I'm just an end user. I learnt to use git bisect specifically because of this bug. Prior to that, I hadn't compiled a kernel in about 15 years. > >> I ask, as I still have this on my list of regressions and it seems there > >> was no progress in three+ weeks now. > > > > I have not pursued this further since as far as I could tell I already > > provided all requested information and I don't actually use nouveau, so > > I blacklisted it. > > I doubt any developer cares enough to take a closer look[1] without a > proper nouveau bug and some help & prodding from someone affected. And > looks to me like reverting the culprit now might create even bigger > problems for users. If someone can point me to some docs about for reporting nouveau bugs I can look into it. > Hence I guess then this won't be fixed in the end. In a ideal world this > would not happen, but we don't live in one and all have just 24 hours in > a day. :-/ This is a very common Dell XPS 15 7590 so I expect many people could experience this issue. Or maybe like me they only use the intel GPU. > Nevertheless: thx for your report your help through this thread. No problem. I am willing to try to do more, but right now I don't know how to do what has been suggested. Cheers, Nick. > [1] some points on the following page kinda explain this > https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/ > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > -- > Everything you wanna know about Linux kernel regression tracking: > https://linux-regtracking.leemhuis.info/about/#tldr > If I did something stupid, please tell me, as explained on that page. > > #regzbot inconclusive: reporting deadlock (see thread for details) > > > > >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > >> -- > >> Everything you wanna know about Linux kernel regression tracking: > >> https://linux-regtracking.leemhuis.info/about/#tldr > >> If I did something stupid, please tell me, as explained on that page. > >> > >> #regzbot backburner: slow progress, likely just affects one machine > >> #regzbot poke > >> > >> > >> On 02.06.23 02:57, Limonciello, Mario wrote: > >>> [AMD Official Use Only - General] > >>> > >>>> -Original Message- > >>>> From: Nick Hastings > >>>> Sent: Thursday, June 1, 2023 7:02 PM > >>>> To: Karol Herbst > >>>> Cc: Limonciello, Mario ; Lyude Paul > >>>> ; Lukas Wunner ; Salvatore > >>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > >>>> Wysocki ; Len Brown ; linux- > >>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org; > >>>> regressi...@lists.linux.dev > >>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > >>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of > >>>> system) > >>>> > >>>> Hi, > >>>> > >>>> * Karol Herbst [230602 03:10]: > >>>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario > >>>>> wrote: > >>>>>>> -Original Message- > >>>>>>> From: Karol Herbst > >>>>>>> Sent: Thursday, June 1, 2023 12:19 PM > >>>>>>> To: Limonciello, Mario > >>>>>>> Cc: Nick Hastings ; Lyude Paul > >>>>>>> ; Lukas Wunner ; Salvatore > >>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > >>>>>>> Wysocki ; Len Brown ; linux- > >>>>>>> a...@vger.kernel.org; linux-ker..
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
On Fri, Jun 30, 2023 at 3:02 PM Thorsten Leemhuis wrote: > > On 27.06.23 00:34, Nick Hastings wrote: > > * Linux regression tracking (Thorsten Leemhuis) > > [230626 21:09]: > >> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting > >> for once, to make this easily accessible to everyone. > >> > >> Nick, what's the status/was there any progress? Did you do what Mario > >> suggested and file a nouveau bug? > > > > It was not apparent that the suggestion to open "a Nouveau drm bug" was > > addressed to me. > > I wish things were earlier for reporters, but from what I can see this > is the only way forward if you or some silent bystander cares. > > >> I ask, as I still have this on my list of regressions and it seems there > >> was no progress in three+ weeks now. > > > > I have not pursued this further since as far as I could tell I already > > provided all requested information and I don't actually use nouveau, so > > I blacklisted it. > > I doubt any developer cares enough to take a closer look[1] without a > proper nouveau bug and some help & prodding from someone affected. And > looks to me like reverting the culprit now might create even bigger > problems for users. > > Hence I guess then this won't be fixed in the end. In a ideal world this > would not happen, but we don't live in one and all have just 24 hours in > a day. :-/ > We recently merged this commit: https://gitlab.freedesktop.org/drm/nouveau/-/commit/11d24327c2d7ad7f24fcc44fb00e1fa91ebf6525 It might resolve the problem. Worth testing at least, but I can't remember if this was a hybrid AMD/Nvidia system, but I think it was? > Nevertheless: thx for your report your help through this thread. > > [1] some points on the following page kinda explain this > https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/ > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > -- > Everything you wanna know about Linux kernel regression tracking: > https://linux-regtracking.leemhuis.info/about/#tldr > If I did something stupid, please tell me, as explained on that page. > > #regzbot inconclusive: reporting deadlock (see thread for details) > > > > >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > >> -- > >> Everything you wanna know about Linux kernel regression tracking: > >> https://linux-regtracking.leemhuis.info/about/#tldr > >> If I did something stupid, please tell me, as explained on that page. > >> > >> #regzbot backburner: slow progress, likely just affects one machine > >> #regzbot poke > >> > >> > >> On 02.06.23 02:57, Limonciello, Mario wrote: > >>> [AMD Official Use Only - General] > >>> > >>>> -Original Message- > >>>> From: Nick Hastings > >>>> Sent: Thursday, June 1, 2023 7:02 PM > >>>> To: Karol Herbst > >>>> Cc: Limonciello, Mario ; Lyude Paul > >>>> ; Lukas Wunner ; Salvatore > >>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > >>>> Wysocki ; Len Brown ; linux- > >>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org; > >>>> regressi...@lists.linux.dev > >>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > >>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of > >>>> system) > >>>> > >>>> Hi, > >>>> > >>>> * Karol Herbst [230602 03:10]: > >>>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario > >>>>> wrote: > >>>>>>> -Original Message- > >>>>>>> From: Karol Herbst > >>>>>>> Sent: Thursday, June 1, 2023 12:19 PM > >>>>>>> To: Limonciello, Mario > >>>>>>> Cc: Nick Hastings ; Lyude Paul > >>>>>>> ; Lukas Wunner ; Salvatore > >>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > >>>>>>> Wysocki ; Len Brown ; linux- > >>>>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org; > >>>>>>> regressi...@lists.linux.dev > >>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > >>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of > >>>> system) > >>>&g
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
On 27.06.23 00:34, Nick Hastings wrote: > * Linux regression tracking (Thorsten Leemhuis) > [230626 21:09]: >> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting >> for once, to make this easily accessible to everyone. >> >> Nick, what's the status/was there any progress? Did you do what Mario >> suggested and file a nouveau bug? > > It was not apparent that the suggestion to open "a Nouveau drm bug" was > addressed to me. I wish things were earlier for reporters, but from what I can see this is the only way forward if you or some silent bystander cares. >> I ask, as I still have this on my list of regressions and it seems there >> was no progress in three+ weeks now. > > I have not pursued this further since as far as I could tell I already > provided all requested information and I don't actually use nouveau, so > I blacklisted it. I doubt any developer cares enough to take a closer look[1] without a proper nouveau bug and some help & prodding from someone affected. And looks to me like reverting the culprit now might create even bigger problems for users. Hence I guess then this won't be fixed in the end. In a ideal world this would not happen, but we don't live in one and all have just 24 hours in a day. :-/ Nevertheless: thx for your report your help through this thread. [1] some points on the following page kinda explain this https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/ Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. #regzbot inconclusive: reporting deadlock (see thread for details) >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >> -- >> Everything you wanna know about Linux kernel regression tracking: >> https://linux-regtracking.leemhuis.info/about/#tldr >> If I did something stupid, please tell me, as explained on that page. >> >> #regzbot backburner: slow progress, likely just affects one machine >> #regzbot poke >> >> >> On 02.06.23 02:57, Limonciello, Mario wrote: >>> [AMD Official Use Only - General] >>> >>>> -Original Message- >>>> From: Nick Hastings >>>> Sent: Thursday, June 1, 2023 7:02 PM >>>> To: Karol Herbst >>>> Cc: Limonciello, Mario ; Lyude Paul >>>> ; Lukas Wunner ; Salvatore >>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J. >>>> Wysocki ; Len Brown ; linux- >>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org; >>>> regressi...@lists.linux.dev >>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI >>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system) >>>> >>>> Hi, >>>> >>>> * Karol Herbst [230602 03:10]: >>>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario >>>>> wrote: >>>>>>> -Original Message- >>>>>>> From: Karol Herbst >>>>>>> Sent: Thursday, June 1, 2023 12:19 PM >>>>>>> To: Limonciello, Mario >>>>>>> Cc: Nick Hastings ; Lyude Paul >>>>>>> ; Lukas Wunner ; Salvatore >>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J. >>>>>>> Wysocki ; Len Brown ; linux- >>>>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org; >>>>>>> regressi...@lists.linux.dev >>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI >>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of >>>> system) >>>>>>> >>>>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario >>>>>>> wrote: >>>>>>>> >>>>>>>> [AMD Official Use Only - General] >>>>>>>> >>>>>>>>> -Original Message- >>>>>>>>> From: Karol Herbst >>>>>>>>> Sent: Thursday, June 1, 2023 11:33 AM >>>>>>>>> To: Limonciello, Mario >>>>>>>>> Cc: Nick Hastings ; Lyude Paul >>>>>>>>> ; Lukas Wunner ; Salvatore >>>>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael >>>> J. >>>>>>>>> Wy
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Hi Thorsten, * Linux regression tracking (Thorsten Leemhuis) [230626 21:09]: > Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting > for once, to make this easily accessible to everyone. > > Nick, what's the status/was there any progress? Did you do what Mario > suggested and file a nouveau bug? It was not apparent that the suggestion to open "a Nouveau drm bug" was addressed to me. > I ask, as I still have this on my list of regressions and it seems there > was no progress in three+ weeks now. I have not pursued this further since as far as I could tell I already provided all requested information and I don't actually use nouveau, so I blacklisted it. Regards, Nick. > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > -- > Everything you wanna know about Linux kernel regression tracking: > https://linux-regtracking.leemhuis.info/about/#tldr > If I did something stupid, please tell me, as explained on that page. > > #regzbot backburner: slow progress, likely just affects one machine > #regzbot poke > > > On 02.06.23 02:57, Limonciello, Mario wrote: > > [AMD Official Use Only - General] > > > >> -Original Message- > >> From: Nick Hastings > >> Sent: Thursday, June 1, 2023 7:02 PM > >> To: Karol Herbst > >> Cc: Limonciello, Mario ; Lyude Paul > >> ; Lukas Wunner ; Salvatore > >> Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > >> Wysocki ; Len Brown ; linux- > >> a...@vger.kernel.org; linux-ker...@vger.kernel.org; > >> regressi...@lists.linux.dev > >> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > >> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system) > >> > >> Hi, > >> > >> * Karol Herbst [230602 03:10]: > >>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario > >>> wrote: > >>>>> -Original Message- > >>>>> From: Karol Herbst > >>>>> Sent: Thursday, June 1, 2023 12:19 PM > >>>>> To: Limonciello, Mario > >>>>> Cc: Nick Hastings ; Lyude Paul > >>>>> ; Lukas Wunner ; Salvatore > >>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > >>>>> Wysocki ; Len Brown ; linux- > >>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org; > >>>>> regressi...@lists.linux.dev > >>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > >>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of > >> system) > >>>>> > >>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario > >>>>> wrote: > >>>>>> > >>>>>> [AMD Official Use Only - General] > >>>>>> > >>>>>>> -Original Message- > >>>>>>> From: Karol Herbst > >>>>>>> Sent: Thursday, June 1, 2023 11:33 AM > >>>>>>> To: Limonciello, Mario > >>>>>>> Cc: Nick Hastings ; Lyude Paul > >>>>>>> ; Lukas Wunner ; Salvatore > >>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael > >> J. > >>>>>>> Wysocki ; Len Brown ; linux- > >>>>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org; > >>>>>>> regressi...@lists.linux.dev > >>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video > >> _OSI > >>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of > >>>>> system) > >>>>>>> > >>>>>>> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario > >>>>>>>> > >>>>>>>> Lyude, Lukas, Karol > >>>>>>>> > >>>>>>>> This thread is in relation to this commit: > >>>>>>>> > >>>>>>>> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string") > >>>>>>>> > >>>>>>>> Nick has found that runtime PM is *not* working for nouveau. > >>>>>>>> > >>>>>>> > >>>>>>> keep in mind we have a list of PCIe controllers where we apply a > >>>>>>> workaround: > >>>>>>> > >>>>> > >> https:
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting for once, to make this easily accessible to everyone. Nick, what's the status/was there any progress? Did you do what Mario suggested and file a nouveau bug? I ask, as I still have this on my list of regressions and it seems there was no progress in three+ weeks now. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. #regzbot backburner: slow progress, likely just affects one machine #regzbot poke On 02.06.23 02:57, Limonciello, Mario wrote: > [AMD Official Use Only - General] > >> -Original Message- >> From: Nick Hastings >> Sent: Thursday, June 1, 2023 7:02 PM >> To: Karol Herbst >> Cc: Limonciello, Mario ; Lyude Paul >> ; Lukas Wunner ; Salvatore >> Bonaccorso ; 1036...@bugs.debian.org; Rafael J. >> Wysocki ; Len Brown ; linux- >> a...@vger.kernel.org; linux-ker...@vger.kernel.org; >> regressi...@lists.linux.dev >> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI >> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system) >> >> Hi, >> >> * Karol Herbst [230602 03:10]: >>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario >>> wrote: >>>>> -Original Message- >>>>> From: Karol Herbst >>>>> Sent: Thursday, June 1, 2023 12:19 PM >>>>> To: Limonciello, Mario >>>>> Cc: Nick Hastings ; Lyude Paul >>>>> ; Lukas Wunner ; Salvatore >>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J. >>>>> Wysocki ; Len Brown ; linux- >>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org; >>>>> regressi...@lists.linux.dev >>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI >>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of >> system) >>>>> >>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario >>>>> wrote: >>>>>> >>>>>> [AMD Official Use Only - General] >>>>>> >>>>>>> -Original Message- >>>>>>> From: Karol Herbst >>>>>>> Sent: Thursday, June 1, 2023 11:33 AM >>>>>>> To: Limonciello, Mario >>>>>>> Cc: Nick Hastings ; Lyude Paul >>>>>>> ; Lukas Wunner ; Salvatore >>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael >> J. >>>>>>> Wysocki ; Len Brown ; linux- >>>>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org; >>>>>>> regressi...@lists.linux.dev >>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video >> _OSI >>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of >>>>> system) >>>>>>> >>>>>>> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario >>>>>>>> >>>>>>>> Lyude, Lukas, Karol >>>>>>>> >>>>>>>> This thread is in relation to this commit: >>>>>>>> >>>>>>>> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string") >>>>>>>> >>>>>>>> Nick has found that runtime PM is *not* working for nouveau. >>>>>>>> >>>>>>> >>>>>>> keep in mind we have a list of PCIe controllers where we apply a >>>>>>> workaround: >>>>>>> >>>>> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers >>>>>>> /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682 >>>>>>> >>>>>>> And I suspect there might be one or two more IDs we'll have to add >>>>>>> there. Do we have any logs? >>>>>> >>>>>> There's some archived onto the distro bug. Search this page for >>>>> "journalctl.log.gz" >>>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530 >>>>>> >>>>> >>>>> interesting.. It seems to be the same controller used here. I wonder >>>>> if the pci topology is different or if the workaround is applie
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
line #276 - # 298 of dmesg... attached to Message # 62 [ 0.066966] smpboot: CPU0: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz (family: 0x6, model: 0x9e, stepping: 0xd) [ 0.066966] cblist_init_generic: Setting adjustable number of callback queues. [ 0.066966] cblist_init_generic: Setting shift to 4 and lim to 1. [ 0.066966] cblist_init_generic: Setting shift to 4 and lim to 1. [ 0.066966] cblist_init_generic: Setting shift to 4 and lim to 1. [ 0.066966] Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver. [ 0.066966] ... version: 4 [ 0.066966] ... bit width: 48 [ 0.066966] ... generic registers: 4 [ 0.066966] ... value mask: [ 0.066966] ... max period: 7fff [ 0.066966] ... fixed-purpose events: 3 [ 0.066966] ... event mask: 0007000f [ 0.066966] Estimated ratio of average max frequency by base frequency (times 1024): 2005 [ 0.066966] rcu: Hierarchical SRCU implementation. [ 0.066966] rcu: Max phase no-delay instances is 1000. [ 0.066966] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. [ 0.066966] smp: Bringing up secondary CPUs ... [ 0.066966] x86: Booting SMP configuration: [ 0.066966] node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 [ 0.077241] MMIO Stale Data CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/processor_mmio_stale_data.html for more details. [ 0.077241] #9 #10 #11 #12 #13 #14 #15 [ 0.088667] smp: Brought up 1 node, 16 CPUs compare to lspci -tvnn posted in Message # 133 lspci -tvnn -[:00]-+-00.0 Intel Corporation Device [8086:3e20] +-01.0-[01]00.0 NVIDIA Corporation TU117M [GeForce GTX 1650 Mobile / Max-Q] [10de:1f91] +-02.0 Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630] [8086:3e9b] +-04.0 Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [8086:1903] +-08.0 Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911] +-12.0 Intel Corporation Cannon Lake PCH Thermal Controller [8086:a379] +-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d] +-14.2 Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f] +-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 [8086:a368] +-15.1 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 [8086:a369] +-16.0 Intel Corporation Cannon Lake PCH HECI Controller [8086:a360] +-17.0 Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller [8086:a353] +-1b.0-[02-3a]00.0-[03-3a]--+-00.0-[04]00.0 Intel Corporation JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] [8086:15d9] | +-01.0-[05-39]-- | \-02.0-[3a]00.0 Intel Corporation JHL6340 Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 2016] [8086:15db] +-1c.0-[3b]00.0 Intel Corporation Wi-Fi 6 AX200 [8086:2723] +-1c.4-[3c]00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader [10ec:525a] +-1d.0-[3d]00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808] +-1f.0 Intel Corporation Cannon Lake LPC Controller [8086:a30e] +-1f.3 Intel Corporation Cannon Lake PCH cAVS [8086:a348] +-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller [8086:a323] \-1f.5 Intel Corporation Cannon Lake PCH SPI Controller [8086:a324] I hope this is not noise! much gratitude
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
[AMD Official Use Only - General] > -Original Message- > From: Nick Hastings > Sent: Thursday, June 1, 2023 7:02 PM > To: Karol Herbst > Cc: Limonciello, Mario ; Lyude Paul > ; Lukas Wunner ; Salvatore > Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > Wysocki ; Len Brown ; linux- > a...@vger.kernel.org; linux-ker...@vger.kernel.org; > regressi...@lists.linux.dev > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system) > > Hi, > > * Karol Herbst [230602 03:10]: > > On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario > > wrote: > > > > -Original Message- > > > > From: Karol Herbst > > > > Sent: Thursday, June 1, 2023 12:19 PM > > > > To: Limonciello, Mario > > > > Cc: Nick Hastings ; Lyude Paul > > > > ; Lukas Wunner ; Salvatore > > > > Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > > > > Wysocki ; Len Brown ; linux- > > > > a...@vger.kernel.org; linux-ker...@vger.kernel.org; > > > > regressi...@lists.linux.dev > > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of > system) > > > > > > > > On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario > > > > wrote: > > > > > > > > > > [AMD Official Use Only - General] > > > > > > > > > > > -Original Message- > > > > > > From: Karol Herbst > > > > > > Sent: Thursday, June 1, 2023 11:33 AM > > > > > > To: Limonciello, Mario > > > > > > Cc: Nick Hastings ; Lyude Paul > > > > > > ; Lukas Wunner ; Salvatore > > > > > > Bonaccorso ; 1036...@bugs.debian.org; Rafael > J. > > > > > > Wysocki ; Len Brown ; linux- > > > > > > a...@vger.kernel.org; linux-ker...@vger.kernel.org; > > > > > > regressi...@lists.linux.dev > > > > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video > _OSI > > > > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of > > > > system) > > > > > > > > > > > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario > > > > > > > > > > > > > > Lyude, Lukas, Karol > > > > > > > > > > > > > > This thread is in relation to this commit: > > > > > > > > > > > > > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string") > > > > > > > > > > > > > > Nick has found that runtime PM is *not* working for nouveau. > > > > > > > > > > > > > > > > > > > keep in mind we have a list of PCIe controllers where we apply a > > > > > > workaround: > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers > > > > > > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682 > > > > > > > > > > > > And I suspect there might be one or two more IDs we'll have to add > > > > > > there. Do we have any logs? > > > > > > > > > > There's some archived onto the distro bug. Search this page for > > > > "journalctl.log.gz" > > > > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530 > > > > > > > > > > > > > interesting.. It seems to be the same controller used here. I wonder > > > > if the pci topology is different or if the workaround is applied at > > > > all. > > > > > > I didn't see the message in the log about the workaround being applied > > > in that log, so I guess PCI topology difference is a likely suspect. > > > > > > > yeah, but I also couldn't see a log with the usual nouveau messages, > > so it's kinda weird. > > > > Anyway, the output of `lspci -tvnn` would help > > % lspci -tvnn > -[:00]-+-00.0 Intel Corporation Device [8086:3e20] >+-01.0-[01]00.0 NVIDIA Corporation TU117M [GeForce GTX 1650 > Mobile / Max-Q] [10de:1f91] So the bridge it's connected to is the same that the quirk *should have been* triggering. May 29 15:02:42 xps kernel: pci :00:01.0: [8086:1901] type 01 class 0x060400 Since the quirk isn't wor
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Hi, * Karol Herbst [230602 03:10]: > On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario > wrote: > > > -Original Message- > > > From: Karol Herbst > > > Sent: Thursday, June 1, 2023 12:19 PM > > > To: Limonciello, Mario > > > Cc: Nick Hastings ; Lyude Paul > > > ; Lukas Wunner ; Salvatore > > > Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > > > Wysocki ; Len Brown ; linux- > > > a...@vger.kernel.org; linux-ker...@vger.kernel.org; > > > regressi...@lists.linux.dev > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of > > > system) > > > > > > On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario > > > wrote: > > > > > > > > [AMD Official Use Only - General] > > > > > > > > > -Original Message- > > > > > From: Karol Herbst > > > > > Sent: Thursday, June 1, 2023 11:33 AM > > > > > To: Limonciello, Mario > > > > > Cc: Nick Hastings ; Lyude Paul > > > > > ; Lukas Wunner ; Salvatore > > > > > Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > > > > > Wysocki ; Len Brown ; linux- > > > > > a...@vger.kernel.org; linux-ker...@vger.kernel.org; > > > > > regressi...@lists.linux.dev > > > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > > > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of > > > system) > > > > > > > > > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario > > > > > > > > > > > > Lyude, Lukas, Karol > > > > > > > > > > > > This thread is in relation to this commit: > > > > > > > > > > > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string") > > > > > > > > > > > > Nick has found that runtime PM is *not* working for nouveau. > > > > > > > > > > > > > > > > keep in mind we have a list of PCIe controllers where we apply a > > > > > workaround: > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers > > > > > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682 > > > > > > > > > > And I suspect there might be one or two more IDs we'll have to add > > > > > there. Do we have any logs? > > > > > > > > There's some archived onto the distro bug. Search this page for > > > "journalctl.log.gz" > > > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530 > > > > > > > > > > interesting.. It seems to be the same controller used here. I wonder > > > if the pci topology is different or if the workaround is applied at > > > all. > > > > I didn't see the message in the log about the workaround being applied > > in that log, so I guess PCI topology difference is a likely suspect. > > > > yeah, but I also couldn't see a log with the usual nouveau messages, > so it's kinda weird. > > Anyway, the output of `lspci -tvnn` would help % lspci -tvnn -[:00]-+-00.0 Intel Corporation Device [8086:3e20] +-01.0-[01]00.0 NVIDIA Corporation TU117M [GeForce GTX 1650 Mobile / Max-Q] [10de:1f91] +-02.0 Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630] [8086:3e9b] +-04.0 Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [8086:1903] +-08.0 Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911] +-12.0 Intel Corporation Cannon Lake PCH Thermal Controller [8086:a379] +-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d] +-14.2 Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f] +-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 [8086:a368] +-15.1 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 [8086:a369] +-16.0 Intel Corporation Cannon Lake PCH HECI Controller [8086:a360] +-17.0 Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller [8086:a353] +-1b.0-[02-3a]00.0-[03-3a]--+-00.0-[04]00.0 Intel Corporation JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] [8086:15d9] | +-01.0-[05-39]-- | \-02.0-[3a]00.0 Intel Corporation JHL6340 Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 2016] [8086:15db] +-1c.0-[3b]00.0 Intel Corporation Wi-Fi 6 AX200 [8086:2723] +-1c.4-[3c]00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader [10ec:525a] +-1d.0-[3d]00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808] +-1f.0 Intel Corporation Cannon Lake LPC Controller [8086:a30e] +-1f.3 Intel Corporation Cannon Lake PCH cAVS [8086:a348] +-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller [8086:a323] \-1f.5 Intel Corporation Cannon Lake PCH SPI Controller [8086:a324] Regards, Nick.
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Hi, * Limonciello, Mario [230602 01:18]: > +Lyude, Lukas, Karol > > On 5/31/2023 6:40 PM, Nick Hastings wrote: > > > > * Nick Hastings [230530 16:01]: > > > * Mario Limonciello [230530 13:00]: > > > > > > As you're actually loading nouveau, can you please try nouveau.runpm=0 > > > > on > > > > the kernel command line? > > > I'm not intentionally loading it. This machine also has intel graphics > > > which is what I prefer. Checking my > > > /etc/modprobe.d/blacklist-nvidia-nouveau.conf > > > I see: > > > > > > blacklist nvidia > > > blacklist nvidia-drm > > > blacklist nvidia-modeset > > > blacklist nvidia-uvm > > > blacklist ipmi_msghandler > > > blacklist ipmi_devintf > > > > > > So I thought I had blacklisted it but it seems I did not. Since I do not > > > want to use it maybe it is better to check if the lock up occurs with > > > nouveau blacklisted. I will try that now. > > I blacklisted nouveau and booted into a 6.1 kernel: > > % uname -a > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) > > x86_64 GNU/Linux > > > > It has been running without problems for nearly two days now: > > % uptime > > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27 > > > > Regards, > > > > Nick. > > Thanks, that makes a lot more sense now. > > Nick, Can you please test if nouveau works with runtime PM in the > latest 6.4-rc? I reported this twice already. I guess it was lost since for some reason emails in this thread are not being trimmed. I'll repeat here: I did eventually see a lockup of this kernel. On the console I saw: [ 151.035036] vfio-pci :01:00.0 Unable to change power state from D3cold to D0, device inaccessible I did not see the other two lines that were present in earlier lock ups. Regards, Nick.
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario wrote: > > [AMD Official Use Only - General] > > > -Original Message- > > From: Karol Herbst > > Sent: Thursday, June 1, 2023 12:19 PM > > To: Limonciello, Mario > > Cc: Nick Hastings ; Lyude Paul > > ; Lukas Wunner ; Salvatore > > Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > > Wysocki ; Len Brown ; linux- > > a...@vger.kernel.org; linux-ker...@vger.kernel.org; > > regressi...@lists.linux.dev > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system) > > > > On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario > > wrote: > > > > > > [AMD Official Use Only - General] > > > > > > > -Original Message- > > > > From: Karol Herbst > > > > Sent: Thursday, June 1, 2023 11:33 AM > > > > To: Limonciello, Mario > > > > Cc: Nick Hastings ; Lyude Paul > > > > ; Lukas Wunner ; Salvatore > > > > Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > > > > Wysocki ; Len Brown ; linux- > > > > a...@vger.kernel.org; linux-ker...@vger.kernel.org; > > > > regressi...@lists.linux.dev > > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of > > system) > > > > > > > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario > > > > wrote: > > > > > > > > > > +Lyude, Lukas, Karol > > > > > > > > > > On 5/31/2023 6:40 PM, Nick Hastings wrote: > > > > > > Hi, > > > > > > > > > > > > * Nick Hastings [230530 16:01]: > > > > > >> * Mario Limonciello [230530 13:00]: > > > > > > > > > > > >>> As you're actually loading nouveau, can you please try > > > > nouveau.runpm=0 on > > > > > >>> the kernel command line? > > > > > >> I'm not intentionally loading it. This machine also has intel > > > > > >> graphics > > > > > >> which is what I prefer. Checking my > > > > > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf > > > > > >> I see: > > > > > >> > > > > > >> blacklist nvidia > > > > > >> blacklist nvidia-drm > > > > > >> blacklist nvidia-modeset > > > > > >> blacklist nvidia-uvm > > > > > >> blacklist ipmi_msghandler > > > > > >> blacklist ipmi_devintf > > > > > >> > > > > > >> So I thought I had blacklisted it but it seems I did not. Since I > > > > > >> do not > > > > > >> want to use it maybe it is better to check if the lock up occurs > > > > > >> with > > > > > >> nouveau blacklisted. I will try that now. > > > > > > I blacklisted nouveau and booted into a 6.1 kernel: > > > > > > % uname -a > > > > > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 > > > > (2023-05-08) x86_64 GNU/Linux > > > > > > > > > > > > It has been running without problems for nearly two days now: > > > > > > % uptime > > > > > > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, > > > > > > 1.27 > > > > > > > > > > > > Regards, > > > > > > > > > > > > Nick. > > > > > > > > > > Thanks, that makes a lot more sense now. > > > > > > > > > > Nick, Can you please test if nouveau works with runtime PM in the > > > > > latest 6.4-rc? > > > > > > > > > > If it works in 6.4-rc, there are probably nouveau commits that need > > > > > to be backported to 6.1 LTS. > > > > > > > > > > If it's still broken in 6.4-rc, I believe you should file a bug: > > > > > > > > > > https://gitlab.freedesktop.org/drm/nouveau/ > > > > > > > > > > > > > > > Lyude, Lukas, Karol > > > > > > > > > > This thread is in relation to this commit: > > > > > > > > > > 24867516f06d ("ACPI: OSI: Remo
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
[AMD Official Use Only - General] > -Original Message- > From: Karol Herbst > Sent: Thursday, June 1, 2023 12:19 PM > To: Limonciello, Mario > Cc: Nick Hastings ; Lyude Paul > ; Lukas Wunner ; Salvatore > Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > Wysocki ; Len Brown ; linux- > a...@vger.kernel.org; linux-ker...@vger.kernel.org; > regressi...@lists.linux.dev > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system) > > On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario > wrote: > > > > [AMD Official Use Only - General] > > > > > -Original Message- > > > From: Karol Herbst > > > Sent: Thursday, June 1, 2023 11:33 AM > > > To: Limonciello, Mario > > > Cc: Nick Hastings ; Lyude Paul > > > ; Lukas Wunner ; Salvatore > > > Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > > > Wysocki ; Len Brown ; linux- > > > a...@vger.kernel.org; linux-ker...@vger.kernel.org; > > > regressi...@lists.linux.dev > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of > system) > > > > > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario > > > wrote: > > > > > > > > +Lyude, Lukas, Karol > > > > > > > > On 5/31/2023 6:40 PM, Nick Hastings wrote: > > > > > Hi, > > > > > > > > > > * Nick Hastings [230530 16:01]: > > > > >> * Mario Limonciello [230530 13:00]: > > > > > > > > > >>> As you're actually loading nouveau, can you please try > > > nouveau.runpm=0 on > > > > >>> the kernel command line? > > > > >> I'm not intentionally loading it. This machine also has intel > > > > >> graphics > > > > >> which is what I prefer. Checking my > > > > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf > > > > >> I see: > > > > >> > > > > >> blacklist nvidia > > > > >> blacklist nvidia-drm > > > > >> blacklist nvidia-modeset > > > > >> blacklist nvidia-uvm > > > > >> blacklist ipmi_msghandler > > > > >> blacklist ipmi_devintf > > > > >> > > > > >> So I thought I had blacklisted it but it seems I did not. Since I do > > > > >> not > > > > >> want to use it maybe it is better to check if the lock up occurs with > > > > >> nouveau blacklisted. I will try that now. > > > > > I blacklisted nouveau and booted into a 6.1 kernel: > > > > > % uname -a > > > > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 > > > (2023-05-08) x86_64 GNU/Linux > > > > > > > > > > It has been running without problems for nearly two days now: > > > > > % uptime > > > > > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27 > > > > > > > > > > Regards, > > > > > > > > > > Nick. > > > > > > > > Thanks, that makes a lot more sense now. > > > > > > > > Nick, Can you please test if nouveau works with runtime PM in the > > > > latest 6.4-rc? > > > > > > > > If it works in 6.4-rc, there are probably nouveau commits that need > > > > to be backported to 6.1 LTS. > > > > > > > > If it's still broken in 6.4-rc, I believe you should file a bug: > > > > > > > > https://gitlab.freedesktop.org/drm/nouveau/ > > > > > > > > > > > > Lyude, Lukas, Karol > > > > > > > > This thread is in relation to this commit: > > > > > > > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string") > > > > > > > > Nick has found that runtime PM is *not* working for nouveau. > > > > > > > > > > keep in mind we have a list of PCIe controllers where we apply a > > > workaround: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers > > > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682 > > > > > > And I suspect there might be one or two more IDs we'll have to add > > > there. Do we have any logs? > > > &g
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario wrote: > > [AMD Official Use Only - General] > > > -Original Message- > > From: Karol Herbst > > Sent: Thursday, June 1, 2023 11:33 AM > > To: Limonciello, Mario > > Cc: Nick Hastings ; Lyude Paul > > ; Lukas Wunner ; Salvatore > > Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > > Wysocki ; Len Brown ; linux- > > a...@vger.kernel.org; linux-ker...@vger.kernel.org; > > regressi...@lists.linux.dev > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system) > > > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario > > wrote: > > > > > > +Lyude, Lukas, Karol > > > > > > On 5/31/2023 6:40 PM, Nick Hastings wrote: > > > > Hi, > > > > > > > > * Nick Hastings [230530 16:01]: > > > >> * Mario Limonciello [230530 13:00]: > > > > > > > >>> As you're actually loading nouveau, can you please try > > nouveau.runpm=0 on > > > >>> the kernel command line? > > > >> I'm not intentionally loading it. This machine also has intel graphics > > > >> which is what I prefer. Checking my > > > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf > > > >> I see: > > > >> > > > >> blacklist nvidia > > > >> blacklist nvidia-drm > > > >> blacklist nvidia-modeset > > > >> blacklist nvidia-uvm > > > >> blacklist ipmi_msghandler > > > >> blacklist ipmi_devintf > > > >> > > > >> So I thought I had blacklisted it but it seems I did not. Since I do > > > >> not > > > >> want to use it maybe it is better to check if the lock up occurs with > > > >> nouveau blacklisted. I will try that now. > > > > I blacklisted nouveau and booted into a 6.1 kernel: > > > > % uname -a > > > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 > > (2023-05-08) x86_64 GNU/Linux > > > > > > > > It has been running without problems for nearly two days now: > > > > % uptime > > > > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27 > > > > > > > > Regards, > > > > > > > > Nick. > > > > > > Thanks, that makes a lot more sense now. > > > > > > Nick, Can you please test if nouveau works with runtime PM in the > > > latest 6.4-rc? > > > > > > If it works in 6.4-rc, there are probably nouveau commits that need > > > to be backported to 6.1 LTS. > > > > > > If it's still broken in 6.4-rc, I believe you should file a bug: > > > > > > https://gitlab.freedesktop.org/drm/nouveau/ > > > > > > > > > Lyude, Lukas, Karol > > > > > > This thread is in relation to this commit: > > > > > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string") > > > > > > Nick has found that runtime PM is *not* working for nouveau. > > > > > > > keep in mind we have a list of PCIe controllers where we apply a > > workaround: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers > > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682 > > > > And I suspect there might be one or two more IDs we'll have to add > > there. Do we have any logs? > > There's some archived onto the distro bug. Search this page for > "journalctl.log.gz" > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530 > interesting.. It seems to be the same controller used here. I wonder if the pci topology is different or if the workaround is applied at all. But yeah, I'd kinda love for somebody with better knowledge on all of this to figure out what exactly is going wrong, but everytime this gets investigated Intel says "our hardware has no bugs", the ACPI folks dig for months and find nothing and I end up figuring out some weirdo workaround I don't understand. And apparently also nobody is able to hand out docs explaining in detail how that runtime suspend/resume stuff is supposed to work. I have a Dell XPS 9560 where the added workaround in nouveau fixed the problem and I know it's fixed on a bunch of other systems. So if anybody is willing to publish docs and/or actually debug it with domain knowledge, please go ahead. > > And could anybody test if adding the > > controller in play here does resolve the problem? > > > > > If you recall we did 24867516f06d because 5775b843a619 was > > > supposed to have fixed it. > > > >
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
[AMD Official Use Only - General] > -Original Message- > From: Karol Herbst > Sent: Thursday, June 1, 2023 11:33 AM > To: Limonciello, Mario > Cc: Nick Hastings ; Lyude Paul > ; Lukas Wunner ; Salvatore > Bonaccorso ; 1036...@bugs.debian.org; Rafael J. > Wysocki ; Len Brown ; linux- > a...@vger.kernel.org; linux-ker...@vger.kernel.org; > regressi...@lists.linux.dev > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system) > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario > wrote: > > > > +Lyude, Lukas, Karol > > > > On 5/31/2023 6:40 PM, Nick Hastings wrote: > > > Hi, > > > > > > * Nick Hastings [230530 16:01]: > > >> * Mario Limonciello [230530 13:00]: > > > > > >>> As you're actually loading nouveau, can you please try > nouveau.runpm=0 on > > >>> the kernel command line? > > >> I'm not intentionally loading it. This machine also has intel graphics > > >> which is what I prefer. Checking my > > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf > > >> I see: > > >> > > >> blacklist nvidia > > >> blacklist nvidia-drm > > >> blacklist nvidia-modeset > > >> blacklist nvidia-uvm > > >> blacklist ipmi_msghandler > > >> blacklist ipmi_devintf > > >> > > >> So I thought I had blacklisted it but it seems I did not. Since I do not > > >> want to use it maybe it is better to check if the lock up occurs with > > >> nouveau blacklisted. I will try that now. > > > I blacklisted nouveau and booted into a 6.1 kernel: > > > % uname -a > > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 > (2023-05-08) x86_64 GNU/Linux > > > > > > It has been running without problems for nearly two days now: > > > % uptime > > > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27 > > > > > > Regards, > > > > > > Nick. > > > > Thanks, that makes a lot more sense now. > > > > Nick, Can you please test if nouveau works with runtime PM in the > > latest 6.4-rc? > > > > If it works in 6.4-rc, there are probably nouveau commits that need > > to be backported to 6.1 LTS. > > > > If it's still broken in 6.4-rc, I believe you should file a bug: > > > > https://gitlab.freedesktop.org/drm/nouveau/ > > > > > > Lyude, Lukas, Karol > > > > This thread is in relation to this commit: > > > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string") > > > > Nick has found that runtime PM is *not* working for nouveau. > > > > keep in mind we have a list of PCIe controllers where we apply a > workaround: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682 > > And I suspect there might be one or two more IDs we'll have to add > there. Do we have any logs? There's some archived onto the distro bug. Search this page for "journalctl.log.gz" https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530 > And could anybody test if adding the > controller in play here does resolve the problem? > > > If you recall we did 24867516f06d because 5775b843a619 was > > supposed to have fixed it. > >
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario wrote: > > +Lyude, Lukas, Karol > > On 5/31/2023 6:40 PM, Nick Hastings wrote: > > Hi, > > > > * Nick Hastings [230530 16:01]: > >> * Mario Limonciello [230530 13:00]: > > > >>> As you're actually loading nouveau, can you please try nouveau.runpm=0 on > >>> the kernel command line? > >> I'm not intentionally loading it. This machine also has intel graphics > >> which is what I prefer. Checking my > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf > >> I see: > >> > >> blacklist nvidia > >> blacklist nvidia-drm > >> blacklist nvidia-modeset > >> blacklist nvidia-uvm > >> blacklist ipmi_msghandler > >> blacklist ipmi_devintf > >> > >> So I thought I had blacklisted it but it seems I did not. Since I do not > >> want to use it maybe it is better to check if the lock up occurs with > >> nouveau blacklisted. I will try that now. > > I blacklisted nouveau and booted into a 6.1 kernel: > > % uname -a > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) > > x86_64 GNU/Linux > > > > It has been running without problems for nearly two days now: > > % uptime > > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27 > > > > Regards, > > > > Nick. > > Thanks, that makes a lot more sense now. > > Nick, Can you please test if nouveau works with runtime PM in the > latest 6.4-rc? > > If it works in 6.4-rc, there are probably nouveau commits that need > to be backported to 6.1 LTS. > > If it's still broken in 6.4-rc, I believe you should file a bug: > > https://gitlab.freedesktop.org/drm/nouveau/ > > > Lyude, Lukas, Karol > > This thread is in relation to this commit: > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string") > > Nick has found that runtime PM is *not* working for nouveau. > keep in mind we have a list of PCIe controllers where we apply a workaround: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682 And I suspect there might be one or two more IDs we'll have to add there. Do we have any logs? And could anybody test if adding the controller in play here does resolve the problem? > If you recall we did 24867516f06d because 5775b843a619 was > supposed to have fixed it. >
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
+Lyude, Lukas, Karol On 5/31/2023 6:40 PM, Nick Hastings wrote: Hi, * Nick Hastings [230530 16:01]: * Mario Limonciello [230530 13:00]: As you're actually loading nouveau, can you please try nouveau.runpm=0 on the kernel command line? I'm not intentionally loading it. This machine also has intel graphics which is what I prefer. Checking my /etc/modprobe.d/blacklist-nvidia-nouveau.conf I see: blacklist nvidia blacklist nvidia-drm blacklist nvidia-modeset blacklist nvidia-uvm blacklist ipmi_msghandler blacklist ipmi_devintf So I thought I had blacklisted it but it seems I did not. Since I do not want to use it maybe it is better to check if the lock up occurs with nouveau blacklisted. I will try that now. I blacklisted nouveau and booted into a 6.1 kernel: % uname -a Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux It has been running without problems for nearly two days now: % uptime 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27 Regards, Nick. Thanks, that makes a lot more sense now. Nick, Can you please test if nouveau works with runtime PM in the latest 6.4-rc? If it works in 6.4-rc, there are probably nouveau commits that need to be backported to 6.1 LTS. If it's still broken in 6.4-rc, I believe you should file a bug: https://gitlab.freedesktop.org/drm/nouveau/ Lyude, Lukas, Karol This thread is in relation to this commit: 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string") Nick has found that runtime PM is *not* working for nouveau. If you recall we did 24867516f06d because 5775b843a619 was supposed to have fixed it.
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Hi, * Nick Hastings [230530 16:01]: > > * Mario Limonciello [230530 13:00]: > > As you're actually loading nouveau, can you please try nouveau.runpm=0 on > > the kernel command line? > > I'm not intentionally loading it. This machine also has intel graphics > which is what I prefer. Checking my > /etc/modprobe.d/blacklist-nvidia-nouveau.conf > I see: > > blacklist nvidia > blacklist nvidia-drm > blacklist nvidia-modeset > blacklist nvidia-uvm > blacklist ipmi_msghandler > blacklist ipmi_devintf > > So I thought I had blacklisted it but it seems I did not. Since I do not > want to use it maybe it is better to check if the lock up occurs with > nouveau blacklisted. I will try that now. I blacklisted nouveau and booted into a 6.1 kernel: % uname -a Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux It has been running without problems for nearly two days now: % uptime 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27 Regards, Nick.
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Hi Nick, Thanks to you both for triaging the issue! On Tue, May 30, 2023 at 04:01:04PM +0900, Nick Hastings wrote: > Hi, > > * Mario Limonciello [230530 13:00]: > > On 5/29/23 18:01, Nick Hastings wrote: > > > Hi, > > > > > > * Nick Hastings [230529 12:51]: > > > > * Mario Limonciello [230529 10:14]: > > > > > On 5/28/23 19:56, Nick Hastings wrote: > > > > > > Hi, > > > > > > > > > > > > * Mario Limonciello [230528 21:44]: > > > > > > > On 5/28/23 01:49, Salvatore Bonaccorso wrote: > > > > > > > > Hi Mario > > > > > > > > > > > > > > > > Nick Hastings reported in Debian in > > > > > > > > https://bugs.debian.org/1036530 > > > > > > > > lockups from his system after updating from a 6.0 based version > > > > > > > > to > > > > > > > > 6.1.y. > > > > > > > > > #regzbot ^introduced 24867516f06d > > > > > > > > > > > > > > > > he bisected the issue and tracked it down to: > > > > > > > > > > > > > > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote: > > > > > > > > > Control: tags -1 - moreinfo > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > I repeated the git bisect, and the bad commit seems to be: > > > > > > > > > > > > > > > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad > > > > > > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad > > > > > > > > > commit > > > > > > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc > > > > > > > > > Author: Mario Limonciello > > > > > > > > > Date: Tue Aug 23 13:51:31 2022 -0500 > > > > > > > > > > > > > > > > > >ACPI: OSI: Remove Linux-Dell-Video _OSI string > > > > > > > > >This string was introduced because drivers for NVIDIA > > > > > > > > > hardware > > > > > > > > >had bugs supporting RTD3 in the past. > > > > > > > > >Before proprietary NVIDIA driver started to support > > > > > > > > > RTD3, Ubuntu had > > > > > > > > >had a mechanism for switching PRIME on and off, though > > > > > > > > > it had required > > > > > > > > >to logout/login to make the library switch happen. > > > > > > > > >When the PRIME had been off, the mechanism had > > > > > > > > > unloaded the NVIDIA > > > > > > > > >driver and put the device into D3cold, but the GPU had > > > > > > > > > never come back > > > > > > > > >to D0 again which is why ODMs used the _OSI to expose > > > > > > > > > an old _DSM > > > > > > > > >method to switch the power on/off. > > > > > > > > >That has been fixed by commit 5775b843a619 ("PCI: > > > > > > > > > Restore config space > > > > > > > > >on runtime resume despite being unbound"). so vendors > > > > > > > > > shouldn't be > > > > > > > > >using this string to modify ASL any more. > > > > > > > > >Reviewed-by: Lyude Paul > > > > > > > > >Signed-off-by: Mario Limonciello > > > > > > > > > > > > > > > > > >Signed-off-by: Rafael J. Wysocki > > > > > > > > > > > > > > > > > > > > > > > > > > > drivers/acpi/osi.c | 9 - > > > > > > > > > 1 file changed, 9 deletions(-) > > > > > > > > > > > > > > > > > > This machine is a Dell with an nvidia chip so it looks like > > > > > > > > > this really > > > > > > > > > could be the commit that that is causing the problems. The > > > > > > > > > description > > > > > > > > > of the commit also seems (to my untrained eye) to be > > > > > > > > > consistent with the > > > > > > > > > error reported on the console when the lockup occurs: > > > > > > > > > > > > > > > > > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due > > > > > > > > > to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > > > > > > > > > [ 58.729904] ACPI Error: Aborting method > > > > > > > > > \_SB.PCI0.PEG0.PG00._ON due to previous error > > > > > > > > > (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > > > > > > > > > [ 60.083261] vfio-pci :01:00.0 Unable to change power > > > > > > > > > state from D3cold to D0, device inaccessible > > > > > > > > > > > > > > > > > > Hopefully this is enough information for experts to resolve > > > > > > > > > this. > > > > > > > > > > > > > > > > Does this ring some bell for you? Do you need any further > > > > > > > > information > > > > > > > > from Nick? > > > > > > > > > > > > > > > > Regards, > > > > > > > > Salvatore > > > > > > > > > > > > > > > > > > > > Have Nick try using "pcie_port_pm=off" and see if it helps the > > > > > > > issue. > > > > > > > > > > > > I booted into a 6.1 kernel with this option. It has been running > > > > > > without > > > > > > problems for 1.5 hours. Usually I would expect the lockup to have > > > > > > occurred by now. > > > > > > > > I let this run for 3 hours without issue. > > > > > > > > > > > Does this happen in the latest 6.4 RC as well? > > > > > > > > > > > > I have compiled that kernel and will boot into it after running > > > > > > this one > > > > > > with the
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Hi, * Mario Limonciello [230530 13:00]: > On 5/29/23 18:01, Nick Hastings wrote: > > Hi, > > > > * Nick Hastings [230529 12:51]: > > > * Mario Limonciello [230529 10:14]: > > > > On 5/28/23 19:56, Nick Hastings wrote: > > > > > Hi, > > > > > > > > > > * Mario Limonciello [230528 21:44]: > > > > > > On 5/28/23 01:49, Salvatore Bonaccorso wrote: > > > > > > > Hi Mario > > > > > > > > > > > > > > Nick Hastings reported in Debian in > > > > > > > https://bugs.debian.org/1036530 > > > > > > > lockups from his system after updating from a 6.0 based version to > > > > > > > 6.1.y. > > > > > > > > #regzbot ^introduced 24867516f06d > > > > > > > > > > > > > > he bisected the issue and tracked it down to: > > > > > > > > > > > > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote: > > > > > > > > Control: tags -1 - moreinfo > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I repeated the git bisect, and the bad commit seems to be: > > > > > > > > > > > > > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad > > > > > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit > > > > > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc > > > > > > > > Author: Mario Limonciello > > > > > > > > Date: Tue Aug 23 13:51:31 2022 -0500 > > > > > > > > > > > > > > > >ACPI: OSI: Remove Linux-Dell-Video _OSI string > > > > > > > >This string was introduced because drivers for NVIDIA > > > > > > > > hardware > > > > > > > >had bugs supporting RTD3 in the past. > > > > > > > >Before proprietary NVIDIA driver started to support > > > > > > > > RTD3, Ubuntu had > > > > > > > >had a mechanism for switching PRIME on and off, though > > > > > > > > it had required > > > > > > > >to logout/login to make the library switch happen. > > > > > > > >When the PRIME had been off, the mechanism had unloaded > > > > > > > > the NVIDIA > > > > > > > >driver and put the device into D3cold, but the GPU had > > > > > > > > never come back > > > > > > > >to D0 again which is why ODMs used the _OSI to expose an > > > > > > > > old _DSM > > > > > > > >method to switch the power on/off. > > > > > > > >That has been fixed by commit 5775b843a619 ("PCI: > > > > > > > > Restore config space > > > > > > > >on runtime resume despite being unbound"). so vendors > > > > > > > > shouldn't be > > > > > > > >using this string to modify ASL any more. > > > > > > > >Reviewed-by: Lyude Paul > > > > > > > >Signed-off-by: Mario Limonciello > > > > > > > > > > > > > > > >Signed-off-by: Rafael J. Wysocki > > > > > > > > > > > > > > > > > > > > > > > > drivers/acpi/osi.c | 9 - > > > > > > > > 1 file changed, 9 deletions(-) > > > > > > > > > > > > > > > > This machine is a Dell with an nvidia chip so it looks like > > > > > > > > this really > > > > > > > > could be the commit that that is causing the problems. The > > > > > > > > description > > > > > > > > of the commit also seems (to my untrained eye) to be consistent > > > > > > > > with the > > > > > > > > error reported on the console when the lockup occurs: > > > > > > > > > > > > > > > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due > > > > > > > > to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > > > > > > > > [ 58.729904] ACPI Error: Aborting method > > > > > > > > \_SB.PCI0.PEG0.PG00._ON due to previous error > > > > > > > > (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > > > > > > > > [ 60.083261] vfio-pci :01:00.0 Unable to change power > > > > > > > > state from D3cold to D0, device inaccessible > > > > > > > > > > > > > > > > Hopefully this is enough information for experts to resolve > > > > > > > > this. > > > > > > > > > > > > > > Does this ring some bell for you? Do you need any further > > > > > > > information > > > > > > > from Nick? > > > > > > > > > > > > > > Regards, > > > > > > > Salvatore > > > > > > > > > > > > > > > > > Have Nick try using "pcie_port_pm=off" and see if it helps the > > > > > > issue. > > > > > > > > > > I booted into a 6.1 kernel with this option. It has been running > > > > > without > > > > > problems for 1.5 hours. Usually I would expect the lockup to have > > > > > occurred by now. > > > > > > I let this run for 3 hours without issue. > > > > > > > > > Does this happen in the latest 6.4 RC as well? > > > > > > > > > > I have compiled that kernel and will boot into it after running this > > > > > one > > > > > with the pcie_port_pm=off for another hour or so. > > > > > > I'm now running 6.4.0-rc4 without seeing the problem after 1 hour. > > > > I did eventually see a lockup of this kernel. On the console I saw: > > > > [ 151.035036] vfio-pci :01:00.0 Unable to change power state from > > D3cold to D0, device inaccessible > > > > I did not see the other two lines that were
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
On 5/29/23 18:01, Nick Hastings wrote: Hi, * Nick Hastings [230529 12:51]: * Mario Limonciello [230529 10:14]: On 5/28/23 19:56, Nick Hastings wrote: Hi, * Mario Limonciello [230528 21:44]: On 5/28/23 01:49, Salvatore Bonaccorso wrote: Hi Mario Nick Hastings reported in Debian in https://bugs.debian.org/1036530 lockups from his system after updating from a 6.0 based version to 6.1.y. > #regzbot ^introduced 24867516f06d he bisected the issue and tracked it down to: On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote: Control: tags -1 - moreinfo Hi, I repeated the git bisect, and the bad commit seems to be: (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit commit 24867516f06dabedef3be7eea0ef0846b91538bc Author: Mario Limonciello Date: Tue Aug 23 13:51:31 2022 -0500 ACPI: OSI: Remove Linux-Dell-Video _OSI string This string was introduced because drivers for NVIDIA hardware had bugs supporting RTD3 in the past. Before proprietary NVIDIA driver started to support RTD3, Ubuntu had had a mechanism for switching PRIME on and off, though it had required to logout/login to make the library switch happen. When the PRIME had been off, the mechanism had unloaded the NVIDIA driver and put the device into D3cold, but the GPU had never come back to D0 again which is why ODMs used the _OSI to expose an old _DSM method to switch the power on/off. That has been fixed by commit 5775b843a619 ("PCI: Restore config space on runtime resume despite being unbound"). so vendors shouldn't be using this string to modify ASL any more. Reviewed-by: Lyude Paul Signed-off-by: Mario Limonciello Signed-off-by: Rafael J. Wysocki drivers/acpi/osi.c | 9 - 1 file changed, 9 deletions(-) This machine is a Dell with an nvidia chip so it looks like this really could be the commit that that is causing the problems. The description of the commit also seems (to my untrained eye) to be consistent with the error reported on the console when the lockup occurs: [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) [ 60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold to D0, device inaccessible Hopefully this is enough information for experts to resolve this. Does this ring some bell for you? Do you need any further information from Nick? Regards, Salvatore Have Nick try using "pcie_port_pm=off" and see if it helps the issue. I booted into a 6.1 kernel with this option. It has been running without problems for 1.5 hours. Usually I would expect the lockup to have occurred by now. I let this run for 3 hours without issue. Does this happen in the latest 6.4 RC as well? I have compiled that kernel and will boot into it after running this one with the pcie_port_pm=off for another hour or so. I'm now running 6.4.0-rc4 without seeing the problem after 1 hour. I did eventually see a lockup of this kernel. On the console I saw: [ 151.035036] vfio-pci :01:00.0 Unable to change power state from D3cold to D0, device inaccessible I did not see the other two lines that were present in earlier lock ups > I did however see two unrelated problems that I include here for completeness: 1. iwlwifi module did not automatically load 2. Xwayland used huge amount of CPU even though was not running any X programs. Recompiling my wayland compositor without XWayland support "fixed" this. I think we need to see a full dmesg and acpidump to better characterize it. Please find attached. Let me know if there is anything else I can provide. Regards, Nick. I don't see nouveau loading, are you explicitly preventing it from loading? Yes nouveau is blacklisted. Can I see the journal from a boot when it reproduced? Hmm not sure which n for "journalctl -b n" maps to which kernel (is that what you are requesting?). The commit hash doesn't not seem to be listed. I may have to boot into a bad kernel again. Please find attached the output from a "journalctl --system -bN" for a kernel that has this issue. Regards, Nick. In this log I see nouveau loaded, but I also don't see the failure occurring. As you're actually loading nouveau, can you please try nouveau.runpm=0 on the kernel command line? If that helps the issue; I strongly suggest you cross reference the latest kernel to see if this bug still exists.
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Hi, * Nick Hastings [230529 12:51]: > * Mario Limonciello [230529 10:14]: > > On 5/28/23 19:56, Nick Hastings wrote: > > > Hi, > > > > > > * Mario Limonciello [230528 21:44]: > > > > On 5/28/23 01:49, Salvatore Bonaccorso wrote: > > > > > Hi Mario > > > > > > > > > > Nick Hastings reported in Debian in https://bugs.debian.org/1036530 > > > > > lockups from his system after updating from a 6.0 based version to > > > > > 6.1.y. > > > > > > #regzbot ^introduced 24867516f06d > > > > > > > > > > he bisected the issue and tracked it down to: > > > > > > > > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote: > > > > > > Control: tags -1 - moreinfo > > > > > > > > > > > > Hi, > > > > > > > > > > > > I repeated the git bisect, and the bad commit seems to be: > > > > > > > > > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad > > > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit > > > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc > > > > > > Author: Mario Limonciello > > > > > > Date: Tue Aug 23 13:51:31 2022 -0500 > > > > > > > > > > > > ACPI: OSI: Remove Linux-Dell-Video _OSI string > > > > > > This string was introduced because drivers for NVIDIA hardware > > > > > > had bugs supporting RTD3 in the past. > > > > > > Before proprietary NVIDIA driver started to support RTD3, > > > > > > Ubuntu had > > > > > > had a mechanism for switching PRIME on and off, though it had > > > > > > required > > > > > > to logout/login to make the library switch happen. > > > > > > When the PRIME had been off, the mechanism had unloaded the > > > > > > NVIDIA > > > > > > driver and put the device into D3cold, but the GPU had never > > > > > > come back > > > > > > to D0 again which is why ODMs used the _OSI to expose an old > > > > > > _DSM > > > > > > method to switch the power on/off. > > > > > > That has been fixed by commit 5775b843a619 ("PCI: Restore > > > > > > config space > > > > > > on runtime resume despite being unbound"). so vendors > > > > > > shouldn't be > > > > > > using this string to modify ASL any more. > > > > > > Reviewed-by: Lyude Paul > > > > > > Signed-off-by: Mario Limonciello > > > > > > Signed-off-by: Rafael J. Wysocki > > > > > > > > > > > >drivers/acpi/osi.c | 9 - > > > > > >1 file changed, 9 deletions(-) > > > > > > > > > > > > This machine is a Dell with an nvidia chip so it looks like this > > > > > > really > > > > > > could be the commit that that is causing the problems. The > > > > > > description > > > > > > of the commit also seems (to my untrained eye) to be consistent > > > > > > with the > > > > > > error reported on the console when the lockup occurs: > > > > > > > > > > > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to > > > > > > previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > > > > > > [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON > > > > > > due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > > > > > > [ 60.083261] vfio-pci :01:00.0 Unable to change power state > > > > > > from D3cold to D0, device inaccessible > > > > > > > > > > > > Hopefully this is enough information for experts to resolve this. > > > > > > > > > > Does this ring some bell for you? Do you need any further information > > > > > from Nick? > > > > > > > > > > Regards, > > > > > Salvatore > > > > > > > > > > > Have Nick try using "pcie_port_pm=off" and see if it helps the issue. > > > > > > I booted into a 6.1 kernel with this option. It has been running without > > > problems for 1.5 hours. Usually I would expect the lockup to have > > > occurred by now. > > I let this run for 3 hours without issue. > > > > > Does this happen in the latest 6.4 RC as well? > > > > > > I have compiled that kernel and will boot into it after running this one > > > with the pcie_port_pm=off for another hour or so. > > I'm now running 6.4.0-rc4 without seeing the problem after 1 hour. I did eventually see a lockup of this kernel. On the console I saw: [ 151.035036] vfio-pci :01:00.0 Unable to change power state from D3cold to D0, device inaccessible I did not see the other two lines that were present in earlier lock ups > I did however see two unrelated problems that I include here for > completeness: > 1. iwlwifi module did not automatically load > 2. Xwayland used huge amount of CPU even though was not running any X > programs. Recompiling my wayland compositor without XWayland support > "fixed" this. > > > > > I think we need to see a full dmesg and acpidump to better > > > > characterize it. > > > > > > Please find attached. Let me know if there is anything else I can provide. > > > > > > Regards, > > > > > > Nick. > > > > I don't see nouveau loading, are you explicitly preventing it from > > loading? > > Yes nouveau is blacklisted. >
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
* Mario Limonciello [230529 10:14]: > On 5/28/23 19:56, Nick Hastings wrote: > > Hi, > > > > * Mario Limonciello [230528 21:44]: > > > On 5/28/23 01:49, Salvatore Bonaccorso wrote: > > > > Hi Mario > > > > > > > > Nick Hastings reported in Debian in https://bugs.debian.org/1036530 > > > > lockups from his system after updating from a 6.0 based version to > > > > 6.1.y. > > > > > #regzbot ^introduced 24867516f06d > > > > > > > > he bisected the issue and tracked it down to: > > > > > > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote: > > > > > Control: tags -1 - moreinfo > > > > > > > > > > Hi, > > > > > > > > > > I repeated the git bisect, and the bad commit seems to be: > > > > > > > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad > > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit > > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc > > > > > Author: Mario Limonciello > > > > > Date: Tue Aug 23 13:51:31 2022 -0500 > > > > > > > > > > ACPI: OSI: Remove Linux-Dell-Video _OSI string > > > > > This string was introduced because drivers for NVIDIA hardware > > > > > had bugs supporting RTD3 in the past. > > > > > Before proprietary NVIDIA driver started to support RTD3, > > > > > Ubuntu had > > > > > had a mechanism for switching PRIME on and off, though it had > > > > > required > > > > > to logout/login to make the library switch happen. > > > > > When the PRIME had been off, the mechanism had unloaded the > > > > > NVIDIA > > > > > driver and put the device into D3cold, but the GPU had never > > > > > come back > > > > > to D0 again which is why ODMs used the _OSI to expose an old > > > > > _DSM > > > > > method to switch the power on/off. > > > > > That has been fixed by commit 5775b843a619 ("PCI: Restore > > > > > config space > > > > > on runtime resume despite being unbound"). so vendors shouldn't > > > > > be > > > > > using this string to modify ASL any more. > > > > > Reviewed-by: Lyude Paul > > > > > Signed-off-by: Mario Limonciello > > > > > Signed-off-by: Rafael J. Wysocki > > > > > > > > > >drivers/acpi/osi.c | 9 - > > > > >1 file changed, 9 deletions(-) > > > > > > > > > > This machine is a Dell with an nvidia chip so it looks like this > > > > > really > > > > > could be the commit that that is causing the problems. The description > > > > > of the commit also seems (to my untrained eye) to be consistent with > > > > > the > > > > > error reported on the console when the lockup occurs: > > > > > > > > > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to > > > > > previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > > > > > [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON > > > > > due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > > > > > [ 60.083261] vfio-pci :01:00.0 Unable to change power state > > > > > from D3cold to D0, device inaccessible > > > > > > > > > > Hopefully this is enough information for experts to resolve this. > > > > > > > > Does this ring some bell for you? Do you need any further information > > > > from Nick? > > > > > > > > Regards, > > > > Salvatore > > > > > > > > Have Nick try using "pcie_port_pm=off" and see if it helps the issue. > > > > I booted into a 6.1 kernel with this option. It has been running without > > problems for 1.5 hours. Usually I would expect the lockup to have > > occurred by now. I let this run for 3 hours without issue. > > > Does this happen in the latest 6.4 RC as well? > > > > I have compiled that kernel and will boot into it after running this one > > with the pcie_port_pm=off for another hour or so. I'm now running 6.4.0-rc4 without seeing the problem after 1 hour. I did however see two unrelated problems that I include here for completeness: 1. iwlwifi module did not automatically load 2. Xwayland used huge amount of CPU even though was not running any X programs. Recompiling my wayland compositor without XWayland support "fixed" this. > > > I think we need to see a full dmesg and acpidump to better > > > characterize it. > > > > Please find attached. Let me know if there is anything else I can provide. > > > > Regards, > > > > Nick. > > I don't see nouveau loading, are you explicitly preventing it from > loading? Yes nouveau is blacklisted. > Can I see the journal from a boot when it reproduced? Hmm not sure which n for "journalctl -b n" maps to which kernel (is that what you are requesting?). The commit hash doesn't not seem to be listed. I may have to boot into a bad kernel again. Regards, Ncik.
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
On 5/28/23 19:56, Nick Hastings wrote: Hi, * Mario Limonciello [230528 21:44]: On 5/28/23 01:49, Salvatore Bonaccorso wrote: Hi Mario Nick Hastings reported in Debian in https://bugs.debian.org/1036530 lockups from his system after updating from a 6.0 based version to 6.1.y. > #regzbot ^introduced 24867516f06d he bisected the issue and tracked it down to: On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote: Control: tags -1 - moreinfo Hi, I repeated the git bisect, and the bad commit seems to be: (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit commit 24867516f06dabedef3be7eea0ef0846b91538bc Author: Mario Limonciello Date: Tue Aug 23 13:51:31 2022 -0500 ACPI: OSI: Remove Linux-Dell-Video _OSI string This string was introduced because drivers for NVIDIA hardware had bugs supporting RTD3 in the past. Before proprietary NVIDIA driver started to support RTD3, Ubuntu had had a mechanism for switching PRIME on and off, though it had required to logout/login to make the library switch happen. When the PRIME had been off, the mechanism had unloaded the NVIDIA driver and put the device into D3cold, but the GPU had never come back to D0 again which is why ODMs used the _OSI to expose an old _DSM method to switch the power on/off. That has been fixed by commit 5775b843a619 ("PCI: Restore config space on runtime resume despite being unbound"). so vendors shouldn't be using this string to modify ASL any more. Reviewed-by: Lyude Paul Signed-off-by: Mario Limonciello Signed-off-by: Rafael J. Wysocki drivers/acpi/osi.c | 9 - 1 file changed, 9 deletions(-) This machine is a Dell with an nvidia chip so it looks like this really could be the commit that that is causing the problems. The description of the commit also seems (to my untrained eye) to be consistent with the error reported on the console when the lockup occurs: [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) [ 60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold to D0, device inaccessible Hopefully this is enough information for experts to resolve this. Does this ring some bell for you? Do you need any further information from Nick? Regards, Salvatore Have Nick try using "pcie_port_pm=off" and see if it helps the issue. I booted into a 6.1 kernel with this option. It has been running without problems for 1.5 hours. Usually I would expect the lockup to have occurred by now. Does this happen in the latest 6.4 RC as well? I have compiled that kernel and will boot into it after running this one with the pcie_port_pm=off for another hour or so. I think we need to see a full dmesg and acpidump to better characterize it. Please find attached. Let me know if there is anything else I can provide. Regards, Nick. I don't see nouveau loading, are you explicitly preventing it from loading? Can I see the journal from a boot when it reproduced?
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
On 5/28/23 01:49, Salvatore Bonaccorso wrote: Hi Mario Nick Hastings reported in Debian in https://bugs.debian.org/1036530 lockups from his system after updating from a 6.0 based version to 6.1.y. > #regzbot ^introduced 24867516f06d he bisected the issue and tracked it down to: On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote: Control: tags -1 - moreinfo Hi, I repeated the git bisect, and the bad commit seems to be: (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit commit 24867516f06dabedef3be7eea0ef0846b91538bc Author: Mario Limonciello Date: Tue Aug 23 13:51:31 2022 -0500 ACPI: OSI: Remove Linux-Dell-Video _OSI string This string was introduced because drivers for NVIDIA hardware had bugs supporting RTD3 in the past. Before proprietary NVIDIA driver started to support RTD3, Ubuntu had had a mechanism for switching PRIME on and off, though it had required to logout/login to make the library switch happen. When the PRIME had been off, the mechanism had unloaded the NVIDIA driver and put the device into D3cold, but the GPU had never come back to D0 again which is why ODMs used the _OSI to expose an old _DSM method to switch the power on/off. That has been fixed by commit 5775b843a619 ("PCI: Restore config space on runtime resume despite being unbound"). so vendors shouldn't be using this string to modify ASL any more. Reviewed-by: Lyude Paul Signed-off-by: Mario Limonciello Signed-off-by: Rafael J. Wysocki drivers/acpi/osi.c | 9 - 1 file changed, 9 deletions(-) This machine is a Dell with an nvidia chip so it looks like this really could be the commit that that is causing the problems. The description of the commit also seems (to my untrained eye) to be consistent with the error reported on the console when the lockup occurs: [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) [ 60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold to D0, device inaccessible Hopefully this is enough information for experts to resolve this. Does this ring some bell for you? Do you need any further information from Nick? Regards, Salvatore Hi Salvatore, Have Nick try using "pcie_port_pm=off" and see if it helps the issue. Does this happen in the latest 6.4 RC as well? I think we need to see a full dmesg and acpidump to better characterize it.
Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Hi Mario Nick Hastings reported in Debian in https://bugs.debian.org/1036530 lockups from his system after updating from a 6.0 based version to 6.1.y. #regzbot ^introduced 24867516f06d he bisected the issue and tracked it down to: On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote: > Control: tags -1 - moreinfo > > Hi, > > I repeated the git bisect, and the bad commit seems to be: > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit > commit 24867516f06dabedef3be7eea0ef0846b91538bc > Author: Mario Limonciello > Date: Tue Aug 23 13:51:31 2022 -0500 > > ACPI: OSI: Remove Linux-Dell-Video _OSI string > > This string was introduced because drivers for NVIDIA hardware > had bugs supporting RTD3 in the past. > > Before proprietary NVIDIA driver started to support RTD3, Ubuntu had > had a mechanism for switching PRIME on and off, though it had required > to logout/login to make the library switch happen. > > When the PRIME had been off, the mechanism had unloaded the NVIDIA > driver and put the device into D3cold, but the GPU had never come back > to D0 again which is why ODMs used the _OSI to expose an old _DSM > method to switch the power on/off. > > That has been fixed by commit 5775b843a619 ("PCI: Restore config space > on runtime resume despite being unbound"). so vendors shouldn't be > using this string to modify ASL any more. > > Reviewed-by: Lyude Paul > Signed-off-by: Mario Limonciello > Signed-off-by: Rafael J. Wysocki > > drivers/acpi/osi.c | 9 - > 1 file changed, 9 deletions(-) > > This machine is a Dell with an nvidia chip so it looks like this really > could be the commit that that is causing the problems. The description > of the commit also seems (to my untrained eye) to be consistent with the > error reported on the console when the lockup occurs: > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous > error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to > previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > [ 60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold > to D0, device inaccessible > > Hopefully this is enough information for experts to resolve this. Does this ring some bell for you? Do you need any further information from Nick? Regards, Salvatore
Bug#1036530: linux-signed-amd64: Hard lock up of system
Control: tags -1 - moreinfo Hi, I repeated the git bisect, and the bad commit seems to be: (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit commit 24867516f06dabedef3be7eea0ef0846b91538bc Author: Mario Limonciello Date: Tue Aug 23 13:51:31 2022 -0500 ACPI: OSI: Remove Linux-Dell-Video _OSI string This string was introduced because drivers for NVIDIA hardware had bugs supporting RTD3 in the past. Before proprietary NVIDIA driver started to support RTD3, Ubuntu had had a mechanism for switching PRIME on and off, though it had required to logout/login to make the library switch happen. When the PRIME had been off, the mechanism had unloaded the NVIDIA driver and put the device into D3cold, but the GPU had never come back to D0 again which is why ODMs used the _OSI to expose an old _DSM method to switch the power on/off. That has been fixed by commit 5775b843a619 ("PCI: Restore config space on runtime resume despite being unbound"). so vendors shouldn't be using this string to modify ASL any more. Reviewed-by: Lyude Paul Signed-off-by: Mario Limonciello Signed-off-by: Rafael J. Wysocki drivers/acpi/osi.c | 9 - 1 file changed, 9 deletions(-) This machine is a Dell with an nvidia chip so it looks like this really could be the commit that that is causing the problems. The description of the commit also seems (to my untrained eye) to be consistent with the error reported on the console when the lockup occurs: [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) [ 60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold to D0, device inaccessible Hopefully this is enough information for experts to resolve this. Regards, Nick. * Salvatore Bonaccorso [230526 20:30]: > Control: tags -1 + moreinfo > > Hi Nick, > > On Fri, May 26, 2023 at 09:25:23AM +0900, Nick Hastings wrote: > > Hi Salvatore, > > > > thanks for your help. However, I'm now not sure if I really have > > identified the commit that causes my problems. I fear I may have made > > one or more mistakes when setting "git bisect good". I had been under > > the impression that the lock up would happen no more than a few tens of > > minutes after booting, however it seems that sometimes it can take a few > > hours to occur. > > > > So, I'm running the git bisect again and will be more careful before > > marking "git bisect good". It could take a few days. > > > > Should this particular bug be closed? > > Thanks a lot for reporting back, you time put in into bisect is very > appreciated and valued! No, no need to close this one, as the bug > still persist. Just followup please once you have identified the > culprit with the fresh bisect. > > Please do remove by then as well the moreinfo tag again (you can write > a control message with tag -1 - moreinfo, so won't appear as bug > needing information from reporter). > > Thank you! > > Regards, > Salvatore
Bug#1036530: linux-signed-amd64: Hard lock up of system
Control: tags -1 + moreinfo Hi Nick, On Fri, May 26, 2023 at 09:25:23AM +0900, Nick Hastings wrote: > Hi Salvatore, > > thanks for your help. However, I'm now not sure if I really have > identified the commit that causes my problems. I fear I may have made > one or more mistakes when setting "git bisect good". I had been under > the impression that the lock up would happen no more than a few tens of > minutes after booting, however it seems that sometimes it can take a few > hours to occur. > > So, I'm running the git bisect again and will be more careful before > marking "git bisect good". It could take a few days. > > Should this particular bug be closed? Thanks a lot for reporting back, you time put in into bisect is very appreciated and valued! No, no need to close this one, as the bug still persist. Just followup please once you have identified the culprit with the fresh bisect. Please do remove by then as well the moreinfo tag again (you can write a control message with tag -1 - moreinfo, so won't appear as bug needing information from reporter). Thank you! Regards, Salvatore
Bug#1036530: linux-signed-amd64: Hard lock up of system
Hi Salvatore, thanks for your help. However, I'm now not sure if I really have identified the commit that causes my problems. I fear I may have made one or more mistakes when setting "git bisect good". I had been under the impression that the lock up would happen no more than a few tens of minutes after booting, however it seems that sometimes it can take a few hours to occur. So, I'm running the git bisect again and will be more careful before marking "git bisect good". It could take a few days. Should this particular bug be closed? Thanks, Nick. * Salvatore Bonaccorso [230526 00:19]: > Hi Nick, > > On Thu, May 25, 2023 at 08:23:15AM +0900, Nick Hastings wrote: > > Hi, > > > > * Salvatore Bonaccorso [230524 19:26]: > > > > > > Given you were able to bisect it so far, can you try to isolate the > > > commit from the merge commit causing it? > > > > I guess I can try. The commit message states: > > > > Merge: c77f54a9bcec a1cf1fd62ae7 562163595a91 018d6711c26e 6cc401be1648 > > > > Is there a way extract out each of those? > > Th way i usuually get all commits from a merge request is > > git log --oneline $mergecommit^$mergecommit^2 > > though here we have three merge commits, merged with one merge commit > on top, so you would go down the merges of the acpi-properties, > acpi-tables, acpi-x86 and acpi-soc branches. Those are those: > > * acpi-properties: > ACPI: property: Silence missing-declarations warning in apple.c > > * acpi-tables: > ACPI: HMAT: Drop unused dev_fmt() and redundant 'HMAT' prefix > ACPI: tables: FPDT: Don't call acpi_os_map_memory() on invalid phys > address > > * acpi-x86: > ACPI: x86: Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable > > * acpi-soc: > ACPI: LPSS: Deduplicate skipping device in acpi_lpss_create_device() > ACPI: LPSS: Replace loop with first entry retrieval > > > > One remotely related might be "ACPI: x86: Add a quirk for Dell > > > Inspiron 14 2-in-1 for StorageD3Enable". > > > > Manually looking at the diff with > > git diff e996c7e01892ac18ec0db447294d4f591c325efe~ > > e996c7e01892ac18ec0db447294d4f591c325efe > > I guess that means the following: > > > > --- a/drivers/acpi/x86/utils.c > > +++ b/drivers/acpi/x86/utils.c > > @@ -207,9 +207,26 @@ static const struct x86_cpu_id storage_d3_cpu_ids[] = { > > {} > > }; > > > > +static const struct dmi_system_id force_storage_d3_dmi[] = { > > + { > > + /* > > +* _ADR is ambiguous between GPP1.DEV0 and GPP1.NVME > > +* but .NVME is needed to get StorageD3Enable node > > +* https://bugzilla.kernel.org/show_bug.cgi?id=216440 > > +*/ > > + .matches = { > > + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."), > > + DMI_MATCH(DMI_PRODUCT_NAME, "Inspiron 14 7425 > > 2-in-1"), > > + } > > + }, > > + {} > > +}; > > + > > bool force_storage_d3(void) > > { > > - return x86_match_cpu(storage_d3_cpu_ids); > > + const struct dmi_system_id *dmi_id = > > dmi_first_match(force_storage_d3_dmi); > > + > > + return dmi_id || x86_match_cpu(storage_d3_cpu_ids); > > } > > That probably won't work actually as the code has been refactored > substantiantly after the commit. > > In the ideal case we could confirm the quirk change is the responsable > commit, so we can make upstream aware. > > Regards, > Salvatore
Bug#1036530: linux-signed-amd64: Hard lock up of system
Hi Nick, On Thu, May 25, 2023 at 08:23:15AM +0900, Nick Hastings wrote: > Hi, > > * Salvatore Bonaccorso [230524 19:26]: > > > > Given you were able to bisect it so far, can you try to isolate the > > commit from the merge commit causing it? > > I guess I can try. The commit message states: > > Merge: c77f54a9bcec a1cf1fd62ae7 562163595a91 018d6711c26e 6cc401be1648 > > Is there a way extract out each of those? Th way i usuually get all commits from a merge request is git log --oneline $mergecommit^$mergecommit^2 though here we have three merge commits, merged with one merge commit on top, so you would go down the merges of the acpi-properties, acpi-tables, acpi-x86 and acpi-soc branches. Those are those: * acpi-properties: ACPI: property: Silence missing-declarations warning in apple.c * acpi-tables: ACPI: HMAT: Drop unused dev_fmt() and redundant 'HMAT' prefix ACPI: tables: FPDT: Don't call acpi_os_map_memory() on invalid phys address * acpi-x86: ACPI: x86: Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable * acpi-soc: ACPI: LPSS: Deduplicate skipping device in acpi_lpss_create_device() ACPI: LPSS: Replace loop with first entry retrieval > > One remotely related might be "ACPI: x86: Add a quirk for Dell > > Inspiron 14 2-in-1 for StorageD3Enable". > > Manually looking at the diff with > git diff e996c7e01892ac18ec0db447294d4f591c325efe~ > e996c7e01892ac18ec0db447294d4f591c325efe > I guess that means the following: > > --- a/drivers/acpi/x86/utils.c > +++ b/drivers/acpi/x86/utils.c > @@ -207,9 +207,26 @@ static const struct x86_cpu_id storage_d3_cpu_ids[] = { > {} > }; > > +static const struct dmi_system_id force_storage_d3_dmi[] = { > + { > + /* > +* _ADR is ambiguous between GPP1.DEV0 and GPP1.NVME > +* but .NVME is needed to get StorageD3Enable node > +* https://bugzilla.kernel.org/show_bug.cgi?id=216440 > +*/ > + .matches = { > + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."), > + DMI_MATCH(DMI_PRODUCT_NAME, "Inspiron 14 7425 > 2-in-1"), > + } > + }, > + {} > +}; > + > bool force_storage_d3(void) > { > - return x86_match_cpu(storage_d3_cpu_ids); > + const struct dmi_system_id *dmi_id = > dmi_first_match(force_storage_d3_dmi); > + > + return dmi_id || x86_match_cpu(storage_d3_cpu_ids); > } That probably won't work actually as the code has been refactored substantiantly after the commit. In the ideal case we could confirm the quirk change is the responsable commit, so we can make upstream aware. Regards, Salvatore
Bug#1036530: linux-signed-amd64: Hard lock up of system
Hi, * Salvatore Bonaccorso [230524 19:26]: > > Given you were able to bisect it so far, can you try to isolate the > commit from the merge commit causing it? I guess I can try. The commit message states: Merge: c77f54a9bcec a1cf1fd62ae7 562163595a91 018d6711c26e 6cc401be1648 Is there a way extract out each of those? > One remotely related might be "ACPI: x86: Add a quirk for Dell > Inspiron 14 2-in-1 for StorageD3Enable". Manually looking at the diff with git diff e996c7e01892ac18ec0db447294d4f591c325efe~ e996c7e01892ac18ec0db447294d4f591c325efe I guess that means the following: --- a/drivers/acpi/x86/utils.c +++ b/drivers/acpi/x86/utils.c @@ -207,9 +207,26 @@ static const struct x86_cpu_id storage_d3_cpu_ids[] = { {} }; +static const struct dmi_system_id force_storage_d3_dmi[] = { + { + /* +* _ADR is ambiguous between GPP1.DEV0 and GPP1.NVME +* but .NVME is needed to get StorageD3Enable node +* https://bugzilla.kernel.org/show_bug.cgi?id=216440 +*/ + .matches = { + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."), + DMI_MATCH(DMI_PRODUCT_NAME, "Inspiron 14 7425 2-in-1"), + } + }, + {} +}; + bool force_storage_d3(void) { - return x86_match_cpu(storage_d3_cpu_ids); + const struct dmi_system_id *dmi_id = dmi_first_match(force_storage_d3_dmi); + + return dmi_id || x86_match_cpu(storage_d3_cpu_ids); } Thanks, Nick.
Bug#1036530: linux-signed-amd64: Hard lock up of system
Control: tags -1 + moreinfo Hi Nick, On Mon, May 22, 2023 at 08:56:12AM +0900, Nick Hastings wrote: > Source: linux-signed-amd64 > Severity: important > Tags: upstream > X-Debbugs-Cc: nicholaschasti...@gmail.com > > Dear Maintainer, > > after upgrading from a 6.0.0 kernel to a 6.1.0 kernel I experienced > hard lockups on my Dell XPS 15 7590 a few minutes after each boot. On > more than one occasion I was on the console and was able to see the > error message. It was the same error on each occasion, and I reproduce > it here: > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous > error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to > previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) > [ 60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold > to D0, device inaccessible > > N.B. the message on the console was recorded with at photograph and > then manually typed in, so it is possible that it may contain one or > more errors. > > I ran git bisect as descirbed at > https://wiki.debian.org/DebianKernel/GitBisect which seems to have > found the bad commit. It is a merge commit that deals with acpi code. > However I don't see what may actually be causing this issue. > The commit is e996c7e01892ac18ec0db447294d4f591c325efe > > Please find the report from git bisect below. Given you were able to bisect it so far, can you try to isolate the commit from the merge commit causing it? One remotely related might be "ACPI: x86: Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable". Regards, Salvatore
Bug#1036530: linux-signed-amd64: Hard lock up of system
Source: linux-signed-amd64 Severity: important Tags: upstream X-Debbugs-Cc: nicholaschasti...@gmail.com Dear Maintainer, after upgrading from a 6.0.0 kernel to a 6.1.0 kernel I experienced hard lockups on my Dell XPS 15 7590 a few minutes after each boot. On more than one occasion I was on the console and was able to see the error message. It was the same error on each occasion, and I reproduce it here: [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529) [ 60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold to D0, device inaccessible N.B. the message on the console was recorded with at photograph and then manually typed in, so it is possible that it may contain one or more errors. I ran git bisect as descirbed at https://wiki.debian.org/DebianKernel/GitBisect which seems to have found the bad commit. It is a merge commit that deals with acpi code. However I don't see what may actually be causing this issue. The commit is e996c7e01892ac18ec0db447294d4f591c325efe Please find the report from git bisect below. Regards, Nick. -- System Information: Debian Release: 12.0 APT prefers testing APT policy: (990, 'testing'), (500, 'unstable') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 6.0.0-rc6-1-g018d6711c26e (SMP w/16 CPU threads; PREEMPT) Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8), LANGUAGE=en_AU:en Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled % git bisect good e996c7e01892ac18ec0db447294d4f591c325efe is the first bad commit commit e996c7e01892ac18ec0db447294d4f591c325efe Merge: c77f54a9bcec a1cf1fd62ae7 562163595a91 018d6711c26e 6cc401be1648 Author: Rafael J. Wysocki Date: Fri Sep 30 20:52:39 2022 +0200 Merge branches 'acpi-properties', 'acpi-tables', 'acpi-x86' and 'acpi-soc' Merge changes related to ACPI data-only tables handling and ACPI device properties management, x86-specific ACPI code changes and ACPI SoC driver changes for 6.1-rc1: - Clean up the ACPI LPSS (Intel SoC) driver (Andy Shevchenko). - Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable (Mario Limonciello). - Drop unused dev_fmt() and redundant 'HMAT' prefix from the HMAT parsing code (Liu Shixin). - Make ACPI FPDT parsing code avoid calling acpi_os_map_memory() on invalid physical addresses (Hans de Goede). - Silence missing-declarations warning related to Apple device properties management (Lukas Wunner). * acpi-properties: ACPI: property: Silence missing-declarations warning in apple.c * acpi-tables: ACPI: HMAT: Drop unused dev_fmt() and redundant 'HMAT' prefix ACPI: tables: FPDT: Don't call acpi_os_map_memory() on invalid phys address * acpi-x86: ACPI: x86: Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable * acpi-soc: ACPI: LPSS: Deduplicate skipping device in acpi_lpss_create_device() ACPI: LPSS: Replace loop with first entry retrieval drivers/acpi/acpi_fpdt.c | 22 ++ drivers/acpi/acpi_lpss.c | 45 + drivers/acpi/numa/hmat.c | 25 - drivers/acpi/x86/apple.c | 1 + drivers/acpi/x86/utils.c | 19 ++- 5 files changed, 74 insertions(+), 38 deletions(-) [0 running job(s)] {history#6810} 2023-05-20 20:54:16