Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-07-07 Thread Lyude Paul
On Thu, 2023-06-01 at 11:18 -0500, Limonciello, Mario wrote:
> +Lyude, Lukas, Karol
> 
> On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > Hi,
> > 
> > * Nick Hastings  [230530 16:01]:
> > > * Mario Limonciello  [230530 13:00]:
> > 
> > > > As you're actually loading nouveau, can you please try nouveau.runpm=0 
> > > > on
> > > > the kernel command line?
> > > I'm not intentionally loading it. This machine also has intel graphics
> > > which is what I prefer. Checking my
> > > /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> > > I see:
> > > 
> > > blacklist nvidia
> > > blacklist nvidia-drm
> > > blacklist nvidia-modeset
> > > blacklist nvidia-uvm
> > > blacklist ipmi_msghandler
> > > blacklist ipmi_devintf
> > > 
> > > So I thought I had blacklisted it but it seems I did not. Since I do not
> > > want to use it maybe it is better to check if the lock up occurs with
> > > nouveau blacklisted. I will try that now.
> > I blacklisted nouveau and booted into a 6.1 kernel:
> > % uname -a
> > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) 
> > x86_64 GNU/Linux
> > 
> > It has been running without problems for nearly two days now:
> > % uptime
> >   08:34:48 up 1 day, 16:22,  2 users,  load average: 1.33, 1.26, 1.27
> > 
> > Regards,
> > 
> > Nick.
> 
> Thanks, that makes a lot more sense now.
> 
> Nick, Can you please test if nouveau works with runtime PM in the
> latest 6.4-rc?
> 
> If it works in 6.4-rc, there are probably nouveau commits that need
> to be backported to 6.1 LTS.
> 
> If it's still broken in 6.4-rc, I believe you should file a bug:
> 
> https://gitlab.freedesktop.org/drm/nouveau/
> 
> 
> Lyude, Lukas, Karol
> 
> This thread is in relation to this commit:
> 
> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> 
> Nick has found that runtime PM is *not* working for nouveau.
> 
> If you recall we did 24867516f06d because 5775b843a619 was
> supposed to have fixed it.

Gotcha, I guess keep me updated since it seems like things -might- be working
from what I gathered here? Happy to look further if they find that 6.4-rc is
broken though

> 

-- 
Cheers,
 Lyude Paul (she/her)
 Software Engineer at Red Hat



Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-30 Thread Nick Hastings
Hi,

* Limonciello, Mario  [230701 06:40]:
> 
> > > Nevertheless: thx for your report your help through this thread.
> > 
> > No problem. I am willing to try to do more, but right now I don't know
> > how to do what has been suggested.
> > 
> 
> Here is where to report Nouveau bugs:
> 
> https://gitlab.freedesktop.org/drm/nouveau/-/issues/

Thanks.

Done: https://gitlab.freedesktop.org/drm/nouveau/-/issues/241

Cheers,

Nick.



Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-30 Thread Limonciello, Mario




Nevertheless: thx for your report your help through this thread.


No problem. I am willing to try to do more, but right now I don't know
how to do what has been suggested.



Here is where to report Nouveau bugs:

https://gitlab.freedesktop.org/drm/nouveau/-/issues/



Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-30 Thread Nick Hastings
Hi,

* Thorsten Leemhuis  [230630 22:02]:
> On 27.06.23 00:34, Nick Hastings wrote:
> > * Linux regression tracking (Thorsten Leemhuis)  
> > [230626 21:09]:
> >> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> >> for once, to make this easily accessible to everyone.
> >>
> >> Nick, what's the status/was there any progress? Did you do what Mario
> >> suggested and file a nouveau bug?
> > 
> > It was not apparent that the suggestion to open "a Nouveau drm bug" was
> > addressed to me.
> 
> I wish things were earlier for reporters, but from what I can see this
> is the only way forward if you or some silent bystander cares.

In principle I can open another bug report, but I don't know how or
where to report "a Nouveau drm bug". Please keep in mind that I'm just
an end user. I learnt to use git bisect specifically because of this
bug. Prior to that, I hadn't compiled a kernel in about 15 years.

> >> I ask, as I still have this on my list of regressions and it seems there
> >> was no progress in three+ weeks now.
> > 
> > I have not pursued this further since as far as I could tell I already
> > provided all requested information and I don't actually use nouveau, so
> > I blacklisted it.
> 
> I doubt any developer cares enough to take a closer look[1] without a
> proper nouveau bug and some help & prodding from someone affected. And
> looks to me like reverting the culprit now might create even bigger
> problems for users.

If someone can point me to some docs about for reporting nouveau bugs I
can look into it.

> Hence I guess then this won't be fixed in the end. In a ideal world this
> would not happen, but we don't live in one and all have just 24 hours in
> a day. :-/

This is a very common Dell XPS 15 7590 so I expect many people could
experience this issue. Or maybe like me they only use the intel GPU.

> Nevertheless: thx for your report your help through this thread.

No problem. I am willing to try to do more, but right now I don't know
how to do what has been suggested.

Cheers,

Nick.

> [1] some points on the following page kinda explain this
> https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
> 
> #regzbot inconclusive: reporting deadlock (see thread for details)
 > 
> 
> 
> >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >> --
> >> Everything you wanna know about Linux kernel regression tracking:
> >> https://linux-regtracking.leemhuis.info/about/#tldr
> >> If I did something stupid, please tell me, as explained on that page.
> >>
> >> #regzbot backburner: slow progress, likely just affects one machine
> >> #regzbot poke
> >>
> >>
> >> On 02.06.23 02:57, Limonciello, Mario wrote:
> >>> [AMD Official Use Only - General]
> >>>
> >>>> -Original Message-
> >>>> From: Nick Hastings 
> >>>> Sent: Thursday, June 1, 2023 7:02 PM
> >>>> To: Karol Herbst 
> >>>> Cc: Limonciello, Mario ; Lyude Paul
> >>>> ; Lukas Wunner ; Salvatore
> >>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> >>>> Wysocki ; Len Brown ; linux-
> >>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> >>>> regressi...@lists.linux.dev
> >>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> >>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of 
> >>>> system)
> >>>>
> >>>> Hi,
> >>>>
> >>>> * Karol Herbst  [230602 03:10]:
> >>>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
> >>>>>  wrote:
> >>>>>>> -Original Message-
> >>>>>>> From: Karol Herbst 
> >>>>>>> Sent: Thursday, June 1, 2023 12:19 PM
> >>>>>>> To: Limonciello, Mario 
> >>>>>>> Cc: Nick Hastings ; Lyude Paul
> >>>>>>> ; Lukas Wunner ; Salvatore
> >>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> >>>>>>> Wysocki ; Len Brown ; linux-
> >>>>>>> a...@vger.kernel.org; linux-ker..

Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-30 Thread Karol Herbst
On Fri, Jun 30, 2023 at 3:02 PM Thorsten Leemhuis
 wrote:
>
> On 27.06.23 00:34, Nick Hastings wrote:
> > * Linux regression tracking (Thorsten Leemhuis)  
> > [230626 21:09]:
> >> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> >> for once, to make this easily accessible to everyone.
> >>
> >> Nick, what's the status/was there any progress? Did you do what Mario
> >> suggested and file a nouveau bug?
> >
> > It was not apparent that the suggestion to open "a Nouveau drm bug" was
> > addressed to me.
>
> I wish things were earlier for reporters, but from what I can see this
> is the only way forward if you or some silent bystander cares.
>
> >> I ask, as I still have this on my list of regressions and it seems there
> >> was no progress in three+ weeks now.
> >
> > I have not pursued this further since as far as I could tell I already
> > provided all requested information and I don't actually use nouveau, so
> > I blacklisted it.
>
> I doubt any developer cares enough to take a closer look[1] without a
> proper nouveau bug and some help & prodding from someone affected. And
> looks to me like reverting the culprit now might create even bigger
> problems for users.
>
> Hence I guess then this won't be fixed in the end. In a ideal world this
> would not happen, but we don't live in one and all have just 24 hours in
> a day. :-/
>

We recently merged this commit:
https://gitlab.freedesktop.org/drm/nouveau/-/commit/11d24327c2d7ad7f24fcc44fb00e1fa91ebf6525

It might resolve the problem. Worth testing at least, but I can't
remember if this was a hybrid AMD/Nvidia system, but I think it was?

> Nevertheless: thx for your report your help through this thread.
>
> [1] some points on the following page kinda explain this
> https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot inconclusive: reporting deadlock (see thread for details)
>
>
>
> >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >> --
> >> Everything you wanna know about Linux kernel regression tracking:
> >> https://linux-regtracking.leemhuis.info/about/#tldr
> >> If I did something stupid, please tell me, as explained on that page.
> >>
> >> #regzbot backburner: slow progress, likely just affects one machine
> >> #regzbot poke
> >>
> >>
> >> On 02.06.23 02:57, Limonciello, Mario wrote:
> >>> [AMD Official Use Only - General]
> >>>
> >>>> -Original Message-
> >>>> From: Nick Hastings 
> >>>> Sent: Thursday, June 1, 2023 7:02 PM
> >>>> To: Karol Herbst 
> >>>> Cc: Limonciello, Mario ; Lyude Paul
> >>>> ; Lukas Wunner ; Salvatore
> >>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> >>>> Wysocki ; Len Brown ; linux-
> >>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> >>>> regressi...@lists.linux.dev
> >>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> >>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of 
> >>>> system)
> >>>>
> >>>> Hi,
> >>>>
> >>>> * Karol Herbst  [230602 03:10]:
> >>>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
> >>>>>  wrote:
> >>>>>>> -Original Message-
> >>>>>>> From: Karol Herbst 
> >>>>>>> Sent: Thursday, June 1, 2023 12:19 PM
> >>>>>>> To: Limonciello, Mario 
> >>>>>>> Cc: Nick Hastings ; Lyude Paul
> >>>>>>> ; Lukas Wunner ; Salvatore
> >>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> >>>>>>> Wysocki ; Len Brown ; linux-
> >>>>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> >>>>>>> regressi...@lists.linux.dev
> >>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> >>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> >>>> system)
> >>>&g

Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-30 Thread Thorsten Leemhuis
On 27.06.23 00:34, Nick Hastings wrote:
> * Linux regression tracking (Thorsten Leemhuis)  
> [230626 21:09]:
>> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
>> for once, to make this easily accessible to everyone.
>>
>> Nick, what's the status/was there any progress? Did you do what Mario
>> suggested and file a nouveau bug?
> 
> It was not apparent that the suggestion to open "a Nouveau drm bug" was
> addressed to me.

I wish things were earlier for reporters, but from what I can see this
is the only way forward if you or some silent bystander cares.

>> I ask, as I still have this on my list of regressions and it seems there
>> was no progress in three+ weeks now.
> 
> I have not pursued this further since as far as I could tell I already
> provided all requested information and I don't actually use nouveau, so
> I blacklisted it.

I doubt any developer cares enough to take a closer look[1] without a
proper nouveau bug and some help & prodding from someone affected. And
looks to me like reverting the culprit now might create even bigger
problems for users.

Hence I guess then this won't be fixed in the end. In a ideal world this
would not happen, but we don't live in one and all have just 24 hours in
a day. :-/

Nevertheless: thx for your report your help through this thread.

[1] some points on the following page kinda explain this
https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot inconclusive: reporting deadlock (see thread for details)



>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>> #regzbot backburner: slow progress, likely just affects one machine
>> #regzbot poke
>>
>>
>> On 02.06.23 02:57, Limonciello, Mario wrote:
>>> [AMD Official Use Only - General]
>>>
>>>> -Original Message-
>>>> From: Nick Hastings 
>>>> Sent: Thursday, June 1, 2023 7:02 PM
>>>> To: Karol Herbst 
>>>> Cc: Limonciello, Mario ; Lyude Paul
>>>> ; Lukas Wunner ; Salvatore
>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
>>>> Wysocki ; Len Brown ; linux-
>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
>>>> regressi...@lists.linux.dev
>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>>>>
>>>> Hi,
>>>>
>>>> * Karol Herbst  [230602 03:10]:
>>>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
>>>>>  wrote:
>>>>>>> -Original Message-
>>>>>>> From: Karol Herbst 
>>>>>>> Sent: Thursday, June 1, 2023 12:19 PM
>>>>>>> To: Limonciello, Mario 
>>>>>>> Cc: Nick Hastings ; Lyude Paul
>>>>>>> ; Lukas Wunner ; Salvatore
>>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
>>>>>>> Wysocki ; Len Brown ; linux-
>>>>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
>>>>>>> regressi...@lists.linux.dev
>>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
>>>> system)
>>>>>>>
>>>>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>> [AMD Official Use Only - General]
>>>>>>>>
>>>>>>>>> -Original Message-
>>>>>>>>> From: Karol Herbst 
>>>>>>>>> Sent: Thursday, June 1, 2023 11:33 AM
>>>>>>>>> To: Limonciello, Mario 
>>>>>>>>> Cc: Nick Hastings ; Lyude Paul
>>>>>>>>> ; Lukas Wunner ; Salvatore
>>>>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael
>>>> J.
>>>>>>>>> Wy

Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-26 Thread Nick Hastings
Hi Thorsten,

* Linux regression tracking (Thorsten Leemhuis)  
[230626 21:09]:
> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> for once, to make this easily accessible to everyone.
> 
> Nick, what's the status/was there any progress? Did you do what Mario
> suggested and file a nouveau bug?

It was not apparent that the suggestion to open "a Nouveau drm bug" was
addressed to me.

> I ask, as I still have this on my list of regressions and it seems there
> was no progress in three+ weeks now.

I have not pursued this further since as far as I could tell I already
provided all requested information and I don't actually use nouveau, so
I blacklisted it.

Regards,

Nick.

> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
> 
> #regzbot backburner: slow progress, likely just affects one machine
> #regzbot poke
> 
> 
> On 02.06.23 02:57, Limonciello, Mario wrote:
> > [AMD Official Use Only - General]
> > 
> >> -Original Message-
> >> From: Nick Hastings 
> >> Sent: Thursday, June 1, 2023 7:02 PM
> >> To: Karol Herbst 
> >> Cc: Limonciello, Mario ; Lyude Paul
> >> ; Lukas Wunner ; Salvatore
> >> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> >> Wysocki ; Len Brown ; linux-
> >> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> >> regressi...@lists.linux.dev
> >> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> >> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
> >>
> >> Hi,
> >>
> >> * Karol Herbst  [230602 03:10]:
> >>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
> >>>  wrote:
> >>>>> -Original Message-
> >>>>> From: Karol Herbst 
> >>>>> Sent: Thursday, June 1, 2023 12:19 PM
> >>>>> To: Limonciello, Mario 
> >>>>> Cc: Nick Hastings ; Lyude Paul
> >>>>> ; Lukas Wunner ; Salvatore
> >>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> >>>>> Wysocki ; Len Brown ; linux-
> >>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> >>>>> regressi...@lists.linux.dev
> >>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> >>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> >> system)
> >>>>>
> >>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
> >>>>>  wrote:
> >>>>>>
> >>>>>> [AMD Official Use Only - General]
> >>>>>>
> >>>>>>> -Original Message-
> >>>>>>> From: Karol Herbst 
> >>>>>>> Sent: Thursday, June 1, 2023 11:33 AM
> >>>>>>> To: Limonciello, Mario 
> >>>>>>> Cc: Nick Hastings ; Lyude Paul
> >>>>>>> ; Lukas Wunner ; Salvatore
> >>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael
> >> J.
> >>>>>>> Wysocki ; Len Brown ; linux-
> >>>>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> >>>>>>> regressi...@lists.linux.dev
> >>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video
> >> _OSI
> >>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> >>>>> system)
> >>>>>>>
> >>>>>>> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> >>>>>>>>
> >>>>>>>> Lyude, Lukas, Karol
> >>>>>>>>
> >>>>>>>> This thread is in relation to this commit:
> >>>>>>>>
> >>>>>>>> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> >>>>>>>>
> >>>>>>>> Nick has found that runtime PM is *not* working for nouveau.
> >>>>>>>>
> >>>>>>>
> >>>>>>> keep in mind we have a list of PCIe controllers where we apply a
> >>>>>>> workaround:
> >>>>>>>
> >>>>>
> >> https:

Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-26 Thread Linux regression tracking (Thorsten Leemhuis)
Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

Nick, what's the status/was there any progress? Did you do what Mario
suggested and file a nouveau bug?

I ask, as I still have this on my list of regressions and it seems there
was no progress in three+ weeks now.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot backburner: slow progress, likely just affects one machine
#regzbot poke


On 02.06.23 02:57, Limonciello, Mario wrote:
> [AMD Official Use Only - General]
> 
>> -Original Message-
>> From: Nick Hastings 
>> Sent: Thursday, June 1, 2023 7:02 PM
>> To: Karol Herbst 
>> Cc: Limonciello, Mario ; Lyude Paul
>> ; Lukas Wunner ; Salvatore
>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
>> Wysocki ; Len Brown ; linux-
>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
>> regressi...@lists.linux.dev
>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>>
>> Hi,
>>
>> * Karol Herbst  [230602 03:10]:
>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
>>>  wrote:
>>>>> -Original Message-
>>>>> From: Karol Herbst 
>>>>> Sent: Thursday, June 1, 2023 12:19 PM
>>>>> To: Limonciello, Mario 
>>>>> Cc: Nick Hastings ; Lyude Paul
>>>>> ; Lukas Wunner ; Salvatore
>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
>>>>> Wysocki ; Len Brown ; linux-
>>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
>>>>> regressi...@lists.linux.dev
>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
>> system)
>>>>>
>>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
>>>>>  wrote:
>>>>>>
>>>>>> [AMD Official Use Only - General]
>>>>>>
>>>>>>> -Original Message-
>>>>>>> From: Karol Herbst 
>>>>>>> Sent: Thursday, June 1, 2023 11:33 AM
>>>>>>> To: Limonciello, Mario 
>>>>>>> Cc: Nick Hastings ; Lyude Paul
>>>>>>> ; Lukas Wunner ; Salvatore
>>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael
>> J.
>>>>>>> Wysocki ; Len Brown ; linux-
>>>>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
>>>>>>> regressi...@lists.linux.dev
>>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video
>> _OSI
>>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
>>>>> system)
>>>>>>>
>>>>>>> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
>>>>>>>>
>>>>>>>> Lyude, Lukas, Karol
>>>>>>>>
>>>>>>>> This thread is in relation to this commit:
>>>>>>>>
>>>>>>>> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
>>>>>>>>
>>>>>>>> Nick has found that runtime PM is *not* working for nouveau.
>>>>>>>>
>>>>>>>
>>>>>>> keep in mind we have a list of PCIe controllers where we apply a
>>>>>>> workaround:
>>>>>>>
>>>>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
>>>>>>> /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
>>>>>>>
>>>>>>> And I suspect there might be one or two more IDs we'll have to add
>>>>>>> there. Do we have any logs?
>>>>>>
>>>>>> There's some archived onto the distro bug.  Search this page for
>>>>> "journalctl.log.gz"
>>>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
>>>>>>
>>>>>
>>>>> interesting.. It seems to be the same controller used here. I wonder
>>>>> if the pci topology is different or if the workaround is applie

Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-02 Thread Sandy Shores
 line #276 - # 298 of dmesg... attached to Message # 62

 [    0.066966] smpboot: CPU0: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz 
(family: 0x6, model: 0x9e, stepping: 0xd) [    0.066966] cblist_init_generic: 
Setting adjustable number of callback queues. [    0.066966] 
cblist_init_generic: Setting shift to 4 and lim to 1. [    0.066966] 
cblist_init_generic: Setting shift to 4 and lim to 1. [    0.066966] 
cblist_init_generic: Setting shift to 4 and lim to 1. [    0.066966] 
Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width 
counters, Intel PMU driver. [    0.066966] ... version:                4 [    
0.066966] ... bit width:              48 [    0.066966] ... generic registers:  
    4 [    0.066966] ... value mask:              [    
0.066966] ... max period:             7fff [    0.066966] ... 
fixed-purpose events:   3 [    0.066966] ... event mask:             
0007000f [    0.066966] Estimated ratio of average max frequency by 
base frequency (times 1024): 2005 [    0.066966] rcu: Hierarchical SRCU 
implementation. [    0.066966] rcu:     Max phase no-delay instances is 1000. [ 
   0.066966] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. [  
  0.066966] smp: Bringing up secondary CPUs ... [    0.066966] x86: Booting SMP 
configuration: [    0.066966]  node  #0, CPUs:        #1  #2  #3  #4  #5  
#6  #7  #8 [    0.077241] MMIO Stale Data CPU bug present and SMT on, data leak 
possible. See 
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/processor_mmio_stale_data.html
 for more details. [    0.077241]   #9 #10 #11 #12 #13 #14 #15 [    0.088667] 
smp: Brought up 1 node, 16 CPUs
 compare to lspci -tvnn posted in Message # 133
   lspci -tvnn -[:00]-+-00.0  Intel Corporation Device [8086:3e20]          
 +-01.0-[01]00.0  NVIDIA Corporation TU117M [GeForce GTX 1650 Mobile / 
Max-Q] [10de:1f91]           +-02.0  Intel Corporation CoffeeLake-H GT2 [UHD 
Graphics 630] [8086:3e9b]           +-04.0  Intel Corporation Xeon E3-1200 
v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [8086:1903]           
+-08.0  Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen 
Core Processor Gaussian Mixture Model [8086:1911]           +-12.0  Intel 
Corporation Cannon Lake PCH Thermal Controller [8086:a379]           +-14.0  
Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d]      
     +-14.2  Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]          
 +-15.0  Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 
[8086:a368]           +-15.1  Intel Corporation Cannon Lake PCH Serial IO I2C 
Controller #1 [8086:a369]           +-16.0  Intel Corporation Cannon Lake PCH 
HECI Controller [8086:a360]           +-17.0  Intel Corporation Cannon Lake 
Mobile PCH SATA AHCI Controller [8086:a353]           
+-1b.0-[02-3a]00.0-[03-3a]--+-00.0-[04]00.0  Intel Corporation JHL6340 
Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] [8086:15d9]           |       
                        +-01.0-[05-39]--           |                            
   \-02.0-[3a]00.0  Intel Corporation JHL6340 Thunderbolt 3 USB 3.1 
Controller (C step) [Alpine Ridge 2C 2016] [8086:15db]           
+-1c.0-[3b]00.0  Intel Corporation Wi-Fi 6 AX200 [8086:2723]           
+-1c.4-[3c]00.0  Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card 
Reader [10ec:525a]           +-1d.0-[3d]00.0  Samsung Electronics Co Ltd 
NVMe SSD Controller SM981/PM981/PM983 [144d:a808]           +-1f.0  Intel 
Corporation Cannon Lake LPC Controller [8086:a30e]           +-1f.3  Intel 
Corporation Cannon Lake PCH cAVS [8086:a348]           +-1f.4  Intel 
Corporation Cannon Lake PCH SMBus Controller [8086:a323]           \-1f.5  
Intel Corporation Cannon Lake PCH SPI Controller           [8086:a324]    I 
hope this is not noise! much gratitude


Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-01 Thread Limonciello, Mario
[AMD Official Use Only - General]

> -Original Message-
> From: Nick Hastings 
> Sent: Thursday, June 1, 2023 7:02 PM
> To: Karol Herbst 
> Cc: Limonciello, Mario ; Lyude Paul
> ; Lukas Wunner ; Salvatore
> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> Wysocki ; Len Brown ; linux-
> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> regressi...@lists.linux.dev
> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>
> Hi,
>
> * Karol Herbst  [230602 03:10]:
> > On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
> >  wrote:
> > > > -Original Message-
> > > > From: Karol Herbst 
> > > > Sent: Thursday, June 1, 2023 12:19 PM
> > > > To: Limonciello, Mario 
> > > > Cc: Nick Hastings ; Lyude Paul
> > > > ; Lukas Wunner ; Salvatore
> > > > Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> > > > Wysocki ; Len Brown ; linux-
> > > > a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> > > > regressi...@lists.linux.dev
> > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> system)
> > > >
> > > > On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
> > > >  wrote:
> > > > >
> > > > > [AMD Official Use Only - General]
> > > > >
> > > > > > -Original Message-
> > > > > > From: Karol Herbst 
> > > > > > Sent: Thursday, June 1, 2023 11:33 AM
> > > > > > To: Limonciello, Mario 
> > > > > > Cc: Nick Hastings ; Lyude Paul
> > > > > > ; Lukas Wunner ; Salvatore
> > > > > > Bonaccorso ; 1036...@bugs.debian.org; Rafael
> J.
> > > > > > Wysocki ; Len Brown ; linux-
> > > > > > a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> > > > > > regressi...@lists.linux.dev
> > > > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video
> _OSI
> > > > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> > > > system)
> > > > > >
> > > > > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> > > > > > >
> > > > > > > Lyude, Lukas, Karol
> > > > > > >
> > > > > > > This thread is in relation to this commit:
> > > > > > >
> > > > > > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> > > > > > >
> > > > > > > Nick has found that runtime PM is *not* working for nouveau.
> > > > > > >
> > > > > >
> > > > > > keep in mind we have a list of PCIe controllers where we apply a
> > > > > > workaround:
> > > > > >
> > > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> > > > > > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
> > > > > >
> > > > > > And I suspect there might be one or two more IDs we'll have to add
> > > > > > there. Do we have any logs?
> > > > >
> > > > > There's some archived onto the distro bug.  Search this page for
> > > > "journalctl.log.gz"
> > > > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
> > > > >
> > > >
> > > > interesting.. It seems to be the same controller used here. I wonder
> > > > if the pci topology is different or if the workaround is applied at
> > > > all.
> > >
> > > I didn't see the message in the log about the workaround being applied
> > > in that log, so I guess PCI topology difference is a likely suspect.
> > >
> >
> > yeah, but I also couldn't see a log with the usual nouveau messages,
> > so it's kinda weird.
> >
> > Anyway, the output of `lspci -tvnn` would help
>
> % lspci -tvnn
> -[:00]-+-00.0  Intel Corporation Device [8086:3e20]
>+-01.0-[01]00.0  NVIDIA Corporation TU117M [GeForce GTX 1650
> Mobile / Max-Q] [10de:1f91]

So the bridge it's connected to is the same that the quirk *should have been* 
triggering.

May 29 15:02:42 xps kernel: pci :00:01.0: [8086:1901] type 01 class 0x060400

Since the quirk isn't wor

Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-01 Thread Nick Hastings
Hi,

* Karol Herbst  [230602 03:10]:
> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
>  wrote:
> > > -Original Message-
> > > From: Karol Herbst 
> > > Sent: Thursday, June 1, 2023 12:19 PM
> > > To: Limonciello, Mario 
> > > Cc: Nick Hastings ; Lyude Paul
> > > ; Lukas Wunner ; Salvatore
> > > Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> > > Wysocki ; Len Brown ; linux-
> > > a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> > > regressi...@lists.linux.dev
> > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of 
> > > system)
> > >
> > > On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
> > >  wrote:
> > > >
> > > > [AMD Official Use Only - General]
> > > >
> > > > > -Original Message-
> > > > > From: Karol Herbst 
> > > > > Sent: Thursday, June 1, 2023 11:33 AM
> > > > > To: Limonciello, Mario 
> > > > > Cc: Nick Hastings ; Lyude Paul
> > > > > ; Lukas Wunner ; Salvatore
> > > > > Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> > > > > Wysocki ; Len Brown ; linux-
> > > > > a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> > > > > regressi...@lists.linux.dev
> > > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> > > system)
> > > > >
> > > > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> > > > > >
> > > > > > Lyude, Lukas, Karol
> > > > > >
> > > > > > This thread is in relation to this commit:
> > > > > >
> > > > > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> > > > > >
> > > > > > Nick has found that runtime PM is *not* working for nouveau.
> > > > > >
> > > > >
> > > > > keep in mind we have a list of PCIe controllers where we apply a
> > > > > workaround:
> > > > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> > > > > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
> > > > >
> > > > > And I suspect there might be one or two more IDs we'll have to add
> > > > > there. Do we have any logs?
> > > >
> > > > There's some archived onto the distro bug.  Search this page for
> > > "journalctl.log.gz"
> > > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
> > > >
> > >
> > > interesting.. It seems to be the same controller used here. I wonder
> > > if the pci topology is different or if the workaround is applied at
> > > all.
> >
> > I didn't see the message in the log about the workaround being applied
> > in that log, so I guess PCI topology difference is a likely suspect.
> >
>
> yeah, but I also couldn't see a log with the usual nouveau messages,
> so it's kinda weird.
>
> Anyway, the output of `lspci -tvnn` would help

% lspci -tvnn
-[:00]-+-00.0  Intel Corporation Device [8086:3e20]
   +-01.0-[01]00.0  NVIDIA Corporation TU117M [GeForce GTX 1650 
Mobile / Max-Q] [10de:1f91]
   +-02.0  Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630] 
[8086:3e9b]
   +-04.0  Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core 
Processor Thermal Subsystem [8086:1903]
   +-08.0  Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 
6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
   +-12.0  Intel Corporation Cannon Lake PCH Thermal Controller 
[8086:a379]
   +-14.0  Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host 
Controller [8086:a36d]
   +-14.2  Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
   +-15.0  Intel Corporation Cannon Lake PCH Serial IO I2C Controller 
#0 [8086:a368]
   +-15.1  Intel Corporation Cannon Lake PCH Serial IO I2C Controller 
#1 [8086:a369]
   +-16.0  Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
   +-17.0  Intel Corporation Cannon Lake Mobile PCH SATA AHCI 
Controller [8086:a353]
   +-1b.0-[02-3a]00.0-[03-3a]--+-00.0-[04]00.0  Intel 
Corporation JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] 
[8086:15d9]
   |   +-01.0-[05-39]--
   |   \-02.0-[3a]00.0  Intel 
Corporation JHL6340 Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 
2016] [8086:15db]
   +-1c.0-[3b]00.0  Intel Corporation Wi-Fi 6 AX200 [8086:2723]
   +-1c.4-[3c]00.0  Realtek Semiconductor Co., Ltd. RTS525A PCI 
Express Card Reader [10ec:525a]
   +-1d.0-[3d]00.0  Samsung Electronics Co Ltd NVMe SSD Controller 
SM981/PM981/PM983 [144d:a808]
   +-1f.0  Intel Corporation Cannon Lake LPC Controller [8086:a30e]
   +-1f.3  Intel Corporation Cannon Lake PCH cAVS [8086:a348]
   +-1f.4  Intel Corporation Cannon Lake PCH SMBus Controller 
[8086:a323]
   \-1f.5  Intel Corporation Cannon Lake PCH SPI Controller
   [8086:a324]


Regards,

Nick.



Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-01 Thread Nick Hastings
Hi,

* Limonciello, Mario  [230602 01:18]:
> +Lyude, Lukas, Karol
> 
> On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > 
> > * Nick Hastings  [230530 16:01]:
> > > * Mario Limonciello  [230530 13:00]:
> > 
> > > > As you're actually loading nouveau, can you please try nouveau.runpm=0 
> > > > on
> > > > the kernel command line?
> > > I'm not intentionally loading it. This machine also has intel graphics
> > > which is what I prefer. Checking my
> > > /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> > > I see:
> > > 
> > > blacklist nvidia
> > > blacklist nvidia-drm
> > > blacklist nvidia-modeset
> > > blacklist nvidia-uvm
> > > blacklist ipmi_msghandler
> > > blacklist ipmi_devintf
> > > 
> > > So I thought I had blacklisted it but it seems I did not. Since I do not
> > > want to use it maybe it is better to check if the lock up occurs with
> > > nouveau blacklisted. I will try that now.
> > I blacklisted nouveau and booted into a 6.1 kernel:
> > % uname -a
> > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) 
> > x86_64 GNU/Linux
> > 
> > It has been running without problems for nearly two days now:
> > % uptime
> >   08:34:48 up 1 day, 16:22,  2 users,  load average: 1.33, 1.26, 1.27
> > 
> > Regards,
> > 
> > Nick.
> 
> Thanks, that makes a lot more sense now.
> 
> Nick, Can you please test if nouveau works with runtime PM in the
> latest 6.4-rc?

I reported this twice already. I guess it was lost since for some
reason emails in this thread are not being trimmed. I'll repeat here:

I did eventually see a lockup of this kernel. On the console I saw:

[  151.035036] vfio-pci :01:00.0 Unable to change power state from D3cold 
to D0, device inaccessible

I did not see the other two lines that were present in earlier lock ups.

Regards,

Nick.



Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-01 Thread Karol Herbst
On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
 wrote:
>
> [AMD Official Use Only - General]
>
> > -Original Message-
> > From: Karol Herbst 
> > Sent: Thursday, June 1, 2023 12:19 PM
> > To: Limonciello, Mario 
> > Cc: Nick Hastings ; Lyude Paul
> > ; Lukas Wunner ; Salvatore
> > Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> > Wysocki ; Len Brown ; linux-
> > a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> > regressi...@lists.linux.dev
> > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
> >
> > On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
> >  wrote:
> > >
> > > [AMD Official Use Only - General]
> > >
> > > > -Original Message-
> > > > From: Karol Herbst 
> > > > Sent: Thursday, June 1, 2023 11:33 AM
> > > > To: Limonciello, Mario 
> > > > Cc: Nick Hastings ; Lyude Paul
> > > > ; Lukas Wunner ; Salvatore
> > > > Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> > > > Wysocki ; Len Brown ; linux-
> > > > a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> > > > regressi...@lists.linux.dev
> > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> > system)
> > > >
> > > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> > > >  wrote:
> > > > >
> > > > > +Lyude, Lukas, Karol
> > > > >
> > > > > On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > > > > > Hi,
> > > > > >
> > > > > > * Nick Hastings  [230530 16:01]:
> > > > > >> * Mario Limonciello  [230530 13:00]:
> > > > > > 
> > > > > >>> As you're actually loading nouveau, can you please try
> > > > nouveau.runpm=0 on
> > > > > >>> the kernel command line?
> > > > > >> I'm not intentionally loading it. This machine also has intel 
> > > > > >> graphics
> > > > > >> which is what I prefer. Checking my
> > > > > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> > > > > >> I see:
> > > > > >>
> > > > > >> blacklist nvidia
> > > > > >> blacklist nvidia-drm
> > > > > >> blacklist nvidia-modeset
> > > > > >> blacklist nvidia-uvm
> > > > > >> blacklist ipmi_msghandler
> > > > > >> blacklist ipmi_devintf
> > > > > >>
> > > > > >> So I thought I had blacklisted it but it seems I did not. Since I 
> > > > > >> do not
> > > > > >> want to use it maybe it is better to check if the lock up occurs 
> > > > > >> with
> > > > > >> nouveau blacklisted. I will try that now.
> > > > > > I blacklisted nouveau and booted into a 6.1 kernel:
> > > > > > % uname -a
> > > > > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1
> > > > (2023-05-08) x86_64 GNU/Linux
> > > > > >
> > > > > > It has been running without problems for nearly two days now:
> > > > > > % uptime
> > > > > >   08:34:48 up 1 day, 16:22,  2 users,  load average: 1.33, 1.26, 
> > > > > > 1.27
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Nick.
> > > > >
> > > > > Thanks, that makes a lot more sense now.
> > > > >
> > > > > Nick, Can you please test if nouveau works with runtime PM in the
> > > > > latest 6.4-rc?
> > > > >
> > > > > If it works in 6.4-rc, there are probably nouveau commits that need
> > > > > to be backported to 6.1 LTS.
> > > > >
> > > > > If it's still broken in 6.4-rc, I believe you should file a bug:
> > > > >
> > > > > https://gitlab.freedesktop.org/drm/nouveau/
> > > > >
> > > > >
> > > > > Lyude, Lukas, Karol
> > > > >
> > > > > This thread is in relation to this commit:
> > > > >
> > > > > 24867516f06d ("ACPI: OSI: Remo

Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-01 Thread Limonciello, Mario
[AMD Official Use Only - General]

> -Original Message-
> From: Karol Herbst 
> Sent: Thursday, June 1, 2023 12:19 PM
> To: Limonciello, Mario 
> Cc: Nick Hastings ; Lyude Paul
> ; Lukas Wunner ; Salvatore
> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> Wysocki ; Len Brown ; linux-
> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> regressi...@lists.linux.dev
> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>
> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
>  wrote:
> >
> > [AMD Official Use Only - General]
> >
> > > -Original Message-
> > > From: Karol Herbst 
> > > Sent: Thursday, June 1, 2023 11:33 AM
> > > To: Limonciello, Mario 
> > > Cc: Nick Hastings ; Lyude Paul
> > > ; Lukas Wunner ; Salvatore
> > > Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> > > Wysocki ; Len Brown ; linux-
> > > a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> > > regressi...@lists.linux.dev
> > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> system)
> > >
> > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> > >  wrote:
> > > >
> > > > +Lyude, Lukas, Karol
> > > >
> > > > On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > > > > Hi,
> > > > >
> > > > > * Nick Hastings  [230530 16:01]:
> > > > >> * Mario Limonciello  [230530 13:00]:
> > > > > 
> > > > >>> As you're actually loading nouveau, can you please try
> > > nouveau.runpm=0 on
> > > > >>> the kernel command line?
> > > > >> I'm not intentionally loading it. This machine also has intel 
> > > > >> graphics
> > > > >> which is what I prefer. Checking my
> > > > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> > > > >> I see:
> > > > >>
> > > > >> blacklist nvidia
> > > > >> blacklist nvidia-drm
> > > > >> blacklist nvidia-modeset
> > > > >> blacklist nvidia-uvm
> > > > >> blacklist ipmi_msghandler
> > > > >> blacklist ipmi_devintf
> > > > >>
> > > > >> So I thought I had blacklisted it but it seems I did not. Since I do 
> > > > >> not
> > > > >> want to use it maybe it is better to check if the lock up occurs with
> > > > >> nouveau blacklisted. I will try that now.
> > > > > I blacklisted nouveau and booted into a 6.1 kernel:
> > > > > % uname -a
> > > > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1
> > > (2023-05-08) x86_64 GNU/Linux
> > > > >
> > > > > It has been running without problems for nearly two days now:
> > > > > % uptime
> > > > >   08:34:48 up 1 day, 16:22,  2 users,  load average: 1.33, 1.26, 1.27
> > > > >
> > > > > Regards,
> > > > >
> > > > > Nick.
> > > >
> > > > Thanks, that makes a lot more sense now.
> > > >
> > > > Nick, Can you please test if nouveau works with runtime PM in the
> > > > latest 6.4-rc?
> > > >
> > > > If it works in 6.4-rc, there are probably nouveau commits that need
> > > > to be backported to 6.1 LTS.
> > > >
> > > > If it's still broken in 6.4-rc, I believe you should file a bug:
> > > >
> > > > https://gitlab.freedesktop.org/drm/nouveau/
> > > >
> > > >
> > > > Lyude, Lukas, Karol
> > > >
> > > > This thread is in relation to this commit:
> > > >
> > > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> > > >
> > > > Nick has found that runtime PM is *not* working for nouveau.
> > > >
> > >
> > > keep in mind we have a list of PCIe controllers where we apply a
> > > workaround:
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> > > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
> > >
> > > And I suspect there might be one or two more IDs we'll have to add
> > > there. Do we have any logs?
> >
> &g

Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-01 Thread Karol Herbst
On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
 wrote:
>
> [AMD Official Use Only - General]
>
> > -Original Message-
> > From: Karol Herbst 
> > Sent: Thursday, June 1, 2023 11:33 AM
> > To: Limonciello, Mario 
> > Cc: Nick Hastings ; Lyude Paul
> > ; Lukas Wunner ; Salvatore
> > Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> > Wysocki ; Len Brown ; linux-
> > a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> > regressi...@lists.linux.dev
> > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
> >
> > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> >  wrote:
> > >
> > > +Lyude, Lukas, Karol
> > >
> > > On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > > > Hi,
> > > >
> > > > * Nick Hastings  [230530 16:01]:
> > > >> * Mario Limonciello  [230530 13:00]:
> > > > 
> > > >>> As you're actually loading nouveau, can you please try
> > nouveau.runpm=0 on
> > > >>> the kernel command line?
> > > >> I'm not intentionally loading it. This machine also has intel graphics
> > > >> which is what I prefer. Checking my
> > > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> > > >> I see:
> > > >>
> > > >> blacklist nvidia
> > > >> blacklist nvidia-drm
> > > >> blacklist nvidia-modeset
> > > >> blacklist nvidia-uvm
> > > >> blacklist ipmi_msghandler
> > > >> blacklist ipmi_devintf
> > > >>
> > > >> So I thought I had blacklisted it but it seems I did not. Since I do 
> > > >> not
> > > >> want to use it maybe it is better to check if the lock up occurs with
> > > >> nouveau blacklisted. I will try that now.
> > > > I blacklisted nouveau and booted into a 6.1 kernel:
> > > > % uname -a
> > > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1
> > (2023-05-08) x86_64 GNU/Linux
> > > >
> > > > It has been running without problems for nearly two days now:
> > > > % uptime
> > > >   08:34:48 up 1 day, 16:22,  2 users,  load average: 1.33, 1.26, 1.27
> > > >
> > > > Regards,
> > > >
> > > > Nick.
> > >
> > > Thanks, that makes a lot more sense now.
> > >
> > > Nick, Can you please test if nouveau works with runtime PM in the
> > > latest 6.4-rc?
> > >
> > > If it works in 6.4-rc, there are probably nouveau commits that need
> > > to be backported to 6.1 LTS.
> > >
> > > If it's still broken in 6.4-rc, I believe you should file a bug:
> > >
> > > https://gitlab.freedesktop.org/drm/nouveau/
> > >
> > >
> > > Lyude, Lukas, Karol
> > >
> > > This thread is in relation to this commit:
> > >
> > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> > >
> > > Nick has found that runtime PM is *not* working for nouveau.
> > >
> >
> > keep in mind we have a list of PCIe controllers where we apply a
> > workaround:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
> >
> > And I suspect there might be one or two more IDs we'll have to add
> > there. Do we have any logs?
>
> There's some archived onto the distro bug.  Search this page for 
> "journalctl.log.gz"
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
>

interesting.. It seems to be the same controller used here. I wonder
if the pci topology is different or if the workaround is applied at
all.

But yeah, I'd kinda love for somebody with better knowledge on all of
this to figure out what exactly is going wrong, but everytime this
gets investigated Intel says "our hardware has no bugs", the ACPI
folks dig for months and find nothing and I end up figuring out some
weirdo workaround I don't understand. And apparently also nobody is
able to hand out docs explaining in detail how that runtime
suspend/resume stuff is supposed to work.

I have a Dell XPS 9560 where the added workaround in nouveau fixed the
problem and I know it's fixed on a bunch of other systems. So if
anybody is willing to publish docs and/or actually debug it with
domain knowledge, please go ahead.

> > And could anybody test if adding the
> > controller in play here does resolve the problem?
> >
> > > If you recall we did 24867516f06d because 5775b843a619 was
> > > supposed to have fixed it.
> > >
>



Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-01 Thread Limonciello, Mario
[AMD Official Use Only - General]

> -Original Message-
> From: Karol Herbst 
> Sent: Thursday, June 1, 2023 11:33 AM
> To: Limonciello, Mario 
> Cc: Nick Hastings ; Lyude Paul
> ; Lukas Wunner ; Salvatore
> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> Wysocki ; Len Brown ; linux-
> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> regressi...@lists.linux.dev
> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>
> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
>  wrote:
> >
> > +Lyude, Lukas, Karol
> >
> > On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > > Hi,
> > >
> > > * Nick Hastings  [230530 16:01]:
> > >> * Mario Limonciello  [230530 13:00]:
> > > 
> > >>> As you're actually loading nouveau, can you please try
> nouveau.runpm=0 on
> > >>> the kernel command line?
> > >> I'm not intentionally loading it. This machine also has intel graphics
> > >> which is what I prefer. Checking my
> > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> > >> I see:
> > >>
> > >> blacklist nvidia
> > >> blacklist nvidia-drm
> > >> blacklist nvidia-modeset
> > >> blacklist nvidia-uvm
> > >> blacklist ipmi_msghandler
> > >> blacklist ipmi_devintf
> > >>
> > >> So I thought I had blacklisted it but it seems I did not. Since I do not
> > >> want to use it maybe it is better to check if the lock up occurs with
> > >> nouveau blacklisted. I will try that now.
> > > I blacklisted nouveau and booted into a 6.1 kernel:
> > > % uname -a
> > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1
> (2023-05-08) x86_64 GNU/Linux
> > >
> > > It has been running without problems for nearly two days now:
> > > % uptime
> > >   08:34:48 up 1 day, 16:22,  2 users,  load average: 1.33, 1.26, 1.27
> > >
> > > Regards,
> > >
> > > Nick.
> >
> > Thanks, that makes a lot more sense now.
> >
> > Nick, Can you please test if nouveau works with runtime PM in the
> > latest 6.4-rc?
> >
> > If it works in 6.4-rc, there are probably nouveau commits that need
> > to be backported to 6.1 LTS.
> >
> > If it's still broken in 6.4-rc, I believe you should file a bug:
> >
> > https://gitlab.freedesktop.org/drm/nouveau/
> >
> >
> > Lyude, Lukas, Karol
> >
> > This thread is in relation to this commit:
> >
> > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> >
> > Nick has found that runtime PM is *not* working for nouveau.
> >
>
> keep in mind we have a list of PCIe controllers where we apply a
> workaround:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
>
> And I suspect there might be one or two more IDs we'll have to add
> there. Do we have any logs?

There's some archived onto the distro bug.  Search this page for 
"journalctl.log.gz"
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530

> And could anybody test if adding the
> controller in play here does resolve the problem?
>
> > If you recall we did 24867516f06d because 5775b843a619 was
> > supposed to have fixed it.
> >



Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-01 Thread Karol Herbst
On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
 wrote:
>
> +Lyude, Lukas, Karol
>
> On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > Hi,
> >
> > * Nick Hastings  [230530 16:01]:
> >> * Mario Limonciello  [230530 13:00]:
> > 
> >>> As you're actually loading nouveau, can you please try nouveau.runpm=0 on
> >>> the kernel command line?
> >> I'm not intentionally loading it. This machine also has intel graphics
> >> which is what I prefer. Checking my
> >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> >> I see:
> >>
> >> blacklist nvidia
> >> blacklist nvidia-drm
> >> blacklist nvidia-modeset
> >> blacklist nvidia-uvm
> >> blacklist ipmi_msghandler
> >> blacklist ipmi_devintf
> >>
> >> So I thought I had blacklisted it but it seems I did not. Since I do not
> >> want to use it maybe it is better to check if the lock up occurs with
> >> nouveau blacklisted. I will try that now.
> > I blacklisted nouveau and booted into a 6.1 kernel:
> > % uname -a
> > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) 
> > x86_64 GNU/Linux
> >
> > It has been running without problems for nearly two days now:
> > % uptime
> >   08:34:48 up 1 day, 16:22,  2 users,  load average: 1.33, 1.26, 1.27
> >
> > Regards,
> >
> > Nick.
>
> Thanks, that makes a lot more sense now.
>
> Nick, Can you please test if nouveau works with runtime PM in the
> latest 6.4-rc?
>
> If it works in 6.4-rc, there are probably nouveau commits that need
> to be backported to 6.1 LTS.
>
> If it's still broken in 6.4-rc, I believe you should file a bug:
>
> https://gitlab.freedesktop.org/drm/nouveau/
>
>
> Lyude, Lukas, Karol
>
> This thread is in relation to this commit:
>
> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
>
> Nick has found that runtime PM is *not* working for nouveau.
>

keep in mind we have a list of PCIe controllers where we apply a
workaround: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682

And I suspect there might be one or two more IDs we'll have to add
there. Do we have any logs? And could anybody test if adding the
controller in play here does resolve the problem?

> If you recall we did 24867516f06d because 5775b843a619 was
> supposed to have fixed it.
>



Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-01 Thread Limonciello, Mario

+Lyude, Lukas, Karol

On 5/31/2023 6:40 PM, Nick Hastings wrote:

Hi,

* Nick Hastings  [230530 16:01]:

* Mario Limonciello  [230530 13:00]:



As you're actually loading nouveau, can you please try nouveau.runpm=0 on
the kernel command line?

I'm not intentionally loading it. This machine also has intel graphics
which is what I prefer. Checking my
/etc/modprobe.d/blacklist-nvidia-nouveau.conf
I see:

blacklist nvidia
blacklist nvidia-drm
blacklist nvidia-modeset
blacklist nvidia-uvm
blacklist ipmi_msghandler
blacklist ipmi_devintf

So I thought I had blacklisted it but it seems I did not. Since I do not
want to use it maybe it is better to check if the lock up occurs with
nouveau blacklisted. I will try that now.

I blacklisted nouveau and booted into a 6.1 kernel:
% uname -a
Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) 
x86_64 GNU/Linux

It has been running without problems for nearly two days now:
% uptime
  08:34:48 up 1 day, 16:22,  2 users,  load average: 1.33, 1.26, 1.27

Regards,

Nick.


Thanks, that makes a lot more sense now.

Nick, Can you please test if nouveau works with runtime PM in the
latest 6.4-rc?

If it works in 6.4-rc, there are probably nouveau commits that need
to be backported to 6.1 LTS.

If it's still broken in 6.4-rc, I believe you should file a bug:

https://gitlab.freedesktop.org/drm/nouveau/


Lyude, Lukas, Karol

This thread is in relation to this commit:

24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")

Nick has found that runtime PM is *not* working for nouveau.

If you recall we did 24867516f06d because 5775b843a619 was
supposed to have fixed it.



Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-05-31 Thread Nick Hastings
Hi,

* Nick Hastings  [230530 16:01]:
> 
> * Mario Limonciello  [230530 13:00]:

> > As you're actually loading nouveau, can you please try nouveau.runpm=0 on
> > the kernel command line?
> 
> I'm not intentionally loading it. This machine also has intel graphics
> which is what I prefer. Checking my
> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> I see:
> 
> blacklist nvidia
> blacklist nvidia-drm
> blacklist nvidia-modeset
> blacklist nvidia-uvm
> blacklist ipmi_msghandler
> blacklist ipmi_devintf
> 
> So I thought I had blacklisted it but it seems I did not. Since I do not
> want to use it maybe it is better to check if the lock up occurs with
> nouveau blacklisted. I will try that now.

I blacklisted nouveau and booted into a 6.1 kernel:
% uname -a
Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) 
x86_64 GNU/Linux

It has been running without problems for nearly two days now:
% uptime
 08:34:48 up 1 day, 16:22,  2 users,  load average: 1.33, 1.26, 1.27

Regards,

Nick.



Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-05-30 Thread Salvatore Bonaccorso
Hi Nick,

Thanks to you both for triaging the issue!

On Tue, May 30, 2023 at 04:01:04PM +0900, Nick Hastings wrote:
> Hi,
> 
> * Mario Limonciello  [230530 13:00]:
> > On 5/29/23 18:01, Nick Hastings wrote:
> > > Hi,
> > > 
> > > * Nick Hastings  [230529 12:51]:
> > > > * Mario Limonciello  [230529 10:14]:
> > > > > On 5/28/23 19:56, Nick Hastings wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > * Mario Limonciello  [230528 21:44]:
> > > > > > > On 5/28/23 01:49, Salvatore Bonaccorso wrote:
> > > > > > > > Hi Mario
> > > > > > > > 
> > > > > > > > Nick Hastings reported in Debian in 
> > > > > > > > https://bugs.debian.org/1036530
> > > > > > > > lockups from his system after updating from a 6.0 based version 
> > > > > > > > to
> > > > > > > > 6.1.y. >
> > > > > > > > #regzbot ^introduced 24867516f06d
> > > > > > > > 
> > > > > > > > he bisected the issue and tracked it down to:
> > > > > > > > 
> > > > > > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
> > > > > > > > > Control: tags -1 - moreinfo
> > > > > > > > > 
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > I repeated the git bisect, and the bad commit seems to be:
> > > > > > > > > 
> > > > > > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
> > > > > > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad 
> > > > > > > > > commit
> > > > > > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc
> > > > > > > > > Author: Mario Limonciello 
> > > > > > > > > Date:   Tue Aug 23 13:51:31 2022 -0500
> > > > > > > > > 
> > > > > > > > >ACPI: OSI: Remove Linux-Dell-Video _OSI string
> > > > > > > > >This string was introduced because drivers for NVIDIA 
> > > > > > > > > hardware
> > > > > > > > >had bugs supporting RTD3 in the past.
> > > > > > > > >Before proprietary NVIDIA driver started to support 
> > > > > > > > > RTD3, Ubuntu had
> > > > > > > > >had a mechanism for switching PRIME on and off, though 
> > > > > > > > > it had required
> > > > > > > > >to logout/login to make the library switch happen.
> > > > > > > > >When the PRIME had been off, the mechanism had 
> > > > > > > > > unloaded the NVIDIA
> > > > > > > > >driver and put the device into D3cold, but the GPU had 
> > > > > > > > > never come back
> > > > > > > > >to D0 again which is why ODMs used the _OSI to expose 
> > > > > > > > > an old _DSM
> > > > > > > > >method to switch the power on/off.
> > > > > > > > >That has been fixed by commit 5775b843a619 ("PCI: 
> > > > > > > > > Restore config space
> > > > > > > > >on runtime resume despite being unbound"). so vendors 
> > > > > > > > > shouldn't be
> > > > > > > > >using this string to modify ASL any more.
> > > > > > > > >Reviewed-by: Lyude Paul 
> > > > > > > > >Signed-off-by: Mario Limonciello 
> > > > > > > > > 
> > > > > > > > >Signed-off-by: Rafael J. Wysocki 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > drivers/acpi/osi.c | 9 -
> > > > > > > > > 1 file changed, 9 deletions(-)
> > > > > > > > > 
> > > > > > > > > This machine is a Dell with an nvidia chip so it looks like 
> > > > > > > > > this really
> > > > > > > > > could be the commit that that is causing the problems. The 
> > > > > > > > > description
> > > > > > > > > of the commit also seems (to my untrained eye) to be 
> > > > > > > > > consistent with the
> > > > > > > > > error reported on the console when the lockup occurs:
> > > > > > > > > 
> > > > > > > > > [   58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due 
> > > > > > > > > to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > > > > [   58.729904] ACPI Error: Aborting method 
> > > > > > > > > \_SB.PCI0.PEG0.PG00._ON due to previous error 
> > > > > > > > > (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > > > > [   60.083261] vfio-pci :01:00.0 Unable to change power 
> > > > > > > > > state from D3cold to D0, device inaccessible
> > > > > > > > > 
> > > > > > > > > Hopefully this is enough information for experts to resolve 
> > > > > > > > > this.
> > > > > > > > 
> > > > > > > > Does this ring some bell for you? Do you need any further 
> > > > > > > > information
> > > > > > > > from Nick?
> > > > > > > > 
> > > > > > > > Regards,
> > > > > > > > Salvatore
> > > > > > > 
> > > > > > 
> > > > > > > Have Nick try using "pcie_port_pm=off" and see if it helps the 
> > > > > > > issue.
> > > > > > 
> > > > > > I booted into a 6.1 kernel with this option. It has been running 
> > > > > > without
> > > > > > problems for 1.5 hours. Usually I would expect the lockup to have
> > > > > > occurred by now.
> > > > 
> > > > I let this run for 3 hours without issue.
> > > > 
> > > > > > > Does this happen in the latest 6.4 RC as well?
> > > > > > 
> > > > > > I have compiled that kernel and will boot into it after running 
> > > > > > this one
> > > > > > with the 

Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-05-30 Thread Nick Hastings
Hi,

* Mario Limonciello  [230530 13:00]:
> On 5/29/23 18:01, Nick Hastings wrote:
> > Hi,
> > 
> > * Nick Hastings  [230529 12:51]:
> > > * Mario Limonciello  [230529 10:14]:
> > > > On 5/28/23 19:56, Nick Hastings wrote:
> > > > > Hi,
> > > > > 
> > > > > * Mario Limonciello  [230528 21:44]:
> > > > > > On 5/28/23 01:49, Salvatore Bonaccorso wrote:
> > > > > > > Hi Mario
> > > > > > > 
> > > > > > > Nick Hastings reported in Debian in 
> > > > > > > https://bugs.debian.org/1036530
> > > > > > > lockups from his system after updating from a 6.0 based version to
> > > > > > > 6.1.y. >
> > > > > > > #regzbot ^introduced 24867516f06d
> > > > > > > 
> > > > > > > he bisected the issue and tracked it down to:
> > > > > > > 
> > > > > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
> > > > > > > > Control: tags -1 - moreinfo
> > > > > > > > 
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > I repeated the git bisect, and the bad commit seems to be:
> > > > > > > > 
> > > > > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
> > > > > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
> > > > > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc
> > > > > > > > Author: Mario Limonciello 
> > > > > > > > Date:   Tue Aug 23 13:51:31 2022 -0500
> > > > > > > > 
> > > > > > > >ACPI: OSI: Remove Linux-Dell-Video _OSI string
> > > > > > > >This string was introduced because drivers for NVIDIA 
> > > > > > > > hardware
> > > > > > > >had bugs supporting RTD3 in the past.
> > > > > > > >Before proprietary NVIDIA driver started to support 
> > > > > > > > RTD3, Ubuntu had
> > > > > > > >had a mechanism for switching PRIME on and off, though 
> > > > > > > > it had required
> > > > > > > >to logout/login to make the library switch happen.
> > > > > > > >When the PRIME had been off, the mechanism had unloaded 
> > > > > > > > the NVIDIA
> > > > > > > >driver and put the device into D3cold, but the GPU had 
> > > > > > > > never come back
> > > > > > > >to D0 again which is why ODMs used the _OSI to expose an 
> > > > > > > > old _DSM
> > > > > > > >method to switch the power on/off.
> > > > > > > >That has been fixed by commit 5775b843a619 ("PCI: 
> > > > > > > > Restore config space
> > > > > > > >on runtime resume despite being unbound"). so vendors 
> > > > > > > > shouldn't be
> > > > > > > >using this string to modify ASL any more.
> > > > > > > >Reviewed-by: Lyude Paul 
> > > > > > > >Signed-off-by: Mario Limonciello 
> > > > > > > > 
> > > > > > > >Signed-off-by: Rafael J. Wysocki 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > drivers/acpi/osi.c | 9 -
> > > > > > > > 1 file changed, 9 deletions(-)
> > > > > > > > 
> > > > > > > > This machine is a Dell with an nvidia chip so it looks like 
> > > > > > > > this really
> > > > > > > > could be the commit that that is causing the problems. The 
> > > > > > > > description
> > > > > > > > of the commit also seems (to my untrained eye) to be consistent 
> > > > > > > > with the
> > > > > > > > error reported on the console when the lockup occurs:
> > > > > > > > 
> > > > > > > > [   58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due 
> > > > > > > > to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > > > [   58.729904] ACPI Error: Aborting method 
> > > > > > > > \_SB.PCI0.PEG0.PG00._ON due to previous error 
> > > > > > > > (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > > > [   60.083261] vfio-pci :01:00.0 Unable to change power 
> > > > > > > > state from D3cold to D0, device inaccessible
> > > > > > > > 
> > > > > > > > Hopefully this is enough information for experts to resolve 
> > > > > > > > this.
> > > > > > > 
> > > > > > > Does this ring some bell for you? Do you need any further 
> > > > > > > information
> > > > > > > from Nick?
> > > > > > > 
> > > > > > > Regards,
> > > > > > > Salvatore
> > > > > > 
> > > > > 
> > > > > > Have Nick try using "pcie_port_pm=off" and see if it helps the 
> > > > > > issue.
> > > > > 
> > > > > I booted into a 6.1 kernel with this option. It has been running 
> > > > > without
> > > > > problems for 1.5 hours. Usually I would expect the lockup to have
> > > > > occurred by now.
> > > 
> > > I let this run for 3 hours without issue.
> > > 
> > > > > > Does this happen in the latest 6.4 RC as well?
> > > > > 
> > > > > I have compiled that kernel and will boot into it after running this 
> > > > > one
> > > > > with the pcie_port_pm=off for another hour or so.
> > > 
> > > I'm now running 6.4.0-rc4 without seeing the problem after 1 hour.
> > 
> > I did eventually see a lockup of this kernel. On the console I saw:
> > 
> > [  151.035036] vfio-pci :01:00.0 Unable to change power state from 
> > D3cold to D0, device inaccessible
> > 
> > I did not see the other two lines that were 

Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-05-29 Thread Mario Limonciello

On 5/29/23 18:01, Nick Hastings wrote:

Hi,

* Nick Hastings  [230529 12:51]:

* Mario Limonciello  [230529 10:14]:

On 5/28/23 19:56, Nick Hastings wrote:

Hi,

* Mario Limonciello  [230528 21:44]:

On 5/28/23 01:49, Salvatore Bonaccorso wrote:

Hi Mario

Nick Hastings reported in Debian in https://bugs.debian.org/1036530
lockups from his system after updating from a 6.0 based version to
6.1.y. >
#regzbot ^introduced 24867516f06d

he bisected the issue and tracked it down to:

On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:

Control: tags -1 - moreinfo

Hi,

I repeated the git bisect, and the bad commit seems to be:

(git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
commit 24867516f06dabedef3be7eea0ef0846b91538bc
Author: Mario Limonciello 
Date:   Tue Aug 23 13:51:31 2022 -0500

   ACPI: OSI: Remove Linux-Dell-Video _OSI string
   This string was introduced because drivers for NVIDIA hardware
   had bugs supporting RTD3 in the past.
   Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
   had a mechanism for switching PRIME on and off, though it had required
   to logout/login to make the library switch happen.
   When the PRIME had been off, the mechanism had unloaded the NVIDIA
   driver and put the device into D3cold, but the GPU had never come back
   to D0 again which is why ODMs used the _OSI to expose an old _DSM
   method to switch the power on/off.
   That has been fixed by commit 5775b843a619 ("PCI: Restore config space
   on runtime resume despite being unbound"). so vendors shouldn't be
   using this string to modify ASL any more.
   Reviewed-by: Lyude Paul 
   Signed-off-by: Mario Limonciello 
   Signed-off-by: Rafael J. Wysocki 

drivers/acpi/osi.c | 9 -
1 file changed, 9 deletions(-)

This machine is a Dell with an nvidia chip so it looks like this really
could be the commit that that is causing the problems. The description
of the commit also seems (to my untrained eye) to be consistent with the
error reported on the console when the lockup occurs:

[   58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error 
(AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[   58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to 
previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[   60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold 
to D0, device inaccessible

Hopefully this is enough information for experts to resolve this.


Does this ring some bell for you? Do you need any further information
from Nick?

Regards,
Salvatore





Have Nick try using "pcie_port_pm=off" and see if it helps the issue.


I booted into a 6.1 kernel with this option. It has been running without
problems for 1.5 hours. Usually I would expect the lockup to have
occurred by now.


I let this run for 3 hours without issue.


Does this happen in the latest 6.4 RC as well?


I have compiled that kernel and will boot into it after running this one
with the pcie_port_pm=off for another hour or so.


I'm now running 6.4.0-rc4 without seeing the problem after 1 hour.


I did eventually see a lockup of this kernel. On the console I saw:

[  151.035036] vfio-pci :01:00.0 Unable to change power state from D3cold 
to D0, device inaccessible

I did not see the other two lines that were present in earlier lock ups >

I did however see two unrelated problems that I include here for
completeness:
1. iwlwifi module did not automatically load
2. Xwayland used huge amount of CPU even though was not running any X
programs. Recompiling my wayland compositor without XWayland support
"fixed" this.


I think we need to see a full dmesg and acpidump to better
characterize it.


Please find attached. Let me know if there is anything else I can provide.

Regards,

Nick.


I don't see nouveau loading, are you explicitly preventing it from
loading?


Yes nouveau is blacklisted.


Can I see the journal from a boot when it reproduced?


Hmm not sure which n for "journalctl -b n" maps to which kernel (is that
what you are requesting?). The commit hash doesn't not seem to be
listed. I may have to boot into a bad kernel again.


Please find attached the output from a "journalctl --system -bN" for a
kernel that has this issue.

Regards,

Nick.


In this log I see nouveau loaded, but I also don't see the failure 
occurring.


As you're actually loading nouveau, can you please try nouveau.runpm=0 
on the kernel command line?


If that helps the issue; I strongly suggest you cross reference the 
latest kernel to see if this bug still exists.




Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-05-29 Thread Nick Hastings
Hi,

* Nick Hastings  [230529 12:51]:
> * Mario Limonciello  [230529 10:14]:
> > On 5/28/23 19:56, Nick Hastings wrote:
> > > Hi,
> > > 
> > > * Mario Limonciello  [230528 21:44]:
> > > > On 5/28/23 01:49, Salvatore Bonaccorso wrote:
> > > > > Hi Mario
> > > > > 
> > > > > Nick Hastings reported in Debian in https://bugs.debian.org/1036530
> > > > > lockups from his system after updating from a 6.0 based version to
> > > > > 6.1.y. >
> > > > > #regzbot ^introduced 24867516f06d
> > > > > 
> > > > > he bisected the issue and tracked it down to:
> > > > > 
> > > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
> > > > > > Control: tags -1 - moreinfo
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > I repeated the git bisect, and the bad commit seems to be:
> > > > > > 
> > > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
> > > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
> > > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc
> > > > > > Author: Mario Limonciello 
> > > > > > Date:   Tue Aug 23 13:51:31 2022 -0500
> > > > > > 
> > > > > >   ACPI: OSI: Remove Linux-Dell-Video _OSI string
> > > > > >   This string was introduced because drivers for NVIDIA hardware
> > > > > >   had bugs supporting RTD3 in the past.
> > > > > >   Before proprietary NVIDIA driver started to support RTD3, 
> > > > > > Ubuntu had
> > > > > >   had a mechanism for switching PRIME on and off, though it had 
> > > > > > required
> > > > > >   to logout/login to make the library switch happen.
> > > > > >   When the PRIME had been off, the mechanism had unloaded the 
> > > > > > NVIDIA
> > > > > >   driver and put the device into D3cold, but the GPU had never 
> > > > > > come back
> > > > > >   to D0 again which is why ODMs used the _OSI to expose an old 
> > > > > > _DSM
> > > > > >   method to switch the power on/off.
> > > > > >   That has been fixed by commit 5775b843a619 ("PCI: Restore 
> > > > > > config space
> > > > > >   on runtime resume despite being unbound"). so vendors 
> > > > > > shouldn't be
> > > > > >   using this string to modify ASL any more.
> > > > > >   Reviewed-by: Lyude Paul 
> > > > > >   Signed-off-by: Mario Limonciello 
> > > > > >   Signed-off-by: Rafael J. Wysocki 
> > > > > > 
> > > > > >drivers/acpi/osi.c | 9 -
> > > > > >1 file changed, 9 deletions(-)
> > > > > > 
> > > > > > This machine is a Dell with an nvidia chip so it looks like this 
> > > > > > really
> > > > > > could be the commit that that is causing the problems. The 
> > > > > > description
> > > > > > of the commit also seems (to my untrained eye) to be consistent 
> > > > > > with the
> > > > > > error reported on the console when the lockup occurs:
> > > > > > 
> > > > > > [   58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to 
> > > > > > previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > [   58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON 
> > > > > > due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > [   60.083261] vfio-pci :01:00.0 Unable to change power state 
> > > > > > from D3cold to D0, device inaccessible
> > > > > > 
> > > > > > Hopefully this is enough information for experts to resolve this.
> > > > > 
> > > > > Does this ring some bell for you? Do you need any further information
> > > > > from Nick?
> > > > > 
> > > > > Regards,
> > > > > Salvatore
> > > > 
> > > 
> > > > Have Nick try using "pcie_port_pm=off" and see if it helps the issue.
> > > 
> > > I booted into a 6.1 kernel with this option. It has been running without
> > > problems for 1.5 hours. Usually I would expect the lockup to have
> > > occurred by now.
> 
> I let this run for 3 hours without issue.
> 
> > > > Does this happen in the latest 6.4 RC as well?
> > > 
> > > I have compiled that kernel and will boot into it after running this one
> > > with the pcie_port_pm=off for another hour or so.
> 
> I'm now running 6.4.0-rc4 without seeing the problem after 1 hour.

I did eventually see a lockup of this kernel. On the console I saw:

[  151.035036] vfio-pci :01:00.0 Unable to change power state from D3cold 
to D0, device inaccessible

I did not see the other two lines that were present in earlier lock ups

> I did however see two unrelated problems that I include here for
> completeness:
> 1. iwlwifi module did not automatically load
> 2. Xwayland used huge amount of CPU even though was not running any X
> programs. Recompiling my wayland compositor without XWayland support
> "fixed" this.
> 
> > > > I think we need to see a full dmesg and acpidump to better
> > > > characterize it.
> > > 
> > > Please find attached. Let me know if there is anything else I can provide.
> > > 
> > > Regards,
> > > 
> > > Nick.
> > 
> > I don't see nouveau loading, are you explicitly preventing it from
> > loading?
> 
> Yes nouveau is blacklisted.
> 

Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-05-28 Thread Nick Hastings
* Mario Limonciello  [230529 10:14]:
> On 5/28/23 19:56, Nick Hastings wrote:
> > Hi,
> > 
> > * Mario Limonciello  [230528 21:44]:
> > > On 5/28/23 01:49, Salvatore Bonaccorso wrote:
> > > > Hi Mario
> > > > 
> > > > Nick Hastings reported in Debian in https://bugs.debian.org/1036530
> > > > lockups from his system after updating from a 6.0 based version to
> > > > 6.1.y. >
> > > > #regzbot ^introduced 24867516f06d
> > > > 
> > > > he bisected the issue and tracked it down to:
> > > > 
> > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
> > > > > Control: tags -1 - moreinfo
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > I repeated the git bisect, and the bad commit seems to be:
> > > > > 
> > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
> > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
> > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc
> > > > > Author: Mario Limonciello 
> > > > > Date:   Tue Aug 23 13:51:31 2022 -0500
> > > > > 
> > > > >   ACPI: OSI: Remove Linux-Dell-Video _OSI string
> > > > >   This string was introduced because drivers for NVIDIA hardware
> > > > >   had bugs supporting RTD3 in the past.
> > > > >   Before proprietary NVIDIA driver started to support RTD3, 
> > > > > Ubuntu had
> > > > >   had a mechanism for switching PRIME on and off, though it had 
> > > > > required
> > > > >   to logout/login to make the library switch happen.
> > > > >   When the PRIME had been off, the mechanism had unloaded the 
> > > > > NVIDIA
> > > > >   driver and put the device into D3cold, but the GPU had never 
> > > > > come back
> > > > >   to D0 again which is why ODMs used the _OSI to expose an old 
> > > > > _DSM
> > > > >   method to switch the power on/off.
> > > > >   That has been fixed by commit 5775b843a619 ("PCI: Restore 
> > > > > config space
> > > > >   on runtime resume despite being unbound"). so vendors shouldn't 
> > > > > be
> > > > >   using this string to modify ASL any more.
> > > > >   Reviewed-by: Lyude Paul 
> > > > >   Signed-off-by: Mario Limonciello 
> > > > >   Signed-off-by: Rafael J. Wysocki 
> > > > > 
> > > > >drivers/acpi/osi.c | 9 -
> > > > >1 file changed, 9 deletions(-)
> > > > > 
> > > > > This machine is a Dell with an nvidia chip so it looks like this 
> > > > > really
> > > > > could be the commit that that is causing the problems. The description
> > > > > of the commit also seems (to my untrained eye) to be consistent with 
> > > > > the
> > > > > error reported on the console when the lockup occurs:
> > > > > 
> > > > > [   58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to 
> > > > > previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > [   58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON 
> > > > > due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > [   60.083261] vfio-pci :01:00.0 Unable to change power state 
> > > > > from D3cold to D0, device inaccessible
> > > > > 
> > > > > Hopefully this is enough information for experts to resolve this.
> > > > 
> > > > Does this ring some bell for you? Do you need any further information
> > > > from Nick?
> > > > 
> > > > Regards,
> > > > Salvatore
> > > 
> > 
> > > Have Nick try using "pcie_port_pm=off" and see if it helps the issue.
> > 
> > I booted into a 6.1 kernel with this option. It has been running without
> > problems for 1.5 hours. Usually I would expect the lockup to have
> > occurred by now.

I let this run for 3 hours without issue.

> > > Does this happen in the latest 6.4 RC as well?
> > 
> > I have compiled that kernel and will boot into it after running this one
> > with the pcie_port_pm=off for another hour or so.

I'm now running 6.4.0-rc4 without seeing the problem after 1 hour.

I did however see two unrelated problems that I include here for
completeness:
1. iwlwifi module did not automatically load
2. Xwayland used huge amount of CPU even though was not running any X
programs. Recompiling my wayland compositor without XWayland support
"fixed" this.

> > > I think we need to see a full dmesg and acpidump to better
> > > characterize it.
> > 
> > Please find attached. Let me know if there is anything else I can provide.
> > 
> > Regards,
> > 
> > Nick.
> 
> I don't see nouveau loading, are you explicitly preventing it from
> loading?

Yes nouveau is blacklisted.

> Can I see the journal from a boot when it reproduced?

Hmm not sure which n for "journalctl -b n" maps to which kernel (is that
what you are requesting?). The commit hash doesn't not seem to be
listed. I may have to boot into a bad kernel again.

Regards,

Ncik.



Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-05-28 Thread Mario Limonciello

On 5/28/23 19:56, Nick Hastings wrote:

Hi,

* Mario Limonciello  [230528 21:44]:

On 5/28/23 01:49, Salvatore Bonaccorso wrote:

Hi Mario

Nick Hastings reported in Debian in https://bugs.debian.org/1036530
lockups from his system after updating from a 6.0 based version to
6.1.y. >
#regzbot ^introduced 24867516f06d

he bisected the issue and tracked it down to:

On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:

Control: tags -1 - moreinfo

Hi,

I repeated the git bisect, and the bad commit seems to be:

(git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
commit 24867516f06dabedef3be7eea0ef0846b91538bc
Author: Mario Limonciello 
Date:   Tue Aug 23 13:51:31 2022 -0500

  ACPI: OSI: Remove Linux-Dell-Video _OSI string
  This string was introduced because drivers for NVIDIA hardware
  had bugs supporting RTD3 in the past.
  Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
  had a mechanism for switching PRIME on and off, though it had required
  to logout/login to make the library switch happen.
  When the PRIME had been off, the mechanism had unloaded the NVIDIA
  driver and put the device into D3cold, but the GPU had never come back
  to D0 again which is why ODMs used the _OSI to expose an old _DSM
  method to switch the power on/off.
  That has been fixed by commit 5775b843a619 ("PCI: Restore config space
  on runtime resume despite being unbound"). so vendors shouldn't be
  using this string to modify ASL any more.
  Reviewed-by: Lyude Paul 
  Signed-off-by: Mario Limonciello 
  Signed-off-by: Rafael J. Wysocki 

   drivers/acpi/osi.c | 9 -
   1 file changed, 9 deletions(-)

This machine is a Dell with an nvidia chip so it looks like this really
could be the commit that that is causing the problems. The description
of the commit also seems (to my untrained eye) to be consistent with the
error reported on the console when the lockup occurs:

[   58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error 
(AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[   58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to 
previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[   60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold 
to D0, device inaccessible

Hopefully this is enough information for experts to resolve this.


Does this ring some bell for you? Do you need any further information
from Nick?

Regards,
Salvatore





Have Nick try using "pcie_port_pm=off" and see if it helps the issue.


I booted into a 6.1 kernel with this option. It has been running without
problems for 1.5 hours. Usually I would expect the lockup to have
occurred by now.


Does this happen in the latest 6.4 RC as well?


I have compiled that kernel and will boot into it after running this one
with the pcie_port_pm=off for another hour or so.


I think we need to see a full dmesg and acpidump to better
characterize it.


Please find attached. Let me know if there is anything else I can provide.

Regards,

Nick.


I don't see nouveau loading, are you explicitly preventing it from 
loading?  Can I see the journal from a boot when it reproduced?




Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-05-28 Thread Mario Limonciello

On 5/28/23 01:49, Salvatore Bonaccorso wrote:

Hi Mario

Nick Hastings reported in Debian in https://bugs.debian.org/1036530
lockups from his system after updating from a 6.0 based version to
6.1.y. >
#regzbot ^introduced 24867516f06d

he bisected the issue and tracked it down to:

On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:

Control: tags -1 - moreinfo

Hi,

I repeated the git bisect, and the bad commit seems to be:

(git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
commit 24867516f06dabedef3be7eea0ef0846b91538bc
Author: Mario Limonciello 
Date:   Tue Aug 23 13:51:31 2022 -0500

 ACPI: OSI: Remove Linux-Dell-Video _OSI string
 
 This string was introduced because drivers for NVIDIA hardware

 had bugs supporting RTD3 in the past.
 
 Before proprietary NVIDIA driver started to support RTD3, Ubuntu had

 had a mechanism for switching PRIME on and off, though it had required
 to logout/login to make the library switch happen.
 
 When the PRIME had been off, the mechanism had unloaded the NVIDIA

 driver and put the device into D3cold, but the GPU had never come back
 to D0 again which is why ODMs used the _OSI to expose an old _DSM
 method to switch the power on/off.
 
 That has been fixed by commit 5775b843a619 ("PCI: Restore config space

 on runtime resume despite being unbound"). so vendors shouldn't be
 using this string to modify ASL any more.
 
 Reviewed-by: Lyude Paul 

 Signed-off-by: Mario Limonciello 
 Signed-off-by: Rafael J. Wysocki 

  drivers/acpi/osi.c | 9 -
  1 file changed, 9 deletions(-)

This machine is a Dell with an nvidia chip so it looks like this really
could be the commit that that is causing the problems. The description
of the commit also seems (to my untrained eye) to be consistent with the
error reported on the console when the lockup occurs:

[   58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error 
(AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[   58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to 
previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[   60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold 
to D0, device inaccessible

Hopefully this is enough information for experts to resolve this.


Does this ring some bell for you? Do you need any further information
from Nick?

Regards,
Salvatore


Hi Salvatore,

Have Nick try using "pcie_port_pm=off" and see if it helps the issue.

Does this happen in the latest 6.4 RC as well?

I think we need to see a full dmesg and acpidump to better characterize it.



Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-05-28 Thread Salvatore Bonaccorso
Hi Mario

Nick Hastings reported in Debian in https://bugs.debian.org/1036530
lockups from his system after updating from a 6.0 based version to
6.1.y.

#regzbot ^introduced 24867516f06d

he bisected the issue and tracked it down to:

On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
> Control: tags -1 - moreinfo
> 
> Hi,
> 
> I repeated the git bisect, and the bad commit seems to be:
> 
> (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
> 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
> commit 24867516f06dabedef3be7eea0ef0846b91538bc
> Author: Mario Limonciello 
> Date:   Tue Aug 23 13:51:31 2022 -0500
> 
> ACPI: OSI: Remove Linux-Dell-Video _OSI string
> 
> This string was introduced because drivers for NVIDIA hardware
> had bugs supporting RTD3 in the past.
> 
> Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
> had a mechanism for switching PRIME on and off, though it had required
> to logout/login to make the library switch happen.
> 
> When the PRIME had been off, the mechanism had unloaded the NVIDIA
> driver and put the device into D3cold, but the GPU had never come back
> to D0 again which is why ODMs used the _OSI to expose an old _DSM
> method to switch the power on/off.
> 
> That has been fixed by commit 5775b843a619 ("PCI: Restore config space
> on runtime resume despite being unbound"). so vendors shouldn't be
> using this string to modify ASL any more.
> 
> Reviewed-by: Lyude Paul 
> Signed-off-by: Mario Limonciello 
> Signed-off-by: Rafael J. Wysocki 
> 
>  drivers/acpi/osi.c | 9 -
>  1 file changed, 9 deletions(-)
> 
> This machine is a Dell with an nvidia chip so it looks like this really
> could be the commit that that is causing the problems. The description
> of the commit also seems (to my untrained eye) to be consistent with the
> error reported on the console when the lockup occurs:
> 
> [   58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous 
> error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> [   58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to 
> previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> [   60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold 
> to D0, device inaccessible
> 
> Hopefully this is enough information for experts to resolve this.

Does this ring some bell for you? Do you need any further information
from Nick?

Regards,
Salvatore



Bug#1036530: linux-signed-amd64: Hard lock up of system

2023-05-27 Thread Nick Hastings
Control: tags -1 - moreinfo

Hi,

I repeated the git bisect, and the bad commit seems to be:

(git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
commit 24867516f06dabedef3be7eea0ef0846b91538bc
Author: Mario Limonciello 
Date:   Tue Aug 23 13:51:31 2022 -0500

ACPI: OSI: Remove Linux-Dell-Video _OSI string

This string was introduced because drivers for NVIDIA hardware
had bugs supporting RTD3 in the past.

Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
had a mechanism for switching PRIME on and off, though it had required
to logout/login to make the library switch happen.

When the PRIME had been off, the mechanism had unloaded the NVIDIA
driver and put the device into D3cold, but the GPU had never come back
to D0 again which is why ODMs used the _OSI to expose an old _DSM
method to switch the power on/off.

That has been fixed by commit 5775b843a619 ("PCI: Restore config space
on runtime resume despite being unbound"). so vendors shouldn't be
using this string to modify ASL any more.

Reviewed-by: Lyude Paul 
Signed-off-by: Mario Limonciello 
Signed-off-by: Rafael J. Wysocki 

 drivers/acpi/osi.c | 9 -
 1 file changed, 9 deletions(-)

This machine is a Dell with an nvidia chip so it looks like this really
could be the commit that that is causing the problems. The description
of the commit also seems (to my untrained eye) to be consistent with the
error reported on the console when the lockup occurs:

[   58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error 
(AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[   58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to 
previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[   60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold 
to D0, device inaccessible

Hopefully this is enough information for experts to resolve this.

Regards,

Nick.

* Salvatore Bonaccorso  [230526 20:30]:
> Control: tags -1 + moreinfo
> 
> Hi Nick,
> 
> On Fri, May 26, 2023 at 09:25:23AM +0900, Nick Hastings wrote:
> > Hi Salvatore,
> > 
> > thanks for your help. However, I'm now not sure if I really have
> > identified the commit that causes my problems. I fear I may have made
> > one or more mistakes when setting "git bisect good". I had been under
> > the impression that the lock up would happen no more than a few tens of
> > minutes after booting, however it seems that sometimes it can take a few
> > hours to occur.
> > 
> > So, I'm running the git bisect again and will be more careful before
> > marking "git bisect good". It could take a few days.
> > 
> > Should this particular bug be closed?
> 
> Thanks a lot for reporting back, you time put in into bisect is very
> appreciated and valued! No, no need to close this one, as the bug
> still persist. Just followup please once you have identified the
> culprit with the fresh bisect.
> 
> Please do remove by then as well the moreinfo tag again (you can write
> a control message with tag -1 - moreinfo, so won't appear as bug
> needing information from reporter).
> 
> Thank you!
> 
> Regards,
> Salvatore



Bug#1036530: linux-signed-amd64: Hard lock up of system

2023-05-26 Thread Salvatore Bonaccorso
Control: tags -1 + moreinfo

Hi Nick,

On Fri, May 26, 2023 at 09:25:23AM +0900, Nick Hastings wrote:
> Hi Salvatore,
> 
> thanks for your help. However, I'm now not sure if I really have
> identified the commit that causes my problems. I fear I may have made
> one or more mistakes when setting "git bisect good". I had been under
> the impression that the lock up would happen no more than a few tens of
> minutes after booting, however it seems that sometimes it can take a few
> hours to occur.
> 
> So, I'm running the git bisect again and will be more careful before
> marking "git bisect good". It could take a few days.
> 
> Should this particular bug be closed?

Thanks a lot for reporting back, you time put in into bisect is very
appreciated and valued! No, no need to close this one, as the bug
still persist. Just followup please once you have identified the
culprit with the fresh bisect.

Please do remove by then as well the moreinfo tag again (you can write
a control message with tag -1 - moreinfo, so won't appear as bug
needing information from reporter).

Thank you!

Regards,
Salvatore



Bug#1036530: linux-signed-amd64: Hard lock up of system

2023-05-25 Thread Nick Hastings
Hi Salvatore,

thanks for your help. However, I'm now not sure if I really have
identified the commit that causes my problems. I fear I may have made
one or more mistakes when setting "git bisect good". I had been under
the impression that the lock up would happen no more than a few tens of
minutes after booting, however it seems that sometimes it can take a few
hours to occur.

So, I'm running the git bisect again and will be more careful before
marking "git bisect good". It could take a few days.

Should this particular bug be closed?

Thanks,

Nick.


* Salvatore Bonaccorso  [230526 00:19]:
> Hi Nick,
> 
> On Thu, May 25, 2023 at 08:23:15AM +0900, Nick Hastings wrote:
> > Hi,
> > 
> > * Salvatore Bonaccorso  [230524 19:26]:
> > >
> > > Given you were able to bisect it so far, can you try to isolate the
> > > commit from the merge commit causing it?
> > 
> > I guess I can try. The commit message states:
> > 
> > Merge: c77f54a9bcec a1cf1fd62ae7 562163595a91 018d6711c26e 6cc401be1648
> > 
> > Is there a way extract out each of those?
> 
> Th way i usuually get all commits from a merge request is
> 
> git log --oneline $mergecommit^$mergecommit^2
> 
> though here we have three merge commits, merged with one merge commit
> on top, so you would go down the merges of the acpi-properties,
> acpi-tables, acpi-x86 and acpi-soc branches. Those are those:
> 
> * acpi-properties:
>   ACPI: property: Silence missing-declarations warning in apple.c
> 
> * acpi-tables:
>   ACPI: HMAT: Drop unused dev_fmt() and redundant 'HMAT' prefix
>   ACPI: tables: FPDT: Don't call acpi_os_map_memory() on invalid phys 
> address
> 
> * acpi-x86:
>   ACPI: x86: Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable
> 
> * acpi-soc:
>   ACPI: LPSS: Deduplicate skipping device in acpi_lpss_create_device()
>   ACPI: LPSS: Replace loop with first entry retrieval
> 
> > > One remotely related might be "ACPI: x86: Add a quirk for Dell
> > > Inspiron 14 2-in-1 for StorageD3Enable".
> > 
> > Manually looking at the diff with
> > git diff e996c7e01892ac18ec0db447294d4f591c325efe~  
> > e996c7e01892ac18ec0db447294d4f591c325efe 
> > I guess that means the following:
> > 
> > --- a/drivers/acpi/x86/utils.c
> > +++ b/drivers/acpi/x86/utils.c
> > @@ -207,9 +207,26 @@ static const struct x86_cpu_id storage_d3_cpu_ids[] = {
> > {}
> >  };
> >  
> > +static const struct dmi_system_id force_storage_d3_dmi[] = {
> > +   {
> > +   /*
> > +* _ADR is ambiguous between GPP1.DEV0 and GPP1.NVME
> > +* but .NVME is needed to get StorageD3Enable node
> > +* https://bugzilla.kernel.org/show_bug.cgi?id=216440
> > +*/
> > +   .matches = {
> > +   DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> > +   DMI_MATCH(DMI_PRODUCT_NAME, "Inspiron 14 7425 
> > 2-in-1"),
> > +   }
> > +   },
> > +   {}
> > +};
> > +
> >  bool force_storage_d3(void)
> >  {
> > -   return x86_match_cpu(storage_d3_cpu_ids);
> > +   const struct dmi_system_id *dmi_id = 
> > dmi_first_match(force_storage_d3_dmi);
> > +
> > +   return dmi_id || x86_match_cpu(storage_d3_cpu_ids);
> >  }
> 
> That probably won't work actually as the code has been refactored
> substantiantly after the commit. 
> 
> In the ideal case we could confirm the quirk change is the responsable
> commit, so we can make upstream aware.
> 
> Regards,
> Salvatore



Bug#1036530: linux-signed-amd64: Hard lock up of system

2023-05-25 Thread Salvatore Bonaccorso
Hi Nick,

On Thu, May 25, 2023 at 08:23:15AM +0900, Nick Hastings wrote:
> Hi,
> 
> * Salvatore Bonaccorso  [230524 19:26]:
> >
> > Given you were able to bisect it so far, can you try to isolate the
> > commit from the merge commit causing it?
> 
> I guess I can try. The commit message states:
> 
> Merge: c77f54a9bcec a1cf1fd62ae7 562163595a91 018d6711c26e 6cc401be1648
> 
> Is there a way extract out each of those?

Th way i usuually get all commits from a merge request is

git log --oneline $mergecommit^$mergecommit^2

though here we have three merge commits, merged with one merge commit
on top, so you would go down the merges of the acpi-properties,
acpi-tables, acpi-x86 and acpi-soc branches. Those are those:

* acpi-properties:
  ACPI: property: Silence missing-declarations warning in apple.c

* acpi-tables:
  ACPI: HMAT: Drop unused dev_fmt() and redundant 'HMAT' prefix
  ACPI: tables: FPDT: Don't call acpi_os_map_memory() on invalid phys 
address

* acpi-x86:
  ACPI: x86: Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable

* acpi-soc:
  ACPI: LPSS: Deduplicate skipping device in acpi_lpss_create_device()
  ACPI: LPSS: Replace loop with first entry retrieval

> > One remotely related might be "ACPI: x86: Add a quirk for Dell
> > Inspiron 14 2-in-1 for StorageD3Enable".
> 
> Manually looking at the diff with
> git diff e996c7e01892ac18ec0db447294d4f591c325efe~  
> e996c7e01892ac18ec0db447294d4f591c325efe 
> I guess that means the following:
> 
> --- a/drivers/acpi/x86/utils.c
> +++ b/drivers/acpi/x86/utils.c
> @@ -207,9 +207,26 @@ static const struct x86_cpu_id storage_d3_cpu_ids[] = {
> {}
>  };
>  
> +static const struct dmi_system_id force_storage_d3_dmi[] = {
> +   {
> +   /*
> +* _ADR is ambiguous between GPP1.DEV0 and GPP1.NVME
> +* but .NVME is needed to get StorageD3Enable node
> +* https://bugzilla.kernel.org/show_bug.cgi?id=216440
> +*/
> +   .matches = {
> +   DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> +   DMI_MATCH(DMI_PRODUCT_NAME, "Inspiron 14 7425 
> 2-in-1"),
> +   }
> +   },
> +   {}
> +};
> +
>  bool force_storage_d3(void)
>  {
> -   return x86_match_cpu(storage_d3_cpu_ids);
> +   const struct dmi_system_id *dmi_id = 
> dmi_first_match(force_storage_d3_dmi);
> +
> +   return dmi_id || x86_match_cpu(storage_d3_cpu_ids);
>  }

That probably won't work actually as the code has been refactored
substantiantly after the commit. 

In the ideal case we could confirm the quirk change is the responsable
commit, so we can make upstream aware.

Regards,
Salvatore



Bug#1036530: linux-signed-amd64: Hard lock up of system

2023-05-24 Thread Nick Hastings
Hi,

* Salvatore Bonaccorso  [230524 19:26]:
>
> Given you were able to bisect it so far, can you try to isolate the
> commit from the merge commit causing it?

I guess I can try. The commit message states:

Merge: c77f54a9bcec a1cf1fd62ae7 562163595a91 018d6711c26e 6cc401be1648

Is there a way extract out each of those?

> One remotely related might be "ACPI: x86: Add a quirk for Dell
> Inspiron 14 2-in-1 for StorageD3Enable".

Manually looking at the diff with
git diff e996c7e01892ac18ec0db447294d4f591c325efe~  
e996c7e01892ac18ec0db447294d4f591c325efe 
I guess that means the following:

--- a/drivers/acpi/x86/utils.c
+++ b/drivers/acpi/x86/utils.c
@@ -207,9 +207,26 @@ static const struct x86_cpu_id storage_d3_cpu_ids[] = {
{}
 };
 
+static const struct dmi_system_id force_storage_d3_dmi[] = {
+   {
+   /*
+* _ADR is ambiguous between GPP1.DEV0 and GPP1.NVME
+* but .NVME is needed to get StorageD3Enable node
+* https://bugzilla.kernel.org/show_bug.cgi?id=216440
+*/
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+   DMI_MATCH(DMI_PRODUCT_NAME, "Inspiron 14 7425 2-in-1"),
+   }
+   },
+   {}
+};
+
 bool force_storage_d3(void)
 {
-   return x86_match_cpu(storage_d3_cpu_ids);
+   const struct dmi_system_id *dmi_id = 
dmi_first_match(force_storage_d3_dmi);
+
+   return dmi_id || x86_match_cpu(storage_d3_cpu_ids);
 }
 

Thanks,

Nick.



Bug#1036530: linux-signed-amd64: Hard lock up of system

2023-05-24 Thread Salvatore Bonaccorso
Control: tags -1 + moreinfo

Hi Nick,

On Mon, May 22, 2023 at 08:56:12AM +0900, Nick Hastings wrote:
> Source: linux-signed-amd64
> Severity: important
> Tags: upstream
> X-Debbugs-Cc: nicholaschasti...@gmail.com
> 
> Dear Maintainer,
> 
> after upgrading from a 6.0.0 kernel to a 6.1.0 kernel I experienced
> hard lockups on my Dell XPS 15 7590 a few minutes after each boot.  On
> more than one occasion I was on the console and was able to see the
> error message. It was the same error on each occasion, and I reproduce
> it here:
> 
> [   58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous 
> error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> [   58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to 
> previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> [   60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold 
> to D0, device inaccessible
> 
> N.B. the message on the console was recorded with at photograph and
> then manually typed in, so it is possible that it may contain one or
> more errors.
> 
> I ran git bisect as descirbed at
> https://wiki.debian.org/DebianKernel/GitBisect which seems to have
> found the bad commit. It is a merge commit that deals with acpi code.
> However I don't see what may actually be causing this issue.
> The commit is e996c7e01892ac18ec0db447294d4f591c325efe
> 
> Please find the report from git bisect below.

Given you were able to bisect it so far, can you try to isolate the
commit from the merge commit causing it? One remotely related might be
"ACPI: x86: Add a quirk for Dell Inspiron 14 2-in-1 for
StorageD3Enable".

Regards,
Salvatore



Bug#1036530: linux-signed-amd64: Hard lock up of system

2023-05-21 Thread Nick Hastings
Source: linux-signed-amd64
Severity: important
Tags: upstream
X-Debbugs-Cc: nicholaschasti...@gmail.com

Dear Maintainer,

after upgrading from a 6.0.0 kernel to a 6.1.0 kernel I experienced
hard lockups on my Dell XPS 15 7590 a few minutes after each boot.  On
more than one occasion I was on the console and was able to see the
error message. It was the same error on each occasion, and I reproduce
it here:

[   58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error 
(AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[   58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to 
previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[   60.083261] vfio-pci :01:00.0 Unable to change power state from D3cold 
to D0, device inaccessible

N.B. the message on the console was recorded with at photograph and
then manually typed in, so it is possible that it may contain one or
more errors.

I ran git bisect as descirbed at
https://wiki.debian.org/DebianKernel/GitBisect which seems to have
found the bad commit. It is a merge commit that deals with acpi code.
However I don't see what may actually be causing this issue.
The commit is e996c7e01892ac18ec0db447294d4f591c325efe

Please find the report from git bisect below.

Regards,

Nick.

-- System Information:
Debian Release: 12.0
  APT prefers testing
  APT policy: (990, 'testing'), (500, 'unstable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 6.0.0-rc6-1-g018d6711c26e (SMP w/16 CPU threads; PREEMPT)
Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_AU:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled


% git bisect good
e996c7e01892ac18ec0db447294d4f591c325efe is the first bad commit
commit e996c7e01892ac18ec0db447294d4f591c325efe
Merge: c77f54a9bcec a1cf1fd62ae7 562163595a91 018d6711c26e 6cc401be1648
Author: Rafael J. Wysocki 
Date:   Fri Sep 30 20:52:39 2022 +0200

Merge branches 'acpi-properties', 'acpi-tables', 'acpi-x86' and 'acpi-soc'

Merge changes related to ACPI data-only tables handling and ACPI device
properties management, x86-specific ACPI code changes and ACPI SoC driver
changes for 6.1-rc1:

 - Clean up the ACPI LPSS (Intel SoC) driver (Andy Shevchenko).

 - Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable (Mario
   Limonciello).

 - Drop unused dev_fmt() and redundant 'HMAT' prefix from the HMAT
  parsing code (Liu Shixin).

 - Make ACPI FPDT parsing code avoid calling acpi_os_map_memory() on
   invalid physical addresses (Hans de Goede).

 - Silence missing-declarations warning related to Apple device
   properties management (Lukas Wunner).

* acpi-properties:
  ACPI: property: Silence missing-declarations warning in apple.c

* acpi-tables:
  ACPI: HMAT: Drop unused dev_fmt() and redundant 'HMAT' prefix
  ACPI: tables: FPDT: Don't call acpi_os_map_memory() on invalid phys 
address

* acpi-x86:
  ACPI: x86: Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable

* acpi-soc:
  ACPI: LPSS: Deduplicate skipping device in acpi_lpss_create_device()
  ACPI: LPSS: Replace loop with first entry retrieval

 drivers/acpi/acpi_fpdt.c | 22 ++
 drivers/acpi/acpi_lpss.c | 45 +
 drivers/acpi/numa/hmat.c | 25 -
 drivers/acpi/x86/apple.c |  1 +
 drivers/acpi/x86/utils.c | 19 ++-
 5 files changed, 74 insertions(+), 38 deletions(-)
 [0 running job(s)] {history#6810} 2023-05-20 20:54:16