Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-22 Thread Alex Deucher
On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
 wrote:
>
> [Public]
>
> > -Original Message-
> > From: Deucher, Alexander
> > Sent: Tuesday, December 21, 2021 12:01 PM
> > To: Linus Torvalds ; Imre Deak
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Daniel Vetter ; Kai-Heng Feng
> > 
> > Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> > PCI device ..."
> >
> > [Public]
> >
> > > -Original Message-
> > > From: Linus Torvalds 
> > > Sent: Monday, December 20, 2021 5:05 PM
> > > To: Imre Deak 
> > > Cc: Daniel Vetter ; Deucher, Alexander
> > > ; Kai-Heng Feng
> > > 
> > > Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
> > > Release PCI device ..."
> > >
> > > On Mon, Dec 20, 2021 at 1:33 PM Imre Deak 
> > wrote:
> > > >
> > > > amdgpu.runpm=0
> > >
> > > Hmmm.
> > >
> > > This does seem to "work", but not very well.
> > >
> > > With this, what seems to happen is odd: I lock the screen, wait, it
> > > goes "No signal, shutting down", but then doesn't actually shut down
> > > but stays black (with the backlight on). After _another_ five seconds
> > > or so, the monitor goes "No signal, shutting down" _again_, and at that
> > point it actually does it.
> > >
> > > So it solves my immediate problem - in that yes, the backlight finally
> > > does turn off in the end - but it does seem to be still broken.
> > >
> > > I'm very surprised if no AMD drm developers can see this exact same thing.
> > > This is a very simple setup. The only possibly slightly less common
> > > thing is that I have two monitors, but while that is not necessarily
> > > the _most_ common setup in an absolute sense, I'd expect it to be very
> > > common among DRM developers..
> > >
> > > I guess I can just change the revert to just a
> > >
> > > -int amdgpu_runtime_pm = -1;
> > > +int amdgpu_runtime_pm = 0;
> > >
> > > instead. The auto-detect is apparently broken. Maybe it should only
> > > kick in for LVDS screens on actual laptops?
> > >
> > > Note: on my machine, I get that
> > >
> > >amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
> > >
> > > so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> > > and it's only that BACO case that is broken.
> > >
> > > I have no idea what any of those three things are - I'm just looking
> > > at the uses of that amdgpu_runtime_pm variable.
> > >
> > > amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> > > default, tell me something else to try.
> >
> > For a little background, runtime PM support was added about 10 year ago
> > originally to support laptops with multiple GPUs (integrated and discrete).
> > It's not specific to the display hardware.  When the GPU is idle, it can be
> > powered down completely.  In the case of these laptops, it's D3 cold
> > (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> > which powers off the dGPU completely (i.e., it disappears from the bus).  A
> > few years ago we extended this to support desktop dGPUs as well which
> > support their own version of runtime D3 (called BACO in AMD parlance - Bus
> > Active, Chip Off).  The driver can put the chip into a low power state where
> > everything except the bus interface is powered down (to avoid the device
> > disappearing from the bus).  So this has worked for almost 2 years now on
> > BACO capable parts and for a decade or more on BOCO systems.
> > Unfortunately, changing the default runpm parameter setting would cause a
> > flood of bug reports about runtime power management breaking and
> > suddenly systems are using more power.
> >
> > Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
> > Runtime pm was working on amdgpu prior to that commit.  Is it possible
> > there is still some race between when amdgpu takes over from efifb?  Does
> > it work properly when all pm_runtime calls in efifb are removed or if efifb 
> > is
> > not enabled?  Runtime pm for Polaris boards has been enabled by default
> > since 4fdda2e66de0b which predates both of those patches.
>
> Thinking about this more, I wonder if there was some change in some userspace 
> component which 

Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Alex Deucher
On Tue, Dec 21, 2021 at 5:09 PM Harry Wentland  wrote:
>
>
>
> On 2021-12-21 16:18, Alex Deucher wrote:
> > On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
> >  wrote:
> >>
> >> [Public]
> >>
> >>> -Original Message-
> >>> From: Deucher, Alexander
> >>> Sent: Tuesday, December 21, 2021 12:01 PM
> >>> To: Linus Torvalds ; Imre Deak
> >>> ; amd-gfx@lists.freedesktop.org
> >>> Cc: Daniel Vetter ; Kai-Heng Feng
> >>> 
> >>> Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> >>> PCI device ..."
> >>>
> >>> [Public]
> >>>
> >>>> -----Original Message-
> >>>> From: Linus Torvalds 
> >>>> Sent: Monday, December 20, 2021 5:05 PM
> >>>> To: Imre Deak 
> >>>> Cc: Daniel Vetter ; Deucher, Alexander
> >>>> ; Kai-Heng Feng
> >>>> 
> >>>> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
> >>>> Release PCI device ..."
> >>>>
> >>>> On Mon, Dec 20, 2021 at 1:33 PM Imre Deak 
> >>> wrote:
> >>>>>
> >>>>> amdgpu.runpm=0
> >>>>
> >>>> Hmmm.
> >>>>
> >>>> This does seem to "work", but not very well.
> >>>>
> >>>> With this, what seems to happen is odd: I lock the screen, wait, it
> >>>> goes "No signal, shutting down", but then doesn't actually shut down
> >>>> but stays black (with the backlight on). After _another_ five seconds
> >>>> or so, the monitor goes "No signal, shutting down" _again_, and at that
> >>> point it actually does it.
> >>>>
> >>>> So it solves my immediate problem - in that yes, the backlight finally
> >>>> does turn off in the end - but it does seem to be still broken.
> >>>>
> >>>> I'm very surprised if no AMD drm developers can see this exact same 
> >>>> thing.
> >>>> This is a very simple setup. The only possibly slightly less common
> >>>> thing is that I have two monitors, but while that is not necessarily
> >>>> the _most_ common setup in an absolute sense, I'd expect it to be very
> >>>> common among DRM developers..
> >>>>
> >>>> I guess I can just change the revert to just a
> >>>>
> >>>> -int amdgpu_runtime_pm = -1;
> >>>> +int amdgpu_runtime_pm = 0;
> >>>>
> >>>> instead. The auto-detect is apparently broken. Maybe it should only
> >>>> kick in for LVDS screens on actual laptops?
> >>>>
> >>>> Note: on my machine, I get that
> >>>>
> >>>>amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
> >>>>
> >>>> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> >>>> and it's only that BACO case that is broken.
> >>>>
> >>>> I have no idea what any of those three things are - I'm just looking
> >>>> at the uses of that amdgpu_runtime_pm variable.
> >>>>
> >>>> amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> >>>> default, tell me something else to try.
> >>>
> >>> For a little background, runtime PM support was added about 10 year ago
> >>> originally to support laptops with multiple GPUs (integrated and 
> >>> discrete).
> >>> It's not specific to the display hardware.  When the GPU is idle, it can 
> >>> be
> >>> powered down completely.  In the case of these laptops, it's D3 cold
> >>> (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> >>> which powers off the dGPU completely (i.e., it disappears from the bus).  
> >>> A
> >>> few years ago we extended this to support desktop dGPUs as well which
> >>> support their own version of runtime D3 (called BACO in AMD parlance - Bus
> >>> Active, Chip Off).  The driver can put the chip into a low power state 
> >>> where
> >>> everything except the bus interface is powered down (to avoid the device
> >>> disappearing from the bus).  So this has worked for almost 2 years now on
> >>> BACO capable parts and for a decade or mo

Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Harry Wentland



On 2021-12-21 16:18, Alex Deucher wrote:
> On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
>  wrote:
>>
>> [Public]
>>
>>> -Original Message-
>>> From: Deucher, Alexander
>>> Sent: Tuesday, December 21, 2021 12:01 PM
>>> To: Linus Torvalds ; Imre Deak
>>> ; amd-gfx@lists.freedesktop.org
>>> Cc: Daniel Vetter ; Kai-Heng Feng
>>> 
>>> Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
>>> PCI device ..."
>>>
>>> [Public]
>>>
>>>> -Original Message-
>>>> From: Linus Torvalds 
>>>> Sent: Monday, December 20, 2021 5:05 PM
>>>> To: Imre Deak 
>>>> Cc: Daniel Vetter ; Deucher, Alexander
>>>> ; Kai-Heng Feng
>>>> 
>>>> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
>>>> Release PCI device ..."
>>>>
>>>> On Mon, Dec 20, 2021 at 1:33 PM Imre Deak 
>>> wrote:
>>>>>
>>>>> amdgpu.runpm=0
>>>>
>>>> Hmmm.
>>>>
>>>> This does seem to "work", but not very well.
>>>>
>>>> With this, what seems to happen is odd: I lock the screen, wait, it
>>>> goes "No signal, shutting down", but then doesn't actually shut down
>>>> but stays black (with the backlight on). After _another_ five seconds
>>>> or so, the monitor goes "No signal, shutting down" _again_, and at that
>>> point it actually does it.
>>>>
>>>> So it solves my immediate problem - in that yes, the backlight finally
>>>> does turn off in the end - but it does seem to be still broken.
>>>>
>>>> I'm very surprised if no AMD drm developers can see this exact same thing.
>>>> This is a very simple setup. The only possibly slightly less common
>>>> thing is that I have two monitors, but while that is not necessarily
>>>> the _most_ common setup in an absolute sense, I'd expect it to be very
>>>> common among DRM developers..
>>>>
>>>> I guess I can just change the revert to just a
>>>>
>>>> -int amdgpu_runtime_pm = -1;
>>>> +int amdgpu_runtime_pm = 0;
>>>>
>>>> instead. The auto-detect is apparently broken. Maybe it should only
>>>> kick in for LVDS screens on actual laptops?
>>>>
>>>> Note: on my machine, I get that
>>>>
>>>>amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
>>>>
>>>> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
>>>> and it's only that BACO case that is broken.
>>>>
>>>> I have no idea what any of those three things are - I'm just looking
>>>> at the uses of that amdgpu_runtime_pm variable.
>>>>
>>>> amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
>>>> default, tell me something else to try.
>>>
>>> For a little background, runtime PM support was added about 10 year ago
>>> originally to support laptops with multiple GPUs (integrated and discrete).
>>> It's not specific to the display hardware.  When the GPU is idle, it can be
>>> powered down completely.  In the case of these laptops, it's D3 cold
>>> (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
>>> which powers off the dGPU completely (i.e., it disappears from the bus).  A
>>> few years ago we extended this to support desktop dGPUs as well which
>>> support their own version of runtime D3 (called BACO in AMD parlance - Bus
>>> Active, Chip Off).  The driver can put the chip into a low power state where
>>> everything except the bus interface is powered down (to avoid the device
>>> disappearing from the bus).  So this has worked for almost 2 years now on
>>> BACO capable parts and for a decade or more on BOCO systems.
>>> Unfortunately, changing the default runpm parameter setting would cause a
>>> flood of bug reports about runtime power management breaking and
>>> suddenly systems are using more power.
>>>
>>> Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
>>> Runtime pm was working on amdgpu prior to that commit.  Is it possible
>>> there is still some race between when amdgpu takes over from efifb?  Does
>>> it work properly when all pm_runtime calls in efifb are removed or if

Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Alex Deucher
On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
 wrote:
>
> [Public]
>
> > -Original Message-
> > From: Deucher, Alexander
> > Sent: Tuesday, December 21, 2021 12:01 PM
> > To: Linus Torvalds ; Imre Deak
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Daniel Vetter ; Kai-Heng Feng
> > 
> > Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> > PCI device ..."
> >
> > [Public]
> >
> > > -Original Message-
> > > From: Linus Torvalds 
> > > Sent: Monday, December 20, 2021 5:05 PM
> > > To: Imre Deak 
> > > Cc: Daniel Vetter ; Deucher, Alexander
> > > ; Kai-Heng Feng
> > > 
> > > Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
> > > Release PCI device ..."
> > >
> > > On Mon, Dec 20, 2021 at 1:33 PM Imre Deak 
> > wrote:
> > > >
> > > > amdgpu.runpm=0
> > >
> > > Hmmm.
> > >
> > > This does seem to "work", but not very well.
> > >
> > > With this, what seems to happen is odd: I lock the screen, wait, it
> > > goes "No signal, shutting down", but then doesn't actually shut down
> > > but stays black (with the backlight on). After _another_ five seconds
> > > or so, the monitor goes "No signal, shutting down" _again_, and at that
> > point it actually does it.
> > >
> > > So it solves my immediate problem - in that yes, the backlight finally
> > > does turn off in the end - but it does seem to be still broken.
> > >
> > > I'm very surprised if no AMD drm developers can see this exact same thing.
> > > This is a very simple setup. The only possibly slightly less common
> > > thing is that I have two monitors, but while that is not necessarily
> > > the _most_ common setup in an absolute sense, I'd expect it to be very
> > > common among DRM developers..
> > >
> > > I guess I can just change the revert to just a
> > >
> > > -int amdgpu_runtime_pm = -1;
> > > +int amdgpu_runtime_pm = 0;
> > >
> > > instead. The auto-detect is apparently broken. Maybe it should only
> > > kick in for LVDS screens on actual laptops?
> > >
> > > Note: on my machine, I get that
> > >
> > >amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
> > >
> > > so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> > > and it's only that BACO case that is broken.
> > >
> > > I have no idea what any of those three things are - I'm just looking
> > > at the uses of that amdgpu_runtime_pm variable.
> > >
> > > amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> > > default, tell me something else to try.
> >
> > For a little background, runtime PM support was added about 10 year ago
> > originally to support laptops with multiple GPUs (integrated and discrete).
> > It's not specific to the display hardware.  When the GPU is idle, it can be
> > powered down completely.  In the case of these laptops, it's D3 cold
> > (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> > which powers off the dGPU completely (i.e., it disappears from the bus).  A
> > few years ago we extended this to support desktop dGPUs as well which
> > support their own version of runtime D3 (called BACO in AMD parlance - Bus
> > Active, Chip Off).  The driver can put the chip into a low power state where
> > everything except the bus interface is powered down (to avoid the device
> > disappearing from the bus).  So this has worked for almost 2 years now on
> > BACO capable parts and for a decade or more on BOCO systems.
> > Unfortunately, changing the default runpm parameter setting would cause a
> > flood of bug reports about runtime power management breaking and
> > suddenly systems are using more power.
> >
> > Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
> > Runtime pm was working on amdgpu prior to that commit.  Is it possible
> > there is still some race between when amdgpu takes over from efifb?  Does
> > it work properly when all pm_runtime calls in efifb are removed or if efifb 
> > is
> > not enabled?  Runtime pm for Polaris boards has been enabled by default
> > since 4fdda2e66de0b which predates both of those patches.
>
> Thinking about this more, I wonder if there was some change in some userspace 
> component which 

RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Deucher, Alexander
[Public]

> -Original Message-
> From: Deucher, Alexander
> Sent: Tuesday, December 21, 2021 12:01 PM
> To: Linus Torvalds ; Imre Deak
> ; amd-gfx@lists.freedesktop.org
> Cc: Daniel Vetter ; Kai-Heng Feng
> 
> Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> PCI device ..."
> 
> [Public]
> 
> > -Original Message-
> > From: Linus Torvalds 
> > Sent: Monday, December 20, 2021 5:05 PM
> > To: Imre Deak 
> > Cc: Daniel Vetter ; Deucher, Alexander
> > ; Kai-Heng Feng
> > 
> > Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
> > Release PCI device ..."
> >
> > On Mon, Dec 20, 2021 at 1:33 PM Imre Deak 
> wrote:
> > >
> > > amdgpu.runpm=0
> >
> > Hmmm.
> >
> > This does seem to "work", but not very well.
> >
> > With this, what seems to happen is odd: I lock the screen, wait, it
> > goes "No signal, shutting down", but then doesn't actually shut down
> > but stays black (with the backlight on). After _another_ five seconds
> > or so, the monitor goes "No signal, shutting down" _again_, and at that
> point it actually does it.
> >
> > So it solves my immediate problem - in that yes, the backlight finally
> > does turn off in the end - but it does seem to be still broken.
> >
> > I'm very surprised if no AMD drm developers can see this exact same thing.
> > This is a very simple setup. The only possibly slightly less common
> > thing is that I have two monitors, but while that is not necessarily
> > the _most_ common setup in an absolute sense, I'd expect it to be very
> > common among DRM developers..
> >
> > I guess I can just change the revert to just a
> >
> > -int amdgpu_runtime_pm = -1;
> > +int amdgpu_runtime_pm = 0;
> >
> > instead. The auto-detect is apparently broken. Maybe it should only
> > kick in for LVDS screens on actual laptops?
> >
> > Note: on my machine, I get that
> >
> >amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
> >
> > so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> > and it's only that BACO case that is broken.
> >
> > I have no idea what any of those three things are - I'm just looking
> > at the uses of that amdgpu_runtime_pm variable.
> >
> > amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> > default, tell me something else to try.
> 
> For a little background, runtime PM support was added about 10 year ago
> originally to support laptops with multiple GPUs (integrated and discrete).
> It's not specific to the display hardware.  When the GPU is idle, it can be
> powered down completely.  In the case of these laptops, it's D3 cold
> (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> which powers off the dGPU completely (i.e., it disappears from the bus).  A
> few years ago we extended this to support desktop dGPUs as well which
> support their own version of runtime D3 (called BACO in AMD parlance - Bus
> Active, Chip Off).  The driver can put the chip into a low power state where
> everything except the bus interface is powered down (to avoid the device
> disappearing from the bus).  So this has worked for almost 2 years now on
> BACO capable parts and for a decade or more on BOCO systems.
> Unfortunately, changing the default runpm parameter setting would cause a
> flood of bug reports about runtime power management breaking and
> suddenly systems are using more power.
> 
> Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
> Runtime pm was working on amdgpu prior to that commit.  Is it possible
> there is still some race between when amdgpu takes over from efifb?  Does
> it work properly when all pm_runtime calls in efifb are removed or if efifb is
> not enabled?  Runtime pm for Polaris boards has been enabled by default
> since 4fdda2e66de0b which predates both of those patches.

Thinking about this more, I wonder if there was some change in some userspace 
component which was hidden by the changes in 55285e21f045 and a6c0fd3d5a8b.  
E.g., some desktop component started polling for display changes or GPU 
temperature or something like that and when a6c0fd3d5a8b was in place the GPU 
never entered runtime suspend.  Then when 55285e21f045 was applied, it unmasked 
the new behavior in the userpace component.

What should happen is that when all of the displays blank, assuming the GPU is 
otherwise idle, the GPU will runtime suspend after  seconds.  When you move the 
mouse or hit the keyboard, that sho

RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Deucher, Alexander
[Public]

> -Original Message-
> From: Linus Torvalds 
> Sent: Monday, December 20, 2021 5:21 PM
> To: Imre Deak ; Koenig, Christian
> ; Pan, Xinhui ;
> Wentland, Harry 
> Cc: Daniel Vetter ; Deucher, Alexander
> ; Kai-Heng Feng
> ; amd-gfx list  g...@lists.freedesktop.org>
> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> PCI device ..."
> 
> [ Adding back in more amd people and the amd list, the people Daniel added
> seem to have gotten lost again, but I think people at least saw my original
> report thanks to Daniel ]
> 
> With "amdgpu.runpm=0", things are better, but not perfect. With that I can
> lock the screen, and it has to go through *two* cycles of "No signal, turning
> off", but on the second cycle it does finally work.
> 
> This was exposed by commit 55285e21f045 ("fbdev/efifb: Release PCI
> device's runtime PM ref during FB destroy"), probably because that made
> runtime PM actually potentially work, but it is then broken on amdgpu.
> 
> Absolutely nothing odd in my setup. Two monitors, one GPU. PCI ID
> 1002:67df rev e7, subsystem ID 1da2:e353.
> 
> I'd expect pretty much any amdgpu person to see this.
> 
> On Mon, Dec 20, 2021 at 2:04 PM Linus Torvalds  foundation.org> wrote:
> >
> > Note: on my machine, I get that
> >
> >amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
> >
> > so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> > and it's only that BACO case that is broken.
> 
> Hmm. The *documentation* says:
> 
> PX runtime pm
> 2 = force enable with BAMACO,
> 1 = force enable with BACO,
> 0 = disable,
> -1 = PX only default
> 
> but the code actually makes anything != 0 enable it, except on VEGA20 and
> ARCTURUS, where it needs to be positive.
> 
> My card is apparently "POLARIS10", whatever that means, which means that
> any non-zero value of amdgpu_runtime_pm will enable runtime PM as long
> as "amdgpu_device_supports_baco()" is true. Which it is.
> 
> Whatever. Now I'm just kwetching about the documentation not matching
> what I see the code doing, which is never a great sign when things don't
> work.

Apologies on the documentation.  -1 is the default and is enabled for all dGPUs 
which support runtime D3.  It was never fixed up when we extended support for 
runtime pm beyond PX/HG laptops.  Fixed up the documentation here:
https://patchwork.freedesktop.org/patch/467681/

Alex


RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Deucher, Alexander
[Public]

> -Original Message-
> From: Linus Torvalds 
> Sent: Monday, December 20, 2021 5:05 PM
> To: Imre Deak 
> Cc: Daniel Vetter ; Deucher, Alexander
> ; Kai-Heng Feng
> 
> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> PCI device ..."
> 
> On Mon, Dec 20, 2021 at 1:33 PM Imre Deak  wrote:
> >
> > amdgpu.runpm=0
> 
> Hmmm.
> 
> This does seem to "work", but not very well.
> 
> With this, what seems to happen is odd: I lock the screen, wait, it goes "No
> signal, shutting down", but then doesn't actually shut down but stays black
> (with the backlight on). After _another_ five seconds or so, the monitor goes
> "No signal, shutting down" _again_, and at that point it actually does it.
> 
> So it solves my immediate problem - in that yes, the backlight finally does
> turn off in the end - but it does seem to be still broken.
> 
> I'm very surprised if no AMD drm developers can see this exact same thing.
> This is a very simple setup. The only possibly slightly less common thing is 
> that
> I have two monitors, but while that is not necessarily the _most_ common
> setup in an absolute sense, I'd expect it to be very common among DRM
> developers..
> 
> I guess I can just change the revert to just a
> 
> -int amdgpu_runtime_pm = -1;
> +int amdgpu_runtime_pm = 0;
> 
> instead. The auto-detect is apparently broken. Maybe it should only kick in
> for LVDS screens on actual laptops?
> 
> Note: on my machine, I get that
> 
>amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
> 
> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> and it's only that BACO case that is broken.
> 
> I have no idea what any of those three things are - I'm just looking at the
> uses of that amdgpu_runtime_pm variable.
> 
> amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> default, tell me something else to try.

For a little background, runtime PM support was added about 10 year ago 
originally to support laptops with multiple GPUs (integrated and discrete).  
It's not specific to the display hardware.  When the GPU is idle, it can be 
powered down completely.  In the case of these laptops, it's D3 cold (managed 
by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off) which powers 
off the dGPU completely (i.e., it disappears from the bus).  A few years ago we 
extended this to support desktop dGPUs as well which support their own version 
of runtime D3 (called BACO in AMD parlance - Bus Active, Chip Off).  The driver 
can put the chip into a low power state where everything except the bus 
interface is powered down (to avoid the device disappearing from the bus).  So 
this has worked for almost 2 years now on BACO capable parts and for a decade 
or more on BOCO systems.  Unfortunately, changing the default runpm parameter 
setting would cause a flood of bug reports about runtime power management 
breaking and suddenly systems are using more power.

Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).  Runtime pm 
was working on amdgpu prior to that commit.  Is it possible there is still some 
race between when amdgpu takes over from efifb?  Does it work properly when all 
pm_runtime calls in efifb are removed or if efifb is not enabled?  Runtime pm 
for Polaris boards has been enabled by default since 4fdda2e66de0b which 
predates both of those patches.

Alex


Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Linus Torvalds
[ Adding back in more amd people and the amd list, the people Daniel
added seem to have gotten lost again, but I think people at least saw
my original report thanks to Daniel ]

With "amdgpu.runpm=0", things are better, but not perfect. With that I
can lock the screen, and it has to go through *two* cycles of "No
signal, turning off", but on the second cycle it does finally work.

This was exposed by commit 55285e21f045 ("fbdev/efifb: Release PCI
device's runtime PM ref during FB destroy"), probably because that
made runtime PM actually potentially work, but it is then broken on
amdgpu.

Absolutely nothing odd in my setup. Two monitors, one GPU. PCI ID
1002:67df rev e7, subsystem ID 1da2:e353.

I'd expect pretty much any amdgpu person to see this.

On Mon, Dec 20, 2021 at 2:04 PM Linus Torvalds
 wrote:
>
> Note: on my machine, I get that
>
>amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
>
> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> and it's only that BACO case that is broken.

Hmm. The *documentation* says:

PX runtime pm
2 = force enable with BAMACO,
1 = force enable with BACO,
0 = disable,
-1 = PX only default

but the code actually makes anything != 0 enable it, except on VEGA20
and ARCTURUS, where it needs to be positive.

My card is apparently "POLARIS10", whatever that means, which means
that any non-zero value of amdgpu_runtime_pm will enable runtime PM as
long as "amdgpu_device_supports_baco()" is true. Which it is.

Whatever. Now I'm just kwetching about the documentation not matching
what I see the code doing, which is never a great sign when things
don't work.

  Linus


Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-20 Thread Christian König

Good morning guys,

first of all get better soon Linus.

I'm unfortunately not the best expert for runtime power management 
(Alex) nor display (Harry), but from the lack of their response I guess 
that they are already on vacation. So maybe take everything I explain 
here with a grain of salt.


Then for the background we have two separate power management features 
here which doesn't seem to work as they should.


The first buggy one is runtime power management, which is what commit 
55285e21f045 surfaces. My educated guess is that the now corrected 
reference counting turns of the GPU before userspace has a chance to 
send a signal to the monitor to turn of it's backlight. Double checking 
the code I can see the correct calls to pm_runtime_get_*() and 
pm_runtime_put_*() in amdgpu_dm_atomic_commit_tail(), but to be honest 
that function seems to be quite a mess.


A trace of what exactly happens during PM autosuspend might help here. 
Daniel do you know any tracepoint for that?


Then we have DPMS, which is basically the way of telling the monitor to 
shut of it's backlight. When this doesn't work as expected (e.g. you 
need *two* cycles) then it can as well be that userspace is not sending 
the right command.


When you use X you could double check with "xset dpms force off" and 
"xset dpms force suspend". At least with my monitor it turns of the 
backlight in both cases, but maybe your hardware behaves differently.


Regards,
Christian.

Am 20.12.21 um 23:21 schrieb Linus Torvalds:

[ Adding back in more amd people and the amd list, the people Daniel
added seem to have gotten lost again, but I think people at least saw
my original report thanks to Daniel ]

With "amdgpu.runpm=0", things are better, but not perfect. With that I
can lock the screen, and it has to go through *two* cycles of "No
signal, turning off", but on the second cycle it does finally work.

This was exposed by commit 55285e21f045 ("fbdev/efifb: Release PCI
device's runtime PM ref during FB destroy"), probably because that
made runtime PM actually potentially work, but it is then broken on
amdgpu.

Absolutely nothing odd in my setup. Two monitors, one GPU. PCI ID
1002:67df rev e7, subsystem ID 1da2:e353.

I'd expect pretty much any amdgpu person to see this.

On Mon, Dec 20, 2021 at 2:04 PM Linus Torvalds
 wrote:

Note: on my machine, I get that

amdgpu :49:00.0: amdgpu: Using BACO for runtime pm

so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
and it's only that BACO case that is broken.

Hmm. The *documentation* says:

 PX runtime pm
 2 = force enable with BAMACO,
 1 = force enable with BACO,
 0 = disable,
 -1 = PX only default

but the code actually makes anything != 0 enable it, except on VEGA20
and ARCTURUS, where it needs to be positive.

My card is apparently "POLARIS10", whatever that means, which means
that any non-zero value of amdgpu_runtime_pm will enable runtime PM as
long as "amdgpu_device_supports_baco()" is true. Which it is.

Whatever. Now I'm just kwetching about the documentation not matching
what I see the code doing, which is never a great sign when things
don't work.

   Linus




Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-20 Thread Daniel Vetter
Adding more amdgpu folks.

Smells like this runtime pm leak is papering over an issue in the
amdgpu driver. Other bug reports are also only about amdgpu.ko it
seems, would an unconditional pm_runtime_get_sync() in
amdgpu_pci_probe() also work? If you do it before the
drm_aperture_remove_conflicting_pci_framebuffers() which throws out
efifb it should be equivalent to the efifb revert.

That would avoid the revert for everyone else, but revert's fine too
imo this close to holidays. I don't think actually fixing this rpm fun
in a late -rc is realistic, these tend to be very gnarly.
-Daniel

On Mon, Dec 20, 2021 at 6:38 PM Linus Torvalds
 wrote:
>
> So my lovely eldest daughter came back from NYC for xmas, and two days
> later was informed that she had been in contact with somebody who
> tested positive. And sure enough, so did she.
>
> I'm all vaxxed up and boostered, and feeling fine with just a slightly
> sore throat and no other symptoms, but I did test positive too a few
> days after that. And as a result I'm distancing in my office. It turns
> out that is not so hugely different from my normal life.
>
> One difference *is* that I now have an inflatable mattress and am
> sleeping here by my desktop, and as a result last night I noticed that
> my monitors didn't go to sleep. I had to turn them off manually to get
> the backlight to go away.
>
> So this morning I start to search for the reason, and find this listed
> by the regression bot, and sure enough, reverting commit 55285e21f045
> ("fbdev/efifb: Release PCI device's runtime PM ref during FB destroy")
> makes it all work again.
>
> I've reverted it in my private testing tree, in the hope that somebody
> figures out the actual real cause. But if that doesn't happen, I
> expect to revert it in my main git tree too before rc7.
>
> So this is mainly just a heads-up. I'm not seeing anything interesting
> in the kernel logs, apart from the usual unending stream of
>
>   [drm] PCIE GART of 256M enabled (table at 0x00F4).
>   [drm] UVD and UVD ENC initialized successfully.
>   [drm] VCE initialized successfully.
>   [drm] PCIE GART of 256M enabled (table at 0x00F4).
>   [drm] UVD and UVD ENC initialized successfully.
>   [drm] VCE initialized successfully.
>
> that happens randomly (sometimes every minute, sometimes hours apart)
> with or without that commit.
>
> This is some random Radeon card with two monitors attached (PCI ID
> 1002:67df rev e7, subsystem ID 1da2:e353).
>
> The symptoms are exactly as reported elsewhere:
>
>   
> https://lore.kernel.org/linux-fbdev/8a27c986-4767-bd29-2073-6c4ffed49...@jetfuse.net/
>   
> https://linux-regtracking.leemhuis.info/regzbot/regression/8a27c986-4767-bd29-2073-6c4ffed49...@jetfuse.net/
>   https://bugzilla.kernel.org/show_bug.cgi?id=215203
>
> ie I can lock my desktop, wait a few seconds, see the "No signal,
> enabling power management" on my monitors, but when that actually
> happens, the desktop lockscreen just comes right back.
>
> I suspect/assume that the screen off event triggers some status event
> on the GPU side, and now some pm logic then just wakes things right up
> again.
>
>Linus



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch