Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

Alex Deucher Tue, 21 Dec 2021 14:19:15 -0800

On Tue, Dec 21, 2021 at 5:09 PM Harry Wentland <harry.wentl...@amd.com> wrote:
>
>
>
> On 2021-12-21 16:18, Alex Deucher wrote:
> > On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
> > <alexander.deuc...@amd.com> wrote:
> >>
> >> [Public]
> >>
> >>> -----Original Message-----
> >>> From: Deucher, Alexander
> >>> Sent: Tuesday, December 21, 2021 12:01 PM
> >>> To: Linus Torvalds <torva...@linux-foundation.org>; Imre Deak
> >>> <imre.d...@intel.com>; amd-gfx@lists.freedesktop.org
> >>> Cc: Daniel Vetter <daniel.vet...@ffwll.ch>; Kai-Heng Feng
> >>> <kai.heng.f...@canonical.com>
> >>> Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> >>> PCI device ..."
> >>>
> >>> [Public]
> >>>
> >>>> -----Original Message-----
> >>>> From: Linus Torvalds <torva...@linux-foundation.org>
> >>>> Sent: Monday, December 20, 2021 5:05 PM
> >>>> To: Imre Deak <imre.d...@intel.com>
> >>>> Cc: Daniel Vetter <daniel.vet...@ffwll.ch>; Deucher, Alexander
> >>>> <alexander.deuc...@amd.com>; Kai-Heng Feng
> >>>> <kai.heng.f...@canonical.com>
> >>>> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
> >>>> Release PCI device ..."
> >>>>
> >>>> On Mon, Dec 20, 2021 at 1:33 PM Imre Deak <imre.d...@intel.com>
> >>> wrote:
> >>>>>
> >>>>> amdgpu.runpm=0
> >>>>
> >>>> Hmmm.
> >>>>
> >>>> This does seem to "work", but not very well.
> >>>>
> >>>> With this, what seems to happen is odd: I lock the screen, wait, it
> >>>> goes "No signal, shutting down", but then doesn't actually shut down
> >>>> but stays black (with the backlight on). After _another_ five seconds
> >>>> or so, the monitor goes "No signal, shutting down" _again_, and at that
> >>> point it actually does it.
> >>>>
> >>>> So it solves my immediate problem - in that yes, the backlight finally
> >>>> does turn off in the end - but it does seem to be still broken.
> >>>>
> >>>> I'm very surprised if no AMD drm developers can see this exact same 
> >>>> thing.
> >>>> This is a very simple setup. The only possibly slightly less common
> >>>> thing is that I have two monitors, but while that is not necessarily
> >>>> the _most_ common setup in an absolute sense, I'd expect it to be very
> >>>> common among DRM developers..
> >>>>
> >>>> I guess I can just change the revert to just a
> >>>>
> >>>>     -int amdgpu_runtime_pm = -1;
> >>>>     +int amdgpu_runtime_pm = 0;
> >>>>
> >>>> instead. The auto-detect is apparently broken. Maybe it should only
> >>>> kick in for LVDS screens on actual laptops?
> >>>>
> >>>> Note: on my machine, I get that
> >>>>
> >>>>    amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm
> >>>>
> >>>> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> >>>> and it's only that BACO case that is broken.
> >>>>
> >>>> I have no idea what any of those three things are - I'm just looking
> >>>> at the uses of that amdgpu_runtime_pm variable.
> >>>>
> >>>> amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> >>>> default, tell me something else to try.
> >>>
> >>> For a little background, runtime PM support was added about 10 year ago
> >>> originally to support laptops with multiple GPUs (integrated and 
> >>> discrete).
> >>> It's not specific to the display hardware.  When the GPU is idle, it can 
> >>> be
> >>> powered down completely.  In the case of these laptops, it's D3 cold
> >>> (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> >>> which powers off the dGPU completely (i.e., it disappears from the bus).  
> >>> A
> >>> few years ago we extended this to support desktop dGPUs as well which
> >>> support their own version of runtime D3 (called BACO in AMD parlance - Bus
> >>> Active, Chip Off).  The driver can put the chip into a low power state 
> >>> where
> >>> everything except the bus interface is powered down (to avoid the device
> >>> disappearing from the bus).  So this has worked for almost 2 years now on
> >>> BACO capable parts and for a decade or more on BOCO systems.
> >>> Unfortunately, changing the default runpm parameter setting would cause a
> >>> flood of bug reports about runtime power management breaking and
> >>> suddenly systems are using more power.
> >>>
> >>> Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
> >>> Runtime pm was working on amdgpu prior to that commit.  Is it possible
> >>> there is still some race between when amdgpu takes over from efifb?  Does
> >>> it work properly when all pm_runtime calls in efifb are removed or if 
> >>> efifb is
> >>> not enabled?  Runtime pm for Polaris boards has been enabled by default
> >>> since 4fdda2e66de0b which predates both of those patches.
> >>
> >> Thinking about this more, I wonder if there was some change in some 
> >> userspace component which was hidden by the changes in 55285e21f045 and 
> >> a6c0fd3d5a8b.  E.g., some desktop component started polling for display 
> >> changes or GPU temperature or something like that and when a6c0fd3d5a8b 
> >> was in place the GPU never entered runtime suspend.  Then when 
> >> 55285e21f045 was applied, it unmasked the new behavior in the userpace 
> >> component.
> >>
> >> What should happen is that when all of the displays blank, assuming the 
> >> GPU is otherwise idle, the GPU will runtime suspend after  seconds.  When 
> >> you move the mouse or hit the keyboard, that should trigger the GPU should 
> >> runtime resume and then the displays will be re-enabled.
> >>
> >> In the behavior you are seeing, when the displays come back on after they 
> >> blank are you seeing the device resume from runtime suspend?  On resume 
> >> from suspend (runtime or system) we issue a hotplug notification to 
> >> userspace in case any displays changed during suspend when the GPU was 
> >> powered down (and hence could not detect a hotplug event).  Perhaps that 
> >> event is triggering userspace to reprobe and re-enable the displays 
> >> shortly after resume from runtime suspend due to some other event that 
> >> caused the device to runtime resume.  Does something like this help by any 
> >> chance?
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D300103%26action%3Ddiff%26collapsed%3D%26headers%3D1%26format%3Draw&amp;data=04%7C01%7CHarry.Wentland%40amd.com%7Cd1d2f9528d8e42199af508d9c4c7793d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637757183279389936%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=KdIAXxYP0oQpNCFCe1slYwqgDhRdse2CrkGuCc2KrAE%3D&amp;reserved=0
> >
> > I'm not seeing this on my system, and another user tried the patch and
> > it fixed his system, so it indeed seems to be something along the
> > lines of what I described above.  Something in userspace seems to be
> > accessing the GPU causing it to runtime resume.  On resume the driver
> > then sends a hotplug event to userspace (in case anything display
> > related changed while the GPU was suspended).  The desktop manager
> > then sees the hotplug event and reprobes and lights up the displays
> > again.  @Wentland, Harry, I guess the display code may already handle
> > this (seeing if anything has changed on resume since suspend), so we
> > can probably drop the call from the generic amdgpu resume code?  Or if
> > not, it should be pretty straightforward to fix that in
> > dm_suspend()/dm_resume().
> >
>
> We handle re-detection in dm_resume but only seem to call the
> drm_kms_helper_hotplug_event for MST. We might want to call it
> in dm_resume but that would just move it from amdgpu_device_resume
> and will probably cause the same issue.
>
> What I think we'd really want here is to make sure
> drm_kms_helper_hotplug_event is only called when any display
> config actually changes. Unfortunately, we're not being very
> smart in our detection and just recreate everything (even
> though dc_link_detect_helper seems to have code that wants to
> be smart enough to detect if the sink is changed or not).
> This means this change is non-trivial. At least I can't see
> an easy fix that doesn't run the risk of breaking a bunch of
> stuff.


I think something like this patch may do the trick:
https://bugzilla.kernel.org/attachment.cgi?id=300109&action=diff&collapsed=&headers=1&format=raw
We can use the helper to check if anything actually changed.

Alex

>
> Or maybe someone can try to see if (non-MST) detection during
> RPM still works with your patch. If it does (for some reason)
> then we should be good.
>
> I can't seem to repro the problem on my Navi 1x card with KDE Plasma.
> I wonder if it's a Polaris issue.
>
> I also don't see my Navi 1x card going into RPM after
> 'xset dpms force off' with the Manjaro 5.15 kernel. Might need
> to try the latest mainline or amd-stg.
>
> Harry
>
>
> > Alex

Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

Reply via email to