On Tue, Dec 21, 2021 at 5:09 PM Harry Wentland <harry.wentl...@amd.com> wrote: > > > > On 2021-12-21 16:18, Alex Deucher wrote: > > On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander > > <alexander.deuc...@amd.com> wrote: > >> > >> [Public] > >> > >>> -----Original Message----- > >>> From: Deucher, Alexander > >>> Sent: Tuesday, December 21, 2021 12:01 PM > >>> To: Linus Torvalds <torva...@linux-foundation.org>; Imre Deak > >>> <imre.d...@intel.com>; amd-gfx@lists.freedesktop.org > >>> Cc: Daniel Vetter <daniel.vet...@ffwll.ch>; Kai-Heng Feng > >>> <kai.heng.f...@canonical.com> > >>> Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release > >>> PCI device ..." > >>> > >>> [Public] > >>> > >>>> -----Original Message----- > >>>> From: Linus Torvalds <torva...@linux-foundation.org> > >>>> Sent: Monday, December 20, 2021 5:05 PM > >>>> To: Imre Deak <imre.d...@intel.com> > >>>> Cc: Daniel Vetter <daniel.vet...@ffwll.ch>; Deucher, Alexander > >>>> <alexander.deuc...@amd.com>; Kai-Heng Feng > >>>> <kai.heng.f...@canonical.com> > >>>> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: > >>>> Release PCI device ..." > >>>> > >>>> On Mon, Dec 20, 2021 at 1:33 PM Imre Deak <imre.d...@intel.com> > >>> wrote: > >>>>> > >>>>> amdgpu.runpm=0 > >>>> > >>>> Hmmm. > >>>> > >>>> This does seem to "work", but not very well. > >>>> > >>>> With this, what seems to happen is odd: I lock the screen, wait, it > >>>> goes "No signal, shutting down", but then doesn't actually shut down > >>>> but stays black (with the backlight on). After _another_ five seconds > >>>> or so, the monitor goes "No signal, shutting down" _again_, and at that > >>> point it actually does it. > >>>> > >>>> So it solves my immediate problem - in that yes, the backlight finally > >>>> does turn off in the end - but it does seem to be still broken. > >>>> > >>>> I'm very surprised if no AMD drm developers can see this exact same > >>>> thing. > >>>> This is a very simple setup. The only possibly slightly less common > >>>> thing is that I have two monitors, but while that is not necessarily > >>>> the _most_ common setup in an absolute sense, I'd expect it to be very > >>>> common among DRM developers.. > >>>> > >>>> I guess I can just change the revert to just a > >>>> > >>>> -int amdgpu_runtime_pm = -1; > >>>> +int amdgpu_runtime_pm = 0; > >>>> > >>>> instead. The auto-detect is apparently broken. Maybe it should only > >>>> kick in for LVDS screens on actual laptops? > >>>> > >>>> Note: on my machine, I get that > >>>> > >>>> amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm > >>>> > >>>> so maybe the other possible runtime pm models (ARPX and BOCO) are ok, > >>>> and it's only that BACO case that is broken. > >>>> > >>>> I have no idea what any of those three things are - I'm just looking > >>>> at the uses of that amdgpu_runtime_pm variable. > >>>> > >>>> amdgpu people: if you don't want that amdgpu_runtime_pm turned off by > >>>> default, tell me something else to try. > >>> > >>> For a little background, runtime PM support was added about 10 year ago > >>> originally to support laptops with multiple GPUs (integrated and > >>> discrete). > >>> It's not specific to the display hardware. When the GPU is idle, it can > >>> be > >>> powered down completely. In the case of these laptops, it's D3 cold > >>> (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off) > >>> which powers off the dGPU completely (i.e., it disappears from the bus). > >>> A > >>> few years ago we extended this to support desktop dGPUs as well which > >>> support their own version of runtime D3 (called BACO in AMD parlance - Bus > >>> Active, Chip Off). The driver can put the chip into a low power state > >>> where > >>> everything except the bus interface is powered down (to avoid the device > >>> disappearing from the bus). So this has worked for almost 2 years now on > >>> BACO capable parts and for a decade or more on BOCO systems. > >>> Unfortunately, changing the default runpm parameter setting would cause a > >>> flood of bug reports about runtime power management breaking and > >>> suddenly systems are using more power. > >>> > >>> Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b). > >>> Runtime pm was working on amdgpu prior to that commit. Is it possible > >>> there is still some race between when amdgpu takes over from efifb? Does > >>> it work properly when all pm_runtime calls in efifb are removed or if > >>> efifb is > >>> not enabled? Runtime pm for Polaris boards has been enabled by default > >>> since 4fdda2e66de0b which predates both of those patches. > >> > >> Thinking about this more, I wonder if there was some change in some > >> userspace component which was hidden by the changes in 55285e21f045 and > >> a6c0fd3d5a8b. E.g., some desktop component started polling for display > >> changes or GPU temperature or something like that and when a6c0fd3d5a8b > >> was in place the GPU never entered runtime suspend. Then when > >> 55285e21f045 was applied, it unmasked the new behavior in the userpace > >> component. > >> > >> What should happen is that when all of the displays blank, assuming the > >> GPU is otherwise idle, the GPU will runtime suspend after seconds. When > >> you move the mouse or hit the keyboard, that should trigger the GPU should > >> runtime resume and then the displays will be re-enabled. > >> > >> In the behavior you are seeing, when the displays come back on after they > >> blank are you seeing the device resume from runtime suspend? On resume > >> from suspend (runtime or system) we issue a hotplug notification to > >> userspace in case any displays changed during suspend when the GPU was > >> powered down (and hence could not detect a hotplug event). Perhaps that > >> event is triggering userspace to reprobe and re-enable the displays > >> shortly after resume from runtime suspend due to some other event that > >> caused the device to runtime resume. Does something like this help by any > >> chance? > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D300103%26action%3Ddiff%26collapsed%3D%26headers%3D1%26format%3Draw&data=04%7C01%7CHarry.Wentland%40amd.com%7Cd1d2f9528d8e42199af508d9c4c7793d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637757183279389936%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KdIAXxYP0oQpNCFCe1slYwqgDhRdse2CrkGuCc2KrAE%3D&reserved=0 > > > > I'm not seeing this on my system, and another user tried the patch and > > it fixed his system, so it indeed seems to be something along the > > lines of what I described above. Something in userspace seems to be > > accessing the GPU causing it to runtime resume. On resume the driver > > then sends a hotplug event to userspace (in case anything display > > related changed while the GPU was suspended). The desktop manager > > then sees the hotplug event and reprobes and lights up the displays > > again. @Wentland, Harry, I guess the display code may already handle > > this (seeing if anything has changed on resume since suspend), so we > > can probably drop the call from the generic amdgpu resume code? Or if > > not, it should be pretty straightforward to fix that in > > dm_suspend()/dm_resume(). > > > > We handle re-detection in dm_resume but only seem to call the > drm_kms_helper_hotplug_event for MST. We might want to call it > in dm_resume but that would just move it from amdgpu_device_resume > and will probably cause the same issue. > > What I think we'd really want here is to make sure > drm_kms_helper_hotplug_event is only called when any display > config actually changes. Unfortunately, we're not being very > smart in our detection and just recreate everything (even > though dc_link_detect_helper seems to have code that wants to > be smart enough to detect if the sink is changed or not). > This means this change is non-trivial. At least I can't see > an easy fix that doesn't run the risk of breaking a bunch of > stuff.
I think something like this patch may do the trick: https://bugzilla.kernel.org/attachment.cgi?id=300109&action=diff&collapsed=&headers=1&format=raw We can use the helper to check if anything actually changed. Alex > > Or maybe someone can try to see if (non-MST) detection during > RPM still works with your patch. If it does (for some reason) > then we should be good. > > I can't seem to repro the problem on my Navi 1x card with KDE Plasma. > I wonder if it's a Polaris issue. > > I also don't see my Navi 1x card going into RPM after > 'xset dpms force off' with the Manjaro 5.15 kernel. Might need > to try the latest mainline or amd-stg. > > Harry > > > > Alex